Detecting Power Laws

Binning Samples

MarcusFrean and NickChapman talked about how to detect power laws by binning samples. See bins.gif

This is handy for discrete distributions, and essential for real-valued ones.

If you use exponentially spaced bin boundaries, divide the bin's occupancy by its width, and then plot the log of that against the bin number, you should get a straight line whose negative slope is the exponent a.

Nick's going to check it out in practice, by generating artificial "powerlaw" distributions...

RoBt askz: So again I want to be clear about the rationale for exponentially sized bins. Is it because this will support the "scale-free" aspect of the distribution? Just to check my understanding of what you are then saying: does this mean you then do not take the log of the independent variable when plotting to look for a straight line?

MarcusFrean sayz: the (current!) idea is to # use exponentially sized bins, # divide the counts by the bin-width to get "adjusted counts" # ''plot the log of the adjusted counts versus the bin index''. Or, equivalently, the abscissa can be the log of the bin boundary. This will be a straight line if the distribution is a powerlaw. and in that case the slope is the exponent.

RoBt: Ok, this seems reasonable. I still wonder about effects at the low end though. When we are dealing with things are that come in discrete sizes, rather than continuous, it seems to me that the tiny bucket sizes make result in lots of noise data (exactly the kind of thing that bucket size selection normally handles). I am also unsure, especially at these tiny sizes, whether we are really sure what we are measuring: e.g. classes vs methods vs fields vs lines vs tokens vs lexemes vs characters vs bits. Surely other people have faced similar issues? It doesn't seem right to change our idea of the distribution because of edge-effects relating to the tiniest quanta in the data. Comments?

MarcusFrean: Yah, I guess you can start with your first bucket boundary at the smallest value in the dataset, make the first bin width =1, and go from there upwards. I've got a different worry - it seems like this logarithmic binning runs the risk of losing information by sticking all the data from (say) 10 to 15 or whatever into a single bin, when in reality we may have loads of excellent info on 10, 11, 12, 13, 14 and 15. Seems like we're throwing hard-earned data away, in effect.

Provided your data is positive integers though you don't need to do this binning anyway (do you?), and the standard method of ranking and then plotting on log scales (as in the scrips below, and the CACM paper) seems preferable... Agree / disagree ??

Matlab diagnostics

It looks like some of the distributions that ''aren't'' PowerLaw might be LogNormal.

I've written some quick and dirty diagnostics for the two cases. Put your raw data (list of samples) into a file "datafile". Grab this matlab script: diagnoseDatafile.m Then it's
  • matlab -nodesktop
followed by
  • diagnoseDatafile(10)

The 10 here is the number of bins used in the diagnosis of logNormals. You need to fiddle with this... It's not definitive, but if you can find a number of bins such that there's (a) plenty of data points still and (b) they're on a straight line, that's very suggestive of log normal.

It draws two plots. If the LHS one is a straight line you've got a powerlaw idstribution, and if the RHS is straight it's log normal. To get a feel for things, you can make some fake data from either distribution and check the diagnosis out:

  • generatePowerlawData.m - Call this with one argument, the exponent. E.g: ''generatePowerlawData(1.8) ''
  • generateLognormalData.m - Call this with two arguments, the mode and its "spread". E.g: ''generateLognormalData(10, 0.5) ''

I notice that things get a little hairy for log normals that have a very low mean (nb it's got to be at least 1). Any other observations, please collect here...

MarcusFrean


I've added a bit to Marcus' diagnose script which tries to fit a straight line to both the powerlaw and log-normal plots. It'll draw the best-fit line onto them and print out the corresponding equation.

diagnoseDatafile.m

JeromeDolman


Also see LogNormal distributions.