NaN.

TutorialBig DataLast updated

Power-law distributions in Empirical data[ Edit ]

A variable x is said to obey a power-law if it is drawn from a probability distribution function (pdf) of the form *p(x) = Cx ^{-α}* where

Power-law distributions come in both continuous and discrete flavor with the discrete case being more involved than the continuous one. So, the discrete power-law behavior is often approximated by continuous power-law behavior for the sake of convenience. One reliable approximation is to assume that discrete values of *x* are generated from a continuous power-law and then rounded to nearest integer to get the discrete values.

Sometimes, complementary cumulative distribution function (or CDF) is also considered where *P(X) = p(x ≥ X)*

Power-law distribution makes a straight line on the log-log plot. This slope can be calculated using the method of least square linear regression. But simple line fitting does not guarantee that data follows a power-law distribution. Moreover, the assumption of independent, Gaussian noise, which is a pre-requisite for linear regression, does not hold for this case.

**Estimating scaling parameter**

Assuming that we know the value of *x _{min}*, the value of

and that for the discrete case is given as:

The equation for the discrete case is only an approximation as there is no exact MLE for discrete case.

MLE method outperforms several linear regression based approaches like line fitting on the log-log plot, line fitting after performing logarithmic binning (done to reduce fluctuations in the tail of the distribution), line fitting to CDF with constant size bins and line fitting to CDF without any bins. But for any finite sample size *n* and any choice of *x _{min}*, there is bias present which decays as

**Estimating x_{min}**

If we choose a value of *x _{min}* less than the original value, then we will get a biased value of

One approach is to plot the PDF or CDF on the log-log plot and mark the point beyond which the distribution becomes roughly straight or to plot *α* as a function of *x _{min}* and mark the point beyond which the value appears relatively stable. But these approaches are not objective as roughly straight and relatively stable are not quantified.

The approach proposed by Clauset et al. [Clauset, A., M. Young, and K. S. Gleditsch, 2007, Journal of Conflict Resolution 51, 58] is as follows:

Choose a value of *x _{min}* such that the probability distribution of the measured data and best-fit power-law model are as similar as possible. Similarity between distributions can be measured using Kolmogorov-Smirnov (KS) statistic which is defined as:

where *S(x) *is CDF of given data with values greater than or equal to *x _{min }*and

MLE and other approaches do not tell us whether power-law is a possible fit to the given data - all they do is find the best fit values of *x _{min}* and

**Goodness-of-fit tests**

A large number of synthetic data sets are generated from the hypothesized power-law distribution. Then each of these distributions is fitted to their own power-law model individually and the KS statistics is calculated for each distribution. The *p-* value is defined to be the fraction of synthetic datasets where the distance (KS statistic value) is greater than the distance for given dataset. A large value of *p* (close to 1) means that the fluctuations between given data and the hypothesized model could be because of statistical fluctuations alone while a small value of *p* (close to 0) means that the model is not a possible fit to the distribution.

**Dataset generation**

The generated dataset needs to be such that it has a distribution similar to the given data below *x _{min}* and follows the fitted power-law above

If we want the *p-* values to be accurate to within about *ε* of the true value, then we should generate at least *1/4 ε ^{-2}*synthetic data sets.

The power law is ruled out if *p ≤ 0.1*. A large *p-* value does not mean that the power-law is the correct distribution for the data. There can be other distributions that can fit the data equally well or even better. Moreover, for small values of n, it is possible that the given distribution will follow a power law closely, and hence that the p-value will be large, even when the power law is the wrong model for the data.

**Alternate distributions**

*p-* value test can only be used to reject the power-law hypothesis and not accept it. So even if *p-value > 0.1*, we can only say that power-law hypothesis is not rejected. It could be the case that some other distribution fits the data equally well or even better. To eliminate this possibility, we calculate a *p-* value for a fit to the alternate distribution and compare it with the *p-* value for the power-law. If the *p-* value for power-law is high and the *p-* value for the other distribution is low, we can say that data is more likely to be drawn from the power-law distribution (though we still can not be sure that it is **definitely** drawn from the power-law distribution).

**Likelihood Ratio Test**

This test can be used to directly compare two distributions against one another to see which is a better fit for the given data. The idea is to compute the likelihood of the given data under the two competing distributions. The one with the higher likelihood is taken to be the better fit. Alternatively, the ratio of the two likelihoods, or the logarithm *R*of the ratio can be used. If *R* is close enough to zero, then it could go to either side of zero, depending on statistical fluctuations. So *R* value needs to be sufficiently far from zero. To check for this, Vuong's method [Vuong, Q. H., 1989, Econometrica 57, 307] is used which gives a *p-* value that can tell if the conclusion from the value of *R* is statistically significant. If this *p-* value is small (*p < 0.1*), the result is significant. Otherwise, the result is not taken to be reliable and the test does not favor either distribution.

Other than the likelihood ratio, several other tests like minimum description length (MDL) or cross-validation can also be performed.

Read more…(1373 words)

About the contributor:

Shagun SodhaniAnalytics and Data Science team @ Adobe Systems

100%

Loading…

Join the discussion. Add a reply…

Post

Table of contents

- Introduction
- Fitting power-laws to empirical data
- Testing the power-law hypothesis

Contributor

Shagun SodhaniAnalytics and Data Science team @ Adobe Systems

100%

Ready to join our community?

Sign up below to automatically get notified of new courses, get **reminders** to finish ones you subscribe to, and **bookmark** lessons to read later.

Continue with Facebook

— OR —

Your Full Name

Email address

I have an account. Log in instead

By signing up, you agree to our Terms and our Privacy Policy.

Popular Courses

New Courses

Get in touch

Copyright 2016-18, Compose Labs Inc. All rights reserved.