Pareto Analysis Technique
Pareto Analysis Technique
Pareto Analysis Technique
Introducing a Quantile Regression Method Joseph Lee Petersen Introduction. A broad approach to using correlation coecients for parameter estimation and not merely as descriptive statistics has been developed. [1,2,3,4] It was a goal of this project to extend these ideas specically to estimating the parameters of the Pareto distribution. In this paper we will recall the denition of the Pareto distribution, some basic properties, and some previously developed methods of estimating the parameters of the Pareto distribution from which a random sample comes. We will introduce a new parameter estimation scheme based on correlation coecients. Finally, we will study and compare the performance of each of the parameter estimation schemes.
The Pareto Distribution was rst proposed as a model for the distribution of incomes. It is also used as a model for the distribution of city populations within a given area. The Pareto distribution is dened by the following functions: CDF: F (x|, k ) = 1 PDF: f (x|, k ) = k x
a
k x < ; , k > 0
k ; k x < ; , k > 0 x+1 The rst parameter marks a lower bound on the possible values that a Pareto distributed random variable can take on. To illustrate we can see in gure 1 a plot of the density of a Pareto(1, 1) random variable. A few well known properties follow: E (X ) = k/( 1), > 1 V ar(X ) = k 2 /[( 1)2 ( 2)], > 2
Parameter Estimation
We are interested in estimating the parameters of the Pareto distribution from which a random sample comes. We will outline a few parameter estimation schemes.
2.1
Method of Moments
We actually modify the usual method of moments scheme according to a method laid out in Johnson and Kotz[5]. If we set the sample mean equal to the distributions theoretical expected value mentioned above and if we set the sample minimum, x1 , equal to the theoretical expected value of the minimum of a size n sample of Pareto(k ,) random variables, we obtain two equations and two unknowns: x = k ( 1)
1) x1 = (n k n
2.2
Median Estimator
As far as the author knows, this is a new estimator. The idea is that in method of moments we set the sample mean equal to the theoretical mean, so here we will set the sample median equal to the theoretical median. Many of the estimation schemes discussed in this paper were rst studied in the 2
case where k was known to be equal to one, so that it was only that needed to be estimated. In this case we can see from the CDF above that if x = ms , the median of the sample, we can estimate as follows:
1 m s = 0.5
ln 2 ln ms It is likely though that we will be interested in many cases in which k is not equal to one, so if we already have an estimate for k , call it kest , we will make the following adjustment to this estimate for : = = = ln ln 2
ms kest
For the purposes of analyzing the performance of this estimator, we will use the minimum sample value as the estimate for k .
2.3
Maximum Likelihood
n
The likelihood function, L, for the Pareto distribution has the following form: L(k, |x) =
i=1
Recall that the likelihood function tells as a function of the distribution parameters how likely it is to have observed the data that we did in fact observe. The maximum likelihood estimates for k and are the values of k and that make L as large as possible given the data we have. The most familiar method of maximizing functions involves calculus. However, we need no calculus to see that L gets large beyond bound for increases in k . It is key then to recall that k can be no larger than the smallest value of x in our data, so the best we can do in maximizing L by adjusting k is as follows: = min{xi } k In order to nd the maximum likelihood estimate for , calculus is appropriate. Since L is nonnegative, we can take its logarithm. We do this because 3
it is easier to dierentiate log L than L itself. Logarithms are bijective functions, so the value of that maximizes L also maximizes log L. The process in brief looks like this:
n
log
k +1 x i
n
= n log () + nlog (k ) ( + 1)
i=1
log (xi )
d = n/ + n log (k ) d
log (xi )
i=1
Setting the derivative equal to zero, a little algebra and an omitted second derivative check to conrm we are maximizing L rather than minimizing L yields: n xi = n/ log k
i=1
2.4
Correlation Coecients
Gideon[4] has shown using a correlation based interpretation of linear regression that the mean and standard deviation for a normally distributed data set of size n can be estimated by regression of the sorted data on the 1st through nth (n + 1)-tiles of the standard normal. In such a regression, the intercept of the tted linear model serves as an unbiased estimate of the mean of the distribution from which the data came, and the slope of the tted linear model serves as an unbiased estimate of the standard deviation. This quantile regression estimation scheme not only is appropriate for normally distributed data though, but rather it works for distributions from any scale regular family.1 The connection between regression and correlation is laid out explicitly in Gideon[4]. We will describe the connection here in brief. The value of s
A scale regular family is a family of distributions such that for any member in the family there is a scalar that can multiply the member to yield another member in the same family with unit variance. A common example of the use of the scale regular property of a scale regular family is standardizing a Normally distributed random variable.
1
that satises the following equation (See Gideon[1,2,3,4]) estimates standard deviation: r(q, uo sq ) = 0 where q is a vector of the 1st through nth (n + 1)-tiles of the standard distribution, n is the sample size of the data, uo is a vector of the ordered data, and r is any correlation coecient. When r is Pearsons correlation, the solution is exactly the least squares estimate of the slope of a linear model. As for an estimate of the model intercept, if r is Pearsons, the estimate is u 0 sq . If r is Kendalls r or the Greatest Deviation Correlation Coecient (See Gideon[2]), the estimate of intercept is median(u0 ) s median(q ). Again if r is Pearsons correlation coecient, the solution is exactly the least squares line. However, it is important to note that r need not be Pearsons r, but rather other correlation statistics will yield estimates, with perhaps more desirable properties, of the slope and intercept of the line. We will now describe how these ideas can be applied to the Pareto distribution. Let X be a Pareto(,k ) distributed random variable. Then U = ln X is a two-parameter Exponentially distributed random variable with parame1 e(u)/ , ters and . That is U has probability density function f (u) = where = 1/ and = lnk . Its expected value is + or in terms of the original Pareto random variable, 1/ + ln k , and its variance is 2 or 1/2 . (This means the standard deviation is 1/.) It should be noted now that the Exponential distributions are a scale regular family. Now the random variable Z = U will be rather a standard Exponential with probability density function f (z ) = e(z) and so cumulative density function F (z ) = 1e(z) . These facts suggest that if we log transform data which we hypothesize to be Pareto distributed, we can solve r(q, u0 sq ) = 0, q being the quantiles of the standard Exponential, to get an estimate of the scale parameter for the Exponential distribution to which the log transformed Pareto data corresponds, and the appropriate center measure, i, of the uncentered residuals will give an estimate of the location parameter of the Exponential distribution to which the log transformed Pareto data corresponds. In order to turn these estimates back in terms of the Pareto parameters, we estimate k by ei and by 1/s. In our studies we used Pearsons r and the Greatest Deviation Correlation Coecient (See Gideon[1,2]) to evaluate the performance of this estimation scheme.
An Example
We used the probability integral transformation to write a function in SPlus that would generate random samples of Pareto distributed data. For this example we generated a size 100 sample from a Pareto(1, 0.5) distribution. We can see a histogram of this data in gure 2. Weve truncated the histogram in order to compare it to the density plot in Figure 1, and because there are some extreme outliers that would have made the histogram not useful had they been included. (The max was 5715 and the next highest was 1581.) The next plot shows the sorted logarithm transformed data plotted as a function of the rst through 100th 101-tiles of an Exponential(1) random variable. The dotted line is a least squares regression line. The solid line is a Greatest Deviation based regression line. The intercept and slope of the least squares line are 0.126 and 2.006 which imply estimates for k and of 1.134 and 0.498. The intercept and slope of the GD line are 0.023 and 2.139 which imply estimates for k and of 1.023 and 0.467. Method of moments estimates k = 1.025 and = 1.009 It is important to note how horribly method of moments does in estimating . The reason for this is that the integral dening variance for a Pareto distribution does not converge if is less than or equal to two, and similarly the integral dening the distributions mean is innite if is less than or equal to one. This means that the distribution is prone to extreme outliers. The limit of as x goes to innity is one, so one extreme outlier can yield an estimate of one for . The maximum likelihood estimates are k = 1.035 and = 0.487. The median estimate for assuming a k estimate of kmin (1.035) is 0.504. Finally to illustrate the equation r(q, uo sq ) = 0 we can see a plot of r(q, uo sq ) as a function of s. Remember u0 is actually the logarithm of the sorted presumed Pareto data and hence is distributed Exponentially. The value of s where the graph crosses the s axis is the correlation estimate of the standard deviation of the Exponential distribution. Our estimate of is 0.498 which was obtained more precisely than can be done looking at a graph using a midpoint algorithm. Recall that solving the correlation function is embedded in the least squares regression. This is why the correlation estimate is the same as that from the LS regression.
Estimator Performance
It was interesting to see one example, but to truly evaluate the performance of estimators, we need to do some simulations. Our simulations were done by setting a value for k and . SPlus generated a size 100 sample of Pareto(k , ) data and estimated k using one of the parameter estimation schemes we discussed earlier. It repeated 1000 times. The mean, median, standard error (SE) of those 1000 estimates were computed. The estimated bias was calculated as the mean minus the true value of the parameter. The mean squared error (MSE) was calculated as the bias squared plus the SE squared. This process was repeated for using the same estimation scheme and the same values of k and to generate the random samples. When all the estimation schemes were tested, the entire process was repeated using dierent values of k and to generate the data. We tested for values of k at 1, 1,000, and 1,000,000, and for each value of k , we tested for values of at 0.1, 1, 10, 100, and 1,000. The estimation schemes were GDCC based quantile regression for k (GDqk), GDCC based quantile regression for (GDq), Least squares based quantile regression for k (LSqk), Least squares based quantile regression for (LSq), method of moments for k (momk), method of moments for (mom), maximum likelihood for k (MLk), maximum likelihood for (ML), median estimator for assuming k = xmin (med), and a Pearsons correlation based estimate for . For the purposes of reporting in this paper, SPlus rounded all results to 3 digits as per its signicant digits function. The results of one simulation can be found here. The output for the other fourteen simulations can be found at the end of the paper. Simulation Statistics for Pareto(1,0.1) (1, 0.1) Mean GDqk 1.1200 GDq 0.1010 LSqk 1.1100 LSq 0.0971 momk 1.1000 mom 1.0000 MLk 1.1100 ML 0.1020 med 0.1040 Median SE Bias MSE 0.9600 6.44e-01 0.12100 0.429000 0.1000 1.19e-02 0.00133 0.000144 0.7780 1.08e+00 0.10500 1.190000 0.0964 1.29e-02 -0.00293 0.000175 1.0600 1.24e-01 0.10400 0.026100 1.0000 1.42e-12 0.90000 0.810000 1.0700 1.23e-01 0.10700 0.026500 0.1010 1.02e-02 0.00183 0.000108 0.1020 1.55e-02 0.00354 0.000254 7
Results
Something very fascinating is revealed in looking at all the output. The method of moments approach, despite being totally inappropriate for estimating , performed consistently best as an estimator for k . There were only two instances in which the GD based quantile regression method was less biased (k = 1000 and = 100, k = 1, 000, 000 and = 1) than method of moments and one instance (k = 1, 000, 000 and = 0.1) in which maximum likelihood yielded a smaller MSE. As for estimating , the trend seemed to be that the GD based quantile regression estimator was in all cases the least biased estimator and maximum likelihood yielded the least MSE for low values of while method of moments yielded the least MSE for high values of .
Conclusion
Among the estimators studied, method of moments clearly performed the best for estimating k . While this is interesting theoretically, it could be less interesting practically. On page 242 Johnson and Kotz[5] say that the Pareto distribution serves well as a model for incomes at the extremities, but not as well over the entire range of income levels. We are guessing then that if one were to try to t a Pareto model to a set of data, he would rst through other means determine at what value k the Pareto model takes over and then only be interested in the power parameter that models the data greater than k . However if one would need to estimate k , we recommend the method of moments estimator. As for estimating , the Greatest Deviation based quantile regression showed to be the least biased estimator, but it generally had a 20% greater standard error than maximum likelihood. Maximum likelihood yielded smaller mean squared error than the Greatest Deviation based quantile regression over the parameter values we studied. If one needs an unbiased estimator, we recommend the Greatest Deviation based quantile regression estimator. If small MSE is desired, we recommend maximum likelihood estimation. Part of the motivation for this project was to develop a parameter estimation scheme for the Pareto distribution that ts into the broader subject of using correlation statistics for parameter estimation and not merely as 8
descriptive statistics. So far the quantile regression approach has only been applied to symmetric distribution such as Normal, Gossets t, and Cauchy. We see that correlations, specically in the case of the Greatest Deviation Correlation Coecient, can produce good estimates for skewed distributions. The GD quantile regression approach produced estimates for that were relatively unbiased.
Simulation Statistics for Pareto(1, 10) (1, 10) Mean GDqk 1.000 GDq 10.000 LSqk 0.997 9.670 LSq momk 1.000 mom 10.100 MLk 1.000 ML 10.200 med 10.300 Median 1.000 9.960 0.997 9.550 1.000 10.000 1.000 10.100 10.100 SE 0.005000 1.220000 0.009230 1.340000 0.001010 1.040000 0.000914 1.020000 1.500000 Bias .000282 0.033800 -0.003360 -0.334000 0.000007 0.085700 0.000978 0.202000 0.253000 MSE 2.51e-05 1.49e+00 9.64e-05 1.92e+00 1.02e-06 1.08e+00 1.79e-06 1.07e+00 2.31e+00
Simulation Statistics for Pareto(1, 100) (1, 100) Mean GDqk 1.0 100.0 GDq LSqk 1.0 LSq 96.0 momk 1.0 mom 101.0 MLk 1.0 ML 102.0 med 103.0 Median 1.0 99.1 1.0 94.9 1.0 99.7 1.0 101.0 102.0 SE Bias 5.03e-04 1.80e-05 1.23e+01 3.71e-01 9.32e-04 -2.85e-04 1.27e+01 -4.00e+00 9.33e-05 -9.00e-07 1.02e+01 6.77e-01 9.84e-05 1.03e-04 1.01e+01 2.30e+00 1.48e+01 3.03e+00 10 MSE 2.53e-07 1.51e+02 9.50e-07 1.78e+02 8.71e-09 1.05e+02 2.03e-08 1.08e+02 2.28e+02
Simulation Statistics for Pareto(1, 1000) (1, 1000) Mean GDqk 1 GDq 1010 1 LSqk LSq 970 1 momk mom 1010 1 MLk 1020 ML med 1030 Median 1 1000 1 965 1 1010 1 1010 1020 SE Bias 4.50e-05 3.00e-06 1.24e+02 1.03e+01 9.59e-05 -3.17e-05 1.30e+02 -3.00e+01 1.02e-05 0.00e+00 1.03e+02 1.27e+01 9.33e-06 9.00e-06 1.07e+02 1.72e+01 1.51e+02 2.65e+01 MSE 2.04e-09 1.56e+04 1.02e-08 1.79e+04 1.03e-10 1.08e+04 1.68e-10 1.17e+04 2.36e+04
Simulation Statistics for Pareto(1000, 0.1) (1000, 0.1) GDqk GDq LSqk LSq momk mom MLk ML med Mean 1.14e+03 1.01e-01 1.13e+03 9.66e-02 1.10e+03 1.00e+00 1.11e+03 1.02e-01 1.03e-01 Median 9.66e+02 1.00e-01 7.61e+02 9.56e-02 1.06e+03 1.00e+00 1.07e+03 1.01e-01 1.02e-01 SE 6.72e+02 1.19e-02 1.19e+03 1.32e-02 1.13e+02 1.42e-12 1.33e+02 1.02e-02 1.54e-02 Bias 1.38e+02 8.77e-04 1.34e+02 -3.36e-03 9.63e+01 9.00e-01 1.12e+02 1.88e-03 3.29e-03 MSE 4.71e+05 1.43e-04 1.44e+06 1.87e-04 2.20e+04 8.10e-01 3.03e+04 1.08e-04 2.49e-04
Simulation Statistics for Pareto(1000, 1) (1000, 1) Mean Median GDqk 1000.000 997.000 1.000 0.992 GDq LSqk 974.000 975.000 LSq 0.961 0.950 momk 1000.000 998.000 mom 1.190 1.180 MLk 1010.000 1010.000 ML 1.020 1.010 med 1.030 1.010 11 SE Bias MSE 49.500 1.9500 2.46e+03 0.120 0.0017 1.43e-02 91.800 -25.6000 9.08e+03 0.130 -0.0394 1.84e-02 11.100 1.8600 1.27e+02 0.102 0.1900 4.63e-02 9.760 9.8200 1.92e+02 0.106 0.0200 1.17e-02 0.152 0.0295 2.40e-02
Simulation Statistics for Pareto(1000, 10) (1000, 10) Mean Median SE Bias MSE GDqk 1000.00 1000.00 4.89 0.0960 23.90 GDq 10.10 9.96 1.24 0.0937 1.54 997.00 998.00 9.33 -2.6300 93.90 LSqk LSq 9.67 9.57 1.36 -0.3310 1.95 1000.00 1000.00 1.04 0.0330 1.07 momk mom 10.20 10.10 1.05 0.1810 1.13 1000.00 1000.00 1.03 0.9630 1.98 MLk 10.20 10.20 1.02 0.2290 1.10 ML med 10.30 10.10 1.54 0.3070 2.45 Simulation Statistics for Pareto(1000, 100) (1000, 100) GDqk GDq LSqk LSq momk mom MLk ML med Mean 1000.0 101.0 1000.0 96.2 1000.0 102.0 1000.0 102.0 103.0 Median 1000.0 100.0 1000.0 95.1 1000.0 101.0 1000.0 102.0 101.0 SE Bias MSE 0.4870 0.0020 2.38e-01 12.0000 0.9540 1.46e+02 0.8930 -0.2640 8.66e-01 12.7000 -3.7900 1.76e+02 0.0996 -0.0028 9.93e-03 10.1000 1.6300 1.04e+02 0.0976 0.1020 1.99e-02 10.0000 2.2800 1.06e+02 15.3000 2.6900 2.42e+02
Simulation Statistics for Pareto(1000, 1000) (1000, 1000) Mean GDqk 1000 1000 GDq LSqk 1000 LSq 963 momk 1000 mom 1010 MLk 1000 ML 1020 med 1020 Median 1000 994 1000 955 1000 999 1000 1010 1010 12 SE Bias MSE 0.0490 0.002 2.41e-03 126.0000 3.770 1.59e+04 0.0951 -0.029 9.88e-03 128.0000 -36.800 1.78e+04 0.0098 0.000 9.61e-05 107.0000 7.490 1.15e+04 0.0107 0.010 2.14e-04 102.0000 15.200 1.07e+04 146.0000 19.500 2.17e+04
Simulation Statistics for Pareto(1mil, 0.1) (1mil, 0.1) GDqk GDq LSqk LSq momk mom MLk ML med Mean 1.23e+06 1.00e-01 1.12e+06 9.64e-02 1.10e+06 1.00e+00 1.11e+06 1.02e-01 1.03e-01 Median 1.00e+06 9.94e-02 7.92e+05 9.56e-02 1.06e+06 1.00e+00 1.07e+06 1.02e-01 1.02e-01 SE 8.63e+05 1.20e-02 1.11e+06 1.24e-02 1.23e+05 1.31e-12 1.16e+05 9.98e-03 1.55e-02 Bias 2.26e+05 1.67e-04 1.25e+05 -3.56e-03 1.01e+05 9.00e-01 1.05e+05 2.45e-03 2.96e-03 MSE 7.96e+11 1.43e-04 1.24e+12 1.67e-04 2.53e+10 8.10e-01 2.44e+10 1.06e-04 2.50e-04
Simulation Statistics for Pareto(1mil, 1) (1mil, 1) GDqk GDq LSqk LSq momk mom MLk ML med Mean 1.00e+06 1.01e+00 9.75e+05 9.64e-01 1.00e+06 1.20e+00 1.01e+06 1.02e+00 1.02e+00 Median 9.96e+05 9.98e-01 9.76e+05 9.53e-01 9.98e+05 1.19e+00 1.01e+06 1.01e+00 1.01e+00 SE Bias 4.90e+04 -4.58e+02 1.23e-01 9.45e-03 8.85e+04 -2.52e+04 1.27e-01 -3.62e-02 1.02e+04 1.34e+03 9.97e-02 1.95e-01 1.01e+04 1.02e+04 1.03e-01 1.96e-02 1.48e-01 -1.00e+06 MSE 2.40e+09 1.52e-02 8.47e+09 1.75e-02 1.06e+08 4.81e-02 2.07e+08 1.09e-02 1.00e+12
Simulation Statistics for Pareto(1mil, 10) (1mil, 10) GDqk GDq LSqk LSq momk mom MLk ML med Mean 1.00e+06 1.00e+01 9.97e+05 9.62e+00 1.00e+06 1.01e+01 1.00e+06 1.02e+01 1.03e+01 Median SE Bias 1.00e+06 4670.00 2.23e+02 9.93e+00 1.18 1.79e-02 9.97e+05 9090.00 -3.31e+03 9.48e+00 1.32 -3.82e-01 1.00e+06 1040.00 -1.29e+01 1.01e+01 1.01 1.21e-01 1.00e+06 942.00 9.82e+02 1.01e+01 1.06 1.85e-01 1.01e+01 1.52 2.80e-01 13 MSE 2.18e+07 1.40e+00 9.35e+07 1.89e+00 1.08e+06 1.04e+00 1.85e+06 1.15e+00 2.40e+00
Simulation Statistics for Pareto(1mil, 100) (1mil, 100) GDqk GDq LSqk LSq momk mom MLk ML med Mean 1.00e+06 1.01e+02 1.00e+06 9.65e+01 1.00e+06 1.01e+02 1.00e+06 1.03e+02 1.03e+02 Median SE Bias MSE 1.00e+06 478.0 -12.400 228000 1.00e+02 12.4 0.512 153 1.00e+06 910.0 -281.000 906000 9.57e+01 13.7 -3.470 200 1.00e+06 99.3 -0.100 9870 1.00e+02 10.3 1.020 107 1.00e+06 104.0 104.000 21600 1.02e+02 10.9 2.510 124 1.01e+02 15.4 2.680 245
Simulation Statistics for Pareto(1mil, 1000) (1mil, 1000) GDqk GDq LSqk LSq momk mom MLk ML med Mean Median SE Bias MSE 1000000 1000000 47.90 -0.30 2290.0 1010 997 122.00 6.03 15000.0 1000000 1000000 97.50 -28.50 10300.0 974 962 125.00 -26.50 16400.0 1000000 1000000 9.60 -0.20 92.1 1010 1000 103.00 11.00 10800.0 1000000 1000000 9.42 10.00 189.0 1020 1020 105.00 21.80 11500.0 1030 1010 146.00 25.00 22100.0
14
References
[1] Rudy A. Gideon, Generalized Interpretation of Pearsons r., unpublished paper (URL: http://www.math.umt.edu/gideon/Generalized%20CC.pdf) [2] Rudy A. Gideon, Random Variables, Regression, and the Greatest Deviation Correlation Coecient., unpublished paper (URL: http://www.math.umt.edu/gideon/SLR%20theory%20revision.pdf) [3] Rudy A. Gideon, Correlation in Simple Linear Regression. , unpublished paper (URL: http://www.math.umt.edu/gideon/CORR-NSPACE-REG.pdf) [4] Rudy A. Gideon, Location and Scale Correlation Coecients. , unpublished http://www.math.umt.edu/gideon/locscale.pdf) Estimation with paper (URL:
[5] Norman L. Johnson and Samuel Kotz, Distributions in Statistics Continuous Univariate Distributions - 1. , Houghton Miin Company, Boston, 1970
15