Robust Pareto Tail Modeling For The Estimation of Indicators On Social Exclusion Using The R Package Laeken
Robust Pareto Tail Modeling For The Estimation of Indicators On Social Exclusion Using The R Package Laeken
Abstract In this vignette, robust semiparametric estimation of social exclusion indicators using
the R package laeken is discussed. Special emphasis is thereby given to income inequality indica-
tors, as the standard estimates for these indicators are highly influenced by outliers in the upper
tail of the income distribution. This influence can be reduced by modeling the upper tail with a
Pareto distribution in a robust manner. While the focus of the paper is to demonstrate the func-
tionality of laeken beyond the standard estimation techniques, a brief mathematical description
of the implemented procedures is given as well.
1 Introduction
From a robustness point of view, the standard estimators for some of the social exclusion indicators
defined by Eurostat (2004, 2009) are problematic. In particular the income inequality indicators
quintile share ratio (QSR) and Gini coefficient suffer from a lack of robustness. Consider, e.g., the
QSR, which is estimated as the ratio of estimated totals or means (see Section 2.1 for an exact
definition). It is well known that the classical estimates for totals or means have a breakdown point
of 0, meaning that even a single outlier can distort the results to an arbitrary extent. In fact, the
influence of a single observation in the upper tail of the income distribution on the estimation of the
QSR is linear and therefore unbounded. For practical purposes, the standard QSR estimator thus
cannot be recommended in many situations (cf. Hulliger and Schoch 2009). It is also important to
note that the behavior of the Gini coefficient is similar to the behavior of the QSR.
The data basis for the estimation of the social exclusion indicators according to Eurostat (2004,
2009) is the European Union Statistics on Income and Living Conditions (EU-SILC), which is
an annual panel survey conducted in EU member states and other European countries. On the
one hand, EU-SILC data typically contain a considerable amount of representative outliers in the
upper tail of the income distribution, i.e., correct observations that behave differently from the
main part of the data, but that are not unique in the population and hence need to be considered
for computing estimates of the indicators. On the other hand, EU-SILC data frequently contain
some even more extreme nonrepresentative outliers, i.e., observations that are either incorrect or
can be considered unique in the population. Consequently, such nonrepresentative outliers need to
be excluded from the estimation process or downweighted.
1
As a remedy, the upper tail of the income distribution may be modeled with a Pareto distribution
in order to recalibrate the sample weights or use fitted income values for observations in the
upper tail when estimating the indicators (see Section 6). Nevertheless, classical estimators for
the parameters of the Pareto distribution are highly influenced by the nonrepresentative outliers
themselves. Using robust methods reduces the influence on fitting the Pareto distribution to the
representative outliers and therefore on the estimation of the indicators.
Rather than evaluating these methods, the paper concentrates on showing how they can be
applied in the statistical environment R (R Development Core Team 2013) with the add-on package
laeken (Alfons et al. 2013). The basic design of the package, as well as standard estimation of the
social exclusion indicators is discussed in detail in vignette laeken-standard (Templ and Alfons
2011a). Furthermore, the general framework for variance estimation is illustrated in vignette
laeken-variance (Templ and Alfons 2011b). Those documents can be viewed from within R with
the following commands:
R> vignette("laeken-standard")
R> vignette("laeken-variance")
Morover, a general introduction to package laeken is published as Alfons and Templ (2013).
Throughout the paper, the example data from package laeken is used. The data set is called
eusilc and consists of 14 827 observations from 6 000 households. In addition, it was synthetically
generated from Austrian EU-SILC survey data from 2006 using the data simulation methodology
proposed by Alfons et al. (2011) and implemented in the R package simPopulation (Alfons and
Kraft 2012). More information on the example data can be found in vignette laeken-standard
or in the corresponding R help page.
R> library("laeken")
R> data("eusilc")
The rest of the paper is organized as follows. Section 2 gives a mathematical description of
the Eurostat definitions of the social exclusion indicators QSR and Gini coefficient. In Section 3,
the Pareto distribution is briefly discussed. Section 4 discusses a rule of thumb for estimating the
threshold for the upper tail of the distribution, and illustrates graphical methods for exploring the
data in order to find the threshold. Classical and robust estimators for the shape parameter of the
Pareto distribution are described in Section 5. How to use Pareto tail modeling to estimate the
social exclusion indicators is then shown in Section 6. Finally, Section 7 concludes.
2
Using index sets I≤q̂0.2 := {i ∈ {1, . . . , n} : xi ≤ q̂0.2 } and I>q̂0.8 := {i ∈ {1, . . . , n} : xi > q̂0.8 }, the
quintile share ratio is estimated by
P
w i xi
[ := Pi∈I>q̂0.8
QSR . (2)
i∈I≤q̂ w i xi
0.2
With package laeken, the quintile share ratio can be estimated using the function qsr(). Sample
weights can thereby be supplied via the weights argument.
R> qsr("eqIncome", weights = "rb050", data = eusilc)
Value:
[1] 3.970004
The function gini() is available in laeken to estimate the Gini coefficient. As before, sample
weights can be specified with the weights argument.
Value:
[1] 26.48962
θxθ0
fθ (x) = , x ≥ x0 . (5)
xθ+1
Figure 1 visualizes the Pareto probability density function with scale parameter x0 = 1 and
different values of the shape parameter θ. Clearly, the Pareto distribution is a highly right-skewed
distribution with a heavy tail. It is therefore reasonable to assume that a random variable following
a Pareto distribution contains extreme values. The effect of changing the shape parameter θ is
visible in the probability mass at the scale parameter x0 : the higher θ, the higher the probability
mass at x0 .
In Pareto tail modeling, the cumulative distribution function on the whole range of x is modeled
as
G(x), if x ≤ x0 ,
F (x) = (6)
G(x0 ) + (1 − G(x0 ))Fθ (x), if x > x0 ,
where G is an unknown distribution function (Dupuis and Victoria-Feser 2006).
3
3.0
θ=1
θ=2
2.5
θ=3
2.0
f(x)
1.5
1.0
0.5
0.0
1 2 3 4 5 6
Let n be the number of observations and let x = (x1 , . . . , xn )′ denote the observed values with
x1 ≤ . . . ≤ xn . In addition, let k be the number of observations to be used for tail modeling. In
this scenario, the threshold x0 is estimated by
If an estimate x̂0 for the scale parameter of the Pareto distribution has been obtained, k is given
by the number of observations larger than x̂0 . Thus estimating x0 and k directly corresponds with
each other.
In the remainder of this package vignette, the equivalized disposable income of the EU-SILC
example data is of main interest. Consequently, the Pareto distribution will be modeled at the
household level rather than the individual level. Moreover, the focus of this vignette is on ro-
bust estimation of the social exclusion indicators. Hence the equivalized disposable income of the
household with the largest income is replaced by a large outlier.
Since the aim is to model a Pareto distribution at the household level, the following command
creates a data set that contains only the equivalized disposable income and the sample weights on
the household level. This data set will be used in Sections 4 and 5 to estimate the parameters of
the Pareto distribution.
4
by minimizing the asymptotic mean squared error (AMSE). In package laeken, this approach is
implemented in the function minAMSE(). However, the procedure is designed for the non-robust
Hill estimator and is therefore not further discussed in this paper. Furthermore, Danielsson et al.
(2001) proposed a bootstrap method to find the optimal k for the Hill estimator with respect to
the AMSE, which has less analytical requirements than the approach by Beirlant et al. (1996a,b).
Please note that this method is not robust either and that it is currently not available in package
laeken. A robust prediction error criterion for choosing the number of observations k in the
tail and estimating the shape parameter θ was developed by Dupuis and Victoria-Feser (2006).
Nevertheless, our implementation of this robust criterion was unstable and is therefore not included
in laeken.
In any case, Holzer (2009) concludes that graphical methods for finding the threshold outperform
those analytical approaches in the case of EU-SILC data. While this section is thus focused
graphical methods, a simple rule of thumb designed specifically for the equivalized disposable
income in EU-SILC data is described in the following as well.
where x̄ is the weighted mean, and q0.98 and q0.97 are weighted quantiles as defined in Equation (1).
In package laeken, the function paretoScale() provides functionality for computing the thresh-
old with van Kerm’s rule of thumb. The argument w is available to supply sample weights.
R> ts <- paretoScale(eusilcH$eqIncome, w = eusilcH$db090)
R> ts
Threshold: 48459.43
Number of observations in the tail: 119
It should be noted that the function returns an object of class "paretoScale", which consists
of a component x0 for the threshold (scale parameter) and a component k for the number of
observations in the tail of the distribution, i.e., that are larger than the threshold.
n
where the correction factor n+1 ensures that the quantiles reduce to (9) if all sample weights are
equal.
5
R> paretoQPlot(eusilcH$eqIncome, w = eusilcH$db090)
1e+06
1e+04
1e+02
0 2 4 6 8
Theoretical quantiles
Figure 2: Pareto Quantile plot for the example data eusilc on the household level with the largest
observation replaced by an outlier.
If the tail of the data follows a Pareto distribution, those observations form almost a straight
line. The leftmost point of a fitted line can thus be used as an estimate of the threshold x0 , the
scale parameter. All values starting from the point after the threshold may be modeled by a Pareto
distribution, but this point cannot be determined exactly. Furthermore, the slope of the fitted line
is in turn an estimate of θ1 , the reciprocal of the shape parameter.
Figure 2 displays the Pareto quantile plot for the example data eusilc on the household level
with the largest observation replaced by an outlier. The plot is generated using the function
paretoQPlot(), which allows to supply sample weights via the argument w. In addition, the
threshold can be selected interactively by clicking on a data point. Information on the selected
threshold is then printed on the R console. When the interactive selection is terminated, which is
typically done by a secondary mouse click, the selected threshold is returned as an object of class
"paretoScale".
Another advantage of the Pareto quantile plot is also illustrated in Figure 2. Nonrepresentative
outliers such as the large income introduced into the example data in Section 3, i.e., extreme
observations in the upper tail that deviate from the Pareto model, are clearly visible.
6
R> meanExcessPlot(eusilcH$eqIncome, w = eusilcH$db090)
150000
100000
Mean excess
50000
Threshold
Figure 3: Mean excess plot for the example data eusilc on the household level with the largest
observation replaced by an outlier.
A detailed description for the case without sample weights can be found in Borkovec and Klüppel-
berg (2000).
For the following definition of the mean excess plot, keep in mind that √ the observations are
sorted such that x1 ≤ . . . ≤ xn . For each observation xi , i = 1, . . . , ⌊n − n⌋, the empirical excess
function en is computed. In the case without sample weights, the expectation in Equation (11) is
replaced by the arithmetic mean, and the empirical excess function is given by
n
1 X √
en (xi ) := (xj − xi ), i = 1, . . . , ⌊n − n⌋. (12)
n − i j=i+1
The values of the √ empirical excess function en (xi ) are then plotted against the corresponding xi ,
i = 1, . . . , ⌊n − n⌋. If sample weights are available in the data, the mean excess plot is simply
generalized by using the weighted mean for the empirical excess function:
n
1 X √
en (xi ) := Pn wj (xj − xi ), i = 1, . . . , ⌊n − n⌋. (13)
j=i+1 wj j=i+1
If the tail of the data follows a Pareto distribution, those observations show a positive linear
trend. The leftmost point of a fitted line can thus be used as an estimate of the threshold x0 , the
scale parameter. As for the Pareto quantile plot, a disadvantage of the mean excess plot is that
the threshold cannot be determined exactly.
Figure 3 shows the mean excess plot for the example data eusilc on the household level with
the largest observation replaced by an outlier. The function meanExcessPlot() is thereby used to
7
produce the plot. Sample weights can be supplied via the argument w. Interactive selection of the
threshold works just like for the Pareto quantile plot. Again, the selected threshold is returned as
an object of class "paretoScale".
[1] 3.437979
[1] 3.437979
with
∂ 1 x
Ψ(x, θ) := u(x, θ) log f (x, θ) = u(x, θ) − log , (17)
∂θ θ x0
where u(x, θ) is a weight function with values in [0, 1]. In the implementation in package laeken,
a Huber type weight function is used by default, as proposed by Dupuis and Victoria-Feser (2006).
Let the logarithms of the relative excesses be denoted by
xn−k+i
zi := log , i = 1, . . . , k. (18)
xn−k
In the Pareto model, these can be predicted by
1 k+1−i
ẑi := − log , i = 1, . . . , k. (19)
θ k+1
8
The variance of zi is given by
i
X 1
σi 2 := , i = 1, . . . , k. (20)
j=1
θ2 (k − i + j)2
where ui := u(xn−k+i , θ) and fi := f (xn−k+i , θ). This term is used to obtain a bias-corrected
estimator
θ̃ := θ̂ − B̂(θ̂). (24)
For details and proofs of the above statements, as well as for information on a probability-based
weight function u(x, θ), the reader is referred to Dupuis and Morgenthaler (2002) and Dupuis and
Victoria-Feser (2006). However, note the WML estimator does not consider sample weights. An
adjustment of the estimator to take sample weights into account is currently not available due to
its complexity. For sampling designs that lead to equal sample weights, the WML estimator may
still be useful, though.
The function thetaWML() is available in laeken to compute the WML estimator. Again, either
the argument k or x0 needs to be used to specify the number of observations in the tail or the
threshold. Since the sample weights in the example data are not equal, the following example is
only included to demonstrate the use of the function.
[1] 4.226204
[1] 4.226204
The density function of the Pareto distribution for the relative excesses is approximated by
The ISE estimator is then given by minimizing the integrated squared error criterion (Terrell 1990):
Z
2
θ̂ = arg min fθ (y)dy − 2E(fθ (Y )) . (27)
θ
9
If there are no sample weights in the data, the mean is used as an unbiased estimator of E(fθ (Y ))
in order to obtain the ISE estimate
"Z k
#
2 X
θ̂ISE = arg min fθ2 (y)dy − fθ (yi ) . (28)
θ k i=1
See Vandewalle et al. (2007) for more information on the ISE estimator for the case without sample
weights.
If sample weights are available in the data, the mean in Equation (28) is simply replaced by a
weighted mean to obtain the weighted integrated squared error (wISE) estimator:
"Z k
#
2 2 X
θ̂wISE = arg min fθ (y)dy − Pk wn−k+i fθ (yi ) . (29)
θ
i=1 wn−k+i i=1
With package laeken, the ISE estimator can be computed using the function thetaISE(). The
arguments k and x0 are available to specify either the number of observations in the tail or the
threshold, and sample weights can be supplied via the argument w.
R> thetaISE(eusilcH$eqIncome, k = ts$k, w = eusilcH$db090)
[1] 3.993801
R> thetaISE(eusilcH$eqIncome, x0 = ts$x0, w = eusilcH$db090)
[1] 3.993801
10
6 Estimation of the indicators using Pareto tail modeling
Three approaches based on Pareto tail modeling for reducing the influence of outliers on the social
exclusion indicators are implemented in the R package laeken:
Calibration for nonrepresentative outliers (CN): Values larger than a certain quantile of the fit-
ted distribution are declared as nonrepresentative outliers. Since these are considered to be
unique to the population data, the sample weights of the corresponding observations are set
to 1 and the weights of the remaining observations are adjusted accordingly by calibration.
Replacement of nonrepresentative outliers (RN): Values larger than a certain quantile of the
fitted distribution are declared as nonrepresentative outliers. Only these nonrepresentative
outliers are replaced by values drawn from the fitted distribution, thereby preserving the
order of the original values.
Replacement of the tail (RT): All values above the threshold are replaced by values drawn from
the fitted distribution. The order of the original values is preserved.
An evaluation of the RT approach by means of a simulation study can be found in Alfons et al.
(2010).
Keep in mind that the largest observation in the example data eusilc was replaced by a large
outlier in Section 3. With the following command, the Gini coefficient is estimated according to
the Eurostat definition to show that even a single outlier can completely distort the results for the
standard estimation (see Section 2.2 for the original value).
R> gini("eqIncome", weights = "rb050", data = eusilc)
Value:
[1] 29.24333
For Pareto tail modeling, the function paretoTail() is implemented in laeken. It returns an
object of class "paretoTail", which contains all the necessary information for further analysis
using the three approaches described above. Note that the household IDs are supplied via the
argument groups such that the Pareto distribution is fitted on the household level rather than the
individual level. In addition, the PDC is used by default to estimate the shape parameter. Other
estimators can be specified via the method argument.
R> fit <- paretoTail(eusilc$eqIncome, k = ts$k,
+ w = eusilc$db090, groups = eusilc$db030)
The function reweightOut() is available for semiparametric estimation with the CN approach.
It returns a vector of the recalibrated weights. In this example, regional information is used as
auxiliary variables for calibration. The function calibVars() thereby transforms a factor into a
matrix of binary variables, as required by the calibration function calibWeights(), which is called
internally. These recalibrated weights are then simply used to estimate the Gini coefficient with
function gini().
R> w <- reweightOut(fit, calibVars(eusilc$db040))
R> gini(eusilc$eqIncome, w)
Value:
[1] 26.45973
For the RN approach, the function replaceOut() is implemented. Since values are drawn
from the fitted distribution to replace the observations flagged as outliers, the seed of the random
number generator is set first for reproducibility of the results. The returned vector of incomes is
then supplied to gini() to estimate the Gini coefficient.
R> set.seed(1234)
R> eqIncome <- replaceOut(fit)
R> gini(eqIncome, weights = eusilc$rb050)
11
Value:
[1] 26.46924
Similarly, the function replaceTail() is available for the RT approach. Again, the seed of the
random number generator is set beforehand.
R> set.seed(1234)
R> eqIncome <- replaceTail(fit)
R> gini(eqIncome, weights = eusilc$rb050)
Value:
[1] 26.64921
It should be noted that replaceTail() can also be used for the RN approach by setting the
argument all to FALSE. In fact, replaceOut(x, ...) is a simple wrapper for replaceTail(x,
all = FALSE, ...).
In any case, the estimates for the semiparametric approaches based on Pareto tail modeling are
very close to the original value before the outlier has been introduced (see Section 2.2), whereas
the standard estimation is corrupted by the outlier. Furthermore, the estimation of other indi-
cators such as the quintile share ratio (see Section 2.1) using the semiparametric approaches is
straightforward and hence not shown here.
7 Conclusions
This vignette shows the functionality of package laeken for robust semiparametric estimation of
social exclusion indicators based on Pareto tail modeling. Most notably, it demonstrates that the
functions are easy to use and that the implementation follows an object-oriented design. While
the focus of the paper lies on the use of the package, a mathematical description of the methods is
given as well.
Furthermore, it is shown that the standard estimation of the inequality indicators can be cor-
rupted by a single outlier, thus underlining the need for robust alternatives. Three approaches
for robust semiparametric estimation based on Pareto tail modeling are thereby implemented such
that the corresponding functions share a common interface for ease of use.
Acknowledgments
This work was partly funded by the European Union (represented by the European Commission)
within the 7th framework programme for research (Theme 8, Socio-Economic Sciences and Human-
ities, Project AMELI (Advanced Methodology for European Laeken Indicators), Grant Agreement
No. 217322). Visit http://ameli.surveystatistics.net for more information on the project.
References
A. Alfons and S. Kraft. simPopulation: Simulation of Synthetic Populations for Surveys Based on
Sample Data, 2012. URL https://CRAN.R-project.org/package=simPopulation. R package
version 0.4.0.
A. Alfons and M. Templ. Estimation of social exclusion indicators from complex surveys: The R
package laeken. Journal of Statistical Software, 54(15):1–25, 2013. doi: 10.18637/jss.v054.i15.
A. Alfons, M. Templ, P. Filzmoser, and J. Holzer. A comparison of robust methods for Pareto tail
modeling in the case of Laeken indicators. In C. Borgelt, G. González-Rodrı́guez, W. Trutschnig,
M.A. Lubiano, M.A. Gil, P. Grzegorzewski, and O. Hryniewicz, editors, Combining Soft Com-
puting and Statistical Methods in Data Analysis, volume 77 of Advances in Intelligent and Soft
Computing, pages 17–24. Springer-Verlag, Heidelberg, 2010. ISBN 978-3-642-14745-6.
12
A. Alfons, S. Kraft, M. Templ, and P. Filzmoser. Simulation of close-to-reality population data
for household surveys with application to EU-SILC. Statistical Methods & Applications, 20(3):
383–407, 2011.
A. Alfons, J. Holzer, and M. Templ. laeken: Estimation of Indicators on Social Exclusion and
Poverty, 2013. URL https://CRAN.R-project.org/package=laeken. R package version 0.4.4.
J. Beirlant, P. Vynckier, and J.L. Teugels. Tail index estimation, Pareto quantile plots, and re-
gression diagnostics. Journal of the American Statistical Association, 31(436):1659–1667, 1996a.
J. Beirlant, P. Vynckier, and J.L. Teugels. Excess functions and estimation of the extreme-value
index. Bernoulli, 2(4):293–318, 1996b.
M. Borkovec and C. Klüppelberg. Extremwerttheorie für Finanzzeitreihen – ein unverzichtbares
Werkzeug im Risikomanagement. In L. Johanning and B. Rudolph, editors, Handbuch Risiko-
management, pages 219–241. Uhlenbruch, Bad Soden, 2000. ISBN 3933207150.
J. Danielsson, L. de Haan, L. Peng, and C.G. de Vries. Using a bootstrap method to choose the
sample fraction in tail index estimation. Journal of Multivariate Analysis, 76(2):226–248, 2001.
D.J. Dupuis and S. Morgenthaler. Robust weighted likelihood estimators with an application to
bivariate extreme value problems. The Canadian Journal of Statistics, 30(1):17–36, 2002.
D.J. Dupuis and M.-P. Victoria-Feser. A robust prediction error criterion for Pareto modelling of
upper tails. The Canadian Journal of Statistics, 34(4):639–658, 2006.
Eurostat. Common cross-sectional EU indicators based on EU-SILC; the gender pay gap. EU-SILC
131-rev/04, Unit D-2: Living conditions and social protection, Directorate D: Single Market,
Employment and Social statistics, Eurostat, Luxembourg, 2004.
Eurostat. Algorithms to compute social inclusion indicators based on EU-SILC and adopted under
the Open Method of Coordination (OMC). Doc. LC-ILC/39/09/EN-rev.1, Unit F-3: Living con-
ditions and social protection, Directorate F: Social and information society statistics, Eurostat,
Luxembourg, 2009.
B.M. Hill. A simple general approach to inference about the tail of a distribution. The Annals of
Statistics, 3(5):1163–1174, 1975.
J. Holzer. Robust methods for the estimation of selected Laeken indicators. Master’s thesis,
Department of Statistics and Probability Theory, Vienna University of Technology, Vienna,
Austria, 2009.
B. Hulliger and T. Schoch. Robustification of the quintile share ratio. New Techniques and
Technologies for Statistics, Brussels, 2009.
C. Kleiber and S. Kotz. Statistical Size Distributions in Economics and Actuarial Sciences. John
Wiley & Sons, Hoboken, New Jersey, 2003. ISBN 0-471-15064-9.
R Development Core Team. R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria, 2013. URL https://www.R-project.org.
ISBN 3-900051-07-0.
M. Templ and A. Alfons. Standard methods for point estimation of social inclusion indicators using
the R package laeken. Research Report CS-2011-1, Department of Statistics and Probability
Theory, Vienna University of Technology, 2011a.
M. Templ and A. Alfons. Variance estimation of social inclusion indicators using the R package
laeken. Research Report CS-2011-3, Department of Statistics and Probability Theory, Vienna
University of Technology, 2011b.
G. Terrell. Linear density estimates. In Proceedings of the Statistical Computing Section, pages
297–302. American Statistical Association, 1990.
13
P. Van Kerm. Extreme incomes and the estimation of poverty and inequality indicators from
EU-SILC. IRISS Working Paper Series 2007-01, CEPS/INSTEAD, 2007.
B. Vandewalle, J. Beirlant, A. Christmann, and M. Hubert. A robust estimator for the tail index
of Pareto-type distributions. Computational Statistics & Data Analysis, 51(12):6252–6268, 2007.
14