0% found this document useful (0 votes)

47 views14 pages

Robust Pareto Tail Modeling For The Estimation of Indicators On Social Exclusion Using The R Package Laeken

This document discusses using robust Pareto tail modeling to estimate social exclusion indicators like the quintile share ratio and Gini coefficient from income data that may contain outliers. It introduces the indicators, describes modeling the upper tail of the income distribution with a Pareto distribution, and shows how to estimate the indicators in a robust way using the R package laeken that accounts for outliers. The focus is demonstrating the functionality of laeken for robust estimation rather than evaluating the methods.

Uploaded by

Servando Valdes Cruz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views14 pages

Robust Pareto Tail Modeling For The Estimation of Indicators On Social Exclusion Using The R Package Laeken

Uploaded by

Servando Valdes Cruz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Robust Pareto Tail Modeling for the

Estimation of Indicators on Social Exclusion

using the R Package laeken
Andreas Alfons1 , Matthias Templ2 , Peter Filzmoser3 , Josef Holzer4

Abstract In this vignette, robust semiparametric estimation of social exclusion indicators using
the R package laeken is discussed. Special emphasis is thereby given to income inequality indica-
tors, as the standard estimates for these indicators are highly influenced by outliers in the upper
tail of the income distribution. This influence can be reduced by modeling the upper tail with a
Pareto distribution in a robust manner. While the focus of the paper is to demonstrate the func-
tionality of laeken beyond the standard estimation techniques, a brief mathematical description
of the implemented procedures is given as well.

1 Introduction
From a robustness point of view, the standard estimators for some of the social exclusion indicators
defined by Eurostat (2004, 2009) are problematic. In particular the income inequality indicators
quintile share ratio (QSR) and Gini coefficient suffer from a lack of robustness. Consider, e.g., the
QSR, which is estimated as the ratio of estimated totals or means (see Section 2.1 for an exact
definition). It is well known that the classical estimates for totals or means have a breakdown point
of 0, meaning that even a single outlier can distort the results to an arbitrary extent. In fact, the
influence of a single observation in the upper tail of the income distribution on the estimation of the
QSR is linear and therefore unbounded. For practical purposes, the standard QSR estimator thus
cannot be recommended in many situations (cf. Hulliger and Schoch 2009). It is also important to
note that the behavior of the Gini coefficient is similar to the behavior of the QSR.
The data basis for the estimation of the social exclusion indicators according to Eurostat (2004,
2009) is the European Union Statistics on Income and Living Conditions (EU-SILC), which is
an annual panel survey conducted in EU member states and other European countries. On the
one hand, EU-SILC data typically contain a considerable amount of representative outliers in the
upper tail of the income distribution, i.e., correct observations that behave differently from the
main part of the data, but that are not unique in the population and hence need to be considered
for computing estimates of the indicators. On the other hand, EU-SILC data frequently contain
some even more extreme nonrepresentative outliers, i.e., observations that are either incorrect or
can be considered unique in the population. Consequently, such nonrepresentative outliers need to
be excluded from the estimation process or downweighted.

1 Erasmus School of Economics, Erasmus University Rotterdam

E-mail: [email protected]
2 Zurich University of Applied Sciences
E-mail: [email protected]
3 Vienna University of Technology
E-mail: [email protected]
4 Landesstatistik Steiermark
E-mail: [email protected]

1
As a remedy, the upper tail of the income distribution may be modeled with a Pareto distribution
in order to recalibrate the sample weights or use fitted income values for observations in the
upper tail when estimating the indicators (see Section 6). Nevertheless, classical estimators for
the parameters of the Pareto distribution are highly influenced by the nonrepresentative outliers
themselves. Using robust methods reduces the influence on fitting the Pareto distribution to the
representative outliers and therefore on the estimation of the indicators.
Rather than evaluating these methods, the paper concentrates on showing how they can be
applied in the statistical environment R (R Development Core Team 2013) with the add-on package
laeken (Alfons et al. 2013). The basic design of the package, as well as standard estimation of the
social exclusion indicators is discussed in detail in vignette laeken-standard (Templ and Alfons
2011a). Furthermore, the general framework for variance estimation is illustrated in vignette
laeken-variance (Templ and Alfons 2011b). Those documents can be viewed from within R with
the following commands:
R> vignette("laeken-standard")
R> vignette("laeken-variance")
Morover, a general introduction to package laeken is published as Alfons and Templ (2013).
Throughout the paper, the example data from package laeken is used. The data set is called
eusilc and consists of 14 827 observations from 6 000 households. In addition, it was synthetically
generated from Austrian EU-SILC survey data from 2006 using the data simulation methodology
proposed by Alfons et al. (2011) and implemented in the R package simPopulation (Alfons and
Kraft 2012). More information on the example data can be found in vignette laeken-standard
or in the corresponding R help page.
R> library("laeken")
R> data("eusilc")
The rest of the paper is organized as follows. Section 2 gives a mathematical description of
the Eurostat definitions of the social exclusion indicators QSR and Gini coefficient. In Section 3,
the Pareto distribution is briefly discussed. Section 4 discusses a rule of thumb for estimating the
threshold for the upper tail of the distribution, and illustrates graphical methods for exploring the
data in order to find the threshold. Classical and robust estimators for the shape parameter of the
Pareto distribution are described in Section 5. How to use Pareto tail modeling to estimate the
social exclusion indicators is then shown in Section 6. Finally, Section 7 concludes.

2 Social exclusion indicators

This paper is focused on the inequality indicators quintile share ratio (QSR) and Gini coefficient,
which are both highly influenced by outliers in the upper tail of the distribution. Note that for
the estimation of the social exclusion indicators, each person in a household is assigned the same
eqivalized disposable income. See vignette laeken-standard (Templ and Alfons 2011a) for the
computation of the equivalized disposable income with the R package laeken.
For the following definitions, let x := (x1 , . . . , xn )′ be the equivalized disposable income with
x1 ≤ . . . ≤ xn and let w := (wi , . . . , wn )′ be the corresponding personal sample weights, where n
denotes the number of observations.

2.1 Quintile share ratio (QSR)

The income quintile share ratio (QSR) is defined as the ratio of the sum of the equivalized disposable
income received by the 20% of the population with the highest equivalized disposable income to
that received by the 20% of the population with the lowest equivalized disposable income (Eurostat
2004, 2009).
For the estimation of the quintile share ratio from a sample, let q̂0.2 and q̂0.8 denote the weighted
20% and 80% quantiles, respectively. With 0 ≤ p ≤ 1, these weighted quantiles are given by
( Pj Pn
1
(xj + xj+1 ), if i=1 wi = p wi ,
q̂p = q̂p (x, w) := 2 Pj Pi=1
n Pj+1 (1)
xj+1 , if i=1 wi < p i=1 wi < i=1 wi .

2
Using index sets I≤q̂0.2 := {i ∈ {1, . . . , n} : xi ≤ q̂0.2 } and I>q̂0.8 := {i ∈ {1, . . . , n} : xi > q̂0.8 }, the
quintile share ratio is estimated by
P
w i xi
[ := Pi∈I>q̂0.8
QSR . (2)
i∈I≤q̂ w i xi
0.2

With package laeken, the quintile share ratio can be estimated using the function qsr(). Sample
weights can thereby be supplied via the weights argument.
R> qsr("eqIncome", weights = "rb050", data = eusilc)

Value:
[1] 3.970004

2.2 Gini coefficient

The Gini coefficient is defined as the relationship of cumulative shares of the population arranged
according to the level of equivalized disposable income, to the cumulative share of the equivalized
total disposable income received by them (Eurostat 2004, 2009).
For the estimation of the Gini coefficient from a sample, the sample weights need to be taken
into account. In mathematical terms, the Gini coefficient is estimated by
 Pn Pi P
n

2 i=1 wi xi j=1 wj − i=1 wi2 xi
[ := 100 
Gini Pn Pn − 1 . (3)
( i=1 wi ) i=1 (wi xi )

The function gini() is available in laeken to estimate the Gini coefficient. As before, sample
weights can be specified with the weights argument.

R> gini("eqIncome", weights = "rb050", data = eusilc)

Value:
[1] 26.48962

3 The Pareto distribution

The Pareto distribution is well studied in the literature and is defined in terms of its cumulative
distribution function −θ
x
Fθ (x) = 1 − , x ≥ x0 , (4)
x0
where x0 > 0 is the scale parameter and θ > 0 is the shape parameter (Kleiber and Kotz 2003).
Furthermore, its density function is given by

θxθ0
fθ (x) = , x ≥ x0 . (5)
xθ+1
Figure 1 visualizes the Pareto probability density function with scale parameter x0 = 1 and
different values of the shape parameter θ. Clearly, the Pareto distribution is a highly right-skewed
distribution with a heavy tail. It is therefore reasonable to assume that a random variable following
a Pareto distribution contains extreme values. The effect of changing the shape parameter θ is
visible in the probability mass at the scale parameter x0 : the higher θ, the higher the probability
mass at x0 .
In Pareto tail modeling, the cumulative distribution function on the whole range of x is modeled
as
G(x), if x ≤ x0 ,
F (x) = (6)
G(x0 ) + (1 − G(x0 ))Fθ (x), if x > x0 ,
where G is an unknown distribution function (Dupuis and Victoria-Feser 2006).

3
3.0
θ=1
θ=2

2.5
θ=3

2.0
f(x)

1.5
1.0
0.5
0.0

1 2 3 4 5 6

Figure 1: Pareto probability density functions with parameters x0 = 1 and θ = 1, 2, 3.

Let n be the number of observations and let x = (x1 , . . . , xn )′ denote the observed values with
x1 ≤ . . . ≤ xn . In addition, let k be the number of observations to be used for tail modeling. In
this scenario, the threshold x0 is estimated by

x̂0 := xn−k . (7)

If an estimate x̂0 for the scale parameter of the Pareto distribution has been obtained, k is given
by the number of observations larger than x̂0 . Thus estimating x0 and k directly corresponds with
each other.
In the remainder of this package vignette, the equivalized disposable income of the EU-SILC
example data is of main interest. Consequently, the Pareto distribution will be modeled at the
household level rather than the individual level. Moreover, the focus of this vignette is on ro-
bust estimation of the social exclusion indicators. Hence the equivalized disposable income of the
household with the largest income is replaced by a large outlier.

R> hID <- eusilc$db030[which.max(eusilc$eqIncome)]

R> eusilc[eusilc$db030 == hID, "eqIncome"] <- 10000000

Since the aim is to model a Pareto distribution at the household level, the following command
creates a data set that contains only the equivalized disposable income and the sample weights on
the household level. This data set will be used in Sections 4 and 5 to estimate the parameters of
the Pareto distribution.

R> eusilcH <- eusilc[!duplicated(eusilc$db030), c("eqIncome", "db090")]

4 Finding the threshold

The aim of the methods presented in this sections is to find the threshold x0 for modeling the
Pareto distribution. Several methods for the estimation of the threshold x0 or the number of
observations k in the tail have been proposed in the literature, but those proposals typically do
not consider sample weights.
Beirlant et al. (1996a,b) developed a procedure that analytically determines the optimal choice
of k for the Hill estimator of the shape parameter (Hill 1975, see also Section 5.1 of this paper)

4
by minimizing the asymptotic mean squared error (AMSE). In package laeken, this approach is
implemented in the function minAMSE(). However, the procedure is designed for the non-robust
Hill estimator and is therefore not further discussed in this paper. Furthermore, Danielsson et al.
(2001) proposed a bootstrap method to find the optimal k for the Hill estimator with respect to
the AMSE, which has less analytical requirements than the approach by Beirlant et al. (1996a,b).
Please note that this method is not robust either and that it is currently not available in package
laeken. A robust prediction error criterion for choosing the number of observations k in the
tail and estimating the shape parameter θ was developed by Dupuis and Victoria-Feser (2006).
Nevertheless, our implementation of this robust criterion was unstable and is therefore not included
in laeken.
In any case, Holzer (2009) concludes that graphical methods for finding the threshold outperform
those analytical approaches in the case of EU-SILC data. While this section is thus focused
graphical methods, a simple rule of thumb designed specifically for the equivalized disposable
income in EU-SILC data is described in the following as well.

4.1 Van Kerm’s rule of thumb

Van Kerm (2007) presented a formula that is more of a rule of thumb for the threshold of the
equivalized disposable income in EU-SILC data. Is is given by

x̂0 := min(max(2.5x̄, q0.98 ), q0.97 ), (8)

where x̄ is the weighted mean, and q0.98 and q0.97 are weighted quantiles as defined in Equation (1).
In package laeken, the function paretoScale() provides functionality for computing the thresh-
old with van Kerm’s rule of thumb. The argument w is available to supply sample weights.
R> ts <- paretoScale(eusilcH$eqIncome, w = eusilcH$db090)
R> ts

Threshold: 48459.43
Number of observations in the tail: 119

It should be noted that the function returns an object of class "paretoScale", which consists
of a component x0 for the threshold (scale parameter) and a component k for the number of
observations in the tail of the distribution, i.e., that are larger than the threshold.

4.2 Pareto quantile plot

The Pareto quantile plot is a graphical method for inspecting the parameters of a Pareto distribu-
tion. For the case without sample weights, it is described in detail in Beirlant et al. (1996a).
If the Pareto model holds, there exists a linear relationship between the lograrithms of the
observed values and the quantiles of the standard exponential distribution, since the logarithm of
a Pareto distributed random variable follows an exponential distribution. Hence the logarithms of
the observed values, log(xi ), i = 1, . . . , n, are plotted against the theoretical quantiles.
In the case without sample weights, the theoretical quantiles of the standard exponential distri-
bution are given by
i
− log 1 − , i = 1, . . . , n, (9)
n+1
i.e., by dividing the range into n + 1 equally sized subsets and using the resulting n inner gridpoints
as probabilities for the quantiles. If the data contain sample weights, the range of the exponential
distribution needs to be divided according to the weights of the n observations. The Pareto quantile
plot is thus generalized by using the theoretical quantiles
Pi !
j=1 wj n
− log 1 − Pn , i = 1, . . . , n, (10)
j=1 wj n + 1

n
where the correction factor n+1 ensures that the quantiles reduce to (9) if all sample weights are
equal.

5
R> paretoQPlot(eusilcH$eqIncome, w = eusilcH$db090)

Pareto quantile plot

1e+06
1e+04
1e+02

0 2 4 6 8

Theoretical quantiles

Figure 2: Pareto Quantile plot for the example data eusilc on the household level with the largest
observation replaced by an outlier.

If the tail of the data follows a Pareto distribution, those observations form almost a straight
line. The leftmost point of a fitted line can thus be used as an estimate of the threshold x0 , the
scale parameter. All values starting from the point after the threshold may be modeled by a Pareto
distribution, but this point cannot be determined exactly. Furthermore, the slope of the fitted line
is in turn an estimate of θ1 , the reciprocal of the shape parameter.
Figure 2 displays the Pareto quantile plot for the example data eusilc on the household level
with the largest observation replaced by an outlier. The plot is generated using the function
paretoQPlot(), which allows to supply sample weights via the argument w. In addition, the
threshold can be selected interactively by clicking on a data point. Information on the selected
threshold is then printed on the R console. When the interactive selection is terminated, which is
typically done by a secondary mouse click, the selected threshold is returned as an object of class
"paretoScale".
Another advantage of the Pareto quantile plot is also illustrated in Figure 2. Nonrepresentative
outliers such as the large income introduced into the example data in Section 3, i.e., extreme
observations in the upper tail that deviate from the Pareto model, are clearly visible.

4.3 Mean excess plot

The mean excess plot is another graphical method for inspecting the threshold for Pareto tail
modeling, but it does not provide information on the shape parameter. It is based on the excess
function
e(x0 ) := E(x − x0 |x > x0 ), x0 ≥ 0. (11)

6
R> meanExcessPlot(eusilcH$eqIncome, w = eusilcH$db090)

Mean excess plot

150000
100000
Mean excess

50000

0 10000 20000 30000 40000 50000

Threshold

Figure 3: Mean excess plot for the example data eusilc on the household level with the largest
observation replaced by an outlier.

A detailed description for the case without sample weights can be found in Borkovec and Klüppel-
berg (2000).
For the following definition of the mean excess plot, keep in mind that √ the observations are
sorted such that x1 ≤ . . . ≤ xn . For each observation xi , i = 1, . . . , ⌊n − n⌋, the empirical excess
function en is computed. In the case without sample weights, the expectation in Equation (11) is
replaced by the arithmetic mean, and the empirical excess function is given by
n
1 X √
en (xi ) := (xj − xi ), i = 1, . . . , ⌊n − n⌋. (12)
n − i j=i+1

The values of the √ empirical excess function en (xi ) are then plotted against the corresponding xi ,
i = 1, . . . , ⌊n − n⌋. If sample weights are available in the data, the mean excess plot is simply
generalized by using the weighted mean for the empirical excess function:
n
1 X √
en (xi ) := Pn wj (xj − xi ), i = 1, . . . , ⌊n − n⌋. (13)
j=i+1 wj j=i+1

If the tail of the data follows a Pareto distribution, those observations show a positive linear
trend. The leftmost point of a fitted line can thus be used as an estimate of the threshold x0 , the
scale parameter. As for the Pareto quantile plot, a disadvantage of the mean excess plot is that
the threshold cannot be determined exactly.
Figure 3 shows the mean excess plot for the example data eusilc on the household level with
the largest observation replaced by an outlier. The function meanExcessPlot() is thereby used to

7
produce the plot. Sample weights can be supplied via the argument w. Interactive selection of the
threshold works just like for the Pareto quantile plot. Again, the selected threshold is returned as
an object of class "paretoScale".

5 Estimation of the shape parameter

This section is focused on methods for estimating the shape parameter θ once the threshold x0 is
fixed. It should be noted that none of the original proposals takes sample weights into account.
Most estimators presented in the following were therefore adjusted for the case of sample weights.

5.1 Hill estimator

The maximum likelihood estimator for the shape parameter of the Pareto distribution was intro-
duced by Hill (1975) and is referred to as the Hill estimator. If the data do not contain sample
weights, it is given by
k
θ̂Hill = Pk . (14)
i=1 log xn−k+i − k log xn−k
In the case of sample weights, the weighted Hill (wHill) estimator is given by generalizing Equa-
tion (14) to
Pk
i=1 wn−k+i
θ̂wHill = Pk . (15)
i=1 wn−k+i (log xn−k+i − log xn−k )
Package laeken provides the function thetaHill() to compute the Hill estimator. It requires
to specify either the number of observations in the tail via the argument k, or the threshold via the
argument x0. Furthermore, the argument w can be used to supply sample weights. In the following
example, the shape parameter is estimated using the largest observations (first command) and the
threshold (second command) as computed with van Kerm’s rule of thumb in Section 4.1.

R> thetaHill(eusilcH$eqIncome, k = ts$k, w = eusilcH$db090)

[1] 3.437979

R> thetaHill(eusilcH$eqIncome, x0 = ts$x0, w = eusilcH$db090)

[1] 3.437979

5.2 Weighted maximum likelihood estimator

The weighted maximum likelihood (WML) estimator (Dupuis and Morgenthaler 2002, Dupuis and
Victoria-Feser 2006) falls into the class of M-estimators and is given by the solution θ̂ of
k
X
Ψ(xn−k+i , θ) = 0 (16)
i=1

with
∂ 1 x
Ψ(x, θ) := u(x, θ) log f (x, θ) = u(x, θ) − log , (17)
∂θ θ x0
where u(x, θ) is a weight function with values in [0, 1]. In the implementation in package laeken,
a Huber type weight function is used by default, as proposed by Dupuis and Victoria-Feser (2006).
Let the logarithms of the relative excesses be denoted by

xn−k+i
zi := log , i = 1, . . . , k. (18)
xn−k
In the Pareto model, these can be predicted by

1 k+1−i
ẑi := − log , i = 1, . . . , k. (19)
θ k+1

8
The variance of zi is given by
i
X 1
σi 2 := , i = 1, . . . , k. (20)
j=1
θ2 (k − i + j)2

Using the standardized residuals

zi − ẑi
ri := , (21)
σi
the Huber type weight function with tuning constant c is defined as

1, if |ri | ≤ c,
u(xn−k+i , θ) := c (22)
|ri | , if |ri | > c.

For this choice of weight function, the bias of θ̂ is approximated by

Pk ∂

i=1 ui ∂θ log fi |θ̂ Fθ̂ (xn−k+i ) − Fθ̂ (xn−k+i−1 )
B̂(θ̂) = − Pk ∂ ∂ ∂2
, (23)
i=1 ∂θ ui ∂θ log fi + ui ∂θ 2 log fi |θ̂ Fθ̂ (xn−k+i ) − Fθ̂ (xn−k+i−1 )

where ui := u(xn−k+i , θ) and fi := f (xn−k+i , θ). This term is used to obtain a bias-corrected
estimator
θ̃ := θ̂ − B̂(θ̂). (24)
For details and proofs of the above statements, as well as for information on a probability-based
weight function u(x, θ), the reader is referred to Dupuis and Morgenthaler (2002) and Dupuis and
Victoria-Feser (2006). However, note the WML estimator does not consider sample weights. An
adjustment of the estimator to take sample weights into account is currently not available due to
its complexity. For sampling designs that lead to equal sample weights, the WML estimator may
still be useful, though.
The function thetaWML() is available in laeken to compute the WML estimator. Again, either
the argument k or x0 needs to be used to specify the number of observations in the tail or the
threshold. Since the sample weights in the example data are not equal, the following example is
only included to demonstrate the use of the function.

R> thetaWML(eusilcH$eqIncome, k = ts$k)

[1] 4.226204

R> thetaWML(eusilcH$eqIncome, x0 = ts$x0)

[1] 4.226204

5.3 Integrated squared error estimator

For the integrated squared error (ISE) estimator (Vandewalle et al. 2007), the Pareto distribution
is modeled in terms of the relative excesses
xn−k+i
yi := , i = 1, . . . , k. (25)
xn−k

The density function of the Pareto distribution for the relative excesses is approximated by

fθ (y) = θy −(1+θ) . (26)

The ISE estimator is then given by minimizing the integrated squared error criterion (Terrell 1990):
Z
2
θ̂ = arg min fθ (y)dy − 2E(fθ (Y )) . (27)
θ

9
If there are no sample weights in the data, the mean is used as an unbiased estimator of E(fθ (Y ))
in order to obtain the ISE estimate
"Z k
#
2 X
θ̂ISE = arg min fθ2 (y)dy − fθ (yi ) . (28)
θ k i=1

See Vandewalle et al. (2007) for more information on the ISE estimator for the case without sample
weights.
If sample weights are available in the data, the mean in Equation (28) is simply replaced by a
weighted mean to obtain the weighted integrated squared error (wISE) estimator:
"Z k
#
2 2 X
θ̂wISE = arg min fθ (y)dy − Pk wn−k+i fθ (yi ) . (29)
θ
i=1 wn−k+i i=1

With package laeken, the ISE estimator can be computed using the function thetaISE(). The
arguments k and x0 are available to specify either the number of observations in the tail or the
threshold, and sample weights can be supplied via the argument w.
R> thetaISE(eusilcH$eqIncome, k = ts$k, w = eusilcH$db090)
[1] 3.993801
R> thetaISE(eusilcH$eqIncome, x0 = ts$x0, w = eusilcH$db090)
[1] 3.993801

5.4 Partial density component estimator

For the partial density component (PDC) estimator Vandewalle et al. (2007) minimizes the inte-
grated squared error criterion using an incomplete density mixture model ufθ . If the data do not
contain sample weights, the PDC estimator in is thus given by
" Z k
#
2 2 2u X
θ̂PDC = arg min u fθ (y)dy − fθ (yi ) . (30)
θ k i=1
The parameter u can be interpreted as a measure of the uncontaminated part of the sample and
is estimated by
1
Pk
fθ̂ (yi )
û = k R i=1 . (31)
fθ̂2 (y)dy
See Vandewalle et al. (2007) and references therein for more information on the PDC estimator for
the case without sample weights.
Taking sample weights into account, the weighted partial density component (wPDC) estimator
is obtained by generalizing Equations (30) and (31) to
" Z k
#
2 2 2u X
θ̂wPDC = arg min u fθ (y)dy − Pk wn−k+i fθ (yi ) , (32)
θ
i=1 wn−k+i i=1
Pk
Pk 1
w n−k+i i=1 wn−k+i fθ̂ (yi )
û = i=1 R 2 . (33)
fθ̂ (y)dy
The function thetaPDC() is implemented in package laeken to compute the PDC estimator. As
for the other estimators, it is necessary to specify either the number of observations in the tail via
the argument k, or the threshold via the argument x0. Sample weights can be supplied using the
argument w.
R> thetaPDC(eusilcH$eqIncome, k = ts$k, w = eusilcH$db090)
[1] 4.132596
R> thetaPDC(eusilcH$eqIncome, x0 = ts$x0, w = eusilcH$db090)
[1] 4.132596

10
6 Estimation of the indicators using Pareto tail modeling
Three approaches based on Pareto tail modeling for reducing the influence of outliers on the social
exclusion indicators are implemented in the R package laeken:
Calibration for nonrepresentative outliers (CN): Values larger than a certain quantile of the fit-
ted distribution are declared as nonrepresentative outliers. Since these are considered to be
unique to the population data, the sample weights of the corresponding observations are set
to 1 and the weights of the remaining observations are adjusted accordingly by calibration.
Replacement of nonrepresentative outliers (RN): Values larger than a certain quantile of the
fitted distribution are declared as nonrepresentative outliers. Only these nonrepresentative
outliers are replaced by values drawn from the fitted distribution, thereby preserving the
order of the original values.
Replacement of the tail (RT): All values above the threshold are replaced by values drawn from
the fitted distribution. The order of the original values is preserved.
An evaluation of the RT approach by means of a simulation study can be found in Alfons et al.
(2010).
Keep in mind that the largest observation in the example data eusilc was replaced by a large
outlier in Section 3. With the following command, the Gini coefficient is estimated according to
the Eurostat definition to show that even a single outlier can completely distort the results for the
standard estimation (see Section 2.2 for the original value).
R> gini("eqIncome", weights = "rb050", data = eusilc)

Value:
[1] 29.24333

For Pareto tail modeling, the function paretoTail() is implemented in laeken. It returns an
object of class "paretoTail", which contains all the necessary information for further analysis
using the three approaches described above. Note that the household IDs are supplied via the
argument groups such that the Pareto distribution is fitted on the household level rather than the
individual level. In addition, the PDC is used by default to estimate the shape parameter. Other
estimators can be specified via the method argument.
R> fit <- paretoTail(eusilc$eqIncome, k = ts$k,
+ w = eusilc$db090, groups = eusilc$db030)
The function reweightOut() is available for semiparametric estimation with the CN approach.
It returns a vector of the recalibrated weights. In this example, regional information is used as
auxiliary variables for calibration. The function calibVars() thereby transforms a factor into a
matrix of binary variables, as required by the calibration function calibWeights(), which is called
internally. These recalibrated weights are then simply used to estimate the Gini coefficient with
function gini().
R> w <- reweightOut(fit, calibVars(eusilc$db040))
R> gini(eusilc$eqIncome, w)

Value:
[1] 26.45973

For the RN approach, the function replaceOut() is implemented. Since values are drawn
from the fitted distribution to replace the observations flagged as outliers, the seed of the random
number generator is set first for reproducibility of the results. The returned vector of incomes is
then supplied to gini() to estimate the Gini coefficient.
R> set.seed(1234)
R> eqIncome <- replaceOut(fit)
R> gini(eqIncome, weights = eusilc$rb050)

11
Value:
[1] 26.46924

Similarly, the function replaceTail() is available for the RT approach. Again, the seed of the
random number generator is set beforehand.

R> set.seed(1234)
R> eqIncome <- replaceTail(fit)
R> gini(eqIncome, weights = eusilc$rb050)

Value:
[1] 26.64921

It should be noted that replaceTail() can also be used for the RN approach by setting the
argument all to FALSE. In fact, replaceOut(x, ...) is a simple wrapper for replaceTail(x,
all = FALSE, ...).
In any case, the estimates for the semiparametric approaches based on Pareto tail modeling are
very close to the original value before the outlier has been introduced (see Section 2.2), whereas
the standard estimation is corrupted by the outlier. Furthermore, the estimation of other indi-
cators such as the quintile share ratio (see Section 2.1) using the semiparametric approaches is
straightforward and hence not shown here.

7 Conclusions
This vignette shows the functionality of package laeken for robust semiparametric estimation of
social exclusion indicators based on Pareto tail modeling. Most notably, it demonstrates that the
functions are easy to use and that the implementation follows an object-oriented design. While
the focus of the paper lies on the use of the package, a mathematical description of the methods is
given as well.
Furthermore, it is shown that the standard estimation of the inequality indicators can be cor-
rupted by a single outlier, thus underlining the need for robust alternatives. Three approaches
for robust semiparametric estimation based on Pareto tail modeling are thereby implemented such
that the corresponding functions share a common interface for ease of use.

Acknowledgments
This work was partly funded by the European Union (represented by the European Commission)
within the 7th framework programme for research (Theme 8, Socio-Economic Sciences and Human-
ities, Project AMELI (Advanced Methodology for European Laeken Indicators), Grant Agreement
No. 217322). Visit http://ameli.surveystatistics.net for more information on the project.

References
A. Alfons and S. Kraft. simPopulation: Simulation of Synthetic Populations for Surveys Based on
Sample Data, 2012. URL https://CRAN.R-project.org/package=simPopulation. R package
version 0.4.0.

A. Alfons and M. Templ. Estimation of social exclusion indicators from complex surveys: The R
package laeken. Journal of Statistical Software, 54(15):1–25, 2013. doi: 10.18637/jss.v054.i15.
A. Alfons, M. Templ, P. Filzmoser, and J. Holzer. A comparison of robust methods for Pareto tail
modeling in the case of Laeken indicators. In C. Borgelt, G. González-Rodrı́guez, W. Trutschnig,
M.A. Lubiano, M.A. Gil, P. Grzegorzewski, and O. Hryniewicz, editors, Combining Soft Com-
puting and Statistical Methods in Data Analysis, volume 77 of Advances in Intelligent and Soft
Computing, pages 17–24. Springer-Verlag, Heidelberg, 2010. ISBN 978-3-642-14745-6.

12
A. Alfons, S. Kraft, M. Templ, and P. Filzmoser. Simulation of close-to-reality population data
for household surveys with application to EU-SILC. Statistical Methods & Applications, 20(3):
383–407, 2011.
A. Alfons, J. Holzer, and M. Templ. laeken: Estimation of Indicators on Social Exclusion and
Poverty, 2013. URL https://CRAN.R-project.org/package=laeken. R package version 0.4.4.
J. Beirlant, P. Vynckier, and J.L. Teugels. Tail index estimation, Pareto quantile plots, and re-
gression diagnostics. Journal of the American Statistical Association, 31(436):1659–1667, 1996a.

J. Beirlant, P. Vynckier, and J.L. Teugels. Excess functions and estimation of the extreme-value
index. Bernoulli, 2(4):293–318, 1996b.
M. Borkovec and C. Klüppelberg. Extremwerttheorie für Finanzzeitreihen – ein unverzichtbares
Werkzeug im Risikomanagement. In L. Johanning and B. Rudolph, editors, Handbuch Risiko-
management, pages 219–241. Uhlenbruch, Bad Soden, 2000. ISBN 3933207150.
J. Danielsson, L. de Haan, L. Peng, and C.G. de Vries. Using a bootstrap method to choose the
sample fraction in tail index estimation. Journal of Multivariate Analysis, 76(2):226–248, 2001.
D.J. Dupuis and S. Morgenthaler. Robust weighted likelihood estimators with an application to
bivariate extreme value problems. The Canadian Journal of Statistics, 30(1):17–36, 2002.

D.J. Dupuis and M.-P. Victoria-Feser. A robust prediction error criterion for Pareto modelling of
upper tails. The Canadian Journal of Statistics, 34(4):639–658, 2006.
Eurostat. Common cross-sectional EU indicators based on EU-SILC; the gender pay gap. EU-SILC
131-rev/04, Unit D-2: Living conditions and social protection, Directorate D: Single Market,
Employment and Social statistics, Eurostat, Luxembourg, 2004.
Eurostat. Algorithms to compute social inclusion indicators based on EU-SILC and adopted under
the Open Method of Coordination (OMC). Doc. LC-ILC/39/09/EN-rev.1, Unit F-3: Living con-
ditions and social protection, Directorate F: Social and information society statistics, Eurostat,
Luxembourg, 2009.

B.M. Hill. A simple general approach to inference about the tail of a distribution. The Annals of
Statistics, 3(5):1163–1174, 1975.
J. Holzer. Robust methods for the estimation of selected Laeken indicators. Master’s thesis,
Department of Statistics and Probability Theory, Vienna University of Technology, Vienna,
Austria, 2009.

B. Hulliger and T. Schoch. Robustification of the quintile share ratio. New Techniques and
Technologies for Statistics, Brussels, 2009.
C. Kleiber and S. Kotz. Statistical Size Distributions in Economics and Actuarial Sciences. John
Wiley & Sons, Hoboken, New Jersey, 2003. ISBN 0-471-15064-9.

R Development Core Team. R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria, 2013. URL https://www.R-project.org.
ISBN 3-900051-07-0.
M. Templ and A. Alfons. Standard methods for point estimation of social inclusion indicators using
the R package laeken. Research Report CS-2011-1, Department of Statistics and Probability
Theory, Vienna University of Technology, 2011a.
M. Templ and A. Alfons. Variance estimation of social inclusion indicators using the R package
laeken. Research Report CS-2011-3, Department of Statistics and Probability Theory, Vienna
University of Technology, 2011b.

G. Terrell. Linear density estimates. In Proceedings of the Statistical Computing Section, pages
297–302. American Statistical Association, 1990.

13
P. Van Kerm. Extreme incomes and the estimation of poverty and inequality indicators from
EU-SILC. IRISS Working Paper Series 2007-01, CEPS/INSTEAD, 2007.

B. Vandewalle, J. Beirlant, A. Christmann, and M. Hubert. A robust estimator for the tail index
of Pareto-type distributions. Computational Statistics & Data Analysis, 51(12):6252–6268, 2007.

0026 AnirvinNarayan Math HL IA
92% (13)
0026 AnirvinNarayan Math HL IA
21 pages
Gini Coefficient
No ratings yet
Gini Coefficient
10 pages
2012-05-21 Gini Index Worksheet
100% (1)
2012-05-21 Gini Index Worksheet
14 pages
KAPA OPERATION MANUAL - September 2018
100% (1)
KAPA OPERATION MANUAL - September 2018
417 pages
Biostat Estimation
100% (1)
Biostat Estimation
48 pages
Poverty and Inequality Measurement
No ratings yet
Poverty and Inequality Measurement
48 pages
Chapter 6. Inequality Measures: Poverty Manual, All, JH Revision of August 8, 2005 Page 95 of 218
100% (1)
Chapter 6. Inequality Measures: Poverty Manual, All, JH Revision of August 8, 2005 Page 95 of 218
11 pages
Poverty and Inequality 2020
No ratings yet
Poverty and Inequality 2020
49 pages
Math IA
No ratings yet
Math IA
13 pages
Lesson 4 Testing of Hypothesis 1
No ratings yet
Lesson 4 Testing of Hypothesis 1
18 pages
S.No Query /Rpm/Fico - Int - Planning - Fi-CO Integration and Planning at Portfolio Item and Item (Init) Level
No ratings yet
S.No Query /Rpm/Fico - Int - Planning - Fi-CO Integration and Planning at Portfolio Item and Item (Init) Level
4 pages
Chapter 2 - Estimation PDF
No ratings yet
Chapter 2 - Estimation PDF
25 pages
100% QM Source
No ratings yet
100% QM Source
205 pages
Reliability Analysis: Oskar Larsson
No ratings yet
Reliability Analysis: Oskar Larsson
54 pages
Barton-TBM Tunnelling in Sheared and Fractured Rock Masses. Cartagena, Colombia
No ratings yet
Barton-TBM Tunnelling in Sheared and Fractured Rock Masses. Cartagena, Colombia
36 pages
Newbold Chapter 7
No ratings yet
Newbold Chapter 7
62 pages
INDICATORI2 en
No ratings yet
INDICATORI2 en
190 pages
Geometric Modelling
No ratings yet
Geometric Modelling
138 pages
Jenkins (2008)
No ratings yet
Jenkins (2008)
88 pages
Questions For Chapter 6
No ratings yet
Questions For Chapter 6
34 pages
Gini Coefficient
No ratings yet
Gini Coefficient
8 pages
BIMWERX Coordination Workflows With Revit
No ratings yet
BIMWERX Coordination Workflows With Revit
36 pages
Parker EthernetIP UG PDF
No ratings yet
Parker EthernetIP UG PDF
33 pages
Measures of Inequality
No ratings yet
Measures of Inequality
59 pages
SR 132
No ratings yet
SR 132
58 pages
Quality Prediction and Control in Wire Arc Additiv
No ratings yet
Quality Prediction and Control in Wire Arc Additiv
16 pages
Nonlinear Analytical Modeling of Mass Timber Buildings With Post Tensioned Rocking Walls
No ratings yet
Nonlinear Analytical Modeling of Mass Timber Buildings With Post Tensioned Rocking Walls
30 pages
Measuring Inequality
No ratings yet
Measuring Inequality
58 pages
Lecture 1 Introduction To Probability and Statistics
No ratings yet
Lecture 1 Introduction To Probability and Statistics
28 pages
Pareto and The Upper Tail of The Income Distribution in The UK: 1799 To The Present
No ratings yet
Pareto and The Upper Tail of The Income Distribution in The UK: 1799 To The Present
28 pages
Phast Release Notes
No ratings yet
Phast Release Notes
20 pages
Complementary Assets and Value Creation Beyond Information Technology Investments
No ratings yet
Complementary Assets and Value Creation Beyond Information Technology Investments
25 pages
1980 Kennedy
No ratings yet
1980 Kennedy
24 pages
User Manual LCA501
No ratings yet
User Manual LCA501
22 pages
Poverty and Inequality
No ratings yet
Poverty and Inequality
26 pages
Measuring Inequality: An Examination of The Purpose and Techniques of Inequality Measurement
No ratings yet
Measuring Inequality: An Examination of The Purpose and Techniques of Inequality Measurement
32 pages
Lecture 4
No ratings yet
Lecture 4
45 pages
Financial and Actuarial Properties of The Beta-Pareto As A Long Tail Distribution (Deniz)
No ratings yet
Financial and Actuarial Properties of The Beta-Pareto As A Long Tail Distribution (Deniz)
15 pages
Globalization and Socio-Economic Inequality - L1
No ratings yet
Globalization and Socio-Economic Inequality - L1
25 pages
Group 1 Poverty Inequality and Development 102116.PDF 2
No ratings yet
Group 1 Poverty Inequality and Development 102116.PDF 2
67 pages
Statistica
No ratings yet
Statistica
22 pages
Chapter 6 (Philoid-In)
No ratings yet
Chapter 6 (Philoid-In)
17 pages
Lecture The Econometrics of Inequality and Poverty Lubrano
No ratings yet
Lecture The Econometrics of Inequality and Poverty Lubrano
41 pages
This Content Downloaded From 146.199.60.115 On Mon, 19 Dec 2022 20:16:43 UTC
No ratings yet
This Content Downloaded From 146.199.60.115 On Mon, 19 Dec 2022 20:16:43 UTC
29 pages
Pr20110321 - Family Income Distribution by MR Tomas Africa - FINAL
No ratings yet
Pr20110321 - Family Income Distribution by MR Tomas Africa - FINAL
28 pages
6 InequalityMeasures
No ratings yet
6 InequalityMeasures
23 pages
Forrester 1982: 405 978-1-4244-9864-2/10/$26.00 ©2010 IEEE
No ratings yet
Forrester 1982: 405 978-1-4244-9864-2/10/$26.00 ©2010 IEEE
12 pages
Belz - Estimating Inequality Measures From Quantile Data - 2019
No ratings yet
Belz - Estimating Inequality Measures From Quantile Data - 2019
17 pages
2023 Statistics Fin 6
No ratings yet
2023 Statistics Fin 6
21 pages
Prepared By: Ramawta Ashwaye Kumar (1980021) Submitted To: DR Ushad Subadar Agathee
No ratings yet
Prepared By: Ramawta Ashwaye Kumar (1980021) Submitted To: DR Ushad Subadar Agathee
7 pages
Final IA
No ratings yet
Final IA
24 pages
Impact of Different Variables On Inequality
No ratings yet
Impact of Different Variables On Inequality
19 pages
Dispersion PDF
No ratings yet
Dispersion PDF
17 pages
C-3-1 Inequality Indexes Types
No ratings yet
C-3-1 Inequality Indexes Types
10 pages
ChoiceModelR Manual
No ratings yet
ChoiceModelR Manual
17 pages
TACTIC User Guide
No ratings yet
TACTIC User Guide
19 pages
The Gini Index and Measures of Inequality: Frank A. Farris
No ratings yet
The Gini Index and Measures of Inequality: Frank A. Farris
14 pages
Persistence Analysis Tutorial: Swedge Has The Ability To Take These Factors Into Consideration in A
No ratings yet
Persistence Analysis Tutorial: Swedge Has The Ability To Take These Factors Into Consideration in A
13 pages
Negative Income
No ratings yet
Negative Income
20 pages
11 Stat 6 Measures of Dispersion
No ratings yet
11 Stat 6 Measures of Dispersion
17 pages
Maps Editor.: Creating A Map
No ratings yet
Maps Editor.: Creating A Map
12 pages
Gini Coefficient
No ratings yet
Gini Coefficient
8 pages
Pareto Distribution
No ratings yet
Pareto Distribution
13 pages
Kim Jargowsky Gini Segregation
No ratings yet
Kim Jargowsky Gini Segregation
17 pages
Chap 5
No ratings yet
Chap 5
7 pages
Midterm Reference Summary Econdev
No ratings yet
Midterm Reference Summary Econdev
11 pages
Lec - 10 & 11 - Ch-6 Inequality Measures - 2021
No ratings yet
Lec - 10 & 11 - Ch-6 Inequality Measures - 2021
19 pages
02 Basic Ideas of Regression Analysis PDF
No ratings yet
02 Basic Ideas of Regression Analysis PDF
8 pages
Inequality Decomposition Analysis and The Gini Coefficient Revisited PDF
No ratings yet
Inequality Decomposition Analysis and The Gini Coefficient Revisited PDF
7 pages
Laeken Variance
No ratings yet
Laeken Variance
7 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 4
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 4
3 pages
SimulationLab Final 2021-Rubric
No ratings yet
SimulationLab Final 2021-Rubric
8 pages
Chapter 5 6
No ratings yet
Chapter 5 6
10 pages
SHS - Statistics and Probability - Quarter 1 - Week 5
No ratings yet
SHS - Statistics and Probability - Quarter 1 - Week 5
7 pages
On The Super-Additivity and Estimation Biases of Quantile Contributions
No ratings yet
On The Super-Additivity and Estimation Biases of Quantile Contributions
6 pages
Prepared By: Ramawta Ashwaye Kumar (1980021) Submitted To: DR Ushad Subadar Agathee
No ratings yet
Prepared By: Ramawta Ashwaye Kumar (1980021) Submitted To: DR Ushad Subadar Agathee
7 pages
Micro 2Decision2014LecturesPart5
No ratings yet
Micro 2Decision2014LecturesPart5
9 pages
Gini Coefficient: Jump To Navigation Jump To Search
No ratings yet
Gini Coefficient: Jump To Navigation Jump To Search
3 pages
Gini RajivShetty
No ratings yet
Gini RajivShetty
7 pages
Other Summary Measures of Poverty and Inequality
No ratings yet
Other Summary Measures of Poverty and Inequality
6 pages
Floboss 107 Instruction Manual
No ratings yet
Floboss 107 Instruction Manual
146 pages
Table of Diff Countries For GINI
No ratings yet
Table of Diff Countries For GINI
10 pages
Gini Index Group G
No ratings yet
Gini Index Group G
4 pages
CHAPTER 6 Inequality Measures
No ratings yet
CHAPTER 6 Inequality Measures
4 pages
Agustin Et Al Comput Stat & Data Anal. 2012
No ratings yet
Agustin Et Al Comput Stat & Data Anal. 2012
6 pages
Measuring Economic Inequalities
No ratings yet
Measuring Economic Inequalities
10 pages
Inequality Measurement: Development Issues No. 2
No ratings yet
Inequality Measurement: Development Issues No. 2
2 pages
Introduction to Applied Econometrics Analysis Using Stata
From Everand
Introduction to Applied Econometrics Analysis Using Stata
Justin Doran
5/5 (3)
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Introduction to Statistics
From Everand
Introduction to Statistics
Simone Malacrida
No ratings yet

Robust Pareto Tail Modeling For The Estimation of Indicators On Social Exclusion Using The R Package Laeken

Uploaded by

Robust Pareto Tail Modeling For The Estimation of Indicators On Social Exclusion Using The R Package Laeken

Uploaded by

Robust Pareto Tail Modeling for the

Estimation of Indicators on Social Exclusion

1 Erasmus School of Economics, Erasmus University Rotterdam

2 Social exclusion indicators

2.1 Quintile share ratio (QSR)

2.2 Gini coefficient

R> gini("eqIncome", weights = "rb050", data = eusilc)

3 The Pareto distribution

Figure 1: Pareto probability density functions with parameters x0 = 1 and θ = 1, 2, 3.

x̂0 := xn−k . (7)

R> hID <- eusilc$db030[which.max(eusilc$eqIncome)]

R> eusilcH <- eusilc[!duplicated(eusilc$db030), c("eqIncome", "db090")]

4 Finding the threshold

4.1 Van Kerm’s rule of thumb

x̂0 := min(max(2.5x̄, q0.98 ), q0.97 ), (8)

4.2 Pareto quantile plot

Pareto quantile plot

4.3 Mean excess plot

Mean excess plot

0 10000 20000 30000 40000 50000

5 Estimation of the shape parameter

5.1 Hill estimator

R> thetaHill(eusilcH$eqIncome, k = ts$k, w = eusilcH$db090)

R> thetaHill(eusilcH$eqIncome, x0 = ts$x0, w = eusilcH$db090)

5.2 Weighted maximum likelihood estimator

Using the standardized residuals

For this choice of weight function, the bias of θ̂ is approximated by

R> thetaWML(eusilcH$eqIncome, k = ts$k)

R> thetaWML(eusilcH$eqIncome, x0 = ts$x0)

5.3 Integrated squared error estimator

fθ (y) = θy −(1+θ) . (26)

5.4 Partial density component estimator

You might also like