Review of Kernel Density Estimation
Review of Kernel Density Estimation
Applications to Econometrics
arXiv:1212.2812v1 [stat.ME] 12 Dec 2012
December, 2012
Abstract
model the probabilistic or stochastic structure of a data set. This comprehensive review
summarizes the most important theoretical aspects of kernel density estimation and provides
an extensive description of classical and modern data analytic methods to compute the
smoothing parameter. Throughout the text, several references can be found to the most
up-to-date and cut point research approaches in this area, while econometric data sets are
and Marron (2000), whose objective is to analyze the visible features representing important
1
1 Introduction
The field of econometrics focuses on methods that address the probabilistic or stochastic
phenomena involving economic data. Modeling the underlying probabilistic structure of the
data, i.e., the uncertainty of the process, is a crucial task, for it can be used to describe
the mechanism from which the data was generated. Thus, econometricians have widely
explored density estimation, both the parametric and nonparametric approaches, to identify
these structures and then make inferences about the unknown ”true models”. A parametric
model assumes that the density is known up to a finite number of parameters, while a
nonparametric model allows great flexibility in the possible form, usually assuming that it
belongs to some infinite collection of curves (differentiable with square integrable second
derivatives for example). The most used approach is kernel smoothing, which dates back to
Rosenblatt (1956) and Parzen (1962). The aim of this paper is to review the most import
aspects of kernel density estimation, both traditional approaches and modern ideas.
that a well estimated density can be extremely useful for applied purposes. An interesting
comprehensive review of kernel smoothing and its applications can be found in Bierens
(1987). Silverman (1986) and Scott (1992) discuss kernel density estimation thoroughly,
giving details about assumptions on the kernel weight, properties of the estimator such as
bias and variance, and discusses how to choose the smoothness of the estimate. The choice of
the smoothing parameter is a crucial issue in nonparametric estimation, and will be discussed
in detail in Section 4.
The remainder of this paper is as follows. In Section 2 we describe the most basic and
intuitive method of density estimation: the histogram. Then, in Section 3 we introduce kernel
density estimation and the properties of estimators of this type, followed by an overview of
old and new bandwidth selection approaches in Section 4. Finally, SiZer, a modern idea for
2
accessing features that represent important underlying structures through different levels of
2 The Histogram
The grouping of data in the form of a frequency histogram is a classical methodology that
Basically, the histogram is a step function defined by bin heights, which equal the pro-
portion of observations contained in each bin divided by the bin width. The construction
of the histogram is very intuitive, and to formally describe this construction, we will now
introduce some notation. Suppose we observe random variables X1 , . . . , Xn i.i.d. from the
measure on R. Assume that x1 , . . . , xn are the data points observed from a realization of the
where ξ ∈ Ij and the last equality follows from the mean value theorem for continuous
bounded functions. Intuitively, we can approximate the probability of X falling into the
#{xi ∈ Ij }
P (X ∈ Ij ) ≈ . (2)
n
Using the approximation in (2) and the equation in (1), the density function f (x) can be
3
estimated by
n
#{xi ∈ Ij } 1 X
fˆh (x) = = 1(xi ∈ Ij ) for x ∈ Ij , (3)
nh nh i=1
where
1 if x ∈ Ij
1(xi ∈ Ij ) =
0 otherwise.
leads to a jagged estimate, while larger bandwidths tend to produce over smoothed histogram
estimates (see Hardle (1991)). Figure 1 shows an example of two histograms of the same
randomly generated data: the histogram on the left hand side was estimated with a small
bandwidth and consequently has many bins, while the histogram on the right hand side
was computed with a large bandwidth, producing a smaller number of bins. The choice of
the bandwidth is discussed in more detail in Section 4. Note that in practice, the choice of
k will determine h or vice versa (a rule of thumb for the choice of k is the Sturges’ rule:
k = 1 + log2 n).
When building a histogram, not only the bandwidth needs to be chosen, but also the
starting point of each bin edge. These choices can produce different impressions of the shape,
and hence different estimates. The bin edge problem is a disadvantage of the histogram not
shared by other estimators, such as the kernel density estimator. Another disadvantage is
that the histogram estimators are usually not smooth, displaying bumps that may have been
method. It is an approach that is rooted in the histogram methodology. The basic idea is to
4
Histogram of X Histogram of X
small bandwidth large bandwidth
0.7
1.2
0.6
1.0
0.5
0.8
0.4
Density
Density
0.6
0.3
0.4
0.2
0.2
0.1
0.0
0.0
−2 −1 0 1 2 −2 −1 0 1 2
X X
Figura 1: Histogram estimate with small bandwidth (left) and large bandwidth (right)
estimate the density function at a point x using neighboring observations. However, instead
of building up the estimate according to bin edges, the naive kernel method (adaptively)
uses each point of estimation x as the center of the bin of width 2h. To express it more
called the kernel weight. Then, the kernel estimate (Rosenblatt (1956)) of f (x) is defined as
n
ˆ 1 X x − Xi
f (x) = K . (5)
nh i=1 h
This kernel density estimator is specifically called naive because the kernel weight used is
simply a bin of width 2h centered at x. See Silverman (1986) for a deeper discussion about
Note that the estimator in (5) is an additive function of the kernel weight, inheriting
properties such as continuity and differentiability. Hence, it is not continuous and has zero
derivatives everywhere except on the jump points Xi ± h. Moreover, even with a good
5
choice of h, estimators that use weights as in (4) most often do not produce reasonable
estimates of smooth densities. This is because the discontinuity of the kernel weight gi-
ves the estimate function a ragged form, creating sometimes misleading impressions due
to several bumps and constant estimates where few data points are observed. As an il-
lustration, we consider the CEO compensation data in 2012, containing the 200 highest
paid chief executives in the U.S. This data set can be obtained from the Forbes website
zation of the plot, we excluded the number 1 in the ranking, with an income of US$131.19
Naive Kernel
Epanech Kernel
density
0.10
0.00
10 20 30 40 50 60
Figura 2: Estimated density of CEO compensation using the naive(solid line) and the Epa-
Figure 2 shows two density estimators: the solid line represents the naive estimator, while
the dashed line represents a more adequate kernel type, called Epanechnikov, which will be
described later. The density estimated by the naive kernel appears to have several small
6
bumps, which are probably due to noise, not a characteristic of the true underlying density.
On the other hand, the Epanechnikov kernel is smooth, avoiding this issue.
R∞
A usual choice for the kernel weight K is a function that satisfies −∞
K(x)dx = 1. If
about 0, then the estimated density fˆ(x) is guaranteed to be a density. Note that the weight
in (4) is an example of such choice. Suitable weight functions help overcome problems
with bumps and discontinuity of the estimated density. For example, if K is a gaussian
distribution, the estimated density function fˆ will be smooth and have derivatives of all
orders. Table 1 presents some of the most used kernel functions and Figure 3 displays the
Uniform
Kernel weight K(x) Epanech
Gaussian
Biweight
0.8
Uniform 1
2
1(|x| < 1)
0.6
1 2
Gaussian √1 e− 2 x
2π
3
− x2 )1(|x| ≤ 1)
0.4
Epanechnikov 4
(1
Biweight 15
(1 − x2 )2 1(|x| ≤ 1)
0.2
16
Triweight 35
(1 − x2 )3 1(|x| ≤ 1)
0.0
32
−2 −1 0 1 2
One of the drawbacks of the kernel density estimation is that it is always biased, particu-
larly near the boundaries (when the data is bounded). However, the main drawback of this
approach happens when the underlying density has long tails. In this case, if the bandwidth
is small, spurious noise appears in the tail of the estimates, or if the bandwidth is large
enough to deal with the tails, important features of the main part in the distribution may be
lost due to the over-smoothing. To avoid this problem, adaptive bandwidth methods have
been proposed, where the size of the bandwidth depends on the location of the estimation.
7
See Section 4 for more details on bandwidth selection.
In this section, some of the theoretical properties of the kernel density estimator are
derived, yielding reliable practical use. Assume we have X1 , . . . , Xn i.i.d. random variables
from a density f and let K() be a Kernel weight function such that the following conditions
hold
Z Z Z
K(u)du = 1, uK(u)du = 0, and u2 K(u)du = µ2 (K) > 0.
It is easy to see that fˆ is an asymptotic unbiased estimator of the density, since E(fˆ(x)) →
R
f (x) K(y)dy = f (x) when h → 0. It is important to note that the bandwidth strongly
depends on the sample size, so that when the sample size grows, the bandwidth tends to
shrink.
Now, assume also that the second derivative f ′′ of the underlying density f is absolutely
continuous and square integrable. Then, expanding f (x + yh) in a Taylor series about x we
have
1
f (x − yh) = f (x) − hyf ′(x) + h2 y 2f ′′ (x) + o(h2 )
2
Then, using the conditions imposed on the Kernel, the bias of the density estimator is
h2 ′′
Bias(fˆ(x)) = f (x)µ2 (K) + o(h2 ) (8)
2
8
The variance of the estimated function can be calculated using steps similar to those in
(6):
1 1 ˆ
Z 2
V ar(fˆ(x)) = K 2 (y)f (x − hy)dy − E(f (x))
nh n
1 1
Z
= K 2 (y){f (x) + o(1)}dy = {f (x) + o(1)}2
nh n
1 1
Z
= K 2 (y)dyf (x) + o
nh nh
1 1
= R(K)f (x) + o ,
nh nh
g 2 (y)dy for any square integrable function g. From the definition of Mean
R
where R(g) =
It is straightforward to see that, in order for the kernel density estimation to be consistent
for the underlying density, two conditions on the bandwidth are needed as n → ∞: h →
0 and nh → ∞. When these two conditions hold, MSE(fˆ(x)) → 0, and we have consistency.
Moreover, the trade-off between bias and variance is controlled by the MSE, where decreasing
bias leads to a very noise (large variance) estimate and decreasing variance yields over-
smoothed estimates (large bias). As has already been pointed out, the smoothness of the
estimate depends on the smoothing parameter h, which is chosen as a function of n. For the
optimal asymptotic choice of h, a closed form expression can be obtained from minimizing
the Mean Integrated Square Error (MISE). Integrating the MSE over the entire line, we find
(Parzen (1962))
9
Using this optimal bandwidth, we have
5 2 1/5 −4/5
inf MISE(fˆ) ≈ µ2 (K)R4 (K)R(f ′′ ) n . (11)
h>0 4
A natural question is how to choose the kernel function K to minimize (11). Interestingly,
if we restrict the choice to a proper density function, the minimizer is the Epanechnikov
The problem with using the optimal bandwidth is that it depends on the unknown
quantity f ′′ , which measures the speed of fluctuations in the density f , i.e., the roughness of
f . Many methods have been proposed to select a bandwidth that leads to good performance
The asymptotic convergence of the kernel density estimator has been widely explored.
Bickel and Rosenblatt (1973) showed that for sufficiently smooth f and K, supℓ |fˆ(x) −
p
f (x)|/ f (x), when normalized properly, has an extreme value limit distribution. The strong
uniform convergence of fˆ
has been studied extensively when the observations are independent or weakly dependent.
a law of the logarithm for the maximal deviation between a kernel density estimator and the
true underlying density function, Gine and Guillou (2002) find rates for the strong uniform
consistency of kernel density estimators and Einmahl and Mason (2005) introduce a general
results on strong uniform convergence with different conditions can be found in several other
papers, such as Parzen (1962), Bhattacharya (1967), Van Ryzin (1969), Moore and Yackel
10
4 The choice of the smoothing parameter h
and the purpose of the estimation may be an influential factor in the selection method. In
at the density estimates produced by a range of bandwidths. One can start with a large
estimate. However, there are situations where several estimations are needed, and such
The problem of selecting the smoothing parameter for kernel estimation has been explored
by many authors, and no procedure has yet been considered the best in every situation.
Automatic bandwidth selection methods can basically be divided in two categories: classical
and plug-in. Plug-in methods refer to those that find a pilot estimate of f , sometimes
using a pilot estimate of h, and ”plug it in”the estimation of MISE, computing the optimal
bandwidth as in (10). Classical methods, such as cross-validation, Mallow’s Cp , AIC, etc, are
basically extensions of methods used in parametric modeling. Loader (1999) discusses the
advantages and disadvantages of the plug-in and classical methods in more detail. Besides
Next, we present in more detail the reference method and the most used automatic bandwidth
selection procedures.
A natural way to overcome the problem of not knowing f ′′ is to choose a reference density
for f , compute f ′′ and substitute it in (10). For example, assume that the reference density
11
is Gaussian, and a Gaussian kernel is used, then
1/5 √ 1/5
(2 π)−1
R(K) −1/5
hMISE = 2
n = 3 −1/2 −5 n−1/5 = 1.06σn−1/5 .
µ2 (K)R(f ′′ ) 8
π σ
By using an estimate of σ, one has a data-based estimate of the optimal bandwidth. In order
to have an estimator that is more robust against outliers, the interquartile range R can be
Histogram of CO2
0.12
h_MISE
h_robust
0.08
Density
0.04
0.00
0 10 20 30 40 50
Figura 4: Estimated density of CO2 per capita in 2008 using the bandwidth that minimizes
Figure 4 shows the estimated density of CO2 per capita in the year of 2008. The data set
12
that the estimated density that was computed with the robust bandwidth captures the
peak that characterizes the mode, while the estimated density with the bandwidth that
minimizes MISE smoothes out this peak. This happens because the outliers at the tail of
the distribution contribute to hM ISE be larger than the robust bandwidth hrobust . For more
These methods are of limited practical use, since they are restricted to situations where a
pre-specified family of densities is correctly selected. Plug-in and classical methods, described
There are several papers that address the plug-in approach for bandwidth selection. Some
of them study different ways to estimate R(f ′′ ), others explore ideas on how to select a pilot
bandwidth to better estimate R(f ′′ ). The idea is that the only unknown part of (10) needs
Scott, Tapia and Thompson (1977) proposed a sequential process: calculate R̂(f ′′ ) =
R(fˆh2 (x)), plug R̂(f ′′ ) into (10) to obtain h3 , and iterate until convergence of the bandwidth.
(p) (p) (p)
Hall and Marron (1987) proposed estimating R̂(f (p) ) by R̂(fh ) = R(fˆh ) − R(K
nh2p+1
)
. Parzen
R(K (p) )
and Marron (1990) modified this idea, estimating R̂(f (p) ) = R(fˆg ) −
(p)
ng 2p+1
with g having
the optimal rate given in Hall and Marron (1987). An improvement of Parzen and Marron
(1990) method can be found in Sheather and Jones (1991). Hall, Sheather, Jones and Marron
(1991) proposed to use a kernel of order 2 and to take one extra term in the Taylor expansion
R(K) h4 2 h6
MISE2 (h) = + µ2 (K)R(f ′′ ) − µ2 (K)µ4 (K)R(f ′′′ ). (13)
nh 4 24
Since the minimizer of (13) is not analytically feasible, they proposed to estimate the
13
bandwidth by
Several other plug-in methods have been proposed, and a review of the first procedures
that address this type of methodology can be found in Turlach (1993). Modern research
on plug-in methods have actually become somewhat hybrid, combining ideas of plug-in and
classical approaches such as cross validation, see Biased Cross-Validation described below for
example. More recently, inspired by developments in threshold selection, Chan, Lee and Peng
(2010) propose to choose h = O(n−1/5 ) as large as possible, so that the density estimator has a
larger bias, but smaller variance than fˆhAM SE (x) . The idea is to consider an alternative kernel
√
nh{fˆ(x;h)−f¯(x;h)}
density estimator f¯ = nh 1
Pn x−Xi
i=1 K̄ h
and define ∆n (x; h) = R
fˆ1/2 (x;h){ (K(s)−K̄(s))2 ds}1/2
.
where zα denotes a critical point in N(0, 1), c > 0 and 0 < ǫ < 1/5. The intuition is that,
d
when h is large ∆n (x; h) > zα , since ∆n (x; r) → N(0, 1).
Cross-validation is a popular and readily implemented heuristic for selecting the smo-
othing parameter in kernel estimation. Introduced by Rudemo (1982) and Bowman (1984),
least squares cross-validation is very intuitive and has been a fundamental device in recent
research. The idea is to consider the expansion of the Integrated Square Error (ISE) in the
following way
Z Z Z
ISE(h) = fˆh2 (x)dx − fˆh (x)f (x)dx + f 2 (x)dx.
14
Note that the last term does not depend on fˆh , hence on h, so that we only need to consider
the first two terms. The ideal choice of bandwidth is the one which minimizes
Z Z Z
2 ˆ
L(h) = ISE(h) − f (x)dx = fh (x)dx − fˆ(x)f (x)dx.
2
The principle of the least squares cross-validation method is to find an estimate of L(h) from
because E(fˆh ) depends only on the kernel and bandwidth, not on the sample size. It follows
that E(CVLS (h)) = E(L(h)), and hence CVLS (h) + f 2 (x)dx is an unbiased estimator of
R
MISE (reason why this method is also called unbiased cross-validation). Assuming that the
minimizer of CVLS (h) is close to the minimizer of E(CVLS (h)), the bandwidth
is the natural choice. This method suffers from sample variation, that is, using different
samples from the same distribution, the estimated bandwidths may have large variance.
Further discussion on this method can be found in Bowman, Hall and Titterington (1984),
^
R(f ′′ ) = R(fˆ′′ ) − (nh5 )−1 R(K ′′ )
h
XX
= n−2 (Kh′′ ∗ Kh′′ )(Xi − Xj ),
i6=j
to give
h4 2 ^
BCV (h) = (nh)−1 R(K) + µ (K)R(f ′′ ).
4 2
Then, the bandwidth selected is hBCV = arg minh BCV (h). This selector is considered a
Suppose that in addition to the original data set X1 , . . . , Xn , we have another independent
observation X ∗ from f . Thinking of fˆh as a parametric family depending on h, but with fixed
data X1 , . . . , Xn , we can view log fˆ(X ∗ ) as the likelihood of the bandwidth h. Because in
from the original data, say Xi , and compute fˆh,−i (Xi ), as in (15). Note that there is no
pattern when choosing the observation to be omitted, so that the score function can be
Naturally, we choose the bandwidth the minimizes CV (h), which is known to minimize the
Kullback-Leibler distance between fˆh (x) and f (x). This method was proposed by Habbema,
Hermans and van den Broek (1974) and Duin (1976), but other results can be found in
Marron (1987), Marron (1989) and Cao, Cuevas and Gonzalez-Manteiga (1994).
16
In general, bandwidths chosen via cross validation methods in kernel density estima-
tion are highly variable, and usually give undersmooth density estimates, causing undesired
spurious bumpiness.
The Indirect Cross-validation (ICV) method, proposed by Savchuk, Hart and Sheather
(2010), slightly outperforms least squares cross-validation in terms of mean integrated squa-
red error. The method can be described as follows. First define the family of kernels
that this is a linear combination of two gaussian kernels. Then, select the bandwidth of
an L-kernel estimator using least squares cross-validation, and call it b̂U CV . Under some
(2010) show that the relative error of ICV bandwidths can converge to 0 at a rate of n1/4 ,
Rather than using a single smoothing parameter h, some authors have considered the
possibility of using a bandwidth h(x) that varies according to the point x at which f is
estimated. This is often referred as the balloon estimator and has the form
n
1 x − X i
fˆ(x) =
X
K . (16)
nh(x) i=1 h(x)
17
The balloon estimator was introduced by Loftsgaarden and Quesenberry (1965) in the form
of the kth nearest neighbor estimator. In Loftsgaarden and Quesenberry (1965), h(x) was
based on a suitable number k, so that it was a measure of the distance between x and the kth
data point nearest to x. The optimal bandwidth for this case can be shown to be (analogue
Another variable bandwidth method is to have the bandwidth vary not with the point
of estimation, but with each observed data point. This type of estimator, known as sample-
point or variable kernel density estimator, was introduced by Breiman, Meisel and Purcell
This type of estimator has one advantage over the balloon estimator: it will always integrate
to 1, assuring that it is a density. Note that h(Xi ) is a function of random variables, and
More results on the variable bandwidth approach can be found in Hall (1992), Taron,
Paragios and Jolly (2005), Wu, Chen and Chen (2007) and Gine and Sang (2010).
4.4.2 Binning
An adaptive type of procedure is the binned kernel density estimation, studied by a few
authors such as Scott (1981), Silverman (1982) and Jones (1989). The idea is to consider
equally spaced bins Bi with centers at ti and bin counts ni , and define the estimator as
∞ m
1 x − ti 1 x − ti
fˆbin (x) =
X X
ni K = K , (19)
n i=−∞ h n i=1 h
18
where the sum over m means summing over the finite non-empty bins that exist in practice.
Examples of other approaches and discussion on this type of estimation can be found
in Hall and Wand (1996), Cheng (1997), Minnotte (1999), Pawlak and Stadtmuller (1999),
Holmstrom (2000).
4.4.3 Bootstrap
A methodology that has been recently explored is that of selecting the bandwidth using
bootstrap. It focuses on replacing the MSE by MSE ∗ , a bootstrapped version of MSE, which
can be minimized directly. Some authors resample from a subsample of the data X1 , . . . , Xn
(see Hall (1990)), others replace from a pilot density based on the data (see Faraway and Jhun
(1990), Hazelton (1996), Hazelton (1999)), more precisely, from f˜hb (x) = nb1n ni=1 L x−X
P
bn
i
,
where L is another kernel and bn is a pilot bandwidth. Since the bandwidth choice reduces
1
Pn x−Xi∗
to estimating s in h = n−1/5 s, Ziegler (2006) introduces fn,s
∗
(x) = n4/5 s i=1 K n−1/5 s
, and
∗
obtain MSEn,s (x) = E ∗ ((fn,s
∗
(x) − f˜hb (x))2 ). The proposed bandwidth is
Applications of the bootstrap idea can be found in many different areas of estimation,
see Delaigle and Gijbels (2004), Loh and Jang (2010) for example.
It is known that kernel density estimators have larger bias on the boundaries. Many
methods have been proposed to alleviate such problem, such as the use of gamma kernels
or inverse and reciprocal inverse gaussian kernels, also known as varying kernel approach.
19
Chen (2000) proposes to replace the symmetric kernel by a gamma kernel, which has flexible
shapes and locations on R+ . Their estimator can be described in the following way. Suppose
the underlying density f has support [0, ∞) and consider the gamma kernel
tx/b e−t/b
Kx/b+1,b (t) = ,
bx/b+1 Γ(x/b + 1)
where b is a smoothing parameter such that b → 0 and nb → ∞. Then, the gamma kernel
estimator is defined as
n
1X
fˆG (x) = Kx/b+1,b (Xi ).
n i=1
where ξx is a Gamma(x/b+1,b) random variable. Using Taylor Expansion and the fact that
1
Ef (ξx ) = f (x + b) + f ′′ (x)V ar(ξx ) + o(b)
2
′ 1 ′′
= f (x) + b f (x) + xf (x) + o(b).
2
It is clear then, that this estimator does not have bias problems on the boundaries, since the
bias is O(b) near the origin and in the interior. See Chen (2000) for further details. Other
and Ruymgaart (2012), Mnatsakanov and Sarkisian (2012), Comte and V.Genon-Catalot
Some interest on density estimation research is on bias reduction techniques, which can
be found in Jones, Linton and Nielsen (1995), Choi and Hall (1999), Cheng, Choi, Fan and
Hall (2000), Choi, Hall and Roussan (2000) and Hall and Minnotte (2002). Other recent
improvements and interesting applications of the kernel estimate can be found in Hirukawa
(2010),Liao, Wu and Lin (2010), Matuszyk, Cardew-Hall and Rolfe (2010), Miao, Rahimi
20
and Rao (2012), Chu, Liau, Lin and Su (2012), Golyandina, Pepelyshev and Steland (2012)
function F (x) instead of the density function f (x). A whole methodology known as kernel
distribution function estimation (KDFE) has been explored since Nadaraya (1964) introdu-
have considered many alternatives for this estimation, but the basic measures of quality or
Sarda (1993) considered a discrete approximation to MISE, the average squared error
n
1X
ASE(h) = [F̂h (Xi ) − F (Xi )]2 W (Xi ).
n i=1
He suggests replacing the unknown F (Xi ) by the empirical Fn (Xi ) and then selecting the
21
on estimating kernel distribution functions, for example Bowman, Hall and Prvan (1998),
Tenreiro (2006), Ahmad and Amezziane (2007), Janssen, Swanepoel and Veraberbeke (2007),
It is well known that plug-in bandwidth estimators tend to select larger bandwidths when
compared to the classical estimators. They are usually tuned by arbitrary specification of
pilot estimates and most often produce over smoothed results when the smoothing problem
is difficult. On the other hand, smaller bandwidths tend to be selected by classical methods,
producing under smoothed results. The goal of a selector of the smoothing parameter is to
make that decision purely from the data, finding automatically which features are important
Figure 5 shows an example of classical and plug-in bandwidth selectors for a real data set.
The data corresponds to the exports of goods and services of countries in 2011, representing
the value of all goods and other market services provided to the rest of the world. The data
The plug-in estimators a) rule of thumb for Gaussian and b) Seather and Jones selector pro-
duced a very smooth fit, while unbiased cross-validation selects a small bandwidth, yielding
a highly variable density estimate. The hybrid method biased cross-validation, is the one
that selects the largest bandwidth, hence its corresponding density estimate is very smooth,
22
0.04
0.03
Density estimates of exports of goods and services
Unbiased CV
0.02
Biased CV
0.01
0.00
5 SiZer
yields the best possible fit has been addressed through several methods, as described in
previous sections. The challenge is to identify the features that are really there, but at the
same time to avoid spurious noise. Marron and Chung (1997) and other authors noted that
it may be worth to consider a family of smooths with a broad range of bandwidths, instead
mixture of a Gaussian variable with mean 0 and variance 1 and another Gaussian variable,
with mean 8 and variance 2. The density was estimated with a Epanechnikov kernel using
bandwidths that vary from 0.4 to 10. The wide range of smoothing considered, from a small
bandwidth producing a wiggly estimate to a very large bandwidth yielding nearly the simple
least squares fit, allows a contrast of estimated features at each level of smoothing. The two
highlighted bandwidths are equal to 0.6209704 and 1.493644, corresponding to the choice of
biased cross-validation (blue) and to Silverman’s rule of thumb (red) (see Silverman, 1986)
23
respectively .
0.25
0.20
0.15
0.10 10
8
0.05
6
h
4
0
2
5
X
10
The idea of considering a family of smooths has its origins in scale space theory in computer
science. A fundamental concept in such analysis is that it does not aim at estimating
one true curve, but at recovering the significant aspects of the underlying function, since
different levels of smoothing may reveal different intrinsic features. Exploring this concept
in a statistical point of view, Chaudhuri and Marron (2000) introduced a procedure called
the visible features representing important underlying structures for different bandwidths.
24
Suppose that h ∈ H, where H is a subinterval of (0, ∞), and x ∈ I, where I is a
subinterval of (−∞, ∞). Then the family of smooth curves {fˆh (x)|h ∈ H, x ∈ I} can be
represented by a surface called scale space surface, which captures different structures of the
curve under different levels of smoothing. Hence, the focus is really on E(fˆh (x)) as h varies
in H and x in I, which is called in Chaudhuri and Marron (2000) as ”true curves viewed at
A smooth curve fˆh (x) has derivatives equal to 0 at points of minimum (valleys), maximum
(peaks) and points of inflection. Note that, before a peak (or valley), the sign of the derivative
∂ fˆh (x)/∂x is positive (or negative), and after it the derivative is negative (or positive). In
other words, peaks and valleys are determined by zero crossings of the derivative. Actually,
we can identify structures in a smooth curve by zero crossings of the mth order of the
√
derivative. Using a Gaussian kernel K(x) = (1/ 2π)exp(−x2 /2), Silverman (1981) showed
that the number of peaks in a kernel density estimate decreases monotonically with the
increase of the bandwidth, and Chaudhuri and Marron (2000) extended this idea for the
number of zero crossings of the mth order derivative ∂ m fˆh (x)/∂xm in kernel regression.
The asymptotic theory of the scale space surfaces and their derivatives studied by Chaudhuri
and Marron (2000), which hold even under bootstrapped or resampled distributions, provides
tools for building bootstrap confidence intervals and tests of significance for their features
(see Chaudhuri and Marron (1999)). SiZer basically considers the null hypothesis
for a fixed x ∈ I and h ∈ H. If H0h,x is rejected, there is evidence that ∂ m E(fˆh (x))/∂xm is
The information is displayed in a color map of scale space, where the pixels represent
the location of x (horizontally) and h (vertically). The regions are shaded blue for sig-
nificant increasing curve, red for significantly decreasing, purple for unable to distinguish
25
and gray for insufficient data. Note that purple is displayed when the confidence interval
for the derivative contains 0. There are a few options of software available, including java
Figure 7 shows an example of a color map obtained with SiZer. The data is the GDP
see that for large bandwidths, the density function significantly increases until about 16000,
then after a small area that SiZer is unable to distinguish, it has a significant decrease, hence
estimating a density with one mode at around 16000. Small bandwidths produce a map that
is mostly gray, meaning that the wiggles in the estimate at that level of resolution can not be
separated from spurious sampling noise. An interesting blue area appears, with a mid-level
resolution, near 43000, indicating a slightly significant increase. This comes after and before
with a mid-level bandwidth, the estimated density would suggest 2 modes, one somewhere
26
Acknowledgments: This paper was partially supported with grant 2012/10808-2 FA-
Referências
Ahmad, I. A. and Amezziane, M. (2007). A general and fast convergent bandwidth selection
Altman, N. and Leger, C. (1995). Bandwidth selection for kernel distribution function
Berg, A. and Politis, D. (2009). Cdf and survival function estimation with infinite order
Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviations of density
Bowman, A. W., Hall, P. and Prvan, T. (1998). Bandwidth selection for the smoothing of
27
Breiman, L., Meisel, W. and Purcell, E. (1977). Variable kernel estimates of multivariate
Cai, Q., Rushton, G. and Bhaduri, B. (2012). Validation tests of an improved kernel density
14(3): 243–264.
17(2): 153–176.
Chan, N.-H., Lee, T. C. and Peng, L. (2010). On nonparametric local inference for density
Chaudhuri, P. and Marron, J. S. (1999). Sizer for exploration of structures in curves, Journal
Chaudhuri, P. and Marron, J. S. (2000). Scale space view of curve estimation, The Annals
Chen, S. (2000). Probability density function estimation using gamma kernels, Annals of
Cheng, M.-Y. (1997). A bandwidth selector for local linear density estimators, The Annals
Cheng, M.-Y., Choi, E., Fan, J. and Hall, P. (2000). Skewing-methods for two parameter
Choi, E. and Hall, P. (1999). Data sharpening as prelude to density estimation, Biometrika
86: 941–947.
28
Choi, E., Hall, P. and Roussan, V. (2000). Data sharpening methods for bias reduction in
Chu, H.-J., Liau, C.-J., Lin, C.-H. and Su, B.-S. (2012). Integration of fuzzy cluster analysis
and kernel density estimation for tracking typhoon trajectories in the taiwan region,
Comte, F. and V.Genon-Catalot (2012). Convolution power kernels for density estimation,
Delaigle, A. and Gijbels, I. (2004). Bootstrap bandwidth selection in kernel density esti-
56(1): 19 – 47.
Devroye, L. and Wagner, T. (1980). The strong uniform consistency of kernel density esti-
mates, in: P.R. Krishnaiah, ed., Multivariate Analysis, Vol. V (North-Holland, Ams-
Faraway, J. and Jhun, M. (1990). Bootstrap choice of bandwidth for density estimation,
Gine, E. and Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel
density estimators, Annales de l’Institut Henri Poincare (B) Probability and Statistics
29
Gine, E. and Sang, H. (2010). Uniform asymptotics for kernel density estimators with
Habbema, J. D. F., Hermans, J. and van den Broek, K. (1974). A stepwise discrimination
Hall, P. (1983). Large sample optimality of least squares cross-validation in density estima-
Hall, P. (1990). Using the bootstrap to estimate mean squared error and select smooth- ing
Hall, P. (1992). On global properties of variable bandwidth density estimators, The Annals
Hall, P. and Minnotte, M. (2002). High order data sharpening for density estimation, Journal
Hall, P. and Wand, M. (1996). On the accuracy of binned kernel density estimators, Journal
Hall, P., Sheather, S. J., Jones, M. C. and Marron, J. S. (1991). On optimal data-based
30
Hazelton, M. (1996). Bandwidth selection for local density estimators, Scandinavian Journal
Hazelton, M. (1999). An optimal local bandwidth selector for kernel density estimation,
estimation on the unit interval, Computational Statistics and Data Analysis 54: 473–
495.
Janssen, P., Swanepoel, J. and Veraberbeke, N. (2007). Modifying the kernel distribution
Jones, M. C. (1989). Discretized and interpolated kernel density estimates, Journal of the
Jones, M., Linton, O. and Nielsen, J. (1995). A simple bias reduction method for density
Liao, J., Wu, Y. and Lin, Y. (2010). Improving sheather and jones bandwidth selector
22: 105–114.
27(2): 415–438.
31
Loh, J. M. and Jang, W. (2010). Estimating a cosmological mass bias parameter with
bootstrap bandwidth selection, Journal of the Royal Statistical Society Series C 59: 761–
779.
unpublished manuscript.
Matuszyk, T. I., Cardew-Hall, M. J. and Rolfe, B. F. (2010). The kernel density esti-
380.
Miao, X., Rahimi, A. and Rao, R. P. (2012). Complementary kernel density estimation,
with binned data, Journal of the American Statistical Association 93: 663–672.
Mnatsakanov, R. and Sarkisian, K. (2012). Varying kernel density estimation on r+, Statis-
Moore, D. and Yackel, J. (1977). Consistency properties of nearest neighbour density function
32
Nadaraya, E. A. (1964). On estimating regression, Theory Probab. Appl. 9: 141–142.
Pawlak, M. and Stadtmuller, U. (1999). Kernel density estimation with generalized binning,
Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators, Scandi-
Sarda, P. (1993). Smoothing parameter selection for smooth distribution functions, Journal
Savchuk, O., Hart, J. and Sheather, S. (2010). Indirect cross-validation for density estima-
Scaillet, O. (2004). Density estimation using inverse and reciprocal inverse gaussian kernels,
Scott, D. W. (1981). Using computer-binned data for density estimation, Computer Science
and Statistics: Proceedings of the 13th Symposium on the Interface. W. F. Eddy, Ed.,
33
Scott, D. W. (1992). Multivariate density estimation: Theory, practice, and visualization,
Scott, D. W., Tapia, R. A. and Thompson, J. R. (1977). Kernel density estimation revisited,
Sheather, S. and Jones, M. (1991). A reliable data-based bandwidth selection method for
kernel density estimation, Journal of the Royal Statistical Society - B 53: 683–690.
Silverman, B. W. (1978). Weak and strong uniform consistency of the kernel estimate of a
Silverman, B. W. (1982). Kernel density estimation using the fast fourier transform, Applied
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis, Chapman &
Hall, London.
Stone, C. J. (1984). An asymptotically optimal window selection rule for kernel density
Stute, W. (1982). A law of the logarithm for kernel density estimators, The Annals of
Taron, M., Paragios, N. and Jolly, M. P. (2005). Modelling shapes with uncertainties: Higher
order polynomials, variable bandwidth kernels and non parametric density estimation,
34
Tenreiro, C. (2006). Asymptotic behaviour of multistage plug-in bandwidth selections for
116.
Van Ryzin, J. (1969). On strong consistency of density estimates, The Annals of Mathema-
Wu, T.-J., Chen, C.-F. and Chen, H.-Y. (2007). A variable bandwidth selector in multivariate
Ziegler, K. (2006). On local bootstrap bandwidth choice in kernel density estimation, Sta-
35