0% found this document useful (0 votes)
10 views

densityestimation

Advanced Machine Learning Course from Carnegie Mellon University

Uploaded by

bruceayim30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

densityestimation

Advanced Machine Learning Course from Carnegie Mellon University

Uploaded by

bruceayim30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Nonparametric Density Estimation

10716: Advanced Machine Learning


Pradeep Ravikumar (amending notes from Larry
Wasserman)

1 Introduction

Let X1 , . . . , Xn be a sample from a distribution P with density p. The goal of nonparametric


density estimation is to estimate p with as few assumptions about p as possible. We denote
the estimator by pb. The estimator will typically depend on a tuning parameter h, and
choosing h carefully is crucial. To emphasize the dependence on h we sometimes write pbh .

A very simple non-parametric distribution estimator is simply the empirical distribution:


n
1X
Pn = δX ,
n i=1 i

but this is not very suitable as an estimate of the underlying distribution. It “overfits” to
the training data by placing all probability mass on the given training points {Xi }ni=1 and
zero mass even on very nearby points. It moreover does not have a density. So usually by
nonparameteric density estimation, we mean something that does a bit more, in particu-
lar by “smoothing” the empirical distribution Pn . For this reason, nonparametric density
estimation is also often referred to as smoothing.

Example 1 (Bart Simpson) The top left plot in Figure 1 shows the density
4
1 1 X
p(x) = φ(x; 0, 1) + φ(x; (j/2) − 1, 1/10) (1)
2 10 j=0

where φ(x; µ, σ) denotes a Normal density with mean µ and standard deviation σ. Marron
and Wand (1992) call this density “the claw” although we will call it the Bart Simpson
density. Based on 1,000 draws from p, we computed a kernel density estimator, described
later. The estimator depends on a tuning parameter called the bandwidth. The top right plot
is based on a small bandwidth h which leads to undersmoothing. The bottom right plot is
based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based
on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more
reasonable density estimate.

1
1.0

1.0
0.5

0.5
0.0

−3 0 3 0.0 −3 0 3
True Density Undersmoothed
1.0

1.0
0.5

0.5
0.0

0.0

−3 0 3 −3 0 3
Just Right Oversmoothed

Figure 1: The Bart Simpson density from Example 1. Top left: true density. The other plots
are kernel estimators based on n = 1,000 draws. Bottom left: bandwidth h = 0.05 chosen by
leave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10h.

2
2 Applications

Density estimation could be used for sampling new points (see the outpouring of creative, and
perhaps even worrying, uses of such sampling in the context of images and text), and more
generally, for a compact summary of data useful for downstream probabilistic reasoning. It
can also be used in particular for regression, classification, and clustering. Suppose pb(x, y)
is an estimate of p(x, y).

Regression. We can then compute the following estimate of the regression function:
Z
m(x)
b = yb p(y|x)dy
Z
pb(y, x)
= y dy.
pb(x)

Classification. For classification, recall the Bayes optimal classifier


h(x) = I(p1 (x)π1 > p0 (x)π0 )
where π1 = P(Y = 1), π0 = P(Y = 0), p1 (x) = p(x|y = 1) and p0 (x) = p(x|y = 0). Inserting
sample estimates of π1 and π0 , and density estimates for p1 and p0 yields an estimate of the
Bayes classifier. Many classifiers that you are familiar with can be re-expressed this way.

Clustering. For clustering, we look for the high density regions, based on an estimate of
the density. We will discuss more on this when we discuss clustering.

Anomaly/Outlier Detection. Density estimation is sometimes also used to find unusual


observations or outliers. These are observations for which pb(Xi ) is very small.

Two-Sample Hypothesis Testing. Density estimation can be used for two sample testing.
Given X1 , . . . , Xn ∼ p and Y1 , . . . , Ym ∼ q we can test H0 : p = q using D(b
p, qb), for some
divergence D as a test statistic.

3 Loss Functions

The most commonly used loss function is the L2 loss


Z Z Z Z
p(x) − p(x)) dx = pb (x)dx − 2 pb(x)p(x) + p2 (x)dx.
(b 2 2

3
The risk is R(p, pb) = E(L(p, pb)).

A key advantage of the L2 loss is that the risk has a very mathematically convenient decom-
position:
Z
R(p, pb) = E (p(x) − pb(x))2 dx (2)
Z Z
= b2n (x)dx + vn (x)dx (3)

p(x)) − p(x) is the bias and v(x) = Var(b


where bn (x) = E(b p(x)) is the variance.

The estimator pb typically involves “smoothing” the empirical distribution in some way. The
main challenge is to determine how much smoothing to do. When the data are oversmoothed,
the bias term is large and the variance is small. When the data are undersmoothed the
opposite is true. This is called the bias–variance tradeoff. Minimizing risk corresponds to
balancing bias and variance.

Devroye and Györfi (1985) make a strong case for using the L1 norm
Z
kbp − pk1 ≡ |b p(x) − p(x)|dx

as the loss instead of L2 . The L1 loss has the following nice interpretation. If P and Q are
distributions define the total variation metric

dT V (P, Q) = sup |P (A) − Q(A)|


A

where the supremum is over all measurable sets. Now if P and Q have densities p and q then
Z
1 1
dT V (P, Q) = |p − q| = kp − qk1 .
2 2
R
Thus, if |p − q| < δ then we know that |P (A) − Q(A)| < δ/2 for all A. Also, the L1 norm is
transformation invariant. Suppose that T is a one-to-one smooth function. Let Y = T (X).
Let p and q be densities for X and let pe and qe be the corresponding densities for Y . Then
Z Z
|p(x) − q(x)|dx = |e p(y) − qe(y)|dy.

Hence the distance is unaffected by transformations. The L1 loss is, in some sense, a much
better loss function than L2 for density estimation. But it is much more difficult to deal
with. For now, we will focus on L2 loss. But we may discuss L1 loss later.
R
Another loss function is the Kullback-Leibler loss p(x) log p(x)/q(x)dx. This is not a good
loss function to use for nonparametric density estimation. The reason is that the Kullback-
Leibler loss is completely dominated by the tails of the densities, due to the density ratios.

4
The minimax risk over a class of densities P is

Rn (P) = inf sup R(p, pb) (4)


pb p∈P

and an estimator is minimax if its risk is equal to the minimax risk. We say that pb is rate
optimal if
R(p, pb)  Rn (P). (5)
Typically the minimax rate is of the form n−C/(C+d) for some C > 0.

4 Function Spaces

A distinguishing characteristic of “non-parametric” methods is that what we are estimating


is not in a finite-dimensional parametric space. Typically, it is in some infinite-dimensional
function space. We briefly review some classical function spaces.

The class of Lipschitz functions H(1, L) on X ⊂ R is the set of functions g such that

|g(y) − g(x)| ≤ L|x − y| for all x, y ∈ T.

A differentiable function is Lipschitz if and only if it has bounded derivatives. Conversely a


Lipschitz function is differentiable almost everywhere.

Let X ⊂ R and let β be an integer. The Holder space H(β, L) is the set of functions g
mapping X to R such that g is ` = β − 1 times differentiable and satisfies

|g (`) (y) − g (`) (x)| ≤ L|x − y| for all x, y ∈ T.

A more intuitive perspective of this class is that its first β derivatives are all bounded.

A yet another perspective of this class is as a set of functions that are close to their Taylor
series approximation upto order β. If g ∈ H(β, L) and ` = β − 1, then we can define the
Taylor approximation of g at x by
(y − x)` (`)
ge(y) = g(y) + (y − x)g 0 (x) + · · · + g (x)
`!
and then |g(y) − ge(y)| ≤ L|y − x|β .

The definition for higher dimensions is similar. Let X be a compact subset of Rd .

Given a vector s = (s1 , . . . , sd ), define


∂ s1 +···+sd
Ds = ,
∂xs11 · · · ∂xsdd

5
as the s-th partial derivative. We will also use the compact notation |s| = s1 + · · · + sd ,
s! = s1 ! · · · sd !, xs = xs11 · · · xsdd .

Let β and L be positive integers. The Hölder class is then defined as:
( )
s s
H(β, L) = p : |D p(x)−D p(y)| ≤ Lkx−yk, for all s such that |s| = β−1, and all x, y .

(6)
For example, if d = 1 and β = 2 (which is the most common setting) this means that
|p0 (x) − p0 (y)| ≤ L |x − y|, for all x, y.

As before, we could also view this class as functions with bounded Ds partial derivatives,
for |s| ≤ β. For instance, with β = 2, the class consists of functions have bounded second
derivatives.

And as before, this function class comprises functions that are close to their Taylor series
approximation upto order β. Let
X (u − x)s
px,β (u) = Ds p(x). (7)
s!
|s|<β

Then, if p ∈ H(β, L), we can show that: p(x) is close to its .


|p(u) − px,β (u)| ≤ Lku − xkβ . (8)

In the common case of β = 2, this means that

p(u) − [p(x) + (x − u)T ∇p(x)] ≤ Lkx − uk2 .

4.1 Categories of Nonparametric Density Estimators

We will discuss two broad categories of nonparametric density estimators: (a) those based
on hard partitioning of the input space viz. histograms (technically not density estimators),
and soft-partitioning of the input space viz. kernel density estimators, and (b) those based
on projection onto an infinite-dimensional dimensional function space, where we will look at
a particular instance called series estimators.

5 Histograms

Perhaps the simplest nonparametric distribution estimators, after the empirical distribution,
are histograms. The high level idea is to discretize the data, and then simply use the MLE

6
of the resulting categorical distribution (which is simply the frequencies of each category in
the data).

For convenience, assume that the data X1 , . . . , Xn are contained in the unit cube X = [0, 1]d
(although this assumption is not crucial). Divide X into bins, or sub-cubes, of size h. We
discuss methods for choosing h later. There are N = (1/h)d such bins and each has
volume hd . Denote the bins by B1 , . . . , BN . Now we can write the true density
N
X
p(x) = P (X ∈ Bj ) p(x|X ∈ Bj ).
j=1

We can estimate P (X ∈ Bj ) via


n
1X
θbj = I(Xi ∈ Bj )
n i=1

as the fraction of data points in bin Bj . While we can approximate p(x|X ∈ Bj ) via the
density of the uniform distribution over the bin Bj so that p(x|X ∈ Bj ) = 1/hd I(x ∈ Bj ).
Plugging these two values in, we get the histogram density estimator:
N b
X θj
pbh (x) = d
I(x ∈ Bj ). (9)
j=1
h

5.1 Statistical Analysis: Histograms

Suppose that p ∈ P(L) := H(1, L) where


( )
H(1, L) = p : |p(x) − p(y)| ≤ Lkx − yk, for all x, y . (10)

Theorem 2 The L2 risk of the histogram estimator is bounded by


Z
C
sup R(p, pb) = (E(b ph (x) − p(x))2 ≤ L2 h2 d + d . (11)
p∈H(1,L) nh
1
C
 d+2
The upper bound is minimized by choosing h = L2 nd
. (Later, we shall see a more
practical way to choose h.) With this choice,
2
  d+2
1
sup R(p, pb) ≤ C0
P ∈H(1,L) n

where C0 = L2 d(C/(L2 d))2/(d+2) .

7
The rate of convergence n−2β/(2β+d) is slow when √ the dimension d is large. The typical rate of
convergence for parameter models is typically d/ n. To see the difference between these two
rates, to get to  error with non-parametric rates, we would require number of samples scaling
as n−2β/(2β+d) ≤  n ≥ (1/)d/2β+1 = O(1/)d , which
√ scales exponentially with the dimension
d. On the other hand, for parametric rates, d/ n ≤  only requires that n ≥ (d/)2 , which
only scales polynomially with the dimension.

This upper bound can also be shown to be tight. Specifically:

Theorem 3 There exists a constant C > 0 such that


Z 2
  d+2
2 1
inf sup E (b p(x) − p(x)) dx ≥ C . (12)
pb P ∈H(1,L) n

The above result showed that the histogram estimator is close (wrt `2 loss) to the true
density in expectation. A more powerful result would be to show that it is close with high
probability. This entails analyzing
sup P n (kb
ph − pk∞ > )
P ∈P

where kf k∞ = supx |f (x)|.

Theorem 4 With probability at least 1 − δ,


s

 
1 2
kbph − pk∞ ≤ log + L dh. (13)
cnhd δhd

Choosing h = (c2 /n)1/(2+d) we conclude that, with probability at least 1 − δ,


s 1
!

        2+d
2 2 2 1 log n
ph − pk∞ ≤ c−1 n− 2+d log
kb + log n + L dn− 2+d = O .
δ 2+d n
(14)

5.2 Adaptive histograms: Density Trees

Instead of uniformly partitioning the input domain, one can adaptively partition it. Ram
and Gray (2011) suggest a recursive partitioning scheme similar to decision trees. They
split each coordinate dyadically, in a greedy fashion. The density estimator is taken to
be piecewise constant. They use an L2 risk estimator to decide when to split. The ideas
seems to have been re-discovered in Yand and Wong (arXiv:1404.1425) and Liu and Wong
(arXiv:1401.2597). Density trees seem very promising.

8
6 Kernel Density Estimation
R
A
R one-dimensional smoothing
R kernel is any smooth function K such that K(x) dx = 1,
2
xK(x)dx = 0 and σK ≡ x2 K(x)dx > 0. Smoothing kernels should not be confused with
Mercer kernels which we discuss later. Some commonly used kernels are the following:

2
Boxcar: K(x) = 21 I(x) Gaussian: K(x) = √1 e−x /2

3 2 70
Epanechnikov: K(x) = 4 (1 − x )I(x) Tricube: K(x) = 81
(1 − |x|3 )3 I(x)

where I(x) = 1 if |x| ≤ 1 and I(x) = 0 otherwise.


Qd These kernels are plotted in Figure 2.
Two commonly used multivariate kernels are j=1 K(xj ) and K(kxk). For presentational
simplicity, we will overload notation for both the multivariate and univariate kernels, and if
not specified, for vector x, we will use K(x) = K(kxk).

−3 0 3 −3 0 3

−3 0 3 −3 0 3

Figure 2: Examples of smoothing kernels: boxcar (top left), Gaussian (top right), Epanech-
nikov (bottom left), and tricube (bottom right).

Suppose that X ∈ Rd . Given a kernel K and a positive number h, called the bandwidth,
the kernel density estimator is defined to be
n  
1X 1 x − Xi
pb(x) = K . (15)
n i=1 hd h

More generally, we define


n
1X
pbH (x) = KH (x − Xi )
n i=1

9
−10 −5 0 5 10

Figure 3: A kernel density estimator pb. At each point x, pb(x) is the average of the kernels
centered over the data points Xi . The data points are indicated by short vertical bars. The
kernels are not drawn to scale.

where H is a positive definite bandwidth matrix and KH (x) = |H|−1/2 K(H −1/2 x). For
simplicity, we will take H = h2 I and we get back the previous formula.

Sometimes we write the estimator as pbh to emphasize the dependence on h. In the multivari-
ate case the coordinates of Xi should be standardized so that each has the same variance,
since the norm kx − Xi k treats all coordinates as if they are on the same scale.

The kernel estimator places a smoothed out lump of mass of size 1/n over each data point
Xi ; see Figure 3. The choice of kernel K is not crucial, but the choice of bandwidth h
is important. Small bandwidths give very rough estimates while larger bandwidths give
smoother estimates.

6.1 Statistical Analysis: Kernel Estimators

In this section we examine the performance of kernel density estimation. We will first need
a few definitions.

Assume that Xi ∈ X ⊂ Rd where X is compact.

Conditions on Kernel Function. In order for the kernel density estimate to be able to
estimate well a smooth function in H(β, L) for β > 2, we need a “higher order kernel”.

10
1.0
0.5
0.0
−0.5

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Figure 4: A higher-order kernel function: specifically, a kernel of order 4

Assume now that the kernel K has the form K(x) = k(kxk) for some univariate kernel R k that
has
R support on [−1, 1]. A univariate
R kernel is said to
R have order β provided that: k = 1,
|k|q < ∞ for any q ≥ 1, |t|β |k(t)|dt < ∞ and ts k(t)dt = 0 for s < β. An example
of a kernel that satisfies these conditions
R s for β = 2 is k(x) = (3/4)(1 − x2 ) for |x| ≤ 1.
Constructing a kernel that satisfies t k(t)dt = 0 for β > 2 requires using kernels that can
take negative values; because of which such “higher order kernels” for β > 2 are not that
popular. For example, a 4th-order kernel is K(t) = 83 (3 − 5t2 )1{|t| ≤ 1}, plotted in Figure
4. Notice that it takes negative values.

ph (x)]. The next lemma provides a bound on the bias ph (x) − p(x).
Let ph (x) = E[b

Lemma 5 The bias of pbh satisfies:


sup |ph (x) − p(x)| ≤ chβ (16)
p∈H(β,L)

for some c.

Next we bound the variance.

Lemma 6 The variance of pbh satisfies:


c
ph (x)) ≤
sup Var(b (17)
p∈H(β,L) nhd
for some c > 0.

11
Since the mean squared error is equal to the variance plus the bias squared, together the
previous two lemmas yield:

Theorem 7 The L2 risk is bounded above, uniformly over H(β, L), as

Z
1
sup E ph (x) − p(x))2 dx  h2β +
(b (18)
p∈H(β,L) nhd
If h  n−1/(2β+d) then
Z 2β
  2β+d
1
sup E ph (x) − p(x))2 dx 
(b . (19)
p∈H(β,L) n

When β = 2 and h  n−1/(4+d) we get the rate n−4/(4+d) .

6.2 Minimax Lower Bound

According to the next theorem, there does not exist an estimator that converges faster than
O(n−2β/(2β+d) ). We state the result for integrated L2 loss although similar results hold for
other loss functions and other function spaces. We will prove this later in the course.

Theorem 8 There exists C depending only on β and L such that


Z 2β
  2β+d
2 1
inf sup Ep p(x) − p(x)) dx ≥ C
(b . (20)
pb p∈H(β,L) n

Theorem 8 together with (19) imply that kernel estimators are rate minimax.

Concentration Analysis of Kernel Density Estimator Now we state a result which says
how fast pb(x) concentrates around p(x).

Theorem 9 For all small  > 0,

p(x) − ph (x)| > ) ≤ 2 exp −cnhd 2 .



P(|b (21)

Hence, for any δ > 0,


r !
C log(2/δ)
sup P |b
p(x) − p(x)| > + chβ <δ (22)
p∈H(β,L) nhd

12
for some constants C and c. If h  n−1/(2β+d) then
 c 
sup P |b p(x) − p(x)|2 > < δ.
p∈H(β,L) n2β/(2β+d)

The first statement follows from an application of Bernstein’s inequality. While the last
statement follows from bias-variance calculations followed by Markov’s inequality.

Concentration in L∞ . While Theorem 9 shows that, for each x, pb(x) is close to p(x) with
high probability; it would be nice to have a version of this result that holds uniformly over
all x. That is, we want a concentration result for

kb
p − pk∞ = sup |b
p(x) − p(x)|.
x

We can write

kb
ph − pk∞ ≤ kb ph − ph k∞ + chβ .
ph − ph k∞ + kph − pk∞ ≤ kb

We can bound the first term using something called bracketing together with Bernstein’s
theorem to prove that,
d
3n2 hd
  
C
ph − ph k∞ > ) ≤ 4
P(kb exp − . (23)
hd+1  28K(0)

A more sophisticated analysis in Giné and Guillou (2002) (which in turn replaces Bernstein’s
inequality in previous proof with a more sophisticated inequality due to Talagrand) yields
the following:

Theorem 10 Suppose that p ∈ H(β, L). Fix any δ > 0. Then


r !
C log n
P sup |b p(x) − p(x)| > + chβ < δ
x nhd

for some constants C and c where C depends on δ. Choosing h  log n/n−1/(2β+d) we have
 
2 C log n
P sup |bp(x) − p(x)| > 2β/(2β+d) < δ.
x n

13
6.3 Boundary Bias

One caveat with the kernel density estimator is what happens near the boundary of the
sample space. If x is O(h) close to the boundary, then the bias is O(h) instead of O(h2 ).
The main reason is that when we compute an average over nearby points; points near the
boundary have more points towards directions leading away from the boundary, compared
to directions towards the boundary. We will discuss more about this when we cover non-
parametric regression.

There are a variety of fixes including: data reflection, transformations, boundary kernels,
local likelihood. These are not as popular as simple kernel density estimation however.

6.4 Asymptotic Expansions

In this section we consider some asymptotic expansions that describe the behavior of the
kernel estimator. We focus on the case d = 1.

Theorem 11 Let RxR = E(p(x) − pb(x))2 and let R = Rx dx. Assume that p00 is absolutely
R

continuous and that p000 (x)2 dx < ∞. Then,


R
1 4 4 00 2 p(x) K 2 (x)dx
 
1
Rx = σK hn p (x) + +O + O(h6n )
4 nhn n
and R
K 2 (x)dx
Z  
1 4 4 00 2 1
R = σK hn p (x) dx + +O + O(h6n ) (24)
4 nh n
2
R
where σK = x2 K(x)dx.

If we differentiate (24) with respect to h and set it equal to 0, we see that the asymptotically
optimal bandwidth is
 1/5
c2
h∗ = (25)
c21 A(f )n
where c1 = x2 K(x)dx, c2 = K(x)2 dx and A(f ) = f 00 (x)2 dx. This is informative
R R R

because it tells us that the best bandwidth decreases at rate n−1/5 . Plugging h∗ into (24),
we see that if the optimal bandwidth is used then R = O(n−4/5 ).

7 Picking Bandwidths of Kernel Estimators

In practice we need a data-based method for choosing the bandwidth h. To do this, we will
need to estimate the risk of the estimator and minimize the estimated risk over h.

14
7.1 Leave One Out Cross-Validation

A common method for estimating risk is leave-one-out cross-validation. Recall that the loss
function is
Z Z Z Z
ph (x) − p(x)) dx = pbh (x)dx − 2 pbh (x)p(x)dx + p2 (x)dx.
(b 2 2

The last term does not involve pb so we can drop it. Thus, we now define the loss to be
Z Z
2
L(h) = pbh (x) dx − 2 pbh (x)p(x)dx.

The risk is R(h) = E(L(h)).

Definition 12 The leave-one-out cross-validation estimator of risk is


Z  2 n
2X
R(h)
b = pbh (x) dx − pbh;(−i) (Xi ) (26)
n i=1

where pbh;(−i) is the density estimator obtained after removing the ith observation.

It is easy to check that E[R(h)]


b = R(h).

A further justification for cross-validation is given by the following theorem due to Stone
(1984).

Theorem 13 (Stone’s theorem) Suppose that p is bounded. Let pbh denote the kernel
estimator with bandwidth h and let b
h denote the bandwidth chosen by cross-validation. Then,
R 2
p(x) − pbbh (x) dx a.s.
→ 1. (27)
inf h (p(x) − pbh (x))2 dx
R

The bandwidth for the density estimator in the bottom left panel of Figure 1 is based on
cross-validation. In this case it worked well but of course there are lots of examples where
there are problems. Do not assume that, if the estimator pb is wiggly, then cross-validation
has let you down. The eye is not a good judge of risk.

There are cases when cross-validation can seriously break down. In particular, if there are
ties in the data then cross-validation chooses a bandwidth of 0.

15
7.2 V -fold Cross-Validation

An alternative to leave-one-out is V -fold cross-validation. A common choice is V = 10. Fir


simplicity, let us consider here just splitting the data in two halves. This version of cross-
validation comes with stronger theoretical guarantees. Let pbh denote the kernel estimator
based on bandwidth h. For simplicity, assume the sample size is even and denote the sample
size by 2n. Randomly split the data X = (X1 , . . . , X2n ) into two sets of size n. Denote
these by Y = (Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ).1 Let H̄ = {h1 , . . . , hN } be a finite grid of
bandwidths. For j ∈ [N ], denote
n  
1X 1 x − Yi
pbj (x) = K .
n i=1 hdj h

Thus we have a set P = {b p1 , . . . , pbN } of density estimators.


R R
The loss of pbj is given as: L(p, pbj ) = pb2j (x) − 2 pbj (x)p(x)dx. Define the estimated risk
Z n
2X
bj ≡ L(p,
L b pbj ) = pb2j (x) − pbj (Zi ). (28)
n i=1

Let pb = argminj∈[N ] L(p,


b pbj ). Schematically:

Y → {b
p1 , . . . , pbN } = P
split
X = (X1 , . . . , X2n ) =⇒
Z → {L bN }
b1 , . . . , L

Theorem 14 (Wegkamp 1999) There exists a C > 0 such that


C log N
p − pk2 ) ≤ 2 min E(kb
E(kb pj − pk2 ) + .
j∈[N ] n

A similar result can be proved for V -fold cross-validation.

7.3 Example

Figure 5 shows a synthetic two-dimensional data set, the cross-validation function and two
kernel density estimators. The data are 100 points generated as follows. We select a point
1
It is not necessary to split the data into two sets of equal size. We use the equal split version for
simplicity.

16
Risk
0.5 1.0 1.5 2.0 2.5 3.0
Bandwidth

Figure 5: Synthetic two-dimensional data set. Top left: data. Top right: cross-validation
function. Bottom left: kernel estimator based on the bandwidth that minimizes the cross-
validation score. Bottom right: kernel estimator based on the twice the bandwidth that
minimizes the cross-validation score.

randomly on the unit circle then add Normal noise with standard deviation 0.1 The first
estimator (lower left) uses the bandwidth that minimizes the leave-one-out cross-validation
score. The second uses twice that bandwidth. The cross-validation curve is very sharply
peaked with a clear minimum. The resulting density estimate is somewhat lumpy. This is
because cross-validation is aiming to minimize L2 error which does not guarantee that the
estimate is smooth. Also, the dataset is small so this effect is more noticeable. The estimator
with the larger bandwidth is noticeably smoother. However, the lumpiness of the estimator
is not necessarily a bad thing.

7.4 Picking Bandwidths to optimize L1 instead of L2 Risk

Here we discuss another approach to choosing h aimed at the L1 loss. RRecall that this L1
loss between some density g and the true distribution P is given as: |g(x) − p(x)|dx =
R
2 supA A g(x)dx − P (A) . The idea is to restrict to a class of sets A—which we call test
R
sets— and choose h to make A pbh (x)dx close to P (A) for all A ∈ A. That is, we would like

17
to minimize Z
∆(g) = sup g(x)dx − P (A) . (29)
A∈A A

Note that this yields an approximation to the L1 risk, which optimizes over all sets, rather
than just some restricted class of sets, so we have to choose these carefully. We will next
discuss two approaches to specify these test classes.

7.4.1 VC Classes

Let A be a class of sets with VC dimension ν. As in section 7.2, split the data X into Y
and Z with P = {bp1 , . . . , pbN } constructed from Y . For g ∈ P define
Z
∆n (g) = sup g(x)dx − Pn (A)
A∈A A
−1
Pn
where Pn (A) = n i=1 I(Zi ∈ A). Let pb = argminj∈[N ] ∆n (b
pj ).

Theorem 15 For any δ > 0 there exists c such that


 r 
ν
P ∆(b p) > min ∆(b pj ) + 2c < δ.
j n

The difficulty in implementing this idea is computing and minimizing ∆n (g). Hjort and
Walker (2001) presented a similar method which can be practically implemented when d = 1.
Another caveat with the above is that ∆(g) is only an approximation of the L1 loss, depending
on the richness of the class of sets A. Is there a small enough class of sets A that would be
as if minimizing the L1 loss?

7.4.2 Yatracos Classes

Devroye and Györfi (2001) use such a class of sets called a Yatracos class which leads to
estimators with some remarkable properties.
n Let P =o{p1 , . . . , pN } be a set of densities and
define the Yatracos class of sets A = A(i, j) : i 6= j where A(i, j) = {x : pi (x) > pj (x)}.
Let
pb = argminj∈[N ] ∆n (pj ),
where Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A
Pn
and Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on a sample Z1 , . . . , Zn ∼ p.

18
Theorem 16 The estimator pb satisfies
Z Z
|b
p − p| ≤ 3 min |pj − p| + 4∆ (30)
j

R
where ∆ = supA∈A A
p − Pn (A) .

R
The term minj |pj − p| is like a bias while term ∆ is like the variance.

Now we apply this to kernel estimators. Again we split the data X into two halves Y =
(Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ). For each h let
n  
1X kx − Yi k
pbh (x) = K .
n i=1 h

Let n o
A = A(h, ν) : h, ν > 0, h 6= ν

where A(h, ν) = {x : pbh (x) > pbν (x)}. Define


Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A

Pn
where Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on Z. Let

pb = argminph ∆n (ph ).

Under some regularity conditions on the kernel, we have the following result.

Theorem 17 (Devroye and Györfi, 2001.) The risk of pb satisfies


Z Z r
log n
E |bp − p| ≤ c1 inf E |bph − p| + c2 . (31)
h n

The proof involves showing that the terms on the right hand side of (30) are small. We refer
the reader to Devroye and Györfi (2001) for the details.

Finding computationally efficient methods to implement this approach remains an open


question.

19
8 Series Methods

We have emphasized kernel density estimation. There are many other density estimation
methods. Let us briefly mention a method based on basis functions. For simplicity, suppose
that Xi ∈ [0, 1] and let φ1 , φ2 , . . . be an orthonormal basis for
Z 1
F = {f : [0, 1] → R, f 2 (x)dx < ∞}.
0

Thus Z Z
φ2j (x)dx = 1, φj (x)φk (x)dx = 0.

An example is the cosine basis:



φ0 (x) = 1, φj (x) = 2 cos(2πjx), j = 1, 2, . . . ,

If p ∈ F then

X
p(x) = βj φj (x)
j=1
R1 Pk
where βj = 0
p(x)φj (x)dx. An estimate of p is pb(x) = j=1 βbj φj (x) where
n
1X
βbj = φj (Xi ).
n i=1
The number of terms k is the smoothing parameter and can be chosen using cross-validation.

It can be shown that


Z k
X ∞
X
p(x) − p(x))2 dx] =
R = E[ (b Var(βbj ) + βj2 .
j=1 j=k+1

The first term is of order O(k/n). To bound the second term (the bias) one usually assumes
that p is a Sobolev space of order q which means that p ∈ P with
( ∞
)
X X
P= p∈F : p= βj φj : βj2 j 2q < ∞ .
j j=1

In that case it can be shown that


 2q
k 1
R≈ + .
n k
The optimal k is k ≈ n1/(2q+1) with risk
2q
  2q+1
1
R=O .
n

20
9 Miscellanea

9.1 High Dimensions, Curse of Dimensionality

As discussed earlier, the non-parametric rate of convergence n−C/(C+d) is slow when the
dimension d is large. In this case it is hopeless to try to estimate the true density p precisely
in the L2 norm (or any similar norm). We need to change our notion of what it means to
estimate p in a high-dimensional problem. Instead of estimating p precisely we have to settle
for finding an adequate approximation of p. Any estimator that finds the regions where p
puts large amounts of mass should be considered an adequate approximation. Let us consider
a few ways to implement this type of thinking.

9.2 Biased Density Estimation

Let ph (x) = E(b


ph (x)). Then
 
kx − uk
Z
1
ph (x) = K p(u)du
hd h
R
so that the mean of pbh can be thought of as a smoothed version of p. Let Ph (A) = A
ph (u)du
be the probability distribution corresponding to ph . Then

Ph = P ? K h

where ? denotes convolution2 and Kh is the distribution with density h−d K(kuk/h). In other
words, if X ∼ Ph then X = Y + Z where Y ∼ P and Z ∼ Kh . This is just another way to
say that Ph is a blurred or smoothed version of P . ph need not be close in L2 to p but still
could preserve most of the important shape information about p. Consider then choosing a
fixed h > 0 and estimating ph instead of p. This corresponds to ignoring the bias in the
density estimator. We can then show:

2
Theorem 18 Let h > 0 be fixed. Then P(kbph − ph k∞ > ) ≤ Ce−nc . Hence,
r !
log n
kb
ph − ph k∞ = OP .
n

The rate of convergence is fast and is independent of dimension. How to choose h is not
clear.
2
If X ∼ P and Y ∼ Q are independent, then the distribution of X + Y is denoted by P ? Q and is called
the convolution of P and Q.

21
9.3 Graphical Models/Conditional Independence based methods

If we can live with some bias, we can reduce the dimensionality by imposing some (con-
ditional) independence assumptions. The simplest example is to treat the components
(X1 , . . . , Xd ) as if they are independent. In that case
d
Y
p(x1 , . . . , xd ) = pj (xj )
j=1

and the problem is reduced to a set of one-dimensional density estimation problems.

An extension is to use a forest. We represent the distribution with an undirected graph. A


graph with no cycles is a forest. Let E be the edges of the graph. Any density consistent
with the forest can be written as
d
Y Y pj,k (xj , xk )
p(x) = pj (xj ) .
j=1
pj (xj )pk (xk )
(j,k)∈E

To estimate the density therefore only require that we estimate one and two-dimensional
marginals. But how do we find the edge set E? Some methods are discussed in Liu et al
(2011) under the name “Forest Density Estimation.” A simple approach is to connect pairs
greedily using some measure of correlation.

9.4 Mixtures

Another approach to density estimation is to use mixtures. We will discuss mixture modelling
when we discuss clustering.

9.5 Adaptive Kernels

A generalization of the kernel method is to use adaptive kernels where one uses a different
bandwidth h(x) for each point x. One can also use a different bandwidth h(xi ) for each data
point. This makes the estimator more flexible and allows it to adapt to regions of varying
smoothness. But now we have the very difficult task of choosing many bandwidths instead
of just one.

10 Summary
1. We discussed two categories of nonparametric density estimators: partition based
(hard-partition based such as histograms, and soft-partition based such as kernel den-

22
sity estimators), and projection onto function space based (series estimators).

2. Of these, the most commonly used nonparametric density estimator is the kernel den-
sity estimator
n  
1X 1 kx − Xi k
pbh (x) = K .
n i=1 hd h

3. The kernel estimator is rate minimax over many classes of densities.

4. Cross-validation methods can be used for choosing the bandwidth h.

11 Appendix: Proofs

11.1 Proof of Theorem 2

We prove the result by bounding the bias and variance of pbh .


R
First we bound the bias. Let θj = P (X ∈ Bj ) = Bj p(u)du. For any x ∈ Bj ,

θj
ph (x) ≡ E(b
ph (x)) = (32)
hd
and hence R
Bj
p(u)du 1
Z
p(x) − ph (x) = p(x) − = d (p(x) − p(u))du.
hd h Bj

Thus,
1
Z
1 √ Z √
|p(x) − ph (x)| ≤ d |p(x) − p(u)|du ≤ d Lh d du = Lh d
h Bj h

where we used the fact that if x, u ∈ Bj then kx − uk ≤ dh.

Now Rwe bound the variance.


R Since p is Lipschitz on a compact set, it is bounded. Hence,
θj = Bj p(u)du ≤ C Bj du = Chd for some C. Thus, the variance is

1 θj (1 − θj ) θj C
Var(b
ph (x)) = 2d
Var(θbj ) = 2d
≤ 2d
≤ .
h nh nh nhd

We conclude that the L2 risk is bounded by


Z
C
sup R(p, pb) = (E(b ph (x) − p(x))2 ≤ L2 h2 d + d . (33)
p∈P(L) nh

23
1
C
 d+2
The upper bound is minimized by choosing h = L2 nd
. (Later, we shall see a more
practical way to choose h.) With this choice,
2
  d+2
1
sup R(p, pb) ≤ C0
P ∈P(L) n

where C0 = L2 d(C/(L2 d))2/(d+2) .

11.2 Proof of Theorem 4

We now derive a concentration result for pbh where we will bound

sup P n (kb
ph − pk∞ > )
P ∈P

where kf k∞ = supx |f (x)|. Assume that  ≤ 1. First, note that


!
θbj θj X
ph −ph k∞ > ) = P max d − d >  = P(max |θbj −θj | > hd ) ≤
P(kb P(|θbj −θj | > hd ).
j h h j
j

Recall Bernstein’s inequality: Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤ σ 2
and |Yi | ≤ M . Then

n2
 
P(|Y − µ| > ) ≤ 2 exp − 2 . (34)
2σ + 2M /3

Using Bernstein’s inequality and the fact that θj (1 − θj ) ≤ θj ≤ Chd ,


2 2d
 
d 1 n h
P(|θbj − θj | > h ) ≤ 2 exp −
2 θj (1 − θj ) + hd /3
1 n2 h2d
 
≤ 2 exp −
2 Chd + hd /3
≤ 2 exp −cn2 hd


where c = 1/(2(C + 1/3)). By the union bound and the fact that N ≤ (1/h)d ,

P(|θbj − θj | > hd ) ≤ 2h−d exp −cn2 hd ≡ πn .





Earlier we saw that supx |p(x) − ph (x)| ≤ L dh. Hence, with probability at least 1 − πn ,

kb
ph − pk∞ ≤ kb ph − ph k∞ + kph − pk∞ ≤  + L dh. (35)

24
Now set s  
1 2
= log .
cnhd δhd
Then, with probability at least 1 − δ,
s

 
1 2
kb
ph − pk∞ ≤ log + L dh. (36)
cnhd δhd

Choosing h = (c2 /n)1/(2+d) we conclude that, with probability at least 1 − δ,


s 1
!

        2+d
2 2 2 1 log n
ph − pk∞ ≤ c−1 n− 2+d log
kb + log n + L dn− 2+d = O .
δ 2+d n
(37)

11.3 Proof of Lemma 5 (Bias of Kernel Density Estimators)

We have
u−x
Z
1
|ph (x) − p(x)| = d
K( )p(u)du − p(x)
h h
Z
= K(v)(p(x + hv) − p(x))dv
Z Z
≤ K(v)(p(x + hv) − px,β (x + hv))dv + K(v)(px,β (x + hv) − p(x))dv .
R
The first term is bounded by Lhβ K(s)|s|β since p ∈ H(β, L). The second term is 0 from
the properties on K since px,β (x + hv) − p(x) is a polynomial of degree less than β (with no
constant term).

11.4 Proof of Lemma 6 (Variance of Kernel Density Estimators)


Pn x−Xi
We can write pb(x) = n−1 1

i=1 Zi where Zi = hd
K h
. Then,

hd
 
x−u
Z Z
1
Var(Zi ) ≤ E(Zi2 )
= 2d K 2
p(u)du = 2d K 2 (v)p(x + hv)dv
h h h
Z
supx p(x) c
≤ d
K 2 (v)dv ≤ d
h h

for some c since the densities in H(β, L) are uniformly bounded. The result follows.

25
11.5 Proof of Theorem 9 (Concentration of Kernel Density Estimators)

By the triangle inequality,


|b
p(x) − p(x)| ≤ |b
p(x) − ph (x)| + |ph (x) − p(x)| (38)
wherePnph (x) = E(bp(x)). From Lemma 5, |ph (x) − p(x)| ≤ chβ for some c. Now pb(x) =
−1
n i=1 Zi where  
1 kx − Xi k
Zi = d K .
h h
Note that |Zi | ≤ c1 /hd where c1 = K(0). Also, Var(Zi ) ≤ c2 /hd from Lemma 6.

Recall Bernstein’s inequality: Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤ σ 2
and |Yi | ≤ M . Then
n2
 
P(|Y − µ| > ) ≤ 2 exp − 2 . (39)
2σ + 2M /3

Then, by Bernstein’s inequality,


n2 nhd 2
   
p(x) − ph (x)| > ) ≤ 2 exp −
P(|b ≤ 2 exp −
2c2 h−d + 2c1 h−d /3 4c2
p
whenever  ≤ 3c2 /c1 . If we choose  = C log(2/δ)/(nhd ) where C = 4c2 then
r !
C
P |b
p(x) − ph (x)| > ≤ δ.
nhd

The result follows from (38).

11.6 Proof of Theorem 11 (Asymptotics of Kernel Density Estimators)

Write Kh (x, X) = h−1 K ((x − X)/h) and pb(x) = n−1 i Kh (x, Xi ). Thus, E[b
P
p(x)] =
−1
E[Kh (x, X)] and Var[bp(x)] = n Var[Kh (x, X)]. Now,
 
x−t
Z
1
E[Kh (x, X)] = K p(t) dt
h h
Z
= K(u)p(x − hu) du
h2 u2 00
Z  
0
= K(u) p(x) − hup (x) + p (x) + · · · du
2
Z
1 2 00
= p(x) + h p (x) u2 K(u) du · · ·
2

26
R R
since K(x) dx = 1 and x K(x) dx = 0. The bias is

1 2 2 00
E[Khn (x, X)] − p(x) = σK hn p (x) + O(h4n ).
2
By a similar calculation,
R
p(x) K 2 (x) dx
 
1
Var[b
p(x)] = +O .
n hn n

The first result then follows since the risk is the squared bias plus variance. The second
result follows from integrating the first result.

11.7 Proof of Theorem 15 (VC Approximation to L1)

We know that  r 
ν
P sup |Pn (A) − P (A)| > c < δ.
A∈A n
Hence, except on an event of probability at most δ, we have that
Z Z
∆n (g) = sup g(x)dx − Pn (A) ≤ sup g(x)dx − P (A) + sup Pn (A) − P (A)
A∈A A A∈A A A∈A
r
ν
≤ ∆(g) + c .
n

By a similar argument, ∆(g) ≤ ∆n (g) + c nν . Hence, |∆(g) − ∆n (g)| ≤ c nν for all g. Let
p p
p∗ = argming∈P ∆(g). Then,
r r r
ν ν ν
∆(p) ≤ ∆(b
p) ≤ ∆n (b
p) + c ≤ ∆n (p∗ ) + c ≤ ∆(p∗ ) + 2c .
n n n

11.8 Proof of Theorem 16 (Yatracos Approximation to L1)


R R
Let i be such that pb = pi and let s be such that |ps −p| = minj |pj −p|. Let B = {pi > ps }
and C = {ps > pi }. Now,
Z Z Z
|b
p − p| ≤ |ps − p| + |ps − pi |. (40)

27
Let B denote all measurable sets. Then,
Z Z Z Z Z
|ps − pi | = 2 max pi − ps ≤ 2 sup pi − ps
A∈{B,C} A A A∈A A A
Z Z
≤ 2 sup pi − Pn (A) + 2 sup ps − Pn (A)
A∈A A A∈A A
Z
≤ 4 sup ps − Pn (A)
A∈A A
Z Z Z
≤ 4 sup ps − p + 4 sup p − Pn (A)
A∈A A A A∈A A
Z Z Z Z
= 4 sup ps − p + 4∆ ≤ 4 sup ps − p + 4∆
A∈A A A A∈B A A
Z
= 2 |ps − p| + 4∆.

The result follows from (40).

28

You might also like