0% found this document useful (0 votes)

10 views

densityestimation

Advanced Machine Learning Course from Carnegie Mellon University

Uploaded by

bruceayim30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

densityestimation

Advanced Machine Learning Course from Carnegie Mellon University

Uploaded by

bruceayim30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Nonparametric Density Estimation

10716: Advanced Machine Learning

Pradeep Ravikumar (amending notes from Larry
Wasserman)

1 Introduction

Let X1 , . . . , Xn be a sample from a distribution P with density p. The goal of nonparametric

density estimation is to estimate p with as few assumptions about p as possible. We denote
the estimator by pb. The estimator will typically depend on a tuning parameter h, and
choosing h carefully is crucial. To emphasize the dependence on h we sometimes write pbh .

A very simple non-parametric distribution estimator is simply the empirical distribution:

n
1X
Pn = δX ,
n i=1 i

but this is not very suitable as an estimate of the underlying distribution. It “overfits” to
the training data by placing all probability mass on the given training points {Xi }ni=1 and
zero mass even on very nearby points. It moreover does not have a density. So usually by
nonparameteric density estimation, we mean something that does a bit more, in particu-
lar by “smoothing” the empirical distribution Pn . For this reason, nonparametric density
estimation is also often referred to as smoothing.

Example 1 (Bart Simpson) The top left plot in Figure 1 shows the density
4
1 1 X
p(x) = φ(x; 0, 1) + φ(x; (j/2) − 1, 1/10) (1)
2 10 j=0

where φ(x; µ, σ) denotes a Normal density with mean µ and standard deviation σ. Marron
and Wand (1992) call this density “the claw” although we will call it the Bart Simpson
density. Based on 1,000 draws from p, we computed a kernel density estimator, described
later. The estimator depends on a tuning parameter called the bandwidth. The top right plot
is based on a small bandwidth h which leads to undersmoothing. The bottom right plot is
based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based
on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more
reasonable density estimate.

1
1.0

1.0
0.5

0.5
0.0

−3 0 3 0.0 −3 0 3
True Density Undersmoothed
1.0

1.0
0.5

0.5
0.0

0.0

−3 0 3 −3 0 3
Just Right Oversmoothed

Figure 1: The Bart Simpson density from Example 1. Top left: true density. The other plots
are kernel estimators based on n = 1,000 draws. Bottom left: bandwidth h = 0.05 chosen by
leave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10h.

2
2 Applications

Density estimation could be used for sampling new points (see the outpouring of creative, and
perhaps even worrying, uses of such sampling in the context of images and text), and more
generally, for a compact summary of data useful for downstream probabilistic reasoning. It
can also be used in particular for regression, classification, and clustering. Suppose pb(x, y)
is an estimate of p(x, y).

Regression. We can then compute the following estimate of the regression function:
Z
m(x)
b = yb p(y|x)dy
Z
pb(y, x)
= y dy.
pb(x)

Classification. For classification, recall the Bayes optimal classifier

h(x) = I(p1 (x)π1 > p0 (x)π0 )
where π1 = P(Y = 1), π0 = P(Y = 0), p1 (x) = p(x|y = 1) and p0 (x) = p(x|y = 0). Inserting
sample estimates of π1 and π0 , and density estimates for p1 and p0 yields an estimate of the
Bayes classifier. Many classifiers that you are familiar with can be re-expressed this way.

Clustering. For clustering, we look for the high density regions, based on an estimate of
the density. We will discuss more on this when we discuss clustering.

Anomaly/Outlier Detection. Density estimation is sometimes also used to find unusual

observations or outliers. These are observations for which pb(Xi ) is very small.

Two-Sample Hypothesis Testing. Density estimation can be used for two sample testing.
Given X1 , . . . , Xn ∼ p and Y1 , . . . , Ym ∼ q we can test H0 : p = q using D(b
p, qb), for some
divergence D as a test statistic.

3 Loss Functions

The most commonly used loss function is the L2 loss

Z Z Z Z
p(x) − p(x)) dx = pb (x)dx − 2 pb(x)p(x) + p2 (x)dx.
(b 2 2

3
The risk is R(p, pb) = E(L(p, pb)).

A key advantage of the L2 loss is that the risk has a very mathematically convenient decom-
position:
Z
R(p, pb) = E (p(x) − pb(x))2 dx (2)
Z Z
= b2n (x)dx + vn (x)dx (3)

p(x)) − p(x) is the bias and v(x) = Var(b

where bn (x) = E(b p(x)) is the variance.

The estimator pb typically involves “smoothing” the empirical distribution in some way. The
main challenge is to determine how much smoothing to do. When the data are oversmoothed,
the bias term is large and the variance is small. When the data are undersmoothed the
opposite is true. This is called the bias–variance tradeoff. Minimizing risk corresponds to
balancing bias and variance.

Devroye and Györfi (1985) make a strong case for using the L1 norm
Z
kbp − pk1 ≡ |b p(x) − p(x)|dx

as the loss instead of L2 . The L1 loss has the following nice interpretation. If P and Q are
distributions define the total variation metric

dT V (P, Q) = sup |P (A) − Q(A)|

Hence the distance is unaffected by transformations. The L1 loss is, in some sense, a much
better loss function than L2 for density estimation. But it is much more difficult to deal
with. For now, we will focus on L2 loss. But we may discuss L1 loss later.
R
Another loss function is the Kullback-Leibler loss p(x) log p(x)/q(x)dx. This is not a good
loss function to use for nonparametric density estimation. The reason is that the Kullback-
Leibler loss is completely dominated by the tails of the densities, due to the density ratios.

4
The minimax risk over a class of densities P is

Rn (P) = inf sup R(p, pb) (4)

pb p∈P

and an estimator is minimax if its risk is equal to the minimax risk. We say that pb is rate
optimal if
R(p, pb) Rn (P). (5)
Typically the minimax rate is of the form n−C/(C+d) for some C > 0.

4 Function Spaces

A distinguishing characteristic of “non-parametric” methods is that what we are estimating

is not in a finite-dimensional parametric space. Typically, it is in some infinite-dimensional
function space. We briefly review some classical function spaces.

The class of Lipschitz functions H(1, L) on X ⊂ R is the set of functions g such that

|g(y) − g(x)| ≤ L|x − y| for all x, y ∈ T.

A differentiable function is Lipschitz if and only if it has bounded derivatives. Conversely a

Lipschitz function is differentiable almost everywhere.

Let X ⊂ R and let β be an integer. The Holder space H(β, L) is the set of functions g
mapping X to R such that g is ` = β − 1 times differentiable and satisfies

|g (`) (y) − g (`) (x)| ≤ L|x − y| for all x, y ∈ T.

A more intuitive perspective of this class is that its first β derivatives are all bounded.

A yet another perspective of this class is as a set of functions that are close to their Taylor
series approximation upto order β. If g ∈ H(β, L) and ` = β − 1, then we can define the
Taylor approximation of g at x by
(y − x)` (`)
ge(y) = g(y) + (y − x)g 0 (x) + · · · + g (x)
`!
and then |g(y) − ge(y)| ≤ L|y − x|β .

The definition for higher dimensions is similar. Let X be a compact subset of Rd .

Given a vector s = (s1 , . . . , sd ), define

∂ s1 +···+sd
Ds = ,
∂xs11 · · · ∂xsdd

5
as the s-th partial derivative. We will also use the compact notation |s| = s1 + · · · + sd ,
s! = s1 ! · · · sd !, xs = xs11 · · · xsdd .

Let β and L be positive integers. The Hölder class is then defined as:
( )
s s
H(β, L) = p : |D p(x)−D p(y)| ≤ Lkx−yk, for all s such that |s| = β−1, and all x, y .

(6)
For example, if d = 1 and β = 2 (which is the most common setting) this means that
|p0 (x) − p0 (y)| ≤ L |x − y|, for all x, y.

As before, we could also view this class as functions with bounded Ds partial derivatives,
for |s| ≤ β. For instance, with β = 2, the class consists of functions have bounded second
derivatives.

And as before, this function class comprises functions that are close to their Taylor series
approximation upto order β. Let
X (u − x)s
px,β (u) = Ds p(x). (7)
s!
|s|<β

Then, if p ∈ H(β, L), we can show that: p(x) is close to its .

|p(u) − px,β (u)| ≤ Lku − xkβ . (8)

In the common case of β = 2, this means that

p(u) − [p(x) + (x − u)T ∇p(x)] ≤ Lkx − uk2 .

4.1 Categories of Nonparametric Density Estimators

We will discuss two broad categories of nonparametric density estimators: (a) those based
on hard partitioning of the input space viz. histograms (technically not density estimators),
and soft-partitioning of the input space viz. kernel density estimators, and (b) those based
on projection onto an infinite-dimensional dimensional function space, where we will look at
a particular instance called series estimators.

5 Histograms

Perhaps the simplest nonparametric distribution estimators, after the empirical distribution,
are histograms. The high level idea is to discretize the data, and then simply use the MLE

6
of the resulting categorical distribution (which is simply the frequencies of each category in
the data).

For convenience, assume that the data X1 , . . . , Xn are contained in the unit cube X = [0, 1]d
(although this assumption is not crucial). Divide X into bins, or sub-cubes, of size h. We
discuss methods for choosing h later. There are N = (1/h)d such bins and each has
volume hd . Denote the bins by B1 , . . . , BN . Now we can write the true density
N
X
p(x) = P (X ∈ Bj ) p(x|X ∈ Bj ).
j=1

We can estimate P (X ∈ Bj ) via

n
1X
θbj = I(Xi ∈ Bj )
n i=1

as the fraction of data points in bin Bj . While we can approximate p(x|X ∈ Bj ) via the
density of the uniform distribution over the bin Bj so that p(x|X ∈ Bj ) = 1/hd I(x ∈ Bj ).
Plugging these two values in, we get the histogram density estimator:
N b
X θj
pbh (x) = d
I(x ∈ Bj ). (9)
j=1
h

5.1 Statistical Analysis: Histograms

Suppose that p ∈ P(L) := H(1, L) where

( )
H(1, L) = p : |p(x) − p(y)| ≤ Lkx − yk, for all x, y . (10)

Theorem 2 The L2 risk of the histogram estimator is bounded by

Z
C
sup R(p, pb) = (E(b ph (x) − p(x))2 ≤ L2 h2 d + d . (11)
p∈H(1,L) nh
1
C
d+2
The upper bound is minimized by choosing h = L2 nd
. (Later, we shall see a more
practical way to choose h.) With this choice,
2
d+2
1
sup R(p, pb) ≤ C0
P ∈H(1,L) n

where C0 = L2 d(C/(L2 d))2/(d+2) .

7
The rate of convergence n−2β/(2β+d) is slow when √ the dimension d is large. The typical rate of
convergence for parameter models is typically d/ n. To see the difference between these two
rates, to get to error with non-parametric rates, we would require number of samples scaling
as n−2β/(2β+d) ≤ n ≥ (1/)d/2β+1 = O(1/)d , which
√ scales exponentially with the dimension
d. On the other hand, for parametric rates, d/ n ≤ only requires that n ≥ (d/)2 , which
only scales polynomially with the dimension.

This upper bound can also be shown to be tight. Specifically:

Theorem 3 There exists a constant C > 0 such that

Z 2
d+2
2 1
inf sup E (b p(x) − p(x)) dx ≥ C . (12)
pb P ∈H(1,L) n

The above result showed that the histogram estimator is close (wrt `2 loss) to the true
density in expectation. A more powerful result would be to show that it is close with high
probability. This entails analyzing
sup P n (kb
ph − pk∞ > )
P ∈P

where kf k∞ = supx |f (x)|.

Theorem 4 With probability at least 1 − δ,

s
√

1 2
kbph − pk∞ ≤ log + L dh. (13)
cnhd δhd

Choosing h = (c2 /n)1/(2+d) we conclude that, with probability at least 1 − δ,

s 1
!
√
2+d
2 2 2 1 log n
ph − pk∞ ≤ c−1 n− 2+d log
kb + log n + L dn− 2+d = O .
δ 2+d n
(14)

5.2 Adaptive histograms: Density Trees

Instead of uniformly partitioning the input domain, one can adaptively partition it. Ram
and Gray (2011) suggest a recursive partitioning scheme similar to decision trees. They
split each coordinate dyadically, in a greedy fashion. The density estimator is taken to
be piecewise constant. They use an L2 risk estimator to decide when to split. The ideas
seems to have been re-discovered in Yand and Wong (arXiv:1404.1425) and Liu and Wong
(arXiv:1401.2597). Density trees seem very promising.

8
6 Kernel Density Estimation
R
A
R one-dimensional smoothing
R kernel is any smooth function K such that K(x) dx = 1,
2
xK(x)dx = 0 and σK ≡ x2 K(x)dx > 0. Smoothing kernels should not be confused with
Mercer kernels which we discuss later. Some commonly used kernels are the following:

2
Boxcar: K(x) = 21 I(x) Gaussian: K(x) = √1 e−x /2
2π
3 2 70
Epanechnikov: K(x) = 4 (1 − x )I(x) Tricube: K(x) = 81
(1 − |x|3 )3 I(x)

where I(x) = 1 if |x| ≤ 1 and I(x) = 0 otherwise.

Qd These kernels are plotted in Figure 2.
Two commonly used multivariate kernels are j=1 K(xj ) and K(kxk). For presentational
simplicity, we will overload notation for both the multivariate and univariate kernels, and if
not specified, for vector x, we will use K(x) = K(kxk).

−3 0 3 −3 0 3

Figure 2: Examples of smoothing kernels: boxcar (top left), Gaussian (top right), Epanech-
nikov (bottom left), and tricube (bottom right).

Suppose that X ∈ Rd . Given a kernel K and a positive number h, called the bandwidth,
the kernel density estimator is defined to be
n
1X 1 x − Xi
pb(x) = K . (15)
n i=1 hd h

More generally, we define

n
1X
pbH (x) = KH (x − Xi )
n i=1

9
−10 −5 0 5 10

Figure 3: A kernel density estimator pb. At each point x, pb(x) is the average of the kernels
centered over the data points Xi . The data points are indicated by short vertical bars. The
kernels are not drawn to scale.

where H is a positive definite bandwidth matrix and KH (x) = |H|−1/2 K(H −1/2 x). For
simplicity, we will take H = h2 I and we get back the previous formula.

Sometimes we write the estimator as pbh to emphasize the dependence on h. In the multivari-
ate case the coordinates of Xi should be standardized so that each has the same variance,
since the norm kx − Xi k treats all coordinates as if they are on the same scale.

The kernel estimator places a smoothed out lump of mass of size 1/n over each data point
Xi ; see Figure 3. The choice of kernel K is not crucial, but the choice of bandwidth h
is important. Small bandwidths give very rough estimates while larger bandwidths give
smoother estimates.

6.1 Statistical Analysis: Kernel Estimators

In this section we examine the performance of kernel density estimation. We will first need
a few definitions.

Assume that Xi ∈ X ⊂ Rd where X is compact.

Conditions on Kernel Function. In order for the kernel density estimate to be able to
estimate well a smooth function in H(β, L) for β > 2, we need a “higher order kernel”.

10
1.0
0.5
0.0
−0.5

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Figure 4: A higher-order kernel function: specifically, a kernel of order 4

Assume now that the kernel K has the form K(x) = k(kxk) for some univariate kernel R k that
has
R support on [−1, 1]. A univariate
R kernel is said to
R have order β provided that: k = 1,
|k|q < ∞ for any q ≥ 1, |t|β |k(t)|dt < ∞ and ts k(t)dt = 0 for s < β. An example
of a kernel that satisfies these conditions
R s for β = 2 is k(x) = (3/4)(1 − x2 ) for |x| ≤ 1.
Constructing a kernel that satisfies t k(t)dt = 0 for β > 2 requires using kernels that can
take negative values; because of which such “higher order kernels” for β > 2 are not that
popular. For example, a 4th-order kernel is K(t) = 83 (3 − 5t2 )1{|t| ≤ 1}, plotted in Figure
4. Notice that it takes negative values.

ph (x)]. The next lemma provides a bound on the bias ph (x) − p(x).
Let ph (x) = E[b

Lemma 5 The bias of pbh satisfies:

sup |ph (x) − p(x)| ≤ chβ (16)
p∈H(β,L)

for some c.

Next we bound the variance.

Lemma 6 The variance of pbh satisfies:

c
ph (x)) ≤
sup Var(b (17)
p∈H(β,L) nhd
for some c > 0.

11
Since the mean squared error is equal to the variance plus the bias squared, together the
previous two lemmas yield:

Theorem 7 The L2 risk is bounded above, uniformly over H(β, L), as

Z
1
sup E ph (x) − p(x))2 dx h2β +
(b (18)
p∈H(β,L) nhd
If h n−1/(2β+d) then
Z 2β
2β+d
1
sup E ph (x) − p(x))2 dx
(b . (19)
p∈H(β,L) n

When β = 2 and h n−1/(4+d) we get the rate n−4/(4+d) .

6.2 Minimax Lower Bound

According to the next theorem, there does not exist an estimator that converges faster than
O(n−2β/(2β+d) ). We state the result for integrated L2 loss although similar results hold for
other loss functions and other function spaces. We will prove this later in the course.

Theorem 8 There exists C depending only on β and L such that

Z 2β
2β+d
2 1
inf sup Ep p(x) − p(x)) dx ≥ C
(b . (20)
pb p∈H(β,L) n

Theorem 8 together with (19) imply that kernel estimators are rate minimax.

Concentration Analysis of Kernel Density Estimator Now we state a result which says
how fast pb(x) concentrates around p(x).

Theorem 9 For all small > 0,

p(x) − ph (x)| > ) ≤ 2 exp −cnhd 2 .

P(|b (21)

Hence, for any δ > 0,

r !
C log(2/δ)
sup P |b
p(x) − p(x)| > + chβ <δ (22)
p∈H(β,L) nhd

12
for some constants C and c. If h n−1/(2β+d) then
c
sup P |b p(x) − p(x)|2 > < δ.
p∈H(β,L) n2β/(2β+d)

The first statement follows from an application of Bernstein’s inequality. While the last
statement follows from bias-variance calculations followed by Markov’s inequality.

Concentration in L∞ . While Theorem 9 shows that, for each x, pb(x) is close to p(x) with
high probability; it would be nice to have a version of this result that holds uniformly over
all x. That is, we want a concentration result for

kb
p − pk∞ = sup |b
p(x) − p(x)|.
x

We can write

kb
ph − pk∞ ≤ kb ph − ph k∞ + chβ .
ph − ph k∞ + kph − pk∞ ≤ kb

We can bound the first term using something called bracketing together with Bernstein’s
theorem to prove that,
d
3n2 hd

C
ph − ph k∞ > ) ≤ 4
P(kb exp − . (23)
hd+1 28K(0)

A more sophisticated analysis in Giné and Guillou (2002) (which in turn replaces Bernstein’s
inequality in previous proof with a more sophisticated inequality due to Talagrand) yields
the following:

Theorem 10 Suppose that p ∈ H(β, L). Fix any δ > 0. Then

r !
C log n
P sup |b p(x) − p(x)| > + chβ < δ
x nhd

for some constants C and c where C depends on δ. Choosing h log n/n−1/(2β+d) we have

2 C log n
P sup |bp(x) − p(x)| > 2β/(2β+d) < δ.
x n

13
6.3 Boundary Bias

One caveat with the kernel density estimator is what happens near the boundary of the
sample space. If x is O(h) close to the boundary, then the bias is O(h) instead of O(h2 ).
The main reason is that when we compute an average over nearby points; points near the
boundary have more points towards directions leading away from the boundary, compared
to directions towards the boundary. We will discuss more about this when we cover non-
parametric regression.

There are a variety of fixes including: data reflection, transformations, boundary kernels,
local likelihood. These are not as popular as simple kernel density estimation however.

6.4 Asymptotic Expansions

In this section we consider some asymptotic expansions that describe the behavior of the
kernel estimator. We focus on the case d = 1.

Theorem 11 Let RxR = E(p(x) − pb(x))2 and let R = Rx dx. Assume that p00 is absolutely
R

continuous and that p000 (x)2 dx < ∞. Then,

R
1 4 4 00 2 p(x) K 2 (x)dx

1
Rx = σK hn p (x) + +O + O(h6n )
4 nhn n
and R
K 2 (x)dx
Z
1 4 4 00 2 1
R = σK hn p (x) dx + +O + O(h6n ) (24)
4 nh n
2
R
where σK = x2 K(x)dx.

If we differentiate (24) with respect to h and set it equal to 0, we see that the asymptotically
optimal bandwidth is
1/5
c2
h∗ = (25)
c21 A(f )n
where c1 = x2 K(x)dx, c2 = K(x)2 dx and A(f ) = f 00 (x)2 dx. This is informative
R R R

because it tells us that the best bandwidth decreases at rate n−1/5 . Plugging h∗ into (24),
we see that if the optimal bandwidth is used then R = O(n−4/5 ).

7 Picking Bandwidths of Kernel Estimators

In practice we need a data-based method for choosing the bandwidth h. To do this, we will
need to estimate the risk of the estimator and minimize the estimated risk over h.

14
7.1 Leave One Out Cross-Validation

A common method for estimating risk is leave-one-out cross-validation. Recall that the loss
function is
Z Z Z Z
ph (x) − p(x)) dx = pbh (x)dx − 2 pbh (x)p(x)dx + p2 (x)dx.
(b 2 2

The last term does not involve pb so we can drop it. Thus, we now define the loss to be
Z Z
2
L(h) = pbh (x) dx − 2 pbh (x)p(x)dx.

The risk is R(h) = E(L(h)).

Definition 12 The leave-one-out cross-validation estimator of risk is

Z 2 n
2X
R(h)
b = pbh (x) dx − pbh;(−i) (Xi ) (26)
n i=1

where pbh;(−i) is the density estimator obtained after removing the ith observation.

It is easy to check that E[R(h)]

b = R(h).

A further justification for cross-validation is given by the following theorem due to Stone
(1984).

Theorem 13 (Stone’s theorem) Suppose that p is bounded. Let pbh denote the kernel
estimator with bandwidth h and let b
h denote the bandwidth chosen by cross-validation. Then,
R 2
p(x) − pbbh (x) dx a.s.
→ 1. (27)
inf h (p(x) − pbh (x))2 dx
R

The bandwidth for the density estimator in the bottom left panel of Figure 1 is based on
cross-validation. In this case it worked well but of course there are lots of examples where
there are problems. Do not assume that, if the estimator pb is wiggly, then cross-validation
has let you down. The eye is not a good judge of risk.

There are cases when cross-validation can seriously break down. In particular, if there are
ties in the data then cross-validation chooses a bandwidth of 0.

15
7.2 V -fold Cross-Validation

An alternative to leave-one-out is V -fold cross-validation. A common choice is V = 10. Fir

simplicity, let us consider here just splitting the data in two halves. This version of cross-
validation comes with stronger theoretical guarantees. Let pbh denote the kernel estimator
based on bandwidth h. For simplicity, assume the sample size is even and denote the sample
size by 2n. Randomly split the data X = (X1 , . . . , X2n ) into two sets of size n. Denote
these by Y = (Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ).1 Let H̄ = {h1 , . . . , hN } be a finite grid of
bandwidths. For j ∈ [N ], denote
n
1X 1 x − Yi
pbj (x) = K .
n i=1 hdj h

Thus we have a set P = {b p1 , . . . , pbN } of density estimators.

R R
The loss of pbj is given as: L(p, pbj ) = pb2j (x) − 2 pbj (x)p(x)dx. Define the estimated risk
Z n
2X
bj ≡ L(p,
L b pbj ) = pb2j (x) − pbj (Zi ). (28)
n i=1

Let pb = argminj∈[N ] L(p,

b pbj ). Schematically:

Y → {b
p1 , . . . , pbN } = P
split
X = (X1 , . . . , X2n ) =⇒
Z → {L bN }
b1 , . . . , L

Theorem 14 (Wegkamp 1999) There exists a C > 0 such that

C log N
p − pk2 ) ≤ 2 min E(kb
E(kb pj − pk2 ) + .
j∈[N ] n

A similar result can be proved for V -fold cross-validation.

7.3 Example

Figure 5 shows a synthetic two-dimensional data set, the cross-validation function and two
kernel density estimators. The data are 100 points generated as follows. We select a point
1
It is not necessary to split the data into two sets of equal size. We use the equal split version for
simplicity.

16
Risk
0.5 1.0 1.5 2.0 2.5 3.0
Bandwidth

Figure 5: Synthetic two-dimensional data set. Top left: data. Top right: cross-validation
function. Bottom left: kernel estimator based on the bandwidth that minimizes the cross-
validation score. Bottom right: kernel estimator based on the twice the bandwidth that
minimizes the cross-validation score.

randomly on the unit circle then add Normal noise with standard deviation 0.1 The first
estimator (lower left) uses the bandwidth that minimizes the leave-one-out cross-validation
score. The second uses twice that bandwidth. The cross-validation curve is very sharply
peaked with a clear minimum. The resulting density estimate is somewhat lumpy. This is
because cross-validation is aiming to minimize L2 error which does not guarantee that the
estimate is smooth. Also, the dataset is small so this effect is more noticeable. The estimator
with the larger bandwidth is noticeably smoother. However, the lumpiness of the estimator
is not necessarily a bad thing.

7.4 Picking Bandwidths to optimize L1 instead of L2 Risk

Here we discuss another approach to choosing h aimed at the L1 loss. RRecall that this L1
loss between some density g and the true distribution P is given as: |g(x) − p(x)|dx =
R
2 supA A g(x)dx − P (A) . The idea is to restrict to a class of sets A—which we call test
R
sets— and choose h to make A pbh (x)dx close to P (A) for all A ∈ A. That is, we would like

17
to minimize Z
∆(g) = sup g(x)dx − P (A) . (29)
A∈A A

Note that this yields an approximation to the L1 risk, which optimizes over all sets, rather
than just some restricted class of sets, so we have to choose these carefully. We will next
discuss two approaches to specify these test classes.

7.4.1 VC Classes

Let A be a class of sets with VC dimension ν. As in section 7.2, split the data X into Y
and Z with P = {bp1 , . . . , pbN } constructed from Y . For g ∈ P define
Z
∆n (g) = sup g(x)dx − Pn (A)
A∈A A
−1
Pn
where Pn (A) = n i=1 I(Zi ∈ A). Let pb = argminj∈[N ] ∆n (b
pj ).

Theorem 15 For any δ > 0 there exists c such that

r
ν
P ∆(b p) > min ∆(b pj ) + 2c < δ.
j n

The difficulty in implementing this idea is computing and minimizing ∆n (g). Hjort and
Walker (2001) presented a similar method which can be practically implemented when d = 1.
Another caveat with the above is that ∆(g) is only an approximation of the L1 loss, depending
on the richness of the class of sets A. Is there a small enough class of sets A that would be
as if minimizing the L1 loss?

7.4.2 Yatracos Classes

Devroye and Györfi (2001) use such a class of sets called a Yatracos class which leads to
estimators with some remarkable properties.
n Let P =o{p1 , . . . , pN } be a set of densities and
define the Yatracos class of sets A = A(i, j) : i 6= j where A(i, j) = {x : pi (x) > pj (x)}.
Let
pb = argminj∈[N ] ∆n (pj ),
where Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A
Pn
and Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on a sample Z1 , . . . , Zn ∼ p.

18
Theorem 16 The estimator pb satisfies
Z Z
|b
p − p| ≤ 3 min |pj − p| + 4∆ (30)
j

R
where ∆ = supA∈A A
p − Pn (A) .

R
The term minj |pj − p| is like a bias while term ∆ is like the variance.

Now we apply this to kernel estimators. Again we split the data X into two halves Y =
(Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ). For each h let
n
1X kx − Yi k
pbh (x) = K .
n i=1 h

Let n o
A = A(h, ν) : h, ν > 0, h 6= ν

where A(h, ν) = {x : pbh (x) > pbν (x)}. Define

Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A

Pn
where Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on Z. Let

pb = argminph ∆n (ph ).

Under some regularity conditions on the kernel, we have the following result.

Theorem 17 (Devroye and Györfi, 2001.) The risk of pb satisfies

Z Z r
log n
E |bp − p| ≤ c1 inf E |bph − p| + c2 . (31)
h n

The proof involves showing that the terms on the right hand side of (30) are small. We refer
the reader to Devroye and Györfi (2001) for the details.

Finding computationally efficient methods to implement this approach remains an open

question.

19
8 Series Methods

We have emphasized kernel density estimation. There are many other density estimation
methods. Let us briefly mention a method based on basis functions. For simplicity, suppose
that Xi ∈ [0, 1] and let φ1 , φ2 , . . . be an orthonormal basis for
Z 1
F = {f : [0, 1] → R, f 2 (x)dx < ∞}.
0

Thus Z Z
φ2j (x)dx = 1, φj (x)φk (x)dx = 0.

An example is the cosine basis:

√
φ0 (x) = 1, φj (x) = 2 cos(2πjx), j = 1, 2, . . . ,

If p ∈ F then
∞
X
p(x) = βj φj (x)
j=1
R1 Pk
where βj = 0
p(x)φj (x)dx. An estimate of p is pb(x) = j=1 βbj φj (x) where
n
1X
βbj = φj (Xi ).
n i=1
The number of terms k is the smoothing parameter and can be chosen using cross-validation.

It can be shown that

Z k
X ∞
X
p(x) − p(x))2 dx] =
R = E[ (b Var(βbj ) + βj2 .
j=1 j=k+1

The first term is of order O(k/n). To bound the second term (the bias) one usually assumes
that p is a Sobolev space of order q which means that p ∈ P with
( ∞
)
X X
P= p∈F : p= βj φj : βj2 j 2q < ∞ .
j j=1

In that case it can be shown that

2q
k 1
R≈ + .
n k
The optimal k is k ≈ n1/(2q+1) with risk
2q
2q+1
1
R=O .
n

20
9 Miscellanea

9.1 High Dimensions, Curse of Dimensionality

As discussed earlier, the non-parametric rate of convergence n−C/(C+d) is slow when the
dimension d is large. In this case it is hopeless to try to estimate the true density p precisely
in the L2 norm (or any similar norm). We need to change our notion of what it means to
estimate p in a high-dimensional problem. Instead of estimating p precisely we have to settle
for finding an adequate approximation of p. Any estimator that finds the regions where p
puts large amounts of mass should be considered an adequate approximation. Let us consider
a few ways to implement this type of thinking.

9.2 Biased Density Estimation

Let ph (x) = E(b

ph (x)). Then

kx − uk
Z
1
ph (x) = K p(u)du
hd h
R
so that the mean of pbh can be thought of as a smoothed version of p. Let Ph (A) = A
ph (u)du
be the probability distribution corresponding to ph . Then

Ph = P ? K h

where ? denotes convolution2 and Kh is the distribution with density h−d K(kuk/h). In other
words, if X ∼ Ph then X = Y + Z where Y ∼ P and Z ∼ Kh . This is just another way to
say that Ph is a blurred or smoothed version of P . ph need not be close in L2 to p but still
could preserve most of the important shape information about p. Consider then choosing a
fixed h > 0 and estimating ph instead of p. This corresponds to ignoring the bias in the
density estimator. We can then show:

2
Theorem 18 Let h > 0 be fixed. Then P(kbph − ph k∞ > ) ≤ Ce−nc . Hence,
r !
log n
kb
ph − ph k∞ = OP .
n

The rate of convergence is fast and is independent of dimension. How to choose h is not
clear.
2
If X ∼ P and Y ∼ Q are independent, then the distribution of X + Y is denoted by P ? Q and is called
the convolution of P and Q.

21
9.3 Graphical Models/Conditional Independence based methods

If we can live with some bias, we can reduce the dimensionality by imposing some (con-
ditional) independence assumptions. The simplest example is to treat the components
(X1 , . . . , Xd ) as if they are independent. In that case
d
Y
p(x1 , . . . , xd ) = pj (xj )
j=1

and the problem is reduced to a set of one-dimensional density estimation problems.

An extension is to use a forest. We represent the distribution with an undirected graph. A

graph with no cycles is a forest. Let E be the edges of the graph. Any density consistent
with the forest can be written as
d
Y Y pj,k (xj , xk )
p(x) = pj (xj ) .
j=1
pj (xj )pk (xk )
(j,k)∈E

To estimate the density therefore only require that we estimate one and two-dimensional
marginals. But how do we find the edge set E? Some methods are discussed in Liu et al
(2011) under the name “Forest Density Estimation.” A simple approach is to connect pairs
greedily using some measure of correlation.

9.4 Mixtures

Another approach to density estimation is to use mixtures. We will discuss mixture modelling
when we discuss clustering.

9.5 Adaptive Kernels

A generalization of the kernel method is to use adaptive kernels where one uses a different
bandwidth h(x) for each point x. One can also use a different bandwidth h(xi ) for each data
point. This makes the estimator more flexible and allows it to adapt to regions of varying
smoothness. But now we have the very difficult task of choosing many bandwidths instead
of just one.

10 Summary
1. We discussed two categories of nonparametric density estimators: partition based
(hard-partition based such as histograms, and soft-partition based such as kernel den-

22
sity estimators), and projection onto function space based (series estimators).

2. Of these, the most commonly used nonparametric density estimator is the kernel den-
sity estimator
n
1X 1 kx − Xi k
pbh (x) = K .
n i=1 hd h

3. The kernel estimator is rate minimax over many classes of densities.

4. Cross-validation methods can be used for choosing the bandwidth h.

11 Appendix: Proofs

11.1 Proof of Theorem 2

We prove the result by bounding the bias and variance of pbh .

R
First we bound the bias. Let θj = P (X ∈ Bj ) = Bj p(u)du. For any x ∈ Bj ,

θj
ph (x) ≡ E(b
ph (x)) = (32)
hd
and hence R
Bj
p(u)du 1
Z
p(x) − ph (x) = p(x) − = d (p(x) − p(u))du.
hd h Bj

Thus,
1
Z
1 √ Z √
|p(x) − ph (x)| ≤ d |p(x) − p(u)|du ≤ d Lh d du = Lh d
h Bj h
√
where we used the fact that if x, u ∈ Bj then kx − uk ≤ dh.

Now Rwe bound the variance.

R Since p is Lipschitz on a compact set, it is bounded. Hence,
θj = Bj p(u)du ≤ C Bj du = Chd for some C. Thus, the variance is

1 θj (1 − θj ) θj C
Var(b
ph (x)) = 2d
Var(θbj ) = 2d
≤ 2d
≤ .
h nh nh nhd

We conclude that the L2 risk is bounded by

Z
C
sup R(p, pb) = (E(b ph (x) − p(x))2 ≤ L2 h2 d + d . (33)
p∈P(L) nh

23
1
C
d+2
The upper bound is minimized by choosing h = L2 nd
. (Later, we shall see a more
practical way to choose h.) With this choice,
2
d+2
1
sup R(p, pb) ≤ C0
P ∈P(L) n

where C0 = L2 d(C/(L2 d))2/(d+2) .

11.2 Proof of Theorem 4

We now derive a concentration result for pbh where we will bound

sup P n (kb
ph − pk∞ > )
P ∈P

where kf k∞ = supx |f (x)|. Assume that ≤ 1. First, note that

!
θbj θj X
ph −ph k∞ > ) = P max d − d > = P(max |θbj −θj | > hd ) ≤
P(kb P(|θbj −θj | > hd ).
j h h j
j

Recall Bernstein’s inequality: Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤ σ 2
and |Yi | ≤ M . Then

n2

P(|Y − µ| > ) ≤ 2 exp − 2 . (34)
2σ + 2M /3

Using Bernstein’s inequality and the fact that θj (1 − θj ) ≤ θj ≤ Chd ,

2 2d

d 1 n h
P(|θbj − θj | > h ) ≤ 2 exp −
2 θj (1 − θj ) + hd /3
1 n2 h2d

≤ 2 exp −
2 Chd + hd /3
≤ 2 exp −cn2 hd

where c = 1/(2(C + 1/3)). By the union bound and the fact that N ≤ (1/h)d ,

P(|θbj − θj | > hd ) ≤ 2h−d exp −cn2 hd ≡ πn .

√
Earlier we saw that supx |p(x) − ph (x)| ≤ L dh. Hence, with probability at least 1 − πn ,
√
kb
ph − pk∞ ≤ kb ph − ph k∞ + kph − pk∞ ≤ + L dh. (35)

24
Now set s
1 2
= log .
cnhd δhd
Then, with probability at least 1 − δ,
s
√

1 2
kb
ph − pk∞ ≤ log + L dh. (36)
cnhd δhd

Choosing h = (c2 /n)1/(2+d) we conclude that, with probability at least 1 − δ,

s 1
!
√
2+d
2 2 2 1 log n
ph − pk∞ ≤ c−1 n− 2+d log
kb + log n + L dn− 2+d = O .
δ 2+d n
(37)

11.3 Proof of Lemma 5 (Bias of Kernel Density Estimators)

We have
u−x
Z
1
|ph (x) − p(x)| = d
K( )p(u)du − p(x)
h h
Z
= K(v)(p(x + hv) − p(x))dv
Z Z
≤ K(v)(p(x + hv) − px,β (x + hv))dv + K(v)(px,β (x + hv) − p(x))dv .
R
The first term is bounded by Lhβ K(s)|s|β since p ∈ H(β, L). The second term is 0 from
the properties on K since px,β (x + hv) − p(x) is a polynomial of degree less than β (with no
constant term).

11.4 Proof of Lemma 6 (Variance of Kernel Density Estimators)

Pn x−Xi
We can write pb(x) = n−1 1

i=1 Zi where Zi = hd
K h
. Then,

hd

x−u
Z Z
1
Var(Zi ) ≤ E(Zi2 )
= 2d K 2
p(u)du = 2d K 2 (v)p(x + hv)dv
h h h
Z
supx p(x) c
≤ d
K 2 (v)dv ≤ d
h h

for some c since the densities in H(β, L) are uniformly bounded. The result follows.

25
11.5 Proof of Theorem 9 (Concentration of Kernel Density Estimators)

By the triangle inequality,

Recall Bernstein’s inequality: Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤ σ 2
and |Yi | ≤ M . Then
n2

P(|Y − µ| > ) ≤ 2 exp − 2 . (39)
2σ + 2M /3

Then, by Bernstein’s inequality,

n2 nhd 2

p(x) − ph (x)| > ) ≤ 2 exp −
P(|b ≤ 2 exp −
2c2 h−d + 2c1 h−d /3 4c2
p
whenever ≤ 3c2 /c1 . If we choose = C log(2/δ)/(nhd ) where C = 4c2 then
r !
C
P |b
p(x) − ph (x)| > ≤ δ.
nhd

The result follows from (38).

11.6 Proof of Theorem 11 (Asymptotics of Kernel Density Estimators)

Write Kh (x, X) = h−1 K ((x − X)/h) and pb(x) = n−1 i Kh (x, Xi ). Thus, E[b
P
p(x)] =
−1
E[Kh (x, X)] and Var[bp(x)] = n Var[Kh (x, X)]. Now,

x−t
Z
1
E[Kh (x, X)] = K p(t) dt
h h
Z
= K(u)p(x − hu) du
h2 u2 00
Z
0
= K(u) p(x) − hup (x) + p (x) + · · · du
2
Z
1 2 00
= p(x) + h p (x) u2 K(u) du · · ·
2

26
R R
since K(x) dx = 1 and x K(x) dx = 0. The bias is

1 2 2 00
E[Khn (x, X)] − p(x) = σK hn p (x) + O(h4n ).
2
By a similar calculation,
R
p(x) K 2 (x) dx

1
Var[b
p(x)] = +O .
n hn n

The first result then follows since the risk is the squared bias plus variance. The second
result follows from integrating the first result.

11.7 Proof of Theorem 15 (VC Approximation to L1)

We know that r
ν
P sup |Pn (A) − P (A)| > c < δ.
A∈A n
Hence, except on an event of probability at most δ, we have that
Z Z
∆n (g) = sup g(x)dx − Pn (A) ≤ sup g(x)dx − P (A) + sup Pn (A) − P (A)
A∈A A A∈A A A∈A
r
ν
≤ ∆(g) + c .
n

By a similar argument, ∆(g) ≤ ∆n (g) + c nν . Hence, |∆(g) − ∆n (g)| ≤ c nν for all g. Let
p p
p∗ = argming∈P ∆(g). Then,
r r r
ν ν ν
∆(p) ≤ ∆(b
p) ≤ ∆n (b
p) + c ≤ ∆n (p∗ ) + c ≤ ∆(p∗ ) + 2c .
n n n

11.8 Proof of Theorem 16 (Yatracos Approximation to L1)

R R
Let i be such that pb = pi and let s be such that |ps −p| = minj |pj −p|. Let B = {pi > ps }
and C = {ps > pi }. Now,
Z Z Z
|b
p − p| ≤ |ps − p| + |ps − pi |. (40)

27
Let B denote all measurable sets. Then,
Z Z Z Z Z
|ps − pi | = 2 max pi − ps ≤ 2 sup pi − ps
A∈{B,C} A A A∈A A A
Z Z
≤ 2 sup pi − Pn (A) + 2 sup ps − Pn (A)
A∈A A A∈A A
Z
≤ 4 sup ps − Pn (A)
A∈A A
Z Z Z
≤ 4 sup ps − p + 4 sup p − Pn (A)
A∈A A A A∈A A
Z Z Z Z
= 4 sup ps − p + 4∆ ≤ 4 sup ps − p + 4∆
A∈A A A A∈B A A
Z
= 2 |ps − p| + 4∆.

The result follows from (40).

2000 (Bonate) Analysis of Pretest-Posttest Designs (Full Book)
0% (1)
2000 (Bonate) Analysis of Pretest-Posttest Designs (Full Book)
206 pages
ASTM Test Methods For Geotextiles
No ratings yet
ASTM Test Methods For Geotextiles
7 pages
Advanced Econometrics - 1985 - 1era Edición - Amemiya
100% (1)
Advanced Econometrics - 1985 - 1era Edición - Amemiya
531 pages
Density Estimation 36-708
No ratings yet
Density Estimation 36-708
32 pages
Empirical Finance1
No ratings yet
Empirical Finance1
31 pages
Non Parametric Density Estimation
No ratings yet
Non Parametric Density Estimation
4 pages
Ast Part1 PDF
No ratings yet
Ast Part1 PDF
20 pages
Racine - 2007 - Nonparametric Econometrics A Primer
No ratings yet
Racine - 2007 - Nonparametric Econometrics A Primer
88 pages
A Primer in Nonparametric Econometrics
No ratings yet
A Primer in Nonparametric Econometrics
88 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
BNP PDF
No ratings yet
BNP PDF
108 pages
Lecture Notes On Bayesian Nonparametrics: Version: May 16, 2014
No ratings yet
Lecture Notes On Bayesian Nonparametrics: Version: May 16, 2014
108 pages
Tabak-Turner
No ratings yet
Tabak-Turner
20 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Chap 4
No ratings yet
Chap 4
21 pages
SDV
No ratings yet
SDV
82 pages
Non-Parametric Methods
No ratings yet
Non-Parametric Methods
51 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Minimum L - Distance Estimators For Non-Normalized Parametric Models
No ratings yet
Minimum L - Distance Estimators For Non-Normalized Parametric Models
32 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
pa_01_density_estimation
No ratings yet
pa_01_density_estimation
25 pages
Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
No ratings yet
Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
9 pages
Statistical Inference Notes Melon University
No ratings yet
Statistical Inference Notes Melon University
5 pages
2009 Paninsky Nonparametric estimation of entropy and distributions
No ratings yet
2009 Paninsky Nonparametric estimation of entropy and distributions
34 pages
Review of Kernel Density Estimation
No ratings yet
Review of Kernel Density Estimation
35 pages
M3 DensityEstimation v1
No ratings yet
M3 DensityEstimation v1
65 pages
978 1 4612 1718 3
No ratings yet
978 1 4612 1718 3
219 pages
ACFrOgDxHI9RLajsdAAleI AMD3fD8GMumHY4hP954G9Nc5wG y r Km6yewAtD6KPaLn4JtmlryIevFHyE5hLCpCG9kYiN y2aUEiWWoofQYGd7Z10 ETX5BGeaw6ImvJ9HjlO8aNIJuqL7FlX9wq3pZ2PgZnbra RuhNZrYg==
No ratings yet
ACFrOgDxHI9RLajsdAAleI AMD3fD8GMumHY4hP954G9Nc5wG y r Km6yewAtD6KPaLn4JtmlryIevFHyE5hLCpCG9kYiN y2aUEiWWoofQYGd7Z10 ETX5BGeaw6ImvJ9HjlO8aNIJuqL7FlX9wq3pZ2PgZnbra RuhNZrYg==
16 pages
Non-Parametric Estimation: Rajkumar Saha (WS)
No ratings yet
Non-Parametric Estimation: Rajkumar Saha (WS)
13 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
stat-review__xid-8243919_1
No ratings yet
stat-review__xid-8243919_1
24 pages
slides3part1-mrbm2324
No ratings yet
slides3part1-mrbm2324
29 pages
U4 ProbabilityDensityEstimation
No ratings yet
U4 ProbabilityDensityEstimation
6 pages
A Short Course On Nonparametric Curve Estimation R PDF
No ratings yet
A Short Course On Nonparametric Curve Estimation R PDF
114 pages
Hyvarinen 05 A
No ratings yet
Hyvarinen 05 A
15 pages
Estimation of Integrated Functionals of A Monotone Density
No ratings yet
Estimation of Integrated Functionals of A Monotone Density
30 pages
Estimating The Support of A High-Dimensional Distribution
No ratings yet
Estimating The Support of A High-Dimensional Distribution
28 pages
Articulo Sheather
No ratings yet
Articulo Sheather
11 pages
Norway04 Nonparametric
No ratings yet
Norway04 Nonparametric
32 pages
Kernel Smoothers: An Overview of Curve Estimators For The First Graduate Course in Nonparametric Statistics
No ratings yet
Kernel Smoothers: An Overview of Curve Estimators For The First Graduate Course in Nonparametric Statistics
13 pages
Nonparametric and Semiparametric Models
No ratings yet
Nonparametric and Semiparametric Models
325 pages
01_intro_densities
No ratings yet
01_intro_densities
23 pages
STAT2102_Chapter6
No ratings yet
STAT2102_Chapter6
5 pages
Nonparametric Methods: Jason Corso
No ratings yet
Nonparametric Methods: Jason Corso
49 pages
BORE: Bayesian Optimization by Density-Ratio Estimation: Brochu Et Al. 2010 Shahriari Et Al. 2015
No ratings yet
BORE: Bayesian Optimization by Density-Ratio Estimation: Brochu Et Al. 2010 Shahriari Et Al. 2015
26 pages
Week 1 1720465962 Estimation Hour 2
No ratings yet
Week 1 1720465962 Estimation Hour 2
14 pages
Adaptive Bayesian Density Regression For High-Dimensional Data
No ratings yet
Adaptive Bayesian Density Regression For High-Dimensional Data
25 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Adaptive Density Estimation For Stationary Process
No ratings yet
Adaptive Density Estimation For Stationary Process
31 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Chapter 9 - Statistical Estimat - 2016 - Introduction To Statistical Machine Lea
No ratings yet
Chapter 9 - Statistical Estimat - 2016 - Introduction To Statistical Machine Lea
8 pages
Lecture 1.3
No ratings yet
Lecture 1.3
7 pages
Notes4_BayesianLearning
No ratings yet
Notes4_BayesianLearning
8 pages
Introduction
No ratings yet
Introduction
11 pages
Density Estimation
No ratings yet
Density Estimation
17 pages
09 ML Nonparametric Machine Learning
No ratings yet
09 ML Nonparametric Machine Learning
19 pages
Study of Logspline Density Estimation: (Revised 10, 1990)
No ratings yet
Study of Logspline Density Estimation: (Revised 10, 1990)
29 pages
Classification and kernel density estimation
No ratings yet
Classification and kernel density estimation
7 pages
Multivariate classification
No ratings yet
Multivariate classification
7 pages
Parameter Estimation - PR
No ratings yet
Parameter Estimation - PR
66 pages
13 Density Estimation Note
No ratings yet
13 Density Estimation Note
48 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
8.23 NSO Calendar 2024
No ratings yet
8.23 NSO Calendar 2024
2 pages
Ki 67 Prediction 2pdf
No ratings yet
Ki 67 Prediction 2pdf
36 pages
FGLI Only Calendar
No ratings yet
FGLI Only Calendar
2 pages
Math 114 Syllabus
No ratings yet
Math 114 Syllabus
2 pages
Lvad 112
No ratings yet
Lvad 112
9 pages
Regeneron Science Talent Search
No ratings yet
Regeneron Science Talent Search
22 pages
LGG
No ratings yet
LGG
10 pages
PediCXR Final Munuscript
No ratings yet
PediCXR Final Munuscript
10 pages
Ki 67 Index in Breast Cancer
No ratings yet
Ki 67 Index in Breast Cancer
14 pages
Digital SAT Practice
No ratings yet
Digital SAT Practice
13 pages
SFM A1.1
No ratings yet
SFM A1.1
31 pages
Midterm Exam Statistics and Probability
No ratings yet
Midterm Exam Statistics and Probability
5 pages
Be Advised, The Template Workbooks and Worksheets Are Not Protected. Overtyping Any Data May Remove It
No ratings yet
Be Advised, The Template Workbooks and Worksheets Are Not Protected. Overtyping Any Data May Remove It
11 pages
Full Download Essentials of Modern Business Statistics with Microsoft Excel 7th Edition David Anderson PDF DOCX
100% (3)
Full Download Essentials of Modern Business Statistics with Microsoft Excel 7th Edition David Anderson PDF DOCX
55 pages
Parametric vs non parametric tests- Chi Square Test
No ratings yet
Parametric vs non parametric tests- Chi Square Test
21 pages
Multivariate Probability: 1 Discrete Joint Distributions
No ratings yet
Multivariate Probability: 1 Discrete Joint Distributions
10 pages
?Q3 Statistics and Probability Reviewer
No ratings yet
?Q3 Statistics and Probability Reviewer
6 pages
Ch.7 Confidence Intervals and Tests using t-Distribution
No ratings yet
Ch.7 Confidence Intervals and Tests using t-Distribution
1 page
Course Description PDF
No ratings yet
Course Description PDF
5 pages
Download full Applied Linear Regression 4th Edition Sanford Weisberg ebook all chapters
100% (1)
Download full Applied Linear Regression 4th Edition Sanford Weisberg ebook all chapters
51 pages
Quiz - Standard Costing
No ratings yet
Quiz - Standard Costing
43 pages
Integration and Comovement of Developed and Emerging Islamic Stock Markets: A Case Study of Malaysia
No ratings yet
Integration and Comovement of Developed and Emerging Islamic Stock Markets: A Case Study of Malaysia
37 pages
Malhotra 12 - Essentials 1E
No ratings yet
Malhotra 12 - Essentials 1E
76 pages
Probs-Stats Revision Notes
No ratings yet
Probs-Stats Revision Notes
19 pages
Term Paper Mba
No ratings yet
Term Paper Mba
8 pages
Statistics For The Behavioral And Social Sciences A Brief Course Books A La Carte 6th Edition Arthur Aron Elliot J Coups Elaine N Aron instant download
100% (1)
Statistics For The Behavioral And Social Sciences A Brief Course Books A La Carte 6th Edition Arthur Aron Elliot J Coups Elaine N Aron instant download
82 pages
M4 Module Add Maths Form 5
No ratings yet
M4 Module Add Maths Form 5
8 pages
Genetic Variability and Correlation Studies in Okra (Abelmuschus Esculentus (L) Moench)
100% (3)
Genetic Variability and Correlation Studies in Okra (Abelmuschus Esculentus (L) Moench)
67 pages
MMS 2022-24 QP Business Statistics
No ratings yet
MMS 2022-24 QP Business Statistics
5 pages
Lesson 6 - Statistics For Data Science - II
No ratings yet
Lesson 6 - Statistics For Data Science - II
60 pages
School of Economics ECMT1010: Week 13 Workshop
No ratings yet
School of Economics ECMT1010: Week 13 Workshop
4 pages
Barrick and Mount 1991
No ratings yet
Barrick and Mount 1991
27 pages
Cost and Management Accounting Operations and Mana... - (Chapter 11 Standard Costing)
No ratings yet
Cost and Management Accounting Operations and Mana... - (Chapter 11 Standard Costing)
56 pages
Chapter 2 Research Methodology - Notes
No ratings yet
Chapter 2 Research Methodology - Notes
54 pages
Group Statistics: Dimension1
No ratings yet
Group Statistics: Dimension1
8 pages
B.Tech 1st and 2nd Year (All)
No ratings yet
B.Tech 1st and 2nd Year (All)
15 pages
MSM UserGuide
No ratings yet
MSM UserGuide
41 pages

densityestimation

Uploaded by

densityestimation

Uploaded by

Nonparametric Density Estimation

10716: Advanced Machine Learning

Let X1 , . . . , Xn be a sample from a distribution P with density p. The goal of nonparametric

A very simple non-parametric distribution estimator is simply the empirical distribution:

Classification. For classification, recall the Bayes optimal classifier

Anomaly/Outlier Detection. Density estimation is sometimes also used to find unusual

The most commonly used loss function is the L2 loss

p(x)) − p(x) is the bias and v(x) = Var(b

dT V (P, Q) = sup |P (A) − Q(A)|

Rn (P) = inf sup R(p, pb) (4)

A distinguishing characteristic of “non-parametric” methods is that what we are estimating

|g(y) − g(x)| ≤ L|x − y| for all x, y ∈ T.

A differentiable function is Lipschitz if and only if it has bounded derivatives. Conversely a

|g (`) (y) − g (`) (x)| ≤ L|x − y| for all x, y ∈ T.

The definition for higher dimensions is similar. Let X be a compact subset of Rd .

Given a vector s = (s1 , . . . , sd ), define

Then, if p ∈ H(β, L), we can show that: p(x) is close to its .

In the common case of β = 2, this means that

p(u) − [p(x) + (x − u)T ∇p(x)] ≤ Lkx − uk2 .

4.1 Categories of Nonparametric Density Estimators

We can estimate P (X ∈ Bj ) via

5.1 Statistical Analysis: Histograms

Suppose that p ∈ P(L) := H(1, L) where

Theorem 2 The L2 risk of the histogram estimator is bounded by

where C0 = L2 d(C/(L2 d))2/(d+2) .

This upper bound can also be shown to be tight. Specifically:

Theorem 3 There exists a constant C > 0 such that

where kf k∞ = supx |f (x)|.

Theorem 4 With probability at least 1 − δ,

Choosing h = (c2 /n)1/(2+d) we conclude that, with probability at least 1 − δ,

5.2 Adaptive histograms: Density Trees

where I(x) = 1 if |x| ≤ 1 and I(x) = 0 otherwise.

More generally, we define

6.1 Statistical Analysis: Kernel Estimators

Assume that Xi ∈ X ⊂ Rd where X is compact.

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Figure 4: A higher-order kernel function: specifically, a kernel of order 4

Lemma 5 The bias of pbh satisfies:

Next we bound the variance.

Lemma 6 The variance of pbh satisfies:

Theorem 7 The L2 risk is bounded above, uniformly over H(β, L), as

When β = 2 and h  n−1/(4+d) we get the rate n−4/(4+d) .

6.2 Minimax Lower Bound

Theorem 8 There exists C depending only on β and L such that

Theorem 9 For all small  > 0,

p(x) − ph (x)| > ) ≤ 2 exp −cnhd 2 .

Hence, for any δ > 0,

Theorem 10 Suppose that p ∈ H(β, L). Fix any δ > 0. Then

6.4 Asymptotic Expansions

continuous and that p000 (x)2 dx < ∞. Then,

7 Picking Bandwidths of Kernel Estimators

The risk is R(h) = E(L(h)).

Definition 12 The leave-one-out cross-validation estimator of risk is

It is easy to check that E[R(h)]

An alternative to leave-one-out is V -fold cross-validation. A common choice is V = 10. Fir

Thus we have a set P = {b p1 , . . . , pbN } of density estimators.

Let pb = argminj∈[N ] L(p,

Theorem 14 (Wegkamp 1999) There exists a C > 0 such that

A similar result can be proved for V -fold cross-validation.

7.4 Picking Bandwidths to optimize L1 instead of L2 Risk

Theorem 15 For any δ > 0 there exists c such that

7.4.2 Yatracos Classes

where A(h, ν) = {x : pbh (x) > pbν (x)}. Define

Theorem 17 (Devroye and Györfi, 2001.) The risk of pb satisfies

Finding computationally efficient methods to implement this approach remains an open

An example is the cosine basis:

It can be shown that

In that case it can be shown that

9.1 High Dimensions, Curse of Dimensionality

9.2 Biased Density Estimation

Let ph (x) = E(b

and the problem is reduced to a set of one-dimensional density estimation problems.

An extension is to use a forest. We represent the distribution with an undirected graph. A

9.5 Adaptive Kernels

3. The kernel estimator is rate minimax over many classes of densities.

4. Cross-validation methods can be used for choosing the bandwidth h.

When β = 2 and h n−1/(4+d) we get the rate n−4/(4+d) .

Theorem 9 For all small > 0,

p(x) − ph (x)| > ) ≤ 2 exp −cnhd 2 .

where kf k∞ = supx |f (x)|. Assume that ≤ 1. First, note that

P(|θbj − θj | > hd ) ≤ 2h−d exp −cn2 hd ≡ πn .