Non Par Regression
Non Par Regression
1 Introduction
without making parametric assumptions (such as linearity) about the regression function
m(x). Estimating m is called nonparametric regression or smoothing (Härdle et al. 2012,
Wasserman 2006). We can equivalently write
Y = m(X) +
where E(|X) = 0. This follows since for = Y − m(X) and E(|X)) = 0 iff m(X) = E[Y |X]
Example 1 Figure 1 shows data on bone mineral density. The plots show the relative change
in bone density over two consecutive visits, for men and women. The smooth estimates of
the regression functions suggest that a growth spurt occurs two years earlier for females. In
this example, Y is change in bone mineral density and X is age.
Example 2 Figure 2 shows an analysis of some diabetes data from Efron, Hastie, Johnstone
and Tibshirani (2004). The outcome Y is a measure of disease progression after one year. We
consider four covariates (ignoring for now, six other variables): age, bmi (body mass index),
and two variables representing blood serum measurements. A nonparametric regression model
in this case takes the form
Y = m(x1 , x2 , x3 , x4 ) + . (2)
A simpler, but less general model, is the additive model
b 1, m
Figure 2 shows the four estimated functions m b 2, m
b 3 and m
b 4.
1
0.20
0.15
Females
Change in BMD
0.10
0.05
0.00
−0.05
10 15 20 25
Age
0.20
0.15
Males
Change in BMD
0.10
0.05
0.00
−0.05
10 15 20 25
Age
Notation. We use m(x) to denote the regression function. Often we assume that Xi has a
density denoted by p(x). The support of the distribution of Xi is denoted by X . We assume
that X is a compact subset of Rd . Recall that the trace of a square matrix A is denoted by
tr(A) and is defined to be the sum of the diagonal elements of A.
b
Let m(x) be an estimate of m(x). Here, a common loss is the integrated squared loss:
Z
b m) =
L(m, b
(m(x) − m(x))2 dP (x),
where we use a weighted integral wrt the data distribution P . This could also be viewed as
the expectation:
b m) = EX∼P [(m(X)
L(m, b − m(X))2 ].
The corresponding risk is also known as the integrated mean squared error given by:
Z
b m) = E
R(m, b
(m(x) − m(x))2 dP (x) = E[(m(X)
b − m(X))2 ], (4)
2
300
190
180
250
170
200
160
150
150
100
−0.10 −0.05 0.00 0.05 0.10 −0.10 −0.05 0.00 0.05 0.10 0.15
Age Bmi
240
160
150
200
140
160
130
120
120
110
−0.10 −0.05 0.00 0.05 0.10 −0.10 −0.05 0.00 0.05 0.10 0.15
Map Tc
b Some-
where the expectation is wrt the random input X as well as the data underlying m.
times, we might also be interested in the predictive risk
2
b = E((Y − m(X))
Rp (m, m) b ) (5)
b = σ 2 + E[(m(X)
Rp (m, m) b − m(X))2 ], (6)
Z Z
2 2
= σ + bn (x)dP (x) + vn (x)dP (x) (7)
b
where bn (x) = E(m(x)) − m(x) is the bias and v(x) = Var(m(x))
b is the variance.
The estimator m b typically involves smoothing the data in some way. The main challenge is
to determine how much smoothing to do. When the data are oversmoothed, the bias term
is large and the variance is small. When the data are undersmoothed the opposite is true.
This is called the bias–variance tradeoff. Minimizing risk corresponds to balancing bias and
variance.
b is consistent if
An estimator m
P
km
b − mk → 0. (8)
3
When unspecified, the function norm k · k will typically mean the L2 norm k · k2 in terms of
PX , acting on functions m : Rd → R, by
Z
kmk2 = E[m (X)] = m2 (x) dPX (x).
2 2
b is rate
and an estimator is minimax if its risk is equal to the minimax risk. We say that m
optimal if
R(m, m)b Rn (M). (10)
Typically the minimax rate is of the form n−C/(C+d) for some C > 0.
2.1 Outline
Simple and interpretable estimators can be derived by partitioning the range of X. Let
Πn = {A1 , . . . , AN } be a partition of X and define
N
X
b
m(x) = Y j I(x ∈ Aj )
j=1
4
Pn
where Y j = n−1
j i=1 Yi I(Xi ∈ Aj ) is the average of the Yi ’s in Aj and nj = #{Xi ∈ Aj }.
(We define Y j to be 0 if nj = 0.)
The simplest partition is based on cubes. Suppose that X = [0, 1]d . Then we can partition
X into N = k d cubes with lengths of size h = 1/k. Thus, N = (1/h)d . The smoothing
parameter is h.
b
Theorem 3 Let m(x) be the partition estimator. Suppose that
( )
d
m∈M= m : |m(x) − m(z)| ≤ Lkx − zk, x, z, ∈ R (11)
where c1 , . . . , cM are constants and R1 , . . . , RM are disjoint rectangles that partition the
space of covariates and whose sides are parallel to the coordinate axes. The model is fit in
a greedy, recursive manner that can be represented as a tree; hence the name.
Denote a generic covariate value by x = (x1 , . . . , xj , . . . , xd ). The covariate for the ith ob-
servation is Xi = (Xi1 , . . . , Xij , . . . , Xid ). Given a covariate j and a split point s we define
the rectangles R1 = R1 (j, s) = {x : xj ≤ s} and R2 = R2 (j, s) = {x : xj > s} where, in
this expression, xj refers the the j th covariate not the j th observation. Then we take c1 to
be the average of all the Yi ’s such that Xi ∈ R1 and c2 to be the average P of all the Yi ’s2 such
that
P X i ∈ R2 . Notice that c 1 and c 2 minimize the sums of squares Xi ∈R1 (Yi − c1 ) and
2
Xi ∈R2 (Yi − c2 ) . The choice of which covariate xj to split on and which split point s to use
is based on minimizing the residual sums if squares. The splitting process is on repeated on
each rectangle R1 and R2 .
Figure 3 shows a simple example of a regression tree; also shown are the corresponding
b is constant over the rectangles.
rectangles. The function estimate m
Generally one first grows a very large tree, then the tree is pruned to form a subtree by
collapsing regions together. The size of the tree is a tuning parameter and is usually chosen
by cross-validation.
5
X1
< 50 ≥ 50
X2 c3
< 100 ≥ 100
c1 c2
R2
110
R3
X2
R1
50 X1
Figure 3: A regression tree for two covariates X1 and X2 . The function estimate is m(x)b =
c1 I(x ∈ R1 ) + c2 I(x ∈ R2 ) + c3 I(x ∈ R3 ) where R1 , R2 and R3 are the rectangles shown in
the lower plot.
6
Example 4 Figure 4 shows a tree for the rock data. Notice that the variable shape does not
appear in the tree. This means that the shape variable was never the optimal covariate to split
on in the algorithm. The result is that tree only depends on area and peri. This illustrates
an important feature of tree regression: it automatically performs variable selection in the
sense that a covariate xj will not appear in the tree if the algorithm finds that the variable
is not important.
A non-parametric extension of this is to fit piece-wise polynomials, that is, we split the input
domain into multiple regions, and fit polynomials in each. These are also known as splines.
7
Informally, a spline is a lot smoother than a piecewise polynomial, and so modeling with
splines can serve as a way of reducing the variance of fitted estimators. See Figure 5
How can we parametrize the set of a splines with knots at t1 , . . . , tp ? The most natural way
is to use the truncated power basis B1 , . . . , Bp+k+1 , defined as
(Here x+ denotes the positive part of x, i.e., x+ = max{x, 0}.) From this we can see that
the space of kth-order splines with knots at t1 , . . . , tp has dimension p + k + 1.
Proof. (thanks to Vishwajeet Agarwal for the short proof) Any polynomial p(x) of degree k
that, at some knot t, has the same value and k −1 derivatives of another k degree polynomial
p0 (x) can be written as
p(x) = p0 (x) + ck (x − t)k .
P
To see this, consider the k-degree polynomial g(x) = p(x + t) − p0 (x + t) = kj=0 cj xk . Note
that: g (j) (0) = cj . But under the matching constraint, g (j) (0) = 0, for j = 0, . . . , k − 1.
We thus have that g(x) = ck xk , so that by a reparameterization, p(x) = p0 (x) + cP k
k (x − t) .
k
Consider for now a single knot t0 . Then, with the k-th degree polynomial f0 (x) = j=0 aj xk
in the partition (−∞, t0 ], the polynomial f1 (x) in the next partition has to satisfy the
matching constraints at the knot t1 , and hence can be written as f1 (x) = f0 (x) + (x − t1 )k .
Since we want this piece to be active only in [t0 , ∞), we can compactly specify the piecewise
polynomial over both the bins as f0 (x) + (x − t1 )k+ . The argument naturally extends to more
knots.
While these basis functions are natural, a much better computational choice, both for speed
and numerical accuracy, is the B-spline basis. This was a major development in spline theory
and is now pretty much the standard in software. See de Boor (1978) or the Appendix of
Chapter 5 in Hastie et al. (2009) for details.
We can then perform regression given these basis functions, the resulting approach is also
called regression splines. This would work well provided we choose good knots t1 , . . . , tp ;
but in general choosing knots is a tricky business. Another problem with regression splines
is that the estimates tend to display erratic behavior, i.e., they have high variance at the
boundaries of the input domain. (This is the opposite problem to that with kernel smoothing,
which had poor bias at the boundaries.) This only gets worse as the polynomial order k gets
larger.
A way to remedy this problem is to force the piecewise polynomial function to have a lower
degree to the left of the leftmost knot, and to the right of the rightmost knot—this is exactly
what natural splines do. A natural spline of order k, with knots at t1 < . . . < tp , is a
piecewise polynomial function f such that
8
5.2 Piecewise Polynomials and Splines 143
Discontinuous Continuous
O O
O O O O O O
O O O O O O
O O O O O O
OO O O
O OO O O
O
OOO O OOO O
OO O OO O
O O O O
O O
O O O O O
O O O O O O
O
O O O O
O O
O O O O O O
O
O O
O
O O O O O O
O O O O O O
O O O O
O O
O O
ξ1 ξ2 ξ1 ξ2
O O
O O O O O O
O O O O O O
O O O O O O
OO O O
O OO O O
O
OOO O OOO O
OO O OO O
O O O O
O O O O
O O O O
O O O O O
O
O O O O
O O
O O O O O O
O
O O
O
O O O O O O
O O O O O O
O O O O
O O
O O
ξ1 ξ2 ξ1 ξ2
FIGURE 5.2. A series of piecewise-cubic polynomials, with increasing orders of
Figure 5: Illustration of the effects of enforcing continuity at the knots, across various orders
continuity.
of the derivative, for a cubic piecewise polynomial. From Chapter 5 of Hastie et al. (2009)
It is implicit here that natural splines are only defined for odd orders k. There is a variant of
the truncated power basis for natural splines, and a variant of the B-spline basis for natural
splines. Again, B-splines are the preferred parametrization for computational speed and
stability. Natural splines of cubic order is the most common special case: these are smooth
piecewise cubic functions, that are simply linear beyond the leftmost and rightmost knots
Smoothing splines are simply regularized regression splines, placing knots at all inputs
x1 , . . . , xn . They circumvent the problem of knot selection as they just use the inputs as
knots, and they control for overfitting by shrinking the coefficients of the estimated function
(in its basis expansion)
Splines can also be motivated via the general approach of penalized regression (or regularized
regression) where mb is defined to be the minimizer of
n
X
b i ))2 + λJ(m)
(Yi − m(X b (16)
i=1
The minimizer of (16) can then be shown to be cubic splines with knots at {x1 , . . . , xn }.
R
Theorem 5 Let m b be the minimizer of (16) where J(g) = (g 00 (x))2 dx. Then m
b is a cubic
spline with knots at the points X1 , . . . , Xn .
According to this result, the minimizer m b of (16) is contained in Mn , the set of all cubic
splines with knots at {X1 , . . . , Xn }. However, we still have to find which function in Mn is
the minimizer.
10
Let B1 , . . . , Bn+4 be the truncated power basis for Mn as defined earlier, given knots at
{X1 , . . . , Xn }: B1 (x) = 1, B2 (x) = x, B3 (x) = x2 , B4 (x) = x3 and
Bj (x) = (x − Xj−4 )3+ j = 5, . . . , n + 4.
(As noted in earlier section, in practice, another basis for M called the B-spline basis is
P it has better numerical properties.) Thus, every g ∈ Mn can be
used since PNwritten as
g(x) = N β B
j=1 j j (x) for some coefficients β1 , . . . , β N . If we substitute b
m(x) = j=1 βj Bj (x)
into (16), the minimization problem becomes: find β = (β1 , . . . , βN ) to minimize
(Y − Bβ)T (Y − Bβ) + λβ T Ωβ (17)
R
where Y = (Y1 , . . . , Yn ), Bij = Bj (Xi ) and Ωjk = Bj00 (x)Bk00 (x)dx. The solution is
βb = (BT B + λΩ)−1 BT Y
and hence X
b
m(x) = βbj Bj (x) = `(x)T Y
j
where `(x) = b(x)(BT B + λΩ)−1 BT and b(x) = (B1 (x), . . . , BN (x))T . Hence, the spline
smoother is another example of a linear smoother.
Define the Sobolev class of functions W1 (m, C), for an integer m ≥ 0 and C > 0, to contain
all m times differentiable functions f : R → R such that
Z
2
f (m) (x) dx ≤ C 2 .
(The Sobolev class Wd (m, C) in d dimensions can be defined similarly, where we sum over
all partial derivatives of order m.)
Assuming f0 ∈ W1 (m, C) for the underlying regression function, where C > 0 is a constant,
the smoothing spline estimator fb in (??) of polynomial order k = 2m − 1 with tuning
parameter λ n1/(2m+1) n1/(k+2) satisfies
kfb − f0 k2n . n−2m/(2m+1) in probability.
The proof of this result uses much more fancy techniques from empirical process theory
(entropy numbers) than the proofs for kernel smoothing. See Chapter 10.1 of van de Geer
(2000)
11
4.3 Multivariate splines
Splines can be extended to multiple dimensions, in two different ways: thin-plate splines and
tensor-product splines. See Chapter 7 of Green & Silverman (1994), and Chapters 15 and
20.4 of Gyorfi et al. (2002)). These multivariate extensions however are highly nontrivial,
especially when we compare them to the conceptually simple extension of kernel smoothing to
higher dimensions. In multiple dimensions, if one wants to study penalized nonparametric
estimation, it is easier to study RKHS based estimators, which in fact covers smoothing
splines (and thin-plate splines) as a special case.
Suppose that ( )
Z b
m ∈ L2 (a, b) = g : [a, b] → R : g 2 (x)dx < ∞ .
a
R 2
Let
R φ1 , φ2 , . . . be an orthonormal basis for L2 (a, b). This means Rthat φj (x)dx = 1,
φj φk (x)dx = 0 for j 6= k and the only function b(x) such that b(x)φj (x)dx = 0 for
all j is b(x) = 0. It follows that any m ∈ L2 (a, b) can be written as
∞
X
m(x) = βj φj (x)
j=1
R
where βj = m(x)φj (x)dx. For [a, b] = [0, 1], an example is the cosine basis
√
φ0 (x) = 1, φj (x) = 2 cos(πjx), j = 1, 2, . . .
To use a basis for nonparametric regression, we regress Y on the first J P basis functions and
we treat J as a smoothing parameter. In other words we take m(x) b = Jj=1 βbj φj (x) where
βb = (B T B)−1 B T Y and Bij = φj (Xi ). It follows that m(x)
b is a linear smoother. See Chapters
7 and 8 of Wasserman (2006) for theoretical properties of orthogonal function smoothers.
We might discuss more about such “linear additive” models in a subsequent lecture.
12
6 k-nearest-neighbors regression
Before we study smoothing kernel based approaches, it is instructive to study their basic
precursor which is k-nearest-neighbors regression. We fix an integer k ≥ 1 and define
1 X
b
m(x) = Yi , (18)
k
i∈Nk (x)
However the fitted function mb essentially always looks jagged, especially for small or mod-
erate k. Why is this? It helps to write
n
X
b
m(x) = wi (x)Yi , (19)
i=1
b
Note that wi (x) is discontinuous as a function of x, and therefore so is m(x).
6.1 Consistency
b − m0 k22 → 0
The k-nearest-neighbors estimator is universally consistent, which means Ekm
as n → ∞, with no assumptions other √ than E(Y 2 ) ≤ ∞, provided that we take k = kn such
that kn → ∞ and kn /n → 0; e.g., k = n will do. See Chapter 6.2 of Gyorfi et al. (2002).
b − m0 k22 . n−2/(2+d) .
Ekm (20)
See Chapter 6.3 of Gyorfi et al. (2002). Later, we will see that this is optimal.
Proof sketch: assume that Var(Y |X = x) = σ 2 , a constant, for simplicity, and fix (condition
13
on) the training points. Using the bias-variance tradeoff,
2 2 2
b
E m(x) − m0 (x) b
= E[m(x)] − m0 (x) + E m(x)
b − E[m(x)]
b
| {z } | {z }
Bias2 (m(x))
b Var(m(x))
b
1 X 2 σ2
= m0 (Xi ) − m0 (x) +
k k
i∈Nk (x)
2
L X σ2
≤ kXi − xk2 + .
k k
i∈Nk (x)
In the last line we used the Lipschitz property |m0 (x)−m0 (z)| ≤ Lkx−zk2 , for some constant
L > 0. Now for “most” of the points we’ll have kXi − xk2 ≤ C(k/n)1/d , for a constant C > 0.
(Think of a having input points Xi , i = 1, . . . , n spaced equally over (say) [0, 1]d .) Then our
bias-variance upper bound becomes
2/d
k 2 σ2
(CL) + ,
n k
We can minimize this by balancing the two terms so that they are equal, giving k 1+2/d n2/d ,
i.e., k n2/(2+d) as claimed. Plugging this in gives the error bound of n−2/(2+d) , as claimed.
As discussed in the non-parametric density estimation lecture, the above error rate n−2/(2+d)
exhibits a very poor dependence on the dimension d, requiring number of samples n scaling
exponentially in the dimension d to achieve error : n ≥ −(2+d)/2 . See Figure 6 for an
illustration with = 0.1
14
1e+06
●
8e+05
6e+05
eps^(−(2+d)/d)
4e+05
●
2e+05
●
0e+00
●
● ● ● ● ● ●
2 4 6 8 10
Dimension d
Another simple nonparametric estimator is the kernel estimator. The word “kernel” is often
used in two different ways. Here are we referring to smoothing kernels. Later we will discuss
Mercer kernels which are a distinct (but related) concept.
Let h > 0 be a positive number, called the bandwidth. The Nadaraya–Watson kernel esti-
mator is defined by
Pn
kx−Xi k n
i=1 Yi K h X
b
m(x) ≡mb h (x) = P = Yi `i (x) (23)
n kx−Xi k
i=1 K h i=1
P
where `i (x) = K(kx − Xi k/h)/ j K(kx − Xj k/h).
b
Thus m(x) is a local average of the Yi ’s. It can be shown that the optimal kernel is the
Epanechnikov kernel. But, as with density estimation, the choice of kernel K is not too
important. Estimates obtained by using different kernels are usually numerically very similar.
This observation is confirmed by theoretical calculations which show that the risk is very
15
192 6. Kernel Smoothing Methods
1.5
O O O O
O O
O O
O O OO O O O O
O O O O O O O O
fˆ(x0 ) OO O
O OO O OO
fˆ(x0 ) OO O
OO OO
1.0
1.0
O O OO OO O O O OO OO O
OO
O
OO
O O
OO
O
O
O OO O
O
O
OO O • OO
O
OO
O
OO
O O
OO
O
O
O OO
•O
O
O
OO O
OO
O
OOO OO OOO OO
0.5
0.5
O O O O O O
O O
O O O O O O O O
O O O O
O OO O O O OO O O
O OO
O O O O OO
O O O
O O O O O O
O O
0.0
0.0
O O O OO O O O OO
O O O O O O
O OO O OO
-0.5
-0.5
O O O O
O O O O O O
O O
-1.0
-1.0
O O
O O
O O
0.0 0.2 0.4 x0 0.6 0.8 1.0 0.0 0.2 0.4 x0 0.6 0.8 1.0
FIGURE
Figure 6.1. In each
7: Comparing panel 100 pairs
k-nearest-neighbor and xEpanechnikov
i , yi are generated at random from the
kernels, when d = 1. From
blue curve
Chapter with Gaussian
6 of Hastie errors: Y = sin(4X) + ε, X ∼ U [0, 1], ε ∼ N (0, 1/3). In
et al. (2009)
the left panel the green curve is the result of a 30-nearest-neighbor running-mean
smoother. The red point is the fitted constant fˆ(x0 ), and the red circles indicate
insensitive to the choice of kernel. What does matter much more is the choice of bandwidth
those observations contributing to the fit at x . The solid yellow region indicates
h which controls the amount of smoothing. Small 0bandwidths give very rough estimates
the weights
while assigned give
larger bandwidths to observations. In the right panel, the green curve is the
smoother estimates.
kernel-weighted average, using an Epanechnikov kernel with (half ) window width
The
λ =kernel
0.2. estimator can be derived by minimizing the localized squared error
n
X 2
x − Xi
K c − Yi . (24)
i=1
h
6.1 One-Dimensional Kernel Smoothers b
A simple calculation shows that this is minimized by the kernel estimator c = m(x) as given
in equation (23).
In Chapter 2, we motivated the k–nearest-neighbor average
Kernel regression and kernel density estimation are related. Let pb(x, y) be the kernel density
estimator and define fˆ(x) = Ave(yi |xi ∈ Nk (x)) (6.1)
Z R
b |X = x)function p(x, y)dy
yb
as an estimate of b the=regression
m(x) E(Y = yb p(y|x)dy
E(Y |X= = x). Here Nk (x) is the(25) set
pb(x)
of k pointsR nearest to x in squared distance, and Ave denotes the average
where pb(x) =
(mean). Thepb(x, is Then
y)dy.
idea b
m(x)
to relax is thedefinition
the Nadaraya-Watson kernel regression
of conditional estimator.as
expectation,
Inillustrated
comparison in thek-nearest-neighbors
to the left panel of Figure 6.1,in and
estimator (18), compute an thought
which can be averageof in
as aa
neighborhood
raw (discontinuous)ofmoving
the target point.
average In this
of nearby case we
responses, the have
kernelused the 30-nearest
estimator in (23) is a
smooth moving average offit
neighborhood—the responses.
at x0 isSee
theFigure 7 for of
average an the
example with dwhose
30 pairs = 1. xi values
are closest to x0 . The green curve is traced out as we apply this definition
at different values x0 . The green curve is bumpy, since fˆ(x) is discontinuous
in x. As we move x0 from left to right, the k-nearest neighborhood remains
constant, until a point xi to the right16of x0 becomes closer than the furthest
point xi′ in the neighborhood to the left of x0 , at which time xi replaces xi′ .
The average in (6.1) changes in a discrete way, leading to a discontinuous
fˆ(x).
7.1 Error Analysis
b − m0 k22 → 0 as n → ∞, with
The kernel smoothing estimator is universally consistent (Ekm
no assumptions other than E(Y 2 ) ≤ ∞), provided we take a compactly supported kernel K,
and bandwidth h = hn satisfying hn → 0 and nhdn → ∞ as n → ∞. See Chapter 5.2 of
Gyorfi et al. (2002). We can say more.
Theorem. Suppose that d = 1 and that m00 is bounded. Also suppose that X has a
non-zero, differentiable density p and that the support is unbounded. Then, the risk is
Z 2 Z 2
h4n 2 00 0 p0 (x)
Rn = x K(x)dx m (x) + 2m (x) dx
4 p(x)
R Z
σ 2 K 2 (x)dx dx 1
+ +o + o(h4n )
nhn p(x) nhn
It follows that the optimal bandwidth is hn ≈ n−1/5 yielding a risk of n−4/5 . In d dimensions,
the term nhn becomes nhdn . In that case It follows that the optimal bandwidth is hn ≈
n−1/(4+d) yielding a risk of n−4/(4+d) .
Biases of the bias. The first term in the risk bound from the theorem is the squared bias,
and it has two disturbing properties. The first is that it has a dependence on p and p0 , which
is also called design bias. We’ll fix this problem later using local linear smoothing.
If the support has boundaries then there is bias of order O(h) near the boundary, in contrast
to O(h2 ) in the interior. This is also called boundary bias. The risk then becomes O(h3 )
instead of O(h4 ). This happens because of the asymmetry of the kernel weights in such
regions. See Figure 8. We’ll also fix this problems using local linear smoothing.
Also, the result above depends on assuming that PX has a density. We can drop that
assumption and get a slightly weaker result due to Gyorfi, Kohler, Krzyzak and Walk (2002).
For simplicity, we will use the spherical kernel K(kxk) = I(kxk ≤ 1); the results can be
extended to other kernels. Hence,
Pn Pn
i=1 Yi I(kXi − xk ≤ h) Yi I(kXi − xk ≤ h)
b
m(x) = Pn = i=1
i=1 I(kXi − xk ≤ h) n Pn (B(x, h))
17
Theorem: Risk bound without density. Suppose that the distribution of X has compact
support and that Var(Y |X = x) ≤ σ 2 < ∞ for all x. Then
c2
sup Ekm b − mk2P ≤ c1 h2 + d . (26)
P ∈Hd (1,L) nh
Hence, if h n−1/(d+2) then
c
sup b − mk2P ≤
Ekm . (27)
P ∈Hd (1,L) n2/(d+2)
Recall from (21) we saw that this was the minimax optimal rate over Hd (1, L). More gener-
ally, the minimax rate over Hd (α, L), for a constant L > 0, is
inf sup b − m0 k22 & n−2α/(2α+d) ,
Ekm (28)
m
b m0 ∈Hd (α,L)
see again Chapter 3.2 of Gyorfi et al. (2002). But on the other hand this rate n−2/(d+2) is
slower than the pointwise rate n−4/(d+4) earlier (which is minimax for Hd (2, L)) because we
have made weaker assumptions.
Recall that the kernel estimator can be derived by minimizing the localized squared error
X n 2
x − Xi
K c − Yi . (29)
i=1
h
To reduce the design bias and the boundary bias we simply replace the constant c with a
polynomial. In fact, it is enough to use a polynomial of order 1; in other words, we fit a local
linear estimator instead of a local constant. The idea is that, for u near x, we can write,
b
m(u) ≈ β0 (x) + β1 (x)(u − x). We define β(x) = (βb0 (x), βb1 (x)) to minimize
n !2
X x − Xi
K Yi − β0 (x) − β1 (x)(Xi − x) .
i=1
h
b
Then m(u) ≈ βb0 (x) + βb1 (x)(u − x). In particular, m(x)
b = βb0 (x). The minimizer is easily
seen to be
b
β(x) = (βb0 (x), βb1 (x))T = (BT W B)−1 BT W Y
where Y = (Y1 , . . . , Yn ),
1 X1 − x Kh (x − X1 ) 0 ··· 0
1 X2 − x 0 Kh (x − X2 ) · · · 0
B = .. .. , W = .. .. .
. . . . ··· ···
1 Xn − x 0 0 ··· Kh (x − Xn )
18
6.1 One-Dimensional Kernel Smoothers 195
1.5
N-W Kernel at Boundary Local Linear Regression at Boundary
1.5
O O O O O O
O O
OO O OO O
O O
O O OO O O O OO O
O O O O
O O O O
1.0
1.0
OO O O OO O O
O O O O O O
O O O O O
OO OO O O O O O
OO OO
O O O O O O
O O O O
fˆ(x ) O O O O
O O
OO O 0 O O OO O O
OO O OO
O O OO
OO
fˆ(x )
0.5
0.5
O O
O O O O
O • O O
O
O
0 O
O
OOOO O O
O
O
O
O OO O
O
O
• O
OOOO O O
O
O
O
O OO O
O
O
O O
0.0
0.0
O O O O O O O O
O O O O
O O O O
O O O O
O O O O O O
O
O O O
O O
-0.5
-0.5
O O O O O O
O O
O O O O
O O
O O
O O
-1.0
-1.0
0.0 x0 0.2 0.4 0.6 0.8 1.0 0.0 x0 0.2 0.4 0.6 0.8 1.0
FIGURE 6.3. The locally weighted average has bias problems at or near the
Figure 8: Comparing
boundaries (Nadaraya-Watson)
of the domain. kernel smoothing
The true function to local linear
is approximately regression;
linear here, butthe
former
most of the observations in the neighborhood have a higher mean than the targetof
is biased at the boundary, the latter is unbiased (to first-order). From Chapter 6
Hastie et al. (2009)
point, so despite weighting, their mean will be biased upwards. By fitting a locally
weighted linear regression (right panel), this bias is removed to first order
b
Then m(x) = βb0 (x).
It can be shown that local linear regression removes boundary bias and design bias. See
Figure 8.
because of the asymmetry of the kernel in that region. By fitting straight
lines rather than constants locally, we can remove this bias exactly to first
order; see Figure 6.3 (right panel). Actually, this bias can be present in the
interior ofUnder
Theorem. the some regularity
domain conditions,
as well, if the X thevalues b is not equally spaced (for
risk of m
are
4 Z same reasons,Z 2 Z Z locally weighted linear
hthe
n 00
but usually
T
less severe).
1 Again
tr(m (x) K(u)uu du) dP (x)+ d K (u)du σ 2 (x)dP (x)+o(h4n +(nhdn )−1 ).
2
4regression will make a first-order correction.
nhn
Locally weighted regression solves a separate weighted least squares prob-
lem
For at each
a proof, target
see Fan point (1996).
& Gijbels x0 : For points near the boundary, the bias is Ch2 m00 (x) +
o(h2 ) whereas, the bias is Chm0
N (x) + o(h) for kernel estimators.
! 2
min Kλ (x0 , xi ) [yi − α(x0 ) − β(x0 )xi ] . (6.7)
α(x0 ),β(x0 )
i=1
8.1 Higher-order smoothness
The estimate is then fˆ(x0 ) = α̂(x0 ) + β̂(x0 )x0 . Notice that although we fit
How can we hope to get optimal error rates over Hd (α, d), when α ≥ 2? With kernels there
an entire linear model to the data in the region, we only use it to evaluate
are basically two options: use local polynomials, or use higher-order kernels
the fit at the single point x0 .
T
Define
Local the vector-valued
polynomials function
build on our previous ideab(x) =linear
of local (1, x). Let B(itself
regression be the an N ×2
Pkextension
ofregression
kernel smoothing.) Consider = 1, for T
concreteness. Define b = b + b j
d m(x)
matrix with ith row b(xi ) , and W(x0 ) the N × N diagonal β x,0 j=1 βx,j x ,
b
m(x) = b(x)(B T ΩB)−1 B T Ωy = w(x)T y,
where b(x) = (1, x, . . . , xk ), B is an n × (k + 1) matrix with ith row b(Xi ) = (1, Xi , . . . , Xik ),
and W is as before. Hence again, local polynomial regression is a linear smoother
Assuming that m0 ∈ H1 (α, L) for a constant L > 0, a Taylor expansion shows that the local
b of order k, where k is the largest integer strictly less than α and
polynomial estimator m
where the bandwidth scales as h n−1/(2α+1) , satisfies
b − m0 k22 . n−2α/(2α+1) .
Ekm
See Chapter 1.6.1 of Tsybakov (2009). This matches the lower bound in (28) (when d = 1)
In multiple dimensions, d > 1, local polynomials become kind of tricky to fit, because of the
explosion in terms of the number of parameters we need to represent a kth order polynomial
in d variables. Hence, an interesting alternative is to return back to kernel smoothing but
use a higher-order kernel. Recall that a kernel function K is said to be of order k provided
that
Z Z Z
K(t) dt = 1, t K(t) dt = 0, j = 1, . . . , k − 1, and 0 < tk K(t) dt < ∞.
j
This means that the kernels we were looking at so far were of order 2.
Lastly, while local polynomial regression and higher-order kernel smoothing can help “track”
the derivatives of smooth functions m0 ∈ Hd (α, L), α ≥ 2, it should be noted that they don’t
share the same universal consistency property of kernel smoothing (or k-nearest-neighbors).
See Chapters 5.3 and 5.4 of Gyorfi et al. (2002)
20
9.1 Hilbert Spaces
A Hilbert space is a complete inner product space. A reproducing kernel Hilbert space
(RKHS) is simply a Hilbert space with extra structure that makes it very useful for statistics
and machine learning.
What has this got to do with kernels? Hang on; we’re getting there.
10 Mercer Kernels
Let us first define what a Mercer kernel is: it is a function K(x, y) of two variables that is
symmetric and positive definite. This means that, for any function f ,
Z Z
K(x, y)f (x)f (y)dx dy ≥ 0.
21
(This is like the definition of a positive definite matrix: xT Ax ≥ 0 for each x.)
Suppose the evaluation functional Lx of the Hilbert space is continuous. Then we can make
use of an important theorem — the Reisz representation theorem — that says that any
continuous linear functional Lx of a Hilbert space has a representer Kx ∈ H so that:
Lx (f ) = hf, Kx iH .
By the symmetry of the dot product, this is a symmetric function. It can also be shown to
be positive semi-definite since:
Z Z Z Z
a(x)K(x, y)a(y)dxdy = a(x)hKx , Ky iH a(y)dxdy
Z Z
= h a(x)Kx dx, a(y)Ky dyiH
Z
= k a(x)Kx dxk2H ≥ 0.
Thus the function specified using the Reisz representers is a Mercer kernel. We can also
associate K(x, ·) with Kx (·) and as a function in H since:
So far, we have seen that any Hilbert space with a continuous linear functional is associated
with a Mercer kernel that satisfies the reproducing kernel property.
We can also go in the other direction: any Mercer kernel is associated with a Hilbert space
with a continuous linear functional. Suppose we are given a Mercer kernel K. Let Kx (·)
22
be the function ontained by fixing the first coordinate. That is, Kx (y) = K(x, y). For the
Gaussian kernel, Kx is a Normal, centered at x. We can create functions by taking linear
combinations of the kernel:
X k
f (x) = αj Kxj (x).
j=1
In general, f (and g) might be representable in more than one way. You can check that
hf, giK is independent of how f (or g) is represented. The inner product defines a norm:
p sX X √
||f ||K = hf, f, i = αj αk K(xj , xk ) = αT Kα
j k
It can be seen that the reproducing kernel property introduced earlier holds.
P
Let f (x) = i Kxi (x). Note that we do have:
X
hf, Kx i = αi K(xi , x) = f (x).
i
This follows from the definition of hf, gi where we take g = Kx . This implies that Kx is the
representer of the evaluation functional.
To verify that this is a well-defined Hilbert space, you should check that the following
properties hold:
hf, gi = hg, f i
hcf + dg, hi = chf, hi + chg, hi
hf, f i = 0 iff f = 0.
23
The last one is not obvious so let us verify it here. It is easy to see that f = 0 impies that
hf, f i = 0. Now we must show that hf, f i = 0 implies that f (x) = 0. So suppose that
hf, f i = 0. Pick any x. Then
0 ≤ f 2 (x) = hf, Kx i2
≤ ||f ||2 ||Kx ||2 = hf, f i2 ||Kx ||2 = 0
where we used Cauchy-Schwartz. So 0 ≤ f 2 (x) ≤ 0 which means that f (x) = 0.
10.1 Examples
Example 6 Let H be all functions f on R such that the support of the Fourier transform
of f is contained in [−a, a]. Then
sin(a(y − x))
K(x, y) =
a(y − x)
and Z
hf, gi = f g.
Example
R (m)8 2 The Sobolev space of order m is (roughly speaking) the set of functions f such
that (f ) < ∞. For m = 1 and X = [0, 1] the kernel is
( 2 3
1 + xy + xy2 − y6 0 ≤ y ≤ x ≤ 1
K(x, y) = 2 3
1 + xy + yx2 − x6 0 ≤ x ≤ y ≤ 1
and Z 1
||f ||2K 2
= f (0) + f (0) + 0 2
(f 00 (x))2 dx.
0
24
10.2 Spectral Representation, RKHS as Orthogonal Series
Suppose that supx,y K(x, y) < ∞. Define eigenvalues λj and orthonormal eigenfunctions ψj
by Z
K(x, y)ψj (y)dy = λj ψj (x).
P
Then j λj < ∞ and supx |ψj (x)| < ∞. Also,
∞
X
K(x, y) = λj ψj (x)ψj (y).
j=1
We can then see that K(x, y) is the corresponding `2 inner product of the two `2 sequences
Φ(x) and Φ(y). The key advantage of an RKHS is that this inner product is made compu-
tationally feasible by just evaluating the kernel K(x, y).
Thus, in any algorithm that uses its features x only via inner products hxi , xj i, we can
then replace the features {xi } by their (infinite dimensional) feature maps {Φ(xi )}, and just
substitute the linear feature inner products with the feature map inner products K(xi , xj ) =
hΦ(xi ), Φ(xj )i and get a nonlinear version of the algorithm. This is called the “kernel trick”
since K(xi , xj ) is easy to compute, allowing us to turn a linear procedure into a nonlinear
procedure without adding much computation.
25
10.4 Representer Theorem
Let ` be a loss function depending on (X1 , Y1 ), . . . , (Xn , Yn ) and on f (X1 ), . . . , f (Xn ). Let
fb minimize
` + g(||f ||2K )
where g is any monotone increasing function. Then fb has the form
n
X
fb(x) = αi K(xi , x)
i=1
for some α1 , . . . , αn .
b to minimize
Define m X
R= (Yi − m(Xi ))2 + λ||m||2K .
i
Pn
b
By the representer theorem, m(x) = i=1 αi K(xi , x). Plug this into R and we get
b = (K + λI)−1 Y
α
P
b
and m(x) = j bj K(Xi , x). The fitted values are
α
Yb = Kb
α = K(K + λI)−1 Y = LY.
We can use cross-validation to choose λ. Compare this with smoothing kernel regres-
sion.
One could also combine RKHS estimation with losses other than squared error, which we
discuss further when we consider nonparametric classification.
There are hidden tuning parameters in the RKHS. Consider the Gaussian kernel
||x−y||2
K(x, y) = e− σ2 .
26
P
For nonparametric regression we minimize i (Yi − m(Xi ))2 subject to ||m||K ≤ L. We
control the bias variance tradeoff by doing cross-validation over L. But what about σ?
This parameter seems to get mostly ignored. Suppose we have a uniform distribution on a
circle. The eigenfunctions of K(x, y) are the sines and cosines. The eigenvalues λk die off
like (1/σ)2k . So σ affects the bias-variance tradeoff since it weights things towards lower
order Fourier functions. In principle we can compensate for this by varying L. But clearly
there is some interaction between L and σ. The practical effect is not well understood.
Now consider the polynomial kernel K(x, y) = (1 + hx, yi)d . This kernel has the same
eigenfunctions but the eignvalues decay at a polynomial rate depending on d. So there is an
interaction between L, d and, the choice of kernel itself.
Gretton, Borgwardt, Rasch, Scholkopf and Smola (GBRSS 2008) show how to use kernels
for two sample testing. Suppose that
X1 , . . . , X m ∼ P Y1 , . . . , Yn ∼ Q.
We want to test the null hypothesis H0 : P = Q.
Define
X m m
c = sup 1 1X
M f (Xi ) − f (Yi ) .
f ∈F m i=1 n i=1
Some calcculations show that
X 2 X 1 X
c2 = 1
M K(X j , Xk ) − K(X j , Yk ) + K(Yj , Yk ).
m2 j,k mn j,k n2 j,k
c > t.
We reject H0 if M
27
There is a connection with smoothing kernels. Let
n
1 X
fbX (u) = κ(Xi − u)
m i=1
The estimators depend on the bandwidth h. Let R(h) denote the risk of mb h when bandwidth
h is used. We will estimate R(h) and then choose h to minimize this estimate. As we know,
the training error
n
e 1X
R(h) = b h (Xi ))2
(Yi − m (30)
n i=1
is biased downwards. We will estimate the risk using the cross-validation.
b (−i) is the estimator obtained by omitting the ith pair (Xi , Yi ), that is, m
where m
P b (−i) (x) =
n
Y `
j=1 j j,(−i) (x) and
(
0 if j = i
`j,(−i) (x) = ` (x) (32)
P j if j 6
= i.
`k (x)
k6=i
28
Theorem 9 Let m b
b be a linear smoother. Then the leave-one-out cross-validation score R(h)
can be written as n 2
b 1 X Yi − m b h (Xi )
R(h) = (33)
n i=1 1 − Lii
where Lii = `i (Xi ) is the ith diagonal element of the smoothing matrix L.
which is called the Doppler function. This function is difficult to estimate and provides a
good test case for nonparametric regression methods. The function is spatially inhomogeneous
which means that its smoothness (second derivative) varies over x. The function is plotted
in the top left plot of Figure 9. The top right plot shows 1000 data points simulated from
Yi = m(i/n) + σi with σ = .1 and i ∼ N (0, 1). The bottom left plot shows the cross-
validation score versus the effective degrees of freedom using local linear regression. The
minimum occurred at 166 degrees of freedom corresponding to a bandwidth of .005. The
fitted function is shown in the bottom right plot. The fit has high effective degrees of freedom
and hence the fitted function is very wiggly. This is because the estimate is trying to fit the
rapid fluctuations of the function near x = 0. If we used more smoothing, the right-hand side
of the fit would look better at the cost of missing the structure near x = 0. This is always a
problem when estimating spatially inhomogeneous functions.
29
1
1
0
0
−1
−1
0.0 0.5 1.0 0.0 0.5 1.0
1
0
−1
100 150 200 0.0 0.5 1.0
Figure 9: The Doppler function estimated by local linear regression. The function (top left),
the data (top right), the cross-validation score versus effective degrees of freedom (bottom
left), and the fitted function (bottom right).
As with density estimation, stronger guarantees can be made using a data splitting version
of cross-validation. Suppose the data are (X1 , Y1 ), . . . , (X2n , Y2n ). Now randomly split the
data into two halves that we denote by
n o
D = (X e1 , Ye1 ), . . . , (X
en , Yen )
and n o
∗ ∗ ∗ ∗
E = (X1 , Y1 ), . . . , (Xn , Yn ) .
Construct regression estimators M = {m1 , . . . , mN } from D. Define the risk estimator
X n
b j) = 1
R(m |Yi∗ − mj (Xi∗ )|2 .
n i=1
Finally, let
b
b = argminm∈M R(m).
m
30
12 Linear Smoothers
Kernel estimators and local polynomial estimator are examples of linear smoothers.
where Y = (Y1 , . . . , Yn )T .
For kernel estimators, `i (x) = PnK(kx−Xi k/h) . For local linear estimators, we can deduce
j=1 K(kx−Xj k/h)
b
the weights from the expression for β(x). Here is an interesting fact: the following estimators
are linear smoothers: Gaussian process regression, splines, RKHS estimators.
Example 12 You should note confuse linear smoothers with linear regression. In linear
regression we assume that m(x) = xT β. In fact, least squares linear regression is a special
case of linear smoothing. If βb denotes the least squares estimator then m(x)
b = xT βb =
xT (XT X)−1 XT Y = `(x)T Y T T
where `(x) = x (X X) X .−1 T
The matrix L defined in (38) is called the smoothing matrix. The ith row of L is called the
effective kernel for estimating m(Xi ). We define the effective degrees of freedom by
ν = tr(L). (39)
The effective degrees of freedom behave very much like the number of parameters in a linear
regression model.
Remark.
Pn The weights in all the smoothers we will use have the property that, for all x,
i=1 `i (x) = 1. This implies that the smoother preserves constants.
31
13 Wavelets
Not every nonparametric regression estimate needs to be a linear smoother (though this
does seem to be very common), and wavelet smoothing is one of the leading nonlinear tools
for nonparametric estimation. The theory of wavelets is elegant and we only give a brief
introduction here; see Mallat (2008) for an excellent reference.
You can think of wavelets as defining an orthonormal function basis, with the basis functions
exhibiting a highly varied level of smoothness. Importantly, these basis functions also display
spatially localized smoothness at different locations in the input domain. There are actually
many different choices for wavelets bases (Haar wavelets, symmlets, etc.), but these are
details that we will not go into.
Consider basis functions, φ1 , . . . , φn , evaluated over n equally spaced inputs over [0, 1]:
Xi = i/n, i = 1, . . . , n.
Thus the inputs here are fixed and not random, such a setting is called the fixed design
regression setting. The assumption of evenly spaced inputs is crucial for fast computations;
we also typically assume with wavelets that n is a power of 2. The goal, given outputs
y = (y1 , . . . , yn ) over the evenly spaced input points, is to represent y as a sparse combination
of the wavelet basis functions. To do so, We can then write the wavelet smoothing estimate
in a familiar form, following our previous discussions on basis functions and regularization.
Wij = φj (Xi ), i, j = 1, . . . , n
There are two popular wavelet estimates. The first, hard-thresholding wavelet estimates,
solve
θb = argmin ky − W θk22 + λ2 kθk0 ,
θ∈Rn
b Here kθk0 = Pn 1{θi 6= 0}, the
b = W θ.
and then the wavelet smoothing fitted values are µ i=1
number of nonzero components of θ, called the “`0 norm”.
b Here kθk1 = Pn
b = W θ.
and then the wavelet smoothing fitted values are µ i=1 |θi |, the `1
norm
32
For both of these, we first perform a wavelet transform (multiply by W T ):
θe = W T y,
θb = Tλ (θ),
e
to get our wavelet parameter estimates. To get the prediction estimate, we then perform an
inverse wavelet transform (multiply by W ):
b = W θb
µ
The wavelet and inverse wavelet transforms (multiplication by W T and W ) each require
O(n) operations, and are practically extremely fast due do clever pyramidal multiplication
schemes that exploit the special structure of wavelets
or soft-thresholding, i.e.,
[Tλsoft (z)]i = zi − sign(zi )λ · 1{|zi | ≥ λ}, i = 1, . . . , n.
These thresholding functions are both also O(n), and computationally trivial, making wavelet
smoothing very fast overall
We should emphasize that wavelet smoothing is not a linear smoother, i.e., there is no single
b = Sy for all y.
matrix S such that µ
Apart from its computational efficiency, an important strength of wavelet smoothing is that
it can represent a signal that has a spatially heterogeneous degree of smoothness, i.e., it can
be both smooth and wiggly at different regions of the input domain. The reason that wavelet
smoothing can achieve such local adaptivity is because it selects a sparse number of wavelet
basis functions, by thresholding the coefficients from a basis regression
We can make this more precise by considering convergence rates over an appropriate function
class. In particular, we define the total variation class M (k, C), for an integer k ≥ 0 and
C > 0, to contain all k times (weakly) differentiable functions whose kth derivative satisfies
N
X
TV(f (k) ) = sup |f (k) (zi+1 ) − f (k) (zi )| ≤ C.
0=z1 <z2 <...<zN <zN +1 =1
j=1
33
R1
(Note that if f has k + 1 continuous derivatives, then TV(f (k) ) = 0
|f (k+1) (x)| dx.)
Thus wavelet smoothing attains the minimax optimal rate over the function class M (k, C).
(For a translation of this result to the notation of the current setting, see Tibshirani (2014).)
Donoho & Johnstone (1998) showed that the minimax error over M (k, C), restricted to linear
smoothers, is much larger:
Practically, the differences between wavelets and linear smoothers in problems with spatially
heterogeneous smoothness can be striking as well. However, you should keep in mind that
wavelets are not perfect: a shortcoming is that they require a highly restrictive setup: recall
that they require evenly spaced inputs, and n to be power of 2, and there are often further
assumptions made about the behavior of the fitted function at the boundaries of the input
domain
Also, though you might say they marked the beginning of the story, wavelets are not the end
of the story when it comes to local adaptivity. The natural thing to do, it might seem, is
to make (say) kernel smoothing or smoothing splines more locally adaptive by allowing for
a local bandwidth parameter or a local penalty parameter. People have tried this, but it is
both difficult theoretically and practically to get right.
34
References
de Boor, C. (1978), A Practical Guide to Splines, Springer.
Donoho, D. L. & Johnstone, I. (1998), ‘Minimax estimation via wavelet shrinkage’, Annals
of Statistics 26(8), 879–921.
Fan, J. & Gijbels, I. (1996), Local polynomial modelling and its applications: monographs on
statistics and applied probability 66, Vol. 66, CRC Press.
Green, P. & Silverman, B. (1994), Nonparametric Regression and Generalized Linear Models:
A Roughness Penalty Approach, Chapman & Hall/CRC Press.
Gyorfi, L., Kohler, M., Krzyzak, A. & Walk, H. (2002), A Distribution-Free Theory of
Nonparametric Regression, Springer.
Härdle, W. K., Müller, M., Sperlich, S. & Werwatz, A. (2012), Nonparametric and semi-
parametric models, Springer Science & Business Media.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning; Data
Mining, Inference and Prediction, Springer. Second edition.
Mallat, S. (2008), A wavelet tour of signal processing, Academic Press. Third edition.
Tibshirani, R. J. (2014), ‘Adaptive piecewise polynomial estimation via trend filtering’, An-
nals of Statistics 42(1), 285–323.
35