Kernel
Kernel
for some choice of weights w(x, xi ). Indeed, both linear regression and k-nearest-neighbors are
special cases of this
• Here we will examine another important linear smoother, called kernel smoothing or kernel
regression. We start by defining a kernel function K : R → R, satisfying
Z
K(x) dx = 1, K(x) = K(−x)
in the linear smoother form (1). In other words, the kernel regression estimator is
Pn xi −x
i=1 K h · yi
r̂(x) = Pn xi −x
i=1 K h
1
• What is this doing? This is a weighted average of yi values. Think about laying doing a
Gaussian kernel around a specific query point x, and evaluating its height at each xi in order
to determine the weight associate with yi
• Because these weights are smoothly varying with x, the kernel regression estimator r̂(x) itself
is also smoothly varying with x; compare this to k-nearest-neighbors regression
• What’s in the choice of kernel? Different kernels can give different results. But many of the
common kernels tend to produce similar estimators; e.g., Gaussian vs. Epanechnikov, there’s
not a huge difference
• A much bigger difference comes from choosing different bandwidth values h. What’s the
tradeoff present when we vary h? Hint: as we’ve mentioned before, you should always keep
these two quantities in mind ...
• Fortunately, these can actually roughly be worked out theoretically, under some smoothness
assumptions on r (and other assumptions). We can show that
2
Bias(r̂(x))2 = E[r̂(x)] − r(x) ≤ C1 h2
and
C2
Var(r̂(x)) ≤ ,
nh
for some constants C1 and C2 . Does this make sense? What happens to the bias and variance
as h shrinks? As h grows?
• This means that
C2
E[TestErr(r̂(x))] = σ 2 + C1 h2 + .
nh
We can find the best bandwidth h, i.e., the one minimizing test error, by differentiating and
setting equal to 0: this yields
C2
h= .
2C1 n1/3
Is this is a realistic choice for the bandwidth? Problem is that we don’t know C1 and C2 !
(And even if we did, it may not be a good idea to use this ... why?)
2
• In multiple dimensions, say, each xi ∈ Rp , we can easily use kernels, we just replace xi − x in
the kernel argument by kxi − xk2 , so that the multivariate kernel regression estimator is
Pn kxi −xk2
i=1 K h · yi
r̂(x) = P
n kxi −xk2
i=1 K h
• The same calculations as those that went into producing the bias and variance bounds above
can be done in this multivariate case, showing that
Bias(r̂(x))2 ≤ C̃1 h2
and
C̃2
Var(r̂(x)) ≤ .
nhp
Why is the variance so strongly affected now by the dimension p? What is the optimal h, now?
• A little later we’ll see an alternative extension to higher dimensions that doesn’t nearly suffer
the same variance; this is called an additive model