0% found this document useful (0 votes)
7 views

Kernel

Uploaded by

srmathsatu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Kernel

Uploaded by

srmathsatu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Kernel Regression

Advanced Methods for Data Analysis (36-402/36-608)


Spring 2014

1 Linear smoothers and kernels


• Recall our basic setup: we are given i.i.d. samples (xi , yi ), i = 1, . . . n from the model
yi = r(xi ) + i , i = 1, . . . n,
and our goal is to estimate r with some function r̂. Assume for now that each xi ∈ R (i.e., the
predictors are 1-dimensional)
• We talked about consider r̂ in the class of linear smoothers, so that
n
X
r̂(x) = w(x, xi ) · yi (1)
i=1

for some choice of weights w(x, xi ). Indeed, both linear regression and k-nearest-neighbors are
special cases of this
• Here we will examine another important linear smoother, called kernel smoothing or kernel
regression. We start by defining a kernel function K : R → R, satisfying
Z
K(x) dx = 1, K(x) = K(−x)

• Three common examples are the box kernel:


(
1/2 if |x| ≤ 1
K(x) = ,
0 otherwise
the Gaussian kernel:
1
K(x) = √ exp(−x2 /2),

and the Epanechnikov kernel:
(
3/4(1 − x2 ) if |x| ≤ 1
K(x) =
0 else

• Given a choice of kernel K, and a bandwidth h, kernel regression is defined by taking


K xih−x

w(x, xi ) = P  
n xj −x
j=1 K h

in the linear smoother form (1). In other words, the kernel regression estimator is
Pn xi −x

i=1 K h · yi
r̂(x) = Pn xi −x

i=1 K h

1
• What is this doing? This is a weighted average of yi values. Think about laying doing a
Gaussian kernel around a specific query point x, and evaluating its height at each xi in order
to determine the weight associate with yi
• Because these weights are smoothly varying with x, the kernel regression estimator r̂(x) itself
is also smoothly varying with x; compare this to k-nearest-neighbors regression

• What’s in the choice of kernel? Different kernels can give different results. But many of the
common kernels tend to produce similar estimators; e.g., Gaussian vs. Epanechnikov, there’s
not a huge difference
• A much bigger difference comes from choosing different bandwidth values h. What’s the
tradeoff present when we vary h? Hint: as we’ve mentioned before, you should always keep
these two quantities in mind ...

2 Bias and variance of kernels


• At a fixed query point x, recall our fundamental decomposition
 2 
E[TestErr(r̂(x))] = E Y − r̂(x) X = x
= σ 2 + Bias(r̂(x))2 + Var(r̂(x)).

So what is the bias and variance of the kernel regression estimator?

• Fortunately, these can actually roughly be worked out theoretically, under some smoothness
assumptions on r (and other assumptions). We can show that
2
Bias(r̂(x))2 = E[r̂(x)] − r(x) ≤ C1 h2

and
C2
Var(r̂(x)) ≤ ,
nh
for some constants C1 and C2 . Does this make sense? What happens to the bias and variance
as h shrinks? As h grows?
• This means that
C2
E[TestErr(r̂(x))] = σ 2 + C1 h2 + .
nh
We can find the best bandwidth h, i.e., the one minimizing test error, by differentiating and
setting equal to 0: this yields
C2
h= .
2C1 n1/3
Is this is a realistic choice for the bandwidth? Problem is that we don’t know C1 and C2 !
(And even if we did, it may not be a good idea to use this ... why?)

3 Practical considerations, multiple dimensions


• In practice, we tend to select h by, you guessed it, cross-validation
• Kernels can actually suffer bad bias at the boundaries ... why? Think of the asymmetry of
the weights

2
• In multiple dimensions, say, each xi ∈ Rp , we can easily use kernels, we just replace xi − x in
the kernel argument by kxi − xk2 , so that the multivariate kernel regression estimator is
 
Pn kxi −xk2
i=1 K h · yi
r̂(x) = P  
n kxi −xk2
i=1 K h

• The same calculations as those that went into producing the bias and variance bounds above
can be done in this multivariate case, showing that

Bias(r̂(x))2 ≤ C̃1 h2

and
C̃2
Var(r̂(x)) ≤ .
nhp
Why is the variance so strongly affected now by the dimension p? What is the optimal h, now?
• A little later we’ll see an alternative extension to higher dimensions that doesn’t nearly suffer
the same variance; this is called an additive model

You might also like