Nonparametric regression
Nonparametric regression
9.1 Introduction
Let (X1 , Y1 ), · · · , (Xn , Yn ) be a bivariate random sample. In the regression analysis, we are often interested
in the regression function
m(x) = E(Y |X = x).
Sometimes, we will write
Yi = m(Xi ) + i ,
where i is a mean 0 noise. The simple linear regression model is to assume that m(x) = β0 + β1 x, where
β0 and β1 are the intercept and slope parameter. In this lecture, we will talk about methods that direct
estimate the regression function m(x) without imposing any parametric form of m(x).
We start with a very simple but extremely popular method. This method is called regressogram but people
often call it binning approach. You can view it as
For simplicity, we assume that the covariates Xi ’s are from a distribution over [0, 1].
Similar to the histogram, we first choose M , the number of bins. Then we partition the interval [0, 1] into
M equal-width bins:
1 1 2 M −2 M −1 M −1
B1 = 0, , B2 = , , · · · , BM −1 = , , BM = ,1 .
M M M M M M
When x ∈ B` , we estimate m(x) by
Pn
Yi I(Xi ∈ B` )
b M (x) = Pi=1
m n = average of the responses whose covariates is in the same bin as x.
i=1 I(Xi ∈ B` )
9-1
9-2 Lecture 9: Regression: Regressogram and Kernel Regression
Given a point x0 , assume that we are interested in the value m(x0 ). Here is a simple method to estimate
that value. When m(x0 ) is smooth, an observation Xi ≈ x0 implies m(Xi ) ≈ m(x0 ). Thus, the response
value Yi = m(Xi ) + i ≈ m(x0 ) + i . Using this observation, to reduce the noise i , we can use the sample
average. Thus, an estimator of m(x0 ) is to take the average of those responses whose covariate are close to
x0 .
To make it more concrete, let h > 0 be a threshold. The above procedure suggests to use
P Pn
i:|Xi −x0 |≤h Yi Yi I(|Xi − x0 | ≤ h)
mb loc (x0 ) = = Pi=1
n , (9.1)
nh (x0 ) i=1 I(|Xi − x0 | ≤ h)
where nh (x0 ) is the number of observations where the covariate X : |Xi − x0 | ≤ h. This estimator, m
b loc ,
is called the local average estimator. Indeed, to estimate m(x) at any given point x, we are using a local
average as an estimator.
The local average estimator can be rewritten as
Pn n n
Yi I(|Xi − x0 | ≤ h) X I(|Xi − x0 | ≤ h) X
mb loc (x0 ) = Pi=1
n = Pn · Yi = Wi (x0 )Yi , (9.2)
i=1 I(|Xi − x0 | ≤ h) i=1 `=1 I(|X` − x0 | ≤ h) i=1
where
I(|Xi − x0 | ≤ h)
Wi (x0 ) = Pn (9.3)
`=1 I(|X` − x0 | ≤ h)
Pn
is a weight for each observation. Note that i=1 Wi (x0 ) = 1 and Wi (x0 ) > 0 for all i = 1, · · · , n; this implies
that Wi (x0 )’s are indeed weights. Equation (9.2) shows that the local average estimator can be written as
a weighted average estimator so the i-th weight Wi (x0 ) determines the contribution of response Yi to the
estimator mb loc (x0 ).
In constructing the local average estimator, we are placing a hard-thresholding on the neighboring points–
those within a distance h are given equal weight but those outside the threshold h will be ignored completely.
This hard-thresholding leads to an estimator that is not continuous.
To avoid problem, we consider another construction of the weights. Ideally, we want to give more weights
to those observations that are close to x0 and we want to have a weight that is ‘smooth’. The Gaussian
2
function G(x) = √12π e−x /2 seems to be a good candidate. We now use the Gaussian function to construct
an estimator. We first construct the weight
G x0 −X i
G h
Wi (x0 ) = Pn x0 −X`
.
`=1 G h
Lecture 9: Regression: Regressogram and Kernel Regression 9-3
The quantity h > 0 is the similar quantity to the threshold in the local average but now it acts as the
smoothing bandwidth of the Gaussian. After constructing the weight, our new estimator is
n n Pn
G x0 −X x0 −Xi
i=1 Yi G
X X i
G h h
m
b G (x0 ) = Wi (x0 )Yi = Pn x0 −X`
Yi = Pn x0 −X`
. (9.4)
i=1 i=1 `=1 G h `=1 G h
This new estimator has a weight that changes more smoothly than the local average and is smooth as we
desire.
Observing from equation (9.1) and (9.4), one may notice that these local estimators are all of a similar form:
Pn x0 −Xi n
K x0 −X
i=1 Yi K
X i
h K K h
m
b h (x0 ) = Pn x0 −X`
= Wi (x0 )Yi , Wi (x0 ) = Pn x0 −X`
, (9.5)
`=1 K h i=1 `=1 K h
where K is some function. When K is a Gaussian, we obtain estimator (9.4); when K is a uniform over
[−1, 1], we obtain the local average (9.1). The estimator in equation (9.5) is called the kernel regression
estimator or Nadaraya-Watson estimator1 . The function K plays a similar role as the kernel function in
the KDE and thus it is also called the kernel function. And the quantity h > 0 is similar to the smoothing
bandwidth in the KDE so it is also called the smoothing bandwidth.
9.3.1 Theory
h2 m0 (x)p0 (x)
bias(mb h (x)) = µK m00 (x) + 2 + o(h2 ),
2 p(x)
x2 K(x)dx is the
R
where p(x) is the probability density function of the covariates X1 , · · · , Xn and µK =
same constant of the kernel function as in the KDE.
0 0
The bias has two components: a curvature component m00 (x) and a design component m (x)p p(x)
(x)
. The
curvature component is similar to the one in the KDE; when the regression function curved a lot, kernel
smoothing will smooth out the structure, introducing some bias. The second component, also known as the
design bias, is a new component compare to the bias in the KDE. This component depends on the density
of covariate p(x). Note that in some studies, we can choose the values of covariates so the density p(x) is
also called the design (this is why it is known as the design bias).
Variance. The variance of the estimator is
σ 2 · σK
2
1 1
Var(m
b h (x)) = · +o ,
p(x) nh nh
function (the same as in the KDE). This expression tells us possible sources of variance. First, the variance
increases when σ 2 increases. This makes perfect sense because σ 2 is the noise level. When the noise level
is large, we expect the estimation error increases. Second, the density of covariate p(x) is inversely related
to the variance. This is also very reasonable because when p(x) is large, there tends to be more data points
1 https://en.wikipedia.org/wiki/Kernel_regression
2 if you are interested in the derivation, check http://www.ssc.wisc.edu/ bhansen/718/NonParametrics2.pdf and http:
~
//www.maths.manchester.ac.uk/~peterf/MATH38011/NPR%20N-W%20Estimator.pdf
9-4 Lecture 9: Regression: Regressogram and Kernel Regression
1
around x, increasing the size of sample that we are averaging from. Last, the convergence rate is O nh ,
which is the same as the KDE.
MSE and MISE. Using the expression of bias and variance, the MSE at point x is
2
h4 2 m0 (x)p0 (x) σ 2 · σK
2
00 1 1
MSE(m
b h (x)) = µ m (x) + 2 + · + o(h4 ) + o
4 K p(x) p(x) nh nh
Optimizing the major components in equation (9.6) (the AMISE), we obtain the optimal value of the smooth-
ing bandwidth
hopt = C ∗ · n−1/5 ,
where C ∗ is a constant depending on p and K.
Similarly, we can estimate the MSE as what we did in Lecture 5 and 6. However, when using the bootstrap
to estimate the uncertainty, one has to be very careful because when h is either too small or too large, the
bootstrap estimate may fail to converge its target.
When we choose h = O(n−1/5 ), the bootstrap estimate of the variance is consistent but the bootstrap
estimate of the MSE might not be consistent. The main reason is: it is easier for the bootstrap to estimate
the variance than the bias. Thus, when we choose h in such a way, both bias and the variance contribute a
lot to the MSE so we cannot ignore the bias. However, in this case, the bootstrap cannot estimate the bias
consistently so the estimate of the MSE is not consistent.
Confidence interval. To construct a confidence interval of m(x), we will use the following property of the
kernel regression:
√ σ 2 · σK
2
D
b h (x) − E(m
nh (m b h (x))) → N 0,
p(x)
b h (x) − E(m
m b h (x) D
→ N (0, 1).
Var(mb h (x))
Lecture 9: Regression: Regressogram and Kernel Regression 9-5
where ν, νe are quantities acting as degree-of-freedom in which we will explain later. Thus, a 1 − α CI can be
constructed using
b · σK
σ
b h (x) ± z1−α/2 p
m ,
pbn (x)
where pbn (x) is the KDE of the covariates.
Bias issue.
Cross-validation.
Bootstrap approach.
http://faculty.washington.edu/yenchic/17Sp_403/Lec8-NPreg.pdf
Many theoretical results of the KDE apply to the nonparametric regression. For instance, we can generalize
the MISE into other types of error measurement between m b h and m. We can also use derivatives of m b h as
estimators of the corresponding derivatives of m. Moreover, when we have a multivariate covariate, we can
use either a radial basis kernel or a product kernel to generalize the kernel regression to multivariate case.
The KDE and the kernel regression has a very interesting relationship. Using the given bivariate random
sample (X1 , Y1 ), · · · , (Xn , Yn ), we can estimate the joint PDF p(x, y) as
n
1 X Xi − x Yi − y
pbn (x, y) = K K .
nh2 i=1 h h
Replacing p(x, y) and p(x) by their corresponding estimators pbn (x, y) and pbn (x), we obtain an estimate of
m(x) as
R
yb
pn (x, y)dy
m
b n (x) =
pbn (x)
R 1 Pn
y nh2 i=1 K Xih−x K Yih−y dy
= 1
Pn Xi −x
nh i=1 K h
Pn
Xi −x Yi −y dy
R
i=1 K h · y · K h h
= Pn Xi −x
i=1 K h
Pn Xi −x
i=1 K h Y i
= P n Xi −x
i=1 K h
Pn Xi −x
i=1 i Y K h
= Pn Xi −x
i=1 K h
=m b h (x).
Note that when K(x) is symmetric, y·K Yih−y dy
R
h = Yi . Namely, we may understand the kernel regression
as an estimator inverting the KDE of the joint PDF into a regression estimator.
Now we are going to introduce a very important notion called linear smoother. Linear smoother is a collection
of many regression estimators that have nice properties. The linear smoother is an estimator of the regression
function in the form that
Xn
m(x)
b = `i (x)Yi , (9.8)
i=1
Let e = (e1 , · · · , en )T be the vector of residuals and define an n × n matrix L as Lij = `j (Xi ):
`1 (X1 ) `2 (X1 ) `3 (X1 ) · · · `n (X1 )
`1 (X2 ) `2 (X2 ) `3 (X2 ) · · · `n (X2 )
L= .
.. .. .. ..
.. . . . .
`1 (Xn ) `2 (Xn ) `3 (Xn ) · · · `n (Xn )
b = (Yb1 , · · · , Ybn )T = LY , where Y = (Y1 , · · · , Yn )T is the vector of observed Yi ’s
Then the predicted vector Y
and e = Y − Y = Y − LY = (I − L)Y.
b
Example: Linear Regression. For the linear regression, let X denotes the data matrix (first column is all
value 1 and second column is X1 , · · · , Xn ). We know that βb = (XT X)−1 XT Y and Y
b = Xβb = X(XT X)−1 XT Y.
This implies that the matrix L is
L = X(XT X)−1 XT ,
Lecture 9: Regression: Regressogram and Kernel Regression 9-7
which is also the projection matrix in linear regression. Thus, the linear regression is a linear smoother.
Example: Regressogram. The regressogram is also a linear smoother. Let B1 , · · · , Bm be the bins of the
covariate and define B(x) be the bin such that x belongs to. Then
I(Xj ∈ B(x))
`j (x) = Pn .
i=1 I(Xi ∈ B(x))
Example: Kernel Regression. As you may expect, the kernel regression is also a linear smoother. Recall
from equation (9.5)
Pn x0 −Xi n
K x0 −X
i=1 Yi K
X i
h K K h
mb h (x0 ) = Pn x0 −X`
= Wi (x0 )Yi , Wi (x0 ) = Pn x0 −X`
`=1 K h i=1 `=1 K h
so
x−Xj
K h
`j (x) = Pn x−X`
.
`=1 K h
The linear smoother has an unbiased estimator of the underlying noise level σ 2 . Recall that then noise level
σ 2 = Var(i ).
We need to use two tricks about variance and covariance matrix. For a matrix A and a random variable X,
Cov(AX) = ACov(X)AT .
Thus, the covariance matrix of the residual vector
Cov(e) = Cov((I − L)Y) = (I − L)Cov(Y)(I − LT ).
Because Y1 , · · · , Yn are IID, Cov(Y) = σ 2 In , where In is the n × n identity matrix. This implies
Cov(e) = (I − L)Cov(Y)(I − LT ) = σ 2 (I − L − LT + LLT ).
where ν = Tr(L) and νe = Tr(LLT ). Because the residual square is approximately Var(ei ), we have
n
X n
X
e2i ≈ Var(ei ) = σ 2 (n − 2ν + νe).
i=1 i=1
2
Thus, we can estimate σ by
n
1 X
b2 =
σ e2 , (9.9)
n − 2ν + νe i=1 i
which is what we did in equation (9.7). The quantity ν is called the degree of freedom. InPthe linear regression
1 n
case, ν = νe = p + 1, the number of covariates so the variance estimator σ b2 = n−p−1 2
i=1 ei . If you have
learned the variance estimator of a linear regression, you should be familiar with this estimator.
The degree of freedom ν is easy to interpret in the linear regression. And the power of equation (9.9) is that
it works for every linear smoother as long as the errors i ’s are IID. So it shows how we can define effective
degree of freedom for other complicated regression estimator.