0% found this document useful (0 votes)
18 views7 pages

Nonparametric regression

This document covers Lecture 9 of STAT 425, focusing on nonparametric regression methods, specifically regressograms and kernel regression. It discusses the concepts of estimating the regression function without parametric assumptions, detailing the construction, bias, variance, and mean squared error of both regressograms and kernel regression estimators. Additionally, it addresses the use of bootstrap methods for assessing estimator quality and constructing confidence intervals.

Uploaded by

Sylvia Cheung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

Nonparametric regression

This document covers Lecture 9 of STAT 425, focusing on nonparametric regression methods, specifically regressograms and kernel regression. It discusses the concepts of estimating the regression function without parametric assumptions, detailing the construction, bias, variance, and mean squared error of both regressograms and kernel regression estimators. Additionally, it addresses the use of bootstrap methods for assessing estimator quality and constructing confidence intervals.

Uploaded by

Sylvia Cheung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

STAT 425: Introduction to Nonparametric Statistics Winter 2018

Lecture 9: Regression: Regressogram and Kernel Regression


Instructor: Yen-Chi Chen

Reference: Chapter 5 of All of nonparametric statistics.

9.1 Introduction

Let (X1 , Y1 ), · · · , (Xn , Yn ) be a bivariate random sample. In the regression analysis, we are often interested
in the regression function
m(x) = E(Y |X = x).
Sometimes, we will write
Yi = m(Xi ) + i ,
where i is a mean 0 noise. The simple linear regression model is to assume that m(x) = β0 + β1 x, where
β0 and β1 are the intercept and slope parameter. In this lecture, we will talk about methods that direct
estimate the regression function m(x) without imposing any parametric form of m(x).

9.2 Regressogram (Binning)

We start with a very simple but extremely popular method. This method is called regressogram but people
often call it binning approach. You can view it as

regressogram = regression + histogram.

For simplicity, we assume that the covariates Xi ’s are from a distribution over [0, 1].
Similar to the histogram, we first choose M , the number of bins. Then we partition the interval [0, 1] into
M equal-width bins:
       
1 1 2 M −2 M −1 M −1
B1 = 0, , B2 = , , · · · , BM −1 = , , BM = ,1 .
M M M M M M
When x ∈ B` , we estimate m(x) by
Pn
Yi I(Xi ∈ B` )
b M (x) = Pi=1
m n = average of the responses whose covariates is in the same bin as x.
i=1 I(Xi ∈ B` )

Bias. The bias of a regressogram estimator is


 
1
bias(m
b M (x)) = O .
M

Variance. The variation of a regressogram estimator is


 
M
Var(m
b M (x)) = O .
n

9-1
9-2 Lecture 9: Regression: Regressogram and Kernel Regression

Therefore, the MSE and MISE will be at rate


       
1 M 1 M
MSE = O +O , MISE = O +O ,
M2 n M2 n
leading to the optimal number of bins M ∗  n1/3 and the optimal convergence rate O(n−2/3 ), the same as
the histogram.
Similar to the histogram, the regressogram has a slower convergence rate compared to many other com-
petitors (we will introduce several other candidates). However, they (histogram and regressogram) are still
very popular because the construction of an estimator is very simple and intuitive; practitioners with little
mathematical training can easily master these approaches.

9.3 Kernel Regression

Given a point x0 , assume that we are interested in the value m(x0 ). Here is a simple method to estimate
that value. When m(x0 ) is smooth, an observation Xi ≈ x0 implies m(Xi ) ≈ m(x0 ). Thus, the response
value Yi = m(Xi ) + i ≈ m(x0 ) + i . Using this observation, to reduce the noise i , we can use the sample
average. Thus, an estimator of m(x0 ) is to take the average of those responses whose covariate are close to
x0 .
To make it more concrete, let h > 0 be a threshold. The above procedure suggests to use
P Pn
i:|Xi −x0 |≤h Yi Yi I(|Xi − x0 | ≤ h)
mb loc (x0 ) = = Pi=1
n , (9.1)
nh (x0 ) i=1 I(|Xi − x0 | ≤ h)

where nh (x0 ) is the number of observations where the covariate X : |Xi − x0 | ≤ h. This estimator, m
b loc ,
is called the local average estimator. Indeed, to estimate m(x) at any given point x, we are using a local
average as an estimator.
The local average estimator can be rewritten as
Pn n n
Yi I(|Xi − x0 | ≤ h) X I(|Xi − x0 | ≤ h) X
mb loc (x0 ) = Pi=1
n = Pn · Yi = Wi (x0 )Yi , (9.2)
i=1 I(|Xi − x0 | ≤ h) i=1 `=1 I(|X` − x0 | ≤ h) i=1

where
I(|Xi − x0 | ≤ h)
Wi (x0 ) = Pn (9.3)
`=1 I(|X` − x0 | ≤ h)
Pn
is a weight for each observation. Note that i=1 Wi (x0 ) = 1 and Wi (x0 ) > 0 for all i = 1, · · · , n; this implies
that Wi (x0 )’s are indeed weights. Equation (9.2) shows that the local average estimator can be written as
a weighted average estimator so the i-th weight Wi (x0 ) determines the contribution of response Yi to the
estimator mb loc (x0 ).
In constructing the local average estimator, we are placing a hard-thresholding on the neighboring points–
those within a distance h are given equal weight but those outside the threshold h will be ignored completely.
This hard-thresholding leads to an estimator that is not continuous.
To avoid problem, we consider another construction of the weights. Ideally, we want to give more weights
to those observations that are close to x0 and we want to have a weight that is ‘smooth’. The Gaussian
2
function G(x) = √12π e−x /2 seems to be a good candidate. We now use the Gaussian function to construct
an estimator. We first construct the weight
G x0 −X i

G h
Wi (x0 ) = Pn x0 −X`
.
`=1 G h
Lecture 9: Regression: Regressogram and Kernel Regression 9-3

The quantity h > 0 is the similar quantity to the threshold in the local average but now it acts as the
smoothing bandwidth of the Gaussian. After constructing the weight, our new estimator is
n n Pn
G x0 −X x0 −Xi
 
i=1 Yi G
X X i
G h h
m
b G (x0 ) = Wi (x0 )Yi = Pn x0 −X`
 Yi = Pn x0 −X`
 . (9.4)
i=1 i=1 `=1 G h `=1 G h

This new estimator has a weight that changes more smoothly than the local average and is smooth as we
desire.
Observing from equation (9.1) and (9.4), one may notice that these local estimators are all of a similar form:
Pn x0 −Xi n
K x0 −X
 
i=1 Yi K
X i
h K K h
m
b h (x0 ) = Pn x0 −X`
 = Wi (x0 )Yi , Wi (x0 ) = Pn x0 −X`
, (9.5)
`=1 K h i=1 `=1 K h

where K is some function. When K is a Gaussian, we obtain estimator (9.4); when K is a uniform over
[−1, 1], we obtain the local average (9.1). The estimator in equation (9.5) is called the kernel regression
estimator or Nadaraya-Watson estimator1 . The function K plays a similar role as the kernel function in
the KDE and thus it is also called the kernel function. And the quantity h > 0 is similar to the smoothing
bandwidth in the KDE so it is also called the smoothing bandwidth.

9.3.1 Theory

b h . We skip the details of derivations2 .


Now we study some statistical properties of the estimator m
Bias. The bias of the kernel regression at a point x is

h2 m0 (x)p0 (x)
 
bias(mb h (x)) = µK m00 (x) + 2 + o(h2 ),
2 p(x)

x2 K(x)dx is the
R
where p(x) is the probability density function of the covariates X1 , · · · , Xn and µK =
same constant of the kernel function as in the KDE.
0 0
The bias has two components: a curvature component m00 (x) and a design component m (x)p p(x)
(x)
. The
curvature component is similar to the one in the KDE; when the regression function curved a lot, kernel
smoothing will smooth out the structure, introducing some bias. The second component, also known as the
design bias, is a new component compare to the bias in the KDE. This component depends on the density
of covariate p(x). Note that in some studies, we can choose the values of covariates so the density p(x) is
also called the design (this is why it is known as the design bias).
Variance. The variance of the estimator is
σ 2 · σK
2

1 1
Var(m
b h (x)) = · +o ,
p(x) nh nh

where σ 2 = Var(i ) is the error of the regression model and σK2


= K 2 (x)dx is a constant of the kernel
R

function (the same as in the KDE). This expression tells us possible sources of variance. First, the variance
increases when σ 2 increases. This makes perfect sense because σ 2 is the noise level. When the noise level
is large, we expect the estimation error increases. Second, the density of covariate p(x) is inversely related
to the variance. This is also very reasonable because when p(x) is large, there tends to be more data points
1 https://en.wikipedia.org/wiki/Kernel_regression
2 if you are interested in the derivation, check http://www.ssc.wisc.edu/ bhansen/718/NonParametrics2.pdf and http:
~
//www.maths.manchester.ac.uk/~peterf/MATH38011/NPR%20N-W%20Estimator.pdf
9-4 Lecture 9: Regression: Regressogram and Kernel Regression

1

around x, increasing the size of sample that we are averaging from. Last, the convergence rate is O nh ,
which is the same as the KDE.
MSE and MISE. Using the expression of bias and variance, the MSE at point x is
2
h4 2 m0 (x)p0 (x) σ 2 · σK
2
  
00 1 1
MSE(m
b h (x)) = µ m (x) + 2 + · + o(h4 ) + o
4 K p(x) p(x) nh nh

and the MISE is


2
h4 2 m0 (x)p0 (x) σ 2 · σK
2
Z  Z  
00 1 1
MISE(m
b h) = µ m (x) + 2 dx + dx + o(h4 ) + o . (9.6)
4 K p(x) nh p(x) nh

Optimizing the major components in equation (9.6) (the AMISE), we obtain the optimal value of the smooth-
ing bandwidth
hopt = C ∗ · n−1/5 ,
where C ∗ is a constant depending on p and K.

9.3.2 Uncertainty and Confidence Intervals

How do we assess the quality of our estimator m


b h (x)?
We can use the bootstrap to do it. In this case, empirical bootstrap, residual bootstrap, and wild bootstrap all
can be applied. But note that each of them relies on slightly different assumptions. Let (X1∗ , Y1∗ ), · · · , (Xn∗ , Yn∗ )
be the bootstrap sample. Applying the bootstrap sample to equation (9.5), we obtain a bootstrap kernel
b ∗h . Now repeat the bootstrap procedure B times, this yields
regression, denoted as m
∗(1) ∗(B)
m
bh ,··· ,m
bh ,

B bootstrap kernel regression estimator. Then we can estimate the variance of m


b h (x) by the sample variance
B B
1 X  ∗(`) 
¯ ∗ (x) , ¯ ∗ 1 X ∗(`)
Var
d B (m
b h (x)) = b h (x) − m
m h,B b h,B (x) =
m m
b h (x).
B−1
b
B
`=1 `=1

Similarly, we can estimate the MSE as what we did in Lecture 5 and 6. However, when using the bootstrap
to estimate the uncertainty, one has to be very careful because when h is either too small or too large, the
bootstrap estimate may fail to converge its target.
When we choose h = O(n−1/5 ), the bootstrap estimate of the variance is consistent but the bootstrap
estimate of the MSE might not be consistent. The main reason is: it is easier for the bootstrap to estimate
the variance than the bias. Thus, when we choose h in such a way, both bias and the variance contribute a
lot to the MSE so we cannot ignore the bias. However, in this case, the bootstrap cannot estimate the bias
consistently so the estimate of the MSE is not consistent.
Confidence interval. To construct a confidence interval of m(x), we will use the following property of the
kernel regression:
√ σ 2 · σK
2
 
D
b h (x) − E(m
nh (m b h (x))) → N 0,
p(x)
b h (x) − E(m
m b h (x) D
→ N (0, 1).
Var(mb h (x))
Lecture 9: Regression: Regressogram and Kernel Regression 9-5

The variance depends on three quantities: σ 2 , σK


2
, and p(x). The quantity σK 2
is known because it is just a
characteristic of the kernel function. The density of covariates p(x) can be estimated using a KDE. So what
remains unknown is the noise level σ 2 . A good news is: we can estimate it using the residuals. Recall that
residuals are
ei = Yi − Ybi = Yi − m
b h (Xi ).
When m b h ≈ m, the residual becomes an approximation to the noise i . The quantity σ 2 = Var(1 ) so we
can use the sample variance of the residuals to estimate it (note that the average of residuals is 0):
n
1 X
b2 =
σ e2 , (9.7)
n − 2ν + νe i=1 i

where ν, νe are quantities acting as degree-of-freedom in which we will explain later. Thus, a 1 − α CI can be
constructed using
b · σK
σ
b h (x) ± z1−α/2 p
m ,
pbn (x)
where pbn (x) is the KDE of the covariates.
Bias issue.

9.3.3 Resampling Techniques

Cross-validation.
Bootstrap approach.
http://faculty.washington.edu/yenchic/17Sp_403/Lec8-NPreg.pdf

9.3.4 Relation to KDE

Many theoretical results of the KDE apply to the nonparametric regression. For instance, we can generalize
the MISE into other types of error measurement between m b h and m. We can also use derivatives of m b h as
estimators of the corresponding derivatives of m. Moreover, when we have a multivariate covariate, we can
use either a radial basis kernel or a product kernel to generalize the kernel regression to multivariate case.
The KDE and the kernel regression has a very interesting relationship. Using the given bivariate random
sample (X1 , Y1 ), · · · , (Xn , Yn ), we can estimate the joint PDF p(x, y) as
n    
1 X Xi − x Yi − y
pbn (x, y) = K K .
nh2 i=1 h h

This joint density estimator also leads to a marginal density estimator of X:


n  
Xi − x
Z
1 X
pbn (x) = pbn (x, y)dy = K .
nh i=1 h

Now recalled that the regression function is the conditional expectation


Z Z R
p(x, y) yp(x, y)dy
m(x) = E(Y |X = x) = yp(y|x)dy = y dy = .
p(x) p(x)
9-6 Lecture 9: Regression: Regressogram and Kernel Regression

Replacing p(x, y) and p(x) by their corresponding estimators pbn (x, y) and pbn (x), we obtain an estimate of
m(x) as
R
yb
pn (x, y)dy
m
b n (x) =
pbn (x)
R 1 Pn   
y nh2 i=1 K Xih−x K Yih−y dy
= 1
Pn Xi −x

nh i=1 K h
Pn  
Xi −x Yi −y dy
 R
i=1 K h · y · K h h
= Pn Xi −x

i=1 K h
Pn Xi −x

i=1 K h Y i
= P n Xi −x

i=1 K h
Pn Xi −x

i=1 i Y K h
= Pn Xi −x

i=1 K h
=m b h (x).
 
Note that when K(x) is symmetric, y·K Yih−y dy
R
h = Yi . Namely, we may understand the kernel regression
as an estimator inverting the KDE of the joint PDF into a regression estimator.

9.4 Linear Smoother

Now we are going to introduce a very important notion called linear smoother. Linear smoother is a collection
of many regression estimators that have nice properties. The linear smoother is an estimator of the regression
function in the form that
Xn
m(x)
b = `i (x)Yi , (9.8)
i=1

where `i (x) is some function depending on X1 , · · · , Xn but not on any of Y1 , · · · , Yn .


The residual for the i-th observation can be written as
n
X
ej = Yj − m(X
b j ) = Yj − `i (Xj )Yi .
i=1

Let e = (e1 , · · · , en )T be the vector of residuals and define an n × n matrix L as Lij = `j (Xi ):
 
`1 (X1 ) `2 (X1 ) `3 (X1 ) · · · `n (X1 )
 `1 (X2 ) `2 (X2 ) `3 (X2 ) · · · `n (X2 ) 
L= .
 
.. .. .. .. 
 .. . . . . 
`1 (Xn ) `2 (Xn ) `3 (Xn ) · · · `n (Xn )
b = (Yb1 , · · · , Ybn )T = LY , where Y = (Y1 , · · · , Yn )T is the vector of observed Yi ’s
Then the predicted vector Y
and e = Y − Y = Y − LY = (I − L)Y.
b

Example: Linear Regression. For the linear regression, let X denotes the data matrix (first column is all
value 1 and second column is X1 , · · · , Xn ). We know that βb = (XT X)−1 XT Y and Y
b = Xβb = X(XT X)−1 XT Y.
This implies that the matrix L is
L = X(XT X)−1 XT ,
Lecture 9: Regression: Regressogram and Kernel Regression 9-7

which is also the projection matrix in linear regression. Thus, the linear regression is a linear smoother.
Example: Regressogram. The regressogram is also a linear smoother. Let B1 , · · · , Bm be the bins of the
covariate and define B(x) be the bin such that x belongs to. Then
I(Xj ∈ B(x))
`j (x) = Pn .
i=1 I(Xi ∈ B(x))

Example: Kernel Regression. As you may expect, the kernel regression is also a linear smoother. Recall
from equation (9.5)
Pn x0 −Xi n
K x0 −X
 
i=1 Yi K
X i
h K K h
mb h (x0 ) = Pn x0 −X`
 = Wi (x0 )Yi , Wi (x0 ) = Pn x0 −X`

`=1 K h i=1 `=1 K h
so  
x−Xj
K h
`j (x) = Pn x−X`
.
`=1 K h

9.4.1 Variance of Linear Smoother

The linear smoother has an unbiased estimator of the underlying noise level σ 2 . Recall that then noise level
σ 2 = Var(i ).
We need to use two tricks about variance and covariance matrix. For a matrix A and a random variable X,
Cov(AX) = ACov(X)AT .
Thus, the covariance matrix of the residual vector
Cov(e) = Cov((I − L)Y) = (I − L)Cov(Y)(I − LT ).
Because Y1 , · · · , Yn are IID, Cov(Y) = σ 2 In , where In is the n × n identity matrix. This implies
Cov(e) = (I − L)Cov(Y)(I − LT ) = σ 2 (I − L − LT + LLT ).

Now taking matrix trace in both side,


n
X
Tr(Cov(e)) = Var(ei ) = σ 2 Tr(I − L − LT + LLT ) = σ 2 (n − ν − ν + νe),
i=1

where ν = Tr(L) and νe = Tr(LLT ). Because the residual square is approximately Var(ei ), we have
n
X n
X
e2i ≈ Var(ei ) = σ 2 (n − 2ν + νe).
i=1 i=1
2
Thus, we can estimate σ by
n
1 X
b2 =
σ e2 , (9.9)
n − 2ν + νe i=1 i
which is what we did in equation (9.7). The quantity ν is called the degree of freedom. InPthe linear regression
1 n
case, ν = νe = p + 1, the number of covariates so the variance estimator σ b2 = n−p−1 2
i=1 ei . If you have
learned the variance estimator of a linear regression, you should be familiar with this estimator.
The degree of freedom ν is easy to interpret in the linear regression. And the power of equation (9.9) is that
it works for every linear smoother as long as the errors i ’s are IID. So it shows how we can define effective
degree of freedom for other complicated regression estimator.

You might also like