0% found this document useful (0 votes)
3 views7 pages

Regression Tree

This lecture discusses two approaches to regression: the Basis Approach and Regression Trees. The Basis Approach involves approximating the regression function using an orthonormal basis and analyzing bias and variance to determine an optimal tuning parameter. The Regression Tree method partitions the covariate space into regions and predicts values based on averages within those regions, with a focus on minimizing a score that balances fitting quality and model complexity.

Uploaded by

Sylvia Cheung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views7 pages

Regression Tree

This lecture discusses two approaches to regression: the Basis Approach and Regression Trees. The Basis Approach involves approximating the regression function using an orthonormal basis and analyzing bias and variance to determine an optimal tuning parameter. The Regression Tree method partitions the covariate space into regions and predicts values based on averages within those regions, with a focus on minimizing a score that balances fitting quality and model complexity.

Uploaded by

Sylvia Cheung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

STAT 425: Introduction to Nonparametric Statistics Winter 2018

Lecture 10: Regression: Basis Approach and Regression Tree


Instructor: Yen-Chi Chen

Reference: page 108–111 and Chapter 8 of All of nonparametric statistics.

10.1 Basis Approach

Recall that we observes pairs (X1 , Y1 ), · · · , (Xn , Yn ) and we are interested in the regression function m(x) =
E(Y1 |X1 = x). In this section, we will make the following two assumptions:

• Yi = m(Xi ) + σ · i , where i ∼ N (0, 1) is the noise. Moreover, 1 , · · · , n are IID.

i
• Xi = n. Namely, the covariates consist a uniform grid over [0, 1] and is non-random.

Similar to the basis approach for the density estimation problem where we approximate the density function
by the sum of coefficients and basis, we will approximate the regression function by a basis:


X
m(x) = θj φj (x),
j=1

where {φ1 , φ2 , · · · } is an orthonormal basis and θ1 , θ2 , · · · are the coefficients.


Again, here we consider the cosine basis:

φ1 (x) = 1, φj (x) = 2 cos((j − 1)πx), j = 2, 3, · · · .

As is done in the density estimation, we will use only the top M basis to form our estimator. Namely,

M
X
m
b M (x) = θbj φj (x),
j=1

for some coefficient estimates θb1 , · · · . Again, M is the tuning parameter in our estimator.
Here is a simple choice of the coefficient estimates that we will be using:

n n  
1X 1X i
θbj = Yi φj (Xi ) = Yi φj .
n i=1 n i=1 n

To determine the tuning parameter M , we analyze the MISE. We start with analyzing the bias and variance
of θbj .

10-1
10-2 Lecture 10: Regression: Basis Approach and Regression Tree

10.1.1 Asymptotic theory

Asymptotic normality. Note that the estimator can be rewritten as

M
X
m
b M (x) = θbj φj (x)
j=1
M n  
X 1X i
= Yi φj φj (x)
j=1
n i=1 n
n M  
1X X i
= Yi φj φj (x).
n i=1 j=1 n

Thus, for M being fixed, we have

√ D 2
b M (x) − E(m
n (m b M (x))) → N (0, σM )

2
for some σM . Note that later our analysis will demonstrate

M
X M
X
2 2
E(m
b M (x)) = θj φj (x), σM =σ φ2j (x).
j=1 j=1

Bias.

bias(θbj ) = E(θbj ) − θj
n   !
1X i i
=E Yi φj |Xi = − θj
n i=1 n n
n    
1X i i
= E Yi |Xi = φj − θj
n i=1 n n
n    
1X i i
= m φj − θj
n i=1 n n
n     Z 1
1X i i
= m φj − m(x)φj (x)dx.
n i=1 n n 0

Namely, the bias is the difference between actual integration and a discretized version of integration. We
know that when n is large, the two integrations are almost the same so we can ignore the bias. Thus, we
will write

bias(θbj ) = 0

for simplicity.
Lecture 10: Regression: Basis Approach and Regression Tree 10-3

Variance.
 
1 X n      
i i 
Var(θbj ) = Var  m + σ · i φ j

 n i=1 n n 

| {z }
=Yi
n  
1 X 2 i
= 2 φ Var (i )
n i=1 j n
n
σ2 X 2 i
 
= 2 φj .
n i=1 n

1
Pn i
 R1
Note that n i=1 φ2j n ≈ 0
φ2j (x)dx = 1. For simplicity, we just write

σ2
Var(θbj ) = .
n

MISE. To analyze the MISE, we first note that the bias of m


b M (x) is
M
X ∞
X ∞
X
bias(m b M (x)) − m(x) =
b M (x)) = E(m θj φj (x) − θj φj (x) = θj φj (x).
j=1 j=1 j=M +1

This further implies that the integrated sqaured bias


Z 1 Z 1 X ∞ ∞
X
bias2 (m
b M (x))dx = θj φj (x) θ` φ` (x)dx
0 0 j=M +1 `=M +1
X∞ ∞
X Z 1
= θj θ` φj (x)φ` (x)dx
j=M +1 `=M +1 |0 {z }
=I(j=`)

X
= θj2 .
j=M +1

R1
Again, if we assume that m satisfies 0
|m00 (x)|2 dx < ∞, we have

X
θj2 = O(M −4 ).
j=M +1

Now we turn to the analysis of variance.


 
XM
Var(m
b M (x)) = Var  θbj φj (x)
j=1
M
X
= Var(θbj )φ2j (x)
j=1
M
σ2 X 2
= φ (x).
n j=1 j
10-4 Lecture 10: Regression: Basis Approach and Regression Tree

Thus, the integrated variance is

1 M Z
σ2 X 1 2 σ2 M
Z  
M
Var(m
b M (x))dx = φj (x)dx = =O .
0 n j=1 0 n n

Recall that the MISE is just the sum of integrated bias and integrated variance, we obtain
Z 1 Z 1  
M
MISE(m
bM) = bias2 (m
b M (x))dx + b M (x))dx = O(M −4 ) + O
Var(m .
0 0 n

Thus, the optimal choice is


M ∗  n1/5 .

10.1.2 Basis approach as a linear smoother

The basis estimator is another linear smoother. To see this, we use the follow expansion:

M
X
m
b M (x) = θbj φj (x)
j=1
M n
X 1X
= Yi φj (Xi )φj (x)
j=1
n i=1
 
n M
X X 1
=  φj (Xi )φj (x) Yi
i=1 j=1
n
n
X
= `i (x)Yi ,
i=1

PM 1
where `i (x) = j=1 n φj (Xi )φj (x).

Recall that from the linear smoother theory, we can estimate σ 2 using the residuals and the degree of freedom:
n
1 X
b2 =
σ e2 ,
n − 2ν + νe i=1 i

where ei = Ybi − Yi = m
b M (Xi ) − Yi and ν, νe are the degree of freedoms (see the previous lecture note).
2 PM
With this variance estimator and the fact that Var(m b M (x)) = σn 2
j=1 φj (x) and the asymptotic normality,
we can construct a confidence interval (band) of m using

M
b2 X 2
σ
b M (x) ± z1−α/2
m φ (x).
n j=1 j

PM
Note that this confidence interval is valid for E(m
b M (x)) = j=1 θj φj (x), not the actual m(x). The difference
between them is the bias of our estimator.
Lecture 10: Regression: Basis Approach and Regression Tree 10-5

10.2 Regression Tree

In this section, we assume that the covariate may have multiple dimensions, i.e., x = (x1 , · · · , xd ). And our
data are (X1 , Y1 ), · · · , (Xn , Yn ) ∼ P for some CDF P . Again, we are interested in the regression function
m(x) = E(Y1 |X1 = x).
Regression tree constructs an estimator of the form:
M
X
m(x) = c` I(x ∈ R` ),
`=1

where R` is some rectangle partition of the space of covariates.


Here is an example of a regression tree and its splits. In this example, there are two covariates (namely,
d = 2) and we have 3 regions R1 , R2 , R3 :
R1 = {(x1 , x2 ) : x1 < 10, x2 < 5}, R2 = {(x1 , x2 ) : x1 < 10, x2 ≥ 5}, R3 = {(x1 , x2 ) : x1 ≥ 10}.

x1
< 10 ≥ 10

x2 R3
<5 ≥5

R1 R2
x2

R2
R3

R1
x1
A regression tree estimator will predict the same value of the response Y within the same area of the covariate.
Namely, m(x) will be the same when x is within the same area.
To use a regression tree, there are 2M quantities to be determined: the regions R1 , · · · , RM and the predicted
values c1 , · · · , cM . When R1 , · · · , RM are given, c1 , · · · , cM can be simply estimated by the average within
each region, i.e., Pn
Yi I(Xi ∈ R` )
c` = Pi=1n .
i=1 I(Xi ∈ R` )
b

Thus, the difficult part is the determination of R1 , · · · , RM .


Unfortunately, there is no simple closed form solution to these regions. We only have a procedure for
computing it. Here is what we will do in practice. Let Xij be the j-th coordinate of the i-th observation
(Xi ).
10-6 Lecture 10: Regression: Basis Approach and Regression Tree

1. For a given j, we define

Ra (j, s) = {x : x < s}, Rb (j, s) = {x : x ≥ s}.

2. Find ca and cb that minimizes


X X
(Yi − ca )2 , (Yi − cb )2 ,
Xi ∈Ra Xi ∈Rb

respectively.
3. Compute the score X X
S(j, s) = (Yi − ca )2 + (Yi − cb )2 .
Xi ∈Ra Xi ∈Rb

4. Change s and repeat the same calculation until we find the minimizer of S(j, s), denoted the minimal
score as S ∗ (j).
5. Compute the score S ∗ (j) for j = 1, · · · , d.
6. Pick the dimension (coordinate) and the corresponding split point s that has the minimal score S ∗ (j).
Partition the space into two parts according to this split.
7. Repeat the above procedure for each partition until certain stopping criterion is satisfied.

b1 , · · · , R
Using the above procedure, we will eventually end up with a collection of rectangle partitions R bM .
Then the final estimator is
M
X
m(x)
b = c` I(x ∈ R
b b` ).
`=1

For the stopping criterion, sometimes people will pick the number of M so as long as we obtain M regions,
the splitting procedure will stop. However, such a choice M is rather arbitrary. A popular alternative is to
top the criterion based on minimizing some score that balances the fitting quality and the complexity of the
tree. For instance, we may stop the criterion if the following score is no longer decreasing:
n
1X
Cλ,n (M ) = b i ))2 + λM,
(Yi − m(X
n i=1

where λ > 0 is a tuning parameter that determines the ‘penalty’ for having a complex tree. In the next
lecture, we will talk more about this penalty type tuning parameter.
Remark.

• Interpreation. Regression tree has a powerful feature that it is easy to interpret. Even without
much training, a practitioner can use the output from a regression tree very easily. A limitation of
the regression tree is that it partitions the space of covariates into rectangle regions, which may be
unrealistic for the actual regression model.
• Cross-validation. How to choose the tuning parameter λ? There is a simple approach called the
cross-validation1 that can compute a good choice of this quantity. Not only λ, other tuning parameters
such as the number of basis M , the smoothing bandwidth h, the bin size b, can be chosen using the
cross-validation.
1 https://en.wikipedia.org/wiki/Cross-validation_(statistics)
Lecture 10: Regression: Basis Approach and Regression Tree 10-7

• MARS (multivariate adaptive regression splines). The regression tree has another limitation
that it predicts the same value within the same region. This creates a jump on the boundary of
two consecutive regions. There is a modified regression tree called MARS (multivariate adaptive
regression splines) that allows a continuous (and possibly smooth) changes over two regions. See
https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines.

You might also like