Regression Tree
Regression Tree
Recall that we observes pairs (X1 , Y1 ), · · · , (Xn , Yn ) and we are interested in the regression function m(x) =
E(Y1 |X1 = x). In this section, we will make the following two assumptions:
i
• Xi = n. Namely, the covariates consist a uniform grid over [0, 1] and is non-random.
Similar to the basis approach for the density estimation problem where we approximate the density function
by the sum of coefficients and basis, we will approximate the regression function by a basis:
∞
X
m(x) = θj φj (x),
j=1
As is done in the density estimation, we will use only the top M basis to form our estimator. Namely,
M
X
m
b M (x) = θbj φj (x),
j=1
for some coefficient estimates θb1 , · · · . Again, M is the tuning parameter in our estimator.
Here is a simple choice of the coefficient estimates that we will be using:
n n
1X 1X i
θbj = Yi φj (Xi ) = Yi φj .
n i=1 n i=1 n
To determine the tuning parameter M , we analyze the MISE. We start with analyzing the bias and variance
of θbj .
10-1
10-2 Lecture 10: Regression: Basis Approach and Regression Tree
M
X
m
b M (x) = θbj φj (x)
j=1
M n
X 1X i
= Yi φj φj (x)
j=1
n i=1 n
n M
1X X i
= Yi φj φj (x).
n i=1 j=1 n
√ D 2
b M (x) − E(m
n (m b M (x))) → N (0, σM )
2
for some σM . Note that later our analysis will demonstrate
M
X M
X
2 2
E(m
b M (x)) = θj φj (x), σM =σ φ2j (x).
j=1 j=1
Bias.
bias(θbj ) = E(θbj ) − θj
n !
1X i i
=E Yi φj |Xi = − θj
n i=1 n n
n
1X i i
= E Yi |Xi = φj − θj
n i=1 n n
n
1X i i
= m φj − θj
n i=1 n n
n Z 1
1X i i
= m φj − m(x)φj (x)dx.
n i=1 n n 0
Namely, the bias is the difference between actual integration and a discretized version of integration. We
know that when n is large, the two integrations are almost the same so we can ignore the bias. Thus, we
will write
bias(θbj ) = 0
for simplicity.
Lecture 10: Regression: Basis Approach and Regression Tree 10-3
Variance.
1 X n
i i
Var(θbj ) = Var m + σ · i φ j
n i=1 n n
| {z }
=Yi
n
1 X 2 i
= 2 φ Var (i )
n i=1 j n
n
σ2 X 2 i
= 2 φj .
n i=1 n
1
Pn i
R1
Note that n i=1 φ2j n ≈ 0
φ2j (x)dx = 1. For simplicity, we just write
σ2
Var(θbj ) = .
n
R1
Again, if we assume that m satisfies 0
|m00 (x)|2 dx < ∞, we have
∞
X
θj2 = O(M −4 ).
j=M +1
1 M Z
σ2 X 1 2 σ2 M
Z
M
Var(m
b M (x))dx = φj (x)dx = =O .
0 n j=1 0 n n
Recall that the MISE is just the sum of integrated bias and integrated variance, we obtain
Z 1 Z 1
M
MISE(m
bM) = bias2 (m
b M (x))dx + b M (x))dx = O(M −4 ) + O
Var(m .
0 0 n
The basis estimator is another linear smoother. To see this, we use the follow expansion:
M
X
m
b M (x) = θbj φj (x)
j=1
M n
X 1X
= Yi φj (Xi )φj (x)
j=1
n i=1
n M
X X 1
= φj (Xi )φj (x) Yi
i=1 j=1
n
n
X
= `i (x)Yi ,
i=1
PM 1
where `i (x) = j=1 n φj (Xi )φj (x).
Recall that from the linear smoother theory, we can estimate σ 2 using the residuals and the degree of freedom:
n
1 X
b2 =
σ e2 ,
n − 2ν + νe i=1 i
where ei = Ybi − Yi = m
b M (Xi ) − Yi and ν, νe are the degree of freedoms (see the previous lecture note).
2 PM
With this variance estimator and the fact that Var(m b M (x)) = σn 2
j=1 φj (x) and the asymptotic normality,
we can construct a confidence interval (band) of m using
M
b2 X 2
σ
b M (x) ± z1−α/2
m φ (x).
n j=1 j
PM
Note that this confidence interval is valid for E(m
b M (x)) = j=1 θj φj (x), not the actual m(x). The difference
between them is the bias of our estimator.
Lecture 10: Regression: Basis Approach and Regression Tree 10-5
In this section, we assume that the covariate may have multiple dimensions, i.e., x = (x1 , · · · , xd ). And our
data are (X1 , Y1 ), · · · , (Xn , Yn ) ∼ P for some CDF P . Again, we are interested in the regression function
m(x) = E(Y1 |X1 = x).
Regression tree constructs an estimator of the form:
M
X
m(x) = c` I(x ∈ R` ),
`=1
x1
< 10 ≥ 10
x2 R3
<5 ≥5
R1 R2
x2
R2
R3
R1
x1
A regression tree estimator will predict the same value of the response Y within the same area of the covariate.
Namely, m(x) will be the same when x is within the same area.
To use a regression tree, there are 2M quantities to be determined: the regions R1 , · · · , RM and the predicted
values c1 , · · · , cM . When R1 , · · · , RM are given, c1 , · · · , cM can be simply estimated by the average within
each region, i.e., Pn
Yi I(Xi ∈ R` )
c` = Pi=1n .
i=1 I(Xi ∈ R` )
b
respectively.
3. Compute the score X X
S(j, s) = (Yi − ca )2 + (Yi − cb )2 .
Xi ∈Ra Xi ∈Rb
4. Change s and repeat the same calculation until we find the minimizer of S(j, s), denoted the minimal
score as S ∗ (j).
5. Compute the score S ∗ (j) for j = 1, · · · , d.
6. Pick the dimension (coordinate) and the corresponding split point s that has the minimal score S ∗ (j).
Partition the space into two parts according to this split.
7. Repeat the above procedure for each partition until certain stopping criterion is satisfied.
b1 , · · · , R
Using the above procedure, we will eventually end up with a collection of rectangle partitions R bM .
Then the final estimator is
M
X
m(x)
b = c` I(x ∈ R
b b` ).
`=1
For the stopping criterion, sometimes people will pick the number of M so as long as we obtain M regions,
the splitting procedure will stop. However, such a choice M is rather arbitrary. A popular alternative is to
top the criterion based on minimizing some score that balances the fitting quality and the complexity of the
tree. For instance, we may stop the criterion if the following score is no longer decreasing:
n
1X
Cλ,n (M ) = b i ))2 + λM,
(Yi − m(X
n i=1
where λ > 0 is a tuning parameter that determines the ‘penalty’ for having a complex tree. In the next
lecture, we will talk more about this penalty type tuning parameter.
Remark.
• Interpreation. Regression tree has a powerful feature that it is easy to interpret. Even without
much training, a practitioner can use the output from a regression tree very easily. A limitation of
the regression tree is that it partitions the space of covariates into rectangle regions, which may be
unrealistic for the actual regression model.
• Cross-validation. How to choose the tuning parameter λ? There is a simple approach called the
cross-validation1 that can compute a good choice of this quantity. Not only λ, other tuning parameters
such as the number of basis M , the smoothing bandwidth h, the bin size b, can be chosen using the
cross-validation.
1 https://en.wikipedia.org/wiki/Cross-validation_(statistics)
Lecture 10: Regression: Basis Approach and Regression Tree 10-7
• MARS (multivariate adaptive regression splines). The regression tree has another limitation
that it predicts the same value within the same region. This creates a jump on the boundary of
two consecutive regions. There is a modified regression tree called MARS (multivariate adaptive
regression splines) that allows a continuous (and possibly smooth) changes over two regions. See
https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines.