Additive Models and Trees
Additive Models and Trees
Classification Trees
Regression Trees
Conclusions
References
Clint P. George
1 Introduction
Tree Based Models
Example
2 Classification Trees
General Setup
Growing the Tree
Tree Pruning
Classification Tree: Example
3 Regression Trees
Overview
Growing the Tree
Tree Pruning
Regression Tree: Example
4 Conclusions
Overview
Some characteristics
I No distribution assumptions on the variables
Overview
Some characteristics
I No distribution assumptions on the variables
I The feature space can be fully represented by a single tree
Overview
Some characteristics
I No distribution assumptions on the variables
I The feature space can be fully represented by a single tree
I Interpretable
Overview
Observations
Node Split Nm loss prediction proportions
1 root 150 100 setosa (0.33 0.33 0.33)
2 PL < 2.45 50 0 setosa (1.00 0.00 0.00)*
3 PL ≥ 2.45 100 50 versicolor (0.00 0.50 0.50)
6 PW < 1.75 54 5 versicolor (0.00 0.90 0.09)
- - - - - -
Table: Tree split path and node proportions
Overview
m N
1 X
ˆ =
pmk I(yi = k), k = 1, 2, ...K , (1)
Nm
i=1
Classification Rule
Classify the observations in node m to the majority class
ˆ ]
k (m) = argmaxk [pmk
Impurity Functions
Our aim is to reduce the node misclassification cost i.e. make all
the samples in a node belongs to one class
Assume a predictor xj
Let mleft and mright be the left and right branches by splitting node
m based on xj
I when xj is continuous or ordinal, mleft and mright are given by xj < s
and xj ≥ s for a splitting point s
I when xj is categorical we may need exhaustive subset search to
find s
And qleft and qright be the proportion of samples in node m
assigned into mleft and mright
where
K K
2
X X
i(m) = ˆ (1 − pmk
pmk ˆ )=1 − ˆ ]
[pmk
k=1 k=1
Greedy Algorithm
Scan through all predictors (xj ) to find the best pair (j, s) with
largest decrease in ∆ij (s, m)
Then repeat this splitting procedure recursively for mleft and mright
Define a stopping criteria:
I Stop when some minimum node size (Nm ) is reached
I Split only when the decrease in cost reaches a threshold
Tree size:
I A very large tree may over fit the data
I A small tree may not structure the important data structure
Key differences:
The outcome variables are continuous
The criteria for splitting and pruning : squared error
The calculation of predicted value: it is done by averaging the
variables in a tree node
Greedy Algorithm
Start with all of the data (xi , yi )N
i=1 , consider a splitting variable j
and a split point s, define the regions
Greedy Algorithm
Start with all of the data (xi , yi )N
i=1 , consider a splitting variable j
and a split point s, define the regions
Then the variables j and s can be solved using the greedy criterion
X X
minj,s [minc1 (yi − c1 )2 + minc2 (yi − c2 )2 ]
xi R1 (j,s) xi R2 (j,s)
Greedy Algorithm
Start with all of the data (xi , yi )N
i=1 , consider a splitting variable j
and a split point s, define the regions
Then the variables j and s can be solved using the greedy criterion
X X
minj,s [minc1 (yi − c1 )2 + minc2 (yi − c2 )2 ]
xi R1 (j,s) xi R2 (j,s)
cˆ1 = ave(yi |xi R1 (j, s)) and cˆ2 = ave(yi |xi R2 (j, s))
Tree Pruning
The tree pruning can be done by weakest link pruning using the
squared error impurity function
1 X
i(m) = (yi − cˆm )2
Nm
xi Rm
Summary
Advantages:
CART makes no distribution assumptions on the variables and
supports both categorical and continuous variables
Binary tree structure offers excellent interpret-ability
Can be used for ranking the variables (by summing up the impurity
measures across all nodes in the tree)
Disadvantages:
Since CART uses binary tree, it suffers from instability
Splits are aligned with the axes of the feature space, which may
be suboptimal