0% found this document useful (0 votes)
35 views

Additive Models and Trees

The document describes classification and regression trees (CART), which use binary recursive partitioning to divide a feature space into regions and fit a tree model. Key points: - Trees can represent a feature space with a single model and make no distributional assumptions. - Growing a tree involves recursively splitting nodes based on variables and split points to minimize impurity. Trees are then pruned to avoid overfitting. - Classification trees assign observations to the majority class of a node. Regression trees predict a continuous variable. - Impurity measures like Gini index and cross-entropy are used to determine optimal splits that reduce misclassification in nodes.

Uploaded by

Ao Lv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Additive Models and Trees

The document describes classification and regression trees (CART), which use binary recursive partitioning to divide a feature space into regions and fit a tree model. Key points: - Trees can represent a feature space with a single model and make no distributional assumptions. - Growing a tree involves recursively splitting nodes based on variables and split points to minimize impurity. Trees are then pruned to avoid overfitting. - Classification trees assign observations to the majority class of a node. Regression trees predict a continuous variable. - Impurity measures like Gini index and cross-entropy are used to determine optimal splits that reduce misclassification in nodes.

Uploaded by

Ao Lv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Introduction

Classification Trees
Regression Trees
Conclusions
References

Chapter 9: Additive Models and Trees


Section 9.2 Tree Based Models

Clint P. George

Department of Computer and Information Science and Engineering


University of Florida

Elements of Statistical Learning

Mar 15, 2010

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Classification Trees
Regression Trees
Conclusions
References

1 Introduction
Tree Based Models
Example
2 Classification Trees
General Setup
Growing the Tree
Tree Pruning
Classification Tree: Example
3 Regression Trees
Overview
Growing the Tree
Tree Pruning
Regression Tree: Example
4 Conclusions

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Classification Trees
Tree Based Models
Regression Trees
Example
Conclusions
References

Overview

It is a method based on binary recursive partitioning of the feature


space into regions and fitting a tree model

Some characteristics
I No distribution assumptions on the variables

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Classification Trees
Tree Based Models
Regression Trees
Example
Conclusions
References

Overview

It is a method based on binary recursive partitioning of the feature


space into regions and fitting a tree model

Some characteristics
I No distribution assumptions on the variables
I The feature space can be fully represented by a single tree

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Classification Trees
Tree Based Models
Regression Trees
Example
Conclusions
References

Overview

It is a method based on binary recursive partitioning of the feature


space into regions and fitting a tree model

Some characteristics
I No distribution assumptions on the variables
I The feature space can be fully represented by a single tree
I Interpretable

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Classification Trees
Tree Based Models
Regression Trees
Example
Conclusions
References

Overview

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Classification Trees
Tree Based Models
Regression Trees
Example
Conclusions
References

Example: Classification Tree


Data set: IRIS - four features, 150 samples, and three classes
Software: RPART (Terry M. Therneau, 1997) package in R

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Classification Trees
Tree Based Models
Regression Trees
Example
Conclusions
References

Observations
Node Split Nm loss prediction proportions
1 root 150 100 setosa (0.33 0.33 0.33)
2 PL < 2.45 50 0 setosa (1.00 0.00 0.00)*
3 PL ≥ 2.45 100 50 versicolor (0.00 0.50 0.50)
6 PW < 1.75 54 5 versicolor (0.00 0.90 0.09)
- - - - - -
Table: Tree split path and node proportions

The variables actually used in tree construction: Petal.Length,


Petal.Width and Sepal.Length
The tree partitions the sample space into nine regions

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Overview

Basic idea of any tree building :


Grow a large and complicated tree that explains the data
I decide the splitting variables (predictors) and split points
I binary recursive partitioning
I stopping criterion e.g. max number of samples in a leaf node
Prune the tree to avoid over fitting
References: Bishop (2007); Pham (2006); T. Hastie (2009)

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Classification Tree Setup


Suppose, we have a K class classification problem with data {xi , yi }N
i=1
where
yi  {1, 2, ...K }
xi = (xi1 , xi2 , ...xid ), d = dimensionality

then the node proportion of a class k at node m:

m N
1 X
ˆ =
pmk I(yi = k), k = 1, 2, ...K , (1)
Nm
i=1

where Nm = number of observations in a tree node m and I(A) is an


indicator function

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Classification Rule
Classify the observations in node m to the majority class

ˆ ]
k (m) = argmaxk [pmk

A more general rule is to assign node m to the class

k(m) = argmink [rmk ]

where rmk is the expected misclassification cost for class k


Suppose πmj is the probability of node m as class j and c(k|j) = the
cost of classifying a class j sample as class k sample, then
X
rmk = c(k |j)πmj
j

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Impurity Functions

Our aim is to reduce the node misclassification cost i.e. make all
the samples in a node belongs to one class

This can be seen as reducing the node impurity

Popular functions to measure degree of impurity:


Misclassification error = N1m i  Rm I(yi 6= k(m)) = 1 − max{pmk }
P
I
PK
I Gini index = k=1 pmk (1 − pmk )
PK
I Cross-entropy (deviance) = − k=1 pmk log(pmk )

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Comparison of Impurity Functions (2 class problem)

1 Encourages the formation of regions in which high proportion of


data points assigns to one class
2 Gini index and cross entropy are differentiable and more sensitive
to change in node proportions (pmk )
Clint P. George Chapter 9: Additive Models and Trees
Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Tree Splitting Algorithm

Assume a predictor xj
Let mleft and mright be the left and right branches by splitting node
m based on xj
I when xj is continuous or ordinal, mleft and mright are given by xj < s
and xj ≥ s for a splitting point s
I when xj is categorical we may need exhaustive subset search to
find s
And qleft and qright be the proportion of samples in node m
assigned into mleft and mright

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Tree Splitting Algorithm

For each xj , find the split by maximizing the decrease in

∆ij (s, m) = i(m) − [qleft i(mleft ) + qright i(mright )]

where
K K
2
X X
i(m) = ˆ (1 − pmk
pmk ˆ )=1 − ˆ ]
[pmk
k=1 k=1

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Greedy Algorithm

Scan through all predictors (xj ) to find the best pair (j, s) with
largest decrease in ∆ij (s, m)
Then repeat this splitting procedure recursively for mleft and mright
Define a stopping criteria:
I Stop when some minimum node size (Nm ) is reached
I Split only when the decrease in cost reaches a threshold
Tree size:
I A very large tree may over fit the data
I A small tree may not structure the important data structure

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Cost Complexity Pruning


Focus is to balance misclassification error against a measure of
complexity
Suppose, we got a large tree T0 by using the greedy algorithm, m
indexes terminal nodes
Define a sub tree T ⊂ T0 by pruning nodes from T0
I Collapsing the internal nodes by combining the corresponding
regions
Find Tα ⊂ T0 for a given α ≥ 0 that minimizes the cost-complexity
criterion
|T |
X
Cα (T ) = R(T ) + α|T | = Nm i(m) + α|T |
m=1

I |T | is the number of terminal nodes in a tree T (model complexity)


I i(m) is the node impurity
Clint P. George Chapter 9: Additive Models and Trees
Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Tuning Parameter (α)

α can be interpreted as the complexity cost per terminal node.


α determines the trade-off between the overall misclassification
error and the model complexity
I when α is small, the penalty for having a larger tree is small so Tα
is large.
I when α increases, |Tα | decreases

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Weakest link pruning


Define Tm as a branch of Ti containing a node m and its
descendants
When Ti is pruned at node m
I its misclassification cost increases by R(m) − R(Tm ), where
|T |
X
R(T ) = Nm i(m)
m=1

I and its complexity decreases by |Tm | − 1


I the ratio
R(m) − R(Tm )
gi (m) =
|Tm | − 1
measures the increase in misclassification cost per pruned terminal
node
Clint P. George Chapter 9: Additive Models and Trees
Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Weakest link pruning

Ti+1 is obtained by pruning all nodes in Ti with lowest value of


gi (m) i.e. the weakest link
αi associated with Ti is given by αi = minm gi (m)
starting with T0 continue this pruning procedure till it reaches TI
(tree with only the root node)
then CART identifies the optimal subtree from {Ti |i = 0, 1, ...I} by
selecting
I the one with minimal classification error (0-SE rule)
I or the smallest tree with in one standard error of minimum error rate
(1-SE rule)
I one approach is to use cross validation to find out the error

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Weakest link pruning: Example

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Spam data: Misclassification


4601 samples and 57 features (two class problem)
train set (80 percent of the data set), and the rest for test

Clint P. George Chapter 9: Additive Models and Trees


Introduction
General Setup
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Classification Tree: Example
References

Spam data: Relative Error


Cross validation (10 fold) is done to find best alpha

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Overview
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Regression Tree: Example
References

Regression Trees: Overview

Key differences:
The outcome variables are continuous
The criteria for splitting and pruning : squared error
The calculation of predicted value: it is done by averaging the
variables in a tree node

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Overview
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Regression Tree: Example
References

Impurity function: Squared Error


Suppose the feature space is partitioned into M regions {R1 , R2 , ...RM }
and the response in each region Rm is represented as a constant cm ,
then the regression model can be represented as :
M
X
f (x) = cm I(x  Rm )
m=1

If we use minimization criteria as Squared Error


N
X
(yi − f (xi ))2
i=1

then best cˆm will be cˆm = ave(yi |xi  Rm )


Clint P. George Chapter 9: Additive Models and Trees
Introduction
Overview
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Regression Tree: Example
References

Greedy Algorithm
Start with all of the data (xi , yi )N
i=1 , consider a splitting variable j
and a split point s, define the regions

R1 (j, s) = {X |Xj ≤ s} and R2 (j, s) = {X |Xj > s}

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Overview
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Regression Tree: Example
References

Greedy Algorithm
Start with all of the data (xi , yi )N
i=1 , consider a splitting variable j
and a split point s, define the regions

R1 (j, s) = {X |Xj ≤ s} and R2 (j, s) = {X |Xj > s}

Then the variables j and s can be solved using the greedy criterion
X X
minj,s [minc1 (yi − c1 )2 + minc2 (yi − c2 )2 ]
xi R1 (j,s) xi  R2 (j,s)

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Overview
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Regression Tree: Example
References

Greedy Algorithm
Start with all of the data (xi , yi )N
i=1 , consider a splitting variable j
and a split point s, define the regions

R1 (j, s) = {X |Xj ≤ s} and R2 (j, s) = {X |Xj > s}

Then the variables j and s can be solved using the greedy criterion
X X
minj,s [minc1 (yi − c1 )2 + minc2 (yi − c2 )2 ]
xi R1 (j,s) xi  R2 (j,s)

For any selected j and s the inner minimization is solved by

cˆ1 = ave(yi |xi  R1 (j, s)) and cˆ2 = ave(yi |xi  R2 (j, s))

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Overview
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Regression Tree: Example
References

Tree Pruning

The tree pruning can be done by weakest link pruning using the
squared error impurity function

1 X
i(m) = (yi − cˆm )2
Nm
xi  Rm

The cost complexity criterion is similar to the case of classification


|T |
X
Cα (T ) = Nm i(m) + α|T |
m=1

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Overview
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Regression Tree: Example
References

Data set: cars


Data set: 60 data points and 8 features ( training set = 2/3 of the data;
the rest is for testing)

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Overview
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Regression Tree: Example
References

Tree structure’s sensitivity to the training set


The left tree is from training set 1 and the right tree is from set 2.

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Overview
Classification Trees
Growing the Tree
Regression Trees
Tree Pruning
Conclusions
Regression Tree: Example
References

Tree structure’s sensitivity to the training set

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Classification Trees
Regression Trees
Conclusions
References

Summary
Advantages:
CART makes no distribution assumptions on the variables and
supports both categorical and continuous variables
Binary tree structure offers excellent interpret-ability
Can be used for ranking the variables (by summing up the impurity
measures across all nodes in the tree)
Disadvantages:
Since CART uses binary tree, it suffers from instability
Splits are aligned with the axes of the feature space, which may
be suboptimal

Clint P. George Chapter 9: Additive Models and Trees


Introduction
Classification Trees
Regression Trees
Conclusions
References

Bishop, C. M. (2007). Pattern Recognition and Machine Learning.


Springer, 1 edition.
Pham, H. (2006). Springer Handbook of Engineering Statistics.
Springer, 1 edition.
T. Hastie, R. Tibshirani, J. F. (2009). The Elements of Statistical
Learning. Springer, 2 edition.
Terry M. Therneau, Elizabeth J. Atkinson, M. F. (1997). An introduction
to recursive partitioning using the rpart routines. Technical report.

Clint P. George Chapter 9: Additive Models and Trees

You might also like