Module09 TreeBasedMethods
Module09 TreeBasedMethods
Reference
• James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning (Vol. 112, p. 18). New York:
springer.
• Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.
(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York: springer.
Example: Construction of Regions
Fig: A tree corresponding to dataset Fig: Regression tree for corresponding dataset
Carseats
Regression Tree (Carseats Data, Response = Sales)
R4
R1 R2 R5
R1 R2
R5
R3 R4 R3
Classification Tree (if Sales > 8, Yes, No)
Introduction
➢ Tree based methods for regression and classification involve stratifying or
segmenting the predictor space into a number of simple regions.
➢ In this formalism, a classification or regression decision tree is used as a
predictive model to draw conclusions about a set of observations.
➢ The set of splitting rules used to segment the given predictor space can be
summarized in a tree, these types of approaches are known as decision tree
methods.
➢ The goal of decision tree based methods is to create a model that can predict
the value of target variable based on several input variable.
Regression Trees
➢Decision trees where the target variable can take continuous values (typically real
numbers) are called regression trees.
➢A regression is a statistical technique that relates a dependent variable to one
or more independent (explanatory) variables.
➢A regression tree is built through a process known as binary recursive
partitioning.
➢Binary recursive partitioning is an iterative process that splits the data into
partitions or branches, and then continues splitting each partition into smaller
groups as the method moves up each branch.
➢Regression tree model divide the data into subsets using nodes, branches, and
leaves.
Terminology
➢The regions formed after dividing the data
points are known as “terminal nodes” or
“leaves” of the tree.
➢The points along the tree where predictor space
is divided are referred as “internal nodes”.
➢The segment of the tree that connects the nodes
are termed as “branches”.
➢Building regression tree consider following two
steps:
1) Divide the predictor space 𝑋1 , 𝑋2 … . , 𝑋𝐽
into 𝐽 distinct and non-overlapping regions
𝑅1 , 𝑅2 … . , 𝑅𝐽 .
2) For every observation that falls into the region
𝑅𝐽 , the prediction is simply the mean of the
response values for the training observations
in 𝑅𝐽 .
Construction of regions
➢The region could have any shape, but high-dimensional rectangles, or boxes are preferred
for simplicity and for ease of interpretation of the resulting predictive model.
➢The regions are created using recursive binary splitting approach that uses top-down
and greedy techniques.
➢The goal is to find the boxes 𝑅1 , 𝑅2 … . , 𝑅𝐽 that minimizes the RSS (Residual Sum
Square).
𝐽
2
𝑅𝑆𝑆 = 𝑦𝑖 − 𝑦ො𝑅𝑗
𝑖∈𝑅𝑗
𝑗=1
➢𝑦ො𝑅𝑗 is the mean response for training observations within the 𝑗th box.
➢Observations in the same box are intended to be homogeneous, and observations in
different boxes should be as different as possible.
Recursive Binary Splitting
➢We first select the predictor 𝑿𝒋 and the cutpoint 𝒔 such that splitting the
predictor space into the regions {𝑋|𝑋𝑗 < 𝑠} and {𝑋|𝑋𝑗 ≥ 𝑠} leads to the greatest
possible reduction in RSS.
➢For any 𝑗 and 𝑠, we define the pair of half-planes
𝑅1 𝑗, 𝑠 = 𝑋|𝑋𝑗 < 𝑠 and 𝑅2 𝑗, 𝑠 = 𝑋|𝑋𝑗 ≥ 𝑠
➢And we seek the value of 𝑗 and 𝑠 that minimize the equation
2 2
𝑦𝑖 − 𝑦ො𝑅1 + 𝑦𝑖 − 𝑦ො𝑅2
𝑖:𝑥𝑖 ∈𝑅1 𝑗,𝑠 𝑖:𝑥𝑖 ∈𝑅2 𝑗,𝑠
➢Next, repeat the process, looking for the best predictor and best cutpoint in order
to split the data further so as to minimize the RSS within each of the resulting
regions.
Recursive Binary Splitting
➢However, this time, instead of splitting the entire predictor space, split one of the
two previously identified regions. Now predictor space is divided into three
regions.
➢Again, look to split one of these three regions further, so as to minimize the RSS.
The process continues until a stopping criterion is reached; for instance, we may
continue until no region contains more than five observations.
➢Predict the response for a given test observation using the mean of the
training observations in the region to which that test observation belongs.
Tree Pruning
➢Recursive binary splitting produce good predictions on the training set, but is
likely to overfit the data, leading to poor test set performance.
➢A smaller tree with fewer splits (regions) might lead to lower variance and better
interpretation at the cost of a little bias.
➢The fewer splits can be done by building the tree only so long as the decrease in
the RSS due to each split exceeds some threshold. This strategy work well but it is
too short-sighted since some end split might be followed by a very good split.
➢To overcome all the above issues a better strategy is to grow a very large tree 𝑇0 ,
and then prune it back in order to obtain a subtree.
Shallow Tree after Pruning
Cost complexity pruning
➢It is also known as weakest tree pruning.
➢we consider a sequence of trees indexed by a nonnegative tuning parameter 𝛼. For
each value of 𝛼 there corresponds a subtree 𝑇 ⊂ 𝑇0 such that,
|𝑇| 2
𝑦𝑖 − 𝑦ො𝑅𝑚 + 𝛼|𝑇|
𝑚=1
𝑖:𝑥𝑖 ∈𝑅𝑚
is as small as possible. Here |𝑇| indicates the number of terminal nodes of the tree
𝑇, 𝑅𝑚 is the region corresponding to the 𝑚th terminal node, and 𝑦ො𝑅𝑚 is the mean of
the training observations in 𝑅𝑚 .
➢The tuning parameter 𝜶 controls a trade-off between the subtree’s complexity
and its fit to the training data.
➢Select an optimal value 𝛼ො using cross-validation. Then return to the full data set
and obtain the subtree corresponding to 𝛼.
ො
Algorithm: Building a Regression Tree
1. Use recursive binary splitting to grow a large tree on the training data, stopping
only when each terminal node has fewer than some minimum number of
observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence of
best subtrees, as a function of 𝛼.
3. Use K-fold cross-validation to choose 𝛼. For each 𝑘 = 1, . . . , 𝐾:
𝑘−1
3.1. Repeat Steps 1 and 2 on all but the th fraction of the training data,
𝑘
excluding the kth fold.
3.2. Evaluate the mean squared prediction error on the data in the left-out 𝑘th
fold, as a function of α.
3.3. Average the results, and pick α to minimize the average error.
4. Return the subtree from Step 2 that corresponds to the chosen value of 𝛼.
Regression Tree with Hitters Data
Classification Trees
➢Classification tree is used to predict a qualitative response.
➢For a classification tree, we predict that each observation belongs to the most
commonly occurring class of training observations in the region to which it
belongs.
➢Classification trees interpret the class prediction corresponding to particular
terminal node region, and class proportions among the training observations that
fall into that region.
➢Instead of RSS in classification trees classification error rate (CER) is used.
➢CER is the fraction of the training observations in that region that do not belong to
the most common class
𝐸 = 1 − max 𝑝Ƹ 𝑚𝑘
𝑘
Here 𝑝Ƹ 𝑚𝑘 represents the proportion of training observations in the 𝑚th region that
are from the 𝑘th class.
Classification Tree
Gini index
➢Classification error is not sufficiently sensitive for tree-growing, so two other
measures are preferable:
1) Gini index
2) Cross-entropy
➢The Gini index is defined by a measure of total variance across the K classes.
𝐾
𝐺 = 𝑝Ƹ 𝑚𝑘 1 − 𝑝Ƹ 𝑚𝑘
𝑘=1
➢The Gini index takes on a small value if all of the 𝑝Ƹ 𝑚𝑘 are close to zero or one.
➢For this reason the Gini index is referred to as a measure of node purity — a
small value indicates that a node contains predominantly observations from a
single class.
Cross-entropy
➢The cross-entropy is given by
𝐾
𝐷 = − 𝑝Ƹ 𝑚𝑘 𝑙𝑜𝑔𝑝Ƹ 𝑚𝑘
𝑘=1
➢Since 0 ≤ 𝑝Ƹ𝑚𝑘 ≤ 1, cross-entropy will take value near to zero if the 𝑝Ƹ 𝑚𝑘 are all near zero
or one.
➢Therefore, the Gini index and the cross-entropy are very similar numerically, takes a
small value if the node is pure.
➢For pruning the tree, either Gini-index or Cross-entropy is used to evaluate quality of each
split.
➢Lower the value of Gini Index or Cross-entropy, better the classification performance
➢Classification error rate is preferable when prediction accuracy of the test-set is the goal.
Misclassification Error with Tree Size (Pruning)
𝑓መ 𝑥 = 𝜆 𝑓መ 𝑏 (𝑥)
𝑏=1
Boosting (Boston dataset)
Boosting Parameters
➢Boosting has three tuning parameters:
1) The number of tree 𝑩: boosting also overfit if 𝐵 is too large, for that cross-
validation is used for selection of 𝐵.
2) The shrinkage parameter 𝝀: A small positive number (0.01-0.001) known as
boosting learning rate.
3) The number d of splits in each tree, which controls the complexity of the
boosted ensemble. Mostly d=1 works well, in which case each tree is a stump,
consisting of a single split.
Adaboost.m1 (Freund and Schapire (1997))
• Adaboost.m1, sequentially applies the weak classification algorithm to repeatedly
modified versions of the data, thereby producing a sequence of weak classifiers 𝐺𝑚 (𝑥)
• The predictions from all of them are then combined through a weighted majority vote to
produce the final prediction:
𝑀
𝐺 𝑥 = 𝑠𝑖𝑔𝑛[ 𝛼𝑚 𝐺𝑚 (𝑥)]
𝑚=1
• Here 𝛼1 , 𝛼2 , . . . , 𝛼𝑀 are computed by the boosting algorithm, which are weights of the
contribution of each respective 𝐺𝑚 (𝑥).
• At step m, those observations that were misclassified by the classifier 𝐺𝑚−1 𝑥 induced
at the previous step have their weights increased
• The weights are decreased for those that were classified correctly.
• Thus as iterations proceed, observations that are difficult to classify correctly receive
ever-increasing influence.
Adaboost.m1 (Discrete Adaboost for 2 classes)
2. For 𝑚 = 1 𝑡𝑜 𝑀:
(a) Fit a classifier 𝐺𝑚 (𝑥) to the training data using weights 𝑤 .
𝑖
σ𝑁
𝑖=1 𝑤𝑖 𝐼(𝑦𝑖 ≠𝐺𝑚 𝑥𝑖 )
(b) Compute 𝑒𝑟𝑟𝑚 = σ𝑁 𝑖=1 𝑤𝑖
(c) Compute 𝛼𝑚 = log((1 − 𝑒𝑟𝑟𝑚 )/𝑒𝑟𝑟𝑚 ).
(d) Set 𝑤𝑖 ← 𝑤𝑖 exp[𝛼𝑚 𝐼(𝑦𝑖 ≠ 𝐺𝑚 (𝑥𝑖 )] for 𝑖 = 1, 2, . . . , 𝑁.