0% found this document useful (0 votes)
19 views

Module09 TreeBasedMethods

This document discusses tree-based methods for regression and classification. It introduces decision trees and how they are used to segment predictor spaces into regions. Regression trees are used for continuous target variables, while classification trees predict categorical classes. The document covers tree construction, pruning, and algorithms for building regression and classification trees.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Module09 TreeBasedMethods

This document discusses tree-based methods for regression and classification. It introduces decision trees and how they are used to segment predictor spaces into regions. Regression trees are used for continuous target variables, while classification trees predict categorical classes. The document covers tree construction, pruning, and algorithms for building regression and classification trees.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Tree-Based Methods

Reference
• James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning (Vol. 112, p. 18). New York:
springer.
• Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.
(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York: springer.
Example: Construction of Regions

Fig: A tree corresponding to dataset Fig: Regression tree for corresponding dataset
Carseats
Regression Tree (Carseats Data, Response = Sales)

R4

R1 R2 R5

R1 R2

R5
R3 R4 R3
Classification Tree (if Sales > 8, Yes, No)
Introduction
➢ Tree based methods for regression and classification involve stratifying or
segmenting the predictor space into a number of simple regions.
➢ In this formalism, a classification or regression decision tree is used as a
predictive model to draw conclusions about a set of observations.
➢ The set of splitting rules used to segment the given predictor space can be
summarized in a tree, these types of approaches are known as decision tree
methods.
➢ The goal of decision tree based methods is to create a model that can predict
the value of target variable based on several input variable.
Regression Trees
➢Decision trees where the target variable can take continuous values (typically real
numbers) are called regression trees.
➢A regression is a statistical technique that relates a dependent variable to one
or more independent (explanatory) variables.
➢A regression tree is built through a process known as binary recursive
partitioning.
➢Binary recursive partitioning is an iterative process that splits the data into
partitions or branches, and then continues splitting each partition into smaller
groups as the method moves up each branch.
➢Regression tree model divide the data into subsets using nodes, branches, and
leaves.
Terminology
➢The regions formed after dividing the data
points are known as “terminal nodes” or
“leaves” of the tree.
➢The points along the tree where predictor space
is divided are referred as “internal nodes”.
➢The segment of the tree that connects the nodes
are termed as “branches”.
➢Building regression tree consider following two
steps:
1) Divide the predictor space 𝑋1 , 𝑋2 … . , 𝑋𝐽
into 𝐽 distinct and non-overlapping regions
𝑅1 , 𝑅2 … . , 𝑅𝐽 .
2) For every observation that falls into the region
𝑅𝐽 , the prediction is simply the mean of the
response values for the training observations
in 𝑅𝐽 .
Construction of regions
➢The region could have any shape, but high-dimensional rectangles, or boxes are preferred
for simplicity and for ease of interpretation of the resulting predictive model.
➢The regions are created using recursive binary splitting approach that uses top-down
and greedy techniques.
➢The goal is to find the boxes 𝑅1 , 𝑅2 … . , 𝑅𝐽 that minimizes the RSS (Residual Sum
Square).
𝐽
2
𝑅𝑆𝑆 = ෍ ෍ 𝑦𝑖 − 𝑦ො𝑅𝑗
𝑖∈𝑅𝑗
𝑗=1
➢𝑦ො𝑅𝑗 is the mean response for training observations within the 𝑗th box.
➢Observations in the same box are intended to be homogeneous, and observations in
different boxes should be as different as possible.
Recursive Binary Splitting
➢We first select the predictor 𝑿𝒋 and the cutpoint 𝒔 such that splitting the
predictor space into the regions {𝑋|𝑋𝑗 < 𝑠} and {𝑋|𝑋𝑗 ≥ 𝑠} leads to the greatest
possible reduction in RSS.
➢For any 𝑗 and 𝑠, we define the pair of half-planes
𝑅1 𝑗, 𝑠 = 𝑋|𝑋𝑗 < 𝑠 and 𝑅2 𝑗, 𝑠 = 𝑋|𝑋𝑗 ≥ 𝑠
➢And we seek the value of 𝑗 and 𝑠 that minimize the equation
2 2
෍ 𝑦𝑖 − 𝑦ො𝑅1 + ෍ 𝑦𝑖 − 𝑦ො𝑅2
𝑖:𝑥𝑖 ∈𝑅1 𝑗,𝑠 𝑖:𝑥𝑖 ∈𝑅2 𝑗,𝑠
➢Next, repeat the process, looking for the best predictor and best cutpoint in order
to split the data further so as to minimize the RSS within each of the resulting
regions.
Recursive Binary Splitting
➢However, this time, instead of splitting the entire predictor space, split one of the
two previously identified regions. Now predictor space is divided into three
regions.
➢Again, look to split one of these three regions further, so as to minimize the RSS.
The process continues until a stopping criterion is reached; for instance, we may
continue until no region contains more than five observations.
➢Predict the response for a given test observation using the mean of the
training observations in the region to which that test observation belongs.
Tree Pruning
➢Recursive binary splitting produce good predictions on the training set, but is
likely to overfit the data, leading to poor test set performance.
➢A smaller tree with fewer splits (regions) might lead to lower variance and better
interpretation at the cost of a little bias.
➢The fewer splits can be done by building the tree only so long as the decrease in
the RSS due to each split exceeds some threshold. This strategy work well but it is
too short-sighted since some end split might be followed by a very good split.
➢To overcome all the above issues a better strategy is to grow a very large tree 𝑇0 ,
and then prune it back in order to obtain a subtree.
Shallow Tree after Pruning
Cost complexity pruning
➢It is also known as weakest tree pruning.
➢we consider a sequence of trees indexed by a nonnegative tuning parameter 𝛼. For
each value of 𝛼 there corresponds a subtree 𝑇 ⊂ 𝑇0 such that,
|𝑇| 2
෍ ෍ 𝑦𝑖 − 𝑦ො𝑅𝑚 + 𝛼|𝑇|
𝑚=1
𝑖:𝑥𝑖 ∈𝑅𝑚
is as small as possible. Here |𝑇| indicates the number of terminal nodes of the tree
𝑇, 𝑅𝑚 is the region corresponding to the 𝑚th terminal node, and 𝑦ො𝑅𝑚 is the mean of
the training observations in 𝑅𝑚 .
➢The tuning parameter 𝜶 controls a trade-off between the subtree’s complexity
and its fit to the training data.
➢Select an optimal value 𝛼ො using cross-validation. Then return to the full data set
and obtain the subtree corresponding to 𝛼.

Algorithm: Building a Regression Tree
1. Use recursive binary splitting to grow a large tree on the training data, stopping
only when each terminal node has fewer than some minimum number of
observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence of
best subtrees, as a function of 𝛼.
3. Use K-fold cross-validation to choose 𝛼. For each 𝑘 = 1, . . . , 𝐾:
𝑘−1
3.1. Repeat Steps 1 and 2 on all but the th fraction of the training data,
𝑘
excluding the kth fold.
3.2. Evaluate the mean squared prediction error on the data in the left-out 𝑘th
fold, as a function of α.
3.3. Average the results, and pick α to minimize the average error.
4. Return the subtree from Step 2 that corresponds to the chosen value of 𝛼.
Regression Tree with Hitters Data
Classification Trees
➢Classification tree is used to predict a qualitative response.
➢For a classification tree, we predict that each observation belongs to the most
commonly occurring class of training observations in the region to which it
belongs.
➢Classification trees interpret the class prediction corresponding to particular
terminal node region, and class proportions among the training observations that
fall into that region.
➢Instead of RSS in classification trees classification error rate (CER) is used.
➢CER is the fraction of the training observations in that region that do not belong to
the most common class
𝐸 = 1 − max 𝑝Ƹ 𝑚𝑘
𝑘
Here 𝑝Ƹ 𝑚𝑘 represents the proportion of training observations in the 𝑚th region that
are from the 𝑘th class.
Classification Tree
Gini index
➢Classification error is not sufficiently sensitive for tree-growing, so two other
measures are preferable:
1) Gini index
2) Cross-entropy
➢The Gini index is defined by a measure of total variance across the K classes.
𝐾

𝐺 = ෍ 𝑝Ƹ 𝑚𝑘 1 − 𝑝Ƹ 𝑚𝑘
𝑘=1
➢The Gini index takes on a small value if all of the 𝑝Ƹ 𝑚𝑘 are close to zero or one.
➢For this reason the Gini index is referred to as a measure of node purity — a
small value indicates that a node contains predominantly observations from a
single class.
Cross-entropy
➢The cross-entropy is given by
𝐾

𝐷 = − ෍ 𝑝Ƹ 𝑚𝑘 𝑙𝑜𝑔𝑝Ƹ 𝑚𝑘
𝑘=1
➢Since 0 ≤ 𝑝Ƹ𝑚𝑘 ≤ 1, cross-entropy will take value near to zero if the 𝑝Ƹ 𝑚𝑘 are all near zero
or one.
➢Therefore, the Gini index and the cross-entropy are very similar numerically, takes a
small value if the node is pure.
➢For pruning the tree, either Gini-index or Cross-entropy is used to evaluate quality of each
split.
➢Lower the value of Gini Index or Cross-entropy, better the classification performance
➢Classification error rate is preferable when prediction accuracy of the test-set is the goal.
Misclassification Error with Tree Size (Pruning)

Cost complexity pruning on classification tree using Carseats Data


Advantages and Disadvantages of Trees
➢Advantages:
1. Trees are very easy to explain to people.
2. Some people believe that decision trees more closely mirror human decision-making
system than do the regression and classification approaches.
3. Trees can be displayed graphically, and are easily interpreted even by a non-expert
(especially if they are small).
4. Trees can easily handle qualitative predictors without the need to create dummy
variables.
➢Disadvantages:
1. Unfortunately, trees generally do not have the same level of predictive accuracy as
some of the other regression and classification approaches.
2. High variance
➢To improve the predictive performance of trees bagging, random forests, and boosting
methods can be used.
Bagging (Breiman 1996)
➢Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the
variance of a statistical learning method. It is frequently used in the context of decision
trees.
➢For a given set of n independent observations 𝑍1 , . . . , 𝑍𝑛 , each with variance 𝜎 2 , the
2
ҧ
variance of the mean 𝑍 of the observations is given by Τ𝑛 . So averaging a set of
𝜎
observations reduces variance. This is not practical because generally access to multiple
training sets is difficult.
➢Instead, we can bootstrap, by taking repeated samples from the (single) training data set.
➢In this approach generate 𝐵 different bootstrapped training data sets. Then train our
method on the 𝑏th bootstrapped training set in order to get 𝑓መ ∗𝑏 (𝑥) the prediction at a
point 𝑥. We then average all the predictions to obtain
𝐵
1
𝑓𝑏𝑎𝑔 𝑥 = ෍ 𝑓መ ∗𝑏 (𝑥)

𝐵
𝑏=1
This is called bagging.
➢Bagging provide improvements in accuracy by combining together multiple number of
trees into a single procedure.
Bagging for classification tree
➢For classification trees for each test observation prediction is done as
follows:
1) record the class predicted by each of the 𝐵 trees, and take a majority
vote
2) the overall prediction is the most commonly occurring class among
the 𝐵 predictions.
➢The value of 𝐵 should be sufficiently large that the error has settled
down.
Out-of-Bag Error Estimation
➢The bagged tree uses two-third of observations to build the tree, the
remaining one-third observations are referred to as the out-of-bag (OOB)
observations.
➢The prediction of the OOB observation is obtained by averaging the
response (for regression) or take a majority vote (for classification) among
𝐵/3 predictions.
➢The OOB approach for estimating the test error is particularly convenient
when performing bagging on large data sets.
➢Bagging improves prediction accuracy at the expense of interpretability.
Bagging and RF applied on Heart Data
Random Forest (Breiman 1999)
➢Random forests (RF) are a combination of tree predictors such that each tree depends on
the values of a random vector sampled independently and with the same distribution for
all trees in the forest.
➢Random forests provide an improvement over bagged trees by way of a small tweak that
decorrelates the trees. This reduces the variance when we average the trees.
➢In random forest the number of predictors considered at each split is approximately equal
to the square root of the total number of predictors (𝒎 ≈ 𝒑).
➢In random forest the algorithm is not allowed to consider a majority of the available
predictor at each split in the tree.
➢In random forest 𝑝 − 𝑚 Τ𝑝 of the splits not considered the strong predictor that leads to
substantial reduction in variance over a single tree. This process is known as decorrelating
the trees.
➢In RF, while splitting if 𝑚 = 𝑝 then this is simply bagging.
Errors (Random forest with Boston Data)
Boosting (Test error with Number of Trees)
Boosting
➢Boosting is similar to bagging except in bagging trees are build simultaneously
whereas in boosting trees are build sequentially using the information from
previously grown trees.
➢Unlike fitting the data hard and potentially overfitting, the boosting approach
instead learns slowly.
➢In boosting fit a decision tree to the residuals from the current model. Add this
new decision tree into the fitted function in order to update the residuals.
➢Each of these trees can be rather small, with just a few terminal nodes, determined
by the parameter d in the algorithm.
➢By fitting small trees to the residuals, we slowly improve 𝑓መ in areas where it does
not perform well. The shrinkage parameter 𝜆 slows the process down even further,
allowing more and different shaped trees to attack the residuals.
Algorithm: Boosting for Regression Trees
1. Set 𝑓መ 𝑥 = 0 and 𝑟𝑖 = 𝑦𝑖 for all i in the training set.
2. For 𝑏 = 1, 2, . . . , 𝐵, repeat:
a) Fit a tree 𝑓መ 𝑏 with 𝑑 splits (𝑑 + 1 terminal nodes) to the training data (𝑋, 𝑟).
b) Update 𝑓መ by adding in a shrunken version of the new tree:
𝑓መ 𝑥 ← 𝑓መ 𝑥 + 𝜆𝑓መ 𝑏 (𝑥)
c) Update the residuals,
𝑟𝑖 ← 𝑟𝑖 − 𝜆𝑓መ 𝑏 (𝑥𝑖 )
3. Output the boosted model,
𝐵

𝑓መ 𝑥 = ෍ 𝜆 𝑓መ 𝑏 (𝑥)
𝑏=1
Boosting (Boston dataset)
Boosting Parameters
➢Boosting has three tuning parameters:
1) The number of tree 𝑩: boosting also overfit if 𝐵 is too large, for that cross-
validation is used for selection of 𝐵.
2) The shrinkage parameter 𝝀: A small positive number (0.01-0.001) known as
boosting learning rate.
3) The number d of splits in each tree, which controls the complexity of the
boosted ensemble. Mostly d=1 works well, in which case each tree is a stump,
consisting of a single split.
Adaboost.m1 (Freund and Schapire (1997))
• Adaboost.m1, sequentially applies the weak classification algorithm to repeatedly
modified versions of the data, thereby producing a sequence of weak classifiers 𝐺𝑚 (𝑥)
• The predictions from all of them are then combined through a weighted majority vote to
produce the final prediction:
𝑀

𝐺 𝑥 = 𝑠𝑖𝑔𝑛[ ෍ 𝛼𝑚 𝐺𝑚 (𝑥)]
𝑚=1
• Here 𝛼1 , 𝛼2 , . . . , 𝛼𝑀 are computed by the boosting algorithm, which are weights of the
contribution of each respective 𝐺𝑚 (𝑥).
• At step m, those observations that were misclassified by the classifier 𝐺𝑚−1 𝑥 induced
at the previous step have their weights increased
• The weights are decreased for those that were classified correctly.
• Thus as iterations proceed, observations that are difficult to classify correctly receive
ever-increasing influence.
Adaboost.m1 (Discrete Adaboost for 2 classes)

1. Initialize the observation weights 𝑤 = 1/𝑁, 𝑖 = 1, 2, . . . , 𝑁.


𝑖

2. For 𝑚 = 1 𝑡𝑜 𝑀:
(a) Fit a classifier 𝐺𝑚 (𝑥) to the training data using weights 𝑤 .
𝑖

σ𝑁
𝑖=1 𝑤𝑖 𝐼(𝑦𝑖 ≠𝐺𝑚 𝑥𝑖 )
(b) Compute 𝑒𝑟𝑟𝑚 = σ𝑁 𝑖=1 𝑤𝑖
(c) Compute 𝛼𝑚 = log((1 − 𝑒𝑟𝑟𝑚 )/𝑒𝑟𝑟𝑚 ).
(d) Set 𝑤𝑖 ← 𝑤𝑖 exp[𝛼𝑚 𝐼(𝑦𝑖 ≠ 𝐺𝑚 (𝑥𝑖 )] for 𝑖 = 1, 2, . . . , 𝑁.

3. Output 𝐺 𝑥 = 𝑠𝑖𝑔𝑛[σ𝑀 𝑚=1 𝛼𝑚 𝐺𝑚 (𝑥)]


*Observations misclassified by 𝐺𝑚 (𝑥) have their weights scaled by a factor
exp(𝛼𝑚 ), increasing their relative influence

You might also like