Lecture15 Decision Trees
Lecture15 Decision Trees
• Motivation
• Decision Trees
• Classification Trees
• Splitting Criteria
• Regression Trees
Recall:
logistic regression for building classification boundaries works best when:
- the classes are well-separated in the feature space
- have a nice geometry to the classification boundary)
Recall:
𝑃 𝑌 =1 =1−𝑃 𝑌 =1 ⇒. 𝑃 𝑌 = 1 = 0.5,
𝒙𝛽 = 0,
Question: Can you guess the equation that defines the decision boundary below?
−0.8𝑥/ + 𝑥1 = 0 ⟹ 𝑥1 = 0.8𝑥/ ⇒ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 = 0.8 𝐿𝑜𝑛
Question: Or these?
Notice that in all of the datasets the classes are still well-separated in the feature
space, but the decision boundaries cannot easily be described by single equations:
While logistic regression models with linear boundaries are intuitive to interpret
by examining the impact of each predictor on the log-odds of a positive
classification, it is less straightforward to interpret nonlinear decision boundaries
in context:
(𝑥=+2𝑥1) − 𝑥/1 + 10 = 0
People in every walk of life have long been using interpretable models for
differentiating between classes of objects and phenomena:
It turns out that the simple flow charts in our examples can be formulated as
mathematical models for classification and these models have the properties we
desire; they are:
1. interpretable by humans
2. have sufficiently complex decision boundaries
3. the decision boundaries are locally linear, each component of the decision
boundary is simple to describe mathematically.
Flow charts whose graph is a tree (connected and no cycles) represents a model
called a decision tree.
Formally, a decision tree model is one in which the final outcome of the model is
based on a series of comparisons of the values of predictors against threshold
values.
Every flow chart tree corresponds to a partition of the feature space by axis
aligned lines or (hyper) planes. Conversely, every such partition can be written as
a flow chart tree.
Learning the smallest ‘optimal’ decision tree for any given set of data
is NP complete for numerous simple definitions of ‘optimal’. Instead,
we will seek a reasonably model using a greedy algorithm.
Depending on the encoding, the splits we can optimize over can be different!
We can now try to find the predictor 𝑗 and the threshold 𝑡D that minimizes
the average classification error over the two regions, weighted by the
population of the regions:
⇢
N1 N2
min Error(1|j, tj ) + Error(2|j, tj )
j,tj N N
where 𝑁L is the number of training points inside region 𝑅L .
CS109A, PROTOPAPAS, RADER, TANNER 26
Gini Index
Suppose we have 𝐽 number of predictors, 𝑁 number of training points and
𝐾 classes.
Suppose we select the 𝑗th predictor and split a region containing 𝑁 number
of training points along the threshold 𝑡D ∈ ℝ .
We can assess the quality of this split by measuring the purity of each
newly created region, 𝑅/ , 𝑅1 . This metric is called the Gini Index:
X
Gini(i|j, tj ) = 1 p(k|Ri )2
k
Question: What is the effect of squaring the proportions of each class?
What is the effect of summing the squared proportions of classes within
each region?
Example
Class 1 Class 2 Gini(i|j, tj )
R1 0 6 1 (6/62 + 0/62 ) = 0
R2 5 8 1 [(5/13)2 + (8/13)2 ] = 80/169
We can now try to find the predictor 𝑗 and the threshold 𝑡D that
minimizes the average Gini Index over the two regions, weighted by
the population of the regions:
⇢
N1 N2
min Gini(1|j, tj ) + Gini(2|j, tj )
j,tj N N
The last metric for evaluating the quality of a split is motivated by metrics of
uncertainty in information theory.
Ideally, our decision tree should split the feature space into regions such that each
region represents a single class. In practice, the training points in each region is
distributed over multiple classes, e.g.:
Class 1 Class 2
R1 1 6
R2 5 6
Example
Class 1 Class 2 Entropy(i|j, tj )
R1 0 6 ( 66 log2 66 + 06 log2 06 ) = 0
5 5 8 8
R2 5 8 ( 13 log2 13 + 13 log2 13 ) ⇡ 1.38
We can now try to find the predictor j and the threshold tj that
minimizes the average entropy over the two regions, weighted by the
population of the regions:
⇢
N1 N2
min Entropy(1|j, tj ) + Entropy(2|j, tj )
j,tj N N
Recall our intuitive guidelines for splitting criteria, which of the three
criteria fits our guideline the best?
We have the following comparison of the value of the three criteria at
different levels of purity (from 0 to 1) in a single region (for binary
outcomes).
Recall our intuitive guidelines for splitting criteria, which of the three
criteria fits our guideline the best?
To note that entropy penalizes impurity the most is not to say that it
is the best splitting criteria. For one, a model with purer leaf nodes
on a training set may not perform better on the testing test.
where m is a metric like the Gini Index or entropy. Don’t split if the
gain is less than some pre-defined threshold
(min_impurity_decrease).
CS109A, PROTOPAPAS, RADER, TANNER 40
Alternative to Using Stopping Conditions
width ≤ 1.05in
yes no
width ≤ 0.725in
yes no
yes no yes no
Simple Tree
Early Stopping
Full Tree
PRUNING
Rather than preventing a complex tree from growing, we can obtain a simpler
tree by ‘pruning’ a complex one.
There are many method of pruning, a common one is cost complexity pruning,
where by we select from a array of smaller subtrees of the full model that
optimizes a balance of performance and efficiency.
where T is a decision (sub) tree, 𝑇 is the number of leaves in the tree and 𝛼 is
the parameter for penalizing model complexity.
𝐶 𝑇 = 𝐸𝑟𝑟𝑜𝑟 𝑇 + 𝛼 𝑇
1. Fix 𝛼.
Questions to consider:
• What would be a reasonable loss function?
• How would you determine any splitting criteria?
• How would you perform prediction in each leaf?
With just two modifications, we can use a decision tree model for regression:
1. The three splitting criteria we’ve examined each promoted splits that were pure -
new regions increasingly specialized in a single class.
A. For classification, purity of the regions is a good indicator the performance of the
model.
B. For regression, we want to select a splitting criterion that promotes splits that
improves the predictive accuracy of the model as measured by, say, the MSE.
2. For regression with output in ℝ, we want to label each region in the model with a
real number - typically the average of the output values of the training points
contained in the region.
N1 N2
Gain(R) = (R) = M SE(R) M SE(R1 ) M SE(R2 )
N N
and stop the tree when the gain is less than some pre-defined threshold.
Simple Tree
Early Stopping
Full Tree
PRUNING