0% found this document useful (0 votes)
55 views67 pages

Lecture15 Decision Trees

The document discusses decision trees, which are interpretable machine learning models that can learn complex decision boundaries. Decision trees work by recursively splitting the data into purer subsets based on threshold comparisons of predictor variables. They produce a partition of the feature space into axis-aligned regions. The document covers splitting criteria like classification error minimization, handling categorical variables, and the goal of optimizing splits to increase purity in descendant nodes.

Uploaded by

chowsaj9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views67 pages

Lecture15 Decision Trees

The document discusses decision trees, which are interpretable machine learning models that can learn complex decision boundaries. Decision trees work by recursively splitting the data into purer subsets based on threshold comparisons of predictor variables. They produce a partition of the feature space into axis-aligned regions. The document covers splitting criteria like classification error minimization, handling categorical variables, and the goal of optimizing splits to increase purity in descendant nodes.

Uploaded by

chowsaj9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Lecture 15: Decision Trees

CS109A Introduction to Data Science


Pavlos Protopapas, Kevin Rader and Chris Tanner
Outline

• Motivation

• Decision Trees

• Classification Trees

• Splitting Criteria

• Stopping Conditions & Pruning

• Regression Trees

CS109A, PROTOPAPAS, RADER, TANNER 2


Geometry of Data

Recall:
logistic regression for building classification boundaries works best when:
- the classes are well-separated in the feature space
- have a nice geometry to the classification boundary)

CS109A, PROTOPAPAS, RADER, TANNER 3


Geometry of Data

Recall:

the decision boundary is defined where the probability of being in class 1


and class 0 are equal, i.e.

𝑃 𝑌 =1 =1−𝑃 𝑌 =1 ⇒. 𝑃 𝑌 = 1 = 0.5,

Which is equivalent to when the log-odds=0:

𝒙𝛽 = 0,

this equation defines a line or a hyperplane. It can be generalized with higher


order polynomial terms.

CS109A, PROTOPAPAS, RADER, TANNER 4


Geometry of Data

Question: Can you guess the equation that defines the decision boundary below?
−0.8𝑥/ + 𝑥1 = 0 ⟹ 𝑥1 = 0.8𝑥/ ⇒ 𝐿𝑎𝑡𝑖𝑡𝑢𝑑𝑒 = 0.8 𝐿𝑜𝑛

CS109A, PROTOPAPAS, RADER, TANNER 5


Geometry of Data

Question: How about these?

CS109A, PROTOPAPAS, RADER, TANNER 6


Geometry of Data

Question: Or these?

CS109A, PROTOPAPAS, RADER, TANNER 7


Geometry of Data

Notice that in all of the datasets the classes are still well-separated in the feature
space, but the decision boundaries cannot easily be described by single equations:

CS109A, PROTOPAPAS, RADER, TANNER 8


Geometry of Data

While logistic regression models with linear boundaries are intuitive to interpret
by examining the impact of each predictor on the log-odds of a positive
classification, it is less straightforward to interpret nonlinear decision boundaries
in context:

(𝑥=+2𝑥1) − 𝑥/1 + 10 = 0

It would be desirable to build models that:


1. allow for complex decision boundaries.
2. are also easy to interpret.

CS109A, PROTOPAPAS, RADER, TANNER 9


Interpretable Models

People in every walk of life have long been using interpretable models for
differentiating between classes of objects and phenomena:

CS109A, PROTOPAPAS, RADER, TANNER 10


Interpretable Models (cont.)

Or in the [inferential] data analysis world:

CS109A, PROTOPAPAS, RADER, TANNER 11


Decision Trees

It turns out that the simple flow charts in our examples can be formulated as
mathematical models for classification and these models have the properties we
desire; they are:
1. interpretable by humans
2. have sufficiently complex decision boundaries
3. the decision boundaries are locally linear, each component of the decision
boundary is simple to describe mathematically.

CS109A, PROTOPAPAS, RADER, TANNER 12


Decision Trees

CS109A, PROTOPAPAS, RADER, TANNER 13


The Geometry of Flow Charts

Flow charts whose graph is a tree (connected and no cycles) represents a model
called a decision tree.

Formally, a decision tree model is one in which the final outcome of the model is
based on a series of comparisons of the values of predictors against threshold
values.

In a graphical representation (flow chart),

• the internal nodes of the tree represent attribute testing.

• branching in the next level is determined by attribute value (yes/no).

• terminal leaf nodes represent class assignments.

CS109A, PROTOPAPAS, RADER, TANNER 14


The Geometry of Flow Charts

Flow charts whose graph is a tree


(connected and no cycles) represents
a model called a decision tree.

Formally, a decision tree model is one


in which the final outcome of the
model is based on a series of
comparisons of the values of
predictors against threshold values.

CS109A, PROTOPAPAS, RADER, TANNER 15


The Geometry of Flow Charts

Every flow chart tree corresponds to a partition of the feature space by axis
aligned lines or (hyper) planes. Conversely, every such partition can be written as
a flow chart tree.

CS109A, PROTOPAPAS, RADER, TANNER 16


The Geometry of Flow Charts
Each comparison and branching represents splitting a region in the
feature space on a single feature. Typically, at each iteration, we split
once along one dimension (one predictor). Why?

CS109A, PROTOPAPAS, RADER, TANNER 17


Learning the Model

Given a training set, learning a decision tree model for binary


classification means:

• producing an optimal partition of the feature space with axis-


aligned linear boundaries (very interpretable!),

• each region is predicted to have a class label based on the largest


class of the training points in that region (Bayes’ classifier) when
performing prediction.

CS109A, PROTOPAPAS, RADER, TANNER 18


Learning the Model

Learning the smallest ‘optimal’ decision tree for any given set of data
is NP complete for numerous simple definitions of ‘optimal’. Instead,
we will seek a reasonably model using a greedy algorithm.

1. Start with an empty decision tree (undivided feature space)

2. Choose the ‘optimal’ predictor on which to split and choose the


‘optimal’ threshold value for splitting.

3. Recurse on each new node until stopping condition is met

Now, we need only define our splitting criterion and stopping


condition.
CS109A, PROTOPAPAS, RADER, TANNER 19
Numerical vs Categorical Attributes
Note that the ‘compare and branch’ method by which we defined
classification tree works well for numerical features.

However, if a feature is categorical (with more than two possible


values), comparisons like feature < threshold does not make
sense.

How can we handle this?

A simple solution is to encode the values of a categorical feature using


numbers and treat this feature like a numerical variable. This is
indeed what some computational libraries (e.g. sklearn) do,
however, this method has drawbacks.
CS109A, PROTOPAPAS, RADER, TANNER 20
Numerical vs Categorical Attributes
Example
Supposed the feature we want to split on is color, and the values are: Red, Blue and Yellow. If we
encode the categories numerically as:

Red = 0, Blue = 1, Yellow = 2

Then the possible non-trivial splits on color are

{{Red}, {Blue, Yellow}} {{Red, Blue},{Yellow}}

But if we encode the categories numerically as:

Red = 2, Blue = 0, Yellow = 1

The possible splits are

{{Blue}, {Yellow, Red}} {{Blue,Yellow}, {Red}}

Depending on the encoding, the splits we can optimize over can be different!

CS109A, PROTOPAPAS, RADER, TANNER 21


Numerical vs Categorical Attributes

In practice, the effect of our choice of naive encoding of categorical


variables are often negligible - models resulting from different choices
of encoding will perform comparably.

In cases where you might worry about encoding, there is a more


sophisticated way to numerically encode the values of categorical
variables so that one can optimize over all possible partitions of the
values of the variable.

This more principled encoding scheme is computationally more


expensive but is implemented in a number of computational libraries
(e.g. R’s randomForest).
CS109A, PROTOPAPAS, RADER, TANNER 22
Splitting Criteria

CS109A, PROTOPAPAS, RADER, TANNER 23


Optimality of Splitting

While there is no ‘correct’ way to define an optimal split, there are


some common sensical guidelines for every splitting criterion:
• the regions in the feature space should grow progressively more
pure with the number of splits. That is, we should see each region
‘specialize’ towards a single class.
• the fitness metric of a split should take a differentiable form
(making optimization possible).
• we shouldn’t end up with empty regions - regions containing no
training points.

CS109A, PROTOPAPAS, RADER, TANNER 24


Classification Error

Suppose we have 𝐽 number of predictors and 𝐾 classes.

Suppose we select the 𝑗th predictor and split a region containing 𝑁


number of training points along the threshold 𝑡D ∈ ℝ .

We can assess the quality of this split by measuring the classification


error made by each newly created region, 𝑅/ , 𝑅1 :

Error(i|j, tj ) = 1 max p(k|Ri )


k
where 𝑝(𝑘|𝑅L ) is the proportion of training points in 𝑅L that are
labeled class 𝑘.

CS109A, PROTOPAPAS, RADER, TANNER 25


Classification Error

We can now try to find the predictor 𝑗 and the threshold 𝑡D that minimizes
the average classification error over the two regions, weighted by the
population of the regions:

N1 N2
min Error(1|j, tj ) + Error(2|j, tj )
j,tj N N
where 𝑁L is the number of training points inside region 𝑅L .
CS109A, PROTOPAPAS, RADER, TANNER 26
Gini Index
Suppose we have 𝐽 number of predictors, 𝑁 number of training points and
𝐾 classes.
Suppose we select the 𝑗th predictor and split a region containing 𝑁 number
of training points along the threshold 𝑡D ∈ ℝ .
We can assess the quality of this split by measuring the purity of each
newly created region, 𝑅/ , 𝑅1 . This metric is called the Gini Index:
X
Gini(i|j, tj ) = 1 p(k|Ri )2
k
Question: What is the effect of squaring the proportions of each class?
What is the effect of summing the squared proportions of classes within
each region?

CS109A, PROTOPAPAS, RADER, TANNER 27


Gini Index

Example
Class 1 Class 2 Gini(i|j, tj )
R1 0 6 1 (6/62 + 0/62 ) = 0
R2 5 8 1 [(5/13)2 + (8/13)2 ] = 80/169

We can now try to find the predictor 𝑗 and the threshold 𝑡D that
minimizes the average Gini Index over the two regions, weighted by
the population of the regions:

N1 N2
min Gini(1|j, tj ) + Gini(2|j, tj )
j,tj N N

where 𝑁L is the number of training points inside region 𝑅L .


CS109A, PROTOPAPAS, RADER, TANNER 28
Information Theory

The last metric for evaluating the quality of a split is motivated by metrics of
uncertainty in information theory.
Ideally, our decision tree should split the feature space into regions such that each
region represents a single class. In practice, the training points in each region is
distributed over multiple classes, e.g.:

Class 1 Class 2
R1 1 6
R2 5 6

However, though both imperfect, 𝑅/ is clearly sending a stronger ‘signal’ for a


single class (Class 2) than 𝑅1.

CS109A, PROTOPAPAS, RADER, TANNER 29


Information Theory
One way to quantify the strength of a signal in a particular region is to analyze the
distribution of classes within the region. We compute the entropy of this
distribution.
For a random variable with a discrete distribution, the entropy is computed by:
X
H(X) = p(x) log2 p(x)
x2X
Higher entropy means the distribution is uniform-like (flat histogram) and thus
values sampled from it are ‘less predictable’ (all possible values are equally
probable).
Lower entropy means the distribution has more defined peaks and valleys and thus
values sampled from it are ‘more predictable’ (values around the peaks are more
probable).

CS109A, PROTOPAPAS, RADER, TANNER 30


Entropy
Suppose we have 𝐽 number of predictors, 𝑁 number of training points and
𝐾 classes.
Suppose we select the 𝑗th predictor and split a region containing 𝑁 number of
training points along the threshold 𝑡D ∈ ℝ .
We can assess the quality of this split by measuring the entropy of the class
distribution in each newly created region, 𝑅/, 𝑅1:

N1 N2
min Entropy(1|j, tj ) + Entropy(2|j, tj )
j,tj N N

Note: we are actually computing the conditional entropy of the distribution of


training points amongst the 𝐾 classes given that the point is in region 𝑖.

CS109A, PROTOPAPAS, RADER, TANNER 31


Entropy

Example
Class 1 Class 2 Entropy(i|j, tj )
R1 0 6 ( 66 log2 66 + 06 log2 06 ) = 0
5 5 8 8
R2 5 8 ( 13 log2 13 + 13 log2 13 ) ⇡ 1.38

We can now try to find the predictor j and the threshold tj that
minimizes the average entropy over the two regions, weighted by the
population of the regions:

N1 N2
min Entropy(1|j, tj ) + Entropy(2|j, tj )
j,tj N N

CS109A, PROTOPAPAS, RADER, TANNER 32


Comparison of Criteria

Recall our intuitive guidelines for splitting criteria, which of the three
criteria fits our guideline the best?
We have the following comparison of the value of the three criteria at
different levels of purity (from 0 to 1) in a single region (for binary
outcomes).

CS109A, PROTOPAPAS, RADER, TANNER 33


Comparison of Criteria

Recall our intuitive guidelines for splitting criteria, which of the three
criteria fits our guideline the best?

To note that entropy penalizes impurity the most is not to say that it
is the best splitting criteria. For one, a model with purer leaf nodes
on a training set may not perform better on the testing test.

Another factor to consider is the size of the tree (i.e. model


complexity) each criteria tends to promote.

To compare different decision tree models, we need to first discuss


stopping conditions.
CS109A, PROTOPAPAS, RADER, TANNER 34
Stopping Conditions & Pruning

CS109A, PROTOPAPAS, RADER, TANNER 35


Variance vs Bias

If we don’t terminate the decision tree learning algorithm manually,


the tree will continue to grow until each region defined by the model
possibly contains exactly one training point (and the model attains
100% training accuracy).

To prevent this from happening, we can simply stop the algorithm at a


particular depth.

But how do we determine the appropriate depth?

CS109A, PROTOPAPAS, RADER, TANNER 36


Variance vs Bias

CS109A, PROTOPAPAS, RADER, TANNER 37


Variance vs Bias
We make some observations about our models:
• (High Bias) A tree of depth 4 is not a good fit for the training data - it’s unable to
capture the nonlinear boundary separating the two classes.
• (Low Bias) With an extremely high depth, we can obtain a model that correctly
classifies all points on the boundary (by zig-zagging around each point).
• (Low Variance) The tree of depth 4 is robust to slight perturbations in the training data
- the square carved out by the model is stable if you move the boundary points a bit.
• (High Variance) Trees of high depth are sensitive to perturbations in the training data,
especially to changes in the boundary points.
Not surprisingly, complex trees have low bias (able to capture more complex geometry in
the data) but high variance (can overfit). Complex trees are also harder to interpret and
more computationally expensive to train.

CS109A, PROTOPAPAS, RADER, TANNER 38


Stopping Conditions

Common simple stopping conditions:


• Don’t split a region if all instances in the region belong to the same class.
• Don’t split a region if the number of instances in the sub-region will fall below
pre-defined threshold (min_samples_leaf).
• Don’t split a region if the total number of leaves in the tree will exceed pre-
defined threshold.
The appropriate thresholds can be determined by evaluating the model on a held-
out data set or, better yet, via cross-validation.

CS109A, PROTOPAPAS, RADER, TANNER 39


Stopping Conditions

More restrictive stopping conditions:


• Don’t split a region if the class distribution of the training points
inside the region are independent of the predictors.
• Compute the gain in purity, information or reduction in entropy of
splitting a region R into R1 and R2:
PQ PR
𝐺𝑎𝑖𝑛 𝑅 = Δ 𝑅 = 𝑚 𝑅 − P
𝑚 𝑅/ − P
𝑚(𝑅1 )

where m is a metric like the Gini Index or entropy. Don’t split if the
gain is less than some pre-defined threshold
(min_impurity_decrease).
CS109A, PROTOPAPAS, RADER, TANNER 40
Alternative to Using Stopping Conditions

What is the major issue with pre-specifying a stopping condition?


• you may stop too early or stop too late.
How can we fix this issue?
• choose several stopping criterion (set minimal Gain(R) at
various levels) and cross-validate which is the best.
What is an alternative approach to this issue?
• Don’t stop. Instead prune back!

CS109A, PROTOPAPAS, RADER, TANNER 41


To Hot Dog or Not Hot Dog…

CS109A, PROTOPAPAS, RADER, TANNER 42


Hot Dog or Not

width ≤ 1.05in

yes no

width ≤ 0.725in

yes no

length ≤ 6.25in length ≤ 7.25in

yes no yes no

CS109A, PROTOPAPAS, RADER, TANNER 43


Motivation for Pruning

CS109A, PROTOPAPAS, RADER, TANNER 44


Motivation for Pruning

CS109A, PROTOPAPAS, RADER, TANNER 45


Motivation for Pruning

CS109A, PROTOPAPAS, RADER, TANNER 46


Motivation for Pruning

Simple Tree

Early Stopping
Full Tree
PRUNING

CS109A, PROTOPAPAS, RADER, TANNER 47


Pruning

Rather than preventing a complex tree from growing, we can obtain a simpler
tree by ‘pruning’ a complex one.

There are many method of pruning, a common one is cost complexity pruning,
where by we select from a array of smaller subtrees of the full model that
optimizes a balance of performance and efficiency.

That is, we measure


𝐶 𝑇 = 𝐸𝑟𝑟𝑜𝑟 𝑇 + 𝛼 𝑇

where T is a decision (sub) tree, 𝑇 is the number of leaves in the tree and 𝛼 is
the parameter for penalizing model complexity.

CS109A, PROTOPAPAS, RADER, TANNER 48


Pruning

CS109A, PROTOPAPAS, RADER, TANNER 49


Pruning

CS109A, PROTOPAPAS, RADER, TANNER 50


Pruning

CS109A, PROTOPAPAS, RADER, TANNER 51


Pruning

CS109A, PROTOPAPAS, RADER, TANNER 52


Pruning

𝐶 𝑇 = 𝐸𝑟𝑟𝑜𝑟 𝑇 + 𝛼 𝑇

1. Fix 𝛼.

2. Find best tree for a given 𝛼 and based on cost complexity C.

3. Find best 𝛼 using CV (what should be the error measure?)

CS109A, PROTOPAPAS, RADER, TANNER 53


Pruning
The pruning algorithm:
1. Start with a full tree 𝑇Y (each leaf node is pure)
2. Replace a subtree in 𝑇Y with a leaf node to obtain a pruned tree 𝑇/ . This subtree
should be selected to minimize
𝐸𝑟𝑟𝑜𝑟 𝑇Y − 𝐸𝑟𝑟𝑜𝑟(𝑇/ )
𝑇Y − |𝑇/ |
3. Iterate this pruning process to obtain 𝑇Y , 𝑇/ , … , 𝑇[ where 𝑇[ is the tree containing
just the root of 𝑇Y
4. Select the optimal tree 𝑇L by cross validation.
Note: you might wonder where we are computing the cost-complexity 𝐶(𝑇\ ). One
can prove that this process is equivalent to explicitly optimizing C at each step.

CS109A, PROTOPAPAS, RADER, TANNER 54


Next

How can this decision tree approach apply to a regression problem


(quantitative outcome)?

Questions to consider:
• What would be a reasonable loss function?
• How would you determine any splitting criteria?
• How would you perform prediction in each leaf?

A picture is worth a thousand words…

CS109A, PROTOPAPAS, RADER, TANNER 55


Regression Tree Example
How do we decide a split here?

CS109A, PROTOPAPAS, RADER, TANNER 56


Decision Trees for Regression

CS109A, PROTOPAPAS, RADER, TANNER 57


Adaptations for Regression

With just two modifications, we can use a decision tree model for regression:
1. The three splitting criteria we’ve examined each promoted splits that were pure -
new regions increasingly specialized in a single class.

A. For classification, purity of the regions is a good indicator the performance of the
model.
B. For regression, we want to select a splitting criterion that promotes splits that
improves the predictive accuracy of the model as measured by, say, the MSE.
2. For regression with output in ℝ, we want to label each region in the model with a
real number - typically the average of the output values of the training points
contained in the region.

CS109A, PROTOPAPAS, RADER, TANNER 58


Learning Regression Trees
The learning algorithms for decision trees in regression tasks is:
1. Start with an empty decision tree (undivided features pace)
2. Choose a predictor 𝑗 on which to split and choose a threshold value 𝑡D for splitting such
that the weighted average MSE of the new regions as smallest possible:
𝑁/ 𝑁1
argmin MSE 𝑅/ + MSE(𝑅1 )
D,cd 𝑁 𝑁
or equivalently,
𝑁/ 𝑁1
argmin Var 𝑦|𝑥 ∈ 𝑅/ + Var(𝑦|𝑥 ∈ 𝑅1 )
D,cd 𝑁 𝑁
where 𝑁L is the number of training points in 𝑅L and 𝑁 is the number of points in 𝑅.
3. Recurse on each new node until stopping condition is met.

CS109A, PROTOPAPAS, RADER, TANNER 59


Regression Trees Prediction

For any data point 𝑥L

1. Traverse the tree until we reach a leaf node.

2. Averaged value of the response variable 𝑦’s in the leaf (this is


from the training set) is the 𝑦jL .

CS109A, PROTOPAPAS, RADER, TANNER 60


Regression Tree Example
How do we decide a split here?

CS109A, PROTOPAPAS, RADER, TANNER 61


Regression Tree (max_depth = 1)

CS109A, PROTOPAPAS, RADER, TANNER 62


Regression Tree (max_depth = 2)

CS109A, PROTOPAPAS, RADER, TANNER 63


Regression Tree (max_depth = 5)

CS109A, PROTOPAPAS, RADER, TANNER 64


Regression Tree (max_depth = 10)

CS109A, PROTOPAPAS, RADER, TANNER 65


Stopping Conditions

Most of the stopping conditions, like maximum depth or minimum number of


points in region, we saw for classification trees can still be applied.
In the place of purity gain, we can instead compute accuracy gain for splitting a
region 𝑅

N1 N2
Gain(R) = (R) = M SE(R) M SE(R1 ) M SE(R2 )
N N

and stop the tree when the gain is less than some pre-defined threshold.

CS109A, PROTOPAPAS, RADER, TANNER 66


Overfitting

Same issues as with classification trees. Avoid overfitting by pruning or limiting


the depth of the tree and using CV.

Simple Tree

Early Stopping
Full Tree
PRUNING

CS109A, PROTOPAPAS, RADER, TANNER 67

You might also like