0% found this document useful (0 votes)
14 views35 pages

Lecture 12 - Decision and Regression Trees

The document discusses decision and regression trees as methods in applied machine learning, highlighting their training processes, splitting criteria, and applications. It explains how decision trees recursively choose attributes to separate classes and how regression trees minimize prediction errors for continuous values. Key concepts include entropy, information gain, and the advantages of using ensembles for improved performance.

Uploaded by

subalaxminb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views35 pages

Lecture 12 - Decision and Regression Trees

The document discusses decision and regression trees as methods in applied machine learning, highlighting their training processes, splitting criteria, and applications. It explains how decision trees recursively choose attributes to separate classes and how regression trees minimize prediction errors for continuous values. Key concepts include entropy, information gain, and the advantages of using ensembles for improved performance.

Uploaded by

subalaxminb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Decision and

Regression
Trees

Applied Machine Learning


Derek Hoiem

Dall-E: A dirt road splits around a large gnarly


tree, fractal art
Recap of classification and regression
• Nearest neighbor is widely used
– Super-powers: can instantly learn new classes and predict from one or many examples

• Naïve Bayes represents a common assumption as part of density estimation, more typical as
part of an approach rather than the final predictor
– Super-powers: Fast estimation from lots of data; not terrible estimation from limited data

• Logistic Regression is widely used


– Super-powers: Effective prediction from high-dimensional features; good confidence estimates

• Linear Regression is widely used


– Super-powers: Can extrapolate, explain relationships, and predict continuous values from many
variables

• Almost all algorithms involve nearest neighbor, logistic regression, or linear regression
– The main learning challenge is typically feature learning
• So far, we’ve seen two main
choices for how to use features x
1. Nearest neighbor uses all the x x x
o
features jointly to find similar o o x
examples o o o
o x
2. Linear models make predictions x2 x x x
out of weighted sums of the
x1
features
• If you wanted to give someone a If x2 < 0.6 and x2 > 0.2 and x2 < 0.7, ‘o’

rule to split the ‘o’ from the ‘x’, Else ‘x’

what other idea might you try?


Can we learn these kinds of rules automatically?
Decision trees
• Training: Iteratively choose the attribute and split value that
best separates the classes for the data in the current node
• Combines feature selection/modeling with prediction

Fig Credit: Zemel, Urtasun, Fidler


Decision Tree Classification

Slide Credit: Zemel, Urtasun, Fidler


Example with discrete inputs

Slide Credit: Zemel, Urtasun, Fidler


Example with discrete inputs

Figure Source: Zemel, Urtasun, Fidler


Decision Trees

Figure Source: Zemel, Urtasun, Fidler


Decision tree algorithm

Training
Recursively, for each node in tree:
1. If labels in the node are mixed:
a. Choose attribute and split values x
x
based on data that reaches each x o
x x
node x
o x
b. Branch and create 2 (or more) o x
o
nodes o
o
2. Return o
x2

x1
Decision tree algorithm x2 < 0.6
y n

Training
Recursively, for each node in tree:
1. If labels in the node are mixed: 1

a. Choose attribute and split values x


x
based on data that reaches each x o
x x
node x
o x
b. Branch and create 2 (or more) o x
o
nodes o
o
2. Return o
x2

(0,0) x1 1
Decision tree algorithm x2 < 0.6
y n

x1 < 0.7
Training
o x
Recursively, for each node in tree:
1. If labels in the node are mixed: 1

a. Choose attribute and split values x


x
based on data that reaches each x o
x x
node x
o x
b. Branch and create 2 (or more) o x
o
nodes o
o
2. Return o
x2

(0,0) x1 1
Decision tree algorithm x2 < 0.6
y n

x1 < 0.7 x2 < 0.8


Training
o x x
Recursively, for each node in tree:
1. If labels in the node are mixed: 1

a. Choose attribute and split values x


x
based on data that reaches each x o
x x
node x
o x
b. Branch and create 2 (or more) o x
o
nodes o
o
2. Return o
x2

(0,0) x1 1
Decision tree algorithm x2 < 0.6
y n

x1 < 0.7 x2 < 0.8


Training
o x x
Recursively, for each node in tree: x1 < 0.4

1. If labels in the node are mixed: x x1 < 0.5

o x
a. Choose attribute and split values 1
based on data that reaches each x
node x
x o
b. Branch and create 2 (or more) x x
x
nodes o x
o x
2. Return o o
o
o
x2

(0,0) x1 1
Decision tree algorithm x2 < 0.6
y n

x1 < 0.7 x2 < 0.8

Prediction o x
x1 < 0.4 x

1.Check conditions to descend tree x x1 < 0.5

2.Return label of leaf node o x

x
x
x o
x x
x *
o * o
x
x
o o
o
o
x2

(0,0) x1 1
How do you choose what/where to split?

Slide Source: Zemel, Urtasun, Fidler


Quantifying Uncertainty: Coin Flip Example

Slide Source: Zemel, Urtasun, Fidler


Quantifying Uncertainty: Coin Flip Example

Slide Source: Zemel, Urtasun, Fidler


Quantifying Uncertainty: Coin Flip Example
Entropy:

Slide Source: Zemel, Urtasun, Fidler


Entropy of a Joint Distribution

Slide Source: Zemel, Urtasun, Fidler


Specific Conditional Entropy

Slide Source: Zemel, Urtasun, Fidler


Conditional Entropy

Slide Source: Zemel, Urtasun, Fidler


Conditional Entropy

Slide Source: Zemel, Urtasun, Fidler


Conditional Entropy

Slide Source: Zemel, Urtasun, Fidler


Information Gain

Slide Source: Zemel, Urtasun, Fidler


Constructing decision tree

Training
Recursively, for each node in tree: 1. Measure information gain
• For each discrete attribute: compute
1. If labels in the node are mixed: •
information gain of split
For each continuous attribute: select
a. Choose attribute and split values most informative threshold and
compute its information gain. Can
based on data that reaches each be done efficiently based on sorted
node values.
2. Select attribute / threshold with
b. Branch and create 2 (or more) highest information gain
nodes
2. Return
Pause, stretch, and think: Is it better to split based on type or patrons?

Slide Source: Zemel, Urtasun, Fidler


Slide Source: Zemel, Urtasun, Fidler
What if you need to predict a continuous value?
• Regression Tree
– Same idea, but choose splits to minimize sum squared error
∑𝑛𝑛∈𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑓𝑓𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑥𝑥𝑛𝑛 − 𝑦𝑦𝑛𝑛 2
– 𝑓𝑓𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑥𝑥𝑛𝑛 typically returns the mean prediction value of data points
in the leaf node containing 𝑥𝑥𝑛𝑛
– What are we minimizing?
Variants
• Different splitting criteria, e.g. Gini index: 1 − ∑𝑖𝑖 𝑝𝑝𝑖𝑖2 (very
similar result, a little faster to compute)
• Most commonly, split on one attribute at a time
– In case of continuous vector data, can also split on linear projections
of features
• Can stop early
– when leaf node contains fewer than Nmin points
– when max tree depth is reached
• Can also predict multiple continuous values or multiple classes
Decision Tree vs. 1-NN
DT Boundaries

• Both have piecewise-linear


decisions
• Decision tree is typically “axis-
aligned”
• Decision tree has ability for early
stopping to improve generalization 1-NN Boundaries

• True power of decision trees arrives


with ensembles (lots of small or
randomized trees)
Regression Tree for Temperature Prediction
• Min leaf size: 200 Chicago, yesterday

• RMSE= 3.42
• R2=0.88 Milwaukee, yesterday Grand Rapids, yesterday

Chicago, yesterday Chicago, yesterday

from sklearn import tree


from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=0, min_samples_leaf=200)
model.fit(x_train, y_train)
y_pred = model.predict(x_val)
tree_rmse = np.sqrt(np.mean((y_pred-y_val)**2))
tree_mae = np.sqrt(np.median(np.abs(y_pred-y_val)))
print('LR: RMSE={}, MAE={}'.format(tree_rmse, tree_mae))
print('R^2: {}'.format(1-tree_rmse**2/np.mean((y_pred-y_pred.mean())**2)))
plt.figure(figsize=(20,20))
tree.plot_tree(model)
plt.show()
for f in [334, 372, 405]:
print('{}: {}, {}'.format(f, feature_to_city[f], feature_to_day[f]))
Classification/Regression Trees Summary
• Key Assumptions
– Samples with similar features have similar predictions
• Model Parameters
– Tree structure with split criteria at each internal node and prediction at each leaf
node
• Designs
– Limits on tree growth
– What kinds of splits are considered
– Criterion for choosing attribute/split (e.g. gini impurity score is another common
choice)
• When to Use
– Want an explainable decision function (e.g. for medical diagnosis)
– As part of an ensemble (as we’ll see Thursday)
• When Not to Use
– One tree is not a great performer, but a forest is
Compare classifiers
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑦𝑦 = 𝒘𝒘𝑇𝑇 𝒙𝒙 + 𝑏𝑏

𝑇𝑇
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑦𝑦𝑛𝑛 = 1 = 𝒘𝒘 𝒙𝒙𝑛𝑛 + 𝑏𝑏
Things to remember
• Decision/regression trees
learn to split up the feature
space into partitions with
similar values

• Entropy is a measure of
uncertainty

• Information gain measures


how much particular
knowledge reduces prediction
uncertainty
Thursday
• Ensembles: model averaging and forests

You might also like