Lecture 12 - Decision and Regression Trees
Lecture 12 - Decision and Regression Trees
Regression
Trees
• Naïve Bayes represents a common assumption as part of density estimation, more typical as
part of an approach rather than the final predictor
– Super-powers: Fast estimation from lots of data; not terrible estimation from limited data
• Almost all algorithms involve nearest neighbor, logistic regression, or linear regression
– The main learning challenge is typically feature learning
• So far, we’ve seen two main
choices for how to use features x
1. Nearest neighbor uses all the x x x
o
features jointly to find similar o o x
examples o o o
o x
2. Linear models make predictions x2 x x x
out of weighted sums of the
x1
features
• If you wanted to give someone a If x2 < 0.6 and x2 > 0.2 and x2 < 0.7, ‘o’
Training
Recursively, for each node in tree:
1. If labels in the node are mixed:
a. Choose attribute and split values x
x
based on data that reaches each x o
x x
node x
o x
b. Branch and create 2 (or more) o x
o
nodes o
o
2. Return o
x2
x1
Decision tree algorithm x2 < 0.6
y n
Training
Recursively, for each node in tree:
1. If labels in the node are mixed: 1
(0,0) x1 1
Decision tree algorithm x2 < 0.6
y n
x1 < 0.7
Training
o x
Recursively, for each node in tree:
1. If labels in the node are mixed: 1
(0,0) x1 1
Decision tree algorithm x2 < 0.6
y n
(0,0) x1 1
Decision tree algorithm x2 < 0.6
y n
o x
a. Choose attribute and split values 1
based on data that reaches each x
node x
x o
b. Branch and create 2 (or more) x x
x
nodes o x
o x
2. Return o o
o
o
x2
(0,0) x1 1
Decision tree algorithm x2 < 0.6
y n
Prediction o x
x1 < 0.4 x
x
x
x o
x x
x *
o * o
x
x
o o
o
o
x2
(0,0) x1 1
How do you choose what/where to split?
Training
Recursively, for each node in tree: 1. Measure information gain
• For each discrete attribute: compute
1. If labels in the node are mixed: •
information gain of split
For each continuous attribute: select
a. Choose attribute and split values most informative threshold and
compute its information gain. Can
based on data that reaches each be done efficiently based on sorted
node values.
2. Select attribute / threshold with
b. Branch and create 2 (or more) highest information gain
nodes
2. Return
Pause, stretch, and think: Is it better to split based on type or patrons?
• RMSE= 3.42
• R2=0.88 Milwaukee, yesterday Grand Rapids, yesterday
𝑇𝑇
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑦𝑦𝑛𝑛 = 1 = 𝒘𝒘 𝒙𝒙𝑛𝑛 + 𝑏𝑏
Things to remember
• Decision/regression trees
learn to split up the feature
space into partitions with
similar values
• Entropy is a measure of
uncertainty