Decision Tree
Decision Tree
Lecturer: Ji Liu
[ Some slides from Andrew Moore http://www.cs.cmu.edu/~awm/tutorials and Chuck Dyer, with permission.]
x
• The input
• These names are the same: example,
point, instance, item, input
• Usually represented by a feature
vector
– These names are the same: attribute,
feature
– For decision trees, we will especially
focus on discrete features (though
continuous features are possible, see
end of slides)
y
• The output
• These names are the same: label,
target, goal
• It can be
– Continuous, as in our population
predictionRegression
– Discrete, e.g., is this mushroom x edible
or poisonous? Classification
Evaluating classifiers
• During training
– Train a classifier from a training set (x1,y1),
(x2,y2), …, (xn,yn).
• During testing
– For new test data xn+1…xn+m, your classifier
generates predicted labels y’n+1… y’n+m
• Test set accuracy:
– You need to know the true test labels yn+1…
yn+m n +m
1
– Test set accuracy: acc = m ∑ 1 yi = y ' i
i=n+ 1
– Test set error rate = 1 – acc
Decision Trees
• One kind of classifier (supervised
learning)
• Outline:
– The tree
– Algorithm
– Mutual information of questions
– Overfitting and Pruning
– Extensions: real-valued features,
treerules, pro/con
Akinator: Decision Tree
• http://en.akinator.com/personnages/
A Decision Tree
• A decision tree has 2 kinds of nodes
1. Each leaf node has a class label,
determined by majority vote of training
examples reaching that leaf.
2. Each internal node is a question on
features. It branches out according to
the answers.
Automobile Miles-per-gallon
prediction
mpg cylinders displacement horsepower weight acceleration modelyear maker
Leaves: classify by
majority vote
A bigger decision tree
question: “what is the
value of
horsepower”?
question: “what is the
value of maker”?
Predict “good” is also reasonable by following its parent node instead of the root node.
The full 1. Do not split when all
examples have the same
decision label
tree
biased
coin
Jerry’s coin
H (Y ∣X )= ∑ Pr ( X =v ) H (Y ∣X =v )
v :values of X
+ -
The training set
Example Color Shape Size Class
H(class)=
H(class | color)=
Example Color Shape Size Class
H(class)= H(3/6,3/6) = 1
H(class | color)= 3/6 * H(2/3,1/3) + 1/6 * H(1,0) + 2/6 * H(0,1)
blue is + green is -
2 of the
3 out of 6 1 out of 6
red are + 2 out of 6
are red is blue are green
Example Color Shape Size Class
H(class)= H(3/6,3/6) = 1
H(class | color)= 3/6 * H(2/3,1/3) + 1/6 * H(1,0) + 2/6 * H(0,1)
I(class; color) = H(class) – H(class | color) = 0.54 bits
Example Color Shape Size Class
H(class)= H(3/6,3/6) = 1
H(class | shape)= 4/6 * H(1/2, 1/2) + 2/6 * H(1/2,1/2)
I(class; shape) = H(class) – H(class | shape) = 0 bits
Shape tells us
nothing about
the class!
Example Color Shape Size Class
H(class)= H(3/6,3/6) = 1
H(class | size)= 4/6 * H(3/4, 1/4) + 2/6 * H(0,1)
I(class; size) = H(class) – H(class | size) = 0.46 bits
Example Color Shape Size Class
x2
x1
Example: Overfitting in regression:
Predicting US Population
x=Year y=Million
1900 75.995 • We have
1910 91.972 some
1920 105.71 training
1930 123.2
1940 131.67 data
1950 150.7 (n=11)
1960 179.32
1970 203.21
• What will
1980 226.51 the
1990 249.63 population
2000 281.42
be in 2020?
Regression: Polynomial fit
• The degree d (complexity of the model) is
important
d d −1
f ( x )=c d x +c d −1 x +⋯+c 1 x +c 0
• Fit (=learn) coefficients cd, … c0 to
minimize Mean Squared Error (MSE) on
training data
n
1 2
MSE= ∑ ( y i −f ( x i ) )
n i=1
Overfitting
• As d increases, MSE on training data
improves, but prediction outside training data
worsens
degree=0 MSE=4181.451643
degree=1 MSE=79.600506
degree=2 MSE=9.346899
degree=3 MSE=9.289570
degree=4 MSE=7.420147
degree=5 MSE=5.310130
degree=6 MSE=2.493168
degree=7 MSE=2.278311
degree=8 MSE=1.257978
degree=9 MSE=0.001433
degree=10 MSE=0.000000
Overfitting: Toy Example
• Predict if the outcome of throwing a
die is “6” from its (color, size)
• Color = {red, blue}, Size={small,
large}
• Three training samples:
– X1 = (red, large), y1 = not 6
– X2 = (blue, small), y2 = not 6
– X3 = (blue, large), y3 = 6
Overfitting: Example for
Decision Tree
• Three training samples:
– X1 = (red, large), y1 = not 6
– X2 = (blue, small), y2 = not 6
Root
– X3 = (blue, large), y3 = 6 Color ?
(1, 2)
Red Blue
Size?
(0,1) (1, 1)
Not 6
Large Small
(1, 0) (0, 1)
It is 6 Not 6
Toy Example
• Assume “color” and “size” are
independent attributes for any die
• Assume P(red)=P(blue)=1/2,
P(large)=P(small)=1/2
• The prediction accuracy for this
decision tree is 1-(1/2*1/6+1/4*5/6 +
1/4*1/6)=2/3
Toy Example
• If the decision tree only has the root
node, we predict all new instances as
“Not 6”.
• The accuracy is 5/6 > 2/3 Root
(1, 2)
Not 6
Overfit a decision tree
Output y = copy of
e,
Five inputs, all bits, are
Except a random
generated in all 32
25% of the records
possible combinations
have y set to the
opposite of e
a b c d e y
0 0 0 0 0 0
0 0 0 0 1 0
records
0 0 0 1 0 0
32
0 0 0 1 1 1
0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
Overfit a decision tree
• The test set is constructed similarly
– y=e, but 25% the time we corrupt it by
y=¬e
– The corruptions in training and test sets
are independent
• The training and test sets are the same,
except
– Some y’s are corrupted in training, but not
in test
– Some y’s are corrupted in test, but not in
training
Overfit a decision tree
• We build a full tree on the training
set Root
e=0 e=1
Root
e=0 e=1
On average:
• ¾ training data uncorrupted
– ¾ of these are uncorrupted in test – correct
labels
– ¼ of these are corrupted in test – wrong
• ¼ training data corrupted
– ¾ of these are uncorrupted in test – wrong
– ¼ of these are also corrupted in test – correct
labels
• Test accuracy = ¾ * ¾ + ¼ * ¼ = 5/8 = 62.5%
Overfit a decision tree
• But if we knew a,b,c,d are irrelevant features and don’t use them in the
tree…
a b c d e y
0 0 0 0 0 0
0 0 0 0 1 0
0 0 0 1 0 0
0 0 0 1 1 1
0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
Overfit a decision tree
• The tree would be
Root
e=0 e=1
In training data, about ¾ y’s are In training data, about ¾ y’s are
0 here. Majority vote predicts 1 here. Majority vote predicts
y=0 y=1
Root
e=0 e=1
In training data, about ¾ y’s are In training data, about ¾ y’s are
0 here. Majority vote predicts 1 here. Majority vote predicts
y=0 y=1
Root
e=0 e=1
Axis-parallel cuts
Conclusions
• Decision trees are popular tools for data
mining
– Easy to understand
– Easy to implement
– Easy to use
– Computationally cheap
• Overfitting might happen
• We used decision trees for classification
(predicting a categorical output from
categorical or real inputs)
What you should know
• Trees for classification
• Top-down tree construction
algorithm
• Information gain
• Overfitting
• Pruning
• Real-valued features