Decision Tree

This document provides an overview of decision trees, including: 1) Decision trees are a type of classifier that uses a tree-like model to predict an output based on input values. They consist of internal decision nodes and leaf nodes that make a prediction. 2) The algorithm builds the tree by selecting the best question at each node to split the data into purer child nodes. It uses information gain to evaluate questions. 3) Entropy is used as an impurity measure to calculate information gain, with the goal of creating leaf nodes with examples of the same class. The question with the highest information gain is selected to split on.

Uploaded by

Abdi kariim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Decision Tree

Uploaded by

Abdi kariim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Decision Trees

Lecturer: Ji Liu

Thank Jerry Zhu for sharing his slides

[ Some slides from Andrew Moore http://www.cs.cmu.edu/~awm/tutorials and Chuck Dyer, with permission.]
x
• The input
• These names are the same: example,
point, instance, item, input
• Usually represented by a feature
vector
– These names are the same: attribute,
feature
– For decision trees, we will especially
focus on discrete features (though
continuous features are possible, see
end of slides)
y
• The output
• These names are the same: label,
target, goal
• It can be
– Continuous, as in our population
predictionRegression
– Discrete, e.g., is this mushroom x edible
or poisonous?  Classification
Evaluating classifiers
• During training
– Train a classifier from a training set (x1,y1),
(x2,y2), …, (xn,yn).
• During testing
– For new test data xn+1…xn+m, your classifier
generates predicted labels y’n+1… y’n+m
• Test set accuracy:
– You need to know the true test labels yn+1…
yn+m n +m
1
– Test set accuracy: acc = m ∑ 1 yi = y ' i
i=n+ 1
– Test set error rate = 1 – acc
Decision Trees
• One kind of classifier (supervised
learning)
• Outline:
– The tree
– Algorithm
– Mutual information of questions
– Overfitting and Pruning
– Extensions: real-valued features,
treerules, pro/con
Akinator: Decision Tree
• http://en.akinator.com/personnages/
A Decision Tree
• A decision tree has 2 kinds of nodes
1. Each leaf node has a class label,
determined by majority vote of training
examples reaching that leaf.
2. Each internal node is a question on
features. It branches out according to
the answers.
Automobile Miles-per-gallon
prediction
mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asia

bad 6 medium medium medium medium 70to74 america
bad 4 medium medium medium low 75to78 europe
bad 8 high high high low 70to74 america
bad 6 medium medium medium medium 70to74 america
bad 4 low medium low medium 70to74 asia
bad 4 low medium low low 70to74 asia
bad 8 high high high low 75to78 america
: : : : : : : :
: : : : : : : :
: : : : : : : :
bad 8 high high high low 70to74 america
good 8 high medium high high 79to83 america
bad 8 high high high low 75to78 america
good 4 low low low low 79to83 america
bad 6 medium medium medium high 75to78 america
good 4 medium low low low 79to83 america
good 4 low low medium high 79to83 america
bad 8 high high high low 70to74 america
good 4 low medium low medium 75to78 europe
bad 5 medium medium medium medium 75to78 europe
A very small decision tree
Internal node
question: “what is the
number of
cylinders”?

Leaves: classify by
majority vote
A bigger decision tree
question: “what is the
value of
horsepower”?
question: “what is the
value of maker”?

Predict “good” is also reasonable by following its parent node instead of the root node.
The full 1. Do not split when all
examples have the same
decision label

tree

2. Can not split when we

run out of questions
Decision tree algorithm
buildtree(examples, questions, default)
/* examples: a list of training examples
questions: a set of candidate questions, e.g., “what’s the value
of feature xi?”
default: default label prediction, e.g., over-all majority vote */
IF empty(examples) THEN return(default)
IF (examples have same label y) THEN return(y)
IF empty(questions) THEN return(majority vote in
examples)
q = best_question(examples, questions)
Let there be n answers to q
– Create and return an internal node with n children
– The ith child is built by calling
buildtree({example|q=ith answer}, questions\{q}, default)
The best question
• What do we want: pure leaf nodes, i.e. all
examples having (almost) the same y.
• A good question  a split that results in
pure child nodes
• How do we measure the degree of purity
induced by a question? Here’s one
possibility (Max-Gain in book):
mutual information
(a.k.a. information gain)
A quantity from information theory
Entropy (Impurity Measure)
• At the current node, there are n=n1+…+nk
examples
– n1 examples have label y1
– n2 examples have label y2
–…
– nk examples have label yk
• What’s the impurity of the node?
• Turn it into a game: if I put these
examples in a bag, and grab one at
random, what is the probability the
example has label yi?
Entropy (Impurity Measure)
• Probability estimated from samples:
 with probability p1=n1/n the example has label y1
 with probability p2=n2/n the example has label y2
 …
 with probability pk=nk/n the example has label yk
• p1+p2+…+pk=1
• The “outcome” of the draw is a random variable y with
probability (p1, p2, …, pk)
• What’s the impurity of the node  what’s the
uncertainty of y in a random drawing?
Entropy (Impurity Measure)

• Interpretation: The number of yes/no questions (bits)

needed on average to pin down the value of y in a
random drawing

H(y)= H(y)= H(y)=

Entropy (Impurity Measure)

biased
coin
Jerry’s coin

p(head)=0.5 p(head)=0.51 p(head)=1

p(tail)=0.5 p(tail)=0.49 p(tail)=0
H=1 H=0.9997 H=0 (Why?)
Excellent Video for Entropy
https://www.youtube.com/watch?
v=R4OlXb9aTvQ

• Entropy roughly measures the average

number of yes/no questions we need to
ask to figure out the class label of an
object without any additional attribute
information.
Conditional entropy
k
H (Y ∣X =v )=∑ −Pr (Y = y i∣X =v )log 2 Pr (Y = y i∣ X=v )
i=1

H (Y ∣X )= ∑ Pr ( X =v ) H (Y ∣X =v )
v :values of X

• Y: label. X: a question (e.g., a feature). v: an

answer to the question
• Pr(Y|X=v): conditional probability
• H(Y|X) estimates the average number of y/n
questions required after know the attribute
information X
Information gain
• Information gain, or mutual
information
I (Y ; X )=H (Y )−H (Y ∣X )
• Choose question (feature) X which
maximizes I(Y;X).
Example
• Features: color, shape, size
• What’s the best question at root?

+ -
The training set
Example Color Shape Size Class