Decision Tree
Decision Tree
(ID3 algorithm)
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that optimizes certain criterion.
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that optimizes certain criterion.
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous
CarType
Family Luxury
Sports
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
Decision Trees
• Can be viewed as a way to compactly represent a lot of data.
• The evaluation of the Decision Tree Classifier is easy
a decision tree.
• Learning a good representation Sunny Overcast Rain
Humidity Wind
from data is the challenge. Yes
• Labels
• Binary classification task: Y = {+, -}
24
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)
25
Basic Decision Trees Learning Algorithm
1
O
S
T
H
H
H
W
W
Play?
-
• Data is processed in Batch (i.e. all the
2 S H H S - data available)
3 O H H W +
4 R M H W + • Recursively build a decision tree top
5 R C N W + down.
6 R C N S -
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W + Sunny Overcast Rain
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes
Basic Decision Tree Algorithm –ID3
• Let S be the set of Examples
• Label is the target attribute (the prediction)
• Attributes is the set of measured attributes
• ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root
27
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible
• The recursive algorithm is a greedy heuristic search for a simple
tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.
28
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B). A
< (A=0,B=0), - >: 50 examples 1 0
< (A=0,B=1), - >: 50 examples
+ -
< (A=1,B=0), - >: 0 examples
< (A=1,B=1), + >: 100 examples
splitting on A
• What should be the first attribute we select?
• Splitting on A: we get purely labeled nodes.
• Splitting on B: we don’t get purely labeled nodes.
• What if we have: <(A=1,B=0), - >: 3 examples?
B
1 0
A -
1 0
• (one way to think about it: # of queries required to label a
random data point) + -
splitting on B
29
Picking the Root Attribute
• Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples
• What should be the first attribute we select?
• Trees looks structurally similar; which attribute should we choose?
Advantage A. But… A B
1 0 1 0
Need a way to quantify things
B - A -
• One way to think about it: # of queries required to 1 0 100 1 0 53
label a random data point.
• If we choose A we have less uncertainty about + - + -
the labels.
100 3 100 50
splitting on A splitting on B 30
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible
• The main decision in the algorithm is the selection of the next
attribute to condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
• The most popular heuristics is based on information gain, originated with
the ID3 system of Quinlan.
31
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝+ log 𝑝+ − 𝑝− log 𝑝−
• 𝑝+ is the proportion of positive examples in S and
• 𝑝− is the proportion of negatives examples in S
• If all the examples belong to the same category: Entropy = 0
• If all the examples are equally mixed (0.5, 0.5): Entropy = 1
• Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:
𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 𝑝1, 𝑝2 , … , 𝑝𝑘 = − 𝑝𝑖 log 𝑝𝑖
1
• Entropy can be viewed as the number of bits required, on average, to
encode the class of labels. If the probability for + is 0.5, a single bit is
required for each example; if it is 0.8 – can use less then 1 bit.
32
Information Gain High Entropy – High level of Uncertainty
Low Entropy – No Uncertainty.
33
Will I play tennis today?
O T H W Play?
Outlook: S(unny),
1 S H H W -
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W +
Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)
34
Will I play tennis today?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W + calculate current entropy
4 R M H W +
9 5
5 R C N W + • 𝑝+ = 𝑝− =
6 R C N S - 14 14
7 O C N S + • 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑃𝑙𝑎𝑦 = −𝑝+ log 2 𝑝+ − 𝑝− log 2 𝑝−
8 S M H W - 9 9 5 5
9 S C N W + = − log2 − log2
10 R M N W + 14 14 14 14
11 S M N S + 0.94
12 O M H S +
13 O H N W +
14 R M H S -
35
Information Gain: Outlook
O T H W Play? |𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
1 S H H W - |𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Outlook = sunny:
4 R M H W + 𝑝+ = 2/5 𝑝− = 3/5 Entropy(O = S) = 0.971
5 R C N W + Outlook = overcast:
6 R C N S - 𝑝+ = 4/4 𝑝− = 0 Entropy(O = O) = 0
7 O C N S + Outlook = rainy:
8 S M H W - 𝑝+ = 3/5 𝑝− = 2/5 Entropy(O = R) = 0.971
9 S C N W +
10 R M N W + Expected entropy
11 S M N S + |𝑆 |
12 O M H S + = σ𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝑣 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
13 O H N W + = (5/14)×0.971 + (4/14)×0 + (5/14)×0.971 = 0.694
14 R M H S -
Information gain = 0.940 – 0.694 = 0.246
36
Information Gain: Humidity
O T H W Play? |𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
1 S H H W - |𝑆|
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Humidity = high:
4 R M H W + 𝑝+ = 3/7 𝑝− = 4/7 Entropy(H = H) = 0.985
5 R C N W + Humidity = Normal:
6 R C N S - 𝑝+ = 6/7 𝑝− = 1/7 Entropy(H = N) = 0.592
7 O C N S +
8 S M H W - Expected entropy
9 S C N W + |𝑆𝑣 |
σ
= 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
10 R M N W + |𝑆|
11 S M N S + = (7/14)×0.985 + (1/14)×0.592 = 0.7785
12 O M H S +
13 O H N W + Information gain = 0.940 – 0.7785 = 0.1615
14 R M H S -
37
Which feature to split on?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Information gain:
5 R C N W + Outlook: 0.246
6 R C N S - Humidity: 0.1615
7 O C N S + Wind: 0.048
8 S M H W - Temperature: 0.029
9 S C N W +
10 R M N W +
11 S M N S + → Split on Outlook
12 O M H S +
13 O H N W +
14 R M H S -
38