Supervised Learning:
Classification-I
M M Awais
SPJCM
Decision tree induction
Classification - Decision Tree 2
Introduction
It is a method that induces concepts from
examples (inductive learning)
Most widely used & practical learning
method
The learning is supervised: i.e. the classes
or categories of the data instances are
known
It represents concepts as decision trees
(which can be rewritten as if-then rules)
Classification - Decision Tree 3
Introduction
Decision tree learning is one of the most
widely used techniques for classification.
Its classification accuracy is competitive with
other methods, and
it is very efficient.
The classification model is a tree, called
decision tree.
C4.5 by Ross Quinlan is perhaps the best
known system. It can be downloaded from
the Web.
Classification - Decision Tree 4
Introduction
The target function can be Boolean or
discrete valued
Classification - Decision Tree 5
Decision Trees
Example: “is it a good day to play golf?” A particular instance in the
a set of attributes and their possible values: training set might be:
outlook sunny, overcast, rain
<overcast, hot, normal, false>: play
temperature cool, mild, hot
humidity high, normal
windy true, false
In this case, the target class
is a binary attribute, so each
instance represents a positive
or a negative example.
Classification - Decision Tree 6
Decision Tree Representation
1. Each node corresponds to an attribute
2. Each branch corresponds to an attribute
value
3. Each leaf node assigns a classification
Classification - Decision Tree 7
Example
Classification - Decision Tree 8
Example
Outlook
Sunny Rain
Overcast
Humidity Wind
High Normal Strong Weak
A Decision Tree for the concept PlayTennis
An unknown observation is classified by testing its
attributes and reaching a leaf node
Classification - Decision Tree 9
Using Decision Trees for Classification
Examples can be classified as follows
1. look at the example's value for the feature specified
2. move along the edge labeled with this value
3. if you reach a leaf, return the label of the leaf
4. otherwise, repeat from step 1
Example (a decision tree to decide whether to go on a picnic):
outlook
So a new instance:
sunny overcast rain <rainy, hot, normal, true>: ?
will be classified as “noplay”
humidity P windy
high normal true false
N P N P
Classification - Decision Tree 10
Decision Trees and Decision Rules
outlook
If attributes are continuous, sunny overcast rain
internal nodes may test
against a threshold. humidity yes windy
> 75% <= 75% > 20 <= 20
no yes no yes
Each path in the tree represents a decision rule:
Rule1: Rule3:
If (outlook=“sunny”) AND (humidity<=0.75) If (outlook=“overcast”)
Then (play=“yes”) Then (play=“yes”)
Rule2: ...
If (outlook=“rainy”) AND (wind>20)
Then (play=“no”)
Classification - Decision Tree 11
DECISION TREES
Basic Decision Tree Learning Algorithm
Most algorithms for growing decision trees
are variants of a basic algorithm
An example of this core algorithm is the ID3
algorithm developed by Quinlan (1986)
Itemploys a top-down, greedy search through
the space of possible decision trees
12
DECISION TREES
Basic Decision Tree Learning Algorithm
First
of all we select the best attribute to be
tested at the root of the tree
Formaking this selection each attribute is
evaluated using a statistical test to determine
how well it alone classifies the training
examples
13
DECISION TREES
Basic Decision Tree Learning Algorithm
We have
D12 D11 - 12 observations
D1
- 4 attributes
D10
D2 D5
D4
• Outlook
D6
D14
D3 • Temperature
D8 D9• Humidity
D7
D13
• Wind
- 2 classes (Yes, No)
14
DECISION TREES
Basic Decision Tree Learning Algorithm
Outlook
Sunny Rain
Overcast
D10
D1 D8 D6
D3
D14
D11 D4
D9 D12
D2 D7
D5
D13
15
DECISION TREES
Basic Decision Tree Learning Algorithm
The selection process is then repeated using
the training examples associated with each
descendant node to select the best attribute
to test at that point in the tree
16
DECISION TREES
Outlook
Sunny Rain
Overcast
D10
D1 D8 D6
D3
D14
D11 D4
D9 D12
D2 D7
D5
D13
What is the “best” attribute to test at this point? The possible
choices are Temperature, Wind & Humidity
17
DECISION TREES
Basic Decision Tree Learning Algorithm
Thisforms a greedy search for an acceptable
decision tree, in which the algorithm never
backtracks to reconsider earlier choices
18
DECISION TREES
Which Attribute is the Best Classifier?
The central choice in the ID3 algorithm is
selecting which attribute to test at each node
in the tree
We would like to select the attribute which is
most useful for classifying examples
For this we need a good quantitative measure
For this purpose a statistical property, called
information gain, Gini Index is used
19
Top-Down Decision Tree Generation
The basic approach usually consists of two phases:
Tree construction
At the start, all the training examples are at the
root
Partition examples are recursively based on
selected attributes
Tree pruning
remove tree branches that may reflect noise in
the training data and lead to errors when
classifying test data
improve classification accuracy
Classification - Decision Tree 20
Top-Down Decision Tree Generation
Basic Steps in Decision Tree Construction
Tree starts a single node representing all data
If sample are all same class then node becomes a
leaf labeled with class label
Otherwise, select feature that best separates sample
into individual classes.
Recursion stops when:
Samples in node belong to the same
tu re ? class
ct fe a
(majority) se l e
w to
There are H nooremaining attributes on which to
split
Classification - Decision Tree 21
How to find Feature to split?
Many methods are available but our focus
will be on the following two:
Information Theory(Information Gain)
Gain Ratio
Gini Index
Classification - Decision Tree 22
Information
High Uncertainty
No Uncertainty
Classification - Decision Tree 23
Valuable Information
Which information is more valuable:
Of high uncertain region, or
Of no uncertain region
gi o n
tai n re
g h U ncer
H i
Classification - Decision Tree 24
Information theory
Information theory provides a mathematical basis
for measuring the information content.
To understand the notion of information, think
about it as providing the answer to a question, for
example, whether a coin will come up heads.
If one already has a good guess about the answer,
then the actual answer is less informative.
If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).
Classification - Decision Tree 25
Information theory (cont …)
For a fair (honest) coin, you have no
information, and you are willing to pay more
(say in terms of $) for advanced information -
less you know, the more valuable the
information.
Information theory uses this same intuition,
but instead of measuring the value for
information in dollars, it measures information
contents in bits.
One bit of information is enough to answer a
yes/no question about which one has no
idea, such as the flip of a fair coin
Classification - Decision Tree 26
Information Basic
Classification - Decision Tree 27
Entropy
Classification - Decision Tree 28
Classification - Decision Tree 29
Classification - Decision Tree 30
Classification - Decision Tree 31
Information: Basics
Information (Entropy) is:
E= - pi log pi,
where pi is the probability of an event i
(-pi log pi is always +ve)
For multiple events
E(I) = i -pi log pi
Suppose you toss a fair coin, find the information
(entropy) when the probability of head or tail is 0.5 each.
possible events: 2, pi=0.5
E(I)= - 0.5log 0.5 - 0.5log 0.5 = 1.0
If the coin is biased i.e, chances of heads is 0.75 and of tail
is 0.25, then E(I)= - 0.75log 0.75 - 0.25log 0.25 < 1.0
Classification - Decision Tree 32
Information: Basics
Suppose you have dice and you roll it, find the entropy if
getting a ‘6’ if the probabilities of each event i.e, of getting
1 to 6 is equal.
possible events: 6, pi=1/6
E(I)= 6(- 1/6)log (1/6)=2.585
If the dice is biased i.e, chances of ‘6’ is 0.75 then what is the
entropy:
p(for 6) =0.75,
p(for all other) = 0.25,
p (any other number) = 0.25/5 = 0.05 (equally divided among 1 to 5)
then E(I)= - 0.75log 0.75 – 5 (0.05)log (0.05) = 1.39
Classification - Decision Tree 33
Information: Basics
nt y
Suppose you have dice and you roll it, find the air tentropy if
e
ci.e, of getting
getting a ‘6’ if the probabilities of each event u n
s e s e r
1 to 6 is equal. e a w
c r l o
possible events: 6, p =1/6
t i n l s o
i
e n i s a
E(I)= 6(- 1/6)log (1/6)=2.585 e v p y
n
achances t r o
If the dice is biased i.e, o f e n of ‘6’ is 0.75 then what is the
l it y he
entropy: b i o t
ba s
p r
=0.75,
p o e s
(for 6)
e uc
ps t
h r d
e0.25,
A (for all other) =
p
(any other number) = 0.25/5 = 0.05 (equally divided among 1 to 5)
then E(I)= - 0.75log 0.75 – 5 (0.05)log (0.05) = 1.39
Classification - Decision Tree 34
Information: Basics
Suppose you have dice and you roll it, find the entropy if
getting a ‘6’ if the probabilities of each eevent e i.e,
bl eof getting
w n
1 to 6 is equal. t r r i a no
i o n v a s k
i s r e e i
possible events: 6, pi=1/6 d ec a tu a l u
g a f e s v
E(I)= 6(- 1/6)log (1/6)=2.585 i n t h e i t
k
a chances s of ‘6’ e
cis 0.75 then what is the
If the dice is biased m i.e, a o n
i n b le t y
entropy: S o r i a a i n
v a e r t
p(for 6) =0.75, se a n c
o e u
h o t h
c = 0.25,
p(for all other) e s
u c
d = 0.25/5 = 0.05 (equally divided among 1 to 5)
p (any otherrenumber)
a t
thenh
t E(I)= - 0.75log 0.75 – 5 (0.05)log (0.05) = 1.39
Classification - Decision Tree 35
Decision Trees
The most notable types of decision tree algorithms are:-
Iterative Dichotomiser 3 (ID3): This algorithm uses Information
Gain to decide which attribute is to be used classify the current
subset of the data. For each level of the tree, information gain is
calculated for the remaining data recursively.
C4.5: This algorithm is the successor of the ID3 algorithm. This
algorithm uses either Information gain or Gain ratio to decide upon
the classifying attribute. It is a direct improvement from the ID3
algorithm as it can handle both continuous and missing attribute
values.
Classification and Regression Tree(CART): It is a dynamic
learning algorithm which can produce a regression tree as well as a
classification tree depending upon the dependent variable.
Classification - Decision Tree 36
DT: Entropy – A measuring Value
Entropy is a concept originated in thermodynamics
but later found its way to information theory.
In decision tree construction process, definition of
entropy as a measure of disorder suits well.
If the class values of the data in a node is equally
divided among possible values of the class value,
we say entropy (disorder) is maximum.
If the class values of the data in a node is same for
all data, entropy (disorder) is minimum.
Classification - Decision Tree 37
DT: Entropy – A measuring Value
A decision tree is built top-down from a root
node and involves partitioning the data into
subsets that contain instances with similar
values (homogenous).
ID3 algorithm uses entropy to calculate the
homogeneity of a sample.
If the sample is completely homogeneous the
entropy is zero and if the sample is an
equally divided it has entropy of one.
Classification - Decision Tree 38
Entropy
Maximum probability at the center where curve is touching highest
point is =1 as given above
Classification - Decision Tree 39
Information theory: Entropy measure
The entropy formula,
|C |
entropy ( D) Pr(c ) log
j 1
j 2 Pr(c j )
|C |
Pr(c ) 1,
j 1
j
Pr(cj) is the probability of class cj in data set D
We use entropy as a measure of impurity or
disorder of data set D. (or, a measure of
information in a tree)
Classification - Decision Tree 40
Entropy measure: E= - (p /s)log(p /s) - (n /s)log(n /s)
p= all +ve examples, n= -ve, s=total examples
As the data become purer and purer, the entropy value
becomes smaller and smaller. This is useful for classification
Classification - Decision Tree 41
Information gain
Given a set of examples D, we first compute its
entropy for the ‘c’ classes:
|C |
entropy ( D) Pr(c j ) log 2 Pr(c j )
j 1
If we choose attribute Ai, with v values, the root of the
current tree, this will partition D into v subsets D1, D2
…, Dv . The expected entropy if Ai is used as the
current root: v |D |
entropy Ai ( D)
j
entropy ( D j )
j 1 | D |
Classification - Decision Tree 42
Information Gai
Classification - Decision Tree 43
Information gain (cont …)
Information gained by selecting attribute Ai to
branch or to partition the data is
gain( D, Ai ) entropy ( D) entropy Ai ( D)
We choose the attribute with the highest gain to
branch/split the current tree.
As the information gain increases for a variable,
the uncertainty in decision making reduces.
Classification - Decision Tree 44
Day outlook temp humidity wind play
D1 sunny hot high weak No
D2 sunny hot high strong No
D3 overcast hot high weak Yes
D4 rain mild high weak Yes
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes
D8 sunny mild high weak No
D9 sunny cool normal weak Yes
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes
D13 overcast hot normal weak Yes
D14 rain mild high strong No
Classification - Decision Tree 45
To build a decision tree, we need to calculate two types
of entropy using frequency tables as follows:
a) Entropy using the frequency table of one attribute:
Note: Probability of P(9)=9/14=0.64 & P(5)=5/14=0.36
Classification - Decision Tree 46
How calculate Log base 2?
To Calculate Log2 based value:
Log(value)/log(2) like to calculate
log2(0.36)=log10(0.36)/log10(2)
http://logbase2.blogspot.com/2008/08/log-calculator.html
Classification - Decision Tree 47
b) Entropy using the frequency table of two
attributes :
Classification - Decision Tree 48
Calculate the following Entropies
E(PlayGolf, temp)
P(Hot). E(Yes, No)+P(Mild).E(Yes, No)+P(Cool).E(Yes, No)
Help: First Built the table
E(PlayGolf, humidity)
E(PlayGolf, windy)
Classification - Decision Tree 49
Information Gain
The information gain is based on the
decrease in entropy after a dataset is split
on an attribute.
Constructing a decision tree is all about
finding attribute that returns the highest
information gain (i.e., the most homogeneous
branches).
Classification - Decision Tree 50
Information Gain
Information Gain
Information Gain
Information Gain
Information Gain
Step 3: Choose attribute with the largest
information gain as the decision node.
Information Gain
Step 4a: A branch with entropy of 0 is a leaf
node.
Information Gain
Step 4b: A branch with entropy more than 0
needs further splitting.
Information Gain
Information Gain
Step 5: The ID3 algorithm is run recursively
on the non-leaf branches, until all data is
classified.
Decision Tree to Decision Rules
Decision Tree to Decision Rules
Example
Owns House Married Gender Employed Credit Risk Class
History
Yes Yes M Yes A B
No No F Yes A A
Yes Yes F Yes B C
Yes No M No B B
No Yes F Yes B C
No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Yes Yes F Yes a C
Classification - Decision Tree 62
Choosing the “Best” Feature
Own House? Credit rating
Yes No A B
Married ? Gender
Yes No M F
Classification - Decision Tree 63
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A
Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Find the overall entropy first: No No M No B B
Total samples: 10 Yes No F Yes A A
No Yes F Yes A C
Yes Yes F Yes a C
Classification - Decision Tree 64
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A
Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Find the overall entropy first: No No M No B B
Total samples: 10 Yes No F Yes A A
No Yes F Yes A C
Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C
Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57
Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
Gain(D, Own House) = 1.57-1.52 = 0.05
Classification - Decision Tree 65
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A
Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Find the overall entropy first: No No M No B B
Total samples: 10 Yes No F Yes A A
No Yes F Yes A C
Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C
Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57
Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
Gain(D, Own House) = 1.57-1.52 = 0.05
Classification - Decision Tree 66
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A
Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Find the overall entropy first: No No M No B B
Total samples: 10 Yes No F Yes A A
No Yes F Yes A C
Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C
Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57
Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
Gain(D, Own House) = 1.57-1.52 = 0.05
Classification - Decision Tree 67
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A
Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Find the overall entropy first: No No M No B B
Total samples: 10 Yes No F Yes A A
No Yes F Yes A C
Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C
Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57
Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
E(no)= -(2/5)log(2/5)
Only 1 out- of
(1/5)log(1/5) -(2/5)log(2/5)
5 have class A for own =house:
1.52 yes
E(Dj) = 0.5*E(yes)+0.5*E(no)
Only 2 out of 5 have= 1.52
class B for own house: yes
Gain(D, Own House)
Only 2 out=of1.57-1.52 = 0.05C for own house: yes
5 have class
Classification - Decision Tree 68
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A
Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Find the overall entropy first: No No M No B B
Total samples: 10 Yes No F Yes A A
No Yes F Yes A C
Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C
Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57
Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
Gain(D, Own House) = 1.57-1.52 = 0.05
Classification - Decision Tree 69
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A
Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Find the overall entropy first: No No M No B B
Total samples: 10 Yes No F Yes A A
No Yes F Yes A C
Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C
Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57
Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
Gain(D, Own House) = 1.57-1.52 = 0.05
Classification - Decision Tree 70
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A
Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Find the overall entropy first: No No M No B B
Total samples: 10 Yes No F Yes A A
No Yes F Yes A C
Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C
Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57
Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
Gain(D, Own House) = 1.57-1.52 = 0.05
Classification - Decision Tree 71
Owns Married Gender Employe Credit Risk
House d History Class
Similarly Find the values Yes
No
Yes
No
M
F
Yes
Yes
A
A
B
A
for all the other variables Yes
Yes
Yes
No
F
M
Yes
No
B
B
C
B
No Yes F Yes B C
No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Yes Yes F Yes a C
Own House: 0.05
Married: 0.72
Gender: 0.88
Employed: 0.45 Selected as Root Node
Credit rating: 0.05
Classification - Decision Tree 72
Owns Married Gender Employe Credit Risk
Choosing the “Best”
House d History Class
Yes Yes M Yes A B
Feature
No No F Yes A A
Yes Yes F Yes B C
Gender Yes No M No B B
No Yes F Yes B C
M F No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Class A: 0 Class A: 3 Yes Yes F Yes a C
Class B: 3 Class B: 0
Class C: 0 Class C: 4
No further Further split is
split is required required here, cannot
here, identifies B identify A, and C fully
fully
Apply the same procedure again on other variables leaving
out column for Gender, and rows for class B as it has been fully
determined
Classification - Decision Tree 73
Owns Married Gender Employe Credit Risk
Choosing the “Best”
House d History Class
Yes Yes M Yes A B
Feature
No No F Yes A A
Yes Yes F Yes B C
Gender Yes No M No B B
No Yes F Yes B C
M F No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Class A: 0 Class A: 3 Yes Yes F Yes a C
Class B: 3 Class B: 0
Class C: 0 Class C: 4
No further Further split is
split is required required here, cannot
here, identifies B identify A, and C fully
fully
Apply the same procedure again on other variables leaving
out column for Gender, and rows for class B as it has been fully
determined
Classification - Decision Tree 74
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
No No F Yes A A
Own House? Yes Yes F Yes B C
Yes No M No B B
Yes No No Yes F Yes B C
No No F Yes B A
No No M No B B
Yes No F Yes A A
E(D)=1.33 No Yes F Yes A C
Yes Yes F Yes a C
Own House: 0.96
Married: 0.00
Etc…
Married is the best node as E(Dj) = 0,
Hence information gain will be maximum
Classification - Decision Tree 75
Owns Married Gender Employe Credit Risk
House d History Class
Completing DT Yes
No
Yes
Yes
No
Yes
M
F
F
Yes
Yes
Yes
A
A
B
B
A
C
Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Gender No No M No B B
Yes No F Yes A A
M F No Yes F Yes A C
Yes Yes F Yes a C
Class B: 3 Class A: 3,Class C: 4
Married
Yes No
Class C: 4 Class A: 3
Classification - Decision Tree 76
Owns Married Gender Employe Credit Risk
Completing DT
House d History Class
Yes Yes M Yes A B
No No F Yes A A
Gender Yes Yes F Yes B C
Yes No M No B B
No Yes F Yes B C
M F
No No F Yes B A
No No M No B B
Yes No F Yes A A
Class B: 3 Class A: 3,Class C: 4 No Yes F Yes A C
Yes Yes F Yes a C
Married
Yes No
Class C: 4 Class A: 3
R1: If Gender=M then Class B
R2: If Gender=F and Married=Yes
Then Class C
Rules R3: If Gender=F and Married=No
Then Class A
Classification - Decision Tree 77
Table 6.1 Class‐labeled training tuples from AllElectronics customer database.
78
Classification - Decision Tree 79
Classification - Decision Tree 80
Classification - Decision Tree 81
Trees Construction Algorithm (ID3)
Decision Tree Learning Method (ID3)
Input: a set of examples S, a set of features F, and a target set T (target
class T represents the type of instance we want to classify, e.g., whether
“to play golf”)
1. If every element of S is already in T, return “yes”; if no element of S is in
T return “no”
2. Otherwise, choose the best feature f from F (if there are no features
remaining, then return failure);
3. Extend tree from f by adding a new branch for each attribute value
4. Distribute training examples to leaf nodes (so each leaf node S is now
the set of examples at that node, and F is the remaining set of features not
yet selected)
5. Repeat steps 1-5 for each leaf node
Main Question:
how do we choose the best feature at each step?
Note:
Note:ID3ID3algorithm
algorithmonly
onlydeals
dealswith
withcategorical
categoricalattributes,
attributes,but
butcan
canbe
beextended
extended
(as
(asininC4.5)
C4.5)totohandle
handlecontinuous
continuousattributes
attributes
Classification - Decision Tree 82
Choosing the “Best” Feature
Using Information Gain to find the “best” (most discriminating) feature
Entropy, E(I) of a set of instance I, containing p positive and n negative examples
p p n n
E(I ) log 2 log 2
pn pn pn pn
Gain(A, I) is the expected reduction in entropy due to feature (attribute) A
pj nj
Gain( A, I ) E ( I ) pn
E(I j )
descendant j
the jth descendant of I is the set of instances with value vj for A
S: [9+,5-]
Outlook? E = -(9/14).log(9/14) - (5/14).log(5/14)
= 0.940
overcast rainy
sunny
“yes”, since all positive examples
[4+,0-] [2+,3-] [3+,2-]
Classification - Decision Tree 83
Decision Tree Learning - Example
Day outlook temp humidity wind play
S: [9+,5-] (E = 0.940)
D1 sunny hot high weak No
D2 sunny hot high strong No humidity?
D3 overcast hot high weak Yes
high normal
D4 rain mild high weak Yes
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes [3+,4-] (E = 0.985) [6+,1-] (E = 0.592)
D8 sunny mild high weak No
Gain(S, humidity) = .940 - (7/14)*.985 - (7/14)*.592
D9 sunny cool normal weak Yes
= .151
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes S: [9+,5-] (E = 0.940)
D13 overcast hot normal weak Yes
wind?
D14 rain mild high strong No
weak strong
So, classifying examples by humidity
provides more information gain than by
wind. In this case, however, you can [6+,2-] (E = 0.811) [3+,3-] (E = 1.00)
verify that outlook has largest information Gain(S, wind) = .940 - (8/14)*.811 - (6/14)*1.0
gain, so it’ll be selected as root = .048
Classification - Decision Tree 84
Decision Tree Learning - Example
Day outlook temp humidity wind play
D1 sunny hot high weak No
D2 sunny hot high strong No
S: [9+,5-]
D3 overcast hot high weak Yes
D4 rain mild high weak Yes Outlook{D1, D2, …, D14}
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes
D8 sunny mild high weak No sunny overcast rainy
D9 sunny cool normal weak Yes
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes ? yes ?
D13 overcast hot normal weak Yes
D14 rain mild high strong No
[2+,3-] [4+,0-] [3+,2-]
E=.970 E=0 E=.970
So, classifying examples by humidity
provides more information gain than by Gain(S, outlook) = .940 - (5/14)*..97 - (4/14)*0 - (5/14)*.97
wind. In this case, however, you can = .241
verify that outlook has largest information
gain, so it’ll be selected as root
Classification - Decision Tree 85
Decision Tree Learning - Example
Partially learned decision tree
S: [9+,5-]
Outlook {D1, D2, …, D14}
sunny overcast rainy
?E=.970 yes E=0 ?E=.970
[2+,3-] [4+,0-] [3+,2-]
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
which attribute should be tested here?
Ssunny = {D1, D2, D8, D9, D11}
Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970
Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570
Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019
Classification - Decision Tree 86
Highly-branching attributes
Problematic: attributes with a large number of
values (extreme case: ID code)
Subsets are more likely to be pure if there is
a large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting(selection of an
attribute that is non-optimal for prediction)
Classification - Decision Tree 87
Gain Ratio for Attribute Selection (C4.5)
Classification - Decision Tree 88
Another Alternative to avoid selecting
attributes with large -domains
Classification - Decision Tree 89
Gini Index (CART, SLIQ, ibm
IntellegentMiner)
Classification - Decision Tree 90
Contd..!!!
Classification - Decision Tree 91
Classification - Decision Tree 92
Comparing Attribute Selection Measures
Classification - Decision Tree 93
Split Algorithm with Gini Index
Basic concept taken from Economics given
by Corrado Gini (1884 to 1965)
The index varies from 0 to 1
ZERO means no uncertainty
ONE means maximum uncertainty
Brazil 0.59
Income distribution of India 0.32
various countries China 0.45
USA 0.41
Japan 0.25
Most evenly distributed income
Classification - Decision Tree 94
Gini Index
The Gini index is measure of impurity developed by
Italian statistician Corrado Gini in 1912.
It is usually used to measure income inequality but
can be used to measure any form of uneven
distribution.
Gini index is a number between 0 and 1where 0
corresponds with perfect equality (where every one
has same income) and 1 corresponds the perfect
inequality (where one person has all the income and
everyone else has zero income).
GINI (t ) 1 p 2 ( j | t )
j
Classification - Decision Tree 95
Diversity and Gini Index
high diversity, low purity
G = 1-(3/8)2 -(3/8)2 -(1/8)2 -(1/8)2 = .69 (E=1.811)
low diversity, high purity
G = 1-(6/7)2-(1/7)2 = .24 (E=0.592)
Classification - Decision Tree 96
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A
Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Find the overall G first: No No M No B B
Total samples: 10 Yes No F Yes A A
No Yes F Yes A C
Gt=1- (3/10)2 - (3/10)2 - (4/10)2 = 0.66 Yes Yes F Yes a C
Attribute: Own House
G(y)= 1- (1/5)2 - (2/5)2 - (2/5)2 = 0.66
Gain=G -G
G(n)=0.64 t i
Total G=0.5G(y)+0.5G(n)= 0.64 Gains:
Attribute: Married Own House:
Total G=0.5G(y)+0.5G(n)= 0.40 0.02
Attribute: Gender G=0.511 Married:
Attribute: Employed: G= 0.475 0.26
Attribute: Credit Rating: G=0.64 Gender:
0.302
Classification - Decision Tree Employed: 97
Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
No No F Yes A A
Yes Yes F Yes B C
Gender Yes No M No B B
No Yes F Yes B C
M F
No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Choose Gender Yes Yes F Yes a C
Apply the same procedure again on other variables leaving
out column for Gender, and rows for class B as it has been fully
determined
Check if you can get the same DT or not
Classification - Decision Tree 98
Categorical Attributes: Computing
Gini Index
• For each distinct value, gather counts for each class in the set.
• Use the count matrix to make decisions
Multi-way Split Two-way split
(Find best partition of values)
Outlook
Outlook Outlook
Overcast Rain/
Overca Rain Sunny Sunny Overcast/ Sunny
st rain
C1 1 2 1 C1 4 5
C1 7 2
C2 4 1 1 C2 0 5 C2 2 3
Gini 0 .48 .48 Gini 0 .5 Gini .345 .48
0.34 0.36 0.391
Classification - Decision Tree 99
Continuous Attributes: Computing
Gini Index…
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
Sorted
Values 55 65 72 80 87 92 97 110 122 172 230
Split Pos
≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Classification - Decision Tree 100
Gini index (CART)
E.g., two classes, Pos and Neg, and dataset S
with p Pos-elements and n Neg-elements.
fp = p / (p+n) fn = n / (p+n)
gini(S) = 1 – fp2 - fn2
If dataset S is split into S1, S2 then
ginisplit(S1, S2 )=gini(S1)·(p1+n1)/(p+n) +
gini(S2)·(p2+n2)/(p+n)
Classification - Decision Tree 101
Gini index - play tennis example
Outlook Temperature Humidity W indy Class outlook
sunny hot high false N
sunny hot high true N overcast rain, sunny
overcast hot high false P
rain mild high false P
rain cool normal false P P 100% ……………
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P humidity
sunny mild normal true P
overcast mild high true P normal high
overcast hot normal false P
rain mild high true N
P 86% ……………
Two top best splits at root node:
Split on outlook:
S1: {overcast} (4Pos, 0Neg) S2: {sunny, rain}
Split on humidity:
S1: {normal} (6Pos, 1Neg) S2: {high}
Classification - Decision Tree 102
Calculations
Outlook
Sunny or Rainy Yes = 5 No = 5 Gini = .5
Overcast Yes = 4 No = 0 Gini = 0
Gain = 0.36
Temperature
Hot or Cold Yes = 5 No = 3 Gini = 0.47
Mild Yes = 4 No = 2 Gini = 0.44
Gain = 0.46
Humidity
High Yes = 3 No = 4 Gini = 0.49
Normal Yes = 6 No = 1 Gini = 0.25
Gain = 0.37
Windy
FALSE Yes = 6 No = 2 Gini = 0.38
TRUE Yes = 3 No = 3 Gini = 0.5
Gain = 0.43
Classification - Decision Tree 103
Calculations at Node 0
Outlook
5 2 5 2 1
GINI (outlook sunny rainy ) 1
10 10 2
4 2 0 2
GINI (outlook overcast ) 1 0
4 4
10 1 4
GINI ( split based on outlook ) * (0) 0.3571
14 2 14
Classification - Decision Tree 104
Temperature
5 2 3 2
GINI (temperatur e hot cold ) 1 0.46875
8 8
4 2 2 2
GINI (temperatur e mild ) 1 0.44
6 6
8 6
GINI ( split based on temperatur e) * 0.46875 * 0.44 0.456
14 14
Classification - Decision Tree 105
Humidity
3 2 4 2 24
GINI (humidity high) 1
7 7 49
4 2 2 2 12
GINI (humidity normal ) 1
6 6 49
7 24 7 12
GINI ( split based on humidity ) * * 0.37
14 49 14 49
Classification - Decision Tree 106
Windy
6 2 2 2
GINI ( windy FALSE ) 1 0.375
8 8
3 2 3 2
GINI ( windy TRUE ) 1 0. 5
6 6
8 6
GINI ( split based on windy) * 0.375 * 0.5 0.43
14 14
Classification - Decision Tree 107
V=outlook
N=14
Overcast Rain/Sunny
Humidity
4 yes, 0 no
N = 10
Normal High
windy
4 yes 1 no
N=5
False true
V=outlook
3 yes, 0 no
N=2
Rain Sunny
1 no, 0 yes 1 yes 0 no
Classification - Decision Tree 108
Classification - Decision Tree 109
Dealing With Continuous Variables
Partition continuous attribute into a discrete set of
intervals
sort the examples according to the continuous attribute A
identify adjacent examples that differ in their target classification
generate a set of candidate thresholds midway
problem: may generate too many intervals
Another Solution:
take a minimum threshold M of the examples of the majority class in each
adjacent partition; then merge adjacent partitions with the same majority class
70.5 77.5
Example: M = 3
Temperature 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Play? yes no yes yes yes no no yes yes yes no yes yes no
Same majority, so they are merged
Final mapping: temperature £ 77.5 ==> “yes”; temperature > 77.5 ==> “---”
Classification - Decision Tree 110
Improving on Information Gain
Info. Gain tends to favor attributes with a large number of values
larger distribution ==> lower entropy ==> larger Gain
Quinlan suggests using Gain Ratio
penalize for large number of values
Si S Gain ( A, S )
SplitInfo ( A, S ) log i GainRatio ( A, S )
S S SplitInfo ( A, S )
Example: “outlook”
S: [9+,5-]
SplitInfo (outlook, S) Outlook
= -(4/14).log(4/14) - (5/14).log(5/14) - (5/14).log(5/14)
= 1.577 overcast rainy
sunny
GainRatio (outlook, S)
= 0.246 / 1.577 = 0.156
S1: [4+,0-] S2 : [2+,3-] S3 : [3+,2-]
Classification - Decision Tree 111
Gain Ratios of Decision Variables
Temperature
Outlook
Info = 0.911
Info = 0.693
Gain = 0.940 - .911 = 0.029
Gain = 0.940 - .693 = 0.247
Split info = info ([4, 6, 4]) = 1.362
Split info = info ([5, 4, 5]) = 1.577
Gain ratio = 0.029/1.362=0.021
Gain ratio = 0.247/1.577=0.156
Humidity Windy
Info = 0.788 Info = 0.892
Gain = 0.940 - .788 = 0.152 Gain = 0.940 - .892 = 0.048
Split info = info ([7, 7]) = 1 Split info = info ([8, 6]) = .985
Gain ratio = 0..048/.985=0.049
Gain ratio = 0.152/1=0.152
Classification - Decision Tree 112
Over-fitting in Classification
A tree generated may over-fit the training examples due to noise or too small
a set of training data
Two approaches to avoid over-fitting:
(Stop earlier): Stop growing the tree earlier
(Post-prune): Allow over-fit and then post-prune the tree
Approaches to determine the correct final tree size:
Separate training and testing sets or use cross-validation
Use all the data for training, but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may improve over entire
distribution
Use Minimum Description Length (MDL) principle: halting growth of the
tree when the encoding is minimized.
Rule post-pruning (C4.5): converting to rules before pruning
Classification - Decision Tree 113
The loan data (reproduced)
Approved or not
Classification - Decision Tree 114
A decision tree from the loan data
Decision nodes and leaf nodes (classes)
Classification - Decision Tree 115
Use the decision tree
No
Classification - Decision Tree 116
Is the decision tree unique?
No. Here is a simpler tree.
We want smaller tree and accurate tree.
Easy to understand and perform better.
Finding the best tree is
NP-hard.
All current tree building
algorithms are heuristic
algorithms
Classification - Decision Tree 117
From a decision tree to a set of rules
A decision tree can
be converted to a
set of rules
Each path from the
root to a leaf is a
rule.
Classification - Decision Tree 118
Algorithm for decision tree learning
Basic algorithm (a greedy divide-and-conquer algorithm)
Assume attributes are categorical now (continuous attributes
can be handled too)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Examples are partitioned recursively based on selected
attributes
Attributes are selected on the basis of an impurity function
(e.g., information gain)
Conditions for stopping partitioning
All examples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority class is the leaf
There are no examples left
Classification - Decision Tree 119
Decision tree learning algorithm
Classification - Decision Tree 120
Choose an attribute to partition data
The key to building a decision tree - which
attribute to choose in order to branch.
The objective is to reduce impurity or
uncertainty in data as much as possible.
A subset of data is pure if all instances belong to
the same class.
The heuristic in C4.5 is to choose the attribute
with the maximum Information Gain or Gain
Ratio based on information theory.
Classification - Decision Tree 121
The loan data (reproduced)
Approved or not
Classification - Decision Tree 122
Two possible roots, which is better?
Fig. (B) seems to be better.
Classification - Decision Tree 123
An example
6 6 9 9
entropy ( D) log 2 log 2 0.971
15 15 15 15
6 9
entropy Own _ house ( D) entropy ( D1 ) entropy ( D2 )
15 15
6 9
0 0.918
15 15
0.551
5 5 5
entropy Age ( D) entropy ( D1 ) entropy ( D2 ) entropy ( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
0.971 0.971 0.722 middle 3 2 0.971
15 15 15
0.888
old 4 1 0.722
Own_house is the best
choice for the root.
Classification - Decision Tree 124
We build the final tree
We can use information gain ratio to evaluate the
impurity as well
Classification - Decision Tree 125
Handling continuous attributes
Handle continuous attribute by splitting into
two intervals (can be more) at each node.
How to find the best threshold to divide?
Use information gain or gain ratio again
Sort all the values of an continuous attribute in
increasing order {v1, v2, …, vr},
One possible threshold between two adjacent
values vi and vi+1. Try all possible thresholds and
find the one that maximizes the gain (or gain
ratio).
Classification - Decision Tree 126
An example in a continuous space
Classification - Decision Tree 127
Avoid overfitting in classification
Overfitting: A tree may overfit the training data
Good accuracy on training data but poor on test data
Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
Two approaches to avoid overfitting
Pre-pruning: Halt tree construction early
Difficult to decide because we do not know what may
happen subsequently if we keep growing the tree.
Post-pruning: Remove branches or sub-trees from a
“fully grown” tree.
This method is commonly used. C4.5 uses a statistical
method to estimates the errors at each node for pruning.
A validation set may be used for pruning as well.
Classification - Decision Tree 128
An example Likely to overfit the data
Classification - Decision Tree 129
Other issues in decision tree learning
From tree to rules, and rule pruning
Handling of missing values
Handling skewed distributions
Handling attributes and classes with different
costs.
Attribute construction (adding a new one)
etc.
Classification - Decision Tree 130
Name Gender Height Output1 Output2
DT Example (1) Kristina
Jim
F
M
1.6 m
2m
Short
Tall
Medium
Medium
Maggie F 1.9 m Medium Tall
Martha F 1.88 m Medium Tall
Stephanie F 1.7 m Short Medium
Considering the data in the table and the Bob M 1.85 m Medium Medium
correct classification in Output1, we Kathy F 1.6 m Short Medium
have: Dave M 1.7 m Short Medium
Worth M 2.2 m Tall Tall
Short (4/15) Steven M 2.1 m Tall Tall
Medium (8/15) Debbie F 1.8 m Medium Medium
Tall (3/15) Todd M 1.95 m Medium Medium
Kim F 1.9 m Medium Tall
Amy F 1.8 m Medium Medium
Entropy = 4/15 log(15/4) + 8/15 log(15/8) Wynette F 1.75 m Medium Medium
+ 3/15 log(15/3) = 0.4384
Entropy(F) = 3/9 log(9/3) +
6/9 log(9/6) = 0.2764
Choosing the gender as the splitting
attribute we have: Entropy(M) = 1/6 log(6/1) +
2/6 log(6/2) + 3/6 log(6/3) =
0.4392
Classification - Decision Tree 131
DT Example (2)
The algorithm must determine what the gain
in information is by using this split.
To do this, we calculate the weighted sum of
these last two entropies to get:
((9/15) 0.2764) + ((6/15) 0.4392) = 0.34152
The gain in entropy by using the gender
attribute is thus:
0.4384 – 0.34152 = 0.09688
Classification - Decision Tree 132
DT Example (3)
Name Gender Height Output1 Output2
Kristina F 1.6 m Short Medium
Jim M 2m Tall Medium
Looking at the height Maggie F 1.9 m Medium Tall
attribute, we divide it into Martha F 1.88 m Medium Tall
ranges: Stephanie F 1.7 m Short Medium
(0, 1.6], (1.6, 1.7], (1.7, 1.8], (1.8, 1.9], Bob M 1.85 m Medium Medium
(1.9, 2.0], (2.0, ) Kathy F 1.6 m Short Medium
Dave M 1.7 m Short Medium
Worth M 2.2 m Tall Tall
Now we can compute the Steven M 2.1 m Tall Tall
entropy Debbie F 1.8 m Medium Medium
2 in (0, 1.6] (2/2(0)+0+0)=0 Todd M 1.95 m Medium Medium
2 in (1.6, 1.7] (2/2(0)+0+0)=0 Kim F 1.9 m Medium Tall
3 in (1.7, 1.8] (0+3/3(0)+0)=0 Amy F 1.8 m Medium Medium
4 in (1.8, 1.9] (0+4/4(0)+0)=0 Wynette F 1.75 m Medium Medium
2 in (1.9, 2.0]
(0+1/2(0.301)+1/2(0.301))=0.301
2 in (2.0, ] (0+0+2/2(0))=0
Classification - Decision Tree 133
DT Example (4)
All the states are completely ordered (entropy 0)
except for the (1.9, 2.0] state.
The gain in entropy by using the height attribute is:
0.4384-2/15(0.301)=0.3983
Thus, this has the greater gain, and we choose this
over gender as the first splitting attribute
Classification - Decision Tree 134
DT Example (5) Height
<=1.6m >2.0m
>1.6m >1.7m >1.8m >1.9m
<=1.7m <=1.8m <=1.9m <=2.0m
Short Short Medium Medium Tall
Height
(1.9, 2.0] – is too large !!! <=1.95m >1.95m
A further subdivision on height is needed
Medium Tall
Height
We can optimize the tree <=1.7m >1.7m
<=1.95m
>1.95m
Short Medium Tall
Classification - Decision Tree 135
Quinlan’s ID3 and C4.5 decision tree
algorithms
Given dataset T
Attribute1 Attribute2 Attribute3 Class
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
B 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1
Classification - Decision Tree 136
Quinlan’s ID3 and C4.5 decision tree
algorithms
Consider test on attribute 1
freq(class,value) CLASS1 CLASS2
A 2 3 5
B 4 0 4
C 3 2 5
9 5 14
Info(T) 0.4098 0.5305 0.9403
Info(S) CLASS1 CLASS2 Info(S) Weight
A 0.5288 0.4422 0.9710 0.3571
B 0.0000 0.0000 0.0000 0.2857
C 0.4422 0.5288 0.9710 0.3571
0.6935
Gain .9403 - .6935 = 0.2467
Classification - Decision Tree 137
Quinlan’s ID3 and C4.5 decision tree
algorithms
Consider test on attribute 3
freq(class,value) CLASS1 CLASS2
True 3 3 6
False 6 2 8
9 5 14
Info(T) 0.4098 0.5305 0.9403
Info(S) CLASS1 CLASS2 Info(S) Weight
True 0.5000 0.5000 1.0000 0.4286
False 0.3113 0.5000 0.8113 0.5714
0.8922
Gain .9403 - .8922 = 0.0481
Classification - Decision Tree 138
Quinlan’s ID3 and C4.5 decision tree algorithms
Summary
Gain(Attribute1) = 0.2467
Gain(Attribute3) = 0.0481
In ID3 Attribute 2 not considered
Since it is numeric
Thus split on Attribute 1 – highest gain
Classification - Decision Tree 139
C4.5 decision tree algorithms
What about numeric attribute 2 (as done by C4.5)?
Consider as categorical
Then gain = 0.4039 – should split on it
But – 9 branches, of which 6 with only one
instance
Tree too wide – not compact
Since it is really numerical – what to do with a
different value?
Use threshold Z, and split into two subsets
Y <= Z and Y > Z
More complex tests, assuming discrete values and
variable number of subsets
Classification - Decision Tree 140
C4.5 decision tree algorithms
C4.5 and continuous attribute
Sort values into v1,…,vm
Try Zi = vi or Zi = (vi + vi+1) / 2 for i=1,…,m-1
C4.5 uses Z = vi – more explainable decision rule
Select splitting value Z*
So that gain(Z*) = max {gain(Zi), i=1,…,m-1)}
For last example – Attribute 2 (see next slide)
Z* = 80
Gain = 0.1022
So even with this approach – would have split on Attribute
1
Classification - Decision Tree 141
C4.5 decision tree algorithms
Attribute 2 freq(class,value) Info(S)
Zi Att3 <= Zi Att3 > Zi CLASS1 CLASS2 CLASS1 CLASS2 Total Weight Info(Tx) Gain
65 1 1 0 0.0000 0.0000 0.0000 0.0714
13 8 5 0.4310 0.5302 0.9612 0.9286 0.8926 0.0477
70 4 3 1 0.3113 0.5000 0.8113 0.2857
10 6 4 0.4422 0.5288 0.9710 0.7143 0.9253 0.0150
75 5 4 1 0.2575 0.4644 0.7219 0.3571
9 5 4 0.4711 0.5200 0.9911 0.6429 0.8950 0.0453
78 6 5 1 0.2192 0.4308 0.6500 0.4286
8 4 4 0.5000 0.5000 1.0000 0.5714 0.8500 0.0903
80 9 7 2 0.2820 0.4822 0.7642 0.6429
5 2 3 0.5288 0.4422 0.9710 0.3571 0.8380 0.1022
85 10 7 3 0.3602 0.5211 0.8813 0.7143
4 2 2 0.5000 0.5000 1.0000 0.2857 0.9152 0.0251
90 12 8 4 0.3900 0.5283 0.9183 0.8571
2 1 1 0.5000 0.5000 1.0000 0.1429 0.9300 0.0103
95 13 8 5 0.4310 0.5302 0.9612 0.9286
1 1 0 0.0000 0.0000 0.0000 0.0714 0.8926 0.0477
Classification - Decision Tree 142
C4.5 decision tree algorithms
All same class – so T2 is a leaf
Classification - Decision Tree 143
C4.5 decision tree algorithms
Consider test on attribute 3 FOR SUBSET T1
freq(class,value) CLASS1 CLASS2
True 1 1 2
False 1 2 3
2 3 5
Info(T) 0.5288 0.4422 0.9710 Book has 0.940
Info(S) CLASS1 CLASS2 Info(S) Weight
True 0.5000 0.5000 1.0000 0.4000
False 0.5283 0.3900 0.9183 0.6000
0.9510
Gain .971 - .951 = 0.0200
Attribute 2 freq(class,value) Info(S)
Zi Att3 <= Zi Att3 > Zi CLASS1 CLASS2 CLASS1 CLASS2 Total Weight Info(Tx) Gain
70 2 2 0 0.0000 0.0000 0.0000 0.4000
3 0 3 0.0000 0.0000 0.0000 0.6000 0.0000 0.9710
85 3 2 1 0.3900 0.5283 0.9183 0.6000
2 0 2 0.0000 0.0000 0.0000 0.4000 0.5510 0.4200
90 4 2 2 0.5000 0.5000 1.0000 0.8000
1 0 1 0.0000 0.0000 0.0000 0.2000 0.8000 0.1710
Max gain on Attribute 2 - split on Z* = 70
Classification - Decision Tree 144
C4.5 decision tree algorithms
Consider test on attribute 3 FOR SUBSET T3
freq(class,value) CLASS1 CLASS2
True 0 2 2
False 3 0 3
3 2 5
Info(T) 0.4422 0.5288 0.9710 Book has 0.940
Info(S) CLASS1 CLASS2 Info(S) Weight
True 0.0000 0.0000 0.0000 0.4000
False 0.0000 0.0000 0.0000 0.6000
0.0000
Gain .971 - 0.000 = 0.9710
Attribute 2 freq(class,value) Info(S)
Zi Att3 <= Zi Att3 > Zi CLASS1 CLASS2 CLASS1 CLASS2 Total Weight Info(Tx) Gain
70 1 0 1 0.0000 0.0000 0.0000 0.2000
4 3 1 0.3113 0.5000 0.8113 0.8000 0.6490 0.3219
80 4 2 2 0.5000 0.5000 1.0000 0.8000
1 1 0 0.0000 0.0000 0.0000 0.2000 0.8000 0.1710
Max gain on Attribute 3
Classification - Decision Tree 145
C4.5 decision tree algorithms
Classification - Decision Tree 146
C4.5 decision tree algorithms
Classification - Decision Tree 147
C4.5 decision tree algorithms
We used entropy of T after splitting into T1,…,Tn
Info(T ) = -Σ k [(freq(C ,T ) / |T |) ∙ log (freq(C ,T ) / |T |)]
j i=1 i j j 2 i j j
Info (T) = Σ n (|T | / |T|) ∙ Info(T )
x j=1 j j
Gain(X) = Info(T) – Info (T)
x
This is biased in favor of tests X with many outcomes
Split on ID will generate one subset for each unique value – i.e.,
for all – with each subset of 1 instance
It has maximal gain as Info (T)=0
x
But result is a one level tree with one branch for each instance
Thus divide by number of branches – to measure average gain
Classification - Decision Tree 148
C4.5 decision tree algorithms
So define entropy
Split-info(X) = Σ n (|T | / |T|) ∙ log (|T | / |T|)
j=1 j 2 j
Potential information generated by splitting T into T ,…,T
1 n
Similar to definition of Info(T )
j
Use entropy of T after splitting into T1,…,Tn as before
Info(T ) = -Σ k [(freq(C ,T ) / |T |) ∙ log (freq(C ,T ) / |T |)]
j i=1 i j j 2 i j j
Info (T) = Σ n (|T | / |T|) ∙ Info(T )
x j=1 j j
Gain(X) = Info(T) – Info (T)
x
Selection criteria
Gain-ratio(X) = Gain(X) / Split-info(X)
Proportion of information generated by a “useful” compact split
Select X* so that Gain-ratio(X*) = max {Gain-ratio(X)}
attributes X
Classification - Decision Tree 149
C4.5 decision tree algorithms
Splitting the root
Attribute1 Attribute2 Attribute3
Gain(X) 0.2467 0.1022 0.0481
|T1| 5 9 6
|T2| 4 5 8
|T3| 5
|T| 14 14 14
|T1|/|T|*log(|T1|/|T|) 0.5305 0.4098 0.5239
|T2|/|T|*log(|T2|/|T|) 0.5164 0.5305 0.4613
|T3|/|T|*log(|T3|/|T|) 0.5305
Split-info(X) 1.5774 0.9403 0.9852
Gain-ratio(X) 0.1564 0.1087 0.0488
Still split on Attribute 1
Classification - Decision Tree 150
C4.5 decision tree algorithms
Missing data
Unknown
Not recorded
Data entry error
What to do with missing data?
Eliminate instances with missing data
Only useful when there are few
Replace missing data with some values
Fixed values, mean, mode, from distribution
Modify algorithm to work with missing data
Classification - Decision Tree 151
C4.5 decision tree algorithms
Issues with modified algorithm
How compare subsets with different number of
unknown values
With what class to associate instances with
unknown values
C4.5 replaces unknown values
Based on the distribution (=relative frequency) of
known values
Classification - Decision Tree 152
C4.5 decision tree algorithms
For Split-info(X)
Add one subset for the missing values
That is – if there are n known classes, use T n+1 for missing
values
For Info(T) and Infox(T) for a certain attribute
Use only known values
Compute F = (number instances with a known value) /
(total number of instances in data set)
Use Gain(X) = F ∙ [Info(T) – Info x(T)]
Classification - Decision Tree 153
C4.5 decision tree algorithms
Given dataset T
Attribute1 Attribute2 Attribute3 Class
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
????? 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1
Classification - Decision Tree 154
C4.5 decision tree algorithms
Given dataset T
Attribute1 Attribute2 Attribute3 Class
Consider test on attribute 1 A 70 True CLASS1
freq(class,value) CLASS1 CLASS2 A 90 True CLASS2
A 85 False CLASS2
A 2 3 5 A 95 False CLASS2
B 3 0 3 A 70 False CLASS1
C 3 2 5 ????? 90 True CLASS1
B 78 False CLASS1
8 5 13 B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
Factor F = 13 / 14 0.9286 C 70 True CLASS2
C 80 False CLASS1
Info(T) 0.4310 0.5302 0.9612 C 80 False CLASS1
C 96 False CLASS1
Info(S) CLASS1 CLASS2 Info(S) Weight
A 0.5288 0.4422 0.9710 0.3846
B 0.0000 0.0000 0.0000 0.2308
C 0.4422 0.5288 0.9710 0.3846
0.7469
Original Gain equation .9612 - .7469 = 0.2144
New Gain Equation F*Original-Gain 0.1990
Weight is calculated on the basis of : n/N-m
n= attribute values, N=total number of tuples, m=number of missing values of attribute
Classification - Decision Tree 155
C4.5 decision tree algorithms
Splitting the root
Attribute1 Attribute2 Attribute3
Gain(X) 0.1990 0.0587 -0.0205
|T1| 5 9 6
|T2| 3 5 8
|T3| 5
????? 1
|T| 13 14 14
|T1|/|T|*log(|T1|/|T|) 0.5302 0.4098 0.5239
|T2|/|T|*log(|T2|/|T|) 0.4882 0.5305 0.4613
|T3|/|T|*log(|T3|/|T|) 0.5302
Unknown 0.2846
Split-info(X) 1.8332 0.9403 0.9852
Gain-ratio(X) 0.1086 0.0625 -0.0208
Still split on Attribute 1
Classification - Decision Tree 156
C4.5 decision tree algorithms
At this point, with unknown values
Test attributes selected for each node
Subsets defined for instances with known values
But what to do with the unknown?
C4.5 assigns it to ALL the subsets T1,…,Tn
With probability (or weight)
P(Ti) = w = |Ti known values| / |T known values|
Classification - Decision Tree 157
C4.5 decision tree algorithms
Classification - Decision Tree 158
C4.5 decision tree algorithms
Given dataset T
Attribute1 Attribute2 Attribute3 Class
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
????? 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1
Classification - Decision Tree 159
C4.5 decision tree algorithms
Classification = CLASS2 (3.4 /
Given dataset T
0.4) means Attribute1 Attribute2 Attribute3 Class
3.4 = |updated T | = 3 + 5/13 = A 70 True CLASS1
i
A 90 True CLASS2
3.3846 A 85 False CLASS2
0.4 = number of instances A 95 False CLASS2
A 70 False CLASS1
without (known) value in Ti ????? 90 True CLASS1
Thus 3 / 3.3846 = 88.64% belong B 78 False CLASS1
B 65 True CLASS1
to CLASS2 B 75 False CLASS1
The balance (~12%) is error C 80 True CLASS2
C 70 True CLASS2
rate C 80 False CLASS1
Belongs to other classes C 80 False CLASS1
C 96 False CLASS1
– in this case CLASS1
Classification - Decision Tree 160
C4.5 decision tree algorithms
Prediction
Same approach – with probabilities – is used
If values of attributes known – class is well
defined
Else all paths from the root explored
Probability of each class is determined for all classes
Which is a sum of probabilities along paths
Class with highest probability is selected
Classification - Decision Tree 161