Classification - Decision Trees
Classification - Decision Trees
Decision Trees
1
Agenda
What is a Decision Tree?
Hunt’s Algorithm
C4.5 Algorithm
Impurity Measures
Information Gain
3
Introduction
Decision tree learning is one of the most widely used techniques for
classification.
it is very efficient.
4
Creating a Decision Tree
x2
x
x
x
o
x
x
o x x x
o o o
o o o o
0 x1
5
Creating a Decision Tree
x2
X2 < 2.5
x
x Yes No
x
o
Blue Circle Mixed
x (7:0) (2:8)
x
o x x x
2.5 o
o o
o o o o
0 x1
6
Creating a Decision Tree
x2
X2 < 2.5
x
x Yes No
x
o
Blue Circle Mixed
x (7:0) (2:8)
x
o x x x
2.5 o
o o
o o o o
pure
0 x1
7
Creating a Decision Tree
x2
X2 < 2.5
x
x Yes No
x
o
Blue Circle
x (7:0) X1 < 2
x
o x x x
Yes No
2.5 o
o o
o o o o
Blue Circle Red X
(2:0) (0:7)
0 2 x1
8
Training Data with Objects
9
Building The Tree:
we choose “age” as a root
age
>40
<=30
class=yes
11
Building The Tree:
we chose “student” on <=30 branch
age
<=30 >40
student
yes income studen credit class
no t
medium no fair yes
low yes fair yes
in cr cl in cr cl low yes excellent no
h f n
medium yes fair yes
l f y
h e n medium no excellent no
m e y 31…40
m f n
class=yes
12
Building The Tree:
we chose “student” on <=30 branch
age
<=30 >40
student
yes income studen credit class
no t
medium no fair yes
low yes fair yes
low yes excellent no
class= no class=yes medium yes fair yes
medium no excellent no
31…40
class=yes
13
Building The Tree:
we chose “credit” on >40 branch
<=30 age >40
credit
student
no excellent fair
yes
in st cl in st cl
l y n m n y
class= no class=yes l y y
m n n
m y y
31…40
class=yes
14
Finished Tree for class=“buys”
<=30 age >40
credit
student
no excellent fair
yes
buys=no buys=yes
buys= no buys=yes
31…40
buys=yes
15
A decision Tree
age?
<=30 overcast
31..40 >40
no yes no yes
16
Discriminant RULES extracted from our
TREE
The rules are:
17
The Loan Data
Approved or not
18
A decision tree from the loan data
Decision nodes and leaf nodes (classes)
19
Using the Decision Tree
No
20
Is the decision tree unique?
No. There are many possible trees.
Here is a simpler tree.
21
From a decision tree to a set of rules
A decision tree can be converted to a set of rules.
22
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
24
What is a decision tree?
Decision tree
A flow-chart-like tree structure
The internal node denotes a
test on an attribute
Branch represents an
outcome of the test
Leaf nodes represent
class labels
25
Classifying Using Decision Tree
26
Classifying Using Decision Tree
To classify an object, the appropriate attribute value is used at each node,
starting from the root, to determine the branch taken.
The path found by tests at each node leads to a leaf node which is the class
the model believes the object belongs to.
27
Classifying Using Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
Home
1 Yes Single 125K No
Owner
2 No Married 100K No Yes No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No
Single, Divorced Married
5 No Divorced 95K Yes
6 No Married 60K No Income NO
7 Yes Divorced 220K No < 80K > 80K
8 No Single 85K Yes
NO YES
9 No Married 75K No
10 No Single 90K Yes
10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
29
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
30
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
31
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
32
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
33
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
34
Methods for Expressing Test Conditions
35
Methods for Expressing Test Conditions
Depends on attribute types
Binary
Nominal
Ordinal
Continuous
36
Test Condition for Nominal Attributes
Multi-way split: Marital
Use as many partitions as distinct values. Status
Binary split:
Single Divorced Married
Divides values into two subsets.
Some decision tree algorithms, such as CART, produce only binary splits by
considering all 2k−1 − 1 ways of creating a binary partition of k attribute values.
37
Test Condition for Ordinal Attributes
Multi-way split: Shirt
Size
Use as many partitions as distinct
values
Small
Medium Large Extra Large
Shirt
Size
This grouping
violates order
property
{Small, {Medium,
Large} Extra Large}
38
Test Condition for Continuous Attributes
Binary split:
The attribute test condition can be expressed as a
comparison test
A<v or Av
Any possible value v between the minimum and
maximum attribute values in the training data.
Consider all possible splits and finds the best cut
can be more compute intensive
Multi-way split:
The attribute test condition can be expressed as a
range query of the form
vi ≤ A vi+1, i = 1, 2, …, k
Any possible collection of attribute value ranges
can be used, as long as
they are mutually exclusive and cover the entire
range of attribute values between the minimum and
maximum values observed in the training set.
39
Building a Decision Tree
40
Decision Tree Algorithms: Short History
Late 1970s - ID3 (Interative Dichotomiser) by J. Ross Quinlan
This work expanded on earlier work on concept learning systems described by
E. B. Hunt, J. Marin, and P. T. Stone
ID3, C4.5, and CART were invented independently of one another yet
follow a similar approach for learning decision trees from training tuples.
41
Building a Decision Tree
ID3, C4.5, and CART adopt a greedy (i.e., a non-backtracking) approach.
Most algorithms for decision tree induction also follow such a top-down
approach.
All of the algorithms start with a training set of tuples and their associated
class labels (classification data table).
The training set is recursively partitioned into smaller subsets as the tree is
being built.
42
Building a Decision Tree
The aim is to build a decision tree consisting of a root node, a number
of internal nodes, and a number of leaf nodes.
Building the tree starts with the root node and then splitting the data
into two or more children nodes and splitting them into lower-level
nodes, and so on, until the process is complete.
43
An Example
The first five attributes are symptoms, and the last attribute is
diagnosis.
All attributes are categorical.
Wish to predict the diagnosis class.
Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold
44
An Example
Consider each of the attributes in turn to see
which would be a “good” one to start
Sore
Throat Diagnosis
No Allergy
No Cold
No Allergy
No Strep throat
No Cold
Yes Strep throat
Yes Cold
Yes Strep throat
Yes Allergy
Yes Cold
45
An Example
Fever Diagnosis
No Allergy
No Strep throat
No Allergy
No Strep throat
No Allergy
Yes Strep throat
Yes Cold
Yes Cold
Yes Cold
Yes Cold
46
An Example
Swollen
Glands Diagnosis
No Allergy
No Cold
No Cold
No Allergy
No Allergy
No Cold
No Cold
Yes Strep throat
Yes Strep throat
Yes Strep throat
47
An Example
Try congestion
Congestion Diagnosis
No Strep throat
No Strep throat
Yes Allergy
Yes Cold
Yes Cold
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat
Not helpful.
48
An Example
Headache Diagnosis
No Cold
No Cold
No Allergy
No Strep throat
No Strep throat
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat
Not helpful.
49
Brute Force Approach
This approach does not work if there are many attributes and a large training
set.
The tree continues to grow until finding better ways to split the objects is no
longer possible.
50
Basic Algorithm
1. Let the root node contain all training data D.
2. If all objects D in the root node belong to the same class, then
stop.
51
Basic Algorithm: Example
52
Basic Algorithm: Example
Home
Owner Home Marital Annual Defaulted
ID
Yes No Owner Status Income Borrower
Defaulted = No
1 Yes Single 125K No
Defaulted = No Defaulted = No
(7,3) 2 No Married 100K No
(3,0) (4,3)
3 No Single 70K No
(a) (b) 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
Home
Owner 7 Yes Divorced 220K No
Home Yes No
8 No Single 85K Yes
Owner
Defaulted = No Marital 9 No Married 75K No
Yes No
Status
(3,0) Single, 10 No Single 90K Yes
Marital Married 10
Defaulted = No Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
53
Basic Algorithm: Example
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
54
Basic Algorithm: Example
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
55
Basic Algorithm: Example
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
56
C4.5 Algorithm
57
Algorithm for decision tree learning
Basic algorithm (a greedy divide-and-conquer algorithm)
Tree is constructed in a top-down recursive manner
At start, all the training data are at the root.
Data are partitioned recursively based on selected attributes.
Attributes are selected on the basis of an impurity function
e.g., information gain
58
Decision tree learning algorithm C4.5
59
Decision tree learning algorithm C4.5
60
Choose an attribute to partition data
The key to building a decision tree
which attribute to choose in order to branch.
61
The Loan Data
Approved or not
62
Two possible roots, which is better?
63
Impurity Measures
64
Finding the Best Split
Before splitting:
10 records of class 0 (C0)
10 records of class 1 (C1)
Before splitting:
10 records of class 0 (C0)
10 records of class 1 (C1)
C0: 5 C0: 9
C1: 5 C1: 1
High degree of impurity Low degree of impurity
66
Impurity
When deciding which question to ask at a node, we consider the
impurity in its child nodes after the question.
We want it to be as low as possible (low impurity or high purity).
67
Computing Impurity
There are many measures that can be used to determine the goodness of an attribute
test condition.
These measures try to give preference to attribute test conditions that partition the
training instances into purer subsets in the child nodes,
which mostly have the same class labels.
68
Computing Impurity
69
Measure of Impurity: Entropy
Entropy for a given node that represents the dataset
c
E ( D ) Entropy ( D ) pi log 2 ( pi )
i 1
70
Computing Entropy of a Dataset
Assume we have a dataset D with only two classes
positive and negative. c
Entropy ( D ) pi log 2 ( pi )
i 1
The dataset D has
50% positive examples (P(positive) = 0.5) and 50% negative examples
(P(negative) = 0.5).
E(D)= −0.5⋅log2 0.5 − 0.5⋅log2 0.5 = 1
71
Computing Entropy of a Dataset
c
Entropy ( D ) pi log 2 ( pi )
i 1
72
Information Gain
73
Information Gain
Information gained by selecting attribute Ai to branch or to
partition the data D is
gain( D, Ai ) entropy( D) entropyAi ( D)
gain( D, Ai ) E ( D) E Ai ( D)
Disadvantage:
Attributes with many values are preferred.
74
Computing Information Gain
1. Given a dataset D, compute the entropy value of D before splitting,
c
E ( D ) pi log 2 ( pi )
i 1
2. If we make attribute Ai, with v values, as the root of the current tree, this
will partition D into v subsets D1, D2 …, Dv. The expected weighted entropy
if Ai is used as the current root, after splitting: 𝑣
𝑛𝑖
𝐸 𝐴 (𝐷)=∑ 𝐸 (𝐷𝑖 )
𝑖
𝑖=1 𝑛
where, = number of records of Di at child node
= number of records in D by the parent node.
A? B?
Yes No Yes No
EA(D) EB(D)
6 6 9 9 c
E ( D) log 2 log 2 0.971
15 15 15 15
E ( D ) pi log 2 ( pi )
i 1
77
An example
5 5 5
E Age ( D) E ( D1 ) E ( D2 ) E ( D3 )
15 15 15
5 5 5
0.971 0.971 0.722
15 15 15
0.888
𝑣
𝑛𝑖
𝐸 𝐴 (𝐷)=∑
c
E ( D ) pi log 2 ( pi ) 𝐸 (𝐷𝑖 )
i 1
𝑖
𝑖=1 𝑛
79
An example
Own_house is the best attribute for the root node.
80
Another Example
First five attributes are symptoms and the last attribute is diagnosis.
All attributes are categorical.
Wish to predict the diagnosis class.
Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold
81
Another Example
D has (n = 10) samples and three classes c = 3.
Strep throat =t=3
Cold =d=4
c
Allergy = a = 3
E ( D ) pi log 2 ( pi )
i 1
82
Another Example
𝑣
Sore Throat 𝑛𝑖
has 2 distinct values {Yes, No} 𝐸 (𝐷)=∑
𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 2, d = 2, a = 1, total 5 𝑖=1 𝑛
DNo has t = 1, d = 2, a = 2, total 5
E(DYes) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
E(DNo) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
Fever
has 2 distinct values {Yes, No}
DYes has t = 1, d = 4, a = 0, total 5
DNo has t = 2, d = 0, a = 3, total 5
E(DYes) = –1/5 log(1/5)) – 4/5 log(4/5) = 0.22
E(DNo) = –2/5 log(2/5)) – 3/5 log(3/5) = 0.29
Congestion
has 2 distinct values {Yes, No}
DYes has t = 1, d = 4, a = 3, total 8
DNo has t = 2, d = 0, a = 0, total 2
E(DYes) = –1/8 log(1/8)) – 4/8 log(4/8) – 3/8 log(3/8) = 0.42
E(DNo) =0
ECongestion(D) = 0.8*0.42
84 + 0.2*0 = 0.34
Another Example
𝑣
Headache 𝑛𝑖
has 2 distinct values {Yes, No} 𝐸 (𝐷)=∑
𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 2, d = 2, a = 1, total 5 𝑖=1 𝑛
DNo has t = 1, d = 2, a = 2, total 5
E(DYes) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
E(DNo) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
86
Gini Index
Gini Index for a given node that represents the dataset
𝑐
𝐺 ( 𝐷 )=𝐺𝑖𝑛𝑖(𝐷)=1− ∑ 𝑝 𝑖
2
Parent
C1 7
A?
C2 5
Yes No
G(D) = Gini = 0.486
Node N1 Node N2
𝑣
N1 N2
𝑛𝑖
C1 5 2 𝐺 𝑖𝑛𝑖 𝐴 ( 𝐷 )=∑ 𝐺𝑖𝑛𝑖( 𝐷𝑖 )
C2 1 4 𝑖 =1 𝑛
GA(D) = Gini = 0.361 GiniA(D) = Weighted Gini of attribute A
= 6/12 * 0.278 + 6/12 * 0.444 = 0.361
Gini(D1) = 1 – (5/6)2 – (1/6)2 = 0.278
Gini(D2) = 1 – (2/6)2 – (4/6)2 = 0.444 GainA = Gini(D) – GiniA (D) = 0.486 – 0.361 = 0.125
89
Computing Information Gain Using Gini Index
Parent
C1 7
B?
C2 5
Yes No
G(D) = Gini = 0.486
Node N1 Node N2
𝑣
N1 N2
𝑛𝑖
C1 5 1 𝐺 𝑖𝑛𝑖 𝐵 ( 𝐷 )=∑ 𝐺𝑖𝑛𝑖( 𝐷𝑖 )
C2 2 4 𝑖=1 𝑛
GB(D) = Gini = 0.371 GiniB(D) = Weighted Gini of attribute B
= 7/12 * 0.408 + 5/12 * 0.32 = 0.371
Gini(D1) = 1 – (5/7)2 – (2/7)2 = 0.408
Gini(D2) = 1 – (1/5)2 – (4/5)2 = 0.32
GainB = Gini(D) – GiniB (D) = 0.486 – 0.371 = 0.115
90
Selecting the Splitting Attribute
Since GainA is larger then GainB, attribute A will be selected for the
next splitting.
max(GainA, GainB) or min(GiniA(D), GiniB(D))
91
Information Gain Ratio
92
Problem with large number of partitions
Node impurity measures tend to prefer splits that result in
large number of partitions, each being small but pure.
Customer ID has highest gain because entropy for all the children is
zero.
93
Gain Ratio
Tree building algorithm blindly picks attribute that
maximizes information gain
94
Gain Ratio
Gain Ratio:
g
95
Gain Ratio
g