0% found this document useful (0 votes)
10 views96 pages

Classification - Decision Trees

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views96 pages

Classification - Decision Trees

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

Data Mining

Classification: Basic Concepts

Decision Trees

1
Agenda
 What is a Decision Tree?

 Classifying Using Decision Tree

 Methods for Expressing Test Conditions

 Building a Decision Tree

 Hunt’s Algorithm

 C4.5 Algorithm

 Impurity Measures

 Information Gain

 Computing Impurity of Continuous Attributes

 Information Gain Ratio


2
What is a Decision Tree?

3
Introduction
 Decision tree learning is one of the most widely used techniques for
classification.

 Its accuracy is competitive with other methods,

 it is very efficient.

 The classification model is a tree called a decision tree.

 C4.5 by Ross Quinlan is perhaps the best-known system.


 It can be downloaded from the Web.

4
Creating a Decision Tree

x2
x
x
x
o
x
x
o x x x
o o o
o o o o

0 x1

5
Creating a Decision Tree

x2
X2 < 2.5
x
x Yes No
x
o
Blue Circle Mixed
x (7:0) (2:8)
x
o x x x
2.5 o
o o
o o o o

0 x1

6
Creating a Decision Tree

x2
X2 < 2.5
x
x Yes No
x
o
Blue Circle Mixed
x (7:0) (2:8)
x
o x x x
2.5 o
o o
o o o o
pure

0 x1

7
Creating a Decision Tree

x2
X2 < 2.5
x
x Yes No
x
o
Blue Circle
x (7:0) X1 < 2
x
o x x x
Yes No
2.5 o
o o
o o o o
Blue Circle Red X
(2:0) (0:7)

0 2 x1

8
Training Data with Objects

rec Age Income Student Credit_rating Buys_computer (CLASS)


r1 <=30 High No Fair No

r2 <=30 High No Excellent No

r3 31…40 High No Fair Yes

r4 >40 Medium No Fair Yes

r5 >40 Low Yes Fair Yes

r6 >40 Low Yes Excellent No

r7 31…40 Low Yes Excellent Yes

r8 <=30 Medium No Fair No

r9 <=30 Low Yes Fair Yes

r10 >40 Medium Yes Fair Yes


r11 <=30 Medium Yes Excellent Yes

r12 31…40 Medium No Excellent Yes

r13 31…40 High Yes Fair Yes

r14 >40 Medium No Excellent No

9
Building The Tree:
we choose “age” as a root
age
>40
<=30

income student credit class income studen credit class


high no fair no t
medium no fair yes
high no excellen no
t low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellen yes medium no excellent no
t 31…40

income student credit class


high no fair yes
low yes excellen yes
t
medium no excellen yes
t
high yes 10 fair yes
Building The Tree:
“age” as the root
age
>40
<=30

income student credit class income studen credit class


high no fair no t
medium no fair yes
high no excellen no
t low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellen yes medium no excellent no
t
31…40

class=yes

11
Building The Tree:
we chose “student” on <=30 branch
age
<=30 >40

student
yes income studen credit class
no t
medium no fair yes
low yes fair yes
in cr cl in cr cl low yes excellent no
h f n
medium yes fair yes
l f y
h e n medium no excellent no
m e y 31…40
m f n

class=yes

12
Building The Tree:
we chose “student” on <=30 branch
age
<=30 >40

student
yes income studen credit class
no t
medium no fair yes
low yes fair yes
low yes excellent no
class= no class=yes medium yes fair yes
medium no excellent no
31…40

class=yes

13
Building The Tree:
we chose “credit” on >40 branch
<=30 age >40

credit
student
no excellent fair
yes

in st cl in st cl
l y n m n y

class= no class=yes l y y
m n n
m y y

31…40

class=yes

14
Finished Tree for class=“buys”
<=30 age >40

credit
student
no excellent fair
yes

buys=no buys=yes

buys= no buys=yes

31…40

buys=yes

15
A decision Tree

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

16
Discriminant RULES extracted from our
TREE
 The rules are:

17
The Loan Data
Approved or not

18
A decision tree from the loan data
 Decision nodes and leaf nodes (classes)

19
Using the Decision Tree

No

20
Is the decision tree unique?
 No. There are many possible trees.
 Here is a simpler tree.

 We want a smaller and accurate tree.


 Easy to understand and perform better.

 Finding the best tree is NP-hard.

 All existing tree building algorithms are heuristic algorithms

21
From a decision tree to a set of rules
 A decision tree can be converted to a set of rules.

 Each path from the root to a leaf is a rule.

22
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


Refund
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt

5 No Divorced 95K Yes Single, Divorced Married


Induction
6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


23
Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc

4 Yes Married 120K No < 80K > 80K


5 No Divorced 95K Yes
6 No Married 60K No
Induction NO YES

7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10

24
What is a decision tree?
 Decision tree
 A flow-chart-like tree structure
 The internal node denotes a
 test on an attribute
 Branch represents an
 outcome of the test
 Leaf nodes represent
 class labels

 Decision tree generation consists of two phases


 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers

25
Classifying Using Decision Tree

26
Classifying Using Decision Tree
 To classify an object, the appropriate attribute value is used at each node,
starting from the root, to determine the branch taken.

 The path found by tests at each node leads to a leaf node which is the class
the model believes the object belongs to.

27
Classifying Using Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
Home
1 Yes Single 125K No
Owner
2 No Married 100K No Yes No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No
Single, Divorced Married
5 No Divorced 95K Yes
6 No Married 60K No Income NO
7 Yes Divorced 220K No < 80K > 80K
8 No Single 85K Yes
NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Model: Decision Tree


Training Data
28
Classifying Using Decision Tree
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

29
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

30
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

31
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

32
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

33
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

34
Methods for Expressing Test Conditions

35
Methods for Expressing Test Conditions
 Depends on attribute types
 Binary

 Nominal

 Ordinal

 Continuous

36
Test Condition for Nominal Attributes
 Multi-way split: Marital
 Use as many partitions as distinct values. Status

 Binary split:
Single Divorced Married
 Divides values into two subsets.
 Some decision tree algorithms, such as CART, produce only binary splits by
considering all 2k−1 − 1 ways of creating a binary partition of k attribute values.

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}

37
Test Condition for Ordinal Attributes
 Multi-way split: Shirt
Size
 Use as many partitions as distinct
values

Small
Medium Large Extra Large

 Binary split: Shirt Shirt


Size Size
 Divides values into two subsets.
 Preserve order property among
attribute values
{Small, {Large, {Small} {Medium, Large,
Medium} Extra Large} Extra Large}

Shirt
Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}

38
Test Condition for Continuous Attributes
 Binary split:
 The attribute test condition can be expressed as a
comparison test
 A<v or Av
 Any possible value v between the minimum and
maximum attribute values in the training data.
 Consider all possible splits and finds the best cut
 can be more compute intensive

 Multi-way split:
 The attribute test condition can be expressed as a
range query of the form
 vi ≤ A  vi+1, i = 1, 2, …, k
 Any possible collection of attribute value ranges
can be used, as long as
 they are mutually exclusive and cover the entire
range of attribute values between the minimum and
maximum values observed in the training set.

39
Building a Decision Tree

40
Decision Tree Algorithms: Short History
 Late 1970s - ID3 (Interative Dichotomiser) by J. Ross Quinlan
 This work expanded on earlier work on concept learning systems described by
E. B. Hunt, J. Marin, and P. T. Stone

 Early 1980 - C4.5 a successor of ID3 by Quinlan


 C4.5 later became a benchmark to which newer supervised learning algorithms
are often compared

 In 1984, a group of statisticians published the book "Classification and


Regression Trees (CART)".
 The book described a generation of binary decision trees.

 ID3, C4.5, and CART were invented independently of one another yet
follow a similar approach for learning decision trees from training tuples.

41
Building a Decision Tree
 ID3, C4.5, and CART adopt a greedy (i.e., a non-backtracking) approach.

 In this approach, decision trees are constructed in a top-down recursive


divide-and-conquer approach.

 Most algorithms for decision tree induction also follow such a top-down
approach.

 All of the algorithms start with a training set of tuples and their associated
class labels (classification data table).

 The training set is recursively partitioned into smaller subsets as the tree is
being built.

42
Building a Decision Tree
 The aim is to build a decision tree consisting of a root node, a number
of internal nodes, and a number of leaf nodes.

 Building the tree starts with the root node and then splitting the data
into two or more children nodes and splitting them into lower-level
nodes, and so on, until the process is complete.

 We illustrate the brute-force approach using a simple example.

43
An Example
 The first five attributes are symptoms, and the last attribute is
diagnosis.
 All attributes are categorical.
 Wish to predict the diagnosis class.

Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold
44
An Example
Consider each of the attributes in turn to see
which would be a “good” one to start
Sore
Throat Diagnosis
No Allergy
No Cold
No Allergy
No Strep throat
No Cold
Yes Strep throat
Yes Cold
Yes Strep throat
Yes Allergy
Yes Cold

Sore throat does not predict diagnosis.

45
An Example

I s symptom fever any better?

Fever Diagnosis
No Allergy
No Strep throat
No Allergy
No Strep throat
No Allergy
Yes Strep throat
Yes Cold
Yes Cold
Yes Cold
Yes Cold

Fever is better but not perfect.

46
An Example

Try swollen glands

Swollen
Glands Diagnosis
No Allergy
No Cold
No Cold
No Allergy
No Allergy
No Cold
No Cold
Yes Strep throat
Yes Strep throat
Yes Strep throat

Good. Swollen glands = yes means Strep Throat

47
An Example
Try congestion

Congestion Diagnosis
No Strep throat
No Strep throat
Yes Allergy
Yes Cold
Yes Cold
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat

Not helpful.

48
An Example

Try the symptom headache

Headache Diagnosis
No Cold
No Cold
No Allergy
No Strep throat
No Strep throat
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat

Not helpful.

49
Brute Force Approach
 This approach does not work if there are many attributes and a large training
set.

 An algorithm is needed to select an attribute that best discriminates among


the target classes as the split attribute.

 How do we find the attribute that is most influential in determining the


dependent/target attribute?

 The tree continues to grow until finding better ways to split the objects is no
longer possible.

50
Basic Algorithm
 1. Let the root node contain all training data D.

 2. If all objects D in the root node belong to the same class, then
 stop.

 3. Selecting an attribute A from amongst the independent attributes


that best divides or splits the objects in the node into subsets and
create a decision tree node.
 Split the node according to the values of A.

 4. Stop if any of the following conditions is met; otherwise, continue


with 3.
 data in each subset belongs to a single class.
 there are no remaining attributes on which the sample may be further
divided.

51
Basic Algorithm: Example

Home Marital Annual Defaulted


ID
Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
D 10 No Single 90K Yes
10

52
Basic Algorithm: Example
Home
Owner Home Marital Annual Defaulted
ID
Yes No Owner Status Income Borrower
Defaulted = No
1 Yes Single 125K No
Defaulted = No Defaulted = No
(7,3) 2 No Married 100K No
(3,0) (4,3)
3 No Single 70K No
(a) (b) 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
Home
Owner 7 Yes Divorced 220K No
Home Yes No
8 No Single 85K Yes
Owner
Defaulted = No Marital 9 No Married 75K No
Yes No
Status
(3,0) Single, 10 No Single 90K Yes
Marital Married 10

Defaulted = No Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)

53
Basic Algorithm: Example
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)

54
Basic Algorithm: Example
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)

55
Basic Algorithm: Example
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)

56
C4.5 Algorithm

57
Algorithm for decision tree learning
 Basic algorithm (a greedy divide-and-conquer algorithm)
 Tree is constructed in a top-down recursive manner
 At start, all the training data are at the root.
 Data are partitioned recursively based on selected attributes.
 Attributes are selected on the basis of an impurity function
 e.g., information gain

 Conditions for stopping partitioning


 All examples for a given node belong to the same class.
 There are no remaining attributes for further partitioning.
 There are no training examples left.

58
Decision tree learning algorithm C4.5

59
Decision tree learning algorithm C4.5

60
Choose an attribute to partition data
 The key to building a decision tree
 which attribute to choose in order to branch.

 Objective: reduce impurity in data as much as possible.


 A subset of data is pure if all instances belong to the same class.

 C4.5 chooses the attribute with the maximum Information Gain or


Gain Ratio based on information theory.

61
The Loan Data
Approved or not

62
Two possible roots, which is better?

Fig. (B) seems to be better.

63
Impurity Measures

64
Finding the Best Split
 Before splitting:
 10 records of class 0 (C0)
 10 records of class 1 (C1)

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


65
Finding the Best Split
 Greedy approach:
 Nodes with purer class distribution are preferred.
 Need a measure of node impurity (‫)تنوع‬.

 Before splitting:
 10 records of class 0 (C0)
 10 records of class 1 (C1)

C0: 5 C0: 9
C1: 5 C1: 1
High degree of impurity Low degree of impurity

66
Impurity
 When deciding which question to ask at a node, we consider the
impurity in its child nodes after the question.
 We want it to be as low as possible (low impurity or high purity).

 Let’s look at this example (assume a bucket below is simply a node in


decision tree):

67
Computing Impurity
 There are many measures that can be used to determine the goodness of an attribute
test condition.

 All three measures give a


 zero impurity value if a node contains instances from a single class
 maximum impurity if the node has equal proportion of instances from multiple classes

 These measures try to give preference to attribute test conditions that partition the
training instances into purer subsets in the child nodes,
 which mostly have the same class labels.

68
Computing Impurity

69
Measure of Impurity: Entropy
 Entropy for a given node that represents the dataset
c
E ( D ) Entropy ( D )   pi log 2 ( pi )
i 1

Where is the relative frequency % of class at , and is the total


number of classes.
 Maximum of
 when records are equally distributed among all classes, implying the least
beneficial situation for classification.
 Minimum of 0
 when all records belong to one class, implying the most beneficial situation
for classification.
 Entropy is used in decision tree algorithms such as
 ID3, C4.5

70
Computing Entropy of a Dataset
 Assume we have a dataset D with only two classes
 positive and negative. c
Entropy ( D )   pi log 2 ( pi )
i 1
 The dataset D has
 50% positive examples (P(positive) = 0.5) and 50% negative examples
(P(negative) = 0.5).
 E(D)= −0.5⋅log2 0.5 − 0.5⋅log2 0.5 = 1

 20% positive examples (P(positive) = 0.2) and 80% negative examples


(P(negative) = 0.8).
 E(D)= −0.2⋅log2 0.2 − 0.8⋅log2 0.8 = 0.722

 100% positive examples (P(positive) = 1) and no negative examples,


(P(negative) = 0).
 E(D) = −1⋅log2 1 − 0⋅log2 0 = 0
 Per definition 0⋅log2 0 = 0

71
Computing Entropy of a Dataset
c
Entropy ( D )   pi log 2 ( pi )
i 1

C1 0 p1(C1) = 0/6 = 0 p2(C2) = 6/6 = 1


C2 6 E(D) = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 p1(C1) = 1/6 p2(C2) = 5/6


C2 5 E(D) = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 2 p1(C1) = 2/6 p2(C2) = 4/6


C2 4 E(D) = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

72
Information Gain

73
Information Gain
 Information gained by selecting attribute Ai to branch or to
partition the data D is
gain( D, Ai ) entropy( D)  entropyAi ( D)
gain( D, Ai ) E ( D)  E Ai ( D)

 We evaluate every attribute:


 We choose the attribute with the highest gain to branch/split
the current tree.

 Disadvantage:
 Attributes with many values are preferred.

74
Computing Information Gain
 1. Given a dataset D, compute the entropy value of D before splitting,
c
E ( D )   pi log 2 ( pi )
i 1

 2. If we make attribute Ai, with v values, as the root of the current tree, this
will partition D into v subsets D1, D2 …, Dv. The expected weighted entropy
if Ai is used as the current root, after splitting: 𝑣
𝑛𝑖
𝐸 𝐴 (𝐷)=∑ 𝐸 (𝐷𝑖 )
𝑖
𝑖=1 𝑛
where, = number of records of Di at child node
= number of records in D by the parent node.

 3. Choose the attribute Ai, that produces the highest gain


gain( D, Ai ) E ( D)  E Ai ( D)
75
Computing Information Gain
D Before Splitting: C0 N00
E(D)
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

EA1(D) EA2(D) EB1(D) EB2(D)

EA(D) EB(D)

max (GainA = E(D) – EA(D), GainB = E(D) – EB(D)) or min(EA(D), EB(D))


76
An example

6 6 9 9 c
E ( D)  log 2  log 2 0.971
15 15 15 15
E ( D )   pi log 2 ( pi )
i 1

77
An example
5 5 5
E Age ( D)  E ( D1 )  E ( D2 )  E ( D3 )
15 15 15
5 5 5
 0.971  0.971  0.722
15 15 15
0.888

Age Yes No entropy(Di)


young 2 3 0.971 𝑣
𝑛𝑖
middle 3 2 0.971 𝐸 𝐴 (𝐷)=∑ 𝐸 (𝐷𝑖 )
old 4 1 0.722 𝑖
𝑖=1 𝑛
c
E ( D )   pi log 2 ( pi )
i 1 78
An example
6 9
EOwn _ house ( D)  E ( D1 )  E ( D2 )
15 15
6 9
 0  0.918
15 15
0.551

𝑣
𝑛𝑖
𝐸 𝐴 (𝐷)=∑
c
E ( D )   pi log 2 ( pi ) 𝐸 (𝐷𝑖 )
i 1
𝑖
𝑖=1 𝑛
79
An example
 Own_house is the best attribute for the root node.

80
Another Example
 First five attributes are symptoms and the last attribute is diagnosis.
 All attributes are categorical.
 Wish to predict the diagnosis class.

Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold

81
Another Example
 D has (n = 10) samples and three classes c = 3.
 Strep throat =t=3
 Cold =d=4
c
 Allergy = a = 3
E ( D )   pi log 2 ( pi )
i 1

 E(D) = 2*(– 3/10 log(3/10)) – (4/10 log(4/10)) = 0.47

 Let us now consider using the various symptoms to split D.

82
Another Example
𝑣
 Sore Throat 𝑛𝑖
 has 2 distinct values {Yes, No} 𝐸 (𝐷)=∑
𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 2, d = 2, a = 1, total 5 𝑖=1 𝑛
DNo has t = 1, d = 2, a = 2, total 5
E(DYes) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
E(DNo) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46

ESoreThroat(D) = 0.5*0.46 + 0.5*0.46 = 0.46

 Fever
 has 2 distinct values {Yes, No}
DYes has t = 1, d = 4, a = 0, total 5
DNo has t = 2, d = 0, a = 3, total 5
E(DYes) = –1/5 log(1/5)) – 4/5 log(4/5) = 0.22
E(DNo) = –2/5 log(2/5)) – 3/5 log(3/5) = 0.29

EFever(D) = 0.5 * 0.22 + 0.583 * 0.29 = 0.23


Another Example
𝑣
 Swollen Glands 𝑛𝑖
 has 2 distinct values {Yes, No} 𝐸 (𝐷)=∑
𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 3, d = 0, a = 0, total 3 𝑖=1 𝑛
DNo has t = 0, d = 4, a = 3, total 7
E(DYes) = 2*(–0/3 log(0/3)) – (3/3 log(3/3)) = 0
E(DNo) = –4/7 log(4/7)) – 3/7 log(3/7) = 0.30

ESwollenGlands(D) = 0.3*0 + 0.7*0.3 = 0.21

 Congestion
 has 2 distinct values {Yes, No}
DYes has t = 1, d = 4, a = 3, total 8
DNo has t = 2, d = 0, a = 0, total 2
E(DYes) = –1/8 log(1/8)) – 4/8 log(4/8) – 3/8 log(3/8) = 0.42
E(DNo) =0

ECongestion(D) = 0.8*0.42
84 + 0.2*0 = 0.34
Another Example
𝑣
 Headache 𝑛𝑖
 has 2 distinct values {Yes, No} 𝐸 (𝐷)=∑
𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 2, d = 2, a = 1, total 5 𝑖=1 𝑛
DNo has t = 1, d = 2, a = 2, total 5
E(DYes) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
E(DNo) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46

EHeadache(D) = 0.5*0.46 + 0.5*0.46 = 0.46

 So the entropy values of all attributes are:


Sore Throat 0.46
Fever 0.23
Swollen Glands 0.21
Congestion 0.34
Headache 0.46
 Continuing the process one more step will find Swollen Glands as the next
split attribute.
85
Gini Index

86
Gini Index
 Gini Index for a given node that represents the dataset
𝑐
𝐺 ( 𝐷 )=𝐺𝑖𝑛𝑖(𝐷)=1− ∑ 𝑝 𝑖
2

Where is the relative frequency % of class at , and is the


𝑖=1

total number of classes.


 Maximum of
 when records are equally distributed among all classes,
implying the least beneficial situation for classification.
 Minimum of 0
 when all records belong to one class, implying the most
beneficial situation for classification.
 Gini index is used in decision tree algorithms such as
 CART, SLIQ, SPRINT
87
Gini Index of a Dataset
 For 2-class problem (p, 1 – p):
 G(D) = 1 – p2 – (1 – p)2 = 2p (1 – p) 𝑐
𝐺 ( 𝐷 )=𝐺𝑖𝑛𝑖( 𝐷)=1− ∑ 𝑝 𝑖2
𝑖=1

C1 0 p1(C1) = 0/6 = 0 p2(C2) = 6/6 = 1

C2 6 Gini = 1 – p1(C1)2 – p2(C2)2 = 1 – 0 – 1 = 0

C1 1 p1(C1) = 1/6 p2(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 p1(C1) = 2/6 p2(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

C1 3 p1(C1) = 3/6 p2(C2) = 3/6

C2 3 Gini = 1 – (3/6)2 – (3/6)2 = 0.5


88
Computing Information Gain Using Gini Index

Parent

C1 7
A?
C2 5
Yes No
G(D) = Gini = 0.486
Node N1 Node N2

𝑣
N1 N2
𝑛𝑖
C1 5 2 𝐺 𝑖𝑛𝑖 𝐴 ( 𝐷 )=∑ 𝐺𝑖𝑛𝑖( 𝐷𝑖 )
C2 1 4 𝑖 =1 𝑛
GA(D) = Gini = 0.361 GiniA(D) = Weighted Gini of attribute A
= 6/12 * 0.278 + 6/12 * 0.444 = 0.361
Gini(D1) = 1 – (5/6)2 – (1/6)2 = 0.278
Gini(D2) = 1 – (2/6)2 – (4/6)2 = 0.444 GainA = Gini(D) – GiniA (D) = 0.486 – 0.361 = 0.125

89
Computing Information Gain Using Gini Index

Parent

C1 7
B?
C2 5
Yes No
G(D) = Gini = 0.486
Node N1 Node N2

𝑣
N1 N2
𝑛𝑖
C1 5 1 𝐺 𝑖𝑛𝑖 𝐵 ( 𝐷 )=∑ 𝐺𝑖𝑛𝑖( 𝐷𝑖 )
C2 2 4 𝑖=1 𝑛
GB(D) = Gini = 0.371 GiniB(D) = Weighted Gini of attribute B
= 7/12 * 0.408 + 5/12 * 0.32 = 0.371
Gini(D1) = 1 – (5/7)2 – (2/7)2 = 0.408
Gini(D2) = 1 – (1/5)2 – (4/5)2 = 0.32
GainB = Gini(D) – GiniB (D) = 0.486 – 0.371 = 0.115

90
Selecting the Splitting Attribute
 Since GainA is larger then GainB, attribute A will be selected for the
next splitting.
 max(GainA, GainB) or min(GiniA(D), GiniB(D))

GainA = Gini(D) – GiniA (D) = 0.486 – 0.361 = 0.125

GainB = Gini(D) – GiniB (D) = 0.486 – 0.371 = 0.115

91
Information Gain Ratio

92
Problem with large number of partitions
 Node impurity measures tend to prefer splits that result in
 large number of partitions, each being small but pure.

 Customer ID has highest gain because entropy for all the children is
zero.

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

93
Gain Ratio
 Tree building algorithm blindly picks attribute that
 maximizes information gain

 Need a correction to penalize attributes with highly scattered


attributes.

 Gain ratio overcomes the disadvantage of Gain by normalizing the


gain using the entropy of the data with respect to the values of the
attribute.
 Our previous entropy computations are done with respect to the class
attribute.
 Adjusts Gain by the entropy of the partitioning.
 Higher entropy partitioning (large number of small partitions) is penalized!
 Used in C4.5 algorithm

94
Gain Ratio
 Gain Ratio:
g

 D is split into partitions {D1, D2 …, Dv}(children nodes)


 v is the number of possible values of A

 is number of records in Di (child node )

95
Gain Ratio
g

age incom studen credit_ratin buys_comput


 gainsplit(D, income)  0.029 e t g er
<=30 high no fair no
<=30 high no excellent no

 31…40 high no fair yes


gain_ratioincome(D) = 0.029/1.557 = 0.019 >40 mediu no fair yes
m
>40 low yes fair yes
 The attribute with the maximum gain >40 low yes excellent no
ratio is selected as the splitting attribute . 31…40 low yes excellent yes
<=30 mediu no fair no
m
<=30 low yes fair yes
>40 mediu yes fair yes
m
96 <=30 mediu yes excellent yes

You might also like