05 Chap3 - Basic - Classification Edited On Oct 10, 2023
05 Chap3 - Basic - Classification Edited On Oct 10, 2023
Task:
– Learn a model that maps each attribute set x into one of the
predefined class labels y
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
• Root and internal nodes are test nodes; and leaf nodes are class label nodes.
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
cal cal us
i i o
or or nu
teg
teg
nti
ass
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K
fits the same data!
Yes
10
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ (Supervised Learning In Quest),SPRINT
Dt
21
Hunt’s Algorithm (just another slide from Tan Book 1st edition)
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Multi-way split:
Marital
– Use as many partitions as Status
distinct values.
Binary split:
– Divides values into two subsets
{Small, {Medium,
Large} Extra Large}
2/1/2021 Introduction to Data Mining, 2 nd Edition 33
Test Condition for Continuous Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy approach:
– Nodes with purer class distribution are
preferred
C0: 5 C0: 9
C1: 5 C1: 1
Entropy 𝑐 −1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =− ∑ 𝑝 𝑖 ( 𝑡 ) 𝑙𝑜 𝑔2 𝑝 𝑖 (𝑡)
𝑖=0
Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑒𝑟𝑟𝑜𝑟 =1− max [𝑝¿ ¿ 𝑖(𝑡)]¿
Gain = P - M
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
2/1/2021 Introduction to Data Mining, 2 nd Edition 40
Measure of Impurity: GINI
number of classes
𝑖=0
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
𝑖=0
Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Information Gain:
𝑘
𝑛𝑖
𝐺𝑎𝑖 𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑝 ) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑖)
𝑖=1 𝑛
Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛 𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=− ∑ 𝑙𝑜𝑔 2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛
Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛 𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=∑ 𝑙𝑜 𝑔 2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416