CSC4316 7and8
CSC4316 7and8
! Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
! Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets
! Ensemble Classifiers
– Boosting, Bagging, Random Forests
cal cal u s
ori ori uo
ti n ss
t eg t eg n a
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
cal cal us
i i o
or or nu
t eg
t eg
nt i
ass
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K
NO Home
No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
! Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
Dt
class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
! Multi-way split:
Marital
– Use as many partitions as Status
distinct values.
! Binary split:
– Divides values into two subsets
{Small, {Medium,
Large} Extra Large}
MAY TO JUNE2021 CSC4316:DATA MINING 25
Test Condition for Continuous Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
! Greedy approach:
– Nodes with purer class distribution are
preferred
C0: 5 C0: 9
C1: 5 C1: 1
! Entropy $%&
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − + 𝑝! 𝑡 𝑙𝑜𝑔'𝑝! (𝑡)
!"#
! Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝! (𝑡)]
Gain = P - M
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
MAY TO JUNE2021 CSC4316:DATA MINING 32
Measure of Impurity: GINI
!"#
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total
number of classes
!"#
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
!"#
Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
! Information Gain:
)
𝑛%
𝐺𝑎𝑖𝑛"#$%& = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − B 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑛
%'(
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛! is number of records in child node 𝑖
! Gain Ratio:
)
𝐺𝑎𝑖𝑛"#$%& 𝑛% 𝑛%
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − B 𝑙𝑜𝑔*
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
%'(
! Gain Ratio:
)
𝐺𝑎𝑖𝑛"#$%& 𝑛% 𝑛%
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = B 𝑙𝑜𝑔*
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
%'(
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝K 𝑡 ]
K
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝K 𝑡 ]
K
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1)
N1 N2 Gini(Children)
= 1 – (3/3)2 – (0/3)2
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416