Week 6 Chap3 - Basic - Classificationi
Week 6 Chap3 - Basic - Classificationi
▪ Introduction
▪ Decision Trees
—Overview
—Tree Induction
—Overfitting and other Practical Issues
▪ Model Evaluation
—Metrics for Performance Evaluation
—Methods to Obtain Reliable Estimates
—Model Comparison (Relative Performance)
▪ Feature Selection
▪ Class Imbalance
Supervised Learning
▪ Examples
—Input-output pairs: E = 𝑥1 , 𝑦1 , … , 𝑥𝑖 , 𝑦𝑖 , … , 𝑥𝑁 , 𝑦𝑁 .
—We assume that the examples are produced iid (with noise and errors) from a
target function 𝑦 = 𝑓(𝑥).
𝑓𝑓
▪ Learning problem
—Given a hypothesis space H
—Find a hypothesis ℎ ∈ 𝐻 such that 𝑦ො𝑖 = ℎ(𝑥𝑖 ) ≈ 𝑦𝑖
—That is, we want to approximate 𝑓 by ℎ using E.
▪ Includes
—Regression (outputs = real numbers). Goal: Predict the number accurately.
E.g., x is a house and 𝑓(𝑥) is its selling price.
—Classification (outputs = class labels). Goal: Assign new records to a class.
E.g., 𝑥 is an email and 𝑓(𝑥) is spam / ham
l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
𝑦 = ℎ(𝑥)
Examples of
Classification Task
▪ Predicting tumor cells as benign or
malignant
▪ Classifying secondary
structures of protein
as alpha-helix, beta-sheet,
or random coil
Rule-based Methods
▪ Introduction
▪ Decision Trees
—Overview
—Tree Induction
—Overfitting and other Practical Issues
▪ Model Evaluation
—Metrics for Performance Evaluation
—Methods to Obtain Reliable Estimates
—Model Comparison (Relative Performance)
▪ Feature Selection
▪ Class Imbalance
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
Induction NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree: Deduction
Decision
Tree
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Topics
▪ Introduction
▪ Decision Trees
—Overview
—Tree Induction
—Overfitting and other Practical Issues
▪ Model Evaluation
—Metrics for Performance Evaluation
—Methods to Obtain Reliable Estimates
—Model Comparison (Relative Performance)
▪ Feature Selection
▪ Class Imbalance
Decision Tree: Induction
Decision
Tree
Decision Tree Induction
Many Algorithms:
▪ Hunt’s Algorithm (one of the earliest)
▪ CART (Classification And Regression Tree)
▪ ID3, C4.5, C5.0 (by Ross Quinlan, information gain)
▪ CHAID (CHi-squared Automatic Interaction Detection)
▪ MARS (Improvement for numerical features)
▪ SLIQ, SPRINT
▪ Conditional Inference Trees (recursive partitioning using statistical
tests)
Hunt’s Algorithm
"Use attributes to split the data recursively, till
each split contains only a single class."
Refund
mixed Yes No
Don’t mixed
Cheat
Refund
Yes No
Refund
Yes No Don’t Marital
Cheat Status
Don’t Marital Single,
Cheat Married
Status Divorced
Single,
Married Don’t
Divorced Taxable
Cheat
Don’t Income
mixed
Cheat < 80K >= 80K
Don’t Cheat
Cheat
General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
l Let Dt be the set of training ID
Owner Status Income Borrower
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
l General Procedure:
4 Yes Married 120K No
– If Dt contains records that 5 No Divorced 95K Yes
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
0 x1
Example 2: Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x x Blue circle Mixed
o x x x
2.5 o
o o
o o o o
0 x1
Example 2: Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x x Blue circle Mixed
o x x x
2.5 o
o o
o o o o
pure
0 x1
Example 2: Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x Blue circle X1 < 2
x
o x x x
2.5 Yes No
o o o
o o o o Blue circle Red X
0 2 x1
Tree Induction
▪ Greedy strategy
—Split the records based on an attribute test that optimizes a certain criterion.
▪ Issues
—Determine how to split the record using different attribute types.
—How to determine the best split?
—Determine when to stop splitting
Design Issues of Decision Tree Induction
Multi-way split:
Marital
– Use as many partitions as Status
distinct values.
Binary split:
– Divides values into two subsets
Shirt Shirt
l Binary split: Size Size
{Small, {Medium,
Large} Extra Large}
2/1/2021 Introduction to Data Mining, 2nd Edition 35
Test Condition for Continuous Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
▪ Greedy strategy
—Split the records based on an attribute test that optimizes a certain criterion.
▪ Issues
—Determine how to split the record using different attribute types.
—How to determine the best split?
—Determine when to stop splitting
How to determine the Best Split
▪ Greedy approach:
—Nodes with homogeneous class distribution are preferred
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Find the Best Split -General Framework
Assume we have a measure M that tells us how "pure" a node is.
Before Splitting: C0 N00
M0
C1 N01
Attribute A Attribute B
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34 → Choose best split
Measures of Node Impurity
C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − 𝑝 𝑗 𝑡 )2
𝑗
C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − 𝑝 𝑗 𝑡 )2
𝑗
C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − 𝑝 𝑗 𝑡 )2
𝑗
C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − 𝑝 𝑗 𝑡 )2
𝑗
𝑘𝑛
𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = 𝐺𝐼𝑁𝐼(𝑖)
𝑖 𝑛
where 𝑛𝑖 = number of records at child i, and n = number of records at node p.
B? Parent
C1 6
Yes No
C2 6
Node N1 Node N2 Gini = 0.500
Gini(N1)
= 1 – (5/8)2 – (3/8)2
Weighted Gini of N1 N2
= 0.469
N1 N2 Gini(Children)
Gini(N2) = 8/12 * 0.469 +
C1 5 1
= 1 – (1/4)2 – (3/4)2 4/12 * 0.375
C2 3 3
= 0.375 = 0.438 GINI improves!
Gini=0.438
Gain = 0.5 – 0.438 = 0.062
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Yes No
Continuous Attributes: Computing Gini Index
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420