0% found this document useful (0 votes)
43 views59 pages

Week 6 Chap3 - Basic - Classificationi

This document provides an overview of basic concepts in classification, including supervised learning, classification tasks, and classification techniques such as decision trees. It discusses decision tree induction, model evaluation, feature selection, and class imbalance. Decision trees are built by recursively splitting the training data into purer child nodes based on a splitting attribute and threshold.

Uploaded by

Yna Foronda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views59 pages

Week 6 Chap3 - Basic - Classificationi

This document provides an overview of basic concepts in classification, including supervised learning, classification tasks, and classification techniques such as decision trees. It discusses decision tree induction, model evaluation, feature selection, and class imbalance. Decision trees are built by recursively splitting the training data into purer child nodes based on a splitting attribute and threshold.

Uploaded by

Yna Foronda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Classification - Basic Concepts

Lecture Notes for Chapter 3


Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Look for accompanying R


code on the course web site.
Topics

▪ Introduction
▪ Decision Trees
—Overview
—Tree Induction
—Overfitting and other Practical Issues
▪ Model Evaluation
—Metrics for Performance Evaluation
—Methods to Obtain Reliable Estimates
—Model Comparison (Relative Performance)
▪ Feature Selection
▪ Class Imbalance
Supervised Learning

▪ Examples
—Input-output pairs: E = 𝑥1 , 𝑦1 , … , 𝑥𝑖 , 𝑦𝑖 , … , 𝑥𝑁 , 𝑦𝑁 .
—We assume that the examples are produced iid (with noise and errors) from a
target function 𝑦 = 𝑓(𝑥).
𝑓𝑓
▪ Learning problem
—Given a hypothesis space H
—Find a hypothesis ℎ ∈ 𝐻 such that 𝑦ො𝑖 = ℎ(𝑥𝑖 ) ≈ 𝑦𝑖
—That is, we want to approximate 𝑓 by ℎ using E.

▪ Includes
—Regression (outputs = real numbers). Goal: Predict the number accurately.
E.g., x is a house and 𝑓(𝑥) is its selling price.
—Classification (outputs = class labels). Goal: Assign new records to a class.
E.g., 𝑥 is an email and 𝑓(𝑥) is spam / ham

You already know linear regression. We focus on Classification.


Classification: Definition

l Given a collection of records (training set )


– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
◆ x: attribute, predictor, independent variable, input
◆ y: class, response, dependent variable, output

l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y

2/1/2021 Introduction to Data Mining, 2nd Edition 4


Illustrating Classification Task
E x y

𝑦 = ℎ(𝑥)
Examples of
Classification Task
▪ Predicting tumor cells as benign or
malignant

▪ Classifying credit card transactions


as legitimate or fraudulent

▪ Classifying secondary
structures of protein
as alpha-helix, beta-sheet,
or random coil

▪ Categorizing news stories


as finance, weather, enter-
tainment, sports, etc
Classification Techniques
Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks / Deep Learning

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines


Topics

▪ Introduction
▪ Decision Trees
—Overview
—Tree Induction
—Overfitting and other Practical Issues
▪ Model Evaluation
—Metrics for Performance Evaluation
—Methods to Obtain Reliable Estimates
—Model Comparison (Relative Performance)
▪ Feature Selection
▪ Class Imbalance
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
Induction
6 No Married 60K No
7 Yes Divorced 220K No
TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
Induction NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree: Deduction

Decision
Tree
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Topics

▪ Introduction
▪ Decision Trees
—Overview
—Tree Induction
—Overfitting and other Practical Issues
▪ Model Evaluation
—Metrics for Performance Evaluation
—Methods to Obtain Reliable Estimates
—Model Comparison (Relative Performance)
▪ Feature Selection
▪ Class Imbalance
Decision Tree: Induction

Decision
Tree
Decision Tree Induction

Many Algorithms:
▪ Hunt’s Algorithm (one of the earliest)
▪ CART (Classification And Regression Tree)
▪ ID3, C4.5, C5.0 (by Ross Quinlan, information gain)
▪ CHAID (CHi-squared Automatic Interaction Detection)
▪ MARS (Improvement for numerical features)
▪ SLIQ, SPRINT
▪ Conditional Inference Trees (recursive partitioning using statistical
tests)
Hunt’s Algorithm
"Use attributes to split the data recursively, till
each split contains only a single class."

Refund
mixed Yes No

Don’t mixed
Cheat

Refund
Yes No
Refund
Yes No Don’t Marital
Cheat Status
Don’t Marital Single,
Cheat Married
Status Divorced
Single,
Married Don’t
Divorced Taxable
Cheat
Don’t Income
mixed
Cheat < 80K >= 80K

Don’t Cheat
Cheat
General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
l Let Dt be the set of training ID
Owner Status Income Borrower
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
l General Procedure:
4 Yes Married 120K No
– If Dt contains records that 5 No Divorced 95K Yes

belong the same class yt, 6 No Married 60K No

then t is a leaf node 7 Yes Divorced 220K No


8 No Single 85K Yes
labeled as yt
9 No Married 75K No
– If Dt contains records that 10
10 No Single 90K Yes

belong to more than one


Dt
class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

2/1/2021 Introduction to Data Mining, 2nd Edition 22


Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 23
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 24
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 25
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 26
Example 2: Creating a Decision Tree
x2
x
x
x
o
x x
o x x x
o o o
o o o o

0 x1
Example 2: Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x x Blue circle Mixed
o x x x
2.5 o
o o
o o o o

0 x1
Example 2: Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x x Blue circle Mixed
o x x x
2.5 o
o o
o o o o
pure

0 x1
Example 2: Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x Blue circle X1 < 2
x
o x x x
2.5 Yes No
o o o
o o o o Blue circle Red X

0 2 x1
Tree Induction

▪ Greedy strategy
—Split the records based on an attribute test that optimizes a certain criterion.

▪ Issues
—Determine how to split the record using different attribute types.
—How to determine the best split?
—Determine when to stop splitting
Design Issues of Decision Tree Induction

l How should training records be split?


– Method for expressing test condition
◆ depending on attribute types
– Measure for evaluating the goodness of a test
condition

l How should the splitting procedure stop?


– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
2/1/2021 Introduction to Data Mining, 2nd Edition 32
Methods for Expressing Test Conditions

l Depends on attribute types


– Binary
– Nominal
– Ordinal
– Continuous

2/1/2021 Introduction to Data Mining, 2nd Edition 33


Test Condition for Nominal Attributes

Multi-way split:
Marital
– Use as many partitions as Status
distinct values.

Single Divorced Married

Binary split:
– Divides values into two subsets

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}

2/1/2021 Introduction to Data Mining, 2nd Edition 34


Test Condition for Ordinal Attributes

l Multi-way split: Shirt


Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large

Shirt Shirt
l Binary split: Size Size

– Divides values into two


subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among Shirt


attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
2/1/2021 Introduction to Data Mining, 2nd Edition 35
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

2/1/2021 Introduction to Data Mining, 2nd Edition 36


Tree Induction

▪ Greedy strategy
—Split the records based on an attribute test that optimizes a certain criterion.

▪ Issues
—Determine how to split the record using different attribute types.
—How to determine the best split?
—Determine when to stop splitting
How to determine the Best Split

Before Splitting: 10 records of class 0, C0: 10


10 records of class 1 C1: 10

Which test condition is the best?


How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


2/1/2021 Introduction to Data Mining, 2nd Edition 39
How to determine the Best Split

▪ Greedy approach:
—Nodes with homogeneous class distribution are preferred

▪ Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Find the Best Split -General Framework
Assume we have a measure M that tells us how "pure" a node is.
Before Splitting: C0 N00
M0
C1 N01

Attribute A Attribute B
Yes No Yes No

Node N1 Node N2 Node N3 Node N4


C0 N10 C0 N20 C0 N30 C0 N40
C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34 → Choose best split
Measures of Node Impurity

Gini Index Entropy Classification


error
Measures of Node Impurity

Gini Index Entropy Classification


error
Measure of Impurity: GINI
▪ Gini Index for a given node t :

𝐺𝐼𝑁𝐼 𝑡 = ෍ 𝑝 𝑗 𝑡)(1 − 𝑝 𝑗 𝑡)) = 1 − ෍ 𝑝 𝑗 𝑡 )2


𝑗 𝑗

𝑝( 𝑗 | 𝑡) is estimated as the relative frequency of class j at node t

▪ Gini impurity is a measure of how often a randomly chosen element from


the set would be incorrectly labeled if it was randomly labeled according to
the distribution of labels in the subset.
▪ Maximum of 1 – 1/𝑛𝑐 (number of classes) when records are equally
distributed among all classes = maximal impurity.
▪ Minimum of 0 when all records belong to one class = complete purity.
▪ Examples:
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = ? P(C2) = ?
C2 5
Gini = ?
C1 2 P(C1) = ? P(C2) = ?
C2 4 Gini = ?
C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278


P(C1) = ? P(C2) = ?
C1 2
C2 4 Gini = ?

C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = ?

C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = ?

C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6


C1 2
C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

P(C1) = 3/6 = 0.5 P(C2) = 3/6 = 0.5


C1 3
C2 3 Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6


C1 2
C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

P(C1) = 3/6 = 0.5 P(C2) = 3/6 = 0.5


C1 3
C2 3 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0.25 – 0.25= 0.5
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

C1 3 P(C1) = 3/6 = 0.5 P(C2) = 3/6 = 0.5


C2 3 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0.25 – 0.25= 0.5
Maximal impurity here is ½ = .5
Splitting Based on GINI
When a node p is split into k partitions (children), the quality of the split is
computed as a weighted sum:
𝐺𝐼𝑁𝐼(𝑝) − 𝑛

𝐺𝐼𝑁𝐼(1) − 𝑛1 𝐺𝐼𝑁𝐼(2) − 𝑛2 ... 𝐺𝐼𝑁𝐼(𝑘) − 𝑛𝑘

𝑘𝑛
𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐺𝐼𝑁𝐼(𝑖)
𝑖 𝑛
where 𝑛𝑖 = number of records at child i, and n = number of records at node p.

Used in the algorithms CART, SLIQ, SPRINT.


Binary Attributes: Computing GINI Index
▪ Splits into two partitions
▪ Effect of weighing partitions: Larger and purer partitions are sought
for.

B? Parent
C1 6
Yes No
C2 6
Node N1 Node N2 Gini = 0.500
Gini(N1)
= 1 – (5/8)2 – (3/8)2
Weighted Gini of N1 N2
= 0.469
N1 N2 Gini(Children)
Gini(N2) = 8/12 * 0.469 +
C1 5 1
= 1 – (1/4)2 – (3/4)2 4/12 * 0.375
C2 3 3
= 0.375 = 0.438 GINI improves!
Gini=0.438
Gain = 0.5 – 0.438 = 0.062
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 56
▪ Δ gain in purity of an attribute test condition: The higher the gain,
the purer are the classes in the child nodes relative to the parent
node. The splitting criterion in the decision tree learning algorithm
selects the attribute test condition that shows the maximum gain.
Note that maximizing the gain at a given node is equivalent to
minimizing the weighted impurity measure of its children since
I(parent) is the same for all candidate attribute test conditions.
▪ when entropy is used Δ=I(parent)−I(children), as the impurity
measure, the difference in entropy is commonly known as
information gain
Continuous Attributes: Computing Gini Index

▪ Use binary decisions based on one value


▪ Several choices for the splitting value 𝑣
—Number of possible splitting values
= Number of distinct values

▪ Simple method to choose best 𝑣 :


—For each 𝑣, scan the database to gather
count matrix and compute its Gini index
—Computationally Inefficient! Repetition of
work. Taxable
Income
𝑣 > 80K?

Yes No
Continuous Attributes: Computing Gini Index

▪ For efficient computation: for each attribute,


—Sort the attribute on values
—Linearly scan these values, each time updating the count matrix and
computing gini index
—Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

You might also like