0% found this document useful (0 votes)
22 views60 pages

CSC4316 7and8

Uploaded by

Jamil Salis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views60 pages

CSC4316 7and8

Uploaded by

Jamil Salis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Introduction to Data Mining:

Classification: Basic Concepts and


Techniques

By Habeebah Adamu Kakudi


(PhD)

MAY TO JUNE 2021 CSC4316:DATA MING 1


! Reference Textbooks
– Data Mining: Practical Machine Learning
Tools and Techniques
by I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal
and
– Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar

MAY TO JUNE 2021 CSC4316:DATA MING 2


Classification: Definition

! Given a collection of records (training set )


– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
u x: attribute, predictor, independent variable, input
u y: class, response, dependent variable, output

! Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y

MAY TO JUNE2021 CSC4316:DATA MINING 3


Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells x-rays or MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or


galaxies telescope images irregular-shaped
galaxies

MAY TO JUNE2021 CSC4316:DATA MINING 4


General Approach for Building
Classification Model

MAY TO JUNE2021 CSC4316:DATA MINING 5


Classification Techniques

! Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets

! Ensemble Classifiers
– Boosting, Bagging, Random Forests

MAY TO JUNE2021 CSC4316:DATA MINING 6


Example of a Decision Tree

cal cal u s
ori ori uo
ti n ss
t eg t eg n a
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

MAY TO JUNE2021 CSC4316:DATA MINING 7


Apply Model to Test Data

Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

MAY TO JUNE2021 CSC4316:DATA MINING 8


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

MAY TO JUNE2021 CSC4316:DATA MINING 9


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

MAY TO JUNE2021 CSC4316:DATA MINING 10


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

MAY TO JUNE2021 CSC4316:DATA MINING 11


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

MAY TO JUNE2021 CSC4316:DATA MINING 12


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

MAY TO JUNE2021 CSC4316:DATA MINING 13


Another Example of Decision Tree

cal cal us
i i o
or or nu
t eg
t eg
nt i
ass
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K
NO Home
No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10

MAY TO JUNE2021 CSC4316:DATA MINING 14


Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

MAY TO JUNE2021 CSC4316:DATA MINING 15


Decision Tree Induction

! Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

MAY TO JUNE2021 CSC4316:DATA MINING 16


General Structure of Hunt’s Algorithm

! Let Dt be the set of training ID


Home
Owner
Marital
Status
Annual Defaulted
Income Borrower
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
! General Procedure: 4 Yes Married 120K No
– If Dt contains records that 5 No Divorced 95K Yes

belong the same class yt, 6 No Married 60K No

then t is a leaf node 7 Yes Divorced 220K No


8 No Single 85K Yes
labeled as yt
9 No Married 75K No
– If Dt contains records that 10 No Single 90K Yes

belong to more than one


10

Dt
class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

MAY TO JUNE2021 CSC4316:DATA MINING 17


Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
MAY TO JUNE2021 CSC4316:DATA MINING 18
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
MAY TO JUNE2021 CSC4316:DATA MINING 19
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
MAY TO JUNE2021 CSC4316:DATA MINING 20
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
MAY TO JUNE2021 CSC4316:DATA MINING 21
Design Issues of Decision Tree Induction

! How should training records be split?


– Method for expressing test condition
u depending on attribute types
– Measure for evaluating the goodness of a test
condition

! How should the splitting procedure stop?


– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
MAY TO JUNE2021 CSC4316:DATA MINING 22
Methods for Expressing Test Conditions

! Depends on attribute types


– Binary
– Nominal
– Ordinal
– Continuous

MAY TO JUNE2021 CSC4316:DATA MINING 23


Test Condition for Nominal Attributes

! Multi-way split:
Marital
– Use as many partitions as Status
distinct values.

Single Divorced Married

! Binary split:
– Divides values into two subsets

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}

MAY TO JUNE2021 CSC4316:DATA MINING 24


Test Condition for Ordinal Attributes

! Multi-way split: Shirt


Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large

! Binary split: Shirt


Size
Shirt
Size

– Divides values into two


subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among Shirt


attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
MAY TO JUNE2021 CSC4316:DATA MINING 25
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

MAY TO JUNE2021 CSC4316:DATA MINING 26


Splitting Based on Continuous Attributes

! Different ways of handling


– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
u Static – discretize once at the beginning

u Dynamic – repeat at each node

– Binary Decision: (A < v) or (A ³ v)


u consider all possible splits and finds the best cut
u can be more compute intensive
MAY TO JUNE2021 CSC4316:DATA MINING 27
How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


MAY TO JUNE2021 CSC4316:DATA MINING 28
How to determine the Best Split

! Greedy approach:
– Nodes with purer class distribution are
preferred

! Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity

MAY TO JUNE2021 CSC4316:DATA MINING 29


Measures of Node Impurity

! Gini Index $%& Where 𝒑𝒊 𝒕 is the frequency


𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − + 𝑝! 𝑡 ' of class 𝒊 at node t, and 𝒄 is
the total number of classes
!"#

! Entropy $%&
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − + 𝑝! 𝑡 𝑙𝑜𝑔'𝑝! (𝑡)
!"#

! Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝! (𝑡)]

MAY TO JUNE2021 CSC4316:DATA MINING 30


Finding the Best Split

1. Compute impurity measure (P) before splitting


2. Compute impurity measure (M) after splitting
! Compute impurity measure of each child node
! M is the weighted impurity of child nodes

3. Choose the attribute test condition that


produces the highest gain

Gain = P - M

or equivalently, lowest impurity measure after splitting


(M)

MAY TO JUNE2021 CSC4316:DATA MINING 31


Finding the Best Split
Before Splitting: C0 N00 P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
MAY TO JUNE2021 CSC4316:DATA MINING 32
Measure of Impurity: GINI

! Gini Index for a given node 𝒕


$%&
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − + 𝑝! 𝑡 '

!"#
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total
number of classes

– Maximum of 1 − 1/𝑐 when records are equally


distributed among all classes, implying the least
beneficial situation for classification
– Minimum of 0 when all records belong to one class,
implying the most beneficial situation for classification
– Gini index is used in decision tree algorithms such as
CART, SLIQ, SPRINT

MAY TO JUNE2021 CSC4316:DATA MINING 33


Measure of Impurity: GINI

! Gini Index for a given node t :


$%&
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − + 𝑝! 𝑡 '

!"#

– For 2-class problem (p, 1 – p):


u GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

MAY TO JUNE2021 CSC4316:DATA MINING 34


Computing Gini Index of a Single Node
$%&
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − + 𝑝! 𝑡 '

!"#

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

MAY TO JUNE2021 CSC4316:DATA MINING 35


Computing Gini Index for a Collection of
Nodes
! When a node 𝑝 is split into 𝑘 partitions (children)
,
𝑛!
𝐺𝐼𝑁𝐼()*!+ = + 𝐺𝐼𝑁𝐼(𝑖)
𝑛
!"&

where, 𝑛! = number of records at child 𝑖,


𝑛 = number of records at parent node 𝑝.

MAY TO JUNE2021 CSC4316:DATA MINING 36


Binary Attributes: Computing GINI Index

! Splits into two partitions (child nodes)


! Effect of Weighing partitions:
– Larger and purer partitions are sought
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2
N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
MAY TO JUNE2021 CSC4316:DATA MINING 37
Categorical Attributes: Computing Gini Index

! For each distinct value, gather counts for each class in


the dataset
! Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

Which of these is the best?

MAY TO JUNE2021 CSC4316:DATA MINING 38


Continuous Attributes: Computing Gini Index

! Use Binary Decisions based on one ID


Home
Owner
Marital
Status
Annual
Income
Defaulted
value
1 Yes Single 125K No
! Several Choices for the splitting value 2 No Married 100K No

– Number of possible splitting values 3 No Single 70K No


= Number of distinct values 4 Yes Married 120K No

! Each splitting value has a count matrix 5 No Divorced 95K Yes

associated with it 6 No Married 60K No


7 Yes Divorced 220K No
– Class counts in each of the 8 No Single 85K Yes
partitions, A ≤ v and A > v
9 No Married 75K No
! Simple method to choose best v 10 No Single 90K Yes
– For each v, scan the database to
10

Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4

MAY TO JUNE2021 CSC4316:DATA MINING 39


Continuous Attributes: Computing Gini Index...

! For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

MAY TO JUNE2021 CSC4316:DATA MINING 40


Continuous Attributes: Computing Gini Index...

! For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

MAY TO JUNE2021 CSC4316:DATA MINING 41


Continuous Attributes: Computing Gini Index...

! For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

MAY TO JUNE2021 CSC4316:DATA MINING 42


Continuous Attributes: Computing Gini Index...

! For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

MAY TO JUNE2021 CSC4316:DATA MINING 43


Continuous Attributes: Computing Gini Index...

! For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

MAY TO JUNE2021 CSC4316:DATA MINING 44


Measure of Impurity: Entropy

! Entropy at a given node 𝒕


$%&
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − + 𝑝! 𝑡 𝑙𝑜𝑔'𝑝! (𝑡)
!"#
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total number
of classes

u Maximum of log ! 𝑐 when records are equally distributed


among all classes, implying the least beneficial situation for
classification
u Minimum of 0 when all records belong to one class,
implying most beneficial situation for classification

– Entropy based computations are quite similar to the GINI


index computations
MAY TO JUNE2021 CSC4316:DATA MINING 45
Computing Entropy of a Single Node
$%&
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − + 𝑝! 𝑡 𝑙𝑜𝑔'𝑝! (𝑡)
!"#

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

MAY TO JUNE2021 CSC4316:DATA MINING 46


Computing Information Gain After Splitting

! Information Gain:
)
𝑛%
𝐺𝑎𝑖𝑛"#$%& = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − B 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑛
%'(
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛! is number of records in child node 𝑖

– Choose the split that achieves most reduction (maximizes


GAIN)

– Used in the ID3 and C4.5 decision tree algorithms

– Information gain is the mutual information between the class


variable and the splitting variable

MAY TO JUNE2021 CSC4316:DATA MINING 47


Problem with large number of partitions

! Node impurity measures tend to prefer splits that


result in large number of partitions, each being
small but pure
Gender Car Customer
Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

– Customer ID has highest information gain


because entropy for all the children is zero
MAY TO JUNE2021 CSC4316:DATA MINING 48
Gain Ratio

! Gain Ratio:
)
𝐺𝑎𝑖𝑛"#$%& 𝑛% 𝑛%
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − B 𝑙𝑜𝑔*
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
%'(

Parent Node, 𝑝 is split into 𝑘 partitions (children)


𝑛! is number of records in child node 𝑖

– Adjusts Information Gain by the entropy of the partitioning


(𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜).
u Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain

MAY TO JUNE2021 CSC4316:DATA MINING 49


Gain Ratio

! Gain Ratio:
)
𝐺𝑎𝑖𝑛"#$%& 𝑛% 𝑛%
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = B 𝑙𝑜𝑔*
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
%'(

Parent Node, 𝑝 is split into 𝑘 partitions (children)


𝑛! is number of records in child node 𝑖

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

MAY TO JUNE2021 CSC4316:DATA MINING 50


Measure of Impurity: Classification Error

! Classification error at a node 𝑡

𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝K 𝑡 ]
K

– Maximum of 1 − 1/𝑐 when records are equally


distributed among all classes, implying the least
interesting situation
– Minimum of 0 when all records belong to one class,
implying the most interesting situation

MAY TO JUNE2021 CSC4316:DATA MINING 51


Computing Error of a Single Node

𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝K 𝑡 ]
K

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

MAY TO JUNE2021 CSC4316:DATA MINING 52


Comparison among Impurity Measures

For a 2-class problem:

MAY TO JUNE2021 CSC4316:DATA MINING 53


Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1)
N1 N2 Gini(Children)
= 1 – (3/3)2 – (0/3)2
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!

MAY TO JUNE2021 CSC4316:DATA MINING 54


Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416

Misclassification error for all three cases = 0.3 !

MAY TO JUNE2021 CSC4316:DATA MINING 55


Decision Tree Based Classification
! Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are
interacting)
! Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute

MAY TO JUNE2021 CSC4316:DATA MINING 56


Handling interactions

+ : 1000 instances Entropy (X) : 0.99


Entropy (Y) : 0.99
o : 1000 instances
Y

MAY TO JUNE2021 CSC4316:DATA MINING 57


Handling interactions

MAY TO JUNE2021 CSC4316:DATA MINING 58


Handling interactions given irrelevant attributes

+ : 1000 instances Entropy (X) : 0.99


Entropy (Y) : 0.99
o : 1000 instances Entropy (Z) : 0.98
Y
Adding Z as a noisy Attribute Z will be
attribute generated chosen for splitting!
from a uniform
distribution
X

MAY TO JUNE2021 CSC4316:DATA MINING 59


Limitations of single attribute-based decision boundaries

Both positive (+) and


negative (o) classes
generated from
skewed Gaussians
with centers at (8,8)
and (12,12)
respectively.

MAY TO JUNE2021 CSC4316:DATA MINING 60

You might also like