0% found this document useful (0 votes)

17 views78 pages

05 Chap3 - Basic - Classification Edited On Oct 10, 2023

Uploaded by

zishanalam752

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views78 pages

05 Chap3 - Basic - Classification Edited On Oct 10, 2023

Uploaded by

zishanalam752

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 78

Data Mining

Classification: Basic Concepts and

Techniques

Lecture Notes for Chapter 3

Introduction to Data Mining, 2nd Edition

by
Tan, Steinbach, Karpatne, Kumar

2/1/2021 Introduction to Data Mining, 2 nd Edition 1

Classification: Definition

 Given a collection of records (training set )

– Each record is characterized by a tuple (x,y), where x is the
attribute set ( attribute set means set of attributes not a
single one attribute as it appears from x) and y is the class
label
 x: attribute, predictor, independent variable, input

 y: class, response, dependent variable, output

 Task:
– Learn a model that maps each attribute set x into one of the
predefined class labels y

2/1/2021 Introduction to Data Mining, 2 nd Edition 2

01/23/2025 Introduction to Data Mining, 2 nd Edition 3
Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam

email email message header and
messages content

Identifying Features extracted from x- malignant or benign

tumor cells rays or MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or

galaxies telescope images irregular-shaped
galaxies

2/1/2021 Introduction to Data Mining, 2 nd Edition 4

Model
Example: Consider a problem of predicting :
Whether a loan applicant will repay his/her loan successfully or will
become guilty/defaulter/defaulted borrower because of not returning
loan?

Inductive vs. deductive reasoning

• The main difference between inductive and deductive reasoning is that inductive reasoning aims at developing a theory
while deductive reasoning aims at testing an existing theory.
• Inductive reasoning moves from specific observations to broad generalizations, and deductive reasoning is reverse to it. 5
General Approach for Building Classification Model
• A classification technique (or classifier) is a systematic approach to building
classification models from an input data set.
• Examples include decision tree classifiers, rule-based classifiers, neural
networks, support vector machines, and naive Bayes classifiers.
• Each technique employs a learning algorithm to identify a model that best fits
the relationship between the attribute set and class label of the input data.
• The model generated by a learning algorithm should both fit the input data
well and correctly predict the class labels of records it has never seen before.
• Therefore, a key objective of the learning algorithm is to build models with
good generalization capability; i.e., models that accurately predict the class
labels of previously unknown records.

01/23/2025 Introduction to Data Mining, 2 nd Edition 6

Classification Techniques
 Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets

 Ensemble (term used in statistics meaning together) Classifiers

– Boosting, Bagging, Random Forests

Ensemble (noun) = a group of musicians, actors, or dancer who perform together

(Ensemble is a collection of large number of systems which are macroscopically identical
but microscopically different. We use this idea in statistical mechanics)

Introduction to Data Mining, 2 nd Edition 7

A decision tree for problem:
Whether a loan applicant will repay his/her loan successfully or will become
guilty/defaulter/defaulted borrower because of not returning loan?

cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2 nd Edition 16

Another Decision Tree for the same example

cal cal us
i i o
or or nu
teg
teg
nti
ass
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K
fits the same data!
Yes
10

2/1/2021 Introduction to Data Mining, 2 nd Edition 17

Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

2/1/2021 Introduction to Data Mining, 2 nd Edition 18

What number of decision trees are possible?

suboptimal (adj) = something is of less than the highest standard or quality,

e.g., suboptimal working conditions

2/1/2021 Introduction to Data Mining, 2 nd Edition 19

Decision Tree Induction

 Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ (Supervised Learning In Quest),SPRINT

2/1/2021 Introduction to Data Mining, 2 nd Edition 20

General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
Let Dt be the set of training records that reach a node t. 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
Step 1: If Dt contains records that belong the same
4 Yes Married 120K No
class yt, then t is a leaf node labeled as yt 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Step 2: If Dt contains records that belong to more 8 No Single 85K Yes
than one class, use an attribute test to split 9 No Married 75K No
the data into smaller subsets. Recursively 10 No Single 90K Yes
apply the procedure to each subset.
10

21
Hunt’s Algorithm (just another slide from Tan Book 1st edition)

01/23/2025 Introduction to Data Mining, 2 nd Edition 22

Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2 nd Edition 23
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2 nd Edition 24
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2 nd Edition 25
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2 nd Edition 26
01/23/2025 Introduction to Data Mining, 2 nd Edition 28
Design Issues of Decision Tree
Induction

 How should training records be split?

– Method for expressing test condition
 depending on attribute types
– Measure for evaluating the goodness of a test
condition

 How should the splitting procedure stop?

– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
2/1/2021 Introduction to Data Mining, 2 nd Edition 29
Methods for Expressing Test
Conditions

 Depends on attribute types

– Binary
– Nominal
– Ordinal
– Continuous

 Depends on number of ways to split

– 2-way split
– Multi-way split

01/23/2025 Introduction to Data Mining, 2 nd Edition 30

01/23/2025 Introduction to Data Mining, 2 nd Edition 31
Test Condition for Nominal Attributes

 Multi-way split:
Marital
– Use as many partitions as Status
distinct values.

Single Divorced Married

 Binary split:
– Divides values into two subsets

Marital Marital Marital

Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}

Divorced} Divorced} Married}

2/1/2021 Introduction to Data Mining, 2 nd Edition 32

Test Condition for Ordinal Attributes

 Multi-way split: Shirt

Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large

 Binary split: Shirt Shirt

Size Size
– Divides values into two
subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among Shirt

attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
2/1/2021 Introduction to Data Mining, 2 nd Edition 33
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

2/1/2021 Introduction to Data Mining, 2 nd Edition 34

Splitting Based on Continuous Attributes

 Different ways of handling

– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
 Static – discretize once at the beginning

 Dynamic – repeat at each node

– Binary Decision: (A < v) or (A  v)

 consider all possible splits and finds the best cut
 can be more compute intensive
2/1/2021 Introduction to Data Mining, 2 nd Edition 35
How to determine the Best Split

Before Splitting: 10 records of class 0,

10 records of class 1

Gender Car Customer

Type ID

Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

2/1/2021 Introduction to Data Mining, 2 nd Edition 36
How to determine the Best Split

 Greedy approach:
– Nodes with purer class distribution are
preferred

 Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity

2/1/2021 Introduction to Data Mining, 2 nd Edition 37

Measures of Node Impurity

Whereis the frequency

 Gini Index
of classat node t, and is the
𝑐 −1
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥=1 − ∑ 𝑝 𝑖 ( 𝑡 )
2

𝑖=0 total number of classes

 Entropy 𝑐 −1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =− ∑ 𝑝 𝑖 ( 𝑡 ) 𝑙𝑜 𝑔2 𝑝 𝑖 (𝑡)
𝑖=0

 Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑒𝑟𝑟𝑜𝑟 =1− max ⁡[𝑝¿ ¿ 𝑖(𝑡)]¿

2/1/2021 Introduction to Data Mining, 2 nd Edition 38

Finding the Best Split

1. Compute impurity measure (P) before splitting

2. Compute impurity measure (M) after splitting
 Compute impurity measure of each child node
 M is the weighted impurity of child nodes
3. Choose the attribute test condition that
produces the highest gain

Gain = P - M

or equivalently, lowest impurity measure after splitting

(M)

2/1/2021 Introduction to Data Mining, 2 nd Edition 39

Finding the Best Split
Before Splitting: C0 N00
P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40

C1 N11 C1 N21 C1 N31 C1 N41

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
2/1/2021 Introduction to Data Mining, 2 nd Edition 40
Measure of Impurity: GINI

 Gini Index for a given node

𝑐 −1
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥=1 − ∑ 𝑝 𝑖 ( 𝑡 )
2

Where is the frequency of class at node , and is the total

𝑖=0

number of classes

– Maximum of when records are equally distributed

among all classes, implying the least beneficial
situation for classification
– Minimum of 0 when all records belong to one class,
implying the most beneficial situation for classification
– Gini index is used in decision tree algorithms such as
CART, SLIQ, SPRINT

2/1/2021 Introduction to Data Mining, 2 nd Edition 41

Measure of Impurity: GINI

 Gini Index for a given node t :

𝑐 −1
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥=1 − ∑ 𝑝 𝑖 ( 𝑡 )
2

𝑖=0

– For 2-class problem (p, 1 – p):

 GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

2/1/2021 Introduction to Data Mining, 2 nd Edition 42

Computing Gini Index of a Single
Node
𝑐 −1
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥=1 − ∑ 𝑝 𝑖 ( 𝑡 )
2

𝑖=0

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

2/1/2021 Introduction to Data Mining, 2 nd Edition 43

Computing Gini Index for a
Collection of Nodes
 When a node is split into partitions (children)
𝑘
𝑛𝑖
𝐺𝐼𝑁 𝐼 𝑠𝑝𝑙𝑖𝑡 =∑ 𝐺𝐼𝑁𝐼 (𝑖)
𝑖=1 𝑛

where, = number of records at child ,

= number of records at parent node .

2/1/2021 Introduction to Data Mining, 2 nd Edition 44

Binary Attributes: Computing GINI Index

 Splits into two partitions (child nodes)

 Effect of Weighing partitions:
– Larger and purer partitions are sought
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278 C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125

2/1/2021 Introduction to Data Mining, 2 nd Edition 45

Categorical Attributes: Computing Gini
Index

 For each distinct value, gather counts for each class in

the dataset
 Use the count matrix to make decisions

Multi-way split Two-way split

(find best partition of values)

CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

Which of these is the best?

2/1/2021 Introduction to Data Mining, 2 nd Edition 46

Continuous Attributes: Computing Gini
Index

 Use Binary Decisions based on one ID

Home
Owner
Marital
Status
Annual
Income
Defaulted
value
1 Yes Single 125K No
 Several Choices for the splitting value 2 No Married 100K No
– Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No

 Each splitting value has a count matrix 5 No Divorced 95K Yes

associated with it 6 No Married 60K No

7 Yes Divorced 220K No
– Class counts in each of the
8 No Single 85K Yes
partitions, A ≤ v and A > v
9 No Married 75K No
 Simple method to choose best v 10 No Single 90K Yes
– For each v, scan the database to
10

Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4

2/1/2021 Introduction to Data Mining, 2 nd Edition 47

Continuous Attributes: Computing Gini
Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2 nd Edition 48

Continuous Attributes: Computing Gini
Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2 nd Edition 49

Continuous Attributes: Computing Gini
Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2 nd Edition 50

Continuous Attributes: Computing Gini
Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2 nd Edition 51

Continuous Attributes: Computing Gini
Index...

 For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2 nd Edition 52

Measure of Impurity: Entropy

 Entropy at a given node

𝑐 −1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =− ∑ 𝑝 𝑖 ( 𝑡 ) 𝑙𝑜 𝑔2 𝑝 𝑖 (𝑡)
𝑖=0
Where is the frequency of class at node , and is the total number of
classes

 Maximum of when records are equally distributed among all

classes, implying the least beneficial situation for
classification
 Minimum of 0 when all records belong to one class, implying
most beneficial situation for classification

– Entropy based computations are quite similar to the GINI

index computations
2/1/2021 Introduction to Data Mining, 2 nd Edition 53
Computing Entropy of a Single
Node
𝑐 −1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =− ∑ 𝑝 𝑖 ( 𝑡 ) 𝑙𝑜 𝑔2 𝑝 𝑖 (𝑡)
𝑖=0

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

2/1/2021 Introduction to Data Mining, 2 nd Edition 54

Computing Information Gain After
Splitting

 Information Gain:
𝑘
𝑛𝑖
𝐺𝑎𝑖 𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑝 ) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑖)
𝑖=1 𝑛

Parent Node, is split into partitions (children)

is number of records in child node

– Choose the split that achieves most reduction (maximizes

GAIN)

– Used in the ID3 and C4.5 decision tree algorithms

– Information gain is the mutual information between the class

variable and the splitting variable
2/1/2021 Introduction to Data Mining, 2 nd Edition 55
Problem with large number of
partitions

 Node impurity measures tend to prefer splits that

result in large number of partitions, each being
small but pure
Gender Car Customer
Type ID

Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

– Customer ID has highest information gain

because entropy for all the children is zero
2/1/2021 Introduction to Data Mining, 2 nd Edition 56
Gain Ratio

 Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛 𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=− ∑ 𝑙𝑜𝑔 2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛

Parent Node, is split into partitions (children)

is number of records in child node

– Adjusts Information Gain by the entropy of the partitioning ().

 Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain

2/1/2021 Introduction to Data Mining, 2 nd Edition 57

Gain Ratio

 Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛 𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=∑ 𝑙𝑜 𝑔 2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛

Parent Node, is split into partitions (children)

is number of records in child node

CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

2/1/2021 Introduction to Data Mining, 2 nd Edition 58

Measure of Impurity: Classification Error

 Classification error at a node

𝐸𝑟𝑟𝑜𝑟 ( 𝑡 ) =1− max ⁡[𝑝 𝑖 ( 𝑡 ) ]

𝑖

– Maximum of when records are equally distributed

among all classes, implying the least interesting
situation
– Minimum of 0 when all records belong to one class,
implying the most interesting situation

2/1/2021 Introduction to Data Mining, 2 nd Edition 59

Computing Error of a Single Node

𝐸𝑟𝑟𝑜𝑟 ( 𝑡 ) =1− max ⁡[𝑝 𝑖 ( 𝑡 ) ]

𝑖

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

2/1/2021 Introduction to Data Mining, 2 nd Edition 60

Comparison among Impurity
Measures

For a 2-class problem:

2/1/2021 Introduction to Data Mining, 2 nd Edition 61

Misclassification Error vs Gini
Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!

2/1/2021 Introduction to Data Mining, 2 nd Edition 62

Misclassification Error vs Gini
Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416

Misclassification error for all three cases = 0.3 !

2/1/2021 Introduction to Data Mining, 2 nd Edition 63

Decision Tree Based Classification
 Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are
interacting)
 Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute

2/1/2021 Introduction to Data Mining, 2 nd Edition 64

Handling interactions

+ : 1000 instances Entropy (X) : 0.99

Entropy (Y) : 0.99
o : 1000 instances
Y

2/1/2021 Introduction to Data Mining, 2 nd Edition 65

Handling interactions

2/1/2021 Introduction to Data Mining, 2 nd Edition 66

Handling interactions given irrelevant
attributes

+ : 1000 instances Entropy (X) : 0.99

Entropy (Y) : 0.99
o : 1000 instances Entropy (Z) : 0.98
Y
Adding Z as a noisy Attribute Z will be
attribute generated chosen for splitting!
from a uniform
distribution
X

2/1/2021 Introduction to Data Mining, 2 nd Edition 67

Limitations of single attribute-based decision
boundaries

Both positive (+) and

negative (o) classes
generated from
skewed Gaussians
with centers at (8,8)
and (12,12)
respectively.

2/1/2021 Introduction to Data Mining, 2 nd Edition 68

 Few more repeated slides

2/1/2021 Introduction to Data Mining, 2 nd Edition 69

01/23/2025 Introduction to Data Mining, 2 nd Edition 71
01/23/2025 Introduction to Data Mining, 2 nd Edition 72
01/23/2025 Introduction to Data Mining, 2 nd Edition 74
01/23/2025 Introduction to Data Mining, 2 nd Edition 75
01/23/2025 Introduction to Data Mining, 2 nd Edition 76

Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
101 pages
By Eesha Tur Razia Babar: 2/1/2021 Introduction To Data Mining, 2 Edition 1
No ratings yet
By Eesha Tur Razia Babar: 2/1/2021 Introduction To Data Mining, 2 Edition 1
63 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
63 pages
Chapter 3 DESKTOP VS93238 S Conflicted Copy 2019-09-29
No ratings yet
Chapter 3 DESKTOP VS93238 S Conflicted Copy 2019-09-29
55 pages
A.I. Lecture 6 NEW
No ratings yet
A.I. Lecture 6 NEW
59 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
Aiml Unit-4
No ratings yet
Aiml Unit-4
82 pages
DMDW - Unit 3 - Classification
No ratings yet
DMDW - Unit 3 - Classification
43 pages
01 Classification
No ratings yet
01 Classification
77 pages
Classfication and Prediction
No ratings yet
Classfication and Prediction
133 pages
Vacancies
No ratings yet
Vacancies
1 page
Classification Techniques
No ratings yet
Classification Techniques
50 pages
DUI0448I v2p Ca9 TRM
No ratings yet
DUI0448I v2p Ca9 TRM
62 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
76 pages
03 04OLAP SKJ Edited Oct 1, 2024
No ratings yet
03 04OLAP SKJ Edited Oct 1, 2024
93 pages
Decision Tree 1
No ratings yet
Decision Tree 1
59 pages
Bny Mellon
No ratings yet
Bny Mellon
92 pages
Basic Classification
No ratings yet
Basic Classification
58 pages
CH 4 - Classification Rule - Based Global Edition Edited Oct 17, 2024
No ratings yet
CH 4 - Classification Rule - Based Global Edition Edited Oct 17, 2024
28 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
61 pages
DSTBD - 10 DMClassification ENG
No ratings yet
DSTBD - 10 DMClassification ENG
160 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
82 pages
DM Lec6
No ratings yet
DM Lec6
18 pages
2.dasar Counting 1
No ratings yet
2.dasar Counting 1
19 pages
Chap3 Basic Classification New 2
No ratings yet
Chap3 Basic Classification New 2
21 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
Data Mining Paer 2 Oct 12, 2024 - 241012 - 224522
No ratings yet
Data Mining Paer 2 Oct 12, 2024 - 241012 - 224522
13 pages
Chap4 - Basic - Classification - Class Teaching
No ratings yet
Chap4 - Basic - Classification - Class Teaching
168 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
59 pages
Classification Lecture 1
No ratings yet
Classification Lecture 1
51 pages
Constant Voltage and Constant Current DC Power Supply Instruction 2021.12.21
No ratings yet
Constant Voltage and Constant Current DC Power Supply Instruction 2021.12.21
33 pages
Sky Tower Karachi Pakistan BoM For Bid V1D
No ratings yet
Sky Tower Karachi Pakistan BoM For Bid V1D
14 pages
Free Rental Receipt Template
No ratings yet
Free Rental Receipt Template
21 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
No ratings yet
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
95 pages
K Fold
No ratings yet
K Fold
2 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
50 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
Semtech Broadcast SelectorGuide 2021 Web
No ratings yet
Semtech Broadcast SelectorGuide 2021 Web
12 pages
4-Chap4 Basic Classification
No ratings yet
4-Chap4 Basic Classification
128 pages
Updated DM Unit 3
No ratings yet
Updated DM Unit 3
28 pages
Unit-IV Classification Part 1
No ratings yet
Unit-IV Classification Part 1
38 pages
Vacancies
No ratings yet
Vacancies
2 pages
Waterfall Whitepaper
No ratings yet
Waterfall Whitepaper
7 pages
Mohamed - CV
No ratings yet
Mohamed - CV
2 pages
Implementation of An Image Search Engine - 1
No ratings yet
Implementation of An Image Search Engine - 1
31 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
59 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
58 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
58 pages
Chap4 Basic Classification PDF
No ratings yet
Chap4 Basic Classification PDF
101 pages
AI
No ratings yet
AI
48 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
ChatLog Indore ML Python Batch 2 2021 - 07 - 21 15 - 00
No ratings yet
ChatLog Indore ML Python Batch 2 2021 - 07 - 21 15 - 00
22 pages
User Guide For Free Version
No ratings yet
User Guide For Free Version
20 pages
To Students Data Mining Part-2 Sept 13 - 240913 - 160930
No ratings yet
To Students Data Mining Part-2 Sept 13 - 240913 - 160930
5 pages
Classification
No ratings yet
Classification
58 pages
Week 6 Chap3 - Basic - Classificationi
No ratings yet
Week 6 Chap3 - Basic - Classificationi
59 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
58 pages
Chap4 - Basic - Classification-Admin and Economy
No ratings yet
Chap4 - Basic - Classification-Admin and Economy
31 pages
06 Classification
No ratings yet
06 Classification
32 pages
Unit 3
No ratings yet
Unit 3
16 pages
Chapter 6. Decision Tree Classification
No ratings yet
Chapter 6. Decision Tree Classification
19 pages
BT11803 Tutorial 3 ANSWER
100% (1)
BT11803 Tutorial 3 ANSWER
4 pages
Diversify Your Home Equity: Protect Yourself with Multiple Investment Strategies: Financial Freedom, #99
From Everand
Diversify Your Home Equity: Protect Yourself with Multiple Investment Strategies: Financial Freedom, #99
Joshua King
No ratings yet
Module 04
No ratings yet
Module 04
75 pages
Classification & Prediction
No ratings yet
Classification & Prediction
78 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
Windows Application
No ratings yet
Windows Application
9 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
101 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
CS 6823 Data Mining: Classification Decision Tree
No ratings yet
CS 6823 Data Mining: Classification Decision Tree
39 pages
Introduction To Amazon
No ratings yet
Introduction To Amazon
13 pages
Excel Cad
No ratings yet
Excel Cad
8 pages
Endian Iec-62443-Compliance Whitepaper en
No ratings yet
Endian Iec-62443-Compliance Whitepaper en
5 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
58 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
51 pages
Fast Money: Profit From Property
From Everand
Fast Money: Profit From Property
Jason Cunningham
No ratings yet
Vaccine Portal
No ratings yet
Vaccine Portal
3 pages
Smart India Hackathon 2024
No ratings yet
Smart India Hackathon 2024
6 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
35 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
29 pages
Appi Allotment Letter
No ratings yet
Appi Allotment Letter
1 page
Dwdm-Unit-3 R16
No ratings yet
Dwdm-Unit-3 R16
14 pages
고등영어 Day 2
No ratings yet
고등영어 Day 2
4 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
COA - Practice Set
No ratings yet
COA - Practice Set
3 pages
0417 s13 QP 31
No ratings yet
0417 s13 QP 31
8 pages
T Motor
No ratings yet
T Motor
5 pages
Firewall Ufw
No ratings yet
Firewall Ufw
10 pages
Tda 6107 Ajf
No ratings yet
Tda 6107 Ajf
16 pages
Ass 06
0% (1)
Ass 06
3 pages
Analysis of Various Decision Tree Algorithms For Classification in Data Mining PDF
No ratings yet
Analysis of Various Decision Tree Algorithms For Classification in Data Mining PDF
5 pages
Radar Product Catalog v2
No ratings yet
Radar Product Catalog v2
4 pages
Oracle Cash Management
100% (2)
Oracle Cash Management
14 pages
QSK19 M 660hk
100% (2)
QSK19 M 660hk
2 pages