0% found this document useful (0 votes)

71 views

Chap3 Basic Classification

This document provides an introduction to classification techniques in data mining. It defines classification as predicting a class label based on attribute data. Common classification tasks are discussed like email spam detection. The general approach of building classification models is outlined using a training set to learn a model that maps attributes to classes. Popular classification techniques are presented including decision trees, rules, nearest neighbors, Bayesian methods, support vector machines, and neural networks. Decision trees are demonstrated as a way to classify new test data by traversing the tree based on attribute values.

Uploaded by

Novita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

Chap3 Basic Classification

Uploaded by

Novita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Data Mining

Classification: Basic Concepts and

Techniques

Lecture Notes for Chapter 3

Introduction to Data Mining, 2nd Edition

by
Tan, Steinbach, Karpatne, Kumar

2/1/2021 Introduction to Data Mining, 2nd Edition 1

Classification: Definition

l Given a collection of records (training set )

– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
◆ x: attribute, predictor, independent variable, input
◆ y: class, response, dependent variable, output

l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y

2/1/2021 Introduction to Data Mining, 2nd Edition 2

Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam

email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells x-rays or MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or

galaxies telescope images irregular-shaped
galaxies

2/1/2021 Introduction to Data Mining, 2nd Edition 3

General Approach for Building
Classification Model

2/1/2021 Introduction to Data Mining, 2nd Edition 4

Classification Techniques

Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets

Ensemble Classifiers
– Boosting, Bagging, Random Forests

2/1/2021 Introduction to Data Mining, 2nd Edition 5

Example of a Decision Tree

Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

2/1/2021 Introduction to Data Mining, 2nd Edition 6

Apply Model to Test Data

Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 7

Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 8

Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 9

Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 10

Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 11

Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2nd Edition 12

Another Example of Decision Tree

MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10

2/1/2021 Introduction to Data Mining, 2nd Edition 13

Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

2/1/2021 Introduction to Data Mining, 2nd Edition 14

Decision Tree Induction

Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

2/1/2021 Introduction to Data Mining, 2nd Edition 15

General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
l Let Dt be the set of training ID
Owner Status Income Borrower
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
l General Procedure:
4 Yes Married 120K No
– If Dt contains records that 5 No Divorced 95K Yes

belong the same class yt, 6 No Married 60K No

then t is a leaf node 7 Yes Divorced 220K No

8 No Single 85K Yes
labeled as yt
9 No Married 75K No
– If Dt contains records that 10
10 No Single 90K Yes

belong to more than one

Dt
class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

2/1/2021 Introduction to Data Mining, 2nd Edition 16

Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 17
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 18
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 19
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 20
Design Issues of Decision Tree Induction

l How should training records be split?

– Method for expressing test condition
◆ depending on attribute types
– Measure for evaluating the goodness of a test
condition

l How should the splitting procedure stop?

– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
2/1/2021 Introduction to Data Mining, 2nd Edition 21
Methods for Expressing Test Conditions

l Depends on attribute types

– Binary
– Nominal
– Ordinal
– Continuous

2/1/2021 Introduction to Data Mining, 2nd Edition 22

Test Condition for Nominal Attributes

Multi-way split:
Marital
– Use as many partitions as Status
distinct values.

Single Divorced Married

Binary split:
– Divides values into two subsets

Marital Marital Marital

Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}

Divorced} Divorced} Married}

2/1/2021 Introduction to Data Mining, 2nd Edition 23

Test Condition for Ordinal Attributes

l Multi-way split: Shirt

Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large

Shirt Shirt
l Binary split: Size Size

– Divides values into two

subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among Shirt

attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
2/1/2021 Introduction to Data Mining, 2nd Edition 24
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

2/1/2021 Introduction to Data Mining, 2nd Edition 25

Splitting Based on Continuous Attributes

Different ways of handling

– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
◆ Static – discretize once at the beginning

◆ Dynamic – repeat at each node

– Binary Decision: (A < v) or (A  v)

◆ consider all possible splits and finds the best cut
◆ can be more compute intensive
2/1/2021 Introduction to Data Mining, 2nd Edition 26
How to determine the Best Split

Before Splitting: 10 records of class 0,

10 records of class 1

Gender Car Customer

Type ID

Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

2/1/2021 Introduction to Data Mining, 2nd Edition 27
How to determine the Best Split

l Greedy approach:
– Nodes with purer class distribution are
preferred

l Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity

2/1/2021 Introduction to Data Mining, 2nd Edition 28

Measures of Node Impurity

l Gini Index 𝑐−1 Where 𝒑𝒊 𝒕 is the frequency

2 of class 𝒊 at node t, and 𝒄 is
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡
the total number of classes
𝑖=0

l Entropy 𝑐−1

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)

𝑖=0

l Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝𝑖 (𝑡)]

2/1/2021 Introduction to Data Mining, 2nd Edition 29

Finding the Best Split

1. Compute impurity measure (P) before splitting

2. Compute impurity measure (M) after splitting
l Compute impurity measure of each child node
l M is the weighted impurity of child nodes

3. Choose the attribute test condition that

produces the highest gain

Gain = P - M

or equivalently, lowest impurity measure after splitting

(M)

2/1/2021 Introduction to Data Mining, 2nd Edition 30

Finding the Best Split
Before Splitting: C0 N00
P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40

C1 N11 C1 N21 C1 N31 C1 N41

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
2/1/2021 Introduction to Data Mining, 2nd Edition 31
Measure of Impurity: GINI

Gini Index for a given node 𝒕

𝑐−1

𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2

𝑖=0
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total
number of classes

– Maximum of 1 − 1/𝑐 when records are equally

distributed among all classes, implying the least
beneficial situation for classification
– Minimum of 0 when all records belong to one class,
implying the most beneficial situation for classification
– Gini index is used in decision tree algorithms such as
CART, SLIQ, SPRINT

2/1/2021 Introduction to Data Mining, 2nd Edition 32

Measure of Impurity: GINI

Gini Index for a given node t :

𝑐−1

𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2

𝑖=0

– For 2-class problem (p, 1 – p):

◆ GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

2/1/2021 Introduction to Data Mining, 2nd Edition 33

Computing Gini Index of a Single Node
𝑐−1

𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2

𝑖=0

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

2/1/2021 Introduction to Data Mining, 2nd Edition 34

Computing Gini Index for a Collection of
Nodes
l When a node 𝑝 is split into 𝑘 partitions (children)
𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐺𝐼𝑁𝐼(𝑖)
𝑛
𝑖=1

where, 𝑛𝑖 = number of records at child 𝑖,

𝑛 = number of records at parent node 𝑝.

2/1/2021 Introduction to Data Mining, 2nd Edition 35

Binary Attributes: Computing GINI Index

Splits into two partitions (child nodes)

Effect of Weighing partitions:
– Larger and purer partitions are sought
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125

2/1/2021 Introduction to Data Mining, 2nd Edition 36

Categorical Attributes: Computing Gini Index

l For each distinct value, gather counts for each class in

the dataset
l Use the count matrix to make decisions

Multi-way split Two-way split

(find best partition of values)

CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

Which of these is the best?

2/1/2021 Introduction to Data Mining, 2nd Edition 37

Continuous Attributes: Computing Gini Index

l Use Binary Decisions based on one ID

Home
Owner
Marital
Status
Annual
Income
Defaulted
value
1 Yes Single 125K No
l Several Choices for the splitting value 2 No Married 100K No

– Number of possible splitting values 3 No Single 70K No

= Number of distinct values 4 Yes Married 120K No
5 No Divorced 95K Yes
l Each splitting value has a count matrix
6 No Married 60K No
associated with it
7 Yes Divorced 220K No
– Class counts in each of the 8 No Single 85K Yes
partitions, A ≤ v and A > v
9 No Married 75K No
l Simple method to choose best v 10 No Single 90K Yes

– For each v, scan the database to

Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4

2/1/2021 Introduction to Data Mining, 2nd Edition 38

Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2nd Edition 39

Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2nd Edition 40

Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2nd Edition 41

Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2nd Edition 42

Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

2/1/2021 Introduction to Data Mining, 2nd Edition 43

Measure of Impurity: Entropy

l Entropy at a given node 𝒕

𝑐−1

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)

𝑖=0
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total number
of classes

◆ Maximum of log 2 𝑐 when records are equally distributed

among all classes, implying the least beneficial situation for
classification
◆ Minimum of 0 when all records belong to one class,
implying most beneficial situation for classification

– Entropy based computations are quite similar to the GINI

index computations
2/1/2021 Introduction to Data Mining, 2nd Edition 44
Computing Entropy of a Single Node
𝑐−1

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)

𝑖=0

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

2/1/2021 Introduction to Data Mining, 2nd Edition 45

Computing Information Gain After Splitting

l Information Gain:
𝑘
𝑛𝑖
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑛
𝑖=1

Parent Node, 𝑝 is split into 𝑘 partitions (children)

𝑛𝑖 is number of records in child node 𝑖

– Choose the split that achieves most reduction (maximizes

GAIN)

– Used in the ID3 and C4.5 decision tree algorithms

– Information gain is the mutual information between the class

variable and the splitting variable

2/1/2021 Introduction to Data Mining, 2nd Edition 46

Problem with large number of partitions

Node impurity measures tend to prefer splits that

result in large number of partitions, each being
small but pure
Gender Car Customer
Type ID

Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

– Customer ID has highest information gain

because entropy for all the children is zero
2/1/2021 Introduction to Data Mining, 2nd Edition 47
Gain Ratio

l Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − ෍ 𝑙𝑜𝑔2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
𝑖=1

Parent Node, 𝑝 is split into 𝑘 partitions (children)

𝑛𝑖 is number of records in child node 𝑖

– Adjusts Information Gain by the entropy of the partitioning

(𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜).
◆ Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain

2/1/2021 Introduction to Data Mining, 2nd Edition 48

Gain Ratio

l Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = ෍ 𝑙𝑜𝑔2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
𝑖=1

Parent Node, 𝑝 is split into 𝑘 partitions (children)

𝑛𝑖 is number of records in child node 𝑖

CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

2/1/2021 Introduction to Data Mining, 2nd Edition 49

Measure of Impurity: Classification Error

l Classification error at a node 𝑡

𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝𝑖 𝑡 ]
𝑖

– Maximum of 1 − 1/𝑐 when records are equally

distributed among all classes, implying the least
interesting situation
– Minimum of 0 when all records belong to one class,
implying the most interesting situation

2/1/2021 Introduction to Data Mining, 2nd Edition 50

Computing Error of a Single Node

𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝𝑖 𝑡 ]
𝑖

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

2/1/2021 Introduction to Data Mining, 2nd Edition 51

Comparison among Impurity Measures

For a 2-class problem:

2/1/2021 Introduction to Data Mining, 2nd Edition 52

Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!

2/1/2021 Introduction to Data Mining, 2nd Edition 53

Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416

Misclassification error for all three cases = 0.3 !

2/1/2021 Introduction to Data Mining, 2nd Edition 54

Decision Tree Based Classification
l Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are
interacting)
l Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute

2/1/2021 Introduction to Data Mining, 2nd Edition 55

Handling interactions

+ : 1000 instances Entropy (X) : 0.99

Entropy (Y) : 0.99
o : 1000 instances
Y

2/1/2021 Introduction to Data Mining, 2nd Edition 56

Handling interactions

2/1/2021 Introduction to Data Mining, 2nd Edition 57

Handling interactions given irrelevant attributes

+ : 1000 instances Entropy (X) : 0.99

Entropy (Y) : 0.99
o : 1000 instances Entropy (Z) : 0.98
Y
Adding Z as a noisy Attribute Z will be
attribute generated chosen for splitting!
from a uniform
distribution
X

2/1/2021 Introduction to Data Mining, 2nd Edition 58

Limitations of single attribute-based decision boundaries

Both positive (+) and

negative (o) classes
generated from
skewed Gaussians
with centers at (8,8)
and (12,12)
respectively.

2/1/2021 Introduction to Data Mining, 2nd Edition 59

Darknet Traffic Classification With Machine Learning Algorithms and SMOTE Method
No ratings yet
Darknet Traffic Classification With Machine Learning Algorithms and SMOTE Method
5 pages
Decision Tree 1
No ratings yet
Decision Tree 1
59 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
61 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
59 pages
Lect6 Basic Classification PDF
No ratings yet
Lect6 Basic Classification PDF
30 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
58 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
58 pages
Basic Classification
No ratings yet
Basic Classification
58 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
58 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
29 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
58 pages
Chap3 Basic Classification New 2
No ratings yet
Chap3 Basic Classification New 2
21 pages
Week 6 Chap3 - Basic - Classificationi
No ratings yet
Week 6 Chap3 - Basic - Classificationi
59 pages
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
No ratings yet
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
95 pages
05 chap3_basic_classification edited on Oct 10, 2023
No ratings yet
05 chap3_basic_classification edited on Oct 10, 2023
78 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
51 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
82 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
101 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
101 pages
Chap4 - Basic - Classification-Admin and Economy
No ratings yet
Chap4 - Basic - Classification-Admin and Economy
31 pages
CSC4316 7and8
No ratings yet
CSC4316 7and8
60 pages
Chap4 Basic Classification PDF
No ratings yet
Chap4 Basic Classification PDF
101 pages
DM Lec6
No ratings yet
DM Lec6
18 pages
4-Chap4 Basic Classification
No ratings yet
4-Chap4 Basic Classification
128 pages
Classification Techniques
No ratings yet
Classification Techniques
50 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
50 pages
Classification
No ratings yet
Classification
58 pages
DMDW_Unit 3_Classification
No ratings yet
DMDW_Unit 3_Classification
43 pages
Lecture Notes For Chapter 4: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4: by Tan, Steinbach, Kumar
107 pages
CS 6823 Data Mining: Classification Decision Tree
No ratings yet
CS 6823 Data Mining: Classification Decision Tree
39 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
35 pages
4CL
No ratings yet
4CL
76 pages
01 Classification
No ratings yet
01 Classification
77 pages
DSTBD_10-DMClassification-ENG
No ratings yet
DSTBD_10-DMClassification-ENG
160 pages
CH03 Classification Part I
No ratings yet
CH03 Classification Part I
58 pages
Important For Data Mining
No ratings yet
Important For Data Mining
96 pages
Chapter 6. Decision Tree Classification
No ratings yet
Chapter 6. Decision Tree Classification
19 pages
2EL1730-ML-Lecture05-Trees and Ensemble Learning
No ratings yet
2EL1730-ML-Lecture05-Trees and Ensemble Learning
70 pages
Classification Basics
No ratings yet
Classification Basics
65 pages
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
No ratings yet
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
79 pages
06 Classification
No ratings yet
06 Classification
32 pages
APznzabwlzV5M2e5GjQ954nHSvXZJgoScUzxJJGGObe92caYJVEnxuSRlgugOxlDuIjc-9F42C4ZhbwuYnh0O69UinLutAfSUZxUg2Nuy6xm-Rs3ubxzNFS7ZmZOgZDG2KcsCi2ukySFiw0LC9JPY6dbbd5SMEZWe8kjP5IWtAn_cWgcAMBg1fG60cRdL3iMi5hZ56pOq9v
No ratings yet
APznzabwlzV5M2e5GjQ954nHSvXZJgoScUzxJJGGObe92caYJVEnxuSRlgugOxlDuIjc-9F42C4ZhbwuYnh0O69UinLutAfSUZxUg2Nuy6xm-Rs3ubxzNFS7ZmZOgZDG2KcsCi2ukySFiw0LC9JPY6dbbd5SMEZWe8kjP5IWtAn_cWgcAMBg1fG60cRdL3iMi5hZ56pOq9v
82 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
CH 6
No ratings yet
CH 6
72 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
56 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
96 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
104 pages
Classification: Lecture Notes For Chapters 4 & 5
No ratings yet
Classification: Lecture Notes For Chapters 4 & 5
42 pages
datamining-lect10a-Classsification-basics-DT
No ratings yet
datamining-lect10a-Classsification-basics-DT
87 pages
Tree Based Classifiers: Dinesh R
No ratings yet
Tree Based Classifiers: Dinesh R
54 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Datamining-Lect5 Decision Tree
No ratings yet
Datamining-Lect5 Decision Tree
38 pages
AIML_UNIT-4
No ratings yet
AIML_UNIT-4
82 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
87 pages
7 - Classfication - Concept - DecisionTree - Evaluation
No ratings yet
7 - Classfication - Concept - DecisionTree - Evaluation
47 pages
Datamining-lect3 - Classification. Decision Trees. Evaluation
No ratings yet
Datamining-lect3 - Classification. Decision Trees. Evaluation
95 pages
Lecture3 2020classification PDF
No ratings yet
Lecture3 2020classification PDF
124 pages
Math3071 Model Exam
No ratings yet
Math3071 Model Exam
3 pages
COS3751 Nov 2022 Exams
No ratings yet
COS3751 Nov 2022 Exams
8 pages
Rapidminer Report
No ratings yet
Rapidminer Report
28 pages
02 Natural Semantics
No ratings yet
02 Natural Semantics
55 pages
Resume Md+Waqar
No ratings yet
Resume Md+Waqar
2 pages
Tabu Search
No ratings yet
Tabu Search
25 pages
James Stewart Chapter 13 - Book Eight Edition
No ratings yet
James Stewart Chapter 13 - Book Eight Edition
3 pages
Paillier's Encryption: Implementation and Cloud Applications
No ratings yet
Paillier's Encryption: Implementation and Cloud Applications
6 pages
Hartree Fock Theory
0% (1)
Hartree Fock Theory
14 pages
Computer Vision and Image Processing
No ratings yet
Computer Vision and Image Processing
3 pages
Integration PMT
No ratings yet
Integration PMT
79 pages
Electronics 07 00235 PDF
No ratings yet
Electronics 07 00235 PDF
15 pages
Questions On Prolog Language
No ratings yet
Questions On Prolog Language
5 pages
Kalman and Bayesian Filters in Python
No ratings yet
Kalman and Bayesian Filters in Python
17 pages
Intelligent Temperature Controller For Water Bath System
No ratings yet
Intelligent Temperature Controller For Water Bath System
7 pages
Section 5 Power Flow PDF
No ratings yet
Section 5 Power Flow PDF
128 pages
Course Out-Line-1-2
100% (1)
Course Out-Line-1-2
2 pages
Data Structure Unit 1
No ratings yet
Data Structure Unit 1
49 pages
4.chapter 3 Demand Forecasting
No ratings yet
4.chapter 3 Demand Forecasting
43 pages
Paper Presentation
No ratings yet
Paper Presentation
13 pages
Assignment 1 - Automata, Languages, and Computability
No ratings yet
Assignment 1 - Automata, Languages, and Computability
12 pages
Project
No ratings yet
Project
1 page
BAPI BAPI_PO_CREATE1
No ratings yet
BAPI BAPI_PO_CREATE1
5 pages
Week Eight Term Project
No ratings yet
Week Eight Term Project
5 pages
Application of Artificial Neural Network (ANN) For Estimating Reliable Service Life of Reinforced Concrete (RC) Structure Bookkeeping Factors Responsible For Deterioration Mechanism
No ratings yet
Application of Artificial Neural Network (ANN) For Estimating Reliable Service Life of Reinforced Concrete (RC) Structure Bookkeeping Factors Responsible For Deterioration Mechanism
15 pages
Digital Signal Process Chapter 1
No ratings yet
Digital Signal Process Chapter 1
17 pages
Multiple-Layer Networks Backpropagation Algorithms
No ratings yet
Multiple-Layer Networks Backpropagation Algorithms
46 pages
10ME82 16-17 (Control Engg.)
No ratings yet
10ME82 16-17 (Control Engg.)
30 pages
M. Tech. in VLSI Design (VL) : Suggested Plan of Study: Program Core (PC)
No ratings yet
M. Tech. in VLSI Design (VL) : Suggested Plan of Study: Program Core (PC)
17 pages