0% found this document useful (0 votes)

44 views59 pages

Week 6 Chap3 - Basic - Classificationi

This document provides an overview of basic concepts in classification, including supervised learning, classification tasks, and classification techniques such as decision trees. It discusses decision tree induction, model evaluation, feature selection, and class imbalance. Decision trees are built by recursively splitting the training data into purer child nodes based on a splitting attribute and threshold.

Uploaded by

Yna Foronda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views59 pages

Week 6 Chap3 - Basic - Classificationi

Uploaded by

Yna Foronda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Classification - Basic Concepts

Lecture Notes for Chapter 3

Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Look for accompanying R

code on the course web site.
Topics

▪ Introduction
▪ Decision Trees
—Overview
—Tree Induction
—Overfitting and other Practical Issues
▪ Model Evaluation
—Metrics for Performance Evaluation
—Methods to Obtain Reliable Estimates
—Model Comparison (Relative Performance)
▪ Feature Selection
▪ Class Imbalance
Supervised Learning

▪ Examples
—Input-output pairs: E = 𝑥1 , 𝑦1 , … , 𝑥𝑖 , 𝑦𝑖 , … , 𝑥𝑁 , 𝑦𝑁 .
—We assume that the examples are produced iid (with noise and errors) from a
target function 𝑦 = 𝑓(𝑥).
𝑓𝑓
▪ Learning problem
—Given a hypothesis space H
—Find a hypothesis ℎ ∈ 𝐻 such that 𝑦ො𝑖 = ℎ(𝑥𝑖 ) ≈ 𝑦𝑖
—That is, we want to approximate 𝑓 by ℎ using E.

▪ Includes
—Regression (outputs = real numbers). Goal: Predict the number accurately.
E.g., x is a house and 𝑓(𝑥) is its selling price.
—Classification (outputs = class labels). Goal: Assign new records to a class.
E.g., 𝑥 is an email and 𝑓(𝑥) is spam / ham

You already know linear regression. We focus on Classification.

Classification: Definition

l Given a collection of records (training set )

– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
◆ x: attribute, predictor, independent variable, input
◆ y: class, response, dependent variable, output

l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y

2/1/2021 Introduction to Data Mining, 2nd Edition 4

Illustrating Classification Task
E x y

𝑦 = ℎ(𝑥)
Examples of
Classification Task
▪ Predicting tumor cells as benign or
malignant

▪ Classifying credit card transactions

as legitimate or fraudulent

▪ Classifying secondary
structures of protein
as alpha-helix, beta-sheet,
or random coil

▪ Categorizing news stories

as finance, weather, enter-
tainment, sports, etc
Classification Techniques
Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks / Deep Learning

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines

Topics

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
Induction
6 No Married 60K No
7 Yes Divorced 220K No
TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
Induction NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree: Deduction

Decision
Tree
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Topics

Decision
Tree
Decision Tree Induction

Many Algorithms:
▪ Hunt’s Algorithm (one of the earliest)
▪ CART (Classification And Regression Tree)
▪ ID3, C4.5, C5.0 (by Ross Quinlan, information gain)
▪ CHAID (CHi-squared Automatic Interaction Detection)
▪ MARS (Improvement for numerical features)
▪ SLIQ, SPRINT
▪ Conditional Inference Trees (recursive partitioning using statistical
tests)
Hunt’s Algorithm
"Use attributes to split the data recursively, till
each split contains only a single class."

Refund
mixed Yes No

Don’t mixed
Cheat

Refund
Yes No
Refund
Yes No Don’t Marital
Cheat Status
Don’t Marital Single,
Cheat Married
Status Divorced
Single,
Married Don’t
Divorced Taxable
Cheat
Don’t Income
mixed
Cheat < 80K >= 80K

Don’t Cheat
Cheat
General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
l Let Dt be the set of training ID
Owner Status Income Borrower
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
l General Procedure:
4 Yes Married 120K No
– If Dt contains records that 5 No Divorced 95K Yes

belong the same class yt, 6 No Married 60K No

then t is a leaf node 7 Yes Divorced 220K No

8 No Single 85K Yes
labeled as yt
9 No Married 75K No
– If Dt contains records that 10
10 No Single 90K Yes

belong to more than one

Dt
class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

2/1/2021 Introduction to Data Mining, 2nd Edition 22

Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 23
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 24
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 25
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 26
Example 2: Creating a Decision Tree
x2
x
x
x
o
x x
o x x x
o o o
o o o o

0 x1
Example 2: Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x x Blue circle Mixed
o x x x
2.5 o
o o
o o o o

0 x1
Example 2: Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x x Blue circle Mixed
o x x x
2.5 o
o o
o o o o
pure

0 x1
Example 2: Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x Blue circle X1 < 2
x
o x x x
2.5 Yes No
o o o
o o o o Blue circle Red X

0 2 x1
Tree Induction

▪ Greedy strategy
—Split the records based on an attribute test that optimizes a certain criterion.

▪ Issues
—Determine how to split the record using different attribute types.
—How to determine the best split?
—Determine when to stop splitting
Design Issues of Decision Tree Induction

l How should training records be split?

– Method for expressing test condition
◆ depending on attribute types
– Measure for evaluating the goodness of a test
condition

l How should the splitting procedure stop?

– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
2/1/2021 Introduction to Data Mining, 2nd Edition 32
Methods for Expressing Test Conditions

l Depends on attribute types

– Binary
– Nominal
– Ordinal
– Continuous

2/1/2021 Introduction to Data Mining, 2nd Edition 33

Test Condition for Nominal Attributes

Multi-way split:
Marital
– Use as many partitions as Status
distinct values.

Single Divorced Married

Binary split:
– Divides values into two subsets

Marital Marital Marital

Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}

Divorced} Divorced} Married}

2/1/2021 Introduction to Data Mining, 2nd Edition 34

Test Condition for Ordinal Attributes

l Multi-way split: Shirt

Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large

Shirt Shirt
l Binary split: Size Size

– Divides values into two

subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among Shirt

attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
2/1/2021 Introduction to Data Mining, 2nd Edition 35
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

2/1/2021 Introduction to Data Mining, 2nd Edition 36

Tree Induction

▪ Greedy strategy
—Split the records based on an attribute test that optimizes a certain criterion.

▪ Issues
—Determine how to split the record using different attribute types.
—How to determine the best split?
—Determine when to stop splitting
How to determine the Best Split

Before Splitting: 10 records of class 0, C0: 10

10 records of class 1 C1: 10

Which test condition is the best?

How to determine the Best Split

Before Splitting: 10 records of class 0,

10 records of class 1

Gender Car Customer

Type ID

Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

2/1/2021 Introduction to Data Mining, 2nd Edition 39
How to determine the Best Split

▪ Greedy approach:
—Nodes with homogeneous class distribution are preferred

▪ Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Find the Best Split -General Framework
Assume we have a measure M that tells us how "pure" a node is.
Before Splitting: C0 N00
M0
C1 N01

Attribute A Attribute B
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40
C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34 → Choose best split
Measures of Node Impurity

Gini Index Entropy Classification

error
Measures of Node Impurity

Gini Index Entropy Classification

error
Measure of Impurity: GINI
▪ Gini Index for a given node t :

𝐺𝐼𝑁𝐼 𝑡 = ෍ 𝑝 𝑗 𝑡)(1 − 𝑝 𝑗 𝑡)) = 1 − ෍ 𝑝 𝑗 𝑡 )2

𝑗 𝑗

𝑝( 𝑗 | 𝑡) is estimated as the relative frequency of class j at node t

▪ Gini impurity is a measure of how often a randomly chosen element from

the set would be incorrectly labeled if it was randomly labeled according to
the distribution of labels in the subset.
▪ Maximum of 1 – 1/𝑛𝑐 (number of classes) when records are equally
distributed among all classes = maximal impurity.
▪ Minimum of 0 when all records belong to one class = complete purity.
▪ Examples:
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = ? P(C2) = ?
C2 5
Gini = ?
C1 2 P(C1) = ? P(C2) = ?
C2 4 Gini = ?
C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = ? P(C2) = ?
C1 2
C2 4 Gini = ?

C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = ?

C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = ?

C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

C1 3 P(C1) = ? P(C2) = ?
C2 3
Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

C1 2
C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

P(C1) = 3/6 = 0.5 P(C2) = 3/6 = 0.5

C1 3
C2 3 Gini = ?
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

C1 2
C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

P(C1) = 3/6 = 0.5 P(C2) = 3/6 = 0.5

C1 3
C2 3 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0.25 – 0.25= 0.5
Examples for computing GINI
𝐺𝐼𝑁𝐼 𝑡 = 1 − ෍ 𝑝 𝑗 𝑡 )2
𝑗

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

C1 3 P(C1) = 3/6 = 0.5 P(C2) = 3/6 = 0.5

C2 3 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0.25 – 0.25= 0.5
Maximal impurity here is ½ = .5
Splitting Based on GINI
When a node p is split into k partitions (children), the quality of the split is
computed as a weighted sum:
𝐺𝐼𝑁𝐼(𝑝) − 𝑛

𝐺𝐼𝑁𝐼(1) − 𝑛1 𝐺𝐼𝑁𝐼(2) − 𝑛2 ... 𝐺𝐼𝑁𝐼(𝑘) − 𝑛𝑘

𝑘𝑛
𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐺𝐼𝑁𝐼(𝑖)
𝑖 𝑛
where 𝑛𝑖 = number of records at child i, and n = number of records at node p.

Used in the algorithms CART, SLIQ, SPRINT.

Binary Attributes: Computing GINI Index
▪ Splits into two partitions
▪ Effect of weighing partitions: Larger and purer partitions are sought
for.

B? Parent
C1 6
Yes No
C2 6
Node N1 Node N2 Gini = 0.500
Gini(N1)
= 1 – (5/8)2 – (3/8)2
Weighted Gini of N1 N2
= 0.469
N1 N2 Gini(Children)
Gini(N2) = 8/12 * 0.469 +
C1 5 1
= 1 – (1/4)2 – (3/4)2 4/12 * 0.375
C2 3 3
= 0.375 = 0.438 GINI improves!
Gini=0.438
Gain = 0.5 – 0.438 = 0.062
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
2/1/2021 Introduction to Data Mining, 2nd Edition 56
▪ Δ gain in purity of an attribute test condition: The higher the gain,
the purer are the classes in the child nodes relative to the parent
node. The splitting criterion in the decision tree learning algorithm
selects the attribute test condition that shows the maximum gain.
Note that maximizing the gain at a given node is equivalent to
minimizing the weighted impurity measure of its children since
I(parent) is the same for all candidate attribute test conditions.
▪ when entropy is used Δ=I(parent)−I(children), as the impurity
measure, the difference in entropy is commonly known as
information gain
Continuous Attributes: Computing Gini Index

▪ Use binary decisions based on one value

▪ Several choices for the splitting value 𝑣
—Number of possible splitting values
= Number of distinct values

▪ Simple method to choose best 𝑣 :

—For each 𝑣, scan the database to gather
count matrix and compute its Gini index
—Computationally Inefficient! Repetition of
work. Taxable
Income
𝑣 > 80K?

Yes No
Continuous Attributes: Computing Gini Index

▪ For efficient computation: for each attribute,

—Sort the attribute on values
—Linearly scan these values, each time updating the count matrix and
computing gini index
—Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Eggshell-Activated Carbon Filter
100% (1)
Eggshell-Activated Carbon Filter
45 pages
DMDW - MOD4 - Classification - PPT Updated
No ratings yet
DMDW - MOD4 - Classification - PPT Updated
128 pages
Module 1 (No Answer Key)
89% (9)
Module 1 (No Answer Key)
18 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
63 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
101 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
87 pages
Naga City Science High School
No ratings yet
Naga City Science High School
37 pages
A.I. Lecture 6 NEW
No ratings yet
A.I. Lecture 6 NEW
59 pages
Chapter 3 DESKTOP VS93238 S Conflicted Copy 2019-09-29
No ratings yet
Chapter 3 DESKTOP VS93238 S Conflicted Copy 2019-09-29
55 pages
Objectives: at The End of The Discussion, Students Are Expected To
No ratings yet
Objectives: at The End of The Discussion, Students Are Expected To
48 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Datamining Lect10a Classsification Basics DT
No ratings yet
Datamining Lect10a Classsification Basics DT
87 pages
Classification Slides
No ratings yet
Classification Slides
147 pages
Datamining-Lect3 - Classification. Decision Trees. Evaluation
No ratings yet
Datamining-Lect3 - Classification. Decision Trees. Evaluation
95 pages
05 Chap3 - Basic - Classification Edited On Oct 10, 2023
No ratings yet
05 Chap3 - Basic - Classification Edited On Oct 10, 2023
78 pages
Lecture 13-Supervised Learning-Decision Trees-M
No ratings yet
Lecture 13-Supervised Learning-Decision Trees-M
47 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
76 pages
DSTBD - 10 DMClassification ENG
No ratings yet
DSTBD - 10 DMClassification ENG
160 pages
ML 05 Decision Trees
No ratings yet
ML 05 Decision Trees
76 pages
Tree Based Classifiers: Dinesh R
No ratings yet
Tree Based Classifiers: Dinesh R
54 pages
DMDW - Unit 3 - Classification
No ratings yet
DMDW - Unit 3 - Classification
43 pages
General Physics 1: Not For Sale
No ratings yet
General Physics 1: Not For Sale
23 pages
Decision Tree 1
No ratings yet
Decision Tree 1
59 pages
Basic Classification
No ratings yet
Basic Classification
58 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
61 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
82 pages
2EL1730-ML-Lecture05-Trees and Ensemble Learning
No ratings yet
2EL1730-ML-Lecture05-Trees and Ensemble Learning
70 pages
CH 6
No ratings yet
CH 6
72 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
59 pages
Decision Trees
No ratings yet
Decision Trees
88 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
4-Chap4 Basic Classification
No ratings yet
4-Chap4 Basic Classification
128 pages
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
No ratings yet
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
95 pages
Scoping Study Best Practice Product Manual Design
No ratings yet
Scoping Study Best Practice Product Manual Design
113 pages
Lecture 2
No ratings yet
Lecture 2
98 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
Classification Basics
No ratings yet
Classification Basics
65 pages
Chap4 Basic Classification PDF
No ratings yet
Chap4 Basic Classification PDF
101 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
58 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
59 pages
Decision Tree and Ensemble
No ratings yet
Decision Tree and Ensemble
92 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Karpatne, Kumar
58 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
58 pages
What Is The Gender Pay Gap and Is It Real?
No ratings yet
What Is The Gender Pay Gap and Is It Real?
42 pages
DM Lec6
No ratings yet
DM Lec6
18 pages
Chap3 Basic Classification New 2
No ratings yet
Chap3 Basic Classification New 2
21 pages
Lecture3 2020classification PDF
No ratings yet
Lecture3 2020classification PDF
124 pages
Classification
No ratings yet
Classification
58 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
101 pages
Data Mining: Lecture - 03
No ratings yet
Data Mining: Lecture - 03
56 pages
Important For Data Mining
No ratings yet
Important For Data Mining
96 pages
7 - Classfication - Concept - DecisionTree - Evaluation
No ratings yet
7 - Classfication - Concept - DecisionTree - Evaluation
47 pages
PW-23-Theoretical Yield and Percent Yield Calcs
No ratings yet
PW-23-Theoretical Yield and Percent Yield Calcs
4 pages
06 Classification
No ratings yet
06 Classification
32 pages
Chap4 - Basic - Classification-Admin and Economy
No ratings yet
Chap4 - Basic - Classification-Admin and Economy
31 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
51 pages
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
No ratings yet
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
79 pages
Module 6 PreCal - 2
No ratings yet
Module 6 PreCal - 2
15 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
29 pages
Alternative Projections of Mortality and Disability by Cause 1990-2020: Global Burden of Disease Study
No ratings yet
Alternative Projections of Mortality and Disability by Cause 1990-2020: Global Burden of Disease Study
7 pages
CH03 Classification Part I
No ratings yet
CH03 Classification Part I
58 pages
Datamining-Lect5 Decision Tree
No ratings yet
Datamining-Lect5 Decision Tree
38 pages
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
58 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
56 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
Classification: Lecture Notes For Chapters 4 & 5
No ratings yet
Classification: Lecture Notes For Chapters 4 & 5
42 pages
Ditk PP
No ratings yet
Ditk PP
24 pages
Q1. What Is The Most Common Cyberbullying Case You've Witnessed?
No ratings yet
Q1. What Is The Most Common Cyberbullying Case You've Witnessed?
2 pages
Modular Activities: Christmas Break Inquiries, Investigation, and Immersion Where To Answer Completed Date
No ratings yet
Modular Activities: Christmas Break Inquiries, Investigation, and Immersion Where To Answer Completed Date
2 pages
Orca Share Media1551770418238
No ratings yet
Orca Share Media1551770418238
1 page
Uu Hollandscholarship
No ratings yet
Uu Hollandscholarship
2 pages