Open navigation menu

Scribd

0% found this document useful (0 votes)

8 views

DM Lect8

Uploaded by

هارون المقطري

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

DM Lect8

Uploaded by

هارون المقطري

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

DATA MINING

Classiﬁcation
Lec 8

Mohammed
Taiz University
Outlines
• Define classification
• Decision Trees
• Evaluate the performance of classifier
Introduction
Given a collection of records (training set)
Each record contains a set of attributes, one of them is the class
What is classif ication?
• Classification is the task oflearning a
target function f that maps
attribute set x to one of the predefined class labels y
l l
ir ca ir ca ous
g o g o
ti nu ss
te te n a
ca ca co cl

One of the attributes is the class attribute

In this case: Cheat

Two class labels (or classes): Yes (1), No (0)

General Approach for Building
Classif ication Model
Why classif ication?
• The target function f is known as a classiﬁcation
model

• Descriptive modeling: Explanatory tool to

distinguish between objects of different classes
(e.g., understand why people cheat on their taxes)

• Predictive modeling: Predict a class of a

previously unseen record
Examples of Classif ication Tasks
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate

or fraudulent

• Categorizing news stories as ﬁnance,

weather, entertainment, sports, etc

• Identifying spam email, spam web pages, adult

content
General approach to classif ication
• Training set consists of records with known class
labels

• Training set is used to build a classiﬁcation model

• A labeled test set of previously unseen data

records is used to evaluate the quality of the model.

• The classiﬁcation model is applied to new records

with unknown class labels
Illustrating Classif ication Task
Decision tree example
l l
ir ca ir ca
ous
u
ego ego tin ss
t t n l a
ca ca co c
Splitting Attributes

Refund
Yes No
Test outcome
NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Class labels
Training Data Model: Decision Tree
Another Example of Decision Tree
l l
ir ca ir ca
ous
o o n u
teg
teg
n ti
a ss Single,
ca ca co c l MarSt
Married Divorced

NO Refund
Yes No

NO TaxInc
< 80K > 80K

NO YES

There could be more than one tree that

ﬁts the same data!
Decision Tree Classif ication Task
Apply Model to Test Data
Test Data
Start from the root of tree.

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Classif ication Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.

• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
General Structure of Hunt’s Algorithm
●Let Dt be the set of training records
that reach a node t

●General Procedure:
– If Dt contains records that
belong the same class yt, then t
is a leaf node labeled as yt
– If Dt contains records that
belong to more than one class,
use an attribute test to split the
data into smaller subsets.
Recursively apply the procedure Dt
to each subset.
?
Hunt’s Algorithm

(7,3)
Hunt’s Algorithm

(7,3)
(3,0) (4,3)
Hunt’s Algorithm

(7,3)
(3,0) (4,3)

(3,0)

(1,3) (3,0)
Hunt’s Algorithm

(7,3)
(3,0) (4,3)

(3,0)

(1,3) (3,0)
(1,0) (0,3)
Design Issues of Decision Tree Induction
●How should training records be split?
– Method for expressing test condition
 depending on attribute types
– Measure for evaluating the goodness of a test condition

●How should the splitting procedure stop?

– Stop splitting if all the records belong to the same class
or have identical attribute values
– Early termination
Methods for Expressing Test Conditions
●Depends on attribute types
– Nominal
– Ordinal
– Continuous

●Depends on number of ways to split

– 2-way split
– Multi-way split
Test Condition for Nominal Attributes
● Multi-way split:
M a r ita l
S ta tu s

– Use as many partitions as

distinct values.
S in g le D iv o r c e d M a r r ie d

● Binary split:
– Divides values into two
subsets
M a r ita l M a r ita l M a r ita l
S ta tu s S ta tu s S ta tu s

OR OR

{ M a r r ie d } { S in g le , { S in g le } { M a r r ie d , { S in g le , { D iv o r c e d }
D iv o r c e d } D iv o r c e d } M a r r ie d }
Test Condition for Ordinal Attributes
● Multi-way split: S h ir t
S iz e
– Use as many partitions as
distinct values
S m a ll
E x tr a L a r g e
M e d iu m L a rg e

● Binary split:
– Divides values into two
subsets
– Preserve order property
among attribute values
S h ir t
S iz e
This grouping
violates order
property

{ S m a ll, { M e d iu m ,
L a rg e } E x tr a L a r g e }
Test Condition for Continuous Attributes
How to determine the Best Split

Before Splitting: 10 records of class 0,

10 records of class 1

Which test condition is the best?

How to determine the Best Split
●Greedy approach:
– Nodes with purer class distribution are preferred

●Need a measure of node impurity:

C 0: 5 C 0: 9
C 1: 5 C 1: 1

High degree of impurity Low degree of impurity

Measures of Node Impurity
● Gini Index

● Entropy

● Misclassiﬁcation error
Finding the Best Split
1. Compute impurity measure (P) before splitting
2. Compute impurity measure (M) after splitting
● Compute impurity measure of each child node
● M is the weighted impurity of child nodes

3. Choose the attribute test condition that

produces the highest gain

Gain = P - M

or equivalently, lowest impurity measure after splitting (M)
Finding the Best Split
Before Splitting: P

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
Measure of Impurity: GINI
Measure of Impurity: GINI
• Gini Index for a given node t :

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3

G in i= 0 .0 0 0 G in i= 0 .2 7 8 G in i= 0 .4 4 4 G in i= 0 .5 0 0
Computing Gini Index of a Single Node

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5
Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Computing Gini Index for a Collection of
Nodes
Binary Attributes: Computing GINI Index
Splits into two partitions (child nodes)
Effect of Weighing partitions:
Larger and purer partitions are sought

B?

Yes No
Gini(N1)
= 1 – (5/6)2 – (1/6)2
Node N1 Node N2
= 0.278
Gini(N2) Weighted Gini of N1 N2
= 1 – (2/6)2 – (4/6)2 = 6/12 * 0.278 +
= 0.444 6/12 * 0.444
= 0.361
Gain = 0.486 – 0.361 = 0.125
Measure of Impurity: Entropy
•
Computing Entropy of a Single Node

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6
Entropy = – 0 log2 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Computing Information Gain After
Splitting
•
Problem with large number of partitions
●Node impurity measures tend to prefer splits that
result in large number of partitions, each being
small but pure

– Customer ID has highest information gain because

entropy for all the children is zero
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records
belong to the same class

• Stop expanding a node when all the records have

similar attribute values

• Early termination (to be discussed later)

Decision Tree Based Classif ication
● Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes
Practical Issues of Classif ication
• Underfitting and Overfitting

• Evaluation
Underf itting and Overf itting
• Underfitting:
• Means that the model makes accurate, but initially
incorrect prediction.
• Training data is small.
• Need more training time.
• Overfitting: An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to noise or
outliers
• Poor accuracy for unseen samples
Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?

• Methods for Performance Evaluation

• How to obtain reliable estimates?

• Methods for Model Comparison

• How to compare the relative performance among
competing models?
Metrics for Performance Evaluation
• Focus on the predictive capability of a model
• Rather than how fast it takes to classify or build models,
scalability, etc.
• Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No

a: TP (true positive)
Class=Yes a b
ACTUAL b: FN (false negative)
CLASS
Class=No c d c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)

• Most widely-used metric:

a d TP  TN
Accuracy  
ab c d TP  TN  FP  FN
Example of confusion matrix
C1 C2
C1 True False
positive negative
C2 False True
positive negative

classes buy_computer = buy_computer = total recognition(

yes no %)
buy_computer = 6954 46 7000 99.34
yes
buy_computer = 412 2588 3000 86.27
no
total 7366 2634 1000 95.52
0
Precision-Recall
• Precision :
• Exactness, what % of tuples that the classiﬁer
labeled as positive are actually positive.

• Recall(completeness)
• What % of positive tuples did the classiﬁer label as
positive?
• Perfect score is 1.0

• F measure: harmonic mean of Precision and Recall

Precision-Recall
Count PREDICTED CLASS

Class=Yes Class=No
a TP Class=Yes a b
Precision (p)  
a c TP  FP ACTUAL Class=No c d
a TP CLASS
Recall (r)  
a b TP  FN
1 2 rp 2a 2 TP
F - measure (F)    
1 / r 1 / p  r  p 2a b c 2 TP  FP  FN
 
 2 

● Precision is biased towards C(Yes|Yes) & C(Yes|No)

● Recall is biased towards C(Yes|Yes) & C(No|Yes)
● F-measure is biased towards all except C(No|No)
Methods of Estimation for a Model
• Holdout
• Reserve 2/3 for training and 1/3 for testing
• Random subsampling
• One sample may be biased -- Repeated holdout
• Cross validation
• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n
• Guarantees that each record is used the same number
of times for training and testing
• Bootstrap
• Sampling with replacement
• ~63% of records used for training, ~27% for testing
ANY QUESTIONS

You might also like

Segment Country (Dropdown) Total Sales Total Types of Product
No ratings yet
Segment Country (Dropdown) Total Sales Total Types of Product
5 pages
CSE445 NSU Week_4
No ratings yet
CSE445 NSU Week_4
48 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
DMDW_Classification
No ratings yet
DMDW_Classification
18 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Classification
100% (1)
Classification
37 pages
Class Basic
No ratings yet
Class Basic
75 pages
Unit-3
No ratings yet
Unit-3
98 pages
dm4
No ratings yet
dm4
68 pages
Clase12 13
No ratings yet
Clase12 13
15 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
dm unit 4
No ratings yet
dm unit 4
24 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
21 pages
Lec 6
No ratings yet
Lec 6
39 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
dm 3
No ratings yet
dm 3
37 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Unit 3
100% (1)
Unit 3
21 pages
Introduction To Big Data and Data Mining
No ratings yet
Introduction To Big Data and Data Mining
130 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Decision Trees: Decision Tree Is One of The Most Widely Used and
No ratings yet
Decision Trees: Decision Tree Is One of The Most Widely Used and
53 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
05 Classification
No ratings yet
05 Classification
79 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
DM Unit-Iii
No ratings yet
DM Unit-Iii
13 pages
CH 5
No ratings yet
CH 5
81 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
5-Classification (2)
No ratings yet
5-Classification (2)
59 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
unit 2 notes (1)
No ratings yet
unit 2 notes (1)
83 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
Lec05 Classification DecisionTree
No ratings yet
Lec05 Classification DecisionTree
67 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Classification
No ratings yet
Classification
45 pages
Lecture 4
No ratings yet
Lecture 4
79 pages
Lecture 11-Classification-M
No ratings yet
Lecture 11-Classification-M
33 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Tm 39 Capilized
No ratings yet
Tm 39 Capilized
2 pages
Bycatch_Academic_Vocabulary
No ratings yet
Bycatch_Academic_Vocabulary
2 pages
3 Lecture
No ratings yet
3 Lecture
21 pages
UML Sequence-Communication-Timing
No ratings yet
UML Sequence-Communication-Timing
86 pages
Kapil Slip
No ratings yet
Kapil Slip
1 page
G Codes and Mcodes
No ratings yet
G Codes and Mcodes
6 pages
Module 5 Programs and Apps
No ratings yet
Module 5 Programs and Apps
32 pages
12 1012 Manual V1308
No ratings yet
12 1012 Manual V1308
52 pages
HARDWARE and SOFTWARE Guidelines & Overview Rev. 2024 (1)
No ratings yet
HARDWARE and SOFTWARE Guidelines & Overview Rev. 2024 (1)
8 pages
All Android Projects Ideas
No ratings yet
All Android Projects Ideas
6 pages
Where Can Buy Programming Phoenix 1 4 Chris Mccord Ebook With Cheap Price
100% (4)
Where Can Buy Programming Phoenix 1 4 Chris Mccord Ebook With Cheap Price
52 pages
Aon Test Guide for Sab Tool_Chitkara University
No ratings yet
Aon Test Guide for Sab Tool_Chitkara University
38 pages
NavEdit Manual
No ratings yet
NavEdit Manual
52 pages
Armstrong Pump Selection
No ratings yet
Armstrong Pump Selection
17 pages
SYSAUX Secureconfig
No ratings yet
SYSAUX Secureconfig
1 page
MAIN Electrical Parts List: (SM-J610F)
No ratings yet
MAIN Electrical Parts List: (SM-J610F)
33 pages
08 DSSM
No ratings yet
08 DSSM
21 pages
IC2302 IC2401: To J8 JK8500
No ratings yet
IC2302 IC2401: To J8 JK8500
1 page
8085 Microprocessor Trainer LCD Ver st808504
No ratings yet
8085 Microprocessor Trainer LCD Ver st808504
1 page
SKillify Project Final REPORT (1)
No ratings yet
SKillify Project Final REPORT (1)
60 pages
0107 Introduction To Mac Os X Course
No ratings yet
0107 Introduction To Mac Os X Course
13 pages
day-4-slot-2-mid_spr-24-25
No ratings yet
day-4-slot-2-mid_spr-24-25
100 pages
Computer Studies: Paper 1
No ratings yet
Computer Studies: Paper 1
16 pages
FMM125 Quick Manual v1.5
No ratings yet
FMM125 Quick Manual v1.5
16 pages
Recruiters Guide To Technical Recruiting
100% (1)
Recruiters Guide To Technical Recruiting
63 pages
Basic IT Tools
No ratings yet
Basic IT Tools
15 pages
Acfm Inspection Procedure
No ratings yet
Acfm Inspection Procedure
40 pages
The Role of Mathematics in Algorithm Design
No ratings yet
The Role of Mathematics in Algorithm Design
11 pages
DYNAMIC TIMETABLE GENERATOR
No ratings yet
DYNAMIC TIMETABLE GENERATOR
6 pages
Unit 5 Syllabus
No ratings yet
Unit 5 Syllabus
43 pages
Pinyin - Wikipedia
No ratings yet
Pinyin - Wikipedia
21 pages
Module 2
No ratings yet
Module 2
16 pages
Compiler Lecture 4
No ratings yet
Compiler Lecture 4
12 pages