0% found this document useful (0 votes)
8 views

DM Lect8

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DM Lect8

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

DATA MINING

Classification
Lec 8

Mohammed
Taiz University
Outlines
• Define classification
• Decision Trees
• Evaluate the performance of classifier
Introduction
Given a collection of records (training set)
Each record contains a set of attributes, one of them is the class
What is classif ication?
• Classification is the task oflearning a
target function f that maps
attribute set x to one of the predefined class labels y
l l
ir ca ir ca ous
g o g o
ti nu ss
te te n a
ca ca co cl

One of the attributes is the class attribute


In this case: Cheat

Two class labels (or classes): Yes (1), No (0)


General Approach for Building
Classif ication Model
Why classif ication?
• The target function f is known as a classification
model

• Descriptive modeling: Explanatory tool to


distinguish between objects of different classes
(e.g., understand why people cheat on their taxes)

• Predictive modeling: Predict a class of a


previously unseen record
Examples of Classif ication Tasks
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate


or fraudulent

• Categorizing news stories as finance, 


weather, entertainment, sports, etc

• Identifying spam email, spam web pages, adult


content
General approach to classif ication
• Training set consists of records with known class
labels

• Training set is used to build a classification model

• A labeled test set of previously unseen data


records is used to evaluate the quality of the model.

• The classification model is applied to new records


with unknown class labels
Illustrating Classif ication Task
Decision tree example
l l
ir ca ir ca
ous
u
ego ego tin ss
t t n l a
ca ca co c
Splitting Attributes

Refund
Yes No
Test outcome
NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Class labels
Training Data Model: Decision Tree
Another Example of Decision Tree
l l
ir ca ir ca
ous
o o n u
teg
teg
n ti
a ss Single,
ca ca co c l MarSt
Married Divorced

NO Refund
Yes No

NO TaxInc
< 80K > 80K

NO YES

There could be more than one tree that


fits the same data!
Decision Tree Classif ication Task
Apply Model to Test Data
Test Data
Start from the root of tree.

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Classif ication Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.

• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
General Structure of Hunt’s Algorithm
●Let Dt be the set of training records
that reach a node t

●General Procedure:
– If Dt contains records that
belong the same class yt, then t
is a leaf node labeled as yt
– If Dt contains records that
belong to more than one class,
use an attribute test to split the
data into smaller subsets.
Recursively apply the procedure Dt
to each subset.
?
Hunt’s Algorithm

(7,3)
Hunt’s Algorithm

(7,3)
(3,0) (4,3)
Hunt’s Algorithm

(7,3)
(3,0) (4,3)

(3,0)

(1,3) (3,0)
Hunt’s Algorithm

(7,3)
(3,0) (4,3)

(3,0)

(1,3) (3,0)
(1,0) (0,3)
Design Issues of Decision Tree Induction
●How should training records be split?
– Method for expressing test condition
 depending on attribute types
– Measure for evaluating the goodness of a test condition

●How should the splitting procedure stop?


– Stop splitting if all the records belong to the same class
or have identical attribute values
– Early termination
Methods for Expressing Test Conditions
●Depends on attribute types
– Nominal
– Ordinal
– Continuous

●Depends on number of ways to split


– 2-way split
– Multi-way split
Test Condition for Nominal Attributes
● Multi-way split:
M a r ita l
S ta tu s

– Use as many partitions as


distinct values.
S in g le D iv o r c e d M a r r ie d

● Binary split:
– Divides values into two
subsets
M a r ita l M a r ita l M a r ita l
S ta tu s S ta tu s S ta tu s

OR OR

{ M a r r ie d } { S in g le , { S in g le } { M a r r ie d , { S in g le , { D iv o r c e d }
D iv o r c e d } D iv o r c e d } M a r r ie d }
Test Condition for Ordinal Attributes
● Multi-way split: S h ir t
S iz e
– Use as many partitions as
distinct values
S m a ll
E x tr a L a r g e
M e d iu m L a rg e

● Binary split:
– Divides values into two
subsets
– Preserve order property
among attribute values
S h ir t
S iz e
This grouping
violates order
property

{ S m a ll, { M e d iu m ,
L a rg e } E x tr a L a r g e }
Test Condition for Continuous Attributes
How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Which test condition is the best?


How to determine the Best Split
●Greedy approach:
– Nodes with purer class distribution are preferred

●Need a measure of node impurity:

C 0: 5 C 0: 9
C 1: 5 C 1: 1

High degree of impurity Low degree of impurity


Measures of Node Impurity
● Gini Index

● Entropy

● Misclassification error
Finding the Best Split
1. Compute impurity measure (P) before splitting
2. Compute impurity measure (M) after splitting
● Compute impurity measure of each child node
● M is the weighted impurity of child nodes

3. Choose the attribute test condition that


produces the highest gain

Gain = P - M

or equivalently, lowest impurity measure after splitting (M)
Finding the Best Split
Before Splitting: P

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
Measure of Impurity: GINI
Measure of Impurity: GINI
• Gini Index for a given node t :

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3

G in i= 0 .0 0 0 G in i= 0 .2 7 8 G in i= 0 .4 4 4 G in i= 0 .5 0 0
Computing Gini Index of a Single Node

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5
Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Computing Gini Index for a Collection of
Nodes
Binary Attributes: Computing GINI Index
Splits into two partitions (child nodes)
Effect of Weighing partitions:
Larger and purer partitions are sought

B?

Yes No
Gini(N1) 
= 1 – (5/6)2 – (1/6)2 
Node N1 Node N2
= 0.278
Gini(N2)  Weighted Gini of N1 N2
= 1 – (2/6)2 – (4/6)2  = 6/12 * 0.278 + 
= 0.444 6/12 * 0.444
= 0.361
Gain = 0.486 – 0.361 = 0.125
Measure of Impurity: Entropy

Computing Entropy of a Single Node

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6
Entropy = – 0 log2 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Computing Information Gain After
Splitting

Problem with large number of partitions
●Node impurity measures tend to prefer splits that
result in large number of partitions, each being
small but pure

– Customer ID has highest information gain because


entropy for all the children is zero
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records
belong to the same class

• Stop expanding a node when all the records have


similar attribute values

• Early termination (to be discussed later)


Decision Tree Based Classif ication
● Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes
Practical Issues of Classif ication
• Underfitting and Overfitting

• Evaluation
Underf itting and Overf itting
• Underfitting:
• Means that the model makes accurate, but initially
incorrect prediction.
• Training data is small.
• Need more training time.
• Overfitting: An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to noise or
outliers
• Poor accuracy for unseen samples
Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?

• Methods for Performance Evaluation


• How to obtain reliable estimates?

• Methods for Model Comparison


• How to compare the relative performance among
competing models?
Metrics for Performance Evaluation
• Focus on the predictive capability of a model
• Rather than how fast it takes to classify or build models,
scalability, etc.
• Confusion Matrix:

PREDICTED CLASS

 Class=Yes Class=No

a: TP (true positive)
Class=Yes a b
ACTUAL b: FN (false negative)
CLASS
Class=No c d c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation…
PREDICTED CLASS
 Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)

• Most widely-used metric:

a d TP  TN
Accuracy  
ab c d TP  TN  FP  FN
Example of confusion matrix
C1 C2
C1 True False
positive negative
C2 False True
positive negative

classes buy_computer = buy_computer = total recognition(


yes no %)
buy_computer = 6954 46 7000 99.34
yes
buy_computer = 412 2588 3000 86.27
no
total 7366 2634 1000 95.52
0
Precision-Recall
• Precision :
• Exactness, what % of tuples that the classifier
labeled as positive are actually positive.

• Recall(completeness)
• What % of positive tuples did the classifier label as
positive?
• Perfect score is 1.0

• F measure: harmonic mean of Precision and Recall


Precision-Recall
Count PREDICTED CLASS

 Class=Yes Class=No
a TP Class=Yes a b
Precision (p)  
a c TP  FP ACTUAL Class=No c d
a TP CLASS
Recall (r)  
a b TP  FN
1 2 rp 2a 2 TP
F - measure (F)    
1 / r 1 / p  r  p 2a b c 2 TP  FP  FN
 
 2 

● Precision is biased towards C(Yes|Yes) & C(Yes|No)


● Recall is biased towards C(Yes|Yes) & C(No|Yes)
● F-measure is biased towards all except C(No|No)
Methods of Estimation for a Model
• Holdout
• Reserve 2/3 for training and 1/3 for testing
• Random subsampling
• One sample may be biased -- Repeated holdout
• Cross validation
• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n
• Guarantees that each record is used the same number
of times for training and testing
• Bootstrap
• Sampling with replacement
• ~63% of records used for training, ~27% for testing
ANY QUESTIONS

You might also like