0% found this document useful (0 votes)
80 views56 pages

Int3209 - Data Mining: Week 5: Classification Model Improvements

In this medium skew case, classifier T3 has the best performance because it has: - Highest true positive rate (TPR) of 99% for the rare class - Lowest false positive rate (FPR) of 1% for the rare class - Lowest false negative rate of 1% for the rare class While T1 and T2 have higher FPRs for the rare class compared to T3. In imbalanced problems like this where we want to correctly classify the rare class, T3 would generally be considered the best classifier.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views56 pages

Int3209 - Data Mining: Week 5: Classification Model Improvements

In this medium skew case, classifier T3 has the best performance because it has: - Highest true positive rate (TPR) of 99% for the rare class - Lowest false positive rate (FPR) of 1% for the rare class - Lowest false negative rate of 1% for the rare class While T1 and T2 have higher FPRs for the rare class compared to T3. In imbalanced problems like this where we want to correctly classify the rare class, T3 would generally be considered the best classifier.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

INT3209 - DATA MINING


Week 5: Classification
Model Improvements
Duc-Trong Le

Slide credit: Vipin Kumar et al.,


https://www-users.cse.umn.edu/~kumar001/dmbook

Hanoi, 09/2021
Outline

● Class Imbalance
● Model Underfitting, Overfitting
● Model Selection
● Model Evaluation
Class Imbalance Problem

● Lots of classification problems where the classes


are skewed (more records from one class than
another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly line
– COVID-19 test results on a random sample

● Key Challenge:
– Evaluation measures such as accuracy are not
well-suited for imbalanced class
Accuracy

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)

● Most widely-used metric:


Problem with Accuracy
● Consider a 2-class problem
– Number of Class NO examples = 990
– Number of Class YES examples = 10
● If a model predicts everything to be class NO, accuracy
is 990/1000 = 99 %
– This is misleading because this trivial model does not detect any class
YES example
– Detecting the rare class is usually more interesting (e.g., frauds,
intrusions, defects, etc)

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 0 10
CLASS Class=No 0 990
Which model is better?

PREDICTED
Class=Yes Class=No
A ACTUAL Class=Yes 0 10
Class=No 0 990

Accuracy: 99%

PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490

Accuracy: 50%
Alternative Measures

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS
Class=No c d
Alternative Measures

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 10 0
CLASS
Class=No 10 980
Alternative Measures

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 10 0
CLASS
Class=No 10 980

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 1 9
CLASS
Class=No 0 990
Measures of Classification Performance

PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN

α is the probability that we reject


the null hypothesis when it is true.
This is a Type I error or a false
positive (FP).

β is the probability that we accept


the null hypothesis when it is false.
This is a Type II error or a false
negative (FN).
Alternative Measures

A PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 40 10
CLASS
Class=No 10 40

B PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 40 10
CLASS
Class=No 1000 4000
Which of these classifiers is better?

A PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes 10 40
CLASS
Class=No 10 40

B PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes 25 25
CLASS Class=No 25 25

C PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes 40 10
CLASS
Class=No 40 10
ROC (Receiver Operating Characteristic)

● A graphical approach for displaying trade-off


between detection rate and false alarm rate
● Developed in 1950s for signal detection theory
to analyze noisy signals
● ROC curve plots TPR against FPR
– Performance of a model represented as a point in an
ROC curve
ROC Curve

(TPR,FPR):
● (0,0): declare everything
to be negative class
● (1,1): declare everything
to be positive class
● (1,0): ideal

● Diagonal line:
– Random guessing
– Below diagonal line:
◆ prediction is opposite
of the true class
ROC (Receiver Operating Characteristic)

● To draw ROC curve, classifier must produce


continuous-valued output
– Outputs are used to rank test records, from the most likely
positive class record to the least likely positive class record
– By using different thresholds on this value, we can create
different variations of the classifier with TPR/FPR tradeoffs
● Many classifiers produce only discrete outputs (i.e.,
predicted class)
– How to get continuous-valued outputs?
◆ Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM
Example: Decision Trees
Decision Tree

Continuous-valued outputs
e.g., Gini scores
ROC Curve Example
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive

At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
How to Construct an ROC curve

● Use a classifier that produces a


Instance Score True Class
continuous-valued score for each
1 0.95 +
instance
2 0.93 +
• The more likely it is for the
3 0.87 - instance to be in the + class, the
4 0.85 - higher the score
5 0.85 - ● Sort the instances in decreasing
6 0.85 + order according to the score
7 0.76 - ● Apply a threshold at each unique
8 0.53 + value of the score
9 0.43 - ● Count the number of TP, FP,
10 0.25 + TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)
How to construct an ROC curve

Threshold >=

ROC Curve:
Using ROC for Model Comparison

● No model consistently
outperforms the other
● M is better for
1
small FPR
● M is better for
2
large FPR

● Area Under the ROC


curve (AUC)
● Ideal:
▪ Area = 1
● Random guess:
▪ Area = 0.5
Dealing with Imbalanced Classes - Summary

● Many measures exists, but none of them may be ideal in


all situations
– Random classifiers can have high value for many of these measures

– TPR/FPR provides important information but may not be sufficient by


itself in many practical scenarios
– Given two classifiers, sometimes you can tell that one of them is
strictly better than the other
◆ C1 is strictly better than C2 if C1 has strictly better TPR and FPR relative to C2 (or
same TPR and better FPR, and vice versa)

– Even if C1 is strictly better than C2, C1’s F-value can be worse than
C2’s if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the
scenario at hand (class imbalance, importance of TP vs FP, cost/time
trade-offs)
Which Classifier is better?

T1 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 50 50
ACTUAL
CLASS Class=No 1 99

T2 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
Class=No 10 90
CLASS

T3 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
CLASS Class=No 1 99
Which Classifier is better? Medium Skew case

T1 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 50 50
ACTUAL
CLASS Class=No 10 990

T2 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
Class=No 100 900
CLASS

T3 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
CLASS Class=No 10 990
Which Classifier is better? High Skew case

T1 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 50 50
ACTUAL
CLASS Class=No 100 9900

T2 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
Class=No 1000 9000
CLASS

T3 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
CLASS Class=No 100 9900
Improve Classifiers with Imbalanced Training Set

● Modify the distribution of training data so that


rare class is well-represented in training set
– Undersample the majority class
– Oversample the rare class
Classification Errors

● Training errors: Errors committed on the training set

● Test errors: Errors committed on the test set

● Generalization errors: Expected error of a model over random selection of


records from same distribution
Example Dataset

Two class problem:


+ : 5400 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)

• 400 noisy instances added

o : 5400 instances
• Generated from a uniform
distribution

10 % of the data used for


training and 90% of the
data used for testing
Increasing number of nodes in Decision Trees
Decision Tree with 4 nodes

Decision Tree

Decision boundaries on Training data


Decision Tree with 50 nodes

Decision Tree

Decision boundaries on Training data


Which tree is better?

Decision Tree with 4 nodes

Which tree is better ?


Decision Tree with 50 nodes
Model Underfitting and Overfitting

•As the model becomes more and more complex, test errors can start increasing even
though training error may be decreasing

Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Model Overfitting – Impact of Training Data Size

Using twice the number of data instances

• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Model Overfitting – Impact of Training Data Size

Decision Tree with 50 nodes Decision Tree with 50 nodes

Using twice the number of data instances

• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Reasons for Model Overfitting

● Not enough training data

● High model complexity


– Multiple Comparison Procedure
Effect of Multiple Comparison Procedure

● Consider the task of predicting whether Day 1 Up


stock market will rise/fall in the next 10 Day 2 Down
trading days
Day 3 Down
Day 4 Up
● Random guessing:
Day 5 Down
P(correct) = 0.5
Day 6 Down
Day 7 Up
● Make 10 random guesses in a row:
Day 8 Up
Day 9 Up
Day 10 Down
Effect of Multiple Comparison Procedure

● Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions

● Probability that at least one analyst makes at


least 8 correct predictions
Effect of Multiple Comparison Procedure

● Many algorithms employ the following greedy strategy:


– Initial model: M
– Alternative model: M’ = M ∪ γ,
where γ is a component to be added to the model
(e.g., a test condition of a decision tree)
– Keep M’ if improvement, Δ(M,M’) > α

● Often times, γ is chosen from a set of alternative


components, Γ = {γ1, γ2, …, γk}

● If many alternatives are available, one may inadvertently


add irrelevant components to the model, resulting in
model overfitting
Effect of Multiple Comparison - Example

Use additional 100 noisy variables


generated from a uniform distribution
along with X and Y as attributes.

Use 30% of the data for training and


70% of the data for testing
Using only X and Y as attributes
Notes on Overfitting

● Overfitting results in decision trees that are more


complex than necessary

● Training error does not provide a good estimate


of how well the tree will perform on previously
unseen records

● Need ways for estimating generalization errors


Model Selection

● Performed during model building


● Purpose is to ensure that model is not overly
complex (to avoid overfitting)
● Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
Model Selection:
Using Validation Set
● Divide training data into two parts:
– Training set:
◆ use for model building
– Validation set:
◆ use for estimating generalization error
◆ Note: validation set is not the same as test set

● Drawback:
– Less data available for training
Model Selection:
Incorporating Model Complexity
● Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model

– A complex model has a greater chance of being fitted


accidentally

– Therefore, one should include model complexity


when evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +
x Complexity(Model)
Estimating the Complexity of Decision Trees

● Pessimistic Error Estimate of decision tree T


with k leaf nodes:

– err(T): error rate on all training records


– Ω: trade-off hyper-parameter (similar to )
◆ Relative cost of adding a leaf node
– k: number of leaf nodes
– Ntrain: total number of training records
Estimating the Complexity of Decision Trees: Example

e(TL) = 4/24

e(TR) = 6/24

Ω=1

egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458

egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417


Estimating the Complexity of Decision Trees

● Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24

e(TR) = 6/24
Minimum Description Length (MDL)

● Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)


– Cost is the number of bits needed for encoding.
– Search for the least costly model.
● Cost(Data|Model) encodes the misclassification errors.
● Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
Model Selection for Decision Trees

● Pre-Pruning (Early Stopping Rule)


– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
◆ Stop if all instances belong to the same class
◆ Stop if all the attribute values are the same
– More restrictive conditions:
◆ Stop if number of instances is less than some user-specified
threshold
◆ Stop if class distribution of instances are independent of the
available features (e.g., using χ 2 test)
◆ Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
◆ Stop if estimated generalization error falls below certain
threshold
Model Selection for Decision Trees

● Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
◆ Trim the nodes of the decision tree in a
bottom-up fashion
◆ If generalization error improves after trimming,
replace sub-tree by a leaf node
◆ Class label of leaf node is determined from
majority class of instances in the sub-tree
Example of Post-Pruning
Training Error (Before splitting) = 10/30

Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30


Training Error (After splitting) = 9/30
Class = No 10
Pessimistic error (After splitting)
Error = 10/30
= (9 + 4 × 0.5)/30 = 11/30
PRUNE!

Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5


Class = No 4 Class = No 4 Class = No 1 Class = No 1
Examples of Post-pruning
Model Evaluation

● Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
● Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
● Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Cross-validation Example

● 3-fold cross-validation
Variations on Cross-validation

● Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
● Stratified cross-validation
– Guarantee the same percentage of class
labels in training and test
– Important when classes are imbalanced and
the sample is small
● Use nested cross-validation approach for model
selection and evaluation
Summary

● Class Imbalance
● Model Underfitting, Overfitting
● Model Selection
● Model Evaluation

You might also like