0% found this document useful (0 votes)

9 views36 pages

CH-5 ML

Chapter Five discusses model evaluation in machine learning, emphasizing the importance of defining performance measures and the need for generalization error approximation. It covers various performance metrics for binary and multiclass classification, including accuracy, precision, recall, and F-measure, as well as techniques like k-fold cross-validation for robust evaluation. The chapter also highlights the significance of using test data for assessing model performance and the role of hypothesis testing in comparing learning algorithms.

Uploaded by

lakewt634

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views36 pages

CH-5 ML

Uploaded by

lakewt634

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

Chapter Five

Model Evaluation
Basic Concepts

• Evaluation requires to define performance measures to be

optimized
• Performance of learning algorithms cannot be evaluated
on entire domain (generalization error)  approximation
needed
• Performance evaluation is needed for:
• Tuning hyperparameters of learning method (e.g. type of
kernel and parameters, learning rate of perceptron)
• Evaluating quality of learned predictor
• Computing statistical significance of difference between
learning algorithms
Stages of (Batch) Machine Learning

Given: labeled training data

• Assumes each

Train the model:

model ß classifier.train(X, Y )

Apply the model to new data:

• Given: new unlabeled instance
learner
Performance measures

Training Loss and performance measures

• The training loss function measures the cost paid for
predicting f(x) for output y
• It is designed to boost effectiveness and efficiency of
learning algorithm (e.g. hinge loss for SVM):
• It is not necessarily the best measure of final
performance
• E.g. misclassification cost is never used as it is piecewise
constant (not amenable to gradient descent)
• Multiple performance measures could be used to evaluate
different aspects of a learner
Performance measures

Binary classification

• The confusion matrix reports true (on rows) and predicted

(on column) labels
• Each entry contains the number of examples having label
in row and predicted as column:
• tp True positives: positives predicted as positives
• tn True negatives: negatives predicted as negatives
• fp False positives: negatives predicted as positives
• fn False negatives: positives predicted as negatives
Binary classification/Classification metrics

Accuracy

• Accuracy is the fraction of correctly labelled

examples
among all predictions
• It is one minus the misclassification cost
Confusion Matrix

• Given a dataset of P positive instances and N negative

instances: Predicted Class
Yes
No TP +
Actual Class

Yes TP TN
FN accuracy = P + N
No
FP
TN
• Imagine using classifier to identify positive cases (i.e., for
information retrieval)
TP TP
precision = recall =
TP + F P TP +
Probability that a F N
randomly Probability that a randomly
selected result is relevant selected relevant document
is retrieved 4
Binary classification
Problems
• For strongly unbalanced datasets (typically negatives much
more than positives) it is not informative:
• Predictions are dominated by the larger class
• Predicting everything as negative often maximizes accuracy
• One possibility consists of rebalancing costs (e.g. a single
positive counts as N/P where N=TN+FP and P=TP+FN)
Precision (recap)
• It is the fraction of positives among examples predicted as positives
• It measures the precision of the learner when predicting positive

Recall or Sensitivity (recap)

• It is the fraction of positive examples predicted as positives
• It measures the coverage of the learner in returning positive
examples
Binary
Classification
F-
measure
(1 + β2 )(Pre ∗
Fβ =
Rec)
β2 Pre + Rec

Precision and recall are complementary:

increasing precision typically reduces recall
F-measure combines the two measures balancing
the two aspects
β is a parameter trading-off precision and recall
F1

2(Pre ∗ Rec)
F1 =
Pre + Rec

It is the F-measure for β = 1

It is the harmonic mean of precision and recall
Binary
Classification

Precision-recall curve
Classifiers often provide a confidence in the
prediction (e.g. margin of SVM)
A hard decision is made setting a threshold
on the classifier (zero for SVM)
Acc,Pre,Rec,F1 all measure peformance of a
classifier for a specific threshold
It is possible to change the threshold if
interested in maximizing a specific
performance (e.g. recall)
Binary
Classification

Precision-recall curve
By varying threshold from min to max possible
value, we obtain a curve of performance
measures
This curve can be shown plotting one measure
(recall) against the complementary one
(precision)
It is possible to investigate the performance of the
Binary
Classification

Area under Pre-Rec curve

A single aggregate value can be obtained taking
the area under the curve
It combines the performance of the
algorithm for all possible thresholds
(without preference)
Performance
measures
Multiclass classification
T\P
y1 n11 n12 n13
y1
n21 n22 n23
y2
y2 n31 n32 n33
Confusion matrix is generalized version of binary
one y3
y3
nij is the number of examples with class yi
predicted as yj . The main diagonal contains true
positives for each class
The sum of off-diagonal elements along a
column is the number of false positives for the
column label
The sum of off-diagonal elements along a
row is the number of false negatives for the
Performance
measures

Multiclass classification
ACC,Pre,Rec,F1 carry over to a per-class
measure considering as negatives examples
from other classes.
E.g.:
nii
Prei = nii Reci =
nii + FPi
nii + FNi
Multiclass accuracy is the overall fraction of
correctly classified examples:
Performance
measures
Training Data and Test Data

• Training data: data used to build the model

• Test data: new data, not used in the training process

• Training performance is often a poor indicator of

generalization performance
• Generalization is what we really care about in ML
• Easy to overfit the training data
• Performance on test data is a good indicator of
generalization performance
• i.e., test accuracy is more important than training accuracy

5
Simple Decision Boundary

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE

6
Decisio
n Decisio
5 Region n
1 Region
2
4

3
Feature

2
2

0
Decisio
n
- Boundar
12 3 4 y 5 7 8 9 1
0

6
More Complex Decision Boundary

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE

SPACE
6 Decision
Region Decisio
5 1 n
Region
2
4

3
Feature

2
2

0
Decision

- Boundar
12 3 4 y 5 6 7 8 9 1
Feature 0
1

7
Example: The Overfitting Phenomenon

8
A Complex Model

Y = high-order polynomial
in X

9
The True (simpler) Model

Y=aX +b +
noise

10
How Overfitting Aﬀects Prediction

Predictiv Underfitti Overfitti

e ng ng
Error
Error on Test
Data

Error on Training
Data
Model
Complexity
Ideal Range
for Model
Complexity
13
Comparing Classifiers

Say we have two classifiers, C1 and C2, and want to

choose the best one to use for future predictions

Can we use training accuracy to choose between them?

• No!
– e.g., C1 = pruned decision tree, C2 = 1-NN
‐
training_accuracy(1-‐NN) = 100%, but may not be
best

Instead, choose based on test accuracy...

14
Training and Test Data

Training Data Idea:

Full Data Set Train each
model on the
“training
data”...
...and then test
each model’s
Test Data
accuracy on
the test data

15
Performance
estimation
Hold-out procedure
Computing performance measure on training set
would be optimistically biased
Need to retain an independent set on which to
compute performance:
validation set when used to estimate
performance of different
algorithmic settings (i.e.
hyperparameters)
test set when used to estimate final
performance of selected model
E.g.: split dataset in 40%/30%/30% for training,
validation and testing
Problem
Hold-out procedure depends on the specific test
(and validation) set chosen (esp. for small
K-Fold Cross-‐Validation

• Why just choose one particular “split” of the

data?
– In principle, we should do this multiple times since
performance may be diﬀerent for each split

• k-‐Fold Cross-‐Validation (e.g., k=10)

– randomly partition full data set of n instances into
k disjoint subsets (each roughly of size n/k)
– Choose each fold in turn as the test set; train model
on the other folds and evaluate
– Compute statistics over k test performances, or
choose best of the k models
– Can also do “leave-‐one-‐out CV” where k = n 16
Cont …

k-fold cross validation

Split D in k equal sized disjoint
subsets Di . For i ∈ [1, k ]
train predictor on Ti = D \ Di
compute score S of predictor L(Ti )
on test set Di :

Si = SD i [L(Ti )]

return average score across folds

Example of 3-fold CV

Full Data Set 1st Partition 2nd Partition kth Partition

Training
Test Data Data Training

Training
Test Data ... Data

Data Training
Data Test Data

Test Test Test

Performance Performance Performance

Summary statistics
over k test
performances
17
Optimizing Model Parameters

Can also use CV to choose value of model parameter P

• Search over space of parameter values p 2 values(P )

– Evaluate model with P = p on validation set

• Choose value p’ with highest validation performance
• Learn model on full training set with P = p’
Training Data 1st Partition 2nd Partition kth Partition
Validation Training
Set Data Training

Training
Validation
Set
... Data

Data Training Validation

Data Set
Found that Found that Found that
Test Data optimal P = optimal P = optimal P =
p1 p2 pk
Choose value of p of the model with the best validation 18
More on Cross-‐Validation (CV)

• Cross-validation
‐ generates an approximate estimate
of how well the classifier will do on “unseen” data
– As k  n, the model becomes more accurate
(more training data)
– ...but, CV becomes more computationally expensive
– Choosing k < n is a compromise

• Averaging over diﬀerent partitions is more robust

than just a single train/validate partition of the data

• It is an even better idea to do CV repeatedly!

19
Multiple Trials of k-Fol
‐ d CV

1.) Loop for t

trials: Full Data Set
a.) Randomize
Data Set Shuﬄe

Full Data Set 1st Partition 2nd Partition kth Partition

Training
Test Data Data
b.) Perform Test Data ...
Training
Data

‐ d
k-fol Training
Data Training
Test Data
CV Data

Test Test Test

Performance Performance
Performance

2.) Compute statistics over

t x k test performances 20
Comparing Multiple Classifiers

1.) Loop for t

trials: Full Data Set
Test each candidate learner on
a.) Randomize
same training/testing splits
Data Set Shuﬄe

Full Data Set 1st Partition 2nd Partition kth Partition

Training
Test Data
Data
b.) Perform Test Data ...
Training
Data

‐ d
k-fol Training
Data Training
Test Data
CV Data

Test Perf. Test Perf. Test Test Test Test

C1 C2 C2 C1 C2
C1

2.) Compute statistics over Allows us to do paired summary

t x k test performances statistics (e.g., paired t-test)
‐
21
Comparing learning algorithms

Hypothesis testing
• We want to compare generalization performance of two learning algorithms
• We want to know whether observed different in performance is statistically
significant (and not due to some noisy evaluation)
• Hypothesis testing allows to test the statistical significance of a hypothesis
(e.g. the two predictors have different performance)
Hypothesis testing (t-test)

Example
Learning Curve

• Shows performance versus the # training examples

– Compute over a single training/testing split
– Then, average across multiple trials of CV

22
Building Learning Curves

1.) Loop for t

trials: Full Data Set
Compute learning curve over each
a.) Randomize
training/testing split
Data Set Shuﬄe

Full Data Set 1st Partition 2nd Partition kth Partition

Training
Test Data
Data
b.) Perform Test Data ...
Training
Data

‐ d
k-fol Training
Data Training
Test Data
CV Data

Curve Curve Curve Curve

C1 Curve C2 Curve C2 C1
C2
C1
2.) Compute statistics over
t x k learning curves 23

Oracle WMS PICK (White Paper)
100% (16)
Oracle WMS PICK (White Paper)
35 pages
Cie 15 2004 Tables
No ratings yet
Cie 15 2004 Tables
34 pages
Magnet Grade 5 WS
0% (1)
Magnet Grade 5 WS
7 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Megersa MBA Thesis For Defense (2024)
No ratings yet
Megersa MBA Thesis For Defense (2024)
74 pages
Bojowald - Canonical Gravity and Applications: Cosmology, Black Holes, and Quantum Gravity
100% (1)
Bojowald - Canonical Gravity and Applications: Cosmology, Black Holes, and Quantum Gravity
313 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
Water Well Drilling Machine and Tools Catalogue
No ratings yet
Water Well Drilling Machine and Tools Catalogue
49 pages
Midas Gen Analysis Reference
50% (2)
Midas Gen Analysis Reference
323 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Module 2
No ratings yet
Module 2
151 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
CS585 Lecture October03rd
No ratings yet
CS585 Lecture October03rd
146 pages
2.003J/1.053J Dynamics and Control I Fall 2007 Problem Set 4
No ratings yet
2.003J/1.053J Dynamics and Control I Fall 2007 Problem Set 4
4 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Application of PLC Control Technology in Intelligent Automatic Control
No ratings yet
Application of PLC Control Technology in Intelligent Automatic Control
4 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
KYONGBO GD3-P11 영문 사용설명서 (V1.20)
No ratings yet
KYONGBO GD3-P11 영문 사용설명서 (V1.20)
115 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
6 Evaluarea Performantei
No ratings yet
6 Evaluarea Performantei
43 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
How To Tune Your TV
No ratings yet
How To Tune Your TV
5 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
Third Order Intercepts
No ratings yet
Third Order Intercepts
6 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
3ML.02.MainConcepts Evaluation
No ratings yet
3ML.02.MainConcepts Evaluation
35 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Online Banking Transaction
No ratings yet
Online Banking Transaction
57 pages
Lecture 11
No ratings yet
Lecture 11
61 pages
BSC ML CH1
No ratings yet
BSC ML CH1
63 pages
L2 - Problems in ML & Performance Evaluation
No ratings yet
L2 - Problems in ML & Performance Evaluation
30 pages
Noblelft FD20-35 Operation & Maintenance Manual
No ratings yet
Noblelft FD20-35 Operation & Maintenance Manual
108 pages
Chap 5 Learning
No ratings yet
Chap 5 Learning
56 pages
Unit 6-Feature Engineering and Sensitivity Analysis
No ratings yet
Unit 6-Feature Engineering and Sensitivity Analysis
63 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Chương 2e. Model Evaluation
No ratings yet
Chương 2e. Model Evaluation
27 pages
CH 6
No ratings yet
CH 6
24 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
AI351 Lecture 2 - Common Evaluation Metrics
No ratings yet
AI351 Lecture 2 - Common Evaluation Metrics
50 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
Lecture 3 1611410001002
No ratings yet
Lecture 3 1611410001002
51 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
A10 Model Performance v2 2up
No ratings yet
A10 Model Performance v2 2up
11 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
2020 Evaluation PDF
No ratings yet
2020 Evaluation PDF
25 pages
Module 6
No ratings yet
Module 6
24 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
Sudhanshu Rai Visual Basic Assignment
No ratings yet
Sudhanshu Rai Visual Basic Assignment
33 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
Classification - Performance Evlaution
No ratings yet
Classification - Performance Evlaution
13 pages
Module 5 Advanced Classification Techniques
No ratings yet
Module 5 Advanced Classification Techniques
40 pages
Clase10 11
No ratings yet
Clase10 11
18 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
290 Module III
No ratings yet
290 Module III
31 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
Hands On Machine Learning 3 Edition
No ratings yet
Hands On Machine Learning 3 Edition
31 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Accelerated Synthesis of Novel Materials
No ratings yet
Accelerated Synthesis of Novel Materials
12 pages
13-Universal Bridgeless Non Isolated Battery Charger With Wide Output Voltage Range
No ratings yet
13-Universal Bridgeless Non Isolated Battery Charger With Wide Output Voltage Range
12 pages
Aenexz Tech Data Science Curriculum 8 Weeks
No ratings yet
Aenexz Tech Data Science Curriculum 8 Weeks
8 pages
Brochure ADELE
No ratings yet
Brochure ADELE
12 pages
Iot Based Garbage Management System For Smart City Using Raspberry Pi
No ratings yet
Iot Based Garbage Management System For Smart City Using Raspberry Pi
10 pages
Uncertainty in Humidity Measurements PDF
No ratings yet
Uncertainty in Humidity Measurements PDF
48 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Machine Learning Chapter3
No ratings yet
Machine Learning Chapter3
27 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
59 pages
Class 10 2019 Science Set 2
No ratings yet
Class 10 2019 Science Set 2
11 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Riemann - Biography - Wiki
No ratings yet
Riemann - Biography - Wiki
7 pages
Inbound 1766823743387247522
No ratings yet
Inbound 1766823743387247522
6 pages
Piling
No ratings yet
Piling
20 pages
DX Diag
No ratings yet
DX Diag
27 pages
Experiement 6
No ratings yet
Experiement 6
3 pages
Unit-1 Lesson 1
No ratings yet
Unit-1 Lesson 1
10 pages
Unit 5: International Financial Management 5.1
No ratings yet
Unit 5: International Financial Management 5.1
4 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet