0% found this document useful (0 votes)

12 views41 pages

09 - ML-Model Evaluation

Uploaded by

11-Nguyễn Thị Quỳnh Châu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views41 pages

09 - ML-Model Evaluation

Uploaded by

11-Nguyễn Thị Quỳnh Châu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Ho Chi Minh University of Banking

Department of Economic Mathematics

Machine Learning
Model Evaluation

Vuong Trong Nhân ([email protected])

Outline

§ What is model evaluation?

§ Methods:
§ Data splitting Techniques
o Train-test split
o Cross-validation
§ Performance Metrics
o 1. Metrics for Classification
o 2. Metrics for Regression
o 3. Metrics for Clustering

2
What is model evaluation?
§ Ensures Generalizability:
§ Evaluates if the model can make accurate predictions on unseen data ,
not just memorize training data.
§ Prevents Overfitting:
§ Identifies models that are "over-learning" the training data, leading to
poor performance on new data.
§ Facilitates Model Comparison:
§ Allows you to compare the performance of different models and
choose the one that generalizes best.
§ Guides Model Improvement:
§ Provides insights into model weaknesses, enabling you to refine
training data, adjust settings, or explore new approaches.
§ Establishes Model Reliability:
§ Builds confidence that the model will perform well in real-world
scenarios with unseen data.
Model evaluation is crucial for building robust and generalizable
machine learning models!

3
Strategy for splitting data for training

§ Train-test split
§ Cross-validataion

4
Train-test split
Train-test split is a fundamental technique in machine learning for evaluating
model performance.

5
Train-test split
Pros:
1. Prevents overfitting: Ensures models generalize well to unseen data.
2. Objective evaluation: Provides a fair assessment of model
performance.
3. Hyperparameter tuning: Helps find the best model configuration .
4. Simple and efficient: Easy to use and saves resources for quick
evaluations.
Cons:
1. Sensitive to random splits: Requires caution due to potential biases.
2. Data ineffective: May limit model's learning potential (using less
data).

6
Cross-validation

7
Cross-validation

Dev set (development set): This is another name for the validation set
8
Cross-Validation

Advantages:
1. Robust evaluation: Uses all data for training and testing,
mitigating bias from random splits in train-test splits.
2. Estimates model variability: Generates multiple
performance scores, revealing how sensitive the model is
to different training data selections.
3. Data efficiency: Utilizes more data for training (up to 90%)
compared to train-test split (typically 75%).
Drawback:
Increased computational cost: Requires training multiple models
compared to a single train-test split.

9
1. Metrics for Classification

10
Evaluation for Classification

11
Metrics for Classification

§ Accuracy/Error score
§ Confusion matrix
§ Precision and Recall
§ F1 score
§ ROC curve
§ Area Under the Curve

12
Accuracy Metrics

§ Accuracy (độ chính xác):

§ Tỉ lệ giữa số điểm được dự đoán đúng trên tổng số điểm
trong tập dữ liệu kiểm thử
# correct predictions
Accuracy =
#examples

§ Best possible value : 0.0 - Worst possible value: 1.0

§ Error (độ lỗi):
§ Tỉ lệ giữa số điểm được dự đoán sai trên tổng số điểm trong
tập dữ liệu kiểm thử
# correct predictions
Error = = 1 − Accuracy
#examples

§ Best possible value : 0.0 - Worst possible value: 1.0

13
Limitation of Accuracy

§ Consider a binary classification problem

§ Number of Class 0 examples = 9990
§ Number of Class 1 examples = 10
§ If predict all as 0, accuracy is 9990/10000 = 99.9%

§ Solution:
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝑇𝑁 𝑇𝑁
§ Weighted Accuracy =
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝐹𝑃 𝐹𝑃+𝑤 𝑇𝑁 𝑇𝑁+𝑤 𝐹𝑁 𝐹𝑁

§ Other metrics: precision, recall, F1-score, …

14
Confusion Matrix

§ Shows performance of an algorithm, especially

predictive capability.
§ rather than how fast it takes to classify, build models, or
scalability.

Predicted Class
Actual Positive Negative
Class Positive True Positive (TP) False Negative (FN)

Negative False Positive (FP) True Negative (TN)

15
Confusion Matrix
§ Imagine a study evaluating a test that screens people
for a disease. Each person taking the test either has or
does not have the disease. The test outcome can be
positive or negative.
§ The test results for each subject may or may not match
the subject's actual status. In that setting:
§ True positive: Sick people correctly identified as sick
§ False positive: Healthy people incorrectly identified as sick
§ True negative: Healthy people correctly identified as healthy
§ False negative: Sick people incorrectly identified as healthy

16
Confusion Matrix

https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers 17
Confusion Matrix
Predicted Class
Positive Negative
Actual Positive True Positive (TP) False Negative (FN)
Class (Type II error)
Negative False Positive (FP) True Negative (TN)
(Type I error)

§ Accuracy
= (TP+TN) / (TP+FP+TN+FN)
§ Precision
= TP / (TP + FP)
§ Recall
= TP / (TP + FN)
§ F1-score = 2 precision * recall / (precision + recall)
= 2 TP / (2 TP + FP + FN)
18
Precision-Precall

19
F1-score
The F1 score is the harmonic mean of the precision and recall
F1-score Î (0,1]

precision recall F1_score

1 1 1
0.1 0.1 0.1
0.5 0.5 0.5
1 0.1 0.182
0.3 0.8 0.36

F1-score:
(precision 0.5, recall = 0.5) is better than (precision = 0.3, recall = 0.8)

20
Type I and II error

21
Normalized confusion matrix

Predicted Class
Positive Negative
Actual Positive TPR = TP / (TP + FN) FNR = FN / (TP + FN)
Class
Negative FPR = FP / (FP + TN) TNR = TN / (FP + TN)

False Positive Rate còn được gọi là False Alarm Rate (tỉ lệ báo động nhầm)

False Negative Rate còn được gọi là Miss Detection Rate (tỉ lệ bỏ sót)

Trong bài toán dò mìn, “thà báo nhầm còn hơn bỏ sót”, tức là ta có thể chấp
nhận False Alarm Rate cao để đạt được Miss Detection Rate thấp.
Trong bài toán lọc email rác thì việc cho nhầm email quan trọng vào thùng rác
nghiêm trọng hơn việc xác định một email rác là email thường.
22
ROC curve

§ Receiver Operating Characteristic

§ Graphical approach for displaying the tradeoff between
true positive rate(TPR) and false positive rate (FPR) of a
classifier
o TPR = positives correctly classified/total positives
o FPR = negatives incorrectly classified/total negatives
§ TPR on y-axis and FPR on x-axis

23
ROC curve

§ Points of interests (TP, FP)

§ (0, 0): everything is negative
§ (1, 1): everything is positive
§ (1, 0): perfect (ideal)

§ Diagonal line
§ Random guessing (50%)

§ Area Under Curve (AUC)

§ Measurement how good the model on the average
§ Good to compare with other methods

24
ROC curve
Threshold = 0.5
Predicted Class
True positive rate:
Positive Negative TPR = ¾ =0.75
Actual Positive 3 1 False positive rate:
Class Negative 1 3 FPR = ¼= 0.25

1
Probability X XX 0.8
of obesity
X
0.5
X 0.8
X
XX
0.2
0

weight
Logistic Regression
25
For multi-class classification
Micro-average

Macro-average

27
2. Metrics for Regression

28
2.1. Bias

The sum of residuals, sometimes referred to as bias.

residual = actual - prediction

As the residuals can be both positive

(prediction is smaller than the actual
value) and negative (prediction is larger
than the actual value), bias generally
tells you whether your predictions were
higher or lower than the actuals.

However, as the residuals of opposing signs

offset each other, you can obtain a model
that generates predictions with a very low
bias, while not being accurate at all.

Figure 1 presents the relationship between a target variable (y) and a single feature (x)
29
2.2. Mean squared error (MSE)

Pros:
§ MSE uses the mean (instead of the sum) to keep the metric independent of the
dataset size.
§ As the residuals are squared, MSE puts a significantly heavier penalty on large errors.
Some of those might be outliers, so MSE is not robust to their presence.
§ The metric is useful for optimization algorithms.
Cons:
§ MSE is not measured in the original units, which can make it harder to interpret.
§ MSE cannot be used to compare the performance between different datasets.

30
2.3. Root mean squared error
Root mean squared error (RMSE)

§ Pros:
§ Take the square (MSE) to bring the metric back to the scale of the target variable, so
it is easier to interpret and understand.
§ Cos:
§ However, take caution: one fact that is often overlooked is that although RMSE is on
the same scale as the target, an RMSE of 10 does not actually mean you are off by 10
units on average.

31
https://developer.nvidia.com/
2.4. Mean absolute error (MAE)

§ Pros:
§ Due to the lack of squaring, the metric is expressed at the same scale as the
target variable, making it easier to interpret.
§ All errors are treated equally, so the metric is robust to outliers.
§ Cons:
§ Absolute value disregards the direction of the errors, so underforecasting =
overforecasting.
§ Similar to MSE and RMSE, MAE is also scale-dependent, so you cannot compare
it between different datasets.
§ When you optimize for MAE, the prediction must be as many times higher than
the actual value as it should be lower. That means that you are effectively
looking for the median; that is, a value that splits a dataset into two equal parts.
§ As the formula contains absolute values, MAE is not easily differentiable.
32
2.5. R-squared (R²)

Measure: how well your model fits the data

RSS : the residual sum of squares

TSS : the total sum of squares

33
2.5. R-squared
Measure: how well your model fits the data
• RSS : the residual sum of squares
• TSS : the total sum of squares

§ Pros:
§ Model Fit Assessment & Model Comparisons
o A higher R-squared means a better fit.
§ Helps in Feature Selection
o If adding a variable improves R-squared a lot, it's
likely a good predictor.
§ Cons:
§ Sensitive to Outliers
§ Depends on Sample Size
§ Not distinguishing between different types of
relationships
34
2.6. Some other metrics
§ Mean squared log error (MSLE)
§ Root mean squared log error (RMSLE)
§ Symmetric mean absolute percentage error (sMAPE)
§…

35
3. Metrics for Clustering

36
Rand index (RI)
§ Given the knowledge of the ground truth class assignments labels_true
and our clustering algorithm assignments of the same samples
labels_pred, the (adjusted or unadjusted)
§ Rand index measures the similarity of the two assignments, ignoring
permutations
§ If C is a ground truth class assignment and K the clustering, let us define
a and b as:
§ the number of pairs of elements that are in the same set in C and in the same set in
K
§ b the number of pairs of elements that are in different sets in C and in different sets
in K

is the total number of possible pairs in the dataset.

37
Adjusted Rand index (ARI)
§ However, the Rand index does not guarantee that random
label assignments will get a value close to zero (esp. if the
number of clusters is in the same order of magnitude as the
number of samples).

§ To counter this effect we can discount the expected RI (E[RI)

of random labelings by defining the adjusted Rand index as
follows

38
Rand index (RI) & Adjust Rand Index (ARI)

Rand index is a function that measures the similarity of the two assignments,
ignoring permutations:

The Rand index does not ensure to obtain a value close to 0.0 for a random
labelling.

The adjusted Rand index corrects for chance and will give such a baseline.

39
Silhouette Score
§ The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of
such an evaluation, where a higher Silhouette Coefficient score relates to a
model with better defined clusters. The Silhouette Coefficient is defined for
each sample and is composed of two scores:

•a: The mean distance between a sample and all other points in
the same class.
•b: The mean distance between a sample and all other points in
the next nearest cluster.

§ The Silhouette Coefficient s for a single sample is then given as:

40
Some other metrics

§ Mutual Information based scores

§ Homogeneity, completeness and V-measure
§ Fowlkes-Mallows scores
§ Calinski-Harabasz Index
§ Davies-Bouldin Index
§ Contingency Matrix
§ Pair Confusion Matrix

41
References:
§ https://scikit-
learn.org/stable/modules/clustering.html#clustering-
evaluation

Parasite SEO Secrets Revealed by Charles Floate
100% (1)
Parasite SEO Secrets Revealed by Charles Floate
73 pages
Dreams - Interpreting Your Dreams and How To Dream Your Desires - Lucid Dreaming, Visions and Dream Interpretation PDF
100% (1)
Dreams - Interpreting Your Dreams and How To Dream Your Desires - Lucid Dreaming, Visions and Dream Interpretation PDF
70 pages
09 - ML-Model Evaluation
No ratings yet
09 - ML-Model Evaluation
33 pages
Performance Evaluation
No ratings yet
Performance Evaluation
24 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
Machine Learning Model Evaluation
No ratings yet
Machine Learning Model Evaluation
2 pages
06-FSSR DS610 2024 2025T1 Metrics
No ratings yet
06-FSSR DS610 2024 2025T1 Metrics
24 pages
Metric
No ratings yet
Metric
6 pages
6 Evaluarea Performantei
No ratings yet
6 Evaluarea Performantei
43 pages
Evaluating A Machine Learning Model
No ratings yet
Evaluating A Machine Learning Model
14 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
20 pages
Chương 2e. Model Evaluation
No ratings yet
Chương 2e. Model Evaluation
27 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
Unit8 (Evaluation Method)
No ratings yet
Unit8 (Evaluation Method)
43 pages
Unit III Iml Final
No ratings yet
Unit III Iml Final
36 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Performance Measures
No ratings yet
Performance Measures
19 pages
Machine Learning Model Evaluation - Zero To Mastery Academy
No ratings yet
Machine Learning Model Evaluation - Zero To Mastery Academy
1 page
Model Evaluation
No ratings yet
Model Evaluation
18 pages
Ad3501-Dl-Unit 4 Notes
No ratings yet
Ad3501-Dl-Unit 4 Notes
16 pages
Lec 4
No ratings yet
Lec 4
24 pages
Metrix in ML
No ratings yet
Metrix in ML
7 pages
Lecture - (3-4) Evaluation Metrices Classification and Regression
No ratings yet
Lecture - (3-4) Evaluation Metrices Classification and Regression
28 pages
Evaluation
No ratings yet
Evaluation
18 pages
22AIP3101A Session 3
No ratings yet
22AIP3101A Session 3
24 pages
Lect 02 Evaluation Part 1
No ratings yet
Lect 02 Evaluation Part 1
33 pages
IT 138 - Lecture 4
No ratings yet
IT 138 - Lecture 4
30 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
Đại Học Quốc Gia Thành Phố Hồ Chí Minh Trường Đại Học Khoa Học Tự Nhiên Khoa Công Nghệ Thông Tin Bộ Môn Công Nghệ Tri Thức
No ratings yet
Đại Học Quốc Gia Thành Phố Hồ Chí Minh Trường Đại Học Khoa Học Tự Nhiên Khoa Công Nghệ Thông Tin Bộ Môn Công Nghệ Tri Thức
9 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
Intel Assignment
No ratings yet
Intel Assignment
13 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
49 pages
Module 6
No ratings yet
Module 6
24 pages
FDS Notes
No ratings yet
FDS Notes
6 pages
ML3 Evaluating Models
No ratings yet
ML3 Evaluating Models
40 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Evaluation Matrix
No ratings yet
Evaluation Matrix
29 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
Iai&ml Unit-5
No ratings yet
Iai&ml Unit-5
15 pages
Cheatsheet Machine Learning Tips and Tricks PDF
No ratings yet
Cheatsheet Machine Learning Tips and Tricks PDF
2 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
DS Notes Unit - V
No ratings yet
DS Notes Unit - V
13 pages
Confusion Matrix
No ratings yet
Confusion Matrix
4 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
Confusion Matrix & Evaluation Metrics in Machine Learning
No ratings yet
Confusion Matrix & Evaluation Metrics in Machine Learning
23 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
Week 5
No ratings yet
Week 5
4 pages
Comprehensive Guide On Confusion Matrix 1657202063
No ratings yet
Comprehensive Guide On Confusion Matrix 1657202063
5 pages
Assignment 5
No ratings yet
Assignment 5
22 pages
S1 Evaluate Performance LKW 1mar2025
No ratings yet
S1 Evaluate Performance LKW 1mar2025
26 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
Performance Metrics ML
No ratings yet
Performance Metrics ML
4 pages
Intermediate Analytics-Regression-Week 3-1
No ratings yet
Intermediate Analytics-Regression-Week 3-1
44 pages
Unit 4 Model Evaluation
No ratings yet
Unit 4 Model Evaluation
24 pages
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
No ratings yet
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
26 pages
L9 Model Assessment
No ratings yet
L9 Model Assessment
26 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
Linear Regression Summary
No ratings yet
Linear Regression Summary
57 pages
Confusion Matrix and Classification Evaluation Metrics
No ratings yet
Confusion Matrix and Classification Evaluation Metrics
16 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
2 5348041922055258424
No ratings yet
2 5348041922055258424
26 pages
Internal Structure of A Leaf
50% (2)
Internal Structure of A Leaf
25 pages
Introduction To Management Accounting
No ratings yet
Introduction To Management Accounting
30 pages
Handout 3 Skills - Unit 2 - 4 Medio
No ratings yet
Handout 3 Skills - Unit 2 - 4 Medio
3 pages
Transactions - 1
No ratings yet
Transactions - 1
41 pages
Energy and Cost Savings Through Pumping Stations Rehabilitation. Case Study in Bucharest
No ratings yet
Energy and Cost Savings Through Pumping Stations Rehabilitation. Case Study in Bucharest
8 pages
M62015L, FP M62016L, FP: V C Reset INT GND
No ratings yet
M62015L, FP M62016L, FP: V C Reset INT GND
4 pages
Modelling of Fluid Power Systems
No ratings yet
Modelling of Fluid Power Systems
85 pages
Ocean DEHUMID
No ratings yet
Ocean DEHUMID
4 pages
Methods 3 Unit Plan Project: Petition Rubric
No ratings yet
Methods 3 Unit Plan Project: Petition Rubric
1 page
Shahzad 2014
No ratings yet
Shahzad 2014
21 pages
Planning Engineer
No ratings yet
Planning Engineer
2 pages
Mahieddine Darbal Middle School School Year2017/2018 Level MS3
No ratings yet
Mahieddine Darbal Middle School School Year2017/2018 Level MS3
3 pages
ICT 7 Learning Module
No ratings yet
ICT 7 Learning Module
77 pages
Lesson Plans Feb. 2019
No ratings yet
Lesson Plans Feb. 2019
13 pages
Toaz - Info Detailed Lesson Plan DLP For Demo Teaching Parallelism PR
No ratings yet
Toaz - Info Detailed Lesson Plan DLP For Demo Teaching Parallelism PR
3 pages
CCM 303 Topic 8 PPT Gender and Communication in The Media PDF
No ratings yet
CCM 303 Topic 8 PPT Gender and Communication in The Media PDF
23 pages
8300 Gcse Maths Practice Paper Set 3 1f Mark Scheme.321526845
No ratings yet
8300 Gcse Maths Practice Paper Set 3 1f Mark Scheme.321526845
17 pages
Cambridge International Examinations
No ratings yet
Cambridge International Examinations
12 pages
A Business Research On SPACEX
No ratings yet
A Business Research On SPACEX
5 pages
Staff Manual Chewonki
No ratings yet
Staff Manual Chewonki
34 pages
Understanding Color and Color Schemes
No ratings yet
Understanding Color and Color Schemes
20 pages
DAILY LESSON LOG Organic Compounds
No ratings yet
DAILY LESSON LOG Organic Compounds
4 pages
Quarter 2 - Matatag - SUMMATIVE TEST 1
No ratings yet
Quarter 2 - Matatag - SUMMATIVE TEST 1
3 pages
Creativity Is Always A Social Process
No ratings yet
Creativity Is Always A Social Process
17 pages
Affected Models: Date: November, 2006 No. 2006-06 (W) MODELS: 2007 Evinrude SUBJECT: Engine Software Update
No ratings yet
Affected Models: Date: November, 2006 No. 2006-06 (W) MODELS: 2007 Evinrude SUBJECT: Engine Software Update
2 pages
Letter of Invitation SGC
No ratings yet
Letter of Invitation SGC
7 pages
(English-Vietnamese) Bạn có nhiều hơn một cuộc đời - Marc Levy - Have A Sip EP98 (DownSub.com)
No ratings yet
(English-Vietnamese) Bạn có nhiều hơn một cuộc đời - Marc Levy - Have A Sip EP98 (DownSub.com)
46 pages

09 - ML-Model Evaluation

Uploaded by

09 - ML-Model Evaluation

Uploaded by

Ho Chi Minh University of Banking

Department of Economic Mathematics

Vuong Trong Nhân ([email protected])

§ What is model evaluation?

§ Accuracy (độ chính xác):

§ Best possible value : 0.0 - Worst possible value: 1.0

§ Best possible value : 0.0 - Worst possible value: 1.0

§ Consider a binary classification problem

§ Other metrics: precision, recall, F1-score, …

§ Shows performance of an algorithm, especially

Negative False Positive (FP) True Negative (TN)

precision recall F1_score

§ Receiver Operating Characteristic

§ Points of interests (TP, FP)

§ Area Under Curve (AUC)

The sum of residuals, sometimes referred to as bias.

residual = actual - prediction

As the residuals can be both positive

However, as the residuals of opposing signs

Measure: how well your model fits the data

RSS : the residual sum of squares

is the total number of possible pairs in the dataset.

§ To counter this effect we can discount the expected RI (E[RI)

§ The Silhouette Coefficient s for a single sample is then given as:

§ Mutual Information based scores

You might also like