0% found this document useful (0 votes)
12 views41 pages

09 - ML-Model Evaluation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views41 pages

09 - ML-Model Evaluation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Ho Chi Minh University of Banking

Department of Economic Mathematics

Machine Learning
Model Evaluation

Vuong Trong Nhân ([email protected])


Outline

§ What is model evaluation?


§ Methods:
§ Data splitting Techniques
o Train-test split
o Cross-validation
§ Performance Metrics
o 1. Metrics for Classification
o 2. Metrics for Regression
o 3. Metrics for Clustering

2
What is model evaluation?
§ Ensures Generalizability:
§ Evaluates if the model can make accurate predictions on unseen data ,
not just memorize training data.
§ Prevents Overfitting:
§ Identifies models that are "over-learning" the training data, leading to
poor performance on new data.
§ Facilitates Model Comparison:
§ Allows you to compare the performance of different models and
choose the one that generalizes best.
§ Guides Model Improvement:
§ Provides insights into model weaknesses, enabling you to refine
training data, adjust settings, or explore new approaches.
§ Establishes Model Reliability:
§ Builds confidence that the model will perform well in real-world
scenarios with unseen data.
Model evaluation is crucial for building robust and generalizable
machine learning models!

3
Strategy for splitting data for training

§ Train-test split
§ Cross-validataion

4
Train-test split
Train-test split is a fundamental technique in machine learning for evaluating
model performance.

5
Train-test split
Pros:
1. Prevents overfitting: Ensures models generalize well to unseen data.
2. Objective evaluation: Provides a fair assessment of model
performance.
3. Hyperparameter tuning: Helps find the best model configuration .
4. Simple and efficient: Easy to use and saves resources for quick
evaluations.
Cons:
1. Sensitive to random splits: Requires caution due to potential biases.
2. Data ineffective: May limit model's learning potential (using less
data).

6
Cross-validation

7
Cross-validation

Dev set (development set): This is another name for the validation set
8
Cross-Validation

Advantages:
1. Robust evaluation: Uses all data for training and testing,
mitigating bias from random splits in train-test splits.
2. Estimates model variability: Generates multiple
performance scores, revealing how sensitive the model is
to different training data selections.
3. Data efficiency: Utilizes more data for training (up to 90%)
compared to train-test split (typically 75%).
Drawback:
Increased computational cost: Requires training multiple models
compared to a single train-test split.

9
1. Metrics for Classification

10
Evaluation for Classification

11
Metrics for Classification

§ Accuracy/Error score
§ Confusion matrix
§ Precision and Recall
§ F1 score
§ ROC curve
§ Area Under the Curve

12
Accuracy Metrics

§ Accuracy (độ chính xác):


§ Tỉ lệ giữa số điểm được dự đoán đúng trên tổng số điểm
trong tập dữ liệu kiểm thử
# correct predictions
Accuracy =
#examples

§ Best possible value : 0.0 - Worst possible value: 1.0


§ Error (độ lỗi):
§ Tỉ lệ giữa số điểm được dự đoán sai trên tổng số điểm trong
tập dữ liệu kiểm thử
# correct predictions
Error = = 1 − Accuracy
#examples

§ Best possible value : 0.0 - Worst possible value: 1.0


13
Limitation of Accuracy

§ Consider a binary classification problem


§ Number of Class 0 examples = 9990
§ Number of Class 1 examples = 10
§ If predict all as 0, accuracy is 9990/10000 = 99.9%

§ Solution:
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝑇𝑁 𝑇𝑁
§ Weighted Accuracy =
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝐹𝑃 𝐹𝑃+𝑤 𝑇𝑁 𝑇𝑁+𝑤 𝐹𝑁 𝐹𝑁

§ Other metrics: precision, recall, F1-score, …

14
Confusion Matrix

§ Shows performance of an algorithm, especially


predictive capability.
§ rather than how fast it takes to classify, build models, or
scalability.

Predicted Class
Actual Positive Negative
Class Positive True Positive (TP) False Negative (FN)

Negative False Positive (FP) True Negative (TN)

15
Confusion Matrix
§ Imagine a study evaluating a test that screens people
for a disease. Each person taking the test either has or
does not have the disease. The test outcome can be
positive or negative.
§ The test results for each subject may or may not match
the subject's actual status. In that setting:
§ True positive: Sick people correctly identified as sick
§ False positive: Healthy people incorrectly identified as sick
§ True negative: Healthy people correctly identified as healthy
§ False negative: Sick people incorrectly identified as healthy

16
Confusion Matrix

https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers 17
Confusion Matrix
Predicted Class
Positive Negative
Actual Positive True Positive (TP) False Negative (FN)
Class (Type II error)
Negative False Positive (FP) True Negative (TN)
(Type I error)

§ Accuracy
= (TP+TN) / (TP+FP+TN+FN)
§ Precision
= TP / (TP + FP)
§ Recall
= TP / (TP + FN)
§ F1-score = 2 precision * recall / (precision + recall)
= 2 TP / (2 TP + FP + FN)
18
Precision-Precall

19
F1-score
The F1 score is the harmonic mean of the precision and recall
F1-score Î (0,1]

precision recall F1_score


1 1 1
0.1 0.1 0.1
0.5 0.5 0.5
1 0.1 0.182
0.3 0.8 0.36

F1-score:
(precision 0.5, recall = 0.5) is better than (precision = 0.3, recall = 0.8)

20
Type I and II error

21
Normalized confusion matrix

Predicted Class
Positive Negative
Actual Positive TPR = TP / (TP + FN) FNR = FN / (TP + FN)
Class
Negative FPR = FP / (FP + TN) TNR = TN / (FP + TN)

False Positive Rate còn được gọi là False Alarm Rate (tỉ lệ báo động nhầm)

False Negative Rate còn được gọi là Miss Detection Rate (tỉ lệ bỏ sót)

Trong bài toán dò mìn, “thà báo nhầm còn hơn bỏ sót”, tức là ta có thể chấp
nhận False Alarm Rate cao để đạt được Miss Detection Rate thấp.
Trong bài toán lọc email rác thì việc cho nhầm email quan trọng vào thùng rác
nghiêm trọng hơn việc xác định một email rác là email thường.
22
ROC curve

§ Receiver Operating Characteristic


§ Graphical approach for displaying the tradeoff between
true positive rate(TPR) and false positive rate (FPR) of a
classifier
o TPR = positives correctly classified/total positives
o FPR = negatives incorrectly classified/total negatives
§ TPR on y-axis and FPR on x-axis

23
ROC curve

§ Points of interests (TP, FP)


§ (0, 0): everything is negative
§ (1, 1): everything is positive
§ (1, 0): perfect (ideal)

§ Diagonal line
§ Random guessing (50%)

§ Area Under Curve (AUC)


§ Measurement how good the model on the average
§ Good to compare with other methods

24
ROC curve
Threshold = 0.5
Predicted Class
True positive rate:
Positive Negative TPR = ¾ =0.75
Actual Positive 3 1 False positive rate:
Class Negative 1 3 FPR = ¼= 0.25

1
Probability X XX 0.8
of obesity
X
0.5
X 0.8
X
XX
0.2
0

weight
Logistic Regression
25
For multi-class classification
Micro-average

Macro-average

27
2. Metrics for Regression

28
2.1. Bias

The sum of residuals, sometimes referred to as bias.

residual = actual - prediction

As the residuals can be both positive


(prediction is smaller than the actual
value) and negative (prediction is larger
than the actual value), bias generally
tells you whether your predictions were
higher or lower than the actuals.

However, as the residuals of opposing signs


offset each other, you can obtain a model
that generates predictions with a very low
bias, while not being accurate at all.

Figure 1 presents the relationship between a target variable (y) and a single feature (x)
29
2.2. Mean squared error (MSE)

Pros:
§ MSE uses the mean (instead of the sum) to keep the metric independent of the
dataset size.
§ As the residuals are squared, MSE puts a significantly heavier penalty on large errors.
Some of those might be outliers, so MSE is not robust to their presence.
§ The metric is useful for optimization algorithms.
Cons:
§ MSE is not measured in the original units, which can make it harder to interpret.
§ MSE cannot be used to compare the performance between different datasets.

30
2.3. Root mean squared error
Root mean squared error (RMSE)

§ Pros:
§ Take the square (MSE) to bring the metric back to the scale of the target variable, so
it is easier to interpret and understand.
§ Cos:
§ However, take caution: one fact that is often overlooked is that although RMSE is on
the same scale as the target, an RMSE of 10 does not actually mean you are off by 10
units on average.

31
https://developer.nvidia.com/
2.4. Mean absolute error (MAE)

§ Pros:
§ Due to the lack of squaring, the metric is expressed at the same scale as the
target variable, making it easier to interpret.
§ All errors are treated equally, so the metric is robust to outliers.
§ Cons:
§ Absolute value disregards the direction of the errors, so underforecasting =
overforecasting.
§ Similar to MSE and RMSE, MAE is also scale-dependent, so you cannot compare
it between different datasets.
§ When you optimize for MAE, the prediction must be as many times higher than
the actual value as it should be lower. That means that you are effectively
looking for the median; that is, a value that splits a dataset into two equal parts.
§ As the formula contains absolute values, MAE is not easily differentiable.
32
2.5. R-squared (R²)

Measure: how well your model fits the data

RSS : the residual sum of squares


TSS : the total sum of squares

33
2.5. R-squared
Measure: how well your model fits the data
• RSS : the residual sum of squares
• TSS : the total sum of squares

§ Pros:
§ Model Fit Assessment & Model Comparisons
o A higher R-squared means a better fit.
§ Helps in Feature Selection
o If adding a variable improves R-squared a lot, it's
likely a good predictor.
§ Cons:
§ Sensitive to Outliers
§ Depends on Sample Size
§ Not distinguishing between different types of
relationships
34
2.6. Some other metrics
§ Mean squared log error (MSLE)
§ Root mean squared log error (RMSLE)
§ Symmetric mean absolute percentage error (sMAPE)
§…

35
3. Metrics for Clustering

36
Rand index (RI)
§ Given the knowledge of the ground truth class assignments labels_true
and our clustering algorithm assignments of the same samples
labels_pred, the (adjusted or unadjusted)
§ Rand index measures the similarity of the two assignments, ignoring
permutations
§ If C is a ground truth class assignment and K the clustering, let us define
a and b as:
§ the number of pairs of elements that are in the same set in C and in the same set in
K
§ b the number of pairs of elements that are in different sets in C and in different sets
in K

is the total number of possible pairs in the dataset.

37
Adjusted Rand index (ARI)
§ However, the Rand index does not guarantee that random
label assignments will get a value close to zero (esp. if the
number of clusters is in the same order of magnitude as the
number of samples).

§ To counter this effect we can discount the expected RI (E[RI)


of random labelings by defining the adjusted Rand index as
follows

38
Rand index (RI) & Adjust Rand Index (ARI)

Rand index is a function that measures the similarity of the two assignments,
ignoring permutations:

The Rand index does not ensure to obtain a value close to 0.0 for a random
labelling.

The adjusted Rand index corrects for chance and will give such a baseline.

39
Silhouette Score
§ The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of
such an evaluation, where a higher Silhouette Coefficient score relates to a
model with better defined clusters. The Silhouette Coefficient is defined for
each sample and is composed of two scores:

•a: The mean distance between a sample and all other points in
the same class.
•b: The mean distance between a sample and all other points in
the next nearest cluster.

§ The Silhouette Coefficient s for a single sample is then given as:

40
Some other metrics

§ Mutual Information based scores


§ Homogeneity, completeness and V-measure
§ Fowlkes-Mallows scores
§ Calinski-Harabasz Index
§ Davies-Bouldin Index
§ Contingency Matrix
§ Pair Confusion Matrix

41
References:
§ https://scikit-
learn.org/stable/modules/clustering.html#clustering-
evaluation

42

You might also like