09 - ML-Model Evaluation
09 - ML-Model Evaluation
Machine Learning
Model Evaluation
2
What is model evaluation?
§ Ensures Generalizability:
§ Evaluates if the model can make accurate predictions on unseen data ,
not just memorize training data.
§ Prevents Overfitting:
§ Identifies models that are "over-learning" the training data, leading to
poor performance on new data.
§ Facilitates Model Comparison:
§ Allows you to compare the performance of different models and
choose the one that generalizes best.
§ Guides Model Improvement:
§ Provides insights into model weaknesses, enabling you to refine
training data, adjust settings, or explore new approaches.
§ Establishes Model Reliability:
§ Builds confidence that the model will perform well in real-world
scenarios with unseen data.
Model evaluation is crucial for building robust and generalizable
machine learning models!
3
Strategy for splitting data for training
§ Train-test split
§ Cross-validataion
4
Train-test split
Train-test split is a fundamental technique in machine learning for evaluating
model performance.
5
Train-test split
Pros:
1. Prevents overfitting: Ensures models generalize well to unseen data.
2. Objective evaluation: Provides a fair assessment of model
performance.
3. Hyperparameter tuning: Helps find the best model configuration .
4. Simple and efficient: Easy to use and saves resources for quick
evaluations.
Cons:
1. Sensitive to random splits: Requires caution due to potential biases.
2. Data ineffective: May limit model's learning potential (using less
data).
6
Cross-validation
7
Cross-validation
Dev set (development set): This is another name for the validation set
8
Cross-Validation
Advantages:
1. Robust evaluation: Uses all data for training and testing,
mitigating bias from random splits in train-test splits.
2. Estimates model variability: Generates multiple
performance scores, revealing how sensitive the model is
to different training data selections.
3. Data efficiency: Utilizes more data for training (up to 90%)
compared to train-test split (typically 75%).
Drawback:
Increased computational cost: Requires training multiple models
compared to a single train-test split.
9
1. Metrics for Classification
10
Evaluation for Classification
11
Metrics for Classification
§ Accuracy/Error score
§ Confusion matrix
§ Precision and Recall
§ F1 score
§ ROC curve
§ Area Under the Curve
12
Accuracy Metrics
§ Solution:
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝑇𝑁 𝑇𝑁
§ Weighted Accuracy =
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝐹𝑃 𝐹𝑃+𝑤 𝑇𝑁 𝑇𝑁+𝑤 𝐹𝑁 𝐹𝑁
14
Confusion Matrix
Predicted Class
Actual Positive Negative
Class Positive True Positive (TP) False Negative (FN)
15
Confusion Matrix
§ Imagine a study evaluating a test that screens people
for a disease. Each person taking the test either has or
does not have the disease. The test outcome can be
positive or negative.
§ The test results for each subject may or may not match
the subject's actual status. In that setting:
§ True positive: Sick people correctly identified as sick
§ False positive: Healthy people incorrectly identified as sick
§ True negative: Healthy people correctly identified as healthy
§ False negative: Sick people incorrectly identified as healthy
16
Confusion Matrix
https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers 17
Confusion Matrix
Predicted Class
Positive Negative
Actual Positive True Positive (TP) False Negative (FN)
Class (Type II error)
Negative False Positive (FP) True Negative (TN)
(Type I error)
§ Accuracy
= (TP+TN) / (TP+FP+TN+FN)
§ Precision
= TP / (TP + FP)
§ Recall
= TP / (TP + FN)
§ F1-score = 2 precision * recall / (precision + recall)
= 2 TP / (2 TP + FP + FN)
18
Precision-Precall
19
F1-score
The F1 score is the harmonic mean of the precision and recall
F1-score Î (0,1]
F1-score:
(precision 0.5, recall = 0.5) is better than (precision = 0.3, recall = 0.8)
20
Type I and II error
21
Normalized confusion matrix
Predicted Class
Positive Negative
Actual Positive TPR = TP / (TP + FN) FNR = FN / (TP + FN)
Class
Negative FPR = FP / (FP + TN) TNR = TN / (FP + TN)
False Positive Rate còn được gọi là False Alarm Rate (tỉ lệ báo động nhầm)
False Negative Rate còn được gọi là Miss Detection Rate (tỉ lệ bỏ sót)
Trong bài toán dò mìn, “thà báo nhầm còn hơn bỏ sót”, tức là ta có thể chấp
nhận False Alarm Rate cao để đạt được Miss Detection Rate thấp.
Trong bài toán lọc email rác thì việc cho nhầm email quan trọng vào thùng rác
nghiêm trọng hơn việc xác định một email rác là email thường.
22
ROC curve
23
ROC curve
§ Diagonal line
§ Random guessing (50%)
24
ROC curve
Threshold = 0.5
Predicted Class
True positive rate:
Positive Negative TPR = ¾ =0.75
Actual Positive 3 1 False positive rate:
Class Negative 1 3 FPR = ¼= 0.25
1
Probability X XX 0.8
of obesity
X
0.5
X 0.8
X
XX
0.2
0
weight
Logistic Regression
25
For multi-class classification
Micro-average
Macro-average
27
2. Metrics for Regression
28
2.1. Bias
Figure 1 presents the relationship between a target variable (y) and a single feature (x)
29
2.2. Mean squared error (MSE)
Pros:
§ MSE uses the mean (instead of the sum) to keep the metric independent of the
dataset size.
§ As the residuals are squared, MSE puts a significantly heavier penalty on large errors.
Some of those might be outliers, so MSE is not robust to their presence.
§ The metric is useful for optimization algorithms.
Cons:
§ MSE is not measured in the original units, which can make it harder to interpret.
§ MSE cannot be used to compare the performance between different datasets.
30
2.3. Root mean squared error
Root mean squared error (RMSE)
§ Pros:
§ Take the square (MSE) to bring the metric back to the scale of the target variable, so
it is easier to interpret and understand.
§ Cos:
§ However, take caution: one fact that is often overlooked is that although RMSE is on
the same scale as the target, an RMSE of 10 does not actually mean you are off by 10
units on average.
31
https://developer.nvidia.com/
2.4. Mean absolute error (MAE)
§ Pros:
§ Due to the lack of squaring, the metric is expressed at the same scale as the
target variable, making it easier to interpret.
§ All errors are treated equally, so the metric is robust to outliers.
§ Cons:
§ Absolute value disregards the direction of the errors, so underforecasting =
overforecasting.
§ Similar to MSE and RMSE, MAE is also scale-dependent, so you cannot compare
it between different datasets.
§ When you optimize for MAE, the prediction must be as many times higher than
the actual value as it should be lower. That means that you are effectively
looking for the median; that is, a value that splits a dataset into two equal parts.
§ As the formula contains absolute values, MAE is not easily differentiable.
32
2.5. R-squared (R²)
33
2.5. R-squared
Measure: how well your model fits the data
• RSS : the residual sum of squares
• TSS : the total sum of squares
§ Pros:
§ Model Fit Assessment & Model Comparisons
o A higher R-squared means a better fit.
§ Helps in Feature Selection
o If adding a variable improves R-squared a lot, it's
likely a good predictor.
§ Cons:
§ Sensitive to Outliers
§ Depends on Sample Size
§ Not distinguishing between different types of
relationships
34
2.6. Some other metrics
§ Mean squared log error (MSLE)
§ Root mean squared log error (RMSLE)
§ Symmetric mean absolute percentage error (sMAPE)
§…
35
3. Metrics for Clustering
36
Rand index (RI)
§ Given the knowledge of the ground truth class assignments labels_true
and our clustering algorithm assignments of the same samples
labels_pred, the (adjusted or unadjusted)
§ Rand index measures the similarity of the two assignments, ignoring
permutations
§ If C is a ground truth class assignment and K the clustering, let us define
a and b as:
§ the number of pairs of elements that are in the same set in C and in the same set in
K
§ b the number of pairs of elements that are in different sets in C and in different sets
in K
37
Adjusted Rand index (ARI)
§ However, the Rand index does not guarantee that random
label assignments will get a value close to zero (esp. if the
number of clusters is in the same order of magnitude as the
number of samples).
38
Rand index (RI) & Adjust Rand Index (ARI)
Rand index is a function that measures the similarity of the two assignments,
ignoring permutations:
The Rand index does not ensure to obtain a value close to 0.0 for a random
labelling.
The adjusted Rand index corrects for chance and will give such a baseline.
39
Silhouette Score
§ The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of
such an evaluation, where a higher Silhouette Coefficient score relates to a
model with better defined clusters. The Silhouette Coefficient is defined for
each sample and is composed of two scores:
•a: The mean distance between a sample and all other points in
the same class.
•b: The mean distance between a sample and all other points in
the next nearest cluster.
40
Some other metrics
41
References:
§ https://scikit-
learn.org/stable/modules/clustering.html#clustering-
evaluation
42