Machine Learning Evaluation Metrics Lecturer
Machine Learning Evaluation Metrics Lecturer
BDIJOF-FBSOJOH
1FSGPSNBODF.FUSJDT ityof Bergamo
Performance metrics
Outline
1. Metrics
4. Worked example
2 /31
Outline
1. Metrics
4. Worked example
3 /31
Metrics
It is extremely important to use quantitative metrics for evaluating a machine learning
model
Until now, we relied on the cost function value for regression and classification
Other metrics can be used to better evaluate and understand the model
For classification
Accuracy/Precision/Recall/F1-score, ROC curves,…
For regression
Normalized RMSE, Normalized Mean Absolute Error (NMAE),…
4 /31
Accuracy
Accuracy is a measure of how close a given set of guessing from our model are closed
to their true value.
(
If a classifier make 10 predictions and 9 of them are correct, the accuracy is 90%.
5 /31
Classification case: metrics for skewed classes
Disease dichotomic classification example
Find that you got error on test set ( correct diagnoses)
If I use a classifier that always classifies the observations to the % class, I get of
accuracy!!
6 /31
Outline
1. Metrics
4. Worked example
7 /31
Precision and recall
Suppose that ! ( in presence of a rare class that we want to detect
Estiamted class
1 (p) 0 (n)
8 /31
F1-score
It is usually better to compare models by means of one number only. The & ' can
be used to combine precision and recall
10 /31
Summaries of the confusion matrix
Different metrics can be computed from the confusion matrix, depending on the class of
interest (https://en.wikipedia.org/wiki/Precision_and_recall)
11 /31
Outline
1. Metrics
4. Worked example
12 /31
Ranking instead of classifying
Classifiers such as logistic regression can output a probability of belonging to a class (or
something similar)
We can use this to rank the different istances and take actions on the cases at top of
the list
13 /31
Ranking instead of classifying
p n
Y 0 0 p n
Instance 1 0
True class Score N 100 100 Y
description
99 100
…………… 0,99 N
…………… 0,85 98 99
N
…………… 0,80 p n
…………… 0,70 Y
6 4
94 96
N
14 /31
Ranking instead of classifying
ROC curves are a very general way to represent and compare the performance of
different models (on a binary classification task)
Perfection Observations
classify always negative
Recall (True Positive Rate)
15 /31
Outline
1. Metrics
4. Worked examples
16 /31
Breast cancer detection
Breast cancer is the most common cancer amongst women in the world.
It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015
alone.
It starts when cells in the breast begin to grow out of control. These cells usually
form tumors that can be seen via X-ray or felt as lumps in the breast area.
The key challenges against it’s detection is how to classify tumors into malignant
(cancerous) or benign(non cancerous).
Goal: classifying these tumors using machine learning and the Breast Cancer
Wisconsin (Diagnostic) Dataset.
17 /31
Breast cancer Wisconsin dataset Output:
This dataset has been referred from Kaggle. Class 4 stands for malignant cancer
Class 2 stands for benign cancer
Uniformity Single
Clump Uniformity of Cell Marginal Epithelial Bare Bland Normal
id_num Thickness of Cell Size Shape Adhesion Cell Size Nuclei Chromatin Nucleoli Mitoses Class
1041801 5 3 3 3 2 3 4 4 1 4
1043999 1 1 1 1 2 3 3 1 1 2
1044572 8 7 5 10 7 9 5 5 4 4
1047630 7 4 6 4 6 1 4 3 1 4
1048672 4 1 1 1 2 1 2 1 1 2
1049815 4 1 1 1 2 1 3 1 1 2
1050670 10 7 7 6 4 10 4 1 2 4
18 /31
Breast cancer detection
We will use the dataset to compare differente logistic regression models by means
of the ROC curve associated to each of them.
To this aim we will work with 4 different dataset (plus an extra one)
Extra: after learning the model of CASE 1, take only the features with the smallest
p-value.
19 /31
%% Load and clean data
Phi=table2array(data(:,1:end-1));
y=table2array(data(:,end));
Output:
y(y==4)=1; % in the original date 4 stands for malignant cancer
Class 4 stands for malignant y(y==2)=0; % in the original date 2 stands for benign cancer
cancer and it is for us the positive % Setup the data matrix appropriately, and add ones for the intercept
output. We set it to 1 term
[N, d] = size(Phi);
Class 2 stands for benign cancer Phi = [ones(N, 1) Phi]; % Add intercept term
and it is for us the negative %% Train and test data
[X,Y,T,AUC] = perfcurve(y,scores,1);
20 /31
Results
21 /31
Results
Comparison of case 1, 4 and best
22 /31
Pneumonia detection
Suppose to have at disposal X-ray images of lungs: Healthy people - Covid-19 disease
patients
23 /31
Acknowledgments
The COVID-19 X-ray image is curated by Dr. Joseph Cohen, a postdoctoral fellow at
the University of Montreal, see https://josephpcohen.com/w/public-covid19-dataset/
The previous data contain only X-ray images of people with a disease. To collect
images of healthy people, we can download another X-ray dataset on the platform
Kaggle https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
24 /31
Acknowledgments
We want to use a classifier to perform classification:
Healthy patients: class
Patients with a disease: class
For these computer vision tasks, the state of the art algorithm are the Convolutional
Neural Networks:
we can use them to classify the images into healthy and disease
25 /31
Estimated covid label
Pneumonia detection True label
Estimated healthy label
26 /31
Pneumonia detection
Classification results on test set
Actual class
Estimated class
1 (p) 0 (n)
Sensitivity (recall, true positive rate)
27 /31
Pneumonia detection
Classification results on test set
Specificity: of patients that do not have COVID-19 (i.e., true negatives), we could
accurately identify them as “COVID-19 negative” 100% of the time using our model.
28 /31
Pneumonia detection
Classification results on test set
Being able to accurately detect healthy patients with 100% accuracy is great. We do
not want to quarantine someone for nothing
…but we don’t want to classify someone as «healthy» when they are «COVID-19
positive», since it could infect other people without knowing
29 /31
Summary
Balancing sensitivity and specificity is incredibly challenging when it comes to medical
applications
30 /31
Summary
31 /31