0% found this document useful (0 votes)
10 views

Machine Learning Evaluation Metrics Lecturer

Uploaded by

yassmin khaldi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Machine Learning Evaluation Metrics Lecturer

Uploaded by

yassmin khaldi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

.

BDIJOF-FBSOJOH
1FSGPSNBODF.FUSJDT ityof Bergamo

Performance metrics
Outline
1. Metrics

2. Precision and recall

3. Receiver Operating Characteristic (ROC) curves

4. Worked example

2 /31
Outline
1. Metrics

2. Precision and recall

3. Receiver Operating Characteristic (ROC) curves

4. Worked example

3 /31
Metrics
It is extremely important to use quantitative metrics for evaluating a machine learning
model

 Until now, we relied on the cost function value for regression and classification

 Other metrics can be used to better evaluate and understand the model

 For classification
 Accuracy/Precision/Recall/F1-score, ROC curves,…
 For regression
 Normalized RMSE, Normalized Mean Absolute Error (NMAE),…

4 /31
Accuracy
Accuracy is a measure of how close a given set of guessing from our model are closed
to their true value.
      
  (
     

 If a classifier make 10 predictions and 9 of them are correct, the accuracy is 90%.

 Accuracy is a measure of how well a binary classifier correctly identifies or excludes


a condition.
 It’s the proportion of correct predictions among the total number of cases
examined.

5 /31
Classification case: metrics for skewed classes
Disease dichotomic classification example

Train logistic regression model  " , with ! (  if disease, ! (  otherwise.

Find that you got  error on test set ( correct diagnoses)

The "  class has very few samples with


Only  of patients actually have disease
respect to the "  class

If I use a classifier that always classifies the observations to the % class, I get  of
accuracy!!

For skewed classes, the accuracy metric can be deceptive

6 /31
Outline
1. Metrics

2. Precision and recall

3. Receiver Operating Characteristic (ROC) curves

4. Worked example

7 /31
Precision and recall
Suppose that ! (  in presence of a rare class that we want to detect

Precision (How much we are precise in the detection)


Of all patients where we classified   , Confusion matrix
what fraction actually has the disease?
Actual class
   
"
      !   

Estiamted class
1 (p) 0 (n)

True positive False positive


Recall (How much we are good at detecting) 1 (Y)
(TP) (FP)
Of all patients that actually have the disease, what
fraction did we correctly detect as having the disease? False negative True negative
0 (N)
(FN) (TN)
   
"
       !    

8 /31
F1-score
It is usually better to compare models by means of one number only. The & '  can
be used to combine precision and recall

Precision(P) Recall (R) Average F1 Score


Algorithm 1 0.5 0.4 0.45 0.444 The best is Algorithm 1
Algorithm 2 0.7 0.1 0.4 0.175
Algorithm 3 0.02 1.0 0.51 0.0392
Algorithm 3 classifies always  Average says not correctly
that Algorithm 3 is the best

            


        
 
              

10 /31
Summaries of the confusion matrix
Different metrics can be computed from the confusion matrix, depending on the class of
interest (https://en.wikipedia.org/wiki/Precision_and_recall)

11 /31
Outline
1. Metrics

2. Precision and recall

3. Receiver Operating Characteristic (ROC) curves

4. Worked example

12 /31
Ranking instead of classifying
Classifiers such as logistic regression can output a probability of belonging to a class (or
something similar)

 We can use this to rank the different istances and take actions on the cases at top of
the list

 We may have a budget, so we have to target most promising individuals

 Ranking enables to use different techniques for visualizing model performance

13 /31
Ranking instead of classifying
p n

Y 0 0 p n
Instance 1 0
True class Score N 100 100 Y
description
99 100
……………  0,99 N

……………  0,98 Different confusion


……………  0,96 p n matrices by changing
……………  0,90 Y
2 0 the threshold
……………  0,88 N 98 100
p n
……………  0,87 Y
2 1

……………  0,85 98 99
N
……………  0,80 p n
……………  0,70 Y
6 4

94 96
N

14 /31
Ranking instead of classifying
ROC curves are a very general way to represent and compare the performance of
different models (on a binary classification task)

Perfection Observations
   classify always negative
Recall (True Positive Rate)

Random    classify always positive


Better guessing
classifier      : random classifier
Worse
classifier        : worse than random classifier
 Different classifiers can be compared

 Area Under the Curve (AUC): probability that a


randomly chosen positive instance will be ranked
1 – specificity (False Positive Rate) ahead of randomly chosen negative instance

15 /31
Outline
1. Metrics

2. Precision and recall

3. Receiver Operating Characteristic (ROC) curves

4. Worked examples

16 /31
Breast cancer detection
 Breast cancer is the most common cancer amongst women in the world.

 It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015
alone.

 It starts when cells in the breast begin to grow out of control. These cells usually
form tumors that can be seen via X-ray or felt as lumps in the breast area.

 The key challenges against it’s detection is how to classify tumors into malignant
(cancerous) or benign(non cancerous).

 Goal: classifying these tumors using machine learning and the Breast Cancer
Wisconsin (Diagnostic) Dataset.

17 /31
Breast cancer Wisconsin dataset Output:
This dataset has been referred from Kaggle. Class 4 stands for malignant cancer
Class 2 stands for benign cancer

Uniformity Single
Clump Uniformity of Cell Marginal Epithelial Bare Bland Normal
id_num Thickness of Cell Size Shape Adhesion Cell Size Nuclei Chromatin Nucleoli Mitoses Class

1041801 5 3 3 3 2 3 4 4 1 4

1043999 1 1 1 1 2 3 3 1 1 2

1044572 8 7 5 10 7 9 5 5 4 4

1047630 7 4 6 4 6 1 4 3 1 4

1048672 4 1 1 1 2 1 2 1 1 2

1049815 4 1 1 1 2 1 3 1 1 2

1050670 10 7 7 6 4 10 4 1 2 4

……… …… …… …… …… …… …… ……. ……. ……. …….

18 /31
Breast cancer detection
We will use the dataset to compare differente logistic regression models by means
of the ROC curve associated to each of them.

To this aim we will work with 4 different dataset (plus an extra one)

1. Case 1: the whole dataset


2. Case 2: the first group of 5 features
3. Case 3: the second group of 5 features
4. Case 4: only the first two features

Extra: after learning the model of CASE 1, take only the features with the smallest
p-value.

19 /31
%% Load and clean data

Matlab code data = readtable('breast_cancer_w.xlsx'); %load our data as a table

Phi=table2array(data(:,1:end-1));
y=table2array(data(:,end));
Output:
y(y==4)=1; % in the original date 4 stands for malignant cancer
 Class 4 stands for malignant y(y==2)=0; % in the original date 2 stands for benign cancer
cancer and it is for us the positive % Setup the data matrix appropriately, and add ones for the intercept
output. We set it to 1 term
[N, d] = size(Phi);

 Class 2 stands for benign cancer Phi = [ones(N, 1) Phi]; % Add intercept term
and it is for us the negative %% Train and test data

output. We set it to 0. mdl = fitglm(Phi,y,'Distribution','binomial','Link','logit')

%% ============ Part 2: Compute the ROC curve ============


scores = mdl.Fitted.Probability;

[X,Y,T,AUC] = perfcurve(y,scores,1);

%Plot the ROC curve.


perfcurve compute the points in figure
plot(X,Y)
the ROC curve as well as the AUC xlabel('False positive rate')
ylabel('True positive rate')
title('ROC for Classification by Logistic Regression')

20 /31
Results

Comparison of case 1, 2, 3 and 4

Using only the first 2 features is nott


a smart choice.

21 /31
Results
Comparison of case 1, 4 and best

Using only the best features


provides a model that performs
almost as well as using all the
features

22 /31
Pneumonia detection
Suppose to have at disposal X-ray images of lungs: Healthy people - Covid-19 disease
patients

23 /31
Acknowledgments
 The COVID-19 X-ray image is curated by Dr. Joseph Cohen, a postdoctoral fellow at
the University of Montreal, see https://josephpcohen.com/w/public-covid19-dataset/

 The previous data contain only X-ray images of people with a disease. To collect
images of healthy people, we can download another X-ray dataset on the platform
Kaggle https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia

 The analysis is inspired from a tutorial by Adrian Rosebrock:


https://www.pyimagesearch.com/2020/03/16/detecting-covid-19-in-x-ray-images-with-keras-
tensorflow-and-deep-learning/

24 /31
Acknowledgments
We want to use a classifier to perform classification:
 Healthy patients: class 
 Patients with a disease: class 

The input data are directly the X-ray images

For these computer vision tasks, the state of the art algorithm are the Convolutional
Neural Networks:
 we can use them to classify the images into healthy and disease

25 /31
Estimated covid label
Pneumonia detection True label
Estimated healthy label

26 /31
Pneumonia detection
Classification results on test set
Actual class

Estimated class
1 (p) 0 (n)
Sensitivity (recall, true positive rate)

            True positive False positive


1 (Y)
   11 0
                 

False negative True negative


0 (N)
1 11

Specificity (true negative rate)


 Accuracy: ) 
         
 
                

27 /31
Pneumonia detection
Classification results on test set

Sensitivity (recall, true positive rate) Specificity (true negative rate)

                     
    
                                  

 Sensitivity: of patients that do have COVID-19 (i.e., true positives), we could


accurately identify them as “COVID-19 positive” 92% of the time using our model

 Specificity: of patients that do not have COVID-19 (i.e., true negatives), we could
accurately identify them as “COVID-19 negative” 100% of the time using our model.

28 /31
Pneumonia detection
Classification results on test set

Sensitivity (recall, true positive rate) Specificity (true negative rate)

                     
    
                                  

 Being able to accurately detect healthy patients with 100% accuracy is great. We do
not want to quarantine someone for nothing

 …but we don’t want to classify someone as «healthy» when they are «COVID-19
positive», since it could infect other people without knowing

29 /31
Summary
Balancing sensitivity and specificity is incredibly challenging when it comes to medical
applications

The results should always be validated with another pool of people

Furthermore, we need to be concerned of what the model is actually learning:


 Does the results align with the medical knowledge?
 Was the dataset well representative of the population or there was selection bias?

30 /31
Summary

Furthermore, we need to be concerned of


what the model is actually learning:
 Do we accounted for all external factors
(confounding) that could interfere with the
response?

31 /31

You might also like