0% found this document useful (0 votes)
7 views25 pages

Evaluation Metrics

Chapter 8 of 'Fundamentals of Machine Learning for Predictive Data Analytics' focuses on the evaluation of machine learning models, emphasizing the importance of determining model suitability, estimating performance, and user confidence. It discusses various metrics such as misclassification accuracy, precision, recall, and F1 measure, along with methodologies for test set selection like hold-out sampling and k-fold cross-validation. The chapter highlights the significance of stratification and repeated hold-out methods to improve the reliability of evaluation results.

Uploaded by

jaswinder singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views25 pages

Evaluation Metrics

Chapter 8 of 'Fundamentals of Machine Learning for Predictive Data Analytics' focuses on the evaluation of machine learning models, emphasizing the importance of determining model suitability, estimating performance, and user confidence. It discusses various metrics such as misclassification accuracy, precision, recall, and F1 measure, along with methodologies for test set selection like hold-out sampling and k-fold cross-validation. The chapter highlights the significance of stratification and repeated hold-out methods to improve the reliability of evaluation results.

Uploaded by

jaswinder singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Evaluation

Wei Li

Fundamentals of Machine Learning for Predictive Data Analytics


Chapter 8: Evaluation
Sections 8.1, 8.3, 8.4.1, 8.4.2
Content
• Big idea
• Hold-out test set
• Misclassification accuracy and confusion matrix
• Precision, recall and F1 measure
• More test set selection approach
Big idea
The purpose of evaluation is threefold:

1. To determine which model is the most suitable for a task.


2. To estimate how the model will perform.
3. To convince users that the model will meet their needs.

Training
Data Test
Data
Model
Standard Approach: Measuring
Misclassification Rate on a
Hold-out Test Set
Misclassification Rate on a Hold-out Test Set

Figure: The process of building and evaluating a model using a


hold-out test set.
Training Valida'on Test
Set Set Set

(a) A 50:20:30split

Training Valida'on Test


Set Set Set

(b) A 40:20:40split

Figure: Hold-out sampling can divide the full data into training,
validation, and test sets.
0.5
Performance on Training Set
Performance on Validation Set

0.4
Misclassification Rate
0.3
0.2
0.1

0 50 100 150 200


Training Iteration

Figure: Using a validation set to avoid overfitting in iterative machine


learning algorithms.
Misclassification accuracy and
confusion matrix
Table: A sample test set with model predictions of email(ham/spam)
ID Target Pred. Outcome ID Target Pred. Outcome
1 spam ham FN 11 ham ham TN
2 spam ham FN 12 spam ham FN
3 ham ham TN 13 ham ham TN
4 spam spam TP 14 ham ham TN
5 ham ham TN 15 ham ham TN
6 spam spam TP 16 ham ham TN
7 ham ham TN 17 ham spam FP
8 spam spam TP 18 spam spam TP
9 spam spam TP 19 ham ham TN
10 spam spam TP 20 ham spam FP

number incorrect predictions


misclassification rate =
total predictions

misclassification rate = 5/20 = 0.25


Table: A sample test set with model predictions.
ID Target Pred. Outcome ID Target Pred. Outcome
1 spam ham FN 11 ham ham TN
2 spam ham FN 12 spam ham FN
3 ham ham TN 13 ham ham TN
4 spam spam TP 14 ham ham TN
5 ham ham TN 15 ham ham TN
6 spam spam TP 16 ham ham TN
7 ham ham TN 17 ham spam FP
8 spam spam TP 18 spam spam TP
9 spam spam TP 19 ham ham TN
10 spam spam TP 20 ham spam FP

spam--positive ham--negative
For binary prediction problems there are 4 possible
outcomes:
1 True Positive (TP) :Target = positive AND Pred = positive
2 True Negative (TN) :Target = negative AND Pred =negative
3
False Positive (FP) : Target = negative AND Pred = positive
4
False Negative (FN) : :Target = positive AND Pred = negative
Table: A sample test set with model predictions.
ID Target Pred. Outcome ID Target Pred. Outcome
1 spam ham FN 11 ham ham TN
2 spam ham FN 12 spam ham FN
3 ham ham TN 13 ham ham TN
4 spam spam TP 14 ham ham TN
5 ham ham TN 15 ham ham TN
6 spam spam TP 16 ham ham TN
7 ham ham TN 17 ham spam FP
8 spam spam TP 18 spam spam TP
9 spam spam TP 19 ham ham TN
10 spam spam TP 20 ham spam FP

Table: The structure of a confusion matrix.


Prediction
positive negative
positive TP FN
Target
negative FP TN
Table: The structure of a confusion matrix.
Prediction
positive negative
positive TP FN
Target
negative FP TN

Error (FP + FN)


misclassification accuracy = (2)
(TP + TN + FP + FN)
(2 + 3)
misclassification accuracy = = 0.25
(6 + 9 + 2 + 3)
Accuracy (TP + TN)
classification accuracy = (3)
(TP + TN + FP + FN)

(6 + 9)
classification accuracy = = 0.75
(6 + 9 + 2 + 3)
Performance Measures for
Categorical Targets

Precision, Recall and F1 measure


Accuracy Metrics
• Widely used Metrics
• – Precision: Number of correct predictions for
class/Number of predictions of the class
• – Recall: Number of correct predictions for
class/Size of the reference data for the class
• – F1: This measure is defined as the harmonic
mean of precision and recall.
Precision, recall and F1 measure
• For positive predication
positive predicted negative
sample positive sample
TP
precision =
(TP + FP)
TP
recall =
(TP + FN)
precision

6
precision = = 0.75
(6 + 2)
6
recall = = 0.667
(6 + 3)
 Recall and Precision are
contradictory each other
TP
precision =
(TP + FP)
TP
recall =
Extreme cases (TP + FN)

FP=0 FN=0
Precision=100% Precision=60%
Recall = 10% Recall = 100%

(precision ×recall)
F1 -measure = 2 ×
(precision + recall)
(precision ×recall)
F1 -measure = 2 ×
(precision + recall)

Precision=100% Precision=60%
Recall = 10% Recall = 100%

F1-measure=18% F1-measure=75%

6
precision = = 0.75
(6 + 2)
6
recall = = 0.667
(6 + 3)

F1-measure=0.71
Making the most of the data

• Generally, the larger the training data the


better the classifier.
• The larger the test data, the more accurate the
error estimate.

• Make our output more reliable, what can we


do?
Stratification
• The hold-out method reserves a certain amount
for testing and uses the remainder for training.
• For “Unbalanced” dataset, samples might not be
representative.
– Few or none instances of some classes.
• Stratified sample: advanced version of balancing
the data.
– Make sure that each class is represented with
approximately equal proportions in both subsets.
Repeated Hold-out Method
• Hold-out estimate can be made more reliable
by repeating the process with different
subsamples.
– In each iteration, a certain proportion is randomly
selected for training (possibly with stratification)
– The error rates on the different iterations are
averaged to yield an overall error rate.
• This is called the repeated Hold-out method.
Repeated Hold-out Method (Cont.)
• Still not optimum: the different test sets
overlap, but we would like any our instances
from the data to be tested at least once.

• Can we prevent overlapping?


K-Fold Cross Validation
• K-fold cross-validation avoids
overlapping test sets:
– First step: data is split into k subsets of
equal size;
– Second step: each subset in turn is used
for testing and the remainder for
training.
• The subset are stratified before the
cross-validation.
Classifier
• The estimates are averaged to yield an
overall estimate.
train train test

Data train test train

test train train


More on Cross-Validation
• Standard method for evaluation: stratified 10-fold
cross-validation.
• Why 10?
– Extensive experiments have shown that this is the best
choice to get an accurate estimate.
• Stratification reduces the estimates’ variance.
• Even better: repeated stratified cross-validation
– E.g. 10-fold cross-validation is repeated 10 times and
results are averaged (reduce the variance).
Leave-One-Out Cross-Validation
• Leave-One-Out is a particular form of cross-
validation.
– Set number of folds to number of training
instances;
• For n training instances, build classifier n times.
• Make best use of the data.
• Involves no random sub-sampling.
• Very computationally expensive.
Leave-One-Out Cross-Validation and
Stratification
• A disadvantage of Leave-One-Out Cross-
Validation CV is that stratification is not
possible:
– It guarantees a non-stratified sample because
there is only one instance in the test set!

You might also like