0% found this document useful (0 votes)
8 views33 pages

CE802 Lec Eval Handouts

The document outlines various methods for evaluating learning systems, focusing on performance metrics such as accuracy, recall, precision, and kappa statistics. It emphasizes the importance of using separate training and test datasets, as well as techniques like cross-validation and stratification to ensure reliable evaluation. Additionally, it discusses the significance of understanding the inherent difficulty of tasks and the need for sufficient training examples to achieve accurate results.

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

CE802 Lec Eval Handouts

The document outlines various methods for evaluating learning systems, focusing on performance metrics such as accuracy, recall, precision, and kappa statistics. It emphasizes the importance of using separate training and test datasets, as well as techniques like cross-validation and stratification to ensure reliable evaluation. Additionally, it discusses the significance of understanding the inherent difficulty of tasks and the need for sufficient training examples to achieve accurate results.

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Evaluating learning methods

Luca Citi
[email protected]

School of Computer Science and Electronic Engineering


University of Essex (UK)

CE802
Evaluating learning methods

Outline

Evaluating learning methods


Measuring performance
Classification accuracy
Cross validation
Test set
Confusion matrices
Kappa statistics
Recall and precision

Evaluating learning methods for regression

CE802 (CSEE) Evaluating learning methods 2/1


Evaluating Learning Methods Slide 2 of 32

MEASURING PERFORMANCE
Assuming we are evaluating a system that learns to predict
classifications, what would be a good measure of its
performance?

Accuracy
The percentage of unknown examples that the system
classifies correctly.
Not quite as simple as it seems.
In some situations, some errors matter more than others.
e.g. A system to diagnose a serious disease: What is
the relative cost of a false alarm and a missed
diagnosis?

Other Performance Characteristics


Accuracy is an obviously important characteristic, but there
are others including:
the relative cost of a false alarm and a missed
diagnosis?

Other Performance Characteristics


Accuracy is an obviously important characteristic, but there
are others including:
Space and time complexity
The computing resources required by the learning
procedure.
Size of training set
How many training examples are needed to reach the best
achievable classification accuracy? How well will the
system perform with a limited number of training
examples?
Simplicity of the final model
Important if human user wants to discover something
about the data.

P.D.Scott University of Essex


Evaluating Learning Methods Slide 3 of 32

MEASURING CLASSIFICATION ACCURACY


Suppose we have a learning program and a data set.
We use the data set to train the system.
After training the system can classify 98% of the data set
correctly.
Is this good?
It is impossible to say for several reasons.

Training and Testing Must Use Different Data Sets


The system was trained and evaluated using the same data.
This does not give a good indication of how well the system
will perform given a new example.
Consider an instance based system that saves all training
examples.
Is this good?
It is impossible to say for several reasons.

Training and Testing Must Use Different Data Sets


The system was trained and evaluated using the same data.
This does not give a good indication of how well the system
will perform given a new example.
Consider an instance based system that saves all training
examples.
Such a system would score 100% if tested using the data
used for training!
Similar considerations apply to any learning procedure.

Correct procedure:
Randomly partition the data set into two subsets:
A training set used to train the system.
A test set used to evaluate the system's performance
after learning is completed.
So let us assume we did this and the test phase achieved
95% correct classifications.
Evaluating Learning Methods Slide 4 of 32

We Do Not Know How Difficult The Learning Task Was?


Suppose that the task was a binary classification and that
94% of examples belonged to Class 1.
Then a system that predicted Class 1 for every unknown
example would be right 94% of the time.
In such a case, achieving 95% accuracy looks
unimpressive.
If, on the other hand, both classes were equally likely, 95%
would look very good.
So we really need to know the frequency of the modal class
(i.e. the most frequently occurring class) to provide a baseline
for judging the performance.

Now suppose the task is predicting the result of a coin toss on


the basis of the date, time, weather and name of person
(i.e. the most frequently occurring class) to provide a baseline
for judging the performance.

Now suppose the task is predicting the result of a coin toss on


the basis of the date, time, weather and name of person
tossing the coin.
Does the fact that a learning system only achieves 50%
accuracy mean the learning system is no good?
Of course not. The task is such that it is impossible to
predict the outcome from the attributes available.
So the inherent difficulty of a learning task must also be
considered.
Generally the inherent difficulty is unknown. Hence the only
available basis is comparison with other learning procedures.

P.D.Scott University of Essex


Evaluating Learning Methods Slide 5 of 32

HOW MANY EXAMPLES DO WE NEED?

FOR TRAINING?
As many as possible.
The minimum requirement will depend on:
The complexity of the relationship that is being
learned.
This will depend on both the number of attributes and
the nature of the relationships between them.
More attributes will require more training data
More complex relationships between attributes will
require more training data.
The learning procedure
Procedures that build more complex models
require more training data.

FOR TESTING?
It depends on how accurately you want to estimate the
performance
Evaluating Learning Methods Slide 6 of 32

STRATIFICATION
Suppose we are training a system to predict which of two
classes, C1 and C2, examples belong to.
Suppose also that the amount of data available is limited to a
sample (which we assume is drawn randomly from the
original population).
We divide a sample up into a training set and a test set.
Suppose it turns out that most of the examples in the
training set belong to C1 and most of those in the test set to
C2.
Clearly this is not a good basis for either training or testing.

What we should have done was ensure that the proportion of


each class in the training set was the same as the proportion
in the original sample.
This is called stratification.

Stratification is well worth doing but it is still possible that the


training and test sets were not truly representative of the
sample (and hence of the original population).
Evaluating Learning Methods Slide 7 of 32

CROSS-VALIDATION
One way to further reduce the likelihood of unrepresentative
training and test data sets is to repeat the process.
Suppose we divided the sample randomly into ten disjoint
subsets called folds.
Each fold could be used as the test set and the remaining
9/10 of the data set used as a training set.
Thus we can obtain ten runs, each with a different test set,
from one data set.
The average across all 10 runs would be our accuracy
estimates.

This technique is known as cross-validation and is widely


used in machine learning.
Normally stratification is also imposed when partitioning the
sample.
This is known as stratified cross-validation.
Evaluating learning methods Cross validation

Cross-validation

Credit: Shan-Hung Wu
CE802 (CSEE) Evaluating learning methods 12 / 1
This technique is known as cross-validation and is widely
used in machine learning.
Normally stratification is also imposed when partitioning the
sample.
This is known as stratified cross-validation.

Weka provides a facility for automatically performing n-fold


stratified cross-validation. 10 way cross-validation is the
default experimental procedure.

Why 10?
Although there is no rigorous theory to back this up, wide
experience suggests 10 is about the right number to get
the best accuracy estimate.

P.D.Scott University of Essex


Evaluating Learning Methods Slide 8 of 32

EVEN BETTER ESTIMATES


A single 10-fold cross-validation will provide an accuracy
estimate based on a single 10 way partitioning of the sample.
The results that you would get with a different 10 way
partitioning might be different.
So, you could run the 10-fold cross validation procedure a
number of times and average the results.

Note that, if you do this in Weka, you must change the


random number seed manually for each cross-validation.

So how many times should you run the cross-validation


procedure?
It depends how precisely you want to know the accuracy of
the learning system
The Impact of Sample Size on Confidence Limits

If we assume that the results are normally distributed we can


make use of the fact the standard deviation of their mean will
decrease with the number of results we obtain:
 (x)
 (x) 
N

This is known as the standard error.

Using this relationship we can establish a confidence


interval.

That is, a range within which the true value will lie with
some specified probability.

The width of the confidence interval decreases with the


square root of sample size.
So we need a large number of samples if we want to
estimate a parameter accurately.
Note that the required sample size is not dependent on the
population size.
Evaluating learning methods Test set

Model selection and test set


When testing more than one classifier (or tuning parameters), keep a
separate test set (or perform an additional outer CV loop)

Example: Fortune teller to predict the sex of babies


I ask 10 fortune tellers (using different techniques: crystal ball,
dreams, astrology, ...) to predict the sex of 20 babies
I make sure no overfitting (CV or no training set at all!) and obtain:
{6/20, 8/20, 13/20, 10/20, 15/20, 12/10, 9/20, 11/20, 13/20, 7/20}
I write a paper claiming that (for example) “crystal ball fortune tellers
outperform other techniques and achieve 75% accuracy (p < 0.02) in
predicting sex of babies”
This is clearly wrong. I should have now repeated the test using only
the best one to confirm the results on new data. I will very likely
obtain a much smaller percentage.
Why? P (B(20, 0.5) ≥ 15) ' 0.015 but given ki ∼ B(20, 0.5) for
i = 1..10 then P (max{ki } ≥ 15)  0.015
CE802 (CSEE) Evaluating learning methods 16 / 1
Evaluating learning methods Test set

Final evaluation

Credit: scikit-learn.org
CE802 (CSEE) Evaluating learning methods 17 / 1
Evaluating learning methods Test set

Final evaluation: nested cross-validation

Bischl et al., “mlr: Machine Learning in R”


CE802 (CSEE) Evaluating learning methods 18 / 1
Evaluating Learning Methods Slide 10 of 32

CONFUSION MATRICES
So far we have considered how often a classifier gets the right
answer.
However, we are sometimes also interested in what kind of
mistakes it makes and how often it makes them.
Consider a diagnostic system that simply predicts whether or
not a patient has a particular disease:

Predicted
Yes No
Yes 27 3
Actual
No 17 53

This classifier is correct 80% of the time. However, the table


also reveals that 17% of the predictions are false alarms
(false positives) but only 3% are misses (false negatives).
This classifier is correct 80% of the time. However, the table
also reveals that 17% of the predictions are false alarms
(false positives) but only 3% are misses (false negatives).

Predicted
Yes No
Yes True False
Positives Negatives
Actual
No False True
Positives Negatives

P.D.Scott University of Essex


This type of table can be generalised to cover situations
where more than two classes are predicted.
It is then known as a confusion matrix:

Predicted
Red Green Blue
Red 37 1 2
Actual Green 3 16 11
Blue 1 12 17

Overall, this classifier is right 70% of the time.


However, although the classifier is good at identifying red
items, it is much less good at distinguishing blue and green
items.
Of the 30 incorrect predictions, 23 arise from confusing
blue and green.
When the system is presented with a red item, it is right
92.5% of the time.
But when the system is presented with a blue or green
item, it is only right 55% of the time.
Thus the overall accuracy does not tell the whole story.
Evaluating Learning Methods Slide 12 of 32

The Kappa Statistic

The classifier on the last slide predicted Red 41 times, Green


29 times and Blue 30 times:

Predicted
Red Green Blue Total
Red 37 1 2 40
Actual Green 3 16 11 30
Blue 1 12 17 30
Total 41 29 30 100

Suppose those predictions had been random guesses.


How often would the classifier have been right?
0.4 x 41 + 0.3 x 29 + 0.3 x 30 = 34.1

So the actual success rate of 70 represents an improvement of


about 35.9% on random guessing.

This is the basis of the kappa statistic.


The kappa statistic expresses this improvement as a proportion
of that to be expected from a perfect predictor.

Thus the improvement for a prefect predictor would be


100 – 34.1 = 65.9

The improvement achieved by the classifier was


70 – 34.1 = 35.9

Hence the kappa statistic is


35.9/65.9 = 0.54

A kappa statistic of 1 implies a perfect predictor.


A kappa statistic of 0 implies the classifier provides no
information – it behaves as if it were guessing randomly.

Weka provides a confusion matrix and the kappa statistic in the


results produced for all of its classifiers.
Evaluating Learning Methods Slide 14 of 32

Recall and Precision

Recall and precision are measures originally developed in the


field of information retrieval.
Consider a document retrieval system that is asked to search
a set of documents for those relevant to a particular topic.
Suppose it returns a subset of documents, of which some are
in fact relevant but the remainder are irrelevant.
We need some measures of how good the system is.

The system is essentially classifying the entire set of


documents into two classes: relevant and irrelevant
The relevant documents returned are examples of
True Positives (see earlier).
The irrelevant documents that were returned are
examples of False Postives.
We need some measures of how good the system is.

The system is essentially classifying the entire set of


documents into two classes: relevant and irrelevant
The relevant documents returned are examples of
True Positives (see earlier).
The irrelevant documents that were returned are
examples of False Postives.
However we also need to consider the documents that were
not returned.
There will be relevant documents that were not
returned: these are False Negatives.
Finally, there will be irrelevant documents that were
not returned: the True Negatives.
Information retrieval researchers have found two measures to
be particularly useful in assessing the quality of an information
retrieval system.
Recall
This is the most obvious measure: the proportion of relevant
documents that were returned: Recall is defined as
TP

TP  FN
where
TP is the number of relevant documents returned.
FP is the number of irrelevant documents returned.
TN is the number of irrelevant documents not returned.
FN is the number of relevant documents not returned.
Clearly, TP+FN is the total number of relevant documents.

Although a high value for Recall is very desirable, it is easily


achieved by a very poor system: one that returns all the
documents.

Precision
Clearly, TP+FN is the total number of relevant documents.

Although a high value for Recall is very desirable, it is easily


achieved by a very poor system: one that returns all the
documents.

Precision
For this reason we need a second measure; the proportion of
the returned documents that were relevant. Precision is defined:

TP

TP  FP

Clearly, TP+FP is the number of documents returned.

P.D.Scott University of Essex


Trade Off Between Recall and Precision

It is easy to build a system with 100% Recall:


Simply return everything.
Such a system would have a very low Precision because of the
large number of irrelevant documents returned.

It is almost as easy to build a system with 100% Precision.


Only return 1 document for which the evidence of
relevance is extremely strong.
Such a system would have a very low Recall because of the
large number of relevant documents not returned.

So a practical information retrieval system is going to require


striking a balance between Recall and Precision.
It is almost always possible to improve one at the expense
of the other.
Evaluating learning methods Recall and precision

Other definitions

In other areas of machine learning, the following definitions are also


commonly used:

TP
Sensitivity: (same as recall)
TP + FN

TN
Specificity:
TN + FP

TP
Positive predictive value: (same as precision)
TP + FP

CE802 (CSEE) Evaluating learning methods 29 / 1


Evaluating learning methods for regression

Outline

Evaluating learning methods


Measuring performance
Classification accuracy
Cross validation
Test set
Confusion matrices
Kappa statistics
Recall and precision

Evaluating learning methods for regression

CE802 (CSEE) Evaluating learning methods 30 / 1


Evaluating Learning Methods Slide 24 of 32

NUMERICAL PREDICTIONS
So far we have been concerned with measuring the
performance of classifiers. i.e. systems that predict nominal
variables.
What about systems that make numeric predictions?
The key difference is that we are no longer concerned with
whether predictions are right or wrong – the issue is how large
the errors tend to be.
Hence, instead of accuracy (percentage correct) we typically
use one of the following:

Mean Square Error


Root Mean Square Error
Mean Absolute Error
Correlation Coefficient
Coefficient of Determination – R2
(See notes on Linear Regression)
References

Outline

Evaluating learning methods


Measuring performance
Classification accuracy
Cross validation
Test set
Confusion matrices
Kappa statistics
Recall and precision

Evaluating learning methods for regression

CE802 (CSEE) Evaluating learning methods 32 / 1


References

References

Required course material reading:


Alpaydin 2010/2014 19.1!,19.6!,19.6.1!,19.6.2!,19.7!(ROC,AUC)
Scott’s notes on Evaluating... pp. 1–16!(9)

CE802 (CSEE) Evaluating learning methods 33 / 1

You might also like