CE802 Lec Eval Handouts
CE802 Lec Eval Handouts
Luca Citi
[email protected]
CE802
Evaluating learning methods
Outline
MEASURING PERFORMANCE
Assuming we are evaluating a system that learns to predict
classifications, what would be a good measure of its
performance?
Accuracy
The percentage of unknown examples that the system
classifies correctly.
Not quite as simple as it seems.
In some situations, some errors matter more than others.
e.g. A system to diagnose a serious disease: What is
the relative cost of a false alarm and a missed
diagnosis?
Correct procedure:
Randomly partition the data set into two subsets:
A training set used to train the system.
A test set used to evaluate the system's performance
after learning is completed.
So let us assume we did this and the test phase achieved
95% correct classifications.
Evaluating Learning Methods Slide 4 of 32
FOR TRAINING?
As many as possible.
The minimum requirement will depend on:
The complexity of the relationship that is being
learned.
This will depend on both the number of attributes and
the nature of the relationships between them.
More attributes will require more training data
More complex relationships between attributes will
require more training data.
The learning procedure
Procedures that build more complex models
require more training data.
FOR TESTING?
It depends on how accurately you want to estimate the
performance
Evaluating Learning Methods Slide 6 of 32
STRATIFICATION
Suppose we are training a system to predict which of two
classes, C1 and C2, examples belong to.
Suppose also that the amount of data available is limited to a
sample (which we assume is drawn randomly from the
original population).
We divide a sample up into a training set and a test set.
Suppose it turns out that most of the examples in the
training set belong to C1 and most of those in the test set to
C2.
Clearly this is not a good basis for either training or testing.
CROSS-VALIDATION
One way to further reduce the likelihood of unrepresentative
training and test data sets is to repeat the process.
Suppose we divided the sample randomly into ten disjoint
subsets called folds.
Each fold could be used as the test set and the remaining
9/10 of the data set used as a training set.
Thus we can obtain ten runs, each with a different test set,
from one data set.
The average across all 10 runs would be our accuracy
estimates.
Cross-validation
Credit: Shan-Hung Wu
CE802 (CSEE) Evaluating learning methods 12 / 1
This technique is known as cross-validation and is widely
used in machine learning.
Normally stratification is also imposed when partitioning the
sample.
This is known as stratified cross-validation.
Why 10?
Although there is no rigorous theory to back this up, wide
experience suggests 10 is about the right number to get
the best accuracy estimate.
That is, a range within which the true value will lie with
some specified probability.
Final evaluation
Credit: scikit-learn.org
CE802 (CSEE) Evaluating learning methods 17 / 1
Evaluating learning methods Test set
CONFUSION MATRICES
So far we have considered how often a classifier gets the right
answer.
However, we are sometimes also interested in what kind of
mistakes it makes and how often it makes them.
Consider a diagnostic system that simply predicts whether or
not a patient has a particular disease:
Predicted
Yes No
Yes 27 3
Actual
No 17 53
Predicted
Yes No
Yes True False
Positives Negatives
Actual
No False True
Positives Negatives
Predicted
Red Green Blue
Red 37 1 2
Actual Green 3 16 11
Blue 1 12 17
Predicted
Red Green Blue Total
Red 37 1 2 40
Actual Green 3 16 11 30
Blue 1 12 17 30
Total 41 29 30 100
TP FN
where
TP is the number of relevant documents returned.
FP is the number of irrelevant documents returned.
TN is the number of irrelevant documents not returned.
FN is the number of relevant documents not returned.
Clearly, TP+FN is the total number of relevant documents.
Precision
Clearly, TP+FN is the total number of relevant documents.
Precision
For this reason we need a second measure; the proportion of
the returned documents that were relevant. Precision is defined:
TP
TP FP
Other definitions
TP
Sensitivity: (same as recall)
TP + FN
TN
Specificity:
TN + FP
TP
Positive predictive value: (same as precision)
TP + FP
Outline
NUMERICAL PREDICTIONS
So far we have been concerned with measuring the
performance of classifiers. i.e. systems that predict nominal
variables.
What about systems that make numeric predictions?
The key difference is that we are no longer concerned with
whether predictions are right or wrong – the issue is how large
the errors tend to be.
Hence, instead of accuracy (percentage correct) we typically
use one of the following:
Outline
References