0% found this document useful (0 votes)

8 views33 pages

CE802 Lec Eval Handouts

The document outlines various methods for evaluating learning systems, focusing on performance metrics such as accuracy, recall, precision, and kappa statistics. It emphasizes the importance of using separate training and test datasets, as well as techniques like cross-validation and stratification to ensure reliable evaluation. Additionally, it discusses the significance of understanding the inherent difficulty of tasks and the need for sufficient training examples to achieve accurate results.

Uploaded by

Anand A J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views33 pages

CE802 Lec Eval Handouts

Uploaded by

Anand A J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Evaluating learning methods

Luca Citi
[email protected]

School of Computer Science and Electronic Engineering

University of Essex (UK)

CE802
Evaluating learning methods

Outline

Evaluating learning methods

Measuring performance
Classification accuracy
Cross validation
Test set
Confusion matrices
Kappa statistics
Recall and precision

Evaluating learning methods for regression

CE802 (CSEE) Evaluating learning methods 2/1

Evaluating Learning Methods Slide 2 of 32

MEASURING PERFORMANCE
Assuming we are evaluating a system that learns to predict
classifications, what would be a good measure of its
performance?

Accuracy
The percentage of unknown examples that the system
classifies correctly.
Not quite as simple as it seems.
In some situations, some errors matter more than others.
e.g. A system to diagnose a serious disease: What is
the relative cost of a false alarm and a missed
diagnosis?

Other Performance Characteristics

Accuracy is an obviously important characteristic, but there
are others including:
the relative cost of a false alarm and a missed
diagnosis?

Other Performance Characteristics

Accuracy is an obviously important characteristic, but there
are others including:
Space and time complexity
The computing resources required by the learning
procedure.
Size of training set
How many training examples are needed to reach the best
achievable classification accuracy? How well will the
system perform with a limited number of training
examples?
Simplicity of the final model
Important if human user wants to discover something
about the data.

P.D.Scott University of Essex

Evaluating Learning Methods Slide 3 of 32

MEASURING CLASSIFICATION ACCURACY

Suppose we have a learning program and a data set.
We use the data set to train the system.
After training the system can classify 98% of the data set
correctly.
Is this good?
It is impossible to say for several reasons.

Training and Testing Must Use Different Data Sets

The system was trained and evaluated using the same data.
This does not give a good indication of how well the system
will perform given a new example.
Consider an instance based system that saves all training
examples.
Is this good?
It is impossible to say for several reasons.

Training and Testing Must Use Different Data Sets

The system was trained and evaluated using the same data.
This does not give a good indication of how well the system
will perform given a new example.
Consider an instance based system that saves all training
examples.
Such a system would score 100% if tested using the data
used for training!
Similar considerations apply to any learning procedure.

Correct procedure:
Randomly partition the data set into two subsets:
A training set used to train the system.
A test set used to evaluate the system's performance
after learning is completed.
So let us assume we did this and the test phase achieved
95% correct classifications.
Evaluating Learning Methods Slide 4 of 32

We Do Not Know How Difficult The Learning Task Was?

Suppose that the task was a binary classification and that
94% of examples belonged to Class 1.
Then a system that predicted Class 1 for every unknown
example would be right 94% of the time.
In such a case, achieving 95% accuracy looks
unimpressive.
If, on the other hand, both classes were equally likely, 95%
would look very good.
So we really need to know the frequency of the modal class
(i.e. the most frequently occurring class) to provide a baseline
for judging the performance.

Now suppose the task is predicting the result of a coin toss on

the basis of the date, time, weather and name of person
(i.e. the most frequently occurring class) to provide a baseline
for judging the performance.

Now suppose the task is predicting the result of a coin toss on

the basis of the date, time, weather and name of person
tossing the coin.
Does the fact that a learning system only achieves 50%
accuracy mean the learning system is no good?
Of course not. The task is such that it is impossible to
predict the outcome from the attributes available.
So the inherent difficulty of a learning task must also be
considered.
Generally the inherent difficulty is unknown. Hence the only
available basis is comparison with other learning procedures.

P.D.Scott University of Essex

Evaluating Learning Methods Slide 5 of 32

HOW MANY EXAMPLES DO WE NEED?

FOR TRAINING?
As many as possible.
The minimum requirement will depend on:
The complexity of the relationship that is being
learned.
This will depend on both the number of attributes and
the nature of the relationships between them.
More attributes will require more training data
More complex relationships between attributes will
require more training data.
The learning procedure
Procedures that build more complex models
require more training data.

FOR TESTING?
It depends on how accurately you want to estimate the
performance
Evaluating Learning Methods Slide 6 of 32

STRATIFICATION
Suppose we are training a system to predict which of two
classes, C1 and C2, examples belong to.
Suppose also that the amount of data available is limited to a
sample (which we assume is drawn randomly from the
original population).
We divide a sample up into a training set and a test set.
Suppose it turns out that most of the examples in the
training set belong to C1 and most of those in the test set to
C2.
Clearly this is not a good basis for either training or testing.

What we should have done was ensure that the proportion of

each class in the training set was the same as the proportion
in the original sample.
This is called stratification.

Stratification is well worth doing but it is still possible that the

training and test sets were not truly representative of the
sample (and hence of the original population).
Evaluating Learning Methods Slide 7 of 32

CROSS-VALIDATION
One way to further reduce the likelihood of unrepresentative
training and test data sets is to repeat the process.
Suppose we divided the sample randomly into ten disjoint
subsets called folds.
Each fold could be used as the test set and the remaining
9/10 of the data set used as a training set.
Thus we can obtain ten runs, each with a different test set,
from one data set.
The average across all 10 runs would be our accuracy
estimates.

This technique is known as cross-validation and is widely

used in machine learning.
Normally stratification is also imposed when partitioning the
sample.
This is known as stratified cross-validation.
Evaluating learning methods Cross validation

Cross-validation

Credit: Shan-Hung Wu
CE802 (CSEE) Evaluating learning methods 12 / 1
This technique is known as cross-validation and is widely
used in machine learning.
Normally stratification is also imposed when partitioning the
sample.
This is known as stratified cross-validation.

Weka provides a facility for automatically performing n-fold

stratified cross-validation. 10 way cross-validation is the
default experimental procedure.

Why 10?
Although there is no rigorous theory to back this up, wide
experience suggests 10 is about the right number to get
the best accuracy estimate.

P.D.Scott University of Essex

Evaluating Learning Methods Slide 8 of 32

EVEN BETTER ESTIMATES

A single 10-fold cross-validation will provide an accuracy
estimate based on a single 10 way partitioning of the sample.
The results that you would get with a different 10 way
partitioning might be different.
So, you could run the 10-fold cross validation procedure a
number of times and average the results.

Note that, if you do this in Weka, you must change the

random number seed manually for each cross-validation.

So how many times should you run the cross-validation

procedure?
It depends how precisely you want to know the accuracy of
the learning system
The Impact of Sample Size on Confidence Limits

If we assume that the results are normally distributed we can

make use of the fact the standard deviation of their mean will
decrease with the number of results we obtain:
 (x)
 (x) 
N

This is known as the standard error.

Using this relationship we can establish a confidence

interval.

That is, a range within which the true value will lie with
some specified probability.

The width of the confidence interval decreases with the

square root of sample size.
So we need a large number of samples if we want to
estimate a parameter accurately.
Note that the required sample size is not dependent on the
population size.
Evaluating learning methods Test set

Model selection and test set

When testing more than one classifier (or tuning parameters), keep a
separate test set (or perform an additional outer CV loop)

Example: Fortune teller to predict the sex of babies

I ask 10 fortune tellers (using different techniques: crystal ball,
dreams, astrology, ...) to predict the sex of 20 babies
I make sure no overfitting (CV or no training set at all!) and obtain:
{6/20, 8/20, 13/20, 10/20, 15/20, 12/10, 9/20, 11/20, 13/20, 7/20}
I write a paper claiming that (for example) “crystal ball fortune tellers
outperform other techniques and achieve 75% accuracy (p < 0.02) in
predicting sex of babies”
This is clearly wrong. I should have now repeated the test using only
the best one to confirm the results on new data. I will very likely
obtain a much smaller percentage.
Why? P (B(20, 0.5) ≥ 15) ' 0.015 but given ki ∼ B(20, 0.5) for
i = 1..10 then P (max{ki } ≥ 15) 0.015
CE802 (CSEE) Evaluating learning methods 16 / 1
Evaluating learning methods Test set

Final evaluation

Credit: scikit-learn.org
CE802 (CSEE) Evaluating learning methods 17 / 1
Evaluating learning methods Test set

Final evaluation: nested cross-validation

Bischl et al., “mlr: Machine Learning in R”

CE802 (CSEE) Evaluating learning methods 18 / 1
Evaluating Learning Methods Slide 10 of 32

CONFUSION MATRICES
So far we have considered how often a classifier gets the right
answer.
However, we are sometimes also interested in what kind of
mistakes it makes and how often it makes them.
Consider a diagnostic system that simply predicts whether or
not a patient has a particular disease:

Predicted
Yes No
Yes 27 3
Actual
No 17 53

This classifier is correct 80% of the time. However, the table

also reveals that 17% of the predictions are false alarms
(false positives) but only 3% are misses (false negatives).
This classifier is correct 80% of the time. However, the table
also reveals that 17% of the predictions are false alarms
(false positives) but only 3% are misses (false negatives).

Predicted
Yes No
Yes True False
Positives Negatives
Actual
No False True
Positives Negatives

P.D.Scott University of Essex

This type of table can be generalised to cover situations
where more than two classes are predicted.
It is then known as a confusion matrix:

Predicted
Red Green Blue
Red 37 1 2
Actual Green 3 16 11
Blue 1 12 17

Overall, this classifier is right 70% of the time.

However, although the classifier is good at identifying red
items, it is much less good at distinguishing blue and green
items.
Of the 30 incorrect predictions, 23 arise from confusing
blue and green.
When the system is presented with a red item, it is right
92.5% of the time.
But when the system is presented with a blue or green
item, it is only right 55% of the time.
Thus the overall accuracy does not tell the whole story.
Evaluating Learning Methods Slide 12 of 32

The Kappa Statistic

The classifier on the last slide predicted Red 41 times, Green

29 times and Blue 30 times:

Predicted
Red Green Blue Total
Red 37 1 2 40
Actual Green 3 16 11 30
Blue 1 12 17 30
Total 41 29 30 100

Suppose those predictions had been random guesses.

How often would the classifier have been right?
0.4 x 41 + 0.3 x 29 + 0.3 x 30 = 34.1

So the actual success rate of 70 represents an improvement of

about 35.9% on random guessing.

This is the basis of the kappa statistic.

The kappa statistic expresses this improvement as a proportion
of that to be expected from a perfect predictor.

Thus the improvement for a prefect predictor would be

100 – 34.1 = 65.9

The improvement achieved by the classifier was

70 – 34.1 = 35.9

Hence the kappa statistic is

35.9/65.9 = 0.54

A kappa statistic of 1 implies a perfect predictor.

A kappa statistic of 0 implies the classifier provides no
information – it behaves as if it were guessing randomly.

Weka provides a confusion matrix and the kappa statistic in the

results produced for all of its classifiers.
Evaluating Learning Methods Slide 14 of 32

Recall and Precision

Recall and precision are measures originally developed in the

field of information retrieval.
Consider a document retrieval system that is asked to search
a set of documents for those relevant to a particular topic.
Suppose it returns a subset of documents, of which some are
in fact relevant but the remainder are irrelevant.
We need some measures of how good the system is.

The system is essentially classifying the entire set of

documents into two classes: relevant and irrelevant
The relevant documents returned are examples of
True Positives (see earlier).
The irrelevant documents that were returned are
examples of False Postives.
However we also need to consider the documents that were
not returned.
There will be relevant documents that were not
returned: these are False Negatives.
Finally, there will be irrelevant documents that were
not returned: the True Negatives.
Information retrieval researchers have found two measures to
be particularly useful in assessing the quality of an information
retrieval system.
Recall
This is the most obvious measure: the proportion of relevant
documents that were returned: Recall is defined as
TP

TP  FN
where
TP is the number of relevant documents returned.
FP is the number of irrelevant documents returned.
TN is the number of irrelevant documents not returned.
FN is the number of relevant documents not returned.
Clearly, TP+FN is the total number of relevant documents.

Although a high value for Recall is very desirable, it is easily

achieved by a very poor system: one that returns all the
documents.

Precision
Clearly, TP+FN is the total number of relevant documents.

Although a high value for Recall is very desirable, it is easily

achieved by a very poor system: one that returns all the
documents.

Precision
For this reason we need a second measure; the proportion of
the returned documents that were relevant. Precision is defined:

TP  FP

Clearly, TP+FP is the number of documents returned.

P.D.Scott University of Essex

Trade Off Between Recall and Precision

It is easy to build a system with 100% Recall:

Simply return everything.
Such a system would have a very low Precision because of the
large number of irrelevant documents returned.

It is almost as easy to build a system with 100% Precision.

Only return 1 document for which the evidence of
relevance is extremely strong.
Such a system would have a very low Recall because of the
large number of relevant documents not returned.

So a practical information retrieval system is going to require

striking a balance between Recall and Precision.
It is almost always possible to improve one at the expense
of the other.
Evaluating learning methods Recall and precision

Other definitions

In other areas of machine learning, the following definitions are also

commonly used:

TP
Sensitivity: (same as recall)
TP + FN

TN
Specificity:
TN + FP

TP
Positive predictive value: (same as precision)
TP + FP

CE802 (CSEE) Evaluating learning methods 29 / 1

Evaluating learning methods for regression

Outline

Evaluating learning methods

Measuring performance
Classification accuracy
Cross validation
Test set
Confusion matrices
Kappa statistics
Recall and precision

Evaluating learning methods for regression

CE802 (CSEE) Evaluating learning methods 30 / 1

Evaluating Learning Methods Slide 24 of 32

NUMERICAL PREDICTIONS
So far we have been concerned with measuring the
performance of classifiers. i.e. systems that predict nominal
variables.
What about systems that make numeric predictions?
The key difference is that we are no longer concerned with
whether predictions are right or wrong – the issue is how large
the errors tend to be.
Hence, instead of accuracy (percentage correct) we typically
use one of the following:

Mean Square Error

Root Mean Square Error
Mean Absolute Error
Correlation Coefficient
Coefficient of Determination – R2
(See notes on Linear Regression)
References

Outline

Evaluating learning methods

Measuring performance
Classification accuracy
Cross validation
Test set
Confusion matrices
Kappa statistics
Recall and precision

Evaluating learning methods for regression

CE802 (CSEE) Evaluating learning methods 32 / 1

References

Required course material reading:

Alpaydin 2010/2014 19.1!,19.6!,19.6.1!,19.6.2!,19.7!(ROC,AUC)
Scott’s notes on Evaluating... pp. 1–16!(9)

CE802 (CSEE) Evaluating learning methods 33 / 1

5.design and Analysis of Machine Learning Experiments
No ratings yet
5.design and Analysis of Machine Learning Experiments
37 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
Lesson 3.2 - Supervised Learning Evaluation PDF
No ratings yet
Lesson 3.2 - Supervised Learning Evaluation PDF
38 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
Classification: Basic Concepts, Decision Trees, and Model Evaluation
No ratings yet
Classification: Basic Concepts, Decision Trees, and Model Evaluation
46 pages
Receiver Operator Characteristic
No ratings yet
Receiver Operator Characteristic
25 pages
Lecture Testmodels
No ratings yet
Lecture Testmodels
31 pages
Validaciones - Bosstrap
No ratings yet
Validaciones - Bosstrap
50 pages
Wk07 Topic07 2 - 202303
No ratings yet
Wk07 Topic07 2 - 202303
21 pages
Unit 6-Feature Engineering and Sensitivity Analysis
No ratings yet
Unit 6-Feature Engineering and Sensitivity Analysis
63 pages
AI351 Lecture 2 - Common Evaluation Metrics
No ratings yet
AI351 Lecture 2 - Common Evaluation Metrics
50 pages
Chapter4 Machine Learning Part4
No ratings yet
Chapter4 Machine Learning Part4
42 pages
Unit IV
No ratings yet
Unit IV
51 pages
Lec 16
No ratings yet
Lec 16
18 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
ASSESSING MODEL Accuracy PDF
No ratings yet
ASSESSING MODEL Accuracy PDF
22 pages
AIML-HC Mod 03
No ratings yet
AIML-HC Mod 03
46 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Data Mining - Credibility: Evaluating What's Been Learned
No ratings yet
Data Mining - Credibility: Evaluating What's Been Learned
36 pages
Chapter 2
No ratings yet
Chapter 2
38 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Lec 5
No ratings yet
Lec 5
28 pages
An Efficient Data Partitioning To Improve Classification Performance While Keeping Parameters Interpretable
No ratings yet
An Efficient Data Partitioning To Improve Classification Performance While Keeping Parameters Interpretable
16 pages
Module 5 Advanced Classification Techniques
No ratings yet
Module 5 Advanced Classification Techniques
40 pages
Cofusion Matrix Cross - Validation
No ratings yet
Cofusion Matrix Cross - Validation
34 pages
Evaluation
No ratings yet
Evaluation
21 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Doc-20250117-Wa0014. 20250117 193235 0000
No ratings yet
Doc-20250117-Wa0014. 20250117 193235 0000
22 pages
Module 2
No ratings yet
Module 2
19 pages
Accuracy Measures
No ratings yet
Accuracy Measures
61 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
Mining Process
No ratings yet
Mining Process
33 pages
Ensemble Methods For Classifiers: Department of Industrial Engineering Tel-Aviv University
No ratings yet
Ensemble Methods For Classifiers: Department of Industrial Engineering Tel-Aviv University
24 pages
ML Notes (Module-3)
No ratings yet
ML Notes (Module-3)
21 pages
Module 6
No ratings yet
Module 6
24 pages
ML 5
No ratings yet
ML 5
14 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
ML Module Iii
No ratings yet
ML Module Iii
12 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
No ratings yet
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
16 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
5 DL
No ratings yet
5 DL
33 pages
Da Unit 4
No ratings yet
Da Unit 4
26 pages
Lecture 9
No ratings yet
Lecture 9
16 pages
Capitulo 2 Big Data
No ratings yet
Capitulo 2 Big Data
25 pages
Unit 2
No ratings yet
Unit 2
28 pages
Aiml Unit - V
No ratings yet
Aiml Unit - V
50 pages
Classification Personal
No ratings yet
Classification Personal
36 pages
IV Ai & Ds Al3451 ML Unit5
No ratings yet
IV Ai & Ds Al3451 ML Unit5
23 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
59 pages
Question1 Answers Complete
No ratings yet
Question1 Answers Complete
4 pages
Session01 DataScience
No ratings yet
Session01 DataScience
79 pages
1.4 Intro To Need of Estimation and Validation PDF
No ratings yet
1.4 Intro To Need of Estimation and Validation PDF
18 pages
ML Unit4 Notes
No ratings yet
ML Unit4 Notes
20 pages
A "Short" Introduction To Model Selection
No ratings yet
A "Short" Introduction To Model Selection
25 pages
Unit 2
No ratings yet
Unit 2
20 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
Chapter 5
No ratings yet
Chapter 5
3 pages
The Monthly Average Cost of Money Used For Playing Mobile Legends by The 1 Year Students of
No ratings yet
The Monthly Average Cost of Money Used For Playing Mobile Legends by The 1 Year Students of
20 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
DLL Week 1.2 - Stat and Proba Q3
100% (1)
DLL Week 1.2 - Stat and Proba Q3
8 pages
Child Protection Policy Awareness of Teachers and Responsiveness of The School: Their Relationship and Implications
No ratings yet
Child Protection Policy Awareness of Teachers and Responsiveness of The School: Their Relationship and Implications
11 pages
Data Warehousing and Data Mining Lab Manual
0% (1)
Data Warehousing and Data Mining Lab Manual
30 pages
Comparing Quantities Using Analytical Tools
No ratings yet
Comparing Quantities Using Analytical Tools
10 pages
Finite Population
No ratings yet
Finite Population
13 pages
M Tech Cse Syllabus 2021 Ds 1875a43a25
No ratings yet
M Tech Cse Syllabus 2021 Ds 1875a43a25
96 pages
Using Deep Learning For Predictive Maintenance Slides
100% (1)
Using Deep Learning For Predictive Maintenance Slides
12 pages
Coal Stock Pile Simulation
0% (1)
Coal Stock Pile Simulation
5 pages
Methods of Field Experimentation
No ratings yet
Methods of Field Experimentation
40 pages
MATM111 - Lesson 1 - Introduction To Statistics
No ratings yet
MATM111 - Lesson 1 - Introduction To Statistics
3 pages
CE880 Lecture5 Slides
No ratings yet
CE880 Lecture5 Slides
32 pages
BF 260
No ratings yet
BF 260
14 pages
Lesson 10: Discriminant Analysis: Example 1 - Swiss Bank Notes
No ratings yet
Lesson 10: Discriminant Analysis: Example 1 - Swiss Bank Notes
3 pages
Unit Guide: Mat102 Statistics For Business Trimester 3 2021
No ratings yet
Unit Guide: Mat102 Statistics For Business Trimester 3 2021
12 pages
CE880 Lecture9 Slides
No ratings yet
CE880 Lecture9 Slides
43 pages
Aurora Turmelle - Updated Identifying Misleading Graphs and Stats Lesson Plan With Reflection - Edu 361-2
No ratings yet
Aurora Turmelle - Updated Identifying Misleading Graphs and Stats Lesson Plan With Reflection - Edu 361-2
10 pages
Bayesian Learning1
No ratings yet
Bayesian Learning1
21 pages
Evaluating The Effects of Consecutive Phases Of.415
No ratings yet
Evaluating The Effects of Consecutive Phases Of.415
9 pages
CE802 Lec FuncOptim Handouts
No ratings yet
CE802 Lec FuncOptim Handouts
11 pages
Group 5 Sip Sampling
No ratings yet
Group 5 Sip Sampling
11 pages
Mms - E.pdf 3
No ratings yet
Mms - E.pdf 3
11 pages
Study Guide Physics A S1 2024
No ratings yet
Study Guide Physics A S1 2024
11 pages
M.Sc. IT Semester III Artificial Neural Networks (2014 - 2015) Chapter 1 To 5
No ratings yet
M.Sc. IT Semester III Artificial Neural Networks (2014 - 2015) Chapter 1 To 5
4 pages
Coursework Brief
No ratings yet
Coursework Brief
3 pages
5.04 Linear Regression and Calibration Curves
No ratings yet
5.04 Linear Regression and Calibration Curves
14 pages
List of ENGLISH Books by Mahatma Gandhi Available With Gandhi Research Foundation, Jalgaon
No ratings yet
List of ENGLISH Books by Mahatma Gandhi Available With Gandhi Research Foundation, Jalgaon
41 pages
Random Variables and Mathematical Expectations - Lecture 13 Notes
No ratings yet
Random Variables and Mathematical Expectations - Lecture 13 Notes
9 pages
MAE 301: Applied Experimental Statistics
No ratings yet
MAE 301: Applied Experimental Statistics
10 pages
ETC3550/ETC5550 Applied Forecasting
No ratings yet
ETC3550/ETC5550 Applied Forecasting
19 pages
Stat 101 Exam 2: Important Formulas and Concepts 1
No ratings yet
Stat 101 Exam 2: Important Formulas and Concepts 1
22 pages
Assignments Business Economics
No ratings yet
Assignments Business Economics
2 pages
Reliability Measures For Consolidation Settlement by Means of CPT Data
No ratings yet
Reliability Measures For Consolidation Settlement by Means of CPT Data
6 pages
How To Measure Stand Density
No ratings yet
How To Measure Stand Density
14 pages
An Open Letter To All Engineering Grads Trying To Pursue Physics
No ratings yet
An Open Letter To All Engineering Grads Trying To Pursue Physics
2 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet

CE802 Lec Eval Handouts

Uploaded by

CE802 Lec Eval Handouts

Uploaded by

Evaluating learning methods

School of Computer Science and Electronic Engineering

Evaluating learning methods

Evaluating learning methods for regression

CE802 (CSEE) Evaluating learning methods 2/1

Other Performance Characteristics

Other Performance Characteristics

P.D.Scott University of Essex

MEASURING CLASSIFICATION ACCURACY

Training and Testing Must Use Different Data Sets

Training and Testing Must Use Different Data Sets

We Do Not Know How Difficult The Learning Task Was?

Now suppose the task is predicting the result of a coin toss on

Now suppose the task is predicting the result of a coin toss on

P.D.Scott University of Essex

HOW MANY EXAMPLES DO WE NEED?

What we should have done was ensure that the proportion of

Stratification is well worth doing but it is still possible that the

This technique is known as cross-validation and is widely

Weka provides a facility for automatically performing n-fold

P.D.Scott University of Essex

EVEN BETTER ESTIMATES

Note that, if you do this in Weka, you must change the

So how many times should you run the cross-validation

If we assume that the results are normally distributed we can

This is known as the standard error.

Using this relationship we can establish a confidence

The width of the confidence interval decreases with the

Model selection and test set

Example: Fortune teller to predict the sex of babies

Final evaluation: nested cross-validation

Bischl et al., “mlr: Machine Learning in R”

This classifier is correct 80% of the time. However, the table

P.D.Scott University of Essex

Overall, this classifier is right 70% of the time.

The Kappa Statistic

The classifier on the last slide predicted Red 41 times, Green

Suppose those predictions had been random guesses.

So the actual success rate of 70 represents an improvement of

This is the basis of the kappa statistic.

Thus the improvement for a prefect predictor would be

The improvement achieved by the classifier was

Hence the kappa statistic is

A kappa statistic of 1 implies a perfect predictor.

Weka provides a confusion matrix and the kappa statistic in the

Recall and Precision

Recall and precision are measures originally developed in the

The system is essentially classifying the entire set of

The system is essentially classifying the entire set of

Although a high value for Recall is very desirable, it is easily

Although a high value for Recall is very desirable, it is easily

Clearly, TP+FP is the number of documents returned.

P.D.Scott University of Essex

It is easy to build a system with 100% Recall:

It is almost as easy to build a system with 100% Precision.

So a practical information retrieval system is going to require

In other areas of machine learning, the following definitions are also

CE802 (CSEE) Evaluating learning methods 29 / 1

Evaluating learning methods

Evaluating learning methods for regression

CE802 (CSEE) Evaluating learning methods 30 / 1

Mean Square Error

Evaluating learning methods

Evaluating learning methods for regression

CE802 (CSEE) Evaluating learning methods 32 / 1

Required course material reading:

CE802 (CSEE) Evaluating learning methods 33 / 1

You might also like