0% found this document useful (0 votes)
9 views36 pages

CH-5 ML

Chapter Five discusses model evaluation in machine learning, emphasizing the importance of defining performance measures and the need for generalization error approximation. It covers various performance metrics for binary and multiclass classification, including accuracy, precision, recall, and F-measure, as well as techniques like k-fold cross-validation for robust evaluation. The chapter also highlights the significance of using test data for assessing model performance and the role of hypothesis testing in comparing learning algorithms.

Uploaded by

lakewt634
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views36 pages

CH-5 ML

Chapter Five discusses model evaluation in machine learning, emphasizing the importance of defining performance measures and the need for generalization error approximation. It covers various performance metrics for binary and multiclass classification, including accuracy, precision, recall, and F-measure, as well as techniques like k-fold cross-validation for robust evaluation. The chapter also highlights the significance of using test data for assessing model performance and the role of hypothesis testing in comparing learning algorithms.

Uploaded by

lakewt634
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Chapter Five

Model Evaluation
Basic Concepts

• Evaluation requires to define performance measures to be


optimized
• Performance of learning algorithms cannot be evaluated
on entire domain (generalization error)  approximation
needed
• Performance evaluation is needed for:
• Tuning hyperparameters of learning method (e.g. type of
kernel and parameters, learning rate of perceptron)
• Evaluating quality of learned predictor
• Computing statistical significance of difference between
learning algorithms
Stages of (Batch) Machine Learning

Given: labeled training data


• Assumes each

Train the model:


model ß classifier.train(X, Y )

Apply the model to new data:


• Given: new unlabeled instance
learner
Performance measures

Training Loss and performance measures


• The training loss function measures the cost paid for
predicting f(x) for output y
• It is designed to boost effectiveness and efficiency of
learning algorithm (e.g. hinge loss for SVM):
• It is not necessarily the best measure of final
performance
• E.g. misclassification cost is never used as it is piecewise
constant (not amenable to gradient descent)
• Multiple performance measures could be used to evaluate
different aspects of a learner
Performance measures

Binary classification

• The confusion matrix reports true (on rows) and predicted


(on column) labels
• Each entry contains the number of examples having label
in row and predicted as column:
• tp True positives: positives predicted as positives
• tn True negatives: negatives predicted as negatives
• fp False positives: negatives predicted as positives
• fn False negatives: positives predicted as negatives
Binary classification/Classification metrics

Accuracy

• Accuracy is the fraction of correctly labelled


examples
among all predictions
• It is one minus the misclassification cost
Confusion Matrix

• Given a dataset of P positive instances and N negative


instances: Predicted Class
Yes
No TP +
Actual Class

Yes TP TN
FN accuracy = P + N
No
FP
TN
• Imagine using classifier to identify positive cases (i.e., for
information retrieval)
TP TP
precision = recall =
TP + F P TP +
Probability that a F N
randomly Probability that a randomly
selected result is relevant selected relevant document
is retrieved 4
Binary classification
Problems
• For strongly unbalanced datasets (typically negatives much
more than positives) it is not informative:
• Predictions are dominated by the larger class
• Predicting everything as negative often maximizes accuracy
• One possibility consists of rebalancing costs (e.g. a single
positive counts as N/P where N=TN+FP and P=TP+FN)
Precision (recap)
• It is the fraction of positives among examples predicted as positives
• It measures the precision of the learner when predicting positive

Recall or Sensitivity (recap)


• It is the fraction of positive examples predicted as positives
• It measures the coverage of the learner in returning positive
examples
Binary
Classification
F-
measure
(1 + β2 )(Pre ∗
Fβ =
Rec)
β2 Pre + Rec

Precision and recall are complementary:


increasing precision typically reduces recall
F-measure combines the two measures balancing
the two aspects
β is a parameter trading-off precision and recall
F1

2(Pre ∗ Rec)
F1 =
Pre + Rec

It is the F-measure for β = 1


It is the harmonic mean of precision and recall
Binary
Classification

Precision-recall curve
Classifiers often provide a confidence in the
prediction (e.g. margin of SVM)
A hard decision is made setting a threshold
on the classifier (zero for SVM)
Acc,Pre,Rec,F1 all measure peformance of a
classifier for a specific threshold
It is possible to change the threshold if
interested in maximizing a specific
performance (e.g. recall)
Binary
Classification

Precision-recall curve
By varying threshold from min to max possible
value, we obtain a curve of performance
measures
This curve can be shown plotting one measure
(recall) against the complementary one
(precision)
It is possible to investigate the performance of the
Binary
Classification

Area under Pre-Rec curve


A single aggregate value can be obtained taking
the area under the curve
It combines the performance of the
algorithm for all possible thresholds
(without preference)
Performance
measures
Multiclass classification
T\P
y1 n11 n12 n13
y1
n21 n22 n23
y2
y2 n31 n32 n33
Confusion matrix is generalized version of binary
one y3
y3
nij is the number of examples with class yi
predicted as yj . The main diagonal contains true
positives for each class
The sum of off-diagonal elements along a
column is the number of false positives for the
column label
The sum of off-diagonal elements along a
row is the number of false negatives for the
Performance
measures

Multiclass classification
ACC,Pre,Rec,F1 carry over to a per-class
measure considering as negatives examples
from other classes.
E.g.:
nii
Prei = nii Reci =
nii + FPi
nii + FNi
Multiclass accuracy is the overall fraction of
correctly classified examples:
Performance
measures
Training Data and Test Data

• Training data: data used to build the model


• Test data: new data, not used in the training process

• Training performance is often a poor indicator of


generalization performance
• Generalization is what we really care about in ML
• Easy to overfit the training data
• Performance on test data is a good indicator of
generalization performance
• i.e., test accuracy is more important than training accuracy

5
Simple Decision Boundary

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE


6
Decisio
n Decisio
5 Region n
1 Region
2
4

3
Feature

2
2

0
Decisio
n
- Boundar
12 3 4 y 5 7 8 9 1
0

6
More Complex Decision Boundary

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE


SPACE
6 Decision
Region Decisio
5 1 n
Region
2
4

3
Feature

2
2

0
Decision

- Boundar
12 3 4 y 5 6 7 8 9 1
Feature 0
1

7
Example: The Overfitting Phenomenon

8
A Complex Model

Y = high-order polynomial
in X

9
The True (simpler) Model

Y=aX +b +
noise

10
How Overfitting Affects Prediction

Predictiv Underfitti Overfitti


e ng ng
Error
Error on Test
Data

Error on Training
Data
Model
Complexity
Ideal Range
for Model
Complexity
13
Comparing Classifiers

Say we have two classifiers, C1 and C2, and want to


choose the best one to use for future predictions

Can we use training accuracy to choose between them?


• No!
– e.g., C1 = pruned decision tree, C2 = 1-­NN

training_accuracy(1-­‐NN) = 100%, but may not be
best

Instead, choose based on test accuracy...

14
Training and Test Data

Training Data Idea:


Full Data Set Train each
model on the
“training
data”...
...and then test
each model’s
Test Data
accuracy on
the test data

15
Performance
estimation
Hold-out procedure
Computing performance measure on training set
would be optimistically biased
Need to retain an independent set on which to
compute performance:
validation set when used to estimate
performance of different
algorithmic settings (i.e.
hyperparameters)
test set when used to estimate final
performance of selected model
E.g.: split dataset in 40%/30%/30% for training,
validation and testing
Problem
Hold-out procedure depends on the specific test
(and validation) set chosen (esp. for small
K-Fold Cross-‐Validation

• Why just choose one particular “split” of the


data?
– In principle, we should do this multiple times since
performance may be different for each split

• k-‐­Fold Cross-­‐Validation (e.g., k=10)


– randomly partition full data set of n instances into
k disjoint subsets (each roughly of size n/k)
– Choose each fold in turn as the test set; train model
on the other folds and evaluate
– Compute statistics over k test performances, or
choose best of the k models
– Can also do “leave-­‐one-­‐out CV” where k = n 16
Cont …

k-fold cross validation


Split D in k equal sized disjoint
subsets Di . For i ∈ [1, k ]
train predictor on Ti = D \ Di
compute score S of predictor L(Ti )
on test set Di :

Si = SD i [L(Ti )]

return average score across folds


Example of 3-fold CV

Full Data Set 1st Partition 2nd Partition kth Partition


Training
Test Data Data Training

Training
Test Data ... Data

Data Training
Data Test Data

Test Test Test


Performance Performance Performance

Summary statistics
over k test
performances
17
Optimizing Model Parameters

Can also use CV to choose value of model parameter P


• Search over space of parameter values p 2 values(P )

– Evaluate model with P = p on validation set


• Choose value p’ with highest validation performance
• Learn model on full training set with P = p’
Training Data 1st Partition 2nd Partition kth Partition
Validation Training
Set Data Training

Training
Validation
Set
... Data

Data Training Validation


Data Set
Found that Found that Found that
Test Data optimal P = optimal P = optimal P =
p1 p2 pk
Choose value of p of the model with the best validation 18
More on Cross-‐Validation (CV)

• Cross-­validation
‐ generates an approximate estimate
of how well the classifier will do on “unseen” data
– As k  n, the model becomes more accurate
(more training data)
– ...but, CV becomes more computationally expensive
– Choosing k < n is a compromise

• Averaging over different partitions is more robust


than just a single train/validate partition of the data

• It is an even better idea to do CV repeatedly!


19
Multiple Trials of k-­Fol
‐ d CV

1.) Loop for t


trials: Full Data Set
a.) Randomize
Data Set Shuffle

Full Data Set 1st Partition 2nd Partition kth Partition


Training
Test Data Data
b.) Perform Test Data ...
Training
Data

‐ d
k-­fol Training
Data Training
Test Data
CV Data

Test Test Test


Performance Performance
Performance

2.) Compute statistics over


t x k test performances 20
Comparing Multiple Classifiers

1.) Loop for t


trials: Full Data Set
Test each candidate learner on
a.) Randomize
same training/testing splits
Data Set Shuffle

Full Data Set 1st Partition 2nd Partition kth Partition


Training
Test Data
Data
b.) Perform Test Data ...
Training
Data

‐ d
k-­fol Training
Data Training
Test Data
CV Data

Test Perf. Test Perf. Test Test Test Test


C1 C2 C2 C1 C2
C1

2.) Compute statistics over Allows us to do paired summary


t x k test performances statistics (e.g., paired t-­test)

21
Comparing learning algorithms

Hypothesis testing
• We want to compare generalization performance of two learning algorithms
• We want to know whether observed different in performance is statistically
significant (and not due to some noisy evaluation)
• Hypothesis testing allows to test the statistical significance of a hypothesis
(e.g. the two predictors have different performance)
Hypothesis testing (t-test)

Example
Learning Curve

• Shows performance versus the # training examples


– Compute over a single training/testing split
– Then, average across multiple trials of CV

22
Building Learning Curves

1.) Loop for t


trials: Full Data Set
Compute learning curve over each
a.) Randomize
training/testing split
Data Set Shuffle

Full Data Set 1st Partition 2nd Partition kth Partition


Training
Test Data
Data
b.) Perform Test Data ...
Training
Data

‐ d
k-­fol Training
Data Training
Test Data
CV Data

Curve Curve Curve Curve


C1 Curve C2 Curve C2 C1
C2
C1
2.) Compute statistics over
t x k learning curves 23

You might also like