0% found this document useful (0 votes)

96 views

Evaluating A Machine Learning Model

Evaluating a machine learning model is essential. Key steps include splitting data into training, validation, and test sets to prevent overfitting. Common classification metrics are accuracy, precision and recall. For regression, metrics like mean squared error and R2 are used. Proper evaluation helps determine if a model can generalize to new data or if it is overfitted/underfitted. Tuning hyperparameters can balance bias and variance to improve performance.

Uploaded by

Jean

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views

Evaluating A Machine Learning Model

Uploaded by

Jean

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Evaluating a machine learning model.

Evaluating a machine learning model. (jeremyjordan.me)

JEREMY JORDAN
21 JUL 2017 • 10 MIN READ
So you've built a machine learning model and trained it on some data...
now what? In this post, I'll discuss how to evaluate your model, and
practical advice for improving the model based on what we learn
evaluating it.

I'll answer questions like:

• How well is my model doing? Is it a useful model?

• Will training my model on more data improve its performance?
• Do I need to include more features?
The train/test/validation split
The most important thing you can do to properly evaluate your model is
to not train the model on the entire dataset. I repeat: do not train the
model on the entire dataset. I talked about this in my post on preparing
data for a machine learning model and I'll mention it again now because
it's that important. A typical train/test split would be to use 70% of the
data for training and 30% of the data for testing.
As I discussed previously, it's important to use new data when evaluating
our model to prevent the likelihood of overfitting to the training set.
However, sometimes it's useful to evaluate our model as we're building it
to find that best parameters of a model - but we can't use the test set for
this evaluation or else we'll end up selecting the parameters that
perform best on the test data but maybe not the parameters that
generalize best. To evaluate the model while still building and tuning the
model, we create a third subset of the data known as the validation set. A
typical train/test/validation split would be to use 60% of the data for
training, 20% of the data for validation, and 20% of the data for testing.
I'll also note that it's very important to shuffle the data before making
these splits so that each split has an accurate representation of the
dataset.

Metrics
In this session, I'll discuss common metrics used to evaluate models.
Classification metrics

When performing classification predictions, there's four types of

outcomes that could occur.

•True positives are when you predict an observation belongs to a

class and it actually does belong to that class.
• True negatives are when you predict an observation does not
belong to a class and it actually does not belong to that class.
• False positives occur when you predict an observation belongs to
a class when in reality it does not.
• False negatives occur when you predict an observation does not
belong to a class when in fact it does.
These four outcomes are often plotted on a confusion matrix. The
following confusion matrix is an example for the case of binary
classification. You would generate this matrix after making predictions
on your test data and then identifying each prediction as one of the four
possible outcomes described above.
You can also extend this confusion matrix to plot multi-class
classification predictions. The following is an example confusion matrix
for classifying observations from the Iris flower dataset.

Image credit
The three main metrics used to evaluate a classification model are
accuracy, precision, and recall.
Accuracy is defined as the percentage of correct predictions for the test
data. It can be calculated easily by dividing the number of correct
predictions by the number of total predictions.
accuracy=correctpredictionsallpredictions

Precision is defined as the fraction of relevant examples (true positives)

among all of the examples which were predicted to belong in a certain
class.
precision=truepositivestruepositives+falsepositives

Recall is defined as the fraction of examples which were predicted to

belong to a class with respect to all of the examples that truly belong in
the class.
recall=truepositivestruepositives+falsenegatives

The following graphic does a phenomenal job visualizing the difference

between precision and recall.
Image credit
Precision and recall are useful in cases where classes aren't evenly
distributed. The common example is for developing a classification
algorithm that predicts whether or not someone has a disease. If only a
small percentage of the population (let's say 1%) has this disease, we
could build a classifier that always predicts that the person does not
have the disease, we would have built a model which is 99% accurate
and 0% useful.

However, if we measured the recall of this useless predictor, it would be

clear that there was something wrong with our model. In this example,
recall ensures that we're not overlooking the people who have the
disease, while precision ensures that we're not misclassifying too many
people as having the disease when they don't. Obviously, you wouldn't
want a model that incorrectly predicts a person has cancer (the person
would end up in a painful and expensive treatment process for a disease
they didn't have) but you also don't want to incorrectly predict a person
does not have cancer when in fact they do. Thus, it's important to
evaluate both the precision and recall of a model.

Ultimately, it's nice to have one number to evaluate a machine learning

model just as you get a single grade on a test in school. Thus, it makes
sense to combine the precision and recall metrics; the common approach
for combining these metrics is known as the f-score.

��=(1+�2)precision⋅recall(�2⋅precision)+recall

The � parameter allows us to control the tradeoff of importance

between precision and recall. �<1 focuses more on precision
while �>1 focuses more on recall.

When I was searching for other examples to explain the tradeoff between
precision and recall, I came across the following article discussing using
machine learning to predict suicide. In this case, we'd want to put much
more focus on the model's recall than its precision. It would be much less
harmful to have an intervention with someone who was not actually
considering suicide than it would be to miss someone who was
considering suicide. However, precision is still important because you
don't want too many instances where your model predicts false positives
or else you have the case of "The Model Who Cried Wolf" (this is the
reference for those not familiar with the story).
Note: The article only reports the accuracy of these models, not the
precision or recall! By now you should know that accuracy alone is not too
informative regarding a model's effectiveness. The original paper
published, however, does present the precision and recall.
Regression metrics

Evaluation metrics for regression models are quite different than the
above metrics we discussed for classification models because we are
now predicting in a continuous range instead of a discrete number of
classes. If your regression model predicts the price of a house to
be $400K and it sells for $405K, that's a pretty good prediction. However,
in the classification examples we were only concerned with whether or
not a prediction was correct or incorrect, there was no ability to say a
prediction was "pretty good". Thus, we have a different set of evaluation
metrics for regression models.
Explained variance compares the variance within the expected
outcomes, and compares that to the variance in the error of our model.
This metric essentially represents the amount of variation in the original
dataset that our model is able to explain.
��(��,��)=1−��(��−��)��
��

Mean squared error is simply defined as the average of squared

differences between the predicted output and the true output. Squared
error is commonly used because it is agnostic to whether the prediction
was too high or too low, it just reports that the prediction was incorrect.
MSE(��,��)=1��∑(��−��
��)2

The R2 coefficient represents the proportion of variance in the outcome

that our model is capable of predicting based on its features.
�2(��,��)=1−∑(��−��)2∑(��
�−�¯)2

�¯=1��∑��

Bias vs Variance
The ultimate goal of any machine learning model is to learn from
examples and generalize some degree of knowledge regarding the task
we're training it to perform. Some machine learning models provide the
framework for generalization by suggesting the underlying structure of
that knowledge. For example, a linear regression model imposes a
framework to learn linear relationships between the information we
feed it. However, sometimes we provide a model with too much pre-built
structure that we limit the model's ability to learn from the examples -
such as the case where we train a linear model on a exponential dataset.
In this case, our model is biased by the pre-imposed structure and
relationships.
Models with high bias pay little attention to the data presented; this is
also known as underfitting.

It's also possible to bias a model by trying to teach it to perform a task

without presenting all of the necessary information. If you know the
constraints of the model are not biasing the model's performance yet
you're still observed signs of underfitting, it's likely that you are not
using enough features to train the model.

On the other extreme, sometimes when we train our model it learns too
much from the training data. That is, our model captures the noise in the
data in addition to the signal. This can cause wild fluctuations in the
model that does not represent the true trend; in this case, we say that the
model has high variance. In this case, our model does not generalize
well because it pays too much attention to the training data without
consideration for generalizing to new data. In other words,
we've overfit the model to the training data.

In summary, a model with high bias is limited from learning the true
trend and underfits the data. A model with high variance learns too much
from the training data and overfits the data. The best model sits
somewhere in the middle of the two extremes.

Next, I'll discuss two common tools that are used to diagnosed whether a
model is susceptible to high bias or variance.
Validation curves

As we discussed in the previous section, the goal with any machine

learning model is generalization. Validation curves allow us to find the
sweet spot between underfitting and overfitting a model to build a model
that generalizes well.
A typical validation curve is a plot of the model's error as a function of
some model hyperparameter which controls the model's tendency to
overfit or underfit the data. The parameter you choose depends on the
specific model you're evaluating; for example, you might choose to plot
the degree of polynomial features (typically, this means you have
polynomial features up to this degree) for a linear regression model.
Generally, the chosen parameter will have some degree of control over
the model's complexity. On this curve, we plot both the training error
and the validation error of the model. Using both of these errors
combined, we can diagnose whether a model is suffering from high bias
or high variance.
In the region where both the training error and validation error are high,
the model is subject to high bias. Here, it was not able to learn from the
data and it performing poorly.

In the region where the training error and validation error diverge, with
the training error staying low and validation error increasing, we're
beginning to see the effects of high variance. The training error is low
because we're overfitting the data and learning too much from the
training examples, while the validation error remains high because our
model isn't able to generalize from the training data to new data.
Learning curves

The second tool we'll discuss for diagnosing bias and variance in a model
is learning curves. Here, we'll plot the error of a model as a function of
the number of training examples. Similar to validation curves, we'll plot
the error for both the training data and validation data.
If our model has high bias, we'll observe fairly quick convergence to a
high error for the validation and training datasets. If the model suffers
from high bias, training on more data will do very little to improve the
model. This is because models which underfit the data pay little attention
to the data, so feeding in more data will be useless. A better approach to
improving models which suffer from high bias is to consider adding
additional features to the dataset so that the model can be more
equipped to learn the proper relationships.

If our model has high variance, we'll see a gap between the training and
validation error. This is because the model is performing well for the
training data, since it has been overfit to that subset, and performs
poorly for the validation data since it was not able to generalize the
proper relationships. In this case, feeding more data during training can
help improve the model's performance.

Other practical advice

Another common thing I'll do when evaluating classifier models is
to reduce the dataset into two dimensions and then plot the observations
and decision boundary. Sometimes it's helpful to visually inspect the
data and your model when evaluating its performance.
Image credit
Further reading
• 37 Reasons why your Neural Network is not working
• Improving the Validation and Test Split

Test Bank For Marketing Research An Applied Orientation 6th Edition by Naresh K Malhotra Sample
100% (1)
Test Bank For Marketing Research An Applied Orientation 6th Edition by Naresh K Malhotra Sample
17 pages
Dyscalculia Test
100% (5)
Dyscalculia Test
15 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Business Research Methods Short
No ratings yet
Business Research Methods Short
9 pages
Door Closer Simulation
No ratings yet
Door Closer Simulation
7 pages
Estimating Guide: Types of Estimates
No ratings yet
Estimating Guide: Types of Estimates
10 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
Evaluation
No ratings yet
Evaluation
18 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
AIML-HC Mod 03
No ratings yet
AIML-HC Mod 03
46 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Training Evaluation
No ratings yet
Training Evaluation
42 pages
Strategy Deck
No ratings yet
Strategy Deck
16 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
ML3 Evaluating Models
No ratings yet
ML3 Evaluating Models
40 pages
ML MAKAUT unit-3
No ratings yet
ML MAKAUT unit-3
6 pages
Lecture 4 Evaluation
No ratings yet
Lecture 4 Evaluation
58 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
Unit 4 Model Evaluation
No ratings yet
Unit 4 Model Evaluation
24 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
AIMl TA2
No ratings yet
AIMl TA2
4 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
chapter 1 capstone project ai class 12
No ratings yet
chapter 1 capstone project ai class 12
5 pages
Week 4 Lecture Slides BUS265 2023
No ratings yet
Week 4 Lecture Slides BUS265 2023
45 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
ML-chap-2
No ratings yet
ML-chap-2
60 pages
Machine Learning Model Evaluation
No ratings yet
Machine Learning Model Evaluation
11 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
20 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
ai-part-b-ch6
No ratings yet
ai-part-b-ch6
18 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Lec_4_ML_S4_Evaluation_Metrics
No ratings yet
Lec_4_ML_S4_Evaluation_Metrics
29 pages
Machine Learning # 2
No ratings yet
Machine Learning # 2
17 pages
How to Evaluate Machine Learning Models - Yulinda Rizky
No ratings yet
How to Evaluate Machine Learning Models - Yulinda Rizky
15 pages
Lec 8
No ratings yet
Lec 8
35 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Model Evaluation
No ratings yet
Model Evaluation
18 pages
ML Metrics
No ratings yet
ML Metrics
9 pages
2. Performance Measures
No ratings yet
2. Performance Measures
19 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
DL_IT324a_4
No ratings yet
DL_IT324a_4
52 pages
AD3501-DL-UNIT 4 NOTES
No ratings yet
AD3501-DL-UNIT 4 NOTES
16 pages
FML - KNN
No ratings yet
FML - KNN
64 pages
0 Machine Learning Overview and Metrics LT
No ratings yet
0 Machine Learning Overview and Metrics LT
84 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Chapter 2 Part II (1)
No ratings yet
Chapter 2 Part II (1)
28 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Session 1 Evaluation Model
No ratings yet
Session 1 Evaluation Model
58 pages
Lecture 5 Evaluation_Classifer
No ratings yet
Lecture 5 Evaluation_Classifer
61 pages
Week 2: Machine Learning Intro: Instructor: Ting Sun
No ratings yet
Week 2: Machine Learning Intro: Instructor: Ting Sun
21 pages
Clase10 11
No ratings yet
Clase10 11
18 pages
lec-4
No ratings yet
lec-4
24 pages
Evaluation Metrics in Machine Learning
No ratings yet
Evaluation Metrics in Machine Learning
14 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
6 pages
Unit III 1
No ratings yet
Unit III 1
21 pages
Codes and Concepts of ML-Developer-2
No ratings yet
Codes and Concepts of ML-Developer-2
17 pages
module3_DS_ppt
No ratings yet
module3_DS_ppt
68 pages
Model Evaluation - II
No ratings yet
Model Evaluation - II
12 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Force and Motion - 1
No ratings yet
Force and Motion - 1
31 pages
Proportion (H)
No ratings yet
Proportion (H)
16 pages
Alexandre Koyr and The Scientific Revolution
No ratings yet
Alexandre Koyr and The Scientific Revolution
13 pages
Archana 2014
No ratings yet
Archana 2014
23 pages
Bit Info Nepal - Operating Systems - Bit204-2079
No ratings yet
Bit Info Nepal - Operating Systems - Bit204-2079
9 pages
Quiz 1
No ratings yet
Quiz 1
8 pages
Solid Mensuration
No ratings yet
Solid Mensuration
1 page
M10 (6) Finding The NTH Term and Geometric Means
No ratings yet
M10 (6) Finding The NTH Term and Geometric Means
15 pages
Chap 2 Erickson Fundamentals of Power Electronics PDF
No ratings yet
Chap 2 Erickson Fundamentals of Power Electronics PDF
45 pages
Cartography I - Chapter 2.1 PDF
No ratings yet
Cartography I - Chapter 2.1 PDF
13 pages
DIN.1025-3 Hot - rolled.I.and.H.sections Dimensions, Mass - And.static - Parameters PDF
No ratings yet
DIN.1025-3 Hot - rolled.I.and.H.sections Dimensions, Mass - And.static - Parameters PDF
4 pages
Evaluation of Amplitude-Dependent Damping and Natural Frequency of Buildings During Strong Winds
No ratings yet
Evaluation of Amplitude-Dependent Damping and Natural Frequency of Buildings During Strong Winds
16 pages
CB 03 Corr
No ratings yet
CB 03 Corr
10 pages
A Grid-Compatible Virtual Oscillator Controller Analysis and Design
No ratings yet
A Grid-Compatible Virtual Oscillator Controller Analysis and Design
7 pages
EFS Module 6 Refined 11 Aug-1
No ratings yet
EFS Module 6 Refined 11 Aug-1
91 pages
GW Byun Comparision Geotech Softwares PDF
No ratings yet
GW Byun Comparision Geotech Softwares PDF
52 pages
Market Prediction Indicator
No ratings yet
Market Prediction Indicator
2 pages
Experiment No.9 Friction
100% (7)
Experiment No.9 Friction
13 pages
Statistics2 Chapter1 Draft
No ratings yet
Statistics2 Chapter1 Draft
18 pages
Limit Cycle Oscillations
No ratings yet
Limit Cycle Oscillations
7 pages
Cse4019 Image-Processing Eth 1.0 37 Cse4019
No ratings yet
Cse4019 Image-Processing Eth 1.0 37 Cse4019
2 pages
Third Semester (Cbcss-Ug) Degree Examination November 2021
No ratings yet
Third Semester (Cbcss-Ug) Degree Examination November 2021
2 pages
G7 - Integer Word Problems - Test Practice AK
No ratings yet
G7 - Integer Word Problems - Test Practice AK
6 pages
Probability Distribution
No ratings yet
Probability Distribution
16 pages
RF and MW
No ratings yet
RF and MW
166 pages

Evaluating A Machine Learning Model

Uploaded by

Evaluating A Machine Learning Model

Uploaded by

Evaluating a machine learning model.

Evaluating a machine learning model. (jeremyjordan.me)

I'll answer questions like:

• How well is my model doing? Is it a useful model?

When performing classification predictions, there's four types of

•True positives are when you predict an observation belongs to a

Precision is defined as the fraction of relevant examples (true positives)

Recall is defined as the fraction of examples which were predicted to

The following graphic does a phenomenal job visualizing the difference

However, if we measured the recall of this useless predictor, it would be

Ultimately, it's nice to have one number to evaluate a machine learning

The � parameter allows us to control the tradeoff of importance

Mean squared error is simply defined as the average of squared

The R2 coefficient represents the proportion of variance in the outcome

It's also possible to bias a model by trying to teach it to perform a task

As we discussed in the previous section, the goal with any machine

Other practical advice

You might also like