Evaluating A Machine Learning Model
Evaluating A Machine Learning Model
JEREMY JORDAN
21 JUL 2017 • 10 MIN READ
So you've built a machine learning model and trained it on some data...
now what? In this post, I'll discuss how to evaluate your model, and
practical advice for improving the model based on what we learn
evaluating it.
Metrics
In this session, I'll discuss common metrics used to evaluate models.
Classification metrics
Image credit
The three main metrics used to evaluate a classification model are
accuracy, precision, and recall.
Accuracy is defined as the percentage of correct predictions for the test
data. It can be calculated easily by dividing the number of correct
predictions by the number of total predictions.
accuracy=correctpredictionsallpredictions
��=(1+�2)precision⋅recall(�2⋅precision)+recall
When I was searching for other examples to explain the tradeoff between
precision and recall, I came across the following article discussing using
machine learning to predict suicide. In this case, we'd want to put much
more focus on the model's recall than its precision. It would be much less
harmful to have an intervention with someone who was not actually
considering suicide than it would be to miss someone who was
considering suicide. However, precision is still important because you
don't want too many instances where your model predicts false positives
or else you have the case of "The Model Who Cried Wolf" (this is the
reference for those not familiar with the story).
Note: The article only reports the accuracy of these models, not the
precision or recall! By now you should know that accuracy alone is not too
informative regarding a model's effectiveness. The original paper
published, however, does present the precision and recall.
Regression metrics
Evaluation metrics for regression models are quite different than the
above metrics we discussed for classification models because we are
now predicting in a continuous range instead of a discrete number of
classes. If your regression model predicts the price of a house to
be $400K and it sells for $405K, that's a pretty good prediction. However,
in the classification examples we were only concerned with whether or
not a prediction was correct or incorrect, there was no ability to say a
prediction was "pretty good". Thus, we have a different set of evaluation
metrics for regression models.
Explained variance compares the variance within the expected
outcomes, and compares that to the variance in the error of our model.
This metric essentially represents the amount of variation in the original
dataset that our model is able to explain.
��(�����,�����)=1−���(�����−�����)���
��
�¯=1��������∑�����
Bias vs Variance
The ultimate goal of any machine learning model is to learn from
examples and generalize some degree of knowledge regarding the task
we're training it to perform. Some machine learning models provide the
framework for generalization by suggesting the underlying structure of
that knowledge. For example, a linear regression model imposes a
framework to learn linear relationships between the information we
feed it. However, sometimes we provide a model with too much pre-built
structure that we limit the model's ability to learn from the examples -
such as the case where we train a linear model on a exponential dataset.
In this case, our model is biased by the pre-imposed structure and
relationships.
Models with high bias pay little attention to the data presented; this is
also known as underfitting.
On the other extreme, sometimes when we train our model it learns too
much from the training data. That is, our model captures the noise in the
data in addition to the signal. This can cause wild fluctuations in the
model that does not represent the true trend; in this case, we say that the
model has high variance. In this case, our model does not generalize
well because it pays too much attention to the training data without
consideration for generalizing to new data. In other words,
we've overfit the model to the training data.
In summary, a model with high bias is limited from learning the true
trend and underfits the data. A model with high variance learns too much
from the training data and overfits the data. The best model sits
somewhere in the middle of the two extremes.
Next, I'll discuss two common tools that are used to diagnosed whether a
model is susceptible to high bias or variance.
Validation curves
In the region where the training error and validation error diverge, with
the training error staying low and validation error increasing, we're
beginning to see the effects of high variance. The training error is low
because we're overfitting the data and learning too much from the
training examples, while the validation error remains high because our
model isn't able to generalize from the training data to new data.
Learning curves
The second tool we'll discuss for diagnosing bias and variance in a model
is learning curves. Here, we'll plot the error of a model as a function of
the number of training examples. Similar to validation curves, we'll plot
the error for both the training data and validation data.
If our model has high bias, we'll observe fairly quick convergence to a
high error for the validation and training datasets. If the model suffers
from high bias, training on more data will do very little to improve the
model. This is because models which underfit the data pay little attention
to the data, so feeding in more data will be useless. A better approach to
improving models which suffer from high bias is to consider adding
additional features to the dataset so that the model can be more
equipped to learn the proper relationships.
If our model has high variance, we'll see a gap between the training and
validation error. This is because the model is performing well for the
training data, since it has been overfit to that subset, and performs
poorly for the validation data since it was not able to generalize the
proper relationships. In this case, feeding more data during training can
help improve the model's performance.