CS229 Bias-Variance and Error Analysis: Yoann Le Calonnec October 2, 2017 1 The Bias-Variance Tradeoff
CS229 Bias-Variance and Error Analysis: Yoann Le Calonnec October 2, 2017 1 The Bias-Variance Tradeoff
CS229 Bias-Variance and Error Analysis: Yoann Le Calonnec October 2, 2017 1 The Bias-Variance Tradeoff
Yoann Le Calonnec
October 2, 2017
• Overfitting: the model is too closely related to the examples in the training set and
doesn’t generalize well to other examples.
• Underfitting: the model didn’t gather enough information from the training set, and
doesn’t capture the link between the features x and the target y.
• The data is simply noisy, that is the model is neither overfitting or underfitting, and
the high MSE is simply due to the amount of noise in the dataset.
1
Test MSE = E (y − fˆ(x))2
= E ( + f (x) − fˆ(x))2
= E(2 ) + E (f (x) − fˆ(x))2
2
2 ˆ ˆ
= σ + E(f (x) − f (x)) + Var f (x) − f (x)
2
2 ˆ
= σ + Bias f (x) + Var f (x) ˆ
There is nothing we can do about the first term σ 2 as we can not predict the noise by
definition. The bias term is due to underfitting, meaning that on average, fˆ does not predict
f . The last term is closely related to overfitting, the prediction fˆ is too close from the values
ytrain and varies a lot with the choice of our training set.
To sum up, we can understand our MSE as follows
Hence, when analyzing the performance of a machine learning algorithm, we must always
ask ourselves how to reduce the bias without increasing the variance, and respectively how to
reduce the variance without increasing the bias. Most of the time, reducing one will increase
the other, and there is a tradeoff between bias and variance.
2 Error Analysis
Even though understanding whether our poor test error is due to high bias or high variance
is important, knowing which parts of the machine learning algorithm lead to this error or
score is crucial.
Consider the machine learning pipeline on figure 1.
The algorithms is divided into several steps
2. Preprocessing to remove the background on the image. For instance, if the image are
taken from a security camera, the background is always the same, and we could remove
it easily by keeping the pixels that changed on the image.
2
Figure 1: Face recognition pipeline
If you biuld a complicated system like this one, you might want to figure out how much
error is attributable to each of the components, how good is each of these green boxes.
Indeed, if one of these boxes is really problematic, you might want to spend more time
trying to improve the performance of that one green box. How do you decide what part to
focus on?
One thing we can do is plug in the ground-truth for each component, and see how
accuracy changes. Let’s say the overall accuracy of the system is 85% (pretty bad). You
can now take your development set and manually give it the perfect background removal,
that is, instead of using your background removal algorithm, manually specify the perfect
background removal yourself (using photoshop for instance), and look at how much that
affect the performance of the overall system.
Now let’s say the accuracy only improves by 0.1%. This gives us an upperbound, that
is even if we worked for years on background removal, it wouldn’t help our system by more
than 0.1%.
Now let’s give the pipeline the perfect face detection by specifying the position of the
face manually, see how much we improve the performance, and so on.
The results are specified in the table 1.
Looking at the table, we know that working on the background removal won’t help much.
It also tells us where the biggest jumps are. We notice that having an accurate face detection
mechanism really improves the performance, and similarly, the eyes really help making the
prediction more accurate.
Error analysis is also useful when publishing a paper, since it’s a convenient way to
3
Component Accuracy
Overall system 85%
Preprocess (remove background) 85.1%
Face detection 91%
Eyes segmentation 95%
Nose segmentation 96%
Mouth segmentation 97%
Logistic regression 100%
Table 1: Accuracy when providing the system with the perfect component
analyze the error of an algorithm and explain which parts should be improved.
Ablative analysis
While error analysis tries to explain the difference between current performance and perfect
performance, ablative analysis tries to explain the difference between some baseline (much
poorer) performance and current performance.
For instance, suppose you have built a good anti-spam classifier by adding lots of clever
features to logistic regression
• Spelling correction
• Javascript parser
and your question is: How much did each of these components really help?
In this example, let’s say that simple logistic regression without any clever features gets
94% performance, but when adding these clever features, we get 99.9% performance. In
abaltive analysis, what we do is start from the current level of performance 99.9%, and
slowly take away all of these features to see how it affects performance. The results are
provided in table 2.
When presenting the results in a paper, ablative analysis really helps analyzing the fea-
tures that helped decreasing the misclassification rate. Instead of simply giving the loss/error
rate of the algorithm, we can provide evidence that some specific features are actually more
important than others.
4
Component Accuracy
Overall system 99.9%
Spelling correction 99.0%
Sender host features 98.9%
Email header features 98.9%
Email text parser features 95%
Javascript parser 94.5%
Features from images 94.0%