Machine Learning Yearning
Machine Learning Yearning
deeplearning.ai project.
6 Your dev and test sets should come from the same distribution
17 If you have a large dev set, split it into two subsets, only one of which you look at
18 How big should the Eyeball and Blackbox dev sets be?
39 Weighting data
• Training error = 1%
What problem does it have? Applying the definitions from the previous chapter, we estimate
the bias as 1%, and the variance as 10% (=11%-1%). Thus, it has high variance. The
classifier has very low training error, but it is failing to generalize to the dev set. This is also
called overfitting.
We estimate the bias as 15%, and variance as 1%. This classifier is fitting the training set
poorly with 15% error, but its error on the dev set is barely higher than the training error.
This classifier therefore has high bias, but low variance. We say that this algorithm is
underfitting.
We estimate the bias as 15%, and variance as 15%. This classifier has high bias and high
variance: It is doing poorly on the training set, and therefore has high bias, and its
performance on the dev set is even worse, so it also has high variance. The
overfitting/underfitting terminology is hard to apply here since the classifier is
simultaneously overfitting and underfitting.
• Dev error = 1%
This classifier is doing well, as it has low bias and low variance. Congratulations on achieving
this great performance!
Other problems are harder. For example, suppose that you are building a speech recognition
system, and find that 14% of the audio clips have so much background noise or are so
unintelligible that even a human cannot recognize what was said. In this case, even the most
“optimal” speech recognition system might have error around 14%.
The training set performance is already close to the optimal error rate of 14%. Thus, there is
not much room for improvement in terms of bias or in terms of training set performance.
However, this algorithm is not generalizing well to the dev set; thus there is ample room for
improvement in the errors due to variance.
This example is similar to the third example from the previous chapter, which also had a
training error of 15% and dev error of 30%. If the optimal error rate is ~0%, then a training
error of 15% leaves much room for improvement. This suggests bias-reducing changes might
be fruitful. But if the optimal error rate is 14%, then the same training set performance tells
us that there’s little room for improvement in the classifier’s bias.
For problems where the optimal error rate is far from zero, here’s a more detailed
breakdown of an algorithm’s error. Continuing with our speech recognition example above,
the total dev set error of 30% can be broken down as follows (a similar analysis can be
applied to the test set error):
• Optimal error rate (“unavoidable bias”): 14%. Suppose we decide that, even with the
best possible speech system in the world, we would still suffer 14% error. We can think of
this as the “unavoidable” part of a learning algorithm’s bias.
• Variance: 15%. The difference between the dev error and the training error.
9
To relate this to our earlier definitions, Bias and Avoidable Bias are related as follows:
The “avoidable bias” reflects how much worse your algorithm performs on the training set
than the “optimal classifier.”
The concept of variance remains the same as before. In theory, we can always reduce
variance to nearly zero by training on a massive training set. Thus, all variance is “avoidable”
with a sufficiently large dataset, so there is no such thing as “unavoidable variance.”
Consider one more example, where the optimal error rate is 14%, and we have:
Whereas in the previous chapter we called this a high bias classifier, now we would say that
error from avoidable bias is 1%, and the error from variance is about 1%. Thus, the algorithm
is already doing well, with little room for improvement. It is only 2% worse than the optimal
error rate.
We see from these examples that knowing the optimal error rate is helpful for guiding our
next steps. In statistics, the optimal error rate is also called Bayes error rate, or Bayes
rate.
How do we know what the optimal error rate is? For tasks that humans are reasonably good
at, such as recognizing pictures or transcribing audio clips, you can ask a human to provide
labels then measure the accuracy of the human labels relative to your training set. This
would give an estimate of the optimal error rate. If you are working on a problem that even
8
If this number is negative, you are doing better on the training set than the optimal error rate. This
means you are overfitting on the training set, and the algorithm has over-memorized the training set.
You should focus on variance reduction methods rather than on further bias reduction methods.
9
These definitions are chosen to convey insight on how to improve your learning algorithm. These
definitions are different than how statisticians define Bias and Variance. Technically, what I define
here as “Bias” should be called “Error we attribute to bias”; and “Avoidable bias” should be “error we
attribute to the learning algorithm’s bias that is over the optimal error rate.”
In the last few chapters, you learned how to estimate avoidable/unavoidable bias and
variance by looking at training and dev set error rates. The next chapter will discuss how you
can use insights from such an analysis to prioritize techniques that reduce bias vs.
techniques that reduce variance. There are very different techniques that you should apply
depending on whether your project’s current problem is high (avoidable) bias or high
variance. Read on!
• If you have high avoidable bias, increase the size of your model (for example, increase the
size of your neural network by adding layers/neurons).
If you are able to increase the neural network size and increase training data without limit, it
is possible to do very well on many learning problems.
In practice, increasing the size of your model will eventually cause you to run into
computational problems because training very large models is slow. You might also exhaust
your ability to acquire more training data. (Even on the internet, there is only a finite
number of cat pictures!)
Increasing the model size generally reduces bias, but it might also increase variance and the
risk of overfitting. However, this overfitting problem usually arises only when you are not
using regularization. If you include a well-designed regularization method, then you can
usually safely increase the size of the model without increasing overfitting.
Suppose you are applying deep learning, with L2 regularization or dropout, with the
regularization parameter that performs best on the dev set. If you increase the model size,
usually your performance will stay the same or improve; it is unlikely to worsen significantly.
The only reason to avoid using a bigger model is the increased computational cost.
For example, increasing the size of your model—adding neurons/layers in a neural network,
or adding input features—generally reduces bias but could increase variance. Alternatively,
adding regularization generally increases bias but reduces variance.
In the modern era, we often have access to plentiful data and can use very large neural
networks (deep learning). Therefore, there is less of a tradeoff, and there are now more
options for reducing bias without hurting variance, and vice versa.
For example, you can usually increase a neural network size and tune the regularization
method to reduce bias without noticeably increasing variance. By adding training data, you
can also usually reduce variance without affecting bias.
If you select a model architecture that is well suited for your task, you might also reduce bias
and variance simultaneously. Selecting such an architecture can be difficult.
In the next few chapters, we discuss additional specific techniques for addressing bias and
variance.
• Increase the model size (such as number of neurons/layers): This technique reduces
bias, since it should allow you to fit the training set better. If you find that this increases
variance, then use regularization, which will usually eliminate the increase in variance.
• Modify input features based on insights from error analysis: Say your error
analysis inspires you to create additional features that help the algorithm eliminate a
particular category of errors. (We discuss this further in the next chapter.) These new
features could help with both bias and variance. In theory, adding more features could
increase the variance; but if you find this to be the case, then use regularization, which will
usually eliminate the increase in variance.
• Add more training data: This technique helps with variance problems, but it usually
has no significant effect on bias.
In addition to the techniques described earlier to address high bias, I sometimes also carry
out an error analysis on the training data, following a protocol similar to error analysis on
the Eyeball dev set. This can be useful if your algorithm has high bias—i.e., if it is not fitting
the training set well.
For example, suppose you are building a speech recognition system for an app and have
collected a training set of audio clips from volunteers. If your system is not doing well on the
training set, you might consider listening to a set of ~100 examples that the algorithm is
doing poorly on to understand the major categories of training set errors. Similar to the dev
set error analysis, you can count the errors in different categories:
1 ✔ Car noise
2 ✔ ✔ Restaurant noise
3 ✔ ✔ User shouting
across living room?
4 ✔ Coffeeshop
In this example, you might realize that your algorithm is having a particularly hard time with
training examples that have a lot of background noise. Thus, you might focus on techniques
that allow it to better fit training examples with background noise.
You might also double-check whether it is possible for a person to transcribe these audio
clips, given the same input audio as your learning algorithm. If there is so much background
noise that it is simply impossible for anyone to make out what was said, then it might be
unreasonable to expect any algorithm to correctly recognize such utterances. We will discuss
the benefits of comparing your algorithm to human-level performance in a later section.
• Add more training data: This is the simplest and most reliable way to address variance,
so long as you have access to significantly more data and enough computational power to
process the data.
• Add early stopping (i.e., stop gradient descent early, based on dev set error): This
technique reduces variance but increases bias. Early stopping behaves a lot like
regularization methods, and some authors call it a regularization technique.
• Decrease the model size (such as number of neurons/layers): Use with caution. This
technique could decrease variance, while possibly increasing bias. However, I don’t
recommend this technique for addressing variance. Adding regularization usually gives
better classification performance. The advantage of reducing the model size is reducing
your computational cost and thus speeding up how quickly you can train models. If
speeding up model training is useful, then by all means consider decreasing the model size.
But if your goal is to reduce variance, and you are not concerned about the computational
cost, consider adding regularization instead.
Here are two additional tactics, repeated from the previous chapter on addressing bias:
• Modify input features based on insights from error analysis: Say your error
analysis inspires you to create additional features that help the algorithm to eliminate a
particular category of errors. These new features could help with both bias and variance. In
A learning curve plots your dev set error against the number of training examples. To plot it,
you would run your algorithm using different training set sizes. For example, if you have
1,000 examples, you might train separate copies of the algorithm on 100, 200, 300, …, 1000
examples. Then you could plot how dev set error varies with the training set size. Here is an
example:
As the training set size increases, the dev set error should decrease.
We will often have some “desired error rate” that we hope our learning algorithm will
eventually achieve. For example:
• If we hope for human-level performance, then the human error rate could be the “desired
error rate.”
• If our learning algorithm serves some product (such as delivering cat pictures), we might
have an intuition about what level of performance is needed to give users a great
experience.
You can visually extrapolate the red “dev error” curve to guess how much closer you could
get to the desired level of performance by adding more data. In the example above, it looks
plausible that doubling the training set size might allow you to reach the desired
performance.
But if the dev error curve has “plateaued” (i.e. flattened out), then you can immediately tell
that adding more data won’t get you to your goal:
Looking at the learning curve might therefore help you avoid spending months collecting
twice as much training data, only to realize it does not help.
Let’s illustrate this effect with an example. Suppose your training set has only 2 examples:
One cat image and one non-cat image. Then it is easy for the learning algorithms to
“memorize” both examples in the training set, and get 0% training set error. Even if either or
both of the training examples were mislabeled, it is still easy for the algorithm to memorize
both labels.
Now suppose your training set has 100 examples. Perhaps even a few examples are
mislabeled, or ambiguous—some images are very blurry, so even humans cannot tell if there
is a cat. Perhaps the learning algorithm can still “memorize” most or all of the training set,
but it is now harder to obtain 100% accuracy. By increasing the training set from 2 to 100
examples, you will find that the training set accuracy will drop slightly.
Finally, suppose your training set has 10,000 examples. In this case, it becomes even harder
for the algorithm to perfectly fit all 10,000 examples, especially if some are ambiguous or
mislabeled. Thus, your learning algorithm will do even worse on this training set.
You can see that the blue “training error” curve increases with the size of the training set.
Furthermore, your algorithm usually does better on the training set than on the dev set; thus
the red dev error curve usually lies strictly above the blue training error curve.
We previously said that, if your dev error curve plateaus, you are unlikely to achieve the
desired performance just by adding data.
But it is hard to know exactly what an extrapolation of the red dev error curve will look like.
If the dev set was small, you would be even less certain because the curves could be noisy.
Suppose we add the training error curve to this plot and get the following:
Now, you can be absolutely sure that adding more data will not, by itself, be sufficient. Why
is that? Remember our two observations:
• The red dev error curve is usually higher than the blue training error. Thus, there’s almost
no way that adding more data would allow the red dev error curve to drop down to the
desired level of performance when even the training error is higher than the desired level
of performance.
Examining both the dev error curve and the training error curve on the same plot allows us
to more confidently extrapolate the dev error curve.
Suppose, for the sake of discussion, that the desired performance is our estimate of the
optimal error rate. The figure above is then the standard “textbook” example of what a
learning curve with high avoidable bias looks like: At the largest training set
size—presumably corresponding to all the training data we have—there is a large gap
between the training error and the desired performance, indicating large avoidable bias.
Furthermore, the gap between the training and dev curves is small, indicating small
variance.
Previously, we were measuring training and dev set error only at the rightmost point of this
plot, which corresponds to using all the available training data. Plotting the full learning
curve gives us a more comprehensive picture of the algorithms’ performance on different
training set sizes.
The blue training error curve is relatively low, and the red dev error curve is much higher
than the blue training error. Thus, the bias is small, but the variance is large. Adding more
training data will probably help close the gap between dev error and training error.
This time, the training error is large, as it is much higher than the desired level of
performance. The dev error is also much larger than the training error. Thus, you have
significant bias and significant variance. You will have to find a way to reduce both bias and
variance in your algorithm.
When training on just 10 randomly chosen examples, you might be unlucky and have a
particularly “bad” training set, such as one with many ambiguous/mislabeled examples. Or,
you might get lucky and get a particularly “good” training set. Having a small training set
means that the dev and training errors may randomly fluctuate.
If your machine learning application is heavily skewed toward one class (such as a cat
classification task where the fraction of negative examples is much larger than positive
examples), or if it has a huge number of classes (such as recognizing 100 different animal
species), then the chance of selecting an especially “unrepresentative” or bad training set is
also larger. For example, if 80% of your examples are negative examples (y=0), and only
20% are positive examples (y=1), then there is a chance that a training set of 10 examples
contains only negative examples, thus making it very difficult for the algorithm to learn
something meaningful.
If the noise in the training curve makes it hard to see the true trends, here are two solutions:
• Instead of training just one model on 10 examples, instead select several (say 3-10)
different randomly chosen training sets of 10 examples by sampling with replacement10
from your original set of 100. Train a different model on each of these, and compute the
training and dev set error of each of the resulting models. Compute and plot the average
training error and average dev set error.
• If your training set is skewed towards one class, or if it has many classes, choose a
“balanced” subset instead of 10 training examples at random out of the set of 100. For
example, you can make sure that 2/10 of the examples are positive examples, and 8/10 are
10
Here’s what sampling with replacement means: You would randomly pick 10 different examples out of the 100 to form
your first training set. Then to form the second training set, you would again pick 10 examples, but without taking into
account what had been chosen in the first training set. Thus, it is possible for one specific example to appear in both the
first and second training sets. In contrast, if you were sampling without replacement, the second training set would be
chosen from just the 90 examples that had not been chosen the first time around. In practice, sampling with or without
replacement shouldn’t make a huge difference, but the former is common practice.
Page 63 Machine Learning Yearning-Draft Andrew Ng
negative. More generally, you can make sure the fraction of examples from each class is as
close as possible to the overall fraction in the original training set.
I would not bother with either of these techniques unless you have already tried plotting
learning curves and concluded that the curves are too noisy to see the underlying trends. If
your training set is large—say over 10,000 examples—and your class distribution is not very
skewed, you probably won’t need these techniques.
Finally, plotting a learning curve may be computationally expensive: For example, you might
have to train ten models with 1,000, then 2,000, all the way up to 10,000 examples. Training
models with small datasets is much faster than training models with large datasets. Thus,
instead of evenly spacing out the training set sizes on a linear scale as above, you might train
models with 1,000, 2,000, 4,000, 6,000, and 10,000 examples. This should still give you a
clear sense of the trends in the learning curves. Of course, this technique is relevant only if
the computational cost of training all the additional models is significant.
Further, there are several reasons building an ML system is easier if you are trying to do a
task that people can do well:
1. Ease of obtaining data from human labelers. For example, since people recognize
cat images well, it is straightforward for people to provide high accuracy labels for your
learning algorithm.
3. Use human-level performance to estimate the optimal error rate and also set
a “desired error rate.” Suppose your algorithm achieves 10% error on a task, but a person
achieves 2% error. Then we know that the optimal error rate is 2% or lower and the
avoidable bias is at least 8%. Thus, you should try bias-reducing techniques.
Even though item #3 might not sound important, I find that having a reasonable and
achievable target error rate helps accelerate a team’s progress. Knowing your algorithm has
high avoidable bias is incredibly valuable and opens up a menu of options to try.
There are some tasks that even humans aren’t good at. For example, picking a book to
recommend to you; or picking an ad to show a user on a website; or predicting the stock
market. Computers already surpass the performance of most people on these tasks. With
these applications, we run into the following problems:
• It is harder to obtain labels. For example, it’s hard for human labelers to annotate a
database of users with the “optimal” book recommendation. If you operate a website or
app that sells books, you can obtain data by showing books to users and seeing what they
buy. If you do not operate such a site, you need to find more creative ways to get data.
• It is hard to know what the optimal error rate and reasonable desired error
rate is. Suppose you already have a book recommendation system that is doing quite
well. How do you know how much more it can improve without a human baseline?
In this case, I would use 2% as the human-level performance proxy for our optimal error
rate. You can also set 2% as the desired performance level because all three reasons from the
previous chapter for comparing to human-level performance apply:
• Ease of obtaining labeled data from human labelers. You can get a team of doctors
to provide labels to you with a 2% error rate.
• Error analysis can draw on human intuition. By discussing images with a team of
doctors, you can draw on their intuitions.
• Use human-level performance to estimate the optimal error rate and also set
achievable “desired error rate.” It is reasonable to use 2% error as our estimate of the
optimal error rate. The optimal error rate could be even lower than 2%, but it cannot be
higher, since it is possible for a team of doctors to achieve 2% error. In contrast, it is not
reasonable to use 5% or 10% as an estimate of the optimal error rate, since we know these
estimates are necessarily too high.
When it comes to obtaining labeled data, you might not want to discuss every image with an
entire team of doctors since their time is expensive. Perhaps you can have a single junior
doctor label the vast majority of cases and bring only the harder cases to more experienced
doctors or to the team of doctors.
If your system is currently at 40% error, then it doesn’t matter much whether you use a
junior doctor (10% error) or an experienced doctor (5% error) to label your data and provide
intuitions. But if your system is already at 10% error, then defining the human-level
reference as 2% gives you better tools to keep improving your system.
If you can identify a subset of data in which humans significantly surpass your system, then
you can still use those techniques to drive rapid progress. For example, suppose your system
is much better than people at recognizing speech in noisy audio, but humans are still better
at transcribing very rapidly spoken speech.
1. You can still obtain transcripts from humans that are higher quality than your algorithm’s
output.
2. You can draw on human intuition to understand why they correctly heard a rapidly
spoken utterance when your system didn’t.
3. You can use human-level performance on rapidly spoken speech as a desired performance
target.
More generally, so long as there are dev set examples where humans are right and your
algorithm is wrong, then many of the techniques described earlier will apply. This is true
even if, averaged over the entire dev/test set, your performance is already surpassing
human-level performance.
There are many important machine learning applications where machines surpass human
level performance. For example, machines are better at predicting movie ratings, how long it
takes for a delivery car to drive somewhere, or whether to approve loan applications. Only a
subset of techniques apply once humans have a hard time identifying examples that the
algorithm is clearly getting wrong. Consequently, progress is usually slower on problems
where machines already surpass human-level performance, while progress is faster when
machines are still trying to catch up to humans.
Since the 10,000 user images closely reflect the actual probability distribution of data you
want to do well on, you might use that for your dev and test sets. If you are training a
data-hungry deep learning algorithm, you might give it the additional 200,000 internet
images for training. Thus, your training and dev/test sets come from different probability
distributions. How does this affect your work?
Instead of partitioning our data into train/dev/test sets, we could take all 210,000 images we
have, and randomly shuffle them into train/dev/test sets. In this case, all the data comes
from the same distribution. But I recommend against this method, because about
205,000/210,000 ≈ 97.6% of your dev/test data would come from internet images, which
does not reflect the actual distribution you want to do well on. Remember our
recommendation on choosing dev/test sets:
Choose dev and test sets to reflect data you expect to get in the future
and want to do well on.
Most of the academic literature on machine learning assumes that the training set, dev set
11
and test set all come from the same distribution. In the early days of machine learning, data
was scarce. We usually only had one dataset drawn from some probability distribution. So
we would randomly split that data into train/dev/test sets, and the assumption that all the
data was coming from the same source was usually satisfied.
11
There is some academic research on training and testing on different distributions. Examples
include “domain adaptation,” “transfer learning” and “multitask learning.” But there is still a huge
gap between theory and practice. If you train on dataset A and test on some very different type of data
B, luck could have a huge effect on how well your algorithm performs. (Here, “luck” includes the
researcher’s hand-designed features for the particular task, as well as other factors that we just don’t
understand yet.) This makes the academic study of training and testing on different distributions
difficult to carry out in a systematic way.
For the cat detector example, instead of putting all 10,000 user-uploaded images into the
dev/test sets, we might instead put 5,000 into the dev/test sets. We can put the remaining
5,000 user-uploaded examples into the training set. This way, your training set of 205,000
examples contains some data that comes from your dev/test distribution along with the
200,000 internet images. We will discuss in a later chapter why this method is helpful.
Let’s consider a second example. Suppose you are building a speech recognition system to
transcribe street addresses for a voice-controlled mobile map/navigation app. You have
20,000 examples of users speaking street addresses. But you also have 500,000 examples of
other audio clips with users speaking about other topics. You might take 10,000 examples of
street addresses for the dev/test sets, and use the remaining 10,000, plus the additional
500,000 examples, for training.
We will continue to assume that your dev data and your test data come from the same
distribution. But it is important to understand that different training and dev/test
distributions offer some special challenges.
But in the modern era of powerful, flexible learning algorithms—such as large neural
networks—this risk has greatly diminished. If you can afford to build a neural network with a
large enough number of hidden units/layers, you can safely add the 20,000 images to your
training set. Adding the images is more likely to increase your performance.
This observation relies on the fact that there is some x —> y mapping that works well for
both types of data. In other words, there exists some system that inputs either an internet
image or a mobile app image and reliably predicts the label, even without knowing the
source of the image.
1. It gives your neural network more examples of what cats do/do not look like. This is
helpful, since internet images and user-uploaded mobile app images do share some
similarities. Your neural network can apply some of the knowledge acquired from internet
images to mobile app images.
2. It forces the neural network to expend some of its capacity to learn about properties that
are specific to internet images (such as higher resolution, different distributions of how
the images are framed, etc.) If these properties differ greatly from mobile app images, it
will “use up” some of the representational capacity of the neural network. Thus there is
less capacity for recognizing data drawn from the distribution of mobile app images,
which is what you really care about. Theoretically, this could hurt your algorithms’
performance.
Fortunately, if you have the computational capacity needed to build a big enough neural
network—i.e., a big enough attic—then this is not a serious concern. You have enough
capacity to learn from both internet and from mobile app images, without the two types of
data competing for capacity. Your algorithm’s “brain” is big enough that you don’t have to
worry about running out of attic space.
But if you do not have a big enough neural network (or another highly flexible learning
algorithm), then you should pay more attention to your training data matching your dev/test
set distribution.
If you think you have data that has no benefit,you should just leave out that data for
computational reasons. For example, suppose your dev/test sets contain mainly casual
pictures of people, places, landmarks, animals. Suppose you also have a large collection of
scanned historical documents:
These documents don’t contain anything resembling a cat. They also look completely unlike
your dev/test distribution. There is no point including this data as negative examples,
because the benefit from the first effect above is negligible—there is almost nothing your
neural network can learn from this data that it can apply to your dev/test set distribution.
Including them would waste computation resources and representation capacity of the
neural network.
12
A Study in Scarlet by Arthur Conan Doyle
Housing prices in New York City are very high. Suppose you have a second dataset of
housing prices in Detroit, Michigan, where housing prices are much lower. Should you
include this data in your training set?
Given the same size x, the price of a house y is very different depending on whether it is in
New York City or in Detroit. If you only care about predicting New York City housing prices,
putting the two datasets together will hurt your performance. In this case, it would be better
13
to leave out the inconsistent Detroit data.
How is this New York City vs. Detroit example different from the mobile app vs. internet cat
images example?
The cat image example is different because, given an input picture x, one can reliably predict
the label y indicating whether there is a cat, even without knowing if the image is an internet
image or a mobile app image. I.e., there is a function f(x) that reliably maps from the input x
to the target output y, even without knowing the origin of x. Thus, the task of recognition
from internet images is “consistent” with the task of recognition from mobile app images.
This means there was little downside (other than computational cost) to including all the
data, and some possible significant upside. In contrast, New York City and Detroit, Michigan
data are not consistent. Given the same x (size of house), the price is very different
depending on where the house is.
13
There is one way to address the problem of Detroit data being inconsistent with New York City
data, which is to add an extra feature to each training example indicating the city. Given an input
x—which now specifies the city—the target value of y is now unambiguous. However, in practice I do
not see this done frequently.
But in practice, having 40x as many internet images as mobile app images might mean you
need to spend 40x (or more) as much computational resources to model both, compared to if
you trained on only the 5,000 images.
If you don’t have huge computational resources, you could give the internet images a much
lower weight as a compromise.
For example, suppose your optimization objective is squared error (This is not a good choice
for a classification task, but it will simplify our explanation.) Thus, our learning algorithm
tries to optimize:
The first sum above is over the 5,000 mobile images, and the second sum is over the
200,000 internet images. You can instead optimize with an additional parameter 𝛽:
If you set 𝛽=1/40, the algorithm would give equal weight to the 5,000 mobile images and the
200,000 internet images. You can also set the parameter 𝛽 to other values, perhaps by
tuning to the dev set.
By weighting the additional Internet images less, you don’t have to build as massive a neural
network to make sure the algorithm does well on both types of tasks. This type of
re-weighting is needed only when you suspect the additional data (Internet Images) has a
very different distribution than the dev/test set, or if the additional data is much larger than
the data that came from the same distribution as the dev/test set (mobile images).
1. It does not do well on the training set. This is the problem of high (avoidable) bias on the
training set distribution.
2. It does well on the training set, but does not generalize well to previously unseen data
drawn from the same distribution as the training set. This is high variance.
3. It generalizes well to new data drawn from the same distribution as the training set, but
not to data drawn from the dev/test set distribution. We call this problem data
mismatch, since it is because the training set data is a poor match for the dev/test set
data.
For example, suppose that humans achieve near perfect performance on the cat recognition
task. Your algorithm achieves this:
• 1.5% error on data drawn from the same distribution as the training set that the algorithm
has not seen
In this case, you clearly have a data mismatch problem. To address this, you might try to
make the training data more similar to the dev/test data. We discuss some techniques for
this later.
In order to diagnose to what extent an algorithm suffers from each of the problems 1-3
above, it will be useful to have another dataset. Specifically, rather than giving the algorithm
all the available training data, you can split it into two subsets: The actual training set which
the algorithm will train on, and a separate set, which we will call the “Training dev” set, that
we will not train on.
• Training dev set: This data is drawn from the same distribution as the training set (e.g.,
Internet images + Mobile images). This is usually smaller than the training set; it only
needs to be large enough to evaluate and track the progress of our learning algorithm.
• Dev set: This is drawn from the same distribution as the test set, and it reflects the
distribution of data that we ultimately care about doing well on. (E.g., mobile images.)
• Test set: This is drawn from the same distribution as the dev set. (E.g., mobile images.)
Armed with these four separate datasets, you can now evaluate:
• The algorithm’s ability to generalize to new data drawn from the training set distribution,
by evaluating on the training dev set.
• The algorithm’s performance on the task you care about, by evaluating on the dev and/or
test sets.
Most of the guidelines in Chapters 5-7 for picking the size of the dev set also apply to the
training dev set.