0% found this document useful (0 votes)
6 views40 pages

Machine Learning Yearning

Machine Learning Yearning is a project by Andrew Ng that focuses on strategies for improving machine learning systems. The document covers various topics including the importance of development and test sets, the concepts of bias and variance, and techniques for error analysis and optimization. It emphasizes the need for iterative development and understanding the trade-offs between model complexity and data availability to enhance algorithm performance.

Uploaded by

1202phamnam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views40 pages

Machine Learning Yearning

Machine Learning Yearning is a project by Andrew Ng that focuses on strategies for improving machine learning systems. The document covers various topics including the importance of development and test sets, the concepts of bias and variance, and techniques for error analysis and optimization. It emphasizes the need for iterative development and understanding the trade-offs between model complexity and data availability to enhance algorithm performance.

Uploaded by

1202phamnam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Machine Learning Yearning is a

deeplearning.ai project.

© 2018 Andrew Ng. All Rights Reserved.

Page 2 Machine Learning Yearning-Draft Andrew Ng


Table of Contents

1 Why Machine Learning Strategy

2 How to use this book to help your team

3 Prerequisites and Notation

4 Scale drives machine learning progress

5 Your development and test sets

6 Your dev and test sets should come from the same distribution

7 How large do the dev/test sets need to be?

8 Establish a single-number evaluation metric for your team to optimize

9 Optimizing and satisficing metrics

10 Having a dev set and metric speeds up iterations

11 When to change dev/test sets and metrics

12 Takeaways: Setting up development and test sets

13 Build your first system quickly, then iterate

14 Error analysis: Look at dev set examples to evaluate ideas

15 Evaluating multiple ideas in parallel during error analysis

16 Cleaning up mislabeled dev and test set examples

17 If you have a large dev set, split it into two subsets, only one of which you look at

18 How big should the Eyeball and Blackbox dev sets be?

19 Takeaways: Basic error analysis

20 Bias and Variance: The two big sources of error

21 Examples of Bias and Variance

22 Comparing to the optimal error rate

23 Addressing Bias and Variance

24 Bias vs. Variance tradeoff

25 Techniques for reducing avoidable bias


Page 3 Machine Learning Yearning-Draft Andrew Ng
26 Error analysis on the training set

27 Techniques for reducing variance

28 Diagnosing bias and variance: Learning curves

29 Plotting training error

30 Interpreting learning curves: High bias

31 Interpreting learning curves: Other cases

32 Plotting learning curves

33 Why we compare to human-level performance

34 How to define human-level performance

35 Surpassing human-level performance

36 When you should train and test on different distributions

37 How to decide whether to use all your data

38 How to decide whether to include inconsistent data

39 Weighting data

40 Generalizing from the training set to the dev set

41 Identifying Bias, Variance, and Data Mismatch Errors

42 Addressing data mismatch

43 Artificial data synthesis

44 The Optimization Verification test

45 General form of Optimization Verification test

46 Reinforcement learning example

47 The rise of end-to-end learning

48 More end-to-end learning examples

49 Pros and cons of end-to-end learning

50 Choosing pipeline components: Data availability

51 Choosing pipeline components: Task simplicity

Page 4 Machine Learning Yearning-Draft Andrew Ng


52 Directly learning rich outputs

53 Error analysis by parts

54 Attributing error to one part

55 General case of error attribution

56 Error analysis by parts and comparison to human-level performance

57 Spotting a flawed ML pipeline

58 Building a superhero team - Get your teammates to read this

Page 5 Machine Learning Yearning-Draft Andrew Ng


21 Examples of Bias and Variance
Consider our cat classification task. An “ideal” classifier (such as a human) might achieve
nearly perfect performance in this task.

Suppose your algorithm performs as follows:

• Training error = 1%

• Dev error = 11%

What problem does it have? Applying the definitions from the previous chapter, we estimate
the bias as 1%, and the variance as 10% (=11%-1%). Thus, it has ​high variance​. The
classifier has very low training error, but it is failing to generalize to the dev set. This is also
called ​overfitting​.

Now consider this:

• Training error = 15%

• Dev error = 16%

We estimate the bias as 15%, and variance as 1%. This classifier is fitting the training set
poorly with 15% error, but its error on the dev set is barely higher than the training error.
This classifier therefore has ​high bias​, but low variance. We say that this algorithm is
underfitting​.

Now, consider this:

• Training error = 15%

• Dev error = 30%

We estimate the bias as 15%, and variance as 15%. This classifier has ​high bias and high
variance​: It is doing poorly on the training set, and therefore has high bias, and its
performance on the dev set is even worse, so it also has high variance. The
overfitting/underfitting terminology is hard to apply here since the classifier is
simultaneously overfitting and underfitting.

Page 44 Machine Learning Yearning-Draft Andrew Ng


Finally, consider this:

• Training error = 0.5%

• Dev error = 1%

This classifier is doing well, as it has low bias and low variance. Congratulations on achieving
this great performance!

Page 45 Machine Learning Yearning-Draft Andrew Ng


22 Comparing to the optimal error rate
In our cat recognition example, the “ideal” error rate—that is, one achievable by an “optimal”
classifier—is nearly 0%. A human looking at a picture would be able to recognize if it
contains a cat almost all the time; thus, we can hope for a machine that would do just as well.

Other problems are harder. For example, suppose that you are building a speech recognition
system, and find that 14% of the audio clips have so much background noise or are so
unintelligible that even a human cannot recognize what was said. In this case, even the most
“optimal” speech recognition system might have error around 14%.

Suppose that on this speech recognition problem, your algorithm achieves:

• Training error = 15%

• Dev error = 30%

The training set performance is already close to the optimal error rate of 14%. Thus, there is
not much room for improvement in terms of bias or in terms of training set performance.
However, this algorithm is not generalizing well to the dev set; thus there is ample room for
improvement in the errors due to variance.

This example is similar to the third example from the previous chapter, which also had a
training error of 15% and dev error of 30%. If the optimal error rate is ~0%, then a training
error of 15% leaves much room for improvement. This suggests bias-reducing changes might
be fruitful. But if the optimal error rate is 14%, then the same training set performance tells
us that there’s little room for improvement in the classifier’s bias.

For problems where the optimal error rate is far from zero, here​’​s a more detailed
breakdown of an algorithm​’​s error. Continuing with our speech recognition example above,
the total dev set error of 30% can be broken down as follows (a similar analysis can be
applied to the test set error):

• Optimal error rate (“unavoidable bias”)​: 14%. Suppose we decide that, even with the
best possible speech system in the world, we would still suffer 14% error. We can think of
this as the “unavoidable” part of a learning algorithm​’​s bias.

Page 46 Machine Learning Yearning-Draft Andrew Ng


• Avoidable bias​: 1%. This is calculated as the difference between the training error and
8
the optimal error rate.

• Variance​: 15%. The difference between the dev error and the training error.
9
To relate this to our earlier definitions, Bias and Avoidable Bias are related as follows:

Bias = Optimal error rate (“unavoidable bias”) + Avoidable bias

The “avoidable bias” reflects how much worse your algorithm performs on the training set
than the “optimal classifier.”

The concept of variance remains the same as before. In theory, we can always reduce
variance to nearly zero by training on a massive training set. Thus, all variance is “avoidable”
with a sufficiently large dataset, so there is no such thing as “unavoidable variance.”

Consider one more example, where the optimal error rate is 14%, and we have:

• Training error = 15%

• Dev error = 16%

Whereas in the previous chapter we called this a high bias classifier, now we would say that
error from avoidable bias is 1%, and the error from variance is about 1%. Thus, the algorithm
is already doing well, with little room for improvement. It is only 2% worse than the optimal
error rate.

We see from these examples that knowing the optimal error rate is helpful for guiding our
next steps. In statistics, the optimal error rate is also called ​Bayes error rate​, or Bayes
rate.

How do we know what the optimal error rate is? For tasks that humans are reasonably good
at, such as recognizing pictures or transcribing audio clips, you can ask a human to provide
labels then measure the accuracy of the human labels relative to your training set. This
would give an estimate of the optimal error rate. If you are working on a problem that even

8
If this number is negative, you are doing better on the training set than the optimal error rate. This
means you are overfitting on the training set, and the algorithm has over-memorized the training set.
You should focus on variance reduction methods rather than on further bias reduction methods.
9
These definitions are chosen to convey insight on how to improve your learning algorithm. These
definitions are different than how statisticians define Bias and Variance. Technically, what I define
here as “Bias” should be called “Error we attribute to bias”; and “Avoidable bias” should be “error we
attribute to the learning algorithm’s bias that is over the optimal error rate.”

Page 47 Machine Learning Yearning-Draft Andrew Ng


humans have a hard time solving (e.g., predicting what movie to recommend, or what ad to
show to a user) it can be hard to estimate the optimal error rate.

In the section “Comparing to Human-Level Performance (Chapters 33 to 35), I will discuss


in more detail the process of comparing a learning algorithm’s performance to human-level
performance.

In the last few chapters, you learned how to estimate avoidable/unavoidable bias and
variance by looking at training and dev set error rates. The next chapter will discuss how you
can use insights from such an analysis to prioritize techniques that reduce bias vs.
techniques that reduce variance. There are very different techniques that you should apply
depending on whether your project’s current problem is high (avoidable) bias or high
variance. Read on!

Page 48 Machine Learning Yearning-Draft Andrew Ng


23 Addressing Bias and Variance
Here is the simplest formula for addressing bias and variance issues:

• If you have high avoidable bias, increase the size of your model (for example, increase the
size of your neural network by adding layers/neurons).

• If you have high variance, add data to your training set.

If you are able to increase the neural network size and increase training data without limit, it
is possible to do very well on many learning problems.

In practice, increasing the size of your model will eventually cause you to run into
computational problems because training very large models is slow. You might also exhaust
your ability to acquire more training data. (Even on the internet, there is only a finite
number of cat pictures!)

Different model architectures—for example, different neural network architectures—will


have different amounts of bias/variance for your problem. A lot of recent deep learning
research has developed many innovative model architectures. So if you are using neural
networks, the academic literature can be a great source of inspiration. There are also many
great open-source implementations on github. But the results of trying new architectures are
less predictable than the simple formula of increasing the model size and adding data.

Increasing the model size generally reduces bias, but it might also increase variance and the
risk of overfitting. However, this overfitting problem usually arises only when you are not
using regularization. If you include a well-designed regularization method, then you can
usually safely increase the size of the model without increasing overfitting.

Suppose you are applying deep learning, with L2 regularization or dropout, with the
regularization parameter that performs best on the dev set. If you increase the model size,
usually your performance will stay the same or improve; it is unlikely to worsen significantly.
The only reason to avoid using a bigger model is the increased computational cost.

Page 49 Machine Learning Yearning-Draft Andrew Ng


24 Bias vs. Variance tradeoff
You might have heard of the “Bias vs. Variance tradeoff.” Of the changes you could make to
most learning algorithms, there are some that reduce bias errors but at the cost of increasing
variance, and vice versa. This creates a “trade off” between bias and variance.

For example, increasing the size of your model—adding neurons/layers in a neural network,
or adding input features—generally reduces bias but could increase variance. Alternatively,
adding regularization generally increases bias but reduces variance.

In the modern era, we often have access to plentiful data and can use very large neural
networks (deep learning). Therefore, there is less of a tradeoff, and there are now more
options for reducing bias without hurting variance, and vice versa.

For example, you can usually increase a neural network size and tune the regularization
method to reduce bias without noticeably increasing variance. By adding training data, you
can also usually reduce variance without affecting bias.

If you select a model architecture that is well suited for your task, you might also reduce bias
and variance simultaneously. Selecting such an architecture can be difficult.

In the next few chapters, we discuss additional specific techniques for addressing bias and
variance.

Page 50 Machine Learning Yearning-Draft Andrew Ng


25 Techniques for reducing avoidable bias
If your learning algorithm suffers from high avoidable bias, you might try the following
techniques:

• Increase the model size ​(such as number of neurons/layers): This technique reduces
bias, since it should allow you to fit the training set better. If you find that this increases
variance, then use regularization, which will usually eliminate the increase in variance.

• Modify input features based on insights from error analysis​: Say your error
analysis inspires you to create additional features that help the algorithm eliminate a
particular category of errors. (We discuss this further in the next chapter.) These new
features could help with both bias and variance. In theory, adding more features could
increase the variance; but if you find this to be the case, then use regularization, which will
usually eliminate the increase in variance.

• Reduce or eliminate regularization​ (L2 regularization, L1 regularization, dropout):


This will reduce avoidable bias, but increase variance.

• Modify model architecture​ (such as neural network architecture) so that it is more


suitable for your problem: This technique can affect both bias and variance.

One method that is not helpful:

• Add more training data​: This technique helps with variance problems, but it usually
has no significant effect on bias.

Page 51 Machine Learning Yearning-Draft Andrew Ng


26 Error analysis on the training set
Your algorithm must perform well on the training set before you can expect it to perform
well on the dev/test sets.

In addition to the techniques described earlier to address high bias, I sometimes also carry
out an error analysis on the ​training data​, following a protocol similar to error analysis on
the Eyeball dev set. This can be useful if your algorithm has high bias—i.e., if it is not fitting
the training set well.

For example, suppose you are building a speech recognition system for an app and have
collected a training set of audio clips from volunteers. If your system is not doing well on the
training set, you might consider listening to a set of ~100 examples that the algorithm is
doing poorly on to understand the major categories of training set errors. Similar to the dev
set error analysis, you can count the errors in different categories:

Audio clip Loud background User spoke Far from Comments


noise quickly microphone

1 ✔ Car noise

2 ✔ ✔ Restaurant noise

3 ✔ ✔ User shouting
across living room?

4 ✔ Coffeeshop

% of total 75% 25% 50%

In this example, you might realize that your algorithm is having a particularly hard time with
training examples that have a lot of background noise. Thus, you might focus on techniques
that allow it to better fit training examples with background noise.

You might also double-check whether it is possible for a person to transcribe these audio
clips, given the same input audio as your learning algorithm. If there is so much background
noise that it is simply impossible for anyone to make out what was said, then it might be
unreasonable to expect any algorithm to correctly recognize such utterances. We will discuss
the benefits of comparing your algorithm to human-level performance in a later section.

Page 52 Machine Learning Yearning-Draft Andrew Ng


27 Techniques for reducing variance
If your learning algorithm suffers from high variance, you might try the following
techniques:

• Add more training data​: This is the simplest and most reliable way to address variance,
so long as you have access to significantly more data and enough computational power to
process the data.

• Add regularization​ (L2 regularization, L1 regularization, dropout): This technique


reduces variance but increases bias.

• Add early stopping​ (i.e., stop gradient descent early, based on dev set error): This
technique reduces variance but increases bias. Early stopping behaves a lot like
regularization methods, and some authors call it a regularization technique.

• Feature selection to decrease number/type of input features:​ This technique


might help with variance problems, but it might also increase bias. Reducing the number
of features slightly (say going from 1,000 features to 900) is unlikely to have a huge effect
on bias. Reducing it significantly (say going from 1,000 features to 100—a 10x reduction)
is more likely to have a significant effect, so long as you are not excluding too many useful
features. In modern deep learning, when data is plentiful, there has been a shift away from
feature selection, and we are now more likely to give all the features we have to the
algorithm and let the algorithm sort out which ones to use based on the data. But when
your training set is small, feature selection can be very useful.

• Decrease the model size ​(such as number of neurons/layers): ​Use with caution.​ This
technique could decrease variance, while possibly increasing bias. However, I don’t
recommend this technique for addressing variance. Adding regularization usually gives
better classification performance. The advantage of reducing the model size is reducing
your computational cost and thus speeding up how quickly you can train models. If
speeding up model training is useful, then by all means consider decreasing the model size.
But if your goal is to reduce variance, and you are not concerned about the computational
cost, consider adding regularization instead.

Here are two additional tactics, repeated from the previous chapter on addressing bias:

• Modify input features based on insights from error analysis​: Say your error
analysis inspires you to create additional features that help the algorithm to eliminate a
particular category of errors. These new features could help with both bias and variance. In

Page 53 Machine Learning Yearning-Draft Andrew Ng


theory, adding more features could increase the variance; but if you find this to be the case,
then use regularization, which will usually eliminate the increase in variance.

• Modify model architecture​ (such as neural network architecture) so that it is more


suitable for your problem: This technique can affect both bias and variance.

Page 54 Machine Learning Yearning-Draft Andrew Ng


Learning curves

Page 55 Machine Learning Yearning-Draft Andrew Ng


28 Diagnosing bias and variance: Learning
curves
We’ve seen some ways to estimate how much error can be attributed to avoidable bias vs.
variance. We did so by estimating the optimal error rate and computing the algorithm’s
training set and dev set errors. Let’s discuss a technique that is even more informative:
plotting a learning curve.

A learning curve plots your dev set error against the number of training examples. To plot it,
you would run your algorithm using different training set sizes. For example, if you have
1,000 examples, you might train separate copies of the algorithm on 100, 200, 300, …, 1000
examples. Then you could plot how dev set error varies with the training set size. Here is an
example:

As the training set size increases, the dev set error should decrease.

We will often have some “desired error rate” that we hope our learning algorithm will
eventually achieve. For example:

• If we hope for human-level performance, then the human error rate could be the “desired
error rate.”

• If our learning algorithm serves some product (such as delivering cat pictures), we might
have an intuition about what level of performance is needed to give users a great
experience.

Page 56 Machine Learning Yearning-Draft Andrew Ng


• If you have worked on a important application for a long time, then you might have
intuition about how much more progress you can reasonably make in the next
quarter/year.

Add the desired level of performance to your learning curve:

You can visually extrapolate the red “dev error” curve to guess how much closer you could
get to the desired level of performance by adding more data. In the example above, it looks
plausible that doubling the training set size might allow you to reach the desired
performance.

But if the dev error curve has “plateaued” (i.e. flattened out), then you can immediately tell
that adding more data won’t get you to your goal:

Looking at the learning curve might therefore help you avoid spending months collecting
twice as much training data, only to realize it does not help.

Page 57 Machine Learning Yearning-Draft Andrew Ng


One downside of this process is that if you only look at the dev error curve, it can be hard to
extrapolate and predict exactly where the red curve will go if you had more data. There is one
additional plot that can help you estimate the impact of adding more data: the training error.

Page 58 Machine Learning Yearning-Draft Andrew Ng


29 Plotting training error
Your dev set (and test set) error should decrease as the training set size grows. But your
training set error usually ​increases​ as the training set size grows.

Let’s illustrate this effect with an example. Suppose your training set has only 2 examples:
One cat image and one non-cat image. Then it is easy for the learning algorithms to
“memorize” both examples in the training set, and get 0% training set error. Even if either or
both of the training examples were mislabeled, it is still easy for the algorithm to memorize
both labels.

Now suppose your training set has 100 examples. Perhaps even a few examples are
mislabeled, or ambiguous—some images are very blurry, so even humans cannot tell if there
is a cat. Perhaps the learning algorithm can still “memorize” most or all of the training set,
but it is now harder to obtain 100% accuracy. By increasing the training set from 2 to 100
examples, you will find that the training set accuracy will drop slightly.

Finally, suppose your training set has 10,000 examples. In this case, it becomes even harder
for the algorithm to perfectly fit all 10,000 examples, especially if some are ambiguous or
mislabeled. Thus, your learning algorithm will do even worse on this training set.

Let’s add a plot of training error to our earlier figures:

You can see that the blue “training error” curve increases with the size of the training set.
Furthermore, your algorithm usually does better on the training set than on the dev set; thus
the red dev error curve usually lies strictly above the blue training error curve.

Let’s discuss next how to interpret these plots.

Page 59 Machine Learning Yearning-Draft Andrew Ng


30 Interpreting learning curves: High bias
Suppose your dev error curve looks like this:

We previously said that, if your dev error curve plateaus, you are unlikely to achieve the
desired performance just by adding data.

But it is hard to know exactly what an extrapolation of the red dev error curve will look like.
If the dev set was small, you would be even less certain because the curves could be noisy.

Suppose we add the training error curve to this plot and get the following:

Now, you can be absolutely sure that adding more data will not, by itself, be sufficient. Why
is that? Remember our two observations:

Page 60 Machine Learning Yearning-Draft Andrew Ng


• As we add more training data, training error can only get worse. Thus, the blue training
error curve can only stay the same or go higher, and thus it can only get further away from
the (green line) level of desired performance.

• The red dev error curve is usually higher than the blue training error. Thus, there’s almost
no way that adding more data would allow the red dev error curve to drop down to the
desired level of performance when even the training error is higher than the desired level
of performance.

Examining both the dev error curve and the training error curve on the same plot allows us
to more confidently extrapolate the dev error curve.

Suppose, for the sake of discussion, that the desired performance is our estimate of the
optimal error rate. The figure above is then the standard “textbook” example of what a
learning curve with high avoidable bias looks like: At the largest training set
size—presumably corresponding to all the training data we have—there is a large gap
between the training error and the desired performance, indicating large avoidable bias.
Furthermore, the gap between the training and dev curves is small, indicating small
variance.

Previously, we were measuring training and dev set error only at the rightmost point of this
plot, which corresponds to using all the available training data. Plotting the full learning
curve gives us a more comprehensive picture of the algorithms’ performance on different
training set sizes.

Page 61 Machine Learning Yearning-Draft Andrew Ng


31 Interpreting learning curves: Other cases
Consider this learning curve:

Does this plot indicate high bias, high variance, or both?

The blue training error curve is relatively low, and the red dev error curve is much higher
than the blue training error. Thus, the bias is small, but the variance is large. Adding more
training data will probably help close the gap between dev error and training error.

Now, consider this:

This time, the training error is large, as it is much higher than the desired level of
performance. The dev error is also much larger than the training error. Thus, you have
significant bias and significant variance. You will have to find a way to reduce both bias and
variance in your algorithm.

Page 62 Machine Learning Yearning-Draft Andrew Ng


32 Plotting learning curves
Suppose you have a very small training set of 100 examples. You train your algorithm using a
randomly chosen subset of 10 examples, then 20 examples, then 30, up to 100, increasing
the number of examples by intervals of ten. You then use these 10 data points to plot your
learning curve. You might find that the curve looks slightly noisy (meaning that the values
are higher/lower than expected) at the smaller training set sizes.

When training on just 10 randomly chosen examples, you might be unlucky and have a
particularly “bad” training set, such as one with many ambiguous/mislabeled examples. Or,
you might get lucky and get a particularly “good” training set. Having a small training set
means that the dev and training errors may randomly fluctuate.

If your machine learning application is heavily skewed toward one class (such as a cat
classification task where the fraction of negative examples is much larger than positive
examples), or if it has a huge number of classes (such as recognizing 100 different animal
species), then the chance of selecting an especially “unrepresentative” or bad training set is
also larger. For example, if 80% of your examples are negative examples (y=0), and only
20% are positive examples (y=1), then there is a chance that a training set of 10 examples
contains only negative examples, thus making it very difficult for the algorithm to learn
something meaningful.

If the noise in the training curve makes it hard to see the true trends, here are two solutions:

• Instead of training just one model on 10 examples, instead select several (say 3-10)
different randomly chosen training sets of 10 examples by sampling with replacement10
from your original set of 100. Train a different model on each of these, and compute the
training and dev set error of each of the resulting models. Compute and plot the average
training error and average dev set error.

• If your training set is skewed towards one class, or if it has many classes, choose a
“balanced” subset instead of 10 training examples at random out of the set of 100. For
example, you can make sure that 2/10 of the examples are positive examples, and 8/10 are

10
Here’s what sampling ​with replacement​ means: You would randomly pick 10 different examples out of the 100 to form
your first training set. Then to form the second training set, you would again pick 10 examples, but without taking into
account what had been chosen in the first training set. Thus, it is possible for one specific example to appear in both the
first and second training sets. In contrast, if you were sampling ​without replacement,​ the second training set would be
chosen from just the 90 examples that had not been chosen the first time around. In practice, sampling with or without
replacement shouldn’t make a huge difference, but the former is common practice.
Page 63 Machine Learning Yearning-Draft Andrew Ng
negative. More generally, you can make sure the fraction of examples from each class is as
close as possible to the overall fraction in the original training set.

I would not bother with either of these techniques unless you have already tried plotting
learning curves and concluded that the curves are too noisy to see the underlying trends. If
your training set is large—say over 10,000 examples—and your class distribution is not very
skewed, you probably won’t need these techniques.

Finally, plotting a learning curve may be computationally expensive: For example, you might
have to train ten models with 1,000, then 2,000, all the way up to 10,000 examples. Training
models with small datasets is much faster than training models with large datasets. Thus,
instead of evenly spacing out the training set sizes on a linear scale as above, you might train
models with 1,000, 2,000, 4,000, 6,000, and 10,000 examples. This should still give you a
clear sense of the trends in the learning curves. Of course, this technique is relevant only if
the computational cost of training all the additional models is significant.

Page 64 Machine Learning Yearning-Draft Andrew Ng


Comparing to
human-level
performance

Page 65 Machine Learning Yearning-Draft Andrew Ng


33 Why we compare to human-level
performance
Many machine learning systems aim to automate things that humans do well. Examples
include image recognition, speech recognition, and email spam classification. Learning
algorithms have also improved so much that we are now surpassing human-level
performance on more and more of these tasks.

Further, there are several reasons building an ML system is easier if you are trying to do a
task that people can do well:

1. Ease of obtaining data from human labelers.​ For example, since people recognize
cat images well, it is straightforward for people to provide high accuracy labels for your
learning algorithm.

2. Error analysis can draw on human intuition.​ Suppose a speech recognition


algorithm is doing worse than human-level recognition. Say it incorrectly transcribes an
audio clip as “This recipe calls for a ​pear​ of apples,” mistaking “pair” for “pear.” You can
draw on human intuition and try to understand what information a person uses to get the
correct transcription, and use this knowledge to modify the learning algorithm.

3. Use human-level performance to estimate the optimal error rate and also set
a “desired error rate.”​ Suppose your algorithm achieves 10% error on a task, but a person
achieves 2% error. Then we know that the optimal error rate is 2% or lower and the
avoidable bias is at least 8%. Thus, you should try bias-reducing techniques.

Even though item #3 might not sound important, I find that having a reasonable and
achievable target error rate helps accelerate a team’s progress. Knowing your algorithm has
high avoidable bias is incredibly valuable and opens up a menu of options to try.

There are some tasks that even humans aren’t good at. For example, picking a book to
recommend to you; or picking an ad to show a user on a website; or predicting the stock
market. Computers already surpass the performance of most people on these tasks. With
these applications, we run into the following problems:

• It is harder to obtain labels.​ For example, it’s hard for human labelers to annotate a
database of users with the “optimal” book recommendation. If you operate a website or
app that sells books, you can obtain data by showing books to users and seeing what they
buy. If you do not operate such a site, you need to find more creative ways to get data.

Page 66 Machine Learning Yearning-Draft Andrew Ng


• Human intuition is harder to count on.​ For example, pretty much no one can
predict the stock market. So if our stock prediction algorithm does no better than random
guessing, it is hard to figure out how to improve it.

• It is hard to know what the optimal error rate and reasonable desired error
rate is. ​Suppose you already have a book recommendation system that is doing quite
well. How do you know how much more it can improve without a human baseline?

Page 67 Machine Learning Yearning-Draft Andrew Ng


34 How to define human-level performance
Suppose you are working on a medical imaging application that automatically makes
diagnoses from x-ray images. A typical person with no previous medical background besides
some basic training achieves 15% error on this task. A junior doctor achieves 10% error. An
experienced doctor achieves 5% error. And a small team of doctors that discuss and debate
each image achieves 2% error. Which one of these error rates defines “human-level
performance”?

In this case, I would use 2% as the human-level performance proxy for our optimal error
rate. You can also set 2% as the desired performance level because all three reasons from the
previous chapter for comparing to human-level performance apply:

• Ease of obtaining labeled data from human labelers.​ You can get a team of doctors
to provide labels to you with a 2% error rate.

• Error analysis can draw on human intuition. ​By discussing images with a team of
doctors, you can draw on their intuitions.

• Use human-level performance to estimate the optimal error rate and also set
achievable “desired error rate.”​ It is reasonable to use 2% error as our estimate of the
optimal error rate. The optimal error rate could be even lower than 2%, but it cannot be
higher, since it is possible for a team of doctors to achieve 2% error. In contrast, it is not
reasonable to use 5% or 10% as an estimate of the optimal error rate, since we know these
estimates are necessarily too high.

When it comes to obtaining labeled data, you might not want to discuss every image with an
entire team of doctors since their time is expensive. Perhaps you can have a single junior
doctor label the vast majority of cases and bring only the harder cases to more experienced
doctors or to the team of doctors.

If your system is currently at 40% error, then it doesn’t matter much whether you use a
junior doctor (10% error) or an experienced doctor (5% error) to label your data and provide
intuitions. But if your system is already at 10% error, then defining the human-level
reference as 2% gives you better tools to keep improving your system.

Page 68 Machine Learning Yearning-Draft Andrew Ng


35 Surpassing human-level performance
You are working on speech recognition and have a dataset of audio clips. Suppose your
dataset has many noisy audio clips so that even humans have 10% error. Suppose your
system already achieves 8% error. Can you use any of the three techniques described in
Chapter 33 to continue making rapid progress?

If you can identify a subset of data in which humans significantly surpass your system, then
you can still use those techniques to drive rapid progress. For example, suppose your system
is much better than people at recognizing speech in noisy audio, but humans are still better
at transcribing very rapidly spoken speech.

For the subset of data with rapidly spoken speech:

1. You can still obtain transcripts from humans that are higher quality than your algorithm’s
output.

2. You can draw on human intuition to understand why they correctly heard a rapidly
spoken utterance when your system didn’t.

3. You can use human-level performance on rapidly spoken speech as a desired performance
target.

More generally, so long as there are dev set examples where humans are right and your
algorithm is wrong, then many of the techniques described earlier will apply. This is true
even if, averaged over the entire dev/test set, your performance is already surpassing
human-level performance.

There are many important machine learning applications where machines surpass human
level performance. For example, machines are better at predicting movie ratings, how long it
takes for a delivery car to drive somewhere, or whether to approve loan applications. Only a
subset of techniques apply once humans have a hard time identifying examples that the
algorithm is clearly getting wrong. Consequently, progress is usually slower on problems
where machines already surpass human-level performance, while progress is faster when
machines are still trying to catch up to humans.

Page 69 Machine Learning Yearning-Draft Andrew Ng


Training and
testing on different
distributions

Page 70 Machine Learning Yearning-Draft Andrew Ng


36 When you should train and test on
different distributions
Users of your cat pictures app have uploaded 10,000 images, which you have manually
labeled as containing cats or not. You also have a larger set of 200,000 images that you
downloaded off the internet. How should you define train/dev/test sets?

Since the 10,000 user images closely reflect the actual probability distribution of data you
want to do well on, you might use that for your dev and test sets. If you are training a
data-hungry deep learning algorithm, you might give it the additional 200,000 internet
images for training. Thus, your training and dev/test sets come from different probability
distributions. How does this affect your work?

Instead of partitioning our data into train/dev/test sets, we could take all 210,000 images we
have, and randomly shuffle them into train/dev/test sets. In this case, all the data comes
from the same distribution. But I recommend against this method, because about
205,000/210,000 ≈ 97.6% of your dev/test data would come from internet images, which
does not reflect the actual distribution you want to do well on. Remember our
recommendation on choosing dev/test sets:

Choose dev and test sets to reflect data you expect to get in the future
and want to do well on.
Most of the academic literature on machine learning assumes that the training set, dev set
11
and test set all come from the same distribution. In the early days of machine learning, data
was scarce. We usually only had one dataset drawn from some probability distribution. So
we would randomly split that data into train/dev/test sets, and the assumption that all the
data was coming from the same source was usually satisfied.

11
There is some academic research on training and testing on different distributions. Examples
include “domain adaptation,” “transfer learning” and “multitask learning.” But there is still a huge
gap between theory and practice. If you train on dataset A and test on some very different type of data
B, luck could have a huge effect on how well your algorithm performs. (Here, “luck” includes the
researcher’s hand-designed features for the particular task, as well as other factors that we just don’t
understand yet.) This makes the academic study of training and testing on different distributions
difficult to carry out in a systematic way.

Page 71 Machine Learning Yearning-Draft Andrew Ng


But in the era of big data, we now have access to huge training sets, such as cat internet
images. Even if the training set comes from a different distribution than the dev/test set, we
still want to use it for learning since it can provide a lot of information.

For the cat detector example, instead of putting all 10,000 user-uploaded images into the
dev/test sets, we might instead put 5,000 into the dev/test sets. We can put the remaining
5,000 user-uploaded examples into the training set. This way, your training set of 205,000
examples contains some data that comes from your dev/test distribution along with the
200,000 internet images. We will discuss in a later chapter why this method is helpful.

Let’s consider a second example. Suppose you are building a speech recognition system to
transcribe street addresses for a voice-controlled mobile map/navigation app. You have
20,000 examples of users speaking street addresses. But you also have 500,000 examples of
other audio clips with users speaking about other topics. You might take 10,000 examples of
street addresses for the dev/test sets, and use the remaining 10,000, plus the additional
500,000 examples, for training.

We will continue to assume that your dev data and your test data come from the same
distribution. But it is important to understand that different training and dev/test
distributions offer some special challenges.

Page 72 Machine Learning Yearning-Draft Andrew Ng


37 How to decide whether to use all your data
Suppose your cat detector’s training set includes 10,000 user-uploaded images. This data
comes from the same distribution as a separate dev/test set, and represents the distribution
you care about doing well on. You also have an additional 20,000 images downloaded from
the internet. Should you provide all 20,000+10,000=30,000 images to your learning
algorithm as its training set, or discard the 20,000 internet images for fear of it biasing your
learning algorithm?

When using earlier generations of learning algorithms (such as hand-designed computer


vision features, followed by a simple linear classifier) there was a real risk that merging both
types of data would cause you to perform worse. Thus, some engineers will warn you against
including the 20,000 internet images.

But in the modern era of powerful, flexible learning algorithms—such as large neural
networks—this risk has greatly diminished. If you can afford to build a neural network with a
large enough number of hidden units/layers, you can safely add the 20,000 images to your
training set. Adding the images is more likely to increase your performance.

This observation relies on the fact that there is some x —> y mapping that works well for
both types of data. In other words, there exists some system that inputs either an internet
image or a mobile app image and reliably predicts the label, even without knowing the
source of the image.

Adding the additional 20,000 images has the following effects:

1. It gives your neural network more examples of what cats do/do not look like. This is
helpful, since internet images and user-uploaded mobile app images do share some
similarities. Your neural network can apply some of the knowledge acquired from internet
images to mobile app images.

2. It forces the neural network to expend some of its capacity to learn about properties that
are specific to internet images (such as higher resolution, different distributions of how
the images are framed, etc.) If these properties differ greatly from mobile app images, it
will “use up” some of the representational capacity of the neural network. Thus there is
less capacity for recognizing data drawn from the distribution of mobile app images,
which is what you really care about. Theoretically, this could hurt your algorithms’
performance.

Page 73 Machine Learning Yearning-Draft Andrew Ng


To describe the second effect in different terms, we can turn to the fictional character
Sherlock Holmes, who says that your brain is like an attic; it only has a finite amount of
space. He says that “for every addition of knowledge, you forget something that you knew
before. It is of the highest importance, therefore, not to have useless facts elbowing out the
12
useful ones.”

Fortunately, if you have the computational capacity needed to build a big enough neural
network—i.e., a big enough attic—then this is not a serious concern. You have enough
capacity to learn from both internet and from mobile app images, without the two types of
data competing for capacity. Your algorithm’s “brain” is big enough that you don’t have to
worry about running out of attic space.

But if you do not have a big enough neural network (or another highly flexible learning
algorithm), then you should pay more attention to your training data matching your dev/test
set distribution.

If you think you have data that has no benefit,you should just leave out that data for
computational reasons. For example, suppose your dev/test sets contain mainly casual
pictures of people, places, landmarks, animals. Suppose you also have a large collection of
scanned historical documents:

These documents don’t contain anything resembling a cat. They also look completely unlike
your dev/test distribution. There is no point including this data as negative examples,
because the benefit from the first effect above is negligible—there is almost nothing your
neural network can learn from this data that it can apply to your dev/test set distribution.
Including them would waste computation resources and representation capacity of the
neural network.

12
​A Study in Scarlet​ ​by Arthur Conan Doyle

Page 74 Machine Learning Yearning-Draft Andrew Ng


38 How to decide whether to include
inconsistent data
Suppose you want to learn to predict housing prices in New York City. Given the size of a
house (input feature x), you want to predict the price (target label y).

Housing prices in New York City are very high. Suppose you have a second dataset of
housing prices in Detroit, Michigan, where housing prices are much lower. Should you
include this data in your training set?

Given the same size x, the price of a house y is very different depending on whether it is in
New York City or in Detroit. If you only care about predicting New York City housing prices,
putting the two datasets together will hurt your performance. In this case, it would be better
13
to leave out the inconsistent Detroit data.

How is this New York City vs. Detroit example different from the mobile app vs. internet cat
images example?

The cat image example is different because, given an input picture x, one can reliably predict
the label y indicating whether there is a cat, even without knowing if the image is an internet
image or a mobile app image. I.e., there is a function f(x) that reliably maps from the input x
to the target output y, even without knowing the origin of x. Thus, the task of recognition
from internet images is “consistent” with the task of recognition from mobile app images.
This means there was little downside (other than computational cost) to including all the
data, and some possible significant upside. In contrast, New York City and Detroit, Michigan
data are not consistent. Given the same x (size of house), the price is very different
depending on where the house is.

13
There is one way to address the problem of Detroit data being inconsistent with New York City
data, which is to add an extra feature to each training example indicating the city. Given an input
x—which now specifies the city—the target value of y is now unambiguous. However, in practice I do
not see this done frequently.

Page 75 Machine Learning Yearning-Draft Andrew Ng


39 Weighting data
Suppose you have 200,000 images from the internet and 5,000 images from your mobile
app users. There is a 40:1 ratio between the size of these datasets. In theory, so long as you
build a huge neural network and train it long enough on all 205,000 images, there is no
harm in trying to make the algorithm do well on both internet images and mobile images.

But in practice, having 40x as many internet images as mobile app images might mean you
need to spend 40x (or more) as much computational resources to model both, compared to if
you trained on only the 5,000 images.

If you don’t have huge computational resources, you could give the internet images a much
lower weight as a compromise.

For example, suppose your optimization objective is squared error (This is not a good choice
for a classification task, but it will simplify our explanation.) Thus, our learning algorithm
tries to optimize:

The first sum above is over the 5,000 mobile images, and the second sum is over the
200,000 internet images. You can instead optimize with an additional parameter ​𝛽​:

If you set ​𝛽​=1/40, the algorithm would give equal weight to the 5,000 mobile images and the
200,000 internet images. You can also set the parameter ​𝛽​ to other values, perhaps by
tuning to the dev set.

By weighting the additional Internet images less, you don’t have to build as massive a neural
network to make sure the algorithm does well on both types of tasks. This type of
re-weighting is needed only when you suspect the additional data (Internet Images) has a
very different distribution than the dev/test set, or if the additional data is much larger than
the data that came from the same distribution as the dev/test set (mobile images).

Page 76 Machine Learning Yearning-Draft Andrew Ng


40 Generalizing from the training set to the
dev set
Suppose you are applying ML in a setting where the training and the dev/test distributions
are different. Say, the training set contains Internet images + Mobile images, and the
dev/test sets contain only Mobile images. However, the algorithm is not working well: It has
a much higher dev/test set error than you would like. Here are some possibilities of what
might be wrong:

1. It does not do well on the training set. This is the problem of high (avoidable) bias on the
training set distribution.

2. It does well on the training set, but does not generalize well to previously unseen data
drawn from the same distribution as the training set​. This is high variance.

3. It generalizes well to new data drawn from the same distribution as the training set, but
not to data drawn from the dev/test set distribution. We call this problem ​data
mismatch​, since it is because the training set data is a poor match for the dev/test set
data.

For example, suppose that humans achieve near perfect performance on the cat recognition
task. Your algorithm achieves this:

• 1% error on the training set

• 1.5% error on data drawn from the same distribution as the training set that the algorithm
has not seen

• 10% error on the dev set

In this case, you clearly have a data mismatch problem. To address this, you might try to
make the training data more similar to the dev/test data. We discuss some techniques for
this later.

In order to diagnose to what extent an algorithm suffers from each of the problems 1-3
above, it will be useful to have another dataset. Specifically, rather than giving the algorithm
all the available training data, you can split it into two subsets: The actual training set which
the algorithm will train on, and a separate set, which we will call the “Training dev” set, that
we will not train on.

You now have four subsets of data:

Page 77 Machine Learning Yearning-Draft Andrew Ng


• Training set. This is the data that the algorithm will learn from (e.g., Internet images +
Mobile images). This does not have to be drawn from the same distribution as what we
really care about (the dev/test set distribution).

• Training dev set: This data is drawn from the same distribution as the training set (e.g.,
Internet images + Mobile images). This is usually smaller than the training set; it only
needs to be large enough to evaluate and track the progress of our learning algorithm.

• Dev set: This is drawn from the same distribution as the test set, and it reflects the
distribution of data that we ultimately care about doing well on. (E.g., mobile images.)

• Test set: This is drawn from the same distribution as the dev set. (E.g., mobile images.)

Armed with these four separate datasets, you can now evaluate:

• Training error, by evaluating on the training set.

• The algorithm’s ability to generalize to new data drawn from the training set distribution,
by evaluating on the training dev set.

• The algorithm’s performance on the task you care about, by evaluating on the dev and/or
test sets.

Most of the guidelines in Chapters 5-7 for picking the size of the dev set also apply to the
training dev set.

Page 78 Machine Learning Yearning-Draft Andrew Ng

You might also like