cs229 Notes1
cs229 Notes1
Andrew Ng
Trend #2: The rise of end-to-end learning Major categories of DL models
Learning with integer or real-valued outputs:
1. General neural networks
2. Sequence models (1D sequences)
• RNN, GRU, LSTM, CTC, attention models, ….
3. Image models
Learning with complex (e.g., string valued) outputs:
• 2D and 3D convolutional networks
4. Advanced/future tech:
• Unsupervised learning (sparse coding, ICA, SFA,
…), Reinforcement learning, ….
Andrew Ng
End-to-end learning: Speech recognition End-to-end learning: Autonomous driving
Traditional model Traditional model
This works well given enough labeled (audio, transcript) data. Given the safety-critical requirement of autonomous driving and thus the need for extremely
high levels of accuracy, a pure end-to-end approach is still challenging to get to work. End-to-
end works only when you have enough (x,y) data to learn function of needed level of complexity.
Andrew Ng Andrew Ng
Compared to earlier eras, we still talk about bias and variance, but somewhat less
about the “tradeoff” between them.
Andrew Ng Andrew Ng
Andrew Ng
Basic recipe for machine learning Automatic data synthesis examples
• OCR
• Text against random backgrounds
Machine Learning
Bigger model
Training error high? Train longer (Bias) • Speech recognition
Yes New model architecture
• Synthesize clean audio against different background noise
No
• NLP: Grammar correction
More data
Dev error high? Regularization (Variance) • Synthesize random grammatical errors
Yes New model architecture
No Sometimes synthesized data that appears great to human eyes is
Done! actually very impoverished in the eyes of ML algorithms, and covers
only a minuscule fraction of the actual distribution of data. E.g.,
images of cars extracted from video games.
Different training and test set distributions Different training and test set distributions
Better way: Make the dev and test sets come from the same distribution.
Say you want to build a speech recognition system for a new in-car
Training-Dev Dev Test
rearview mirror product. You have 50,000 hours of general speech Training (~50,000h) (20h) (5h) (5h)
data, and 10 hours of in-car data. How do you split your data? This General speech data In-car data
is a bad way to do it:
Human level error ………... 1%
Training Dev Test “Avoidable bias”
Training error …….............. 1.1%
General speech data (50,000 hours) In-car data
(10 hours) Overfitting of training set
Training-Dev error ………... 1.5%
Having mismatched dev and test distributions is not a good idea. Data mismatch
Your team may spend months optimizing for dev set performance Dev set error ………………. 8%
only to find it doesn’t work well on the test set. Overfitting of dev set
Test set error ………………. 8.5%
Andrew Ng Andrew Ng
Andrew Ng
New recipe for machine learning General Human/Bias/Variance analysis
Bigger model
Training error high? Train longer (Bias)
Yes New model architecture General In-car
speech data speech data
No (50,000 hours) (10 hours)
More data
Train-Dev error high? Regularization (Variance) Performance of (Carry out human
Yes New model architecture Human-level error evaluation to measure.)
humans
No “Avoidable bias”
Make training data more Performance on
(Insert some in-car data into
Dev error high? similar to test data. (Train-test data examples you’ve trained Training error training set to measure.)
Yes Data synthesis
mismatch) on
(Domain adaptation.)
“Variance”/degree of
No New model architecture
Performance on overfitting
examples you haven’t Training-Dev error Dev/Test error
Test error high? More dev set data (Overfit dev trained on
Yes
set)
No
Data mismatch
Done! Andrew Ng Andrew Ng
You’ll often see the fastest performance improvements on a task while the Suppose that on an image labeling task:
ML is performing worse than humans.
• Human-level performance is a proxy for Bayes optimal error, which we Typical human ………………..… 3% error
can never surpass. Typical doctor …………………... 1% error
• Can rely on human intuition: (i) Have humans provide labeled data. Experienced doctor ……………. 0.7% error
(ii) Error analysis to understand how humans got examples right.
(iii) Estimate bias/variance. E.g., On an image recognition task, training Team of experienced doctors …. 0.5% error
error = 8%, dev error = 10%. What do you do? Two cases:
Human level error ………. 1% Human level error ………. 7.5%
What is “human-level error”?
“Avoidable bias” “Avoidable bias”
Training set error ………... 8% Training set error ………... 8%
“Variance” “Variance” Answer: For purpose of driving ML progress, 0.5% is
Dev set error …………… 10%
Dev set error …………… 10%
best answer since it’s closest to Bayes error.
Focus on bias. Focus on variance.
Andrew Ng Andrew Ng
Andrew Ng
AI Product Management AI Product Management
The availability of new supervised DL algorithms means we’re rethinking the workflow How should PMs and AI teams work together? Here’s one default split of
of how to have teams collaborate to build applications using DL. A Product Manager responsibilities:
(PM) can help an AI team prioritize the most fruitful ML tasks. E.g., should you
improve speech performance with car noise, café noise, for low-bandwidth audio, for Product Manager (PM) AI Scientist/Engineer
accented speech, or improve latency, reduce binary size, or something else? responsibility responsibility
What can AI do today? Some heuristics for PMs: • Provide dev/test sets, ideally • Acquire training data
• If a typical person can do a mental task with less than one second of thought, we drawn from same distribution.
can probably automate it using AI either now or in the near future. • Develop system that does well
• For any concrete, repeated event that we observe (e.g., whether user clicks on ad;; • Provide evaluation metric for according to the provided
how long it takes to deliver a package;; ….), we can reasonably try to predict the learning algorithm (accuracy, metric on the dev/test data.
outcome of the next event (whether user clicks on next ad). F1, etc.)
Andrew Ng