Ch10 Deep Learning
Ch10 Deep Learning
1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.
1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.
Then along came SVMs, Random Forests and Boosting in the
1990s, and Neural Networks took a back seat.
1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.
Then along came SVMs, Random Forests and Boosting in the
1990s, and Neural Networks took a back seat.
Re-emerged around 2010 as Deep Learning.
By 2020s very dominant and successful.
Part of success due to vast improvements in computing power,
larger training sets, and software: Tensorflow and PyTorch.
1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.
Then along came SVMs, Random Forests and Boosting in the
1990s, and Neural Networks took a back seat.
Re-emerged around 2010 as Deep Learning.
By 2020s very dominant and successful.
Part of success due to vast improvements in computing power,
larger training sets, and software: Tensorflow and PyTorch.
A1
X1
A2
X2
A3 f (X) Y
X3
A4
X4
A5
2 / 46
Details
1.0
sigmoid
ReLU
0.8
0.6
g(z)
0.4
0.2
0.0
−4 −2 0 2 4
z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
3 / 46
Details
1.0
sigmoid
ReLU
0.8
0.6
g(z)
0.4
0.2
0.0
−4 −2 0 2 4
z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.
3 / 46
Details
1.0
sigmoid
ReLU
0.8
0.6
g(z)
0.4
0.2
0.0
−4 −2 0 2 4
z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear model.
3 / 46
Details
1.0
sigmoid
ReLU
0.8
0.6
g(z)
0.4
0.2
0.0
−4 −2 0 2 4
z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear model.
• So the activations are like derived features — nonlinear
transformations of linear combinations of the features.
3 / 46
Details
1.0
sigmoid
ReLU
0.8
0.6
g(z)
0.4
0.2
0.0
−4 −2 0 2 4
z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear model.
• So the activations are like derived features — nonlinear
transformations of linear combinations of the features.
• The model is fit by minimizing ni=1 (yi − f (xi ))2 (e.g. for
P
regression). 3 / 46
Example: MNIST Digits
Handwritten digits
28 × 28 grayscale images
60K train, 10K test images
Features are the 784 pixel
grayscale values ∈ (0, 255)
Labels are the digit class 0–9
4 / 46
Input
layer
Hidden
X1
layer L1
Hidden
layer L2
(1) Output
X2 A1 layer
(2)
A1
(1)
X3 A2 f0 (X) Y0
(2)
A2
(1)
X4 A3 f1 (X) Y1
(2)
A3
(1) . .
X5 A4 . .
. .
.
.
.
.
X6 . f9 (X) Y9
.
(2)
AK
2
. (1) B
. AK
. 1
W2
Xp W1
5 / 46
Details of Output Layer
(2)
• Let Zm = βm0 + K
P 2
`=1 βm` A` , m = 0, 1, . . . , 9 be 10 linear
combinations of activations at second layer.
• Output activation function encodes the softmax function
eZm
fm (X) = Pr(Y = m|X) = P9 .
Z`
`=0 e
6 / 46
Details of Output Layer
(2)
• Let Zm = βm0 + K
P 2
`=1 βm` A` , m = 0, 1, . . . , 9 be 10 linear
combinations of activations at second layer.
• Output activation function encodes the softmax function
eZm
fm (X) = Pr(Y = m|X) = P9 .
Z`
`=0 e
7 / 46
Results
Method Test Error
Neural Network + Ridge Regularization 2.3%
Neural Network + Dropout Regularization 1.8%
Multinomial Logistic Regression 7.2%
Linear Discriminant Analysis 12.7%
7 / 46
Convolutional Neural Network — CNN
9 / 46
How CNNs Work
9 / 46
How CNNs Work
11 / 46
Convolution Example
11 / 46
Convolution Example
11 / 46
Convolution Example
11 / 46
Convolution Example
1 2 5 3
3 0 1 2
Max pool → 3 5
2 1 3 4 2 4
1 1 2 0
12 / 46
Architecture of a CNN
8
16
32 8 4
32
16
10
50
0
32
0
2
pool
pool
convolve
pool convolve flatten
convolve
13 / 46
Architecture of a CNN
8
16
32 8 4
32
16
10
50
0
32
0
2
pool
pool
convolve
pool convolve flatten
convolve
13 / 46
Architecture of a CNN
8
16
32 8 4
32
16
10
50
0
32
0
2
pool
pool
convolve
pool convolve flatten
convolve
13 / 46
Architecture of a CNN
8
16
32 8 4
32
16
10
50
0
32
0
2
pool
pool
convolve
pool convolve flatten
convolve
13 / 46
Architecture of a CNN
8
16
32 8 4
32
16
10
50
0
32
0
2
pool
pool
convolve
pool convolve flatten
convolve
13 / 46
Architecture of a CNN
8
16
32 8 4
32
16
10
50
0
32
0
2
pool
pool
convolve
pool convolve flatten
convolve
13 / 46
Using Pretrained Networks to Classify Images
14 / 46
Using Pretrained Networks to Classify Images
15 / 46
Document Classification: IMDB Movie Reviews
15 / 46
Featurization: Bag-of-Words
Documents have different lengths, and consist of sequences of
words. How do we create features X to characterize a
document?
• From a dictionary, identify the 10K most frequently
occurring words.
• Create a binary vector of length p = 10K for each
document, and score a 1 in every position that the
corresponding word occurred.
• With n documents, we now have a n × p sparse feature
matrix X.
• We compare a lasso logistic regression model to a
two-hidden-layer neural network on the next slide. (No
convolutions here!)
16 / 46
Featurization: Bag-of-Words
Documents have different lengths, and consist of sequences of
words. How do we create features X to characterize a
document?
• From a dictionary, identify the 10K most frequently
occurring words.
• Create a binary vector of length p = 10K for each
document, and score a 1 in every position that the
corresponding word occurred.
• With n documents, we now have a n × p sparse feature
matrix X.
• We compare a lasso logistic regression model to a
two-hidden-layer neural network on the next slide. (No
convolutions here!)
• Bag-of-words are unigrams. We can instead use bigrams
(occurrences of adjacent word pairs), and in general
m-grams.
16 / 46
Lasso versus Neural Network — IMDB Reviews
Lasso Neural Net
1.0
1.0
●●●●●●●●●●●●●●● ● ● ● ● ●
●●● ● ● ● ●
●● ●
●● ●
●● ● ●
●● ●
●● ●
●●● ●
● ●
●●
●● ●
●●
●●
●●
0.9
0.9
●● ●
●●
●● ● ● ● ● ● ●
●●●● ●●●●●●●●●●●●●
●●●●●●●● ● ●●●●● ● ● ● ● ● ●
● ● ● ●
●● ●●● ●●●●●●● ●●●●● ●● ● ● ● ● ● ● ● ●
●● ● ●●
●●●●●●●● ●●●●● ●●●●●●●●● ● ● ●
●●
●● ●●●
●● ●● ●●●●
●●●●● ●●●●●●●● ● ● ● ● ●
●●
●● ● ●
●●● ●● ●●●●●●●●● ● ● ● ● ●
●●●●●●
● ●●●
Accuracy
Accuracy
●●
●●
●●●●
●
●
●●●
●●●●
●●● ●
●●●
0.8
0.8
●●●
●●
●●
●●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
0.7
0.7
●
●
●
●●
●●
●
● train
●●
●●
●
● validation
●
●●
●●
●●
●●
●●
●
● test
0.6
0.6
4 6 8 10 12 5 10 15 20
− log(λ)
●
●
Epochs
18 / 46
Recurrent Neural Networks
Often data arise as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data or financial indices.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this sequential
nature of the data, and build a memory of the past.
18 / 46
Recurrent Neural Networks
Often data arise as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data or financial indices.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this sequential
nature of the data, and build a memory of the past.
18 / 46
Recurrent Neural Networks
Often data arise as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data or financial indices.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this sequential
nature of the data, and build a memory of the past.
18 / 46
Recurrent Neural Networks
Often data arise as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data or financial indices.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this sequential
nature of the data, and build a memory of the past.
O` O1 O2 O3 OL-1 OL
B B B B B B
U
A` = A1
U
A2
U
A3
U
AL-1
U
AL
W W W W W W
X` X1 X2 X3 ... XL-1 XL
19 / 46
Simple Recurrent Neural Network Architecture
Y Y
O` O1 O2 O3 OL-1 OL
B B B B B B
U
A` = A1
U
A2
U
A3
U
AL-1
U
AL
W W W W W W
X` X1 X2 X3 ... XL-1 XL
19 / 46
Simple Recurrent Neural Network Architecture
Y Y
O` O1 O2 O3 OL-1 OL
B B B B B B
U
A` = A1
U
A2
U
A3
U
AL-1
U
AL
W W W W W W
X` X1 X2 X3 ... XL-1 XL
19 / 46
Simple Recurrent Neural Network Architecture
Y Y
O` O1 O2 O3 OL-1 OL
B B B B B B
U
A` = A1
U
A2
U
A3
U
AL-1
U
AL
W W W W W W
X` X1 X2 X3 ... XL-1 XL
20 / 46
RNN and IMDB Reviews
• The document feature is a sequence of words {W` }L 1 . We
typically truncate/pad the documents to the same number
L of words (we use L = 500).
• Each word W` is represented as a one-hot encoded binary
vector X` (dummy variable) of length 10K, with all zeros
and a single one in the position for that word in the
dictionary.
• This results in an extremely sparse feature representation,
and would not work well.
• Instead we use a lower-dimensional pretrained word
embedding matrix E (m × 10K, next slide).
• This reduces the binary feature vector of length 10K to a
real feature vector of dimension m 10K (e.g. m in the
low hundreds.)
21 / 46
Word Embedding
One−hot
this
is
one
of
the
best
films
actually
the
best
have
ever
seen
the
film
starts
one
fall
day
Embed
this is one of the best films actually the best I have ever seen the film
starts one fall day · · · .
23 / 46
RNN on IMDB Reviews
23 / 46
RNN on IMDB Reviews
23 / 46
Log(Volatility) Dow Jones Return Log(Trading Volume)
1965
1970
1975
1980
Time Series Forecasting
1985
24 / 46
New-York Stock Exchange Data
Shown in previous slide are three daily time series for the period
December 3, 1962 to December 31, 1986 (6,051 trading days):
• Log trading volume. This is the fraction of all
outstanding shares that are traded on that day, relative to
a 100-day moving average of past turnover, on the log scale.
• Dow Jones return. This is the difference between the log
of the Dow Jones Industrial Index on consecutive trading
days.
• Log volatility. This is based on the absolute values of
daily price movements.
Goal: predict Log trading volume tomorrow, given its
observed values up to today, as well as those of Dow Jones
return and Log volatility.
These data were assembled by LeBaron and Weigend (1998) IEEE
Transactions on Neural Networks, 9(1): 213–220.
25 / 46
Autocorrelation
Log( Trading Volume)
Autocorrelation Function
0.8
0.4
0.0
0 5 10 15 20 25 30 35
Lag
26 / 46
Autocorrelation
Log( Trading Volume)
Autocorrelation Function
0.8
0.4
0.0
0 5 10 15 20 25 30 35
Lag
26 / 46
Autocorrelation
Log( Trading Volume)
Autocorrelation Function
0.8
0.4
0.0
0 5 10 15 20 25 30 35
Lag
27 / 46
RNN Results for NYSE Data
Test Period: Observed and Predicted
0.0 0.5 1.0
log(Trading Volume)
−1.0
Year
30 / 46
Summary of RNNs
31 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.
32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.
Should we always use deep learning models?
32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.
Should we always use deep learning models?
• Often the big successes occur when the signal to noise ratio
is high — e.g. image recognition and language translation.
Datasets are large, and overfitting is not a big problem.
32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.
Should we always use deep learning models?
• Often the big successes occur when the signal to noise ratio
is high — e.g. image recognition and language translation.
Datasets are large, and overfitting is not a big problem.
• For noisier data, simpler models can often work better.
• On the NYSE data, the AR(5) model is much simpler than a
RNN, and performed as well.
• On the IMDB review data, the linear model fit by glmnet did
as well as the neural network, and better than the RNN.
32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.
Should we always use deep learning models?
• Often the big successes occur when the signal to noise ratio
is high — e.g. image recognition and language translation.
Datasets are large, and overfitting is not a big problem.
• For noisier data, simpler models can often work better.
• On the NYSE data, the AR(5) model is much simpler than a
RNN, and performed as well.
• On the IMDB review data, the linear model fit by glmnet did
as well as the neural network, and better than the RNN.
• We endorse the Occam’s razor principal — we prefer
simpler models if they work as well. More interpretable! 32 / 46
Fitting Neural Networks
A1
n
X1 1X
minimize (yi − f (xi ))2 ,
A2 {wk }K
1 , β
2 i=1
X2
A3 f (X) Y
where
X3
A4 K
X p
X
X4 f (xi ) = β0 + βk g wk0 + wkj xij .
A5 k=1 j=1
33 / 46
Fitting Neural Networks
A1
n
X1 1X
minimize (yi − f (xi ))2 ,
A2 {wk }K
1 , β
2 i=1
X2
A3 f (X) Y
where
X3
A4 K
X p
X
X4 f (xi ) = β0 + βk g wk0 + wkj xij .
A5 k=1 j=1
33 / 46
Fitting Neural Networks
A1
n
X1 1X
minimize (yi − f (xi ))2 ,
A2 {wk }K
1 , β
2 i=1
X2
A3 f (X) Y
where
X3
A4 K
X p
X
X4 f (xi ) = β0 + βk g wk0 + wkj xij .
A5 k=1 j=1
6
5
4
R(θ)
3
R(θ0) 1
● R(θ )
2
●
R(θ2)
●
R(θ7)
1
●
θ 0
θ
1
θ2
θ7
0
θ
1. Start with a guess θ0 for all the parameters in θ, and set t = 0.
2. Iterate until the objective R(θ) fails to decrease:
(a) Find a vector δ that reflects a small change in θ, such that
θt+1 = θt + δ reduces the objective; i.e. R(θt+1 ) < R(θt ).
(b) Set t ← t + 1.
34 / 46
Gradient Descent Continued
• In this simple example we reached the global minimum.
• If we had started a little to the left of θ0 we would have
gone in the other direction, and ended up in a local
minimum.
• Although θ is multi-dimensional, we have depicted the
process as one-dimensional. It is much harder to identify
whether one is in a local minimum in high dimensions.
How to find a direction δ that points downhill? We compute
the gradient vector
∂R(θ)
∇R(θt ) =
∂θ θ=θt
i.e. the vector of partial derivatives at the current guess θt .
The gradient points uphill, so our update is δ = −ρ∇R(θt ) or
θt+1 ← θt − ρ∇R(θt ),
where ρ is the learning rate (typically small, e.g. ρ = 0.001.
35 / 46
Gradients and Backpropagation
Pn
R(θ) = i=1 Ri (θ) is a sum, so gradient is sum of gradients.
K p
X X 2
Ri (θ) = 12 (yi −fθ (xi ))2 = 1
2 yi −β0 − βk g wk0 + wkj xij
k=1 j=1
Pp
For ease of notation, let zik = wk0 + j=1 wkj xij .
36 / 46
Tricks of the Trade
• Slow learning. Gradient descent is slow, and a small
learning rate ρ slows it even further. With early stopping,
this is a form of regularization.
37 / 46
Tricks of the Trade
• Slow learning. Gradient descent is slow, and a small
learning rate ρ slows it even further. With early stopping,
this is a form of regularization.
• Stochastic gradient descent. Rather than compute the
gradient using all the data, use a small minibatch drawn at
random at each step. E.g. for MNIST data, with n = 60K,
we use minibatches of 128 observations.
37 / 46
Tricks of the Trade
• Slow learning. Gradient descent is slow, and a small
learning rate ρ slows it even further. With early stopping,
this is a form of regularization.
• Stochastic gradient descent. Rather than compute the
gradient using all the data, use a small minibatch drawn at
random at each step. E.g. for MNIST data, with n = 60K,
we use minibatches of 128 observations.
• An epoch is a count of iterations and amounts to the
number of minibatch updates such that n samples in total
have been processed; i.e. 60K/128 ≈ 469 for MNIST.
37 / 46
Tricks of the Trade
• Slow learning. Gradient descent is slow, and a small
learning rate ρ slows it even further. With early stopping,
this is a form of regularization.
• Stochastic gradient descent. Rather than compute the
gradient using all the data, use a small minibatch drawn at
random at each step. E.g. for MNIST data, with n = 60K,
we use minibatches of 128 observations.
• An epoch is a count of iterations and amounts to the
number of minibatch updates such that n samples in total
have been processed; i.e. 60K/128 ≈ 469 for MNIST.
• Regularization. Ridge and lasso regularization can be used
to shrink the weights at each layer. Two other popular
forms of regularization are dropout and augmentation,
discussed next.
37 / 46
Dropout Learning
38 / 46
Dropout Learning
38 / 46
Dropout Learning
38 / 46
Dropout Learning
2
● ● ●● ●●● ●● ● ●
● ●
●
●
●
●●
● ● ● ●
●
●
●
●●●●
●
●
●●●● ●
●
● ● ●
● ●● ● ●
●
● ●●●● ● ●● ●
● ●
●
● ●
●● ●
● ●
● ● ● ●● ●
● ● ●● ●
●●●● ●
● ● ●
●
●●●●●●●
●● ● ●
●
● ●●● ●
●●●● ●●● ● ●●
● ●
● ●
● ●● ● ●● ● ● ● ●
● ● ● ● ● ●
●●●
●● ● ●
● ●●●
●●
● ●●
●● ● ●●●
●●●
● ●● ●
●
●
● ●
●
●
●
●●●● ● ● ● ● ●
●
●●●● ●●●●
● ●●
● ●
●●●● ● ●●
●●●● ●
●
●●●
● ●●●● ●●● ● ● ● ● ● ●
●
●● ●●● ●
● ● ●●●●●●
● ●● ●●●● ● ●● ● ● ●●● ●
● ● ●●●
● ●● ●
● ●
●
●●● ●
●● ● ● ●
●●● ●
●● ● ● ●● ●● ● ● ●
● ●● ●●
● ● ●
●
● ● ●
●● ● ●● ●●● ● ● ● ● ●
●● ●● ●
●●
●● ● ●
● ●●
● ●● ● ●●●●
●
●●
● ●●●●●●●●● ● ●●● ●●
● ● ●● ●●●
●●●● ● ●●●●
●●● ●
● ●
● ●● ● ● ●●
● ●● ● ● ●
● ● ● ●●● ●● ● ●●● ●●● ●●
●●● ●
● ●● ●● ●
●●●●●● ● ●●● ● ●● ●
●
● ●
● ●● ●●●● ●●●● ●
●●● ●●● ● ●●● ●
●● ●● ● ● ●● ●●● ●●
● ●
● ●● ●
● ●●
●●● ●●
● ●●●●●●●● ● ●
●
● ● ●●● ●●●●
● ● ● ● ●● ●● ●● ● ●●
● ● ● ● ●● ● ●
● ● ● ●
●● ●● ●
●● ● ● ●● ● ●● ●● ●●●
● ● ●●●● ● ●●●● ●
● ●● ● ●●●● ● ● ●
● ●● ● ● ● ● ● ●●●● ● ●
● ●● ●●● ● ●
● ●
●●● ● ●● ● ● ● ●
1
● ● ● ●
● ●●
● ● ● ●● ●
●●●●●●●●
●● ● ●●●● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ●●
● ● ● ● ● ● ●
● ●
● ● ●●
●●●
●●
●● ●●●●●
●●● ●
● ●●
●● ●
●● ● ●
●
●
● ● ●●●●● ●●●●
● ●● ●● ●
● ● ●●● ● ●● ●●
● ● ●
● ● ●
●● ●● ● ●● ● ●
● ●●●● ● ● ●●● ● ● ●
● ●●●
●
●● ●● ●●
● ● ●● ● ●● ● ●● ●●●●● ●● ●
●
● ● ● ● ●● ● ● ●
● ●●● ●
● ●●
● ●●● ●●● ●
●●● ● ●● ●●●
● ●● ●
● ●●
●●●● ●●●
●●
●●●●●●●●
● ●●
●●●●
●
● ● ●●● ●●
● ● ●●● ● ● ● ●●● ●
● ● ●● ● ● ●● ● ●●● ●
● ●●
● ● ● ● ● ● ●●
● ● ● ● ●● ●●●●● ● ● ●●
● ●●● ●
●
●●● ●
●
●●●
●●●
●● ●●●●●●
●● ●●● ●
● ●●● ●● ●
●●● ●●
●●● ●
● ●●
● ● ●●
● ● ● ●●●● ●●
●●
●
●●●●
●●●
●●●●
●●
●
●●●●
●●●●●● ● ●● ●
●
●●●●●●● ● ●●● ●
● ● ●● ●
●
● ● ● ●●●● ● ● ● ● ●
● ●● ●●●● ● ● ● ●
●
● ●
● ● ● ● ●●●● ●● ●●● ●
● ●● ● ● ● ●● ● ●
● ● ●● ●● ●●
●● ● ● ●
● ● ● ●● ●
●● ●● ● ● ●●
●●●
●● ●●●● ●●● ●●
●● ● ●●
●● ●●●●
● ● ● ● ●●● ●
●
●
●●●
● ●● ● ●● ●
● ●● ● ●● ● ● ●
● ●●
● ●● ● ●● ● ●
● ●● ●● ●● ● ● ● ●●
●
● ●●● ● ●● ● ●
● ● ● ●●
●● ● ●● ●● ●● ●● ● ●
● ●●● ● ●● ●●●●●
● ●●●●
●●● ● ● ● ● ●
●● ●
●●●
● ● ● ● ● ● ● ● ●● ●
● ● ● ●
●●
● ●●
●● ●● ●●●●
● ●
●●
●
●●
● ●●● ●●●●●●● ●
● ●
●●●●
● ●
●●● ●
● ●●
●●● ●●● ●
●●●
●●●● ● ● ● ●
●
● ● ● ●●
●●● ● ●●● ●●●● ● ● ● ●● ●●●● ●●● ● ●●
●●● ● ●
● ●●● ●●●
●●● ● ●●● ●● ●●
●
●● ●
● ● ● ●● ●
● ● ● ●●●● ● ●●● ● ●● ● ●● ●●
●●●● ●●● ●●
● ● ● ●● ● ●●
●●●
● ●● ●● ●●
●●●●●
● ● ●● ●
●●● ●●●●
●
●●●● ● ●● ● ●●●
● ●
●●●
●
● ●●●
●● ●●
●
●●●●●●
●●● ●
● ● ● ●
●
● ●●
●● ●● ●● ●●●● ●●●●● ● ● ●● ● ●● ●●● ●
● ● ●●●
● ●●●● ● ●●●●● ●●● ● ●●●● ●●● ●●●
●● ●●●
● ●● ●● ●
● ● ●●
●● ● ● ●● ●
●
● ●
●
● ●● ●
●●● ● ●
●● ●●●● ●●●● ●● ● ●
● ●
●●
● ● ●●● ●● ●● ●
●● ●
●●● ● ●● ●
●●
● ● ● ●● ●
●
●●● ● ● ●
●● ●●
●● ●● ●
●●●
● ●● ● ●●● ●●● ● ●
●●
● ●● ●
●● ●●● ● ● ●
●
● ●● ● ●●
●● ●
●● ●
●●●
●●
●●●● ● ●
●● ●
● ● ●
●● ●
●●●
●●●●
●
●
●●●
● ●●
●● ●
●
●●● ●● ●
● ●● ●●
●
●●●
●
●
● ●●● ● ● ●
●● ●●● ●●●
●●● ●
● ●● ● ● ● ●●●● ● ●
●● ●●
●
● ●● ●● ●● ●● ● ●
●
● ● ●●
●● ●● ● ●
● ● ● ●● ●● ● ●●
●●● ●●● ●●●●
●
● ●● ●●●● ●●● ●● ●●
● ●● ●● ●
● ● ●●
●●● ●
●
●
●●
●●
● ●●●●● ● ●●●● ● ●●● ●● ●● ●●●● ●
● ● ●●
● ● ●
● ● ● ● ●●●
●● ● ● ●● ●● ●● ●
●
●
●
●● ● ●● ● ● ●● ● ●
●● ●●● ● ● ● ● ● ● ●
●● ● ●● ● ●
●● ●● ● ● ● ● ●●
●
● ● ● ●● ●●●●● ● ● ● ●●●●
●●
●● ●
●
●●
● ●● ●● ●●●●● ● ●● ●
● ●●● ●
● ● ● ●●
●
● ●●● ● ●● ●● ●
●● ● ● ● ●●
●● ●● ●
●●● ●● ●●●●●● ●
●● ● ●● ● ● ●
● ● ●● ● ● ● ●●● ●● ● ●● ● ● ●
●●● ●● ● ●● ●●● ●●● ●● ●●●
●
●
●●
● ●●● ● ● ●●
●●
● ● ●● ●●● ● ● ●●● ● ● ●●●● ● ●●
●
● ● ●
●●● ●
●
● ●● ● ● ● ●●
● ● ●●
●● ●●●●●
●● ●
● ●● ● ● ● ● ●●
●
● ●
●●●
●
●
● ●●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●
●
● ● ●
●● ●● ●●
●●●●●● ● ●
●
● ● ●● ● ●
●●● ●●●● ● ● ● ● ● ●
●
●● ● ●●● ● ●
●● ●● ●
●
●
●●●●
● ●● ●●● ● ●●
●●●● ●●
● ●● ● ●● ●● ●●
●●
● ●●●●●● ●●● ● ●●
● ● ●●●
● ●
●●●
● ● ● ●● ● ●● ●
● ●● ●●● ●
0
●
●● ● ●●● ●●● ●
●●● ●●● ● ●●●
●
●●●●● ●
● ●●● ●
● ●● ● ●● ●●
● ● ● ●●
● ● ●●
● ● ●● ●
●
●
●●
● ● ● ● ●● ●●
X2
●
● ●
●
● ● ●●● ●●● ● ●
●
●
●●●
●
●●
● ●● ●●● ●● ● ● ●●●● ●●
● ●● ●●
●● ●
●
●●● ●● ● ●●●
● ●●●●
● ● ●● ● ●
● ● ●
● ● ●● ● ● ● ●
●●●●●●
● ●●● ● ●●● ●●
●● ●● ● ●
● ●● ● ●
● ● ●
● ●●
●
●
● ● ●●● ●● ● ● ●● ●● ●●● ● ● ●●●●
●
●● ●●
●
●● ●
●●●● ● ●● ●●●●
● ● ●● ●● ●
●● ●●●
●●
●●
●● ●
●●● ●●●● ●●● ●● ● ● ●
● ●●
● ● ● ●●●●● ●
●
● ●●● ●●● ●● ●
●● ● ●
●
● ●● ●
● ●●● ● ●
● ● ● ●● ●● ●
●
● ● ●●
●
● ● ●● ●● ●●● ● ●● ●●●●●●
●●● ●●● ●●● ● ●●● ●● ● ●
● ●●
●
● ●●●● ● ●
● ●●● ● ●
● ●●
●●●●● ●●●●●●
●● ● ●● ●●
● ● ●● ●●●●
● ● ●●● ●
●
●
●●●● ● ● ● ● ●
●
●
●● ●●●●● ●●● ● ●●●●
● ● ●●
●● ● ●●●●
● ●●
● ● ●●
● ● ●
●● ●●● ● ●●●● ●●● ● ● ●●● ●● ●
●● ●
●●
●● ●●
●●●● ● ●
●●● ●
●● ● ● ●● ●● ● ● ●● ● ●
●● ● ● ●
●
● ● ●
● ●
● ● ●●●●●● ●●●● ●
●
●
●●
● ● ● ●●
●
● ●●
●
● ●●● ● ● ●●● ●●●●● ●
●
● ● ●● ●
●
● ●● ●●●
●
● ● ● ●● ● ●● ● ●
●
● ●● ●● ●●●
●
● ● ● ●●● ●
●
●●●
●
● ●● ●
● ●● ●●● ●
● ●● ●
●
● ●● ●● ●●● ●● ●
●● ● ●● ● ● ●●●●●
●● ● ● ●●● ● ●● ● ●● ●
●●
● ●
●●● ● ● ●● ● ●● ●●
● ●
●
●●●●●●
● ●● ●●
●●●
●● ●●● ●
●● ● ●● ●● ● ●●● ●● ●
●
● ● ●● ● ● ●
● ●
●
● ● ●●
●
●● ●● ● ● ●●●●● ● ● ● ●
● ● ●●
● ● ●● ● ● ● ● ● ● ● ●
●●●
● ● ● ●
● ●● ● ● ● ●
● ● ●● ●●● ●●●● ●●
●●●●● ● ●●● ●● ● ●●
● ● ● ●● ●● ● ● ●
● ●● ● ●● ● ●●●
●● ●● ● ●●●
● ● ● ● ●●●●●● ● ●
●
● ●● ● ●●● ●● ●●
● ● ●
● ● ●● ●● ● ●●● ● ●●
● ●● ●
●● ● ●
●● ●● ● ● ● ●
● ● ●●●
●●●● ●
● ● ●● ●
●● ●● ● ●● ●
●
●● ●● ●●● ●
● ●●●
● ●● ●
● ● ● ● ●
●● ●●● ● ●
● ● ●●● ●● ● ●
●
●●●● ● ●● ● ● ● ● ● ●
● ● ●
●●● ● ● ●●● ●●
● ●● ● ● ● ●● ●●●●● ● ●
● ●● ●●
● ●● ●●
●● ●
● ●● ● ●
● ●● ● ● ● ● ●●
● ●● ● ●●●●●●●●
● ● ● ● ● ●
●
●● ●
● ● ●●
●●●●●● ●● ● ●● ●●●●●
●●●●●●●
● ● ●● ● ● ● ●
●● ●●● ● ●● ● ● ●● ●● ●
● ●● ● ●
●●● ●●● ●●●
●
●●●
●●●
● ● ●● ● ● ●●● ●
●●● ● ● ●●● ●● ● ● ●●
● ●● ● ● ●● ● ●
●● ●
● ● ●●● ●● ●● ●
● ●● ●
● ● ● ●● ●
● ●● ●
● ●
●
●
●
●●
●
●
●●● ●●
● ● ●
● ●●●●●● ● ● ●● ● ●●●● ●● ●
●
● ●
●●●● ● ●● ● ●●
●
● ●● ●
●
●
●
● ●●● ●●● ●
● ● ●● ● ● ●● ●●●●
● ● ●● ●● ●
● ●● ● ●●● ●● ●●
●● ● ●●● ●●●●
●●●
●● ●● ●●●●●
●●●
●●
●●●●● ● ● ●● ●● ● ●●●
●●
● ● ●
● ●● ●●●● ● ●
● ●
● ●●●
●●●●●● ● ● ●●●●●●●● ● ●
●
●●● ● ● ●● ●●● ●●
●
● ●● ●
●
●●●●●●●● ● ● ● ●● ● ● ● ●●
●● ● ● ●● ●● ● ● ●
●● ● ●●● ●●● ●
● ● ● ● ● ●●●●● ●
●●●●● ●●
● ● ●●●●● ● ●●
● ●●●
●
●● ● ●●●● ●● ●
−1
● ● ● ●● ●●
●
● ●●● ●●●●
●●
●● ●
●● ●●●●
●●
● ● ●
●
●●● ●●●● ●
●
●
● ●●●● ●
● ● ●● ● ●
●● ● ●● ●
● ●●●●
●●●● ●●● ●●●●● ● ●● ● ●●●● ●
● ●
● ● ●●●●
●● ●●●● ●●●● ● ● ●●●●● ● ● ●●● ●● ●
●● ● ●● ●
● ●● ●● ●
●
● ●
●●●
● ● ●
●●●●●● ● ●● ● ● ●
●
●●●●● ● ● ●●●
● ● ● ● ●● ● ● ●●●●●● ● ● ●●● ●
● ● ●
● ● ●● ●●●●● ●
●● ● ●
● ●●●●●●● ● ● ●
●
● ●● ●● ●●
● ●
● ● ● ●● ●● ●●● ● ●● ●
● ●
●
●● ●●● ● ● ●
● ● ● ● ● ●●● ●●
●
● ● ● ● ●●● ●●● ●
●● ●● ●
●●
●
● ●● ● ● ●
● ● ●● ●
●● ● ● ●● ●
●●
●● ●●
●
●● ●
● ●● ● ● ●● ● ● ●● ●
● ● ● ●●●● ●●
●
● ● ● ● ● ● ● ● ●●
● ●●●● ● ●
●
● ●
●
● ● ●
●● ● ● ● ● ● ●●●●● ●
●● ● ●
● ● ● ●● ● ●●● ●●●
● ●●●● ●● ●
● ● ●
●●● ● ● ●
● ●● ● ●● ●● ●
●
● ● ●● ●
●
● ●
●●●●●
●
●●
● ● ● ●● ●
●● ●
●
●● ●●
●
● ● ●
●●●● ●●● ● ● ● ●● ● ●● ● ● ●
● ● ● ●●●● ●
●●
● ●●
●
● ● ●●● ● ●● ●
● ● ●●●●●●
●●●●● ● ●●●●●
●● ● ●
● ●●●● ●● ●● ● ●●●● ● ●● ● ●● ●● ●
● ●● ● ● ●● ● ● ● ●
●
● ● ● ●
●● ●
● ●
●● ●●●● ● ●●● ●
● ● ●● ● ● ●
● ● ● ●
● ● ●● ●●
● ● ● ● ● ●●●● ●
● ● ● ● ● ●●●● ●● ● ●● ● ●
● ●● ●● ●
●●
●●
●● ●● ● ● ●●
●● ● ●● ●●●●●●● ●● ● ●
●●●●●● ● ●●● ●●
● ●● ● ●●
●●●
● ● ● ●●●● ●
●
● ● ● ●
−2
●●●●●●●
● ●● ●●
●
● ●● ●
●
● ●● ●
●● ●
●● ●●● ●
● ●● ●●
●
● ●● ●● ● ● ● ●● ●
● ● ●●● ● ●
● ● ●●● ● ●
● ●
● ●● ●
● ●●●●
● ● ● ●●
● ●● ●● ● ●
● ● ●
●● ●●
● ●● ● ● ● ●● ●
●● ●
● ●●●● ● ●
● ●● ●●● ● ● ●●
● ●●● ●● ●● ●
● ●● ● ●
●●
● ●●●
●●
● ●● ●●
●
●● ● ● ● ● ●
●
● ● ●●
● ●● ● ●
● ● ●
●
●
● ●
●
●
−2 −1 0 1 2 −2 −1 0 1 2
X1 X1
40 / 46
Data Augmentation on the Fly
40 / 46
Data Augmentation on the Fly
40 / 46
Data Augmentation on the Fly
40 / 46
Double Descent
41 / 46
Double Descent
41 / 46
Double Descent
41 / 46
Double Descent
41 / 46
Double Descent
41 / 46
Simulation
42 / 46
The Double-Descent Error Curve
2.0
Training Error
Test Error
1.5
Error
1.0
0.5
0.0
2 5 10 20 50
Degrees of Freedom
3
f(seq(−5, 5, len = 1000))
2
2
1
1
0
0
−3 −2 −1
−3 −2 −1
−4 −2 0 2 4 −4 −2 0 2 4
42 Degrees of Freedom
seq(−5, 5, len = 1000)
80 Degrees of Freedom
seq(−5, 5, len = 1000)
3
3
f(seq(−5, 5, len = 1000))
2
2
1
1
0
0
−3 −2 −1
−3 −2 −1
−4 −2 0 2 4 −4 −2 0 2 4
45 / 46
Some Facts
45 / 46
Some Facts
45 / 46
Some Facts
45 / 46
Software
46 / 46