0% found this document useful (0 votes)
2 views

Ch10 Deep Learning

Deep learning, which re-emerged around 2010, has become dominant due to advancements in computing power, larger datasets, and software like TensorFlow and PyTorch. Pioneers in the field, including Yann LeCun, Geoffrey Hinton, and Yoshua Bengio, were awarded the 2019 ACM Turing Award for their contributions to neural networks. The document also discusses the architecture and functioning of neural networks and convolutional neural networks (CNNs), highlighting their applications in image classification.

Uploaded by

Sthefanny Castro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Ch10 Deep Learning

Deep learning, which re-emerged around 2010, has become dominant due to advancements in computing power, larger datasets, and software like TensorFlow and PyTorch. Pioneers in the field, including Yann LeCun, Geoffrey Hinton, and Yoshua Bengio, were awarded the 2019 ACM Turing Award for their contributions to neural networks. The document also discusses the architecture and functioning of neural networks and convolutional neural networks (CNNs), highlighting their applications in image classification.

Uploaded by

Sthefanny Castro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Deep Learning

1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.

1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.
Then along came SVMs, Random Forests and Boosting in the
1990s, and Neural Networks took a back seat.

1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.
Then along came SVMs, Random Forests and Boosting in the
1990s, and Neural Networks took a back seat.
Re-emerged around 2010 as Deep Learning.
By 2020s very dominant and successful.
Part of success due to vast improvements in computing power,
larger training sets, and software: Tensorflow and PyTorch.

1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.
Then along came SVMs, Random Forests and Boosting in the
1990s, and Neural Networks took a back seat.
Re-emerged around 2010 as Deep Learning.
By 2020s very dominant and successful.
Part of success due to vast improvements in computing power,
larger training sets, and software: Tensorflow and PyTorch.

Much of the credit goes to three pioneers and


their students: Yann LeCun, Geoffrey Hinton
and Yoshua Bengio, who received the 2019
ACM Turing Award for their work in Neural
Networks.
1 / 46
Single Layer Neural Network
PK
f (X) = β0 + k=1 βk hk (X)
PK Pp
= β0 + k=1 βk g(wk0 + j=1 wkj Xj ).

Input Hidden Output


Layer Layer Layer

A1

X1

A2

X2

A3 f (X) Y
X3

A4

X4

A5

2 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.

3 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.

3 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear model.

3 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear model.
• So the activations are like derived features — nonlinear
transformations of linear combinations of the features.

3 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear model.
• So the activations are like derived features — nonlinear
transformations of linear combinations of the features.
• The model is fit by minimizing ni=1 (yi − f (xi ))2 (e.g. for
P
regression). 3 / 46
Example: MNIST Digits

Handwritten digits
28 × 28 grayscale images
60K train, 10K test images
Features are the 784 pixel
grayscale values ∈ (0, 255)
Labels are the digit class 0–9

• Goal: build a classifier to predict the image class.


• We build a two-layer network with 256 units at first layer,
128 units at second layer, and 10 units at output layer.
• Along with intercepts (called biases) there are 235,146
parameters (referred to as weights)

4 / 46
Input
layer

Hidden
X1
layer L1
Hidden
layer L2
(1) Output
X2 A1 layer
(2)
A1
(1)
X3 A2 f0 (X) Y0
(2)
A2
(1)
X4 A3 f1 (X) Y1
(2)
A3
(1) . .
X5 A4 . .
. .
.
.
.
.
X6 . f9 (X) Y9
.
(2)
AK
2
. (1) B
. AK
. 1
W2

Xp W1

5 / 46
Details of Output Layer
(2)
• Let Zm = βm0 + K
P 2
`=1 βm` A` , m = 0, 1, . . . , 9 be 10 linear
combinations of activations at second layer.
• Output activation function encodes the softmax function

eZm
fm (X) = Pr(Y = m|X) = P9 .
Z`
`=0 e

6 / 46
Details of Output Layer
(2)
• Let Zm = βm0 + K
P 2
`=1 βm` A` , m = 0, 1, . . . , 9 be 10 linear
combinations of activations at second layer.
• Output activation function encodes the softmax function

eZm
fm (X) = Pr(Y = m|X) = P9 .
Z`
`=0 e

• We fit the model by minimizing the negative multinomial


log-likelihood (or cross-entropy):
n X
X 9
− yim log(fm (xi )).
i=1 m=0

• yim is 1 if true class for observation i is m, else 0 — i.e.


one-hot encoded.
6 / 46
Results
Method Test Error
Neural Network + Ridge Regularization 2.3%
Neural Network + Dropout Regularization 1.8%
Multinomial Logistic Regression 7.2%
Linear Discriminant Analysis 12.7%

• Early success for neural networks in the 1990s.


• With so many parameters, regularization is essential.
• Some details of regularization and fitting will come later.

7 / 46
Results
Method Test Error
Neural Network + Ridge Regularization 2.3%
Neural Network + Dropout Regularization 1.8%
Multinomial Logistic Regression 7.2%
Linear Discriminant Analysis 12.7%

• Early success for neural networks in the 1990s.


• With so many parameters, regularization is essential.
• Some details of regularization and fitting will come later.
• Very overworked problem — best reported rates are
< 0.5%!
• Human error rate is reported to be around 0.2%, or 20 of
the 10K test images.

7 / 46
Convolutional Neural Network — CNN

• Major success story for classifying images.


• Shown are samples from CIFAR100 database. 32 × 32 color
natural images, with 100 classes.
• 50K training images, 10K test images.
Each image is a three-dimensional array or feature map:
32 × 32 × 3 array of 8-bit numbers. The last dimension
represents the three color channels for red, green and blue.
8 / 46
How CNNs Work

• The CNN builds up an image in a hierarchical fashion.

9 / 46
How CNNs Work

• The CNN builds up an image in a hierarchical fashion.


• Edges and shapes are recognized and pieced together to
form more complex shapes, eventually assembling the
target image.

9 / 46
How CNNs Work

• The CNN builds up an image in a hierarchical fashion.


• Edges and shapes are recognized and pieced together to
form more complex shapes, eventually assembling the
target image.
• This hierarchical construction is achieved using convolution
and pooling layers.
9 / 46
Convolution Filter
 
a b c  
d e f
Input Image =   Convolution Filter = α β .
g h i γ δ
j k l
 
aα + bβ + dγ + eδ bα + cβ + eγ + f δ
Convolved Image = dα + eβ + gγ + hδ eα + f β + hγ + iδ 
gα + hβ + jγ + kδ hα + iβ + kγ + lδ

• The filter is itself an image, and represents a small shape,


edge etc.
• We slide it around the input image, scoring for matches.
• The scoring is done via dot-products, illustrated above.
• If the subimage of the input image is similar to the filter,
the score is high, otherwise low.
• The filters are learned during training.
10 / 46
Convolution Example

• The idea of convolution with a filter is to find common


patterns that occur in different parts of the image.

11 / 46
Convolution Example

• The idea of convolution with a filter is to find common


patterns that occur in different parts of the image.
• The two filters shown here highlight vertical and horizontal
stripes.

11 / 46
Convolution Example

• The idea of convolution with a filter is to find common


patterns that occur in different parts of the image.
• The two filters shown here highlight vertical and horizontal
stripes.
• The result of the convolution is a new feature map.

11 / 46
Convolution Example

• The idea of convolution with a filter is to find common


patterns that occur in different parts of the image.
• The two filters shown here highlight vertical and horizontal
stripes.
• The result of the convolution is a new feature map.
• Since images have three colors channels, the filter does as
well: one filter per channel, and dot-products are summed.

11 / 46
Convolution Example

• The idea of convolution with a filter is to find common


patterns that occur in different parts of the image.
• The two filters shown here highlight vertical and horizontal
stripes.
• The result of the convolution is a new feature map.
• Since images have three colors channels, the filter does as
well: one filter per channel, and dot-products are summed.
• The weights in the filters are learned by the network.
11 / 46
Pooling

 
1 2 5 3  
3 0 1 2
Max pool  → 3 5
2 1 3 4 2 4
1 1 2 0

• Each non-overlapping 2 × 2 block is replaced by its


maximum.
• This sharpens the feature identification.
• Allows for locational invariance.
• Reduces the dimension by a factor of 4 — i.e. factor of 2 in
each dimension.

12 / 46
Architecture of a CNN
8
16
32 8 4
32
16

10
50

0
32

0
2
pool
pool
convolve
pool convolve flatten
convolve

13 / 46
Architecture of a CNN
8
16
32 8 4
32
16

10
50

0
32

0
2
pool
pool
convolve
pool convolve flatten
convolve

• Many convolve + pool layers.

13 / 46
Architecture of a CNN
8
16
32 8 4
32
16

10
50

0
32

0
2
pool
pool
convolve
pool convolve flatten
convolve

• Many convolve + pool layers.


• Filters are typically small, e.g. each channel 3 × 3.

13 / 46
Architecture of a CNN
8
16
32 8 4
32
16

10
50

0
32

0
2
pool
pool
convolve
pool convolve flatten
convolve

• Many convolve + pool layers.


• Filters are typically small, e.g. each channel 3 × 3.
• Each filter creates a new channel in convolution layer.

13 / 46
Architecture of a CNN
8
16
32 8 4
32
16

10
50

0
32

0
2
pool
pool
convolve
pool convolve flatten
convolve

• Many convolve + pool layers.


• Filters are typically small, e.g. each channel 3 × 3.
• Each filter creates a new channel in convolution layer.
• As pooling reduces size, the number of filters/channels is
typically increased.

13 / 46
Architecture of a CNN
8
16
32 8 4
32
16

10
50

0
32

0
2
pool
pool
convolve
pool convolve flatten
convolve

• Many convolve + pool layers.


• Filters are typically small, e.g. each channel 3 × 3.
• Each filter creates a new channel in convolution layer.
• As pooling reduces size, the number of filters/channels is
typically increased.
• Number of layers can be very large. E.g. resnet50 trained
on imagenet 1000-class image data base has 50 layers!

13 / 46
Using Pretrained Networks to Classify Images

14 / 46
Using Pretrained Networks to Classify Images

flamingo Cooper’s hawk Cooper’s hawk


flamingo 0.83 kite (raptor) 0.60 fountain 0.35
spoonbill 0.17 great grey owl 0.09 nail 0.12
white stork 0.00 robin 0.06 hook 0.07
Lhasa Apso cat Cape weaver
Tibetan terrier 0.56 Old English sheepdog 0.82 jacamar 0.28
Lhasa 0.32 Shih-Tzu 0.04 macaw 0.12
cocker spaniel 0.03 Persian cat 0.04 robin 0.12

Here we use the 50-layer resnet50 network trained on the 1000-class


imagenet corpus to classify some photographs. 14 / 46
Document Classification: IMDB Movie Reviews

The IMDB corpus consists of user-supplied movie ratings for a


large collection of movies. Each has been labeled for sentiment
as positive or negative. Here is the beginning of a negative
review:
This has to be one of the worst films of the 1990s. When my friends
& I were watching this film (being the target audience it was aimed at)
we just sat & watched the first half an hour with our jaws touching the
floor at how bad it really was. The rest of the time, everyone else in the
theater just started talking to each other, leaving or generally crying
into their popcorn . . .

We have labeled training and test sets, each consisting of 25,000


reviews, and each balanced with regard to sentiment.

15 / 46
Document Classification: IMDB Movie Reviews

The IMDB corpus consists of user-supplied movie ratings for a


large collection of movies. Each has been labeled for sentiment
as positive or negative. Here is the beginning of a negative
review:
This has to be one of the worst films of the 1990s. When my friends
& I were watching this film (being the target audience it was aimed at)
we just sat & watched the first half an hour with our jaws touching the
floor at how bad it really was. The rest of the time, everyone else in the
theater just started talking to each other, leaving or generally crying
into their popcorn . . .

We have labeled training and test sets, each consisting of 25,000


reviews, and each balanced with regard to sentiment.

We wish to build a classifier to predict the sentiment of a


review.

15 / 46
Featurization: Bag-of-Words
Documents have different lengths, and consist of sequences of
words. How do we create features X to characterize a
document?
• From a dictionary, identify the 10K most frequently
occurring words.
• Create a binary vector of length p = 10K for each
document, and score a 1 in every position that the
corresponding word occurred.
• With n documents, we now have a n × p sparse feature
matrix X.
• We compare a lasso logistic regression model to a
two-hidden-layer neural network on the next slide. (No
convolutions here!)

16 / 46
Featurization: Bag-of-Words
Documents have different lengths, and consist of sequences of
words. How do we create features X to characterize a
document?
• From a dictionary, identify the 10K most frequently
occurring words.
• Create a binary vector of length p = 10K for each
document, and score a 1 in every position that the
corresponding word occurred.
• With n documents, we now have a n × p sparse feature
matrix X.
• We compare a lasso logistic regression model to a
two-hidden-layer neural network on the next slide. (No
convolutions here!)
• Bag-of-words are unigrams. We can instead use bigrams
(occurrences of adjacent word pairs), and in general
m-grams.
16 / 46
Lasso versus Neural Network — IMDB Reviews
Lasso Neural Net
1.0

1.0
●●●●●●●●●●●●●●● ● ● ● ● ●
●●● ● ● ● ●
●● ●
●● ●
●● ● ●
●● ●
●● ●
●●● ●
● ●
●●
●● ●
●●
●●
●●
0.9

0.9
●● ●
●●
●● ● ● ● ● ● ●
●●●● ●●●●●●●●●●●●●
●●●●●●●● ● ●●●●● ● ● ● ● ● ●
● ● ● ●
●● ●●● ●●●●●●● ●●●●● ●● ● ● ● ● ● ● ● ●
●● ● ●●
●●●●●●●● ●●●●● ●●●●●●●●● ● ● ●
●●
●● ●●●
●● ●● ●●●●
●●●●● ●●●●●●●● ● ● ● ● ●
●●
●● ● ●
●●● ●● ●●●●●●●●● ● ● ● ● ●
●●●●●●
● ●●●
Accuracy

Accuracy
●●
●●
●●●●


●●●
●●●●
●●● ●
●●●
0.8

0.8
●●●
●●
●●
●●
●●

●●
●●
●●







0.7

0.7



●●
●●

● train
●●
●●

● validation

●●
●●
●●
●●
●●

● test
0.6

0.6
4 6 8 10 12 5 10 15 20

− log(λ)


Epochs

• Simpler lasso logistic regression model works as well as


neural network in this case.
• glmnet was used to fit the lasso model, and is very effective
because it can exploit sparsity in the X matrix. 17 / 46
Recurrent Neural Networks
Often data arise as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data or financial indices.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this sequential
nature of the data, and build a memory of the past.

18 / 46
Recurrent Neural Networks
Often data arise as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data or financial indices.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this sequential
nature of the data, and build a memory of the past.

18 / 46
Recurrent Neural Networks
Often data arise as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data or financial indices.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this sequential
nature of the data, and build a memory of the past.

• The feature for each observation is a sequence of vectors


X = {X1 , X2 , . . . , XL }.

18 / 46
Recurrent Neural Networks
Often data arise as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data or financial indices.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this sequential
nature of the data, and build a memory of the past.

• The feature for each observation is a sequence of vectors


X = {X1 , X2 , . . . , XL }.
• The target Y is often of the usual kind — e.g. a single
variable such as Sentiment, or a one-hot vector for
multiclass.

18 / 46
Recurrent Neural Networks
Often data arise as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data or financial indices.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this sequential
nature of the data, and build a memory of the past.

• The feature for each observation is a sequence of vectors


X = {X1 , X2 , . . . , XL }.
• The target Y is often of the usual kind — e.g. a single
variable such as Sentiment, or a one-hot vector for
multiclass.
• However, Y can also be a sequence, such as the same
document in a different language.
18 / 46
Simple Recurrent Neural Network Architecture
Y Y

O` O1 O2 O3 OL-1 OL

B B B B B B
U
A` = A1
U
A2
U
A3
U
AL-1
U
AL

W W W W W W

X` X1 X2 X3 ... XL-1 XL

19 / 46
Simple Recurrent Neural Network Architecture
Y Y

O` O1 O2 O3 OL-1 OL

B B B B B B
U
A` = A1
U
A2
U
A3
U
AL-1
U
AL

W W W W W W

X` X1 X2 X3 ... XL-1 XL

• The hidden layer is a sequence of vectors A` , receiving as


input X` as well as A`−1 . A` produces an output O` .

19 / 46
Simple Recurrent Neural Network Architecture
Y Y

O` O1 O2 O3 OL-1 OL

B B B B B B
U
A` = A1
U
A2
U
A3
U
AL-1
U
AL

W W W W W W

X` X1 X2 X3 ... XL-1 XL

• The hidden layer is a sequence of vectors A` , receiving as


input X` as well as A`−1 . A` produces an output O` .
• The same weights W, U and B are used at each step in
the sequence — hence the term recurrent.

19 / 46
Simple Recurrent Neural Network Architecture
Y Y

O` O1 O2 O3 OL-1 OL

B B B B B B
U
A` = A1
U
A2
U
A3
U
AL-1
U
AL

W W W W W W

X` X1 X2 X3 ... XL-1 XL

• The hidden layer is a sequence of vectors A` , receiving as


input X` as well as A`−1 . A` produces an output O` .
• The same weights W, U and B are used at each step in
the sequence — hence the term recurrent.
• The A` sequence represents an evolving model for the
response that is updated as each element X` is processed.
19 / 46
RNN in Detail
Suppose X` = (X`1 , X`2 , . . . , X`p ) has p components, and
A` = (A`1 , A`2 , . . . , A`K ) has K components. Then the
computation at the kth components of hidden unit A` is
 p
X K
X 
A`k = g wk0 + wkj X`j + uks A`−1,s
j=1 s=1
K
X
O` = β0 + βk A`k
k=1

Often we are concerned only with the prediction OL at the last


unit. For squared error loss, and n sequence/response pairs, we
would minimize
n n  K p K
X
2
X X X X 2
(yi −oiL ) = yi − β0 + βk g wk0 + wkj xiLj + uks ai,L−1,s .
i=1 i=1 k=1 j=1 s=1

20 / 46
RNN and IMDB Reviews
• The document feature is a sequence of words {W` }L 1 . We
typically truncate/pad the documents to the same number
L of words (we use L = 500).
• Each word W` is represented as a one-hot encoded binary
vector X` (dummy variable) of length 10K, with all zeros
and a single one in the position for that word in the
dictionary.
• This results in an extremely sparse feature representation,
and would not work well.
• Instead we use a lower-dimensional pretrained word
embedding matrix E (m × 10K, next slide).
• This reduces the binary feature vector of length 10K to a
real feature vector of dimension m  10K (e.g. m in the
low hundreds.)

21 / 46
Word Embedding
One−hot

this

is

one

of

the

best

films

actually

the

best

have

ever

seen

the

film

starts

one

fall

day
Embed

this is one of the best films actually the best I have ever seen the film
starts one fall day · · · .

Embeddings are pretrained on very large corpora of documents,


using methods similar to principal components. word2vec and
GloVe are popular.
22 / 46
RNN on IMDB Reviews

• After a lot of work, the results are a disappointing 76%


accuracy.

23 / 46
RNN on IMDB Reviews

• After a lot of work, the results are a disappointing 76%


accuracy.
• We then fit a more exotic RNN than the one displayed — a
LSTM with long and short term memory. Here A` receives
input from A`−1 (short term memory) as well as from a
version that reaches further back in time (long term
memory). Now we get 87% accuracy, slightly less than the
88% achieved by glmnet.

23 / 46
RNN on IMDB Reviews

• After a lot of work, the results are a disappointing 76%


accuracy.
• We then fit a more exotic RNN than the one displayed — a
LSTM with long and short term memory. Here A` receives
input from A`−1 (short term memory) as well as from a
version that reaches further back in time (long term
memory). Now we get 87% accuracy, slightly less than the
88% achieved by glmnet.
• These data have been used as a benchmark for new RNN
architectures. The best reported result found at the time of
writing (2020) was around 95%. We point to a leaderboard
in Section 10.5.1.

23 / 46
Log(Volatility) Dow Jones Return Log(Trading Volume)

−13 −11 −9 −8 −0.04 0.00 0.04 −1.0 0.0 0.5 1.0

1965
1970
1975
1980
Time Series Forecasting

1985
24 / 46
New-York Stock Exchange Data
Shown in previous slide are three daily time series for the period
December 3, 1962 to December 31, 1986 (6,051 trading days):
• Log trading volume. This is the fraction of all
outstanding shares that are traded on that day, relative to
a 100-day moving average of past turnover, on the log scale.
• Dow Jones return. This is the difference between the log
of the Dow Jones Industrial Index on consecutive trading
days.
• Log volatility. This is based on the absolute values of
daily price movements.
Goal: predict Log trading volume tomorrow, given its
observed values up to today, as well as those of Dow Jones
return and Log volatility.
These data were assembled by LeBaron and Weigend (1998) IEEE
Transactions on Neural Networks, 9(1): 213–220.
25 / 46
Autocorrelation
Log( Trading Volume)
Autocorrelation Function

0.8
0.4
0.0

0 5 10 15 20 25 30 35

Lag

• The autocorrelation at lag ` is the correlation of all pairs


(vt , vt−` ) that are ` trading days apart.

26 / 46
Autocorrelation
Log( Trading Volume)
Autocorrelation Function

0.8
0.4
0.0

0 5 10 15 20 25 30 35

Lag

• The autocorrelation at lag ` is the correlation of all pairs


(vt , vt−` ) that are ` trading days apart.
• These sizable correlations give us confidence that past
values will be helpful in predicting the future.

26 / 46
Autocorrelation
Log( Trading Volume)
Autocorrelation Function

0.8
0.4
0.0

0 5 10 15 20 25 30 35

Lag

• The autocorrelation at lag ` is the correlation of all pairs


(vt , vt−` ) that are ` trading days apart.
• These sizable correlations give us confidence that past
values will be helpful in predicting the future.
• This is a curious prediction problem: the response vt is also
a feature vt−` !
26 / 46
RNN Forecaster
We only have one series of data! How do we set up for an RNN?

We extract many short mini-series of input sequences


X = {X1 , X2 , . . . , XL } with a predefined length L known as the
lag:
     
vt−L vt−L+1 vt−1
X1 = rt−L  , X2 = rt−L+1  , · · · , XL = rt−1  , and Y = vt .
zt−L zt−L+1 zt−1

Since T = 6, 051, with L = 5 we can create 6, 046 such (X, Y )


pairs.
We use the first 4, 281 as training data, and the following 1, 770
as test data. We fit an RNN with 12 hidden units per lag step
(i.e. per A` .)

27 / 46
RNN Results for NYSE Data
Test Period: Observed and Predicted
0.0 0.5 1.0
log(Trading Volume)

−1.0

1980 1982 1984 1986

Year

Figure shows predictions and truth for test period.

R2 = 0.42 for RNN


R2 = 0.18 for straw man — use yesterday’s value of Log
trading volume to predict that of today.
28 / 46
Autoregression Forecaster
The RNN forecaster is similar in structure to a traditional
autoregression procedure.
   
vL+1 1 vL vL−1 · · · v1
 vL+2   1 vL+1 vL · · · v2 
   
y =  vL+3  M= 1 vL+2 vL+1 · · · v3 .
   
 ..   .. .. .. .. .. 
 .   . . . . . 
vT 1 vT −1 vT −2 · · · vT −L

Fit an OLS regression of y on M, giving

v̂t = β̂0 + β̂1 vt−1 + β̂2 vt−2 + · · · + β̂L vt−L .

Known as an order-L autoregression model or AR(L).


For the NYSE data we can include lagged versions of DJ return
and log volatility in matrix M, resulting in 3L + 1 columns.
29 / 46
Autoregression Results for NYSE Data

R2 = 0.41 for AR(5) model (16 parameters)

R2 = 0.42 for RNN model (205 parameters)

R2 = 0.42 for AR(5) model fit by neural network.

R2 = 0.46 for all models if we include day of week of day being


predicted.

30 / 46
Summary of RNNs

• We have presented the simplest of RNNs. Many more


complex variations exist.
• One variation treats the sequence as a one-dimensional
image, and uses CNNs for fitting. For example, a sequence
of words using an embedding representation can be viewed
as an image, and the CNN convolves by sliding a
convolutional filter along the sequence.
• Can have additional hidden layers, where each hidden layer
is a sequence, and treats the previous hidden layer as an
input sequence.
• Can have output also be a sequence, and input and output
share the hidden units. So called seq2seq learning are used
for language translation.

31 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.

32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.

32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.
Should we always use deep learning models?

32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.
Should we always use deep learning models?
• Often the big successes occur when the signal to noise ratio
is high — e.g. image recognition and language translation.
Datasets are large, and overfitting is not a big problem.

32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.
Should we always use deep learning models?
• Often the big successes occur when the signal to noise ratio
is high — e.g. image recognition and language translation.
Datasets are large, and overfitting is not a big problem.
• For noisier data, simpler models can often work better.
• On the NYSE data, the AR(5) model is much simpler than a
RNN, and performed as well.
• On the IMDB review data, the linear model fit by glmnet did
as well as the neural network, and better than the RNN.

32 / 46
When to Use Deep Learning
• CNNs have had enormous successes in image classification
and modeling, and are starting to be used in medical
diagnosis. Examples include digital mammography,
ophthalmology, MRI scans, and digital X-rays.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.
Should we always use deep learning models?
• Often the big successes occur when the signal to noise ratio
is high — e.g. image recognition and language translation.
Datasets are large, and overfitting is not a big problem.
• For noisier data, simpler models can often work better.
• On the NYSE data, the AR(5) model is much simpler than a
RNN, and performed as well.
• On the IMDB review data, the linear model fit by glmnet did
as well as the neural network, and better than the RNN.
• We endorse the Occam’s razor principal — we prefer
simpler models if they work as well. More interpretable! 32 / 46
Fitting Neural Networks

Input Hidden Output


Layer Layer Layer

A1
n
X1 1X
minimize (yi − f (xi ))2 ,
A2 {wk }K
1 , β
2 i=1
X2

A3 f (X) Y
where
X3

A4 K
X  p
X 
X4 f (xi ) = β0 + βk g wk0 + wkj xij .
A5 k=1 j=1

33 / 46
Fitting Neural Networks

Input Hidden Output


Layer Layer Layer

A1
n
X1 1X
minimize (yi − f (xi ))2 ,
A2 {wk }K
1 , β
2 i=1
X2

A3 f (X) Y
where
X3

A4 K
X  p
X 
X4 f (xi ) = β0 + βk g wk0 + wkj xij .
A5 k=1 j=1

This problem is difficult because the objective is non-convex.

33 / 46
Fitting Neural Networks

Input Hidden Output


Layer Layer Layer

A1
n
X1 1X
minimize (yi − f (xi ))2 ,
A2 {wk }K
1 , β
2 i=1
X2

A3 f (X) Y
where
X3

A4 K
X  p
X 
X4 f (xi ) = β0 + βk g wk0 + wkj xij .
A5 k=1 j=1

This problem is difficult because the objective is non-convex.

Despite this, effective algorithms have evolved that can


optimize complex neural network problems efficiently.
33 / 46
Non Convex Functions and Gradient Descent
1 Pn
Let R(θ) = 2 i=1 (yi − fθ (xi ))2 with θ = ({wk }K
1 , β).

6
5
4
R(θ)
3

R(θ0) 1
● R(θ )
2


R(θ2)

R(θ7)
1


θ 0
θ
1
θ2
θ7
0

−1.0 −0.5 0.0 0.5 1.0

θ
1. Start with a guess θ0 for all the parameters in θ, and set t = 0.
2. Iterate until the objective R(θ) fails to decrease:
(a) Find a vector δ that reflects a small change in θ, such that
θt+1 = θt + δ reduces the objective; i.e. R(θt+1 ) < R(θt ).
(b) Set t ← t + 1.
34 / 46
Gradient Descent Continued
• In this simple example we reached the global minimum.
• If we had started a little to the left of θ0 we would have
gone in the other direction, and ended up in a local
minimum.
• Although θ is multi-dimensional, we have depicted the
process as one-dimensional. It is much harder to identify
whether one is in a local minimum in high dimensions.
How to find a direction δ that points downhill? We compute
the gradient vector
∂R(θ)
∇R(θt ) =
∂θ θ=θt
i.e. the vector of partial derivatives at the current guess θt .
The gradient points uphill, so our update is δ = −ρ∇R(θt ) or
θt+1 ← θt − ρ∇R(θt ),
where ρ is the learning rate (typically small, e.g. ρ = 0.001.
35 / 46
Gradients and Backpropagation
Pn
R(θ) = i=1 Ri (θ) is a sum, so gradient is sum of gradients.
K p
 X X  2
Ri (θ) = 12 (yi −fθ (xi ))2 = 1
2 yi −β0 − βk g wk0 + wkj xij
k=1 j=1
Pp
For ease of notation, let zik = wk0 + j=1 wkj xij .

Backpropagation uses the chain rule for differentiation:

∂Ri (θ) ∂Ri (θ) ∂fθ (xi )


= ·
∂βk ∂fθ (xi ) ∂βk
= −(yi − fθ (xi )) · g(zik ).
∂Ri (θ) ∂Ri (θ) ∂fθ (xi ) ∂g(zik ) ∂zik
= · · ·
∂wkj ∂fθ (xi ) ∂g(zik ) ∂zik ∂wkj
0
= −(yi − fθ (xi )) · βk · g (zik ) · xij .

36 / 46
Tricks of the Trade
• Slow learning. Gradient descent is slow, and a small
learning rate ρ slows it even further. With early stopping,
this is a form of regularization.

37 / 46
Tricks of the Trade
• Slow learning. Gradient descent is slow, and a small
learning rate ρ slows it even further. With early stopping,
this is a form of regularization.
• Stochastic gradient descent. Rather than compute the
gradient using all the data, use a small minibatch drawn at
random at each step. E.g. for MNIST data, with n = 60K,
we use minibatches of 128 observations.

37 / 46
Tricks of the Trade
• Slow learning. Gradient descent is slow, and a small
learning rate ρ slows it even further. With early stopping,
this is a form of regularization.
• Stochastic gradient descent. Rather than compute the
gradient using all the data, use a small minibatch drawn at
random at each step. E.g. for MNIST data, with n = 60K,
we use minibatches of 128 observations.
• An epoch is a count of iterations and amounts to the
number of minibatch updates such that n samples in total
have been processed; i.e. 60K/128 ≈ 469 for MNIST.

37 / 46
Tricks of the Trade
• Slow learning. Gradient descent is slow, and a small
learning rate ρ slows it even further. With early stopping,
this is a form of regularization.
• Stochastic gradient descent. Rather than compute the
gradient using all the data, use a small minibatch drawn at
random at each step. E.g. for MNIST data, with n = 60K,
we use minibatches of 128 observations.
• An epoch is a count of iterations and amounts to the
number of minibatch updates such that n samples in total
have been processed; i.e. 60K/128 ≈ 469 for MNIST.
• Regularization. Ridge and lasso regularization can be used
to shrink the weights at each layer. Two other popular
forms of regularization are dropout and augmentation,
discussed next.

37 / 46
Dropout Learning

• At each SGD update, randomly remove units with


probability φ, and scale up the weights of those retained by
1/(1 − φ) to compensate.

38 / 46
Dropout Learning

• At each SGD update, randomly remove units with


probability φ, and scale up the weights of those retained by
1/(1 − φ) to compensate.
• In simple scenarios like linear regression, a version of this
process can be shown to be equivalent to ridge
regularization.

38 / 46
Dropout Learning

• At each SGD update, randomly remove units with


probability φ, and scale up the weights of those retained by
1/(1 − φ) to compensate.
• In simple scenarios like linear regression, a version of this
process can be shown to be equivalent to ridge
regularization.
• As in ridge, the other units stand in for those temporarily
removed, and their weights are drawn closer together.

38 / 46
Dropout Learning

• At each SGD update, randomly remove units with


probability φ, and scale up the weights of those retained by
1/(1 − φ) to compensate.
• In simple scenarios like linear regression, a version of this
process can be shown to be equivalent to ridge
regularization.
• As in ridge, the other units stand in for those temporarily
removed, and their weights are drawn closer together.
• Similar to randomly omitting variables when growing trees
in random forests (Chapter 8).
38 / 46
Ridge and Data Augmentation

● ●● ●
● ● ●●
● ●● ● ●
●● ●
● ●

●●●
● ● ●● ● ●
● ● ● ●● ●
● ●

●●● ●
● ● ● ●● ● ● ●
●●● ● ● ●● ● ●

● ●
●● ●
●● ● ●


● ●
● ●
● ● ● ●● ●●●

2
● ● ●● ●●● ●● ● ●

● ●



●●
● ● ● ●


●●●●



●●●● ●

● ● ●
● ●● ● ●

● ●●●● ● ●● ●

● ●

● ●
●● ●
● ●
● ● ● ●● ●

● ● ●● ●
●●●● ●
● ● ●

●●●●●●●
●● ● ●

● ●●● ●
●●●● ●●● ● ●●
● ●
● ●
● ●● ● ●● ● ● ● ●

● ● ● ● ● ●
●●●
●● ● ●
● ●●●
●●
● ●●
●● ● ●●●
●●●
● ●● ●


● ●



●●●● ● ● ● ● ●

●●●● ●●●●
● ●●
● ●
●●●● ● ●●
●●●● ●

●●●
● ●●●● ●●● ● ● ● ● ● ●

●● ●●● ●
● ● ●●●●●●
● ●● ●●●● ● ●● ● ● ●●● ●

● ● ●●●
● ●● ●
● ●

●●● ●
●● ● ● ●
●●● ●
●● ● ● ●● ●● ● ● ●
● ●● ●●

● ● ●

● ● ●
●● ● ●● ●●● ● ● ● ● ●
●● ●● ●
●●
●● ● ●
● ●●
● ●● ● ●●●●

●●
● ●●●●●●●●● ● ●●● ●●
● ● ●● ●●●
●●●● ● ●●●●
●●● ●
● ●
● ●● ● ● ●●
● ●● ● ● ●
● ● ● ●●● ●● ● ●●● ●●● ●●
●●● ●
● ●● ●● ●
●●●●●● ● ●●● ● ●● ●

● ●
● ●● ●●●● ●●●● ●
●●● ●●● ● ●●● ●
●● ●● ● ● ●● ●●● ●●
● ●
● ●● ●
● ●●
●●● ●●
● ●●●●●●●● ● ●

● ● ●●● ●●●●
● ● ● ● ●● ●● ●● ● ●●
● ● ● ● ●● ● ●
● ● ● ●
●● ●● ●
●● ● ● ●● ● ●● ●● ●●●
● ● ●●●● ● ●●●● ●
● ●● ● ●●●● ● ● ●

● ●● ● ● ● ● ● ●●●● ● ●
● ●● ●●● ● ●

● ●
●●● ● ●● ● ● ● ●

1
● ● ● ●
● ●●
● ● ● ●● ●
●●●●●●●●
●● ● ●●●● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ●●

● ● ● ● ● ● ●
● ●
● ● ●●
●●●
●●
●● ●●●●●
●●● ●
● ●●
●● ●
●● ● ●


● ● ●●●●● ●●●●
● ●● ●● ●
● ● ●●● ● ●● ●●
● ● ●
● ● ●

●● ●● ● ●● ● ●
● ●●●● ● ● ●●● ● ● ●
● ●●●

●● ●● ●●
● ● ●● ● ●● ● ●● ●●●●● ●● ●

● ● ● ● ●● ● ● ●
● ●●● ●
● ●●
● ●●● ●●● ●
●●● ● ●● ●●●
● ●● ●
● ●●
●●●● ●●●
●●
●●●●●●●●
● ●●
●●●●

● ● ●●● ●●

● ● ●●● ● ● ● ●●● ●
● ● ●● ● ● ●● ● ●●● ●
● ●●
● ● ● ● ● ● ●●
● ● ● ● ●● ●●●●● ● ● ●●
● ●●● ●

●●● ●

●●●
●●●
●● ●●●●●●
●● ●●● ●
● ●●● ●● ●
●●● ●●

●●● ●
● ●●
● ● ●●
● ● ● ●●●● ●●
●●

●●●●
●●●
●●●●
●●

●●●●
●●●●●● ● ●● ●

●●●●●●● ● ●●● ●
● ● ●● ●

● ● ● ●●●● ● ● ● ● ●
● ●● ●●●● ● ● ● ●

● ●
● ● ● ● ●●●● ●● ●●● ●
● ●● ● ● ● ●● ● ●
● ● ●● ●● ●●
●● ● ● ●
● ● ● ●● ●
●● ●● ● ● ●●
●●●
●● ●●●● ●●● ●●
●● ● ●●
●● ●●●●
● ● ● ● ●●● ●

●●●
● ●● ● ●● ●
● ●● ● ●● ● ● ●
● ●●
● ●● ● ●● ● ●

● ●● ●● ●● ● ● ● ●●


● ●●● ● ●● ● ●
● ● ● ●●

●● ● ●● ●● ●● ●● ● ●
● ●●● ● ●● ●●●●●
● ●●●●
●●● ● ● ● ● ●
●● ●
●●●
● ● ● ● ● ● ● ● ●● ●
● ● ● ●
●●
● ●●
●● ●● ●●●●
● ●

●●

●●
● ●●● ●●●●●●● ●
● ●
●●●●
● ●
●●● ●
● ●●
●●● ●●● ●

●●●
●●●● ● ● ● ●


● ● ● ●●
●●● ● ●●● ●●●● ● ● ● ●● ●●●● ●●● ● ●●

●●● ● ●
● ●●● ●●●
●●● ● ●●● ●● ●●


●● ●
● ● ● ●● ●
● ● ● ●●●● ● ●●● ● ●● ● ●● ●●
●●●● ●●● ●●
● ● ● ●● ● ●●
●●●
● ●● ●● ●●
●●●●●
● ● ●● ●
●●● ●●●●

●●●● ● ●● ● ●●●
● ●
●●●

● ●●●
●● ●●

●●●●●●
●●● ●
● ● ● ●

● ●●
●● ●● ●● ●●●● ●●●●● ● ● ●● ● ●● ●●● ●
● ● ●●●
● ●●●● ● ●●●●● ●●● ● ●●●● ●●● ●●●
●● ●●●
● ●● ●● ●

● ● ●●
●● ● ● ●● ●

● ●

● ●● ●
●●● ● ●
●● ●●●● ●●●● ●● ● ●
● ●
●●
● ● ●●● ●● ●● ●
●● ●
●●● ● ●● ●
●●
● ● ● ●● ●

●●● ● ● ●

●● ●●
●● ●● ●

●●●
● ●● ● ●●● ●●● ● ●

●●
● ●● ●
●● ●●● ● ● ●

● ●● ● ●●
●● ●
●● ●
●●●
●●
●●●● ● ●
●● ●
● ● ●
●● ●
●●●
●●●●


●●●
● ●●
●● ●

●●● ●● ●
● ●● ●●

●●●


● ●●● ● ● ●
●● ●●● ●●●

●●● ●
● ●● ● ● ● ●●●● ● ●

●● ●●

● ●● ●● ●● ●● ● ●

● ● ●●
●● ●● ● ●
● ● ● ●● ●● ● ●●
●●● ●●● ●●●●

● ●● ●●●● ●●● ●● ●●

● ●● ●● ●
● ● ●●
●●● ●


●●
●●
● ●●●●● ● ●●●● ● ●●● ●● ●● ●●●● ●
● ● ●●
● ● ●
● ● ● ● ●●●
●● ● ● ●● ●● ●● ●




●● ● ●● ● ● ●● ● ●
●● ●●● ● ● ● ● ● ● ●

●● ● ●● ● ●
●● ●● ● ● ● ● ●●

● ● ● ●● ●●●●● ● ● ● ●●●●

●●
●● ●

●●
● ●● ●● ●●●●● ● ●● ●
● ●●● ●
● ● ● ●●

● ●●● ● ●● ●● ●
●● ● ● ● ●●
●● ●● ●
●●● ●● ●●●●●● ●
●● ● ●● ● ● ●
● ● ●● ● ● ● ●●● ●● ● ●● ● ● ●
●●● ●● ● ●● ●●● ●●● ●● ●●●

●●
● ●●● ● ● ●●

●●
● ● ●● ●●● ● ● ●●● ● ● ●●●● ● ●●


● ● ●
●●● ●

● ●● ● ● ● ●●
● ● ●●
●● ●●●●●
●● ●
● ●● ● ● ● ● ●●

● ●
●●●

● ●●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●


● ● ●
●● ●● ●●
●●●●●● ● ●

● ● ●● ● ●
●●● ●●●● ● ● ● ● ● ●


●● ● ●●● ● ●
●● ●● ●

●●●●
● ●● ●●● ● ●●
●●●● ●●
● ●● ● ●● ●● ●●

●●
● ●●●●●● ●●● ● ●●
● ● ●●●
● ●
●●●
● ● ● ●● ● ●● ●
● ●● ●●● ●

0

●● ● ●●● ●●● ●
●●● ●●● ● ●●●

●●●●● ●
● ●●● ●
● ●● ● ●● ●●
● ● ● ●●
● ● ●●
● ● ●● ●


●●
● ● ● ● ●● ●●

X2

● ●

● ● ●●● ●●● ● ●


●●●

●●
● ●● ●●● ●● ● ● ●●●● ●●
● ●● ●●
●● ●

●●● ●● ● ●●●
● ●●●●
● ● ●● ● ●

● ● ●
● ● ●● ● ● ● ●

●●●●●●
● ●●● ● ●●● ●●
●● ●● ● ●
● ●● ● ●
● ● ●
● ●●


● ● ●●● ●● ● ● ●● ●● ●●● ● ● ●●●●

●● ●●

●● ●
●●●● ● ●● ●●●●
● ● ●● ●● ●
●● ●●●
●●
●●

●● ●
●●● ●●●● ●●● ●● ● ● ●
● ●●
● ● ● ●●●●● ●


● ●●● ●●● ●● ●

●● ● ●

● ●● ●

● ●●● ● ●
● ● ● ●● ●● ●

● ● ●●

● ● ●● ●● ●●● ● ●● ●●●●●●
●●● ●●● ●●● ● ●●● ●● ● ●

● ●●

● ●●●● ● ●
● ●●● ● ●
● ●●
●●●●● ●●●●●●
●● ● ●● ●●
● ● ●● ●●●●
● ● ●●● ●


●●●● ● ● ● ● ●


●● ●●●●● ●●● ● ●●●●
● ● ●●
●● ● ●●●●
● ●●
● ● ●●
● ● ●
●● ●●● ● ●●●● ●●● ● ● ●●● ●● ●

●● ●
●●

●● ●●
●●●● ● ●

●●● ●
●● ● ● ●● ●● ● ● ●● ● ●
●● ● ● ●

● ● ●
● ●
● ● ●●●●●● ●●●● ●


●●
● ● ● ●●

● ●●

● ●●● ● ● ●●● ●●●●● ●

● ● ●● ●


● ●● ●●●

● ● ● ●● ● ●● ● ●

● ●● ●● ●●●

● ● ● ●●● ●


●●●

● ●● ●
● ●● ●●● ●
● ●● ●

● ●● ●● ●●● ●● ●
●● ● ●● ● ● ●●●●●

●● ● ● ●●● ● ●● ● ●● ●
●●
● ●
●●● ● ● ●● ● ●● ●●
● ●

●●●●●●
● ●● ●●

●●●
●● ●●● ●
●● ● ●● ●● ● ●●● ●● ●

● ● ●● ● ● ●
● ●

● ● ●●

●● ●● ● ● ●●●●● ● ● ● ●

● ● ●●
● ● ●● ● ● ● ● ● ● ● ●

●●●
● ● ● ●
● ●● ● ● ● ●
● ● ●● ●●● ●●●● ●●
●●●●● ● ●●● ●● ● ●●

● ● ● ●● ●● ● ● ●
● ●● ● ●● ● ●●●
●● ●● ● ●●●
● ● ● ● ●●●●●● ● ●

● ●● ● ●●● ●● ●●
● ● ●
● ● ●● ●● ● ●●● ● ●●
● ●● ●
●● ● ●
●● ●● ● ● ● ●
● ● ●●●

●●●● ●
● ● ●● ●

●● ●● ● ●● ●

●● ●● ●●● ●

● ●●●
● ●● ●

● ● ● ● ●
●● ●●● ● ●

● ● ●●● ●● ● ●

●●●● ● ●● ● ● ● ● ● ●
● ● ●
●●● ● ● ●●● ●●
● ●● ● ● ● ●● ●●●●● ● ●
● ●● ●●
● ●● ●●
●● ●
● ●● ● ●
● ●● ● ● ● ● ●●
● ●● ● ●●●●●●●●
● ● ● ● ● ●

●● ●
● ● ●●
●●●●●● ●● ● ●● ●●●●●
●●●●●●●
● ● ●● ● ● ● ●

●● ●●● ● ●● ● ● ●● ●● ●
● ●● ● ●
●●● ●●● ●●●

●●●
●●●
● ● ●● ● ● ●●● ●
●●● ● ● ●●● ●● ● ● ●●
● ●● ● ● ●● ● ●
●● ●
● ● ●●● ●● ●● ●
● ●● ●
● ● ● ●● ●
● ●● ●
● ●



●●


●●● ●●
● ● ●
● ●●●●●● ● ● ●● ● ●●●● ●● ●

● ●
●●●● ● ●● ● ●●

● ●● ●



● ●●● ●●● ●
● ● ●● ● ● ●● ●●●●

● ● ●● ●● ●
● ●● ● ●●● ●● ●●
●● ● ●●● ●●●●
●●●
●● ●● ●●●●●
●●●
●●
●●●●● ● ● ●● ●● ● ●●●
●●
● ● ●
● ●● ●●●● ● ●

● ●
● ●●●
●●●●●● ● ● ●●●●●●●● ● ●

●●● ● ● ●● ●●● ●●

● ●● ●

●●●●●●●● ● ● ● ●● ● ● ● ●●

●● ● ● ●● ●● ● ● ●
●● ● ●●● ●●● ●
● ● ● ● ● ●●●●● ●
●●●●● ●●
● ● ●●●●● ● ●●
● ●●●

●● ● ●●●● ●● ●

−1
● ● ● ●● ●●

● ●●● ●●●●
●●
●● ●
●● ●●●●
●●
● ● ●

●●● ●●●● ●


● ●●●● ●
● ● ●● ● ●
●● ● ●● ●
● ●●●●
●●●● ●●● ●●●●● ● ●● ● ●●●● ●
● ●
● ● ●●●●
●● ●●●● ●●●● ● ● ●●●●● ● ● ●●● ●● ●
●● ● ●● ●
● ●● ●● ●

● ●
●●●
● ● ●
●●●●●● ● ●● ● ● ●

●●●●● ● ● ●●●
● ● ● ● ●● ● ● ●●●●●● ● ● ●●● ●
● ● ●
● ● ●● ●●●●● ●
●● ● ●
● ●●●●●●● ● ● ●

● ●● ●● ●●
● ●
● ● ● ●● ●● ●●● ● ●● ●

● ●

●● ●●● ● ● ●
● ● ● ● ● ●●● ●●

● ● ● ● ●●● ●●● ●
●● ●● ●
●●

● ●● ● ● ●
● ● ●● ●
●● ● ● ●● ●
●●
●● ●●

●● ●
● ●● ● ● ●● ● ● ●● ●

● ● ● ●●●● ●●

● ● ● ● ● ● ● ● ●●
● ●●●● ● ●

● ●

● ● ●
●● ● ● ● ● ● ●●●●● ●
●● ● ●
● ● ● ●● ● ●●● ●●●
● ●●●● ●● ●

● ● ●
●●● ● ● ●
● ●● ● ●● ●● ●

● ● ●● ●

● ●
●●●●●

●●
● ● ● ●● ●
●● ●

●● ●●

● ● ●
●●●● ●●● ● ● ● ●● ● ●● ● ● ●

● ● ● ●●●● ●
●●

● ●●

● ● ●●● ● ●● ●
● ● ●●●●●●
●●●●● ● ●●●●●
●● ● ●
● ●●●● ●● ●● ● ●●●● ● ●● ● ●● ●● ●
● ●● ● ● ●● ● ● ● ●

● ● ● ●
●● ●
● ●
●● ●●●● ● ●●● ●
● ● ●● ● ● ●

● ● ● ●
● ● ●● ●●
● ● ● ● ● ●●●● ●
● ● ● ● ● ●●●● ●● ● ●● ● ●
● ●● ●● ●
●●
●●
●● ●● ● ● ●●
●● ● ●● ●●●●●●● ●● ● ●
●●●●●● ● ●●● ●●

● ●● ● ●●
●●●
● ● ● ●●●● ●

● ● ● ●
−2

●●●●●●●
● ●● ●●

● ●● ●

● ●● ●
●● ●
●● ●●● ●
● ●● ●●

● ●● ●● ● ● ● ●● ●
● ● ●●● ● ●
● ● ●●● ● ●

● ●
● ●● ●
● ●●●●
● ● ● ●●
● ●● ●● ● ●
● ● ●

●● ●●
● ●● ● ● ● ●● ●
●● ●
● ●●●● ● ●
● ●● ●●● ● ● ●●
● ●●● ●● ●● ●
● ●● ● ●
●●
● ●●●
●●
● ●● ●●

●● ● ● ● ● ●

● ● ●●
● ●● ● ●

● ● ●


● ●

−2 −1 0 1 2 −2 −1 0 1 2

X1 X1

• Make many copies of each (xi , yi ) and add a small amount


of Gaussian noise to the xi — a little cloud around each
observation — but leave the copies of yi alone!
• This makes the fit robust to small perturbations in xi , and
is equivalent to ridge regularization in an OLS setting.
39 / 46
Data Augmentation on the Fly

• Data augmentation is especially effective with SGD, here


demonstrated for a CNN and image classification.

40 / 46
Data Augmentation on the Fly

• Data augmentation is especially effective with SGD, here


demonstrated for a CNN and image classification.
• Natural transformations are made of each training image
when it is sampled by SGD, thus ultimately making a
cloud of images around each original training image.

40 / 46
Data Augmentation on the Fly

• Data augmentation is especially effective with SGD, here


demonstrated for a CNN and image classification.
• Natural transformations are made of each training image
when it is sampled by SGD, thus ultimately making a
cloud of images around each original training image.
• The label is left unchanged — in each case still tiger.

40 / 46
Data Augmentation on the Fly

• Data augmentation is especially effective with SGD, here


demonstrated for a CNN and image classification.
• Natural transformations are made of each training image
when it is sampled by SGD, thus ultimately making a
cloud of images around each original training image.
• The label is left unchanged — in each case still tiger.
• Improves performance of CNN and is similar to ridge.

40 / 46
Double Descent

• With neural networks, it seems better to have too many


hidden units than too few.

41 / 46
Double Descent

• With neural networks, it seems better to have too many


hidden units than too few.
• Likewise more hidden layers better than few.

41 / 46
Double Descent

• With neural networks, it seems better to have too many


hidden units than too few.
• Likewise more hidden layers better than few.
• Running stochastic gradient descent till zero training error
often gives good out-of-sample error.

41 / 46
Double Descent

• With neural networks, it seems better to have too many


hidden units than too few.
• Likewise more hidden layers better than few.
• Running stochastic gradient descent till zero training error
often gives good out-of-sample error.
• Increasing the number of units or layers and again training
till zero error sometimes gives even better out-of-sample
error.

41 / 46
Double Descent

• With neural networks, it seems better to have too many


hidden units than too few.
• Likewise more hidden layers better than few.
• Running stochastic gradient descent till zero training error
often gives good out-of-sample error.
• Increasing the number of units or layers and again training
till zero error sometimes gives even better out-of-sample
error.
What happened to overfitting and the usual bias-variance
trade-off?
Belkin, Hsu, Ma and Mandal (arXiv 2018) Reconciling Modern Machine Learning
and the Bias-Variance Trade-off.

41 / 46
Simulation

• y = sin(x) + ε with x ∼ U [−5, 5] and ε Gaussian with


S.D. = 0.3.
• Training set n = 20, test set very large (10K).
• We fit a natural spline to the data (Section 7.4) with d
degrees of freedom — i.e. a linear regression onto d basis
functions: ŷi = β̂1 N1 (xi ) + β̂2 N2 (xi ) + · · · + β̂d Nd (xi ).
• When d = 20 we fit the training data exactly, and get all
residuals equal to zero.
• When d > 20, we still fit the data exactly, but the solution
is not unique. Among the zero-residual solutions, we pick
the one with minimumP norm — i.e. the zero-residual
solution with smallest dj=1 β̂j2 .

42 / 46
The Double-Descent Error Curve

2.0
Training Error
Test Error

1.5
Error

1.0
0.5
0.0

2 5 10 20 50

Degrees of Freedom

• When d ≤ 20, model is OLS, and we see usual bias-variance


trade-off
• When d > 20, we revert to minimum-norm. As d increases
above 20, dj=1 β̂j2 decreases since it is easier to achieve
P
zero error, and hence less wiggly solutions.
43 / 46
Less Wiggly Solutions
8 Degrees of Freedom 20 Degrees of Freedom
3

3
f(seq(−5, 5, len = 1000))
2

2
1

1
0

0
−3 −2 −1

−3 −2 −1
−4 −2 0 2 4 −4 −2 0 2 4
42 Degrees of Freedom
seq(−5, 5, len = 1000)
80 Degrees of Freedom
seq(−5, 5, len = 1000)
3

3
f(seq(−5, 5, len = 1000))
2

2
1

1
0

0
−3 −2 −1

−3 −2 −1

−4 −2 0 2 4 −4 −2 0 2 4

To achieve a zero-residual solution with d = 20 is a real stretch!


Easier for larger d.
44 / 46
Some Facts

• In a wide linear model (p  n) fit by least squares, SGD


with a small step size leads to a minimum norm
zero-residual solution.

45 / 46
Some Facts

• In a wide linear model (p  n) fit by least squares, SGD


with a small step size leads to a minimum norm
zero-residual solution.
• Stochastic gradient flow — i.e. the entire path of SGD
solutions — is somewhat similar to ridge path.

45 / 46
Some Facts

• In a wide linear model (p  n) fit by least squares, SGD


with a small step size leads to a minimum norm
zero-residual solution.
• Stochastic gradient flow — i.e. the entire path of SGD
solutions — is somewhat similar to ridge path.
• By analogy, deep and wide neural networks fit by SGD
down to zero training error often give good solutions that
generalize well.

45 / 46
Some Facts

• In a wide linear model (p  n) fit by least squares, SGD


with a small step size leads to a minimum norm
zero-residual solution.
• Stochastic gradient flow — i.e. the entire path of SGD
solutions — is somewhat similar to ridge path.
• By analogy, deep and wide neural networks fit by SGD
down to zero training error often give good solutions that
generalize well.
• In particular cases with high signal-to-noise ratio — e.g.
image recognition — are less prone to overfitting; the
zero-error solution is mostly signal!

45 / 46
Software

• Wonderful software available for neural networks and deep


learning. Tensorflow from Google and PyTorch from
Facebook. Both are Python packages.
• In the Chapter 10 lab we demonstrate tensorflow and
keras packages in R, which interface to Python. See
textbook and online resources for Rmarkdown and Jupyter
notebooks for these and all labs for the second edition of
ISLR book.
• The torch package in R is available as well, and
implements the PyTorch dialect. The Chapter 10 lab will
be available in this dialect as well; watch the resources
page at www.statlearning.com.

46 / 46

You might also like