0% found this document useful (0 votes)

13 views84 pages

Module11 - NNandDeep Learning

Uploaded by

riya pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views84 pages

Module11 - NNandDeep Learning

Uploaded by

riya pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Neural Network and

Deep Learning
Reference

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An

introduction to statistical learning (Vol. 112, p. 18). New York:
springer.
Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.
(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York:
springer.
Gopal, M. (2019). Applied machine learning. McGraw-Hill
Education.
Recognize the Number!
Early History

Walter Pitts (L) ,

Warren McCulloch (R)
First proposed neural
network in 1944

A brief history is given in this link

Image source: https://www.historyofinformation.com/detail.php?entryid=782
Deep Learning

ACM Turing Award 2018

Image source: https://awards.acm.org/about/2018-turing
Deep Learning
• Neural networks became popular in the 1980s.
• Lots of successes, hype, and great conferences:
NeurIPS, Snowbird.
• Then along came SVMs, Random Forests and
Boosting in the 1990s, and Neural Networks took a
back seat.
• Re-emerged around 2010 as Deep Learning.
• By 2020s very dominant and successful.
• Part of success due to vast improvements in computing
power, larger training sets, and software: Tensorflow
and PyTorch.
Deep Learning
• Deep learning is a branch of machine learning which is
completely based on ANN with the use of multiple hidden
layers.
• Deep learning is well-suited for tasks such as image
recognition, speech recognition, and natural language
processing.
• The most widely used architectures in deep learning are
feedforward neural networks, convolutional neural networks
(CNNs), and recurrent neural networks (RNNs).
Single Layer Neural Network
K

f X = β0 + ෍ βk hk (X)
k=1

p
= β0 + σK
k=1 βk g wk0 + σj=1 wkj X j

In p u t Hidden Output
L a yer La ye r La ye r
A1

A3 f (X ) Y

A5
Training the Neural Network
p
• Ak = hk X = g wk0 + σj=1 wkj X j are called the
activations in the hidden layer.
• g(z) is called the activation function (e.g. sigmoid,
ReLU)
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear
model.
• The model is fit by minimizing σni=1(yi − f(xi ))2 for
regression, cross-entropy for classification
• The weights 𝑤𝑘𝑗 are learned using backpropagation
algorithm.
Activation Functions

Sigmoid

𝑒𝑧 1
𝑔 𝑧 = 𝑧
=
1+𝑒 1 + 𝑒 −𝑧

ReLU

0 𝑖𝑓 𝑧 < 0
𝑔 𝑧 = 𝑧 + =ቊ
𝑧 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Activation Function

The piecewise-linear ReLU function is more utilized

for its efficiency and computability.
Example: MNIST Digits
Handwritten digits 28 × 28
grayscale images 60K
train, 10K test images
Features are the 784 pixel
grayscale values ∈ (0, 255)
Labels are the digit class 0–
9.

• Goal: build a classifier to predict the image class.

Multilayer Neural Network
X 1

X 2 A 1( 1 )

A (12 )

X 3 A 2( 1 ) f0(X) Y0

A (22 )

X 4 A 3( 1 ) f1(X) Y1

A (32 )
(1) . .
X 5 A4 . .
. .
.
.
.
.
X 6 . f9(X) Y9
.
A(2)
K 2
. (1) B
. A
. K 1
W2
Output
H id d e n layer
X p W1
H id d e n la y e r 𝐿 2

Inp ut la y e r 𝐿 1
layer
Details of Output Layer
2 k (2)
• Let Zm = βm0 + σl=1 βml Al , m = 0, 1, . . . , 9 be 10
linear combinations of activations at second layer.
• Output activation function encodes the SoftMax
function -
eZ m
fm X = Pr(Y = m X = 9 Z
σl=0 e l
• Fit the model by minimizing the negative
multinomial log-likelihood (or cross-entropy) -
n 9

− ෍ ෍ yim log fm (xi )

i=1 m=0
• yim is 1 if true class for observation i is m, else 0 -
i.e.one-hot encoded.
Results
Method Test Error
Neural Network + Ridge Regularization 2.3%
Neural Network + Dropout Regularization 1.8%
Multinomial Logistic Regression 7.2%
Linear Discriminant Analysis 12.7%

• With so many parameters, regularization is essential.

• Very overworked problem — best reported rates are
< 0.5%!
• Human error rate is reported to be around 0.2%, or
20 of the 10K test images.
Convolutional Neural Network — CNN

• Shown are samples from CIFAR100 database.

• Data base of 60K labelled images with 20 superclasses
and 5 classes in each super class
• Each image is 32 × 32 pixels, RGB
• The numbers for each image are organized in a three-
dimensional array called a feature map
How CNNs Work

• CNNs mimics human cognition, by recognizing

specific features or patterns in particular places of
the image
• The CNN builds up an image in a hierarchical
fashion.
• Edges and shapes are recognized and pieced together
to form more complex shapes, eventually assembling
the target image.
How CNNs Work

Convolution Layers
Low level features (to identify small features)
(edges, patches of colours)

Pooling Layers
(down sampling of small
Mid- level features patterns identified)
(eyes, ears)

Classifier
(tiger, lion)
Convolution Layers

• Convolution layers are made up of convolution

filters (CF)
• The filter determines whether a particular
small shape, edge (local feature) etc. is present
in the image
• Convolution implies repeatedly multiplying
matrix elements and then adding the results
• The resultant (convolved) image emphasizes the
sections of original image which are similar to
the CF.
Convolution Filter

a b c
d e f
Input Image (4 × 3) pixel = g h i
j k l

α β
Convolution Filter =
γ δ

aα + bβ + dγ + eδ bα + cβ + eγ + fδ
Convolved Image = dα + eβ + gγ + hδ eα + fβ + hγ + iδ
gα + hβ + jγ + kδ hα + iβ + kγ + lδ
• CF is slid around the input image, scoring for matches.
• The scoring is done via dot-products
• If the sub image of the input image is similar to the filter,
the score is high, otherwise low.
• CFs are learned during training.
Convolution Example

• The two filters shown here highlight vertical and horizontal stripes.
• The result of the convolution is a new feature map.
• In the first image vertical stripes are more prominent
• In the second image the horizontal stripes are more prominent
Convolution Layer
• In CNNs the filters are learned for the specific
classification task.
• The filter weights are the parameters going from
an input layer to a hidden layer, with one hidden
unit for each pixel in the convolved image.
CIFAR100 Example
• For colour image, it has three channels represented
by a three-dimensional feature map (array).
• Each of them a 2D (32 × 32) feature map — one
each for R, G and B.
• A single CF will also have three channels, one per
color, each of dimension 3×3, with potentially
different filter weights.
• At the first hidden layer K CFs -> K 2D feature maps
• The results of the three convolutions are summed
to form a 2D output feature map.
• ReLU activation function is used on the convolved
image, in separate layer known as detector layer
Pooling Layers

• Max pooling
• Convolution implies repeatedly multiplying
matrix elements and then adding the results
• The resultant (convolved) image emphasizes the
sections of original image which are similar to
the CF.
Pooling

1 2 5 3
Max pool 3 0 1 2 ⟶ 3 5
2 1 3 4 2 4
1 1 2 0

• Pooling layers transforms large image to smaller

summary image.
• Max pooling: each non-overlapping 2 × 2 block is
replaced by its maximum.
• Reduces the dimension by a factor of 2 in each
dimension.
• Allows for locational invariance.
• This sharpens the feature identification.
Architecture of a CNN for CIFAR100

• CFs are equivalent to units in a hidden layer of ANN.

• Each CF creates a new channel in convolution layer.
• As pooling reduces size of each layer
• Sequence of convolve and pool layers.
• Continued until the pooling has reduced each channel
feature map down to a few pixels in each dimension.
Architecture of a CNN for CIFAR100

• Then 3D feature maps are flattened

• Flattening treat the pixels as separate units — and fed
into one or more fully-connected layers
• Finally at the output layer, which is a softmax
activation is performed for 100 classes
• Accuracy at the time of ISLR2: 75%
Data Augmentation

• Trick to enhance accuracy for image modeling

• Each image is replicated many times with random distortion
• Distortion include rotate, zoom, horizontal/vertical shift, shear
etc.
• This increases the training set considerably with somewhat
different examples, and thus protects against overfitting.
• This is a form of regularization where a cloud of images are
created with each original image having the same label
Using Pretrained Networks to Classify Images

Flamingo Cooper’s hawk Cooper’s hawk

flamingo 0.83 kite (raptor) 0.60 fountain 0.35
spoonbill 0.17 great grey owl 0.09 nail 0.12
white stork 0.00 robin 0.06 hook 0.07

Lhasa Apso Cat Cape weaver

Tibe t an terrier 0.56 O l d Eng lis h sheepdog 0.82 jacamar 0.28
Lhas a 0.32 Shih-Tzu 0.04 macaw 0.12
cocker spaniel 0.03 Persian cat 0.04 robin 0.12

• A pretrained CNN named resnet50 is trained on

imagenet dataset, which has millions of images in
many categories
Document Classification: IMDB Movie Reviews

• The IMDB corpus consists of user-supplied movie

ratings for a large collection of movies. Each has
been labeled for sentiment as positive or negative.
Here is the beginning of a negative review -

This has to be one of the worst films of the 1990s. When my friends &
I were watching this film (being the target audience it was aimed at) we
just sat & watched the first half an hour with our jaws touching the floor
at how bad it really was. The rest of the time, everyone else in the theater
just started talking to each other, leaving or generally crying into their
popcorn . . .

• Labeled training and test sets, each consisting of

25,000 reviews, and each balanced with regard to
sentiment.
• Build a classifier to predict the sentiment of a review.
Featurization: Bag-of-Words
• From a dictionary, identify the 10K most
frequently occurring words.
• Create a binary vector of length p = 10K for
each document, and score a 1 in every position
that the corresponding word occurred.
• With n documents, there are n × p sparse
feature matrix X.
• Sparse matrix: Store location and values of
non-zero elements
• A lasso logistic regression model and a two-
hidden-layer neural network are fitted.
• Bag-of-words are unigrams.
Lasso versus Neural Network — IMDB Reviews
Lasso Neural Net
8

1.0

1.0
●
●●●
●●●●
●●
●●●●
● ● ● ●
●
●●
●
● ● ● ● ● ●
●● ● ●
●● ● ●
●● ●
●●
●● ●
●● ●
●●
● ●
●
●● ●
●●
●●

0.9

0.9
●● ●
●●
●
● ● ● ● ●
●
●●● ●●● ● ● ● ● ● ●●
●● ● ●●
●●●
●●●
●
●
●●
●
●
●●
●●●
●●●
●● ●
●●● ●● ● ● ● ● ● ● ● ●
●●●●●●
●
●
●● ●●
●●●
●●●
●●●
●●
● ● ● ●
● ● ●
●●●●●
●●
● ●●
●
● ●
●●
●
●●●●●● ●●●
● ●
●●
●● ● ● ● ● ●
● ●●●
●● ●●
●● ●●
● ● ● ● ● ● ● ●
●
●●● ● ●●●
●●
●●●●●
●●
●●●●●
●
●
●●
Accuracy

Accuracy
●●●●
●
●●●●
●●●● ●
●●●
●
0.8

0.8
●
●●
●●●
●
●●●
●
●
●●
●●
●●
●
●
●
●
●
0.7

0.7
●
●
●●

●
●●
●
● train
● validation
●
●●
●●
●
● test
0.6

0.6
4 6 8 10 12 5 10 15 20

− log() Epochs

• Simpler lasso logistic regression model works as well

as neural network in this case.
• glmnet was used to fit the lasso model, and is very
effective because it can exploit sparsity in the X
matrix.
Recurrent Neural Networks
• The previous application does not take into
account sequence of words
• It ignores the context of the words used
• Two ways to incorporate context:
• Bag of n-grams model (consecutive co-
occurrence of n words)
• Treat the document as a sequence,
taking account of all the words in the
context of those that preceded and
those that follow
Recurrent Neural Networks

Data as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data
• Financial time series, market indices, stock and bond
price, exchange rates.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this
sequential nature of the data, and build a memory of
the past.
Recurrent Neural Networks

• The document is a sequence of 𝐿 words

X = {X1 ,X 2 , . . . , X L },
each element 𝑋𝑙 representing a word
• The target Y is —a single variable such as
Sentiment, or a one-hot vector for
multiclass.
• However, Y can also be a sequence, such
as the same document in a different
language.
Simple Recurrent Neural Network Architecture
Y Y

OA O1 O2 O3 OL-1 OL

B B B B B B
U
AA = A1
U
A2
U
A3
U
AL - 1
U
AL

W W W W W W

XA X1 X2 X3 ... XL-1 XL

• The hidden layer is a sequence of vectors Al , receiving as

input X l as well as Al−1 . Al produces an output Ol .
• The same weights W, U and B are used at each step in the
sequence — hence the term recurrent (weight sharing)
• The Al sequence represents an evolving model for the
response that is updated as each element X l is processed.
RNN in Detail

• Suppose Xl = { Xl1 , Xl2 , … , Xlp } has p components, and hidden

layer Al = Al1 , Al2 , … , Alk has K components. Then the
computation at the kth components of hidden unit A l is -

p K

Alk = g wk0 + ෍ wkj X lj + ෍ uks Al−1,s

j=1 s=1
K

Ol = β0 + ෍ βk Alk
k=1

• Often we are concerned only with the prediction O L at the last

unit. For squared error loss, and n sequence/response pairs,
we would minimize -
2
n n K p K

෍(yi − oiL )2 = ෍ yi − β0 + ෍ βk g wk0 + ෍ wkj xiLj + ෍ uks 𝐴i,L−1,s

i=1 i=1 k=1 j=1 s=1
RNN and IMDB Reviews

• The document feature is a sequence of words {W } . We

typically truncate/pad the documents to the same number
L of words (we use L = 500).
• Each word is Wl represented as a one-hot encoded binary
vector X l (dummy variable) of length 10K, with all zeros
and a single one in the position for that word in the
dictionary.
• This results in an extremely sparse feature representation,
and would not work well.
• Instead use a lower-dimensional pretrained word
embedding matrix 𝑬 (m × 10K , next slide).
• This reduces the binary feature vector of length 10K to a
real feature vector of dimension m 10K (e.g. m in the
low hundreds.)
Word Embedding

One−hot

I
is

fall
film
of
this

films
the

the

starts

day
one

one
best

best

ever
have

seen
actually
Embed

this is one of the best films actually the best I have ever seen the film starts one fall day

• Embedding matrix 𝑬 can be learned from training,

embedding layer
• Pretrained embedding can be used: weight freezing
• word2vec and GloVe are popular pretrained embeddings .
RNN and IMDB Reviews

• With 32 hidden units in single recurrent layer, and

embedding 𝐸 where 𝑚 = 32, accuracy is only 76%.
• LSTM (long and short term memory) RNN.
• Here A l receives input from A l −1 (short term memory)
as well as from a version that reaches further back in
time (long term memory).
• Accuracy is improved to 87%, slightly less than the
88% achieved by glmnet.
• These data have been used as a benchmark for new
RNN architectures. The best reported result found
at the time of writing (2020) was around 95%.
Log(Volatility) Dow Jones Return Log(Trading Volume)

−13 −11 − 9 − 8 −0.04 0.00 0.04 −1.0 0.0 0.5 1.0

1965
1970
1975
1980
Time Series Forecasting

1985
New-York Stock Exchange Data

Shown in previous slide are three daily time series for the
period December 3, 1962 to December 31, 1986 (6,051
trading days) -
• Log trading volume (𝒗𝒕 ) - This is the fraction of all
outstanding shares that are traded on that day, relative to
a 100-day moving average of past turnover, on the log
scale.
• Dow Jones return (𝒓𝒕 ) - This is the difference between
the log of the Dow Jones Industrial Index on consecutive
trading days.
• Log v o l a t i l i t y (𝒛𝒕 ) - This is based on the absolute
values of daily price movements.
Goal: Predict Log trading volume tomorrow, given its
observed values up to today, as well as those of Dow
Jones return and Log v o l a t i l i t y .
Autocorrelation
Log( Trading Volume)

Autocorrelation Function

0.8
0.4
0.0

0 5 10 15 20 25 30 35

Lag

• The autocorrelation at lag 𝑙 is the correlation of all

pairs (vt , vt−l ) that are 𝑙 trading days apart.
• These sizable correlations give us confidence that
past values will be helpful in predicting the future.
• This is a curious prediction problem: the response vt
is also a feature vt−l!
RNN Forecaster

• Only one series of data

• Extract many short mini-series of input sequences X =
{ X1 , X 2 , … , X L } with a predefined length L known as the lag:

vt−L vt−L+1 vt−1

X1 = t−L , X 2 = t−L+1 , … , X L = rt−1
r r , and Y = vt
zt−L zt−L+1 zt−1

• Since T = 6, 051, with L = 5 , create 6, 046 such (X, Y )

pairs.
• Use the first 4, 281 as training data, and the following 1, 770
as test data. We fit an RNN with 12 hidden units per lag step
(i.e. per Al .)
RNN Results for NYSE Data
Test Period: Observed and Predicted

0.0 0.5 1.0

log(Trading Volume)

−1.0

1980 1982 1984 1986

Year

Figure shows predictions and truth for test period.

R 2 = 0.42 for RNN.

R 2 = 0.18 for straw man — use yesterday’s value of Log
trading volume to predict that of today.
Autoregression Forecaster

• The R N N forecaster is similar in structure to a traditional

autoregression procedure.

vL+1 1 vL vL−1 ⋯ v1
vL+2 1 vL+1 vL ⋯ v 2
y = vL+3 M = 1 vL+2 vL+1 ⋯ v3
⋮ ⋮ ⋮ ⋮ ⋱ ⋮
vT v
1 T−1 vT−2 … vT−L

• Fit an O L S regression of y on M, giving -

vො t = β෠ 0 + β෠ 1 vt−1 + β෠ 2 vt−2 + ⋯ + β෠ L vt−L

Known as an order-L autoregression model or A R ( L ) .
• For the NYSE data we can include lagged versions of DJ return
and log v o l a t i l i t y in matrix M, resulting in 3L + 1 columns.
Autoregression Results for NYSE Data

• R 2 = 0.41 for AR(5) model (16 parameters).

• R 2 = 0.42 for RNN model (205 parameters).

• R 2 = 0.42 for AR(5) model fit by neural network.

• All the models can be improved by including day of the

week as a variable
Summary of RNNs

• Many more complex variations of RNN exist.

• One variation treats the sequence as a one-
dimensional image, and uses CNNs for fitting.
• Bi-directional RNNs scan sequences in both directions
• Can have additional hidden layers, where each hidden
layer is a sequence, and treats the previous hidden
layer as an input sequence.
• Can have output also be a sequence, and input and
output share the hidden units. So called seq2seq
learning are used for language translation.
When to Use Deep Learning

Testing on Hitters data: (263 observations, 176 used for training, 87

for testing)

A Lasso model with 4 variables, and test error of 224.8

When to Use Deep Learning

• CNNs have had enormous successes in image classification and

modeling, and are starting to be used in medical diagnosis.
• RNNs have had big wins in speech modeling, language
translation, and forecasting.

Should we always use deep learning models?

• Often the big successes occur when the signal to noise ratio is
high. Datasets are large, and overfitting is not a big problem.
• For noisier data, simpler models can often work better.
1. On the NYSE data, the AR(5) model is much simpler
than a RNN, and performed as well.
2. On the IMDB review data, the linear model fit by
glmnet did as well as the neural network, and better
than the RNN.
• We endorse the Occam’s razor principal — we prefer simpler
models if they work as well. More interpretable!
Fitting Neural Networks
Input Hidden Output
L a y er Layer Layer

X 1

X 2

A3 f (X ) Y

X 3

X 4

n
minimize 1
෍(yi − f(xi ))2
{wk }1K , β 2
i=1

p
where, f(xi ) = β0 + σK
k=1 βk g wk0 + σj=1 wkj xij

• This problem is difficult because the objective is non-convex.

• Despite this, effective algorithms have evolved that can optimize

complex neural network problems efficiently.
Non Convex Functions and Gradient Descent
1
• Let, R θ = σn (y − fθ (xi ))2 with θ = ({wk }1K , β )
2 i=1 i

6
5
4
R()
3
R(0) 1
●
R(
2

) ●
R(2)
●
R(7)
1

●
0 1 2 7
0

1. Start with a guess θ0 for all the parameters in θ, and set t

= 0.
2. Iterate until the objective R θ fails to decrease:
• Find a vector δ that reflects a small change in θ,
such that θt+1 = θt + δ reduces the objective; i.e.
R(θt+1 ) < R(θt ).
• Set t ← t + 1.
Gradient Descent Continued

• How to find a direction δ that points downhill? T he gradient

vector is computed, which gives the direction of steepest ascent.
δR(θ)
𝛻𝜃 R θm = | m
δθ θ=θ
i.e. the vector of partial derivatives at the current guess θm .

• Hence moving in the opposite direction of the negative gradient

reduces 𝑅(𝜃)
• The reduction is controlled with a parameter 𝜌.
• 𝜃 is updated by δ = − ρ𝛻𝜃 R θm or
θm+1 ⟵ θm − ρ𝛻𝜃 R θm

where ρ is the learning rate (typically small, e.g. ρ = 0.001.

Gradients and Backpropagation
1
R θ = σni=1 R i θ = σni=1(yi − fθ (xi ))2 is a sum, so gradient is
2
sum of gradients.
2
K p

R i θ = yi − ( β0 + ෍ βk g wk0 + ෍ wkj xij )

k=1 j=1

p
• For ease of notation, let zik = wk0 + σj=1 wkj xij .
• Backpropagation uses the chain rule for differentiation:
δR i θ δR i θ δfθ (xi )
= ∗
δβk δfθ (xi ) δβk

= −(yi − fθ (xi )) ∗ g(zik )

δR i θ δR i θ δfθ (xi ) δg zik δzik

= ∗ ∗ ∗
δwkj δfθ (xi ) δg zik δzik δwkj

= −(yi − fθ (xi )) ∗ βk ∗ g ′ (zik ) ∗ xij

Gradient Descent Challenges

1. Choice of proper learning rate

• Too small -> slow convergence
• Too large -> oscillation around optimal
2. Applying same learning rate to all parameters
• Larger update for rarely occurring features might
be better
3. Getting trapped in local minima or saddle points
• SPs are surrounded by plateau where gradient
vanishes, so difficult to come out
4. Batch size?
• Too small -> lot of fluctuations
• Too large -> slow computation
Methods to Improve Performance

• Slow learning - Gradient descent is slow, and a small

learning rate ρ slows it even further.
• Vectorization
• Stochastic gradient descent (minibatch) - Rather than
compute the gradient using all the data, use a small
batch of observations drawn at random at each step.
• Regularization - Ridge and lasso regularization can
be used to shrink the weights at each layer.
• Dropout – Randomly remove a fraction 𝜑 of the
units in a layer when fitting the model.
• This is done separately each time a training
observation is processed.
• The surviving units are scaled up by a factor of
1/(1 − 𝜑) to compensate
Dropout Learning
Gradient Descent

• Attributed to Cauchy (1847)

• Update the parameter set 𝜃 𝑚 at each iteration

θm+1 ⟵ θm − ρ𝛻𝜃 R θm

• 𝛻𝜃 R θm is calculated at each training data point,

• So
1
𝛻𝜃 R θm = 𝑛 σ𝑛𝑖=1 𝛻𝜃 R i θm

• This is expensive when 𝑛 is large (could be in order of millions)

Stochastic Gradient Descent

• Attributed to Robbins-Monroe (1950s)

• Pick a random integer 𝑘 ∈ 1, … 𝑛

• Perform the update at this randomly chosen datapoint

• Update the parameter set 𝜃 𝑚

θm+1 ⟵ θm − ρ𝛻𝜃 R k θm

• Much faster (n times) than calculating a full gradient

• Does this make sense?

Stochastic Gradient Descent

• Consider a simple problem 1D

1
min 𝑅 𝜃 = σ𝑛𝑖=1 𝑎𝑖 𝜃 − 𝑏𝑖 2
2

Solving (full gradient) 𝛻𝜃 R 𝜃 = 0

σ𝑛
𝑖=1 𝑎𝑖 𝑏𝑖
𝜃∗ = σ𝑛 2
𝑖=1 𝑎𝑖

Solving for a particular point 𝑖, 𝛻𝜃 R i 𝜃 = 0

𝑏𝑖
𝜃𝑖∗ =
𝑎𝑖
Stochastic Gradient Descent

2 2 2
𝑎1 𝜃 − 𝑏1 𝑎2 𝜃 − 𝑏2 𝑎𝑛 𝜃 − 𝑏𝑛

𝑏𝑖 𝑏𝑖
min
𝑖 𝑎𝑖
𝑄 max
𝑖 𝑎𝑖

• 𝑄: Region of confusion
• The full gradient point 𝜃 ∗ lies somewhere in 𝑄
• Outside 𝑄, the sign of 𝛻𝜃 R 𝜃 and 𝛻𝜃 R i 𝜃 are the same
• It means outside 𝑄 the 𝛻𝜃 R i 𝜃 moves in the right direction
• This means at the initial steps, SGD makes quick improvement
• Inside 𝑄 this property breaks down, as you get closer to optimal
fluctuation increases
Stochastic Gradient Descent

• Inside 𝑄 this property breaks down, as you get closer to optimal

fluctuation increases
• You may not want the actual optimal point, because that may lead
to overfitting
• In general, when there is a difficult function to optimize, use
randomization
• SGD uses stochastic gradient 𝑔(𝜃) such that it is an unbiased
estimator of the full gradient

𝐸 𝑔(𝜃 = 𝛻𝜃 R 𝜃

• Estimators can be noisy, show it should be low on variance also

• A lot research has gone into controlling the variance, mini-batch is
one of the techniques
Stochastic Gradient Descent

min 𝑅 𝜃

At each iteration:
• Option 1: Pick index 𝑖 with replacement
• Option 2: Pick index 𝑖 without replacement
• Use 𝑔 𝜃 = 𝛻𝜃 𝑅𝑖 𝜃 as the SG
• Update θm+1 ⟵ θm − ρ𝑔(𝜃)

Which one is used in practice?

Stochastic Gradient Descent

min 𝑅 𝜃

Which one is used in practice?

• In almost all the implementation (tensorflow, pytorch)
option 2 is used (preshuffle and stream through the data)
• But for theoretical proofs, option 1 is used
(because of IID property)
SGD (with single observations) Advantages

• Frequent updation of weights provides quick

insights on model performance and improvement
in loss functions

• Simplest to understand and implement

• Learning rate may be fast on some problems

• Noisy update process may avoid local minima

SGD (with single observations) Advantages

• Updating the model too frequently maybe too

expensive especially for larger datasets (both in
number of observations and dimensions)
• Model parameters and model error may jump
around (higher variance)
• Frequent updating does not allow the parameters
to settle (one observation may be classified
correctly by a set of weight vectors, but may be
misclassified by the updated weights in the next
iteration. )
Variant of Stochastic Gradient Descent: Mini-batch
𝑛
1
min 𝑅 𝜃 = ෍ 𝛻𝜃 R i 𝜃
𝑛
𝑖=1

• Use a mini-batch 𝐼𝑘 of training points

1
• Update θm+1 ⟵ θm − ρ 𝐼 σ𝑗∈𝐼𝑘 𝛻𝜃 R j 𝜃 𝑚
𝑘
• Each iteration computes 𝐼𝑘 stochastic gradients
• This technique is great for parallelization (GPUs)
• Each core computes a stochastic gradient
• More cores, larger mini-batches can be computed

• Remark: Very large mini-batches (looks like a full GD)

may lead to OVERFITTING in ML problems
Momentum Optimizer
If in one direction the
𝑊2 gradient is much higher
than in another direction, then
gradient descent finds it
difficult to reach the minima

𝑊1
Since curvature in 𝑊2 direction is more pronounced, the gradient
In direction of 𝑊2 is much larger, causing oscillation
Momentum Optimizer

𝜃2

𝜃1
Since curvature in 𝜃2 direction is more pronounced, the gradient
In direction of 𝜃1 is much larger, causing oscillation
Momentum Optimizer

𝜃2

𝜃1
θm+1 ⟵ θm + 𝛾𝑉 𝑚 − ρ𝛻R θm
Where 𝑉 is the momentum component
Issues?

• Selection of hyperparameters (learning

rates for momentum and gradient)
• Same learning rates for all the directions
• It maybe useful to have smaller learning
rates in directions of higher gradient, and
larger for directions of smaller gradient
Other Efficient Variants of Optimizers

Adagrad: It adaptively scales the learning rate for different

dimensions: parameters with largest derivatives of the loss function
will have rapid decrease in their learning rate, with smaller
derivatives will have slower decrease in learning rate. It uses
cumulative squared gradient to update the learning rate.

RMSProp: Uses Exponentially decaying average of squared gradient

and discards extreme past history

Adam Optimizer: Combination of momentum optimizer and

RMSProp. First and second order moments are used, after correcting
for bias.
Adagrad

• Adaptively scales the learning rates at different

directions
• Scale factor at each parameter is inversely
proportional to the square root of the sum of
historical squared values of the gradient
• The parameters with large partial derivatives of the
loss function, will have rapid decrease in learning
rate
• The parameters with small partial derivatives of
the loss function, will have relatively slower
decrease in learning rate
Adagrad
• Let at a particular iteration 𝑡, the mini-batch SGD is
1
𝑔𝑡 = ෍ 𝛻𝜃 R j 𝜃 𝑡
𝐼𝑘
𝑗∈𝐼𝑘

Now
𝑤 𝑡 = σ𝑡𝜏=1 𝑔𝜏 ∘ 𝑔𝜏
(component-wise multiplication, and accumulated upto
iteration 𝑡)

Update (element-wise)
ρ
θt+1 ⟵ θ𝑡 − 𝑔t
𝜖𝐼 + 𝑤 𝑡
Where 𝐼 is a vector of 1s.
Adagrad
• For a 𝑑 dimensional problem

ρ
θt+1 ⟵ θ𝑡 − 𝑔t
𝜖𝐼 + 𝑤𝑡
ρ
𝑔t
θ1t+1 θ1t 𝜖+ 𝑤1𝑡
.
. . .
= −
. . ρ
θ𝑑t+1
θt𝑑 𝑔t
𝜖 + 𝑤𝑑𝑡
Adagrad
Advantages:
• Adaptively scales the learning rate for different dimensions
by normalizing w.r.t the gradient magnitude in the
corresponding dimension
• Eliminates the need to set the learning rate manually
• Converges rapidly when applied to convex functions
Disadvantages:
• For non-convex function it may pass through many
complex terrains ending up in local optima
• With large number of iterations, the scale factor increases
and learning rate becomes very small
• In such cases, the model may stop learning
RMSProp

• Instead of taking the accumulative sum of squares of the past

gradients, RMSProp uses exponentially weighted average of
squared gradients, and discards history from extreme past
• Converges rapidly in a local convex ball
• Treats as if Adagrad initialized in the local convex ball

• Everything being same as the Adagrad, only the learning rate

update component is changed to

𝑤 𝑡 = 𝛽𝑤 𝑡−1 + 1 − 𝛽 𝑔𝑡 ∘ 𝑔𝑡

AdaDelta is a similar algorithm where moving average instead of

exponential decaying for the update term
Adam (Adaptive Moment) Optimizer

• Combination of RMSprop and momentum optimizer

• Consider both first and second moments
• Both first and second moments are corrected for their bias for
initialization at 0.
• First moment
𝑠 𝑡 = 𝛽1 𝑠 𝑡−1 + 1 − 𝛽1 𝑔𝑡
• Second moment
𝑤 𝑡 = 𝛽2 𝑤 𝑡−1 + 1 − 𝛽2 𝑔𝑡 ∘ 𝑔𝑡

• Bias correction
𝑠𝑡 𝑤𝑡
𝑠Ƹ 𝑡 = 1−𝛽 ෝ 𝑡 = 1−𝛽
;𝑤
1 2
• Updation
t+1 𝑡
𝑠Ƹ 𝑡
θ ⟵ θ −ρ
ෝ𝑡
𝜖𝐼 + 𝑤
Tuning Parameters for the Model

• Number of hidden layers

• Number of units per layer
• Regularization tuning parameters λ
• Batch size, number of epochs for sgd
• Data augmentation parameters
Double Descent

• With neural networks, it seems better to have too

many hidden units than too few.
• Likewise more hidden layers better than few.
• Running stochastic gradient descent till zero training
error often gives good out-of-sample error.
• This may seem to contradict the bias-variance trade-off
idea.
• Increasing the number of units or layers and again
training till zero error sometimes gives even better out-
of-sample error.
The Double-Descent Error Curve

2.0
Training Error
Test Error

1.5
Error

1.0
0.5
0.0

2 5 10 20 50

Degrees of Freedom

• Simulated with Y = sin(X) + ϵ with x ∼ U [−5, 5] and ε

Gaussian with S.D. = 0.3.
• A natural spline is fitted to the data with d degrees of
freedom
• From the above it is apparent that at 𝑑 = 20, the test
error becomes large, but then falls again.
Less Wiggly Solutions
8 Degrees of Freedom 20 Degrees of Freedom

3
2

2
1

1
0

0
−3 −2 −1

−3 −2 −1
−4 −2 0 2 4 −4 −2 0 2 4
42 Degrees of Freedom 80 Degrees of Freedom
3

3
2

2
1

1
0

0
−3 −2 −1

−3 −2 −1

-4 −2 0 2 4 −4 −2 0 2 4

• To achieve a zero-residual (interpolation) with d = 20,

there is just one solution.
• With 𝑑 = 42 𝑜𝑟 80 there are an infinite number of
solutions for interpolation
Some Facts

• NN works particularly well, in cases with high signal-

to-noise ratio — e.g. image recognition, language
translation
• Fitting zero error models is not always optimal, it
depends on signal-noise ratio
• It is possible to achieve better test-set performance
with regularized models, without interpolating the
training set
Thank You.

Unit 4
100% (3)
Unit 4
57 pages
Module11 - NNandDeep Learning
No ratings yet
Module11 - NNandDeep Learning
84 pages
Ch10 Deep Learning
No ratings yet
Ch10 Deep Learning
104 pages
CNN2
No ratings yet
CNN2
70 pages
CVlecture 5
No ratings yet
CVlecture 5
56 pages
MLP and CNN
No ratings yet
MLP and CNN
56 pages
Pattern Recognition
No ratings yet
Pattern Recognition
14 pages
DL Unit-4
No ratings yet
DL Unit-4
26 pages
CS601 Machine Learning Unit 3
No ratings yet
CS601 Machine Learning Unit 3
47 pages
Co2 CNN 3
No ratings yet
Co2 CNN 3
31 pages
Introduction To Convolutional Neural Networks
No ratings yet
Introduction To Convolutional Neural Networks
41 pages
Lec5 CNN RNN Attention
No ratings yet
Lec5 CNN RNN Attention
71 pages
Max78000 Article Series Part 1
No ratings yet
Max78000 Article Series Part 1
4 pages
PNAL9 CNNs
No ratings yet
PNAL9 CNNs
61 pages
Lab 09
No ratings yet
Lab 09
5 pages
Unit 3 CNN 2024
No ratings yet
Unit 3 CNN 2024
58 pages
What Is Convolutional Neural Network
No ratings yet
What Is Convolutional Neural Network
16 pages
Lecture - 07 (Convolutional Neural Networks)
No ratings yet
Lecture - 07 (Convolutional Neural Networks)
57 pages
Module5 ML
No ratings yet
Module5 ML
112 pages
CNN 1
No ratings yet
CNN 1
19 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
8 pages
Deep Learning: Seungsang Oh
No ratings yet
Deep Learning: Seungsang Oh
39 pages
ML 13
No ratings yet
ML 13
34 pages
Unit Iii Deep Learning
No ratings yet
Unit Iii Deep Learning
31 pages
6-DeepVisualLearning L6
No ratings yet
6-DeepVisualLearning L6
82 pages
CV 2025 Spring 16
No ratings yet
CV 2025 Spring 16
53 pages
cnn1
No ratings yet
cnn1
11 pages
NN 07
No ratings yet
NN 07
24 pages
Convolutional Networks 2024
No ratings yet
Convolutional Networks 2024
44 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
98 pages
1 CNN
No ratings yet
1 CNN
14 pages
CV PPT Mt101
No ratings yet
CV PPT Mt101
16 pages
JNTUK R20 UNIT-IV DEEP LEARNING TECHNIQUES (WWW - Jntumaterials.co - In)
No ratings yet
JNTUK R20 UNIT-IV DEEP LEARNING TECHNIQUES (WWW - Jntumaterials.co - In)
26 pages
Basic Introduction To Convolutional Neural Network in Deep Learning
No ratings yet
Basic Introduction To Convolutional Neural Network in Deep Learning
9 pages
CV - T3 - Unit-7
No ratings yet
CV - T3 - Unit-7
36 pages
CNN Notes Unit-3
No ratings yet
CNN Notes Unit-3
12 pages
Lecture 3
No ratings yet
Lecture 3
48 pages
JNTUK R20 UNIT-IV DEEP LEARNING TECHNIQUES-www - Jntumaterials.co - in
No ratings yet
JNTUK R20 UNIT-IV DEEP LEARNING TECHNIQUES-www - Jntumaterials.co - in
26 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
108 pages
Day8 (CNN)
No ratings yet
Day8 (CNN)
35 pages
Unit Iv DL
No ratings yet
Unit Iv DL
26 pages
Understandingcnn 241117075844 C6ee6804
No ratings yet
Understandingcnn 241117075844 C6ee6804
24 pages
UNIT-III DeepLearning Notes
No ratings yet
UNIT-III DeepLearning Notes
30 pages
Image Classification Using Convolutional Neural Network With Python
No ratings yet
Image Classification Using Convolutional Neural Network With Python
8 pages
Unit - 2
No ratings yet
Unit - 2
31 pages
02 Aibds II CNN ST 2025
No ratings yet
02 Aibds II CNN ST 2025
25 pages
An Introduction To Convolutional Neural Networks: November 2015
No ratings yet
An Introduction To Convolutional Neural Networks: November 2015
12 pages
Deep Learning Notes For Easy Access
No ratings yet
Deep Learning Notes For Easy Access
14 pages
DL Unit4
No ratings yet
DL Unit4
31 pages
Images and Convolutional Neural Networks: Practical Deep Learning
No ratings yet
Images and Convolutional Neural Networks: Practical Deep Learning
34 pages
Military AI-Week 05-AI in Computer Vision
No ratings yet
Military AI-Week 05-AI in Computer Vision
65 pages
Convolutional Neuralnetworks: Abin - Roozgard
No ratings yet
Convolutional Neuralnetworks: Abin - Roozgard
54 pages
Machine Learning (CSO851) - Lecture 10
No ratings yet
Machine Learning (CSO851) - Lecture 10
83 pages
CNN Students
No ratings yet
CNN Students
170 pages
Super VIP Cheatsheet - Deep Learning
No ratings yet
Super VIP Cheatsheet - Deep Learning
47 pages
CNN and Autoencoder
No ratings yet
CNN and Autoencoder
56 pages
CNN (Neural Network)
No ratings yet
CNN (Neural Network)
32 pages
Neural Network Notes
No ratings yet
Neural Network Notes
268 pages
Unit 3
No ratings yet
Unit 3
105 pages
Classify Webcam Images Using Deep Learning
No ratings yet
Classify Webcam Images Using Deep Learning
17 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
MTM 2024 Mod 1 Lec 2
No ratings yet
MTM 2024 Mod 1 Lec 2
12 pages
Module10 - Support Vector Machine
No ratings yet
Module10 - Support Vector Machine
23 pages
Ltintegratedreport 2023
No ratings yet
Ltintegratedreport 2023
100 pages
Module08 PolynomialRegressionSplineGAMs
No ratings yet
Module08 PolynomialRegressionSplineGAMs
56 pages
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
16 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
10 pages
SIFT, Track, OpenCV, Image Processing
No ratings yet
SIFT, Track, OpenCV, Image Processing
9 pages
Machine Learning in Structural Engineering
No ratings yet
Machine Learning in Structural Engineering
12 pages
Module 1 Presentation
No ratings yet
Module 1 Presentation
48 pages
A Continual Learning Survey Defying Forgetting in Classification Tasks
No ratings yet
A Continual Learning Survey Defying Forgetting in Classification Tasks
20 pages
Let Your Graph Do The Talking: Encoding Structured Data For Llms
No ratings yet
Let Your Graph Do The Talking: Encoding Structured Data For Llms
16 pages
Lab Manual - R20A6683 Deep Learning - Year-IV - Semester-I
No ratings yet
Lab Manual - R20A6683 Deep Learning - Year-IV - Semester-I
68 pages
MCNN-LSTM Combining CNN and LSTM To Classify Multi-Class Text in Imbalanced News Data
No ratings yet
MCNN-LSTM Combining CNN and LSTM To Classify Multi-Class Text in Imbalanced News Data
16 pages
Semantic Segmentation: Tingwu Wang Machine Learning Group, University of Toronto
No ratings yet
Semantic Segmentation: Tingwu Wang Machine Learning Group, University of Toronto
28 pages
Lesson Plan - ML - Spring 2023
No ratings yet
Lesson Plan - ML - Spring 2023
4 pages
BE Comps A - Elective and Honours
No ratings yet
BE Comps A - Elective and Honours
6 pages
Automatic Circuit Design Using Machine Learning
No ratings yet
Automatic Circuit Design Using Machine Learning
11 pages
A Review of Classification Algorithms For EEG-based Brain-Computer Interfaces: A 10 Year Update
No ratings yet
A Review of Classification Algorithms For EEG-based Brain-Computer Interfaces: A 10 Year Update
29 pages
CV Important Question
No ratings yet
CV Important Question
3 pages
CAIpaper - AFR v3.3 Drafted
No ratings yet
CAIpaper - AFR v3.3 Drafted
8 pages
j2020 A Survey of The Usages of Deep Learning For Natural Language Processing
No ratings yet
j2020 A Survey of The Usages of Deep Learning For Natural Language Processing
21 pages
Cashew Nut Grade Identification and Quality Testing Using Machine Learning
No ratings yet
Cashew Nut Grade Identification and Quality Testing Using Machine Learning
4 pages
GPT Report
No ratings yet
GPT Report
41 pages
NVT Certification Exam Study Guide Gen Ai LLM Web
No ratings yet
NVT Certification Exam Study Guide Gen Ai LLM Web
8 pages
ML Unit I - It
No ratings yet
ML Unit I - It
30 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
19 pages
YouTube Video Summarizer Using NLP - A Review
No ratings yet
YouTube Video Summarizer Using NLP - A Review
7 pages
Lecture15 Learning Ranking
No ratings yet
Lecture15 Learning Ranking
46 pages
Predicting Rapid Impact Compaction - Case Study
No ratings yet
Predicting Rapid Impact Compaction - Case Study
36 pages
The Promise of Convolutional Neural Networks
No ratings yet
The Promise of Convolutional Neural Networks
13 pages
Concept of AI: Artificial Intelligence
No ratings yet
Concept of AI: Artificial Intelligence
5 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2024-12-19 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2024-12-19 Reference-Material-I
10 pages
AI Driven Crop Disease Detection
No ratings yet
AI Driven Crop Disease Detection
9 pages