Module11 - NNandDeep Learning
Module11 - NNandDeep Learning
Deep Learning
Reference
f X = β0 + βk hk (X)
k=1
p
= β0 + σK
k=1 βk g wk0 + σj=1 wkj X j
In p u t Hidden Output
L a yer La ye r La ye r
A1
X1
A2
X2
A3 f (X ) Y
X3
A4
X4
A5
Training the Neural Network
p
• Ak = hk X = g wk0 + σj=1 wkj X j are called the
activations in the hidden layer.
• g(z) is called the activation function (e.g. sigmoid,
ReLU)
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear
model.
• The model is fit by minimizing σni=1(yi − f(xi ))2 for
regression, cross-entropy for classification
• The weights 𝑤𝑘𝑗 are learned using backpropagation
algorithm.
Activation Functions
Sigmoid
𝑒𝑧 1
𝑔 𝑧 = 𝑧
=
1+𝑒 1 + 𝑒 −𝑧
ReLU
0 𝑖𝑓 𝑧 < 0
𝑔 𝑧 = 𝑧 + =ቊ
𝑧 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Activation Function
X 2 A 1( 1 )
A (12 )
X 3 A 2( 1 ) f0(X) Y0
A (22 )
X 4 A 3( 1 ) f1(X) Y1
A (32 )
(1) . .
X 5 A4 . .
. .
.
.
.
.
X 6 . f9(X) Y9
.
A(2)
K 2
. (1) B
. A
. K 1
W2
Output
H id d e n layer
X p W1
H id d e n la y e r 𝐿 2
Inp ut la y e r 𝐿 1
layer
Details of Output Layer
2 k (2)
• Let Zm = βm0 + σl=1 βml Al , m = 0, 1, . . . , 9 be 10
linear combinations of activations at second layer.
• Output activation function encodes the SoftMax
function -
eZ m
fm X = Pr(Y = m X = 9 Z
σl=0 e l
• Fit the model by minimizing the negative
multinomial log-likelihood (or cross-entropy) -
n 9
Convolution Layers
Low level features (to identify small features)
(edges, patches of colours)
Pooling Layers
(down sampling of small
Mid- level features patterns identified)
(eyes, ears)
Classifier
(tiger, lion)
Convolution Layers
a b c
d e f
Input Image (4 × 3) pixel = g h i
j k l
α β
Convolution Filter =
γ δ
aα + bβ + dγ + eδ bα + cβ + eγ + fδ
Convolved Image = dα + eβ + gγ + hδ eα + fβ + hγ + iδ
gα + hβ + jγ + kδ hα + iβ + kγ + lδ
• CF is slid around the input image, scoring for matches.
• The scoring is done via dot-products
• If the sub image of the input image is similar to the filter,
the score is high, otherwise low.
• CFs are learned during training.
Convolution Example
• The two filters shown here highlight vertical and horizontal stripes.
• The result of the convolution is a new feature map.
• In the first image vertical stripes are more prominent
• In the second image the horizontal stripes are more prominent
Convolution Layer
• In CNNs the filters are learned for the specific
classification task.
• The filter weights are the parameters going from
an input layer to a hidden layer, with one hidden
unit for each pixel in the convolved image.
CIFAR100 Example
• For colour image, it has three channels represented
by a three-dimensional feature map (array).
• Each of them a 2D (32 × 32) feature map — one
each for R, G and B.
• A single CF will also have three channels, one per
color, each of dimension 3×3, with potentially
different filter weights.
• At the first hidden layer K CFs -> K 2D feature maps
• The results of the three convolutions are summed
to form a 2D output feature map.
• ReLU activation function is used on the convolved
image, in separate layer known as detector layer
Pooling Layers
• Max pooling
• Convolution implies repeatedly multiplying
matrix elements and then adding the results
• The resultant (convolved) image emphasizes the
sections of original image which are similar to
the CF.
Pooling
1 2 5 3
Max pool 3 0 1 2 ⟶ 3 5
2 1 3 4 2 4
1 1 2 0
This has to be one of the worst films of the 1990s. When my friends &
I were watching this film (being the target audience it was aimed at) we
just sat & watched the first half an hour with our jaws touching the floor
at how bad it really was. The rest of the time, everyone else in the theater
just started talking to each other, leaving or generally crying into their
popcorn . . .
1.0
1.0
●
●●●
●●●●
●●
●●●●
● ● ● ●
●
●●
●
● ● ● ● ● ●
●● ● ●
●● ● ●
●● ●
●●
●● ●
●● ●
●●
● ●
●
●● ●
●●
●●
0.9
0.9
●● ●
●●
●
● ● ● ● ●
●
●●● ●●● ● ● ● ● ● ●●
●● ● ●●
●●●
●●●
●
●
●●
●
●
●●
●●●
●●●
●● ●
●●● ●● ● ● ● ● ● ● ● ●
●●●●●●
●
●
●● ●●
●●●
●●●
●●●
●●
● ● ● ●
● ● ●
●●●●●
●●
● ●●
●
● ●
●●
●
●●●●●● ●●●
● ●
●●
●● ● ● ● ● ●
● ●●●
●● ●●
●● ●●
● ● ● ● ● ● ● ●
●
●●● ● ●●●
●●
●●●●●
●●
●●●●●
●
●
●●
Accuracy
Accuracy
●●●●
●
●●●●
●●●● ●
●●●
●
0.8
0.8
●
●●
●●●
●
●●●
●
●
●●
●●
●●
●
●
●
●
●
0.7
0.7
●
●
●●
●
●●
●
● train
● validation
●
●●
●●
●
● test
0.6
0.6
4 6 8 10 12 5 10 15 20
− log() Epochs
Data as sequences:
• Documents are sequences of words, and their relative
positions have meaning.
• Time-series such as weather data
• Financial time series, market indices, stock and bond
price, exchange rates.
• Recorded speech or music.
• Handwriting, such as doctor’s notes.
RNNs build models that take into account this
sequential nature of the data, and build a memory of
the past.
Recurrent Neural Networks
OA O1 O2 O3 OL-1 OL
B B B B B B
U
AA = A1
U
A2
U
A3
U
AL - 1
U
AL
W W W W W W
XA X1 X2 X3 ... XL-1 XL
p K
Ol = β0 + βk Alk
k=1
One−hot
I
is
fall
film
of
this
films
the
the
the
starts
day
one
one
best
best
ever
have
seen
actually
Embed
this is one of the best films actually the best I have ever seen the film starts one fall day
1965
1970
1975
1980
Time Series Forecasting
1985
New-York Stock Exchange Data
Shown in previous slide are three daily time series for the
period December 3, 1962 to December 31, 1986 (6,051
trading days) -
• Log trading volume (𝒗𝒕 ) - This is the fraction of all
outstanding shares that are traded on that day, relative to
a 100-day moving average of past turnover, on the log
scale.
• Dow Jones return (𝒓𝒕 ) - This is the difference between
the log of the Dow Jones Industrial Index on consecutive
trading days.
• Log v o l a t i l i t y (𝒛𝒕 ) - This is based on the absolute
values of daily price movements.
Goal: Predict Log trading volume tomorrow, given its
observed values up to today, as well as those of Dow
Jones return and Log v o l a t i l i t y .
Autocorrelation
Log( Trading Volume)
Autocorrelation Function
0.8
0.4
0.0
0 5 10 15 20 25 30 35
Lag
−1.0
Year
vL+1 1 vL vL−1 ⋯ v1
vL+2 1 vL+1 vL ⋯ v 2
y = vL+3 M = 1 vL+2 vL+1 ⋯ v3
⋮ ⋮ ⋮ ⋮ ⋱ ⋮
vT v
1 T−1 vT−2 … vT−L
A1
X 1
A2
X 2
A3 f (X ) Y
X 3
A4
X 4
A5
n
minimize 1
(yi − f(xi ))2
{wk }1K , β 2
i=1
p
where, f(xi ) = β0 + σK
k=1 βk g wk0 + σj=1 wkj xij
6
5
4
R()
3
R(0) 1
●
R(
2
) ●
R(2)
●
R(7)
1
●
0 1 2 7
0
p
• For ease of notation, let zik = wk0 + σj=1 wkj xij .
• Backpropagation uses the chain rule for differentiation:
δR i θ δR i θ δfθ (xi )
= ∗
δβk δfθ (xi ) δβk
θm+1 ⟵ θm − ρ𝛻𝜃 R θm
θm+1 ⟵ θm − ρ𝛻𝜃 R k θm
1
min 𝑅 𝜃 = σ𝑛𝑖=1 𝑎𝑖 𝜃 − 𝑏𝑖 2
2
σ𝑛
𝑖=1 𝑎𝑖 𝑏𝑖
𝜃∗ = σ𝑛 2
𝑖=1 𝑎𝑖
2 2 2
𝑎1 𝜃 − 𝑏1 𝑎2 𝜃 − 𝑏2 𝑎𝑛 𝜃 − 𝑏𝑛
𝑏𝑖 𝑏𝑖
min
𝑖 𝑎𝑖
𝑄 max
𝑖 𝑎𝑖
• 𝑄: Region of confusion
• The full gradient point 𝜃 ∗ lies somewhere in 𝑄
• Outside 𝑄, the sign of 𝛻𝜃 R 𝜃 and 𝛻𝜃 R i 𝜃 are the same
• It means outside 𝑄 the 𝛻𝜃 R i 𝜃 moves in the right direction
• This means at the initial steps, SGD makes quick improvement
• Inside 𝑄 this property breaks down, as you get closer to optimal
fluctuation increases
Stochastic Gradient Descent
𝐸 𝑔(𝜃 = 𝛻𝜃 R 𝜃
min 𝑅 𝜃
At each iteration:
• Option 1: Pick index 𝑖 with replacement
• Option 2: Pick index 𝑖 without replacement
• Use 𝑔 𝜃 = 𝛻𝜃 𝑅𝑖 𝜃 as the SG
• Update θm+1 ⟵ θm − ρ𝑔(𝜃)
min 𝑅 𝜃
At each iteration:
• Option 1: Pick index 𝑖 with replacement
• Option 2: Pick index 𝑖 without replacement
• Use 𝑔 𝜃 = 𝛻𝜃 𝑅𝑖 𝜃 as the SG
• Update θm+1 ⟵ θm − ρ𝑔(𝜃)
𝑊1
Since curvature in 𝑊2 direction is more pronounced, the gradient
In direction of 𝑊2 is much larger, causing oscillation
Momentum Optimizer
𝜃2
𝜃1
Since curvature in 𝜃2 direction is more pronounced, the gradient
In direction of 𝜃1 is much larger, causing oscillation
Momentum Optimizer
𝜃2
𝜃1
θm+1 ⟵ θm + 𝛾𝑉 𝑚 − ρ𝛻R θm
Where 𝑉 is the momentum component
Issues?
Now
𝑤 𝑡 = σ𝑡𝜏=1 𝑔𝜏 ∘ 𝑔𝜏
(component-wise multiplication, and accumulated upto
iteration 𝑡)
Update (element-wise)
ρ
θt+1 ⟵ θ𝑡 − 𝑔t
𝜖𝐼 + 𝑤 𝑡
Where 𝐼 is a vector of 1s.
Adagrad
• For a 𝑑 dimensional problem
ρ
θt+1 ⟵ θ𝑡 − 𝑔t
𝜖𝐼 + 𝑤𝑡
ρ
𝑔t
θ1t+1 θ1t 𝜖+ 𝑤1𝑡
.
. . .
= −
. . ρ
θ𝑑t+1
θt𝑑 𝑔t
𝜖 + 𝑤𝑑𝑡
Adagrad
Advantages:
• Adaptively scales the learning rate for different dimensions
by normalizing w.r.t the gradient magnitude in the
corresponding dimension
• Eliminates the need to set the learning rate manually
• Converges rapidly when applied to convex functions
Disadvantages:
• For non-convex function it may pass through many
complex terrains ending up in local optima
• With large number of iterations, the scale factor increases
and learning rate becomes very small
• In such cases, the model may stop learning
RMSProp
𝑤 𝑡 = 𝛽𝑤 𝑡−1 + 1 − 𝛽 𝑔𝑡 ∘ 𝑔𝑡
• Bias correction
𝑠𝑡 𝑤𝑡
𝑠Ƹ 𝑡 = 1−𝛽 ෝ 𝑡 = 1−𝛽
;𝑤
1 2
• Updation
t+1 𝑡
𝑠Ƹ 𝑡
θ ⟵ θ −ρ
ෝ𝑡
𝜖𝐼 + 𝑤
Tuning Parameters for the Model
2.0
Training Error
Test Error
1.5
Error
1.0
0.5
0.0
2 5 10 20 50
Degrees of Freedom
3
2
2
1
1
0
0
−3 −2 −1
−3 −2 −1
−4 −2 0 2 4 −4 −2 0 2 4
42 Degrees of Freedom 80 Degrees of Freedom
3
3
2
2
1
1
0
0
−3 −2 −1
−3 −2 −1
-4 −2 0 2 4 −4 −2 0 2 4