0% found this document useful (0 votes)

2 views95 pages

Chap 3 Slides

The document discusses the necessity and complexity of backpropagation in neural networks, emphasizing the computation of partial derivatives in computational graphs. It explains the challenges of differentiating complex composition functions and introduces dynamic programming as a solution to reduce the computational complexity from exponential to polynomial time. The text also covers various approaches to compute derivatives, including pre-activation and post-activation variables, and addresses shared weights in neural networks.

Uploaded by

Nandita Bhanja Chaudhuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views95 pages

Chap 3 Slides

Uploaded by

Nandita Bhanja Chaudhuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 95

Charu C.

Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Backpropagation I: Computing Derivatives

in Computational Graphs [without
Backpropagation] in Exponential Time

Neural Networks and Deep Learning, Springer, 2018

Chapter 3, Section 3.2
Why Do We Need Backpropagation?

• To perform any kind of learning, we need to compute the

partial derivative of the loss function with respect to each
intermediate weight.

– Simple with single-layer architectures like the perceptron.

– Not a simple matter with multi-layer architectures.

The Complexity of Computational Graphs

• A computational graph is a directed acyclic graph in which

each node computes a function of its incoming node vari-
ables.

• A neural network is a special case of a computational graph.

– Each node computes a combination of a linear vector mul-

tiplication and a (possibly nonlinear) activation function.

• The output is a very complicated composition function of

each intermediate weight in the network.

– The complex composition function might be hard to ex-

press neatly in closed form.

∗ Diﬃcult to diﬀerentiate!
Recursive Nesting is Ugly!

• Consider a computational graph containing two nodes in a

path and input w.

• The ﬁrst node computes y = g(w) and the second node

computes the output o = f (y).

– Overall composition function is f (g(w)).

– Setting f () and g() to the sigmoid function results in the

following:
1
f (g(w)) = (1)
1
1 + exp − 1+exp(−w) |

– Increasing path length increases recursive nesting.

Backpropagation along Single Path (Univariate Chain
Rule)

w y=g(w) f(y)= O = f(g(w))=cos(w2)

g(w)=w2
INPUT cos(y) OUTPUT
WEIGHT

• Consider a two-node path with f (g(w)) = cos(w2)

• In the univariate chain rule, we compute product of local

derivatives.
∂f (g(w)) ∂f (y) ∂g(w)
= · = −2w · sin(y) = −2w · sin(w2)
∂w ∂y ∂w

−sin(y) 2w

• Local derivatives are easy to compute because they care

about their own input and output.
Backpropagation along Multiple Paths (Multivariate Chain
Rule)

• Neural networks contain multiple nodes in each layer.

• Consider the function f (g1(w), . . . gk (w)), in which a unit

computing the multivariate function f (·) gets its inputs from
k units computing g1(w) . . . gk (w).

• The multivariable chain rule needs to be used:

k

∂f (g1(w), . . . gk (w)) ∂f (g1(w), . . . gk (w)) ∂gi(w)
= · (2)
∂w i=1 ∂gi(w) ∂w
Example of Multivariable Chain Rule

g(y)=
cos(y)
O = [cos(w2)] + [sin(w2)]

w K(p,q) O
f(w)=w2
INPUT =p+q
OUTPUT
WEIGHT

h(z)=
sin(z)

∂o ∂K(p, q) ∂K(p, q)
= · g (y) · f

(w) + · h (z) · f

(w)
∂w
∂p
-sin(y)
∂q
cos(z)
2w 2w
1 1
= −2w · sin(y) + 2w · cos(z)
= −2w · sin(w2) + 2w · cos(w2)

• Product of local derivatives along all paths from w to o.

Pathwise Aggregation Lemma

• Let a non-null set P of paths exist from a variable w in the

computational graph to output o.

– Local gradient of node with variable y(j) with respect to

variable y(i) for directed edge (i, j) is z(i, j) = ∂y(j)
∂y(i)

• The value of ∂w∂o is given by computing the product of the

local gradients along each path in P, and summing these

products over all paths.
∂o
= z(i, j) (3)
∂w P ∈P (i,j)∈P

• Observation: Each z(i, j) easy to compute.

An Exponential Time Algorithm for Computing Partial
Derivatives

• The path aggregation lemma provides a simple way to com-

pute the derivative with respect to intermediate variable w

– Use computational graph to compute each value y(i) of

nodes i in a forward phase.

– Compute local derivative z(i, j) = ∂y(j)

∂y(i)
on each edge (i, j)
in the network.

– Identify the set P of all paths from the node with variable
w to the output o.

– For each path P ∈ P compute the product M (P ) =

(i,j)∈P z(i, j) of the local derivatives on that path.

– Add up these values over all paths P ∈ P.

Example: Deep Computational Graph with Product Nodes

h(1,1) h(2,1) h(3,1) h(4,1) h(5,1)

w w2 w4 w8 w16 O=w32
w
O
INPUT
WEIGHT w w2 w4 w8 w16 OUTPUT
h(1,2) h(2,2) h(3,2) h(4,2) h(5,2)

EACH NODE COMPUTES THE PRODUCT OF ITS INPUTS

• Each node computes product of its inputs ⇒ Partial deriva-

tive of xy with respect to one input x is the other input y.

• Computing product of partial derivatives along a path is

equivalent to computing product of values along the only
other node disjoint path.

• Aggregative product of partial derivatives (only in this case)

equals aggregating products of values.
Example of Increasing Complexity with Depth

h(1,1) h(2,1) h(3,1) h(4,1) h(5,1)

w w2 w4 w8 w16 O=w32
w
O
INPUT
WEIGHT w w2 w4 w8 w16 OUTPUT
h(1,2) h(2,2) h(3,2) h(4,2) h(5,2)

EACH NODE COMPUTES THE PRODUCT OF ITS INPUTS

∂O
= h(1, j1) h(2, j ) h(3, j ) h(4, j ) h(5, j )
2 3 4 5
∂w
j1 ,j2 ,j3 ,j4 ,j5∈{1,2}5 w w2 w4 w8 w16

= w31 = 32w31
All 32 paths

• Impractical with increasing depth.

Observations on Exponential Time Algorithm

• Not very practical approach ⇒ Million paths for a network

with 100 nodes in each layer and three layers.

• This is the approach of traditional machine learning with

complex objective functions in closed form.

– For a composition function in closed form, manual diﬀer-

entiation explicitly traverses all paths with chain rule.

– The algebraic expression of the derivative of a complex

function might not ﬁt the paper you write on.

– Explains why most of traditional machine learning is a

shallow neural model.

• The beautiful dynamic programming idea of backpropagation

rescues us from complexity.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Backpropagation II: Using Dynamic

Programming [Backpropagation] to
Compute Derivatives in Polynomial Time

Neural Networks and Deep Learning, Springer, 2018

Chapter 3, Section 3.2
Diﬀerentiating Composition Functions

• Neural networks compute composition functions with a lot of

repetitiveness caused by a node appearing in multiple paths.

• The most natural and intuitive way to diﬀerentiate such a

composition function is not the most eﬃcient way to do it.

• Natural approach: Top down

f (w) = sin(w2) + cos(w2)

• We should not have to diﬀerentiate w2 twice!

• Dynamic programming collapses repetitive computations to

reduce exponential complexity into polynomial complexity!
1 3 5 7 9
O
w
11
INPUT
WEIGHT 2 4 6 8 10 OUTPUT
y(4) z(4, 6) y(6)
EACH NODE i CONTAINS y(i) AND EACH EDGE BETWEEN i AND j CONTAINS z(i, j)
EXAMPLE: z(4, 6)= PARTIAL DERIVATIVE OF y(6) WITH RESPECT TO y(4)

• We want to compute the derivative of the output with re-

spect to variable w.

• We can easily compute z(i, j) = ∂y(j)

∂y(i)
.

• Naive approach computes S(w, o) = ∂w ∂o =

P ∈P (i,j)∈P z(i, j)
by explicit aggregation over all paths in P.
Dynamic Programming and Directed Acyclic Graphs

• Dynamic programming used extensively in directed acyclic

graphs.

– Typical: Exponentially aggregative path-centric functions

between source-sink pairs.

– Example: Polynomial solution to longest path problem in

directed acyclic graphs (NP-hard in general).

– General approach: Starts at either the source or sink and

recursively computes the relevant function over paths of
increasing length by reusing intermediate computations.

• Our path-centric function: S(w, o) = P ∈P (i,j)∈P z(i, j).

– Backwards direction makes more sense here because we

have to compute derivative of output (sink) with respect
to all variables in early layers.
Dynamic Programming Update

• Let A(i) be the set of nodes at the ends of outgoing edges

from node i.

• Let S(i, o) be the intermediate variable indicating the same

path aggregative function from i to o.

S(i, o) ⇐ S(j, o) · z(i, j) (4)
j∈A(i)

• Initialize S(o, o) to 1 and compute backwards to reach S(w, o).

– Intermediate computations like S(i, o) are also useful for

computing derivatives in other layers.

• Do you recognize the multivariate chain rule in Equation 4?

∂o ∂o ∂y(j)
= ·
∂y(i) j∈A(i)
∂y(j) ∂y(i)
How Does it Apply to Neural Networks?

W
X { ∑ɸ h=ɸ(W. X)

BREAK UP

W ah = W .X
X { ∑ɸ ∑ɸ h=ɸ(ah)
POST-ACTIVATION
VALUE

PRE-ACTIVATION
VALUE

• A neural network is a special case of a computational graph.

– We can deﬁne the computational graph in multiple ways.

– Pre-activation variables or post-activation variables or

both as the node variables of the computation graph?

– The three lead to diﬀerent updates but the end result is

equivalent.
Pre-Activation Variables to Create Computational Graph

• Compute derivative δ(i, o) of loss L at o with respect to pre-

activation variable at node i.

• We always compute loss derivatives δ(i, o) with respect to

activations in nodes during dynamic programming rather than
weights.

– Loss derivative with respect to weight wij from node i

to node j is given by the product of δ(j, o) and hidden
variable at i (why?)

• Key points: z(i, j) = wij ·Φi , Initialize S(o, o) = δ(o, o) = ∂L

∂o Φo

δ(i, o) = S(i, o) = Φi wij S(j, o) = Φi wij δ(j, o)
j∈A(i) j∈A(i)
(5)
Post-Activation Variables to Create Computation Graph

• The variables in the computation graph are hidden values

after activation function application.

• Compute derivative Δ(i, o) of loss L at o with respect to

post-activation variable at node i.

• Key points: z(i, j) = wij · Φj , Initialize S(o, o) = Δ(o, o) = ∂L

∂o

Δ(i, o) = S(i, o) = wij S(j, o)Φj = wij Δ(j, o)Φj
j∈A(i) j∈A(i)
(6)

– Compare with pre-activation approach δ(i, o) = Φi j∈A(i) wij δ(j, o)

– Pre-activation approach more common in textbooks.

Variables for Both Pre-Activation and Post-Activation
Values

• Nice way of decoupling the linear multiplication and activa-

tion operations.

• Simpliﬁed approach in which each layer is treated as a single

node with a vector variable.

– Update can be computed in vector and matrix multiplica-

tions.

• Topic of discussion in next part of the backpropagation series.

Losses at Arbitrary Nodes

• We assume that the loss is incurred at a single output node.

• In case of multiple output nodes, one only has to add up the

contributions of diﬀerent outputs in the backwards phase.

• In some cases, penalties may be applied to hidden nodes.

• For a hidden node i, we add an “initialization value” to S(i, o)

just after it has been computed during dynamic program-
ming, which is based on its penalty.

– Similar treatment as the initialization of an output node,

except that we add the contribution to existing value of
S(i, o).
Handling Shared Weights

• You saw an example in autoencoders where encoder and de-

coder weights are shared.

• Also happens in specialized architectures like recurrent or

convolutional neural networks.

• Can be addressed with a simple application of the chain rule.

• Let w1 . . . wr be r copies of the same weight w in the neural

network.
r
r

∂L ∂L ∂wi ∂L
= · = (7)
∂w i=1 ∂wi ∂w i=1 ∂wi

• Pretend all weights are diﬀerent and just add!

Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Backpropagation III: A Decoupled View of

Vector-Centric Backpropagation

Neural Networks and Deep Learning, Springer, 2018

Chapter 3, Section 3.2
Multiple Computational Graphs from Same Neural
Network

• We can create a computational graph in multiple ways from

the variables in a neural network.

– Computational graph of pre-activation variables (part II of

lecture)

– Computational graph of post-activation variables (part II

of lecture)

– Computational graph of both (this part of the lecture)

• Using both pre-activation and post-activation variables cre-

ates decoupled backpropagation updates for linear layer and
for activation function.
Scalar Versus Vector Computational Graphs

• The backpropagation discussion so far uses scalar operations.

• Neural networks are constructed in layer-wise fashion.

• We can treat an entire layer as a node with a vector variable.

• We want to use layer-wise operations on vectors.

– Most real implementations use vector and matrix multi-

plications.

• Want to decouple the operations of linear matrix multiplica-

tion and activation function in separate “layers.”
Vector-Centric and Decoupled View of Single Layer

MULTIPLY WITH WT APPLY ɸ (ELEMENTWISE)

DECOUPLED LAYER (i+2)

DECOUPLED LAYER (i+3)

LINEAR
DECOUPLED LAYER (i-1)

DECOUPLED LAYER (i+1)

TRANSFORM ACTIVATION
SOME SOME SOME
FUNCTION DECOUPLED LAYER i FUNCTION LOSS

MULTIPLY WITH W MULTIPLY ɸ’ (ELEMENTWISE)

• Note that linear matrix multiplication and activation function

are separate layers.

• Method 1 (requires knowledge of matrix calculus): You can

use the vector-to-vector chain rule to backpropagate on a
single path!
Converting Scalar Updates to Vector Form

• Recap: When the partial derivative of node q with respect

to node p is z(p, q), the dynamic programming update is:

S(p, o) = S(q, o) · z(p, q) (8)
q∈Next Layer

• We can write the above update in vector form by creating a

single column vector g i for layer i ⇒ Contains S(p, o) for all
values of p.
g i = Zg i+1 (9)

• The matrix Z = [z(p, q)] is the transpose of the Jacobian!

– We will use the notation J = Z T in further slides.

The Jacobian

• Consider layer i and layer-(i + 1) with activations z i and z i+1.

– The kth activation in layer-(i + 1) is obtained by applying

an arbitrary function fk (·) on the vector of activations in
layer-i.

• Deﬁnition of Jacobian matrix entries:

∂fk (z i)
Jkr = (10)
(r)
∂z i

• Backpropagation updates:

g i = J T g i+1 (11)
Eﬀect on Linear Layer and Activation Functions

MULTIPLY WITH WT APPLY ɸ (ELEMENTWISE)

DECOUPLED LAYER (i+2)

DECOUPLED LAYER (i+3)

LINEAR
DECOUPLED LAYER (i-1)

DECOUPLED LAYER (i+1)

TRANSFORM ACTIVATION
SOME DECOUPLED LAYER i SOME SOME
FUNCTION FUNCTION LOSS

MULTIPLY WITH W MULTIPLY ɸ’ (ELEMENTWISE)

• Backpropagation is multiplication with transposed weight

matrix for linear layer.

• Elementwise multiplication with derivative for activation

layer.
Table of Forward Propagation and Backward Propagation
Function Forward Backward
Linear z i+1 = W T z i g i = W g i+1
Sigmoid z i+1 =sigmoid(z i ) g i = g i+1 z i+1 (1 − z i+1)
Tanh z i+1 =tanh(z i) g i = g i+1 (1 − z i+1 z i+1)
ReLU z i+1 = z i I(z i > 0) g i = g i+1 I(z i > 0)
Hard Set to ±1 (∈ [−1, +1]) Set to 0 (∈ [−1, +1])
Tanh Copy (∈ [−1, +1]) Copy ( ∈ [−1, +1])
Max Maximum of inputs Set to 0 (non-maximal inputs)
Copy (maximal input)
Arbitrary z (k)
i+1 = fk (z i ) g i = J T g i+1
function fk (·) J is Jacobian (Equation 10)

• Two types of Jacobians: Linear layers are dense and activa-

tion layers are sparse.

• Maximization function used in max-pooling.

Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Neural Network Training [Initialization,

Preprocessing, Mini-Batching, Tuning, and
Other Black Art]

Neural Networks and Deep Learning, Springer, 2018

Chapter 3, Section 3.3
How to Check Correctness of Backpropagation

• Consider a particular weight w of a randomly selected edge

in the network.

• Let L(w) be the current value of the loss.

• The weight of this edge is perturbed by adding a small

amount > 0 to it.

• Estimate of derivative:
∂L(w) L(w + ) − L(w)
≈ (12)
∂w

• When the partial derivatives do not match closely enough, it

might be indicative of an incorrectness in implementation.
What Does “Closely Enough” Mean?

• Algorithm-determined derivative is Ge and the approximate

derivative is Ga.
|Ge − Ga|
ρ= (13)
|Ge + Ga|

• The ratio should be less than 10−6.

• If ReLU is used, the ratio should be less than 10−3.

• Should perform the checks for a sample of the weights a few

times during training.
Stochastic Gradient Descent

• We have always worked with point-wise loss functions so far.

– Corresponds to stochastic gradient descent.

– In practice, stochastic gradient descent is only a random-

ized approximation of the true loss function.

• True loss function is typically additive over points.

– Example: Sum-of-squared errors in regression.

– Computing gradient over a single point is like sampled

gradient estimate.
Mini-batch Stochastic Gradient Descent

• One can improve accuracy of gradient computation by using

a batch of instances.

– Instead of holding a vector of activations, we hold a matrix

of activations in each layer.

– Matrix-to-matrix multiplications required for forward and

backward propagation.

– Increases the memory requirements.

• Typical sizes are powers of 2 like 32, 64, 128, 256

Why Does Mini-Batching Work?

• At early learning stages, the weight vectors are very poor.

– Training data is highly redundant in terms of important

patterns.

– Small batch sizes gives the correct direction of gradient.

• At later learning stages, the gradient direction becomes less

accurate.

– But some amount of noise helps avoid overﬁtting anyway!

• Performance on out-of-sample data does not deteriorate!

Feature Normalization

• Standardization: Normalize to zero mean and unit variance.

• Whitening: Transform the data to a de-correlated axis

system with principal component analysis (mean-centered
SVD).

– Truncate directions with extremely low variance.

– Standardize the other directions.

• Basic principle: Assume that data is generated from Gaus-

sian distribution and give equal importance to all directions.
Weight Initialization

• Initializations are surprisingly important.

– Poor initializations can lead to bad convergence behavior.

– Instability across diﬀerent layers (vanishing and exploding

gradients).

• More sophisticated initializations such as pretraining covered

in later lecture.

• Even some simple rules in initialization can help in condition-

ing.
Symmetry Breaking

• Bad idea to initialize weights to the same value.

– Results in weights being updated in lockstep.

– Creates redundant features.

• Initializing weights to random values breaks symmetry.

• Average magnitude of the random variables is important for

stability.
Sensitivity to Number of Inputs

• More inputs increase output sensitivity to the average weight.

– Additive eﬀect of multiple inputs: variance linearly in-

creases with number of inputs r.

– Standard deviation scales with the square-root of number

of inputs r.

• Each weight is initialized from Gaussian distribution with

standard deviation 1/r ( 2/r for ReLU).

• More sophisticated: Use standard deviation of 2/(rin + rout).

Tuning Hyperparameters

• Hyperparameters represent the parameters like number of

layers, nodes per layer, learning rate, and regularization pa-
rameter.

• Use separate validation set for tuning.

• Do not use same data set for backpropagation training as

tuning.
Grid Search

• Perform grid search over parameter space.

– Select set of values for each parameter in some “reason-

able” range.

– Test over all combination of values.

• Careful about parameters at borders of selected range.

• Optimization: Search over coarse grid ﬁrst, and then drill

down into region of interest with ﬁner grids.
How to Select Values for Each Parameter

• Natural approach is to select uniformly distributed values of

parameters.

– Not the best approach in many cases! ⇒ Log-uniform

intervals.

– Search uniformly in reasonable values of log-values and

then exponentiate.

– Example: Uniformly sample log-learning rate between −3

and −1, and then raise it to the power of 10.
Sampling versus Grid Search

• With a large number of parameters, grid search is still ex-

pensive.

• With 10 parameters, choosing just 3 values for each param-

eter leads to 310 = 59049 possibilities.

• Flexible choice is to sample over grid space.

• Used more commonly in large-scale settings with good re-

sults.
Large-Scale Settings

• Multiple threads are often run with sampled parameter set-

tings.

• Accuracy tracked on a separate out-of-sample validation set.

• Bad runs are detected and killed after a certain number of

epochs.

• New runs may also be started after killing threads (if needed).

• Only a few winners are trained to completion and the pre-

dictions combined in an ensemble.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Gradient Ratios, Vanishing and Exploding

Gradient Problems

Neural Networks and Deep Learning, Springer, 2018

Chapter 3, Section 3.4
Eﬀect of Varying Slopes in Gradient Descent

• Neural network learning is a multivariable optimization prob-

lem.

• Diﬀerent weights have diﬀerent magnitudes of partial deriva-

tives.

• Widely varying magnitudes of partial derivatives aﬀect the

learning.

• Gradient descent works best when the diﬀerent weights have

derivatives of similar magnitude.

– The path of steepest descent in most loss functions is

only an instantaneous direction of best movement, and is
not the correct direction of descent in the longer term.
Example

40 40

30 30

20 20

10 10
VALUE OF y

VALUE OF y
0 0

−10 −10

−20 −20

−30 −30

−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
VALUE OF x VALUE OF x

(a) Loss function is circular bowl (b) Loss function is elliptical bowl
L = x2 + y 2 L = x2 + 4y 2

• Loss functions with varying sensitivity to diﬀerent attributes

Revisiting Feature Normalization

• In the previous lecture, we discussed feature normalization.

• When features have very diﬀerent magnitudes, gradient ra-

tios of diﬀerent weights are likely very diﬀerent.

• Feature normalization helps even out gradient ratios to some

extent.

– Exact behavior depends on target variable and loss func-

tion.
The Vanishing and Exploding Gradient Problems

• An extreme manifestation of varying sensitivity occurs in deep

networks.

• The weights/activation derivatives in diﬀerent layers aﬀect

the backpropagated gradient in a multiplicative way.

– With increasing depth this eﬀect is magniﬁed.

– The partial derivatives can either increase or decrease with

depth.
Example

h1 h2 hm-1
x w1 w2 w3 wm-1 wm o
∑ ∑ ∑ ∑

• Neural network with one node per layer.

• Forward propagation multiplicatively depends on each weight

and activation function evaluation.

• Backpropagated partial derivative get multiplied by weights

and activation function derivatives.

• Unless the values are exactly one, the partial derivatives will
either continuously increase (explode) or decrease (vanish).

• Hard to initialize weights exactly right.

Activation Function Propensity to Vanishing Gradients

• Partial derivative of sigmoid with output o ⇒ o(1 − o).

– Maximum value at o = 0.5 of 0.25.

– For 10 layers, the activation function alone will multiply

by less than 0.2510 ≈ 10−6.

• At extremes of output values, the partial derivative is close

to 0, which is called saturation.

• The tanh activation function with partial derivative (1 − o2)

has a maximum value of 1 at o = 0, but saturation will still
cause problems.
Exploding Gradients

• Initializing weights to very large values to compensate for the

activation functions can cause exploding gradients.

• Exploding gradients can also occur when weights across dif-

ferent layers are shared (e.g., recurrent neural networks).

– The eﬀect of a ﬁnite change in weight is extremely un-

predictable across diﬀerent layers.

– Small ﬁnite change changes loss negligibly, but a slightly

larger value might change loss drastically.
Cliﬀs

GENTLE GRADIENT BEFORE

CLIFF OVERSHOOTS
1.4

1.2

0.8
LOSS

0.6

0.4

0.2 30
25
20
0 15
0 10
5 10 15 5
20 25 0
30
x y

• Often occurs with the exploding gradient problem.

A Partial Fix to Vanishing Gradients

• The ReLU has linear activation for nonnegative values and

otherwise sets outputs to 0.

• The ReLU has a partial derivative of 1 for nonnegative inputs.

• However, it can have a partial derivative of 0 in some cases

and never get updated.

– Neuron is permanently dead!

Leaky ReLU

• For negative inputs, the leaky ReLU can still propagate some
gradient backwards.

– At the reduced rate of α < 1 times the learning case for

nonnegative inputs:
⎧
⎨α · v v≤0
Φ(v) = (14)
⎩v otherwise

• The value of α is a hyperparameter chosen by the user.

• The gains with the leaky ReLU are not guaranteed.

Maxout

• The activation used is max{W1 ·X, W2 ·X} with two coeﬃcient

vectors.

• One can view the maxout as a generalization of the ReLU.

– The ReLU is obtained by setting one of the coeﬃcient

vectors to 0.

– The leaky ReLU can also be simulated by setting the other

coeﬃcient vector to W2 = αW1.

• Main disadvantage is that it doubles the number of parame-

ters.
Gradient Clipping for Exploding Gradients

• Try to make the diﬀerent components of the partial deriva-

tives more even.

– Value-based clipping: All partial derivatives outside ranges

are set to range boundaries.

– Norm-based clipping: The entire gradient vector is nor-

malized by the L2-norm of the entire vector.

• One can achieve a better conditioning of the values, so that

the updates from mini-batch to mini-batch are roughly sim-
ilar.

• Prevents an anomalous gradient explosion during the course

of training.
Other Comments on Vanishing and Exploding Gradients

• The methods discussed above are only partial ﬁxes.

• Other ﬁxes discussed in later lectures:

– Stronger initializations with pretraining.

– Second-order learning methods that make use of second-

order derivatives (or curvature of the loss function).
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

First-Order Gradient Descent Methods

Neural Networks and Deep Learning, Springer, 2018

Chapter 3, Section 3.5
First-Order Descent

• First-order methods work with steepest-descent directions.

• Modiﬁcations to basic form of steepest-descent:

– Need to reduce step sizes with algorithm progression.

– Need a way of avoiding local optima.

– Need to address widely varying slopes with respect to dif-

ferent weight parameters.
Learning Rate Decay

• Initial learning rates should be high but reduce over time.

• The two most common decay functions are exponential decay

and inverse decay.

• The learning rate αt can be expressed in terms of the initial

decay rate α0 and epoch t as follows:

αt = α0 exp(−k · t) [Exponential Decay]

α0
αt = [Inverse Decay]
1+k·t
The parameter k controls the rate of the decay.
Momentum Methods: Marble Rolling Down Hill

GD SLOWS DOWN
LOSS

IN FLAT REGION
GD GETS TRAPPED
IN LOCAL OPTIMUM

VALUE OF NEURAL NETWORK PARAMETER

• Use a friction parameter β ∈ (0, 1) to gain speed in direction

of movement.
∂L
V ⇐ βV − α ; W ⇐W +V
∂W
Avoiding Zig-Zagging with Momentum

OPTIMUM
STARTING
POINT

STARTING
POINT WITH
MOMENTUM (b) WITHOUT MOMENTUM

OPTIMUM
WITHOUT STARTING
MOMENTUM POINT

(a) RELATIVE DIRECTIONS (c) WITH MOMENTUM

Nesterov Momentum

• Modiﬁcation of the traditional momentum method in which

the gradients are computed at a point that would be reached
after executing a β-discounted version of the previous step
again.

• Compute at a point reached using only the momentum por-

tion of the current update:
∂L(W + βV )
V ⇐ βV

−α ; W ⇐W +V
∂W
Momentum

• Put on the brakes as the marble reaches near bottom of hill.

• Nesterov momentum should always be used with mini-batch

SGD (rather than SGD).
AdaGrad

• Aggregate squared magnitude of ith partial derivative in Ai.

• The square-root of Ai is proportional to the root-mean-

square slope.

– The absolute value will increase over time.

2
∂L
Ai ⇐ Ai + ∀i (15)
∂wi

• The update for the ith parameter wi is as follows:

α ∂L
wi ⇐ wi − √ ; ∀i (16)
Ai ∂wi

√
• Use Ai + in the denominator to avoid ill-conditioning.
AdaGrad Intuition

√
• Scaling the derivative inversely with Ai encourages faster
relative movements along gently sloping directions.

– Absolute movements tend to slow down prematurely.

– Scaling parameters use stale values.

RMSProp

• The RMSProp algorithm uses exponential smoothing with

parameter ρ ∈ (0, 1) in the relative estimations of the gradi-
ents.

– Absolute magnitudes of scaling factors do not grow with

time.

– Problem of staleness is ameliorated.

2
∂L
Ai ⇐ ρAi + (1 − ρ) ∀i (17)
∂wi

α ∂L
wi ⇐ wi − √ ; ∀i
Ai ∂wi

√
• Use Ai + to avoid ill-conditioning.
RMSProp with Nesterov Momentum

• Possible to combine RMSProp with Nesterov Momentum

α ∂L(W + βV )
vi ⇐ βvi − √ ; wi ⇐ wi + v i ∀i
Ai ∂wi

• Maintenance of Ai is done with shifted gradients as well.

2
∂L(W + βV )
Ai ⇐ ρAi + (1 − ρ) ∀i (18)
∂wi
AdaDelta and Adam

• Both methods derive intuition from RMSProp

– AdaDelta track of an exponentially smoothed value of the

incremental changes of weights Δwi in previous iterations
to decide parameter-speciﬁc learning rate.

– Adam keeps track of exponentially smoothed gradients

from previous iterations (in addition to normalizing like
RMSProp).

• Adam is extremely popular method.

Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Second-Order Gradient Descent Methods

Neural Networks and Deep Learning, Springer, 2018

Chapter 3, Section 3.5.5
Why Second-Order Methods?

GENTLE GRADIENT BEFORE

CLIFF OVERSHOOTS
1.4

1.2

0.8
LOSS

0.6

0.4

0.2 30
25
20
0 15
0 10
5 10 15 5
20 25 0
30
x y

• First-order methods are not enough when there is curvature.

Revisiting the Bowl

40 40

30 30

20 20

10 10
VALUE OF y

VALUE OF y
0 0

−10 −10

−20 −20

−30 −30

−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
VALUE OF x VALUE OF x

(a) Loss function is circular bowl (b) Loss function is elliptical bowl
L = x2 + y 2 L = x2 + 4y 2

• High curvature directions cause bouncing in spite of higher

gradient ⇒ Need second-derivative for more information.
A Valley

5 LEAST
CURVATURE
DIRECTION
4

3
f(x, y)

−1
1

0.5 2
1
0
0
−0.5
−1
−1 −2
y x

• Gently sloping directions are better with less curvature!

The Hessian

• The second-order derivatives of the loss function L(W ) are

of the following form:
∂ 2L(W )
Hij =
∂wi∂wj

• The partial derivatives use all pairwise parameters in the de-

nominator.

• For a neural network with d parameters, we have a d × d

Hessian matrix H, for which the (i, j)th entry is Hij .
Quadratic Approximation of Loss Function

• One can write a quadratic approximation of the loss function

with Taylor expansion about W0:
1
L(W ) ≈ L(W 0)+(W −W 0)T [∇L(W 0)]+ (W −W 0)T H(W −W 0)
2
(19)

• One can derive a single-step optimality condition from initial

point W0 by setting the gradient to 0.
Newton’s Update

• Can solve quadratic approximation in one step from initial

point W0.

∇L(W ) = 0 [Gradient of Loss Function]

∇L(W 0) + H(W − W 0) = 0 [Gradient of Taylor approximation]

• Rearrange optimality condition to obtain Newton update:

∗
W ⇐ W 0 − H −1[∇L(W 0)] (20)

• Note the ratio of ﬁrst-order to second-order ⇒ Trade-oﬀ

between speed and curvature

• Step-size not needed!

Why Second-Order Methods?

• Pre-multiplying with the inverse Hessian ﬁnds a trade-oﬀ be-

tween speed of descent and curvature.
Basic Second-Order Algorithm and Approximations

• Keep making Newton’s updates to convergence (single step

needed for quadratic function)

– Even computing the Hessian is diﬃcult!

– Inverting it is even more diﬃcult

• Solutions:

– Approximate the Hessian.

– Find an algorithm that works with projection Hv for some

direction v.
Conjugate Gradient Method

• Get to optimal in d steps (instead of single Newton step)

where d is number of parameters.

• Use optimal step-sizes to get best point along a direction.

• Thou shalt not worsen with respect to previous directions!

• Conjugate direction: The gradient of the loss function on

any point on an update direction is always orthogonal to the
previous update directions.

qT
t H[∇L(W t+1)]
q t+1 = −∇L(W t+1) + qt (21)
qT
t Hq t

• For quadratic function, it requires d updates instead of single

update of Newton method.
Conjugate Gradients on 2-Dimensional Quadratic

• Two conjugate directions are required to reach optimality

Conjugate Gradient Algorithm

• For quadratic functions only.

– Update W t+1 ⇐ W t + αtq t. Here, the step size αt is com-

puted using line search.

qT
t H[∇L(W t+1)]
– Set q t+1 = −∇L(W t+1) + q t. Increment
qT
t Hq t
t by 1.

• For non-quadratic functions approximate loss function with

Taylor expansion and perform d of the above steps. Then
repeat.
Eﬃciently Computing Projection of Hessian

• The update requires computation of the projection of the

Hessian rather than inversion of Hessian.

qT
t H[∇L(W t+1)]
q t+1 = −∇L(W t+1) + qt (22)
qT
t Hq t

• Easy to perform numerically!

∇L(W 0 + δv) − ∇L(W 0)
Hv ≈ (23)
δ
Other Second-Order Methods

• Quasi-Newton Method: A sequence of increasingly accurate

approximations of the inverse Hessian matrix are used in var-
ious steps.

• Many variations of this approach.

• Commonly-used update is BFGS, which stands for the

Broyden–Fletcher–Goldfarb–Shanno algorithm and its lim-
ited memory variant L-BFGS.
Problems with Second-Order Methods

1
1
0.8
0.8 SADDLE
POINT
0.6
0.6
0.4
0.4 0.2

g(x, y)
0
0.2
−0.2
f(x)

0 −0.4

−0.6
−0.2
−0.8

−0.4 −1
1

−0.6 0.5

0
−0.8
−0.5 1
0.5
0
−1 −1 −0.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 y −1
x x

(a) f (x) = x3 (b) f (x) = x2 − y 2

Degenerate Stationary

• Saddle points: Whether it is maximum or minimum depends

on which direction we approach it from.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Batch Normalization

Neural Networks and Deep Learning, Springer, 2018

Chapter 3, Section 3.6
Revisiting the Vanishing and Exploding Gradient Problems

h1 h2 hm-1
x w1 w2 w3 wm-1 wm o
∑ ∑ ∑ ∑

• Neural network with one node per layer.

• Forward propagation multiplicatively depends on each weight

and activation function evaluation.

• Backpropagated partial derivative get multiplied by weights

and activation function derivatives.

• Unless the values are exactly one, the partial derivatives will
either continuously increase (explode) or decrease (vanish).

• Hard to initialize weights exactly right.

Revisiting the Bowl

40 40

30 30

20 20

10 10
VALUE OF y

VALUE OF y
0 0

−10 −10

−20 −20

−30 −30

−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
VALUE OF x VALUE OF x

(a) Loss function is circular bowl (b) Loss function is elliptical bowl
L = x2 + y 2 L = x2 + 4y 2

• Varying scale of diﬀerent parameters will cause bouncing

• Varying scale of features causes varying scale of parameters

Input Shift

• One can view the input to each layer as a shifting data set
of hidden activations during training.

• A shifting input causes problems during learning.

– Convergence becomes slower.

– Final result may not generalize well because of unstable

inputs.

• Batch normalization ensures (somewhat) more stable inputs

to each layer.
Solution: Batch Normalization

∑ɸ ∑ɸ
ADD BATCH BREAK UP
NORMALIZATION

vi ai
∑ɸ BN ∑ɸ BN
∑ɸ
(a) Post-activation normalization (b) Pre-activation normalization

• Add an additional layer than normalizes in batch-wise fashion.

• Additional learnable parameters to ensure that optimal level

of nonlinearity is used.

• Pre-activation normalization more common than post-

activation normalization.
Batch Normalization Node

• The ith unit contains two parameters βi and γi that need to

be learned.

• Normalize over batch of m instances for ith unit.

m v (r)
μi = r=1 i ∀i [Batch Mean]
m
m (v (r) − μ )2
i
σi2 = r=1 i + ∀i [Batch Variance]
m
(r)
(r) v i − μi
v̂i = ∀i, r [Normalize Batch Instances]
σi
(r) (r)
ai = γi · v̂i + βi ∀i, r [Scale with Learnable Parameters]

• Why do we need βi and γi?

– Most activations will be near zero (near-linear regime).

Changes to Backpropagation

• We need to backpropagate through the newly added layer of

normalization nodes.

– The BN node can be treated like any other node.

• We want to optimize the parameters βi and γi.

– The gradients with respect to these parameters are com-

puted during backpropagation.

• Detailed derivations in book.

Issues in Inference

• The transformation parameters μi and σi depend on the

batch.

• How should one compute them during testing when a single

test instance is available?

• The values of μi and σi are computed up front using the entire

population (of training data), and then treated as constants
during testing time.

– One can also maintain exponentially weighted averages

during training.

• The normalization is a simple linear transformation during

inference.
Batch Normalization as Regularizer

• Batch normalization also acts as a regularizer.

• Same data point can cause somewhat diﬀerent updates de-

pending on which batch it is included in.

• One can view this eﬀect as a kind of noise added to the

update process.

• Regularization is can be shown to be equivalent to adding a

small amount of noise to the training data.

• The regularization is relatively mild.

ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville (Z-Lib - Org) - 226-228
No ratings yet
Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville (Z-Lib - Org) - 226-228
3 pages
Models Shale Volume
100% (1)
Models Shale Volume
21 pages
MIT6 034F10 Tutor04
No ratings yet
MIT6 034F10 Tutor04
4 pages
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
No ratings yet
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
3 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Lecture 2, Part 2: Backpropagation: Roger Grosse
No ratings yet
Lecture 2, Part 2: Backpropagation: Roger Grosse
9 pages
1d Backprop4
No ratings yet
1d Backprop4
6 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Health Effects in Vaccinated Versus Unvaccinated Children
100% (1)
Health Effects in Vaccinated Versus Unvaccinated Children
11 pages
CS231n Convolutional Neural Networks For Visual Recognition 4
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 4
10 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
9 pages
شبكات عصبية ٢
No ratings yet
شبكات عصبية ٢
6 pages
Back Propagation
No ratings yet
Back Propagation
10 pages
Part of DL
No ratings yet
Part of DL
24 pages
Mod 2 DL
No ratings yet
Mod 2 DL
8 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
Neural Networks: Learning: Introduction To Machine Learning
No ratings yet
Neural Networks: Learning: Introduction To Machine Learning
8 pages
CH 19
0% (1)
CH 19
32 pages
Unit 3
No ratings yet
Unit 3
6 pages
09: Neural Networks - Learning: Neural Network Cost Function
No ratings yet
09: Neural Networks - Learning: Neural Network Cost Function
9 pages
Machine Learning and Pattern Recognition Week 8 - Backprop
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Backprop
8 pages
A Step by Step Forward Pass and Backpropagation Example
No ratings yet
A Step by Step Forward Pass and Backpropagation Example
14 pages
Backpropagation Exercises
No ratings yet
Backpropagation Exercises
7 pages
Tut 01
No ratings yet
Tut 01
39 pages
Backpropagation
No ratings yet
Backpropagation
4 pages
6.034f Neural Net Notes October 28, 2010
No ratings yet
6.034f Neural Net Notes October 28, 2010
7 pages
Neural Networks - Learning
No ratings yet
Neural Networks - Learning
26 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Chapter 6 - Backpropagation
No ratings yet
Chapter 6 - Backpropagation
48 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Understanding and Creating Neural Networks
No ratings yet
Understanding and Creating Neural Networks
69 pages
Chap5 3-BackProp
No ratings yet
Chap5 3-BackProp
41 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
Calc
No ratings yet
Calc
6 pages
Backprop Unit 2
No ratings yet
Backprop Unit 2
5 pages
1d Backprop
No ratings yet
1d Backprop
23 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
14 Backprop
No ratings yet
14 Backprop
34 pages
Learning 3
No ratings yet
Learning 3
98 pages
NN 2
No ratings yet
NN 2
12 pages
Lecture 2: Introduction To Pytorch
No ratings yet
Lecture 2: Introduction To Pytorch
7 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
XCS224N Module2 Slides
No ratings yet
XCS224N Module2 Slides
80 pages
Unit 1
No ratings yet
Unit 1
30 pages
Ncert ch2 Chemistry Class 11
No ratings yet
Ncert ch2 Chemistry Class 11
44 pages
ZeMax Manual
No ratings yet
ZeMax Manual
766 pages
Unit 1 DL
No ratings yet
Unit 1 DL
52 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
Deep Learning With Python Sample
100% (1)
Deep Learning With Python Sample
31 pages
Lecture21 Deep Learning PartII April12 2021
No ratings yet
Lecture21 Deep Learning PartII April12 2021
60 pages
Computational Graphs
No ratings yet
Computational Graphs
10 pages
Corel DESIGNER 12 User Guide PDF
No ratings yet
Corel DESIGNER 12 User Guide PDF
460 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
Instructor Solutions Manual For Engineering Electromagnetics (Chapter 7)
50% (2)
Instructor Solutions Manual For Engineering Electromagnetics (Chapter 7)
23 pages
First
No ratings yet
First
92 pages
6 Room Layout GO 960 UP-1
No ratings yet
6 Room Layout GO 960 UP-1
1 page
Isabela Ebc Builders Construction Inc. Calculation of Concrete Design Mix (Aci 211.1-91) NO. 01
No ratings yet
Isabela Ebc Builders Construction Inc. Calculation of Concrete Design Mix (Aci 211.1-91) NO. 01
3 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
This Study Resource Was Shared Via: MATH 1201 College Algebra - Term 2, 2019-2020
No ratings yet
This Study Resource Was Shared Via: MATH 1201 College Algebra - Term 2, 2019-2020
2 pages
Excel Solver Tutorial
No ratings yet
Excel Solver Tutorial
2 pages
PDF Haier
No ratings yet
PDF Haier
48 pages
Packet Tracer - Who Hears The Broadcast?: Objectives
No ratings yet
Packet Tracer - Who Hears The Broadcast?: Objectives
2 pages
Python OOP's
No ratings yet
Python OOP's
5 pages
23 C 25 EC 1 2 Sem Electronics and Communication Engg
No ratings yet
23 C 25 EC 1 2 Sem Electronics and Communication Engg
22 pages
Longitudinal Analysis of The Recovery of Trunk Control and Upper Extremity Following Stroke: An Individual Growth Curve Approach
No ratings yet
Longitudinal Analysis of The Recovery of Trunk Control and Upper Extremity Following Stroke: An Individual Growth Curve Approach
17 pages
Module 1 de PDF
No ratings yet
Module 1 de PDF
18 pages
Simple Fractionation Through The Supercritical Carbon Dioxide Extraction of Palm Kernel Oil
No ratings yet
Simple Fractionation Through The Supercritical Carbon Dioxide Extraction of Palm Kernel Oil
8 pages
SP Logging
No ratings yet
SP Logging
42 pages
Quantum Computing Applications in Financial Modeling: A Comparative Analysis of Traditional and Quantum Algorithms
No ratings yet
Quantum Computing Applications in Financial Modeling: A Comparative Analysis of Traditional and Quantum Algorithms
2 pages
Practical Analytical 1 ,,chemistry
No ratings yet
Practical Analytical 1 ,,chemistry
45 pages
Sample Paper Xii Phy.
No ratings yet
Sample Paper Xii Phy.
4 pages
MC 96 Duty Cycle Crane en 905 673 2
No ratings yet
MC 96 Duty Cycle Crane en 905 673 2
16 pages
Ameya SE
No ratings yet
Ameya SE
1 page
ChemSep Thermodynamic Property Model Selection
No ratings yet
ChemSep Thermodynamic Property Model Selection
6 pages
Mas
No ratings yet
Mas
1 page
Work and Circular 1728479000
No ratings yet
Work and Circular 1728479000
8 pages
Fire Detection and Alarm System
No ratings yet
Fire Detection and Alarm System
5 pages
IGCSE & Olevel Computer Science
No ratings yet
IGCSE & Olevel Computer Science
20 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet