0% found this document useful (0 votes)
2 views

Chap3slides

The document discusses the necessity and complexity of backpropagation in neural networks, emphasizing the computation of partial derivatives in computational graphs. It explains the challenges of differentiating complex composition functions and introduces dynamic programming as a solution to reduce the computational complexity from exponential to polynomial time. The text also covers various approaches to compute derivatives, including pre-activation and post-activation variables, and addresses shared weights in neural networks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chap3slides

The document discusses the necessity and complexity of backpropagation in neural networks, emphasizing the computation of partial derivatives in computational graphs. It explains the challenges of differentiating complex composition functions and introduces dynamic programming as a solution to reduce the computational complexity from exponential to polynomial time. The text also covers various approaches to compute derivatives, including pre-activation and post-activation variables, and addresses shared weights in neural networks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Charu C.

Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Backpropagation I: Computing Derivatives


in Computational Graphs [without
Backpropagation] in Exponential Time

Neural Networks and Deep Learning, Springer, 2018


Chapter 3, Section 3.2
Why Do We Need Backpropagation?

• To perform any kind of learning, we need to compute the


partial derivative of the loss function with respect to each
intermediate weight.

– Simple with single-layer architectures like the perceptron.

– Not a simple matter with multi-layer architectures.


The Complexity of Computational Graphs

• A computational graph is a directed acyclic graph in which


each node computes a function of its incoming node vari-
ables.

• A neural network is a special case of a computational graph.

– Each node computes a combination of a linear vector mul-


tiplication and a (possibly nonlinear) activation function.

• The output is a very complicated composition function of


each intermediate weight in the network.

– The complex composition function might be hard to ex-


press neatly in closed form.

∗ Difficult to differentiate!
Recursive Nesting is Ugly!

• Consider a computational graph containing two nodes in a


path and input w.

• The first node computes y = g(w) and the second node


computes the output o = f (y).

– Overall composition function is f (g(w)).

– Setting f () and g() to the sigmoid function results in the


following:
1
f (g(w)) =   (1)
1
1 + exp − 1+exp(−w) |

– Increasing path length increases recursive nesting.


Backpropagation along Single Path (Univariate Chain
Rule)

w y=g(w) f(y)= O = f(g(w))=cos(w2)


g(w)=w2
INPUT cos(y) OUTPUT
WEIGHT

• Consider a two-node path with f (g(w)) = cos(w2)

• In the univariate chain rule, we compute product of local


derivatives.
∂f (g(w)) ∂f (y) ∂g(w)
= · = −2w · sin(y) = −2w · sin(w2)
∂w ∂y ∂w
     
−sin(y) 2w

• Local derivatives are easy to compute because they care


about their own input and output.
Backpropagation along Multiple Paths (Multivariate Chain
Rule)

• Neural networks contain multiple nodes in each layer.

• Consider the function f (g1(w), . . . gk (w)), in which a unit


computing the multivariate function f (·) gets its inputs from
k units computing g1(w) . . . gk (w).

• The multivariable chain rule needs to be used:


k

∂f (g1(w), . . . gk (w)) ∂f (g1(w), . . . gk (w)) ∂gi(w)
= · (2)
∂w i=1 ∂gi(w) ∂w
Example of Multivariable Chain Rule

g(y)=
cos(y)
O = [cos(w2)] + [sin(w2)]

w K(p,q) O
f(w)=w2
INPUT =p+q
OUTPUT
WEIGHT

h(z)=
sin(z)

∂o ∂K(p, q)   ∂K(p, q)  
= · g (y) · f
     
(w) + · h (z) · f
     
(w)
∂w 
∂p
  -sin(y) 
∂q
  cos(z)
2w 2w
1 1
= −2w · sin(y) + 2w · cos(z)
= −2w · sin(w2) + 2w · cos(w2)

• Product of local derivatives along all paths from w to o.


Pathwise Aggregation Lemma

• Let a non-null set P of paths exist from a variable w in the


computational graph to output o.

– Local gradient of node with variable y(j) with respect to


variable y(i) for directed edge (i, j) is z(i, j) = ∂y(j)
∂y(i)

• The value of ∂w∂o is given by computing the product of the

local gradients along each path in P, and summing these


products over all paths.
∂o 
= z(i, j) (3)
∂w P ∈P (i,j)∈P

• Observation: Each z(i, j) easy to compute.


An Exponential Time Algorithm for Computing Partial
Derivatives

• The path aggregation lemma provides a simple way to com-


pute the derivative with respect to intermediate variable w

– Use computational graph to compute each value y(i) of


nodes i in a forward phase.

– Compute local derivative z(i, j) = ∂y(j)


∂y(i)
on each edge (i, j)
in the network.

– Identify the set P of all paths from the node with variable
w to the output o.

– For each path P ∈ P compute the product M (P ) =


(i,j)∈P z(i, j) of the local derivatives on that path.

– Add up these values over all paths P ∈ P.


Example: Deep Computational Graph with Product Nodes

h(1,1) h(2,1) h(3,1) h(4,1) h(5,1)


w w2 w4 w8 w16 O=w32
w
O
INPUT
WEIGHT w w2 w4 w8 w16 OUTPUT
h(1,2) h(2,2) h(3,2) h(4,2) h(5,2)

EACH NODE COMPUTES THE PRODUCT OF ITS INPUTS

• Each node computes product of its inputs ⇒ Partial deriva-


tive of xy with respect to one input x is the other input y.

• Computing product of partial derivatives along a path is


equivalent to computing product of values along the only
other node disjoint path.

• Aggregative product of partial derivatives (only in this case)


equals aggregating products of values.
Example of Increasing Complexity with Depth

h(1,1) h(2,1) h(3,1) h(4,1) h(5,1)


w w2 w4 w8 w16 O=w32
w
O
INPUT
WEIGHT w w2 w4 w8 w16 OUTPUT
h(1,2) h(2,2) h(3,2) h(4,2) h(5,2)

EACH NODE COMPUTES THE PRODUCT OF ITS INPUTS

∂O 
= h(1, j1) h(2, j ) h(3, j ) h(4, j ) h(5, j )
  2    3    4    5 
∂w   
j1 ,j2 ,j3 ,j4 ,j5∈{1,2}5 w w2 w4 w8 w16

= w31 = 32w31
All 32 paths

• Impractical with increasing depth.


Observations on Exponential Time Algorithm

• Not very practical approach ⇒ Million paths for a network


with 100 nodes in each layer and three layers.

• This is the approach of traditional machine learning with


complex objective functions in closed form.

– For a composition function in closed form, manual differ-


entiation explicitly traverses all paths with chain rule.

– The algebraic expression of the derivative of a complex


function might not fit the paper you write on.

– Explains why most of traditional machine learning is a


shallow neural model.

• The beautiful dynamic programming idea of backpropagation


rescues us from complexity.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Backpropagation II: Using Dynamic


Programming [Backpropagation] to
Compute Derivatives in Polynomial Time

Neural Networks and Deep Learning, Springer, 2018


Chapter 3, Section 3.2
Differentiating Composition Functions

• Neural networks compute composition functions with a lot of


repetitiveness caused by a node appearing in multiple paths.

• The most natural and intuitive way to differentiate such a


composition function is not the most efficient way to do it.

• Natural approach: Top down

f (w) = sin(w2) + cos(w2)

• We should not have to differentiate w2 twice!

• Dynamic programming collapses repetitive computations to


reduce exponential complexity into polynomial complexity!
1 3 5 7 9
O
w
11
INPUT
WEIGHT 2 4 6 8 10 OUTPUT
y(4) z(4, 6) y(6)
EACH NODE i CONTAINS y(i) AND EACH EDGE BETWEEN i AND j CONTAINS z(i, j)
EXAMPLE: z(4, 6)= PARTIAL DERIVATIVE OF y(6) WITH RESPECT TO y(4)

• We want to compute the derivative of the output with re-


spect to variable w.

• We can easily compute z(i, j) = ∂y(j)


∂y(i)
.

• Naive approach computes S(w, o) = ∂w ∂o =


P ∈P (i,j)∈P z(i, j)
by explicit aggregation over all paths in P.
Dynamic Programming and Directed Acyclic Graphs

• Dynamic programming used extensively in directed acyclic


graphs.

– Typical: Exponentially aggregative path-centric functions


between source-sink pairs.

– Example: Polynomial solution to longest path problem in


directed acyclic graphs (NP-hard in general).

– General approach: Starts at either the source or sink and


recursively computes the relevant function over paths of
increasing length by reusing intermediate computations.

• Our path-centric function: S(w, o) = P ∈P (i,j)∈P z(i, j).

– Backwards direction makes more sense here because we


have to compute derivative of output (sink) with respect
to all variables in early layers.
Dynamic Programming Update

• Let A(i) be the set of nodes at the ends of outgoing edges


from node i.

• Let S(i, o) be the intermediate variable indicating the same


path aggregative function from i to o.

S(i, o) ⇐ S(j, o) · z(i, j) (4)
j∈A(i)

• Initialize S(o, o) to 1 and compute backwards to reach S(w, o).

– Intermediate computations like S(i, o) are also useful for


computing derivatives in other layers.

• Do you recognize the multivariate chain rule in Equation 4?


∂o  ∂o ∂y(j)
= ·
∂y(i) j∈A(i)
∂y(j) ∂y(i)
How Does it Apply to Neural Networks?

W
X { ∑ɸ h=ɸ(W. X)

BREAK UP

W ah = W .X
X { ∑ɸ ∑ɸ h=ɸ(ah)
POST-ACTIVATION
VALUE

PRE-ACTIVATION
VALUE

• A neural network is a special case of a computational graph.

– We can define the computational graph in multiple ways.

– Pre-activation variables or post-activation variables or


both as the node variables of the computation graph?

– The three lead to different updates but the end result is


equivalent.
Pre-Activation Variables to Create Computational Graph

• Compute derivative δ(i, o) of loss L at o with respect to pre-


activation variable at node i.

• We always compute loss derivatives δ(i, o) with respect to


activations in nodes during dynamic programming rather than
weights.

– Loss derivative with respect to weight wij from node i


to node j is given by the product of δ(j, o) and hidden
variable at i (why?)

• Key points: z(i, j) = wij ·Φi , Initialize S(o, o) = δ(o, o) = ∂L 


∂o Φo
 
δ(i, o) = S(i, o) = Φi wij S(j, o) = Φi wij δ(j, o)
j∈A(i) j∈A(i)
(5)
Post-Activation Variables to Create Computation Graph

• The variables in the computation graph are hidden values


after activation function application.

• Compute derivative Δ(i, o) of loss L at o with respect to


post-activation variable at node i.

• Key points: z(i, j) = wij · Φj , Initialize S(o, o) = Δ(o, o) = ∂L


∂o
 
Δ(i, o) = S(i, o) = wij S(j, o)Φj = wij Δ(j, o)Φj
j∈A(i) j∈A(i)
(6)

– Compare with pre-activation approach δ(i, o) = Φi j∈A(i) wij δ(j, o)

– Pre-activation approach more common in textbooks.


Variables for Both Pre-Activation and Post-Activation
Values

• Nice way of decoupling the linear multiplication and activa-


tion operations.

• Simplified approach in which each layer is treated as a single


node with a vector variable.

– Update can be computed in vector and matrix multiplica-


tions.

• Topic of discussion in next part of the backpropagation series.


Losses at Arbitrary Nodes

• We assume that the loss is incurred at a single output node.

• In case of multiple output nodes, one only has to add up the


contributions of different outputs in the backwards phase.

• In some cases, penalties may be applied to hidden nodes.

• For a hidden node i, we add an “initialization value” to S(i, o)


just after it has been computed during dynamic program-
ming, which is based on its penalty.

– Similar treatment as the initialization of an output node,


except that we add the contribution to existing value of
S(i, o).
Handling Shared Weights

• You saw an example in autoencoders where encoder and de-


coder weights are shared.

• Also happens in specialized architectures like recurrent or


convolutional neural networks.

• Can be addressed with a simple application of the chain rule.

• Let w1 . . . wr be r copies of the same weight w in the neural


network.
r
 r

∂L ∂L ∂wi ∂L
= · = (7)
∂w i=1 ∂wi ∂w i=1 ∂wi

• Pretend all weights are different and just add!


Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Backpropagation III: A Decoupled View of


Vector-Centric Backpropagation

Neural Networks and Deep Learning, Springer, 2018


Chapter 3, Section 3.2
Multiple Computational Graphs from Same Neural
Network

• We can create a computational graph in multiple ways from


the variables in a neural network.

– Computational graph of pre-activation variables (part II of


lecture)

– Computational graph of post-activation variables (part II


of lecture)

– Computational graph of both (this part of the lecture)

• Using both pre-activation and post-activation variables cre-


ates decoupled backpropagation updates for linear layer and
for activation function.
Scalar Versus Vector Computational Graphs

• The backpropagation discussion so far uses scalar operations.

• Neural networks are constructed in layer-wise fashion.

• We can treat an entire layer as a node with a vector variable.

• We want to use layer-wise operations on vectors.

– Most real implementations use vector and matrix multi-


plications.

• Want to decouple the operations of linear matrix multiplica-


tion and activation function in separate “layers.”
Vector-Centric and Decoupled View of Single Layer

MULTIPLY WITH WT APPLY ɸ (ELEMENTWISE)

DECOUPLED LAYER (i+2)

DECOUPLED LAYER (i+3)


LINEAR
DECOUPLED LAYER (i-1)

DECOUPLED LAYER (i+1)


TRANSFORM ACTIVATION
SOME SOME SOME
FUNCTION DECOUPLED LAYER i FUNCTION LOSS

MULTIPLY WITH W MULTIPLY ɸ’ (ELEMENTWISE)

• Note that linear matrix multiplication and activation function


are separate layers.

• Method 1 (requires knowledge of matrix calculus): You can


use the vector-to-vector chain rule to backpropagate on a
single path!
Converting Scalar Updates to Vector Form

• Recap: When the partial derivative of node q with respect


to node p is z(p, q), the dynamic programming update is:

S(p, o) = S(q, o) · z(p, q) (8)
q∈Next Layer

• We can write the above update in vector form by creating a


single column vector g i for layer i ⇒ Contains S(p, o) for all
values of p.
g i = Zg i+1 (9)

• The matrix Z = [z(p, q)] is the transpose of the Jacobian!

– We will use the notation J = Z T in further slides.


The Jacobian

• Consider layer i and layer-(i + 1) with activations z i and z i+1.

– The kth activation in layer-(i + 1) is obtained by applying


an arbitrary function fk (·) on the vector of activations in
layer-i.

• Definition of Jacobian matrix entries:


∂fk (z i)
Jkr = (10)
(r)
∂z i

• Backpropagation updates:

g i = J T g i+1 (11)
Effect on Linear Layer and Activation Functions

MULTIPLY WITH WT APPLY ɸ (ELEMENTWISE)

DECOUPLED LAYER (i+2)

DECOUPLED LAYER (i+3)


LINEAR
DECOUPLED LAYER (i-1)

DECOUPLED LAYER (i+1)


TRANSFORM ACTIVATION
SOME DECOUPLED LAYER i SOME SOME
FUNCTION FUNCTION LOSS

MULTIPLY WITH W MULTIPLY ɸ’ (ELEMENTWISE)

• Backpropagation is multiplication with transposed weight


matrix for linear layer.

• Elementwise multiplication with derivative for activation


layer.
Table of Forward Propagation and Backward Propagation
Function Forward Backward
Linear z i+1 = W T z i g i = W g i+1
Sigmoid z i+1 =sigmoid(z i ) g i = g i+1  z i+1  (1 − z i+1)
Tanh z i+1 =tanh(z i) g i = g i+1  (1 − z i+1  z i+1)
ReLU z i+1 = z i  I(z i > 0) g i = g i+1  I(z i > 0)
Hard Set to ±1 (∈ [−1, +1]) Set to 0 (∈ [−1, +1])
Tanh Copy (∈ [−1, +1]) Copy ( ∈ [−1, +1])
Max Maximum of inputs Set to 0 (non-maximal inputs)
Copy (maximal input)
Arbitrary z (k)
i+1 = fk (z i ) g i = J T g i+1
function fk (·) J is Jacobian (Equation 10)

• Two types of Jacobians: Linear layers are dense and activa-


tion layers are sparse.

• Maximization function used in max-pooling.


Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Neural Network Training [Initialization,


Preprocessing, Mini-Batching, Tuning, and
Other Black Art]

Neural Networks and Deep Learning, Springer, 2018


Chapter 3, Section 3.3
How to Check Correctness of Backpropagation

• Consider a particular weight w of a randomly selected edge


in the network.

• Let L(w) be the current value of the loss.

• The weight of this edge is perturbed by adding a small


amount  > 0 to it.

• Estimate of derivative:
∂L(w) L(w + ) − L(w)
≈ (12)
∂w 

• When the partial derivatives do not match closely enough, it


might be indicative of an incorrectness in implementation.
What Does “Closely Enough” Mean?

• Algorithm-determined derivative is Ge and the approximate


derivative is Ga.
|Ge − Ga|
ρ= (13)
|Ge + Ga|

• The ratio should be less than 10−6.

• If ReLU is used, the ratio should be less than 10−3.

• Should perform the checks for a sample of the weights a few


times during training.
Stochastic Gradient Descent

• We have always worked with point-wise loss functions so far.

– Corresponds to stochastic gradient descent.

– In practice, stochastic gradient descent is only a random-


ized approximation of the true loss function.

• True loss function is typically additive over points.

– Example: Sum-of-squared errors in regression.

– Computing gradient over a single point is like sampled


gradient estimate.
Mini-batch Stochastic Gradient Descent

• One can improve accuracy of gradient computation by using


a batch of instances.

– Instead of holding a vector of activations, we hold a matrix


of activations in each layer.

– Matrix-to-matrix multiplications required for forward and


backward propagation.

– Increases the memory requirements.

• Typical sizes are powers of 2 like 32, 64, 128, 256


Why Does Mini-Batching Work?

• At early learning stages, the weight vectors are very poor.

– Training data is highly redundant in terms of important


patterns.

– Small batch sizes gives the correct direction of gradient.

• At later learning stages, the gradient direction becomes less


accurate.

– But some amount of noise helps avoid overfitting anyway!

• Performance on out-of-sample data does not deteriorate!


Feature Normalization

• Standardization: Normalize to zero mean and unit variance.

• Whitening: Transform the data to a de-correlated axis


system with principal component analysis (mean-centered
SVD).

– Truncate directions with extremely low variance.

– Standardize the other directions.

• Basic principle: Assume that data is generated from Gaus-


sian distribution and give equal importance to all directions.
Weight Initialization

• Initializations are surprisingly important.

– Poor initializations can lead to bad convergence behavior.

– Instability across different layers (vanishing and exploding


gradients).

• More sophisticated initializations such as pretraining covered


in later lecture.

• Even some simple rules in initialization can help in condition-


ing.
Symmetry Breaking

• Bad idea to initialize weights to the same value.

– Results in weights being updated in lockstep.

– Creates redundant features.

• Initializing weights to random values breaks symmetry.

• Average magnitude of the random variables is important for


stability.
Sensitivity to Number of Inputs

• More inputs increase output sensitivity to the average weight.

– Additive effect of multiple inputs: variance linearly in-


creases with number of inputs r.

– Standard deviation scales with the square-root of number


of inputs r.

• Each weight is initialized from Gaussian distribution with


standard deviation 1/r ( 2/r for ReLU).

• More sophisticated: Use standard deviation of 2/(rin + rout).


Tuning Hyperparameters

• Hyperparameters represent the parameters like number of


layers, nodes per layer, learning rate, and regularization pa-
rameter.

• Use separate validation set for tuning.

• Do not use same data set for backpropagation training as


tuning.
Grid Search

• Perform grid search over parameter space.

– Select set of values for each parameter in some “reason-


able” range.

– Test over all combination of values.

• Careful about parameters at borders of selected range.

• Optimization: Search over coarse grid first, and then drill


down into region of interest with finer grids.
How to Select Values for Each Parameter

• Natural approach is to select uniformly distributed values of


parameters.

– Not the best approach in many cases! ⇒ Log-uniform


intervals.

– Search uniformly in reasonable values of log-values and


then exponentiate.

– Example: Uniformly sample log-learning rate between −3


and −1, and then raise it to the power of 10.
Sampling versus Grid Search

• With a large number of parameters, grid search is still ex-


pensive.

• With 10 parameters, choosing just 3 values for each param-


eter leads to 310 = 59049 possibilities.

• Flexible choice is to sample over grid space.

• Used more commonly in large-scale settings with good re-


sults.
Large-Scale Settings

• Multiple threads are often run with sampled parameter set-


tings.

• Accuracy tracked on a separate out-of-sample validation set.

• Bad runs are detected and killed after a certain number of


epochs.

• New runs may also be started after killing threads (if needed).

• Only a few winners are trained to completion and the pre-


dictions combined in an ensemble.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Gradient Ratios, Vanishing and Exploding


Gradient Problems

Neural Networks and Deep Learning, Springer, 2018


Chapter 3, Section 3.4
Effect of Varying Slopes in Gradient Descent

• Neural network learning is a multivariable optimization prob-


lem.

• Different weights have different magnitudes of partial deriva-


tives.

• Widely varying magnitudes of partial derivatives affect the


learning.

• Gradient descent works best when the different weights have


derivatives of similar magnitude.

– The path of steepest descent in most loss functions is


only an instantaneous direction of best movement, and is
not the correct direction of descent in the longer term.
Example

40 40

30 30

20 20

10 10
VALUE OF y

VALUE OF y
0 0

−10 −10

−20 −20

−30 −30

−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
VALUE OF x VALUE OF x

(a) Loss function is circular bowl (b) Loss function is elliptical bowl
L = x2 + y 2 L = x2 + 4y 2

• Loss functions with varying sensitivity to different attributes


Revisiting Feature Normalization

• In the previous lecture, we discussed feature normalization.

• When features have very different magnitudes, gradient ra-


tios of different weights are likely very different.

• Feature normalization helps even out gradient ratios to some


extent.

– Exact behavior depends on target variable and loss func-


tion.
The Vanishing and Exploding Gradient Problems

• An extreme manifestation of varying sensitivity occurs in deep


networks.

• The weights/activation derivatives in different layers affect


the backpropagated gradient in a multiplicative way.

– With increasing depth this effect is magnified.

– The partial derivatives can either increase or decrease with


depth.
Example

h1 h2 hm-1
x w1 w2 w3 wm-1 wm o
∑ ∑ ∑ ∑

• Neural network with one node per layer.

• Forward propagation multiplicatively depends on each weight


and activation function evaluation.

• Backpropagated partial derivative get multiplied by weights


and activation function derivatives.

• Unless the values are exactly one, the partial derivatives will
either continuously increase (explode) or decrease (vanish).

• Hard to initialize weights exactly right.


Activation Function Propensity to Vanishing Gradients

• Partial derivative of sigmoid with output o ⇒ o(1 − o).

– Maximum value at o = 0.5 of 0.25.

– For 10 layers, the activation function alone will multiply


by less than 0.2510 ≈ 10−6.

• At extremes of output values, the partial derivative is close


to 0, which is called saturation.

• The tanh activation function with partial derivative (1 − o2)


has a maximum value of 1 at o = 0, but saturation will still
cause problems.
Exploding Gradients

• Initializing weights to very large values to compensate for the


activation functions can cause exploding gradients.

• Exploding gradients can also occur when weights across dif-


ferent layers are shared (e.g., recurrent neural networks).

– The effect of a finite change in weight is extremely un-


predictable across different layers.

– Small finite change changes loss negligibly, but a slightly


larger value might change loss drastically.
Cliffs

GENTLE GRADIENT BEFORE


CLIFF OVERSHOOTS
1.4

1.2

0.8
LOSS

0.6

0.4

0.2 30
25
20
0 15
0 10
5 10 15 5
20 25 0
30
x y

• Often occurs with the exploding gradient problem.


A Partial Fix to Vanishing Gradients

• The ReLU has linear activation for nonnegative values and


otherwise sets outputs to 0.

• The ReLU has a partial derivative of 1 for nonnegative inputs.

• However, it can have a partial derivative of 0 in some cases


and never get updated.

– Neuron is permanently dead!


Leaky ReLU

• For negative inputs, the leaky ReLU can still propagate some
gradient backwards.

– At the reduced rate of α < 1 times the learning case for


nonnegative inputs:

⎨α · v v≤0
Φ(v) = (14)
⎩v otherwise

• The value of α is a hyperparameter chosen by the user.

• The gains with the leaky ReLU are not guaranteed.


Maxout

• The activation used is max{W1 ·X, W2 ·X} with two coefficient


vectors.

• One can view the maxout as a generalization of the ReLU.

– The ReLU is obtained by setting one of the coefficient


vectors to 0.

– The leaky ReLU can also be simulated by setting the other


coefficient vector to W2 = αW1.

• Main disadvantage is that it doubles the number of parame-


ters.
Gradient Clipping for Exploding Gradients

• Try to make the different components of the partial deriva-


tives more even.

– Value-based clipping: All partial derivatives outside ranges


are set to range boundaries.

– Norm-based clipping: The entire gradient vector is nor-


malized by the L2-norm of the entire vector.

• One can achieve a better conditioning of the values, so that


the updates from mini-batch to mini-batch are roughly sim-
ilar.

• Prevents an anomalous gradient explosion during the course


of training.
Other Comments on Vanishing and Exploding Gradients

• The methods discussed above are only partial fixes.

• Other fixes discussed in later lectures:

– Stronger initializations with pretraining.

– Second-order learning methods that make use of second-


order derivatives (or curvature of the loss function).
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

First-Order Gradient Descent Methods

Neural Networks and Deep Learning, Springer, 2018


Chapter 3, Section 3.5
First-Order Descent

• First-order methods work with steepest-descent directions.

• Modifications to basic form of steepest-descent:

– Need to reduce step sizes with algorithm progression.

– Need a way of avoiding local optima.

– Need to address widely varying slopes with respect to dif-


ferent weight parameters.
Learning Rate Decay

• Initial learning rates should be high but reduce over time.

• The two most common decay functions are exponential decay


and inverse decay.

• The learning rate αt can be expressed in terms of the initial


decay rate α0 and epoch t as follows:

αt = α0 exp(−k · t) [Exponential Decay]


α0
αt = [Inverse Decay]
1+k·t
The parameter k controls the rate of the decay.
Momentum Methods: Marble Rolling Down Hill

GD SLOWS DOWN
LOSS

IN FLAT REGION
GD GETS TRAPPED
IN LOCAL OPTIMUM

VALUE OF NEURAL NETWORK PARAMETER

• Use a friction parameter β ∈ (0, 1) to gain speed in direction


of movement.
∂L
V ⇐ βV − α ; W ⇐W +V
∂W
Avoiding Zig-Zagging with Momentum

OPTIMUM
STARTING
POINT

STARTING
POINT WITH
MOMENTUM (b) WITHOUT MOMENTUM

OPTIMUM
WITHOUT STARTING
MOMENTUM POINT

(a) RELATIVE DIRECTIONS (c) WITH MOMENTUM


Nesterov Momentum

• Modification of the traditional momentum method in which


the gradients are computed at a point that would be reached
after executing a β-discounted version of the previous step
again.

• Compute at a point reached using only the momentum por-


tion of the current update:
∂L(W + βV )
V ⇐ βV

−α ; W ⇐W +V
∂W
Momentum

• Put on the brakes as the marble reaches near bottom of hill.

• Nesterov momentum should always be used with mini-batch


SGD (rather than SGD).
AdaGrad

• Aggregate squared magnitude of ith partial derivative in Ai.

• The square-root of Ai is proportional to the root-mean-


square slope.

– The absolute value will increase over time.


 2
∂L
Ai ⇐ Ai + ∀i (15)
∂wi

• The update for the ith parameter wi is as follows:


 
α ∂L
wi ⇐ wi − √ ; ∀i (16)
Ai ∂wi


• Use Ai +  in the denominator to avoid ill-conditioning.
AdaGrad Intuition


• Scaling the derivative inversely with Ai encourages faster
relative movements along gently sloping directions.

– Absolute movements tend to slow down prematurely.

– Scaling parameters use stale values.


RMSProp

• The RMSProp algorithm uses exponential smoothing with


parameter ρ ∈ (0, 1) in the relative estimations of the gradi-
ents.

– Absolute magnitudes of scaling factors do not grow with


time.

– Problem of staleness is ameliorated.


 2
∂L
Ai ⇐ ρAi + (1 − ρ) ∀i (17)
∂wi
 
α ∂L
wi ⇐ wi − √ ; ∀i
Ai ∂wi


• Use Ai +  to avoid ill-conditioning.
RMSProp with Nesterov Momentum

• Possible to combine RMSProp with Nesterov Momentum


 
α ∂L(W + βV )
vi ⇐ βvi − √ ; wi ⇐ wi + v i ∀i
Ai ∂wi

• Maintenance of Ai is done with shifted gradients as well.


 2
∂L(W + βV )
Ai ⇐ ρAi + (1 − ρ) ∀i (18)
∂wi
AdaDelta and Adam

• Both methods derive intuition from RMSProp

– AdaDelta track of an exponentially smoothed value of the


incremental changes of weights Δwi in previous iterations
to decide parameter-specific learning rate.

– Adam keeps track of exponentially smoothed gradients


from previous iterations (in addition to normalizing like
RMSProp).

• Adam is extremely popular method.


Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Second-Order Gradient Descent Methods

Neural Networks and Deep Learning, Springer, 2018


Chapter 3, Section 3.5.5
Why Second-Order Methods?

GENTLE GRADIENT BEFORE


CLIFF OVERSHOOTS
1.4

1.2

0.8
LOSS

0.6

0.4

0.2 30
25
20
0 15
0 10
5 10 15 5
20 25 0
30
x y

• First-order methods are not enough when there is curvature.


Revisiting the Bowl

40 40

30 30

20 20

10 10
VALUE OF y

VALUE OF y
0 0

−10 −10

−20 −20

−30 −30

−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
VALUE OF x VALUE OF x

(a) Loss function is circular bowl (b) Loss function is elliptical bowl
L = x2 + y 2 L = x2 + 4y 2

• High curvature directions cause bouncing in spite of higher


gradient ⇒ Need second-derivative for more information.
A Valley

5 LEAST
CURVATURE
DIRECTION
4

3
f(x, y)

−1
1

0.5 2
1
0
0
−0.5
−1
−1 −2
y x

• Gently sloping directions are better with less curvature!


The Hessian

• The second-order derivatives of the loss function L(W ) are


of the following form:
∂ 2L(W )
Hij =
∂wi∂wj

• The partial derivatives use all pairwise parameters in the de-


nominator.

• For a neural network with d parameters, we have a d × d


Hessian matrix H, for which the (i, j)th entry is Hij .
Quadratic Approximation of Loss Function

• One can write a quadratic approximation of the loss function


with Taylor expansion about W0:
1
L(W ) ≈ L(W 0)+(W −W 0)T [∇L(W 0)]+ (W −W 0)T H(W −W 0)
2
(19)

• One can derive a single-step optimality condition from initial


point W0 by setting the gradient to 0.
Newton’s Update

• Can solve quadratic approximation in one step from initial


point W0.

∇L(W ) = 0 [Gradient of Loss Function]


∇L(W 0) + H(W − W 0) = 0 [Gradient of Taylor approximation]

• Rearrange optimality condition to obtain Newton update:



W ⇐ W 0 − H −1[∇L(W 0)] (20)

• Note the ratio of first-order to second-order ⇒ Trade-off


between speed and curvature

• Step-size not needed!


Why Second-Order Methods?

• Pre-multiplying with the inverse Hessian finds a trade-off be-


tween speed of descent and curvature.
Basic Second-Order Algorithm and Approximations

• Keep making Newton’s updates to convergence (single step


needed for quadratic function)

– Even computing the Hessian is difficult!

– Inverting it is even more difficult

• Solutions:

– Approximate the Hessian.

– Find an algorithm that works with projection Hv for some


direction v.
Conjugate Gradient Method

• Get to optimal in d steps (instead of single Newton step)


where d is number of parameters.

• Use optimal step-sizes to get best point along a direction.

• Thou shalt not worsen with respect to previous directions!

• Conjugate direction: The gradient of the loss function on


any point on an update direction is always orthogonal to the
previous update directions.
 
qT
t H[∇L(W t+1)]
q t+1 = −∇L(W t+1) + qt (21)
qT
t Hq t

• For quadratic function, it requires d updates instead of single


update of Newton method.
Conjugate Gradients on 2-Dimensional Quadratic

• Two conjugate directions are required to reach optimality


Conjugate Gradient Algorithm

• For quadratic functions only.

– Update W t+1 ⇐ W t + αtq t. Here, the step size αt is com-


puted using line search.
 
qT
t H[∇L(W t+1)]
– Set q t+1 = −∇L(W t+1) + q t. Increment
qT
t Hq t
t by 1.

• For non-quadratic functions approximate loss function with


Taylor expansion and perform  d of the above steps. Then
repeat.
Efficiently Computing Projection of Hessian

• The update requires computation of the projection of the


Hessian rather than inversion of Hessian.
 
qT
t H[∇L(W t+1)]
q t+1 = −∇L(W t+1) + qt (22)
qT
t Hq t

• Easy to perform numerically!


∇L(W 0 + δv) − ∇L(W 0)
Hv ≈ (23)
δ
Other Second-Order Methods

• Quasi-Newton Method: A sequence of increasingly accurate


approximations of the inverse Hessian matrix are used in var-
ious steps.

• Many variations of this approach.

• Commonly-used update is BFGS, which stands for the


Broyden–Fletcher–Goldfarb–Shanno algorithm and its lim-
ited memory variant L-BFGS.
Problems with Second-Order Methods

1
1
0.8
0.8 SADDLE
POINT
0.6
0.6
0.4
0.4 0.2

g(x, y)
0
0.2
−0.2
f(x)

0 −0.4

−0.6
−0.2
−0.8

−0.4 −1
1

−0.6 0.5

0
−0.8
−0.5 1
0.5
0
−1 −1 −0.5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 y −1
x x

(a) f (x) = x3 (b) f (x) = x2 − y 2


Degenerate Stationary

• Saddle points: Whether it is maximum or minimum depends


on which direction we approach it from.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Batch Normalization

Neural Networks and Deep Learning, Springer, 2018


Chapter 3, Section 3.6
Revisiting the Vanishing and Exploding Gradient Problems

h1 h2 hm-1
x w1 w2 w3 wm-1 wm o
∑ ∑ ∑ ∑

• Neural network with one node per layer.

• Forward propagation multiplicatively depends on each weight


and activation function evaluation.

• Backpropagated partial derivative get multiplied by weights


and activation function derivatives.

• Unless the values are exactly one, the partial derivatives will
either continuously increase (explode) or decrease (vanish).

• Hard to initialize weights exactly right.


Revisiting the Bowl

40 40

30 30

20 20

10 10
VALUE OF y

VALUE OF y
0 0

−10 −10

−20 −20

−30 −30

−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
VALUE OF x VALUE OF x

(a) Loss function is circular bowl (b) Loss function is elliptical bowl
L = x2 + y 2 L = x2 + 4y 2

• Varying scale of different parameters will cause bouncing

• Varying scale of features causes varying scale of parameters


Input Shift

• One can view the input to each layer as a shifting data set
of hidden activations during training.

• A shifting input causes problems during learning.

– Convergence becomes slower.

– Final result may not generalize well because of unstable


inputs.

• Batch normalization ensures (somewhat) more stable inputs


to each layer.
Solution: Batch Normalization

∑ɸ ∑ɸ
ADD BATCH BREAK UP
NORMALIZATION

vi ai
∑ɸ BN ∑ɸ BN
∑ɸ
(a) Post-activation normalization (b) Pre-activation normalization

• Add an additional layer than normalizes in batch-wise fashion.

• Additional learnable parameters to ensure that optimal level


of nonlinearity is used.

• Pre-activation normalization more common than post-


activation normalization.
Batch Normalization Node

• The ith unit contains two parameters βi and γi that need to


be learned.

• Normalize over batch of m instances for ith unit.


m v (r)
μi = r=1 i ∀i [Batch Mean]
m
m (v (r) − μ )2
i
σi2 = r=1 i +  ∀i [Batch Variance]
m
(r)
(r) v i − μi
v̂i = ∀i, r [Normalize Batch Instances]
σi
(r) (r)
ai = γi · v̂i + βi ∀i, r [Scale with Learnable Parameters]

• Why do we need βi and γi?

– Most activations will be near zero (near-linear regime).


Changes to Backpropagation

• We need to backpropagate through the newly added layer of


normalization nodes.

– The BN node can be treated like any other node.

• We want to optimize the parameters βi and γi.

– The gradients with respect to these parameters are com-


puted during backpropagation.

• Detailed derivations in book.


Issues in Inference

• The transformation parameters μi and σi depend on the


batch.

• How should one compute them during testing when a single


test instance is available?

• The values of μi and σi are computed up front using the entire


population (of training data), and then treated as constants
during testing time.

– One can also maintain exponentially weighted averages


during training.

• The normalization is a simple linear transformation during


inference.
Batch Normalization as Regularizer

• Batch normalization also acts as a regularizer.

• Same data point can cause somewhat different updates de-


pending on which batch it is included in.

• One can view this effect as a kind of noise added to the


update process.

• Regularization is can be shown to be equivalent to adding a


small amount of noise to the training data.

• The regularization is relatively mild.

You might also like