0% found this document useful (0 votes)

15 views159 pages

CV 3

The document outlines the historical development and fundamental concepts of neural networks, backpropagation, and deep learning, emphasizing the significance of gradient descent and the Universal Approximation Theorem. It discusses the evolution of neural network architectures, the role of activation functions, and various training methodologies, including the transition from mean squared error to cross-entropy loss functions. Additionally, it highlights the computational graph representation of neural networks and the importance of backpropagation in optimizing these models.

Uploaded by

Moh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views159 pages

CV 3

Uploaded by

Moh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 159

Neural Networks, Backpropagation and Deep Learning

CS 410/510: CV & DL
Outline
• Historical Notes
• UAT
• Gradient Descent
• Backpropagation & Computational Graphs
• Automatic Differentiation
• Activations, Weight Initialization
• Deep Learning Challenges
• Data Augmentation, Feature Pre-Processing
• SGD, Momentum, AdaGrad, Adam, Second-Order Methods
• Backpropagation Derivation
Historical Notes
• Feedforward networks can be seen as efficient non-linear function approximators based on using
gradient descent to minimize the error in a function approximation.

•As such, the modern feedforward NN is the culmination of centuries of progress on the general
function approximation task.

• The chain rule underlying backprop was invented by Leibniz (1796), and due naturally to foundations
also laid by Newton.

• Calculus and algebra have been used to solve optimization problems in closed form since their
inception, but gradient descent was not introduced as a technique for iteratively approximating the
solution to optimization problems until 19C (Cauchy, 1847).

Newton Leibniz Cauchy Al-Khwarizmi Galois

Historical Notes
Neurons & the Brain
Neurons & the Brain
Hebb’s Postulate
McCulloch & Pitts Neuron Model (1943)

(3) Components:
(1) Set of weighted inputs {wi} that correspond to synapses
(2) An “adder” that sums the input signals (equivalent to membrane of the cell that collects
the electrical charge)
(3) An activation function (initially a threshold function) that decides whether the neuron
fires (“spikes”) for the current inputs.
McCulloch & Pitts Neuron Model (1943)
Limitations & Deviations of the M-P Neuron Model:

• Summing is linear.
• No explicit model of “spike trains” (sequence of pulses that encodes
information in biological neuron).
• Threshold value is usually fixed.
• Sequential updating implicit (biological neurons usually update themselves
asynchronously)
• Weights can be positive (excitatory) or negative (inhibitory); biological
neurons do not change in this way.
• Real neurons can have synapses that link back to themselves (e.g. feedback
loop) – see RNNs (recurrent neural networks).
• Other biological aspects ignored: chemical concentrations, refractory
periods, etc.
Historical Notes
• Beginning in the 1940s, these function approximation techniques were used to motivate ML models
such as the percepton. However, the earliest models were based on linear models.

• In the 1960s Rosenblatt proved that the perceptron learning rule converges to correct weights
in a finite number of steps, provided the training examples are linearly separable.

•Critics including Marvin Minsky point out several of the flaws of the linear model family, such as its
inability to learn the XOR function, which led to a backlash against the entire NN approach.

• Learning non-linear functions required the development of a MLP (multi-layer perceptron) and a
means of computing the gradient through such a model. Efficient applications of the chain rule based
on DP (dynamic programming) began to appear in the 1960s and 1970s.

Rosenblatt Minsky
•
Historical Notes
1969: Minsky and Papert proved that perceptrons cannot represent non-linearly
separable target functions.

• However, they showed that adding a fully connected hidden layer makes the
network more powerful.
– i.e., Multi-layer neural networks can represent non-linear decision surfaces.
• Later it was shown that by using continuous activation functions (rather than
thresholds), a fully connected network with a single hidden layer can in principle
represent any function.
• 1986: “rediscovery” of backprop algorithm: Hinton et al.

• The Universal Approximation Theorem (1989) states that one hidden layer is
sufficient to approximate any function to arbitrary accuracy with a NN. (we say:
“NNs are universal function approximators”); RNNs are Turing Complete.
Universal Approximation Properties
• A linear model, mapping from features to outputs via matrix multiplication,
can by definition represent only linear functions. It has the advantage of being
easy to train because many loss functions result in convex optimization
problems when applied to linear models.

• The universal approximation theorem (UAT) states that a feedforward

network with a linear output layer and at least one hidden layer with any
“squashing” activation function can approximate any Borel measurable (e.g. a
continuous function on a closed and bounded subset of Rn) function from one
finite-dimensional space to another with any desired non-zero amount of
error, provided the network is given enough hidden units.

• The UAT states that regardless of what function we are trying to learn, we
know that a sufficiently large MLP will be able to represent this function. We
are not guaranteed, however, that the training algorithm will be able to learn
the function.
Universal Approximation Properties
• The UAT states that regardless of what function we are trying to learn, we
know that a sufficiently large MLP will be able to represent this function. We
are not guaranteed, however, that the training algorithm will be able to learn
the function.
• Cybneko (1989) proved UAT for sigmoid activations.

• Hornik (1991) proved that the network itself gives rise to universal
approximation property -- not specific choice of activation (so no long as
activation is non-linear).
• Classical UAT related to depth-bounded networks (e.g. depth-2). Lu et al.
(2017) proved UAT for width-bound NNs (width: n+4 with RELU, where n is
the input dimension).
Universal Approximation Properties
• Even if the MLP is able to represent the function. Learning can fail for (2) different
reasons:

(1) The optimization algorithm used for training may not be able to find the value of the
parameters that corresponds to the desired function.

(2) The training algorithm might choose the wrong function as a result of overfitting.
Universal Approximation Properties
• Feedforward networks provide a universal system for representing functions in the
sense that, given a function, there exists a feedforward network that approximates the
function; there is no universal procedure for examining a training set of specific
examples and choosing a function that will generalize to points not in the training set.
Universal Approximation Properties
• Feedforward networks provide a universal system for representing functions in the
sense that, given a function, there exists a feedforward network that approximates the
function; there is no universal procedure for examining a training set of specific
examples and choosing a function that will generalize to points not in the training set.

*Note also that the theorem does not prescribe the size of the network (some bounds can
be approximated); unfortunately, in the worst case, an exponential number of hidden units
may be required.

*Recall that any time we choose a specific ML algorithm, we are implicitly imposing some
set of prior beliefs we have about what kind of function the algorithm should learn (this is
the so-called inductive bias of the learning algorithm); choosing a deep model generally
indicates that we want to learn a composition of several simpler functions.
Historical Notes
•The “rediscovery” of the backpropagation algorithm (Hinton & Rumelhardt) ushered in a very active
period of research for MLPs. In particular, “connectionism” took root in the ML community, which
placed emphasis on connections between neurons as the locus of learning and memory (cf.
distributed representation: each concept is represented by many neurons, each neuron participates
in the representation of many concepts.

http://www.cs.toronto.edu/~bonner/courses/2014s/csc321/lecture
s/lec5.pdf

http://www.jneurosci.org/content/35/13/5180
Historical Notes
• Following the success of backprop, NN research gained popularity and reached a peak in the early
1990s. Afterwards, other ML techniques became more popular until the modern deep learning
renaissance that began in 2006.

• The core ideas behind modern feedforward nets have not changed substantially since the 1980s. The
same backprop algorithm and the same approaches to gradient descent are still in use. Most of the
improvement in NN performance from 1986-2018 can be attributed to two factors:
Historical Notes
• Following the success of backprop, NN research gained popularity and reached an (initial) apex in the
early 1990s.Afterwards, other ML techniques became more popular until the modern deep learning
renaissance that began in 2006.

(1) Larger datasets have reduced the degree to which statistical generalization is a challenge for NNs.

(2) NNs have come much larger because of more powerful computer (including the use of GPUs)
and better software infrastructure.
Historical Notes
• Nevertheless, a number of algorithmic changes have also contributed to subsequent
improvements in the performance of NNs.

• One of these algorithmic changes was the replacement of mean squared error (MSE) with the
cross-entropy family of loss functions. MSE was popular in the 1980s and 1990s but was
gradually replaced by cross-entropy losses and the principles of MLE as ideas spread between the
statistics community and ML community.

• The use of cross-entropy losses greatly improved the performance of models with sigmoid and
softmax outputs, which had previously suffered from saturation and slow learning when using
MSE.
Historical Notes
• The other major algorithmic change that has greatly improved the performance of
feedforward networks was the replacement of sigmoid hidden units with piecewise linear
hidden units, such as rectified linear units (RELUs). Rectification using the max{0,z} function
was introduced in early NN models.

• As of the early 2000s, rectified linear units were avoided due to the belief that activation
functions with non-differentiable points must be avoided.

• For small datasets, Jarrett et al. (2009) observed that using rectifying non-linearities is even
more important than learning the weights of the hidden layers. Random weights are sufficient
to propagate useful information through a rectified linear network, enabling the classifier layer
at the top to learn how to map different feature vectors to class identities.
Historical Notes
• RELUs are also of historical interest because they show that neuroscience has continued to have
an influence on the development of deep learning algorithms. Glorot et al. (2011) motivated
RELUs from biological considerations. The half-rectifying non-linearity was intended to captured
these properties of biological neurons:

(1) For some inputs, biological neurons are completely inactive.

(2) For some inputs, a biological neuron’s output is proportional to its inputs.

(3) Most of the time, biological neurons operate in the regime where they are inactive (i.e. they
should have sparse activations).
Neurons & the Brain
– Human brain contains ~1011 neurons
– Each individual neuron connects to ~104 neuron
– ~1014 total synapses!
Historical Notes
A “two”-layer neural network
(activation represents
output layer
classification)

hidden layer (internal representation)

inputs (activations represent

feature vector for one training
example)

•Input layer—It contains those units (artificial neurons) which receive input from the outside
world on which network will learn, recognize about or otherwise process.
•Output layer—It contains units that respond to the information about how it’s learned any task.
•Hidden layer—These units are in between input and output layers. The job of hidden layer is to
transform the input into something that output unit can use in some way.
Most neural networks are fully connected that means to say each hidden neuron is fully connected to
the every neuron in its previous layer(input) and to the next layer (output) layer.
A Neural Network “Zoo”
Neural network notation
xi : activation of input node i.

hj : activation of hidden node j.

(activation
represents
classification)
ok : activation of output node k.

(internal wji : weight from node i to node j.

representation)

σ : “sigmoid function”.
(activations represent
feature vector for one
training example) For each node j in hidden layer,
æ ö
h j = s çç å w ji xi + w j 0 ÷÷
Sigmoid function: è iÎ input layer ø

For each node k in output layer,

æ ö
ok = s çç å wkj h j + wk 0 ÷÷
è j Î hidden layer ø
Gradient Descent

(*) Backpropagation is one particular instance of a larger paradigm of optimization

algorithms know as Gradient Descent (also called “hill climbing”).

(*) There exists a large array of nuanced methodologies for efficiently training NNs
(particularly DNNs), including the use of regularization, momentum, dropout,
batch normalization, pre-training regimes, initialization processes, etc.

(*) Traditionally, the backpropagation algorithm has been used to efficiently train a
NN; more recently the Adam stochastic optimization method (2014) has eclipsed
backpropagation in practice: https://arxiv.org/abs/1412.6980
DNNs Learn Hierarchical Feature Representations
Backpropagation
• Backpropagation is the engine behind most (but not all) deep learning training algorithms.
Backpropagation consists of two alternating steps:

(1) Forward step: Propagate the input vector through the network (this consists primarily of
dot product operations followed by non-linear activation operations).

(2) Backward step: Using the output computed in step (1); the “error” (according to some
prescribed loss function) is propagated backward through the network. The backward
step assigns an attribution value to the edges in the network based on the loss
calculated.
Backpropagation
• Backpropagation is the engine behind most (but not all) deep learning training algorithms.
Backpropagation consists of two alternating steps:

(1) Forward step: Propagate the input vector through the network (this consists primarily of
dot product operations followed by non-linear activation operations).

(*) For an alternative to backpropagation methods, see, for example: ELM “extreme learning
machines” methodologies (which are considered controversial in mainstream ML.
https://www.researchgate.net/publication/264273594_Extreme_learning_machines
Backpropagation: Computational
Graphs
• A neural network (NN can be modeled as a computational graph, in which a unit of
computation is the neuron.

• NNs are fundamentally more powerful than their building blocks because the parameters of
these models are learned jointly to create a highly optimized composition function of these
models. In addition, the non-linear activations between the different layers enhance the
expressive power of the network.
Backpropagation: Computational
Graphs
• A neural network (NN) is a computational graph, in which a unit of computation is the neuron.

• A multi-layer NN evaluates compositions of functions computed at individual nodes. For instance, a

path of length 2 in the NN in which the activation function g(∙) follows a basic affine transformation (i.e.,
matrix multiplication plus a “bias” shift) results in the composition:

• Weight updates are traditionally computed using gradient descent (or a related variant), in which case,
one applies the chain rule of differential calculus with respect to the the various function compositions
defined across the layers of the network.
Computational Graph: Example
• Next, we consider a simple example of learning the XOR function in 2D.

• (Right image, left-side) Every unit in computational graph is

shown; (Right image right-side) More compactly, each node
represents a layer.
Computational Graph: Example
• Next, we consider a simple example of learning the XOR function in 2D.

• (Right image, left-side) Every unit in computational graph is

shown; (Right image right-side) More compactly, each node
represents a layer.

• (Left) XOR represented in original space (notice the data are not linearly separable);
(Right) By introducing non-linearity, the data are linearly separable in the learned space.
Computational Graph: Example
• Denote the ith element activation: ℎ𝑖 = 𝑔 𝒙𝑇 𝑊:,𝑖 + 𝑐𝑖 , where 𝑔 is our activation function –
here we’ll use the standard RELU depicted below, defined: 𝑔 𝑧 = 𝑚𝑎𝑥 0, 𝑧

• Notice that the complete (mathematical) specification of our network is given as:
Computational Graph: Example
• Denote the ith element activation: ℎ𝑖 = 𝑔 𝒙𝑇 𝑊:,𝑖 + 𝑐𝑖 , where 𝑔 is our activation function –
here we’ll use the standard RELU depicted below, defined: 𝑔 𝑧 = 𝑚𝑎𝑥 0, 𝑧

• Notice that the complete (mathematical) specification of our network is given as:

𝑓 𝒙; 𝑾, 𝒄, 𝒘, 𝑏 = 𝒘𝑇 𝑚𝑎𝑥 0, 𝑾𝑇 𝑥 + 𝑐 + 𝑏

0 0
1 1 0 1
• Let: 𝑾 = , 𝒄= , 𝒘= , 𝑏 = 0, 𝑿 = 1 0 (XOR input)
1 1 −1 −2 1 0
1 1
0 0 0 −1 0 0
𝑿𝑾 = 1 1 ⟶ 𝑿𝑾 + 𝒄 = 1 0 ⟶ 𝑚𝑎𝑥 0, 𝑾𝑇 𝑥 + 𝑐 ⟶ 1 0
1 1 1 0 1 0
2 2 2 1 2 1
0
⟶𝒘𝑇 𝑚𝑎𝑥 0, 𝑾𝑇 𝑥 + 𝑐 + 𝑏 = 1 Predicted output over XOR dataset.
1
0
Computational Graph: Example
• Some example computation graphs:
(a) 𝑧 = 𝑥𝑦
(b) 𝑦 = 𝜎 𝒙𝑇 𝒘 + 𝑏 (logistic regression)
(c) 𝐇 = max{0, 𝑿𝑾 + 𝒃}
(d) linear regression model with regularization (L2 weight decay penalty), i.e., 𝑦ො = 𝒘𝒙 +
λ σ𝑖 𝑤𝑖2 .
Backpropagation
• Recall the Chain Rule of Calculus:

Let z = f ( g ( x)) = f ( y ), then the Chain Rule states:

dz dz dy
=
dx dy dx

:
Backpropagation
• Recall the Chain Rule of Calculus:

Let z = f ( g ( x)) = f ( y ), then the Chain Rule states:

dz dz dy
=
dx dy dx

• For functions of several variables, we introduce the analogue of the derivative,

termed the partial derivative. Partial derivatives entail computing the derivative of a
multivariate function wrt (“with respect to”) a single variable, while treating all other
variables as constants. Let 𝑓(𝑥, 𝑦, 𝑧, … ); partial derivatives are commonly denoted:

f
or equivalently: f x or Dx ; recall that in general: f xy = f yx , etc.
x
Backpropagation
• We can thus generalize the chain rule to vector-valued functions. Suppose that 𝒙 ∈
ℝ𝑚 , 𝒚 ∈ ℝ𝑛 ; if 𝒚 = 𝑔(𝒙) and z = 𝑓 𝒚 , then:

z z y j  y 
T

= or equivalently in vector notation:  x z =   y z

xi j y j xi  x 
Jacobian of g
Backpropagation
• We can thus generalize the chain rule to vector-valued functions. Suppose that 𝒙 ∈
ℝ𝑚 , 𝒚 ∈ ℝ𝑛 ; if 𝒚 = 𝑔(𝒙) and z = 𝑓 𝒚 = 𝑓(𝑔(𝒙)) then:

z z y j  y 
T

= or equivalently in vector notation:  x z =   y z

xi j y j xi  x 
Jacobian of g

• Consider the following example:

y = g ( x ) = g ( x1 , x2 )= x1 , x1 sin ( x2 ) , f ( y ) = f ( y1 , y2 )= y + 2 y2
Let 1

z = f ( g ( x ) ) = x1 + 2 x1 sin ( x2 )
Backpropagation
• We can thus generalize the chain rule to vector-valued functions. Suppose that 𝒙 ∈
ℝ𝑚 , 𝒚 ∈ ℝ𝑛 ; if 𝒚 = 𝑔(𝒙) and z = 𝑓 𝒚 = 𝑓(𝑔(𝒙)) then:

z z y j  y 
T

= or equivalently in vector notation:  x z =   y z

xi j y j xi  x 
Jacobian of g

• Consider the following example:

y = g ( x ) = g ( x1 , x2 )= x1 , x1 sin ( x2 ) , f ( y ) = f ( y1 , y2 )= y + 2 y2
Let 1

z = f ( g ( x ) ) = x1 + 2 x1 sin ( x2 )

z z y j z y1 z y2
= = + = 11 + 2  sin ( x2 )
x1 j y j xi y1 x1 y2 x1
z z y j z y1 z y2
= = + = 1 0 + 2  x1 cos ( x2 )
x2 j y j xi y1 x2 y 2 x2

 y1 y1   z 
T

 x x2   y   
T
1  1 2sin ( x2 )  1   1 + 2sin ( x2 ) 
 y 
T
1 0
x z =   y z =    =  2  = 0 x cos x   2  =  +2 x cos x 
1 1

 x   y2 y2   z   2sin ( x2 ) x1 cos ( x2 )     1 ( 2 )    1 ( 2 )
 x x2   y 
 1  2
Jacobian of g
Backpropagation
• The main computational challenge for backpropagation relates to the multivariate
chain rule. (see pathwise aggregation lemma, next slides)
Backpropagation
• The main computational challenge for backpropagation relates to the multivariate
chain rule. (see pathwise aggregation lemma, next slides)
Computational Graphs & Backpropagation
• Here is an example schematic of symbol-to-symbol computation of derivatives from a
computation graph.

• Another example schematic of the computational graph used to train a single-layer NN

using cross-entropy loss and weight decay:
Computational Graphs & Backpropagation
• Notice that the calculation of the chain rule along a path in a computational graph typically
admits of many redundancies. Let 𝑥 = 𝑓 𝑤 , 𝑦 = 𝑓 𝑥 , 𝑧 = 𝑓 𝑦 :

𝜕𝑧
• Notice that the calculation of 𝜕𝑤 requires that we compute that value 𝑓 𝑤 many times.
Naturally, a more efficient approach is to simply compute this value only once and store it in
order to avoid these redundant calculations. This is the key idea behind applying dynamic
programming to backpropagation.
Backpropagation
• Pathwise Aggregation Lemma: Consider a directed acyclic computational graph
(DAG) in which the ith node contains variable y(i). The local derivative z(i,j) of the
𝜕𝑦 𝑗
directed edge (i,j) in the graph is defined as: . Let a non-null set of paths P exist
𝜕𝑦 𝑖
from variable w in the graph to output node containing variable o. Then, the value of
𝜕𝑜
𝜕𝑤
is given by computing the product of the local gradients along each path in P, and
summing these products over all paths:

𝜕𝑜
= ෍ ෑ 𝑧 𝑖, 𝑗
𝜕𝑤
𝑝∈𝑃 𝑖,𝑗 ∈𝑝
Backpropagation
• Pathwise Aggregation Lemma: Consider a directed acyclic computational graph
(DAG) in which the ith node contains variable y(i). The local derivative z(i,j) of the
𝜕𝑦 𝑗
directed edge (i,j) in the graph is defined as: . Let a non-null set of paths P exist
𝜕𝑦 𝑖
from variable w in the graph to output node containing variable o. Then, the value of
𝜕𝑜
𝜕𝑤
is given by computing the product of the local gradients along each path in P, and
summing these products over all paths:

𝜕𝑜
= ෍ ෑ 𝑧 𝑖, 𝑗
𝜕𝑤
𝑝∈𝑃 𝑖,𝑗 ∈𝑝
Backpropagation: Dynamic Programming
• Although the summation discussed previously has an exponential number of paths, one can
nonetheless compute it efficiently using dynamic programming.

• We want to compute the product of z(i,j) over each path p ε P from source node w to output o
and then add them:

𝑆(𝑤, 𝑜) = ෍ ෑ 𝑧 𝑖, 𝑗
𝑝∈𝑃 𝑖,𝑗 ∈𝑝
Backpropagation: Dynamic Programming
• Although the summation discussed previously has an exponential number of paths, one can
Nevertheless, compute this result efficiently using dynamic programming.

• We want to compute the product of z(i,j) over each path p ε P from source node w to
output o and then add them:
𝑆(𝑤, 𝑜) = ෍ ෑ 𝑧 𝑖, 𝑗
𝑝∈𝑃 𝑖,𝑗 ∈𝑝

(*) In practice, when using dynamic programming for backpropagation for redundant
calculations required for enumerating all paths.
Backpropagation: Dynamic Programming
• Pathwise Aggregation – in this example, explicit computation of the partial derivative
of the output (o) wrt to the input (w), requires “pathwise aggregation” over all 25 =
32 paths in the network!
𝜕𝑜
= ෍ ෑ 𝑧 𝑖, 𝑗
𝜕𝑤
𝑝∈𝑃 𝑖,𝑗 ∈𝑝
Backpropagation: Dynamic Programming
• Notice that the given network represents a DAG (so it admits of a topological
ordering), so we can apply dynamic programming (DP) to generate an efficient solution
𝜕𝑜 𝜕𝑦 𝑗
for the calculation of: 𝜕𝑤 . Recall that z i, j = .
𝜕𝑦 𝑖

• We will use a common DP methodology – compute 𝐒 𝐰, 𝒋 for all nodes w in the

graph beginning with w ≔ 𝑜 (so we traverse right to left); see the formula below for
the general calculation of S i, 𝑜 . Note that 𝐀 𝒊 symbolizes the set of nodes at the
endpoints of outgoing edge for each intermediate node 𝑖. Let S 11,11 = 1 by
default.
Backpropagation: Dynamic Programming
• We will use a common DP methodology – compute S w, j for all nodes w in the graph
beginning with w ≔ 𝑜 (so we traverse right to left); see the formula below for the general
calculation of S i, 𝑜 . Note that A 𝑖 symbolizes the set of nodes at the end points of
outgoing edge for each intermediate node 𝑖. Let S 11,11 = 1.

𝜕𝑦 11 𝜕𝑤 32
• Next, we compute S 9,11 = 𝑆 11,11 ∙ 𝑧 9,11 = 1 ∙ 𝜕𝑦 9 = 𝜕𝑤 16 = 2𝑤 16
𝜕𝑦 11 𝜕𝑤 32 16
• Similarly, S 10,11 = 𝑆 11,11 ∙ 𝑧 10,11 = 1 ∙ = 16 = 2𝑤
𝜕𝑦 10 𝜕𝑤
Backpropagation: Dynamic Programming
• We will use a common DP methodology – compute S w, j for all nodes w in the graph
beginning with w ≔ 𝑜 (so we traverse right to left); see the formula below for the general
calculation of S i, 𝑜 . Note that A 𝑖 symbolizes the set of nodes at the end points of
outgoing edge for each intermediate node 𝑖. You should see that S 11,11 = 1.

𝜕𝑦 11 𝜕𝑤 32
• Next, we compute S 9,11 = 𝑆 11,11 ∙ 𝑧 9,11 = 1 ∙ 𝜕𝑦 9 = 𝜕𝑤 16 = 2𝑤 16
𝜕𝑦 11 𝜕𝑤 32 16
• Similarly, S 10,11 = 𝑆 11,11 ∙ 𝑧 10,11 = 1 ∙ = 16 = 2𝑤
𝜕𝑦 10 𝜕𝑤
𝜕𝑦 9
and S 7,11 = 𝑆 9,11 ∙ 𝑧 7,9 + 𝑆 10,11 ∙ 𝑧 7,10 = 2𝑤 16 𝜕𝑦 7 +
𝜕𝑦 10 𝜕𝑤 16 𝜕𝑤16
2𝑤 16 𝜕𝑦 7 = 2𝑤 16 𝜕𝑤 8 + 2𝑤 16 𝜕𝑤 8 = 4𝑤 24 + 4𝑤 24 = 8𝑤 24.

𝜕𝑜
• You should verify that iterating this DP strategy to completion yields: 𝜕𝑤 = 32𝑤 31 ;
note that this method avoids exponential path aggregation calculations, as was to be
shown.
Automatic Differentiation
• Automatic Differentiation (AD) is a set of techniques to numerically evaluate the derivative of a
function. Many contemporary ML and DL libraries (e.g., Pytorch, TensorFlow) include AD
capabilities.

• Different from symbolic differentiation (i.e., directly using mathematical expression) and
numerical differentiation (e.g., an iterative algorithm to estimate the derivative of a function), AD
replaces the domain of variables to incorporate derivatives per the chain rule.
Automatic Differentiation
• Automatic Differentiation (AD) is a set of techniques to numerically evaluate the derivative of a
function. Many contemporary ML and DL libraries (e.g., Pytorch, TensorFlow) include AD
capabilities.

• AD computes derivatives through the accumulation of values during code execution to generate
numerical derivative evaluations (rather than derivative expressions).

Baydin et al., “Automatic Differentiation: A Survey” (JMLR 2018)

Automatic Differentiation
• AD is a deep topic, for brevity we note that at a high-level AD uses the chain rule to compute the
accumulation of derivatives. This is done for two fundamental processes: (1) forward accumulation
(i.e., AD for forward pass through a NN) and (2) reverse accumulation (used for backprop).
Automatic Differentiation
• AD is a deep topic, for brevity we note that at a high-level AD uses the chain rule to compute the
accumulation of derivatives. This is done for two fundamental processes: (1) forward accumulation
(i.e., AD for forward pass through a NN) and (2) reverse accumulation (used for backprop).

• Consider the following example (https://sidsite.com/posts/autodiff/ ):

𝜕𝑑
where we wish to compute 𝜕𝑎. Using the product rule we have:

𝜕𝑑
• Note that if we wish to compute 𝜕𝑏 a similar tedious process is required.
Automatic Differentiation
𝜕𝑑
• Let’s now contrast the computation of 𝜕𝑎
using AD.

• The left image denotes the computational graph for this problem. On the right, we see the AD
methodology. Consider the derivatives appearing on the edges as local derivatives.

• The basic idea through reverse accumulation is to begin at the output node of the computational graph,
and then consider each path in the computational graph from the output node to the input nodes.

• We follow two simple rules: (1) add together different path accumulations and (2) multiply local
derivatives along each path.
Automatic Differentiation
𝜕𝑑
• Let’s now contrast the computation of 𝜕𝑎
using AD.

• We follow two simple rules: (1) add together different path accumulations and (2) multiply local
derivatives along each path.

AD execution conventional Chain Rule execution

Architectural Considerations
• Vanilla NNs consist of chains of feed-forward layers, with the main considerations being the depth
of the network and width of each layer.

• In practice, though, NN architectures can be very diverse (see the NN “Zoo”, shown previously).
Ultimately, NN design should be intentional and developed with consideration for the specific task at
hand.

• Special architectures for computer vision called convolutional neural networks (CNNs) are described
later in our course. FF networks can be generalized to the recurrent neural networks (RNNs) for
sequence processing, which have their own architectural considerations.
Architectural Considerations
• Observe that layers need not be connected in a sequential chain; many architectures make use of
skip connections and residual layers to benefit gradient flow in the network.

• One can additionally vary the connectivity strategy of the network. Many specialized networks
have fewer connections (than dense networks) or they admit of some other form of
compression. Note that CNNs utilize parameter sharing to this end.

• Recently, Neural Architecture Search (NAS) has emerged as a new paradigm for automating the
design of DNNs.
Architectural Considerations
• DNNs frequently embody large and unwieldy, overparameterized models. Thus, recent
research has focused on transforming DNNs into more sustainable network designs.

• This effort is catalyzed by several factors, including: the desire to conserve memory and
compute overhead for the deployment of commercial DL models, energy sustainability, the
need for greater model interpretability, and the aspiration to port DL models to low compute
environments, including edge and IOT devices.
Architectural Considerations
• DNNs frequently embody large and unwieldy, overparameterized models. Thus, recent
research has focused on transforming DNNs into more sustainable network designs.

• Today there exist a large variety of DL model compression techniques, due to the desirability
of compact models with state-of-the-art functionality.

• Roughly, these techniques fall into several generic categories, comprising pruning,
quantization, low-rank and sparse approximations, and knowledge distillation.
Weight Initialization
• Training algorithms for DNN models are usually iterative and thus require the user
to specify some initial point from which to begin the iterations. Moreover, training
deep models is a sufficiently difficult task that most algorithms are strongly affected
by the choice of initialization.

• The initial point can determine whether the algorithm converges at all, with
some initial points being so unstable that the algorithm encounters numerical
difficulties and fails altogether. When learning does converge, the initial point can
determine how quickly learning converges and whether it converges to a point
with high or low cost.
Weight Initialization
• Modern initialization strategies are usually simple and heuristic; designing improved
initialization strategies is a difficult task because NN optimization is not yet well
understood.

• The most general guideline agreed upon by most practitioners is known as

“symmetry-breaking.” If two hidden units with the same activation function are
connected to the same inputs, then these units have different initial parameters. If the
training is deterministic, “symmetric” units will update identically (and hence be
useless); even if the training is stochastic, it is usually best to initialize each unit to
compute a different function from all the other units.

• Note that the scale of the initial distribution does have a large effect on both the
outcome of the optimization procedure and the ability of the network to generalize.
Weight Initialization
• Larger initial weights will yield a strong symmetry-breaking effect, helping to avoid
redundant units; in addition, they will also potentially help avoid the problem of
vanishing gradients. Nevertheless, they may conversely exacerbate the exploding
gradient problem; in RNNs, large initial weights can manifest chaotic behavior.

* Sparse initialization (Martens, 2010) fixes the number of non-zero weights for
initialization; Xavier initialization draws random initial values from a distribution with
zero mean and variance inversely proportional to the size of the previous layer in the
network.
Weight Initialization
• Another related approach is to initialize the weights to generate random values from a
Gaussian distribution with zero mean and small standard deviation (e.g. 10-2). This will
result in small random values that are both positive and negative.

• One problem with this initialization is that it is not sensitive to the number of
inputs to a specific neuron. For example, if one neuron has only 2 inputs and
another has 100 inputs, the output of the latter is far more sensitive to the average
weight because of the additive effect of more inputs (which will manifest itself through
a much larger gradient).

https://www.deeplearning.ai/ai-notes/initialization/index.html
Weight Initialization
(*) It can be shown that the variance of outputs scales with the number of inputs, and
therefore the standard deviation scales with the square root of the number of inputs.

1
• To balance this fact, each weight can be initialized by a value drawn from 𝑁 0, ,
𝑟
where r indicates the number of inputs to that neuron.

• Xavier initialization is somewhat more sophisticated, so that initial weights are

2
drawn from 𝑁 0, , where rin and rout are the fan-in and fan-out values of a
𝑟𝑖𝑛 +𝑟𝑜𝑢𝑡
particular neuron, respectively.

https://www.deeplearning.ai/ai-notes/initialization/index.html
Challenges for DNN Optimization
• Traditionally, ML implementations avoid the difficulty of general optimization by carefully
designing the objective function and constraints to ensure that the optimization problem is
convex.

• When training NNs, however, we must confront the general non-convex case.

Convex Function Non-Convex Function

Challenges for DNN Optimization: Local Minima
• For a convex function, any local minimum is guaranteed to be a global minimum.

• With non-convex functions, such as with loss functions of NNs, it is possible to have
many local minima. Moreover, nearly any DNN is essentially guaranteed to have a very
large number of local minimal (even uncountably many).

• One of the chief reasons for the presence of many local minima for NNs, is due to
the problem of model identifiability. A model is said to be identifiable if a sufficiently
large training set can rule out all but one setting of the mode’s parameters.
Challenges for DNN Optimization: Local Minima
• Models with latent variables (e.g. hidden neurons) are not in general identifiable
because we can obtain equivalent models by exchanging latent variables with one
another.
• Local minima are problematic if they correspond with high cost (vis-à-vis the global
minimum). *Note that local minima are typically less problematic for DNN training
than saddle points (this concept is not always well-appreciated by ML
practitioners).
Challenges for DNN Optimization: Plateaus, Saddle Points
• For many high-dimensional, non-convex functions, local minima (and maxima) are in
fact rare compared to saddle points.

• Some points around a saddle point have greater cost than the saddle point, while
others have lowers cost. At a saddle point, the Hessian matrix has both positive and
negative eigenvalues.

• Why are saddle points more common than local extrema in high dimensions? The
basic intuition is this: in order to render a local extreme value, all of the eigenvalues
must be of the same sign (naturally, this is very unlikely – all things being equal – in
high dimensions).
Challenges for DNN Optimization: Plateaus, Saddle Points
• For first-order optimization, saddle points are not necessarily a significant problem
(Goodfellow); however, for second-order methods, they clearly constitute a problem.

• Degenerate locations such as plateaus can pose major problems for all
numerical algorithms.
Challenges for DNN Optimization: Cliffs, Exploding and
Vanishing Gradients
• NNs with many layers often have extremely steep regions resembling cliffs in the parameter space. This is
due to the multiplication of several large weights together. On the face of an extremely steep cliff structure,
the gradient update step can alter the parameters drastically.

• Gradient clipping, a heuristic technique, can help avoid this issue. When the traditional gradient descent
algorithm proposes making a large step, the gradient clipping heuristic intervenes to reduce the step size,
thereby making it less likely to go outside the region where the gradient indicates the direction of
approximately steepest descent.
Challenges for DNN Optimization: Cliffs, Exploding and
Vanishing Gradients
• NNs with many layers often have extremely steep regions resembling cliffs in the parameter space. This is
due to the multiplication of several large weights together. On the face of an extremely steep cliff structure,
the gradient update step can alter the parameters drastically.

• When the computational graph for a NN becomes very large (e.g. RNNs), the issue of
exploding/vanishing gradients can arise. Vanishing gradients make it difficult to known which direction
the parameters should move to improve the cost function, while exploding gradients can make learning
unstable.

*LSTMs, RELU, and ResNet (Microsoft) have been applied to solve the vanishing gradient problem.
Challenges for DNN Optimization: Hill-Climbing
• Potentially compounding this problem, many activation functions have small derivatives.

• The specific choice of activation function often has a considerable effect on the severity of the vanishing
gradient problem.

• In recent years, the sigmoid and tanh activation functions have been increasingly supplanted by the ReLU
and the hard tanh functions (see subsequent slides on variants of the ReLU activation).

(*) NB: The computational requirements to generate the derivate of a piecewise linear function are naturally
significantly less than that required for a transcendental function (e.g. ex) – where a sigmoid is defined as the
composition of a transcendental function.
Cross-Entropy Loss
• As mentioned, cross-entropy loss is generally preferred to MSE, particularly for classification
problems with DNNs.

Cross-entropy loss is defined:

E = − ci log ( pi ) + (1 − ci ) log (1 − pi )

Where c refers to one hot encoded classes (or labels), whereas p refers to softmax applied probabilities
Cross-Entropy Loss
• As mentioned, cross-entropy loss is generally preferred to MSE, particularly for classification
problems with DNNs.

Cross-entropy loss is defined:

E = − ci log ( pi ) + (1 − ci ) log (1 − pi )

Where c refers to one hot encoded classes (or labels), whereas p refers to softmax applied probabilities

(2) Properties make cross-entropy a natural loss function:

(1) E ≥ 0; all individual terms are negative and there is a minus outside.

(2) If the neuron's actual output is close to the desired output for all training inputs, x, then the cross-
entropy will be close to zero. To demonstrate this, we assume (WLOG) that the desired outputs c are
all either 0 or 1. Suppose for example that c = 0 and p ≈ 0, for some input x (so the neuron has done
well on this input). The first term in E vanishes, while the second term is close to zero; a similar
analysis holds when c = 1 and p ≈ 1.
Cross-Entropy Loss
• Cross-entropy loss is defined:

E = − ci log ( pi ) + (1 − ci ) log (1 − pi )

One can show that, for example, that the partial derivative of the cross-entropy loss function is:
E
=  x j ( ( z ) − y )
w j x

*(σ denotes the sigmoid function) Which indicates that the gradient is larger (i.e. learning is faster) the
larger the error; in addition, the cross-entropy loss function does not in general “bottom out” like the
MSE loss.
RELU & Their Generalizations
• Rectified linear units use the activation function g(z) = max{0, z}.

• These units are easy to optimize because they are so similar to linear units; the only difference being
the RELU is zero across half of its domain. This makes the derivatives through a RELU remain large
whenever the unit is active.

• The gradients are therefore not only large but consistent.

RELUs are typically used on top of an affine transformation:

h = g ( WT x + b )
•One drawback of RELU: is that they cannot learn via gradient-based methods on examples for which
their activation is zero; various generalizations of RELUs guarantee they receive gradient everywhere.

*affine transformations preserve points, straight lines, planes, and parallelism.

RELU & Their Generalizations
(3) Generalizations of RELUs are based on using a non-zero slope αi when zi < 0:

hi = g ( z, α )i = max ( 0, zi ) + i min ( 0, zi )
(1) Absolute value rectification fixes αi = -1, to obtain g(z)=|z|; this method has been used for
object recognition from images, where it makes sense to seek features that are invariant under polarity
reversal of the input illumination.
RELU & Their Generalizations
(3) Generalizations of RELUs are based on using a non-zero slope αi when zi < 0:

(2) Leaky RELU fixes αi to a small value like 0.01.

• Note that the gradient of a standard ReLU is zero for negative values of its argument. While this
inactivity is arguably biological-plausible -- since in real brains, neuron firing is often sporadic and
followed by refractory periods (see previous slides) -- it can nevertheless lead to undesirable,
pathological behavior for artificial NNs.

• In artificial NNs, zero outputs can cause some ReLU units to be “knocked out”, in which case they
can reach a state in which they are never further updated during training. Such a neuron can be
considered dead, which is a kind of permanent “brain damage” in biological parlance.

(*) The problem of dying neurons can be partially ameliorated by the leaky ReLU.
RELU & Their Generalizations
(3) Generalizations of RELUs are based on using a non-zero slope αi when zi < 0:

(2) Leaky RELU fixes αi to a small value like 0.01.

(3) Maxout units (Goodfellow, 2013); instead of applying an element-wise function g(z), maxout units
divide z into groups of k values. Each maxout unit then outputs the maximum element of one of
those groups.

This provides a way of learning a piecewise linear function that responds to multiple directions in the
input x space. Each maxout unit can learn a piecewise linear, convex function with up to k pieces;
maxout units can thus be seen as learning the activation function itself rather than just the relationship
between units; with enough k, a maxout unit can learn to approximate any convex function with
arbitrary fidelity.
“Swish” Activations
• In 2018 Google Brain introduced “swish” activation functions
(https://arxiv.org/pdf/1710.05941.pdf); swish a smooth, non-monotonic function
matching/outperforming RELU in experiments.

x
f ( x) = x   (  x) =
1 + e−  x

• Of note, swish activation introduces a non-monotonic “bump” for 𝑥 < 0 (the shape of this
bump is modulated by the parameter β), as this regularizes large initial negative parameter weights.

• Non-monotonic feature increases the “expressivity” of activations; smoothness helps improve

network optimization efficiency by making output space smoother and thus easier to traverse for
optimization.
Feature Preprocessing
• There are two general forms of feature preprocessing:

(1) Additive preprocessing and mean-centering. It can be useful to mean-center the data to remove
certain types of bias effects (recall that PCA does this); mean-centering is often paired with standardization.

(*) If it is desirable for all feature values to be non-negative (e.g. χ2 test for feature selection), then one can
simply add the absolute value of the maximum negative feature to the data set.
Feature Preprocessing
(1) Additive preprocessing and mean-centering. It can be useful to mean-center the data to remove
certain types of bias effects (recall that PCA does this); mean-centering is often paired with standardization.

(2) Feature normalization. Standardization is a default feature normalization technique:

xi − 
xi 

This assumes that each feature is drawn from a standard normal Gaussian (i.e., N(0,1)).

(*) Another, common form of feature normalization is min-max normalization:

xi − min ( x )
xi 
max( x) − min( x)

This data transformation maps the dataset to [0,1].

(*) In general, feature normalization often ensures better performance, as it safeguards against ill-
conditioning (where the loss function is more sensitive to some parameters vs. others).
Feature Preprocessing: Whitening
• Whitening is a linear data transformation that transforms a vector of random variables with known
covariance matrix into a set of new variables whose covariance is the identity matrix (i.e. this procedure
produces decorrelated variables with variances equal to 1). This procedure is called “whitening” because it
changes the input vector into a white noise vector.

• Suppose X is a random column vector with non-singular covariance matrix M and mean equal to zero
(that is to say, assume the data has been mean-centered).

Then the transformation Y=WX for the whitening matrix where W satisfies: WWT=M-1 yields the
whitened random vector Y with unit diagonal covariance matrix.
Feature Preprocessing: Whitening
• Whitening is a linear data transformation that transforms a vector of random variables with known
covariance matrix into a set of new variables whose covariance is the identity matrix (i.e. this procedure
produces decorrelated variables with variances equal to 1). (this procedure is called “whitening” because it
changes the input vector into a white noise vector).

• Suppose X is a random column vector with non-singular covariance matrix M and mean equal to zero
(that is to say, assume the data has been mean-centered).

Then the transformation Y=WX for the whitening matrix where W satisfies: WWT=M-1 yields the
whitened random vector Y with unit diagonal covariance matrix.

Cov(Y ) = E YY T  = E WX (WX )  = E WXX TW T  = WW T E  XX T 

T
 

= Cov ( X ) Cov ( X ) = I
−1

(*) Note that the choice of the whitening matrix W is not unique. Common choices include: W=M-1/2
(Mahalanobis whitening), Choleksy decomposition-based whitening, where W=M-1 and the eigen-system
of M (PCA whitening).
Feature Preprocessing: Whitening
Data Augmentation
• The best way to make an ML model generalize better is to train it on more data. Of course, data are
limited/expensive.

• One way to get around this problem is to generate synthetic data and add it to the training set.

• This approach is easiest for classification. A classifier needs to take a complicated, high-dimensional
input x and summarize it with a single category identity y. This means that the main task facing a
classifier is to be invariant to a wide variety of transformations; we can generate new (x, y) pairs easily
by transforming the x inputs in our training set.
Data Augmentation
• The best way to make an ML model generalize better is to train it on more data. Of course, data are
limited/expensive.

• One way to get around this problem is to generate synthetic data and add it to the training set.

• Dataset augmentation has been particularly effective for object recognition; operations like
translating the training images a few pixels in each direction can often greatly improve generalization;
many operations such as rotating the image or scaling the image are also quite effective (one needs to
be careful that the transformation does not alter the correct image class).

• Injecting noise in the input to a NN can also be seen as a form of data augmentation; one way to
improve the robustness of a NN is to simply train them with random noise applied to their inputs.
Early Stopping
• When training large models with sufficient representation capacity to overfit the task, we often
observe that training error decreases steadily over time, but validation set error begins to rise again.

• This means we can obtain a model with better validation set error (and hopefully better test error) by
returning to the parameter setting at the point in time with the lowest validation set error. Every time
the error on the validation set improves, we store a copy of the model parameters; when the training
terminates, we return these parameters, rather than the latest parameters.

* This strategy is known as early stopping; it is one of the most common forms of regularization
used in deep learning.
Dropout

• Dropout (Srivastava et al., 2014) provides a computationally inexpensive but powerful method of regularizing a
broad family of models (it is akin to bagging).

• Dropout trains the ensemble consisting of all subnetworks that can be formed by removing non-output units from
an underlying base network. Recall that to learn with bagging, we define k different models, construct k different
datasets by sampling from the training set with replacement, and then train model i on dataset i. Dropout aims to
approximate this process, but with an exponentially large number of NNs.

• In practice, each time we load an example into a minibatch for training, we randomly sample a different binary
mask to apply to all input and hidden units in the network; the mask is sampled independently for each unit (e.g. 0.8
probability for including an input unit and 0.5 for hidden units).

• In the case of bagging, the models are all independent; for dropout, the models share parameters.
Adversarial Training

• Szegedy et al. (2014) found that even NNs that perform at human level accuracy have a nearly 100 percent
error rate on examples that are intentionally construction by using an optimization procedure to search for an
input x’ near a data point x such that the model output is very different from x’ (oftentimes such adversarial
examples are indiscernible to humans).

• In the context of regularization, one can reduce the error rate on the original i.i.d. test set via adversarial
training – training on adversarially perturbed examples from the training set.

• Goodfellow et al. (2014), showed that one of the primary cause of these adversarial examples is excessive
linearity. NNs are primarily built out of linear parts, and so the overall function that they implement proves to
be highly linear as a result.

• Adversarial training help to illustrate the power of using a large function family in combination with
aggressive regularization – a major theme in contemporary deep learning.
Basic Algorithms: SGD
Basic Algorithms: SGD
• Stochastic Gradient Descent (SGD) and its variants are some of the most frequently used optimization
algorithms in ML. Using a minibatch of i.i.d. samples, one can obtain an unbiased estimate of the gradient
(where examples are drawn from the data-generating distribution).

•A crucial parameter for the SGD algorithm is the learning rate, ε. In practice, it is necessary to gradually
decrease the learning rate over time. This is because the SGD gradient estimator introduces a source of noise
(the random sampling of m training examples) that does not vanish even when we arrive at a minimum.

A sufficient condition to guarantee convergence of SGD is that:

 


k =1
k =  and   k2  
k =1

In practice, it is common to decay the learning rate linearly until iteration τ:

k
 k = (1 −  )  0 +  with  =


* Note that for SGD, the computation time per update does not grow with the number of training examples.
This allows convergence even when the number of training examples becomes very large.
Momentum
• The method of momentum is designed to accelerate learning, especially in the face of high curvature, small
but consistent gradients, or noisy gradients.

• The momentum algorithm accumulates an exponentially decaying moving average of past gradients and
continues to move in their direction.

• Formally, the momentum algorithm introduces a variable v that plays the role of velocity – it is the direction
and speed at which the parameters move through parameter space. The velocity is set to an exponentially
decaying average of the negative gradient.
Momentum
• The method of momentum is designed to accelerate learning, especially in the face of high curvature, small
but consistent gradients, or noisy gradients.

• The momentum algorithm accumulates an exponentially decaying moving average of past gradients and
continues to move in their direction.

• The name momentum derives from a physical analogy, in which the negative gradient is a force moving a
particle through parameter space, according to Newton’s laws of motion. If the only force is the gradient of
the cost function, then the particle might never come to rest. To resolve this problem, we add one other force,
proportional to v(t); in physics terminology this force corresponds to viscous drag, as the if the particle must
push through a resistant medium such as syrup.

• The velocity v accumulates the gradient elements; the larger alpha is relative to epsilon, the more previous
gradients affect the current direction.
Momentum
Algorithms with Adaptive Learning Rates
• It is well known that the learning rate is reliably one of the most challenging hyperparameters to set because
it significantly affects model performance. The cost function is often highly sensitive to some directions in
parameters space and insensitive to others.

• While the momentum algorithm mitigates these issues somewhat, it does so at the expense of introducing
another hyperparameter.

• Recently, a number of incremental methods have been introduced that adapt the learning rates of model
parameters.
AdaGrad
• The AdaGrad algorithm (Duchi et al, 2011) individually adapts the learning rates of all model
parameters by scaling them inversely proportional to the square root of the sum of all the historical
squared values of the gradient.

• The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their
learning rate, while parameters with small partial derivates have a relatively small decrease in their learning
rate. The net effect is greater progress in the more gently sloped directions of parameter space.

*Note: empirically, for training DNNs, the accumulation of squared gradients from the beginning of training can
result in premature and excessive decrease in the effective learning rate.
RMSProp
• The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-convex setting by
changing the gradient accumulation into an exponentially-weighted moving average. Where AdaGrad shrinks
the learning rate according to the entire history of the squared gradient, RMSProp uses an exponentially
decaying average to discard history from the extreme past so that it can converge rapidly after
finding a convex bowl.

• Empirically, RMSProp has been to shown to be an effective and practical optimization algorithm for DNNs.
Adam
• Adam (Kingman and Ba, 2014) is another adaptive learning rate optimization algorithm (“adaptive
moments”). It can be seen as a variant on the combination of RMSProp and momentum with several
distinctions.

• First, in Adam, momentum is incorporated directly as an estimate of the first-order moment (with
exponential weighting) of the gradient. Second, Adam includes bias corrections to the estimates of both the
first-order moments (the momentum term) and the (uncentered) second-order moments to account for their
initialization at the origin.

• RMSProp also incorporates an estimate of the (uncentered) second-order moment; however, it lacks the
correction factor. Thus, unlike in Adam, the RMSProp second-order moment estimate may have high bias
early in training. *Adam is generally regarded as being fairly robust to the choice of hyperparameters.
DL Optimization Comparison

Left: Contours of a loss surface and time evolution of different optimization algorithms.
Notice the "overshooting" behavior of momentum-based methods, which make the
optimization look like a ball rolling down the hill. Right: A visualization of a saddle point in
the optimization landscape, where the curvature along different dimension has different
signs (one dimension curves up and another down). Notice that SGD has a very hard time
breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will
see very low gradients in the saddle direction. Due to the denominator term in the RMSprop
update, this will increase the effective learning rate along this direction, helping RMSProp
proceed. Images credit: Alec Radford.
Second-Order Methods
• A number of methods have been proposed in recent years for using second-order derivatives for
optimization (consider this scenario as incorporating an approximation of the curvature of the loss function
into the optimization problem).

• Such methods can partially alleviate some of the problems caused by curvature of the loss function,
including cliffs, and the necessity of many course correction steps for hill climbing.

• Newton’s method is a classical second-order iterative approximation method. In contrast to first-order

methods, second-order methods make use of second derivatives (i.e. the curvature of the loss function) to
improve optimization.
Second-Order Methods: Newton’s Method
• Newton’s method is a classical second-order iterative approximation method. In contrast to first-order
methods, second-order methods make use of second derivatives (i.e. the curvature of the loss function) to
improve optimization.

• Newton’s method is an optimization scheme based on using a second-order Taylor series expansion to
approximate J(θ) near some point θ0, ignoring derivatives of higher order:

1
J ( θ )  J ( θ0 ) + ( θ − θ0 ) θ J ( θ0 ) + ( θ − θ0 ) H ( θ − θ0 )
T T

Where H is the Hessian of J wrt θ evaluated at θ0. If we then solve for the critical point of this function, we
obtain the Newton parameter update rule:

θ* = θ0 − H −1θ J ( θ0 )
Second-Order Methods: Newton’s Method
θ* = θ0 − H −1θ J ( θ0 )
1
J ( θ )  J ( θ0 ) + ( θ − θ0 ) θ J ( θ0 ) + ( θ − θ0 ) H ( θ − θ0 )
T T

2
• If the objective function is convex but not quadratic, this update can be iterated, yielding a training
algorithm. For surfaces that are not quadratic, as long as the Hessian remains positive definite, Newton’s
method can be applied iteratively. This implies a two-step procedure: (1) update or compute the inverse
Hessian; (2) update the parameters according to the equation above.

* In deep learning, the surface of the objective function is usually non-convex; with many features and
potential saddle points, this is a potential problem for Newton’s Method.
Second-Order Methods: Newton’s Method
θ* = θ0 − H −1θ J ( θ0 )
1
J ( θ )  J ( θ0 ) + ( θ − θ0 ) θ J ( θ0 ) + ( θ − θ0 ) H ( θ − θ0 )
T T

* In deep learning, the surface of the objective function is usually non-convex; with many features and
potential saddle points, this is a potential problem for Newton’s Method.

• Commonly, researchers apply a regularization strategy, for which the update becomes (this regularization is
used in approximations to Newton’s Method including the Levenberg-Marquardt algorithm):

θ* = θ0 −  H ( f ( θ0 ) ) +  I  θ f ( θ0 )
−1

• Beyond the challenges of saddle points, the application of Newton’s method for training large NNs is
limited by its significant computational requirements; ostensibly, Newton’s method requires the inversion
of a matrix (O(n3)); as a consequence, only networks with a very small number of parameters can be
practically trained via Newton’s method.
(*) In practice, it is common to apply a second-order method using a “Hessian-free” approach, meaning that
the full Hessian is either approximated with a low-rank matrix or eigen-vector methods are applied (see
conjugate gradients).
Second-Order Methods: Newton’s Method
Supplemental Backpropagation Derivation
Backpropagation Algorithm

• Initialize the network weights w to small random numbers (e.g.,

between −0.05 and 0.05).

• Until the termination condition is met, Do:

– For each (x,t)  training set, Do:
1. Propagate the input forward:

– Input x to the network and compute the activation hj of

each hidden unit j.

– Compute the activation ok of each output unit k.

2. Calculate error terms
For each output unit k, calculate error term k :

For each hidden unit j, calculate error term j :

æ ö
d j ¬ h j (1- h j )çç å wkj dk ÷÷
è kÎoutput units ø
2. Calculate error terms
For each output unit k, calculate error term k :

For each hidden unit j, calculate error term j :

æ ö
d j ¬ h j (1- h j )çç å wkj dk ÷÷
è kÎoutput units ø
3. Update weights

Hidden to Output layer: For each weight wkj

wkj  wkj − wkj

where

Dwkj = hdk hj

Input to Hidden layer: For each weight wji

w ji  w ji − w ji
where

Dw ji = hd j xi
Backpropagation Algorithm (BP)
– Forwards Phase: compute the activation of each neuron in the
hidden layers and outputs using:
æ ö æ ö
h j = s ç å w ji xi + w j 0 ÷÷
ç ok = s çç å wkj h j + wk 0 ÷÷
è iÎ input layer ø è j Î hidden layer ø
– Backwards pass
– Compute the error at the output using:
æ ö
– Compute the error at the hidden layer(s) using: d j ¬ h j (1- hj )çç å wkj dk ÷÷
è kÎoutput units ø

– Update the output layer weights using: wkj  wkj − wkj

where Dwkj = hdk h j

– Update the hidden layer weights using: w ji  w ji − w ji

where Dw ji = hd j xi

– (If using sequential updating) randomize the order of the input

vectors so that you don’t train in exactly the same order each
iteration.
Backprop Example
Training set: Test set:
1 0 Label: 0.9 1 1 Label: .8

0 1 Label: -.3

o1
.1 .1
.1

1 h1 h2

.1 .1
.1
.1 .1 .1

1 x1 x2
Training set: Test set:
1 0 Label: .9 1 1 Label: .8

0 1 Label: -.3

Target: .9
o1
.1 .1
.1

1 h1 h2

.1 .1
.1
.1 .1 .1

1 x1 x2

1 0
Training set: Test set:
1 0 Label: .9 1 1 Label: .8

0 1 Label: -.3

Target: .9
o1
.1 .1
.1

1 h1 h2

.1 .1
.1
.1 .1 .1

1 x1 x2

1 0
Training set: Test set:
1 0 Label: .9 1 1 Label: .8

0 1 Label: -.3

Target: .9
o1
.1 .1
.1

1 .55 .55
.1 .1
.1
.1 .1 .1

1 x1 x2

1 0

“Forward Phase” – hidden layers

Training set: Test set:
1 0 Label: .9 1 1 Label: .8

0 1 Label: -.3

Target: .9
o1
.1 .1
.1

1 .55 .55
.1 .1
.1
.1 .1 .1

1 x1 x2

1 0
Training set: Test set:
1 0 Label: .9 1 1 Label: .8

0 1 Label: -.3

Target: .9
o1
.1 .1
.1

1 .55 .55
.1 .1
.1
.1 .1 .1

1 x1 x2

1 0

“Forward Phase” – output layer

Training set: Test set:
1 0 Label: .9 1 1 Label: .8

0 1 Label: -.3

Target: .9
.552
.1 .1
.1

1 .55 .55
.1 .1
.1
.1 .1 .1

1 x1 x2

1 0

“Forward Phase” – output layer

Target: .9
.552
.1 .1
.1

1 .55 .55
Output weight Updates
.1 .1
.1
.1 .1 .1

Hidden weight Updates

1 x1 x2

1 0
æ ö
d j ¬ h j (1- h j )çç å wkj dk ÷÷
è kÎoutput units ø

“Backward Phase”
Target: .9 Calculate error terms:
.552
.1 .1
.1

1 .55 .55
.1 .1
.1
.1 .1 .1

1 x1 x2

1 0

“Backward Phase”

Output weight Updates

Hidden weight Updates
Target: .9 Calculate error terms:
.552
.1 .1
.1

1 .55 .55
.1 .1
.1
.1 .1 .1

1 x1 x2

1 0

“Backward Phase”

Output weight Updates

Hidden weight Updates
Training set: Test set:
1 0 Label: Positive 1 1 Label: Positive

0 1 Label: Negative

Target: .9 Calculate error terms:

.552
.1 .1
.1

1 .55 .55
.1 .1
.1
.1 .1 .1

1 x1 x2

1 0
Update hidden-to-output weights (learning rate = 0.2; momentum = 0.9):
Target: .9 Calculate error terms:
.552
.1 .1
.1

1 .55 .55
.1 .1
.1
.1 .1 .1

1 x1 x2

1 0
Update hidden-to-output weights (learning rate = 0.2; momentum = 0.9): Hidden
unit j=1
Target: .9 Calculate error terms:
.552
.1 .1
.1

1 .55 .55
.1 .1
.1
.1 .1 .1

1 x1 x2

1 0
Update hidden-to-output weights (learning rate = 0.2; momentum = 0.9):
w1k =1, j =0 = .1 − .0172 = .0828