week 03-04 - Deep Feedforward Networks - Intro
week 03-04 - Deep Feedforward Networks - Intro
Networks
Paolo Favaro
2
Contents
• Introduction to Feedforward Neural Networks:
definition, design, training
Resources
• Books and online material for further studies
Feedforward Neural
Networks
• Feedforward networks are a sequence of layers,
each processing the output of the previous layer(s)
z1 q1
h1
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
• Feedforward networks are a sequence of layers,
each processing the output of the previous layer(s)
z1 q1
h1
x z2 q2 y
h2
z3 q3
input
4
Feedforward Neural
Networks
• Feedforward networks are a sequence of layers,
each processing the output of the previous layer(s)
z1 q1
h1
x z2 q2 y
h2
z3 q3
input layers
4
Feedforward Neural
Networks
• Feedforward networks are a sequence of layers,
each processing the output of the previous layer(s)
z1 q1
h1
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y
h2
z3 q3
hidden layers
5
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
z1 q1
h1
width
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
sequential processing
z1 q1
h1
width
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
sequential processing
z1 q1
processing
parallel
h1
width
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y y = f4 (q1 , q2 , q3 )
h2
z3 q3
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y
h2
z3 q3
ReLU(x)
Example (rectified linear unit)
f1,1 (x) = ReLU(x)
x
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y
h2
z3 q3
ReLU(x)
Example (rectified linear unit)
f1,1 (x) = ReLU(x)
x
8
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y
h2
z3 q3
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y y = f4 (q1 , q2 , q3 )
h2
z3 q3
Although each layer may implement a very simple function, the composition of several simple functions becomes quickly a very complex one.
9
Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y y = f4 (q1 , q2 , q3 )
h2
z3 q3
Feedforward Neural
Networks
• Feedforward neural networks define a family of
functions f (x; ✓)
y = f (x; ✓)
The fundamental property of feedforward neural networks is the way they define the family of functions f. As in the previous slides, the main constraint is in the
(compositional) dependencies.
11
• We need
• Cost function
First, we need to have a clear objective. In ML we have already mentioned that this will be done by means of pairs (x,y) (input,output) (the training set).
Then we need to choose how we will penalize mistakes in the estimated mapping (e.g., 0-1 loss, L2 etc). Then we need to design the network. This is largely an art at this
stage. Common practice is to start from networks that fit the task and that have been proven to work on similar problems. Then one modifies the network and uses
diagnosis and error analysis tools to guide the changes. Finally, the training will require choosing an optimizer to minimize the cost function. We will use a gradient-based
method, also referred to as back-propagation.
12
h2
x2
0 0
173
13
Cost Function
4
1X i 2
J(✓) = y f (xi ; ✓)
4 i=1
14
Linear Model
• Let us try a linear model of the form
f (x; w, b) = w> x + b
[0]
0 1
ω= , b=
2
So the output is a constant (b) for any input, which is largely unsatisfactory.
Graphically we can see that the plane with the smallest vertical distance from each 2D point in the training set is indeed the one with w=[0 0]^T and b=1/2.
15
w
• Let us try a simple feedforward
network with one hidden layer h1 h2 h
and two hidden units
W
x1 x2 x
Figure 6.2: An example of a feedforward network, drawn in two different styles. Speci
this is the feedforward network we use to solve the XOR example. It has a single h
layer containing two units. (Left)In this style, we draw every unit as a node in the
This style is very explicit and unambiguous but for networks larger than this ex
it can consume too much space. (Right)In this style, we draw a node in the gra
each entire vector representing a layer’s activations. This style is much more com
Sometimes we annotate the edges in this graph with the name of the parameter
describe the relationship between two layers. Here, we indicate that a matrix W des
the mapping from x to h, and a vector w describes the mapping from h to y
typically omit the intercept parameters associated with each layer when labeling thi
of drawing.
Figure 6.2: An example of a feedforward network, drawn in two different styles. Speci
this is the feedforward network we use to solve the XOR example. It has a single h
layer containing two units. (Left)In this style, we draw every unit as a node in the
This style is very explicit and unambiguous but for networks larger than this ex
it can consume too much space. (Right)In this style, we draw a node in the gra
each entire vector representing a layer’s activations. This style is much more com
Sometimes we annotate the edges in this graph with the name of the parameter
describe the relationship between two layers. Here, we indicate that a matrix W des
the mapping from x to h, and a vector w describes the mapping from h to y
typically omit the intercept parameters associated with each layer when labeling thi
of drawing.
y y y(h) = w> h + b
x1 x2 x
Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
For simplicity, let us group together all units in the same layer into vectors x and h.
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.
Optimization
f (x; W, c, w, b) = w> max{0, W > x + c} + b
Let us consider these settings and then compute the output of the function f on the training set.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS 19
Simulation
2 3 y y
0 0
60 17
X=6
41
7
05 w
1 1
h1 h2 h
x1 x2 x
Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.
Simulation
2 3 y y
0 0
60 17
X=6
41
7
05 w
1 1
h1 h2 h
x1 x2 x
Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.
Simulation
2 3 2y 3 y
0 0 0 1
60 17 61 07
X=6
41
7 XW + 1c = 6 7
05 41 05 w
1 1 2 1
h1 h2 h
x1 x2 x
Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.
Simulation
2 3 2y 3 y
0 0 0 1
60 17 61 07
X=6
41
7 XW + 1c = 6 7
05 41 05 w
1 1 2 1
h12 3 h2 h
0 0
61 07
max{0, XW + 1c} = 641 05
7 W
x1 2 1 x2 x
Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.
Simulation
2 3 2y 3 y
0 0 0 1
60 17 61 07
X=6
41
7 XW + 1c = 6 7
05 41 05 w
1 1 2 1
h12 3 h2 h
0 0
61 07
max{0, XW + 1c} = 641 05
7 W
x1 2 1 x2 x
2 3
0
617 the XOR function
XW + 1c}w + 1b = 6 7
Figure 6.2: An example of a feedforward4network,
max{0, drawn in two different styles. Specifically,
15 (matches Y)
this is the feedforward network we use to0 solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.
Step-by-Step Analysis
z1 q1
h1
xi z2 q2 f (xi ; ✓)
h2
z3 q3
yi
Now that we have seen an example of how to design the cost function, a model (and motivated the need for nonlinearity), and analysed the performance, we can present
more in depth each item in the design of a machine learning algorithm: the cost function, the model (the neural network and, in particular, its hidden layers), and the
optimization procedure.
20
Step-by-Step Analysis
z1 q1
h1
xi z2 q2 f (xi ; ✓)
h2
z3 q3
yi
training set
20
Step-by-Step Analysis
z1 q1
h1
xi z2 q2 f (xi ; ✓)
h2
z3 q3
network model
yi
training set
20
Step-by-Step Analysis
hidden layers
z1 q1
h1
xi z2 q2 f (xi ; ✓)
h2
z3 q3
network model
yi
training set
20
Step-by-Step Analysis
hidden layers output
layer
z1 q1
h1
xi z2 q2 f (xi ; ✓)
h2
z3 q3
network model
yi
training set
20
Step-by-Step Analysis
hidden layers output
layer
z1 q1
h1
cost function
xi z2 q2 f (xi ; ✓)
h2
z3 q3 loss y i , f (xi ; ✓)
network model
yi
training set
20
Step-by-Step Analysis
hidden layers output
layer
z1 q1
h1
cost function
xi z2 q2 f (xi ; ✓)
h2
z3 q3 loss y i , f (xi ; ✓)
network model
yi
training set
20
Step-by-Step Analysis
hidden layers output
layer
z1 q1
h1
cost function
xi z2 q2 f (xi ; ✓)
h2
z3 q3 loss y i , f (xi ; ✓)
network model
yi
Cost Function
Saturation
• Functions that saturate (have flat regions) have a very small
gradient and slow down gradient descent
• We choose loss functions that have a non flat region when the
answer is incorrect (it might be flat otherwise)
In practice the cross entropy is preferred over other methods (e.g., based on statistics) because numerically is it more stable.
In the example above, we want z to have the same sign as 2y-1 where y=0 or y=1.
If we minimize the exponential form given above, we achieve this purpose.
The gradient will be nonzero when there is a mismatch (so the gradient will not stop).
When there is a match, the gradient can become zero.
23
Cost Function
• For example
Another option is to directly estimate some statistics of y|x. The L2 loss yields the conditional mean. Other well-known options are the L1 loss yields the conditional
median.
24
Output Units
• The choice of the output representation (e.g., a
probability vector or the mean estimate) determines
the cost function
h = f (x; ✓)
Linear Units
• With a little abuse of terminology, linear units include
affine transformations
ŷ = W > h + b
can be seen as the mean of the conditional Gaussian
distribution (in the Maximum Likelihood loss)
Linear units are easy to work with (during training the gradients are simple and stable)
26
Softplus
• The softplus function is 10
defined as 8
(x)
4
0
10 5 0 5 10
x
Figure 3.4: The softplus function.
and it is a smooth
approximation of the 10
4
x+ = max(0, x) 2
0
-10 -5 0 5 10
27
Sigmoid Units
• Use to predict binary variables or to predict the
probability of binary variables
p(y = 0|x) 2 [0, 1]
• The sigmoid unit defines aCHAPTER
suitable mapping and has no
3. PROBABILITY AND INFORMATION THEORY
0.6
(x)
1 0.4
(x) = x
1+e 0.2
0.0
10 5 0 5 10
x
Figure 3.3: The logistic sigmoid function.
In practice, the sigmoid unit is a combination of a linear unit and the logistic sigmoid function.
The lack of (strongly) flat regions helps gradient descent. When regions are flat the gradients become zero and
there is no useful update to the parameters (an extremum has been reached).
Notice that the logistic sigmoid function is the derivative of the softplus function.
69
28
Bernoulli Parametrization
• Let z = w> h + b . Then, we can define the Bernoulli
distribution
p(y|z) = ((2y 1)z)
Smoothed Max
• An extension to the softplus function is the
smoothed max X
log exp(zj )
j
which gives a smooth approximation to max zj
j
Softmax Units
• An extension of the logistic sigmoid to multiple variables
Softmax Units
• In Maximum Likelihood we have
X
log softmax(z)i = zi log exp(zj )
j
• Recall the smoothed max, then we can write
Softmax Units
• Softmax is an extension to the logistic sigmoid
where we have 2 variables and z1 = 0, z2 = z
Because probabilities must add up to 1, we can parametrize the probability on n variables with an n-1 dimensional vector. In the case of binary variables this is convenient
because we only need to care about 1 probability and the logistic sigmoid is a practical parametrization choice. However, with multi-dimensional variables it is often simpler
to implement the full n-dimensional probability vector.
33
• The parameters
include
p(c = i|x)
y
µi (x)
⌃i (x)
x
In the example above we sample x uniformly and then we sampleFigure 6.4: Samples drawn from a neural network with a mixture density output layer.
The input x is sampled from a uniform distribution and the output y is sampled from
p(y|x). As we can see, depending on x we have different multi-modal
pmodel (y distributions.
| x). The neural The concentration
network is able to learnofnonlinear
y samples takesfrom
mappings place
theat different
input to locations and these locations
change with x. the parameters of the output distribution. These parameters include the probabilities
governing which of three mixture components will generate the output as well as the
parameters for each mixture component. Each mixture component is Gaussian with
predicted mean and variance. All of these aspects of the output distribution are able to
vary with respect to the input x, and to do so in nonlinear ways.
Hidden Units
• The design of a neural network is so far still an art
8
an affine transformation 6
g(z) = max{0, z} 2
0
-10 -5 0 5 10
Maxout units generalize all the other units. However, they may be more difficult to train.
Their redundancy in the representation also allows them to be more robust to catastrophic forgetting (the network forgets how to perform a task that was previously
learned — transfer learning).
38
Hyperbolic Tangent
• The hyperbolic tangent is defined as
1
0.5
tanh(z)
0
g(z) = tanh(z)
-0.5
-1
-10 -5 0 5 10
z
and it is related to the logistic sigmoid via
tanh(z) = 2 (2z) 1
The hyperbolic tangent and the logistic sigmoid were used quite widely as the nonlinear components in neural networks. However, their saturation properties can make
gradient-based learning very difficult. Thus, they are now mostly substituted by ReLUs. These networks can be used as output units especially when the loss function
“undoes” the saturation.
They are used often in recurrent neural networks, probabilistic models and autoencoders.
39
Other Units
Many other hidden units have been tested and published. When they perform as the existing ones, they are not deemed useful or interesting.
The linear projection allows factorizations.
Instead of using an affine transformation W we can have first a projection U (possibly to a lower dimensional space) and then a projection V. The (low-rank) factorization
VU of W uses fewer parameters than in the case of a general W matrix.
The skip connection is a special case that implements the identity function.
The RBF saturates for most x values so it is also difficult to optimize.
The softplus seems to be a better option than ReLU because it is always differentiable. However, experimentally it performs worse than ReLU.
40
Network Design
Universal Approximation
• Theorem
A feedforward network with a linear output layer
and enough (but at least one) hidden nonlinear
layers (e.g., the logistic sigmoid unit) can
approximate up to any desired precision any (Borel
measurable) function between two finite-
dimensional spaces
An example of Borel measurable function is any continuous function on a closed and bounded subset of R^n.
42
Universal Approximation
An example of Borel measurable function is any continuous function on a closed and bounded subset of R^n.
43
Depth
• A general rule is that depth helps generalization
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
• It is better to have many simple layers than few
highly complex ones
96.5
96.0
202
44
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Depth
• Other network modifications do not have the same
effect
97
3, convolutional
92
91
0.0 0.2 0.4 0.6 0.8 1.0
Number of parameters ⇥108
203
45
Depth
CHAPTER 1. INTRODUCTION
• Another interpretation is that depth allows a more
gradual abstraction
Output
CAR PERSON ANIMAL
(object identity)
Visible layer
(input pixels)
Figure 1.2: Illustration of a deep learning model. It is difficult for a computer to understand
the meaning of raw sensory input data, such as this image represented as a collection
A deep network might give a useful representation, where of pixelconcepts are gradually
values. The function mapping frommore
a set ofand
pixels more abstract.
to an object identity is A
verylearning problem is solved via the detection of simple
complicated. Learning or evaluating this mapping seems insurmountable if tackled directly.
underlying factors of variation that can also be described Deep
in alearning
similar way via the detection of more abstract factors of variation.
resolves this difficulty by breaking the desired complicated mapping into a
series of nested simple mappings, each described by a different layer of the model. The
input is presented at the visible layer, so named because it contains the variables that
we are able to observe. Then a series of hidden layers extracts increasingly abstract
features from the image. These layers are called “hidden” because their values are not given
in the data; instead the model must determine which concepts are useful for explaining
the relationships in the observed data. The images here are visualizations of the kind
of feature represented by each hidden unit. Given the pixels, the first layer can easily
identify edges, by comparing the brightness of neighboring pixels. Given the first hidden
layer’s description of the edges, the second hidden layer can easily search for corners and
extended contours, which are recognizable as collections of edges. Given the second hidden
layer’s description of the image in terms of corners and contours, the third hidden layer
can detect entire parts of specific objects, by finding specific collections of contours and
corners. Finally, this description of the image in terms of the object parts it contains can
be used to recognize the objects present in the image. Images reproduced with permission
from Zeiler and Fergus (2014).
6
46
Depth
• Another interpretation is that the network
implements a gradient descent algorithm for
This interpretation defines explicitly the network and the depth corresponds to the number of gradient descent iterations (the more iterations and the better the final
accuracy of the function); each step performs a local linearized operation (that is, relatively simple).
47
Optimization
• Given a task we define
We now look more in detail at how we can improve the network parameters so that the network can make better predictions on the training set.
48
Optimization
• The MSE cost function J(✓) is convex with a linear
model
J(✓)
✓
global optimum
We used the MSE as our loss function in the initial (XOR) example. With a linear model the optimization problem is convex and has a unique global optimum.
While we could get a closed form solution in this case, the solution was not good.
Then, we considered a neural network which made the model non linear and thus the loss function non convex.
49
Optimization
• However, since the cost function J(✓) is typically
non convex in the parameters, we use an iterative
solution
✓t+1 = ✓t ↵rJ(✓t )
In the more general case, we need to resort to iterative solutions such as gradient descent. This method takes a scaled version of the negative of the gradient as the update
for the previous set of parameters.
50
When we want to reach the valley and we start from a random position we could follow the slope at our current location. This is the principle of gradient descent.
51
Optimization
gradient descent ✓t+1 = ✓t ↵rJ(✓t )
rJ(✓) J(✓)
Let us illustrate the way gradient descent works when the loss is convex and in 1D. In this case it is immediate to see how the negative of the gradient points always
towards the global minimum.
52
Local Minima
• Does gradient descent reach a (local) minimum
even with a non convex function?
J(✓)
How do we ensure that, regardless of where we are in the domain (x-axis in the figure),
we make gradient descent move towards the local minimum?
53
Assumptions
• Gradient descent will reach a local minimum under
some constraints on both the cost function and
the learning rate
One such constraint is Lipschitz continuity of the gradient of the cost function (notice that it applies to the gradient, not to the cost function itself).
Lipschitz continuity defines the maximum slope in the whole domain. In the illustration Lipschitz continuity states that there exists a cone (red lines) that can be
translated at any point of the function and such that the function will not enter it (the function will always be in the red-shaded region).
53
Assumptions
• Gradient descent will reach a local minimum under
some constraints on both the cost function and
the learning rate
✓
54
Convergence
If we assume (Lipschitz continuity)
9L 0: rJ(✓) ¯ L|✓
rJ(✓) ¯
✓|, 8✓, ✓¯
Under mild conditions, we can show that there is a suitable small learning rate such that gradient descent is guaranteed to converge to a local minimum.
55
Diagnosing GD
• In practice, what do we do when gradient descent
does not work?
Diagnosis 1/2
• Case 1: Lipschitzianity
Diagnosis: this is a purely analytical diagnosis. One can simply look at the cost function formula and determine if the functions in it may not satisfy Lipschitz continuity (of
its gradient).
This step should be done first.
56
Diagnosis 1/2
• Case 1: Lipschitzianity
Diagnosis 1/2
• Case 1: Lipschitzianity
Diagnosis 2/2
• Case 2: Learning rate
Diagnosis 2/2
• Case 2: Learning rate
Diagnosis 2/2
• Case 2: Learning rate
Optimization
• For more efficiency, we use the stochastic
gradient descent method
✓t+1 = ✓t ˜ t)
↵rJ(✓
Under some mild requirements (some smoothness of the loss function and a decreasing learning rate) also Stochastic GD (and other similar variations) can be shown to
converge to a local minimum.
Because SGD requires a small subset of all the samples, it can be computed very efficiently with a positive impact on the overall training time.
Before we discussed how to set the learning rate. Now we look at the other remaining component
in the update equation: the gradient of the loss. How do we obtain it?
59
Back-Propagation
• (Stochastic) gradient descent boils down to the
calculation of the loss gradient with respect to the
parameters
Note that backprop does not refer to the whole training algorithm, but just to the specific way used to compute the gradient
60
Computational Graphs
• Because neural networks can quickly become very
complex, a clear representation is needed
Examples
ŷ
z u(1) u(2)
+
dot
⇥
x y x w b
(a) (b)
H u(2) u(3)
relu ⇥
sum
X W b x w
(c) (d)
Figure 6.8: Examples of computational graphs. (a)The graph using the ⇥ operation to
Examples of computational graphs. compute z = xy. (b)The graph for the logistic regression prediction ŷ = x> w + b .
Some of the intermediate expressions do not have names in the algebraic expression
(a) z = xy but need names in the graph. We simply name the i-th such variable u(i) . (c)The
(b)\hat y = \sigma(x^T w + b) [log. reg.] computational graph for the expression H = max{0, XW + b}, which computes a design
matrix of rectified linear unit activations H given a design matrix containing a minibatch
(c) H = max{0,XW+b} of inputs X. (d)Examples a–c applied at most one operation to each variable, but it
is possible to apply more than one operation. Here we show a computation graph that
(d)\hat y = x^T w and u3 = \lambda \sum w^2; w is used to perform multiple operations
applies more than one operation to the weights w of a linear regression model. The P
weights are used to make both the prediction ŷ and the weight decay penalty i wi2 .
206
62
u7 u16
input u12
✓ = [u4 , . . . , u16 ]
We define each variable in the network with a node ui. The nodes in the first layer denote the input variables and the last node denotes the loss function. The input could
be just 1 minibatch. Each link defines the inputs to a node, and each node is associated to a function fi of the inputs.
63
Computing Gradients
• The main objective is to compute the derivatives of
the loss node with respect to the input nodes
@u17
i = 1, 2, 3
@ui
Because of the compositional structure of the network (and therefore of the computational graph), the natural tool to compute derivatives is the chain rule. Moreover, to
compute the gradients of the loss with respect to the inputs in a computationally and memory efficient way we need to implement the chain rule carefully.
64
Chain Rule
• The derivatives of a function composition can be
computed via the chain rule
The same rule can be applied to matrices and tensors simply by vectorizing them, then by applying the above equations and finally by reshaping them to the original
shape. Indeed we can think of a tensor as a vector whose indices have multiple coordinates.
x
CHAPTER 6. DEEP FEEDFORWARD NETWORKS 65
f
Computational Challenges
w
@z y
(6.50)
@w
f
@z @y @x
= (6.51)
@y @x @w x
=f 0 (y)f 0 (x)f 0 (w) (6.52)
f
=f 0 (f (f (w)))f 0 (f (w))f 0 (w) (6.53)
w
Equation 6.52 suggests an implementation in which we compute the value of f (w) only
once and store it in the variable x. This is the approach taken by the back-propagation
algorithm. An alternative approach is suggested byFigure equation
6.9: A6.53, where thegraph
computational subexpression
that results in repeated subexpressions when computing
appearsmultiple
more than the gradient. Let is R be the input to time
the graph. We use the same function f : R ! R
times once. In the alternative approach, recomputed
(to avoid each
f (w) f (w) w2
Repeating the same calculation is wasteful. Also, keeping these calculations in memory repeated calculations) could become unmanageable.
it is needed. When the memory required to store as the
the operation
value of that
these we apply atisevery
expressions low, step
the of a chain: x = f (w), y = f (x), z = f (y).
Keeping all the intermediate Jacobians may require too many memory resources. To compute @w , we apply equation 6.44 and obtain:
@z
back-propagation approach of equation 6.52 is clearly preferable because of its reduced
runtime. However, equation 6.53 is also a valid implementation of the chain rule, @z and is
(6.50)
useful when memory is limited. @w
@z @y @x
= (6.51)
@y @x @w
=f 0 (y)f 0 (x)f 0 (w) (6.52)
=f 0 (f (f (w)))f 0 (f (w))f 0 (w) (6.53)
Equation 6.52 suggests an implementation in which we compute the value of f (w) only
211 once and store it in the variable x. This is the approach taken by the back-propagation
algorithm. An alternative approach is suggested by equation 6.53, where the subexpression
f (w) appears more than once. In the alternative approach, f (w) is recomputed each time
it is needed. When the memory required to store the value of these expressions is low, the
back-propagation approach of equation 6.52 is clearly preferable because of its reduced
runtime. However, equation 6.53 is also a valid implementation of the chain rule, and is
useful when memory is limited.
66
u7 u16
u12
The original graph can be used to compute the loss given the input. To compute the gradient we need to duplicate each node (and allocate the associated memory).
66
Forward Propagation
u1
u2
u3
The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
67
Forward Propagation
u1 x1
u2 x2
u3 x3
inputs
67
Forward Propagation
u1 x1
u2 x2
u3 x3
inputs
68
Forward Propagation
u1
u2
u3
The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
68
Forward Propagation
u4
u1
u2
u3
68
Forward Propagation
u4
u1
u i = f i Ai
u2 Ai = {uj |j 2 Parents(ui )}
u3
68
Forward Propagation
u4 u4 = f4 (u1 , u2 , u3 )
u1
u i = f i Ai
u2 Ai = {uj |j 2 Parents(ui )}
u3
69
Forward Propagation
u4 u4 = f4 (u1 , u2 , u3 )
u1
u i = f i Ai
u2 Ai = {uj |j 2 Parents(ui )}
u3
parents of u4
The parents of a node are the arguments of the function that computes the value of that node.
70
Forward Propagation
u1
u2
u3
The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
70
Forward Propagation
u4 u4 = f4 (u1 , u2 , u3 )
u1
u5 u5 = f5 (u1 , u2 , u3 )
u2
u6 u6 = f6 (u1 , u2 , u3 )
u3
u7 u7 = f7 (u1 , u2 , u3 )
71
Forward Propagation
u4
u1
u5
u2
u6
u3
u7
The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
71
Forward Propagation
u8 u8 = f8 (u4 )
u4
u1
u9 u9 = f9 (u4 , u5 , u6 )
u5
u2 u10 u10 = f10 (u4 , u5 , u6 , u7 )
u6
u3 u11 u11 = f11 (u5 , u6 , u7 )
u7
u12 u12 = f12 (u7 )
72
Forward Propagation
u8
u4
u1
u9
u5
u2 u10
u6
u3 u11
u7
u12
The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
72
Forward Propagation
u13 = f13 (u8 , u9 )
u8
u4 u13
u1 u14 = f14 (u9 , u10 )
u9
u5 u14
u2 u10 u15 = f15 (u10 , u11 )
u6 u15
u3 u11
u16 = f16 (u11 , u12 )
u7 u16
u12
73
Forward Propagation
u8
u4 u13
u1
u9
u5 u14
u2 u10
u6 u15
u3 u11
u7 u16
u12
The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
73
Forward Propagation
u8
u4 u13
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11
u7 u16
u12
73
Forward Propagation
u8 target
u4 u13
y
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11
u7 u16
u12
73
Forward Propagation
u8 target
u4 u13
y
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11
u7 u16
u12
Backward Propagation
u8
u4 u13
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11
u7 u16
u12
The backward propagation is an efficient way to compute the derivatives of u17 (in green) with respect to all the other variables (in red).
75
Backward Propagation
y
u17
The backward propagation can be computed by propagating information from the loss node on the right to the parent nodes on the left up to the input nodes. In this way
we avoid the computation of repeated gradients.
75
Backward Propagation
u13
y
u14
u17
u15
u16
75
Backward Propagation
@u17 @f17 (u13 , u14 , u15 , u16 , y)
= u13
@u13 @u13 y
Backward Propagation
u13
u14
u17
u15
gradients
u16
already
computed
More in general, we can use the chain rule formula and exploit the previous calculations.
76
Backward Propagation
u8
u13
u9
u14
u10 u17
u15
u11
gradients
u16
u12 already
computed
76
Backward Propagation
u8
u13
u9
u14
@un X @un @ui
= u10 u17
@uj @ui @uj
i:j2Pa(ui ) u15
u11
gradients
u16
u12 already
computed
76
Backward Propagation
u8
u13
u9
u14
u10 u17
u15
u11
gradients
u16
u12 already
computed
76
Backward Propagation
u8
u13
@u17 X @u17 @ui
= u9
@u9 @ui @u9 u14
i:92Pa(ui )
u10 u17
u15
u11
gradients
u16
u12 already
computed
77
Backward Propagation
u13
u14
u17
u15
gradients
u16
already
computed
These terms (in orange) have already been computed at the previous iteration (right-hand-side).
77
Backward Propagation
u8
u13
u9
u14
u10 u17
u15
u11
gradients
u16
u12 already
computed
77
Backward Propagation
@u17 @u17 @u13
= u8
@u8 @u3 @u8 u13
@u17 @u17 @u13
=
@u9 @u13 @u9
u9
@u17 @u14
+
@u14 @u9
u14
u10 u17
.. u15
. u11
gradients
@u17 @u17 @u16 u16
@u12
=
@u16 @u12
u12 already
computed
77
Backward Propagation
@u17 @u17 @u13
= u8
@u8 @u3 @u8 u13
@u17 @u17 @u13
=
@u9 @u13 @u9
u9
@u17 @u14
+
@u14 @u9
u14
u10 u17
.. u15
. u11
gradients
@u17 @u17 @u16 u16
@u12
=
@u16 @u12
u12 already
computed
78
Backward Propagation
u8
u13
u9
u14
u10 u17
u15
u11
u16
u12
Note
78
Backward Propagation
u8
u4 u13
u9
u5 u14
u10 u17
u6 u15
u11
u7 u16
u12
78
Backward Propagation
u8
u4 u13
u9
u5 u14
u10 u17
u6 u15
u11
u7 u16
u12
@un X @un @ui
=
@uj @ui @uj
i:j2Pa(ui )
79
Backward Propagation
u8
u4 u13
u9
u5 u14
u10 u17
u6 u15
u11
u7 u16
u12
@un X @un @ui
=
@uj @ui @uj
i:j2Pa(ui )
These are nodes where the gradients have already been computed (and stored) that we can reuse in the chain rule to compute the new gradients.
80
Backward Propagation
u8
u4 u13
u9
u5 u14
u10 u17
u6 u15
u11
u7 u16
u12
@un X @un @ui
=
@uj @ui @uj
i:j2Pa(ui )
These are the new gradients between directly connected nodes (these gradients are defined when the function that implements the forward pass from one layer to the one on
the immediate right is defined).
81
Backward Propagation
u8
u4 u13
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11
u7 u16
u12
The Backpropagation algorithm thus defines an efficient method to compute the gradients while sacrificing memory. Derivative computations grow linearly with the number
of edges.
82
Back-Propagation
Note
83
Back-Propagation
Instead of assigning values to the functional expressions one obtains the functional expression of the derivatives.
Figure 6.9: A computational graph that results in repeated 84subexpressions when computing
the gradient. Let w 2 R be the input to the graph. We use the same function f : R ! R
as the operation
CHAPTER 6. DEEP FEEDFORWARD NETWORKS that we apply at every step of a chain: x = f (w), y = f (x), z = f (y).
Symbol-to-Symbol
To compute @w , we apply equation 6.44 and obtain:
@z
@z
z z (6.50)
@w
@z @y @x
f f = (6.51)
@y @x @w
f0
y y
dz =f 0 (y)f 0 (x)f 0 (w) (6.52)
dy
=f 0 (f (f (w)))f 0 (f (w))f 0 (w) (6.53)
f f
Equation 6.52 suggests
f0 an implementation
⇥ dz
in which we compute the value of f (w) only
dy
x once and storex it in the variable x. This is the approach taken by the back-propagation
dx dx
algorithm. An alternative approach is suggested by equation 6.53, where the subexpression
f f (w) appears
f more than once. In the alternative approach, f (w) is recomputed each time
it is needed. Whenfthe
0 memory⇥ required to store the value of these expressions is low, the
dx dz
w back-propagation
w approach
dw of equation
dw 6.52 is clearly preferable because of its reduced
runtime. However, equation 6.53 is also a valid implementation of the chain rule, and is
useful when memory is limited.
Figure 6.10: An example of the symbol-to-symbol approach to computing derivatives. In
this approach, the back-propagation algorithm does not need to ever access any actual
The idea is to convert the network into another network with the corresponding derivatives layers.
specific numeric values. Instead, it adds nodes to a computational graph describing how
to compute these derivatives. A generic graph evaluation engine can later compute the
derivatives for any specific numeric values. (Left)In this example, we begin with a graph
representing z = f (f (f (w))). (Right)We run the back-propagation algorithm, instructing
it to construct the graph for the expression corresponding to dw
dz
. In this example,
211 we do
not explain how the back-propagation algorithm works. The purpose is only to illustrate
what the desired result is: a computational graph with a symbolic description of the
derivative.
Symbol-to-Symbol
• Advantages
Higher order derivatives can be computed by building the computational graph of the previous (gradient) computational graph.
However, the dimensionality of higher order derivatives makes this option not very useful. Other practical solutions are possible (e.g. Krylov methods).
86
Back-Propagation Forms
• The back-propagation algorithm exploits a special
case of the chain rule, written in recursive form
@un X @un @ui
=
@uj @ui @uj
i:j2Pa(ui )
• In alternative one could use the sequential form
X t
Y
@un @u⇡k
=
@uj @u⇡k 1
path(u⇡1 ,...,u⇡t ), k=2
from ⇡1 =j to ⇡t =n
The number of paths might grow exponentially with the length of the paths and this would lead to a computational explosion in the non-recursive formula.
The recursive formula can also be seen as a way to compute the gradient via a dynamic programming approach (split original problem in repeated subproblems and
solution of original problem is defined as a composition of solutions of subproblems).
87
Further Issues
• Memory consumption
• Data types
When is more efficient to return more than 1 output? For example, if we need both the maximum and the argument of the maximum of a tensor it is better to implement
a single function that can do both at once (rather than as two separate nodes).
Memory might be challenged by the temporary memorization of several intermediate tensors (e.g. Gi in the previous function); one workaround is to keep a separate buffer
where the tensors are added as they are computed.
Keeping track of undefined gradients.
88
Extensions
• Automatic simplification of the derivatives (or the
computational graph) — Theano, TensorFlow
Theano and TensorFlow use known rules to try and simplify the computational graph.
When there are k outputs it is not efficient to repeat k times the reverse mode accumulation.
89
ABCD
In this case it is more efficient to multiply from right to left so that the products are always between a matrix and a vector (as they result in a vector).
90
ABCD
In this case it is more efficient to multiply from left to right so that the products are always between small matrices.