0% found this document useful (0 votes)
3 views

week 03-04 - Deep Feedforward Networks - Intro

Uploaded by

jgamtejero
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

week 03-04 - Deep Feedforward Networks - Intro

Uploaded by

jgamtejero
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

Deep Feedforward

Networks
Paolo Favaro
2

Contents
• Introduction to Feedforward Neural Networks:
definition, design, training

• Based on Chapter 6 (and 4) of Deep Learning by


Goodfellow, Bengio, Courville

• References to Machine Learning and Pattern


Recognition by Bishop
3

Resources
• Books and online material for further studies

• CS231 @ Stanford (Fei-Fei Li)

• Pattern Recognition and Machine Learning


by Christopher M. Bishop

• Machine Learning: a Probabilistic Perspective


by Kevin P. Murphy
4

Feedforward Neural
Networks
• Feedforward networks are a sequence of layers,
each processing the output of the previous layer(s)

z1 q1
h1
x z2 q2 y

h2
z3 q3

What is a feedforward neural network?


It is a chain of functions (typically “simple” functions) that have a limited scope (i.e., they depend only on a subset of the variables. The dependency is also hierarchical:
each layer takes as inputs the outputs of the previous layer (there is no feedback).
These relations can also be specified by a directed acyclic graph (DAG).
4

Feedforward Neural
Networks
• Feedforward networks are a sequence of layers,
each processing the output of the previous layer(s)

z1 q1
h1
x z2 q2 y

h2
z3 q3

input
4

Feedforward Neural
Networks
• Feedforward networks are a sequence of layers,
each processing the output of the previous layer(s)

z1 q1
h1
x z2 q2 y

h2
z3 q3

input layers
4

Feedforward Neural
Networks
• Feedforward networks are a sequence of layers,
each processing the output of the previous layer(s)

z1 q1
h1
x z2 q2 y

h2
z3 q3

input layers output


5

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y

h2
z3 q3

first second output


layer layer layer

Some common terminology.


The width applies to each layer.
Deep learning refers to the large number of layers used in the latest version of neural networks.
Typically, hidden layers are not directly associated to a known output.
One unit is also referred to as a neuron. It is largely inspired by neuroscience
and the functional analogy of the neurons in the brain (an attempt to imitate them).
The function used in one unit is also referred to as activation function (again a loose reference to an analogy in neuroscience).
5

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y

h2
z3 q3

first second output


layer layer layer

hidden layers
5

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y

h2
z3 q3

first second output


layer layer layer
unit (neuron,
hidden layers activation function)
5

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y

h2
z3 q3

first second output


layer layer layer
unit (neuron,
hidden layers activation function)
depth
5

Feedforward Neural
Networks
z1 q1
h1

width
x z2 q2 y

h2
z3 q3

first second output


layer layer layer
unit (neuron,
hidden layers activation function)
depth
5

Feedforward Neural
Networks
sequential processing

z1 q1
h1

width
x z2 q2 y

h2
z3 q3

first second output


layer layer layer
unit (neuron,
hidden layers activation function)
depth
5

Feedforward Neural
Networks
sequential processing

z1 q1

processing
parallel
h1

width
x z2 q2 y

h2
z3 q3

first second output


layer layer layer
unit (neuron,
hidden layers activation function)
depth
6

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y y = f4 (q1 , q2 , q3 )
h2
z3 q3

z1 = f2,1 (h1 , h2 ) q1 = f3,1 (z1 , z2 , z3 )


h1 = f1,1 (x)
z2 = f2,2 (h1 , h2 ) q2 = f3,2 (z1 , z2 , z3 )
h2 = f1,2 (x)
z3 = f2,3 (h1 , h2 ) q3 = f3,3 (z1 , z2 , z3 )

Mathematically we can express the relationships via functional composition


7

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y

h2
z3 q3

ReLU(x)
Example (rectified linear unit)
f1,1 (x) = ReLU(x)
x

An example of simple function is the rectified linear unit (ReLU).


This is an example of a non linear function.
This function is parameter-free.
7

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y

h2
z3 q3

ReLU(x)
Example (rectified linear unit)
f1,1 (x) = ReLU(x)
x
8

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y

h2
z3 q3

Example (fully connected unit)


f2,2 (h1 , h2 ) = w1 h1 + w2 h2

Another example of simple function is the fully connected unit.


This is an example of a linear function.
This function is parametric (the parameters are w1 and w2).
8

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y

h2
z3 q3

Example (fully connected unit)


f2,2 (h1 , h2 ) = w1 h1 + w2 h2
9

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y y = f4 (q1 , q2 , q3 )
h2
z3 q3

Hierarchical composition of functions


y = f4 (f3,1 (f2,1 (f1,1 (x), f1,2 (x)), . . . ), . . . )

Although each layer may implement a very simple function, the composition of several simple functions becomes quickly a very complex one.
9

Feedforward Neural
Networks
z1 q1
h1
x z2 q2 y y = f4 (q1 , q2 , q3 )
h2
z3 q3

Hierarchical composition of functions


y = f4 (f3,1 (f2,1 (f1,1 (x), f1,2 (x)), . . . ), . . . )
10

Feedforward Neural
Networks
• Feedforward neural networks define a family of
functions f (x; ✓)

• The goal is to find parameters ✓ that define the


best mapping

y = f (x; ✓)

between input x and output y

• The key constraints are the I/O dependencies

The fundamental property of feedforward neural networks is the way they define the family of functions f. As in the previous slides, the main constraint is in the
(compositional) dependencies.
11

Deploying a Neural Network


• Given a task (in terms of I/O mappings)

• We need

• Cost function

• Neural network model (e.g., choice of units,


their number, their connectivity)

• Optimization method (back-propagation)

First, we need to have a clear objective. In ML we have already mentioned that this will be done by means of pairs (x,y) (input,output) (the training set).
Then we need to choose how we will penalize mistakes in the estimated mapping (e.g., 0-1 loss, L2 etc). Then we need to design the network. This is largely an art at this
stage. Common practice is to start from networks that fit the task and that have been proven to work on similar problems. Then one modifies the network and uses
diagnosis and error analysis tools to guide the changes. Finally, the training will require choosing an optimizer to minimize the cost function. We will use a gradient-based
method, also referred to as back-propagation.
12

Example: Learning XOR


Original x space Learned h space
• Objective function
1 1
is the XOR operation
between two binary
inputs x1 and x2

h2
x2
0 0

• Training set (x,y) pairs is 0 1 0 1 2


(✓  ◆ ✓ ◆ ✓ ◆ ✓  x1 ◆) h1
0 0 1 1
,0 , ,1 , ,1 , ,0
0 1 Figure 6.1: 0Solving the XOR
1 problem by learning a representation. The bold numbers
printed on the plot indicate the value that the learned function must output at each point.
(Left)A linear model applied directly to the original input cannot implement the XOR
function. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,
Let us use an example to get a sense for the whole process. We look at thethe
casemodel’s output
of data in a XORmust decrease(an
configuration asexample
x2 increases. A linear
that is often model
used as a toy must apply
problem a fixed
to analyze
classifiers). coefficient w2 to x2 . The linear model therefore cannot use the value of x1 to change
the coefficient on x2 and cannot solve this problem. (Right)In the transformed space
represented by the features extracted by a neural network, a linear model can now solve
the problem. In our example solution, the two points that must have output 1 have been
collapsed into a single point in feature space. In other words, the nonlinear features have
mapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]> .
The linear model can now describe the function as increasing in h1 and decreasing in h2 .
In this example, the motivation for learning the feature space is only to make the model
capacity greater so that it can fit the training set. In more realistic applications, learned
representations can also help the model to generalize.

173
13

Cost Function

• Let us use the Mean Squared Error (MSE) as a first


attempt

4
1X i 2
J(✓) = y f (xi ; ✓)
4 i=1
14

Linear Model
• Let us try a linear model of the form

f (x; w, b) = w> x + b

• This choice leads to the normal equations (see


slides on Machine Learning Review) and the
following values for the parameters

[0]
0 1
ω= , b=
2

So the output is a constant (b) for any input, which is largely unsatisfactory.
Graphically we can see that the plane with the smallest vertical distance from each 2D point in the training set is indeed the one with w=[0 0]^T and b=1/2.
15

CHAPTER 6. DEEP FEEDFORWARD NETWORKS


Nonlinear Model
y y

w
• Let us try a simple feedforward
network with one hidden layer h1 h2 h
and two hidden units
W

x1 x2 x

Figure 6.2: An example of a feedforward network, drawn in two different styles. Speci
this is the feedforward network we use to solve the XOR example. It has a single h
layer containing two units. (Left)In this style, we draw every unit as a node in the
This style is very explicit and unambiguous but for networks larger than this ex
it can consume too much space. (Right)In this style, we draw a node in the gra
each entire vector representing a layer’s activations. This style is much more com
Sometimes we annotate the edges in this graph with the name of the parameter
describe the relationship between two layers. Here, we indicate that a matrix W des
the mapping from x to h, and a vector w describes the mapping from h to y
typically omit the intercept parameters associated with each layer when labeling thi
of drawing.

model, we used a vector of weights and a scalar bias parameter to describ


affine transformation from an input vector to an output scalar. Now, we des
an affine transformation from a vector x to a vector h, so an entire vector o
parameters is needed. The activation function g is typically chosen to be a fun
16

CHAPTER 6. DEEP FEEDFORWARD NETWORKS


Nonlinear Model
• If each activation function is y y
linear then the composite
function would also be linear w

• We would have the same poor h1 h2 h


result as before
W
• We must consider nonlinear
activation functions x1 x2 x

Figure 6.2: An example of a feedforward network, drawn in two different styles. Speci
this is the feedforward network we use to solve the XOR example. It has a single h
layer containing two units. (Left)In this style, we draw every unit as a node in the
This style is very explicit and unambiguous but for networks larger than this ex
it can consume too much space. (Right)In this style, we draw a node in the gra
each entire vector representing a layer’s activations. This style is much more com
Sometimes we annotate the edges in this graph with the name of the parameter
describe the relationship between two layers. Here, we indicate that a matrix W des
the mapping from x to h, and a vector w describes the mapping from h to y
typically omit the intercept parameters associated with each layer when labeling thi
of drawing.

model, we used a vector of weights and a scalar bias parameter to describ


affine transformation from an input vector to an output scalar. Now, we des
an affine transformation from a vector x to a vector h, so an entire vector o
parameters is needed. The activation function g is typically chosen to be a fun
17

CHAPTER 6. DEEP FEEDFORWARD NETWORKS Nonlinear Model


f (x; W, c, w, b) = w> max{0, W > x + c} + b

y y y(h) = w> h + b

h1 h2 h h(x) = ReLU(W > x + c)

x1 x2 x

Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
For simplicity, let us group together all units in the same layer into vectors x and h.
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.

model, we used a vector of weights and a scalar bias parameter to describe an


affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
18

Optimization
f (x; W, c, w, b) = w> max{0, W > x + c} + b

At this stage we would use optimization to fit f to the y


in the training set. In this example, we skip this step
and assume that some oracle gives us the parameters
 
1 1 0
W = c=
1 1 1

1
w= b=0
2

Let us consider these settings and then compute the output of the function f on the training set.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS 19

Simulation
2 3 y y
0 0
60 17
X=6
41
7
05 w
1 1
h1 h2 h

x1 x2 x

Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.

model, we used a vector of weights and a scalar bias parameter to describe an


affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
that is applied element-wise, with hi = g(x> W:,i + ci ). In modern neural networks,
the default recommendation is to use the rectified linear unit or ReLU (Jarrett
CHAPTER 6. DEEP FEEDFORWARD NETWORKS 19

Simulation
2 3 y y
0 0
60 17
X=6
41
7
05 w
1 1
h1 h2 h

x1 x2 x

Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.

model, we used a vector of weights and a scalar bias parameter to describe an


affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
that is applied element-wise, with hi = g(x> W:,i + ci ). In modern neural networks,
the default recommendation is to use the rectified linear unit or ReLU (Jarrett
CHAPTER 6. DEEP FEEDFORWARD NETWORKS 19

Simulation
2 3 2y 3 y
0 0 0 1
60 17 61 07
X=6
41
7 XW + 1c = 6 7
05 41 05 w
1 1 2 1
h1 h2 h

x1 x2 x

Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.

model, we used a vector of weights and a scalar bias parameter to describe an


affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
that is applied element-wise, with hi = g(x> W:,i + ci ). In modern neural networks,
the default recommendation is to use the rectified linear unit or ReLU (Jarrett
CHAPTER 6. DEEP FEEDFORWARD NETWORKS 19

Simulation
2 3 2y 3 y
0 0 0 1
60 17 61 07
X=6
41
7 XW + 1c = 6 7
05 41 05 w
1 1 2 1
h12 3 h2 h
0 0
61 07
max{0, XW + 1c} = 641 05
7 W

x1 2 1 x2 x

Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.

model, we used a vector of weights and a scalar bias parameter to describe an


affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
that is applied element-wise, with hi = g(x> W:,i + ci ). In modern neural networks,
the default recommendation is to use the rectified linear unit or ReLU (Jarrett
CHAPTER 6. DEEP FEEDFORWARD NETWORKS 19

Simulation
2 3 2y 3 y
0 0 0 1
60 17 61 07
X=6
41
7 XW + 1c = 6 7
05 41 05 w
1 1 2 1
h12 3 h2 h
0 0
61 07
max{0, XW + 1c} = 641 05
7 W

x1 2 1 x2 x
2 3
0
617 the XOR function
XW + 1c}w + 1b = 6 7
Figure 6.2: An example of a feedforward4network,
max{0, drawn in two different styles. Specifically,
15 (matches Y)
this is the feedforward network we use to0 solve the XOR example. It has a single hidden
layer containing two units. (Left)In this style, we draw every unit as a node in the graph.
This style is very explicit and unambiguous but for networks larger than this example
it can consume too much space. (Right)In this style, we draw a node in the graph for
each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that
describe the relationship between two layers. Here, we indicate that a matrix W describes
the mapping from x to h, and a vector w describes the mapping from h to y. We
typically omit the intercept parameters associated with each layer when labeling this kind
of drawing.

model, we used a vector of weights and a scalar bias parameter to describe an


affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
that is applied element-wise, with hi = g(x> W:,i + ci ). In modern neural networks,
the default recommendation is to use the rectified linear unit or ReLU (Jarrett
20

Step-by-Step Analysis

z1 q1
h1

xi z2 q2 f (xi ; ✓)
h2
z3 q3

yi

Now that we have seen an example of how to design the cost function, a model (and motivated the need for nonlinearity), and analysed the performance, we can present
more in depth each item in the design of a machine learning algorithm: the cost function, the model (the neural network and, in particular, its hidden layers), and the
optimization procedure.
20

Step-by-Step Analysis

z1 q1
h1

xi z2 q2 f (xi ; ✓)
h2
z3 q3

yi

training set
20

Step-by-Step Analysis

z1 q1
h1

xi z2 q2 f (xi ; ✓)
h2
z3 q3

network model
yi

training set
20

Step-by-Step Analysis
hidden layers

z1 q1
h1

xi z2 q2 f (xi ; ✓)
h2
z3 q3

network model
yi

training set
20

Step-by-Step Analysis
hidden layers output
layer
z1 q1
h1

xi z2 q2 f (xi ; ✓)
h2
z3 q3

network model
yi

training set
20

Step-by-Step Analysis
hidden layers output
layer
z1 q1
h1
cost function
xi z2 q2 f (xi ; ✓)
h2
z3 q3 loss y i , f (xi ; ✓)

network model
yi

training set
20

Step-by-Step Analysis
hidden layers output
layer
z1 q1
h1
cost function
xi z2 q2 f (xi ; ✓)
h2
z3 q3 loss y i , f (xi ; ✓)

network model
yi

training set
20

Step-by-Step Analysis
hidden layers output
layer
z1 q1
h1
cost function
xi z2 q2 f (xi ; ✓)
h2
z3 q3 loss y i , f (xi ; ✓)

network model
yi

training set optimization


21

Cost Function

• Based on the conditional distribution pmodel (y|x; ✓)

• Maximum Likelihood (i.e., cross-entropy


between model pdf and data pdf)

min Ex,y⇠p̂data [log pmodel (y|x; ✓)]


Let’s start with the choice of cost function.


One option is to recover the whole probability density function of y|x.
This can be achieved with the cross-entropy minimization, which is equivalent to using maximum likelihood.
22

Saturation
• Functions that saturate (have flat regions) have a very small
gradient and slow down gradient descent

• We choose loss functions that have a non flat region when the
answer is incorrect (it might be flat otherwise)

• E.g., exponential functions exp(z(1 2y))


saturate in the negative domain;
with a binary variable y 2 {0, 1}
map errors to the nonflat region correct errors
and then minimize
z(1 2y)
• The logarithm also helps with saturation (see next slides)

In practice the cross entropy is preferred over other methods (e.g., based on statistics) because numerically is it more stable.
In the example above, we want z to have the same sign as 2y-1 where y=0 or y=1.
If we minimize the exponential form given above, we achieve this purpose.
The gradient will be nonzero when there is a mismatch (so the gradient will not stop).
When there is a match, the gradient can become zero.
23

Cost Function

• Based on conditional statistics f (x; ✓) of y|x

• For example

f ⇤ = arg min Ex,y⇠p̂data |y f (x; ✓)|2


f

gives the conditional mean

f ⇤ = Ey⇠p̂data (y|x) [y]

Another option is to directly estimate some statistics of y|x. The L2 loss yields the conditional mean. Other well-known options are the L1 loss yields the conditional
median.
24

Output Units
• The choice of the output representation (e.g., a
probability vector or the mean estimate) determines
the cost function

• Let us denote with

h = f (x; ✓)

the output of the layer before the output unit


25

Linear Units
• With a little abuse of terminology, linear units include
affine transformations
ŷ = W > h + b
can be seen as the mean of the conditional Gaussian
distribution (in the Maximum Likelihood loss)

p(y|x) = N (y; ŷ, I)

• The Maximum Likelihood loss becomes

log p(y|ŷ) = |y ŷ|2 + const

Linear units are easy to work with (during training the gradients are simple and stable)
26

Figure 3.3: The logistic sigmoid function.

Softplus
• The softplus function is 10

defined as 8

(x)
4

⇣(x) = log(1 + exp(x)) 2

0
10 5 0 5 10
x
Figure 3.4: The softplus function.
and it is a smooth
approximation of the 10

Rectified Linear Unit (ReLU) 8 69

4
x+ = max(0, x) 2

0
-10 -5 0 5 10
27

Sigmoid Units
• Use to predict binary variables or to predict the
probability of binary variables
p(y = 0|x) 2 [0, 1]
• The sigmoid unit defines aCHAPTER
suitable mapping and has no
3. PROBABILITY AND INFORMATION THEORY

flat regions (useful in gradient descent)


ŷ = (w> h + b)
where we have used the 1.0

logistic sigmoid function 0.8

0.6

(x)
1 0.4
(x) = x
1+e 0.2

0.0

10 5 0 5 10
x
Figure 3.3: The logistic sigmoid function.

In practice, the sigmoid unit is a combination of a linear unit and the logistic sigmoid function.
The lack of (strongly) flat regions helps gradient descent. When regions are flat the gradients become zero and
there is no useful update to the parameters (an extremum has been reached).
Notice that the logistic sigmoid function is the derivative of the softplus function.

Figure 3.4: The softplus function.

69
28

Bernoulli Parametrization
• Let z = w> h + b . Then, we can define the Bernoulli
distribution
p(y|z) = ((2y 1)z)

• The loss function with Maximum Likelihood is then

log p(y|z) = ⇣((1 2y)z) ' max(0, (1 2y)z)

and saturation occurs only when the output is


correct (y=0 and z<0 or y=1 and z>0)

When the output is incorrect the gradient changes linearly with z.


This is a desirable stable behaviour in the algorithm.
This behaviour is not guaranteed when we use other loss functions (e.g., the least squares).
Thus, ML is recommended when using the sigmoid.
29

Smoothed Max
• An extension to the softplus function is the
smoothed max X
log exp(zj )
j
which gives a smooth approximation to max zj
j

• If we rewrite the softplus function as

log(1 + exp(z)) = log(exp(0) + exp(z))

we can see that it is the case with z1 = 0, z2 = z

This is a generalisation of softplus that is used in the Softmax output layer.


30

Softmax Units
• An extension of the logistic sigmoid to multiple variables

• Used as the output of a multi-class classifier

• The Softmax function is defined as


exp(zi )
softmax(z)i = P
j exp(zj )

• Shift-invariance: softmax(z + 1c) = softmax(z)

gives numerically stable implementation


softmax(z max zj ) = softmax(z)
j

The softmax gives a probability over multiple classes.


It is automatically normalized.
31

Softmax Units
• In Maximum Likelihood we have
X
log softmax(z)i = zi log exp(zj )
j
• Recall the smoothed max, then we can write

log softmax(z)i ' zi max zj


j

• Maximization, with i = arg max zj , yields


j

softmax(z)i = 1 and softmax(z)j6=i = 0

ML leads softmax to a vector with one 1 and all the rest at 0.


32

Softmax Units
• Softmax is an extension to the logistic sigmoid
where we have 2 variables and z1 = 0, z2 = z

p(y = 1|x) = softmax(z)1 = (z2 )

• Softmax is a winner-take-all formulation

• Softmax is more related to the arg max function


than the max function

Because probabilities must add up to 1, we can parametrize the probability on n variables with an n-1 dimensional vector. In the case of binary variables this is convenient
because we only need to care about 1 probability and the logistic sigmoid is a practical parametrization choice. However, with multi-dimensional variables it is often simpler
to implement the full n-dimensional probability vector.
33

General Output Units


• A neural network can be written as a function f (x; ✓)

• This function could output the value of y or


parameters ! of the pdf of y such that f (x; ✓) = !

• In this case the loss function (with ML) is


log p(y; !(x))

• For example, the parameters could represent the


mean and precision of the Gaussian distribution of y
34

General Output Units


• Mixture density models are used for multimodal
probability densities (i.e., multi-peaked outputs)
X
i
p(y|x) = 6. p(c
CHAPTER DEEP= i|x)N (y; µ
FEEDFORWARD (x), ⌃i (x))
NETWORKS
i

• The parameters
include
p(c = i|x)

y
µi (x)
⌃i (x)
x

In the example above we sample x uniformly and then we sampleFigure 6.4: Samples drawn from a neural network with a mixture density output layer.
The input x is sampled from a uniform distribution and the output y is sampled from
p(y|x). As we can see, depending on x we have different multi-modal
pmodel (y distributions.
| x). The neural The concentration
network is able to learnofnonlinear
y samples takesfrom
mappings place
theat different
input to locations and these locations
change with x. the parameters of the output distribution. These parameters include the probabilities
governing which of three mixture components will generate the output as well as the
parameters for each mixture component. Each mixture component is Gaussian with
predicted mean and variance. All of these aspects of the output distribution are able to
vary with respect to the input x, and to do so in nonlinear ways.

to describe y becomes complex enough to be beyond the scope of this chapter.


Chapter 10 describes how to use recurrent neural networks to define such models
over sequences, and part III describes advanced techniques for modeling arbitrary
probability distributions.

6.3 Hidden Units


So far we have focused our discussion on design choices for neural networks that
are common to most parametric machine learning models trained with gradient-
based optimization. Now we turn to an issue that is unique to feedforward neural
networks: how to choose the type of hidden unit to use in the hidden layers of the
model.
The design of hidden units is an extremely active area of research and does not
yet have many definitive guiding theoretical principles.
Rectified linear units are an excellent default choice of hidden unit. Many other
35

Hidden Units
• The design of a neural network is so far still an art

• The basic principle is the trial and error process:


1. Start from a known model
2. Modify
3. Implement and test (go back to 2. if needed)

• A good choice is to always use ReLUs

• In general the hidden unit picks a g for


h(x) = g(W > x + b)
36

Rectified Linear Units


• ReLUs typically use also 10

8
an affine transformation 6

g(z) = max{0, z} 2

0
-10 -5 0 5 10

• Good initialization is b = 0.1 (initially, a linear layer)

• Negative axis cannot learn due to null gradient

• Generalizations help avoid the null gradient


37

Leaky ReLUs and More


• A generalization of ReLU is
g(z, ↵) = max{0, z} + ↵ min{0, z}

• To avoid a null gradient the following are in use

1. Absolute value rectification ↵= 1


2. Leaky ReLU ↵ = 0.01
3. Parametric ReLU ↵ learnable
4. Maxout Units g(z)i = max zj
j2Si
[i Si = [1, . . . , m]
Si \ Sj = ? i 6= j

Maxout units generalize all the other units. However, they may be more difficult to train.
Their redundancy in the representation also allows them to be more robust to catastrophic forgetting (the network forgets how to perform a task that was previously
learned — transfer learning).
38

Hyperbolic Tangent
• The hyperbolic tangent is defined as
1

0.5

tanh(z)
0
g(z) = tanh(z)
-0.5

-1
-10 -5 0 5 10
z
and it is related to the logistic sigmoid via
tanh(z) = 2 (2z) 1

The hyperbolic tangent and the logistic sigmoid were used quite widely as the nonlinear components in neural networks. However, their saturation properties can make
gradient-based learning very difficult. Thus, they are now mostly substituted by ReLUs. These networks can be used as output units especially when the loss function
“undoes” the saturation.
They are used often in recurrent neural networks, probabilistic models and autoencoders.
39

Other Units

• Linear projection W =VU


✓ ◆
1 2
• Radial Basis Functions hi (x) = exp 2 |x Wi |
i
• Softplus g(z) = ⇣(z) = log(1 + exp(z))

• Hard tanh g(z) = max{ 1, min{+1, z}}

Many other hidden units have been tested and published. When they perform as the existing ones, they are not deemed useful or interesting.
The linear projection allows factorizations.
Instead of using an affine transformation W we can have first a projection U (possibly to a lower dimensional space) and then a projection V. The (low-rank) factorization
VU of W uses fewer parameters than in the case of a general W matrix.
The skip connection is a special case that implements the identity function.
The RBF saturates for most x values so it is also difficult to optimize.
The softplus seems to be a better option than ReLU because it is always differentiable. However, experimentally it performs worse than ReLU.
40

Network Design

• The network architecture is the overall structure of


the network: number of units and their connectivity

• Today, the design for a task must be found


experimentally via a careful analysis of the training
and validation error
41

Universal Approximation
• Theorem
A feedforward network with a linear output layer
and enough (but at least one) hidden nonlinear
layers (e.g., the logistic sigmoid unit) can
approximate up to any desired precision any (Borel
measurable) function between two finite-
dimensional spaces

• This means that neural networks provide a


universal representation/approximation

An example of Borel measurable function is any continuous function on a closed and bounded subset of R^n.
42

Universal Approximation

• However, we are not guaranteed that the learning


algorithm will be able to build that representation

• Learning might fail to find some good parameters

• Learning might fail due to overfitting (see “No Free


Lunch” theorem)

An example of Borel measurable function is any continuous function on a closed and bounded subset of R^n.
43

Depth
• A general rule is that depth helps generalization
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
• It is better to have many simple layers than few
highly complex ones
96.5
96.0

Test accuracy (percent)


95.5
95.0
94.5
94.0
93.5
93.0
92.5
92.0
3 4 5 6 7 8 9 10 11
Number of hidden layers
Figure 6.6: Empirical results showing that deeper networks generalize better when used
The figure shows that deeper networks to transcribe to transcribenumbers
multi-digit multi-digitfrom
numbers from photographs
images generalizeofbetter
addresses. Datashorter
than from Goodfellow
ones.
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
figure 6.7 for a control experiment demonstrating that other increases to the model size
do not yield the same effect.

Another key consideration of architecture design is exactly how to connect a


pair of layers to each other. In the default neural network layer described by a linear
transformation via a matrix W , every input unit is connected to every output
unit. Many specialized networks in the chapters ahead have fewer connections, so
that each unit in the input layer is connected to only a small subset of units in
the output layer. These strategies for reducing the number of connections reduce
the number of parameters and the amount of computation required to evaluate
the network, but are often highly problem-dependent. For example, convolutional
networks, described in chapter 9, use specialized patterns of sparse connections
that are very effective for computer vision problems. In this chapter, it is difficult
to give much more specific advice concerning the architecture of a generic neural
network. Subsequent chapters develop the particular architectural strategies that
have been found to work well for different application domains.

202
44
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Depth
• Other network modifications do not have the same
effect
97
3, convolutional

Test accuracy (percent)


96
3, fully connected
95 11, convolutional
94
depth variation type
93

92

91
0.0 0.2 0.4 0.6 0.8 1.0
Number of parameters ⇥108

Depth increases the number of parameters ofFigure 6.7: Deeper


the network (andmodels tend
thus its to perform
capacity). better. This
Increasing theisparameters
not merely because the model
by working is factors does not have the same impact on
on other
larger. This
the performance. Moreover, all shallow models overfit. experiment from Goodfellow et al. (2014d) shows that increasing the number
of parameters in layers of convolutional networks without increasing their depth is not
nearly as effective at increasing test set performance. The legend indicates the depth of
network used to make each curve and whether the curve represents variation in the size of
the convolutional or the fully connected layers. We observe that shallow models in this
context overfit at around 20 million parameters while deep ones can benefit from having
over 60 million. This suggests that using a deep model expresses a useful preference over
the space of functions the model can learn. Specifically, it expresses a belief that the
function should consist of many simpler functions composed together. This could result
either in learning a representation that is composed in turn of simpler representations (e.g.,
corners defined in terms of edges) or in learning a program with sequentially dependent
steps (e.g., first locate a set of objects, then segment them from each other, then recognize
them).

203
45

Depth
CHAPTER 1. INTRODUCTION
• Another interpretation is that depth allows a more
gradual abstraction
Output
CAR PERSON ANIMAL
(object identity)

3rd hidden layer


(object parts)

2nd hidden layer


(corners and
contours)

1st hidden layer


(edges)

Visible layer
(input pixels)

Figure 1.2: Illustration of a deep learning model. It is difficult for a computer to understand
the meaning of raw sensory input data, such as this image represented as a collection
A deep network might give a useful representation, where of pixelconcepts are gradually
values. The function mapping frommore
a set ofand
pixels more abstract.
to an object identity is A
verylearning problem is solved via the detection of simple
complicated. Learning or evaluating this mapping seems insurmountable if tackled directly.
underlying factors of variation that can also be described Deep
in alearning
similar way via the detection of more abstract factors of variation.
resolves this difficulty by breaking the desired complicated mapping into a
series of nested simple mappings, each described by a different layer of the model. The
input is presented at the visible layer, so named because it contains the variables that
we are able to observe. Then a series of hidden layers extracts increasingly abstract
features from the image. These layers are called “hidden” because their values are not given
in the data; instead the model must determine which concepts are useful for explaining
the relationships in the observed data. The images here are visualizations of the kind
of feature represented by each hidden unit. Given the pixels, the first layer can easily
identify edges, by comparing the brightness of neighboring pixels. Given the first hidden
layer’s description of the edges, the second hidden layer can easily search for corners and
extended contours, which are recognizable as collections of edges. Given the second hidden
layer’s description of the image in terms of corners and contours, the third hidden layer
can detect entire parts of specific objects, by finding specific collections of contours and
corners. Finally, this description of the image in terms of the object parts it contains can
be used to recognize the objects present in the image. Images reproduced with permission
from Zeiler and Fergus (2014).

6
46

Depth
• Another interpretation is that the network
implements a gradient descent algorithm for

x̃ = arg min E[x|y, !]


x

where the network represents f (x0 ; ✓) = xt+1 and

x1 = x0 + ↵rE[x0 |y, !] . . . xt+1 = xt + ↵rE[xt |y, !]

This interpretation defines explicitly the network and the depth corresponds to the number of gradient descent iterations (the more iterations and the better the final
accuracy of the function); each step performs a local linearized operation (that is, relatively simple).
47

Optimization
• Given a task we define

• The training data {xi , y i }i=1,...,m

• A network design f (x; ✓)


m
X
• The loss function J(✓) = loss y i , f (xi ; ✓)
i=1
• Next, we optimize the network parameters ✓

• This operation is called training

We now look more in detail at how we can improve the network parameters so that the network can make better predictions on the training set.
48

Optimization
• The MSE cost function J(✓) is convex with a linear
model
J(✓)


global optimum

We used the MSE as our loss function in the initial (XOR) example. With a linear model the optimization problem is convex and has a unique global optimum.
While we could get a closed form solution in this case, the solution was not good.
Then, we considered a neural network which made the model non linear and thus the loss function non convex.
49

Optimization
• However, since the cost function J(✓) is typically
non convex in the parameters, we use an iterative
solution

• We consider the gradient descent method

✓t+1 = ✓t ↵rJ(✓t )

where ↵ > 0 is the learning rate

In the more general case, we need to resort to iterative solutions such as gradient descent. This method takes a scaled version of the negative of the gradient as the update
for the previous set of parameters.
50

When we want to reach the valley and we start from a random position we could follow the slope at our current location. This is the principle of gradient descent.
51

Optimization
gradient descent ✓t+1 = ✓t ↵rJ(✓t )
rJ(✓) J(✓)

rJ(✓t ) < 0 rJ(✓t ) > 0


negative positive
gradient gradient

✓t ✓t+1 = ✓t ↵rJ(✓t ) ✓t
move move
right left

Let us illustrate the way gradient descent works when the loss is convex and in 1D. In this case it is immediate to see how the negative of the gradient points always
towards the global minimum.
52

Local Minima
• Does gradient descent reach a (local) minimum
even with a non convex function?
J(✓)

• How do we show that?

How do we ensure that, regardless of where we are in the domain (x-axis in the figure),
we make gradient descent move towards the local minimum?
53

Assumptions
• Gradient descent will reach a local minimum under
some constraints on both the cost function and
the learning rate

• We show that by using a smoothness assumption


called Lipschitz continuity on rJ(✓)
rJ(✓)

Gradient descent requires some additional constraints to converge to a local minimum.


Understanding these constraints is key to know what to do when gradient descent does not work.

One such constraint is Lipschitz continuity of the gradient of the cost function (notice that it applies to the gradient, not to the cost function itself).

Lipschitz continuity defines the maximum slope in the whole domain. In the illustration Lipschitz continuity states that there exists a cone (red lines) that can be
translated at any point of the function and such that the function will not enter it (the function will always be in the red-shaded region).
53

Assumptions
• Gradient descent will reach a local minimum under
some constraints on both the cost function and
the learning rate

• We show that by using a smoothness assumption


called Lipschitz continuity on rJ(✓)
rJ(✓)


54

Convergence
If we assume (Lipschitz continuity)

9L 0: rJ(✓) ¯  L|✓
rJ(✓) ¯
✓|, 8✓, ✓¯

then for a small enough learning rate ↵ the gradient


descent iteration will generate a sequence ✓1 , . . . , ✓T
such that*
decreasing
J(✓t+1 ) < J(✓t )

if rJ(✓t ) 6= 0 ; i.e., it will converge to a local minimum.

*See Tutorial 2 of the Machine Learning Course

Under mild conditions, we can show that there is a suitable small learning rate such that gradient descent is guaranteed to converge to a local minimum.
55

Diagnosing GD
• In practice, what do we do when gradient descent
does not work?

• Since it must work when the assumptions are true,


it must be that the assumptions are violated

• The next step is to determine which assumptions


are violated

• There are two: Lipschitzianity and small learning


rate
56

Diagnosis 1/2
• Case 1: Lipschitzianity

Diagnosis: this is a purely analytical diagnosis. One can simply look at the cost function formula and determine if the functions in it may not satisfy Lipschitz continuity (of
its gradient).
This step should be done first.
56

Diagnosis 1/2
• Case 1: Lipschitzianity

• The cost function does not satisfy the Lipschitz


condition for any L
56

Diagnosis 1/2
• Case 1: Lipschitzianity

• The cost function does not satisfy the Lipschitz


condition for any L

• Solution: Smooth the cost function until an L exists


57

Diagnosis 2/2
• Case 2: Learning rate

Diagnosis here is more experimental.


One can also compute the magnitude of the gradient of the cost function and compare it to the magnitude of the current parameters. The learning rate should scale the
update (the gradient of the cost function) to a small percentage (e.g., 10^-4) of the current parameters magnitude.
57

Diagnosis 2/2
• Case 2: Learning rate

• The learning rate is not small enough


57

Diagnosis 2/2
• Case 2: Learning rate

• The learning rate is not small enough

• Solution: Make it smaller until gradient descent


starts to work
58

Optimization
• For more efficiency, we use the stochastic
gradient descent method

• The gradient of the loss function is computed on a


small set of samples from the training set
X
˜
J(✓) = loss y i , f (xi ; ✓)
i2B⇢[1,...,m]

and the iteration is as before

✓t+1 = ✓t ˜ t)
↵rJ(✓

Under some mild requirements (some smoothness of the loss function and a decreasing learning rate) also Stochastic GD (and other similar variations) can be shown to
converge to a local minimum.
Because SGD requires a small subset of all the samples, it can be computed very efficiently with a positive impact on the overall training time.

Before we discussed how to set the learning rate. Now we look at the other remaining component
in the update equation: the gradient of the loss. How do we obtain it?
59

Back-Propagation
• (Stochastic) gradient descent boils down to the
calculation of the loss gradient with respect to the
parameters

• Due to the compositional structure of the network,


the gradient can be computed in many ways

• Back-propagation (or simply backprop) refers to a


particularly computationally efficient procedure for
computing the gradient

Note that backprop does not refer to the whole training algorithm, but just to the specific way used to compute the gradient
60

Computational Graphs
• Because neural networks can quickly become very
complex, a clear representation is needed

• We use a computational graph to formalize the


computations

• Each node is assigned to a variable (e.g., a scalar,


a vector, a matrix, a tensor)

• An operation transforms one or more variables into


another variable
61
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Examples

z u(1) u(2)
+
dot

x y x w b

(a) (b)

H u(2) u(3)
relu ⇥
sum

U (1) U (2) ŷ u(1)


+
sqr
dot
matmul

X W b x w

(c) (d)

Figure 6.8: Examples of computational graphs. (a)The graph using the ⇥ operation to
Examples of computational graphs. compute z = xy. (b)The graph for the logistic regression prediction ŷ = x> w + b .
Some of the intermediate expressions do not have names in the algebraic expression
(a) z = xy but need names in the graph. We simply name the i-th such variable u(i) . (c)The
(b)\hat y = \sigma(x^T w + b) [log. reg.] computational graph for the expression H = max{0, XW + b}, which computes a design
matrix of rectified linear unit activations H given a design matrix containing a minibatch
(c) H = max{0,XW+b} of inputs X. (d)Examples a–c applied at most one operation to each variable, but it
is possible to apply more than one operation. Here we show a computation graph that
(d)\hat y = x^T w and u3 = \lambda \sum w^2; w is used to perform multiple operations
applies more than one operation to the weights w of a linear regression model. The P
weights are used to make both the prediction ŷ and the weight decay penalty i wi2 .

206
62

The Computational Graph


u8
u4 u13
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11 loss

u7 u16
input u12
✓ = [u4 , . . . , u16 ]

We define each variable in the network with a node ui. The nodes in the first layer denote the input variables and the last node denotes the loss function. The input could
be just 1 minibatch. Each link defines the inputs to a node, and each node is associated to a function fi of the inputs.
63

Computing Gradients
• The main objective is to compute the derivatives of
the loss node with respect to the input nodes

@u17
i = 1, 2, 3
@ui

• The loss depends on the input nodes through


functional composition

u17 = f17 (u13 , u14 , u15 , u16 )

Because of the compositional structure of the network (and therefore of the computational graph), the natural tool to compute derivatives is the chain rule. Moreover, to
compute the gradients of the loss with respect to the inputs in a computationally and memory efficient way we need to implement the chain rule carefully.
64

Chain Rule
• The derivatives of a function composition can be
computed via the chain rule

• Consider y = g(x) and z = f (g(x)) = f (y)


@z @z @y
=
@x @y @x
@z X @z @yj
• In the multivariate case we have =
@xi j
@yj @xi
or, more compactly rx z = (rx y)> ry z

The same rule can be applied to matrices and tensors simply by vectorizing them, then by applying the above equations and finally by reshaping them to the original
shape. Indeed we can think of a tensor as a vector whose indices have multiple coordinates.
x
CHAPTER 6. DEEP FEEDFORWARD NETWORKS 65
f

Computational Challenges
w

Figure 6.9: A computational


• graph
In thethat results
chain in repeated
rule, subexpressions
subexpressions maywhen computing
repeat
the gradient. Let w 2 R be the input to the graph. We use the same function f : R ! R
z
as the operation that we •apply at every step of a chain: x = f (w), y = f (x), z = f (y).
Example
To compute @w@z
, we apply equation 6.44 and obtain: f

@z y
(6.50)
@w
f
@z @y @x
= (6.51)
@y @x @w x
=f 0 (y)f 0 (x)f 0 (w) (6.52)
f
=f 0 (f (f (w)))f 0 (f (w))f 0 (w) (6.53)
w
Equation 6.52 suggests an implementation in which we compute the value of f (w) only
once and store it in the variable x. This is the approach taken by the back-propagation
algorithm. An alternative approach is suggested byFigure equation
6.9: A6.53, where thegraph
computational subexpression
that results in repeated subexpressions when computing
appearsmultiple
more than the gradient. Let is R be the input to time
the graph. We use the same function f : R ! R
times once. In the alternative approach, recomputed
(to avoid each
f (w) f (w) w2
Repeating the same calculation is wasteful. Also, keeping these calculations in memory repeated calculations) could become unmanageable.
it is needed. When the memory required to store as the
the operation
value of that
these we apply atisevery
expressions low, step
the of a chain: x = f (w), y = f (x), z = f (y).
Keeping all the intermediate Jacobians may require too many memory resources. To compute @w , we apply equation 6.44 and obtain:
@z
back-propagation approach of equation 6.52 is clearly preferable because of its reduced
runtime. However, equation 6.53 is also a valid implementation of the chain rule, @z and is
(6.50)
useful when memory is limited. @w
@z @y @x
= (6.51)
@y @x @w
=f 0 (y)f 0 (x)f 0 (w) (6.52)
=f 0 (f (f (w)))f 0 (f (w))f 0 (w) (6.53)

Equation 6.52 suggests an implementation in which we compute the value of f (w) only
211 once and store it in the variable x. This is the approach taken by the back-propagation
algorithm. An alternative approach is suggested by equation 6.53, where the subexpression
f (w) appears more than once. In the alternative approach, f (w) is recomputed each time
it is needed. When the memory required to store the value of these expressions is low, the
back-propagation approach of equation 6.52 is clearly preferable because of its reduced
runtime. However, equation 6.53 is also a valid implementation of the chain rule, and is
useful when memory is limited.
66

The Computational Graph


u8
u4 u13
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11

u7 u16
u12

The original graph can be used to compute the loss given the input. To compute the gradient we need to duplicate each node (and allocate the associated memory).
66

The Computational Graph


u8
u4 u8 u13
u4 u13
u1
u1 u9
u9
u5 u14
u5 u14
u2 u10 u17
u2 u10 u17
u6 u15
u6 u15
u3 u11
u3 u11
u7 u16
u7 u12 u16
u12
67

Forward Propagation
u1

u2

u3

The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
67

Forward Propagation
u1 x1

u2 x2

u3 x3

inputs
67

Forward Propagation
u1 x1

u2 x2

u3 x3

inputs
68

Forward Propagation
u1

u2

u3

The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
68

Forward Propagation
u4
u1

u2

u3
68

Forward Propagation
u4
u1
u i = f i Ai
u2 Ai = {uj |j 2 Parents(ui )}

u3
68

Forward Propagation
u4 u4 = f4 (u1 , u2 , u3 )
u1
u i = f i Ai
u2 Ai = {uj |j 2 Parents(ui )}

u3
69

Forward Propagation
u4 u4 = f4 (u1 , u2 , u3 )
u1
u i = f i Ai
u2 Ai = {uj |j 2 Parents(ui )}

u3

parents of u4

The parents of a node are the arguments of the function that computes the value of that node.
70

Forward Propagation
u1

u2

u3

The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
70

Forward Propagation
u4 u4 = f4 (u1 , u2 , u3 )
u1
u5 u5 = f5 (u1 , u2 , u3 )
u2

u6 u6 = f6 (u1 , u2 , u3 )
u3

u7 u7 = f7 (u1 , u2 , u3 )
71

Forward Propagation
u4
u1
u5
u2

u6
u3

u7

The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
71

Forward Propagation
u8 u8 = f8 (u4 )
u4
u1
u9 u9 = f9 (u4 , u5 , u6 )
u5
u2 u10 u10 = f10 (u4 , u5 , u6 , u7 )
u6
u3 u11 u11 = f11 (u5 , u6 , u7 )

u7
u12 u12 = f12 (u7 )
72

Forward Propagation
u8
u4
u1
u9
u5
u2 u10
u6
u3 u11

u7
u12

The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
72

Forward Propagation
u13 = f13 (u8 , u9 )
u8
u4 u13
u1 u14 = f14 (u9 , u10 )
u9
u5 u14
u2 u10 u15 = f15 (u10 , u11 )
u6 u15
u3 u11
u16 = f16 (u11 , u12 )
u7 u16
u12
73

Forward Propagation
u8
u4 u13
u1
u9
u5 u14
u2 u10
u6 u15
u3 u11

u7 u16
u12

The forward propagation can be computed by propagating the inputs to the nodes to the right until the loss node.
73

Forward Propagation
u8
u4 u13
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11

u7 u16
u12
73

Forward Propagation
u8 target
u4 u13
y
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11

u7 u16
u12
73

Forward Propagation
u8 target
u4 u13
y
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11

u7 u16
u12

loss function u17 = f17 (u13 , u14 , u15 , u16 , y)


74

Backward Propagation
u8
u4 u13
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11

u7 u16
u12

The backward propagation is an efficient way to compute the derivatives of u17 (in green) with respect to all the other variables (in red).
75

Backward Propagation
y

u17

The backward propagation can be computed by propagating information from the loss node on the right to the parent nodes on the left up to the input nodes. In this way
we avoid the computation of repeated gradients.
75

Backward Propagation
u13
y

u14
u17
u15

u16
75

Backward Propagation
@u17 @f17 (u13 , u14 , u15 , u16 , y)
= u13
@u13 @u13 y

@u17 @f17 (u13 , u14 , u15 , u16 , y)


= u14
@u14 @u14
u17
@u17 @f17 (u13 , u14 , u15 , u16 , y)
= u15
@u15 @u15

@u17 @f17 (u13 , u14 , u15 , u16 , y)


= u16
@u16 @u16
76

Backward Propagation
u13

u14
u17
u15

gradients
u16
already
computed

More in general, we can use the chain rule formula and exploit the previous calculations.
76

Backward Propagation
u8
u13

u9
u14
u10 u17
u15
u11
gradients
u16
u12 already
computed
76

Backward Propagation
u8
u13

u9
u14
@un X @un @ui
= u10 u17
@uj @ui @uj
i:j2Pa(ui ) u15
u11
gradients
u16
u12 already
computed
76

Backward Propagation
u8
u13

u9
u14
u10 u17
u15
u11
gradients
u16
u12 already
computed
76

Backward Propagation
u8
u13
@u17 X @u17 @ui
= u9
@u9 @ui @u9 u14
i:92Pa(ui )

u10 u17
u15
u11
gradients
u16
u12 already
computed
77

Backward Propagation
u13

u14
u17
u15

gradients
u16
already
computed

These terms (in orange) have already been computed at the previous iteration (right-hand-side).
77

Backward Propagation
u8
u13

u9
u14
u10 u17
u15
u11
gradients
u16
u12 already
computed
77

Backward Propagation
@u17 @u17 @u13
= u8
@u8 @u3 @u8 u13
@u17 @u17 @u13
=
@u9 @u13 @u9
u9
@u17 @u14
+
@u14 @u9
u14
u10 u17
.. u15
. u11
gradients
@u17 @u17 @u16 u16
@u12
=
@u16 @u12
u12 already
computed
77

Backward Propagation
@u17 @u17 @u13
= u8
@u8 @u3 @u8 u13
@u17 @u17 @u13
=
@u9 @u13 @u9
u9
@u17 @u14
+
@u14 @u9
u14
u10 u17
.. u15
. u11
gradients
@u17 @u17 @u16 u16
@u12
=
@u16 @u12
u12 already
computed
78

Backward Propagation
u8
u13

u9
u14
u10 u17
u15
u11

u16
u12

Note
78

Backward Propagation
u8
u4 u13

u9
u5 u14
u10 u17
u6 u15
u11

u7 u16
u12
78

Backward Propagation
u8
u4 u13

u9
u5 u14
u10 u17
u6 u15
u11

u7 u16
u12
@un X @un @ui
=
@uj @ui @uj
i:j2Pa(ui )
79

Backward Propagation
u8
u4 u13

u9
u5 u14
u10 u17
u6 u15
u11

u7 u16
u12
@un X @un @ui
=
@uj @ui @uj
i:j2Pa(ui )

These are nodes where the gradients have already been computed (and stored) that we can reuse in the chain rule to compute the new gradients.
80

Backward Propagation
u8
u4 u13

u9
u5 u14
u10 u17
u6 u15
u11

u7 u16
u12
@un X @un @ui
=
@uj @ui @uj
i:j2Pa(ui )

These are the new gradients between directly connected nodes (these gradients are defined when the function that implements the forward pass from one layer to the one on
the immediate right is defined).
81

Backward Propagation
u8
u4 u13
u1
u9
u5 u14
u2 u10 u17
u6 u15
u3 u11

u7 u16
u12

The Backpropagation algorithm thus defines an efficient method to compute the gradients while sacrificing memory. Derivative computations grow linearly with the number
of edges.
82

Back-Propagation

• The approach seen so far is called symbol-to-


value differentiation

• This is used by libraries such as Caffe and Torch

Note
83

Back-Propagation

• An alternative approach is the symbol-to-symbol


differentiation

• This is used by libraries such as Theano and


TensorFlow

Instead of assigning values to the functional expressions one obtains the functional expression of the derivatives.
Figure 6.9: A computational graph that results in repeated 84subexpressions when computing
the gradient. Let w 2 R be the input to the graph. We use the same function f : R ! R
as the operation
CHAPTER 6. DEEP FEEDFORWARD NETWORKS that we apply at every step of a chain: x = f (w), y = f (x), z = f (y).
Symbol-to-Symbol
To compute @w , we apply equation 6.44 and obtain:
@z

@z
z z (6.50)
@w
@z @y @x
f f = (6.51)
@y @x @w
f0
y y
dz =f 0 (y)f 0 (x)f 0 (w) (6.52)
dy
=f 0 (f (f (w)))f 0 (f (w))f 0 (w) (6.53)
f f
Equation 6.52 suggests
f0 an implementation
⇥ dz
in which we compute the value of f (w) only
dy
x once and storex it in the variable x. This is the approach taken by the back-propagation
dx dx
algorithm. An alternative approach is suggested by equation 6.53, where the subexpression
f f (w) appears
f more than once. In the alternative approach, f (w) is recomputed each time
it is needed. Whenfthe
0 memory⇥ required to store the value of these expressions is low, the
dx dz
w back-propagation
w approach
dw of equation
dw 6.52 is clearly preferable because of its reduced
runtime. However, equation 6.53 is also a valid implementation of the chain rule, and is
useful when memory is limited.
Figure 6.10: An example of the symbol-to-symbol approach to computing derivatives. In
this approach, the back-propagation algorithm does not need to ever access any actual
The idea is to convert the network into another network with the corresponding derivatives layers.
specific numeric values. Instead, it adds nodes to a computational graph describing how
to compute these derivatives. A generic graph evaluation engine can later compute the
derivatives for any specific numeric values. (Left)In this example, we begin with a graph
representing z = f (f (f (w))). (Right)We run the back-propagation algorithm, instructing
it to construct the graph for the expression corresponding to dw
dz
. In this example,
211 we do
not explain how the back-propagation algorithm works. The purpose is only to illustrate
what the desired result is: a computational graph with a symbolic description of the
derivative.

Some approaches to back-propagation take a computational graph and a set


of numerical values for the inputs to the graph, then return a set of numerical
values describing the gradient at those input values. We call this approach “symbol-
to-number” differentiation. This is the approach used by libraries such as Torch
(Collobert et al., 2011b) and Caffe (Jia, 2013).
Another approach is to take a computational graph and add additional nodes
to the graph that provide a symbolic description of the desired derivatives. This
is the approach taken by Theano (Bergstra et al., 2010; Bastien et al., 2012)
and TensorFlow (Abadi et al., 2015). An example of how this approach works
85

Symbol-to-Symbol

• Advantages

• Derivatives are computed as a forward


propagation in another graph

• Higher order derivatives can be easily computed

Higher order derivatives can be computed by building the computational graph of the previous (gradient) computational graph.
However, the dimensionality of higher order derivatives makes this option not very useful. Other practical solutions are possible (e.g. Krylov methods).
86

Back-Propagation Forms
• The back-propagation algorithm exploits a special
case of the chain rule, written in recursive form
@un X @un @ui
=
@uj @ui @uj
i:j2Pa(ui )
• In alternative one could use the sequential form

X t
Y
@un @u⇡k
=
@uj @u⇡k 1
path(u⇡1 ,...,u⇡t ), k=2
from ⇡1 =j to ⇡t =n

The number of paths might grow exponentially with the length of the paths and this would lead to a computational explosion in the non-recursive formula.
The recursive formula can also be seen as a way to compute the gradient via a dynamic programming approach (split original problem in repeated subproblems and
solution of original problem is defined as a composition of solutions of subproblems).
87

Further Issues

• Returning more that one output (1 tensor) might be


more efficient

• Memory consumption

• Data types

• Undefined gradients (e.g. L1 norm)

When is more efficient to return more than 1 output? For example, if we need both the maximum and the argument of the maximum of a tensor it is better to implement
a single function that can do both at once (rather than as two separate nodes).
Memory might be challenged by the temporary memorization of several intermediate tensors (e.g. Gi in the previous function); one workaround is to keep a separate buffer
where the tensors are added as they are computed.
Keeping track of undefined gradients.
88

Extensions
• Automatic simplification of the derivatives (or the
computational graph) — Theano, TensorFlow

• Reverse mode accumulation (what we have seen


so far; efficient with a single output)

• Forward mode accumulation (efficient when


outputs are more than the inputs)

Theano and TensorFlow use known rules to try and simplify the computational graph.
When there are k outputs it is not efficient to repeat k times the reverse mode accumulation.
89

Forward vs Reverse Mode


• The computation of the gradients involves the
products (and sums) of Jacobians

• The order of these products determines the forward


(left to right) or reverse (right to left) mode

• Suppose that there is one output D 2 Rm⇥1

ABCD

reverse more efficient

In this case it is more efficient to multiply from right to left so that the products are always between a matrix and a vector (as they result in a vector).
90

Forward vs Reverse Mode


• The computation of the gradients involves the
products (and sums) of Jacobians

• The order of these products determines the forward


(left to right) or reverse (right to left) mode

• With multiple outputs D 2 Rm⇥n, A 2 Rp⇥q and p < n

ABCD

forward more efficient

In this case it is more efficient to multiply from left to right so that the products are always between small matrices.

You might also like