CH 13
CH 13
In Sections 12.2 and 12.3 we discussed the design of single “neurons” (percep-
trons). These take a collection of inputs and, based on weights associated with
those inputs, compute a number that, compared with a threshold, determines
whether to output “yes” or “no.” These methods allow us to separate inputs
into two classes, as long as the classes are linearly separable. However, most
problems of interest and importance are not linearly separable. In this chapter,
we shall consider the design of neural nets, which are collections of perceptrons,
or nodes, where the outputs of one rank (or layer of nodes becomes the inputs
to nodes at the next layer. The last layer of nodes produces the outputs of
the entire neural net. The training of neural nets with many layers requires
enormous numbers of training examples, but has proven to be an extremely
powerful technique, referred to as deep learning, when it can be used.
We also consider several specialized forms of neural nets that have proved
useful for special kinds of data. These forms are characterized by requiring that
certain sets of nodes in the network share the same weights. Since learning all
the weights on all the inputs to all the nodes of the network is in general a
hard and time-consuming task, these special forms of network greatly simplify
the process of training the network to recognize the desired class or classes of
inputs. We shall study convolutional neural networks (CNN’s), which are spe-
cially designed to recognize classes of images. We shall also study recurrent neu-
ral networks (RNN’s) and long short-term memory networks (LSTM’s), which
are designed to recognize classes of sequences, such as sentences (sequences of
words).
509
510 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
Example 13.1 : The problem we discuss is to learn the concept that “good”
bit-vectors are those that have two consecutive 1’s. Since we want to deal with
only tiny example instances, we shall assume bit-vectors have length four. Our
training examples will thus have the form ([x1 , x2 , x3 , x4 ], y), where each of the
xi ’s are bits, 0 or 1. There are 16 possible training examples, and we shall
assume we are given some subset of these as our training set. Notice that eight
of the possible bit vectors are good – they do have consecutive 1’s, and there
are also eight “bad” examples. For instance, 0111 and 1100 are good; 1001 and
0100 are bad.1
To start, let us look at a neural net that solves this simple problem exactly.
How we might design this net from training examples is the true subject for
discussion, but this net will serve as an example of what we would like to
achieve. The net is shown in Fig. 13.1.
x1
1
x2 2
x3 1 2.5
0
x4
1
0.5 y
x1 0
x2 1
0
x3 1 1.5
1
x4
Figure 13.1: A neural net that tells whether a bit-vector has consecutive 1’s
The net has two layers, the first consisting of two nodes, and the second
with a single node that produces the output y. Each node is a perceptron,
exactly as was described in Section 12.2. In the first layer, the first node is
characterized by weight vector [w1 , w2 , w3 , w4 ] = [1, 2, 1, 0] and threshold 2.5.
P4 each input xi is either 0 or 1, we note that the only way to reach a sum
Since
i=1 xi wi as high as 2.5 is if x2 = 1 and at least one of x1 and x3 is also 1.
The output of this node is 1 if and only if the input is one of 1100, 1101, 1110,
1111, 0110, or 0111. That is, it recognizes those bit-vectors that either begin
with two 1’s or have two 1’s in the middle. The only good inputs it does not
1 We shall show bit vectors as bit strings in what follows, so we can avoid the commas
recognize are those that end with 11 but do not have 11 elsewhere. these are
0011 and 1011.
Fortunately, the second node in the first layer, with weights [0, 0, 1, 1] and
threshold 1.5 gives output 1 whenever x3 = x4 = 1, and not otherwise. This
node thus recognizes the inputs 0011 and 1011, as well as some other good
inputs that are also recognized by the first node.
Now, let us turn to the second layer, with a single node; that node has
weights [1, 1] and threshold 0.5. It thus behaves as an “OR-gate.” It gives
output y = 1 whenever either or both of the nodes in the first layer have output
1, but gives output y = 0 if both of the first-layer nodes give output 0. Thus,
the neural net of Fig. 13.1 gives output 1 for all the good inputs but none of
the bad inputs. ✷
x1
1
x2 2
x3 1
0
x4
1
−2.5
1
1 y
x1 0
x2 0 −0.5
x3 1 1
1
x4
1 −1.5
important part of the design process for neural nets. Especially, note that the
output layer can have many nodes. For instance, the neural net could classify
inputs into many different classes, with one output node for each class.
x1
y1
x2
y2
.
. ...
.
. . .
. . .
. . .
xn yk
Input Output
layer Hidden layers
layer
Each layer, except for the input layer, consists of one or more nodes, which
we arrange in the column that represents that layer. We can think of each node
as a perceptron. The inputs to a node are outputs of some or all of the nodes in
the previous layer. So that we can assume the threshold for each node is zero,
we can also allow a node to have an input that is a constant, typically 1, as
we suggested in Fig. 13.2. AssociatedP with each input to each node is a weight.
The output of the node depends on xi wi , where the sum is over all the inputs
xi , and wi is the weight of that input. Sometimes, the output is either 0 or 1;
the output is 1 if that sum is positive and 0 otherwise. However, as we shall
see in Section 13.2, it is often convenient, when trying to learn the weights for
a neural net that solves some problem, to have outputs that are almost always
close to 0 or 1, but may be slightly different. The reason, intuitively, is that
it is then possible for the output of a node to be a continuous function of its
inputs. We can then use gradient descent to converge to the ideal values of all
the weights in the net.
13.1. INTRODUCTION TO NEURAL NETS 513
1. Random. For some m, we pick for each node m nodes from the previous
layer and make those, and only those, be inputs to this node.
2. Pooled. Partition the nodes of one layer into some number of clusters. In
the next layer, which is called a pooled layer, there is one node for each
cluster, and this node has all and only the member of its cluster as inputs.
3. Convolutional. This approach to interconnection, which we discuss in
more detail in the next section and Section 13.4, views the nodes of each
layer as arranged in a grid, typically two-dimensional. In a convolutional
layer, a node corresponding to coordinates (i, j) receive as inputs the
nodes of the previous layer that have coordinates in some small region
around (i, j). For example, the node (i, j) at one convolutional layer may
have as inputs those nodes from the previous layer that correspond to
coordinates (p, q), where i ≤ p ≤ i + 2 and j ≤ q ≤ j + 2 (i.e., the square
of side 3 whose lower-left corner is the point (i, j).
particular angle, e.g., if the upper left corner is light and the other eight pixels
dark. Moreover, the algorithm for recognizing an edge of a certain type is the
same, regardless of where in the field of vision this little square appears. That
observation justifies the CNN constraint that all the nodes of a layer use the
same weights. In the eye, additional layers combine results from the previous
layers to recognize more and more complex structures: long boundaries, regions
of similar color, and finally faces and all the familiar objects that we see daily.
We shall have more to say about convolutional neural networks in Sec-
tion 13.4. Moreover, CNN’s are only one example of a kind of neural network
where certain families of nodes are constrained to have the same weights. For
example, in Section 13.5, we shall consider recurrent neural networks and long
short-term memory networks, which are specially adapted to recognizing prop-
erties of sequences, such as sentences (sequences of words).
3. In what manner will we interconnect the outputs of one layer to the inputs
of the next layer?
Further, in later sections we shall see that there are other decisions that need
to be made when we train the neural net. These include:
4. What cost function should we minimize to express what weights are best?
! Exercise 13.1.4 : Prove that there is no single preceptron that behaves like
an exclusive-or gate.
to write row vectors, and we shall do so in the text. But in formulas, we shall use the transpose
operator when we actually want to use the vector as a row rather than a column.
516 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
The threshold inputs to the hidden layer nodes (i.e., the negatives of the thresh-
olds) form a 2-vector b = [b1 , b2 ], often called the bias vector. The perceptron
applies the nonlinear step function to produce its output, defined as:
(
1 when z > 0
step(z) =
0 otherwise
Each hidden node hi can now be described using the expression:
hi = step(wT
i x + bi ) for i = 1, 2
The logistic sigmoid has several advantages over the step function as a way
to define the output of a perceptron. The logistic sigmoid is continuous and dif-
ferentiable, so it enables us to use gradient descent to discover the best weights.
Since its value is in the range [0, 1], it is possible to interpret the outputs of the
network as a probability. However, the logistic sigmoid saturates very quickly
as we move away from the “critical region” around 0. So the derivative goes
towards zero and gradient-based learning can stall out. That is, weights almost
stop changing, once they get away from 0.
In Section 13.3.3, when we describe the backpropagation algorithm, we shall
see that we need the derivatives of activation functions and loss functions. As
an exercise, you can verify that if y = σ(x), then
dy
= y(1 − y)
dx
ex − e−x
tanh(x) =
ex + e−x
Simple algebraic manipulation yields:
tanh(x) = 2σ(2x) − 1
So the hyperbolic tangent is just a scaled and shifted version of the sigmoid.
It has two desirable properties that make it attractive in some situations: its
output is in the range [−1, 1] and is symmetric around 0. It also shares the
good properties and the saturation problem of the sigmoid. You may show that
if y = tanh(x) then
dy
= 1 − y2
dx
13.2. DENSE FEEDFORWARD NETWORKS 519
1.0 1.0
0.5
0.5
3 2 1 1 2 3
0.5
1.0
6 4 2 0 2 4 6
(a) (b)
Figure 13.4: The logistic sigmoid (a) and hyperbolic tangent (b) functions
Figure 13.4 shows the logistic sigmoid and hyperbolic tangent functions.
Note the difference in scale along the x-axis between the two charts. It is easy
to see that the functions are identical after shifting and scaling.
13.2.5 Softmax
The softmax function differs from sigmoid functions in that it does not operate
element-wise on a vector. Rather, the softmax function applies to an entire
vector. If x = [x1 , x2 , . . . , xn ], then its softmax µ(x) = [µ(x1 ), µ(x2 ), . . . , µ(xn )]
where
exi
µ(xi ) = P xj
je
Softmax pushes the largest component of the vector towards 1 while pushing all
the other components towards zero. Also, all the outputs sum to 1, regardless
of the sum of the components of the input vector. Thus, the output of the
softmax function can be intepreted as a probability distribution.
A common application is to use softmax in the output layer for a classi-
fication problem. The output vector has a component corresponding to each
target class, and the softmax output is interpreted as the probability of the
input belonging to the corresponding class.
Softmax has the same saturation problem as the sigmoid function, since one
component gets larger than all the others. There is a simple workaround to this
problem, however, when softmax is used at the output layer. In this case, it is
usual to pick cross entropy as the loss function, which undoes the exponentiation
in the definition of softmax and avoids saturation. Cross entropy is explained in
520 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
The name of this function derives from the analogy to half-wave rectification in
electrical engineering. The function is not differentiable at 0 but is differentiable
everywhere else, including at points arbitrarily close to 0. In practice, we “set”
the derivative at 0 to be either 0 (the left derivative) or 1 (the right derivative).
In modern neural nets, a version of ReLU has replaced sigmoid as the de-
fault choice of activation function. The popularity of ReLU derives from two
properties:
1. The gradient of ReLU remains constant and never saturates for positive
x, speeding up training. It has been found in practice that networks
that use ReLU offer a significat speedup in training compared to sigmoid
activation.
2. Both the function and its derivative can be computed using elementary
and efficient mathematical operations (no exponentiation).
ReLU does suffer from a problem related to the saturation of its derivative
when x < 0. Once a node’s input values become negative, it is possible that the
13.2. DENSE FEEDFORWARD NETWORKS 521
3 3
2 2
1 1
3 2 1 1 2 3 3 2 1 1 2 3
1 1
(a) (b)
Figure 13.5: The ReLU (a) and ELU (b), with α = 1 functions
node’s output get “stuck” at 0 through the rest of the training. This is called
the dying ReLU problem.
The Leaky ReLU attempts to fix this problem by defining the activation
function as follows: (
x, for x ≥ 0
f (x) =
αx, for x < 0
where α is typically a small positive value such as 0.01. The Parametric ReLU
(PReLU) makes α a parameter to be optimized as part of the learning process.
An improvement on both the original and leaky ReLU functions is Expo-
nential Linear Unit, or ELU. This function is defined as:
(
x, for x ≥ 0
f (x) =
α(ex − 1), for x < 0
In general, we compute the loss for a set of predictions. Suppose the observed
(i.e., training set) input-output pairs are T = {(x1 , yˆ1 ), (x2 , yˆ2 ), . . . , (xn , yˆn )},
while the corresponding input-output pairs predicted by the model are P =
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}. The mean squared error (MSE) for the set is:
n
1X
L(P, T ) = (yi − yˆi )2
n i=1
Note that the mean squared error is just square of the RMSE. It is convenient
to omit the square root to simplify the derivative of the function, which we shall
use during training. In any case, when we minimize MSE we also automatically
minimize RMSE.
One problem with MSE is that it is very sensitive to outliers due the squared
term. A few outliers can contribute very highly to the loss and swamp out the
effect of other points, making the training process susceptible to wild swings.
One way to deal with this issue is to use the Huber Loss. Suppose z = y − ŷ,
and δ is a constant. The Huber Loss Lδ is given by:
(
z2 if |z| ≤ δ
Lδ (z) = 1
2δ(|z| − 2 δ) otherwise
Figure 13.6 contrasts the squared error and Huber loss functions.
In the case where we have a vector y of outputs rather than a single output,
we use ky − ŷk in place of |y − ŷ| in the definitions of mean squared error and
Huber loss.
13.2. DENSE FEEDFORWARD NETWORKS 523
25
20
15
10
5 4 3 2 1 1 2 3 4 5
Figure 13.6: Huber Loss (solid line, δ = 1) and Squared Error (dotted line) as
functions of z = y − ŷ
Note that H(p, p) = H(p), and in general H(p, q) ≥ H(p). The difference
between the cross entropy and the entropy is the average number of additional
bits needed per symbol. It is a reasonable measure of the distance between the
distributions p and q, called the Kullblack-Liebler divergence (KL-divergence)
and denoted D(pkq):
n
X pi
D(pkq) = H(p, q) − H(p) = pi log
i=1
qi
1 3
x 1 =1 y1
2 4
1
x 2 =0 3 1 −2
−4 2
1 y2
x 3 =1 1 1
2 −2
1
3
x 4 =0 2 y3
−1 1 −2
1
Exercise 13.2.6 : In Fig. 13.7 is a neural net with paricular values shown
for all the weights and inputs. Suppose that we use the sigmoid function to
compute outputs of nodes at the first layer, and we use softmax to compute the
outputs of the nodes in the output layer.
(a) Compute the values of the outputs for each of the five nodes.
! (b) Assuming each of the weights and each of the xi ’s is a variable, express
the output of the first (top) output node in terms of the weights and the
xi ’s,
! (c) Find the derivative of your function from part (b) with respect to the
weight on the first (top) input to the first (top) node in the first layer.
parameters, it is possible to find parameters that yield low training loss but
nevertheless perform poorly in the real world. This phenomenon is called over-
fitting, a problem we have mentioned several times, starting in Section 9.4.4.
For the moment, we assume that our goal is to find parameters that minimize
the expected loss on the training set. This goal is achieved by gradient descent.
There is an elegant algorithm called backpropagation that allows us to compute
these gradients efficiently. Before we describe backpropagation, we need a few
preliminaries.
Example 13.2 : Figure 13.8 shows the compute graph for a single-layer dense
network described by y = σ(W x + b) where x is the input and y is the output.
We then compute an MSE loss against the training-set output ŷ. That is, we
have a single layer of n nodes. The vector y is of length n and represents the
outputs of each of these nodes. There are k inputs, and (x, ŷ) represents one
training example. Matrix W represents the weights on the inputs of the nodes;
that is, Wij is the weight for input j at the ith node. Finally, b represents the
n biases, so its ith element is the negative of the threshold of the ith node.
σ MSE
u v y L
W b
y
u = Wx
v = u+b
y = σ(v)
L = MSE(y, ŷ)
Each of these steps corresponds to one of the four nodes in the middle row,
in order from the left. The first step corresponds to the node with operand
u and operator ×. Here is an example where it must be understood that the
node labeled W is the first argument. If necessary, we could label each incoming
edge with a number to indicate its order, but in this case the order should be
obvious, since a column vector x could not multiply a matrix W unless the
matrix happened to have only one row. The second step corresponds to the
node with operator + and operand v. Here, the order of arguments does not
matter, since + on vectors is commutative. ✷
Example 13.3 : We could let function f be the loss function, e.g., the squared-
error loss, which we denote L. This loss is a scalar-valued function of the output
y:
X n
L(y) = ky − ŷk2 = (yi − ŷi )2
i=1
are equivalent. Recall that we are assuming all vectors are column vectors unless transposed,
but we show them as row vectors so they can be written in-line.
528 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
∂y1 ∂yn
∂x1 ... ∂x1
. .. ..
Jx (y) = .. . .
∂y1 ∂yn
∂xm ... ∂xm
We shall make use of the chain rule for derivatives from calculus. If y = g(x)
and z = f (y) = f (g(x)), then the chain rule says:
dz dz dy
=
dx dy dx
Also, if z = f (u, v) where u = g(x) and v = h(x), then
dz ∂z du ∂z dv
= +
dx ∂u dx ∂v dx
For functions of vectors, we can restate the chain rule in terms of gradients and
Jacobians. Suppose y = g(x) and z = f (y) = f (g(x)) then:
∇x z = Jx (y)∇y z
∇x z = Jx (u)∇u z + Jx (v)∇v z
We work backwards through the compute graph, applying the chain rule at
each stage. At each point we pick a node all of whose successors have already
been processed. Suppose a is such a node, and suppose it has just one immediate
successor b in the graph (note that in the simple compute graph of Fig. 13.8,
each node has just one immediate successor). Since we have already processed
node b, we have already computed g(b). We can now compute g(a) using the
chain rule:
g(a) = Ja (b)g(b)
In the case where node a has more than one successor node, we use the more
general sum version of the chain rule. That is, g(a) would be the sum of the
above terms for each successor b of a.
13.3. BACKPROPAGATION AND GRADIENT DESCENT 529
Since we shall need to compute these gradients several times, once for each
iteration of gradient descent, we can avoid repeated computation by adding
additional nodes to the compute graph for backpropagation: one node for each
gradient computation. In general, the Jacobian Ja (b) is a function of both
a and b, and so the node for g(a) will have arcs to it from the nodes for a,
b and g(b). Popular frameworks for deep learning (e.g., TensorFlow) know
how to compute the functional expression for the Jacobians and gradients of
commonly used operators such as those that appear in Fig. 13.8. In that case,
the developer needs only to provide the compute graph and the framework
will add the new nodes for backpropagation. Figure 13.9 shows the resulting
compute graph with added gradient nodes.
g( W ) g( u ) g( v ) g( y)
σ MSE
W u v y
L
x g( b) b
y
Example 13.4 : We shall work out the functional expressions for the gradients
of all the nodes in Fig. 13.8. We already know g(y). So the next node we choose
to process is v.
g(v) = s ◦ g(y)
where a ◦ b is the vector resulting from the element-wise product of a and b.5
Now that we have g(v), we can compute g(b) and g(u). We have g(b) =
Jb (v)g(v) and g(u) = Ju (v)g(v). Since
v =u+b
Jb (v) = Ju (v) = In
g(wi ) = g(ui )x
eyi
log(qi ) = log( P yj )
je
X
= yi − log( eyj )
j
5 This operation sometimes called the Hadamard product, so as not to confuse it with the
more usual dot product, which is the sum of the components of the Hadamard product.
13.3. BACKPROPAGATION AND GRADIENT DESCENT 531
P
Therefore, noting that i pi = 1, we have:
l = H(p, q)
X
= − pi log qi
i
X X
= − pi (yi − log( eyj ))
i j
X X X
= − pi yi − log( eyj ) pi
i j i
X X
= − pi yi − log( eyj )
i j
Differentiating, we get:
∂l eyk
= −pk + P yj
∂yk je
= −pk + µ(yk )
= qk − pk
∇y l = q − p
This combined gradient does not saturate or explode, and leads to good learning
behavior. That observation explains why softmax and cross entropy loss work
so well together in practice. ✷
lead to convergence. Usually picking the right learning rate is a matter of trial
and error. It is also possible and common to vary the learning rate. Start with
an initial learning rate η0 . Then, at each iteration, multiply the learning rate
by a factor β (0 < β < 1) until the learning rate reaches a sufficiently low value.
When we have a large training set, we may not want to use the entire
training set for each iteration, as it might be too time-consuming. So for each
iteration we randomly sample a “minibatch” of training examples. This variant
is called stochastic gradient descent,” as was discussed in Section 12.3.5, since
we estimate the gradients using a different sample of the training set at each
iteration.
We have left open the question of how the parameter values are initialized
before we start gradient descent. The usual approach is to choose them at
random. Popular approachaes include sampling each entry uniformly at ran-
dom in [−1, 1], or choosing randomly using a normal distribution. Notice that
initializing all the weights to the same value would cause all nodes in a layer
to behave the same way, and thus we would never reap the benefit of having
different nodes in a layer recognize different features of the input.
13.3.5 Tensors
Previously, we have imagined that the inputs to a neural net are one-dimensional
vectors. But there is no reason why we cannot view the input as having a higher
dimension.
Example 13.7 : This example is based on the MNIST dataset.6 This dataset
consists of 28 × 28 monochrome images, each represented by a two-dimensional
square bit array whose sides are of length 28. Our goal is to build a neural net
that determines whether an image corresponds to a handwritten digit (0-9) and
if so which one. Consider a single image X, which is a 28 × 28 matrix. Suppose
the first layer of our network is a dense layer7 consisting of 49 hidden nodes,
which we shall imagine is arranged in a 7 × 7 array. We model the hidden layer
as a 7 × 7 matrix H, where the output of the node in row i and column j is hij .
We can model the weights for each of the 28 × 28 inputs and each of the
7 × 7 nodes as a weight tensor W with dimensions 7 × 7 × 28 × 28. That is, Wijkl
represents the weight for the input pixel in row i and column j of the image to
the node whose position in the array of nodes is row k and column l. Then:
28 X
X 28
hij = wijkl xkl for 1 ≤ i, j ≤ 7
k=1 l=1
where we omit the bias term for simplicity (i.e., we assume all thresholds of all
nodes are 0.
An equivalent way to think about this structure to flatten the input X into
a vector x of length 784 (since 28 × 28 = 784) and the hidden layer H into a
vector h of length 49. We flatten the weight tensor W as follows: the last two
dimensions are flattened into a single dimension to match x, and its first two
dimensions are flattened into a single dimension to match the hidden vector h,
resulting in a 49 × 784 weight matrix. Suppose as in Section 13.2.1, we have
wTi denote the ith row of this new weight matrix. We can now write:
hi = w T
i x for 1 ≤ i ≤ 49
It’s straightforward to see that there is a 1-to-1 mapping between the hidden
nodes in the old and new arrangements. Moreover, the output of each hidden
node is determined by a dot product, just as in Section 13.2.1. Thus the tensor
notation is just a convenient way to group vectors. ✷
Thus the tensors used in neural networks have little in common with the
tensors used in Physics and other mathematical sciences. A tensor in our con-
text is just a nested collection of vectors. The only tensor operation we shall
need is the flattening of a tensor by merging dimensions as in Example 13.7. We
can use the backpropagation algorithm described in Section 13.3.3 for tensors
once we have flattened them approriately.
6 Seeyann.lecun.com/exdb/mnist/.
7 Inreality, the first network layer for this problem is likely to be a convolutional layer.
See Section 13.4
534 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
(e) Draw the compute graph with gradient computation for the entire net-
work.
We now slide the filter along the length and width of the input image, applying
the filter at each position, so that we capture every possible 5 × 5 square region
of pixels in the image. Notice that we can apply the filter at input locations
1 ≤ i ≤ 220 and 1 ≤ j ≤ 220, although it does not “fit” at positions with a
higher i or j. The resulting set of responses rij are then passed through an
activation function to form the activation map R of the filter. In most cases,
the activation function is ReLU or one of its variants. When trained, i.e., the
536 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
weights wij of the filter are determined, the filter will recognize some feature of
the image, and the activation map tells whether (or to what degree) this feaure
is present at each position of the image.
Example 13.8 : In Fig. 13.10(b), we see a 2 × 2 filter, which is to be applied
to the 4 × 4 image in Fig. 13.10(b). To do so, we lay the filter over all nine of
the 2 × 2 squares of the image. In the figure, we suggest the filter being placed
over the 2 × 2 square in the upper-right corner. After overlaying the filter, we
multiply each of the filter elements by the corresponding image element and
then take the sum of the products. In principle, we then need to add in a bias
term, but in this example, we shall assume the bias is 0.
1 0
1 0 1 0 0 −1
0 1 0 1 0 0 0
(b) 2−by−2 filter
1 1 0 0 −1 1 0
0 0 0 1 1 1 −1
Another way to look at this process is that we turn the filter into a vector
by concatenating its rows, in order, and we do the same to the square of the
image. Then, we take the dot product of the vectors. For instance, the filter
can be thought of as the vector [1, 0, 0, −1], and the square in the upper-left
corner of the image can be thought of as the vector [1, 0, 0, 1]. The dot product
of these vectors is 1 × 1 + 0 × 0 + 0 × 0 + (−1) × 1 = 0. Thus, the result, shown
in Fig. 13.10(c), has a 0 for its upper-left entry.
For another example, if we slide the filter down one row, the dot product of
the filter as a vector and the vector formed from the first two elements of the
second and third rows of the image is 1 × 0 + 0 × 1 + 0 × 1 + (−1) × 1 = −1.
Thus, the first element of the second row of the convolution is −1. ✷
When we deal with color images, the input has three channels. That is, each
pixel is represented by three values, one for each color. Suppose we have a color
image of size 224 × 224 × 3. The filter’s output will also have three channels,
and so the filter is now encoded by a 5 × 5 × 3 weight matrix W and single
bias parameter b. The activation map R still remains 5 × 5, with each response
given by:
X 4 X4 X 3
rij = xi+k,j+l,d wkld + b (13.2)
k=0 l=0 d=1
In our example, we imagined a filter of size 5. In general, the size of the filter is
a hyperparameter of the convolutional layer. Filters of size 3, 4, or 5 are most
13.4. CONVOLUTIONAL NEURAL NETWORKS 537
commonly used. Note that the filter size specifies only the width and height of
the filter; the number of channels of the filter always matches the number of
channels of the input.
The activation map in our example is slightly smaller than the input. In
many cases, it is convenient to have the activation map be of the same size as the
input. We can expand the repsonse by using zero padding: adding additional
rows and columns of zeros to pad out the input. A zero padding of p corresponds
to adding p rows of zeros each to the top and bottom, and p columns to the left
and right, increasing the dimensionality of the input by 2p along both width
and height. A zero padding of 2 in our example would augment the input size
to 228 × 228 and result in an activation map of size 224 × 224, the same size as
the original input image.
The third hyperparameter of interest is stride. In our example, we assumed
that we apply the filter at every possible point in the input image. We could
think instead of sliding the filter one pixel at a time along the width and height
of the input, corresponding to a stride s = 1. Instead, we could slide the
filter along the width and the height of image two or three pixels at a time,
corresponding to a stride s of 2 or 3. The larger the stride, the smaller the
activation map compared to the input.
Suppose the input is an m × m square of pixels, the output an n × n square,
filter size is f , stride is s, and zero padding is p. It is easily seen that:
n = (m − f + 2p)/s + 1 (13.3)
In particular, we must be careful to pick hyperparameters such that s evenly
divides m − f + 2p; else we would have an invalid arrangement for the convo-
lutional layer, and most software implementations would throw an exception.
We can intuitively think of a filter as looking for an image feature, such as
a splotch of color or an edge. Classifying an image usually requires identifying
many features. We therefore use many filters, ideally one for each useful feature.
During the training of the CNN, we hope that each filter will learn to identify
one of these features. Suppose we use k filters; to keep things simple, we
constrain all filters to use same size, stride, and zero padding. Then the output
contains k activation maps. The dimensionality of the output layer is therefore
n × n × k, where n is given by Equation 13.3.
The set of k filters together constitute a convolutional layer. Given input
with d channels, a filter of size f requires df 2 + 1 parameters (df 2 weight param-
eters and 1 bias parameter). Therefore, a convolutional layer of k such filters
uses k(df 2 + 1) parameters.
Example 13.9 : Continuing the ImageNet example, suppose the input consists
of 224×224×3 images, and we use a convolutional layer of 64 filters, each of size
5, stride 1, and zero padding 2. The size of the output layer is 224 × 224 × 64.
Each filter needs 3 × 5 × 5 + 1 = 76 parameters (including one for the threshold)
and the convolutional layer contains 64 × 76 = 4864 parameters – orders of
magnitude smaller than the number of parameters for a fully connected layer
with the same input and output sizes. ✷
538 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
Here we are interested in the discrete version of convolution, where f and g are
defined over the integers:
∞
X ∞
X
(f ∗ g)(i) = f (k)g(i − k) = f (i − k)g(k)
k=−∞ k=−∞
Let us define h to be the kernel obtained by flipping g, i.e., h(i, j) = g(−i, −j)
for i, j ∈ {0, . . . , m − 1}. It can verified that the convolution f ∗ h of f with the
flipped kernel h is given by:
m−1
X m−1
X
(f ∗ h)(i, j) = f (i + k, j + l)g(k, l) (13.4)
k=0 l=0
Note the similarity of Equation 13.4 to Equation 13.1, ignoring the bias term b.
The operation of the convolutional layer can be thought of as the convolution of
the input with a flipped kernel. This similarity is the reason why convolutional
layers are so named, and filters are sometimes called kernels.
The cross-correlation f ⋆ g is defined by (f ⋆ g)(x, y) = (f ∗ h)(x, y) where
h is the flipped version of g. Thus the operation of the convolutional layer can
also be seen as the cross-correlation of the input with the filter.
13.4. CONVOLUTIONAL NEURAL NETWORKS 539
1. The pooling function, which is most commonly the max function but could
in theory be any aggregate function, such as average.
2. The size f of each pool, which specifies that each pool uses an f ×f square
of inputs.
3. The stride s between pools, analogous to the stride used in the convolu-
tional layer.
The most common use cases in practice are f = 2 and s = 2, which specifies
nonoverlapping 2 × 2 regions, and f = 3, s = 2, which specifies 3 × 3 regions
with some overlap. Higher values of f lead to too much loss of information in
practice. Note that the pooling operation shrinks the height and width of the
input layer, but preserves the number of channels. It operates independently
on each channel of its input. Note that unlike the convolution layer, it is not
common practice to use zero padding for the max pooling layer.
Pooling is appropriate if we we believe that features are approximately in-
variant to small translations. For example, we might care about the relative
locations of features such as legs or wings and not their exact locations. In
such cases pooling can greatly reduce the size of the hidden layer that forms
the input to the subsequent layers of the network.
Example 13.10 : Suppose we apply max pooling with size = 2 and stride =
2 to the 224 × 224 × 64 output of the convolutional layer from Example 13.9.
The resulting output is of size 112 × 112 × 64. ✷
Example 13.11 : For instance, Fig. 13.11 is a simple network architecture for
classifying ImageNet images into one of 1000 image classes, loosely inspired by
540 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
Recognition,” arXiv:1409–1556v6.
13.4. CONVOLUTIONAL NEURAL NETWORKS 541
previous pooling layer and its filters can represent structures of progressively
larger sizes and complexities. Thus, the number of filters has been chosen to
double at each convolution layer.
Finally, the eleventh, and last, layer is a fully connected layer. It has 1000
nodes, corresponding to the 1000 images classes we are trying to learn how to
recognize. Being fully connected, each of these 1000 nodes takes all 7 × 7 × 3 =
147 outputs from the 10th layer; the factor 3 is from the fact that all filters of
the previous layers have three channels. ✷
Designing CNN’s and other deep network architectures is still more art than
science. Over the past few years, however, some rules of thumb have emerged
that are worth keeping in mind:
1. Deep networks that use many convolutional layers, each using many small
filters, are better than shallow networks that use large filters.
4. It’s very useful to have the input size evenly divisible by 2 many times.
the filter F and the corresponding region in the input into vectors. Consider
the convolution of an m × m × 1 tensor X (i.e., X is actually an m × m matrix)
with an f × f filter F and bias b, to produce as output the n × n matrix Z.
We now explain how to implement the convolution operation using a single
vector-matrix multiplication.
We first flatten the filter F into a f 2 × 1 vector g. We then create matrix
Y from X as follows: each square f × f region of X is flattened into a f 2 × 1
vector, and all these vectors are lined up as columns to form a single f 2 × n2
matrix Y . Construct the n2 × 1 vector b so that all its entries are equal to the
bias b. Then
z = Y Tg + b
yields a n2 × 1 vector z. Moreover, each element of z is a single element in the
convolution. Therefore, we can rearrange the entries in z into an n × n matrix
Z that gives the output of the convolution.
Notice that the matrix Y is larger than the input X (approximately by a
factor of f 2 ), because each entry of X is repeated many times in Y . Thus,
this implementation uses a lot of memory. However, multiplying a matrix and
a vector is extremely fast on modern hardware such as Graphics Processing
Units (GPU’s), and so it is the method used in practice.
This approach to computing convolutions can easily be extended to the case
of inputs with more than one channel. Moreover, we can also handle the case
where we have k filters rather than just 1. We then need to replace the vector
g with a df 2 × k matrix G and use a larger matrix Y (df 2 × n2 ). We also need
to use an n2 × k bias matrix B, where each column repeats the bias term of
the corresponding filter. Finally, the output of the convolution is expressed by
an n2 × k matrix C, with a column for the output of each filter, where:
C = Y TF + B
We have explained how to perform the forward pass through the convolu-
tional layer. During training, we shall need to backpropagate through the layer.
Since each entry in the output of convolution is a dot product of vectors followed
by a sum, we can use the techniques from Section 13.3.3 to compute deriva-
tives. It turns out that the derivative of a convolution can also be expressed as
a convolution, but we shall not go into the details here.
(a) How many responses will be computed for this layer of a CNN?
(b) How much zero padding is necessary to produce an output of size equal
to the input?
13.5. RECURRENT NEURAL NETWORKS 543
(c) Suppose we do not do any zero padding. If the output of one layer is
input to the next layer, after how many layers will there be no output at
all?
Exercise 13.4.2 : Repeat Exercise 13.4.1(a) and (c) for the case when there is
a stride of three.
Exercise 13.4.4 : For this exercise, assume that inputs are single bits 0 (white)
and 1 (black). Consider a 3-by-3 filter, whose weights are wij , for 0 ≤ i ≤ 2 and
0 ≤ j ≤ 2, and whose bias is b. Suggest wieghts and bias so that the output of
this filter would detect the following simple features:
(a) A vertical boundary, where the left column is 0, and the other two columns
are 1.
(b) A diagonal boundary, where only the triangle of three pixels in the upper
right corner are 1.
(c) a corner, in which the 2-by-2 square in the lower right is 0 and the other
pixels are 1.
1. The output at each point depends on the entire prefix of the sentence
until that point, and not just the last word. The network needs to retain
some “memory” of the past.
2. The underlying language model does not change across positions in the
sequence, so we should use the same parameters (weights for each of the
nodes) at each position.
544 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
y y y yn
1 2
V W V V V
W W W W
s s0 s s ooo sn
1 2
U U U U
x x x xn
1 2
(a) The basic unit of an RNN. (b) The unrolled RNN of length n.
1. The RNN has inputs at all (or almost all) layers, and not just at the first
layer.
2. The weights at each of the first n layers are constrained to be the same;
these weights are the matrices U and W in Equation 13.5 below. Thus,
each of the first n layers has the same set of nodes, and correspond-
ing nodes from each of the layers share weights (and are thus really the
same node), just as nodes of a CNN representing different locations share
weights and are thus really the same node.
At each step t, we have a hidden state vector st that functions as the memory
in which the network encodes information about the prefix of the sequence it
has seen. The hidden state at time t is a function of the input at time t and
the hidden state at time t − 1:
st = f (U xt + W st−1 + b) (13.5)
10 Since there could in principle be an infinite number of words, we might in practice devote
components of the vector only to the most common words or the words that are most impor-
tant in the application at hand. Other words would all be represented by a single additional
component of the vector.
13.5. RECURRENT NEURAL NETWORKS 545
det
= V T (yt − ŷt ) (13.6)
dst
dst
Setting Rt = dW , we note that st = tanh(zt ), where zt = W st−1 + U xt + b,
we have:
dst dzt
Rt =
dzt dW
dst
It is straightforward to verify that dzt is the diagonal matrix A defined by:
(
1 − s2ti when i = j
aij =
0 otherwise
Rt = Pt + Qt Rt−1 (13.7)
Rt = Pt + Qt Rt−1
= Pt + Qt (Pt−1 + Qt−1 Rt−2 )
= Pt + Qt Pt−1 + Qt Qt−1 Rt−2
...
Ultimately yielding:
t−1
X t
Y
Rt = Pt + Pj Qk (13.8)
j=0 k=j+1
From Equation 13.8, it is clear that the contribution of step i to Rt is given by:
t
Y
Rti = Pi Qk (13.9)
k=i+1
Equation 13.9 includes the product of several matrices that look like diagonal
matrix A. Each entry in A is strictly less than 1. Just as the product of many
numbers, each strictly
Qt less than 1, approaches zero as we add more multipli-
cands, the term k=i+1 Qk approaches zero for i ≪ t. In other words, the
gradient at step t is determined entirely by the preceding few time steps, with
548 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
very little contribution from much earlier time steps. This phenomenon is called
the problem of vanishing gradients.
Equation 13.9 results in vanishing gradients because we used the tanh ac-
tivation function for state update. If instead we use other activation functions
such as ReLU, we end up with the product of many matrices with large entries,
resulting in the problem of exploding gradients. Exploding gradients are easier
to handle than vanishing gradients, because we can clip the gradient at each
step to lie within a fixed range. However, the resulting RNN’s still have trouble
learning long-distance associations.
2. The ability to save selected information into memory. For example, when
we process product reviews, we might want to save only words expressing
opinions (e.g., excellent, terrible) and ignore other words.
3. The ability to focus only on the aspects of memory that are immediately
relevant. For example, focus only on information about the characters of
the current movie scene, or only on the subject of the sentence currently
being analyzed. We can implement this focus by using a 2-tier archi-
tecture: a long-term memory that retains information about the entire
processed prefix of the sequence, and a working memory that is restricted
to the items of immediate relevance.
The RNN model has a single hidden state vector st at time t. The LSTM
model adds an additional state vector ct , called the cell state, for each time t.
Intuitively, the hidden state corresponds to working memory and the cell state
corresponds to long-term memory. Both state vectors are of the same length,
and both have entries in the range [−1, 1]. We may imagine the working memory
having most of its entries near zero with only the relevant entries turned “on.”
The architectural ingredient that enables the ability to forget, save, and
focus is the gate. A gate g is just a vector of the same length as a state vector
13.5. RECURRENT NEURAL NETWORKS 549
Note that W , U , and b with subscript h are two weight matrices and a bias
vector that we learn and use for just the purpose of computing ht for each t.
We also compute two gates, the forget gate ft and the input gate it . The
forget gate determines which aspects of the long-term memory we retain. The
input gate determines which parts of the candidate state update to save into the
long-term memory. These gates are computed using different weight matrices
and bias vectors, which also must be learned. We indicate these matrices and
vector with subscripts f and i, respectively.
We update the long-term memory using the gates and the candidate update
vector as follows:12
ct = ct−1 ◦ ft + ht ◦ it (13.13)
Now that we have updated the long-term memory, we need to update the work-
ing memory. We do this in two steps. The first step is to create an output gate
ot . The second step is to apply this gate to the long-term memory, followed by
a tanh activation:13
entry; so the forget gate should really be called the remember gate. Similarly, the input gate
might be better named the save gate. Here we follow the naming convention commonly used
in the literature.
13 Once again, the output gate might be better named the focus gate since it focuses the
Here, we use subscript o to indicate another pairs of weight matrices and a bias
vector that must be learned.
Finally, the output at time t is computed in exactly the same manner as the
RNN output:
yt = g(V st + d) (13.16)
(a) A node to signal when the input is 1 and the previous input is 0.
! (b) A node to signal when the last three inputs have all been 1.
!! (c) A node to signal when the input is the same as the previous input.
de de
! Exercise 13.5.3 : Give the formulas for the gradients dU and dV for the gen-
eral RNN of Fig.13.12.
14 We should understand that RNN’s, like any neural network, is to be learned from data,
13.6 Regularization
Thus far, we have presented our goal as one of minimizing loss (i.e., prediction
error) on the training set. Gradient descent and stochastic gradient descent
help us achieve this objective. In practice, the real objective of training is
to minimize the loss on new and hitherto unseen inputs. Our hope is that
our training set is representative of unknown future inputs, so a low loss on the
training set translates into good performance on new inputs. Unfortunately, the
trained model sometimes learns idiosyncrasies of the training data that allow
it have low training loss, but not generalize well to new inputs – the familiar
problem of overfitting.
How can we tell if a model has overfit? In general, we split the available
data into a training set and a test set. We train the model using only the
training-set data, withholding the test set. We then evaluate the performance
of the model on the test set. If the model performs much worse on the test set
than on the training set, we know the model has overfit. Assuming data points
are independent of one another, we can pick a fraction of the available data
points at random to form the test set. A common ratio for the training:test
split is 80:20. i.e, 80% of the data for training and 20% for test. We have to
be careful, however: in sequence-learning problems (e.g., modeling time series),
the state of the sequence at any point in time encodes information about the
past. In such cases the final piece of the sequence is a better test set.
Overfitting is a general problem that affects all machine-learning models.
However, deep neural networks are particularly susceptible to overfitting, be-
cause they use many more parameters (weights and biases) than other kinds of
models. Several techniques have been developed to reduce overfitting in deep
networks, usually by trading higher training error for better generalization. The
process is referred to as model regularization. In this section we describe some
of the most important regularization methods for deep learning.
In practice, it is observed that the L2 -norm penalty works best for most
applications. The L1 -norm penalty is useful in some situations calling for model
compression, because it tends to produce models where many of the weights are
zero.
13.6.2 Dropout
Dropout is a technique that reduces overfitting by making random changes to
the underlying deep neural network. Recall that when we train using stochastic
gradient descent, at each step we sample at random a minibatch of inputs to
process. When using dropout, we also select at random a certain fraction (say
half) of all the hidden nodes from the network and delete them, along with
any edges connected to them. We then perform forward propagation and back-
propagation for the minibatch using this modified network, and update the
weights and biases. After processing the minibatch, we restore all the deleted
nodes and edges. When we sample the next minibatch, we delete a different
random subset of nodes and repeat the training process.
The fraction of hidden nodes deleted each time is a hyperparameter called
the dropout rate. When training is complete, and we actually use the full
network, we need to take into account that the full network contains a larger
number of hidden nodes than the networks used for training. We therefore need
to scale the weight on each outgoing edge from a hidden node by the dropout
rate.
Why does dropout reduce overfitting? Several hypotheses have been put
forward, but perhaps the most convincing argument is that dropout allows a
single neural network to behave effectively as a collection of neural networks.
Imagine that we have a collection of neural networks, each with a different
network topology. Suppose we trained each network independently using the
training data, and used some kind of voting or averaging scheme to create a
higher-level model. Such a scheme would perform better than any of the indi-
vidual networks. The dropout technique simulates this setup without explicitly
creating a collection of neural networks.
training set (the training loss) decreases through the training process, the loss
on the test set (the test loss) often behaves differently. The test loss falls dur-
ing the initial part of the training, and then many hit a minimum and actually
increase after a large number of training iterations, even as the training loss
keeps falling.
Intuitively, the point at when the test loss starts to increase is the point at
which the training process has started learning idiosyncrasies of the training
data rather than a generalizable model. A simple approach to avoiding this
problem is to stop the training when the test loss stops falling. There is, how-
ever, a subtle problem with this approach: we might inadvertently overfit to the
test data (rather than to the training data) by stopping training at the point
of minimum test loss. Therefore, the test error no longer is a reliable measure
of the true performance of the model on hitherto unseen inputs.
The usual solution is to use a third subset of inputs, the validation set, to
determine the point at which we stop training. We split the data not just into
training and test sets, but into three groups: training, validation, and test.
Both the validation and test sets are withheld from the training process. When
the loss on the validation set stops decreasing, we stop the training process.
Since the test set has played no role at all in the training process, the test error
remains a reliable indicator of the true performance of the model.
✦ Types of Layers: Many layers are fully connected, meaning that each node
in the layer has all the nodes of the previous layer as inputs. Other layers
554 CHAPTER 13. NEURAL NETS AND DEEP LEARNING
are pooled, meaning that the nodes of the previous layer are partitioned,
and each node of this layer takes as input only the nodes of one block
of the partition. Convolutional layers are also used, especially in image
processing applications.
✦ Loss Functions: These measure the difference between the output of the
net and the correct output according to the training set. Commonly used
loss functions include squared-error loss, Huber loss, classification loss,
and cross-entropy loss.
the same weights, so the training process therefore needs to deal with a
relatively small number of weights.
1. http://caffe2.ai
5. http://www.image-net.org.