Deep-Learning-book-part2
Deep-Learning-book-part2
from just a subset of the training data. Fourth, it can (in principle) escape local minima.
Fifth, it reduces the chances of getting stuck near saddle points; it is likely that at least
some of the possible batches will have a significant gradient at any point on the loss
function. Finally, there is some evidence that SGD finds parameters for neural networks
that cause them to generalize well to new data in practice (see section 9.2).
SGD does not necessarily “converge” in the traditional sense. However, the hope is
that when we are close to the global minimum, all the data points will be well described
by the model. Consequently, the gradient will be small, whichever batch is chosen, and
the parameters will cease to change much. In practice, SGD is often applied with a
learning rate schedule. The learning rate α starts at a high value and is decreased by a
constant factor every N epochs. The logic is that in the early stages of training, we want
the algorithm to explore the parameter space, jumping from valley to valley to find a
sensible region. In later stages, we are roughly in the right place and are more concerned
with fine-tuning the parameters, so we decrease α to make smaller changes.
6.3 Momentum
X ∂ℓi [ϕ ]
mt+1 ← β · mt + (1 − β) t
∂ϕ
i∈Bt
ϕt+1 ← ϕt − α · mt+1 , (6.11)
where mt is the momentum (which drives the update at iteration t), β ∈ [0, 1) controls
the degree to which the gradient is smoothed over time, and α is the learning rate.
The recursive formulation of the momentum calculation means that the gradient step
is an infinite weighted sum of all the previous gradients, where the weights get smaller
as we move back in time. The effective learning rate increases if all these gradients
Problem 6.10
are aligned over multiple iterations but decreases if the gradient direction repeatedly
changes as the terms in the sum cancel out. The overall effect is a smoother trajectory
and reduced oscillatory behavior in valleys (figure 6.7).
The momentum term can be considered a coarse prediction of where the SGD algorithm
Notebook 6.4
Momentum will move next. Nesterov accelerated momentum (figure 6.8) computes the gradients at
this predicted point rather than at the current point:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.3 Momentum 87
X ∂ℓi [ϕ − αβ · mt ]
mt+1 ← β · mt + (1 − β) t
∂ϕ
i∈Bt
ϕt+1 ← ϕt − α · mt+1 , (6.12)
where now the gradients are evaluated at ϕt − αβ · mt . One way to think about this is
that the gradient term now corrects the path provided by momentum alone.
6.4 Adam
Gradient descent with a fixed step size has the following undesirable property: it makes
large adjustments to parameters associated with large gradients (where perhaps we
should be more cautious) and small adjustments to parameters associated with small
gradients (where perhaps we should explore further). When the gradient of the loss
surface is much steeper in one direction than another, it is difficult to choose a learning
rate that (i) makes good progress in both directions and (ii) is stable (figures 6.9a–b).
A straightforward approach is to normalize the gradients so that we move a fixed
distance (governed by the learning rate) in each direction. To do this, we first measure
the gradient mt+1 and the pointwise squared gradient vt+1 :
∂L[ϕt ]
mt+1 ←
∂ϕ
2
∂L[ϕt ]
vt+1 ← . (6.13)
∂ϕ
mt+1
ϕt+1 ← ϕt − α · √ , (6.14)
vt+1 + ϵ
where the square root and division are both pointwise, α is the learning rate, and ϵ is a
small constant that prevents division by zero when the gradient magnitude is zero. The
term vt+1 is the squared gradient, and the positive root of this is used to normalize the
gradient itself, so all that remains is the sign in each coordinate direction. The result is
that the algorithm moves a fixed distance α along each coordinate, where the direction
is determined by whichever way is downhill (figure 6.9c). This simple algorithm makes
good progress in both directions but will not converge unless it happens to land exactly
at the minimum. Instead, it will bounce back and forth around the minimum.
Adaptive moment estimation, or Adam, takes this idea and adds momentum to both
the estimate of the gradient and the squared gradient:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.4 Adam 89
Figure 6.9 Adaptive moment estimation (Adam). a) This loss function changes
quickly in the vertical direction but slowly in the horizontal direction. If we run
full-batch gradient descent with a learning rate that makes good progress in the
vertical direction, then the algorithm takes a long time to reach the final hor-
izontal position. b) If the learning rate is chosen so that the algorithm makes
good progress in the horizontal direction, it overshoots in the vertical direction
and becomes unstable. c) A straightforward approach is to move a fixed distance
along each axis at each step so that we move downhill in both directions. This is
accomplished by normalizing the gradient magnitude and retaining only the sign.
However, this does not usually converge to the exact minimum but instead oscil-
lates back and forth around it (here between the last two points). d) The Adam
algorithm uses momentum in both the estimated gradient and the normalization
term, which creates a smoother path.
∂L[ϕt ]
mt+1 ← β · mt + (1 − β)
∂ϕ
2
∂L[ϕt ]
vt+1 ← γ · vt + (1 − γ) , (6.15)
∂ϕ
where β and γ are the momentum coefficients for the two statistics.
Using momentum is equivalent to taking a weighted average over the history of each
of these statistics. At the start of the procedure, all the previous measurements are
effectively zero, resulting in unrealistically small estimates. Consequently, we modify
these statistics using the rule:
mt+1
m̃t+1 ←
1 − β t+1
vt+1
ṽt+1 ← . (6.16)
1 − γ t+1
Since β and γ are in the range [0, 1), the terms with exponents t + 1 become smaller
with each time step, the denominators become closer to one, and this modification has
a diminishing effect.
Finally, we update the parameters as before, but with the modified terms:
m̃t+1
ϕt+1 ← ϕt − α · p . (6.17)
ṽt+1 + ϵ
The result is an algorithm that can converge to the overall minimum and makes good
Notebook 6.5
Adam progress in every direction in the parameter space. Note that Adam is usually used in a
stochastic setting where the gradients and their squares are computed from mini-batches:
X ∂ℓi [ϕ ]
mt+1 ← β · mt + (1 − β) t
∂ϕ
i∈Bt
!2
X ∂ℓi [ϕ ]
vt+1 ← γ · vt + (1 − γ) t
, (6.18)
∂ϕ
i∈Bt
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.5 Training algorithm hyperparameters 91
6.6 Summary
This chapter discussed model training. This problem was framed as finding parameters ϕ
that corresponded to the minimum of a loss function L[ϕ]. The gradient descent method
measures the gradient of the loss function for the current parameters (i.e., how the loss
changes when we make a small change to the parameters). Then it moves the parameters
in the direction that decreases the loss fastest. This is repeated until convergence.
For nonlinear functions, the loss function may have both local minima (where gradi-
ent descent gets trapped) and saddle points (where gradient descent may appear to have
converged but has not). Stochastic gradient descent helps mitigate these problems.1 At
each iteration, we use a different random subset of the data (a batch) to compute the
gradient. This adds noise to the process and helps prevent the algorithm from getting
trapped in a sub-optimal region of parameter space. Each iteration is also computation-
ally cheaper since it only uses a subset of the data. We saw that adding a momentum
term makes convergence more efficient. Finally, we introduced the Adam algorithm.
The ideas in this chapter apply to optimizing any model. The next chapter tackles
two aspects of training specific to neural networks. First, we address how to compute
the gradients of the loss with respect to the parameters of a neural network. This is
accomplished using the famous backpropagation algorithm. Second, we discuss how to
initialize the network parameters before optimization begins. Without careful initializa-
tion, the gradients used by the optimization can become extremely large or extremely
small, which can hinder the training process.
Notes
Convexity, minima, and saddle points: A function is convex if every chord (line segment
between two points on the surface) lies above the function and does not intersect it. This can
be tested algebraically by considering the Hessian matrix (the matrix of second derivatives):
∂2L ∂2L ∂2L
∂ϕ2
...
0
∂ϕ0 ∂ϕ1 ∂ϕ0 ∂ϕN
∂2L ∂2L ∂2L
∂ϕ1 ∂ϕ0 ∂ϕ2
... ∂ϕ1 ∂ϕN
H[ϕ] =
.. ..
1
.. ..
.
(6.19)
. . . .
∂2L ∂2L ∂2L
∂ϕN ∂ϕ0 ∂ϕN ∂ϕ1
... ∂ϕ2
N
Appendix B.3.7 If the Hessian matrix is positive definite (has positive eigenvalues) for all possible parameter
Eigenvalues values, then the function is convex; the loss function will look like a smooth bowl (as in fig-
ure 6.1c), so training will be relatively easy. There will be a single global minimum and no local
minima or saddle points.
For any loss function, the eigenvalues of the Hessian matrix at places where the gradient is
zero allow us to classify this position as (i) a minimum (the eigenvalues are all positive), (ii)
a maximum (the eigenvalues are all negative), or (iii) a saddle point (positive eigenvalues are
associated with directions in which we are at a minimum and negative ones with directions
where we are at a maximum).
Line search: Gradient descent with a fixed step size is inefficient because the distance moved
depends entirely on the magnitude of the gradient. It moves a long distance when the function
is changing fast (where perhaps it should be more cautious) but a short distance when the
function is changing slowly (where perhaps it should explore further). For this reason, gradient
descent methods are usually combined with a line search procedure in which we sample the
function along the desired direction to try to find the optimal step size. One such approach
is bracketing (figure 6.10). Another problem with gradient descent is that it tends to lead to
inefficient oscillatory behavior when descending valleys (e.g., path 1 in figure 6.5a).
Beyond gradient descent: Numerous algorithms have been developed that remedy the prob-
lems of gradient descent. Most notable is the Newton method, which takes the curvature of the
surface into account using the inverse of the Hessian matrix; if the gradient of the function is
changing quickly, then it applies a more cautious update. This method eliminates the need for
line search and does not suffer from oscillatory behavior. However, it has its own problems; in
its simplest form, it moves toward the nearest extremum, but this may be a maximum if we
are closer to the top of a hill than we are to the bottom of a valley. Moreover, computing the
Problem 6.11
inverse Hessian is intractable when the number of parameters is large, as in neural networks.
Properties of SGD: The limit of SGD as the learning rate tends to zero is a stochastic
differential equation. Jastrzębski et al. (2018) showed that this equation relies on the learning-
rate to batch size ratio and that there is a relation between the learning rate to batch size ratio
and the width of the minimum found. Wider minima are considered more desirable; if the loss
function for test data is similar, then small errors in the parameter estimates will have little
effect on test performance. He et al. (2019) prove a generalization bound for SGD that has a
positive correlation with the ratio of batch size to learning rate. They train a large number of
models on different architectures and datasets and find empirical evidence that test accuracy
improves when the ratio of batch size to learning rate is low. Smith et al. (2018) and Goyal et al.
(2018) also identified the ratio of batch size to learning rate as being important for generalization
(see figure 20.10).
Momentum: The idea of using momentum to speed up optimization dates to Polyak (1964).
Goh (2017) presents an in-depth discussion of the properties of momentum. The Nesterov
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 93
Figure 6.10 Line search using the bracketing approach. a) The current solution is
at position a (orange point), and we wish to search the region [a, d] (gray shaded
area). We define two points b, c interior to the search region and evaluate the loss
function at these points. Here L[b] > L[c], so we eliminate the range [a, b]. b) We
now repeat this procedure in the refined search region and find that L[b] < L[c],
so we eliminate the range [c, d]. c) We repeat this process until this minimum is
closely bracketed.
accelerated gradient method was introduced by Nesterov (1983). Nesterov momentum was first
applied in the context of stochastic gradient descent by Sutskever et al. (2013).
changes the momentum term over time in a way that helps avoid high variance. Dozat (2016)
incorporated Nesterov momentum into the Adam algorithm.
SGD vs. Adam: There has been a lively discussion about the relative merits of SGD and
Adam. Wilson et al. (2017) provided evidence that SGD with momentum can find lower minima
than Adam, which generalizes better over a variety of deep learning tasks. However, this is
strange since SGD is a special case of Adam (when β = 0, γ = 1) once the modification
term (equation 6.16) becomes one, which happens quickly. It is hence more likely that SGD
outperforms Adam when we use Adam’s default hyperparameters. Loshchilov & Hutter (2019)
proposed AdamW, which substantially improves the performance of Adam in the presence of
L2 regularization (see section 9.1). Choi et al. (2019) provide evidence that if we search for the
best Adam hyperparameters, it performs just as well as SGD and converges faster. Keskar &
Socher (2017) proposed a method called SWATS that starts using Adam (to make rapid initial
progress) and then switches to SGD (to get better final generalization performance).
Exhaustive search: All the algorithms discussed in this chapter are iterative. A completely
different approach is to quantize the network parameters and exhaustively search the resulting
discretized parameter space using SAT solvers (Mézard & Mora, 2009). This approach has
the potential to find the global minimum and provide a guarantee that there is no lower loss
elsewhere but is only practical for very small models.
Problems
Problem 6.1 Show that the derivatives of the least squares loss function in equation 6.5 are
given by the expressions in equation 6.7.
Problem 6.2 A surface is guaranteed to be convex if the eigenvalues of the Hessian H[ϕ] are
positive everywhere. In this case, the surface has a unique minimum, and optimization is easy.
Find an algebraic expression for the Hessian matrix,
∂2L ∂2L
∂ϕ ,
2 ∂ϕ0 ∂ϕ1
H[ϕ] = ∂2L
0
∂2L
(6.20)
∂ϕ1 ∂ϕ0 ∂ϕ2 1
Appendix B.3.7 for the linear regression model (equation 6.5). Prove that this function is convex by showing
Eigenvalues that the eigenvalues are always positive. This can be done by showing that both the trace and
the determinant of the matrix are positive.
Appendix B.3.8
Trace Problem 6.3 Compute the derivatives of the least squares loss L[ϕ] with respect to the param-
Appendix B.3.8 eters ϕ0 and ϕ1 for the Gabor model (equation 6.8).
Determinant
Problem 6.4∗ The logistic regression model uses a linear function to assign an input x to one
of two classes y ∈ {0, 1}. For a 1D input and a 1D output, it has two parameters, ϕ0 and ϕ1 ,
and is defined by:
1
sig[z] = . (6.22)
1 + exp[−z]
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 95
(i) Plot y against x for this model for different values of ϕ0 and ϕ1 and explain the qualitative
meaning of each parameter. (ii) What is a suitable loss function for this model? (iii) Compute
the derivatives of this loss function with respect to the parameters. (iv) Generate ten data
points from a normal distribution with mean -1 and standard deviation 1 and assign them the
label y = 0. Generate another ten data points from a normal distribution with mean 1 and
standard deviation 1 and assign these the label y = 1. Plot the loss as a heatmap in terms of
the two parameters ϕ0 and ϕ1 . (v) Is this loss function convex? How could you prove this?
Problem 6.5∗ Compute the derivatives of the least squares loss with respect to the ten param-
eters of the simple neural network model introduced in equation 3.1:
Think carefully about what the derivative of the ReLU function a[•] will be.
Problem 6.6 Which of the functions in figure 6.11 is convex? Justify your answer. Characterize
each of the points 1–7 as (i) a local minimum, (ii) the global minimum, or (iii) neither.
Problem 6.7∗ The gradient descent trajectory for path 1 in figure 6.5a oscillates back and forth
inefficiently as it moves down the valley toward the minimum. It’s also notable that it turns at
right angles to the previous direction at each step. Provide a qualitative explanation for these
phenomena. Propose a solution that might help prevent this behavior.
Problem 6.8∗ Can (non-stochastic) gradient descent with a fixed learning rate escape local
minima?
Problem 6.9 We run the stochastic gradient descent algorithm for 1,000 iterations on a dataset
of size 100 with a batch size of 20. For how many epochs did we train the model?
Problem 6.10 Show that the momentum term mt (equation 6.11) is an infinite weighted sum
of the gradients at the previous iterations and derive an expression for the coefficients (weights)
of that sum.
Problem 6.11 What dimensions will the Hessian have if the model has one million parameters?
Consider a network f[x, ϕ] with multivariate input x, parameters ϕ, and three hidden
layers h1 , h2 , and h3 :
h1 = a[β 0 + Ω0 x]
h2 = a[β 1 + Ω1 h1 ]
h3 = a[β 2 + Ω2 h2 ]
f[x, ϕ] = β 3 + Ω3 h3 , (7.1)
where the function a[•] applies the activation function separately to every element of the
input. The model parameters ϕ = {β 0 , Ω0 , β 1 , Ω1 , β 2 , Ω2 , β 3 , Ω3 } consist of the bias
vectors β k and weight matrices Ωk between every layer (figure 7.1).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.2 Computing derivatives 97
We also have individual loss terms ℓi , which return the negative log-likelihood of
the ground truth label yi given the model prediction f[xi , ϕ] for training input xi . For
example, this might be the least squares loss ℓi = (f[xi , ϕ] − yi )2 . The total loss is the
sum of these terms over the training data:
X
I
L[ϕ] = ℓi . (7.2)
i=1
The most commonly used optimization algorithm for training neural networks is
stochastic gradient descent (SGD), which updates the parameters as:
X ∂ℓi [ϕ ]
ϕt+1 ←− ϕt − α t
, (7.3)
∂ϕ
i∈Bt
where α is the learning rate, and Bt contains the batch indices at iteration t. To compute
this update, we need to calculate the derivatives:
∂ℓi ∂ℓi
and , (7.4)
∂β k ∂Ωk
for the parameters {β k , Ωk } at every layer k ∈ {0, 1, . . . , K} and for each index i in
Problem 7.1
the batch. The first part of this chapter describes the backpropagation algorithm, which
computes these derivatives efficiently.
In the second part of the chapter, we consider how to initialize the network parameters
before we commence training. We describe methods to choose the initial weights Ωk and
biases β k so that training is stable.
The derivatives of the loss tell us how the loss changes when we make a small change
to the parameters. Optimization algorithms exploit this information to manipulate the
parameters so that the loss becomes smaller. The backpropagation algorithm computes
these derivatives. The mathematical details are somewhat involved, so we first make two
observations that provide some intuition.
Figure 7.1 Backpropagation forward pass. The goal is to compute the derivatives
of the loss ℓ with respect to each of the weights (arrows) and biases (not shown).
In other words, we want to know how a small change to each parameter will affect
the loss. Each weight multiplies the hidden unit at its source and contributes the
result to the hidden unit at its destination. Consequently, the effects of any small
change to the weight will be scaled by the activation of the source hidden unit.
For example, the blue weight is applied to the second hidden unit at layer 1; if
the activation of this unit doubles, then the effect of a small change to the blue
weight will double too. Hence, to compute the derivatives of the weights, we need
to calculate and store the activations at the hidden layers. This is known as the
forward pass since it involves running the network equations sequentially.
unit. This, in turn, changes the values of the hidden units in the subsequent layer, which
will change the hidden units in the layer after that, and so on, until a change is made to
the model output and, finally, the loss.
Hence, to know how changing a parameter modifies the loss, we also need to know
how changes to every subsequent hidden layer will, in turn, modify their successor. These
same quantities are required when considering other parameters in the same or earlier
layers. It follows that we can calculate them once and reuse them. For example, consider
computing the effect of a small change in weights that feed into hidden layers h3 , h2 ,
and h1 , respectively:
• To calculate how a small change in a weight or bias feeding into hidden layer h3
modifies the loss, we need to know (i) how a change in layer h3 changes the model
output f , and (ii) how a change in this output changes the loss ℓ (figure 7.2a).
• To calculate how a small change in a weight or bias feeding into hidden layer h2
modifies the loss, we need to know (i) how a change in layer h2 affects h3 , (ii) how h3
changes the model output, and (iii) how this output changes the loss (figure 7.2b).
• To calculate how a small change in a weight or bias feeding into hidden layer h1
modifies the loss, we need to know (i) how a change in layer h1 affects layer h2 ,
(ii) how a change in layer h2 affects layer h3 , (iii) how layer h3 changes the model
output, and (iv) how the model output changes the loss (figure 7.2c).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.2 Computing derivatives 99
As we move backward through the network, we see that most of the terms we need
were already calculated in the previous step, so we do not need to re-compute them.
Proceeding backward through the network in this way to compute the derivatives is
known as the backward pass.
The ideas behind backpropagation are relatively easy to understand. However, the
derivation requires matrix calculus because the bias and weight terms are vectors and
matrices, respectively. To help grasp the underlying mechanics, the following section
derives backpropagation for a simpler toy model with scalar parameters. We then apply
the same approach to a deep neural network in section 7.4.
ℓi = (f[xi , ϕ] − yi )2 , (7.6)
th th
where, as usual, xi is the i training input, and yi is the i training output. You can
think of this as a simple neural network with one input, one output, one hidden unit at
each layer, and different activation functions sin[•], exp[•], and cos[•] between each layer.
We aim to compute the derivatives:
∂ℓi h i
= −2 β3 + ω3 · cos β2 + ω2 · exp β1 + ω1 · sin[β0 + ω0 · xi ] − yi
∂ω0
h i
·ω1 ω2 ω3 · xi · cos[β0 + ω0 · xi ] · exp β1 + ω1 · sin[β0 + ω0 · xi ]
h i
· sin β2 + ω2 · exp β1 + ω1 · sin[β0 + ω0 · xi ] . (7.8)
Such expressions are awkward to derive and code without mistakes and do not exploit
the inherent redundancy; notice that the three exponential terms are the same.
The backpropagation algorithm is an efficient method for computing all of these
derivatives at once. It consists of (i) a forward pass, in which we compute and store a
series of intermediate values and the network output, and (ii) a backward pass, in which
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.3 Toy example 101
Figure 7.3 Backpropagation forward pass. We compute and store each of the
intermediate variables in turn until we finally calculate the loss.
we calculate the derivatives of each parameter, starting at the end of the network, and
reusing previous calculations as we move toward the start.
f0 = β0 + ω0 · xi
h1 = sin[f0 ]
f1 = β1 + ω1 · h1
h2 = exp[f1 ]
f2 = β2 + ω2 · h2
h3 = cos[f2 ]
f3 = β3 + ω3 · h3
ℓi = (f3 − yi )2 . (7.9)
We compute and store the values of the intermediate variables fk and hk (figure 7.3).
Backward pass #1: We now compute the derivatives of ℓi with respect to these inter-
mediate variables, but in reverse order:
∂ℓi
= 2(f3 − yi ). (7.11)
∂f3
The next derivative can be calculated using the chain rule:
Figure 7.4 Backpropagation backward pass #1. We work backward from the end
of the function computing the derivatives ∂ℓi /∂fk and ∂ℓi /∂hk of the loss with
respect to the intermediate quantities. Each derivative is computed from the
previous one by multiplying by terms of the form ∂fk /∂hk or ∂hk /∂fk−1 .
represent the effects of this chain. Notice that we already computed the second of these
derivatives, and the other is the derivative of β3 + ω3 · h3 with respect to h3 , which is ω3 .
We continue in this way, computing the derivatives of the output with respect to
these intermediate quantities (figure 7.4):
∂ℓi ∂h3 ∂f3 ∂ℓi
=
∂f2 ∂f2 ∂h3 ∂f3
∂ℓi ∂f2 ∂h3 ∂f3 ∂ℓi
=
∂h2 ∂h2 ∂f2 ∂h3 ∂f3
∂ℓi ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
=
∂f1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3
∂ℓi ∂f1 ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
=
∂h1 ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3
∂ℓi ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
= . (7.13)
∂f0 ∂f0 ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3
In each case, we have already computed the quantities in the brackets in the previous
Problem 7.2
step, and the last term has a simple expression. These equations embody Observation 2
from the previous section (figure 7.2); we can reuse the previously computed derivatives
if we calculate them in reverse order.
Backward pass #2: Finally, we consider how the loss ℓi changes when we change the
parameters {βk } and {ωk }. Once more, we apply the chain rule (figure 7.5):
∂fk ∂fk
=1 and = hk . (7.15)
∂βk ∂ωk
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.4 Backpropagation algorithm 103
Figure 7.5 Backpropagation backward pass #2. Finally, we compute the deriva-
tives ∂ℓi /∂βk and ∂ℓi /∂ωk . Each derivative is computed by multiplying the
term ∂ℓi /∂fk by ∂fk /∂βk or ∂fk /∂ωk as appropriate.
This is consistent with Observation 1 from the previous section; the effect of a change
in the weight ωk is proportional to the value of the source variable hk (which was stored
in the forward pass). The final derivatives from the term f0 = β0 + ω0 · xi are:
Notebook 7.1
Backpropagation
∂f0 ∂f0 in toy model
=1 and = xi . (7.16)
∂β0 ∂ω0
Backpropagation is both simpler and more efficient than computing the derivatives in-
dividually, as in equation 7.8.1
Now we repeat this process for a three-layer network (figure 7.1). The intuition and much
of the algebra are identical. The main differences are that intermediate variables fk , hk
are vectors, the biases β k are vectors, the weights Ωk are matrices, and we are using
ReLU functions rather than simple algebraic functions like cos[•].
f0 = β 0 + Ω0 xi
h1 = a[f0 ]
f1 = β 1 + Ω1 h1
h2 = a[f1 ]
f2 = β 2 + Ω2 h2
h3 = a[f2 ]
f3 = β 3 + Ω3 h3
ℓi = l[f3 , yi ], (7.17)
1 Note that we did not actually need the derivatives ∂l /∂h of the loss with respect to the activations.
i k
In the final backpropagation algorithm, we will not compute these explicitly.
where fk−1 represents the pre-activations at the k th hidden layer (i.e., the values before
the ReLU function a[•]) and hk contains the activations at the k th hidden layer (i.e., after
the ReLU function). The term l[f3 , yi ] represents the loss function (e.g., least squares or
binary cross-entropy loss). In the forward pass, we work through these calculations and
store all the intermediate quantities.
Backward pass #1: Now let’s consider how the loss changes when the pre-activations
f0 , f1 , f2 change. Applying the chain rule, the expression for the derivative of the loss ℓi
Appendix B.5
Matrix calculus with respect to f2 is:
∂ℓi ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
= (7.19)
∂f1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3
∂ℓi ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
= . (7.20)
∂f0 ∂f0 ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3
Note that in each case, the term in brackets was computed in the previous step. By
Problem 7.3
working backward through the network, we can reuse the previous computations.
Moreover, the terms themselves are simple. Working backward through the right-
Problems 7.4–7.5
hand side of equation 7.18, we have:
• The derivative ∂ℓi /∂f3 of the loss ℓi with respect to the network output f3 will
depend on the loss function but usually has a simple form.
• The derivative ∂f3 /∂h3 of the network output with respect to hidden layer h3 is:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.4 Backpropagation algorithm 105
∂f3 ∂
= (β + Ω3 h3 ) = ΩT3 . (7.21)
∂h3 ∂h3 3
If you are unfamiliar with matrix calculus, this result is not obvious. It is explored
Problem 7.6
in problem 7.6.
• The derivative ∂h3 /∂f2 of the output h3 of the activation function with respect to
its input f2 will depend on the activation function. It will be a diagonal matrix
since each activation only depends on the corresponding pre-activation. For ReLU
functions, the diagonal terms are zero everywhere f2 is less than zero and one
Problems 7.7–7.8
otherwise (figure 7.6). Rather than multiply by this matrix, we extract the diagonal
terms as a vector I[f2 > 0] and pointwise multiply, which is more efficient.
The terms on the right-hand side of equations 7.19 and 7.20 have similar forms. As
we progress back through the network, we alternately (i) multiply by the transpose of
the weight matrices ΩTk and (ii) threshold based on the inputs fk−1 to the hidden layer.
These inputs were stored during the forward pass.
Backward pass #2: Now that we know how to compute ∂ℓi /∂fk , we can focus on
calculating the derivatives of the loss with respect to the weights and biases. To calculate
the derivatives of the loss with respect to the biases β k , we again use the chain rule:
We now briefly summarize the final backpropagation algorithm. Consider a deep neural
network f[xi , ϕ] that takes input xi , has K hidden layers with ReLU activations, and
individual loss term ℓi = l[f[xi , ϕ], yi ]. The goal of backpropagation is to compute the
derivatives ∂ℓi /∂β k and ∂ℓi /∂Ωk with respect to the biases β k and weights Ωk .
f0 = β 0 + Ω0 xi
hk = a[fk−1 ] k ∈ {1, 2, . . . , K}
fk = β k + Ωk hk . k ∈ {1, 2, . . . , K} (7.24)
Backward pass: We start with the derivative ∂ℓi /∂fK of the loss function ℓi with respect
to the network output fK and work backward through the network:
∂ℓi ∂ℓi
= k ∈ {K, K − 1, . . . , 1}
∂β k ∂fk
∂ℓi ∂ℓi T
= h k ∈ {K, K − 1, . . . , 1}
∂Ωk ∂fk k
∂ℓi ∂ℓi
= I[fk−1 > 0] ⊙ ΩTk , k ∈ {K, K − 1, . . . , 1} (7.25)
∂fk−1 ∂fk
where ⊙ denotes pointwise multiplication, and I[fk−1 > 0] is a vector containing ones
where fk−1 is greater than zero and zeros elsewhere. Finally, we compute the derivatives
with respect to the first set of biases and weights:
∂ℓi ∂ℓi
=
∂β 0 ∂f0
∂ℓi ∂ℓi T
= x . (7.26)
∂Ω0 ∂f0 i
We calculate these derivatives for every training example in the batch and sum them
Problem 7.10
together to retrieve the gradient for the SGD update.
Notebook 7.2 Note that the backpropagation algorithm is extremely efficient; the most demanding
Backpropagation computational step in both the forward and backward pass is matrix multiplication (by Ω
and ΩT , respectively) which only requires additions and multiplications. However, it is
not memory efficient; the intermediate values in the forward pass must all be stored, and
this can limit the size of the model we can train.
Although it’s important to understand the backpropagation algorithm, it’s unlikely that
you will need to code it in practice. Modern deep learning frameworks such as PyTorch
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.5 Parameter initialization 107
and TensorFlow calculate the derivatives automatically, given the model specification.
This is known as algorithmic differentiation.
Each functional component (linear transform, ReLU activation, loss function) in the
framework knows how to compute its own derivative. For example, the PyTorch ReLU
function zout = relu[zin ] knows how to compute the derivative of its output zout with
respect to its input zin . Similarly, a linear function zout = β + Ωzin knows how to
compute the derivatives of the output zout with respect to the input zin and with re-
spect to the parameters β and Ω. The algorithmic differentiation framework also knows
the sequence of operations in the network and thus has all the information required to
perform the forward and backward passes.
These frameworks exploit the massive parallelism of modern graphics processing units
(GPUs). Computations such as matrix multiplication (which features in both the forward
and backward pass) are naturally amenable to parallelization. Moreover, it’s possible to
Problem 7.11
perform the forward and backward passes for the entire batch in parallel if the model
and intermediate results in the forward pass do not exceed the available memory.
Since the training algorithm now processes the entire batch in parallel, the input
becomes a multi-dimensional tensor. In this context, a tensor can be considered the
generalization of a matrix to arbitrary dimensions. Hence, a vector is a 1D tensor, a
matrix is a 2D tensor, and a 3D tensor is a 3D grid of numbers. Until now, the training
data have been 1D, so the input for backpropagation would be a 2D tensor where the
first dimension indexes the batch element and the second indexes the data dimension.
In subsequent chapters, we will encounter more complex structured input data. For
example, in models where the input is an RGB image, the original data examples are
3D (height × width × channel). Here, the input to the learning framework would be a
4D tensor, where the extra dimension indexes the batch element.
The backpropagation algorithm computes the derivatives that are used by stochastic
gradient descent and Adam to train the model. We now address how to initialize the
parameters before we start training. To see why this is crucial, consider that during the
forward pass, each set of pre-activations fk is computed as:
fk = β k + Ωk hk
= β k + Ωk a[fk−1 ], (7.27)
where a[•] applies the ReLU functions and Ωk and β k are the weights and biases, respec-
tively. Imagine that we initialize all the biases to zero and the elements of Ωk according
to a normal distribution with mean zero and variance σ 2 . Consider two scenarios:
• If the variance σ 2 is very small (e.g., 10−5 ), then each element of β k + Ωk hk will be
a weighted sum of hk where the weights are very small; the result will likely have
a smaller magnitude than the input. In addition, the ReLU function clips values
less than zero, so the range of hk will be half that of fk−1 . Consequently, the
magnitudes of the pre-activations at the hidden layers will get smaller and smaller
as we progress through the network.
• If the variance σ 2 is very large (e.g., 105 ), then each element of β k + Ωk hk will be
a weighted sum of hk where the weights are very large; the result is likely to have
a much larger magnitude than the input. The ReLU function halves the range of
the inputs, but if σ 2 is large enough, the magnitudes of the pre-activations will still
get larger as we progress through the network.
In these two situations, the values at the pre-activations can become so small or so large
that they cannot be represented with finite precision floating point arithmetic.
Even if the forward pass is tractable, the same logic applies to the backward pass.
Each gradient update (equation 7.25) consists of multiplying by ΩT . If the values of Ω
are not initialized sensibly, then the gradient magnitudes may decrease or increase un-
controllably during the backward pass. These cases are known as the vanishing gradient
problem and the exploding gradient problem, respectively. In the former case, updates to
the model become vanishingly small. In the latter case, they become unstable.
We now present a mathematical version of the same argument. Consider the computation
between adjacent pre-activations f and f ′ with dimensions Dh and Dh′ , respectively:
h = a[f ],
f ′ = β + Ωh (7.28)
where h represents the activations, Ω and β represent the weights and biases, and a[•]
is the activation function.
Assume the pre-activations fj in the input layer f have variance σf2j . Consider ini-
tializing the biases βi to zero and the weights Ωij as normally distributed with mean
2
zero and variance σΩ . Now we derive expressions for the mean and variance of the
′
pre-activations f in the subsequent layer.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.5 Parameter initialization 109
The expectation (mean) E[fi′ ] of the intermediate values fi′ is: Appendix C.2
Expectation
X
Dh
E[fi′ ] = E βi + Ωij hj
j=1
X
Dh
= E [βi ] + E [Ωij hj ]
j=1
X
Dh
= E [βi ] + E [Ωij ] E [hj ]
j=1
X
Dh
= 0+ 0 · E [hj ] = 0, (7.29)
j=1
where Dh is the dimensionality of the input layer h. We have used the rules for manipu-
Appendix C.2.1
lating expectations, and we have assumed that the distributions over the hidden units hj Expectation rules
and the network weights Ωij are independent between the second and third lines.
Using this result, we see that the variance σf2 ′ of the pre-activations fi′ is:
i
X
Dh
= E Ω2ij E h2j
j=1
X
Dh
X
Dh
= 2
σΩ E h2j = σΩ
2
E h2j , (7.30)
j=1 j=1
where we have used the variance identity σ 2 = E[(z − E[z])2 ] = E[z 2 ] − E[z]2 . We have
Appendix C.2.3
assumed once more that the distributions of the weights Ωij and the hidden units hj are Variance identity
independent between lines three and four.
Assuming that the distribution of pre-activations fj at the previous layer is symmetric
about zero, half of these pre-activations will be clipped by the ReLU function, and the
second moment E[h2j ] will be half the variance σf2 of fj (see problem 7.14):
Problem 7.14
X
Dh
σf2 1
2
σf2i′ = σΩ 2 2
= Dh σΩ σf . (7.31)
j=1
2 2
Figure 7.7 Weight initialization. Consider a deep network with 50 hidden layers
and Dh = 100 hidden units per layer. The network has a 100-dimensional input x
initialized from a standard normal distribution, a single fixed target y = 0, and
a least squares loss function. The bias vectors β k are initialized to zero, and the
weight matrices Ωk are initialized with a normal distribution with mean zero and
five different variances σΩ2
∈ {0.001, 0.01, 0.02, 0.1, 1.0}. a) Variance of hidden
unit activations computed in forward pass as a function of the network layer. For
2
He initialization (σΩ = 2/Dh = 0.02), the variance is stable. However, for larger
values, it increases rapidly, and for smaller values, it decreases rapidly (note
log scale). b) The variance of the gradients in the backward pass (solid lines)
continues this trend; if we initialize with a value larger than 0.02, the magnitude
of the gradients increases rapidly as we pass back through the network. If we
initialize with a value smaller, then the magnitude decreases. These are known
as the exploding gradient and vanishing gradient problems, respectively.
This, in turn, implies that if we want the variance σf2 ′ of the subsequent pre-activations f ′
to be the same as the variance σf2 of the original pre-activations f during the forward
pass, we should set:
2 2
σΩ = , (7.32)
Dh
where Dh is the dimension of the original layer to which the weights were applied. This
is known as He initialization.
A similar argument establishes how the variance of the gradients ∂l/∂fk changes during
the backward pass. During the backward pass, we multiply by the transpose ΩT of the
weight matrix (equation 7.25), so the equivalent expression becomes:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.6 Example training code 111
2 2
σΩ = , (7.33)
Dh′
where Dh′ is the dimension of the layer that the weights feed into.
If the weight matrix Ω is not square (i.e., there are different numbers of hidden units
in the two adjacent layers, so Dh and Dh′ differ), then it is not possible to choose the
variance to satisfy both equations 7.32 and 7.33 simultaneously. One possible compromise
is to use the mean (Dh + Dh′ )/2 as a proxy for the number of terms, which gives:
2 4
σΩ = . (7.34)
Dh + Dh′
Figure 7.7 shows empirically that both the variance of the hidden units in the forward
Problem 7.15
pass and the variance of the gradients in the backward pass remain stable when the
parameters are initialized appropriately. Notebook 7.3
Initialization
The primary focus of this book is scientific; this is not a guide for implementing deep
learning models. Nonetheless, in figure 7.8, we present PyTorch code that implements
the ideas explored in this book so far. The code defines a neural network and initializes
Problems 7.16–7.17
the weights. It creates random input and output datasets and defines a least squares loss
function. The model is trained from the data using SGD with momentum in batches of
size 10 over 100 epochs. The learning rate starts at 0.01 and halves every 10 epochs.
The takeaway is that although the underlying ideas in deep learning are quite com-
plex, implementation is relatively simple. For example, all of the details of the back-
propagation are hidden in the single line of code: loss.backward().
7.7 Summary
The previous chapter introduced stochastic gradient descent (SGD), an iterative opti-
mization algorithm that aims to find the minimum of a function. In the context of neural
networks, this algorithm finds the parameters that minimize the loss function. SGD re-
lies on the gradient of the loss function with respect to the parameters, which must be
initialized before optimization. This chapter has addressed these two problems for deep
neural networks.
The gradients must be evaluated for a very large number of parameters, for each
member of the batch, and at each SGD iteration. It is hence imperative that the gradient
# He initialization of weights
def weights_init(layer_in):
if isinstance(layer_in, nn.Linear):
nn.init.kaiming_normal_(layer_in.weight)
layer_in.bias.data.fill_(0.0)
model.apply(weights_init)
# create 100 random data points and store in data loader class
x = torch.randn(100, D_i)
y = torch.randn(100, D_o)
data_loader = DataLoader(TensorDataset(x,y), batch_size=10, shuffle=True)
Figure 7.8 Sample code for training two-layer network on random data.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 113
computation is efficient, and to this end, the backpropagation algorithm was introduced.
Careful parameter initialization is also critical. The magnitudes of the hidden unit
activations can either decrease or increase exponentially in the forward pass. The same
is true of the gradient magnitudes in the backward pass, where these behaviors are known
as the vanishing gradient and exploding gradient problems. Both impede training but
can be avoided with appropriate initialization.
We’ve now defined the model and the loss function, and we can train a model for a
given task. The next chapter discusses how to measure the model performance.
Notes
Closely related to these methods are schemes such as BatchNorm (Ioffe & Szegedy, 2015), in
which the network normalizes the variance of each batch as part of its processing at every
step. BatchNorm and its variants are discussed in chapter 11. Other initialization schemes have
been proposed for specific architectures, including the ConvolutionOrthogonal initializer (Xiao
et al., 2018a) for convolutional networks, Fixup (Zhang et al., 2019a) for residual networks, and
TFixup (Huang et al., 2020a) and DTFixup (Xu et al., 2021b) for transformers.
Distributed training: For sufficiently large models, the memory requirements or total re-
quired time may be too much for a single processor. In this case, we must use distributed
training, in which training takes place in parallel across multiple processors. There are several
approaches to parallelism. In data parallelism, each processor or node contains a full copy of
the model but runs a subset of the batch (see Xing et al., 2015; Li et al., 2020b). The gradients
from each node are aggregated centrally and then redistributed back to each node to ensure
that the models remain consistent. This is known as synchronous training. The synchronization
required to aggregate and redistribute the gradients can be a performance bottleneck, and this
leads to the idea of asynchronous training. For example, in the Hogwild! algorithm (Recht
et al., 2011), the gradient from a node is used to update a central model whenever it is ready.
The updated model is then redistributed to the node. This means that each node may have a
slightly different version of the model at any given time, so the gradient updates may be stale;
however, it works well in practice. Other decentralized schemes have also been developed. For
example, in Zhang et al. (2016a), the individual nodes update one another in a ring structure.
Data parallelism methods still assume that the entire model can be held in the memory of a
single node. Pipeline model parallelism stores different layers of the network on different nodes
and hence does not have this requirement. In a naïve implementation, the first node runs the
forward pass for the batch on the first few layers and passes the result to the next node, which
runs the forward pass on the next few layers and so on. In the backward pass, the gradients are
updated in the opposite order. The obvious disadvantage of this approach is that each machine
lies idle for most of the cycle. Various schemes revolving around each node processing micro-
batches sequentially have been proposed to reduce this inefficiency (e.g., Huang et al., 2019;
Narayanan et al., 2021a). Finally, in tensor model parallelism, computation at a single network
layer is distributed across nodes (e.g., Shoeybi et al., 2019). A good overview of distributed
training methods can be found in Narayanan et al. (2021b), who combine tensor, pipeline, and
data parallelism to train a language model with one trillion parameters on 3072 GPUs.
Problems
Problem 7.1 A two-layer network with two hidden units in each layer can be defined as:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 115
h i
y = ϕ0 + ϕ1 a ψ01 + ψ11 a[θ01 + θ11 x] + ψ21 a[θ02 + θ12 x]
h i
+ϕ2 a ψ02 + ψ12 a[θ01 + θ11 x] + ψ22 a[θ02 + θ12 x] , (7.35)
where the functions a[•] are ReLU functions. Compute the derivatives of the output y with
respect to each of the 13 parameters ϕ• , θ•• , and ψ•• directly (i.e., not using the backpropagation
algorithm). The derivative of the ReLU function with respect to its input ∂a[z]/∂z is the
indicator function I[z > 0], which returns one if the argument is greater than zero and zero
otherwise (figure 7.6).
Problem 7.2 Find an expression for the final term in each of the five chains of derivatives in
equation 7.13.
Problem 7.3 What size are each of the terms in equation 7.20?
Problem 7.4 Calculate the derivative ∂ℓi /∂f [xi , ϕ] for the least squares loss function:
Problem 7.5 Calculate the derivative ∂ℓi /∂f[xi , ϕ] for the binary classification loss function:
h i h i
ℓi = −(1 − yi ) log 1 − sig f[xi , ϕ] − yi log sig f[xi , ϕ] , (7.37)
where the function sig[•] is the logistic sigmoid and is defined as:
1
sig[z] = . (7.38)
1 + exp[−z]
∂z
= ΩT , (7.39)
∂h
where ∂z/∂h is a matrix containing the term ∂zi /∂hj in its ith column and j th row. To do this,
first find an expression for the constituent elements ∂zi /∂hj , and then consider the form that
the matrix ∂z/∂h must take.
Problem 7.7 Consider the case where we use the logistic sigmoid (see equation 7.38) as an
activation function, so h = sig[f ]. Compute the derivative ∂h/∂f for this activation function.
What happens to the derivative when the input takes (i) a large positive value and (ii) a large
negative value?
Problem 7.8 Consider using (i) the Heaviside function and (ii) the rectangular function as
activation functions:
(
0 z<0
Heaviside[z] = , (7.40)
1 z≥0
Figure 7.9 Computational graph for problem 7.12 and problem 7.13. Adapted
from Domke (2010).
and
0 z<0
rect[z] = 1 0≤z≤1. (7.41)
0 z>1
Discuss why these functions are problematic for neural network training with gradient-based
optimization methods.
Problem 7.9∗ Consider a loss function ℓ[f ], where f = β + Ωh. We want to find how the loss ℓ
changes when we change Ω, which we’ll express with a matrix that contains the derivative
∂ℓ/∂Ωij at the ith row and j th column. Find an expression for ∂fi /∂Ωij and, using the chain
rule, show that:
∂ℓ ∂ℓ T
= h . (7.42)
∂Ω ∂f
Problem 7.10∗ Derive the equations for the backward pass of the backpropagation algorithm
for a network that uses leaky ReLU activations, which are defined as:
(
α·z z<0
a[z] = ReLU[z] = , (7.43)
z z≥0
Problem 7.11 Consider training a network with fifty layers, where we only have enough memory
to store the pre-activations at every tenth hidden layer during the forward pass. Explain how
to compute the derivatives in this situation using gradient checkpointing.
Problem 7.12∗ This problem explores computing derivatives on general acyclic computational
graphs. Consider the function:
y = exp exp[x] + exp[x]2 + sin[exp[x] + exp[x]2 ]. (7.44)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 117
f1 = exp[x]
f2 = f12
f3 = f1 + f2
f4 = exp[f3 ]
f5 = sin[f3 ]
y = f4 + f5 . (7.45)
The associated computational graph is depicted in figure 7.9. Compute the derivative ∂y/∂x
by reverse-mode differentiation. In other words, compute in order:
∂y ∂y ∂y ∂y ∂y ∂y
, , , , and , (7.46)
∂f5 ∂f4 ∂f3 ∂f2 ∂f1 ∂x
using the chain rule in each case to make use of the derivatives already computed.
Problem 7.13∗ For the same function as in problem 7.12, compute the derivative ∂y/∂x by
forward-mode differentiation. In other words, compute in order:
Problem 7.14 Consider a random variable a with variance Var[a] = σ 2 and a symmetrical
distribution around the mean E[a] = 0. Prove that if we pass this variable through the ReLU
function:
(
0 a<0
b = ReLU[a] = , (7.48)
a a≥0
Problem 7.15 What would you expect to happen if we initialized all of the weights and biases
in the network to zero?
Problem 7.16 Implement the code in figure 7.8 in PyTorch and plot the training loss as a
function of the number of epochs.
Problem 7.17 Change the code in figure 7.8 to tackle a binary classification problem. You will
need to (i) change the targets y so they are binary, (ii) change the network to predict numbers
between zero and one (iii) change the loss function appropriately.
Measuring performance
Previous chapters described neural network models, loss functions, and training algo-
rithms. This chapter considers how to measure the performance of the trained models.
With sufficient capacity (i.e., number of hidden units), a neural network model will often
perform perfectly on the training data. However, this does not necessarily mean it will
generalize well to new test data.
We will see that the test errors have three distinct causes and that their relative
contributions depend on (i) the inherent uncertainty in the task, (ii) the amount of
training data, and (iii) the choice of model. The latter dependency raises the issue of
hyperparameter search. We discuss how to select both the model hyperparameters (e.g.,
the number of hidden layers and the number of hidden units in each) and the learning
algorithm hyperparameters (e.g., the learning rate and batch size).
We explore model performance using the MNIST-1D dataset (figure 8.1). This con-
sists of ten classes y ∈ {0, 1, . . . , 9}, representing the digits 0–9. The data are derived
from 1D templates for each of the digits. Each data example x is created by randomly
transforming one of these templates and adding noise. The full training dataset {xi , yi }
consists of I = 4000 training examples, each consisting of Di = 40 dimensions representing
the horizontal offset at 40 positions. The ten classes are drawn uniformly during data
generation, so there are ∼ 400 examples of each class.
We use a network with Di = 40 inputs and Do = 10 outputs which are passed through
a softmax function to produce class probabilities (see section 5.5). The network has two
hidden layers with D = 100 hidden units each. It is trained using stochastic gradient
descent with batch size 100 and learning rate 0.1 for 6000 steps (150 epochs) with a
multiclass cross-entropy loss (equation 5.24). Figure 8.2 shows that the training error
decreases as training proceeds. The training data are classified perfectly after about
Problem 8.1 4000 steps. The training loss also decreases, eventually approaching zero.
However, this doesn’t imply that the classifier is perfect; the model might have mem-
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.1 Training a simple model 119
Figure 8.1 MNIST-1D. a) Templates for 10 classes y ∈ {0, . . . , 9}, based on digits
0–9. b) Training examples x are created by randomly transforming a template
and c) adding noise. d) The horizontal offset of the transformed template is then
sampled at 40 vertical positions. Adapted from (Greydanus, 2020)
orized the training set but be unable to predict new examples. To estimate the true
performance, we need a separate test set of input/output pairs {xi , yi }. To this end, we
generate 1000 more examples using the same process. Figure 8.2a also shows the errors
for this test data as a function of the training step. These decrease as training proceeds,
but only to around 40%. This is better than the chance error rate of 90% but far worse
than for the training set; the model has not generalized well to the test data.
The test loss (figure 8.2b) decreases for the first 1500 training steps but then increases
Notebook 8.1
MNIST-1D again. At this point, the test error rate is fairly constant; the model makes the same
performance mistakes but with increasing confidence. This decreases the probability of the correct
answers and thus increases the negative log-likelihood. This increasing confidence is a
side-effect of the softmax function; the pre-softmax activations are driven to increasingly
extreme values to make the probability of the training data approach one (see figure 5.10).
We now consider the sources of the errors that occur when a model fails to generalize. To
make this easier to visualize, we revert to a 1D least squares regression problem where
we know exactly how the ground truth data were generated. Figure 8.3 shows a quasi-
sinusoidal function; both training and test data are generated by sampling input values
in the range [0, 1], passing them through this function, and adding Gaussian noise with
a fixed variance.
We fit a simplified shallow neural net to this data (figure 8.4). The weights and biases
that connect the input layer to the hidden layer are chosen so that the “joints” of the
function are evenly spaced across the interval. If there are D hidden units, then these
joints will be at 0, 1/D, 2/D, . . . , (D − 1)/D. This model can represent any piecewise
linear function with D equally sized regions in the range [0, 1]. As well as being easy to
understand, this model also has the advantage that it can be fit in closed form without
the need for stochastic optimization algorithms (see problem 8.3). Consequently, we can
Problems 8.2–8.3
guarantee to find the global minimum of the loss function during training.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.2 Sources of error 121
Figure 8.4 Simplified neural network with three hidden units. a) The weights and
biases between the input and hidden layer are fixed (dashed arrows). b–d) They
are chosen so that the hidden unit activations have slope one, and their joints are
equally spaced across the interval, with joints at x = 0, x = 1/3, and x = 2/3,
respectively. Modifying the remaining parameters ϕ = {β, ω1 , ω2 , ω3 } can create
any piecewise linear function over x ∈ [0, 1] with joints at 1/3 and 2/3. e–g)
Three example functions with different values of the parameters ϕ.
Figure 8.5 Sources of test error. a) Noise. Data generation is noisy, so even if the
model exactly replicates the true underlying function (black line), the noise in the
test data (gray points) means that some error will remain (gray region represents
two standard deviations). b) Bias. Even with the best possible parameters, the
three-region model (cyan line) cannot exactly fit the true function (black line).
This bias is another source of error (gray regions represent signed error). c)
Variance. In practice, we have limited noisy training data (orange points). When
we fit the model, we don’t recover the best possible function from panel (b) but
a slightly different function (cyan line) that reflects idiosyncrasies of the training
data. This provides an additional source of error (gray region represents two
standard deviations). Figure 8.6 shows how this region was calculated.
There are three possible sources of error, which are known as noise, bias, and variance
respectively (figure 8.5):
Noise The data generation process includes the addition of noise, so there are multiple
possible valid outputs y for each input x (figure 8.5a). This source of error is insurmount-
able for the test data. Note that it does not necessarily limit the training performance;
we will likely never see the same input x twice during training, so it is still possible to
fit the training data perfectly.
Noise may arise because there is a genuine stochastic element to the data generation
process, because some of the data are mislabeled, or because there are further explanatory
variables that were not observed. In rare cases, noise may be absent; for example,
a network might approximate a function that is deterministic but requires significant
computation to evaluate. However, noise is usually a fundamental limitation on the
possible test performance.
Bias A second potential source of error may occur because the model is not flexible
enough to fit the true function perfectly. For example, the three-region neural network
model cannot exactly describe the quasi-sinusoidal function, even when the parameters
are chosen optimally (figure 8.5b). This is known as bias.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.2 Sources of error 123
Variance We have limited training examples, and there is no way to distinguish sys-
tematic changes in the underlying function from noise in the underlying data. When
we fit a model, we do not get the closest possible approximation to the true underly-
ing function. Indeed, for different training datasets, the result will be slightly different
each time. This additional source of variability in the fitted function is termed variance
(figure 8.5c). In practice, there might also be additional variance due to the stochastic
learning algorithm, which does not necessarily converge to the same solution each time.
We now make the notions of noise, bias, and variance mathematically precise. Consider
a 1D regression problem where the data generation process has additive noise with vari-
ance σ 2 (e.g., figure 8.3); we can observe different outputs y for the same input x, so for
Appendix C.2
each x, there is a distribution P r(y|x) with expected value (mean) µ[x]: Expectation
Z
µ[x] = Ey [y[x]] = y[x]P r(y|x)dy, (8.1)
and fixed noise σ 2 = Ey (µ[x] − y[x])2 . Here we have used the notation y[x] to specify
that we are considering the output y at a given input position x.
Now consider a least squares loss between the model prediction f[x, ϕ] at position x
and the observed value y[x] at that position:
2
L[x] = f[x, ϕ] − y[x] (8.2)
2
= f[x, ϕ] − µ[x] + µ[x] − y[x]
2 2
= f[x, ϕ] − µ[x] + 2 f[x, ϕ] − µ[x] µ[x] − y[x] + µ[x] − y[x] ,
where we have both added and subtracted the mean µ[x] of the underlying function in
the second line and have expanded out the squared term in the third line.
The underlying function is stochastic, so this loss depends on the particular y[x] we
observe. The expected loss is:
h 2 2 i
Ey L[x] = Ey f[x, ϕ]−µ[x] + 2 f[x, ϕ]−µ[x] µ[x]−y[x] + µ[x]−y[x]
2
= f[x, ϕ]−µ[x] + 2 f[x, ϕ] − µ[x] µ[x]−Ey [y[x]] + Ey (µ[x]−y[x])2
2 h 2 i
= f[x, ϕ]−µ[x] + 2 f[x, ϕ]−µ[x] · 0 + Ey µ[x]−y[x]
2
= f[x, ϕ] − µ[x] + σ 2 , (8.3)
where we have made use of the rules for manipulating expectations. In the second line, we
Appendix C.2.1
have distributed the expectation operator and removed it from terms with no dependence Expectation rules
on y[x], and in the third line, we note that the second term is zero since Ey [y[x]] = µ[x]
by definition. Finally, in the fourth line, we have substituted in the definition of the
noise σ 2 . We can see that the expected loss has been broken down into two terms; the
first term is the squared deviation between the model and the true function mean, and
the second term is the noise.
The first term can be further partitioned into bias and variance. The parameters ϕ of
the model f[x, ϕ] depend on the training dataset D = {xi , yi }, so more properly, we should
write f [x, ϕ[D]]. The training dataset is a random sample from the data generation
process; with a different sample of training data, we would learn different parameter
values. The expected model output fµ [x] with respect to all possible datasets D is hence:
h i
fµ [x] = ED f x, ϕ[D] . (8.4)
Returning to the first term of equation 8.3, we add and subtract fµ [x] and expand:
2
f[x, ϕ[D]]−µ[x] (8.5)
2
= f[x, ϕ[D]]−fµ [x] + fµ [x] − µ[x]
2 2
= f[x, ϕ[D]]−fµ [x] + 2 f[x, ϕ[D]]−fµ [x] fµ [x]−µ[x] + fµ [x]−µ[x] .
We then take the expectation with respect to the training dataset D:
h 2 i h 2 i 2
ED f[x, ϕ[D]] − µ[x] = ED f[x, ϕ[D]] − fµ [x] + fµ [x] − µ[x] , (8.6)
where we have simplified using similar steps as for equation 8.3. Finally, we substitute
this result into equation 8.3:
h i h 2 i 2
ED Ey [L[x]] = ED f[x, ϕ[D]] − fµ [x] + fµ [x]−µ[x] + σ 2 . (8.7)
| {z } | {z } |{z}
variance bias noise
This equation says that the expected loss after considering the uncertainty in the training
data D and the test data y consists of three additive components. The variance is
uncertainty in the fitted model due to the particular training dataset we sample. The bias
is the systematic deviation of the model from the mean of the function we are modeling.
The noise is the inherent uncertainty in the true mapping from input to output. These
three sources of error will be present for any task. They combine additively for regression
tasks with a least squares loss. However, their interaction can be more complex for other
types of problems.
In the previous section, we saw that test error results from three sources: noise, bias,
and variance. The noise component is insurmountable; there is nothing we can do to
circumvent this, and it represents a fundamental limit on expected model performance.
However, it is possible to reduce the other two terms.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.3 Reducing error 125
Recall that the variance results from limited noisy training data. Fitting the model
to two different training sets results in slightly different parameters. It follows we can
reduce the variance by increasing the quantity of training data. This averages out the
inherent noise and ensures that the input space is well sampled.
Figure 8.6 shows the effect of training with 6, 10, and 100 samples. For each dataset
size, we show the best-fitting model for three training datasets. With only six samples,
the fitted function is quite different each time: the variance is significant. As we increase
the number of samples, the fitted models become very similar, and the variance reduces.
In general, adding training data almost always improves test performance.
The bias term results from the inability of the model to describe the true underlying
function. This suggests that we can reduce this error by making the model more flexible.
This is usually done by increasing the model capacity. For neural networks, this means
adding more hidden units and/or hidden layers.
In the simplified model, adding capacity corresponds to adding more hidden units
so that the interval [0, 1] is divided into more linear regions. Figures 8.7a–c show that
(unsurprisingly) this does indeed reduce the bias; as we increase the number of linear
regions to ten, the model becomes flexible enough to fit the true function closely.
However, figures 8.7d–f show an unexpected side-effect of increasing the model capacity.
For a fixed-size training dataset, the variance term typically increases as the model
capacity increases. Consequently, increasing the model capacity does not necessarily
reduce the test error. This is known as the bias-variance trade-off.
Figure 8.8 explores this phenomenon. In panels a–c), we fit the simplified three-region
model to three different datasets of fifteen points. Although the datasets differ, the final
model is much the same; the noise in the dataset roughly averages out in each linear
region. In panels d–f), we fit a model with ten regions to the same three datasets. This
model has more flexibility, but this is disadvantageous; the model certainly fits the data
better, and the training error will be lower, but much of the extra descriptive power is
devoted to modeling the noise. This phenomenon is known as overfitting.
We’ve seen that as we add capacity to the model, the bias decreases, but the variance
increases for a fixed-size training dataset. This suggests that there is an optimal capacity
where the bias is not too large and the variance is still relatively small. Figure 8.9 shows
how these terms vary numerically for the toy model as we increase the capacity, using
Notebook 8.2
the data from figure 8.8. For regression models, the total expected error is the sum of Bias-variance
the bias and the variance, and this sum is minimized when the model capacity is four trade-off
(i.e., with four hidden units and four linear regions in the range of the data).
Figure 8.6 Reducing variance by increasing training data. a–c) The three-region
model fitted to three different randomly sampled datasets of six points. The
fitted model is quite different each time. d) We repeat this experiment many
times and plot the mean model predictions (cyan line) and the variance of the
model predictions (gray area shows two standard deviations). e–h) We do the
same experiment, but this time with datasets of size ten. The variance of the
predictions is reduced. i–l) We repeat this experiment with datasets of size 100.
Now the fitted model is always similar, and the variance is small.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.4 Double descent 127
Figure 8.7 Bias and variance as a function of model capacity. a–c) As we in-
crease the number of hidden units of the toy model, the number of linear regions
increases, and the model becomes able to fit the true function closely; the bias
(gray region) decreases. d–f) Unfortunately, increasing the model capacity has
the side-effect of increasing the variance term (gray region). This is known as the
bias-variance trade-off.
Figure 8.8 Overfitting. a–c) A model with three regions is fit to three different
datasets of fifteen points each. The result is similar in all three cases (i.e., the
variance is low). d–f) A model with ten regions is fit to the same datasets. The
additional flexibility does not necessarily produce better predictions. While these
three models each describe the training data better, they are not necessarily closer
to the true underlying function (black curve). Instead, they overfit the data and
describe the noise, and the variance (difference between fitted curves) is larger.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.4 Double descent 129
training labels. Once more, the training error decreases to zero. This time, there is
more randomness, and the model requires almost as many parameters as there are data
points to memorize the data. The test error does show the typical bias-variance trade-off
as we increase the capacity to the point where the model fits the training data exactly.
However, then it does something unexpected; it starts to decrease again. Indeed, if we
add enough capacity, the test loss reduces to below the minimal level that we achieved
in the first part of the curve.
This phenomenon is known as double descent. For some datasets like MNIST, it is
present with the original data (figure 8.10c). For others, like MNIST-1D and CIFAR-100
(figure 8.10d), it emerges or becomes more prominent when we add noise to the labels.
Notebook 8.3
The first part of the curve is referred to as the classical or under-parameterized regime, Double descent
and the second part as the modern or over-parameterized regime. The central part where
the error increases is termed the critical regime.
8.4.1 Explanation
The discovery of double descent is recent, unexpected, and somewhat puzzling. It results
from an interaction of two phenomena. First, the test performance becomes temporarily
worse when the model has just enough capacity to memorize the data. Second, the test
performance continues to improve with capacity even when this exceeds the point where
the training data are all classified correctly. The first phenomenon is exactly as predicted
by the bias-variance trade-off. The second phenomenon is more confusing; it’s unclear
why performance should be better in the over-parameterized regime, given that there are
now not even enough training data points to constrain the model parameters uniquely.
To understand why performance continues to improve as we add more parameters,
note that once the model has enough capacity to drive the training loss to near zero,
the model fits the training data almost perfectly. This implies that further capacity
Problems 8.4–8.5
cannot help the model fit the training data any better; any change must occur between
the training points. The tendency of a model to prioritize one solution over another as
it extrapolates between data points is known as its inductive bias.
The model’s behavior between data points is critical because, in high-dimensional
space, the training data are extremely sparse. The MNIST-1D dataset has 40 dimensions,
and we trained with 10,000 examples. If this seems like plenty of data, consider what
would happen if we quantized each input dimension into 10 bins. There would be 1040
bins in total, constrained by only 104 examples. Even with this coarse quantization,
there will only be one data point in every 1035 bins! The tendency of the volume of
high-dimensional space to overwhelm the number of training points is termed the curse
of dimensionality.
The implication is that problems in high dimensions might look more like figure 8.11a;
there are small regions of the input space where we observe data with significant gaps
between them. The putative explanation for double descent is that as we add capacity
to the model, it interpolates between the nearest data points increasingly smoothly. In
the absence of information about what happens between the training points, assuming
smoothness is sensible and will probably generalize reasonably to new data.
Figure 8.10 Double descent. a) Training and test error on MNIST-1D for a
two-hidden layer network as we increase the number of hidden units (and hence
parameters) in each layer. The training error decreases to zero as the number of
parameters approaches the number of training examples (vertical dashed line).
The test error does not show the expected bias-variance trade-off but continues
to decrease even after the model has memorized the dataset. b) The same exper-
iment is repeated with noisier training data. Again, the training error reduces
to zero, although it now takes almost as many parameters as training points to
memorize the dataset. The test error shows the predicted bias/variance trade-off;
it decreases as the capacity increases but then increases again as we near the
point where the training data is exactly memorized. However, it subsequently
decreases again and ultimately reaches a better performance level. This is known
as double descent. Depending on the loss function, the model, and the amount
of noise in the data, the double descent pattern can be seen to a greater or lesser
degree across many datasets. c) Results on MNIST (without label noise) with
shallow neural network from Belkin et al. (2019). d) Results on CIFAR-100 with
ResNet18 network (see chapter 11) from Nakkiran et al. (2021). See original
papers for details.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.4 Double descent 131
Figure 8.11 Increasing capacity (hidden units) allows smoother interpolation be-
tween sparse data points. a) Consider this situation where the training data
(orange circles) are sparse; there is a large region in the center with no data ex-
amples to constrain the model to mimic the true function (black curve). b) If we
fit a model with just enough capacity to fit the training data (cyan curve), then it
has to contort itself to pass through the training data, and the output predictions
will not be smooth. c–f) However, as we add more hidden units, the model has
the ability to interpolate between the points more smoothly (smoothest possible
curve plotted in each case). However, unlike in this figure, it is not obliged to.
This argument is plausible. It’s certainly true that as we add more capacity to the
model, it will have the capability to create smoother functions. Figures 8.11b–f show the
smoothest possible functions that still pass through the data points as we increase the
number of hidden units. When the number of parameters is very close to the number
of training data examples (figure 8.11b), the model is forced to contort itself to fit the
training data exactly, resulting in erratic predictions. This explains why the peak in the
double descent curve is so pronounced. As we add more hidden units, the model has the
ability to construct smoother functions that are likely to generalize better to new data.
However, this does not explain why over-parameterized models should produce smooth
functions. Figure 8.12 shows three functions that can be created by the simplified model
with 50 hidden units. In each case, the model fits the data exactly, so the loss is zero. If
the modern regime of double descent is explained by increasing smoothness, then what
exactly is encouraging this smoothness?
Figure 8.12 Regularization. a–c) Each of the three fitted curves passes through
the data points exactly, so the training loss for each is zero. However, we might
expect the smooth curve in panel (a) to generalize much better to new data than
the erratic curves in panels (b) and (c). Any factor that biases a model toward
a subset of the solutions with a similar training loss is known as a regularizer.
It is thought that the initialization and/or fitting of neural networks have an
implicit regularizing effect. Consequently, in the over-parameterized regime, more
reasonable solutions, such as that in panel (a), are encouraged.
The answer to this question is uncertain, but there are two likely possibilities. First,
the network initialization may encourage smoothness, and the model never departs from
the sub-domain of smooth function during the training process. Second, the training
algorithm may somehow “prefer” to converge to smooth functions. Any factor that
biases a solution toward a subset of equivalent solutions is known as a regularizer, so one
possibility is that the training algorithm acts as an implicit regularizer (see section 9.2).
In the previous section, we discussed how test performance changes with model capac-
ity. Unfortunately, in the classical regime, we don’t have access to either the bias (which
requires knowledge of the true underlying function) or the variance (which requires mul-
tiple independently sampled datasets to estimate). In the modern regime, there is no
way to tell how much capacity should be added before the test error stops improving.
This raises the question of exactly how we should choose model capacity in practice.
For a deep network, the model capacity depends on the numbers of hidden layers
and hidden units per layer as well as other aspects of architecture that we have yet to
introduce. Furthermore, the choice of learning algorithm and any associated parameters
(learning rate, etc.) also affects the test performance. These elements are collectively
termed hyperparameters. The process of finding the best hyperparameters is termed
hyperparameter search or (when focused on network structure) neural architecture search.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.6 Summary 133
Hyperparameters are typically chosen empirically; we train many models with differ-
ent hyperparameters on the same training set, measure their performance, and retain the
best model. However, we do not measure their performance on the test set; this would
admit the possibility that these hyperparameters just happen to work well for the test
set but don’t generalize to further data. Instead, we introduce a third dataset known
as a validation set. For every choice of hyperparameters, we train the associated model
using the training set and evaluate performance on the validation set. Finally, we select
the model that worked best on the validation set and measure its performance on the
test set. In principle, this should give a reasonable estimate of the true performance.
The hyperparameter space is generally smaller than the parameter space but still
too large to try every combination exhaustively. Unfortunately, many hyperparameters
are discrete (e.g., the number of hidden layers), and others may be conditional on one
another (e.g., we only need to specify the number of hidden units in the tenth hidden
layer if there are ten or more layers). Hence, we cannot rely on gradient descent methods
as we did for learning the model parameters. Hyperparameter optimization algorithms
intelligently sample the space of hyperparameters, contingent on previous results. This
procedure is computationally expensive since we must train an entire model and measure
the validation performance for each combination of hyperparameters.
8.6 Summary
To measure performance, we use a separate test set. The degree to which performance is
maintained on this test set is known as generalization. Test errors can be explained by
three factors: noise, bias, and variance. These combine additively in regression problems
with least squares losses. Adding training data decreases the variance. When the model
capacity is less than the number of training examples, increasing the capacity decreases
bias but increases variance. This is known as the bias-variance trade-off, and there is a
capacity where the trade-off is optimal.
However, this is balanced against a tendency for performance to improve with ca-
pacity, even when the parameters exceed the training examples. Together, these two
phenomena create the double descent curve. It is thought that the model interpolates
more smoothly between the training data points in the over-parameterized “modern
regime,” although it is unclear what drives this. To choose the capacity and other model
and training algorithm hyperparameters, we fit multiple models and evaluate their per-
formance using a separate validation set.
Notes
Bias-variance trade-off: We showed that the test error for regression problems with least
squares loss decomposes into the sum of noise, bias, and variance terms. These factors are
all present for models with other losses, but their interaction is typically more complicated
(Friedman, 1997; Domingos, 2000). For classification problems, there are some counter-intuitive
predictions; for example, if the model is biased toward selecting the wrong class in a region of
the input space, then increasing the variance can improve the classification rate as this pushes
some of the predictions over the threshold to be classified correctly.
Cross-validation: We saw that it is typical to divide the data into three parts: training
data (which is used to learn the model parameters), validation data (which is used to choose
the hyperparameters), and test data (which is used to estimate the final performance). This
approach is known as cross-validation. However, this division may cause problems where the
total number of data examples is limited; if the number of training examples is comparable to
the model capacity, then the variance will be large.
One way to mitigate this problem is to use k-fold cross-validation. The training and validation
data are partitioned into K disjoint subsets. For example, we might divide these data into
five parts. We train with four and validate with the fifth for each of the five permutations
and choose the hyperparameters based on the average validation performance. The final test
performance is assessed using the average of the predictions from the five models with the best
hyperparameters on an entirely different test set. There are many variations of this idea, but
all share the general goal of using a larger proportion of the data to train the model, thereby
reducing variance.
Capacity: We have used the term capacity informally to mean the number of parameters or
hidden units in the model (and hence indirectly, the ability of the model to fit functions of
increasing complexity). The representational capacity of a model describes the space of possible
functions it can construct when we consider all possible parameter values. When we take into
account the fact that an optimization algorithm may not be able to reach all of these solutions,
what is left is the effective capacity.
The Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1971) is a more formal
measure of capacity. It is the largest number of training examples that a binary classifier can
label arbitrarily. Bartlett et al. (2019) derive upper and lower bounds for the VC dimension in
terms of the number of layers and weights. An alternative measure of capacity is the Rademacher
complexity, which is the expected empirical performance of a classification model (with optimal
parameters) for data with random labels. Neyshabur et al. (2017) derive a lower bound on the
generalization error in terms of the Rademacher complexity.
Double descent: The term “double descent” was coined by Belkin et al. (2019), who demon-
strated that the test error decreases again in the over-parameterized regime for two-layer neural
networks and random features. They also claimed that this occurs in decision trees, although
Buschjäger & Morik (2021) subsequently provided evidence to the contrary. Nakkiran et al.
(2021) show that double descent occurs for various modern datasets (CIFAR-10, CIFAR-100,
IWSLT’14 de-en), architectures (CNNs, ResNets, transformers), and optimizers (SGD, Adam).
The phenomenon is more pronounced when noise is added to the target labels (Nakkiran et al.,
2021) and when some regularization techniques are used (Ishida et al., 2020).
Nakkiran et al. (2021) also provide empirical evidence that test performance depends on effective
model capacity (the largest number of samples for which a given model and training method can
achieve zero training error). At this point, the model starts to devote its efforts to interpolating
smoothly. As such, the test performance depends not just on the model but also on the training
algorithm and length of training. They observe the same pattern when they study a model with
fixed capacity and increase the number of training iterations. They term this epoch-wise double
descent. This phenomenon has been modeled by Pezeshki et al. (2022) in terms of different
features in the model being learned at different speeds.
Double descent makes the rather strange prediction that adding training data can sometimes
worsen test performance. Consider an over-parameterized model in the second descending part
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 135
of the curve. If we increase the training data to match the model capacity, we will now be in
the critical region of the new test error curve, and the test loss may increase.
Bubeck & Sellke (2021) prove that overparameterization is necessary to interpolate data smoothly
in high dimensions. They demonstrate a trade-off between the number of parameters and the
Appendix B.1.1
Lipschitz constant of a model (the fastest the output can change for a small input change). A
Lipschitz constant
review of the theory of over-parameterized machine learning can be found in Dar et al. (2021).
Curse of dimensionality: As dimensionality increases, the volume of space grows so fast that
the amount of data needed to densely sample it increases exponentially. This phenomenon is
known as the curse of dimensionality. High-dimensional space has many unexpected properties,
and caution should be used when trying to reason about it based on low-dimensional exam-
ples. This book visualizes many aspects of deep learning in one or two dimensions, but these
visualizations should be treated with healthy skepticism.
Surprising properties of high-dimensional spaces include: (i) Two randomly sampled data points
from a standard normal distribution are very close to orthogonal to one another (relative to
Problems 8.6–8.9
the origin) with high likelihood. (ii) The distance from the origin of samples from a standard
normal distribution is roughly constant. (iii) Most of a volume of a high-dimensional sphere
(hypersphere) is adjacent to its surface (a common metaphor is that most of the volume of a high-
dimensional orange is in the peel, not in the pulp). (iv) If we place a unit-diameter hypersphere
inside a hypercube with unit-length sides, then the hypersphere takes up a decreasing proportion
of the volume of the cube as the dimension increases. Since the volume of the cube is fixed at
Notebook 8.4
size one, this implies that the volume of a high-dimensional hypersphere becomes close to zero.
High-dimensional
(v) For random points drawn from a uniform distribution in a high-dimensional hypercube, the
spaces
ratio of the Euclidean distance between the nearest and furthest points becomes close to one.
For further information, consult Beyer et al. (1999) and Aggarwal et al. (2001).
Real-world performance: In this chapter, we argued that model performance could be evalu-
ated using a held-out test set. However, the result won’t be indicative of real-world performance
if the statistics of the test set don’t match those of real-world data. Moreover, the statistics
of real-world data may change over time, causing the model to become increasingly stale and
performance to decrease. This is known as data drift and means that deployed models must be
carefully monitored.
There are three main reasons why real-world performance may be worse than the test perfor-
mance implies. First, the statistics of the input data x may change; we may now be observing
parts of the function that were sparsely sampled or not sampled at all during training. This
is known as covariate shift. Second, the statistics of the output data y may change; if some
output values are infrequent during training, then the model may learn not to predict these in
ambiguous situations and will make mistakes if they are more common in the real world. This
is known as prior shift. Third, the relationship between input and output may change. This is
known as concept shift. These issues are discussed in Moreno-Torres et al. (2012).
A simple approach is to sample the space randomly (Bergstra & Bengio, 2012). However,
for continuous variables, it is better to build a model of performance as a function of the
hyperparameters and the uncertainty in this function. This can be exploited to test where the
uncertainty is great (explore the space) or home in on regions where performance looks promising
(exploit previous knowledge). Bayesian optimization is a framework based on Gaussian processes
that does just this, and its application to hyperparameter search is described in Snoek et al.
(2012). The Beta-Bernoulli bandit (see Lattimore & Szepesvári, 2020) is a roughly equivalent
model for describing uncertainty in results due to discrete variables.
The sequential model-based configuration (SMAC) algorithm (Hutter et al., 2011) can cope with
continuous, discrete, and conditional parameters. The basic approach is to use a random forest
to model the objective function where the mean of the tree predictions is the best guess about
the objective function, and their variance represents the uncertainty. A completely different
approach that can also cope with combinations of continuous, discrete, and conditional param-
eters is Tree-Parzen Estimators (Bergstra et al., 2011). The previous methods modeled the
probability of the model performance given the hyperparameters. In contrast, the Tree-Parzen
estimator models the probability of the hyperparameters given the model performance.
Hyperband (Li et al., 2017b) is a multi-armed bandit strategy for hyperparameter optimization.
It assumes that there are computationally cheap but approximate ways to measure performance
(e.g., by not training to completion) and that these can be associated with a budget (e.g., by
training for a fixed number of iterations). A number of random configurations are sampled and
run until the budget is used up. Then the best fraction η of runs is kept, and the budget is
multiplied by 1/η. This is repeated until the maximum budget is reached. This approach has
the advantage of efficiency; for bad configurations, it does not need to run the experiment to the
end. However, each sample is just chosen randomly, which is inefficient. The BOHB algorithm
(Falkner et al., 2018) combines the efficiency of Hyperband with the more sensible choice of
hyperparameters from Tree Parzen estimators to construct an even better method.
Problems
Problem 8.1 Will the multiclass cross-entropy training loss in figure 8.2 ever reach zero? Explain
your reasoning.
Problem 8.2 What values should we choose for the three weights and biases in the first layer of
the model in figure 8.4a so that the hidden unit’s responses are as depicted in figures 8.4b–d?
Problem 8.3∗ Given a training dataset consisting of I input/output pairs {xi , yi }, show how
the parameters {β, ω1 , ω2 , ω3 } for the model in figure 8.4a using the least squares loss function
can be found in closed form.
Problem 8.4 Consider the curve in figure 8.10b at the point where we train a model with a
hidden layer of size 200, which would have 50,410 parameters. What do you predict will happen
to the training and test performance if we increase the number of training examples from 10,000
to 50,410?
Problem 8.5 Consider the case where the model capacity exceeds the number of training data
points, and the model is flexible enough to reduce the training loss to zero. What are the
implications of this for fitting a heteroscedastic model? Propose a method to resolve any
problems that you identify.
Problem 8.6 Show that two random points drawn from a 1000-dimensional standard Gaussian
distribution are orthogonal relative to the origin with high probability.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 137
rD π D/2
Vol[r] = , (8.8)
Γ[D/2 + 1]
Appendix B.1.3
where Γ[•] is the Gamma function. Show using Stirling’s formula that the volume of a hyper- Gamma function
sphere of diameter one (radius r = 0.5) becomes zero as the dimension increases.
Appendix B.1.4
Problem 8.8∗ Consider a hypersphere of radius r = 1. Find an expression for the proportion Stirling’s formula
of the total volume that lies in the outermost 1% of the distance from the center (i.e., in the
outermost shell of thickness 0.01). Show that this becomes one as the dimension increases.
Problem 8.9 Figure 8.13c shows the distribution of distances of samples of a standard normal
distribution as the dimension increases. Empirically verify this finding by sampling from the
standard normal distributions in 25, 100, and 500 dimensions and plotting a histogram of the
distances from the center. What closed-form probability distribution describes these distances?
Regularization
Chapter 8 described how to measure model performance and identified that there could
be a significant performance gap between the training and test data. Possible reasons for
this discrepancy include: (i) the model describes statistical peculiarities of the training
data that are not representative of the true mapping from input to output (overfitting),
and (ii) the model is unconstrained in areas with no training examples, leading to sub-
optimal predictions.
This chapter discusses regularization techniques. These are a family of methods that
reduce the generalization gap between training and test performance. Strictly speaking,
regularization involves adding explicit terms to the loss function that favor certain pa-
rameter choices. However, in machine learning, this term is commonly used to refer to
any strategy that improves generalization.
We start by considering regularization in its strictest sense. Then we show how
the stochastic gradient descent algorithm itself favors certain solutions. This is known
as implicit regularization. Following this, we consider a set of heuristic methods that
improve test performance. These include early stopping, ensembling, dropout, label
smoothing, and transfer learning.
Consider fitting a model f[x, ϕ] with parameters ϕ using a training set {xi , yi } of in-
put/output pairs. We seek the parameters ϕ̂ that minimize the loss function L[ϕ] :
ϕ̂ = argmin L[ϕ]
ϕ
" #
X
I
= argmin ℓi [xi , yi ] , (9.1)
ϕ i=1
where the individual terms ℓi [xi , yi ] measure the mismatch between the network pre-
dictions f[xi , ϕ] and output targets yi for each training pair. To bias this minimization
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.1 Explicit regularization 139
Figure 9.1 Explicit regularization. a) Loss function for Gabor model (see sec-
tion 6.1.2). Cyan circles represent local minima. Gray circle represents the global
minimum. b) The regularization term favors parameters close to the center of the
plot by adding an increasing penalty as we move away from this point. c) The
final loss function is the sum of the original loss function plus the regularization
term. This surface has fewer local minima, and the global minimum has moved
to a different position (arrow shows change).
where g[ϕ] is a function that returns a scalar which takes larger values when the pa-
rameters are less preferred. The term λ is a positive scalar that controls the relative
contribution of the original loss function and the regularization term. The minima of
the regularized loss function usually differ from those in the original, so the training
procedure converges to different parameter values (figure 9.1).
Regularization can be viewed from a probabilistic perspective. Section 5.1 shows how
loss functions are constructed from the maximum likelihood criterion:
" I #
Y
ϕ̂ = argmax P r(yi |xi , ϕ) . (9.3)
ϕ i=1
The regularization term can be considered as a prior P r(ϕ) that represents knowledge
about the parameters before we observe the data and we now have the maximum a
posteriori or MAP criterion:
" #
Y
I
ϕ̂ = argmax P r(yi |xi , ϕ)P r(ϕ) . (9.4)
ϕ i=1
Moving back to the negative log-likelihood loss function by taking the log and multiplying
by minus one, we see that λ · g[ϕ] = − log[P r(ϕ)].
9.1.2 L2 regularization
This discussion has sidestepped the question of which solutions the regularization term
should penalize (or equivalently that the prior should favor). Since neural networks are
used in an extremely broad range of applications, these can only be very generic pref-
erences. The most commonly used regularization term is the L2 norm, which penalizes
the sum of the squares of the parameter values:
XI X
ϕ̂ = argmin ℓi [xi , yi ] + λ ϕ2j , (9.5)
ϕ i=1 j
• If the network is overfitting, then adding the regularization term means that the
network must trade off slavish adherence to the data against the desire to be
smooth. One way to think about this is that the error due to variance reduces (the
model no longer needs to pass through every data point) at the cost of increased
bias (the model can only describe smooth functions).
• When the network is over-parameterized, some of the extra model capacity de-
scribes areas with no training data. Here, the regularization term will favor func-
tions that smoothly interpolate between the nearby points. This is reasonable
behavior in the absence of knowledge about the true function.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.2 Implicit regularization 141
Figure 9.2 L2 regularization in simplified network with 14 hidden units (see fig-
ure 8.4). a–f) Fitted functions as we increase the regularization coefficient λ. The
black curve is the true function, the orange circles are the noisy training data,
and the cyan curve is the fitted model. For small λ (panels a–b), the fitted func-
tion passes exactly through the data points. For intermediate λ (panels c–d), the
function is smoother and more similar to the ground truth. For large λ (panels
e–f), the regularization term overpowers the likelihood term, so the fitted function
is too smooth and the overall fit is worse.
An intriguing recent finding is that neither gradient descent nor stochastic gradient
descent moves neutrally to the minimum of the loss function; each exhibits a preference
for some solutions over others. This is known as implicit regularization.
Consider a continuous version of gradient descent where the step size is infinitesimal.
The change in parameters ϕ will be governed by the differential equation:
dϕ ∂L
=− . (9.6)
dt ∂ϕ
Figure 9.3 Implicit regularization in gradient descent. a) Loss function with family
of global minima on horizontal line ϕ1 = 0.61. Dashed blue line shows continuous
gradient descent path starting in bottom-left. Cyan trajectory shows discrete
gradient descent with step size 0.1 (first few steps shown explicitly as arrows).
The finite step size causes the paths to diverge and reach a different final position.
b) This disparity can be approximated by adding a regularization term to the
continuous gradient descent loss function that penalizes the squared gradient
magnitude. c) After adding this term, the continuous gradient descent path
converges to the same place that the discrete one did on the original function.
Gradient descent approximates this process with a series of discrete steps of size α:
∂L[ϕt ]
ϕt+1 = ϕt − α , (9.7)
∂ϕ
The discretization causes a deviation from the continuous path (figure 9.3).
This deviation can be understood by deriving a modified loss term L̃ for the continu-
ous case that arrives at the same place as the discretized version on the original loss L. It
can be shown (see notes “Implicit regularization in gradient descent” at end of chapter)
that this modified loss is:
2
α ∂L
L̃GD [ϕ] = L[ϕ] + . (9.8)
4 ∂ϕ
In other words, the discrete trajectory is repelled from places where the gradient norm
is large (the surface is steep). This doesn’t change the position of the minima where the
gradients are zero anyway. However, it changes the effective loss function elsewhere and
modifies the optimization trajectory, which potentially converges to a different minimum.
Implicit regularization due to gradient descent may be responsible for the observation
that full batch gradient descent generalizes better with larger step sizes (figure 9.5a).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 143
A similar analysis can be applied to stochastic gradient descent. Now we seek a modified
loss function such that the continuous version reaches the same place as the average of
the possible random SGD updates. This can be shown to be:
α X ∂Lb
B 2
∂L
L̃SGD [ϕ] = L̃GD [ϕ] + −
4B ∂ϕ ∂ϕ
b=1
α X ∂Lb
2 B 2
α ∂L ∂L
= L[ϕ] + + − . (9.9)
4 ∂ϕ 4B ∂ϕ ∂ϕ
b=1
Here, Lb is the loss for the bth of the B batches in an epoch, and both L and Lb now
represent the means of the I individual losses in the full dataset and the |B| individual
losses in the batch, respectively:
1X 1 X
I
L= ℓi [xi , yi ] and Lb = ℓi [xi , yi ]. (9.10)
I i=1 |B|
i∈Bb
Equation 9.9 reveals an extra regularization term, which corresponds to the variance
of the gradients of the batch losses Lb . In other words, SGD implicitly favors places
where the gradients are stable (where all the batches agree on the slope). Once more, this
modifies the trajectory of the optimization process (figure 9.4) but does not necessarily
change the position of the global minimum; if the model is over-parameterized, then it
may fit all the training data exactly, so each of these gradient terms will be zero at the
global minimum.
SGD generalizes better than gradient descent, and smaller batch sizes generally per-
form better than larger ones (figure 9.5b). One possible explanation is that the inherent
randomness allows the algorithm to reach different parts of the loss function. However,
Notebook 9.2
it’s also possible that some or all of this performance increase is due to implicit regular- Implicit
ization; this encourages solutions where all the data fits well (so the batch variance is regularization
small) rather than solutions where some of the data fit extremely well and other data less
well (perhaps with the same overall loss, but with larger batch variance). The former
solutions are likely to generalize better.
We’ve seen that explicit regularization encourages the training algorithm to find a good
solution by adding extra terms to the loss function. This also occurs implicitly as an un-
intended (but seemingly helpful) byproduct of stochastic gradient descent. This section
describes other heuristic methods used to improve generalization.
Figure 9.4 Implicit regularization for stochastic gradient descent. a) Original loss
function for Gabor model (section 6.1.2). Blue point represents global minimum.
b) Implicit regularization term from gradient descent penalizes the squared gra-
dient magnitude. c) Additional implicit regularization from stochastic gradient
descent penalizes the variance of the batch gradients. d) Modified loss function
(sum of original loss plus two implicit regularization components). Blue point
represents global minimum which may now be in a different place from panel (a).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 145
Figure 9.5 Effect of learning rate (LR) and batch size for 4000 training and
4000 test examples from MNIST-1D (see figure 8.1) for a neural network with
two hidden layers. a) Performance is better for large learning rates than for
intermediate or small ones. In each case, the number of iterations is 6000/LR, so
each solution has the opportunity to move the same distance. b) Performance is
superior for smaller batch sizes. In each case, the number of iterations was chosen
so that the training data were memorized at roughly the same model capacity.
Early stopping refers to stopping the training procedure before it has fully converged.
This can reduce overfitting if the model has already captured the coarse shape of the
underlying function but has not yet had time to overfit to the noise (figure 9.6). One
way of thinking about this is that since the weights are initialized to small values (see
section 7.5), they simply don’t have time to become large, so early stopping has a similar
effect to explicit L2 regularization. A different view is that early stopping reduces the
effective model complexity. Hence, we move back down the bias/variance trade-off curve
from the critical region, and performance improves (see figures 8.9 and 8.10).
Early stopping has a single hyperparameter, the number of steps after which learning
is terminated. As usual, this is chosen empirically using a validation set (section 8.5).
However, for early stopping, the hyperparameter can be selected without the need to
train multiple models. The model is trained once, the performance on the validation set
is monitored every T iterations, and the associated parameters are stored. The stored
parameters where the validation performance was best are selected.
9.3.2 Ensembling
Another approach to reducing the generalization gap between training and test data is
to build several models and average their predictions. A group of such models is known
Figure 9.6 Early stopping. a) Simplified shallow network model with 14 linear
regions (figure 8.4) is initialized randomly (cyan curve) and trained with SGD
using a batch size of five and a learning rate of 0.05. b–d) As training proceeds,
the function first captures the coarse structure of the true function (black curve)
before e–f) overfitting to the noisy training data (orange points). Although the
training loss continues to decrease throughout this process, the learned models in
panels (c) and (d) are closest to the true underlying function. They will generalize
better on average to test data than those in panels (e) or (f).
as an ensemble. This technique reliably improves test performance at the cost of training
and storing multiple models and performing inference multiple times.
The models can be combined by taking the mean of the outputs (for regression
problems) or the mean of the pre-softmax activations (for classification problems). The
assumption is that model errors are independent and will cancel out. Alternatively,
we can take the median of the outputs (for regression problems) or the most frequent
predicted class (for classification problems) to make the predictions more robust.
One way to train different models is just to use different random initializations. This
may help in regions of input space far from the training data. Here, the fitted function
Notebook 9.3
Ensembling is relatively unconstrained, and different models may produce different predictions, so
the average of several models may generalize better than any single model.
A second approach is to generate several different datasets by re-sampling the train-
ing data with replacement and training a different model from each. This is known as
bootstrap aggregating or bagging for short (figure 9.7). It has the effect of smoothing
out the data; if a data point is not present in one training set, the model will interpo-
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 147
Figure 9.7 Ensemble methods. a) Fitting a single model (gray curve) to the
entire dataset (orange points). b–e) Four models created by re-sampling the data
with replacement (bagging) four times (size of orange point indicates number of
times the data point was re-sampled). f) When we average the predictions of this
ensemble, the result (cyan curve) is smoother than the result from panel (a) for
the full dataset (gray curve) and will probably generalize better.
late from nearby points; hence, if that point was an outlier, the fitted function will be
more moderate in this region. Other approaches include training models with different
hyperparameters or training completely different families of models.
9.3.3 Dropout
Dropout clamps a random subset (typically 50%) of hidden units to zero at each iteration
of SGD (figure 9.8). This makes the network less dependent on any given hidden unit and
encourages the weights to have smaller magnitudes so that the change in the function
due to the presence or absence of any specific hidden unit is reduced.
This technique has the positive benefit that it can eliminate undesirable “kinks” in
the function that are far from the training data and don’t affect the loss. For example,
consider three hidden units that become active sequentially as we move along the curve
(figure 9.9a). The first hidden unit causes a large increase in the slope. A second hidden
unit decreases the slope, so the function goes back down. Finally, the third unit cancels
out this decrease and returns the curve to its original trajectory. These three units
conspire to make an undesirable local change in the function. This will not change the
training loss but is unlikely to generalize well.
When several units conspire in this way, eliminating one (as would happen in dropout)
causes a considerable change to the output function in the half-space where that unit
was active (figure 9.9b). A subsequent gradient descent step will attempt to compensate
for the change that this induces, and such dependencies will be eliminated over time.
The overall effect is that large unnecessary changes between training data points are
gradually removed even though they contribute nothing to the loss (figure 9.9).
At test time, we can run the network as usual with all the hidden units active;
however, the network now has more hidden units than it was trained with at any given
iteration, so we multiply the weights by one minus the dropout probability to compensate.
This is known as the weight scaling inference rule. A different approach to inference is
to use Monte Carlo dropout, in which we run the network multiple times with different
random subsets of units clamped to zero (as in training) and combine the results. This
is closely related to ensembling in that every random version of the network is a different
model; however, we do not have to train or store multiple networks here.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 149
Figure 9.10 Adding noise to inputs. At each step of SGD, random noise with
variance σx2 is added to the batch data. a–c) Fitted model with different noise
levels (small dots represent ten samples). Adding more noise smooths out the
fitted function (cyan line).
the training labels are incorrect and belong with equal probability to the other classes.
This could be done by randomly changing the labels at each training iteration. However,
the same end can be achieved by changing the loss function to minimize the cross-
entropy between the predicted distribution and a distribution where the true label has
Problem 9.4
probability 1 − ρ, and the other classes have equal probability. This is known as label
smoothing and improves generalization in diverse scenarios.
The maximum likelihood approach is generally overconfident; it selects the most likely
parameters during training and uses these to make predictions. However, many param-
eter values may be broadly compatible with the data and only slightly less likely. The
Bayesian approach treats the parameters as unknown variables and computes a distri-
Appendix C.1.4
Bayes’ rule bution P r(ϕ|{xi , yi }) over these parameters ϕ conditioned on the training data {xi , yi }
using Bayes’ rule:
QI
P r(yi |xi , ϕ)P r(ϕ)
P r(ϕ|{xi , yi }) = R QIi=1 , (9.11)
i=1 r(yi |xi , ϕ)P r(ϕ)dϕ
P
where P r(ϕ) is the prior probability of the parameters, and the denominator is a nor-
malizing term. Hence, every parameter choice is assigned a probability (figure 9.11).
The prediction y for new input x is an infinite weighted sum (i.e., an integral) of the
predictions for each parameter set, where the weights are the associated probabilities:
Z
P r(y|x, {xi , yi }) = P r(y|x, ϕ)P r(ϕ|{xi , yi })dϕ. (9.12)
This is effectively an infinite weighted ensemble, where the weight depends on (i) the
prior probability of the parameters and (ii) their agreement with the data.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 151
Figure 9.11 Bayesian approach for simplified network model (see figure 8.4). The
parameters are treated as uncertain. The posterior probability P r(ϕ|{xi , yi }) for
a set of parameters is determined by their compatibility with the data {xi , yi }
and a prior distribution P r(ϕ). a–c) Two sets of parameters (cyan and gray
curves) sampled from the posterior using normally distributed priors with mean
2
zero and three variances. When the prior variance σϕ is small, the parameters
also tend to be small, and the functions smoother. d–f) Inference proceeds by
taking a weighted sum over all possible parameter values where the weights are
the posterior probabilities. This produces both a prediction of the mean (cyan
curves) and the associated uncertainty (gray region is two standard deviations).
The Bayesian approach is elegant and can provide more robust predictions than
those that derive from maximum likelihood. Unfortunately, for complex models like
neural networks, there is no practical way to represent the full probability distribution
Notebook 9.4
over the parameters or to integrate over it during the inference phase. Consequently, all Bayesian
current methods of this type make approximations of some kind, and typically these add approach
considerable complexity to learning and inference.
When training data are limited, other datasets can be exploited to improve performance.
In transfer learning (figure 9.12a), the network is pre-trained to perform a related sec-
ondary task for which data are more plentiful. The resulting model is then adapted to
the original task. This is typically done by removing the last layer and adding one or
more layers that produce a suitable output. The main model may be fixed, and the new
layers trained for the original task, or we may fine-tune the entire model.
The principle is that the network will build a good internal representation of the
data from the secondary task, which can subsequently be exploited for the original task.
Equivalently, transfer learning can be viewed as initializing most of the parameters of
the final network in a sensible part of the space that is likely to produce a good solution.
Multi-task learning (figure 9.12b) is a related technique in which the network is trained
to solve several problems concurrently. For example, the network might take an image
and simultaneously learn to segment the scene, estimate the pixel-wise depth, and predict
a caption describing the image. All of these tasks require some understanding of the
image and, when learned simultaneously, the model performance for each may improve.
The above discussion assumes that we have plentiful data for a secondary task or data for
multiple tasks to be learned concurrently. If not, we can create large amounts of “free”
labeled data using self-supervised learning and use this for transfer learning. There are
two families of methods for self-supervised learning: generative and contrastive.
In generative self-supervised learning, part of each data example is masked, and the
secondary task is to predict the missing part (figure 9.12c). For example, we might use
a corpus of unlabeled images and a secondary task that aims to inpaint (fill in) missing
parts of the image (figure 9.12c). Similarly, we might use a large corpus of text and mask
some words. We train the network to predict the missing words and then fine-tune it for
the actual language task we are interested in (see chapter 12).
In contrastive self-supervised learning, pairs of examples with commonalities are com-
pared to unrelated pairs. For images, the secondary task might be to identify whether a
pair of images are transformed versions of one another or are unconnected. For text, the
secondary task might be to determine whether two sentences followed one another in the
original document. Sometimes, the precise relationship between a connected pair must
be identified (e.g., finding the relative position of two patches from the same image).
9.3.8 Augmentation
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 153
Figure 9.13 Data augmentation. For some problems, each data example can be
transformed to augment the dataset. a) Original image. b–h) Various geometric
and photometric transformations of this image. For image classification, all these
images still have the same label, “bird.” Adapted from Wu et al. (2015a).
Generating extra training data in this way is known as data augmentation. The aim
is to teach the model to be indifferent to these irrelevant data transformations.
9.4 Summary
Explicit regularization involves adding an extra term to the loss function that changes
the position of the minimum. The term can be interpreted as a prior probability over
the parameters. Stochastic gradient descent with a finite step size does not neutrally
descend to the minimum of the loss function. This bias can be interpreted as adding
additional terms to the loss function, and this is known as implicit regularization.
There are also many heuristics for improving generalization, including early stopping,
dropout, ensembling, the Bayesian approach, adding noise, transfer learning, multi-task
learning, and data augmentation. There are four main principles behind these methods
(figure 9.14). We can (i) encourage the function to be smoother (e.g., L2 regularization),
(ii) increase the amount of data (e.g., data augmentation), (iii) combine models (e.g.,
ensembling), or (iv) search for wider minima (e.g., applying noise to network weights).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 155
Another way to improve generalization is to choose the model architecture to suit the
task. For example, in image segmentation, we can share parameters within the model,
so we don’t need to independently learn what a tree looks like at every image location.
Chapters 10–13 consider architectural variations designed for different tasks.
Notes
Regularization: L2 regularization penalizes the sum of squares of the network weights. This
encourages the output function to change slowly (i.e., become smoother) and is the most used
regularization term. It is sometimes referred to as Frobenius norm regularization as it penalizes
the Frobenius norms of the weight matrices. It is often also mistakenly referred to as “weight
decay,” although this is a separate technique devised by Hanson & Pratt (1988) in which the
parameters ϕ are updated as:
∂L
ϕ ←− (1 − λ′ )ϕ − α , (9.13)
∂ϕ
where, as usual, α is the learning rate, and L is the loss. This is identical to gradient descent,
except that the weights are reduced by a factor of 1−λ′ before the gradient update. For standard
SGD, weight decay is equivalent to L2 regularization (equation 9.5) with coefficient λ = λ′ /2α.
Problem 9.5
However, for Adam, the learning rate α is different for each parameter, so L2 regularization
and weight decay differ. Loshchilov & Hutter (2019) present AdamW, which modifies Adam to
implement weight decay correctly and show that this improves performance.
Other choices of vector norm encourage sparsity in the weights. The L0 regularization term
Appendix B.3.2
applies a fixed penalty for every non-zero weight. The effect is to “prune” the network. L0
Vector norms
regularization can also be used to encourage group sparsity; this might apply a fixed penalty if
any of the weights contributing to a given hidden unit are non-zero. If they are all zero, we can
remove the unit, decreasing the model size and making inference faster.
Unfortunately, L0 regularization is challenging to implement since the derivative of the regular-
ization term is not smooth, and more sophisticated fitting methods are required (see Louizos
et al., 2018). Somewhere between L2 and L0 regularization is L1 regularization or LASSO
(least absolute shrinkage and selection operator), which imposes a penalty on the absolute val-
ues of the weights. L2 regularization somewhat discourages sparsity in that the derivative of
the squared penalty decreases as the weight becomes smaller, lowering the pressure to make it
smaller still. L1 regularization does not have this disadvantage, as the derivative of the penalty
is constant. This can produce sparser solutions than L2 regularization but is much easier to
Problem 9.6
optimize than L0 regularization. Sometimes both L1 and L2 regularization terms are used,
which is termed an elastic net penalty (Zou & Hastie, 2005).
A different approach to regularization is to modify the gradients of the learning algorithm
without ever explicitly formulating a new loss function (e.g., equation 9.13). This approach has
been used to promote sparsity during backpropagation (Schwarz et al., 2021).
The evidence on the effectiveness of explicit regularization is mixed. Zhang et al. (2017a) showed
that L2 regularization contributes little to generalization. It has been proven that the Lipschitz
constant of the network (how fast the function can change as we modify the input) bounds
Appendix B.1.1
the generalization error (Bartlett et al., 2017; Neyshabur et al., 2018). However, the Lipschitz
Lipschitz constant
constant depends on the product of the spectral norms of the weight matrices Ωk , which are
only indirectly dependent on the magnitudes of the individual weights. Bartlett et al. (2017),
Appendix B.3.7 Neyshabur et al. (2018), and Yoshida & Miyato (2017) all add terms that indirectly encourage
Spectral norm the spectral norms to be smaller. Gouk et al. (2021) take a different approach and develop an
algorithm that constrains the Lipschitz constant of the network to be below a particular value.
ϕ1 = ϕ0 + α · g[ϕ0 ], (9.14)
where g[ϕ0 ] is the negative of the gradient of the loss function, and α is the step size. As α → 0,
the gradient descent process can be described by a differential equation:
dϕ
= g[ϕ]. (9.15)
dt
For typical step sizes α, the discrete and continuous versions converge to different solutions. We
can use backward error analysis to find a correction g1 [ϕ] to the continuous version:
dϕ
≈ g[ϕ] + αg1 [ϕ] + . . . , (9.16)
dt
so that it gives the same result as the discrete version.
Consider the first two terms of a Taylor expansion of the modified continuous solution ϕ around
initial position ϕ0 :
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 157
dϕ α2 d2 ϕ
ϕ[α] ≈ ϕ+α +
dt 2 dt2 ϕ=ϕ0
α2 ∂g[ϕ] dϕ ∂g [ϕ] dϕ
≈ ϕ + α (g[ϕ] + αg1 [ϕ]) + +α 1
2 ∂ϕ dt ∂ϕ dt ϕ=ϕ0
α2 ∂g[ϕ] ∂g [ϕ]
= ϕ + α (g[ϕ] + αg1 [ϕ]) + g[ϕ] + α 1 g[ϕ]
2 ∂ϕ ∂ϕ ϕ=ϕ0
1 ∂g[ϕ]
≈ ϕ + αg[ϕ] + α2 g1 [ϕ] + g[ϕ] , (9.17)
2 ∂ϕ ϕ=ϕ 0
where in the second line, we have introduced the correction term (equation 9.16), and in the
final line, we have removed terms of greater order than α2 .
Note that the first two terms on the right-hand side ϕ0 + αg[ϕ0 ] are the same as the discrete
update (equation 9.14). Hence, to make the continuous and discrete versions arrive at the same
place, the third term on the right-hand side must equal zero, allowing us to solve for g1 [ϕ]:
1 ∂g[ϕ]
g1 [ϕ] = − g[ϕ]. (9.18)
2 ∂ϕ
During training, the evolution function g[ϕ] is the negative of the gradient of the loss:
dϕ
≈ g[ϕ] + αg1 [ϕ]
dt
∂L α ∂ 2 L ∂L
= − − . (9.19)
∂ϕ 2 ∂ϕ2 ∂ϕ
2
α ∂L
LGD [ϕ] = L[ϕ] + , (9.20)
4 ∂ϕ
because the right-hand side of equation 9.19 is the derivative of that in equation 9.20.
This formulation of implicit regularization was developed by Barrett & Dherin (2021) and
extended to stochastic gradient descent by Smith et al. (2021). Smith et al. (2020) and others
have shown that stochastic gradient descent with small or moderate batch sizes outperforms full
batch gradient descent on the test set, and this may in part be due to implicit regularization.
Relatedly, Jastrzębski et al. (2021) and Cohen et al. (2021) both show that using a large learn-
ing rate reduces the tendency of typical optimization trajectories to move to “sharper” parts of
the loss function (i.e., where at least one direction has high curvature). This implicit regular-
ization effect of large learning rates can be approximated by penalizing the trace of the Fisher
Information Matrix, which is closely related to penalizing the gradient norm in equation 9.20
(Jastrzębski et al., 2021).
Early stopping: Bishop (1995) and Sjöberg & Ljung (1995) argued that early stopping limits
the effective solution space that the training procedure can explore; given that the weights are
initialized to small values, this leads to the idea that early stopping helps prevent the weights
from getting too large. Goodfellow et al. (2016) show that under a quadratic approximation
of the loss function with parameters initialized to zero, early stopping is equivalent to L2 reg-
ularization in gradient descent. The effective regularization weight λ is approximately 1/(τ α)
where α is the learning rate, and τ is the early stopping time.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 159
et al. (2017) showed that SGD finds wider minima as the batch size is reduced. This may be
because of the batch variance term that results from implicit regularization by SGD.
Ishida et al. (2020) use a technique named flooding, in which they intentionally prevent the
training loss from becoming zero. This encourages the solution to perform a random walk over
the loss landscape and drift into a flatter area with better generalization.
Bayesian approaches: For some models, including the simplified neural network model in
figure 9.11, the Bayesian predictive distribution can be computed in closed form (see Bishop,
2006; Prince, 2012). For neural networks, the posterior distribution over the parameters can-
not be represented in closed form and must be approximated. The two main approaches are
variational Bayes (Hinton & van Camp, 1993; MacKay, 1995; Barber & Bishop, 1997; Blundell
et al., 2015), in which the posterior is approximated by a simpler tractable distribution, and
Markov Chain Monte Carlo (MCMC) methods, which approximate the distribution by drawing
a set of samples (Neal, 1995; Welling & Teh, 2011; Chen et al., 2014; Ma et al., 2015; Li et al.,
2016a). The generation of samples can be integrated into SGD, and this is known as stochas-
tic gradient MCMC (see Ma et al., 2015). It has recently been discovered that “cooling” the
posterior distribution over the parameters (making it sharper) improves predictions from these
models (Wenzel et al., 2020a), but this is not currently fully understood (see Noci et al., 2021).
Transfer learning: Transfer learning for visual tasks works extremely well (Sharif Razavian
et al., 2014) and has supported rapid progress in computer vision, including the original AlexNet
results (Krizhevsky et al., 2012). Transfer learning has also impacted natural language process-
ing (NLP), where many models are based on pre-trained features from the BERT model (Devlin
et al., 2019). More information can be found in Zhuang et al. (2020) and Yang et al. (2020b).
Self-supervised learning: Self-supervised learning techniques for images have included in-
painting masked image regions (Pathak et al., 2016), predicting the relative position of patches
in an image (Doersch et al., 2015), re-arranging permuted image tiles back into their original
configuration (Noroozi & Favaro, 2016), colorizing grayscale images (Zhang et al., 2016b), and
transforming rotated images back to their original orientation (Gidaris et al., 2018). In Sim-
CLR (Chen et al., 2020c), a network is learned that maps versions of the same image that
have been photometrically and geometrically transformed to the same representation while re-
pelling versions of different images, with the goal of becoming indifferent to irrelevant image
transformations. Jing & Tian (2020) present a survey of self-supervised learning in images.
Self-supervised learning in NLP can be based on predicting masked words(Devlin et al., 2019),
predicting the next word in a sentence (Radford et al., 2019; Brown et al., 2020), or predicting
whether two sentences follow one another (Devlin et al., 2019). In automatic speech recognition,
the Wav2Vec model (Schneider et al., 2019) aims to distinguish an original audio sample from
one where 10ms of audio has been swapped out from elsewhere in the clip. Self-supervision
has also been applied to graph neural networks (chapter 13). Tasks include recovering masked
features (You et al., 2020) and recovering the adjacency structure of the graph (Kipf & Welling,
2016). Liu et al. (2023a) review self-supervised learning for graph models.
Data augmentation: Data augmentation for images dates back to at least LeCun et al.
(1998) and contributed to the success of AlexNet (Krizhevsky et al., 2012), in which the dataset
was increased by a factor of 2048. Image augmentation approaches include geometric transfor-
mations, changing or manipulating the color space, noise injection, and applying spatial filters.
More elaborate techniques include randomly mixing images (Inoue, 2018; Summers & Dinneen,
2019), randomly erasing parts of the image (Zhong et al., 2020), style transfer (Jackson et al.,
2019), and randomly swapping image patches (Kang et al., 2017). In addition, many studies
have used generative adversarial networks or GANs (see chapter 15) to produce novel but plau-
sible data examples (e.g., Calimeri et al., 2017). In other cases, the data have been augmented
with adversarial examples (Goodfellow et al., 2015a), which are minor perturbations of the
training data that cause the example to be misclassified. A review of data augmentation for
images can be found in Shorten & Khoshgoftaar (2019).
Augmentation methods for acoustic data include pitch shifting, time stretching, dynamic range
compression, and adding random noise (e.g., Abeßer et al., 2017; Salamon & Bello, 2017; Xu
et al., 2015; Lasseck, 2018), as well as mixing data pairs (Zhang et al., 2017c; Yun et al., 2019),
masking features (Park et al., 2019), and using GANs to generate new data (Mun et al., 2017).
Augmentation for speech data includes vocal tract length perturbation (Jaitly & Hinton, 2013;
Kanda et al., 2013), style transfer (Gales, 1998; Ye & Young, 2004), adding noise (Hannun et al.,
2014), and synthesizing speech (Gales et al., 2009).
Augmentation methods for text include adding noise at a character level by switching, deleting,
and inserting letters (Belinkov & Bisk, 2018; Feng et al., 2020), or by generating adversarial
examples (Ebrahimi et al., 2018), using common spelling mistakes (Coulombe, 2018), randomly
swapping or deleting words (Wei & Zou, 2019), using synonyms (Kolomiyets et al., 2011),
altering adjectives (Li et al., 2017c), passivization (Min et al., 2020), using generative models
to create new data (Qiu et al., 2020), and round-trip translation to another language and back
(Aiken & Park, 2010). Augmentation methods for text are reviewed by Bayer et al. (2022).
Problems
Problem 9.1 Consider a model where the prior distribution over the parameters is a normal
2
distribution with mean zero and variance σϕ so that
Y
J
2
P r(ϕ) = Normϕj [0, σϕ ], (9.21)
j=1
Q
where j indexes the model parameters. We now maximize Ii=1 P r(yi |xi , ϕ)P r(ϕ). Show that
the associated loss function of this model is equivalent to L2 regularization.
Problem 9.2 How do the gradients of the loss function change when L2 regularization (equa-
tion 9.5) is added?
Problem 9.3∗ Consider a linear regression model y = ϕ0 + ϕ1 x with input x, output y, and
parameters ϕ0 and ϕ1 . Assume we have I training examples {xi , yi } and use a least squares
loss. Consider adding Gaussian noise with mean zero and variance σx2 to the inputs xi at each
training iteration. What is the expected gradient update?
Problem 9.4∗ Derive the loss function for multiclass classification when we use label smooth-
ing so that the target probability distribution has 0.9 at the correct class and the remaining
probability mass of 0.1 is divided between the remaining Do − 1 classes.
Problem 9.5 Show that the weight decay parameter update with decay rate λ:
∂L
ϕ ←− (1 − λ)ϕ − α , (9.22)
∂ϕ
on the original loss function L[ϕ] is equivalent to a standard gradient update using L2 regular-
ization so that the modified loss function L̃[ϕ] is:
λ X 2
L̃[ϕ] = L[ϕ] + ϕk , (9.23)
2α
k
where ϕ are the parameters, and α is the learning rate.
Problem 9.6 Consider a model with parameters ϕ = [ϕ0 , ϕ1 ]T . Draw the L0, L 21 , and L1
P
d=1 |ϕd | .
regularization terms in a similar form to figure 9.1b. The LP regularization term is D P
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 10
Convolutional networks
Chapters 2–9 introduced the supervised learning pipeline for deep neural networks. How-
ever, these chapters only considered fully connected networks with a single path from
input to output. Chapters 10–13 introduce more specialized network components with
sparser connections, shared weights, and parallel processing paths. This chapter de-
scribes convolutional layers, which are mainly used for processing image data.
Images have three properties that suggest the need for specialized model architec-
ture. First, they are high-dimensional. A typical image for a classification task contains
224×224 RGB values (i.e., 150,528 input dimensions). Hidden layers in fully connected
networks are generally larger than the input size, so even for a shallow network, the
number of weights would exceed 150, 5282 , or 22 billion. This poses obvious practical
problems in terms of the required training data, memory, and computation.
Second, nearby image pixels are statistically related. However, fully connected net-
works have no notion of “nearby” and treat the relationship between every input equally.
If the pixels of the training and test images were randomly permuted in the same way,
the network could still be trained with no practical difference. Third, the interpretation
of an image is stable under geometric transformations. An image of a tree is still an
image of a tree if we shift it leftwards by a few pixels. However, this shift changes every
input to the network. Hence, a fully connected model must learn the patterns of pixels
that signify a tree separately at every position, which is clearly inefficient.
Convolutional layers process each local image region independently, using parameters
shared across the whole image. They use fewer parameters than fully connected layers,
exploit the spatial relationships between nearby pixels, and don’t have to re-learn the
interpretation of the pixels at every position. A network predominantly consisting of
convolutional layers is known as a convolutional neural network or CNN.
Figure 10.1 Invariance and equivariance for translation. a–b) In image classi-
fication, the goal is to categorize both images as “mountain” regardless of the
horizontal shift that has occurred. In other words, we require the network pre-
diction to be invariant to translation. c,e) The goal of semantic segmentation is
to associate a label with each pixel. d,f) When the input image is translated, we
want the output (colored overlay) to translate in the same way. In other words,
we require the output to be equivariant with respect to translation. Panels c–f)
adapted from Bousselham et al. (2021).
In other words, the output of the function f[x] is the same regardless of the transfor-
mation t[x]. Networks for image classification should be invariant to geometric trans-
formations of the image (figure 10.1a–b). The network f[x] should identify an image as
containing the same object, even if it has been translated, rotated, flipped, or warped.
A function f[x] of an image x is equivariant or covariant to a transformation t[x] if:
f t[x] = t f[x] . (10.2)
In other words, f[x] is equivariant to the transformation t[x] if its output changes in
the same way under the transformation as the input. Networks for per-pixel image
segmentation should be equivariant to transformations (figure 10.1c–f); if the image is
translated, rotated, or flipped, the network f[x] should return a segmentation that has
been transformed in the same way.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 163
Figure 10.2 1D convolution with kernel size three. Each output zi is a weighted
sum of the nearest three inputs xi−1 , xi , and xi+1 , where the weights are ω =
[ω1 , ω2 , ω3 ]. a) Output z2 is computed as z2 = ω1 x1 + ω2 x2 + ω3 x3 . b) Output z3
is computed as z3 = ω1 x2 + ω2 x3 + ω3 x4 . c) At position z1 , the kernel extends
beyond the first input x1 . This can be handled by zero-padding, in which we
assume values outside the input are zero. The final output is treated similarly.
d) Alternatively, we could only compute outputs where the kernel fits within the
input range (“valid” convolution); now, the output will be smaller than the input.
Convolutional layers are network layers based on the convolution operation. In 1D, a
convolution transforms an input vector x into an output vector z so that each output zi
is a weighted sum of nearby inputs. The same weights are used at every position and
are collectively called the convolution kernel or filter. The size of the region over which
inputs are combined is termed the kernel size. For a kernel size of three, we have:
flipped relative to the input (so we would switch xi−1 with xi+1 ). Regardless, this (incorrect) definition
is the usual convention in machine learning.
Figure 10.3 Stride, kernel size, and dilation. a) With a stride of two, we evaluate
the kernel at every other position, so the first output z1 is computed from a
weighted sum centered at x1 , and b) the second output z2 is computed from a
weighted sum centered at x3 and so on. c) The kernel size can also be changed.
With a kernel size of five, we take a weighted sum of the nearest five inputs. d)
In dilated or atrous convolution (from the French “à trous” – with holes), we
intersperse zeros in the weight vector to allow us to combine information over a
large area using fewer weights.
10.2.2 Padding
Equation 10.3 shows that each output is computed by taking a weighted sum of the
previous, current, and subsequent positions in the input. This begs the question of how
to deal with the first output (where there is no previous input) and the final output
(where there is no subsequent input).
There are two common approaches. The first is to pad the edges of the inputs with
new values and proceed as usual. Zero-padding assumes the input is zero outside its
valid range (figure 10.2c). Other possibilities include treating the input as circular or
reflecting it at the boundaries. The second approach is to discard the output positions
where the kernel exceeds the range of input positions. These valid convolutions have the
advantage of introducing no extra information at the edges of the input. However, they
have the disadvantage that the representation decreases in size.
In the example above, each output was a sum of the nearest three inputs. However,
this is just one of a larger family of convolution operations, the members of which are
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 165
distinguished by their stride, kernel size, and dilation rate. When we evaluate the output
at every position, we term this a stride of one. However, it is also possible to shift the
kernel by a stride greater than one. If we have a stride of two, we create roughly half
the number of outputs (figure 10.3a–b).
The kernel size can be increased to integrate over a larger area (figure 10.3c). How-
ever, it typically remains an odd number so that it can be centered around the current
position. Increasing the kernel size has the disadvantage of requiring more weights. This
leads to the idea of dilated or atrous convolutions, in which the kernel values are inter-
spersed with zeros. For example, we can turn a kernel of size five into a dilated kernel of
size three by setting the second and fourth elements to zero. We still integrate informa-
Problems 10.2–10.4
tion from a larger input region but only require three weights to do this (figure 10.3d).
The number of zeros we intersperse between the weights determines the dilation rate.
A convolutional layer computes its output by convolving the input, adding a bias β, and
passing each result through an activation function a[•]. With kernel size three, stride
one, and dilation rate one, the ith hidden unit hi would be computed as:
hi = a [β + ω1 xi−1 + ω2 xi + ω3 xi+1 ]
X3
= a β + ωj xi+j−2 , (10.4)
j=1
where the bias β and kernel weights ω1 , ω2 , ω3 are trainable parameters, and (with zero-
padding) we treat the input x as zero when it is out of the valid range. This is a special
case of a fully connected layer that computes the ith hidden unit as:
X
D
hi = a βi + ωij xj . (10.5)
j=1
If there are D inputs x• and D hidden units h• , this fully connected layer would have D2
weights ω•• and D biases β• . The convolutional layer only uses three weights and one
bias. A fully connected layer can reproduce this exactly if most weights are set to zero
Problem 10.5
and others are constrained to be identical (figure 10.4).
10.2.5 Channels
If we only apply a single convolution, information will likely be lost; we are averaging
nearby inputs, and the ReLU activation function clips results that are less than zero.
Hence, it is usual to compute several convolutions in parallel. Each convolution produces
a new set of hidden variables, termed a feature map or channel.
Figure 10.4 Fully connected vs. convolutional layers. a) A fully connected layer
has a weight connecting each input x to each hidden unit h (colored arrows)
and a bias for each hidden unit (not shown). b) Hence, the associated weight
matrix Ω contains 36 weights relating the six inputs to the six hidden units. c) A
convolutional layer with kernel size three computes each hidden unit as the same
weighted sum of the three neighboring inputs (arrows) plus a bias (not shown).
d) The weight matrix is a special case of the fully connected matrix where many
weights are zero and others are repeated (same colors indicate same value, white
indicates zero weight). e) A convolutional layer with kernel size three and stride
two computes a weighted sum at every other position. f) This is also a special
case of a fully connected network with a different sparse weight structure.
Figure 10.5 Channels. Typically, multiple convolutions are applied to the input x
and stored in channels. a) A convolution is applied to create hidden units h1
to h6 , which form the first channel. b) A second convolution operation is applied
to create hidden units h7 to h12 , which form the second channel. The channels
are stored in a 2D array H1 that contains all the hidden units in the first hidden
layer. c) If we add a further convolutional layer, there are now two channels at
each input position. Here, the 1D convolution defines a weighted sum over both
input channels at the three closest positions to create each new output channel.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 167
Figure 10.5a–b illustrates this with two convolution kernels of size three and with
zero-padding. The first kernel computes a weighted sum of the nearest three pixels, adds
a bias, and passes the results through the activation function to produce hidden units h1
to h6 . These comprise the first channel. The second kernel computes a different weighted
sum of the nearest three pixels, adds a different bias, and passes the results through the
activation function to create hidden units h7 to h12 . These comprise the second channel.
In general, the input and the hidden layers all have multiple channels (figure 10.5c). If
the incoming layer has Ci channels and we select a kernel size K per channel, the hidden
Problems 10.6–10.8
units in each output channel are computed as a weighted sum over all Ci channels and K
kernel entries using a weight matrix Ω ∈ RCi ×K and one bias. Hence, if there are Co Notebook 10.1
channels in the next layer, then we need Ω ∈ RCi ×Co ×K weights and β ∈ RCo biases. 1D convolution
We now apply a convolutional network to the MNIST-1D data (see figure 8.1). The
input x is a 40D vector, and the output f is a 10D vector that is passed through a
softmax layer to produce class probabilities. We use a network with three hidden layers
(figure 10.7). The fifteen channels of the first hidden layer H1 are each computed using
a kernel size of three and a stride of two with “valid” padding, giving nineteen spatial
positions. The second hidden layer H2 is also computed using a kernel size of three, a
stride of two, and “valid” padding. The third hidden layer is computed similarly. At this
stage, the representation has four spatial positions and fifteen channels. These values
are reshaped into a vector of size sixty, which is mapped by a fully connected layer to
the ten output activations.
This network was trained for 100,000 steps using SGD without momentum, a learning
rate of 0.01, and a batch size of 100 on a dataset of 4,000 examples. We compare this to
Problem 10.12
a fully connected network with the same number of layers and hidden units (i.e., three
hidden layers with 285, 135, and 60 hidden units, respectively). The convolutional net-
work has 2,050 parameters, and the fully connected network has 59,065 parameters. By
the logic of figure 10.4, the convolutional network is a special case of the fully connected
Figure 10.6 Receptive fields for network with kernel width of three. a) An input
with eleven dimensions feeds into a hidden layer with three channels and convo-
lution kernel of size three. The pre-activations of the three highlighted hidden
units in the first hidden layer H1 are different weighted sums of the nearest three
inputs, so the receptive field in H1 has size three. b) The pre-activations of the
four highlighted hidden units in layer H2 each take a weighted sum of the three
channels in layer H1 at each of the three nearest positions. Each hidden unit in
layer H1 weights the nearest three input positions. Hence, hidden units in H2
have a receptive field size of five. c) The hidden units in the third layer (kernel
size three, stride two) increases the receptive field size to seven. d) By the time
we add a fourth layer, the receptive field of the hidden units at position three
have a receptive field that covers the entire input.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 169
Figure 10.7 Convolutional network for classifying MNIST-1D data (see figure 8.1).
The MNIST-1D input has dimension Di = 40. The first convolutional layer has
fifteen channels, kernel size three, stride two, and only retains “valid” positions to
make a hidden layer with nineteen positions and fifteen channels. The following
two convolutional layers have the same settings, gradually reducing the repre-
sentation size at each subsequent hidden layer. Finally, a fully connected layer
takes all sixty hidden units from the third hidden layer. It outputs ten activations
that are subsequently passed through a softmax layer to produce the ten class
probabilities.
Figure 10.8 MNIST-1D results. a) The convolutional network from figure 10.7
eventually fits the training data perfectly and has ∼17% test error. b) A fully
connected network with the same number of hidden layers and the number of
hidden units in each learns the training data faster but fails to generalize well with
∼40% test error. The latter model can reproduce the convolutional model but
fails to do so. The convolutional structure restricts the possible mappings to those
that process every position similarly, and this restriction improves performance.
one. The latter has enough flexibility to replicate the former exactly. Figure 10.8 shows
Notebook 10.2
Convolution both models fit the training data perfectly. However, the test error for the convolutional
for MNIST-1D network is much less than for the fully connected network.
This discrepancy is probably not due to the difference in the number of parameters;
we know overparameterization usually improves performance (section 8.4.1). The likely
explanation is that the convolutional architecture has a superior inductive bias (i.e.,
interpolates between the training data better) because we have embodied some prior
knowledge in the architecture; we have forced the network to process each position in
the input in the same way. We know that the data were created by starting with a
template that is (among other operations) randomly translated, so this is sensible.
The fully connected network has to learn what each digit template looks like at every
position. In contrast, the convolutional network shares information across positions and
hence learns to identify each category more accurately. Another way of thinking about
this is that when we train the convolutional network, we search through a smaller family
of input/output mappings, all of which are plausible. Alternatively, the convolutional
structure can be considered a regularizer that applies an infinite penalty to most of the
solutions that a fully connected network can describe.
The previous section described convolutional networks for processing 1D data. Such
networks can be applied to financial time series, audio, and text. However, convolutional
networks are more usually applied to 2D image data. The convolutional kernel is now
a 2D object. A 3×3 kernel Ω ∈ R3×3 applied to a 2D input comprising of elements xij
computes a single layer of hidden units hij as:
" #
3 X
X 3
hij = a β+ ωmn xi+m−2,j+n−2 , (10.6)
m=1 n=1
where ωmn are the entries of the convolutional kernel. This is simply a weighted sum
over a square 3×3 input region. The kernel is translated both horizontally and vertically
Problem 10.13
across the 2D input (figure 10.9) to create an output at each position.
Often the input is an RGB image, which is treated as a 2D signal with three channels
(figure 10.10). Here, a 3×3 kernel would have 3×3×3 weights and be applied to the
Notebook 10.3
2D convolution three input channels at each of the 3×3 positions to create a 2D output that is the same
height and width as the input image (assuming zero-padding). To generate multiple
Problem 10.14 output channels, we repeat this process with different kernel weights and append the
results to form a 3D tensor. If the kernel is size K × K, and there are Ci input channels,
Appendix B.3 each output channel is a weighted sum of Ci × K × K quantities plus one bias. It follows
Tensors
that to compute Co output channels, we need Ci × Co × K × K weights and Co biases.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.4 Downsampling and upsampling 171
Figure 10.9 2D convolutional layer. Each output hij computes a weighted sum of
the 3×3 nearest inputs, adds a bias, and passes the result through an activation
function. a) Here, the output h23 (shaded output) is a weighted sum of the nine
positions from x12 to x34 (shaded inputs). b) Different outputs are computed
by translating the kernel across the image grid in two dimensions. c–d) With
zero-padding, positions beyond the image’s edge are considered to be zero.
10.4.1 Downsampling
There are three main approaches to scaling down a 2D representation. Here, we consider
the most common case of scaling down both dimensions by a factor of two. First, we
can sample every other position. When we use a stride of two, we effectively apply this
Problem 10.15
method simultaneously with the convolution operation (figure 10.11a).
Second, max pooling retains the maximum of the 2×2 input values (figure 10.11b).
This induces some invariance to translation; if the input is shifted by one pixel, many
of these maximum values remain the same. Finally, mean pooling or average pooling
averages the inputs. For all approaches, we apply downsampling separately to each
channel, so the output has half the width and height but the same number of channels.
10.4.2 Upsampling
The simplest way to scale up a network layer to double the resolution is to duplicate
all the channels at each spatial position four times (figure 10.12a). A second method
is max unpooling; this is used where we have previously used a max pooling operation
for downsampling, and we distribute the values to the positions they originated from
(figure 10.12b). A third approach uses bilinear interpolation to fill in the missing values
between the points where we have samples. (figure 10.12c).
A fourth approach is roughly analogous to downsampling using a stride of two. In
Notebook 10.4
Downsampling that method, there were half as many outputs as inputs, and for kernel size three, each
& upsampling output was a weighted sum of the three closest inputs (figure 10.13a). In transposed
convolution, this picture is reversed (figure 10.13c). There are twice as many outputs
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.4 Downsampling and upsampling 173
as inputs, and each input contributes to three of the outputs. When we consider the
associated weight matrix of this upsampling mechanism (figure 10.13d), we see that it is
the transpose of the matrix for the downsampling mechanism (figure 10.13b).
Sometimes we want to change the number of channels between one hidden layer and the
next without further spatial pooling. This is usually so we can combine the representation
with another parallel computation (see chapter 11). To accomplish this, we apply a
convolution with kernel size one. Each element of the output layer is computed by
taking a weighted sum of all the channels at the same position (figure 10.14). We can
repeat this multiple times with different weights to generate as many output channels as
we need. The associated convolution weights have size 1 × 1 × Ci × Co . Hence, this is
known as 1×1 convolution. Combined with a bias and activation function, it is equivalent
to running the same fully connected network on the input channels at every position.
10.5 Applications
Much of the pioneering work on deep learning in computer vision focused on image
classification using the ImageNet dataset (figure 10.15). This contains 1,281,167 training
images, 50,000 validation images, and 100,000 test images, and every image is labeled as
belonging to one of 1000 possible categories.
Most methods reshape the input images to a standard size; in a typical system,
the input x to the network is a 224×224 RGB image, and the output is a probability
distribution over the 1000 classes. The task is challenging; there are a large number
of classes, and they exhibit considerable variation (figure 10.15). In 2011, before deep
networks were applied, the state-of-the-art method classified the test images with ∼ 25%
errors for the correct class being in the top five suggestions. Five years later, the best
deep learning models eclipsed human performance.
In 2012, AlexNet was the first convolutional network to perform well on this task.
It consists of eight hidden layers with ReLU activation functions, of which the first
five are convolutional and the rest fully connected (figure 10.16). The network starts by
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.5 Applications 175
Figure 10.14 1×1 convolution. To change the number of channels without spatial
pooling, we apply a 1×1 kernel. Each output channel is computed by taking
a weighted sum of all of the channels at the same position, adding a bias, and
passing through an activation function. Multiple output channels are created by
repeating this operation with different weights and biases.
Figure 10.15 Example ImageNet classification images. The model aims to assign
an input image to one of 1000 classes. This task is challenging because the
images vary widely along different attributes (columns). These include rigidity
(monkey < canoe), number of instances in image (lizard < strawberry), clutter
(compass<steel drum), size (candle<spiderweb), texture (screwdriver<leopard),
distinctiveness of color (mug < red wine), and distinctiveness of shape (headland
< bell). Adapted from Russakovsky et al. (2015).
downsampling the input using an 11×11 kernel with a stride of four to create 96 channels.
It then downsamples again using a max pooling layer before applying a 5×5 kernel to
create 256 channels. There are three more convolutional layers with kernel size 3×3,
Problems 10.16–10.17
eventually resulting in a 13×13 representation with 256 channels. A final max-pooling
layer yields a 6×6 representation with 256 channels which is resized into a vector of
length 9, 216 and passed through three fully connected layers containing 4096, 4096, and
1000 hidden units, respectively. The last layer is passed through the softmax function to
output a probability distribution over the 1000 classes. The complete network contains
∼60 million parameters, most of which are in the fully connected layers.
The dataset size was augmented by a factor of 2048 using (i) spatial transformations
Notebook 10.5
Convolution and (ii) modifications of the input intensities. At test time, five different cropped and
for MNIST mirrored versions of the image were run through the network, and their predictions
averaged. The system was learned using SGD with a momentum coefficient of 0.9 and a
batch size of 128. Dropout was applied in the fully connected layers, and an L2 (weight
decay) regularizer was used. This system achieved a 16.4% top-5 error rate and a 38.1%
top-1 error rate. At the time, this was an enormous leap forward in performance at a task
considered far beyond the capabilities of contemporary methods. This result revealed
the potential of deep learning and kick-started the modern era of AI research.
The VGG network was also targeted at classification in the ImageNet task and
achieved a considerably better performance of 6.8% top-5 error rate and a 23.7% top-1
error rate. This network is similarly composed of a series of interspersed convolutional
and max pooling layers, where the spatial size of the representation gradually decreases,
but the number of channels increase. These are followed by three fully connected layers
(figure 10.17). The VGG network was also trained using data augmentation, weight
decay, and dropout.
Although there were various minor differences in the training regime, the most impor-
tant change between AlexNet and VGG was the depth of the network. The latter used
Problem 10.18
19 hidden layers and 144 million parameters. The networks in figures 10.16 and 10.17
are depicted at the same scale for comparison. There was a general trend for several
years for performance on this task to improve as the depth of the networks increased,
and this is evidence that depth is important in neural networks.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.5 Applications 177
Figure 10.17 VGG network (Simonyan & Zisserman, 2014) depicted at the same
scale as AlexNet (see figure 10.16). This network consists of a series of convolu-
tional layers and max pooling operations, in which the spatial scale of the rep-
resentation gradually decreases, but the number of channels gradually increases.
The hidden layer after the last convolutional operation is resized to a 1D vector
and three fully connected layers follow. The network outputs 1000 activations
corresponding to the class labels that are passed through a softmax function to
create class probabilities.
In object detection, the goal is to identify and localize multiple objects within the image.
An early method based on convolutional networks was You Only Look Once, or YOLO
for short. The input to the YOLO network is a 448×448 RGB image. This is passed
through 24 convolutional layers that gradually decrease the representation size using
max pooling operations while concurrently increasing the number of channels, similarly
to the VGG network. The final convolutional layer is of size 7 × 7 and has 1024 channels.
This is reshaped to a vector, and a fully connected layer maps it to 4096 values. One
further fully connected layer maps this representation to the output.
The output values encode which class is present at each of a 7×7 grid of locations
(figure 10.18a–b). For each location, the output values also encode a fixed number of
bounding boxes. Five parameters define each box: the x- and y-positions of the center,
the height and width of the box, and the confidence of the prediction (figure 10.18c).
The confidence estimates the overlap between the predicted and ground truth bound-
ing boxes. The system is trained using momentum, weight decay, dropout, and data
augmentation. Transfer learning is employed; the network is initially trained on the
ImageNet classification task and is then fine-tuned for object detection.
After the network is run, a heuristic process is used to remove rectangles with low
confidence and to suppress predicted bounding boxes that correspond to the same object
so only the most confident one is retained.
Figure 10.18 YOLO object detection. a) The input image is reshaped to 448×448
and divided into a regular 7×7 grid. b) The system predicts the most likely class
at each grid cell. c) It also predicts two bounding boxes per cell, and a confidence
value (represented by thickness of line). d) During inference, the most likely
bounding boxes are retained, and boxes with lower confidence values that belong
to the same object are suppressed. Adapted from Redmon et al. (2016).
The goal of semantic segmentation is to assign a label to each pixel according to the object
that it belongs to or no label if that pixel does not correspond to anything in the training
database. An early network for semantic segmentation is depicted in figure 10.19. The
input is a 224×224 RGB image, and the output is a 224×224×21 array that contains
the probability of each of 21 possible classes at each position.
The first part of the network is a smaller version of VGG (figure 10.17) that contains
thirteen rather than sixteen convolutional layers and downsizes the representation to size
14×14. There is then one more max pooling operation, followed by two fully connected
layers that map to two 1D representations of size 4096. These layers do not represent
spatial position but instead, combine information from across the whole image.
Here, the architecture diverges from VGG. Another fully connected layer reconsti-
tutes the representation into 7×7 spatial positions and 512 channels. This is followed
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.6 Summary 179
Figure 10.19 Semantic segmentation network of Noh et al. (2015). The input is a
224×224 image, which is passed through a version of the VGG network and even-
tually transformed into a representation of size 4096 using a fully connected layer.
This contains information about the entire image. This is then reformed into a
representation of size 7×7 using another fully connected layer, and the image is
upsampled and deconvolved (transposed convolutions without upsampling) in a
mirror image of the VGG network. The output is a 224×224×21 representation
that gives the output probabilities for the 21 classes at each position.
by a series of max unpooling layers (see figure 10.12b) and deconvolution layers. These
are transposed convolutions (see figure 10.13) but in 2D and without the upsampling.
Finally, there is a 1×1 convolution to create 21 channels representing the possible classes
and a softmax operation at each spatial position to map the activations to class proba-
bilities. The downsampling side of the network is sometimes referred to as an encoder,
and the upsampling side as a decoder, so networks of this type are sometimes called
encoder-decoder networks or hourglass networks due to their shape.
The final segmentation is generated using a heuristic method that greedily searches
for the class that is most represented and infers its region, taking into account the
probabilities but also encouraging connectedness. Then the next most-represented class
is added where it dominates at the remaining unlabeled pixels. This continues until there
is insufficient evidence to add more (figure 10.20).
10.6 Summary
In convolutional layers, each hidden unit is computed by taking a weighted sum of the
nearby inputs, adding a bias, and applying an activation function. The weights and
the bias are the same at every spatial position, so there are far fewer parameters than
in a fully connected network, and the number of parameters doesn’t increase with the
input image size. To ensure that information is not lost, this operation is repeated with
Figure 10.20 Semantic segmentation results. The final result is created from the
21 probability maps by greedily selecting the best class and using a heuristic
method to find a sensible binary map based on the probabilities and their spatial
proximity. If there is enough evidence, subsequent classes are added, and their
segmentation maps are combined. Adapted from Noh et al. (2015).
different weights and biases to create multiple channels at each spatial position.
Typical convolutional networks consist of convolutional layers interspersed with layers
that downsample by a factor of two. As a data example passes through the network, the
spatial dimensions usually decrease by factors of two, and the channels increase by factors
of two. At the end of the network, there are typically one or more fully connected layers
that integrate information from across the entire input and create the desired output. If
the output is an image, a mirrored “decoder” upsamples back to the original size.
The translational equivariance of convolutional layers imposes a useful inductive bias
that increases performance for image-based tasks relative to fully connected networks.
We described image classification, object detection, and semantic segmentation networks.
Image classification performance was shown to improve as the network became deeper.
However, subsequent experiments showed that increasing the network depth indefinitely
doesn’t continue to help; after a certain depth, the system becomes difficult to train.
This is the motivation for residual connections, which are the topic of the next chapter.
Notes
Dumoulin & Visin (2016) present an overview of the mathematics of convolutions that expands
on the brief treatment in this chapter.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 181
handwriting recognition (LeCun et al., 1989a; Martin, 1993), face recognition (Lawrence et al.,
1997), phoneme recognition (Waibel et al., 1989), spoken word recognition (Bottou et al., 1990),
and signature verification (Bromley et al., 1993). However, convolutional networks were popu-
larized by LeCun et al. (1998), who built a system called LeNet for classifying 28×28 grayscale
images of handwritten digits. This is immediately recognizable as a precursor of modern net-
works; it uses a series of convolutional layers, followed by fully connected layers, sigmoid activa-
tions rather than ReLUs, and average pooling rather than max pooling. AlexNet (Krizhevsky
et al., 2012) is widely considered the starting point for modern deep convolutional networks.
ImageNet Challenge: Deng et al. (2009) collated the ImageNet database and the associated
classification challenge drove progress in deep learning for several years after AlexNet. Notable
subsequent winners of this challenge include the network-in-network architecture (Lin et al.,
2014), which alternated convolutions with fully connected layers that operated independently
on all of the channels at each position (i.e., 1×1 convolutions). Zeiler & Fergus (2014) and
Simonyan & Zisserman (2014) trained larger and deeper architectures that were fundamentally
similar to AlexNet. Szegedy et al. (2017) developed an architecture called GoogLeNet, which
introduced inception blocks. These use several parallel paths with different filter sizes, which
are then recombined. This effectively allowed the system to learn the filter size.
The trend was for performance to improve with increasing depth. However, it ultimately became
difficult to train deeper networks without modifications; these include residual connections
and normalization layers, both of which are described in the next chapter. Progress in the
ImageNet challenges is summarized in Russakovsky et al. (2015). A more general survey of
image classification using convolutional networks can be found in Rawat & Wang (2017). The
improvement of image classification networks over time is visualized in figure 10.21.
Downsampling and upsampling: Average pooling dates back to at least LeCun et al. (1989a)
and max pooling to Zhou & Chellappa (1988). Scherer et al. (2010) compared these methods
and concluded that max pooling was superior. The max unpooling method was introduced by
Zeiler et al. (2011) and Zeiler & Fergus (2014). Max pooling can be thought of as applying
an L∞ norm to the hidden units that are to be pooled. This led to applying other Lk norms
Appendix B.3.2
Vector norms
(Springenberg et al., 2015; Sainath et al., 2013), although these require more computation and
are not widely used. Zhang (2019) introduced max-blur-pooling, in which a low-pass filter is
applied before downsampling to prevent aliasing, and showed that this improves generalization
over translation of the inputs and protects against adversarial attacks (see section 20.4.6).
Shi et al. (2016) introduced PixelShuffle, which used convolutional filters with a stride of 1/s
to scale up 1D signals by a factor of s. Only the weights that lie exactly on positions are
used to create the outputs, and the ones that fall between positions are discarded. This can
be implemented by multiplying the number of channels in the kernel by a factor of s, where
the sth output position is computed from just the sth subset of channels. This can be trivially
extended to 2D convolution, which requires s2 channels.
Convolution in 1D and 3D: Convolutional networks are usually applied to images but have
also been applied to 1D data in applications that include speech recognition (Abdel-Hamid
et al., 2012), sentence classification (Zhang et al., 2015; Conneau et al., 2017), electrocardiogram
classification (Kiranyaz et al., 2015), and bearing fault diagnosis (Eren et al., 2019). A survey
of 1D convolutional networks can be found in Kiranyaz et al. (2021). Convolutional networks
have also been applied to 3D data, including video (Ji et al., 2012; Saha et al., 2016; Tran et al.,
2015) and volumetric measurements (Wu et al., 2015b; Maturana & Scherer, 2015).
Invariance and equivariance: Part of the motivation for convolutional layers is that they
are approximately equivariant with respect to translation, and part of the motivation for max
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 183
pooling is to induce invariance to small translations. Zhang (2019) considers the degree to
which convolutional networks really have these properties and proposes the max-blur-pooling
modification that demonstrably improves them. There is considerable interest in making net-
works equivariant or invariant to other types of transformations, such as reflections, rotations,
and scaling. Sifre & Mallat (2013) constructed a system based on wavelets that induced both
translational and rotational invariance in image patches and applied this to texture classifica-
tion. Kanazawa et al. (2014) developed locally scale-invariant convolutional neural networks.
Cohen & Welling (2016) exploited group theory to construct group CNNs, which are equivariant
to larger families of transformations, including reflections and rotations. Esteves et al. (2018)
introduced polar transformer networks, which are invariant to translations and equivariant to
rotation and scale. Worrall et al. (2017) developed harmonic networks, the first example of a
group CNN that was equivariant to continuous rotations.
Adaptive Kernels: The inception block (Szegedy et al., 2017) applies convolutional filters of
different sizes in parallel and, as such, provides a crude mechanism by which the network can
learn the appropriate filter size. Other work has investigated learning the scale of convolutions
as part of the training process (e.g., Pintea et al., 2021; Romero et al., 2021) or the stride of
downsampling layers (Riad et al., 2022).
In some systems, the kernel size is changed adaptively based on the data. This is sometimes in
the context of guided convolution, where one input is used to help guide the computation from
another input. For example, an RGB image might be used to help upsample a low-resolution
depth map. Jia et al. (2016) directly predicted the filter weights themselves using a different
network branch. Xiong et al. (2020b) change the kernel size adaptively. Su et al. (2019a)
moderate weights of fixed kernels by a function learned from another modality. Dai et al.
(2017) learn offsets of weights so that they do not have to be applied in a regular grid.
Object detection and semantic segmentation: Object detection methods can be divided
into proposal-based and proposal-free schemes. In the former case, processing occurs in two
stages. A convolutional network ingests the whole image and proposes regions that might
contain objects. These proposal regions are then resized, and a second network analyzes them
to establish whether there is an object there and what it is. An early example of this approach
was R-CNN (Girshick et al., 2014). This was subsequently extended to allow end-to-end training
(Girshick, 2015) and to reduce the cost of the region proposals (Ren et al., 2015). Subsequent
work on feature pyramid networks improved both performance and speed by combining features
across multiple scales (Lin et al., 2017b). In contrast, proposal-free schemes perform all the
processing in a single pass. YOLO (Redmon et al., 2016), which was described in section 10.5.2,
is the most celebrated example of a proposal-free scheme. The most recent iteration of this
framework at the time of writing is YOLOv7 (Wang et al., 2022a). A recent review of object
detection can be found in Zou et al. (2023).
The semantic segmentation network described in section 10.5.3 was developed by Noh et al.
(2015). Many subsequent approaches have been variations of U-Net (Ronneberger et al., 2015),
which is described in section 11.5.3. Recent surveys of semantic segmentation can be found in
Minaee et al. (2021) and Ulku & Akagündüz (2022).
Problems
Problem 10.1∗ Show that the operation in equation 10.3 is equivariant with respect to transla-
tion.
Problem 10.2 Equation 10.3 defines 1D convolution with a kernel size of three, stride of one,
and dilation one. Write out the equivalent equation for the 1D convolution with a kernel size
of three and a stride of two as pictured in figure 10.3a–b.
Problem 10.3 Write out the equation for the 1D dilated convolution with a kernel size of three
and a dilation rate of two, as pictured in figure 10.3d.
Problem 10.4 Write out the equation for a 1D convolution with kernel size of seven, a dilation
rate of three, and a stride of three.
Problem 10.5 Draw weight matrices in the style of figure 10.4d for (i) the strided convolution
in figure 10.3a–b, (ii) the convolution with kernel size 5 in figure 10.3c, and (iii) the dilated
convolution in figure 10.3d.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 185
Problem 10.6∗ Draw a 12×6 weight matrix in the style of figure 10.4d relating inputs x1 , . . . , x6
to outputs h1 , . . . , h12 in the multi-channel convolution as depicted in figures 10.5a–b.
Problem 10.7∗ Draw a 6×12 weight matrix in the style of figure 10.4d relating inputs h1 , . . . , h12
to outputs h′1 , . . . , h′6 in the multi-channel convolution in figure 10.5c.
Problem 10.8 Consider a 1D convolutional network where the input has three channels. The
first hidden layer is computed using a kernel size of three and has four channels. The second
hidden layer is computed using a kernel size of five and has ten channels. How many biases and
how many weights are needed for each of these two convolutional layers?
Problem 10.9 A network consists of three 1D convolutional layers. At each layer, a zero-padded
convolution with kernel size three, stride one, and dilation one is applied. What size is the
receptive field of the hidden units in the third layer?
Problem 10.10 A network consists of three 1D convolutional layers. At each layer, a zero-
padded convolution with kernel size seven, stride one, and dilation one is applied. What size is
the receptive field of hidden units in the third layer?
Problem 10.11 Consider a convolutional network with 1D input x. The first hidden layer H1 is
computed using a convolution with kernel size five, stride two, and a dilation rate of one. The
second hidden layer H2 is computed using a convolution with kernel size three, stride one, and
a dilation rate of one. The third hidden layer H3 is computed using a convolution with kernel
size five, stride one, and a dilation rate of two. What are the receptive field sizes at each hidden
layer?
Problem 10.12 The 1D convolutional network in figure 10.7 was trained using stochastic gradient
descent with a learning rate of 0.01 and a batch size of 100 on a training dataset of 4,000 examples
for 100,000 steps. How many epochs was the network trained for?
Problem 10.13 Draw a weight matrix in the style of figure 10.4d that shows the relationship
between the 24 inputs and the 24 outputs in figure 10.9.
Problem 10.14 Consider a 2D convolutional layer with kernel size 5×5 that takes 3 input
channels and returns 10 output channels. How many convolutional weights are there? How
many biases?
Problem 10.15 Draw a weight matrix in the style of figure 10.4d that samples every other
variable in a 1D input (i.e., the 1D analog of figure 10.11a). Show that the weight matrix for
1D convolution with kernel size three and stride two is equivalent to composing the matrices
for 1D convolution with kernel size three and stride one and this sampling matrix.
Problem 10.16∗ Consider the AlexNet network (figure 10.16). How many parameters are used
in each convolutional and fully connected layer? What is the total number of parameters?
Problem 10.17 What is the receptive field size at each of the first three layers of AlexNet (i.e.,
the first three orange blocks in figure 10.16)?
Problem 10.18 How many weights and biases are there at each convolutional layer and fully
connected layer in the VGG architecture (figure 10.17)?
Problem 10.19∗ Consider two hidden layers of size 224×224 with C1 and C2 channels, respec-
tively, connected by a 3×3 convolutional layer. Describe how to initialize the weights using He
initialization.
Residual networks
The previous chapter described how image classification performance improved as the
depth of convolutional networks was extended from eight layers (AlexNet) to nineteen
layers (VGG). This led to experimentation with even deeper networks. However, per-
formance decreased again when many more layers were added.
This chapter introduces residual blocks. Here, each network layer computes an addi-
tive change to the current representation instead of transforming it directly. This allows
deeper networks to be trained but causes an exponential increase in the activation mag-
nitudes at initialization. Residual blocks employ batch normalization to compensate for
this, which re-centers and rescales the activations at each layer.
Residual blocks with batch normalization allow much deeper networks to be trained,
and these networks improve performance across a variety of tasks. Architectures that
combine residual blocks to tackle image classification, medical image segmentation, and
human pose estimation are described.
Every network we have seen so far processes the data sequentially; each layer receives
the previous layer’s output and passes the result to the next (figure 11.1). For example,
a three-layer network is defined by:
h1 = f1 [x, ϕ1 ]
h2 = f2 [h1 , ϕ2 ]
h3 = f3 [h2 , ϕ3 ]
y = f4 [h3 , ϕ4 ], (11.1)
where h1 , h2 , and h3 denote the intermediate hidden layers, x is the network input, y
is the output, and the functions fk [•, ϕk ] perform the processing.
In a standard neural network, each layer consists of a linear transformation followed
by an activation function, and the parameters ϕk comprise the weights and biases of the
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.