0% found this document useful (0 votes)
7 views

Deep-Learning-book-part2

This document discusses various optimization algorithms for training models, particularly focusing on stochastic gradient descent (SGD) and its modifications like momentum and Adam. It highlights how these methods help in efficiently finding the minimum of a loss function while avoiding local minima and saddle points. Additionally, it emphasizes the importance of hyperparameters in training algorithms and introduces concepts related to gradient descent and convexity.

Uploaded by

yiyep84767
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Deep-Learning-book-part2

This document discusses various optimization algorithms for training models, particularly focusing on stochastic gradient descent (SGD) and its modifications like momentum and Adam. It highlights how these methods help in efficiently finding the minimum of a loss function while avoiding local minima and saddle points. Additionally, it emphasizes the importance of hyperparameters in training algorithms and introduces concepts related to gradient descent and convexity.

Uploaded by

yiyep84767
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

86 6 Fitting models

from just a subset of the training data. Fourth, it can (in principle) escape local minima.
Fifth, it reduces the chances of getting stuck near saddle points; it is likely that at least
some of the possible batches will have a significant gradient at any point on the loss
function. Finally, there is some evidence that SGD finds parameters for neural networks
that cause them to generalize well to new data in practice (see section 9.2).
SGD does not necessarily “converge” in the traditional sense. However, the hope is
that when we are close to the global minimum, all the data points will be well described
by the model. Consequently, the gradient will be small, whichever batch is chosen, and
the parameters will cease to change much. In practice, SGD is often applied with a
learning rate schedule. The learning rate α starts at a high value and is decreased by a
constant factor every N epochs. The logic is that in the early stages of training, we want
the algorithm to explore the parameter space, jumping from valley to valley to find a
sensible region. In later stages, we are roughly in the right place and are more concerned
with fine-tuning the parameters, so we decrease α to make smaller changes.

6.3 Momentum

A common modification to stochastic gradient descent is to add a momentum term. We


update the parameters with a weighted combination of the gradient computed from the
current batch and the direction moved in the previous step:

X ∂ℓi [ϕ ]
mt+1 ← β · mt + (1 − β) t
∂ϕ
i∈Bt
ϕt+1 ← ϕt − α · mt+1 , (6.11)

where mt is the momentum (which drives the update at iteration t), β ∈ [0, 1) controls
the degree to which the gradient is smoothed over time, and α is the learning rate.
The recursive formulation of the momentum calculation means that the gradient step
is an infinite weighted sum of all the previous gradients, where the weights get smaller
as we move back in time. The effective learning rate increases if all these gradients
Problem 6.10
are aligned over multiple iterations but decreases if the gradient direction repeatedly
changes as the terms in the sum cancel out. The overall effect is a smoother trajectory
and reduced oscillatory behavior in valleys (figure 6.7).

6.3.1 Nesterov accelerated momentum

The momentum term can be considered a coarse prediction of where the SGD algorithm
Notebook 6.4
Momentum will move next. Nesterov accelerated momentum (figure 6.8) computes the gradients at
this predicted point rather than at the current point:

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.3 Momentum 87

Figure 6.7 Stochastic gradient descent with momentum. a) Regular stochastic


descent takes a very indirect path toward the minimum. b) With a momentum
term, the change at the current step is a weighted combination of the previ-
ous change and the gradient computed from the batch. This smooths out the
trajectory and increases the speed of convergence.

Figure 6.8 Nesterov accelerated momen-


tum. The solution has traveled along
the dashed line to arrive at point 1. A
traditional momentum update measures
the gradient at point 1, moves some dis-
tance in this direction to point 2, and
then adds the momentum term from the
previous iteration (i.e., in the same di-
rection as the dashed line), arriving at
point 3. The Nesterov momentum up-
date first applies the momentum term
(moving from point 1 to point 4) and
then measures the gradient and applies
an update to arrive at point 5.

Draft: please send errata to [email protected].


88 6 Fitting models

X ∂ℓi [ϕ − αβ · mt ]
mt+1 ← β · mt + (1 − β) t
∂ϕ
i∈Bt
ϕt+1 ← ϕt − α · mt+1 , (6.12)

where now the gradients are evaluated at ϕt − αβ · mt . One way to think about this is
that the gradient term now corrects the path provided by momentum alone.

6.4 Adam

Gradient descent with a fixed step size has the following undesirable property: it makes
large adjustments to parameters associated with large gradients (where perhaps we
should be more cautious) and small adjustments to parameters associated with small
gradients (where perhaps we should explore further). When the gradient of the loss
surface is much steeper in one direction than another, it is difficult to choose a learning
rate that (i) makes good progress in both directions and (ii) is stable (figures 6.9a–b).
A straightforward approach is to normalize the gradients so that we move a fixed
distance (governed by the learning rate) in each direction. To do this, we first measure
the gradient mt+1 and the pointwise squared gradient vt+1 :

∂L[ϕt ]
mt+1 ←
∂ϕ
 2
∂L[ϕt ]
vt+1 ← . (6.13)
∂ϕ

Then we apply the update rule:

mt+1
ϕt+1 ← ϕt − α · √ , (6.14)
vt+1 + ϵ

where the square root and division are both pointwise, α is the learning rate, and ϵ is a
small constant that prevents division by zero when the gradient magnitude is zero. The
term vt+1 is the squared gradient, and the positive root of this is used to normalize the
gradient itself, so all that remains is the sign in each coordinate direction. The result is
that the algorithm moves a fixed distance α along each coordinate, where the direction
is determined by whichever way is downhill (figure 6.9c). This simple algorithm makes
good progress in both directions but will not converge unless it happens to land exactly
at the minimum. Instead, it will bounce back and forth around the minimum.
Adaptive moment estimation, or Adam, takes this idea and adds momentum to both
the estimate of the gradient and the squared gradient:

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.4 Adam 89

Figure 6.9 Adaptive moment estimation (Adam). a) This loss function changes
quickly in the vertical direction but slowly in the horizontal direction. If we run
full-batch gradient descent with a learning rate that makes good progress in the
vertical direction, then the algorithm takes a long time to reach the final hor-
izontal position. b) If the learning rate is chosen so that the algorithm makes
good progress in the horizontal direction, it overshoots in the vertical direction
and becomes unstable. c) A straightforward approach is to move a fixed distance
along each axis at each step so that we move downhill in both directions. This is
accomplished by normalizing the gradient magnitude and retaining only the sign.
However, this does not usually converge to the exact minimum but instead oscil-
lates back and forth around it (here between the last two points). d) The Adam
algorithm uses momentum in both the estimated gradient and the normalization
term, which creates a smoother path.

Draft: please send errata to [email protected].


90 6 Fitting models

∂L[ϕt ]
mt+1 ← β · mt + (1 − β)
∂ϕ
 2
∂L[ϕt ]
vt+1 ← γ · vt + (1 − γ) , (6.15)
∂ϕ

where β and γ are the momentum coefficients for the two statistics.
Using momentum is equivalent to taking a weighted average over the history of each
of these statistics. At the start of the procedure, all the previous measurements are
effectively zero, resulting in unrealistically small estimates. Consequently, we modify
these statistics using the rule:

mt+1
m̃t+1 ←
1 − β t+1
vt+1
ṽt+1 ← . (6.16)
1 − γ t+1

Since β and γ are in the range [0, 1), the terms with exponents t + 1 become smaller
with each time step, the denominators become closer to one, and this modification has
a diminishing effect.
Finally, we update the parameters as before, but with the modified terms:

m̃t+1
ϕt+1 ← ϕt − α · p . (6.17)
ṽt+1 + ϵ

The result is an algorithm that can converge to the overall minimum and makes good
Notebook 6.5
Adam progress in every direction in the parameter space. Note that Adam is usually used in a
stochastic setting where the gradients and their squares are computed from mini-batches:

X ∂ℓi [ϕ ]
mt+1 ← β · mt + (1 − β) t
∂ϕ
i∈Bt
!2
X ∂ℓi [ϕ ]
vt+1 ← γ · vt + (1 − γ) t
, (6.18)
∂ϕ
i∈Bt

and so the trajectory is noisy in practice.


As we shall see in chapter 7, the gradient magnitudes of neural network parameters
can depend on their depth in the network. Adam helps compensate for this tendency
and balances out changes across the different layers. In practice, Adam also has the
advantage of being less sensitive to the initial learning rate because it avoids situations
like those in figures 6.9a–b, so it doesn’t need complex learning rate schedules.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.5 Training algorithm hyperparameters 91

6.5 Training algorithm hyperparameters


The choices of learning algorithm, batch size, learning rate schedule, and momentum
coefficients are all considered hyperparameters of the training algorithm; these directly
affect the final model performance but are distinct from the model parameters. Choosing
these can be more art than science, and it’s common to train many models with different
hyperparameters and choose the best one. This is known as hyperparameter search. We
return to this issue in chapter 8.

6.6 Summary
This chapter discussed model training. This problem was framed as finding parameters ϕ
that corresponded to the minimum of a loss function L[ϕ]. The gradient descent method
measures the gradient of the loss function for the current parameters (i.e., how the loss
changes when we make a small change to the parameters). Then it moves the parameters
in the direction that decreases the loss fastest. This is repeated until convergence.
For nonlinear functions, the loss function may have both local minima (where gradi-
ent descent gets trapped) and saddle points (where gradient descent may appear to have
converged but has not). Stochastic gradient descent helps mitigate these problems.1 At
each iteration, we use a different random subset of the data (a batch) to compute the
gradient. This adds noise to the process and helps prevent the algorithm from getting
trapped in a sub-optimal region of parameter space. Each iteration is also computation-
ally cheaper since it only uses a subset of the data. We saw that adding a momentum
term makes convergence more efficient. Finally, we introduced the Adam algorithm.
The ideas in this chapter apply to optimizing any model. The next chapter tackles
two aspects of training specific to neural networks. First, we address how to compute
the gradients of the loss with respect to the parameters of a neural network. This is
accomplished using the famous backpropagation algorithm. Second, we discuss how to
initialize the network parameters before optimization begins. Without careful initializa-
tion, the gradients used by the optimization can become extremely large or extremely
small, which can hinder the training process.

Notes

Optimization algorithms: Optimization algorithms are used extensively throughout engi-


neering, and it is generally more typical to use the term objective function rather than loss
function or cost function. Gradient descent was invented by Cauchy (1847), and stochastic gra-
dient descent dates back to at least Robbins & Monro (1951). A modern compromise between
the two is stochastic variance-reduced descent (Johnson & Zhang, 2013), in which the full gra-
dient is computed periodically, with stochastic updates interspersed. Reviews of optimization
algorithms for neural networks can be found in Ruder (2016), Bottou et al. (2018), and Sun
(2020). Bottou (2012) discusses best practice for SGD, including shuffling without replacement.
1 Chapter 20 discusses the extent to which saddle points and local minima really are problems in

deep learning. In practice, deep networks are surprisingly easy to train.

Draft: please send errata to [email protected].


92 6 Fitting models

Convexity, minima, and saddle points: A function is convex if every chord (line segment
between two points on the surface) lies above the function and does not intersect it. This can
be tested algebraically by considering the Hessian matrix (the matrix of second derivatives):
 
∂2L ∂2L ∂2L
∂ϕ2
...
 0
∂ϕ0 ∂ϕ1 ∂ϕ0 ∂ϕN

 ∂2L ∂2L ∂2L 
 ∂ϕ1 ∂ϕ0 ∂ϕ2
... ∂ϕ1 ∂ϕN 
H[ϕ] = 
 .. ..
1
.. ..
.
 (6.19)
 . . . . 
 
∂2L ∂2L ∂2L
∂ϕN ∂ϕ0 ∂ϕN ∂ϕ1
... ∂ϕ2
N

Appendix B.3.7 If the Hessian matrix is positive definite (has positive eigenvalues) for all possible parameter
Eigenvalues values, then the function is convex; the loss function will look like a smooth bowl (as in fig-
ure 6.1c), so training will be relatively easy. There will be a single global minimum and no local
minima or saddle points.
For any loss function, the eigenvalues of the Hessian matrix at places where the gradient is
zero allow us to classify this position as (i) a minimum (the eigenvalues are all positive), (ii)
a maximum (the eigenvalues are all negative), or (iii) a saddle point (positive eigenvalues are
associated with directions in which we are at a minimum and negative ones with directions
where we are at a maximum).

Line search: Gradient descent with a fixed step size is inefficient because the distance moved
depends entirely on the magnitude of the gradient. It moves a long distance when the function
is changing fast (where perhaps it should be more cautious) but a short distance when the
function is changing slowly (where perhaps it should explore further). For this reason, gradient
descent methods are usually combined with a line search procedure in which we sample the
function along the desired direction to try to find the optimal step size. One such approach
is bracketing (figure 6.10). Another problem with gradient descent is that it tends to lead to
inefficient oscillatory behavior when descending valleys (e.g., path 1 in figure 6.5a).

Beyond gradient descent: Numerous algorithms have been developed that remedy the prob-
lems of gradient descent. Most notable is the Newton method, which takes the curvature of the
surface into account using the inverse of the Hessian matrix; if the gradient of the function is
changing quickly, then it applies a more cautious update. This method eliminates the need for
line search and does not suffer from oscillatory behavior. However, it has its own problems; in
its simplest form, it moves toward the nearest extremum, but this may be a maximum if we
are closer to the top of a hill than we are to the bottom of a valley. Moreover, computing the
Problem 6.11
inverse Hessian is intractable when the number of parameters is large, as in neural networks.

Properties of SGD: The limit of SGD as the learning rate tends to zero is a stochastic
differential equation. Jastrzębski et al. (2018) showed that this equation relies on the learning-
rate to batch size ratio and that there is a relation between the learning rate to batch size ratio
and the width of the minimum found. Wider minima are considered more desirable; if the loss
function for test data is similar, then small errors in the parameter estimates will have little
effect on test performance. He et al. (2019) prove a generalization bound for SGD that has a
positive correlation with the ratio of batch size to learning rate. They train a large number of
models on different architectures and datasets and find empirical evidence that test accuracy
improves when the ratio of batch size to learning rate is low. Smith et al. (2018) and Goyal et al.
(2018) also identified the ratio of batch size to learning rate as being important for generalization
(see figure 20.10).

Momentum: The idea of using momentum to speed up optimization dates to Polyak (1964).
Goh (2017) presents an in-depth discussion of the properties of momentum. The Nesterov

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 93

Figure 6.10 Line search using the bracketing approach. a) The current solution is
at position a (orange point), and we wish to search the region [a, d] (gray shaded
area). We define two points b, c interior to the search region and evaluate the loss
function at these points. Here L[b] > L[c], so we eliminate the range [a, b]. b) We
now repeat this procedure in the refined search region and find that L[b] < L[c],
so we eliminate the range [c, d]. c) We repeat this process until this minimum is
closely bracketed.

accelerated gradient method was introduced by Nesterov (1983). Nesterov momentum was first
applied in the context of stochastic gradient descent by Sutskever et al. (2013).

Adaptive training algorithms: AdaGrad (Duchi et al., 2011) is an optimization algorithm


that addresses the possibility that some parameters may have to move further than others by
assigning a different learning rate to each parameter. AdaGrad uses the cumulative squared
gradient for each parameter to attenuate its learning rate. This has the disadvantage that the
learning rates decrease over time, and learning can halt before the minimum is found. RMSProp
(Hinton et al., 2012a) and AdaDelta (Zeiler, 2012) modified this algorithm to help prevent these
problems by recursively updating the squared gradient term.
By far the most widely used adaptive training algorithm is adaptive moment optimization or
Adam (Kingma & Ba, 2015). This combines the ideas of momentum (in which the gradient
vector is averaged over time) and AdaGrad, AdaDelta, and RMSProp (in which a smoothed
squared gradient term is used to modify the learning rate for each parameter). The original
paper on the Adam algorithm provided a convergence proof for convex loss functions, but a
counterexample was identified by Reddi et al. (2018), who developed a modification of Adam
called AMSGrad, which does converge. Of course, in deep learning, the loss functions are non-
convex, and Zaheer et al. (2018) subsequently developed an adaptive algorithm called YOGI
and proved that it converges in this scenario. Regardless of these theoretical objections, the
original Adam algorithm works well in practice and is widely used, not least because it works
well over a broad range of hyperparameters and makes rapid initial progress.
One potential problem with adaptive training algorithms is that the learning rates are based on
accumulated statistics of the observed gradients. At the start of training, when there are few
samples, these statistics may be very noisy. This can be remedied by learning rate warm-up
(Goyal et al., 2018), in which the learning rates are gradually increased over the first few thou-
sand iterations. An alternative solution is rectified Adam (Liu et al., 2021a), which gradually

Draft: please send errata to [email protected].


94 6 Fitting models

changes the momentum term over time in a way that helps avoid high variance. Dozat (2016)
incorporated Nesterov momentum into the Adam algorithm.

SGD vs. Adam: There has been a lively discussion about the relative merits of SGD and
Adam. Wilson et al. (2017) provided evidence that SGD with momentum can find lower minima
than Adam, which generalizes better over a variety of deep learning tasks. However, this is
strange since SGD is a special case of Adam (when β = 0, γ = 1) once the modification
term (equation 6.16) becomes one, which happens quickly. It is hence more likely that SGD
outperforms Adam when we use Adam’s default hyperparameters. Loshchilov & Hutter (2019)
proposed AdamW, which substantially improves the performance of Adam in the presence of
L2 regularization (see section 9.1). Choi et al. (2019) provide evidence that if we search for the
best Adam hyperparameters, it performs just as well as SGD and converges faster. Keskar &
Socher (2017) proposed a method called SWATS that starts using Adam (to make rapid initial
progress) and then switches to SGD (to get better final generalization performance).

Exhaustive search: All the algorithms discussed in this chapter are iterative. A completely
different approach is to quantize the network parameters and exhaustively search the resulting
discretized parameter space using SAT solvers (Mézard & Mora, 2009). This approach has
the potential to find the global minimum and provide a guarantee that there is no lower loss
elsewhere but is only practical for very small models.

Problems
Problem 6.1 Show that the derivatives of the least squares loss function in equation 6.5 are
given by the expressions in equation 6.7.

Problem 6.2 A surface is guaranteed to be convex if the eigenvalues of the Hessian H[ϕ] are
positive everywhere. In this case, the surface has a unique minimum, and optimization is easy.
Find an algebraic expression for the Hessian matrix,
 
∂2L ∂2L
 ∂ϕ ,
2 ∂ϕ0 ∂ϕ1
H[ϕ] = ∂2L
0
∂2L
(6.20)
∂ϕ1 ∂ϕ0 ∂ϕ2 1

Appendix B.3.7 for the linear regression model (equation 6.5). Prove that this function is convex by showing
Eigenvalues that the eigenvalues are always positive. This can be done by showing that both the trace and
the determinant of the matrix are positive.
Appendix B.3.8
Trace Problem 6.3 Compute the derivatives of the least squares loss L[ϕ] with respect to the param-
Appendix B.3.8 eters ϕ0 and ϕ1 for the Gabor model (equation 6.8).
Determinant
Problem 6.4∗ The logistic regression model uses a linear function to assign an input x to one
of two classes y ∈ {0, 1}. For a 1D input and a 1D output, it has two parameters, ϕ0 and ϕ1 ,
and is defined by:

P r(y = 1|x) = sig[ϕ0 + ϕ1 x], (6.21)


where sig[•] is the logistic sigmoid function:

1
sig[z] = . (6.22)
1 + exp[−z]

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 95

Figure 6.11 Three 1D loss functions for problem 6.6.

(i) Plot y against x for this model for different values of ϕ0 and ϕ1 and explain the qualitative
meaning of each parameter. (ii) What is a suitable loss function for this model? (iii) Compute
the derivatives of this loss function with respect to the parameters. (iv) Generate ten data
points from a normal distribution with mean -1 and standard deviation 1 and assign them the
label y = 0. Generate another ten data points from a normal distribution with mean 1 and
standard deviation 1 and assign these the label y = 1. Plot the loss as a heatmap in terms of
the two parameters ϕ0 and ϕ1 . (v) Is this loss function convex? How could you prove this?

Problem 6.5∗ Compute the derivatives of the least squares loss with respect to the ten param-
eters of the simple neural network model introduced in equation 3.1:

f[x, ϕ] = ϕ0 + ϕ1 a[θ10 + θ11 x] + ϕ2 a[θ20 + θ21 x] + ϕ3 a[θ30 + θ31 x]. (6.23)

Think carefully about what the derivative of the ReLU function a[•] will be.

Problem 6.6 Which of the functions in figure 6.11 is convex? Justify your answer. Characterize
each of the points 1–7 as (i) a local minimum, (ii) the global minimum, or (iii) neither.

Problem 6.7∗ The gradient descent trajectory for path 1 in figure 6.5a oscillates back and forth
inefficiently as it moves down the valley toward the minimum. It’s also notable that it turns at
right angles to the previous direction at each step. Provide a qualitative explanation for these
phenomena. Propose a solution that might help prevent this behavior.

Problem 6.8∗ Can (non-stochastic) gradient descent with a fixed learning rate escape local
minima?

Problem 6.9 We run the stochastic gradient descent algorithm for 1,000 iterations on a dataset
of size 100 with a batch size of 20. For how many epochs did we train the model?

Problem 6.10 Show that the momentum term mt (equation 6.11) is an infinite weighted sum
of the gradients at the previous iterations and derive an expression for the coefficients (weights)
of that sum.

Problem 6.11 What dimensions will the Hessian have if the model has one million parameters?

Draft: please send errata to [email protected].


Chapter 7

Gradients and initialization

Chapter 6 introduced iterative optimization algorithms. These are general-purpose meth-


ods for finding the minimum of a function. In the context of neural networks, they find
parameters that minimize the loss so that the model accurately predicts the training
outputs from the inputs. The basic approach is to choose initial parameters randomly
and then make a series of small changes that decrease the loss on average. Each change is
based on the gradient of the loss with respect to the parameters at the current position.
This chapter discusses two issues that are specific to neural networks. First, we
consider how to calculate the gradients efficiently. This is a serious challenge since the
largest models at the time of writing have ∼1012 parameters, and the gradient needs to
be computed for every parameter at every iteration of the training algorithm. Second,
we consider how to initialize the parameters. If this is not done carefully, the initial
losses and their gradients can be extremely large or small. In either case, this impedes
the training process.

7.1 Problem definitions

Consider a network f[x, ϕ] with multivariate input x, parameters ϕ, and three hidden
layers h1 , h2 , and h3 :

h1 = a[β 0 + Ω0 x]
h2 = a[β 1 + Ω1 h1 ]
h3 = a[β 2 + Ω2 h2 ]
f[x, ϕ] = β 3 + Ω3 h3 , (7.1)

where the function a[•] applies the activation function separately to every element of the
input. The model parameters ϕ = {β 0 , Ω0 , β 1 , Ω1 , β 2 , Ω2 , β 3 , Ω3 } consist of the bias
vectors β k and weight matrices Ωk between every layer (figure 7.1).

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.2 Computing derivatives 97

We also have individual loss terms ℓi , which return the negative log-likelihood of
the ground truth label yi given the model prediction f[xi , ϕ] for training input xi . For
example, this might be the least squares loss ℓi = (f[xi , ϕ] − yi )2 . The total loss is the
sum of these terms over the training data:

X
I
L[ϕ] = ℓi . (7.2)
i=1

The most commonly used optimization algorithm for training neural networks is
stochastic gradient descent (SGD), which updates the parameters as:
X ∂ℓi [ϕ ]
ϕt+1 ←− ϕt − α t
, (7.3)
∂ϕ
i∈Bt

where α is the learning rate, and Bt contains the batch indices at iteration t. To compute
this update, we need to calculate the derivatives:

∂ℓi ∂ℓi
and , (7.4)
∂β k ∂Ωk
for the parameters {β k , Ωk } at every layer k ∈ {0, 1, . . . , K} and for each index i in
Problem 7.1
the batch. The first part of this chapter describes the backpropagation algorithm, which
computes these derivatives efficiently.
In the second part of the chapter, we consider how to initialize the network parameters
before we commence training. We describe methods to choose the initial weights Ωk and
biases β k so that training is stable.

7.2 Computing derivatives

The derivatives of the loss tell us how the loss changes when we make a small change
to the parameters. Optimization algorithms exploit this information to manipulate the
parameters so that the loss becomes smaller. The backpropagation algorithm computes
these derivatives. The mathematical details are somewhat involved, so we first make two
observations that provide some intuition.

Observation 1: Each weight (element of Ωk ) multiplies the activation at a source hidden


unit and adds the result to a destination hidden unit in the next layer. It follows that the
effect of any small change to the weight is amplified or attenuated by the activation at
the source hidden unit. Hence, we run the network for each data example in the batch
and store the activations of all the hidden units. This is known as the forward pass
(figure 7.1). The stored activations will subsequently be used to compute the gradients.

Observation 2: A small change in a bias or weight causes a ripple effect of changes


through the subsequent network. The change modifies the value of its destination hidden

Draft: please send errata to [email protected].


98 7 Gradients and initialization

Figure 7.1 Backpropagation forward pass. The goal is to compute the derivatives
of the loss ℓ with respect to each of the weights (arrows) and biases (not shown).
In other words, we want to know how a small change to each parameter will affect
the loss. Each weight multiplies the hidden unit at its source and contributes the
result to the hidden unit at its destination. Consequently, the effects of any small
change to the weight will be scaled by the activation of the source hidden unit.
For example, the blue weight is applied to the second hidden unit at layer 1; if
the activation of this unit doubles, then the effect of a small change to the blue
weight will double too. Hence, to compute the derivatives of the weights, we need
to calculate and store the activations at the hidden layers. This is known as the
forward pass since it involves running the network equations sequentially.

unit. This, in turn, changes the values of the hidden units in the subsequent layer, which
will change the hidden units in the layer after that, and so on, until a change is made to
the model output and, finally, the loss.
Hence, to know how changing a parameter modifies the loss, we also need to know
how changes to every subsequent hidden layer will, in turn, modify their successor. These
same quantities are required when considering other parameters in the same or earlier
layers. It follows that we can calculate them once and reuse them. For example, consider
computing the effect of a small change in weights that feed into hidden layers h3 , h2 ,
and h1 , respectively:

• To calculate how a small change in a weight or bias feeding into hidden layer h3
modifies the loss, we need to know (i) how a change in layer h3 changes the model
output f , and (ii) how a change in this output changes the loss ℓ (figure 7.2a).

• To calculate how a small change in a weight or bias feeding into hidden layer h2
modifies the loss, we need to know (i) how a change in layer h2 affects h3 , (ii) how h3
changes the model output, and (iii) how this output changes the loss (figure 7.2b).

• To calculate how a small change in a weight or bias feeding into hidden layer h1
modifies the loss, we need to know (i) how a change in layer h1 affects layer h2 ,
(ii) how a change in layer h2 affects layer h3 , (iii) how layer h3 changes the model
output, and (iv) how the model output changes the loss (figure 7.2c).

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.2 Computing derivatives 99

Figure 7.2 Backpropagation backward pass. a) To compute how a change to


a weight feeding into layer h3 (blue arrow) changes the loss, we need to know
how the hidden unit in h3 changes the model output f and how f changes the
loss (orange arrows). b) To compute how a small change to a weight feeding
into h2 (blue arrow) changes the loss, we need to know (i) how the hidden unit
in h2 changes h3 , (ii) how h3 changes f , and (iii) how f changes the loss (orange
arrows). c) Similarly, to compute how a small change to a weight feeding into h1
(blue arrow) changes the loss, we need to know how h1 changes h2 and how
these changes propagate through to the loss (orange arrows). The backward pass
first computes derivatives at the end of the network and then works backward to
exploit the inherent redundancy of these computations.

Draft: please send errata to [email protected].


100 7 Gradients and initialization

As we move backward through the network, we see that most of the terms we need
were already calculated in the previous step, so we do not need to re-compute them.
Proceeding backward through the network in this way to compute the derivatives is
known as the backward pass.
The ideas behind backpropagation are relatively easy to understand. However, the
derivation requires matrix calculus because the bias and weight terms are vectors and
matrices, respectively. To help grasp the underlying mechanics, the following section
derives backpropagation for a simpler toy model with scalar parameters. We then apply
the same approach to a deep neural network in section 7.4.

7.3 Toy example

Consider a model f[x, ϕ] with eight scalar parameters ϕ = {β0 , ω0 , β1 , ω1 , β2 , ω2 , β3 , ω3 }


that consists of a composition of the functions sin[•], exp[•], and cos[•]:
h  i
f[x, ϕ] = β3 + ω3 · cos β2 + ω2 · exp β1 + ω1 · sin[β0 + ω0 · x] , (7.5)
P
and a least squares loss function L[ϕ] = i ℓi with individual terms:

ℓi = (f[xi , ϕ] − yi )2 , (7.6)
th th
where, as usual, xi is the i training input, and yi is the i training output. You can
think of this as a simple neural network with one input, one output, one hidden unit at
each layer, and different activation functions sin[•], exp[•], and cos[•] between each layer.
We aim to compute the derivatives:

∂ℓi ∂ℓi ∂ℓi ∂ℓi ∂ℓi ∂ℓi ∂ℓi ∂ℓi


, , , , , , , and . (7.7)
∂β0 ∂ω0 ∂β1 ∂ω1 ∂β2 ∂ω2 ∂β3 ∂ω3
Of course, we could find expressions for these derivatives by hand and compute them
directly. However, some of these expressions are quite complex. For example:

∂ℓi  h  i 
= −2 β3 + ω3 · cos β2 + ω2 · exp β1 + ω1 · sin[β0 + ω0 · xi ] − yi
∂ω0
h i
·ω1 ω2 ω3 · xi · cos[β0 + ω0 · xi ] · exp β1 + ω1 · sin[β0 + ω0 · xi ]
 h i
· sin β2 + ω2 · exp β1 + ω1 · sin[β0 + ω0 · xi ] . (7.8)

Such expressions are awkward to derive and code without mistakes and do not exploit
the inherent redundancy; notice that the three exponential terms are the same.
The backpropagation algorithm is an efficient method for computing all of these
derivatives at once. It consists of (i) a forward pass, in which we compute and store a
series of intermediate values and the network output, and (ii) a backward pass, in which

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.3 Toy example 101

Figure 7.3 Backpropagation forward pass. We compute and store each of the
intermediate variables in turn until we finally calculate the loss.

we calculate the derivatives of each parameter, starting at the end of the network, and
reusing previous calculations as we move toward the start.

Forward pass: We treat the computation of the loss as a series of calculations:

f0 = β0 + ω0 · xi
h1 = sin[f0 ]
f1 = β1 + ω1 · h1
h2 = exp[f1 ]
f2 = β2 + ω2 · h2
h3 = cos[f2 ]
f3 = β3 + ω3 · h3
ℓi = (f3 − yi )2 . (7.9)

We compute and store the values of the intermediate variables fk and hk (figure 7.3).

Backward pass #1: We now compute the derivatives of ℓi with respect to these inter-
mediate variables, but in reverse order:

∂ℓi ∂ℓi ∂ℓi ∂ℓi ∂ℓi ∂ℓi ∂ℓi


, , , , , , and . (7.10)
∂f3 ∂h3 ∂f2 ∂h2 ∂f1 ∂h1 ∂f0

The first of these derivatives is straightforward:

∂ℓi
= 2(f3 − yi ). (7.11)
∂f3
The next derivative can be calculated using the chain rule:

∂ℓi ∂f3 ∂ℓi


= . (7.12)
∂h3 ∂h3 ∂f3
The left-hand side asks how ℓi changes when h3 changes. The right-hand side says we can
decompose this into (i) how f3 changes when h3 changes and (ii) how ℓi changes when f3
changes. In the original equations, h3 changes f3 , which changes ℓi , and the derivatives

Draft: please send errata to [email protected].


102 7 Gradients and initialization

Figure 7.4 Backpropagation backward pass #1. We work backward from the end
of the function computing the derivatives ∂ℓi /∂fk and ∂ℓi /∂hk of the loss with
respect to the intermediate quantities. Each derivative is computed from the
previous one by multiplying by terms of the form ∂fk /∂hk or ∂hk /∂fk−1 .

represent the effects of this chain. Notice that we already computed the second of these
derivatives, and the other is the derivative of β3 + ω3 · h3 with respect to h3 , which is ω3 .
We continue in this way, computing the derivatives of the output with respect to
these intermediate quantities (figure 7.4):
 
∂ℓi ∂h3 ∂f3 ∂ℓi
=
∂f2 ∂f2 ∂h3 ∂f3
 
∂ℓi ∂f2 ∂h3 ∂f3 ∂ℓi
=
∂h2 ∂h2 ∂f2 ∂h3 ∂f3
 
∂ℓi ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
=
∂f1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3
 
∂ℓi ∂f1 ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
=
∂h1 ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3
 
∂ℓi ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
= . (7.13)
∂f0 ∂f0 ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3
In each case, we have already computed the quantities in the brackets in the previous
Problem 7.2
step, and the last term has a simple expression. These equations embody Observation 2
from the previous section (figure 7.2); we can reuse the previously computed derivatives
if we calculate them in reverse order.

Backward pass #2: Finally, we consider how the loss ℓi changes when we change the
parameters {βk } and {ωk }. Once more, we apply the chain rule (figure 7.5):

∂ℓi ∂fk ∂ℓi


=
∂βk ∂βk ∂fk
∂ℓi ∂fk ∂ℓi
= . (7.14)
∂ωk ∂ωk ∂fk
In each case, the second term on the right-hand side was computed in equation 7.13.
When k > 0, we have fk = βk + ωk · hk , so:

∂fk ∂fk
=1 and = hk . (7.15)
∂βk ∂ωk

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.4 Backpropagation algorithm 103

Figure 7.5 Backpropagation backward pass #2. Finally, we compute the deriva-
tives ∂ℓi /∂βk and ∂ℓi /∂ωk . Each derivative is computed by multiplying the
term ∂ℓi /∂fk by ∂fk /∂βk or ∂fk /∂ωk as appropriate.

This is consistent with Observation 1 from the previous section; the effect of a change
in the weight ωk is proportional to the value of the source variable hk (which was stored
in the forward pass). The final derivatives from the term f0 = β0 + ω0 · xi are:
Notebook 7.1
Backpropagation
∂f0 ∂f0 in toy model
=1 and = xi . (7.16)
∂β0 ∂ω0

Backpropagation is both simpler and more efficient than computing the derivatives in-
dividually, as in equation 7.8.1

7.4 Backpropagation algorithm

Now we repeat this process for a three-layer network (figure 7.1). The intuition and much
of the algebra are identical. The main differences are that intermediate variables fk , hk
are vectors, the biases β k are vectors, the weights Ωk are matrices, and we are using
ReLU functions rather than simple algebraic functions like cos[•].

Forward pass: We write the network as a series of sequential calculations:

f0 = β 0 + Ω0 xi
h1 = a[f0 ]
f1 = β 1 + Ω1 h1
h2 = a[f1 ]
f2 = β 2 + Ω2 h2
h3 = a[f2 ]
f3 = β 3 + Ω3 h3
ℓi = l[f3 , yi ], (7.17)
1 Note that we did not actually need the derivatives ∂l /∂h of the loss with respect to the activations.
i k
In the final backpropagation algorithm, we will not compute these explicitly.

Draft: please send errata to [email protected].


104 7 Gradients and initialization

Figure 7.6 Derivative of rectified linear


unit. The rectified linear unit (orange
curve) returns zero when the input is
less than zero and returns the input oth-
erwise. Its derivative (cyan curve) re-
turns zero when the input is less than
zero (since the slope here is zero) and
one when the input is greater than zero
(since the slope here is one).

where fk−1 represents the pre-activations at the k th hidden layer (i.e., the values before
the ReLU function a[•]) and hk contains the activations at the k th hidden layer (i.e., after
the ReLU function). The term l[f3 , yi ] represents the loss function (e.g., least squares or
binary cross-entropy loss). In the forward pass, we work through these calculations and
store all the intermediate quantities.

Backward pass #1: Now let’s consider how the loss changes when the pre-activations
f0 , f1 , f2 change. Applying the chain rule, the expression for the derivative of the loss ℓi
Appendix B.5
Matrix calculus with respect to f2 is:

∂ℓi ∂h3 ∂f3 ∂ℓi


= . (7.18)
∂f2 ∂f2 ∂h3 ∂f3
The three terms on the right-hand side have sizes D3 × D3 , D3 × Df , and Df × 1,
respectively, where D3 is the number of hidden units in the third layer, and Df is the
dimensionality of the model output f3 .
Similarly, we can compute how the loss changes when we change f1 and f0 :

 
∂ℓi ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
= (7.19)
∂f1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3
 
∂ℓi ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3 ∂ℓi
= . (7.20)
∂f0 ∂f0 ∂h1 ∂f1 ∂h2 ∂f2 ∂h3 ∂f3

Note that in each case, the term in brackets was computed in the previous step. By
Problem 7.3
working backward through the network, we can reuse the previous computations.
Moreover, the terms themselves are simple. Working backward through the right-
Problems 7.4–7.5
hand side of equation 7.18, we have:

• The derivative ∂ℓi /∂f3 of the loss ℓi with respect to the network output f3 will
depend on the loss function but usually has a simple form.

• The derivative ∂f3 /∂h3 of the network output with respect to hidden layer h3 is:

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.4 Backpropagation algorithm 105

∂f3 ∂
= (β + Ω3 h3 ) = ΩT3 . (7.21)
∂h3 ∂h3 3
If you are unfamiliar with matrix calculus, this result is not obvious. It is explored
Problem 7.6
in problem 7.6.
• The derivative ∂h3 /∂f2 of the output h3 of the activation function with respect to
its input f2 will depend on the activation function. It will be a diagonal matrix
since each activation only depends on the corresponding pre-activation. For ReLU
functions, the diagonal terms are zero everywhere f2 is less than zero and one
Problems 7.7–7.8
otherwise (figure 7.6). Rather than multiply by this matrix, we extract the diagonal
terms as a vector I[f2 > 0] and pointwise multiply, which is more efficient.
The terms on the right-hand side of equations 7.19 and 7.20 have similar forms. As
we progress back through the network, we alternately (i) multiply by the transpose of
the weight matrices ΩTk and (ii) threshold based on the inputs fk−1 to the hidden layer.
These inputs were stored during the forward pass.

Backward pass #2: Now that we know how to compute ∂ℓi /∂fk , we can focus on
calculating the derivatives of the loss with respect to the weights and biases. To calculate
the derivatives of the loss with respect to the biases β k , we again use the chain rule:

∂ℓi ∂fk ∂ℓi


=
∂β k ∂β k ∂fk
∂ ∂ℓi
= (β + Ωk hk )
∂β k k ∂fk
∂ℓi
= , (7.22)
∂fk
which we already calculated in equations 7.18 and 7.19.
Similarly, the derivative for the weights matrix Ωk , is given by:

∂ℓi ∂fk ∂ℓi


=
∂Ωk ∂Ωk ∂fk
∂ ∂ℓi
= (β k + Ωk hk )
∂Ωk ∂fk
∂ℓi T
= h . (7.23)
∂fk k
Again, the progression from line two to line three is not obvious and is explored in
Problem 7.9
problem 7.9. However, the result makes sense. The final line is a matrix of the same size
as Ωk . It depends linearly on hk , which was multiplied by Ωk in the original expression.
This is also consistent with the initial intuition that the derivative of the weights in Ωk
will be proportional to the values of the hidden units hk that they multiply. Recall that
we already computed these during the forward pass.

Draft: please send errata to [email protected].


106 7 Gradients and initialization

7.4.1 Backpropagation algorithm summary

We now briefly summarize the final backpropagation algorithm. Consider a deep neural
network f[xi , ϕ] that takes input xi , has K hidden layers with ReLU activations, and
individual loss term ℓi = l[f[xi , ϕ], yi ]. The goal of backpropagation is to compute the
derivatives ∂ℓi /∂β k and ∂ℓi /∂Ωk with respect to the biases β k and weights Ωk .

Forward pass: We compute and store the following quantities:

f0 = β 0 + Ω0 xi
hk = a[fk−1 ] k ∈ {1, 2, . . . , K}
fk = β k + Ωk hk . k ∈ {1, 2, . . . , K} (7.24)

Backward pass: We start with the derivative ∂ℓi /∂fK of the loss function ℓi with respect
to the network output fK and work backward through the network:

∂ℓi ∂ℓi
= k ∈ {K, K − 1, . . . , 1}
∂β k ∂fk
∂ℓi ∂ℓi T
= h k ∈ {K, K − 1, . . . , 1}
∂Ωk ∂fk k
 
∂ℓi ∂ℓi
= I[fk−1 > 0] ⊙ ΩTk , k ∈ {K, K − 1, . . . , 1} (7.25)
∂fk−1 ∂fk
where ⊙ denotes pointwise multiplication, and I[fk−1 > 0] is a vector containing ones
where fk−1 is greater than zero and zeros elsewhere. Finally, we compute the derivatives
with respect to the first set of biases and weights:

∂ℓi ∂ℓi
=
∂β 0 ∂f0
∂ℓi ∂ℓi T
= x . (7.26)
∂Ω0 ∂f0 i
We calculate these derivatives for every training example in the batch and sum them
Problem 7.10
together to retrieve the gradient for the SGD update.
Notebook 7.2 Note that the backpropagation algorithm is extremely efficient; the most demanding
Backpropagation computational step in both the forward and backward pass is matrix multiplication (by Ω
and ΩT , respectively) which only requires additions and multiplications. However, it is
not memory efficient; the intermediate values in the forward pass must all be stored, and
this can limit the size of the model we can train.

7.4.2 Algorithmic differentiation

Although it’s important to understand the backpropagation algorithm, it’s unlikely that
you will need to code it in practice. Modern deep learning frameworks such as PyTorch

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.5 Parameter initialization 107

and TensorFlow calculate the derivatives automatically, given the model specification.
This is known as algorithmic differentiation.
Each functional component (linear transform, ReLU activation, loss function) in the
framework knows how to compute its own derivative. For example, the PyTorch ReLU
function zout = relu[zin ] knows how to compute the derivative of its output zout with
respect to its input zin . Similarly, a linear function zout = β + Ωzin knows how to
compute the derivatives of the output zout with respect to the input zin and with re-
spect to the parameters β and Ω. The algorithmic differentiation framework also knows
the sequence of operations in the network and thus has all the information required to
perform the forward and backward passes.
These frameworks exploit the massive parallelism of modern graphics processing units
(GPUs). Computations such as matrix multiplication (which features in both the forward
and backward pass) are naturally amenable to parallelization. Moreover, it’s possible to
Problem 7.11
perform the forward and backward passes for the entire batch in parallel if the model
and intermediate results in the forward pass do not exceed the available memory.
Since the training algorithm now processes the entire batch in parallel, the input
becomes a multi-dimensional tensor. In this context, a tensor can be considered the
generalization of a matrix to arbitrary dimensions. Hence, a vector is a 1D tensor, a
matrix is a 2D tensor, and a 3D tensor is a 3D grid of numbers. Until now, the training
data have been 1D, so the input for backpropagation would be a 2D tensor where the
first dimension indexes the batch element and the second indexes the data dimension.
In subsequent chapters, we will encounter more complex structured input data. For
example, in models where the input is an RGB image, the original data examples are
3D (height × width × channel). Here, the input to the learning framework would be a
4D tensor, where the extra dimension indexes the batch element.

7.4.3 Extension to arbitrary computational graphs

We have described backpropagation in a deep neural network that is naturally sequential;


we calculate the intermediate quantities f0 , h1 , f1 , h2 . . . , fk in turn. However, models
need not be restricted to sequential computation. Later in this book, we will meet
models with branching structures. For example, we might take the values in a hidden
layer and process them through two different sub-networks before recombining.
Problems 7.12–7.13
Fortunately, the ideas of backpropagation still hold if the computational graph is
acyclic. Modern algorithmic differentiation frameworks such as PyTorch and TensorFlow
can handle arbitrary acyclic computational graphs.

7.5 Parameter initialization

The backpropagation algorithm computes the derivatives that are used by stochastic
gradient descent and Adam to train the model. We now address how to initialize the
parameters before we start training. To see why this is crucial, consider that during the
forward pass, each set of pre-activations fk is computed as:

Draft: please send errata to [email protected].


108 7 Gradients and initialization

fk = β k + Ωk hk
= β k + Ωk a[fk−1 ], (7.27)

where a[•] applies the ReLU functions and Ωk and β k are the weights and biases, respec-
tively. Imagine that we initialize all the biases to zero and the elements of Ωk according
to a normal distribution with mean zero and variance σ 2 . Consider two scenarios:

• If the variance σ 2 is very small (e.g., 10−5 ), then each element of β k + Ωk hk will be
a weighted sum of hk where the weights are very small; the result will likely have
a smaller magnitude than the input. In addition, the ReLU function clips values
less than zero, so the range of hk will be half that of fk−1 . Consequently, the
magnitudes of the pre-activations at the hidden layers will get smaller and smaller
as we progress through the network.
• If the variance σ 2 is very large (e.g., 105 ), then each element of β k + Ωk hk will be
a weighted sum of hk where the weights are very large; the result is likely to have
a much larger magnitude than the input. The ReLU function halves the range of
the inputs, but if σ 2 is large enough, the magnitudes of the pre-activations will still
get larger as we progress through the network.

In these two situations, the values at the pre-activations can become so small or so large
that they cannot be represented with finite precision floating point arithmetic.
Even if the forward pass is tractable, the same logic applies to the backward pass.
Each gradient update (equation 7.25) consists of multiplying by ΩT . If the values of Ω
are not initialized sensibly, then the gradient magnitudes may decrease or increase un-
controllably during the backward pass. These cases are known as the vanishing gradient
problem and the exploding gradient problem, respectively. In the former case, updates to
the model become vanishingly small. In the latter case, they become unstable.

7.5.1 Initialization for forward pass

We now present a mathematical version of the same argument. Consider the computation
between adjacent pre-activations f and f ′ with dimensions Dh and Dh′ , respectively:

h = a[f ],
f ′ = β + Ωh (7.28)

where h represents the activations, Ω and β represent the weights and biases, and a[•]
is the activation function.
Assume the pre-activations fj in the input layer f have variance σf2j . Consider ini-
tializing the biases βi to zero and the weights Ωij as normally distributed with mean
2
zero and variance σΩ . Now we derive expressions for the mean and variance of the

pre-activations f in the subsequent layer.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.5 Parameter initialization 109

The expectation (mean) E[fi′ ] of the intermediate values fi′ is: Appendix C.2
Expectation
 
X
Dh
E[fi′ ] = E βi + Ωij hj 
j=1

X
Dh
= E [βi ] + E [Ωij hj ]
j=1

X
Dh
= E [βi ] + E [Ωij ] E [hj ]
j=1

X
Dh
= 0+ 0 · E [hj ] = 0, (7.29)
j=1

where Dh is the dimensionality of the input layer h. We have used the rules for manipu-
Appendix C.2.1
lating expectations, and we have assumed that the distributions over the hidden units hj Expectation rules
and the network weights Ωij are independent between the second and third lines.
Using this result, we see that the variance σf2 ′ of the pre-activations fi′ is:
i

σf2i′ = E[fi′2 ] − E[fi′ ]2


 2 
 XD h

= E βi + Ωij hj   − 0
j=1
 2 
 X
Dh

= E  Ωij hj  
j=1

X
Dh
   
= E Ω2ij E h2j
j=1

X
Dh
  X
Dh
 
= 2
σΩ E h2j = σΩ
2
E h2j , (7.30)
j=1 j=1

where we have used the variance identity σ 2 = E[(z − E[z])2 ] = E[z 2 ] − E[z]2 . We have
Appendix C.2.3
assumed once more that the distributions of the weights Ωij and the hidden units hj are Variance identity
independent between lines three and four.
Assuming that the distribution of pre-activations fj at the previous layer is symmetric
about zero, half of these pre-activations will be clipped by the ReLU function, and the
second moment E[h2j ] will be half the variance σf2 of fj (see problem 7.14):
Problem 7.14

X
Dh
σf2 1
2
σf2i′ = σΩ 2 2
= Dh σΩ σf . (7.31)
j=1
2 2

Draft: please send errata to [email protected].


110 7 Gradients and initialization

Figure 7.7 Weight initialization. Consider a deep network with 50 hidden layers
and Dh = 100 hidden units per layer. The network has a 100-dimensional input x
initialized from a standard normal distribution, a single fixed target y = 0, and
a least squares loss function. The bias vectors β k are initialized to zero, and the
weight matrices Ωk are initialized with a normal distribution with mean zero and
five different variances σΩ2
∈ {0.001, 0.01, 0.02, 0.1, 1.0}. a) Variance of hidden
unit activations computed in forward pass as a function of the network layer. For
2
He initialization (σΩ = 2/Dh = 0.02), the variance is stable. However, for larger
values, it increases rapidly, and for smaller values, it decreases rapidly (note
log scale). b) The variance of the gradients in the backward pass (solid lines)
continues this trend; if we initialize with a value larger than 0.02, the magnitude
of the gradients increases rapidly as we pass back through the network. If we
initialize with a value smaller, then the magnitude decreases. These are known
as the exploding gradient and vanishing gradient problems, respectively.

This, in turn, implies that if we want the variance σf2 ′ of the subsequent pre-activations f ′
to be the same as the variance σf2 of the original pre-activations f during the forward
pass, we should set:

2 2
σΩ = , (7.32)
Dh
where Dh is the dimension of the original layer to which the weights were applied. This
is known as He initialization.

7.5.2 Initialization for backward pass

A similar argument establishes how the variance of the gradients ∂l/∂fk changes during
the backward pass. During the backward pass, we multiply by the transpose ΩT of the
weight matrix (equation 7.25), so the equivalent expression becomes:

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.6 Example training code 111

2 2
σΩ = , (7.33)
Dh′
where Dh′ is the dimension of the layer that the weights feed into.

7.5.3 Initialization for both forward and backward pass

If the weight matrix Ω is not square (i.e., there are different numbers of hidden units
in the two adjacent layers, so Dh and Dh′ differ), then it is not possible to choose the
variance to satisfy both equations 7.32 and 7.33 simultaneously. One possible compromise
is to use the mean (Dh + Dh′ )/2 as a proxy for the number of terms, which gives:

2 4
σΩ = . (7.34)
Dh + Dh′
Figure 7.7 shows empirically that both the variance of the hidden units in the forward
Problem 7.15
pass and the variance of the gradients in the backward pass remain stable when the
parameters are initialized appropriately. Notebook 7.3
Initialization

7.6 Example training code

The primary focus of this book is scientific; this is not a guide for implementing deep
learning models. Nonetheless, in figure 7.8, we present PyTorch code that implements
the ideas explored in this book so far. The code defines a neural network and initializes
Problems 7.16–7.17
the weights. It creates random input and output datasets and defines a least squares loss
function. The model is trained from the data using SGD with momentum in batches of
size 10 over 100 epochs. The learning rate starts at 0.01 and halves every 10 epochs.
The takeaway is that although the underlying ideas in deep learning are quite com-
plex, implementation is relatively simple. For example, all of the details of the back-
propagation are hidden in the single line of code: loss.backward().

7.7 Summary

The previous chapter introduced stochastic gradient descent (SGD), an iterative opti-
mization algorithm that aims to find the minimum of a function. In the context of neural
networks, this algorithm finds the parameters that minimize the loss function. SGD re-
lies on the gradient of the loss function with respect to the parameters, which must be
initialized before optimization. This chapter has addressed these two problems for deep
neural networks.
The gradients must be evaluated for a very large number of parameters, for each
member of the batch, and at each SGD iteration. It is hence imperative that the gradient

Draft: please send errata to [email protected].


112 7 Gradients and initialization

import torch, torch.nn as nn


from torch.utils.data import TensorDataset, DataLoader
from torch.optim.lr_scheduler import StepLR

# define input size, hidden layer size, output size


D_i, D_k, D_o = 10, 40, 5
# create model with two hidden layers
model = nn.Sequential(
nn.Linear(D_i, D_k),
nn.ReLU(),
nn.Linear(D_k, D_k),
nn.ReLU(),
nn.Linear(D_k, D_o))

# He initialization of weights
def weights_init(layer_in):
if isinstance(layer_in, nn.Linear):
nn.init.kaiming_normal_(layer_in.weight)
layer_in.bias.data.fill_(0.0)
model.apply(weights_init)

# choose least squares loss function


criterion = nn.MSELoss()
# construct SGD optimizer and initialize learning rate and momentum
optimizer = torch.optim.SGD(model.parameters(), lr = 0.1, momentum=0.9)
# object that decreases learning rate by half every 10 epochs
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)

# create 100 random data points and store in data loader class
x = torch.randn(100, D_i)
y = torch.randn(100, D_o)
data_loader = DataLoader(TensorDataset(x,y), batch_size=10, shuffle=True)

# loop over the dataset 100 times


for epoch in range(100):
epoch_loss = 0.0
# loop over batches
for i, data in enumerate(data_loader):
# retrieve inputs and labels for this batch
x_batch, y_batch = data
# zero the parameter gradients
optimizer.zero_grad()
# forward pass
pred = model(x_batch)
loss = criterion(pred, y_batch)
# backward pass
loss.backward()
# SGD update
optimizer.step()
# update statistics
epoch_loss += loss.item()
# print error
print(f'Epoch {epoch:5d}, loss {epoch_loss:.3f}')
# tell scheduler to consider updating learning rate
scheduler.step()

Figure 7.8 Sample code for training two-layer network on random data.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 113

computation is efficient, and to this end, the backpropagation algorithm was introduced.
Careful parameter initialization is also critical. The magnitudes of the hidden unit
activations can either decrease or increase exponentially in the forward pass. The same
is true of the gradient magnitudes in the backward pass, where these behaviors are known
as the vanishing gradient and exploding gradient problems. Both impede training but
can be avoided with appropriate initialization.
We’ve now defined the model and the loss function, and we can train a model for a
given task. The next chapter discusses how to measure the model performance.

Notes

Backpropagation: Efficient reuse of partial computations while calculating gradients in com-


putational graphs has been repeatedly discovered, including by Werbos (1974), Bryson et al.
(1979), LeCun (1985), and Parker (1985). However, the most celebrated description of this
idea was by Rumelhart et al. (1985) and Rumelhart et al. (1986), who also coined the term
“backpropagation.” This latter work kick-started a new phase of neural network research in the
eighties and nineties; for the first time, it was practical to train networks with hidden layers.
However, progress stalled due (in retrospect) to a lack of training data, limited computational
power, and the use of sigmoid activations. Areas such as natural language processing and com-
puter vision did not rely on neural network models until the remarkable image classification
results of Krizhevsky et al. (2012) ushered in the modern era of deep learning.
The implementation of backpropagation in modern deep learning frameworks such as PyTorch
and TensorFlow is an example of reverse-mode algorithmic differentiation. This is distinguished
from forward-mode algorithmic differentiation in which the derivatives from the chain rule
are accumulated while moving forward through the computational graph (see problem 7.13).
Further information about algorithmic differentiation can be found in Griewank & Walther
(2008) and Baydin et al. (2018).

Initialization: He initialization was first introduced by He et al. (2015). It follows closely


from Glorot or Xavier initialization (Glorot & Bengio, 2010), which is very similar but does
not consider the effect of the ReLU layer and so differs by a factor of two. Essentially the
same method was proposed much earlier by LeCun et al. (2012) but with a slightly different
motivation; in this case, sigmoidal activation functions were used, which naturally normalize the
range of outputs at each layer, and hence help prevent an exponential increase in the magnitudes
of the hidden units. However, if the pre-activations are too large, they fall into the flat regions
of the sigmoid function and result in very small gradients. Hence, it is still important to
initialize the weights sensibly. Klambauer et al. (2017) introduce the scaled exponential linear
unit (SeLU) and show that, within a certain range of inputs, this activation function tends to
make the activations in network layers automatically converge to mean zero and unit variance.
A completely different approach is to pass data through the network and then normalize by the
empirically observed variance. Layer-sequential unit variance initialization (Mishkin & Matas,
2016) is an example of this kind of method, in which the weight matrices are initialized as
orthonormal. GradInit (Zhu et al., 2021) randomizes the initial weights and temporarily fixes
them while it learns non-negative scaling factors for each weight matrix. These factors are
selected to maximize the decrease in the loss for a fixed learning rate subject to a constraint
on the maximum gradient norm. Activation normalization or ActNorm adds a learnable scaling
and offset parameter after each network layer at each hidden unit. They run an initial batch
through the network and then choose the offset and scale so that the mean of the activations is
zero and the variance one. After this, these extra parameters are learned as part of the model.

Draft: please send errata to [email protected].


114 7 Gradients and initialization

Closely related to these methods are schemes such as BatchNorm (Ioffe & Szegedy, 2015), in
which the network normalizes the variance of each batch as part of its processing at every
step. BatchNorm and its variants are discussed in chapter 11. Other initialization schemes have
been proposed for specific architectures, including the ConvolutionOrthogonal initializer (Xiao
et al., 2018a) for convolutional networks, Fixup (Zhang et al., 2019a) for residual networks, and
TFixup (Huang et al., 2020a) and DTFixup (Xu et al., 2021b) for transformers.

Reducing memory requirements: Training neural networks is memory intensive. We must


store both the model parameters and the pre-activations at the hidden units for every member
of the batch during the forward pass. Two methods that decrease memory requirements are
gradient checkpointing (Chen et al., 2016a) and micro-batching (Huang et al., 2019). In gradient
checkpointing, the activations are only stored every N layers during the forward pass. During
the backward pass, the intermediate missing activations are recalculated from the nearest check-
point. In this manner, we can drastically reduce the memory requirements at the computational
cost of performing the forward pass twice (problem 7.11). In micro-batching, the batch is sub-
divided into smaller parts, and the gradient updates are aggregated from each sub-batch before
being applied to the network. A completely different approach is to build a reversible network
(e.g., Gomez et al., 2017), in which the activations at the previous layer can be computed from
the activations at the current one, so there is no need to cache anything during the forward pass
(see chapter 16). Sohoni et al. (2019) review approaches to reducing memory requirements.

Distributed training: For sufficiently large models, the memory requirements or total re-
quired time may be too much for a single processor. In this case, we must use distributed
training, in which training takes place in parallel across multiple processors. There are several
approaches to parallelism. In data parallelism, each processor or node contains a full copy of
the model but runs a subset of the batch (see Xing et al., 2015; Li et al., 2020b). The gradients
from each node are aggregated centrally and then redistributed back to each node to ensure
that the models remain consistent. This is known as synchronous training. The synchronization
required to aggregate and redistribute the gradients can be a performance bottleneck, and this
leads to the idea of asynchronous training. For example, in the Hogwild! algorithm (Recht
et al., 2011), the gradient from a node is used to update a central model whenever it is ready.
The updated model is then redistributed to the node. This means that each node may have a
slightly different version of the model at any given time, so the gradient updates may be stale;
however, it works well in practice. Other decentralized schemes have also been developed. For
example, in Zhang et al. (2016a), the individual nodes update one another in a ring structure.
Data parallelism methods still assume that the entire model can be held in the memory of a
single node. Pipeline model parallelism stores different layers of the network on different nodes
and hence does not have this requirement. In a naïve implementation, the first node runs the
forward pass for the batch on the first few layers and passes the result to the next node, which
runs the forward pass on the next few layers and so on. In the backward pass, the gradients are
updated in the opposite order. The obvious disadvantage of this approach is that each machine
lies idle for most of the cycle. Various schemes revolving around each node processing micro-
batches sequentially have been proposed to reduce this inefficiency (e.g., Huang et al., 2019;
Narayanan et al., 2021a). Finally, in tensor model parallelism, computation at a single network
layer is distributed across nodes (e.g., Shoeybi et al., 2019). A good overview of distributed
training methods can be found in Narayanan et al. (2021b), who combine tensor, pipeline, and
data parallelism to train a language model with one trillion parameters on 3072 GPUs.

Problems
Problem 7.1 A two-layer network with two hidden units in each layer can be defined as:

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 115

h i
y = ϕ0 + ϕ1 a ψ01 + ψ11 a[θ01 + θ11 x] + ψ21 a[θ02 + θ12 x]
h i
+ϕ2 a ψ02 + ψ12 a[θ01 + θ11 x] + ψ22 a[θ02 + θ12 x] , (7.35)

where the functions a[•] are ReLU functions. Compute the derivatives of the output y with
respect to each of the 13 parameters ϕ• , θ•• , and ψ•• directly (i.e., not using the backpropagation
algorithm). The derivative of the ReLU function with respect to its input ∂a[z]/∂z is the
indicator function I[z > 0], which returns one if the argument is greater than zero and zero
otherwise (figure 7.6).

Problem 7.2 Find an expression for the final term in each of the five chains of derivatives in
equation 7.13.

Problem 7.3 What size are each of the terms in equation 7.20?

Problem 7.4 Calculate the derivative ∂ℓi /∂f [xi , ϕ] for the least squares loss function:

ℓi = (yi − f[xi , ϕ])2 . (7.36)

Problem 7.5 Calculate the derivative ∂ℓi /∂f[xi , ϕ] for the binary classification loss function:

h  i h  i
ℓi = −(1 − yi ) log 1 − sig f[xi , ϕ] − yi log sig f[xi , ϕ] , (7.37)

where the function sig[•] is the logistic sigmoid and is defined as:

1
sig[z] = . (7.38)
1 + exp[−z]

Problem 7.6∗ Show that for z = β + Ωh:

∂z
= ΩT , (7.39)
∂h
where ∂z/∂h is a matrix containing the term ∂zi /∂hj in its ith column and j th row. To do this,
first find an expression for the constituent elements ∂zi /∂hj , and then consider the form that
the matrix ∂z/∂h must take.

Problem 7.7 Consider the case where we use the logistic sigmoid (see equation 7.38) as an
activation function, so h = sig[f ]. Compute the derivative ∂h/∂f for this activation function.
What happens to the derivative when the input takes (i) a large positive value and (ii) a large
negative value?

Problem 7.8 Consider using (i) the Heaviside function and (ii) the rectangular function as
activation functions:
(
0 z<0
Heaviside[z] = , (7.40)
1 z≥0

Draft: please send errata to [email protected].


116 7 Gradients and initialization

Figure 7.9 Computational graph for problem 7.12 and problem 7.13. Adapted
from Domke (2010).

and



0 z<0
rect[z] = 1 0≤z≤1. (7.41)


0 z>1

Discuss why these functions are problematic for neural network training with gradient-based
optimization methods.

Problem 7.9∗ Consider a loss function ℓ[f ], where f = β + Ωh. We want to find how the loss ℓ
changes when we change Ω, which we’ll express with a matrix that contains the derivative
∂ℓ/∂Ωij at the ith row and j th column. Find an expression for ∂fi /∂Ωij and, using the chain
rule, show that:

∂ℓ ∂ℓ T
= h . (7.42)
∂Ω ∂f

Problem 7.10∗ Derive the equations for the backward pass of the backpropagation algorithm
for a network that uses leaky ReLU activations, which are defined as:

(
α·z z<0
a[z] = ReLU[z] = , (7.43)
z z≥0

where α is a small positive constant (typically 0.1).

Problem 7.11 Consider training a network with fifty layers, where we only have enough memory
to store the pre-activations at every tenth hidden layer during the forward pass. Explain how
to compute the derivatives in this situation using gradient checkpointing.

Problem 7.12∗ This problem explores computing derivatives on general acyclic computational
graphs. Consider the function:

 
y = exp exp[x] + exp[x]2 + sin[exp[x] + exp[x]2 ]. (7.44)

We can break this down into a series of intermediate computations so that:

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 117

f1 = exp[x]
f2 = f12
f3 = f1 + f2
f4 = exp[f3 ]
f5 = sin[f3 ]
y = f4 + f5 . (7.45)

The associated computational graph is depicted in figure 7.9. Compute the derivative ∂y/∂x
by reverse-mode differentiation. In other words, compute in order:

∂y ∂y ∂y ∂y ∂y ∂y
, , , , and , (7.46)
∂f5 ∂f4 ∂f3 ∂f2 ∂f1 ∂x
using the chain rule in each case to make use of the derivatives already computed.

Problem 7.13∗ For the same function as in problem 7.12, compute the derivative ∂y/∂x by
forward-mode differentiation. In other words, compute in order:

∂f1 ∂f2 ∂f3 ∂f4 ∂f5 ∂y


, , , , , and , (7.47)
∂x ∂x ∂x ∂x ∂x ∂x
using the chain rule in each case to make use of the derivatives already computed. Why do
we not use forward-mode differentiation when we calculate the parameter gradients for deep
networks?

Problem 7.14 Consider a random variable a with variance Var[a] = σ 2 and a symmetrical
distribution around the mean E[a] = 0. Prove that if we pass this variable through the ReLU
function:
(
0 a<0
b = ReLU[a] = , (7.48)
a a≥0

then the second moment of the transformed variable is E[b2 ] = σ 2 /2.

Problem 7.15 What would you expect to happen if we initialized all of the weights and biases
in the network to zero?

Problem 7.16 Implement the code in figure 7.8 in PyTorch and plot the training loss as a
function of the number of epochs.

Problem 7.17 Change the code in figure 7.8 to tackle a binary classification problem. You will
need to (i) change the targets y so they are binary, (ii) change the network to predict numbers
between zero and one (iii) change the loss function appropriately.

Draft: please send errata to [email protected].


Chapter 8

Measuring performance

Previous chapters described neural network models, loss functions, and training algo-
rithms. This chapter considers how to measure the performance of the trained models.
With sufficient capacity (i.e., number of hidden units), a neural network model will often
perform perfectly on the training data. However, this does not necessarily mean it will
generalize well to new test data.
We will see that the test errors have three distinct causes and that their relative
contributions depend on (i) the inherent uncertainty in the task, (ii) the amount of
training data, and (iii) the choice of model. The latter dependency raises the issue of
hyperparameter search. We discuss how to select both the model hyperparameters (e.g.,
the number of hidden layers and the number of hidden units in each) and the learning
algorithm hyperparameters (e.g., the learning rate and batch size).

8.1 Training a simple model

We explore model performance using the MNIST-1D dataset (figure 8.1). This con-
sists of ten classes y ∈ {0, 1, . . . , 9}, representing the digits 0–9. The data are derived
from 1D templates for each of the digits. Each data example x is created by randomly
transforming one of these templates and adding noise. The full training dataset {xi , yi }
consists of I = 4000 training examples, each consisting of Di = 40 dimensions representing
the horizontal offset at 40 positions. The ten classes are drawn uniformly during data
generation, so there are ∼ 400 examples of each class.
We use a network with Di = 40 inputs and Do = 10 outputs which are passed through
a softmax function to produce class probabilities (see section 5.5). The network has two
hidden layers with D = 100 hidden units each. It is trained using stochastic gradient
descent with batch size 100 and learning rate 0.1 for 6000 steps (150 epochs) with a
multiclass cross-entropy loss (equation 5.24). Figure 8.2 shows that the training error
decreases as training proceeds. The training data are classified perfectly after about
Problem 8.1 4000 steps. The training loss also decreases, eventually approaching zero.
However, this doesn’t imply that the classifier is perfect; the model might have mem-

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.1 Training a simple model 119

Figure 8.1 MNIST-1D. a) Templates for 10 classes y ∈ {0, . . . , 9}, based on digits
0–9. b) Training examples x are created by randomly transforming a template
and c) adding noise. d) The horizontal offset of the transformed template is then
sampled at 40 vertical positions. Adapted from (Greydanus, 2020)

Figure 8.2 MNIST-1D results. a) Percent classification error as a function of the


training step. The training set errors decrease to zero, but the test errors do not
drop below ∼ 40%. This model doesn’t generalize well to new test data. b) Loss
as a function of the training step. The training loss decreases steadily toward
zero. The test loss decreases at first but subsequently increases as the model
becomes increasingly confident about its (wrong) predictions.

Draft: please send errata to [email protected].


120 8 Measuring performance

Figure 8.3 Regression function. Solid


black line shows ground truth function.
To generate I training examples {xi , yi },
the input space x ∈ [0, 1] is divided
into I equal segments and one sample xi
is drawn from a uniform distribution
within each segment. The correspond-
ing value yi is created by evaluating the
function at xi and adding Gaussian noise
(gray region shows ±2 standard devia-
tions). The test data are generated in
the same way.

orized the training set but be unable to predict new examples. To estimate the true
performance, we need a separate test set of input/output pairs {xi , yi }. To this end, we
generate 1000 more examples using the same process. Figure 8.2a also shows the errors
for this test data as a function of the training step. These decrease as training proceeds,
but only to around 40%. This is better than the chance error rate of 90% but far worse
than for the training set; the model has not generalized well to the test data.
The test loss (figure 8.2b) decreases for the first 1500 training steps but then increases
Notebook 8.1
MNIST-1D again. At this point, the test error rate is fairly constant; the model makes the same
performance mistakes but with increasing confidence. This decreases the probability of the correct
answers and thus increases the negative log-likelihood. This increasing confidence is a
side-effect of the softmax function; the pre-softmax activations are driven to increasingly
extreme values to make the probability of the training data approach one (see figure 5.10).

8.2 Sources of error

We now consider the sources of the errors that occur when a model fails to generalize. To
make this easier to visualize, we revert to a 1D least squares regression problem where
we know exactly how the ground truth data were generated. Figure 8.3 shows a quasi-
sinusoidal function; both training and test data are generated by sampling input values
in the range [0, 1], passing them through this function, and adding Gaussian noise with
a fixed variance.
We fit a simplified shallow neural net to this data (figure 8.4). The weights and biases
that connect the input layer to the hidden layer are chosen so that the “joints” of the
function are evenly spaced across the interval. If there are D hidden units, then these
joints will be at 0, 1/D, 2/D, . . . , (D − 1)/D. This model can represent any piecewise
linear function with D equally sized regions in the range [0, 1]. As well as being easy to
understand, this model also has the advantage that it can be fit in closed form without
the need for stochastic optimization algorithms (see problem 8.3). Consequently, we can
Problems 8.2–8.3
guarantee to find the global minimum of the loss function during training.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.2 Sources of error 121

Figure 8.4 Simplified neural network with three hidden units. a) The weights and
biases between the input and hidden layer are fixed (dashed arrows). b–d) They
are chosen so that the hidden unit activations have slope one, and their joints are
equally spaced across the interval, with joints at x = 0, x = 1/3, and x = 2/3,
respectively. Modifying the remaining parameters ϕ = {β, ω1 , ω2 , ω3 } can create
any piecewise linear function over x ∈ [0, 1] with joints at 1/3 and 2/3. e–g)
Three example functions with different values of the parameters ϕ.

Draft: please send errata to [email protected].


122 8 Measuring performance

Figure 8.5 Sources of test error. a) Noise. Data generation is noisy, so even if the
model exactly replicates the true underlying function (black line), the noise in the
test data (gray points) means that some error will remain (gray region represents
two standard deviations). b) Bias. Even with the best possible parameters, the
three-region model (cyan line) cannot exactly fit the true function (black line).
This bias is another source of error (gray regions represent signed error). c)
Variance. In practice, we have limited noisy training data (orange points). When
we fit the model, we don’t recover the best possible function from panel (b) but
a slightly different function (cyan line) that reflects idiosyncrasies of the training
data. This provides an additional source of error (gray region represents two
standard deviations). Figure 8.6 shows how this region was calculated.

8.2.1 Noise, bias, and variance

There are three possible sources of error, which are known as noise, bias, and variance
respectively (figure 8.5):

Noise The data generation process includes the addition of noise, so there are multiple
possible valid outputs y for each input x (figure 8.5a). This source of error is insurmount-
able for the test data. Note that it does not necessarily limit the training performance;
we will likely never see the same input x twice during training, so it is still possible to
fit the training data perfectly.
Noise may arise because there is a genuine stochastic element to the data generation
process, because some of the data are mislabeled, or because there are further explanatory
variables that were not observed. In rare cases, noise may be absent; for example,
a network might approximate a function that is deterministic but requires significant
computation to evaluate. However, noise is usually a fundamental limitation on the
possible test performance.

Bias A second potential source of error may occur because the model is not flexible
enough to fit the true function perfectly. For example, the three-region neural network
model cannot exactly describe the quasi-sinusoidal function, even when the parameters
are chosen optimally (figure 8.5b). This is known as bias.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.2 Sources of error 123

Variance We have limited training examples, and there is no way to distinguish sys-
tematic changes in the underlying function from noise in the underlying data. When
we fit a model, we do not get the closest possible approximation to the true underly-
ing function. Indeed, for different training datasets, the result will be slightly different
each time. This additional source of variability in the fitted function is termed variance
(figure 8.5c). In practice, there might also be additional variance due to the stochastic
learning algorithm, which does not necessarily converge to the same solution each time.

8.2.2 Mathematical formulation of test error

We now make the notions of noise, bias, and variance mathematically precise. Consider
a 1D regression problem where the data generation process has additive noise with vari-
ance σ 2 (e.g., figure 8.3); we can observe different outputs y for the same input x, so for
Appendix C.2
each x, there is a distribution P r(y|x) with expected value (mean) µ[x]: Expectation
Z
µ[x] = Ey [y[x]] = y[x]P r(y|x)dy, (8.1)
 
and fixed noise σ 2 = Ey (µ[x] − y[x])2 . Here we have used the notation y[x] to specify
that we are considering the output y at a given input position x.
Now consider a least squares loss between the model prediction f[x, ϕ] at position x
and the observed value y[x] at that position:

2
L[x] = f[x, ϕ] − y[x] (8.2)
   2
= f[x, ϕ] − µ[x] + µ[x] − y[x]
2   2
= f[x, ϕ] − µ[x] + 2 f[x, ϕ] − µ[x] µ[x] − y[x] + µ[x] − y[x] ,

where we have both added and subtracted the mean µ[x] of the underlying function in
the second line and have expanded out the squared term in the third line.
The underlying function is stochastic, so this loss depends on the particular y[x] we
observe. The expected loss is:

  h 2   2 i
Ey L[x] = Ey f[x, ϕ]−µ[x] + 2 f[x, ϕ]−µ[x] µ[x]−y[x] + µ[x]−y[x]
2    
= f[x, ϕ]−µ[x] + 2 f[x, ϕ] − µ[x] µ[x]−Ey [y[x]] + Ey (µ[x]−y[x])2
2  h 2 i
= f[x, ϕ]−µ[x] + 2 f[x, ϕ]−µ[x] · 0 + Ey µ[x]−y[x]
2
= f[x, ϕ] − µ[x] + σ 2 , (8.3)

where we have made use of the rules for manipulating expectations. In the second line, we
Appendix C.2.1
have distributed the expectation operator and removed it from terms with no dependence Expectation rules
on y[x], and in the third line, we note that the second term is zero since Ey [y[x]] = µ[x]
by definition. Finally, in the fourth line, we have substituted in the definition of the

Draft: please send errata to [email protected].


124 8 Measuring performance

noise σ 2 . We can see that the expected loss has been broken down into two terms; the
first term is the squared deviation between the model and the true function mean, and
the second term is the noise.
The first term can be further partitioned into bias and variance. The parameters ϕ of
the model f[x, ϕ] depend on the training dataset D = {xi , yi }, so more properly, we should
write f [x, ϕ[D]]. The training dataset is a random sample from the data generation
process; with a different sample of training data, we would learn different parameter
values. The expected model output fµ [x] with respect to all possible datasets D is hence:
h i
fµ [x] = ED f x, ϕ[D] . (8.4)
Returning to the first term of equation 8.3, we add and subtract fµ [x] and expand:

2
f[x, ϕ[D]]−µ[x] (8.5)
  2
= f[x, ϕ[D]]−fµ [x] + fµ [x] − µ[x]
2   2
= f[x, ϕ[D]]−fµ [x] + 2 f[x, ϕ[D]]−fµ [x] fµ [x]−µ[x] + fµ [x]−µ[x] .
We then take the expectation with respect to the training dataset D:
h 2 i h 2 i 2
ED f[x, ϕ[D]] − µ[x] = ED f[x, ϕ[D]] − fµ [x] + fµ [x] − µ[x] , (8.6)

where we have simplified using similar steps as for equation 8.3. Finally, we substitute
this result into equation 8.3:
h i h 2 i 2
ED Ey [L[x]] = ED f[x, ϕ[D]] − fµ [x] + fµ [x]−µ[x] + σ 2 . (8.7)
| {z } | {z } |{z}
variance bias noise
This equation says that the expected loss after considering the uncertainty in the training
data D and the test data y consists of three additive components. The variance is
uncertainty in the fitted model due to the particular training dataset we sample. The bias
is the systematic deviation of the model from the mean of the function we are modeling.
The noise is the inherent uncertainty in the true mapping from input to output. These
three sources of error will be present for any task. They combine additively for regression
tasks with a least squares loss. However, their interaction can be more complex for other
types of problems.

8.3 Reducing error

In the previous section, we saw that test error results from three sources: noise, bias,
and variance. The noise component is insurmountable; there is nothing we can do to
circumvent this, and it represents a fundamental limit on expected model performance.
However, it is possible to reduce the other two terms.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.3 Reducing error 125

8.3.1 Reducing variance

Recall that the variance results from limited noisy training data. Fitting the model
to two different training sets results in slightly different parameters. It follows we can
reduce the variance by increasing the quantity of training data. This averages out the
inherent noise and ensures that the input space is well sampled.
Figure 8.6 shows the effect of training with 6, 10, and 100 samples. For each dataset
size, we show the best-fitting model for three training datasets. With only six samples,
the fitted function is quite different each time: the variance is significant. As we increase
the number of samples, the fitted models become very similar, and the variance reduces.
In general, adding training data almost always improves test performance.

8.3.2 Reducing bias

The bias term results from the inability of the model to describe the true underlying
function. This suggests that we can reduce this error by making the model more flexible.
This is usually done by increasing the model capacity. For neural networks, this means
adding more hidden units and/or hidden layers.
In the simplified model, adding capacity corresponds to adding more hidden units
so that the interval [0, 1] is divided into more linear regions. Figures 8.7a–c show that
(unsurprisingly) this does indeed reduce the bias; as we increase the number of linear
regions to ten, the model becomes flexible enough to fit the true function closely.

8.3.3 Bias-variance trade-off

However, figures 8.7d–f show an unexpected side-effect of increasing the model capacity.
For a fixed-size training dataset, the variance term typically increases as the model
capacity increases. Consequently, increasing the model capacity does not necessarily
reduce the test error. This is known as the bias-variance trade-off.
Figure 8.8 explores this phenomenon. In panels a–c), we fit the simplified three-region
model to three different datasets of fifteen points. Although the datasets differ, the final
model is much the same; the noise in the dataset roughly averages out in each linear
region. In panels d–f), we fit a model with ten regions to the same three datasets. This
model has more flexibility, but this is disadvantageous; the model certainly fits the data
better, and the training error will be lower, but much of the extra descriptive power is
devoted to modeling the noise. This phenomenon is known as overfitting.
We’ve seen that as we add capacity to the model, the bias decreases, but the variance
increases for a fixed-size training dataset. This suggests that there is an optimal capacity
where the bias is not too large and the variance is still relatively small. Figure 8.9 shows
how these terms vary numerically for the toy model as we increase the capacity, using
Notebook 8.2
the data from figure 8.8. For regression models, the total expected error is the sum of Bias-variance
the bias and the variance, and this sum is minimized when the model capacity is four trade-off
(i.e., with four hidden units and four linear regions in the range of the data).

Draft: please send errata to [email protected].


126 8 Measuring performance

Figure 8.6 Reducing variance by increasing training data. a–c) The three-region
model fitted to three different randomly sampled datasets of six points. The
fitted model is quite different each time. d) We repeat this experiment many
times and plot the mean model predictions (cyan line) and the variance of the
model predictions (gray area shows two standard deviations). e–h) We do the
same experiment, but this time with datasets of size ten. The variance of the
predictions is reduced. i–l) We repeat this experiment with datasets of size 100.
Now the fitted model is always similar, and the variance is small.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.4 Double descent 127

Figure 8.7 Bias and variance as a function of model capacity. a–c) As we in-
crease the number of hidden units of the toy model, the number of linear regions
increases, and the model becomes able to fit the true function closely; the bias
(gray region) decreases. d–f) Unfortunately, increasing the model capacity has
the side-effect of increasing the variance term (gray region). This is known as the
bias-variance trade-off.

8.4 Double descent

In the previous section, we examined the bias-variance trade-off as we increased the


capacity of a model. Let’s now return to the MNIST-1D dataset and see whether this
happens in practice. We use 10,000 training examples, test with another 5,000 examples
and examine the training and test performance as we increase the capacity (number of
parameters) in the model. We train the model with Adam and a step size of 0.005 using
a full batch of 10,000 examples for 4000 steps.
Figure 8.10a shows the training and test error for a neural network with two hid-
den layers as the number of hidden units increases. The training error decreases as the
capacity grows and quickly becomes close to zero. The vertical dashed line represents
the capacity where the model has the same number of parameters as there are training
examples, but the model memorizes the dataset before this point. The test error de-
creases as we add model capacity but does not increase as predicted by the bias-variance
trade-off curve; it keeps decreasing.
In figure 8.10b, we repeat this experiment, but this time, we randomize 15% of the

Draft: please send errata to [email protected].


128 8 Measuring performance

Figure 8.8 Overfitting. a–c) A model with three regions is fit to three different
datasets of fifteen points each. The result is similar in all three cases (i.e., the
variance is low). d–f) A model with ten regions is fit to the same datasets. The
additional flexibility does not necessarily produce better predictions. While these
three models each describe the training data better, they are not necessarily closer
to the true underlying function (black curve). Instead, they overfit the data and
describe the noise, and the variance (difference between fitted curves) is larger.

Figure 8.9 Bias-variance trade-off. The


bias and variance terms from equa-
tion 8.7 are plotted as a function of
the model capacity (number of hidden
units / linear regions in range of data)
in the simplified model using training
data from figure 8.8. As the capacity
increases, the bias (solid orange line) de-
creases, but the variance (solid cyan line)
increases. The sum of these two terms
(dashed gray line) is minimized when the
capacity is four.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.4 Double descent 129

training labels. Once more, the training error decreases to zero. This time, there is
more randomness, and the model requires almost as many parameters as there are data
points to memorize the data. The test error does show the typical bias-variance trade-off
as we increase the capacity to the point where the model fits the training data exactly.
However, then it does something unexpected; it starts to decrease again. Indeed, if we
add enough capacity, the test loss reduces to below the minimal level that we achieved
in the first part of the curve.
This phenomenon is known as double descent. For some datasets like MNIST, it is
present with the original data (figure 8.10c). For others, like MNIST-1D and CIFAR-100
(figure 8.10d), it emerges or becomes more prominent when we add noise to the labels.
Notebook 8.3
The first part of the curve is referred to as the classical or under-parameterized regime, Double descent
and the second part as the modern or over-parameterized regime. The central part where
the error increases is termed the critical regime.

8.4.1 Explanation

The discovery of double descent is recent, unexpected, and somewhat puzzling. It results
from an interaction of two phenomena. First, the test performance becomes temporarily
worse when the model has just enough capacity to memorize the data. Second, the test
performance continues to improve with capacity even when this exceeds the point where
the training data are all classified correctly. The first phenomenon is exactly as predicted
by the bias-variance trade-off. The second phenomenon is more confusing; it’s unclear
why performance should be better in the over-parameterized regime, given that there are
now not even enough training data points to constrain the model parameters uniquely.
To understand why performance continues to improve as we add more parameters,
note that once the model has enough capacity to drive the training loss to near zero,
the model fits the training data almost perfectly. This implies that further capacity
Problems 8.4–8.5
cannot help the model fit the training data any better; any change must occur between
the training points. The tendency of a model to prioritize one solution over another as
it extrapolates between data points is known as its inductive bias.
The model’s behavior between data points is critical because, in high-dimensional
space, the training data are extremely sparse. The MNIST-1D dataset has 40 dimensions,
and we trained with 10,000 examples. If this seems like plenty of data, consider what
would happen if we quantized each input dimension into 10 bins. There would be 1040
bins in total, constrained by only 104 examples. Even with this coarse quantization,
there will only be one data point in every 1035 bins! The tendency of the volume of
high-dimensional space to overwhelm the number of training points is termed the curse
of dimensionality.
The implication is that problems in high dimensions might look more like figure 8.11a;
there are small regions of the input space where we observe data with significant gaps
between them. The putative explanation for double descent is that as we add capacity
to the model, it interpolates between the nearest data points increasingly smoothly. In
the absence of information about what happens between the training points, assuming
smoothness is sensible and will probably generalize reasonably to new data.

Draft: please send errata to [email protected].


130 8 Measuring performance

Figure 8.10 Double descent. a) Training and test error on MNIST-1D for a
two-hidden layer network as we increase the number of hidden units (and hence
parameters) in each layer. The training error decreases to zero as the number of
parameters approaches the number of training examples (vertical dashed line).
The test error does not show the expected bias-variance trade-off but continues
to decrease even after the model has memorized the dataset. b) The same exper-
iment is repeated with noisier training data. Again, the training error reduces
to zero, although it now takes almost as many parameters as training points to
memorize the dataset. The test error shows the predicted bias/variance trade-off;
it decreases as the capacity increases but then increases again as we near the
point where the training data is exactly memorized. However, it subsequently
decreases again and ultimately reaches a better performance level. This is known
as double descent. Depending on the loss function, the model, and the amount
of noise in the data, the double descent pattern can be seen to a greater or lesser
degree across many datasets. c) Results on MNIST (without label noise) with
shallow neural network from Belkin et al. (2019). d) Results on CIFAR-100 with
ResNet18 network (see chapter 11) from Nakkiran et al. (2021). See original
papers for details.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.4 Double descent 131

Figure 8.11 Increasing capacity (hidden units) allows smoother interpolation be-
tween sparse data points. a) Consider this situation where the training data
(orange circles) are sparse; there is a large region in the center with no data ex-
amples to constrain the model to mimic the true function (black curve). b) If we
fit a model with just enough capacity to fit the training data (cyan curve), then it
has to contort itself to pass through the training data, and the output predictions
will not be smooth. c–f) However, as we add more hidden units, the model has
the ability to interpolate between the points more smoothly (smoothest possible
curve plotted in each case). However, unlike in this figure, it is not obliged to.

This argument is plausible. It’s certainly true that as we add more capacity to the
model, it will have the capability to create smoother functions. Figures 8.11b–f show the
smoothest possible functions that still pass through the data points as we increase the
number of hidden units. When the number of parameters is very close to the number
of training data examples (figure 8.11b), the model is forced to contort itself to fit the
training data exactly, resulting in erratic predictions. This explains why the peak in the
double descent curve is so pronounced. As we add more hidden units, the model has the
ability to construct smoother functions that are likely to generalize better to new data.
However, this does not explain why over-parameterized models should produce smooth
functions. Figure 8.12 shows three functions that can be created by the simplified model
with 50 hidden units. In each case, the model fits the data exactly, so the loss is zero. If
the modern regime of double descent is explained by increasing smoothness, then what
exactly is encouraging this smoothness?

Draft: please send errata to [email protected].


132 8 Measuring performance

Figure 8.12 Regularization. a–c) Each of the three fitted curves passes through
the data points exactly, so the training loss for each is zero. However, we might
expect the smooth curve in panel (a) to generalize much better to new data than
the erratic curves in panels (b) and (c). Any factor that biases a model toward
a subset of the solutions with a similar training loss is known as a regularizer.
It is thought that the initialization and/or fitting of neural networks have an
implicit regularizing effect. Consequently, in the over-parameterized regime, more
reasonable solutions, such as that in panel (a), are encouraged.

The answer to this question is uncertain, but there are two likely possibilities. First,
the network initialization may encourage smoothness, and the model never departs from
the sub-domain of smooth function during the training process. Second, the training
algorithm may somehow “prefer” to converge to smooth functions. Any factor that
biases a solution toward a subset of equivalent solutions is known as a regularizer, so one
possibility is that the training algorithm acts as an implicit regularizer (see section 9.2).

8.5 Choosing hyperparameters

In the previous section, we discussed how test performance changes with model capac-
ity. Unfortunately, in the classical regime, we don’t have access to either the bias (which
requires knowledge of the true underlying function) or the variance (which requires mul-
tiple independently sampled datasets to estimate). In the modern regime, there is no
way to tell how much capacity should be added before the test error stops improving.
This raises the question of exactly how we should choose model capacity in practice.
For a deep network, the model capacity depends on the numbers of hidden layers
and hidden units per layer as well as other aspects of architecture that we have yet to
introduce. Furthermore, the choice of learning algorithm and any associated parameters
(learning rate, etc.) also affects the test performance. These elements are collectively
termed hyperparameters. The process of finding the best hyperparameters is termed
hyperparameter search or (when focused on network structure) neural architecture search.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.6 Summary 133

Hyperparameters are typically chosen empirically; we train many models with differ-
ent hyperparameters on the same training set, measure their performance, and retain the
best model. However, we do not measure their performance on the test set; this would
admit the possibility that these hyperparameters just happen to work well for the test
set but don’t generalize to further data. Instead, we introduce a third dataset known
as a validation set. For every choice of hyperparameters, we train the associated model
using the training set and evaluate performance on the validation set. Finally, we select
the model that worked best on the validation set and measure its performance on the
test set. In principle, this should give a reasonable estimate of the true performance.
The hyperparameter space is generally smaller than the parameter space but still
too large to try every combination exhaustively. Unfortunately, many hyperparameters
are discrete (e.g., the number of hidden layers), and others may be conditional on one
another (e.g., we only need to specify the number of hidden units in the tenth hidden
layer if there are ten or more layers). Hence, we cannot rely on gradient descent methods
as we did for learning the model parameters. Hyperparameter optimization algorithms
intelligently sample the space of hyperparameters, contingent on previous results. This
procedure is computationally expensive since we must train an entire model and measure
the validation performance for each combination of hyperparameters.

8.6 Summary

To measure performance, we use a separate test set. The degree to which performance is
maintained on this test set is known as generalization. Test errors can be explained by
three factors: noise, bias, and variance. These combine additively in regression problems
with least squares losses. Adding training data decreases the variance. When the model
capacity is less than the number of training examples, increasing the capacity decreases
bias but increases variance. This is known as the bias-variance trade-off, and there is a
capacity where the trade-off is optimal.
However, this is balanced against a tendency for performance to improve with ca-
pacity, even when the parameters exceed the training examples. Together, these two
phenomena create the double descent curve. It is thought that the model interpolates
more smoothly between the training data points in the over-parameterized “modern
regime,” although it is unclear what drives this. To choose the capacity and other model
and training algorithm hyperparameters, we fit multiple models and evaluate their per-
formance using a separate validation set.

Notes
Bias-variance trade-off: We showed that the test error for regression problems with least
squares loss decomposes into the sum of noise, bias, and variance terms. These factors are
all present for models with other losses, but their interaction is typically more complicated
(Friedman, 1997; Domingos, 2000). For classification problems, there are some counter-intuitive

Draft: please send errata to [email protected].


134 8 Measuring performance

predictions; for example, if the model is biased toward selecting the wrong class in a region of
the input space, then increasing the variance can improve the classification rate as this pushes
some of the predictions over the threshold to be classified correctly.

Cross-validation: We saw that it is typical to divide the data into three parts: training
data (which is used to learn the model parameters), validation data (which is used to choose
the hyperparameters), and test data (which is used to estimate the final performance). This
approach is known as cross-validation. However, this division may cause problems where the
total number of data examples is limited; if the number of training examples is comparable to
the model capacity, then the variance will be large.
One way to mitigate this problem is to use k-fold cross-validation. The training and validation
data are partitioned into K disjoint subsets. For example, we might divide these data into
five parts. We train with four and validate with the fifth for each of the five permutations
and choose the hyperparameters based on the average validation performance. The final test
performance is assessed using the average of the predictions from the five models with the best
hyperparameters on an entirely different test set. There are many variations of this idea, but
all share the general goal of using a larger proportion of the data to train the model, thereby
reducing variance.

Capacity: We have used the term capacity informally to mean the number of parameters or
hidden units in the model (and hence indirectly, the ability of the model to fit functions of
increasing complexity). The representational capacity of a model describes the space of possible
functions it can construct when we consider all possible parameter values. When we take into
account the fact that an optimization algorithm may not be able to reach all of these solutions,
what is left is the effective capacity.
The Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1971) is a more formal
measure of capacity. It is the largest number of training examples that a binary classifier can
label arbitrarily. Bartlett et al. (2019) derive upper and lower bounds for the VC dimension in
terms of the number of layers and weights. An alternative measure of capacity is the Rademacher
complexity, which is the expected empirical performance of a classification model (with optimal
parameters) for data with random labels. Neyshabur et al. (2017) derive a lower bound on the
generalization error in terms of the Rademacher complexity.

Double descent: The term “double descent” was coined by Belkin et al. (2019), who demon-
strated that the test error decreases again in the over-parameterized regime for two-layer neural
networks and random features. They also claimed that this occurs in decision trees, although
Buschjäger & Morik (2021) subsequently provided evidence to the contrary. Nakkiran et al.
(2021) show that double descent occurs for various modern datasets (CIFAR-10, CIFAR-100,
IWSLT’14 de-en), architectures (CNNs, ResNets, transformers), and optimizers (SGD, Adam).
The phenomenon is more pronounced when noise is added to the target labels (Nakkiran et al.,
2021) and when some regularization techniques are used (Ishida et al., 2020).
Nakkiran et al. (2021) also provide empirical evidence that test performance depends on effective
model capacity (the largest number of samples for which a given model and training method can
achieve zero training error). At this point, the model starts to devote its efforts to interpolating
smoothly. As such, the test performance depends not just on the model but also on the training
algorithm and length of training. They observe the same pattern when they study a model with
fixed capacity and increase the number of training iterations. They term this epoch-wise double
descent. This phenomenon has been modeled by Pezeshki et al. (2022) in terms of different
features in the model being learned at different speeds.
Double descent makes the rather strange prediction that adding training data can sometimes
worsen test performance. Consider an over-parameterized model in the second descending part

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 135

of the curve. If we increase the training data to match the model capacity, we will now be in
the critical region of the new test error curve, and the test loss may increase.
Bubeck & Sellke (2021) prove that overparameterization is necessary to interpolate data smoothly
in high dimensions. They demonstrate a trade-off between the number of parameters and the
Appendix B.1.1
Lipschitz constant of a model (the fastest the output can change for a small input change). A
Lipschitz constant
review of the theory of over-parameterized machine learning can be found in Dar et al. (2021).

Curse of dimensionality: As dimensionality increases, the volume of space grows so fast that
the amount of data needed to densely sample it increases exponentially. This phenomenon is
known as the curse of dimensionality. High-dimensional space has many unexpected properties,
and caution should be used when trying to reason about it based on low-dimensional exam-
ples. This book visualizes many aspects of deep learning in one or two dimensions, but these
visualizations should be treated with healthy skepticism.
Surprising properties of high-dimensional spaces include: (i) Two randomly sampled data points
from a standard normal distribution are very close to orthogonal to one another (relative to
Problems 8.6–8.9
the origin) with high likelihood. (ii) The distance from the origin of samples from a standard
normal distribution is roughly constant. (iii) Most of a volume of a high-dimensional sphere
(hypersphere) is adjacent to its surface (a common metaphor is that most of the volume of a high-
dimensional orange is in the peel, not in the pulp). (iv) If we place a unit-diameter hypersphere
inside a hypercube with unit-length sides, then the hypersphere takes up a decreasing proportion
of the volume of the cube as the dimension increases. Since the volume of the cube is fixed at
Notebook 8.4
size one, this implies that the volume of a high-dimensional hypersphere becomes close to zero.
High-dimensional
(v) For random points drawn from a uniform distribution in a high-dimensional hypercube, the
spaces
ratio of the Euclidean distance between the nearest and furthest points becomes close to one.
For further information, consult Beyer et al. (1999) and Aggarwal et al. (2001).

Real-world performance: In this chapter, we argued that model performance could be evalu-
ated using a held-out test set. However, the result won’t be indicative of real-world performance
if the statistics of the test set don’t match those of real-world data. Moreover, the statistics
of real-world data may change over time, causing the model to become increasingly stale and
performance to decrease. This is known as data drift and means that deployed models must be
carefully monitored.
There are three main reasons why real-world performance may be worse than the test perfor-
mance implies. First, the statistics of the input data x may change; we may now be observing
parts of the function that were sparsely sampled or not sampled at all during training. This
is known as covariate shift. Second, the statistics of the output data y may change; if some
output values are infrequent during training, then the model may learn not to predict these in
ambiguous situations and will make mistakes if they are more common in the real world. This
is known as prior shift. Third, the relationship between input and output may change. This is
known as concept shift. These issues are discussed in Moreno-Torres et al. (2012).

Hyperparameter search: Finding the best hyperparameters is a challenging optimization


task. Testing a single configuration of hyperparameters is expensive; we must train an entire
model and measure its performance. We have no easy way to access the derivatives (i.e., how
performance changes when we make a small change to a hyperparameter). Moreover, many of
the hyperparameters are discrete, so we cannot use gradient descent methods. There are multiple
local minima and no way to tell if we are close to the global minimum. The noise level is high
since each training/validation cycle uses a stochastic training algorithm; we expect different
results if we train a model twice with the same hyperparameters. Finally, some variables are
conditional and only exist if others are set. For example, the number of hidden units in the
third hidden layer is only relevant if we have at least three hidden layers.

Draft: please send errata to [email protected].


136 8 Measuring performance

A simple approach is to sample the space randomly (Bergstra & Bengio, 2012). However,
for continuous variables, it is better to build a model of performance as a function of the
hyperparameters and the uncertainty in this function. This can be exploited to test where the
uncertainty is great (explore the space) or home in on regions where performance looks promising
(exploit previous knowledge). Bayesian optimization is a framework based on Gaussian processes
that does just this, and its application to hyperparameter search is described in Snoek et al.
(2012). The Beta-Bernoulli bandit (see Lattimore & Szepesvári, 2020) is a roughly equivalent
model for describing uncertainty in results due to discrete variables.
The sequential model-based configuration (SMAC) algorithm (Hutter et al., 2011) can cope with
continuous, discrete, and conditional parameters. The basic approach is to use a random forest
to model the objective function where the mean of the tree predictions is the best guess about
the objective function, and their variance represents the uncertainty. A completely different
approach that can also cope with combinations of continuous, discrete, and conditional param-
eters is Tree-Parzen Estimators (Bergstra et al., 2011). The previous methods modeled the
probability of the model performance given the hyperparameters. In contrast, the Tree-Parzen
estimator models the probability of the hyperparameters given the model performance.
Hyperband (Li et al., 2017b) is a multi-armed bandit strategy for hyperparameter optimization.
It assumes that there are computationally cheap but approximate ways to measure performance
(e.g., by not training to completion) and that these can be associated with a budget (e.g., by
training for a fixed number of iterations). A number of random configurations are sampled and
run until the budget is used up. Then the best fraction η of runs is kept, and the budget is
multiplied by 1/η. This is repeated until the maximum budget is reached. This approach has
the advantage of efficiency; for bad configurations, it does not need to run the experiment to the
end. However, each sample is just chosen randomly, which is inefficient. The BOHB algorithm
(Falkner et al., 2018) combines the efficiency of Hyperband with the more sensible choice of
hyperparameters from Tree Parzen estimators to construct an even better method.

Problems
Problem 8.1 Will the multiclass cross-entropy training loss in figure 8.2 ever reach zero? Explain
your reasoning.

Problem 8.2 What values should we choose for the three weights and biases in the first layer of
the model in figure 8.4a so that the hidden unit’s responses are as depicted in figures 8.4b–d?

Problem 8.3∗ Given a training dataset consisting of I input/output pairs {xi , yi }, show how
the parameters {β, ω1 , ω2 , ω3 } for the model in figure 8.4a using the least squares loss function
can be found in closed form.

Problem 8.4 Consider the curve in figure 8.10b at the point where we train a model with a
hidden layer of size 200, which would have 50,410 parameters. What do you predict will happen
to the training and test performance if we increase the number of training examples from 10,000
to 50,410?

Problem 8.5 Consider the case where the model capacity exceeds the number of training data
points, and the model is flexible enough to reduce the training loss to zero. What are the
implications of this for fitting a heteroscedastic model? Propose a method to resolve any
problems that you identify.

Problem 8.6 Show that two random points drawn from a 1000-dimensional standard Gaussian
distribution are orthogonal relative to the origin with high probability.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 137

Figure 8.13 Typical sets. a) Standard normal distribution in two dimensions.


Circles are four samples from this distribution. As the distance from the cen-
ter increases, the probability decreases, but the volume of space at that radius
(i.e., the area between adjacent evenly spaced circles) increases. b) These fac-
tors trade off so that the histogram of distances of samples from the center has
a pronounced peak. c) In higher dimensions, this effect becomes more extreme,
and the probability of observing a sample close to the mean becomes vanishingly
small. Although the most likely point is at the mean of the distribution, the
typical samples are found in a relatively narrow shell.

Problem 8.7 The volume of a hypersphere with radius r in D dimensions is:

rD π D/2
Vol[r] = , (8.8)
Γ[D/2 + 1]
Appendix B.1.3
where Γ[•] is the Gamma function. Show using Stirling’s formula that the volume of a hyper- Gamma function
sphere of diameter one (radius r = 0.5) becomes zero as the dimension increases.
Appendix B.1.4
Problem 8.8∗ Consider a hypersphere of radius r = 1. Find an expression for the proportion Stirling’s formula
of the total volume that lies in the outermost 1% of the distance from the center (i.e., in the
outermost shell of thickness 0.01). Show that this becomes one as the dimension increases.

Problem 8.9 Figure 8.13c shows the distribution of distances of samples of a standard normal
distribution as the dimension increases. Empirically verify this finding by sampling from the
standard normal distributions in 25, 100, and 500 dimensions and plotting a histogram of the
distances from the center. What closed-form probability distribution describes these distances?

Draft: please send errata to [email protected].


Chapter 9

Regularization

Chapter 8 described how to measure model performance and identified that there could
be a significant performance gap between the training and test data. Possible reasons for
this discrepancy include: (i) the model describes statistical peculiarities of the training
data that are not representative of the true mapping from input to output (overfitting),
and (ii) the model is unconstrained in areas with no training examples, leading to sub-
optimal predictions.
This chapter discusses regularization techniques. These are a family of methods that
reduce the generalization gap between training and test performance. Strictly speaking,
regularization involves adding explicit terms to the loss function that favor certain pa-
rameter choices. However, in machine learning, this term is commonly used to refer to
any strategy that improves generalization.
We start by considering regularization in its strictest sense. Then we show how
the stochastic gradient descent algorithm itself favors certain solutions. This is known
as implicit regularization. Following this, we consider a set of heuristic methods that
improve test performance. These include early stopping, ensembling, dropout, label
smoothing, and transfer learning.

9.1 Explicit regularization

Consider fitting a model f[x, ϕ] with parameters ϕ using a training set {xi , yi } of in-
put/output pairs. We seek the parameters ϕ̂ that minimize the loss function L[ϕ] :

 
ϕ̂ = argmin L[ϕ]
ϕ
" #
X
I
= argmin ℓi [xi , yi ] , (9.1)
ϕ i=1

where the individual terms ℓi [xi , yi ] measure the mismatch between the network pre-
dictions f[xi , ϕ] and output targets yi for each training pair. To bias this minimization

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.1 Explicit regularization 139

Figure 9.1 Explicit regularization. a) Loss function for Gabor model (see sec-
tion 6.1.2). Cyan circles represent local minima. Gray circle represents the global
minimum. b) The regularization term favors parameters close to the center of the
plot by adding an increasing penalty as we move away from this point. c) The
final loss function is the sum of the original loss function plus the regularization
term. This surface has fewer local minima, and the global minimum has moved
to a different position (arrow shows change).

toward certain solutions, we include an additional term:


" I #
X
ϕ̂ = argmin ℓi [xi , yi ] + λ · g[ϕ] , (9.2)
ϕ i=1

where g[ϕ] is a function that returns a scalar which takes larger values when the pa-
rameters are less preferred. The term λ is a positive scalar that controls the relative
contribution of the original loss function and the regularization term. The minima of
the regularized loss function usually differ from those in the original, so the training
procedure converges to different parameter values (figure 9.1).

9.1.1 Probabilistic interpretation

Regularization can be viewed from a probabilistic perspective. Section 5.1 shows how
loss functions are constructed from the maximum likelihood criterion:
" I #
Y
ϕ̂ = argmax P r(yi |xi , ϕ) . (9.3)
ϕ i=1

The regularization term can be considered as a prior P r(ϕ) that represents knowledge
about the parameters before we observe the data and we now have the maximum a
posteriori or MAP criterion:

Draft: please send errata to [email protected].


140 9 Regularization

" #
Y
I
ϕ̂ = argmax P r(yi |xi , ϕ)P r(ϕ) . (9.4)
ϕ i=1

Moving back to the negative log-likelihood loss function by taking the log and multiplying
by minus one, we see that λ · g[ϕ] = − log[P r(ϕ)].

9.1.2 L2 regularization

This discussion has sidestepped the question of which solutions the regularization term
should penalize (or equivalently that the prior should favor). Since neural networks are
used in an extremely broad range of applications, these can only be very generic pref-
erences. The most commonly used regularization term is the L2 norm, which penalizes
the sum of the squares of the parameter values:
 
XI X
ϕ̂ = argmin  ℓi [xi , yi ] + λ ϕ2j  , (9.5)
ϕ i=1 j

where j indexes the parameters. This is also referred to as Tikhonov regularization or


Problems 9.1–9.2
ridge regression, or (when applied to matrices) Frobenius norm regularization.
For neural networks, L2 regularization is usually applied to the weights but not
the biases and is hence referred to as a weight decay term. The effect is to encourage
smaller weights, so the output function is smoother. To see this, consider that the
output prediction is a weighted sum of the activations at the last hidden layer. If the
Notebook 9.1
L2 regularization weights have a smaller magnitude, the output will vary less. The same logic applies to
the computation of the pre-activations at the last hidden layer and so on, progressing
backward through the network. In the limit, if we forced all the weights to be zero, the
network would produce a constant output determined by the final bias parameter.
Figure 9.2 shows the effect of fitting the simplified network from figure 8.4 with weight
decay and different values of the regularization coefficient λ. When λ is small, it has
little effect. However, as λ increases, the fit to the data becomes less accurate, and the
function becomes smoother. This might improve the test performance for two reasons:

• If the network is overfitting, then adding the regularization term means that the
network must trade off slavish adherence to the data against the desire to be
smooth. One way to think about this is that the error due to variance reduces (the
model no longer needs to pass through every data point) at the cost of increased
bias (the model can only describe smooth functions).
• When the network is over-parameterized, some of the extra model capacity de-
scribes areas with no training data. Here, the regularization term will favor func-
tions that smoothly interpolate between the nearby points. This is reasonable
behavior in the absence of knowledge about the true function.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.2 Implicit regularization 141

Figure 9.2 L2 regularization in simplified network with 14 hidden units (see fig-
ure 8.4). a–f) Fitted functions as we increase the regularization coefficient λ. The
black curve is the true function, the orange circles are the noisy training data,
and the cyan curve is the fitted model. For small λ (panels a–b), the fitted func-
tion passes exactly through the data points. For intermediate λ (panels c–d), the
function is smoother and more similar to the ground truth. For large λ (panels
e–f), the regularization term overpowers the likelihood term, so the fitted function
is too smooth and the overall fit is worse.

9.2 Implicit regularization

An intriguing recent finding is that neither gradient descent nor stochastic gradient
descent moves neutrally to the minimum of the loss function; each exhibits a preference
for some solutions over others. This is known as implicit regularization.

9.2.1 Implicit regularization in gradient descent

Consider a continuous version of gradient descent where the step size is infinitesimal.
The change in parameters ϕ will be governed by the differential equation:

dϕ ∂L
=− . (9.6)
dt ∂ϕ

Draft: please send errata to [email protected].


142 9 Regularization

Figure 9.3 Implicit regularization in gradient descent. a) Loss function with family
of global minima on horizontal line ϕ1 = 0.61. Dashed blue line shows continuous
gradient descent path starting in bottom-left. Cyan trajectory shows discrete
gradient descent with step size 0.1 (first few steps shown explicitly as arrows).
The finite step size causes the paths to diverge and reach a different final position.
b) This disparity can be approximated by adding a regularization term to the
continuous gradient descent loss function that penalizes the squared gradient
magnitude. c) After adding this term, the continuous gradient descent path
converges to the same place that the discrete one did on the original function.

Gradient descent approximates this process with a series of discrete steps of size α:

∂L[ϕt ]
ϕt+1 = ϕt − α , (9.7)
∂ϕ

The discretization causes a deviation from the continuous path (figure 9.3).
This deviation can be understood by deriving a modified loss term L̃ for the continu-
ous case that arrives at the same place as the discretized version on the original loss L. It
can be shown (see notes “Implicit regularization in gradient descent” at end of chapter)
that this modified loss is:

2
α ∂L
L̃GD [ϕ] = L[ϕ] + . (9.8)
4 ∂ϕ

In other words, the discrete trajectory is repelled from places where the gradient norm
is large (the surface is steep). This doesn’t change the position of the minima where the
gradients are zero anyway. However, it changes the effective loss function elsewhere and
modifies the optimization trajectory, which potentially converges to a different minimum.
Implicit regularization due to gradient descent may be responsible for the observation
that full batch gradient descent generalizes better with larger step sizes (figure 9.5a).

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 143

9.2.2 Implicit regularization in stochastic gradient descent

A similar analysis can be applied to stochastic gradient descent. Now we seek a modified
loss function such that the continuous version reaches the same place as the average of
the possible random SGD updates. This can be shown to be:

α X ∂Lb
B 2
∂L
L̃SGD [ϕ] = L̃GD [ϕ] + −
4B ∂ϕ ∂ϕ
b=1

α X ∂Lb
2 B 2
α ∂L ∂L
= L[ϕ] + + − . (9.9)
4 ∂ϕ 4B ∂ϕ ∂ϕ
b=1

Here, Lb is the loss for the bth of the B batches in an epoch, and both L and Lb now
represent the means of the I individual losses in the full dataset and the |B| individual
losses in the batch, respectively:

1X 1 X
I
L= ℓi [xi , yi ] and Lb = ℓi [xi , yi ]. (9.10)
I i=1 |B|
i∈Bb

Equation 9.9 reveals an extra regularization term, which corresponds to the variance
of the gradients of the batch losses Lb . In other words, SGD implicitly favors places
where the gradients are stable (where all the batches agree on the slope). Once more, this
modifies the trajectory of the optimization process (figure 9.4) but does not necessarily
change the position of the global minimum; if the model is over-parameterized, then it
may fit all the training data exactly, so each of these gradient terms will be zero at the
global minimum.
SGD generalizes better than gradient descent, and smaller batch sizes generally per-
form better than larger ones (figure 9.5b). One possible explanation is that the inherent
randomness allows the algorithm to reach different parts of the loss function. However,
Notebook 9.2
it’s also possible that some or all of this performance increase is due to implicit regular- Implicit
ization; this encourages solutions where all the data fits well (so the batch variance is regularization
small) rather than solutions where some of the data fit extremely well and other data less
well (perhaps with the same overall loss, but with larger batch variance). The former
solutions are likely to generalize better.

9.3 Heuristics to improve performance

We’ve seen that explicit regularization encourages the training algorithm to find a good
solution by adding extra terms to the loss function. This also occurs implicitly as an un-
intended (but seemingly helpful) byproduct of stochastic gradient descent. This section
describes other heuristic methods used to improve generalization.

Draft: please send errata to [email protected].


144 9 Regularization

Figure 9.4 Implicit regularization for stochastic gradient descent. a) Original loss
function for Gabor model (section 6.1.2). Blue point represents global minimum.
b) Implicit regularization term from gradient descent penalizes the squared gra-
dient magnitude. c) Additional implicit regularization from stochastic gradient
descent penalizes the variance of the batch gradients. d) Modified loss function
(sum of original loss plus two implicit regularization components). Blue point
represents global minimum which may now be in a different place from panel (a).

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 145

Figure 9.5 Effect of learning rate (LR) and batch size for 4000 training and
4000 test examples from MNIST-1D (see figure 8.1) for a neural network with
two hidden layers. a) Performance is better for large learning rates than for
intermediate or small ones. In each case, the number of iterations is 6000/LR, so
each solution has the opportunity to move the same distance. b) Performance is
superior for smaller batch sizes. In each case, the number of iterations was chosen
so that the training data were memorized at roughly the same model capacity.

9.3.1 Early stopping

Early stopping refers to stopping the training procedure before it has fully converged.
This can reduce overfitting if the model has already captured the coarse shape of the
underlying function but has not yet had time to overfit to the noise (figure 9.6). One
way of thinking about this is that since the weights are initialized to small values (see
section 7.5), they simply don’t have time to become large, so early stopping has a similar
effect to explicit L2 regularization. A different view is that early stopping reduces the
effective model complexity. Hence, we move back down the bias/variance trade-off curve
from the critical region, and performance improves (see figures 8.9 and 8.10).
Early stopping has a single hyperparameter, the number of steps after which learning
is terminated. As usual, this is chosen empirically using a validation set (section 8.5).
However, for early stopping, the hyperparameter can be selected without the need to
train multiple models. The model is trained once, the performance on the validation set
is monitored every T iterations, and the associated parameters are stored. The stored
parameters where the validation performance was best are selected.

9.3.2 Ensembling

Another approach to reducing the generalization gap between training and test data is
to build several models and average their predictions. A group of such models is known

Draft: please send errata to [email protected].


146 9 Regularization

Figure 9.6 Early stopping. a) Simplified shallow network model with 14 linear
regions (figure 8.4) is initialized randomly (cyan curve) and trained with SGD
using a batch size of five and a learning rate of 0.05. b–d) As training proceeds,
the function first captures the coarse structure of the true function (black curve)
before e–f) overfitting to the noisy training data (orange points). Although the
training loss continues to decrease throughout this process, the learned models in
panels (c) and (d) are closest to the true underlying function. They will generalize
better on average to test data than those in panels (e) or (f).

as an ensemble. This technique reliably improves test performance at the cost of training
and storing multiple models and performing inference multiple times.
The models can be combined by taking the mean of the outputs (for regression
problems) or the mean of the pre-softmax activations (for classification problems). The
assumption is that model errors are independent and will cancel out. Alternatively,
we can take the median of the outputs (for regression problems) or the most frequent
predicted class (for classification problems) to make the predictions more robust.
One way to train different models is just to use different random initializations. This
may help in regions of input space far from the training data. Here, the fitted function
Notebook 9.3
Ensembling is relatively unconstrained, and different models may produce different predictions, so
the average of several models may generalize better than any single model.
A second approach is to generate several different datasets by re-sampling the train-
ing data with replacement and training a different model from each. This is known as
bootstrap aggregating or bagging for short (figure 9.7). It has the effect of smoothing
out the data; if a data point is not present in one training set, the model will interpo-

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 147

Figure 9.7 Ensemble methods. a) Fitting a single model (gray curve) to the
entire dataset (orange points). b–e) Four models created by re-sampling the data
with replacement (bagging) four times (size of orange point indicates number of
times the data point was re-sampled). f) When we average the predictions of this
ensemble, the result (cyan curve) is smoother than the result from panel (a) for
the full dataset (gray curve) and will probably generalize better.

late from nearby points; hence, if that point was an outlier, the fitted function will be
more moderate in this region. Other approaches include training models with different
hyperparameters or training completely different families of models.

9.3.3 Dropout

Dropout clamps a random subset (typically 50%) of hidden units to zero at each iteration
of SGD (figure 9.8). This makes the network less dependent on any given hidden unit and
encourages the weights to have smaller magnitudes so that the change in the function
due to the presence or absence of any specific hidden unit is reduced.
This technique has the positive benefit that it can eliminate undesirable “kinks” in
the function that are far from the training data and don’t affect the loss. For example,
consider three hidden units that become active sequentially as we move along the curve
(figure 9.9a). The first hidden unit causes a large increase in the slope. A second hidden

Draft: please send errata to [email protected].


148 9 Regularization

Figure 9.8 Dropout. a) Original network. b–d) At each training iteration, a


random subset of hidden units is clamped to zero (gray nodes). The result is
that the incoming and outgoing weights from these units have no effect, so we are
training with a slightly different network each time.

unit decreases the slope, so the function goes back down. Finally, the third unit cancels
out this decrease and returns the curve to its original trajectory. These three units
conspire to make an undesirable local change in the function. This will not change the
training loss but is unlikely to generalize well.
When several units conspire in this way, eliminating one (as would happen in dropout)
causes a considerable change to the output function in the half-space where that unit
was active (figure 9.9b). A subsequent gradient descent step will attempt to compensate
for the change that this induces, and such dependencies will be eliminated over time.
The overall effect is that large unnecessary changes between training data points are
gradually removed even though they contribute nothing to the loss (figure 9.9).
At test time, we can run the network as usual with all the hidden units active;
however, the network now has more hidden units than it was trained with at any given
iteration, so we multiply the weights by one minus the dropout probability to compensate.
This is known as the weight scaling inference rule. A different approach to inference is
to use Monte Carlo dropout, in which we run the network multiple times with different
random subsets of units clamped to zero (as in training) and combine the results. This
is closely related to ensembling in that every random version of the network is a different
model; however, we do not have to train or store multiple networks here.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 149

Figure 9.9 Dropout mechanism. a) An undesirable kink in the curve is caused


by a sequential increase in the slope, decrease in the slope (at circled joint), and
then another increase to return the curve to its original trajectory. Here we are
using full-batch gradient descent, and the model (from figure 8.4) fits the data
as well as possible, so further training won’t remove the kink. b) Consider what
happens if we remove the eighth hidden unit that produced the circled joint in
panel (a), as might happen using dropout. Without the decrease in the slope,
the right-hand side of the function takes an upwards trajectory, and a subsequent
gradient descent step will aim to compensate for this change. c) Curve after 2000
iterations of (i) randomly removing one of the three hidden units that cause the
kink and (ii) performing a gradient descent step. The kink does not affect the loss
but is nonetheless removed by this approximation of the dropout mechanism.

9.3.4 Applying noise

Dropout can be interpreted as applying multiplicative Bernoulli noise to the network


activations. This leads to the idea of applying noise to other parts of the network during
training to make the final model more robust.
One option is to add noise to the input data; this smooths out the learned function
Problem 9.3
(figure 9.10). For regression problems, it can be shown to be equivalent to adding a
regularizing term that penalizes the derivatives of the network’s output with respect to
its input. An extreme variant is adversarial training, in which the optimization algorithm
actively searches for small perturbations of the input that cause large changes to the
output. These can be thought of as worst-case additive noise vectors.
A second possibility is to add noise to the weights. This encourages the network to
make sensible predictions even for small perturbations of the weights. The result is that
the training converges to local minima in the middle of wide, flat regions, where changing
the individual weights does not matter much.
Finally, we can perturb the labels. The maximum-likelihood criterion for multiclass
classification aims to predict the correct class with absolute certainty (equation 5.24).
To this end, the final network activations (i.e., before the softmax function) are pushed
to very large values for the correct class and very small values for the wrong classes.
We could discourage this overconfident behavior by assuming that a proportion ρ of

Draft: please send errata to [email protected].


150 9 Regularization

Figure 9.10 Adding noise to inputs. At each step of SGD, random noise with
variance σx2 is added to the batch data. a–c) Fitted model with different noise
levels (small dots represent ten samples). Adding more noise smooths out the
fitted function (cyan line).

the training labels are incorrect and belong with equal probability to the other classes.
This could be done by randomly changing the labels at each training iteration. However,
the same end can be achieved by changing the loss function to minimize the cross-
entropy between the predicted distribution and a distribution where the true label has
Problem 9.4
probability 1 − ρ, and the other classes have equal probability. This is known as label
smoothing and improves generalization in diverse scenarios.

9.3.5 Bayesian inference

The maximum likelihood approach is generally overconfident; it selects the most likely
parameters during training and uses these to make predictions. However, many param-
eter values may be broadly compatible with the data and only slightly less likely. The
Bayesian approach treats the parameters as unknown variables and computes a distri-
Appendix C.1.4
Bayes’ rule bution P r(ϕ|{xi , yi }) over these parameters ϕ conditioned on the training data {xi , yi }
using Bayes’ rule:
QI
P r(yi |xi , ϕ)P r(ϕ)
P r(ϕ|{xi , yi }) = R QIi=1 , (9.11)
i=1 r(yi |xi , ϕ)P r(ϕ)dϕ
P
where P r(ϕ) is the prior probability of the parameters, and the denominator is a nor-
malizing term. Hence, every parameter choice is assigned a probability (figure 9.11).
The prediction y for new input x is an infinite weighted sum (i.e., an integral) of the
predictions for each parameter set, where the weights are the associated probabilities:
Z
P r(y|x, {xi , yi }) = P r(y|x, ϕ)P r(ϕ|{xi , yi })dϕ. (9.12)

This is effectively an infinite weighted ensemble, where the weight depends on (i) the
prior probability of the parameters and (ii) their agreement with the data.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 151

Figure 9.11 Bayesian approach for simplified network model (see figure 8.4). The
parameters are treated as uncertain. The posterior probability P r(ϕ|{xi , yi }) for
a set of parameters is determined by their compatibility with the data {xi , yi }
and a prior distribution P r(ϕ). a–c) Two sets of parameters (cyan and gray
curves) sampled from the posterior using normally distributed priors with mean
2
zero and three variances. When the prior variance σϕ is small, the parameters
also tend to be small, and the functions smoother. d–f) Inference proceeds by
taking a weighted sum over all possible parameter values where the weights are
the posterior probabilities. This produces both a prediction of the mean (cyan
curves) and the associated uncertainty (gray region is two standard deviations).

The Bayesian approach is elegant and can provide more robust predictions than
those that derive from maximum likelihood. Unfortunately, for complex models like
neural networks, there is no practical way to represent the full probability distribution
Notebook 9.4
over the parameters or to integrate over it during the inference phase. Consequently, all Bayesian
current methods of this type make approximations of some kind, and typically these add approach
considerable complexity to learning and inference.

9.3.6 Transfer learning and multi-task learning

When training data are limited, other datasets can be exploited to improve performance.
In transfer learning (figure 9.12a), the network is pre-trained to perform a related sec-

Draft: please send errata to [email protected].


152 9 Regularization

ondary task for which data are more plentiful. The resulting model is then adapted to
the original task. This is typically done by removing the last layer and adding one or
more layers that produce a suitable output. The main model may be fixed, and the new
layers trained for the original task, or we may fine-tune the entire model.
The principle is that the network will build a good internal representation of the
data from the secondary task, which can subsequently be exploited for the original task.
Equivalently, transfer learning can be viewed as initializing most of the parameters of
the final network in a sensible part of the space that is likely to produce a good solution.
Multi-task learning (figure 9.12b) is a related technique in which the network is trained
to solve several problems concurrently. For example, the network might take an image
and simultaneously learn to segment the scene, estimate the pixel-wise depth, and predict
a caption describing the image. All of these tasks require some understanding of the
image and, when learned simultaneously, the model performance for each may improve.

9.3.7 Self-supervised learning

The above discussion assumes that we have plentiful data for a secondary task or data for
multiple tasks to be learned concurrently. If not, we can create large amounts of “free”
labeled data using self-supervised learning and use this for transfer learning. There are
two families of methods for self-supervised learning: generative and contrastive.
In generative self-supervised learning, part of each data example is masked, and the
secondary task is to predict the missing part (figure 9.12c). For example, we might use
a corpus of unlabeled images and a secondary task that aims to inpaint (fill in) missing
parts of the image (figure 9.12c). Similarly, we might use a large corpus of text and mask
some words. We train the network to predict the missing words and then fine-tune it for
the actual language task we are interested in (see chapter 12).
In contrastive self-supervised learning, pairs of examples with commonalities are com-
pared to unrelated pairs. For images, the secondary task might be to identify whether a
pair of images are transformed versions of one another or are unconnected. For text, the
secondary task might be to determine whether two sentences followed one another in the
original document. Sometimes, the precise relationship between a connected pair must
be identified (e.g., finding the relative position of two patches from the same image).

9.3.8 Augmentation

Transfer learning improves performance by exploiting a different dataset. Multi-task


learning improves performance using additional labels. A third option is to expand the
dataset. We can often transform each input data example in such a way that the label
stays the same. For example, we might aim to determine if there is a bird in an image
(figure 9.13). Here, we could rotate, flip, blur, or manipulate the color balance of the
image, and the label “bird” remains valid. Similarly, for tasks where the input is text,
Notebook 9.5
Augmentation we can substitute synonyms or translate to another language and back again. For tasks
where the input is audio, we can amplify or attenuate different frequency bands.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 153

Figure 9.12 Transfer, multi-task, and self-supervised learning. a) Transfer learn-


ing is used when we have limited labeled data for the primary task (here depth
estimation) but plentiful data for a secondary task (here segmentation). We train
a model for the secondary task, remove the final layers, and replace them with
new layers appropriate to the primary task. We then train only the new layers
or fine-tune the entire network for the primary task. The network learns a good
internal representation from the secondary task that is then exploited for the pri-
mary task. b) In multi-task learning, we train a model to perform multiple tasks
simultaneously, hoping that performance on each will improve. c) In generative
self-supervised learning, we remove part of the data and train the network to
complete the missing information. Here, the task is to fill in (inpaint) a masked
portion of the image. This permits transfer learning when no labels are available.
Images from Cordts et al. (2016).

Draft: please send errata to [email protected].


154 9 Regularization

Figure 9.13 Data augmentation. For some problems, each data example can be
transformed to augment the dataset. a) Original image. b–h) Various geometric
and photometric transformations of this image. For image classification, all these
images still have the same label, “bird.” Adapted from Wu et al. (2015a).

Generating extra training data in this way is known as data augmentation. The aim
is to teach the model to be indifferent to these irrelevant data transformations.

9.4 Summary
Explicit regularization involves adding an extra term to the loss function that changes
the position of the minimum. The term can be interpreted as a prior probability over
the parameters. Stochastic gradient descent with a finite step size does not neutrally
descend to the minimum of the loss function. This bias can be interpreted as adding
additional terms to the loss function, and this is known as implicit regularization.
There are also many heuristics for improving generalization, including early stopping,
dropout, ensembling, the Bayesian approach, adding noise, transfer learning, multi-task
learning, and data augmentation. There are four main principles behind these methods
(figure 9.14). We can (i) encourage the function to be smoother (e.g., L2 regularization),
(ii) increase the amount of data (e.g., data augmentation), (iii) combine models (e.g.,
ensembling), or (iv) search for wider minima (e.g., applying noise to network weights).

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 155

Figure 9.14 Regularization methods. The regularization methods discussed in this


chapter aim to improve generalization by one of four mechanisms. Some methods
aim to make the modeled function smoother. Other methods increase the effective
amount of data. The third group of methods combine multiple models and hence
mitigate against uncertainty in the fitting process. Finally, the fourth group of
methods encourages the training process to converge to a wide minimum where
small errors in the estimated parameters are less important (see also figure 20.11).

Another way to improve generalization is to choose the model architecture to suit the
task. For example, in image segmentation, we can share parameters within the model,
so we don’t need to independently learn what a tree looks like at every image location.
Chapters 10–13 consider architectural variations designed for different tasks.

Notes

An overview and taxonomy of regularization techniques in deep learning can be found in


Kukačka et al. (2017). Notably missing from the discussion in this chapter is BatchNorm
(Szegedy et al., 2016) and its variants, which are described in chapter 11.

Regularization: L2 regularization penalizes the sum of squares of the network weights. This
encourages the output function to change slowly (i.e., become smoother) and is the most used
regularization term. It is sometimes referred to as Frobenius norm regularization as it penalizes
the Frobenius norms of the weight matrices. It is often also mistakenly referred to as “weight
decay,” although this is a separate technique devised by Hanson & Pratt (1988) in which the
parameters ϕ are updated as:

∂L
ϕ ←− (1 − λ′ )ϕ − α , (9.13)
∂ϕ

Draft: please send errata to [email protected].


156 9 Regularization

where, as usual, α is the learning rate, and L is the loss. This is identical to gradient descent,
except that the weights are reduced by a factor of 1−λ′ before the gradient update. For standard
SGD, weight decay is equivalent to L2 regularization (equation 9.5) with coefficient λ = λ′ /2α.
Problem 9.5
However, for Adam, the learning rate α is different for each parameter, so L2 regularization
and weight decay differ. Loshchilov & Hutter (2019) present AdamW, which modifies Adam to
implement weight decay correctly and show that this improves performance.
Other choices of vector norm encourage sparsity in the weights. The L0 regularization term
Appendix B.3.2
applies a fixed penalty for every non-zero weight. The effect is to “prune” the network. L0
Vector norms
regularization can also be used to encourage group sparsity; this might apply a fixed penalty if
any of the weights contributing to a given hidden unit are non-zero. If they are all zero, we can
remove the unit, decreasing the model size and making inference faster.
Unfortunately, L0 regularization is challenging to implement since the derivative of the regular-
ization term is not smooth, and more sophisticated fitting methods are required (see Louizos
et al., 2018). Somewhere between L2 and L0 regularization is L1 regularization or LASSO
(least absolute shrinkage and selection operator), which imposes a penalty on the absolute val-
ues of the weights. L2 regularization somewhat discourages sparsity in that the derivative of
the squared penalty decreases as the weight becomes smaller, lowering the pressure to make it
smaller still. L1 regularization does not have this disadvantage, as the derivative of the penalty
is constant. This can produce sparser solutions than L2 regularization but is much easier to
Problem 9.6
optimize than L0 regularization. Sometimes both L1 and L2 regularization terms are used,
which is termed an elastic net penalty (Zou & Hastie, 2005).
A different approach to regularization is to modify the gradients of the learning algorithm
without ever explicitly formulating a new loss function (e.g., equation 9.13). This approach has
been used to promote sparsity during backpropagation (Schwarz et al., 2021).
The evidence on the effectiveness of explicit regularization is mixed. Zhang et al. (2017a) showed
that L2 regularization contributes little to generalization. It has been proven that the Lipschitz
constant of the network (how fast the function can change as we modify the input) bounds
Appendix B.1.1
the generalization error (Bartlett et al., 2017; Neyshabur et al., 2018). However, the Lipschitz
Lipschitz constant
constant depends on the product of the spectral norms of the weight matrices Ωk , which are
only indirectly dependent on the magnitudes of the individual weights. Bartlett et al. (2017),
Appendix B.3.7 Neyshabur et al. (2018), and Yoshida & Miyato (2017) all add terms that indirectly encourage
Spectral norm the spectral norms to be smaller. Gouk et al. (2021) take a different approach and develop an
algorithm that constrains the Lipschitz constant of the network to be below a particular value.

Implicit regularization in gradient descent: The gradient descent step is:

ϕ1 = ϕ0 + α · g[ϕ0 ], (9.14)
where g[ϕ0 ] is the negative of the gradient of the loss function, and α is the step size. As α → 0,
the gradient descent process can be described by a differential equation:


= g[ϕ]. (9.15)
dt
For typical step sizes α, the discrete and continuous versions converge to different solutions. We
can use backward error analysis to find a correction g1 [ϕ] to the continuous version:


≈ g[ϕ] + αg1 [ϕ] + . . . , (9.16)
dt
so that it gives the same result as the discrete version.
Consider the first two terms of a Taylor expansion of the modified continuous solution ϕ around
initial position ϕ0 :

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 157

dϕ α2 d2 ϕ
ϕ[α] ≈ ϕ+α +
dt 2 dt2 ϕ=ϕ0
 
α2 ∂g[ϕ] dϕ ∂g [ϕ] dϕ
≈ ϕ + α (g[ϕ] + αg1 [ϕ]) + +α 1
2 ∂ϕ dt ∂ϕ dt ϕ=ϕ0
 
α2 ∂g[ϕ] ∂g [ϕ]
= ϕ + α (g[ϕ] + αg1 [ϕ]) + g[ϕ] + α 1 g[ϕ]
2 ∂ϕ ∂ϕ ϕ=ϕ0
 
1 ∂g[ϕ]
≈ ϕ + αg[ϕ] + α2 g1 [ϕ] + g[ϕ] , (9.17)
2 ∂ϕ ϕ=ϕ 0

where in the second line, we have introduced the correction term (equation 9.16), and in the
final line, we have removed terms of greater order than α2 .
Note that the first two terms on the right-hand side ϕ0 + αg[ϕ0 ] are the same as the discrete
update (equation 9.14). Hence, to make the continuous and discrete versions arrive at the same
place, the third term on the right-hand side must equal zero, allowing us to solve for g1 [ϕ]:

1 ∂g[ϕ]
g1 [ϕ] = − g[ϕ]. (9.18)
2 ∂ϕ
During training, the evolution function g[ϕ] is the negative of the gradient of the loss:


≈ g[ϕ] + αg1 [ϕ]
dt
 
∂L α ∂ 2 L ∂L
= − − . (9.19)
∂ϕ 2 ∂ϕ2 ∂ϕ

This is equivalent to performing continuous gradient descent on the loss function:

2
α ∂L
LGD [ϕ] = L[ϕ] + , (9.20)
4 ∂ϕ
because the right-hand side of equation 9.19 is the derivative of that in equation 9.20.
This formulation of implicit regularization was developed by Barrett & Dherin (2021) and
extended to stochastic gradient descent by Smith et al. (2021). Smith et al. (2020) and others
have shown that stochastic gradient descent with small or moderate batch sizes outperforms full
batch gradient descent on the test set, and this may in part be due to implicit regularization.
Relatedly, Jastrzębski et al. (2021) and Cohen et al. (2021) both show that using a large learn-
ing rate reduces the tendency of typical optimization trajectories to move to “sharper” parts of
the loss function (i.e., where at least one direction has high curvature). This implicit regular-
ization effect of large learning rates can be approximated by penalizing the trace of the Fisher
Information Matrix, which is closely related to penalizing the gradient norm in equation 9.20
(Jastrzębski et al., 2021).
Early stopping: Bishop (1995) and Sjöberg & Ljung (1995) argued that early stopping limits
the effective solution space that the training procedure can explore; given that the weights are
initialized to small values, this leads to the idea that early stopping helps prevent the weights
from getting too large. Goodfellow et al. (2016) show that under a quadratic approximation
of the loss function with parameters initialized to zero, early stopping is equivalent to L2 reg-
ularization in gradient descent. The effective regularization weight λ is approximately 1/(τ α)
where α is the learning rate, and τ is the early stopping time.

Draft: please send errata to [email protected].


158 9 Regularization

Ensembling: Ensembles can be trained using different random seeds (Lakshminarayanan


et al., 2017), hyperparameters (Wenzel et al., 2020b), or even entirely different families of
models. The models can be combined by averaging their predictions, weighting the predictions,
or stacking (Wolpert, 1992), in which the results are combined using another machine learning
model. Lakshminarayanan et al. (2017) showed that averaging the output of independently
trained networks can improve accuracy, calibration, and robustness. Conversely, Frankle et al.
(2020) showed that if we average together the weights to make one model, the network fails.
Fort et al. (2019) compared ensembling solutions that resulted from different initializations
with ensembling solutions that were generated from the same original model. For example, in
the latter case, they consider exploring around the solution in a limited subspace to find other
Appendix B.3.6
good nearby points. They found that both techniques provide complementary benefits but that
Subspaces
genuine ensembling from different random starting points provides a bigger improvement.
An efficient way of ensembling is to combine models from intermediate stages of training. To this
end, Izmailov et al. (2018) introduce stochastic weight averaging, in which the model weights
are sampled at different time steps and averaged together. As the name suggests, snapshot
ensembles (Huang et al., 2017a) also store the models from different time steps and average
their predictions. The diversity of these models can be improved by cyclically increasing and
decreasing the learning rate. Garipov et al. (2018) observed that different minima of the loss
function are often connected by a low-energy path (i.e., a path with a low loss everywhere along
it). Motivated by this observation, they developed a method that explores low-energy regions
around an initial solution to provide diverse models without retraining. This is known as fast
geometric ensembling. A review of ensembling methods can be found in Ganaie et al. (2022).
Dropout: Dropout was first introduced by Hinton et al. (2012b) and Srivastava et al. (2014).
Dropout is applied at the level of hidden units. Dropping a hidden unit has the same effect
as temporarily setting all the incoming and outgoing weights and the bias to zero. Wan et al.
(2013) generalized dropout by randomly setting individual weights to zero. Gal & Ghahramani
(2016) and Kendall & Gal (2017) proposed Monte Carlo dropout, in which inference is computed
with several dropout patterns, and the results are averaged together. Gal & Ghahramani (2016)
argued that this could be interpreted as approximating Bayesian inference.
Dropout is equivalent to applying multiplicative Bernoulli noise to the hidden units. Similar
benefits derive from using other distributions, including the normal (Srivastava et al., 2014;
Shen et al., 2017), uniform (Shen et al., 2017), and beta distributions (Liu et al., 2019b).
Adding noise: Bishop (1995) and An (1996) added Gaussian noise to the network inputs to
improve performance. Bishop (1995) showed that this is equivalent to weight decay. An (1996)
also investigated adding noise to the weights. DeVries & Taylor (2017a) added Gaussian noise
to the hidden units. The randomized ReLU (Xu et al., 2015) applies noise in a different way by
making the activation functions stochastic.
Label smoothing: Label smoothing was introduced by Szegedy et al. (2016) for image classi-
fication but has since been shown to be helpful in speech recognition (Chorowski & Jaitly, 2017),
machine translation (Vaswani et al., 2017), and language modeling (Pereyra et al., 2017). The
precise mechanism by which label smoothing improves test performance isn’t well understood,
although Müller et al. (2019a) show that it improves the calibration of the predicted output
probabilities. A closely related technique is DisturbLabel (Xie et al., 2016), in which a certain
percentage of the labels in each batch are randomly switched at each training iteration.
Finding wider minima: It is thought that wider minima generalize better (see figure 20.11).
Here, the exact values of the weights are less important, so performance should be robust to
errors in their estimates. One of the reasons that applying noise to parts of the network during
training is effective is that it encourages the network to be indifferent to their exact values.
Chaudhari et al. (2019) developed a variant of SGD that biases the optimization toward flat
minima, which they call entropy SGD. The idea is to incorporate local entropy as a term in the
loss function. In practice, this takes the form of one SGD-like update within another. Keskar

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 159

et al. (2017) showed that SGD finds wider minima as the batch size is reduced. This may be
because of the batch variance term that results from implicit regularization by SGD.
Ishida et al. (2020) use a technique named flooding, in which they intentionally prevent the
training loss from becoming zero. This encourages the solution to perform a random walk over
the loss landscape and drift into a flatter area with better generalization.

Bayesian approaches: For some models, including the simplified neural network model in
figure 9.11, the Bayesian predictive distribution can be computed in closed form (see Bishop,
2006; Prince, 2012). For neural networks, the posterior distribution over the parameters can-
not be represented in closed form and must be approximated. The two main approaches are
variational Bayes (Hinton & van Camp, 1993; MacKay, 1995; Barber & Bishop, 1997; Blundell
et al., 2015), in which the posterior is approximated by a simpler tractable distribution, and
Markov Chain Monte Carlo (MCMC) methods, which approximate the distribution by drawing
a set of samples (Neal, 1995; Welling & Teh, 2011; Chen et al., 2014; Ma et al., 2015; Li et al.,
2016a). The generation of samples can be integrated into SGD, and this is known as stochas-
tic gradient MCMC (see Ma et al., 2015). It has recently been discovered that “cooling” the
posterior distribution over the parameters (making it sharper) improves predictions from these
models (Wenzel et al., 2020a), but this is not currently fully understood (see Noci et al., 2021).

Transfer learning: Transfer learning for visual tasks works extremely well (Sharif Razavian
et al., 2014) and has supported rapid progress in computer vision, including the original AlexNet
results (Krizhevsky et al., 2012). Transfer learning has also impacted natural language process-
ing (NLP), where many models are based on pre-trained features from the BERT model (Devlin
et al., 2019). More information can be found in Zhuang et al. (2020) and Yang et al. (2020b).

Self-supervised learning: Self-supervised learning techniques for images have included in-
painting masked image regions (Pathak et al., 2016), predicting the relative position of patches
in an image (Doersch et al., 2015), re-arranging permuted image tiles back into their original
configuration (Noroozi & Favaro, 2016), colorizing grayscale images (Zhang et al., 2016b), and
transforming rotated images back to their original orientation (Gidaris et al., 2018). In Sim-
CLR (Chen et al., 2020c), a network is learned that maps versions of the same image that
have been photometrically and geometrically transformed to the same representation while re-
pelling versions of different images, with the goal of becoming indifferent to irrelevant image
transformations. Jing & Tian (2020) present a survey of self-supervised learning in images.
Self-supervised learning in NLP can be based on predicting masked words(Devlin et al., 2019),
predicting the next word in a sentence (Radford et al., 2019; Brown et al., 2020), or predicting
whether two sentences follow one another (Devlin et al., 2019). In automatic speech recognition,
the Wav2Vec model (Schneider et al., 2019) aims to distinguish an original audio sample from
one where 10ms of audio has been swapped out from elsewhere in the clip. Self-supervision
has also been applied to graph neural networks (chapter 13). Tasks include recovering masked
features (You et al., 2020) and recovering the adjacency structure of the graph (Kipf & Welling,
2016). Liu et al. (2023a) review self-supervised learning for graph models.

Data augmentation: Data augmentation for images dates back to at least LeCun et al.
(1998) and contributed to the success of AlexNet (Krizhevsky et al., 2012), in which the dataset
was increased by a factor of 2048. Image augmentation approaches include geometric transfor-
mations, changing or manipulating the color space, noise injection, and applying spatial filters.
More elaborate techniques include randomly mixing images (Inoue, 2018; Summers & Dinneen,
2019), randomly erasing parts of the image (Zhong et al., 2020), style transfer (Jackson et al.,
2019), and randomly swapping image patches (Kang et al., 2017). In addition, many studies
have used generative adversarial networks or GANs (see chapter 15) to produce novel but plau-
sible data examples (e.g., Calimeri et al., 2017). In other cases, the data have been augmented
with adversarial examples (Goodfellow et al., 2015a), which are minor perturbations of the
training data that cause the example to be misclassified. A review of data augmentation for
images can be found in Shorten & Khoshgoftaar (2019).

Draft: please send errata to [email protected].


160 9 Regularization

Augmentation methods for acoustic data include pitch shifting, time stretching, dynamic range
compression, and adding random noise (e.g., Abeßer et al., 2017; Salamon & Bello, 2017; Xu
et al., 2015; Lasseck, 2018), as well as mixing data pairs (Zhang et al., 2017c; Yun et al., 2019),
masking features (Park et al., 2019), and using GANs to generate new data (Mun et al., 2017).
Augmentation for speech data includes vocal tract length perturbation (Jaitly & Hinton, 2013;
Kanda et al., 2013), style transfer (Gales, 1998; Ye & Young, 2004), adding noise (Hannun et al.,
2014), and synthesizing speech (Gales et al., 2009).
Augmentation methods for text include adding noise at a character level by switching, deleting,
and inserting letters (Belinkov & Bisk, 2018; Feng et al., 2020), or by generating adversarial
examples (Ebrahimi et al., 2018), using common spelling mistakes (Coulombe, 2018), randomly
swapping or deleting words (Wei & Zou, 2019), using synonyms (Kolomiyets et al., 2011),
altering adjectives (Li et al., 2017c), passivization (Min et al., 2020), using generative models
to create new data (Qiu et al., 2020), and round-trip translation to another language and back
(Aiken & Park, 2010). Augmentation methods for text are reviewed by Bayer et al. (2022).

Problems
Problem 9.1 Consider a model where the prior distribution over the parameters is a normal
2
distribution with mean zero and variance σϕ so that

Y
J
2
P r(ϕ) = Normϕj [0, σϕ ], (9.21)
j=1
Q
where j indexes the model parameters. We now maximize Ii=1 P r(yi |xi , ϕ)P r(ϕ). Show that
the associated loss function of this model is equivalent to L2 regularization.
Problem 9.2 How do the gradients of the loss function change when L2 regularization (equa-
tion 9.5) is added?
Problem 9.3∗ Consider a linear regression model y = ϕ0 + ϕ1 x with input x, output y, and
parameters ϕ0 and ϕ1 . Assume we have I training examples {xi , yi } and use a least squares
loss. Consider adding Gaussian noise with mean zero and variance σx2 to the inputs xi at each
training iteration. What is the expected gradient update?
Problem 9.4∗ Derive the loss function for multiclass classification when we use label smooth-
ing so that the target probability distribution has 0.9 at the correct class and the remaining
probability mass of 0.1 is divided between the remaining Do − 1 classes.
Problem 9.5 Show that the weight decay parameter update with decay rate λ:

∂L
ϕ ←− (1 − λ)ϕ − α , (9.22)
∂ϕ
on the original loss function L[ϕ] is equivalent to a standard gradient update using L2 regular-
ization so that the modified loss function L̃[ϕ] is:

λ X 2
L̃[ϕ] = L[ϕ] + ϕk , (9.23)

k
where ϕ are the parameters, and α is the learning rate.
Problem 9.6 Consider a model with parameters ϕ = [ϕ0 , ϕ1 ]T . Draw the L0, L 21 , and L1
P
d=1 |ϕd | .
regularization terms in a similar form to figure 9.1b. The LP regularization term is D P

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 10

Convolutional networks

Chapters 2–9 introduced the supervised learning pipeline for deep neural networks. How-
ever, these chapters only considered fully connected networks with a single path from
input to output. Chapters 10–13 introduce more specialized network components with
sparser connections, shared weights, and parallel processing paths. This chapter de-
scribes convolutional layers, which are mainly used for processing image data.
Images have three properties that suggest the need for specialized model architec-
ture. First, they are high-dimensional. A typical image for a classification task contains
224×224 RGB values (i.e., 150,528 input dimensions). Hidden layers in fully connected
networks are generally larger than the input size, so even for a shallow network, the
number of weights would exceed 150, 5282 , or 22 billion. This poses obvious practical
problems in terms of the required training data, memory, and computation.
Second, nearby image pixels are statistically related. However, fully connected net-
works have no notion of “nearby” and treat the relationship between every input equally.
If the pixels of the training and test images were randomly permuted in the same way,
the network could still be trained with no practical difference. Third, the interpretation
of an image is stable under geometric transformations. An image of a tree is still an
image of a tree if we shift it leftwards by a few pixels. However, this shift changes every
input to the network. Hence, a fully connected model must learn the patterns of pixels
that signify a tree separately at every position, which is clearly inefficient.
Convolutional layers process each local image region independently, using parameters
shared across the whole image. They use fewer parameters than fully connected layers,
exploit the spatial relationships between nearby pixels, and don’t have to re-learn the
interpretation of the pixels at every position. A network predominantly consisting of
convolutional layers is known as a convolutional neural network or CNN.

10.1 Invariance and equivariance


We argued above that some properties of images (e.g., tree texture) are stable under
transformations. In this section, we make this idea more mathematically precise. A

Draft: please send errata to [email protected].


162 10 Convolutional networks

Figure 10.1 Invariance and equivariance for translation. a–b) In image classi-
fication, the goal is to categorize both images as “mountain” regardless of the
horizontal shift that has occurred. In other words, we require the network pre-
diction to be invariant to translation. c,e) The goal of semantic segmentation is
to associate a label with each pixel. d,f) When the input image is translated, we
want the output (colored overlay) to translate in the same way. In other words,
we require the output to be equivariant with respect to translation. Panels c–f)
adapted from Bousselham et al. (2021).

function f[x] of an image x is invariant to a transformation t[x] if:


 
f t[x] = f[x]. (10.1)

In other words, the output of the function f[x] is the same regardless of the transfor-
mation t[x]. Networks for image classification should be invariant to geometric trans-
formations of the image (figure 10.1a–b). The network f[x] should identify an image as
containing the same object, even if it has been translated, rotated, flipped, or warped.
A function f[x] of an image x is equivariant or covariant to a transformation t[x] if:
   
f t[x] = t f[x] . (10.2)

In other words, f[x] is equivariant to the transformation t[x] if its output changes in
the same way under the transformation as the input. Networks for per-pixel image
segmentation should be equivariant to transformations (figure 10.1c–f); if the image is
translated, rotated, or flipped, the network f[x] should return a segmentation that has
been transformed in the same way.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 163

Figure 10.2 1D convolution with kernel size three. Each output zi is a weighted
sum of the nearest three inputs xi−1 , xi , and xi+1 , where the weights are ω =
[ω1 , ω2 , ω3 ]. a) Output z2 is computed as z2 = ω1 x1 + ω2 x2 + ω3 x3 . b) Output z3
is computed as z3 = ω1 x2 + ω2 x3 + ω3 x4 . c) At position z1 , the kernel extends
beyond the first input x1 . This can be handled by zero-padding, in which we
assume values outside the input are zero. The final output is treated similarly.
d) Alternatively, we could only compute outputs where the kernel fits within the
input range (“valid” convolution); now, the output will be smaller than the input.

10.2 Convolutional networks for 1D inputs

Convolutional networks consist of a series of convolutional layers, each of which is equiv-


ariant to translation. They also typically include pooling mechanisms that induce partial
invariance to translation. For clarity of exposition, we first consider convolutional net-
works for 1D data, which are easier to visualize. In section 10.3, we progress to 2D
convolution, which can be applied to image data.

10.2.1 1D convolution operation

Convolutional layers are network layers based on the convolution operation. In 1D, a
convolution transforms an input vector x into an output vector z so that each output zi
is a weighted sum of nearby inputs. The same weights are used at every position and
are collectively called the convolution kernel or filter. The size of the region over which
inputs are combined is termed the kernel size. For a kernel size of three, we have:

zi = ω1 xi−1 + ω2 xi + ω3 xi+1 , (10.3)


where ω = [ω1 , ω2 , ω3 ]T is the kernel (figure 10.2).1 Notice that the convolution oper-
Problem 10.1
ation is equivariant with respect to translation. If we translate the input x, then the
corresponding output z is translated in the same way.
1 Strictly speaking, this is a cross-correlation and not a convolution, in which the weights would be

flipped relative to the input (so we would switch xi−1 with xi+1 ). Regardless, this (incorrect) definition
is the usual convention in machine learning.

Draft: please send errata to [email protected].


164 10 Convolutional networks

Figure 10.3 Stride, kernel size, and dilation. a) With a stride of two, we evaluate
the kernel at every other position, so the first output z1 is computed from a
weighted sum centered at x1 , and b) the second output z2 is computed from a
weighted sum centered at x3 and so on. c) The kernel size can also be changed.
With a kernel size of five, we take a weighted sum of the nearest five inputs. d)
In dilated or atrous convolution (from the French “à trous” – with holes), we
intersperse zeros in the weight vector to allow us to combine information over a
large area using fewer weights.

10.2.2 Padding

Equation 10.3 shows that each output is computed by taking a weighted sum of the
previous, current, and subsequent positions in the input. This begs the question of how
to deal with the first output (where there is no previous input) and the final output
(where there is no subsequent input).
There are two common approaches. The first is to pad the edges of the inputs with
new values and proceed as usual. Zero-padding assumes the input is zero outside its
valid range (figure 10.2c). Other possibilities include treating the input as circular or
reflecting it at the boundaries. The second approach is to discard the output positions
where the kernel exceeds the range of input positions. These valid convolutions have the
advantage of introducing no extra information at the edges of the input. However, they
have the disadvantage that the representation decreases in size.

10.2.3 Stride, kernel size, and dilation

In the example above, each output was a sum of the nearest three inputs. However,
this is just one of a larger family of convolution operations, the members of which are

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 165

distinguished by their stride, kernel size, and dilation rate. When we evaluate the output
at every position, we term this a stride of one. However, it is also possible to shift the
kernel by a stride greater than one. If we have a stride of two, we create roughly half
the number of outputs (figure 10.3a–b).
The kernel size can be increased to integrate over a larger area (figure 10.3c). How-
ever, it typically remains an odd number so that it can be centered around the current
position. Increasing the kernel size has the disadvantage of requiring more weights. This
leads to the idea of dilated or atrous convolutions, in which the kernel values are inter-
spersed with zeros. For example, we can turn a kernel of size five into a dilated kernel of
size three by setting the second and fourth elements to zero. We still integrate informa-
Problems 10.2–10.4
tion from a larger input region but only require three weights to do this (figure 10.3d).
The number of zeros we intersperse between the weights determines the dilation rate.

10.2.4 Convolutional layers

A convolutional layer computes its output by convolving the input, adding a bias β, and
passing each result through an activation function a[•]. With kernel size three, stride
one, and dilation rate one, the ith hidden unit hi would be computed as:

hi = a [β + ω1 xi−1 + ω2 xi + ω3 xi+1 ]
 
X3
= a β + ωj xi+j−2  , (10.4)
j=1

where the bias β and kernel weights ω1 , ω2 , ω3 are trainable parameters, and (with zero-
padding) we treat the input x as zero when it is out of the valid range. This is a special
case of a fully connected layer that computes the ith hidden unit as:

 
X
D
hi = a βi + ωij xj  . (10.5)
j=1

If there are D inputs x• and D hidden units h• , this fully connected layer would have D2
weights ω•• and D biases β• . The convolutional layer only uses three weights and one
bias. A fully connected layer can reproduce this exactly if most weights are set to zero
Problem 10.5
and others are constrained to be identical (figure 10.4).

10.2.5 Channels

If we only apply a single convolution, information will likely be lost; we are averaging
nearby inputs, and the ReLU activation function clips results that are less than zero.
Hence, it is usual to compute several convolutions in parallel. Each convolution produces
a new set of hidden variables, termed a feature map or channel.

Draft: please send errata to [email protected].


166 10 Convolutional networks

Figure 10.4 Fully connected vs. convolutional layers. a) A fully connected layer
has a weight connecting each input x to each hidden unit h (colored arrows)
and a bias for each hidden unit (not shown). b) Hence, the associated weight
matrix Ω contains 36 weights relating the six inputs to the six hidden units. c) A
convolutional layer with kernel size three computes each hidden unit as the same
weighted sum of the three neighboring inputs (arrows) plus a bias (not shown).
d) The weight matrix is a special case of the fully connected matrix where many
weights are zero and others are repeated (same colors indicate same value, white
indicates zero weight). e) A convolutional layer with kernel size three and stride
two computes a weighted sum at every other position. f) This is also a special
case of a fully connected network with a different sparse weight structure.

Figure 10.5 Channels. Typically, multiple convolutions are applied to the input x
and stored in channels. a) A convolution is applied to create hidden units h1
to h6 , which form the first channel. b) A second convolution operation is applied
to create hidden units h7 to h12 , which form the second channel. The channels
are stored in a 2D array H1 that contains all the hidden units in the first hidden
layer. c) If we add a further convolutional layer, there are now two channels at
each input position. Here, the 1D convolution defines a weighted sum over both
input channels at the three closest positions to create each new output channel.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 167

Figure 10.5a–b illustrates this with two convolution kernels of size three and with
zero-padding. The first kernel computes a weighted sum of the nearest three pixels, adds
a bias, and passes the results through the activation function to produce hidden units h1
to h6 . These comprise the first channel. The second kernel computes a different weighted
sum of the nearest three pixels, adds a different bias, and passes the results through the
activation function to create hidden units h7 to h12 . These comprise the second channel.
In general, the input and the hidden layers all have multiple channels (figure 10.5c). If
the incoming layer has Ci channels and we select a kernel size K per channel, the hidden
Problems 10.6–10.8
units in each output channel are computed as a weighted sum over all Ci channels and K
kernel entries using a weight matrix Ω ∈ RCi ×K and one bias. Hence, if there are Co Notebook 10.1
channels in the next layer, then we need Ω ∈ RCi ×Co ×K weights and β ∈ RCo biases. 1D convolution

10.2.6 Convolutional networks and receptive fields

Chapter 4 described deep networks, which consisted of a sequence of fully connected


layers. Similarly, convolutional networks comprise a sequence of convolutional layers.
The receptive field of a hidden unit in the network is the region of the original input that
feeds into it. Consider a convolutional network where each convolutional layer has kernel
size three. The hidden units in the first layer take a weighted sum of the three closest
inputs, so have receptive fields of size three. The units in the second layer take a weighted
sum of the three closest positions in the first layer, which are themselves weighted sums
of three inputs. Hence, the hidden units in the second layer have a receptive field of size
five. In this way, the receptive field of units in successive layers increases, and information
from across the input is gradually integrated (figure 10.6). Problems 10.9–10.11

10.2.7 Example: MNIST-1D

We now apply a convolutional network to the MNIST-1D data (see figure 8.1). The
input x is a 40D vector, and the output f is a 10D vector that is passed through a
softmax layer to produce class probabilities. We use a network with three hidden layers
(figure 10.7). The fifteen channels of the first hidden layer H1 are each computed using
a kernel size of three and a stride of two with “valid” padding, giving nineteen spatial
positions. The second hidden layer H2 is also computed using a kernel size of three, a
stride of two, and “valid” padding. The third hidden layer is computed similarly. At this
stage, the representation has four spatial positions and fifteen channels. These values
are reshaped into a vector of size sixty, which is mapped by a fully connected layer to
the ten output activations.
This network was trained for 100,000 steps using SGD without momentum, a learning
rate of 0.01, and a batch size of 100 on a dataset of 4,000 examples. We compare this to
Problem 10.12
a fully connected network with the same number of layers and hidden units (i.e., three
hidden layers with 285, 135, and 60 hidden units, respectively). The convolutional net-
work has 2,050 parameters, and the fully connected network has 59,065 parameters. By
the logic of figure 10.4, the convolutional network is a special case of the fully connected

Draft: please send errata to [email protected].


168 10 Convolutional networks

Figure 10.6 Receptive fields for network with kernel width of three. a) An input
with eleven dimensions feeds into a hidden layer with three channels and convo-
lution kernel of size three. The pre-activations of the three highlighted hidden
units in the first hidden layer H1 are different weighted sums of the nearest three
inputs, so the receptive field in H1 has size three. b) The pre-activations of the
four highlighted hidden units in layer H2 each take a weighted sum of the three
channels in layer H1 at each of the three nearest positions. Each hidden unit in
layer H1 weights the nearest three input positions. Hence, hidden units in H2
have a receptive field size of five. c) The hidden units in the third layer (kernel
size three, stride two) increases the receptive field size to seven. d) By the time
we add a fourth layer, the receptive field of the hidden units at position three
have a receptive field that covers the entire input.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 169

Figure 10.7 Convolutional network for classifying MNIST-1D data (see figure 8.1).
The MNIST-1D input has dimension Di = 40. The first convolutional layer has
fifteen channels, kernel size three, stride two, and only retains “valid” positions to
make a hidden layer with nineteen positions and fifteen channels. The following
two convolutional layers have the same settings, gradually reducing the repre-
sentation size at each subsequent hidden layer. Finally, a fully connected layer
takes all sixty hidden units from the third hidden layer. It outputs ten activations
that are subsequently passed through a softmax layer to produce the ten class
probabilities.

Figure 10.8 MNIST-1D results. a) The convolutional network from figure 10.7
eventually fits the training data perfectly and has ∼17% test error. b) A fully
connected network with the same number of hidden layers and the number of
hidden units in each learns the training data faster but fails to generalize well with
∼40% test error. The latter model can reproduce the convolutional model but
fails to do so. The convolutional structure restricts the possible mappings to those
that process every position similarly, and this restriction improves performance.

Draft: please send errata to [email protected].


170 10 Convolutional networks

one. The latter has enough flexibility to replicate the former exactly. Figure 10.8 shows
Notebook 10.2
Convolution both models fit the training data perfectly. However, the test error for the convolutional
for MNIST-1D network is much less than for the fully connected network.
This discrepancy is probably not due to the difference in the number of parameters;
we know overparameterization usually improves performance (section 8.4.1). The likely
explanation is that the convolutional architecture has a superior inductive bias (i.e.,
interpolates between the training data better) because we have embodied some prior
knowledge in the architecture; we have forced the network to process each position in
the input in the same way. We know that the data were created by starting with a
template that is (among other operations) randomly translated, so this is sensible.
The fully connected network has to learn what each digit template looks like at every
position. In contrast, the convolutional network shares information across positions and
hence learns to identify each category more accurately. Another way of thinking about
this is that when we train the convolutional network, we search through a smaller family
of input/output mappings, all of which are plausible. Alternatively, the convolutional
structure can be considered a regularizer that applies an infinite penalty to most of the
solutions that a fully connected network can describe.

10.3 Convolutional networks for 2D inputs

The previous section described convolutional networks for processing 1D data. Such
networks can be applied to financial time series, audio, and text. However, convolutional
networks are more usually applied to 2D image data. The convolutional kernel is now
a 2D object. A 3×3 kernel Ω ∈ R3×3 applied to a 2D input comprising of elements xij
computes a single layer of hidden units hij as:

" #
3 X
X 3
hij = a β+ ωmn xi+m−2,j+n−2 , (10.6)
m=1 n=1

where ωmn are the entries of the convolutional kernel. This is simply a weighted sum
over a square 3×3 input region. The kernel is translated both horizontally and vertically
Problem 10.13
across the 2D input (figure 10.9) to create an output at each position.
Often the input is an RGB image, which is treated as a 2D signal with three channels
(figure 10.10). Here, a 3×3 kernel would have 3×3×3 weights and be applied to the
Notebook 10.3
2D convolution three input channels at each of the 3×3 positions to create a 2D output that is the same
height and width as the input image (assuming zero-padding). To generate multiple
Problem 10.14 output channels, we repeat this process with different kernel weights and append the
results to form a 3D tensor. If the kernel is size K × K, and there are Ci input channels,
Appendix B.3 each output channel is a weighted sum of Ci × K × K quantities plus one bias. It follows
Tensors
that to compute Co output channels, we need Ci × Co × K × K weights and Co biases.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.4 Downsampling and upsampling 171

Figure 10.9 2D convolutional layer. Each output hij computes a weighted sum of
the 3×3 nearest inputs, adds a bias, and passes the result through an activation
function. a) Here, the output h23 (shaded output) is a weighted sum of the nine
positions from x12 to x34 (shaded inputs). b) Different outputs are computed
by translating the kernel across the image grid in two dimensions. c–d) With
zero-padding, positions beyond the image’s edge are considered to be zero.

10.4 Downsampling and upsampling


The network in figure 10.7 increased receptive field size by scaling down the representa-
tion at each layer using stride two convolutions. We now consider methods for scaling
down or downsampling 2D input representations. We also describe methods for scaling
them back up (upsampling), which is useful when the output is also an image. Finally,
we consider methods to change the number of channels between layers. This is helpful
when recombining representations from two branches of a network (chapter 11).

10.4.1 Downsampling

There are three main approaches to scaling down a 2D representation. Here, we consider
the most common case of scaling down both dimensions by a factor of two. First, we

Draft: please send errata to [email protected].


172 10 Convolutional networks

Figure 10.10 2D convolution applied to an image. The image is treated as a 2D


input with three channels corresponding to the red, green, and blue components.
With a 3×3 kernel, each pre-activation in the first hidden layer is computed by
pointwise multiplying the 3×3×3 kernel weights with the 3×3 RGB image patch
centered at the same position, summing, and adding the bias. To calculate all
the pre-activations in the hidden layer, we “slide” the kernel over the image in
both horizontal and vertical directions. The output is a 2D layer of hidden units.
To create multiple output channels, we would repeat this process with multiple
kernels, resulting in a 3D tensor of hidden units at hidden layer H1 .

can sample every other position. When we use a stride of two, we effectively apply this
Problem 10.15
method simultaneously with the convolution operation (figure 10.11a).
Second, max pooling retains the maximum of the 2×2 input values (figure 10.11b).
This induces some invariance to translation; if the input is shifted by one pixel, many
of these maximum values remain the same. Finally, mean pooling or average pooling
averages the inputs. For all approaches, we apply downsampling separately to each
channel, so the output has half the width and height but the same number of channels.

10.4.2 Upsampling

The simplest way to scale up a network layer to double the resolution is to duplicate
all the channels at each spatial position four times (figure 10.12a). A second method
is max unpooling; this is used where we have previously used a max pooling operation
for downsampling, and we distribute the values to the positions they originated from
(figure 10.12b). A third approach uses bilinear interpolation to fill in the missing values
between the points where we have samples. (figure 10.12c).
A fourth approach is roughly analogous to downsampling using a stride of two. In
Notebook 10.4
Downsampling that method, there were half as many outputs as inputs, and for kernel size three, each
& upsampling output was a weighted sum of the three closest inputs (figure 10.13a). In transposed
convolution, this picture is reversed (figure 10.13c). There are twice as many outputs

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.4 Downsampling and upsampling 173

Figure 10.11 Methods for scaling down representation size (downsampling). a)


Sub-sampling. The original 4×4 representation (left) is reduced to size 2×2 (right)
by retaining every other input. Colors on the left indicate which inputs contribute
to the outputs on the right. This is effectively what happens with a kernel of stride
two, except that the intermediate values are never computed. b) Max pooling.
Each output comprises the maximum value of the corresponding 2×2 block. c)
Mean pooling. Each output is the mean of the values in the 2×2 block.

Figure 10.12 Methods for scaling up representation size (upsampling). a) The


simplest way to double the size of a 2D layer is to duplicate each input four
times. b) In networks where we have previously used a max pooling operation
(figure 10.11b), we can redistribute the values to the same positions they originally
came from (i.e., where the maxima were). This is known as max unpooling. c) A
third option is bilinear interpolation between the input values.

Figure 10.13 Transposed convolution in 1D. a) Downsampling with kernel size


three, stride two, and zero-padding. Each output is a weighted sum of three
inputs (arrows indicate weights). b) This can be expressed by a weight matrix
(same color indicates shared weight). c) In transposed convolution, each input
contributes three values to the output layer, which has twice as many outputs as
inputs. d) The associated weight matrix is the transpose of that in panel (b).

Draft: please send errata to [email protected].


174 10 Convolutional networks

as inputs, and each input contributes to three of the outputs. When we consider the
associated weight matrix of this upsampling mechanism (figure 10.13d), we see that it is
the transpose of the matrix for the downsampling mechanism (figure 10.13b).

10.4.3 Changing the number of channels

Sometimes we want to change the number of channels between one hidden layer and the
next without further spatial pooling. This is usually so we can combine the representation
with another parallel computation (see chapter 11). To accomplish this, we apply a
convolution with kernel size one. Each element of the output layer is computed by
taking a weighted sum of all the channels at the same position (figure 10.14). We can
repeat this multiple times with different weights to generate as many output channels as
we need. The associated convolution weights have size 1 × 1 × Ci × Co . Hence, this is
known as 1×1 convolution. Combined with a bias and activation function, it is equivalent
to running the same fully connected network on the input channels at every position.

10.5 Applications

We conclude by describing three computer vision applications. We describe convolu-


tional networks for image classification where the goal is to assign the image to one of a
predetermined set of categories. Then we consider object detection, where the goal is to
identify multiple objects in an image and find the bounding box around each. Finally,
we describe an early system for semantic segmentation where the goal is to assign a label
to each pixel according to which object is present.

10.5.1 Image classification

Much of the pioneering work on deep learning in computer vision focused on image
classification using the ImageNet dataset (figure 10.15). This contains 1,281,167 training
images, 50,000 validation images, and 100,000 test images, and every image is labeled as
belonging to one of 1000 possible categories.
Most methods reshape the input images to a standard size; in a typical system,
the input x to the network is a 224×224 RGB image, and the output is a probability
distribution over the 1000 classes. The task is challenging; there are a large number
of classes, and they exhibit considerable variation (figure 10.15). In 2011, before deep
networks were applied, the state-of-the-art method classified the test images with ∼ 25%
errors for the correct class being in the top five suggestions. Five years later, the best
deep learning models eclipsed human performance.
In 2012, AlexNet was the first convolutional network to perform well on this task.
It consists of eight hidden layers with ReLU activation functions, of which the first
five are convolutional and the rest fully connected (figure 10.16). The network starts by

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.5 Applications 175

Figure 10.14 1×1 convolution. To change the number of channels without spatial
pooling, we apply a 1×1 kernel. Each output channel is computed by taking
a weighted sum of all of the channels at the same position, adding a bias, and
passing through an activation function. Multiple output channels are created by
repeating this operation with different weights and biases.

Figure 10.15 Example ImageNet classification images. The model aims to assign
an input image to one of 1000 classes. This task is challenging because the
images vary widely along different attributes (columns). These include rigidity
(monkey < canoe), number of instances in image (lizard < strawberry), clutter
(compass<steel drum), size (candle<spiderweb), texture (screwdriver<leopard),
distinctiveness of color (mug < red wine), and distinctiveness of shape (headland
< bell). Adapted from Russakovsky et al. (2015).

Draft: please send errata to [email protected].


176 10 Convolutional networks

Figure 10.16 AlexNet (Krizhevsky et al.,


2012). The network maps a 224 × 224
color image to a 1000-dimensional vec-
tor representing class probabilities. The
network first convolves with 11×11 ker-
nels and stride 4 to create 96 channels.
It decreases the resolution again using a
max pool operation and applies a 5×5
convolutional layer. Another max pool-
ing layer follows, and three 3×3 convo-
lutional layers are applied. After a fi-
nal max pooling operation, the result
is vectorized and passed through three
fully connected (FC) layers and finally
the softmax layer.

downsampling the input using an 11×11 kernel with a stride of four to create 96 channels.
It then downsamples again using a max pooling layer before applying a 5×5 kernel to
create 256 channels. There are three more convolutional layers with kernel size 3×3,
Problems 10.16–10.17
eventually resulting in a 13×13 representation with 256 channels. A final max-pooling
layer yields a 6×6 representation with 256 channels which is resized into a vector of
length 9, 216 and passed through three fully connected layers containing 4096, 4096, and
1000 hidden units, respectively. The last layer is passed through the softmax function to
output a probability distribution over the 1000 classes. The complete network contains
∼60 million parameters, most of which are in the fully connected layers.
The dataset size was augmented by a factor of 2048 using (i) spatial transformations
Notebook 10.5
Convolution and (ii) modifications of the input intensities. At test time, five different cropped and
for MNIST mirrored versions of the image were run through the network, and their predictions
averaged. The system was learned using SGD with a momentum coefficient of 0.9 and a
batch size of 128. Dropout was applied in the fully connected layers, and an L2 (weight
decay) regularizer was used. This system achieved a 16.4% top-5 error rate and a 38.1%
top-1 error rate. At the time, this was an enormous leap forward in performance at a task
considered far beyond the capabilities of contemporary methods. This result revealed
the potential of deep learning and kick-started the modern era of AI research.
The VGG network was also targeted at classification in the ImageNet task and
achieved a considerably better performance of 6.8% top-5 error rate and a 23.7% top-1
error rate. This network is similarly composed of a series of interspersed convolutional
and max pooling layers, where the spatial size of the representation gradually decreases,
but the number of channels increase. These are followed by three fully connected layers
(figure 10.17). The VGG network was also trained using data augmentation, weight
decay, and dropout.
Although there were various minor differences in the training regime, the most impor-
tant change between AlexNet and VGG was the depth of the network. The latter used
Problem 10.18
19 hidden layers and 144 million parameters. The networks in figures 10.16 and 10.17
are depicted at the same scale for comparison. There was a general trend for several
years for performance on this task to improve as the depth of the networks increased,
and this is evidence that depth is important in neural networks.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.5 Applications 177

Figure 10.17 VGG network (Simonyan & Zisserman, 2014) depicted at the same
scale as AlexNet (see figure 10.16). This network consists of a series of convolu-
tional layers and max pooling operations, in which the spatial scale of the rep-
resentation gradually decreases, but the number of channels gradually increases.
The hidden layer after the last convolutional operation is resized to a 1D vector
and three fully connected layers follow. The network outputs 1000 activations
corresponding to the class labels that are passed through a softmax function to
create class probabilities.

10.5.2 Object detection

In object detection, the goal is to identify and localize multiple objects within the image.
An early method based on convolutional networks was You Only Look Once, or YOLO
for short. The input to the YOLO network is a 448×448 RGB image. This is passed
through 24 convolutional layers that gradually decrease the representation size using
max pooling operations while concurrently increasing the number of channels, similarly
to the VGG network. The final convolutional layer is of size 7 × 7 and has 1024 channels.
This is reshaped to a vector, and a fully connected layer maps it to 4096 values. One
further fully connected layer maps this representation to the output.
The output values encode which class is present at each of a 7×7 grid of locations
(figure 10.18a–b). For each location, the output values also encode a fixed number of
bounding boxes. Five parameters define each box: the x- and y-positions of the center,
the height and width of the box, and the confidence of the prediction (figure 10.18c).
The confidence estimates the overlap between the predicted and ground truth bound-
ing boxes. The system is trained using momentum, weight decay, dropout, and data
augmentation. Transfer learning is employed; the network is initially trained on the
ImageNet classification task and is then fine-tuned for object detection.
After the network is run, a heuristic process is used to remove rectangles with low
confidence and to suppress predicted bounding boxes that correspond to the same object
so only the most confident one is retained.

Draft: please send errata to [email protected].


178 10 Convolutional networks

Figure 10.18 YOLO object detection. a) The input image is reshaped to 448×448
and divided into a regular 7×7 grid. b) The system predicts the most likely class
at each grid cell. c) It also predicts two bounding boxes per cell, and a confidence
value (represented by thickness of line). d) During inference, the most likely
bounding boxes are retained, and boxes with lower confidence values that belong
to the same object are suppressed. Adapted from Redmon et al. (2016).

10.5.3 Semantic segmentation

The goal of semantic segmentation is to assign a label to each pixel according to the object
that it belongs to or no label if that pixel does not correspond to anything in the training
database. An early network for semantic segmentation is depicted in figure 10.19. The
input is a 224×224 RGB image, and the output is a 224×224×21 array that contains
the probability of each of 21 possible classes at each position.
The first part of the network is a smaller version of VGG (figure 10.17) that contains
thirteen rather than sixteen convolutional layers and downsizes the representation to size
14×14. There is then one more max pooling operation, followed by two fully connected
layers that map to two 1D representations of size 4096. These layers do not represent
spatial position but instead, combine information from across the whole image.
Here, the architecture diverges from VGG. Another fully connected layer reconsti-
tutes the representation into 7×7 spatial positions and 512 channels. This is followed

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.6 Summary 179

Figure 10.19 Semantic segmentation network of Noh et al. (2015). The input is a
224×224 image, which is passed through a version of the VGG network and even-
tually transformed into a representation of size 4096 using a fully connected layer.
This contains information about the entire image. This is then reformed into a
representation of size 7×7 using another fully connected layer, and the image is
upsampled and deconvolved (transposed convolutions without upsampling) in a
mirror image of the VGG network. The output is a 224×224×21 representation
that gives the output probabilities for the 21 classes at each position.

by a series of max unpooling layers (see figure 10.12b) and deconvolution layers. These
are transposed convolutions (see figure 10.13) but in 2D and without the upsampling.
Finally, there is a 1×1 convolution to create 21 channels representing the possible classes
and a softmax operation at each spatial position to map the activations to class proba-
bilities. The downsampling side of the network is sometimes referred to as an encoder,
and the upsampling side as a decoder, so networks of this type are sometimes called
encoder-decoder networks or hourglass networks due to their shape.
The final segmentation is generated using a heuristic method that greedily searches
for the class that is most represented and infers its region, taking into account the
probabilities but also encouraging connectedness. Then the next most-represented class
is added where it dominates at the remaining unlabeled pixels. This continues until there
is insufficient evidence to add more (figure 10.20).

10.6 Summary
In convolutional layers, each hidden unit is computed by taking a weighted sum of the
nearby inputs, adding a bias, and applying an activation function. The weights and
the bias are the same at every spatial position, so there are far fewer parameters than
in a fully connected network, and the number of parameters doesn’t increase with the
input image size. To ensure that information is not lost, this operation is repeated with

Draft: please send errata to [email protected].


180 10 Convolutional networks

Figure 10.20 Semantic segmentation results. The final result is created from the
21 probability maps by greedily selecting the best class and using a heuristic
method to find a sensible binary map based on the probabilities and their spatial
proximity. If there is enough evidence, subsequent classes are added, and their
segmentation maps are combined. Adapted from Noh et al. (2015).

different weights and biases to create multiple channels at each spatial position.
Typical convolutional networks consist of convolutional layers interspersed with layers
that downsample by a factor of two. As a data example passes through the network, the
spatial dimensions usually decrease by factors of two, and the channels increase by factors
of two. At the end of the network, there are typically one or more fully connected layers
that integrate information from across the entire input and create the desired output. If
the output is an image, a mirrored “decoder” upsamples back to the original size.
The translational equivariance of convolutional layers imposes a useful inductive bias
that increases performance for image-based tasks relative to fully connected networks.
We described image classification, object detection, and semantic segmentation networks.
Image classification performance was shown to improve as the network became deeper.
However, subsequent experiments showed that increasing the network depth indefinitely
doesn’t continue to help; after a certain depth, the system becomes difficult to train.
This is the motivation for residual connections, which are the topic of the next chapter.

Notes

Dumoulin & Visin (2016) present an overview of the mathematics of convolutions that expands
on the brief treatment in this chapter.

Convolutional networks: Early convolutional networks were developed by Fukushima &


Miyake (1982), LeCun et al. (1989a), and LeCun et al. (1989b). Initial applications included

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 181

handwriting recognition (LeCun et al., 1989a; Martin, 1993), face recognition (Lawrence et al.,
1997), phoneme recognition (Waibel et al., 1989), spoken word recognition (Bottou et al., 1990),
and signature verification (Bromley et al., 1993). However, convolutional networks were popu-
larized by LeCun et al. (1998), who built a system called LeNet for classifying 28×28 grayscale
images of handwritten digits. This is immediately recognizable as a precursor of modern net-
works; it uses a series of convolutional layers, followed by fully connected layers, sigmoid activa-
tions rather than ReLUs, and average pooling rather than max pooling. AlexNet (Krizhevsky
et al., 2012) is widely considered the starting point for modern deep convolutional networks.

ImageNet Challenge: Deng et al. (2009) collated the ImageNet database and the associated
classification challenge drove progress in deep learning for several years after AlexNet. Notable
subsequent winners of this challenge include the network-in-network architecture (Lin et al.,
2014), which alternated convolutions with fully connected layers that operated independently
on all of the channels at each position (i.e., 1×1 convolutions). Zeiler & Fergus (2014) and
Simonyan & Zisserman (2014) trained larger and deeper architectures that were fundamentally
similar to AlexNet. Szegedy et al. (2017) developed an architecture called GoogLeNet, which
introduced inception blocks. These use several parallel paths with different filter sizes, which
are then recombined. This effectively allowed the system to learn the filter size.
The trend was for performance to improve with increasing depth. However, it ultimately became
difficult to train deeper networks without modifications; these include residual connections
and normalization layers, both of which are described in the next chapter. Progress in the
ImageNet challenges is summarized in Russakovsky et al. (2015). A more general survey of
image classification using convolutional networks can be found in Rawat & Wang (2017). The
improvement of image classification networks over time is visualized in figure 10.21.

Types of convolutional layers: Atrous or dilated convolutions were introduced by Chen


et al. (2018c) and Yu & Koltun (2015). Transposed convolutions were introduced by Long et al.
(2015). Odena et al. (2016) pointed out that they can lead to checkerboard artifacts and should
be used with caution. Lin et al. (2014) is an early example of convolution with 1×1 filters.
Many variants of the standard convolutional layer aim to reduce the number of parameters.
These include depthwise or channel-separate convolution (Howard et al., 2017; Tran et al., 2018),
in which a different filter convolves each channel separately to create a new set of channels. For
a kernel size of K × K with C input channels and C output channels, this requires K × K × C
parameters rather than the K × K × C × C parameters in a regular convolutional layer. A
related approach is grouped convolutions (Xie et al., 2017), where each convolution kernel is
only applied to a subset of the channels with a commensurate reduction in the parameters. In
fact, grouped convolutions were used in AlexNet for computational reasons; the whole network
could not run on a single GPU, so some channels were processed on one GPU and some on
another, with limited interaction points. Separable convolutions treat each kernel as an outer
product of 1D vectors; they use C + K + K parameters for each of the C channels. Partial
convolutions (Liu et al., 2018a) are used when inpainting missing pixels and account for the
partial masking of the input. Gated convolutions learn the mask from the previous layer (Yu
et al., 2019; Chang et al., 2019b). Hu et al. (2018b) propose squeeze-and-excitation networks
which re-weight the channels using information pooled across all spatial positions.

Downsampling and upsampling: Average pooling dates back to at least LeCun et al. (1989a)
and max pooling to Zhou & Chellappa (1988). Scherer et al. (2010) compared these methods
and concluded that max pooling was superior. The max unpooling method was introduced by
Zeiler et al. (2011) and Zeiler & Fergus (2014). Max pooling can be thought of as applying

Draft: please send errata to [email protected].


182 10 Convolutional networks

Figure 10.21 ImageNet performance. Each circle represents a different published


model. Blue circles represent models that were state-of-the-art. Models dis-
cussed in this book are also highlighted. The AlexNet and VGG networks were
remarkable for their time but are now far from state of the art. ResNet-200 and
DenseNet are discussed in chapter 11. ImageGPT, ViT, SWIN, and DaViT are
discussed in chapter 12. Adapted from https://paperswithcode.com/sota/image-
classification-on-imagenet.

an L∞ norm to the hidden units that are to be pooled. This led to applying other Lk norms
Appendix B.3.2
Vector norms
(Springenberg et al., 2015; Sainath et al., 2013), although these require more computation and
are not widely used. Zhang (2019) introduced max-blur-pooling, in which a low-pass filter is
applied before downsampling to prevent aliasing, and showed that this improves generalization
over translation of the inputs and protects against adversarial attacks (see section 20.4.6).
Shi et al. (2016) introduced PixelShuffle, which used convolutional filters with a stride of 1/s
to scale up 1D signals by a factor of s. Only the weights that lie exactly on positions are
used to create the outputs, and the ones that fall between positions are discarded. This can
be implemented by multiplying the number of channels in the kernel by a factor of s, where
the sth output position is computed from just the sth subset of channels. This can be trivially
extended to 2D convolution, which requires s2 channels.

Convolution in 1D and 3D: Convolutional networks are usually applied to images but have
also been applied to 1D data in applications that include speech recognition (Abdel-Hamid
et al., 2012), sentence classification (Zhang et al., 2015; Conneau et al., 2017), electrocardiogram
classification (Kiranyaz et al., 2015), and bearing fault diagnosis (Eren et al., 2019). A survey
of 1D convolutional networks can be found in Kiranyaz et al. (2021). Convolutional networks
have also been applied to 3D data, including video (Ji et al., 2012; Saha et al., 2016; Tran et al.,
2015) and volumetric measurements (Wu et al., 2015b; Maturana & Scherer, 2015).

Invariance and equivariance: Part of the motivation for convolutional layers is that they
are approximately equivariant with respect to translation, and part of the motivation for max

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 183

pooling is to induce invariance to small translations. Zhang (2019) considers the degree to
which convolutional networks really have these properties and proposes the max-blur-pooling
modification that demonstrably improves them. There is considerable interest in making net-
works equivariant or invariant to other types of transformations, such as reflections, rotations,
and scaling. Sifre & Mallat (2013) constructed a system based on wavelets that induced both
translational and rotational invariance in image patches and applied this to texture classifica-
tion. Kanazawa et al. (2014) developed locally scale-invariant convolutional neural networks.
Cohen & Welling (2016) exploited group theory to construct group CNNs, which are equivariant
to larger families of transformations, including reflections and rotations. Esteves et al. (2018)
introduced polar transformer networks, which are invariant to translations and equivariant to
rotation and scale. Worrall et al. (2017) developed harmonic networks, the first example of a
group CNN that was equivariant to continuous rotations.

Initialization and regularization: Convolutional networks are typically initialized using


Xavier initialization (Glorot & Bengio, 2010) or He initialization (He et al., 2015), as described
in section 7.5. However, the ConvolutionOrthogonal initializer (Xiao et al., 2018a) is special-
Problem 10.19
ized for convolutional networks. Networks of up to 10,000 layers can be trained using this
initialization without the need for residual connections.
Dropout is effective for fully connected networks but less so for convolutional layers (Park &
Kwak, 2016). This may be because neighboring image pixels are highly correlated, so if a hidden
unit drops out, the same information is passed on via adjacent positions. This is the motivation
for spatial dropout and cutout. In spatial dropout (Tompson et al., 2015), entire feature maps
are discarded instead of individual pixels. This circumvents the problem of neighboring pixels
carrying the same information. Similarly, DeVries & Taylor (2017b) propose cutout, in which a
square patch of each input image is masked at training time. Wu & Gu (2015) modified max
pooling for dropout layers using a method that involves sampling from a probability distribution
over the constituent elements rather than always taking the maximum.

Adaptive Kernels: The inception block (Szegedy et al., 2017) applies convolutional filters of
different sizes in parallel and, as such, provides a crude mechanism by which the network can
learn the appropriate filter size. Other work has investigated learning the scale of convolutions
as part of the training process (e.g., Pintea et al., 2021; Romero et al., 2021) or the stride of
downsampling layers (Riad et al., 2022).
In some systems, the kernel size is changed adaptively based on the data. This is sometimes in
the context of guided convolution, where one input is used to help guide the computation from
another input. For example, an RGB image might be used to help upsample a low-resolution
depth map. Jia et al. (2016) directly predicted the filter weights themselves using a different
network branch. Xiong et al. (2020b) change the kernel size adaptively. Su et al. (2019a)
moderate weights of fixed kernels by a function learned from another modality. Dai et al.
(2017) learn offsets of weights so that they do not have to be applied in a regular grid.

Object detection and semantic segmentation: Object detection methods can be divided
into proposal-based and proposal-free schemes. In the former case, processing occurs in two
stages. A convolutional network ingests the whole image and proposes regions that might
contain objects. These proposal regions are then resized, and a second network analyzes them
to establish whether there is an object there and what it is. An early example of this approach
was R-CNN (Girshick et al., 2014). This was subsequently extended to allow end-to-end training
(Girshick, 2015) and to reduce the cost of the region proposals (Ren et al., 2015). Subsequent
work on feature pyramid networks improved both performance and speed by combining features

Draft: please send errata to [email protected].


184 10 Convolutional networks

across multiple scales (Lin et al., 2017b). In contrast, proposal-free schemes perform all the
processing in a single pass. YOLO (Redmon et al., 2016), which was described in section 10.5.2,
is the most celebrated example of a proposal-free scheme. The most recent iteration of this
framework at the time of writing is YOLOv7 (Wang et al., 2022a). A recent review of object
detection can be found in Zou et al. (2023).
The semantic segmentation network described in section 10.5.3 was developed by Noh et al.
(2015). Many subsequent approaches have been variations of U-Net (Ronneberger et al., 2015),
which is described in section 11.5.3. Recent surveys of semantic segmentation can be found in
Minaee et al. (2021) and Ulku & Akagündüz (2022).

Visualizing Convolutional Networks: The dramatic success of convolutional networks led


to a series of efforts to visualize the information they extract from the image (see Qin et al., 2018,
for a review). Erhan et al. (2009) visualized the optimal stimulus that activated a hidden unit
by starting with an image containing noise and then optimizing the input to make the hidden
unit most active using gradient ascent. Zeiler & Fergus (2014) trained a network to reconstruct
the input and then set all the hidden units to zero except the one they were interested in;
the reconstruction then provides information about what drives the hidden unit. Mahendran
& Vedaldi (2015) visualized an entire layer of a network. Their network inversion technique
aimed to find an image that resulted in the activations at that layer but also incorporates prior
knowledge that encourages this image to have similar statistics to natural images.
Finally, Bau et al. (2017) introduced network dissection. Here, a series of images with known
pixel labels capturing color, texture, and object type are passed through the network, and the
correlation of a hidden unit with each property is measured. This method has the advantage
that it only uses the forward pass of the network and does not require optimization. These
methods did provide some partial insight into how the network processes images. For example,
Bau et al. (2017) showed that earlier layers correlate more with texture and color and later
layers with the object type. However, it is fair to say that fully understanding the processing
of networks containing millions of parameters is currently not possible.

Problems
Problem 10.1∗ Show that the operation in equation 10.3 is equivariant with respect to transla-
tion.

Problem 10.2 Equation 10.3 defines 1D convolution with a kernel size of three, stride of one,
and dilation one. Write out the equivalent equation for the 1D convolution with a kernel size
of three and a stride of two as pictured in figure 10.3a–b.

Problem 10.3 Write out the equation for the 1D dilated convolution with a kernel size of three
and a dilation rate of two, as pictured in figure 10.3d.

Problem 10.4 Write out the equation for a 1D convolution with kernel size of seven, a dilation
rate of three, and a stride of three.

Problem 10.5 Draw weight matrices in the style of figure 10.4d for (i) the strided convolution
in figure 10.3a–b, (ii) the convolution with kernel size 5 in figure 10.3c, and (iii) the dilated
convolution in figure 10.3d.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 185

Problem 10.6∗ Draw a 12×6 weight matrix in the style of figure 10.4d relating inputs x1 , . . . , x6
to outputs h1 , . . . , h12 in the multi-channel convolution as depicted in figures 10.5a–b.

Problem 10.7∗ Draw a 6×12 weight matrix in the style of figure 10.4d relating inputs h1 , . . . , h12
to outputs h′1 , . . . , h′6 in the multi-channel convolution in figure 10.5c.

Problem 10.8 Consider a 1D convolutional network where the input has three channels. The
first hidden layer is computed using a kernel size of three and has four channels. The second
hidden layer is computed using a kernel size of five and has ten channels. How many biases and
how many weights are needed for each of these two convolutional layers?

Problem 10.9 A network consists of three 1D convolutional layers. At each layer, a zero-padded
convolution with kernel size three, stride one, and dilation one is applied. What size is the
receptive field of the hidden units in the third layer?

Problem 10.10 A network consists of three 1D convolutional layers. At each layer, a zero-
padded convolution with kernel size seven, stride one, and dilation one is applied. What size is
the receptive field of hidden units in the third layer?

Problem 10.11 Consider a convolutional network with 1D input x. The first hidden layer H1 is
computed using a convolution with kernel size five, stride two, and a dilation rate of one. The
second hidden layer H2 is computed using a convolution with kernel size three, stride one, and
a dilation rate of one. The third hidden layer H3 is computed using a convolution with kernel
size five, stride one, and a dilation rate of two. What are the receptive field sizes at each hidden
layer?

Problem 10.12 The 1D convolutional network in figure 10.7 was trained using stochastic gradient
descent with a learning rate of 0.01 and a batch size of 100 on a training dataset of 4,000 examples
for 100,000 steps. How many epochs was the network trained for?

Problem 10.13 Draw a weight matrix in the style of figure 10.4d that shows the relationship
between the 24 inputs and the 24 outputs in figure 10.9.

Problem 10.14 Consider a 2D convolutional layer with kernel size 5×5 that takes 3 input
channels and returns 10 output channels. How many convolutional weights are there? How
many biases?

Problem 10.15 Draw a weight matrix in the style of figure 10.4d that samples every other
variable in a 1D input (i.e., the 1D analog of figure 10.11a). Show that the weight matrix for
1D convolution with kernel size three and stride two is equivalent to composing the matrices
for 1D convolution with kernel size three and stride one and this sampling matrix.

Problem 10.16∗ Consider the AlexNet network (figure 10.16). How many parameters are used
in each convolutional and fully connected layer? What is the total number of parameters?

Problem 10.17 What is the receptive field size at each of the first three layers of AlexNet (i.e.,
the first three orange blocks in figure 10.16)?

Problem 10.18 How many weights and biases are there at each convolutional layer and fully
connected layer in the VGG architecture (figure 10.17)?

Problem 10.19∗ Consider two hidden layers of size 224×224 with C1 and C2 channels, respec-
tively, connected by a 3×3 convolutional layer. Describe how to initialize the weights using He
initialization.

Draft: please send errata to [email protected].


Chapter 11

Residual networks

The previous chapter described how image classification performance improved as the
depth of convolutional networks was extended from eight layers (AlexNet) to nineteen
layers (VGG). This led to experimentation with even deeper networks. However, per-
formance decreased again when many more layers were added.
This chapter introduces residual blocks. Here, each network layer computes an addi-
tive change to the current representation instead of transforming it directly. This allows
deeper networks to be trained but causes an exponential increase in the activation mag-
nitudes at initialization. Residual blocks employ batch normalization to compensate for
this, which re-centers and rescales the activations at each layer.
Residual blocks with batch normalization allow much deeper networks to be trained,
and these networks improve performance across a variety of tasks. Architectures that
combine residual blocks to tackle image classification, medical image segmentation, and
human pose estimation are described.

11.1 Sequential processing

Every network we have seen so far processes the data sequentially; each layer receives
the previous layer’s output and passes the result to the next (figure 11.1). For example,
a three-layer network is defined by:

h1 = f1 [x, ϕ1 ]
h2 = f2 [h1 , ϕ2 ]
h3 = f3 [h2 , ϕ3 ]
y = f4 [h3 , ϕ4 ], (11.1)
where h1 , h2 , and h3 denote the intermediate hidden layers, x is the network input, y
is the output, and the functions fk [•, ϕk ] perform the processing.
In a standard neural network, each layer consists of a linear transformation followed
by an activation function, and the parameters ϕk comprise the weights and biases of the

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.

You might also like