0% found this document useful (0 votes)
29 views

Deep Learning For Mathematicians

Uploaded by

jrodasc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Deep Learning For Mathematicians

Uploaded by

jrodasc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

SIAM REVIEW c 2019 SIAM.

Published by SIAM under the terms


Vol. 61, No. 4, pp. 860–891 of the Creative Commons 4.0 license

Deep Learning: An Introduction


Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

for Applied Mathematicians∗


Catherine F. Higham†
Desmond J. Higham‡

Abstract. Multilayered artificial neural networks are becoming a pervasive tool in a host of application
fields. At the heart of this deep learning revolution are familiar concepts from applied and
computational mathematics, notably from calculus, approximation theory, optimization,
and linear algebra. This article provides a very brief introduction to the basic ideas that
underlie deep learning from an applied mathematics perspective. Our target audience
includes postgraduate and final year undergraduate students in mathematics who are keen
to learn about the area. The article may also be useful for instructors in mathematics who
wish to enliven their classes with references to the application of deep learning techniques.
We focus on three fundamental questions: What is a deep neural network? How is a
network trained? What is the stochastic gradient method? We illustrate the ideas with
a short MATLAB code that sets up and trains a network. We also demonstrate the use
of state-of-the-art software on a large scale image classification problem. We finish with
references to the current literature.

Key words. back propagation, chain rule, convolution, image classification, neural network, overfit-
ting, sigmoid, stochastic gradient method, supervised learning

AMS subject classifications. 68T05, 65K10, 65D15

DOI. 10.1137/18M1165748

Contents
1 Motivation 861

2 Example of an Artificial Neural Network 862

3 The General Setup 866

4 Stochastic Gradient 867

5 Back Propagation 869

6 Full MATLAB Example 873

∗ Received by the editors January 17, 2018; accepted for publication (in revised form) February

7, 2019; published electronically November 6, 2019.


https://doi.org/10.1137/18M1165748
Funding: The work of the first author was supported by the EPSRC UK Quantum Technol-
ogy Programme under grant EP/M01326X/1. The work of the second author was supported by
EPSRC/RCUK Established Career Fellowship EP/M00158X/1 and by EPSRC Programme Grant
EP/P020720/1.
† School of Computing Science, University of Glasgow, Glasgow G12 8RZ, UK (Catherine.

[email protected]).
‡ School of Mathematics, University of Edinburgh, Edinburgh EH9 3FD, UK (d.j.higham@ed.

ac.uk).
860

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 861

7 Image Classification Example 876


7.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 876
7.2 Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

7.3 Activation and Cost Functions . . . . . . . . . . . . . . . . . . . . . . 878


7.4 Image Classification Experiment . . . . . . . . . . . . . . . . . . . . . 879

8 Of Things Not Treated 884

Acknowledgments 889

References 889

1. Motivation. Most of us have come across the phrase deep learning. It relates
to a set of tools that have become extremely popular in a vast range of application
fields, from image recognition, speech recognition, and natural language processing
to targeted advertising and drug discovery. The field has grown to the extent where
sophisticated software packages are available in the public domain, many produced
by high-profile technology companies. Chip manufacturers are also customizing their
graphics processing units (GPUs) for kernels at the heart of deep learning.
Whether or not its current level of attention is fully justified, deep learning is
clearly a topic of interest to employers, and therefore to our students. Although there
are many useful resources available, we feel that there is a niche for a brief treatment
aimed at mathematical scientists. For a mathematics student, gaining some familiarity
with deep learning can enhance employment prospects. For mathematics educators,
slipping “Applications to Deep Learning” into the syllabus of a class on calculus,
approximation theory, optimization, linear algebra, or scientific computing is a great
way to attract students and maintain their interest in core topics. The area is also
ripe for independent project study.
There is no novel material in this article, and many topics are glossed over or
omitted. Our aim is to present some key ideas as clearly as possible while avoiding
nonessential detail. The treatment is aimed at readers with a background in math-
ematics who have completed a course in linear algebra and are familiar with partial
differentiation. Some experience of scientific computing is also desirable.
To keep the material concrete, we list and walk through a short MATLAB code
that illustrates the main algorithmic steps in setting up, training, and applying an
artificial neural network. We also demonstrate the high-level use of state-of-the-art
software on a larger scale problem.
Section 2 introduces some key ideas by creating and training an artificial neural
network using a simple example. Section 3 sets up some useful notation and defines a
general network. Training a network, which involves the solution of an optimization
problem, is the main computational challenge in this field. In section 4 we describe the
stochastic gradient method, a variation of a traditional optimization technique that
is designed to cope with very large scale sets of training data. Section 5 explains how
the partial derivatives needed for the stochastic gradient method can be computed
efficiently using back propagation. First-principles MATLAB code that illustrates
these ideas is provided in section 6. A larger scale problem is treated in section 7,
where we make use of existing software. Rather than repeatedly acknowledging work
throughout the text, we have chosen to place the bulk of our citations in section 8,

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
862 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Fig. 2.1 Labeled data points in R2 . Circles denote points in category A. Crosses denote points in
category B.

where pointers to the large and growing literature are given. In that section we also
raise issues that were not mentioned elsewhere and highlight some current hot topics.
2. Example of an Artificial Neural Network. This article takes a data fitting
view of artificial neural networks. To be concrete, consider the set of points shown
in Figure 2.1. This shows labeled data—some points are in category A, indicated by
circles, and the rest are in category B, indicated by crosses. For example, the data
may show oil drilling sites on a map, where category A denotes a successful outcome.
Can we use this data to categorize a newly proposed drilling site? Our job is to
construct a mapping that takes any point in R2 and returns either a circle or a cross.
Of course, there are many reasonable ways to construct such a mapping. The artificial
neural network approach uses repeated application of a simple, nonlinear function.
We will base our network on the sigmoid function
1
(2.1) σ(x) = ,
1 + e−x
which is illustrated in the upper half of Figure 2.2 over the interval −10 ≤ x ≤ 10.
We may regard σ(x) as a smoothed version of a step function, which itself mimics the
behavior of a neuron in the brain—firing (giving output equal to one) if the input is
large enough, and remaining inactive (giving output equal to zero) otherwise. The
sigmoid also has the convenient property that its derivative takes the simple form
(2.2) σ 0 (x) = σ(x) (1 − σ(x)) ,
which is straightforward to verify. Other types of nonlinearity are discussed in sec-
tion 8.
The steepness and location of the transition in the sigmoid function may be
altered by scaling and shifting the argument or, in the language of neural networks,
by weighting and biasing the input. The lower plot in Figure 2.2 shows σ (3(x − 5)).
The factor 3 has sharpened the changeover and the shift −5 has altered its location.
To keep our notation manageable, we need to interpret the sigmoid function in a
vectorized sense. For z ∈ Rm , σ : Rm → Rm is defined by applying the sigmoid
function in the obvious componentwise manner, so that
(σ(z))i = σ(zi ).

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 863
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Fig. 2.2 Upper: sigmoid function (2.1). Lower: sigmoid with shifted and scaled input.

With this notation, we can set up layers of neurons. In each layer, every neuron
outputs a single real number, which is passed to every neuron in the next layer. At
the next layer, each neuron forms its own weighted combination of these values, adds
its own bias, and applies the sigmoid function. Introducing some mathematics, if the
real numbers produced by the neurons in one layer are collected into a vector, a, then
the vector of outputs from the next layer has the form

(2.3) σ(W a + b).

Here, W is a matrix and b is a vector. We say that W contains the weights and b
contains the biases. The number of columns in W matches the number of neurons
that produced the vector a at the previous layer. The number of rows in W matches
the number of neurons at the current layer. The number of components in b also
matches the number of neurons at the current layer. To emphasize the role of the ith
neuron in (2.3), we could pick out the ith component as
 
X
σ wij aj + bi  ,
j

where the sum runs over all entries in a. Throughout this article, we will be switching
between the vectorized and componentwise viewpoints to strike a balance between
clarity and brevity.
In the next section, we introduce a full set of notation that allows us to define a
general network. Before reaching that stage, we will give a specific example. Figure 2.3
represents an artificial neural network with four layers. We will apply this form of
network to the problem defined by Figure 2.1. For the network in Figure 2.3 the first
(input) layer is represented by two circles, because our input data points have two
components. The second layer has two solid circles, indicating that two neurons are
being employed. The arrows from layer 1 to layer 2 indicate that both components
of the input data are made available to the two neurons in layer 2. Since the input

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
864 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Layer 2 Layer 4
Layer 1 (Output layer)
(Input layer) Layer 3

Fig. 2.3 A network with four layers.

data has the form x ∈ R2 , the weights and biases for layer 2 may be represented by
a matrix W [2] ∈ R2×2 and a vector b[2] ∈ R2 , respectively. The output from layer 2
then has the form
σ(W [2] x + b[2] ) ∈ R2 .
Layer 3 has three neurons, each receiving input in R2 , so the weights and biases
for layer 3 may be represented by a matrix W [3] ∈ R3×2 and a vector b[3] ∈ R3 ,
respectively. The output from layer 3 then has the form
 
σ W [3] σ(W [2] x + b[2] ) + b[3] ∈ R3 .

The fourth (output) layer has two neurons, each receiving input in R3 , so the weights
and biases for this layer may be represented by a matrix W [4] ∈ R2×3 and a vector
b[4] ∈ R[2] , respectively. The output from layer 4, and hence from the overall network,
has the form
   
(2.4) F (x) = σ W [4] σ W [3] σ(W [2] x + b[2] ) + b[3] + b[4] ∈ R2 .

The expression (2.4) defines a function F : R2 → R2 in terms of its 23 parameters—


the entries in the weight matrices and bias vectors. Recall that our aim is to produce
a classifier based on the data in Figure 2.1. We do this by optimizing over the pa-
rameters. We will require F (x) to be close to [1, 0]T for data points in category A
and close to [0, 1]T for data points in category B. Then, given a new point x ∈ R2 , it
would be reasonable to classify it according to the largest component of F (x); that is,
category A if F1 (x) > F2 (x) and category B if F1 (x) < F2 (x), with some rule to break
ties. This requirement on F may be specified through a cost function. Denoting the
ten data points in Figure 2.1 by {x{i} }10
i=1 , we use y(x
{i}
) for the target output; that
is,
  
1
if x{i} is in category A,


0



(2.5) y(x{i} ) =  
0


if x{i} is in category B.


1

Our cost function then takes the form


10
  1 X1
(2.6) Cost W [2] , W [3] , W [4] , b[2] , b[3] , b[4] = ky(x{i} ) − F (x{i} )k22 .
10 i=1 2

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 865
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Fig. 2.4 Visualization of output from an artificial neural network applied to the data in Figure 2.1.

Fig. 2.5 Repeat of the experiment in Figure 2.4 with an additional data point.

Here, the factor 12 is included for convenience; it simplifies matters when we start
differentiating. We emphasize that Cost is a function of the weights and biases—the
data points are fixed. The form in (2.6), where discrepancy is measured by averaging
the squared Euclidean norm over the data points, is often referred to as a quadratic
cost function. In the language of optimization, Cost is our objective function.
Choosing the weights and biases in a way that minimizes the cost function is
referred to as training the network. We note that, in principle, rescaling an objective
function does not change an optimization problem: We should arrive at the same
minimizer if we change Cost to, for example, 100 Cost or Cost/30, so the factors 1/10
and 1/2 in (2.6) should have no effect on the outcome.
For the data in Figure 2.1, we used the MATLAB optimization toolbox to min-
imize the cost function (2.6) over the 23 parameters defining W [2] , W [3] , W [4] , b[2] ,
b[3] , and b[4] . More precisely, we used the nonlinear least-squares solver lsqnonlin.
For the trained network, Figure 2.4 shows the boundary where F1 (x) > F2 (x). So,
with this approach, any point in the shaded region would be assigned to category A
and any point in the unshaded region to category B.
Figure 2.5 shows how the network responds to additional training data. Here we
added one further category B point, indicated by the extra cross at (0.3, 0.7), and
re-ran the optimization routine.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
866 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM

The example illustrated in Figure 2.4 is small scale by the standards of today’s
deep learning tools. However, the underlying optimization problem, minimizing a
nonconvex objective function over 23 variables, is fundamentally difficult. We cannot
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

exhaustively search over a 23-dimensional parameter space, and we cannot guarantee


to find the global minimum of a nonconvex function. Indeed, some experimentation
with the location of the data points in Figure 2.4 and with the choice of initial guess
for the weights and biases makes it clear that lsqnonlin, with its default settings,
cannot always find an acceptable solution. This motivates the material in sections 4
and 5, where we focus on the optimization problem.
3. The General Setup. The four layer example in section 2 introduced the idea
of neurons, represented by the sigmoid function, acting in layers. At a general layer,
each neuron receives the same input—one real value from every neuron at the previous
layer—and produces one real value, which is passed to every neuron at the next layer.
There are two exceptional layers. At the input layer, there is no “previous layer”
and each neuron receives the input vector. At the output layer, there is no “next
layer” and these neurons provide the overall output. The layers in between these
two are called hidden layers. There is no special meaning to this phrase; it simply
indicates that these neurons are performing intermediate calculations. Deep learning
is a loosely defined term which implies that many hidden layers are being used.
We now spell out the general form of the notation from section 2. We suppose
that the network has L layers, with layers 1 and L being the input and output layers,
respectively. Suppose that layer l, for l = 1, 2, 3, . . . , L, contains nl neurons. So n1 is
the dimension of the input data. Overall, the network maps from Rn1 to RnL . We
[l]
use W [l] ∈ Rnl ×nl−1 to denote the matrix of weights at layer l. More precisely, wjk is
the weight that neuron j at layer l applies to the output from neuron k at layer l − 1.
Similarly, b[l] ∈ Rnl is the vector of biases for layer l, so neuron j at layer l uses the
[l]
bias bj .
In Figure 3.1 we give an example with L = 5 layers. Here, n1 = 4, n2 = 3,
n3 = 4, n4 = 5, and n5 = 2, so W [2] ∈ R3×4 , W [3] ∈ R4×3 , W [4] ∈ R5×4 , W [5] ∈ R2×5 ,
b[2] ∈ R3 , b[3] ∈ R4 , b[4] ∈ R5 , and b[5] ∈ R2 .
Given an input x ∈ Rn1 , we may then neatly summarize the action of the network
[l]
by letting aj denote the output, or activation, from neuron j at layer l. So, we have

(3.1) a[1] = x ∈ Rn1 ,


 
(3.2) a[l] = σ W [l] a[l−1] + b[l] ∈ Rnl for l = 2, 3, . . . , L.

It should be clear that (3.1) and (3.2) amount to an algorithm for feeding the input
forward through the network in order to produce an output a[L] ∈ RnL . At the end
of section 5 this algorithm appears within a pseudocode description of an approach
for training a network.
Now suppose we have N pieces of data, or training points, in Rn1 , {x{i} }N
i=1 , for
which there are given target outputs {y(x{i} )}Ni=1 in R
nL
. Generalizing (2.6), the
quadratic cost function that we wish to minimize has the form
N
1 X1
(3.3) Cost = ky(x{i} ) − a[L] (x{i} )k22 ,
N i=1 2

where, to keep notation under control, we have not explicitly indicated that Cost is a
function of all the weights and biases.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 867
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

[3] Layer 5
W43 (Output layer)
Layer 2

Layer 1 Layer 3
(Input layer) Layer 4

[3]
Fig. 3.1 A network with five layers. The edge corresponding to the weight w43 is highlighted. The
[3]
output from neuron number 3 at layer 2 is weighted by the factor w43 when it is fed into
neuron number 4 at layer 3.

4. Stochastic Gradient. We saw in the previous two sections that training a


network corresponds to choosing the parameters, that is, the weights and biases, that
minimize the cost function. The weights and biases take the form of matrices and
vectors, but at this stage it is convenient to imagine them stored as a single vector that
we call p. The example in Figure 2.3 has a total of 23 weights and biases. Therefore,
in that case, p ∈ R23 . Generally, we will suppose p ∈ Rs and write the cost function in
(3.3) as Cost(p) to emphasize its dependence on the parameters. So Cost : Rs → R.
We now introduce a classical method in optimization that is often referred to
as steepest descent or gradient descent. The method proceeds iteratively, computing
a sequence of vectors in Rs with the aim of converging to a vector that minimizes
the cost function. Suppose that our current vector is p. How should we choose a
perturbation, ∆p, so that the next vector, p + ∆p, represents an improvement? If ∆p
is small, then ignoring terms of order k ∆p k2 , a Taylor series expansion gives
s
X ∂ Cost(p)
(4.1) Cost(p + ∆p) ≈ Cost(p) + ∆pr .
r=1
∂ pr

Here ∂ Cost(p)/∂ pr denotes the partial derivative of the cost function with respect to
the rth parameter. For convenience, we will let ∇Cost(p) ∈ Rs denote the vector of
partial derivatives, known as the gradient, so that

∂ Cost(p)
(∇Cost(p))r = .
∂ pr

Then (4.1) becomes

(4.2) Cost(p + ∆p) ≈ Cost(p) + ∇Cost(p)T ∆p.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
868 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM

Our aim is to reduce the value of the cost function. The relation (4.2) motivates the
idea of choosing ∆p to make ∇Cost(p)T ∆p as negative as possible. We can address
this problem via the Cauchy–Schwarz inequality, which states that for any f, g ∈ Rs ,
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

we have |f T g| ≤ k f k2 k g k2 . So the most negative that f T g can be is −k f k2 k g k2 ,


which happens when f = −g. Hence, based on (4.2), we should choose ∆p to lie
in the direction −∇Cost(p). Keeping in mind that (4.2) is an approximation that is
relevant only for small ∆p, we will limit ourselves to a small step in that direction.
This leads to the update

(4.3) p → p − η∇Cost(p).

Here η is a small stepsize that, in this context, is known as the learning rate. This
equation defines the steepest descent method. We choose an initial vector and iterate
with (4.3) until some stopping criterion has been met, or until the number of iterations
has exceeded the computational budget.
Our cost function (3.3) involves a sum of individual terms that runs over the
training data. It follows that the partial derivative ∇Cost(p) is a sum over the training
data of individual partial derivatives. More precisely, let

(4.4) Cx{i} = 21 ky(x{i} ) − a[L] (x{i} )k22 .

Then, from (3.3),


N
1 X
(4.5) ∇Cost(p) = ∇Cx{i} (p).
N i=1

When we have a large number of parameters and a large number of training points,
computing the gradient vector (4.5) at every iteration of the steepest descent method
(4.3) can be prohibitively expensive. A much cheaper alternative is to replace the
mean of the individual gradients over all training points by the gradient at a single,
randomly chosen, training point. This leads to the simplest form of what is called the
stochastic gradient method. A single step may be summarized as follows:
1. Choose an integer i uniformly at random from {1, 2, 3, . . . , N }.
2. Update

(4.6) p → p − η∇Cx{i} (p).

In words, at each step, the stochastic gradient method uses one randomly chosen
training point to represent the full training set. As the iteration proceeds, the method
sees more training points, so there is some hope that this dramatic reduction in cost-
per-iteration will be worthwhile overall. We note that, even for very small η, the
update (4.6) is not guaranteed to reduce the overall cost function—we have traded
the mean for a single sample. Hence, although the phrase stochastic gradient descent
is widely used, we prefer to use stochastic gradient.
The version of the stochastic gradient method that we introduced in (4.6) is the
simplest from a large range of possibilities. In particular, the index i in (4.6) was
chosen by sampling with replacement—after using a training point, it is returned to
the training set and is just as likely as any other point to be chosen at the next step.
An alternative is to sample without replacement; that is, to cycle through each of the
N training points in a random order. Performing N steps in this manner, referred to
as completing an epoch, may be summarized as follows:

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 869

1. Shuffle the integers {1, 2, 3, . . . , N } into a new order, {k1 , k2 , k3 , . . . , kN }.


2. For i = 1 upto N , update
(4.7) p → p − η∇Cx{ki } (p).
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

If we regard the stochastic gradient method as approximating the mean over all
training points in (4.5) by a single sample, then it is natural to consider a compromise
where we use a small sample average. For some m  N we could take steps of the
following form:
1. Choose m integers, k1 , k2 , . . . , km , uniformly at random from {1, 2, 3, . . . , N }.
2. Update
m
1 X
(4.8) p→p−η ∇Cx{ki } (p).
m i=1

In this iteration, the set {x{ki } }m


i=1 is known as a minibatch. There is a without
replacement alternative where, assuming N = Km for some K, we split the training
set randomly into K distinct minibatches and cycle through them.
Because the stochastic gradient method is usually implemented within the context
of a very large scale computation, algorithmic choices such as minibatch size and
the form of randomization are often driven by the requirements of high performance
computing architectures. Also, it is, of course, possible to dynamically vary these
choices, along with others such as the learning rate, as the training progresses in an
attempt to accelerate convergence. Indeed, an effective algorithm for specifying the
learning rate is a key ingredient for practical computations.
Section 6 gives a simple MATLAB code that uses a vanilla stochastic gradient
method. In section 7 we describe a state-of-the-art implementation and section 8 has
pointers to the current literature.
5. Back Propagation. We are now in a position to apply the stochastic gradient
method in order to train an artificial neural network. We switch from the general
vector of parameters, p, used in section 4 to the entries in the weight matrices and bias
vectors. Our task is to compute partial derivatives of the cost function with respect to
[l] [l]
each wjk and bj . We have seen that the idea behind the stochastic gradient method
is to exploit the structure of the cost function: because (3.3) is a linear combination
of individual terms that runs over the training data, the same is true of its partial
derivatives. We therefore focus our attention on computing those individual partial
derivatives.
Hence, for a fixed training point we regard Cx{i} in (4.4) as a function of the
weights and biases. So we may drop the dependence on x{i} and simply write
(5.1) C = 21 ky − a[L] k22 .
We recall from (3.2) that a[L] is the output from the artificial neural network. The
dependence of C on the weights and biases arises only through a[L] .
To derive worthwhile expressions for the partial derivatives, it is useful to intro-
duce two further sets of variables. First we let
(5.2) z [l] = W [l] a[l−1] + b[l] ∈ Rnl for l = 2, 3, . . . , L.
[l]
We refer to zj
as the weighted input for neuron j at layer l. The fundamental relation
(3.2) that propagates information through the network may then be written as
 
(5.3) a[l] = σ z [l] for l = 2, 3, . . . , L.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
870 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM

Second, we let δ [l] ∈ Rnl be defined by

[l] ∂C
(5.4) δj = for 1 ≤ j ≤ nl and 2 ≤ l ≤ L.
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

[l]
∂ zj

This expression, which is often called the error in the jth neuron at layer l, is an
intermediate quantity that is useful for both analysis and computation. However, we
point out that this usage of the term error is somewhat ambiguous. At a general,
hidden layer, it is not clear how much to “blame” each neuron for discrepancies in the
final output. Also, at the output layer, L, the expression (5.4) does not quantify those
[l]
discrepancies directly. The idea of referring to δj in (5.4) as an error seems to have
arisen because the cost function can only be at a minimum if all partial derivatives
[l]
are zero, so δj = 0 is a useful goal. As we mention later in this section, it may be
[l]
more helpful to keep in mind that δj measures the sensitivity of the cost function to
the weighted input for neuron j at layer l.
At this stage we also need to define the Hadamard, or componentwise, product
of two vectors. If x, y ∈ Rn , then x ◦ y ∈ Rn is defined by (x ◦ y)i = xi yi . In words,
the Hadamard product is formed by pairwise multiplication of the corresponding
components.
With this notation, the following results are a consequence of the chain rule.
Lemma 5.1. We have

(5.5) δ [L] = σ 0 (z [L] ) ◦ (aL − y),


(5.6) δ [l] = σ 0 (z [l] ) ◦ (W [l+1] )T δ [l+1] for 2 ≤ l ≤ L − 1,
∂C [l]
(5.7) [l]
= δj for 2 ≤ l ≤ L,
∂ bj
∂C [l] [l−1]
(5.8) [l]
= δj ak for 2 ≤ l ≤ L.
∂ wjk

[L]
Proof. We begin by proving (5.5). The relation (5.3) with l = L shows that zj
[L] 
and aj are connected by a[L] = σ z [L] , and hence

[L]
∂ aj [L]
[L]
= σ 0 (zj ).
∂ zj

Also, from (5.1),


nL
∂C ∂ X [L] [L]
[L]
= 1
[L] 2
(yk − ak )2 = −(yj − aj ).
∂ aj ∂ aj k=1

So, using the chain rule,


[L]
[L] ∂C ∂ C ∂ aj [L] [L]
δj = [L]
= [L] [L]
= (aj − yj )σ 0 (zj ),
∂ zj ∂ aj ∂ zj

which is the componentwise form of (5.5).

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 871
[l] [l+1] nl+1
To show (5.6), we use the chain rule to convert from zj to {zk }k=1 . Applying
the chain rule and using the definition (5.4),
nl+1 nl+1
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

[l+1] [l+1]
[l] ∂C X ∂ C ∂ zk X [l+1] ∂ zk
(5.9) δj = [l]
= [l+1] [l]
= δk [l]
.
∂zj k=1 ∂zk ∂zj k=1 ∂zj
[l+1] [l]
Now, from (5.2) we know that zk and zj are connected via
nl  
[l+1] [l+1] [l+1]
X
zk = wks σ zs[l] + bk .
s=1

Hence,
[l+1]
∂ zk [l+1] 0

[l]

[l]
= wkj σ zj .
∂zj
In (5.9) this gives
nl+1  
[l] [l+1] [l+1] 0 [l]
X
δj = δk wkj σ zj ,
k=1

which may be rearranged as


  
[l] [l]
δj = σ 0 zj (W [l+1] )T δ [l+1] .
j

This is the componentwise form of (5.6).


[l] [l]
To show (5.7), we note from (5.2) and (5.3) that zj is connected to bj by
  
[l] [l]
zj = W [l] σ z [l−1] + bj .
j

[l]
Since z [l−1] does not depend on bj , we find that
[l]
∂ zj
[l]
= 1.
∂ bj

Then, from the chain rule,


[l]
∂C ∂ C ∂ zj ∂C [l]
[l]
= [l] [l]
= [l]
= δj ,
∂ bj ∂ zj ∂ bj ∂ zj

using the definition (5.4). This gives (5.7).


Finally, to obtain (5.8) we start with the componentwise version of (5.2),
nl−1
[l] [l] [l−1] [l]
X
zj = wjk ak + bj ,
k=1

which gives
[l]
∂ zj [l−1]
(5.10) [l]
= ak , independently of j,
∂wjk

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
872 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM

and
[l]
∂ zs
(5.11) =0 for s 6= j.
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

[l]
∂wjk

In words, (5.10) and (5.11) follow because the jth neuron at layer l uses the weights
from only the jth row of W [l] and applies these weights linearly. Then, from the chain
rule, (5.10) and (5.11) give
nl [l] [l]
∂C X ∂ C ∂ zs ∂ C ∂ zj ∂C [l−1] [l] [l−1]
[l]
= [l] [l]
= [l] [l]
= a
[l] k
= δj ak ,
∂ wjk s=1 ∂zs ∂ wjk ∂zj ∂ wjk ∂zj
[l]
where the last step used the definition of δj in (5.4). This completes the proof.
There are many aspects of Lemma 5.1 that deserve our attention. We recall from
(3.1), (5.2), and (5.3) that the output a[L] can be evaluated from a forward pass
through the network, computing a[1] , z [2] , a[2] , z [3] , . . . , a[L] in order. Having done
this, we see from (5.5) that δ [L] is immediately available. Then, from (5.6), δ [L−1] ,
δ [L−2] , . . . , δ [2] may be computed in a backward pass. From (5.7) and (5.8), we then
have access to the partial derivatives. Computing gradients in this way is known as
back propagation.
To gain further understanding of the back propagation formulas (5.7) and (5.8) in
Lemma 5.1, it is useful to recall the fundamental definition of a partial derivative. The
[l]
quantity ∂ C/∂ wjk measures how C changes when we make a small perturbation to
[l] [3]
wjk . As illustration, Figure 3.1 highlights the weight w43 . It is clear that a change in
[3]
this weight has no effect on the output of previous layers, so to work out ∂ C/∂ w43 we
do not need to know about partial derivatives at previous layers. It should, however,
[3]
be possible to express ∂ C/∂ w43 in terms of partial derivatives at subsequent layers.
[3]
More precisely, the activation feeding into the fourth neuron on layer 3 is z4 , and,
[3]
by definition, δ4 measures the sensitivity of C with respect to this input. Feeding in
[3] [2]
to this neuron we have w43 a3 + constant, so it makes sense that
∂C [3] [2]
[3]
= δ4 a3 .
∂ w43
[3]
Similarly, in terms of the bias, b4 + constant is feeding in to the neuron, which
explains why
∂C [3]
[3]
= δ4 × 1.
∂ b4
We may avoid the Hadamard product notation in (5.5) and (5.6) by introducing
diagonal matrices. Let D[l] ∈ Rnl ×nl denote the diagonal matrix with (i, i) entry
[l]
given by σ 0 (zi ). Then we see that δ [L] = D[L] (a[L] − y) and δ [l] = D[l] (W [l+1] )T δ [l+1] .
We could expand this out as

δ [l] = D[l] (W [l+1] )T D[l+1] (W [l+2] )T · · · D[L−1] (W [L] )T D[L] (a[L] − y).

We also recall from (2.2) that σ 0 (z) is trivial to compute.


The relation (5.7) shows that δ [l] corresponds precisely to the gradient of the cost
[l]
function with respect to the biases at layer l. If we regard ∂C/∂wjk as defining the

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 873

(j, k) component in a matrix of partial derivatives at layer l, then (5.8) shows this
T
matrix to be the outer product δ [l] a[l−1] ∈ Rnl ×nl−1 .
Putting this together, we may write the following pseudocode for an algorithm
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

that trains a network using a fixed number, Niter, of stochastic gradient iterations.
For simplicity, we consider the basic version (4.6) where single samples are chosen
with replacement. For each training point, we perform a forward pass through the
network in order to evaluate the activations, weighted inputs, and overall output a[L] .
Then we perform a backward pass to compute the errors and updates.
For counter = 1 upto Niter
Choose an integer k uniformly at random from {1, 2, 3, . . . , N }
x{k} is current training data point
a[1] = x{k}
For l = 2 upto L
z [l] = W [l] a[l−1]
 +b
[l]
[l] [l]
a =σ z
D[l] = diag σ 0 (z [l] )


end
δ [L] = D[L] a[L] − y(x{k} )


For l = L − 1 downto 2
δ [l] = D[l] (W [l+1] )T δ [l+1]
end
For l = L downto 2
T
W [l] → W [l] − η δ [l] a[l−1]
[l] [l] [l]
b → b −ηδ
end
end

6. Full MATLAB Example. We now give a concrete illustration involving back


propagation and the stochastic gradient method. Listing 6.1 shows how a network of
the form shown in Figure 2.3 may be used on the data in Figure 2.1. We note that
this MATLAB code has been written for clarity and brevity, rather than efficiency
or elegance. In particular, we have “hardwired” the number of layers and iterated
through the forward and backward passes line by line. (Because the weights and
biases do not have the the same dimension in each layer, it is not convenient to store
them in a three-dimensional array. We could use a cell array or structure array [19],
and then implement the forward and backward passes in for loops. However, this
approach produced a less readable code and violated our self-imposed one page limit.)
The function netbp in Listing 6.1 contains the nested function cost, which eval-
uates a scaled version of Cost in (2.6). Because this function is nested, it has access
to the variables in the main function, notably the training data. We point out that
the nested function cost is not used directly in the forward and backward passes. It
is called at each iteration of the stochastic gradient method so that we can monitor
the progress of the training.
Listing 6.2 shows the function activate, used by netbp, which applies the sigmoid
function in vectorized form.
At the start of netbp we set up the training data and target y values, as defined
in (2.5). We then initialize all weights and biases using the normal pseudorandom
number generator randn. For simplicity, we set a constant learning rate eta = 0.05
and perform a fixed number of iterations Niter = 1e6.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
874 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM

function netbp
%NETBP Uses backpropagation to train a network
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

%%%%%%% DATA %%%%%%%%%%%


x1 = [0.1,0.3,0.1,0.6,0.4,0.6,0.5,0.9,0.4,0.7];
x2 = [0.1,0.4,0.5,0.9,0.2,0.3,0.6,0.2,0.4,0.6];
y = [ones(1,5) zeros(1,5); zeros(1,5) ones(1,5)];

% Initialize weights and biases


rng(5000);
W2 = 0.5*randn(2,2); W3 = 0.5*randn(3,2); W4 = 0.5*randn(2,3);
b2 = 0.5*randn(2,1); b3 = 0.5*randn(3,1); b4 = 0.5*randn(2,1);

% Forward and Back propagate


eta = 0.05; % learning rate
Niter = 1e6; % number of SG iterations
savecost = zeros(Niter,1); % value of cost function at each iteration
for counter = 1:Niter
k = randi(10); % choose a training point at random
x = [x1(k); x2(k)];
% Forward pass
a2 = activate(x,W2,b2);
a3 = activate(a2,W3,b3);
a4 = activate(a3,W4,b4);
% Backward pass
delta4 = a4.*(1-a4).*(a4-y(:,k));
delta3 = a3.*(1-a3).*(W4’*delta4);
delta2 = a2.*(1-a2).*(W3’*delta3);
% Gradient step
W2 = W2 - eta*delta2*x’;
W3 = W3 - eta*delta3*a2’;
W4 = W4 - eta*delta4*a3’;
b2 = b2 - eta*delta2;
b3 = b3 - eta*delta3;
b4 = b4 - eta*delta4;
% Monitor progress
newcost = cost(W2,W3,W4,b2,b3,b4) % display cost to screen
savecost(counter) = newcost;
end

% Show decay of cost function


save costvec
semilogy([1:1e4:Niter],savecost(1:1e4:Niter))

function costval = cost(W2,W3,W4,b2,b3,b4)


costvec = zeros(10,1);
for i = 1:10
x =[x1(i);x2(i)];
a2 = activate(x,W2,b2);
a3 = activate(a2,W3,b3);
a4 = activate(a3,W4,b4);
costvec(i) = norm(y(:,i) - a4,2);
end
costval = norm(costvec,2)^2;
end % of nested function

end

Listing 6.1: M-file netbp.m.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 875

function y = activate(x,W,b)
%ACTIVATE Evaluates sigmoid function.
%
% x is the input vector, y is the output vector
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

% W contains the weights, b contains the shifts


%
% The ith component of y is activate((Wx+b)_i)
% where activate(z) = 1/(1+exp(-z))

y = 1./(1+exp(-(W*x+b)));

Listing 6.2: M-file activate.m.

Fig. 6.1 The vertical axis shows a scaled value of the cost function (2.6). The horizontal axis shows
the iteration number. Here we used the stochastic gradient method to train a network of the
form shown in Figure 2.3 on the data in Figure 2.1. The resulting classification function
is illustrated in Figure 6.2.

We use the basic stochastic gradient iteration summarized at the end of section 5.
Here, the command randi(10) returns a uniformly and independently chosen integer
between 1 and 10.
Having stored the value of the cost function at each iteration, we use the semilogy
command to visualize the progress of the iteration.
In this experiment, our initial guess for the weights and biases produced a cost
function value of 5.3. After 106 stochastic gradient steps this was reduced to 7.3×10−4 .
Figure 6.1 shows the semilogy plot, and we see that the decay is not consistent—the
cost undergoes a flat period toward the start of the process. After this plateau, we
found that the cost decays at a very slow linear rate—the ratio between successive
values is typically within around 10−6 of unity.
An extended version of netbp can be found in the supplementary material. This
version has the extra graphics commands that make Figure 6.1 more readable. It also
takes the trained network and produces Figure 6.2. This plot shows how the trained
network carves up the input space. Eagle-eyed readers will spot that the solution in
Figure 6.2 differs slightly from the version in Figure 2.4, where the same optimization
problem was tackled by the nonlinear least-squares solver lsqnonlin. In Figure 6.3
we show the corresponding result when an extra data point is added; this can be
compared with Figure 2.5.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
876 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Fig. 6.2 Visualization of output from an artificial neural network applied to the data in Figure 2.1.
Here we trained the network using the stochastic gradient method with back propagation—
behavior of the cost function is shown in Figure 6.1. The same optimization problem was
solved with the lsqnonlin routine from MATLAB in order to produce Figure 2.4.

Fig. 6.3 Visualization of output from an artificial neural network applied to the data in Figure 2.1
with an additional data point. Here we trained the network using the stochastic gradi-
ent method with back propagation. The same optimization problem was solved with the
lsqnonlin routine from MATLAB in order to produce Figure 2.5.

7. Image Classification Example. We now move on to a more realistic task,


which allows us to demonstrate the power of the deep learning approach. We make
use of the MatConvNet toolbox [34], which is designed to offer key deep learning
building blocks as simple MATLAB commands and so is an excellent environment for
prototyping and for educational use. Support for GPUs also makes MatConvNet
efficient for large scale computations, and pretrained networks may be downloaded
for immediate use.
Applying MatConvNet on a large scale problem also gives us the opportunity
to outline further concepts that are relevant to practical computation. These are in-
troduced in the next few subsections, before we apply them to the image classification
exercise.
7.1. Convolutional Neural Networks. MatConvNet uses a special class of
artificial neural networks known as convolutional neural networks (CNNs), which
have become a standard tool in computer vision applications. To motivate CNNs,

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 877

we note that the general framework described in section 3 does not scale well in the
case of digital image data. Consider a color image made up of 200 by 200 pixels,
each with a red, green, and blue component. This corresponds to an input vector of
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

dimension n1 = 200 × 200 × 3 = 120,000, and hence a weight matrix W [2] at level 2
that has 120,000 columns. If we allow general weights and biases, then this approach
is clearly infeasible. CNNs get around this issue by constraining the values that
are allowed in the weight matrices and bias vectors. Rather than a single full-sized
linear transformation, CNNs repeatedly apply a small scale linear kernel, or filter,
across portions of their input data. In effect, the weight matrices used by CNNs are
extremely sparse and highly structured.
To understand why this approach might be useful, consider premultiplying an
input vector in R6 by the matrix
 
1 −1
 1 −1 
 ∈ R5×6 .
 
(7.1) 
 1 −1 
 1 −1 
1 −1
This produces a vector in R5 made up of differences between neighboring values. In
this case we are using a filter [1, −1] and a stride of length one—the filter advances by
one place after each use. Appropriate generalizations of this matrix to the case of input
vectors arising from two-dimensional images can be used to detect edges in an image—
returning a large absolute value when there is an abrupt change in neighboring pixel
values. Moving a filter across an image can also reveal other features, for example,
particular types of curves or blotches of the same color. So, having specified a filter
size and stride length, we can allow the training process to learn the weights in the
filter as a means to extract useful structure.
The word “convolutional” arises because the linear transformations involved may
be written in the form of a convolution. In the one-dimensional case, the convolution
of the vector x ∈ Rp with the filter g1−p , g2−p , . . . , gp−2 , gp−1 has kth component given
by
Xp
yk = xn gk−n .
n=1
The example in (7.1) corresponds to a filter with g0 = 1, g−1 = −1, and all other
gk = 0. In the case
 
x1
 x2 
 
  x3 
 
  
y1 a b c d  x4 
 
 y2   a b c d   x5  ,
  
(7.2)  y3  = 
  
a b c d 
 x6 
 
y4 a b c d  x7 

 x8 
 
 x9 
0
we are applying a filter with four weights, a, b, c, and d, using a stride length of two.
Because the dimension of the input vector x is not compatible with the filter length,
we have padded with an extra zero value.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
878 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM

In practice, image data is typically regarded as a three-dimensional tensor: each


pixel has two spatial coordinates and a red/green/blue value. From this viewpoint,
the filter takes the form of a small tensor that is successively applied to patches of
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

the input tensor and the corresponding convolution operation is multidimensional.


From a computational perspective, a key benefit of CNNs is that the matrix-vector
products involved in the forward and backward passes through the network can be
computed extremely efficiently using fast transform techniques.
A convolutional layer is often followed by a pooling layer, which reduces the
dimension by mapping small regions of pixels into single numbers. For example,
when these small regions are taken to be squares of four neighboring pixels in a two-
dimensonal image, a max pooling or average pooling layer replaces each set of four by
their maximum or average value, respectively.
7.2. Avoiding Overfitting. Overfitting occurs when a trained network performs
very accurately on the given data, but cannot generalize well to new data. Loosely,
this means that the fitting process has focused too heavily on the unimportant and
unrepresentative “noise” in the given data. Many ways to combat overfitting have
been suggested, some of which can be used together.
One useful technique is to split the given data into two distinct groups.
• Training data is used in the definition of the cost function that defines the
optimization problem. Hence this data drives the process that iteratively
updates the weights.
• Validation data is not used in the optimization process—it has no effect on the
way that the weights are updated from step to step. We use the validation
data only to judge the performance of the current network. At each step
of the optimization, we can evaluate the cost function corresponding to the
validation data. This measures how well the current set of trained weights
performs on unseen data.
Intuitively, overfitting corresponds to the situation where the optimization process is
driving down its cost function (giving a better fit to the training data), but the cost
function for the validation error is no longer decreasing (so the performance on unseen
data is not improving). It is therefore reasonable to terminate the training at a stage
where no improvement is seen on the validation data.
Another popular approach to tackle overfitting is to randomly and independently
remove neurons during the training phase. For example, at each step of the stochastic
gradient method, we could delete each neuron with probability p and train on the
remaining network. At the end of the process, because the weights and biases were
produced on these smaller networks, we could multiply each one by a factor of p for
use on the full-sized network. This technique, known as dropout, has the intuitive
interpretation that we are constructing an average over many trained networks, with
such a consensus being more reliable than any individual.
7.3. Activation and Cost Functions. In sections 2 to 6 we used activation func-
tions of sigmoid form (2.1) and a quadratic cost function (3.3). There are many other
widely used choices, and their relative performance is application-specific. In our
image classification setting it is common to use a rectified linear unit, or ReLU,

0 for x ≤ 0,
(7.3) σ(x) =
x for x > 0,

as the activation.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 879

In the case where our training data {x{i} }N i=1 comes from K labeled categories,
let li ∈ {1, 2, . . . , K} be the given label for data point x{i} . As an alternative to the
quadratic cost function (3.3), we could use a softmax log loss approach, as follows.
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Let the output a[L] (x{i} ) =: v {i} from the network take the form of a vector in RK
such that the jth component is large when the image is believed to be from category
j. The softmax operation
{i}
e vs
(v {i} )s 7→ P {i}
K vj
j=1 e
boosts the large components and produces a vector of positive weights summing to
unity, which may be interpreted as probabilities. Our aim is now to force the softmax
value for training point x{i} to be as close to unity as possible in component li , which
corresponds to the correct label. Using a logarithmic rather than quadratic measure
of error, we arrive at the cost function
 {i}

N vl
X e i
(7.4) − log  P {i}
.
K vj
i=1 j=1 e
7.4. Image Classification Experiment. We now show results for a supervised
learning task in image classification. To do this, we rely on the codes cnn_cifar.m
and cnn_cifar_init.m that are available via the MatConvNet website. We made
only minor edits, including some changes that allowed us to test the use of dropout.
Hence, in particular, we are using the network architecture and parameter choices from
those codes. We refer to the MatConvNet documentation and tutorial material for
the fine details and focus here on some of the bigger picture issues.
We consider a set of images, each of which falls into exactly one of the following
ten categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. We
use labeled data from the freely available CIFAR-10 collection [21]. The images are
small, having 32 by 32 pixels, each with a red, green, and blue component. So one
piece of training data consists of 32 × 32 × 3 = 3,072 values. We use a training set of
50,000 images and use 10,000 more images as our validation set. Having completed
the optimization and trained the network, we then judge its performance on a fresh
collection of 10,000 test images, with 1,000 from each category. We note that more
sophisticated cross-validation strategies could be used, where the same overall data
set is repeatedly divided into distinct training, validation, and test sets.
Following the architecture used in the relevant MatConvNet codes, we set up
a network whose layers are divided into five blocks as follows. Here we describe the
dimensions of the inputs/outputs and weights in compact tensor notation. (Of course,
the tensors could be stretched into sparse vectors and matrices in order to fit in with
the general framework of sections 2 to 6, but we feel that the tensor notation is natural
in this context and is consistent with the MatConvNet syntax.)
Block 1 consists of a convolution layer followed by a pooling layer and activation.
This converts the original 32 × 32 × 3 input into dimension 16 × 16 × 32. In
more detail, the convolutional layer uses 5 × 5 filters that also scan across the
three color channels. There are 32 different filters, so overall the weights can
be represented in a 5 × 5 × 3 × 32 array. The output from each filter may
be described as a feature map. The filters are applied with unit stride. In
this way, each of the 32 feature maps has dimension 32 × 32. Max pooling is
then applied to each feature map using stride length two. This reduces the
dimension of the feature maps to 16 × 16. A ReLU activation is then used.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
880 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Fig. 7.1 Overview of the CNN used for the image classification task.

Block 2 applies convolution followed by activation and then a pooling layer. This
reduces the dimension to 8 × 8 × 32. In more detail, we use 32 filters, each of
which is 5 × 5 across the dimensions of the feature maps and also scans across
all 32 feature maps. So the weights could be regarded as a 5 × 5 × 32 × 32
tensor. The stride length is one, so the resulting 32 feature maps are still of
dimension 16 × 16. After ReLU activation, an average pooling layer of stride
two is then applied, which reduces each of the 32 feature maps to dimension
8 × 8.
Block 3 applies a convolution layer followed by the activation function, and then
performs a pooling operation in a way that reduces the dimension to 4×4×64.
In more detail, 64 filters are applied. Each filter is 5 × 5 across the dimensions
of the feature maps, and also scans across all 32 feature maps. So the weights
could be regarded as a 5 × 5 × 32 × 64 tensor. The stride has length one,
resulting in feature maps of dimension 8 × 8. After ReLU activation, an
average pooling layer of stride two is applied, which reduces each of the 64
feature maps to dimension 4 × 4.
Block 4 does not use pooling, just convolution followed by activation, leading to
dimension 1 × 1 × 64. In more detail, 64 filters are used. Each filter is 4 × 4
across the 64 feature maps, so the weights could be regarded as a 4×4×64×64
tensor, and each filter produces a single number.
Block 5 does not involve convolution. It uses a general (fully connected) weight
matrix of the type discussed in sections 2 to 6 to give output of dimension
1 × 1 × 10. This corresponds to a weight matrix of dimension 10 × 64.
A final softmax operation transforms each of the ten output components to the range
[0, 1].
Figure 7.1 gives a visual overview of the network architecture.
Our output is a vector of ten real numbers. The cost function in the optimization
problem takes the softmax log loss form (7.4) with K = 10. We specify stochastic
gradient with momentum, which uses a “moving average” of current and past gradient
directions. We use minibatches of size 100 (so m = 100 in (4.8)) and set a fixed number
of 45 epochs. We predefine the learning rate for each epoch: η = 0.05, η = 0.005, and
η = 0.0005 for the first 30 epochs, next 10 epochs, and final 5 epochs, respectively.
Running on a Tesla C2075 GPU in single precision, the 45 epochs can be completed
in just under four hours.
As an additional test, we also train the network with dropout. Here, on each
stochastic gradient step, any neuron has its output reset to zero with independent
probability

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 881
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Fig. 7.2 Errors for the trained network. The horizontal axis runs over the 45 epochs of the stochastic
gradient method (that is, 45 passes through the training data). Left: Circles show the cost
function on the training data; crosses show the cost function on the validation data. Mid-
dle: Circles show the percentage of instances where the most likely classification from the
network does not match the correct category, over the training data images; crosses show
the same measure computed over the validation data. Right: Circles show the percentage
of instances where the five most likely classifications from the network do not include the
correct category, over the training data images; crosses show the same measure computed
over the validation data.

• 0.15 in block 1,
• 0.15 in block 2,
• 0.15 in block 3,
• 0.35 in block 4,
• 0 in block 5 (no dropout).
We emphasize that in this case all neurons become active when the trained network
is applied to the test data.
In Figure 7.2 we illustrate the training process in the case of no dropout. For the
plot on the left, circles are used to show how the objective function (7.4) decreases
after each of the 45 epochs. We also use crosses to indicate the objective function
value on the validation data. (More precisely, these error measures are averaged over
the individual batches that form the epoch—note that weights are updated after each
batch.) Given that our overall aim is to assign images to one of the ten classes, the
middle plot in Figure 7.2 looks at the percentage of errors that take place when we
classify with the highest probability choice. Similarly, the plot on the right shows the
percentage of cases where the correct category is not among the top five. We see from
Figure 7.2 that the validation error starts to plateau at a stage where the stochastic
gradient method continues to make significant reductions in the training error. This
gives an indication that we are overfitting—learning fine details about the training
data that will not help the network generalize to unseen data.
Figure 7.3 shows the analogous results in the case where dropout is used. We
see that the training errors are significantly larger than those in Figure 7.2 and the

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
882 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Fig. 7.3 As for Figure 7.2 in the case where dropout was used.

validation errors are of a similar magnitude. However, two key features in the dropout
case are that (a) the validation error is below the training error, and (b) the validation
error continues to decrease in sync with the training error, both of which indicate that
the optimization procedure is extracting useful information over all epochs.
Figure 7.4 gives a summary of the performance of the trained network with no
dropout (after 45 epochs) in the form of a confusion matrix. Here, the integer value
in the general (i, j) entry shows the number of occasions where the network predicted
category i for an image from category j. Hence, off-diagonal elements indicate mis-
classifications. For example, the (1,1) element equal to 814 in Figure 7.4 records the
number of airplane images that were correctly classified as airplanes, and the (1,2)
element equal to 21 records the number of automobile images that were incorrectly
classified as airplanes. Below each integer is the corresponding percentage, rounded
to one decimal place, given that the test data has 1,000 images from each category.
The extra row, labeled “all,”summarizes the entries in each column. For example, the
value 81.4% in the first column of the final row arises because 814 of the 1,000 airplane
images were correctly classified. Beneath this, the value 18.6% arises because 186 of
these airplane images were incorrectly classified. The final column of the matrix, also
labeled “all,”summarizes each row. For example, the value 82.4% in the final column of
the first row arises because 988 images were classified by the network as airplanes, with
814 of these classifications being correct. Beneath this, the value 17.6% arises because
the remaining 174 out of these 988 airplane classifications were incorrect. Finally, the
entries in the lower right corner summarize over all categories. We see that 80.1% of
all images were correctly classified (and hence 19.9% were incorrectly classified).
Figure 7.5 gives the corresponding results in the case where dropout was used. We
see that the use of dropout has generally improved performance, and in particular has
increased the overall success rate from 80.1% to 81.1%. Dropout gives larger values
along the diagonal elements of the confusion matrix in nine out of the ten categories.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 883
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Fig. 7.4 Confusion matrix for the the trained network from Figure 7.2.

Fig. 7.5 Confusion matrix for the the trained network from Figure 7.3, which used dropout.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
884 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Fig. 7.6 Sixteen of the images that were misclassified by the trained network from Figure 7.2. The
predicted category is indicated, with the correct category shown in parentheses. Note that
images are low resolution, with 32 × 32 pixels.

To give a feel for the difficulty of this task, Figure 7.6 shows 16 images randomly
sampled from those that were misclassified by the nondropout network.
8. Of Things Not Treated. This short introductory article is aimed at those who
are new to deep learning. In the interests of brevity and accessibility we have ruth-
lessly omitted many topics. For those wishing to learn more, a good starting point
is the free online book [27], which provides a hands-on tutorial style description of
deep learning techniques. The survey [23] gives an intuitive and accessible overview
of many of the key ideas behind deep learning and highlights recent success stories.
A more detailed overview of the prize-winning performance of deep learning tools can
be found in [30], which also traces the development of ideas across more than 800
references. The review [36] discusses the prehistory of deep learning and explains how
key ideas evolved. For a comprehensive treatment of the state of the art, we recom-
mend the book [11] which, in particular, does an excellent job of introducing fun-
damental ideas from computer science/discrete mathematics, applied/computational
mathematics, and probability/statistics/inference before pulling them all together in
the deep learning setting. Those references, [11, 23, 27, 30, 36], also give a feel for
the range of applications that have been tackled. The recent review article [3] focuses
on optimization tasks arising in machine learning. It summarizes the current theory
underlying the stochastic gradient method, along with many alternative techniques.
Those authors also emphasize that optimization tools must be interpreted and judged
carefully when operating within this inherently statistical framework. Leaving aside
the training issue, a mathematical framework for understanding the cascade of linear
and nonlinear transformations used by deep networks is given in [25].

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 885

To give a feel for some of the key issues that can be followed up, we finish with a
list of questions that may have occurred to interested readers, along with brief answers
and further citations.
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

Why Use Artificial Neural Networks? Looking at Figure 2.4, it is clear that there
are many ways to come up with a mapping that divides the x-y axis into two
regions: a shaded region containing the circles and an unshaded region con-
taining the crosses. Artificial neural networks provide one useful approach.
In real applications, success corresponds to a small generalization error ; the
mapping should perform well when presented with new data. In order to
make rigorous, general statements about performance, we need to make some
assumptions about the type of data. For example, we could analyze the situ-
ation where the data consists of samples drawn independently from a certain
probability distribution. If an algorithm is trained on such data, how will it
perform when presented with new data from the same distribution? The au-
thors in [16] prove that artificial neural networks trained with the stochastic
gradient method can behave well in this sense. Of course, in practice we can-
not rely on the existence of such a distribution. Indeed, experiments in [37]
indicate that the worst case can be as bad as possible. These authors tested
state-of-the-art convolutional networks for image classification. In terms of
the heuristic performance indicators used to monitor the progress of the train-
ing phase, they found that the stochastic gradient method appears to work
just as effectively when the images are randomly relabeled. This implies that
the network is happy to learn noise—if the labels for the unseen data are
similarly randomized, then the classifications from the trained network are
no better than random choice. Other authors have established negative re-
sults by showing that small and seemingly unimportant perturbations to an
image can change its predicted class, including cases where one pixel is al-
tered [33]. Related work in [4] showed proof-of-principle for an adversarial
patch, which alters the classification when added to a wide range of images;
for example, such a patch could be printed as a small sticker and used in the
physical world. Hence, although artificial neural networks have outperformed
rival methods in many application fields, and seem particularly well suited for
computer vision, speech and audio recognition, natural language processing,
image segmentation, regression, imputing missing data, and playing board
games, the reasons behind this success are not fully understood. The sur-
vey [35] describes a range of mathematical approaches that are beginning to
provide useful insights, while the discussion piece [26] includes a list of ten
concerns.
Which Nonlinearity? The sigmoid function (2.1), illustrated in Figure 2.2, and the
ReLU (7.3) are popular choices for the activation function. Alternatives in-
clude the step function,

0 for x ≤ 0,
1 for x > 0.
Each of these can undergo saturation: produce very small derivatives that
thereby reduce the size of the gradient updates. Indeed, the step function
and ReLU have completely flat portions. For this reason, a leaky ReLU, such
as 
001x for x ≤ 0,
f (x) =
x for x > 0,

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
886 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM

is sometimes preferred, in order to force a nonzero derivative for negative


inputs. The back propagation algorithm described in section 5 carries through
to general activation functions.
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

How Do We Decide on the Structure of Our Net? Often, there is a natural choice
for the size of the output layer. For example, to classify images of individual
handwritten digits, it would make sense to have an output layer consisting
of ten neurons, corresponding to 0, 1, 2, . . . , 9, as used in Chapter 1 of [27].
In some cases, a physical application imposes natural constraints on one or
more of the hidden layers [5, 17]. However, in general, choosing the overall
number of layers, the number of neurons within each layer, and any con-
straints involving interneuron connections is not an exact science. Rules of
thumb have been suggested, but there is no widely accepted technique. In the
context of image processing, it may be possible to attribute roles to different
layers; for example, detecting edges, motifs, and larger structures as informa-
tion flows forward [23], and our understanding of biological neurons provides
further insights [11]. But specific roles cannot be completely hardwired into
the network design—the weights and biases, and hence the tasks performed
by each layer, emerge from the training procedure. We note that the use of
back propagation to compute gradients is not restricted to the types of con-
nectivity, activation functions, and cost functions discussed here. Indeed, the
method fits into a very general framework of techniques known as automatic
differentiation or algorithmic differentiation [14].
How Big Do Deep Learning Networks Get? The AlexNet architecture [22] achieved
groundbreaking image classification results in 2012. This network used 650,000
neurons, with five convolutional layers followed by two fully connected lay-
ers and a final softmax. The program AlphaGo, developed by the Google
DeepMind team to play the board game Go, rose to fame by beating the hu-
man European champion by five games to nil in October 2015 [31]. AlphaGo
makes use of two artificial neural networks with 13 layers and 15 layers, some
convolutional and others fully connected, involving millions of weights. We
emphasize that for the large number of parameters used by such deep net-
works, over 100 million in some cases, correspondingly large data sets must
be available and the overfitting issue looms large.
Didn’t My Numerical Analysis Teacher Tell Me Never to Use Steepest Descent?
It is known that the steepest descent method can perform poorly on exam-
ples where other methods, notably those using information about the second
derivative of the objective function, are much more efficient. Hence, opti-
mization textbooks typically downplay steepest descent [10, 28]. However,
it is important to note that training an artificial neural network is a very
specific optimization task:
• the problem dimension, and the expense of computing the objective
function and its derivatives, can be extremely high;
• the optimization task is set within a framework that is inherently sta-
tistical in nature;
• a great deal of research effort has been devoted to the development of
practical improvements to the basic stochastic gradient method in the
deep learning context.
Choosing a suitable value for the learning rate in a stochastic gradient method
is a major issue that continues to challenge researchers. Many algorithms

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 887

have been proposed for altering the learning rate dynamically, that is, based
on information from previous iterates. Competing aims of such adaptive
approaches are (i) to learn the overall scale of the problem, (ii) to avoid slow
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

progress, and (iii) to converge to a local minimum of the underlying cost


function. In addition, many variations of the stochastic gradient step have
been developed that make use of previous information more directly. From a
theoretical perspective, results that guarantee convergence of the stochastic
gradient method typically assume that the learning rate tends to zero as the
iteration progresses. For example, denoting the learning rate on iteration k
by ηk , a typical convergence proof requires

X
(8.1) ηk2
k=1

to be finite. Intuitively, this constraint helps to avoid the circumstance where


the method bounces around the minimum as the gradient sample changes
from step to step.
A promising line of research is to connect the stochastic gradient method with
discretizations of stochastic differential equations, [32], generalizing the idea
that many deterministic optimization algorithms can be viewed as timestep-
ping methods for computing steady states of gradient ODEs [18]. This per-
spective, where ηk is regarded as a timestep, also helps to motivate a second
requirement that is often imposed alongside finiteness of (8.1) in asymptotic
convergence proofs—we need the overall “time interval”

X
ηk
k=1

to be infinite in order for the method to have a chance of picking up the


long-time behavior.
We also remark that the introduction of more traditional tools from the field
of optimization is leading to new classes of training algorithms. We refer the
reader to [3] for the state of the art in algorithm development and convergence
theory, while noting that a theoretical underpinning for the success of the
stochastic gradient method in training networks is far from complete.
Is It Possible to Regularize? As we discussed in section 7, overfitting occurs when a
trained network performs accurately on the given data, but cannot generalize
well to new data. Regularization is a broad term that describes attempts to
avoid overfitting by rewarding smoothness. One approach is to alter the cost
function in order to encourage small weights. For example, (3.3) could be
extended to
N L
1 X1 {i} [L] {i} 2 λ X
(8.2) Cost = ky(x ) − a (x )k2 + k W [l] k22 .
N i=1 2 N
l=2

Here λ > 0 is the regularization parameter. One motivation for (8.2) is that
large weights may lead to neurons that are sensitive to their inputs and hence
less reliable when new data is presented. This argument does not apply to
the biases, which typically are not included in such a regularization term.
It is straightforward to check that using (8.2) instead of (3.3) makes a very
minor and inexpensive change to the back propagation algorithm.

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
888 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM

What About Ethics and Accountability? The use of “algorithms” to aid decision-
making is not a recent phenomenon. However, the increasing influence of
black-box technologies is naturally causing concern in many quarters. The
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

recent articles [8, 15] raise several relevant issues and illustrate them with
concrete examples. They also highlight the particular challenges arising from
massively parameterized artificial neural networks. Professional and govern-
mental institutions are, of course, alert to these matters. In 2017, the Asso-
ciation for Computing Machinery’s U.S. Public Policy Council released seven
Principles for Algorithmic Transparency and Accountability.1 Among their
recommendations are that
• “Systems and institutions that use algorithmic decision-making are en-
couraged to produce explanations regarding both the procedures fol-
lowed by the algorithm and the specific decisions that are made,”
and
• “A description of the way in which the training data was collected should
be maintained by the builders of the algorithms, accompanied by an
exploration of the potential biases induced by the human or algorithmic
data-gathering process.”
Article 15 of the European Union’s General Data Protection Regulation
2016/679,2 which took effect in May 2018, concerns “Right of access by the
data subject,” and includes the requirement that “The data subject shall
have the right to obtain from the controller confirmation as to whether or not
personal data concerning him or her are being processed, and, where that is
the case, access to the personal data and the following information.” Item
(h) on the subsequent list covers
• “the existence of automated decision-making, including profiling, re-
ferred to in Article 22(1) and (4) and, at least in those cases, meaningful
information about the logic involved, as well as the significance and the
envisaged consequences of such processing for the data subject.”
What Are Some Current Research Topics? Deep learning is a fast-moving, high-
bandwidth field, in which many new advances are driven by the needs of
specific application areas and the features of new high performance computing
architectures. Here, we briefly mention three hot-topic areas that have not
yet been discussed.
Training a network can be an extremely expensive task. When a trained
network is seen to make a mistake on new data, it is tempting to fix this
with a local perturbation to the weights and/or network structure, rather
than retraining from scratch. Approaches for this type of on the fly tuning
can be developed and justified using the theory of measure concentration in
high-dimensional spaces [13].
Adversarial networks [12] are based on the concept that an artificial neu-
ral network may be viewed as a generative model : a way to create realistic
data. Such a model may be useful, for example, as a means to produce real-
istic sentences or very high resolution images. In the adversarial setting, the
generative model is pitted against a discriminative model. The role of the
discriminative model is to distinguish between real training data and data
produced by the generative model. By iteratively improving the performance

1 https://www.acm.org/
2 https://www.privacy-regulation.eu/en/15.htm

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 889

of these models, the quality of both the generation and the discrimination
can be increased dramatically.
The idea behind autoencoders [29] is, perhaps surprisingly, to produce an
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

overall network whose output matches its input. More precisely, one network,
known as the encoder, corresponds to a map F that takes an input vector,
x ∈ Rs , and produces a lower-dimensional output vector F (x) ∈ Rt , so t  s.
Then a second network, known as the decoder, corresponds to a map G that
takes us back to the same dimension as x; that is, G(F (x)) ∈ Rs . We could
then aim to minimize the sum of the squared error kx − G(F (x))k22 over a set
of training data. Note that this technique does not require the use of labeled
data—in the case of images we are attempting to reproduce each picture
without knowing what it depicts. Intuitively, a good encoder is a tool for
dimension reduction. It extracts the key features. Similarly, a good decoder
can reconstruct the data from those key features.
Where Can I Find Code and Data? There are many publicly available codes that
provide access to deep learning algorithms. In addition to MatConvNet
[34], we mention Caffe [20], Keras [6], TensorFlow [1], Theano [2], and Torch
[7]. These packages differ in their underlying platforms and in the extent of
expert knowledge required. Your favorite scientific computing environment
may also offer a range of proprietary and user-contributed deep learning tool-
boxes. However, it is currently the case that making serious use of modern
deep learning technology requires a strong background in numerical comput-
ing. Among the standard benchmark data sets are the CIFAR-10 collection
[21] that we used in section 7 and its big sibling CIFAR-100, ImageNet [9],
and the handwritten digit database MNIST [24].
Acknowledgments. We are grateful to the MatConvNet team for making their
package available under a permissive BSD license. The MATLAB code in Listings 6.1
and 6.2 is available as online supplementary material. EPSRC Data Statement:
Original MATLAB code associated with this article can be accessed via the Supple-
mentary Materials. Other code and data used in this article are available in the public
domain, as indicated in the text.

REFERENCES

[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,


G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray,
B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng,
TensorFlow: A system for large-scale machine learning, in 12th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283. (Cited on
p. 889)
[2] R. Al-Rfou et al., Theano: A Python Framework for Fast Computation of Mathematical
Expressions, preprint, https://arxiv.org/abs/1605.02688, 2016. (Cited on p. 889)
[3] L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for large-scale machine
learning, SIAM Rev., 60 (2018), pp. 223–311, https://doi.org/10.1137/16M1080173. (Cited
on pp. 884, 887)
[4] T. B. Brown, D. Mané, A. R. M. Abadi, and J. Gilmer, Adversarial Patch, preprint,
https://arxiv.org/abs/1712.09665, 2017. (Cited on p. 885)
[5] P. Caramazza, A. Boccolini, D. Buschek, M. Hullin, C. F. Higham, R. Henderson,
R. Murray-Smith, and D. Faccio, Neural network identification of people hidden from
view with a single-pixel, single-photon detector, Sci. Rep., 8 (2018), art. 11945. (Cited on
p. 886)
[6] F. Chollet et al., Keras, GitHub, 2015. (Cited on p. 889)

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
890 CATHERINE F. HIGHAM AND DESMOND J. HIGHAM

[7] R. Collobert, K. Kavukcuoglu, and C. Farabet, Torch7: A Matlab-like environment for


machine learning, in BigLearn, NIPS Workshop, 2011. (Cited on p. 889)
[8] J. H. Davenport, The debate about algorithms, Math. Today, 53 (2017), pp. 162–165. (Cited
on p. 888)
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, ImageNet: A large-scale
hierarchical image database, in CVPR, IEEE Computer Society, 2009, pp. 248–255. (Cited
on p. 889)
[10] R. Fletcher, Practical Methods of Optimization, 2nd ed., Wiley, Chichester, UK, 1987. (Cited
on p. 886)
[11] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, Boston, 2016.
(Cited on pp. 884, 886)
[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. C. Courville, and Y. Bengio, Generative adversarial nets, in Advances in Neural
Information Processing Systems 27, Montreal, Canada, 2014, pp. 2672–2680. (Cited on
p. 888)
[13] A. N. Gorban and I. Y. Tyukin, Stochastic separation theorems, Neural Networks, 94 (2017),
pp. 255–259. (Cited on p. 888)
[14] A. Griewank and A. Walther, Evaluating Derivatives: Principles and Techniques of Al-
gorithmic Differentiation, 2nd ed., SIAM, Philadelphia, 2008, https://doi.org/10.1137/1.
9780898717761. (Cited on p. 886)
[15] P. Grindrod, Beyond privacy and exposure: Ethical issues within citizen-facing analytics,
Phil. Trans. Roy. Soc. A, 374 (2016), p. 2083. (Cited on p. 888)
[16] M. Hardt, B. Recht, and Y. Singer, Train faster, generalize better: Stability of stochastic
gradient descent, in Proceedings of the 33rd International Conference on Machine Learning,
2016, pp. 1225–1234. (Cited on p. 885)
[17] C. F. Higham, R. Murray-Smith, M. J. Padgett, and M. P. Edgar, Deep learning for
real-time single-pixel video, Sci. Rep., 8 (2018), art. 2369. (Cited on p. 886)
[18] D. J. Higham, Trust region algorithms and timestep selection, SIAM J. Numer. Anal., 37
(1999), pp. 194–210, https://doi.org/10.1137/S0036142998335972. (Cited on p. 887)
[19] D. J. Higham and N. J. Higham, MATLAB Guide, 3rd ed., SIAM, Philadelphia, 2017. (Cited
on p. 873)
[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding, in Pro-
ceedings of the 22nd ACM International Conference on Multimedia, ACM, New York, 2014,
pp. 675–678. (Cited on p. 889)
[21] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Tech. rep., University
of Toronto, 2009. (Cited on pp. 879, 889)
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep con-
volutional neural networks, in Advances in Neural Information Processing Systems 25,
F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds., 2012, pp. 1097–1105.
(Cited on p. 886)
[23] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, 521 (2015), pp. 436–444. (Cited
on pp. 884, 886)
[24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to
document recognition, Proc. IEEE, 86 (1998), pp. 2278–2324. (Cited on p. 889)
[25] S. Mallat, Understanding deep convolutional networks, Philos. Trans. Roy. Soc. London A,
374 (2016), art. 20150203. (Cited on p. 884)
[26] G. Marcus, Deep Learning: A Critical Appraisal, preprint, https://arxiv.org/abs/1801.00631,
2018. (Cited on p. 885)
[27] M. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015. (Cited on
pp. 884, 886)
[28] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed., Springer, Berlin, 2006.
(Cited on p. 886)
[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by
error propagation, in Parallel Distributed Processing: Explorations in the Microstructure
of Cognition, Vol. 1, MIT Press, Cambridge, MA, 1986, pp. 318–362. (Cited on p. 889)
[30] J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks, 61 (2015),
pp. 85–117. (Cited on p. 884)
[31] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrit-
twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe,
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu,

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license
DEEP LEARNING FOR APPLIED MATHEMATICIANS 891

T. Graepel, and D. Hassabis, Mastering the game of Go with deep neural networks and
tree search, Nature, 2529 (2016), pp. 484–489. (Cited on p. 886)
[32] J. Sirignano and K. Spiliopoulos, Stochastic gradient descent in continuous time, SIAM
J. Finan. Math., 8 (2017), pp. 933–961, https://doi.org/10.1137/17M1126825. (Cited on
Downloaded 11/16/21 to 23.227.142.186 Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/page/terms

p. 887)
[33] J. Su, D. V. Vargas, and S. Kouichi, One pixel attack for fooling deep neural networks, IEEE
Trans. Evol. Comput, to appear. (Cited on p. 885)
[34] A. Vedaldi and K. Lenc, MatConvNet: Convolutional neural networks for MATLAB, in
ACM International Conference on Multimedia, Brisbane, 2015, pp. 689–692. (Cited on
pp. 876, 889)
[35] R. Vidal, R. Giryes, J. Bruna, and S. Soatto, Mathematics of deep learning, in Proc. Conf.
Decision and Control (CDC), 2017. (Cited on p. 885)
[36] H. Wang and B. Raj, On the Origin of Deep Learning, preprint, https://arxiv.org/abs/1702.
07800, 2017. (Cited on p. 884)
[37] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning
requires rethinking generalization, in 5th International Conference on Learning Represen-
tations, 2017. (Cited on p. 885)

© 2019 SIAM. Published by SIAM under the terms of the Creative Commons 4.0 license

You might also like