CS321 Grosse Lecture Notes
CS321 Grosse Lecture Notes
This series of readings forms the lecture notes for the course CSC321,
“Intro to Neural Networks,” for undergraduates at the University of Toronto.
I’m aiming for it also to function as a stand-alone mini-textbook for self-
directed learners and for students at other universities. These notes are
aimed at students who have some background in basic calculus, probability
theory, and linear algebra, but possibly no prior background in machine
learning.
1 Motivation
1.1 Why machine learning?
Think about some of the things we do effortlessly on a day-to-day basis:
visually recognize people, places and things, pick up objects, understand
spoken language, and so on. How would you program a machine to do these
things? Unfortunately, it’s hard to give a step-by-step program, since we
have very little introspective awareness of the workings of our minds. How
do you recognize your best friend? Exactly which facial features do you
pick up on? AI researchers tried for decades to come up with computational
procedures for these sorts of tasks, and it proved frustratingly difficult.
Machine learning takes a different approach: collect lots of data, and
have an algorithm automatically figure out a good behavior from the data.
If you’re trying to write a program to distinguish different categories of
objects (tree, dog, etc.), you might first collect a dataset of images of each
kind of object, and then use a machine learning algorithm to train a model
(such as a neural network) to classify an image as one category or another.
Maybe it will learn to see in a way analogous to the human visual system,
or maybe it will come up with a different approach altogether. Either way,
the whole process can be much easier than specifying everything by hand.
Aside from being easier, there are lots of other reasons we might want
to use machine learning to solve a given problem:
1
• We may want an algorithm to behave autonomously for privacy or
fairness reasons, such as with ranking search results or targeting ads.
Here are just a few important applications where machine learning al-
gorithms are regularly deployed:
• Detecting credit card fraud
• Determining when to apply a C-section
• Transcribing human speech
• Recognizing faces
• Robots learning complex behaviors
2
• There are powerful software packages like Caffe, Theano, Torch, and
TensorFlow, which allow us to quickly implement sophisticated learn-
ing algorithms.
• Many of the important algorithms are much simpler to explain, com-
pared with other subfields of machine learning. This makes it possible
for undergraduates to quickly get up to speed on state-of-the-art tech-
niques in the field.
This class is very unusual among undergrad classes, in that it covers
modern research techniques, i.e. algorithms introduced in the last 5 years.
It’s pretty amazing that with less than a page of code, we can build learning
algorithms more powerful than the best ones researchers had come up with
as of 5 years ago.
In fact, these software packages make neural nets deceptively easy. One
might wonder, if you can implement a neural net in TensorFlow using a
handful of lines of code, why do we need a whole class on the subject?
The answer is that the algorithms generally won’t work perfectly the first
time. Diagnosing and fixing the problems requires careful detective work
and a sophisticated understanding of what’s going on beneath the hood.
In this class, we’ll work from the bottom up: we’ll derive the algorithms
mathematically, implement them from scratch, and only then look at the
out-of-the-box implementations. This will help us build up the depth of
understanding we need to reason about how an algorithm is behaving.
3
2.1 Supervised learning
The majority of this course will focus on supervised learning. This is the
best-understood type of machine learning, because (compared with unsu-
pervised and reinforcement learning) supervised learning problems are much
easier to assign a mathematically precise formulation that matches what one
is trying to achieve. In general, one defines a task, where the algorithm’s
goal is to train a model which takes an input (such as an image) and
predicts a target (such as the object category). One collects a dataset
consisting of pairs of inputs and labels (i.e. true values of the target). A
subset of the data, called the training set, is used to train the model, and
a separate subset, called the test set, is used to measure the algorithm’s
performance. There are a lot of highly effective and broadly applicable su-
pervised learning algorithms, many of which will be covered in this course.
For several decades, image classification has been perhaps the pro-
totypical application of neural networks. In the late 1980s, the US Postal
Service was interested in automatically reading handwritten zip codes, so
they collected 9,298 examples of handwritten digits (0-9), given as 16 × 16
images, and labeled each one; the task is to predict the digit class from
the image. This dataset is now known as the USPS Dataset1 . In the ter-
minology of supervised learning, we say that the input is the image, and
the target is the digit class. By the late 1990s, neural networks were good
enough at this task that they became regularly used to sort letters.
In the 1990s, researchers collected a similar but larger handwritten digit
dataset called MNIST2 ; for decades, MNIST has served as the “fruit fly” of
neural network research. I.e., even though handwritten digit classification
is now considered too easy a problem to be of practical interest, MNIST
has been used for almost two decades to benchmark neural net learning
algorithms. Amazingly, this classic dataset continues to yield algorithmic
insights which generalize to challenging problems of more practical interest.
A more challenging task is to classify full-size images into object cat-
egories, a task known as object recognition. The ImageNet dataset3
consists of 14 million images of nearly 22,000 distinct object categories. A
(still rather large) subset of this dataset, containing 1.2 million images in
1000 object categories, is currently one of the most important benchmarks
for computer vision algorithms; this task is known as the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC). Since 2012, all of the best-
performing algorithms have been neural networks. Recently, progress on
the ILSVRC has been extremely rapid, with the error rate4 dropping from
25.7% to 5.7% over the span of a few years!
All of the above examples concerned image classification, where the goal
is to predict a discrete category for each image. A closely related task is
object detection, where the task is to identify all of the objects present in
their image, as well as their locations. I.e., the input is an image, and the
target is a listing of object categories together with their bounding boxes.
1
http://statweb.stanford.edu/~tibs/ElemStatLearn/data.html
2
http://yann.lecun.com/exdb/mnist/
3
http://www.image-net.org/
4
In particular, the top-5 error rate; the algorithm predicts 5 object categories, and
gets it right if any of the 5 is correct.
4
Other variants include localization, where one is given a list of object
categories and has to predict their locations, and semantic segmentation,
where one tries to label each pixel of an image as belonging to an object
category. There are a huge variety of different supervised learning problems
related to image understanding, depending on exactly what one is hoping
to achieve. The variety of tasks can be bewildering, but fortunately we can
approach most of them using very similar principles.
Neural nets have been applied in lots of areas other than vision. Another
important problem domain is language. Consider, for example, the problem
of machine translation. The task is to translate a sentence from one
language (e.g. French) to another language (e.g. English). One has available
a large corpus of French sentences coupled with their English translations; a
good example is the proceedings of the Canadian Parliament. Observe that
this task is more complex than image classification, in that the target is an
entire sentence. Observe also that there generally won’t be a unique best
translation, so it may be preferable for the algorithm to return a probability
distribution over possible translations, rather than a single translation. This
ambiguity also makes evaluation difficult, since one needs to distinguish
almost-correct translations from completely incorrect ones.
The general category of supervised learning problem where the inputs
and targets are both sequences is known as sequence-to-sequence learn-
ing. The sequences need not be of the same type. An important example
is speech recognition, where one is given a speech waveform and wants
to produce a transcription of what was said. Neural networks led to dra-
matic advances in speech recognition around 2010, and form the basis of
all of the modern systems. Caption generation is a task which combines
vision and language understanding; here the task is to take an image and
return a textual description of the image. The most successful approaches
are based on neural nets. Caption generation is far from a solved problem,
and the systems can be fun to experiment with, not least because of their
entertaining errors.5
5
or adversary; this adversarial setting is beyond the scope of this class.
However, single-player games can be formulated as reinforcement learning
problems. For instance, we will look at the example of training an agent
to play classic Atari games. The agent observes the pixels on the screen,
has a set of actions corresponding to the controller buttons, and receives
rewards corresponding to the score of the game. Neural net algorithms have
outperformed humans on many games, in the sense of being able to achieve
a high score in a short period of time.
6
y bias
i'th weight
output output
w1 w weights
2 w3
y =g b+ xi wi
inputs
i
x1 x2 x3
nonlinearity i'th input
The scalar value b, called a bias, determines the neuron’s activation in the
absence of inputs. The pre-activation is passed through a nonlinearity φ
(also called an activation function) to compute the activation a = φ(z).
Examples of nonlinearities include the logistic sigmoid
1
φ(z) =
1 + e−z
and linear rectification
z if z > 0
φ(z) =
0 if z ≤ 0.
That’s it. That’s all that our idealized neurons do. Note that the whole
idea of a continuous-valued activation is biologically unrealistic, since a real
neuron’s action potentials are an all-or-nothing phenomenon: either they
happen or they don’t, and they do not vary in strength. The continuous-
valued activation is sometimes thought of as representing a “firing rate,” but
mostly we just ignore the whole issue and don’t even think about the rela-
tionships with biology. From now on, we’ll refer to these idealized neurons
using the more scientifically neutral term units, rather than neurons.
If the relationship with biology seems strained, it gets even worse when
we talk about learning, i.e. adapting the weights of the neurons. Most
7
modern neural networks are trained using a procedure called backprop-
agation, where each neuron propagates error signals backwards through
its incoming connections. Nothing analogous has been observed in actual
biological neurons. There have been some creative proposals for how bio-
logical neurons might implement something like backpropagation, but for
the most part we just ignore the issue of whether our neural nets are bio-
logically realistic, and simply try to get the best performance we can out of
the tools we have. (There is a separate field called theoretical neuroscience,
which builds much more accurate models of neurons, towards the goal of
understanding better how the brain works. This field has produced lots of
interesting insights, and has achieved accurate quantitative models of some
neural systems, but so far there doesn’t appear to be much practical benefit
to using more realistic neuronal models in machine learning systems.)
However, neural networks do share one important commonality with the
brain: they consist of a very large number of computational units, each of
which performs a rather simple set of operations, but which in aggregate
produce very sophisticated and complex behaviors. Most of the models
we’ll discuss in this course are simply large collections of units, each of
which computes a linear function followed by a nonlinearity.
Another analogy with the brain is worth pointing out: the brain is or-
ganized into hierarchies of processing, where different brain regions encode
information at different levels of abstraction. Information processing starts
at the retina of the eye, where neurons compute simple center-surround
functions of their inputs. Signals are passed to the primary visual cor-
tex, where (to vastly oversimplify things) cells detect simple image features
such as edges. Information is passed through several additional “layers” of
processing, each one taking place in a different brain region, until the in-
formation reaches areas of the cortex which encode things at a high level of
abstraction. For instance, individual neurons in the infero-temporal cortex
have been shown (again, vastly ovsersimplifying) to encode the identities of
objects.
In summary, visual information is processed in a series of layers of in-
creasing abstraction. This inspired machine learning researchers to build
neural networks which are many layers deep, in hopes that they would
learn analogous representations where higher layers represent increasingly
abstract features. In the last 5 years or so, very deep networks have indeed
been found to achieve startlingly good performance on a wide variety of
problems in vision and other application areas; for this reason, the research
area of neural networks is often referred to as deep learning. There is
some circumstantial evidence that deep networks learn hierarchical repre-
sentations, but this is notoriously difficult to analyze rigorously.
4 Software
There are a lot of software tools that make it easy to build powerful and
sophisticated neural nets. In this course, we will use the programming lan-
guage Python, a friendly but powerful high-level language which is widely
used both in introductory programming courses and a wide variety of pro-
duction systems. Because Python is an interpreted language, executing a
8
line of Python code is very slow, perhaps hundreds of times slower than the
C equivalent. Therefore, we never write algorithms directly using for-loops
in Python. Instead, we vectorize the algorithms by expressing them in
terms of operations on matrices and vectors; those operations are imple-
mented in an efficient low-level language such as C or Fortran. This allows
a large number of computational operations to be performed with minimal
interpreter overhead. In this course, we will use the NumPy library, which
provides an efficient and easy-to-use array abstraction in Python.
Ten years ago, most neural networks were implemented directly on top
of a linear algebra framework like NumPy, or perhaps a lower level pro-
gramming language when efficiency was especially critical. More recently,
a variety of powerful neural net frameworks have been developed, including
Torch, Caffe, Theano, TensorFlow, and PyTorch. These frameworks
make it easy to quickly implement a sophisticated neural net model. Here
are some of the features provided by some or all of these frameworks (we’ll
use TensorFlow as an example):
• Automatic differentiation. If one implements a neural net directly
on top of NumPy, much of the implementational work involves writing
procedures to compute derivatives. TensorFlow automatically con-
structs routines for computing derivatives which are generally at least
as efficient as the ones we would have written by hand.
• GPU support. While NumPy is much faster than raw Python, it’s
not nearly fast enough for modern neural nets. Because neural nets
consist of a large collection of simple processing units, they natu-
rally lend themselves to parallel computation. Graphics processing
units (GPUs) are a particular parallel architecture which has been
especially powerful in training neural nets. It can be a huge pain to
write GPU routines at a low level, but TensorFlow provides an easy
interface so that the same code can run on either a CPU or a GPU.
For this course, we’ll use two neural net frameworks. The first is Au-
tograd, a lightweight automatic differentiation library. It is simple enough
that you will be able to understand how it is implemented; while it is miss-
ing many of the key features of PyTorch or TensorFlow, it provides a useful
mental model for reasoning about those frameworks.
For roughly the second half of the course, we will use PyTorch, a pow-
erful and widely used neural net framework. It’s not quite as popular as
9
TensorFlow, but we think it is easier to learn. But once you are done
with this course, you should find it pretty easy to pick up any of the other
frameworks.
10
Lecture 2: Linear regression
Roger Grosse
1 Introduction
Let’s jump right in and look at our first machine learning algorithm, linear
regression. In regression, we are interested in predicting a scalar-valued
target, such as the price of a stock. By linear, we mean that the target must
be predicted as a linear function of the inputs. This is a kind of supervised
learning algorithm; recall that, in supervised learning, we have a collection
of training examples labeled with the correct outputs.
Regression is an important problem in its own right. But today’s dis-
cussion will also highlight a number of themes which will recur throughout
the course:
• Thinking about the data points and the model parameters as vectors.
• Derive both the closed-form solution and the gradient descent updates
for linear regression.
1
Figure 1: Three possible hypotheses for a linear regression model, shown in
data space and weight space.
• Know how linear regression can learn nonlinear functions using feature
maps.
2 Problem setup
In order to formulate a learning problem mathematically, we need to define
two things: a model and a loss function. The model, or architecture
defines the set of allowable hypotheses, or functions that compute predic-
tions from the inputs. In the case of linear regression, the model simply
consists of linear functions. Recall that a linear function of D inputs is
parameterized in terms of D coefficients, which we’ll call the weights, and
an intercept term, which we’ll call the bias. Mathematically, this is written
as: X
y= wj xj + b. (1)
j
Figure 1 shows two ways to visualize linear models. In this case, the data are
one-dimensional, so the model reduces to simply y = wx + b. On one side,
we have the data space, or input space, where t is plotted as a function
of x. Three different possible linear fits are shown. On the other side, we
have weight space, where the corresponding pairs (w, b) are plotted. You should study these figures and
Clearly, some of these linear fits are better than others. In order to try to understand how the lines in
the left figure map onto the X’s on
quantify how good the fit is, we define a loss function. This is a function the right figure. Think back to
L(y, t) which says how far off the prediction y is from the target t. In linear middle school. Hint: w is the slope
regression, we use squared error, defined as of the line, and b is the y-intercept.
1
L(y, t) = (y − t)2 . (2)
2
This is small when y and t are close together, and large when they are far
apart. In general, the value y − t is known as the residual, and we’d like Why is there the factor of 1/2 in
the residuals to be close to zero. front? It just makes the
calculations convenient.
When we combine our model and loss function, we get an optimization
problem, where we are trying to minimize a cost function with respect
to the model parameters (i.e. the weights and bias). The cost function is
simply the loss, averaged over all the training examples. When we plug in
2
Figure 2: Left: three hypotheses for a regression dataset. Middle: Contour
plot of least-squares cost function for the regression problem. Colors of the
points match the hypotheses. Right: Surface plot matching the contour
plot. Surface plots are usually hard to interpret, so we won’t look at them
very often.
the model definition (Eqn. 1), we get the following cost function:
N
1 X
E(w1 , . . . , wD , b) = L(y (i) , t(i) ) (3)
N
i=1
N
1 X 2
= y (i) − t(i) (4)
2N
i=1
2
N
1 X X (i)
= wj xj + b − t(i) (5)
2N
i=1 j
3
to Eqn. 5, we get
N
∂E 1 X (i)
X (i)
= xj wj 0 xj 0 + b − t(i) (6)
∂wj N
i=1 j0
N
∂E 1 X X (i)
= wj 0 xj 0 + b − t(i) . (7)
∂b N 0
i=1 j
It’s possible to simplify this a bit — notice that part of the term in paren-
theses is simply the prediction. The partial derivatives can be rewritten It’s always a good idea to try to
as: simplify equations by finding
familiar terms.
N
∂E 1 X (i) (i)
= xj (y − t(i) ) (8)
∂wj N
i=1
N
∂E 1 X (i)
= y − t(i) . (9)
∂b N
i=1
Now, it’s good practice to do a sanity check of the derivatives. For instance,
suppose we overestimated all of the targets. Then we should be able to
improve the predictions by decreasing the bias, while holding all of the
weights fixed. Does this work out mathematically? Well, the residuals y (i) −
t(i) will be positive, so based on Eqn. 9, ∂E/∂b will be positive. This means
increasing the bias will increase E, and deceasing the bias will decrease E
— which matches up with our expectation. So Eqn. 9 is plausible. Try to
come up with a similar sanity check for ∂E/∂wj . Later in this course, we’ll
Now how do we use these partial derivatives? Let’s discuss the two introduce a more powerful way to
test partial derivative
methods which we will use throughout the course. computations, but you should still
get used to doing sanity checks on
3.1 Direct solution all your computations!
One way to compute the minimum of a function is to set the partial deriva-
tives to zero. Recall from single variable calculus that (assuming a function
is differentiable) the minimum x? of a function f has the property that the
derivative df /dx is zero at x = x? . Note that the converse is not true: if
df /dx = 0, then x? might be a maximum or an inflection point, rather than
a minimum. But the minimum can only occur at points that have derivative
zero.
An analogous result holds in the multivariate case: if f is differentiable,
then all of the partial derivatives ∂f /∂xi are zero at the minimum. The
intuition is simple: if ∂f /∂xi is positive, then one can decrease f slightly
by decreasing xi slightly. Conversely, if ∂f /∂xi is negative, then one can
decrease f slightly by increasing xi slightly. In either case, this implies we’re
not at the minimum. Therefore, if the minimum exists (i.e. f doesn’t keep
growing as x goes to infinity), it occurs at a critical point, i.e. a point
where the partial derivatives are zero. This gives us a strategy for finding
minima: set the partial derivatives to zero, and solve for the parameters.
This method is known as direct solution.
Let’s apply this to linear regression. For simplicity, let’s assume the
model doesn’t have a bias term. (We actually don’t lose anything by getting
4
rid of the bias. Just add a “dummy” input x0 which always takes the value
1; then the weight w0 acts as a bias.) We simplify Eqn. 6 to remove the
bias, and set the partial derivatives to zero:
N D
∂E 1 X (i)
X (i)
= xj wj 0 xj 0 − t(i) = 0 (10)
∂wj N 0
i=1 j =1
Since we’re trying to solve for the weights, let’s pull these out:
D N N
!
∂E 1 X X (i) (i) 1 X (i) (i)
= x j x j 0 wj 0 − xj t = 0 (11)
∂wj N 0 N
j =1 i=1 i=1
5
The reason that this formula gives the direction of steepest ascent is beyond
the scope of this course. (You would learn about it in a multivariable
calculus class.) But this suggests that to decrease a function as quickly
as possible, we should update the parameters in the direction opposite the
gradient.
We can formalize this using the following update rule, which is known
as gradient descent:
∂E
w ←w−α , (14)
∂w
or in terms of coordinates,
∂E
wj ← wj − α . (15)
∂wj
The symbol ← means that the left-hand side is updated to take the value
on the right-hand side; the constant α is known as a learning rate. The
larger it is, the larger a step we take. We’ll talk in much more detail later
about how to choose a learning rate, but in general it’s good to choose a
small value such as 0.01 or 0.001. If we plug in the formula for the partial
derivatives of the regression model (Eqn. 8), we get the update rule: In practice, we rarely if ever go
through this last step. From a
N software engineering perspective,
1 X it’s better to write our code in a
wj ← wj − α xj (y (i) − t(i) ) (16)
N modular way, where one function
i=1 computes the gradient, and
another function implements
So we just repeat this update lots of times. What does gradient descent gradient descent, taking the
give us in the end? For analyzing iterative algorithms, it’s useful to look for gradient as given.
fixed points, i.e. points where the iterate doesn’t change. By inspecting
Eqn. 14, setting the left-hand side equal to the right-hand side, we see that
the fixed points occur where ∂E/∂w = 0. Since we know the gradient must
be zero at the optimum, this is an encouraging sign that maybe it will
converge to the optimum. But there are lots of things that could go wrong,
such as divergence or local optima; we’ll look at these in more detail in a
later lecture. Lecture 9 discusses optimization
You might ask: by setting the partial derivatives to zero, we compute the issues.
exact solution. With gradient descent, we never actually reach the optimum,
but merely approach it gradually. Why, then, would we ever prefer gradient
descent? Two reasons:
6
For these reasons, gradient descent will be our workhorse throughout the
course. We will use it to train almost all of our models, with the exception
of a handful for which we can derive exact solutions.
4 Vectorization
Now it’s time to bring in linear algebra. We’re going to rewrite the linear
regression model, as well as both solution methods, in terms of operations
on matrices and vectors. This process is known as vectorization. There
are two reasons for doing this: Vectorization takes a lot of
practice to get used to. We’ll cover
1. The formulas can be much simpler, more compact, and more readable a lot of examples in the first few
weeks of the course. I’d
in this form.
recommend practicing these until
they start to feel natural.
2. Vectorized code can be much faster than explicit for-loops, for several
reasons.
First, we need to represent the data and model parameters in the form of
matrices and vectors. If we have N training examples, each D-dimensional,
we will represent the inputs as an N × D matrix X. Each row of X cor-
responds to a training example, and each column corresponds to a single
input dimension. The weights are represented as a D-dimensional vector
w, and the targets are represented as a N -dimensional vector t. In general, matrices will be
The predictions are computed using a matrix-vector product denoted with capital boldface,
vectors with lowercase boldface,
and scalars with plain type.
y = Xw + b1, (17)
where 1 denotes a vector of all ones. We can express the cost function in
7
vectorized form: You should stop now and try to
show that these equations are
1 equivalent to Eqns. 3–5. The only
E= ky − tk2 (18) way you get comfortable with this
2N
is by practicing.
1
= kXw + b1 − tk2 . (19)
2N
Note that this is considerably simpler than Eqn. 5. Even more importantly,
it saves us from having to explicitly sum over the indices i and j. As our
models get more complicated, we would run out of convenient letters to use
as indices if we didn’t vectorize.
Now let’s revisit the exact solution for linear regression. We derived
(i) (i)
a system of linear equations, with coefficients Ajj 0 = N1 N
P
i=1 xj xj 0 and
(i) (i)
cj = N1 N
P
i=1 xj t . In terms of linear algebra, we can write these as the
matrix A = N1 X> X and c = N1 X> t. The solution to the linear system
Aw = c is given by w = A−1 c (assuming A is invertible), so this gives us
a formula for the optimal weights:
−1
w = X> X X> t. (20)
5 Feature mappings
Linear regression might sound pretty limited. What if the true relationship
between inputs and targets is nonlinear? Fortunately, there’s an easy way to
use linear regression to learn nonlinear dependencies: use a feature mapping.
I’ll introduce this by way of an example. Suppose we want to approximate it
with a cubic polynomial. In other words, we would compute the predictions
as:
y = w3 x3 + w2 x2 + w1 x + w0 . (22)
This setting is known as polynomial regression.
Let’s use the squared error loss function, just as with ordinary linear re-
gression. The important thing to notice is that algorithmically, polynomial
regression is no different from linear regression. We can apply any of the
linear regression algorithms described above, using (x, x2 , x3 ) as the inputs.
Mathematically, we define a feature mapping φ, in this case Just as in Section 3.1, we’re
including a constant feature to
account for the bias term, since
1
x this simplifies the notation.
φ(x) = x2 ,
(23)
x3
and compute the predictions as y = w> φ(x) instead of w> x. The rest of
the algorithm is completely unchanged.
8
Feature maps are a useful tool, but they’re not a silver bullet, for several
reasons:
• The features must be known in advance. It’s not always easy to pick
good features, and up until very recently, feature engineering would
take up most of the time and ingenuity in building a practical machine
learning system.
• In high dimensions, the feature representations can get very large. For
instance, the number of terms in a cubic polynomial is cubic in the
dimension! It’s possible to work with
polynomial feature maps efficiently
In this course, rather than construct feature maps, we will use neural net- using something called the “kernel
trick,” but that’s beyond the scope
works to learn nonlinear predictors directly from the raw inputs. In most
of this course.
cases, this eliminates the need for hand-engineering of features.
6 Generalization
We don’t just want a learning algorithm to make correct predictions on
the training examples; we’d like it to generalize to examples it hasn’t
seen before. The average squared error on novel examples is known as the
generalization error, and we’d like this to be as small as possible.
Returning to the previous example, let’s consider three different polyno-
mial models: (a) a linear function, or equivalently, a degree 1 polynomial;
(b) a cubic polynomial; (c) a degree-10 polynomial. The linear function may
be too simplistic to describe the data; this is known as underfitting. The The terms underfitting and
degree-10 polynomial may be able to fit every training example exactly, but overfitting are a bit misleading,
since they suggest the two
only by learning a crazy function. It would make silly predictions every- phenomena are mutually exclusive.
where except the observed data. This is known as overfitting. The cubic In fact, most machine learning
polynomial is a reasonable compromise. We need to worry about both models suffer from both problems
underfitting and overfitting in pretty much every application of machine simultaneously.
learning.
The degree of the polynomial is an example of a hyperparameter.
Hyperparameters are values that we can’t include in the training procedure
itself, but which we need to set using some other means. In practice, we nor- Statisticians prefer the term
mally tune hyperparameters by partitioning the dataset into three different metaparameter since
hyperparameter has a different
subsets: meaning in statistics.
3. The test set is used at the very end, to estimate the generalization
error of the final model, once all hyperparameters have been chosen.
We will talk about validation and generalization in a lot more detail later
on in this course.
9
Lecture 3: Linear Classification
Roger Grosse
1 Introduction
Last week, we saw an example of a learning task called regression. There,
the goal was to predict a scalar-valued target from a set of features. This
week, we’ll focus on a slightly different task: binary classification, where
the goal is to predict a binary-valued target. Here are some examples of
binary classification problems:
1
• Weight space, where each set of classification weights corresponds to
a vector. Each training case corresponds to a constraint in this space,
where some regions of weight space are “good” (classify it correctly)
and some regions are “bad” (classify it incorrectly).
The idea of weight space may seem pretty abstract, but it is very important
that you become comfortable with it, since it underlies nearly everything
we do in the course.
Using our understanding of input space and weight space, the limita-
tions of linear classifiers will become immediately apparent. We’ll see some
examples of datasets which are not linearly separable (i.e. no linear classi-
fier can correctly classify all the training cases), but which become linearly
separable if we use a basis function representation.
2
which we’ll call classes, and which are typically referred to as positive
and negative. (E.g., the positive class might be “has disease” and the
negative class might be “does not have disease.”) Data cases belonging
to these classes are called positive examples and negative examples,
respectively. The training set consists of a set of N pairs (x(i) , t(i) ), where
x(i) is the input and t(i) is the binary-valued target, or label. Since the
training cases come with labels, they’re referred to as labeled examples.
Confusingly, even though we talk about positive and negative examples, the
t(i) typically take values in {0, 1}, where 0 corresponds to the “negative”
class. Sorry, you’ll just have to live with this terminology.
Our goal is to correctly classify all the training cases (and, hopefully,
examples not in the training set). In order to do the classification, we need
to specify a model, which determines how the predictions are computed
from the inputs. As we said before, our model for this week is binary linear
classifiers.
The way binary linear classifiers work is simple: they compute a linear
function of the inputs, and determine whether or not the value is larger
than some threshold r. Recall from Lecture 2 that a linear function of the
input can be written as
w1 x1 + · · · + wD xD + b = wT x + b,
z = wT x + b
1 if z ≥ r
y=
0 if z < r
This is the model we’ll use for the rest of the week.
wT x + b ≥ r ⇐⇒ wT x + b − r ≥ 0.
z = wT x + b
1 if z ≥ 0
y=
0 if z < 0
In fact, it’s possible to eliminate the bias as well. We simply add another
input dimension x0 , called a dummy feature, which always takes the value
1. Then
w0 x0 + w1 x1 + · · · + wD xD = w0 + w1 x1 + · · · + wD xD ,
3
so w0 effectively plays the role of a bias. We can then simply write
z = wT x.
x1 t
0 1
1 0
x1 x2 t
0 0 0
0 1 0
1 0 0
1 1 1
Just like in the previous example, we can start by writing out the
inequalities corresponding to each training case. We get:
b<0
w2 + b < 0
w1 + b < 0
w1 + w2 + b > 0
4
From these inequalities, we immediately see that b < 0 and
w1 , w2 > 0. The simplest way forward at this point is proba-
bly trial and error. Since the problem is symmetric with respect
to w1 and w2 , we might as well decide that w1 = w2 . So let’s
try b = −1, w1 = w2 = 1 and see if it works. The first and
fourth inequalities are clearly satisfied, but the second and third
are not, since w1 + b = w2 + b = 0. So let’s try making the bias a
bit more negative. When we try b = −1.5, w1 = w2 = 1, we see
that all four inequalities are satisfied, so we have our solution.
5
(a) (b) (c) (d)
Figure 1: (a) Training examples and for NOT function, in data space. (b)
NOT, in weight space. (c) Slice of data space for AND function correspond-
ing to x0 = 1. (d) Slice of weight space for AND function corresponding to
w0 = −1.
6
Figure 2: Visualizing a slice of a 3-dimensional weight space.
z = wT x (1)
1 if z ≥ 0
y= (2)
−1 if z < 0
Here’s a rough sketch of the perceptron algorithm. We examine each
of the training cases one at a time. For each input x(i) , we compute the
prediction y (i) and see if it matches the target t(i) . If the prediction is
correct, we do nothing. If it is wrong, we adjust the weights in a direction
that makes it more correct.
Now for the details. First of all, how do we determine if the prediction is
correct? We could simply check if y (i) = t(i) , but this has a slight problem:
if x(i) lies exactly on the classification boundary, it is technically classified as
positive according to the above definition. But we don’t want our training
cases to lie on the decision boundary, since this means the classification may
change if the input is perturbed even slightly. We’d like our classifiers to be
more robust than this. Instead, we’ll use the stricter criterion
You should now check that this criterion correctly handles the various cases
that may occur.
The other question is, how do we adjust the weight vector? If the train-
ing case is positive and we classify it as negative, we’d like to increase the
value of z. In other words, we’d like
where w0 and w are the new and old weight vectors, respectively. The
perceptron algorithm achieves this using the update
w0 = w + αx, (5)
7
Here, kxk represents the Euclidean norm of x. Since the squared norm is
always positive, we have z 0 > z.
Conversely, if it’s a negative example which we mistakenly classified as
positive, we want to decrease z, so we use a negative value of α. Since it’s
possible to show that the absolute value of α doesn’t matter, we generally
use α = 1 for positive cases and α = −1 for negative cases. We can denote
this compactly with
w ← w + tx. (9)
This rule is known as the perceptron learning rule.
Now we write out the perceptron algorithm in full:
For each training case (x(i) , t(i) ),
z (i) ← wT x(i)
If z (i) t(i) ≤ 0,
w ← w + t(i) x(i)
In thinking about this algorithm, remember that we’re denoting the classes
with -1 and 1 (rather than 0 and 1, as we do in the rest of the course).
8
Discriminating simple patterns
under translation with wrap-around
9
instances as A. Since 4 out of the 16 values are on, the aver-
age of all instances is simply the vectors (0.25, 0.25, . . . , 0.25).
Similarly, for it to correctly classify all 16 instances of B, it
must also classify their average as B. But the average is also
(0.25, 0.25, . . . , 0.25). Since this vector can’t possibly be classi-
fied as both A and B, this dataset must not be linearly separable.
More generally, we can’t expect any linear classifier to detect a
pattern in all possible translations. This is a serious limitation
of linear classifiers as a basis for a vision system.
φ1 (x) = x1
φ2 (x) = x2
φ3 (x) = x1 x2
b = −0.5 w1 = 1 w2 = 1 w3 = −2.
The only problem is, where do we get the features from? In this example,
we just pulled them out of a hat. Unfortunately, there’s no recipe for coming
up with good features, which is part of what makes machine learning hard.
But next week, we’ll see how we can learn a set of features by training a
multilayer neural net.
10
Lecture 4: Training a Classifier
Roger Grosse
1 Introduction
Now that we’ve defined what binary classification is, let’s actually train a
classifier. We’ll approach this problem in much the same way as we did
linear regression: define a model and a cost function, and minimize the
cost using gradient descent. The one thing that makes the classification
case harder is that it’s not obvious what loss function to use. We can’t
just use the classification error itself, because the gradient is zero almost
everywhere! Instead, we’ll define a surrogate loss function, i.e. an alternative
loss function which is easier to optimize.
1
In the last lecture, our goal was to correctly classify every training exam-
ple. But this might be impossible if the dataset isn’t linearly separable.
Even if it’s possible to correctly classify every training example, it may be
undesirable since then we might just overfit!
How can we define a sensible learning criterion when the dataset isn’t
linearly separable? One natural criterion is to minimize the number of mis-
classified training examples. We can formalize this with the classification
error loss, or the 0-1 loss:
0 if y = t
L0−1 (y, t) = (2)
1 otherwise.
As always, the cost function is just the loss averaged over the training
examples; in this case, that corresponds to the error rate, or fraction of
misclassified examples. How do we make this small?
y = w> x + b (3)
1
LSE (y, t) = (y − t)2 (4)
2
We’ve already seen two ways of optimizing this: gradient descent, and a
closed-form solution. But does it make sense for classification? One obvious
problem is that the predictions are real-valued rather than binary. But
that’s OK, since we can just pick some scheme for binarizing them, such
as thresholding at y = 1/2. When we replace a loss function we trust with
another one we trust less but which is easier to optimize, the replacement
one is called a surrogate loss function.
But there’s still a problem. Suppose we have a positive example, i.e. t =
1. If we predict y = 1, we get a cost of 0, whereas if we make the wrong
prediction y = 0, we get a cost of 1/2; so far, so good. But suppose we’re
really confident that this is a positive example, and predict y = 9. Then we
pay a cost of 12 (9 − 1)2 = 32. This is far higher than the cost for y = 0, so
the learning algorithm will try very hard to prevent this from happening.
2
That’s not bad in itself, but it means that something else might need to be
sacrificed, if it’s impossible to match all of the targets exactly. Perhaps the
sacrifice will be that it incorrectly classifies some other training examples.
z = w> x + b (6)
y = σ(z) (7)
1
LSE (y, t) = (y − t)2 . (8)
2
Notice that this model solves the problem we observed with linear regression.
As the predictions get more and more confident on the correct answer, the
loss continues to decrease.
To derive the gradient descent updates, we’ll need the partial derivatives
of the cost function. We’ll do this by applying the Chain Rule twice: first
to compute dLSE /dz, and then again to compute ∂LSE /∂wj . But first, let’s
note the convenient fact that This is equivalent to the elegant
identity σ 0 (z) = σ(z)(1 − σ(z)).
∂y e−z
=
∂z (1 + e−z )2
= y(1 − y). (9)
3
Figure 1: Visualization of derivatives of squared error loss with logistic
nonlinearity, for a training example with t = 1. The derivative dE/dz
corresponds to the slope of the tangent line.
4
Figure 2: Plot of cross-entropy loss as a function of the input z to the
logistic activation function.
The problem with squared error loss in the classification setting is that
it doesn’t distinguish bad predictions from extremely bad predictions. If Think about how the argument in
t = 1, then a prediction of y = 0.01 has roughly the same squared-error this paragraph relates to the one
in the previous paragraph.
loss as a prediction of y = 0.00001, even though in some sense the latter is
more wrong. This isn’t necessarily a problem in terms of the cost function
itself: whether 0.00001 is inherently much worse than 0.01 depends on the
situation. (If all we care about is classification error, they’re essentially
equivalent.) But from the perspective of optimization, the fact that the
losses are nearly equivalent is a big problem. If we can increase y from Actually, the effect discussed here
0.00001 to 0.0001, that means we’re “getting warmer,” but this doesn’t can also be beneficial, because it
makes the algorithm robust, in that
show up in the squared-error loss. We’d like a loss function which reflects it can learn to ignore mislabeled
our intuitive notion of getting warmer. examples. Cost functions like this
are sometimes used for this reason.
However, when you do use it, you
2.4 Final touch: cross-entropy loss should be aware of the
optimization difficulties it creates!
The problem with squared-error loss is that it treats y = 0.01 and y =
0.00001 as nearly equivalent (for a positive example). We’d like a loss
function which makes these very different. One such loss function is cross-
entropy (CE). This is defined as follows: You’ll sometimes see cross-entropy
abbreviated XE.
− log y if t = 1
LCE (y, t) = (13)
− log 1 − y if t = 0
In our earlier example, we see that LCE (0.01, 1) = 4.6, whereas LCE (0.00001, 1) =
11.5, so cross-entropy treats the latter as much worse than the former.
When we do calculations, it’s cumbersome to use the case notation, so
we instead rewrite Eqn. 13 in the following form. You should check that
they are equivalent:
Remember, the logistic function squashes y to be between 0 and 1, but See if you can derive the equations
cross-entropy draws big distinctions between probabilities close to 0 or 1. for the asymptote lines.
Interestingly, these effects cancel out: Figure 2 plots the loss as a function
of z. You get a sizable gradient signal even when the predictions are very
wrong.
5
When we combine the logistic activation function with cross-entropy
loss, you get logistic regression:
z = w> x + b
y = σ(z) (15)
LCE = −t log y − (1 − t) log 1 − y.
Now let’s compute the derivatives. We’ll do it two different ways: the
mechanical way, and the clever way. Let’s do the mechanical way first, as
an example of the chain rule for derivatives. Remember, our job here isn’t
to produce a formula for the derivatives, the way we would in calculus class.
Our job is to give a procedure for computing the derivatives which we could
translate into NumPy code. The following does that: The second step of this derivation
uses Eqn. 9.
dLCE t 1−t
=− +
dy y 1−y
dLCE dLCE dy
=
dz dy dz
dLCE
= · y(1 − y) (16)
dy
∂LCE dLCE ∂z
=
∂wj dz ∂wj
dLCE
= · xj
dz
This can be translated directly into NumPy (exercise: how do you vec-
torize this?). If we were good little computer scientists, we would stop here.
But today we’re going to be naughty computer scientists and break the
abstraction barrier between the activation function (logistic) and the cost
function (cross-entropy).
6
Figure 3: Comparison of the loss functions considered so far.
This is like magic! We took a somewhat complicated formula for the logistic This isn’t a coincidence. The
activation function, combined it with a somewhat complicated formula for reason it happens is beyond the
scope of this course, but if you’re
the cross-entropy loss, and wound up with a stunningly simple formula for curious, look up “generalized
the loss derivative! Observe that this is exactly the same formula as for linear models.”
dLSE /dy in the case of linear regression. And it has the same intuitive
interpretation: if y > t, you made too positive a prediction, so you want
to shift your prediction in the negative direction. Conversely, if y < t, you
want to shift your prediction in the positive direction.
Hinge loss is plotted in Figure 3 for a positive example. One useful property
of hinge loss is that it’s an upper bound on 0–1 loss; this is a useful property
1
The log-sum-exp trick is pretty neat. https://hips.seas.harvard.edu/blog/2013/
01/09/computing-log-sum-exp/
7
for a surrogate loss function, since it means that if you make the hinge loss
small, you’ve also made 0–1 loss small. A linear model with hinge loss is
known as a support vector machine (SVM):
y = w> x + b (19)
LH = max(0, 1 − ty) (20)
If you take CSC411, you’ll learn a lot about SVMs, including their statis-
tical motivation, how to optimize them efficiently and how to make them
nonlinear (using something called the “kernel trick”). But you already know
one optimization method: you already know enough to derive the gradient
descent updates for an SVM.
Interestingly, even though SVMs came from a different community and
had a different sort of motivation from logistic regression, the algorithms
behave very similarly in practice. The reason has to do with the loss func-
tions. Figure 3 compares hinge loss to cross-entropy loss; even though cross-
entropy is smoother, the asymptotic behavior is the same, suggesting the
loss functions are basically pretty similar.
All of the loss functions covered so far is shown in Figure 3. Take the
time to review them, to understand their strengths and weaknesses.
3 Multiclass classification
So far we’ve talked about binary classification, but most classification prob-
lems involve more than two categories. Fortunately, this doesn’t require any
new ideas: everything pretty much works by analogy with the binary case.
The first question is how to represent the targets. We could represent them
as integers, but it’s more convenient to use a one-hot vector, also called
a one-of-K encoding:
t = (0, . . . , 0, 1, 0, . . . , 0) (21)
| {z }
entry k is 1
Now let’s design the whole model by analogy with the binary case.
First of all, consider the linear part of the model. We have K outputs
and D inputs. To represent a linear function, we’ll need a K × D weight
matrix, as well as a K-dimensional bias vector. We first compute the
intermediate quantities as follows:
z = Wx + b. (22)
Importantly, the outputs of the softmax function are nonnegative and sum
to 1, so they can be interpreted as a probability distribution over the K
8
classes (just like the output of the logistic could be interpreted as a prob-
ability). The inputs to the softmax are called the logits. Note that when Think about the logits as the
one of the zk ’s is much larger than the others, the output of the softmax “log-odds”, because when you
exponentiate them you get the
will be approximately the argmax, in the one-hot encoding. Hence, a more odds ratios of the probabilities.
accurate name might be “soft-argmax.”
Finally, the loss function. Cross-entropy can be generalized to the
multiple-output case: You’ll sometimes see σ(z) used to
denote the softmax function, by
K analogy with the logistic. But in
this course, it will always denote
X
LCE (y, t) = − tk log yk
the logistic function.
k=1
>
= −t (log y).
Here, log y represents the elementwise log. Note that only one of the tk ’s is
1 and the rest are 0, so the summation has the effect of picking the relevant
entry of the vector log y. (See how convenient the one-hot notation is?)
Note that this loss function only makes sense for predictions which sum to Try plugging in K = 2 to see how
1; if you eliminate that constraint, you could trivially minimize the loss by this relates to binary cross-entropy.
z = Wx + b
y = softmax(z)
LCE = −t> (log y)
4 Convex Functions
An important criterion we often use to compare different loss functions is
convexity. Recall that a set S is convex if the line segment connecting any
two points in S lies entirely within S. Mathematically, this means that for
x0 , x1 ∈ S,
(1 − λ)x0 + λx1 ∈ S for 0 ≤ λ ≤ 1.
The definition of a convex function is closely related. A function f is
convex if for any x0 , x1 in the domain of f ,
9
Figure 4: Left: Definition of convexity. Right: Proof-by-picture that if
the model is linear and L is a convex function of z = w> x + b, then it’s also
convex as a function of w and b.
1. All critical points are global minima, so if you can set the derivatives
to zero, you’ve solved the problem.
We’ll talk in more detail in a later lecture about what can go wrong when the
cost function is not convex. Look back at our comparison of loss functions
in Figure 3. You can see visually that squared error, hinge loss, and the
logistic regression objective are all convex; 0–1 loss, and logistic-with-least-
squares are not convex. It’s not a coincidence that the loss functions we
might actually try to optimize are the convex ones. There is an entire field
of research on convex optimization, which comes up with better ways to
minimize convex functions over convex sets, as well as ways to formulate
various kinds of problems in terms of convex optimization.
Note that even though convexity is important, most of the optimization
problems we’ll consider in this course will be non-convex, because training
a deep neural network is a non-convex problem, even when the loss func-
tion is convex. Nonetheless, convex loss functions somehow still tend to be
advantageous from the standpoint of optimization.
10
Figure 5: Estimating a derivative using one-sided and two-sided finite dif-
ferences.
∂ f (x1 , . . . , xi + h, . . . , xN ) − f (x1 , . . . , xi , . . . , xN )
f (x1 , . . . , xN ) = lim
∂xi h→0 h
(26)
We can check the derivatives numerically by plugging in a small value of h,
such as 10−10 . This is known as the method of finite differences. You don’t want to implement your
It’s actually better to use the two-sided definition of the partial deriva- actual learning algorithm using
finite differences, because it’s very
tive than the one-sided one, since it is much more accurate: slow, but it’s great for testing.
∂ f (x1 , . . . , xi + h, . . . , xN ) − f (x1 , . . . , xi − h, . . . , xN )
f (x1 , . . . , xN ) = lim
∂xi h→0 2h
(27)
An example is shown in Figure 5 of estimating a derivative using the one-
sided and two-sided formulas.
Gradient checking is really important! In machine learning, your algo-
rithm can often seem to learn well even if the gradient calculation is totally
wrong. This might lead you to skip the correctness checks. But it might
work even better if the derivatives are correct, and this is important when
you’re trying to squeeze out the last bit of accuracy. Wrong derivatives can
also lead you on wild goose chases, as you make changes to your system
which appear to help significantly, but actually are only helping because
they compensate for errors in the gradient calculations. If you implement
derivatives by hand, gradient checking is the single most important thing
you need to do to get your algorithm to work well.
11
Lecture 5: Multilayer Perceptrons
Roger Grosse
1 Introduction
So far, we’ve only talked about linear models: linear regression and linear
binary classifiers. We noted that there are functions that can’t be rep-
resented by linear models; for instance, linear regression can’t represent
quadratic functions, and linear classifiers can’t represent XOR. We also saw
one particular way around this issue: by defining features, or basis func-
tions. E.g., linear regression can represent a cubic polynomial if we use the
feature map ψ(x) = (1, x, x2 , x3 ). We also observed that this isn’t a very
satisfying solution, for two reasons:
In this lecture, and for the rest of the course, we’ll take a different ap-
proach. We’ll represent complex nonlinear functions by connecting together Some people would claim that the
lots of simple processing units into a neural network, each of which com- methods covered in this course are
really “just” adaptive basis
putes a linear function, possibly followed by a nonlinearity. In aggregate, function representations. I’ve
these units can compute some surprisingly complex functions. By historical never found this a very useful way
accident, these networks are called multilayer perceptrons. of looking at things.
• Given the weights and biases for a neural net, be able to compute its
output from its input
• Understand why shallow neural nets are universal, and why this isn’t
necessarily very interesting
1
Figure 1: A multilayer perceptron with two hidden layers. Left: with the
units written out explicitly. Right: representing layers as boxes.
2 Multilayer Perceptrons
In the first lecture, we introduced our general neuron-like processing unit:
X
a = φ wj xj + b ,
j
where the xj are the inputs to the unit, the wj are the weights, b is the bias,
φ is the nonlinear activation function, and a is the unit’s activation. We’ve
seen a bunch of examples of such units:
2
Figure 2: An MLP that computes the XOR function. All activation func-
tions are binary thresholds at 0.
in these layers are known as input units, output units, and hidden
units, respectively. The number of layers is known as the depth, and the
number of units in a layer is known as the width. As you might guess, Terminology for the depth is very
“deep learning” refers to training neural nets with many layers. inconsistent. A network with one
hidden layer could be called a
As an example to illustrate the power of MLPs, let’s design one that one-layer, two-layer, or three-layer
computes the XOR function. Remember, we showed that linear models network, depending if you count
cannot do this. We can verbally describe XOR as “one of the inputs is 1, the input and output layers.
but not both of them.” So let’s have hidden unit h1 detect if at least one
of the inputs is 1, and have h2 detect if they are both 1. We can easily do
this if we use a hard threshold activation function. You know how to design
such units — it’s an exercise of designing a binary linear classifier. Then
the output unit will activate only if h1 = 1 and h2 = 0. A network which
does this is shown in Figure 2.
Let’s write out the MLP computations mathematically. Conceptually,
there’s nothing new here; we just have to pick a notation to refer to various
parts of the network. As with the linear case, we’ll refer to the activations
of the input units as xj and the activation of the output unit as y. The units
(`)
in the `th hidden layer will be denoted hi . Our network is fully connected,
so each unit receives connections from all the units in the previous layer.
This means each unit has its own bias, and there’s a weight for every pair
of units in two consecutive layers. Therefore, the network’s computations
can be written out as:
(1) (1) (1)
X
h = φ(1)
i w xj + b
ij i
j
(2)
X (2) (1) (2)
hi = φ(2) w h +b
ij j i (1)
j
X (3) (2) (3)
yi = φ(3) wij hj + bi
j
Note that we distinguish φ(1) and φ(2) because different layers may have
different activation functions.
Since all these summations and indices can be cumbersome, we usually
3
write the computations in vectorized form. Since each layer contains mul-
tiple units, we represent the activations of all its units with an activation
vector h(`) . Since there is a weight for every pair of units in two consecutive
layers, we represent each layer’s weights with a weight matrix W(`) . Each
layer also has a bias vector b(`) . The above computations are therefore
written in vectorized form as:
h(1) = φ(1) W(1) x + b(1)
h(2) = φ(2) W(2) h(1) + b(2) (2)
y = φ(3) W(3) h(2) + b(3)
When we write the activation function applied to a vector, this means it’s
applied independently to all the entries.
Recall how in linear regression, we combined all the training examples
into a single matrix X, so that we could compute all the predictions using a
single matrix multiplication. We can do the same thing here. We can store
all of each layer’s hidden units for all the training examples as a matrix H(`) .
Each row contains the hidden units for one example. The computations are
written as follows (note the transposes): If it’s hard to remember when a
matrix or vector is transposed, fear
not. You can usually figure it out
H(1) = φ(1) XW(1)> + 1b(1)> by making sure the dimensions
match up.
H(2) = φ(2) H(1) W(2)> + 1b(2)> (3)
Y = φ(3) H(2) W(3)> + 1b(3)>
These equations can be translated directly into NumPy code which effi-
ciently computes the predictions over the whole dataset.
3 Feature Learning
We already saw that linear regression could be made more powerful using a
feature mapping. For instance, the feature mapping ψ(x) = (1, x, x2 , xe ) can
represent third-degree polynomials. But static feature mappings were lim-
ited because it can be hard to design all the relevant features, and because
the mappings might be impractically large. Neural nets can be thought
of as a way of learning nonlinear feature mappings. E.g., in Figure 1, the
last hidden layer can be thought of as a feature map ψ(x), and the output
layer weights can be thought of as a linear model using those features. But
the whole thing can be trained end-to-end with backpropagation, which
we’ll cover in the next lecture. The hope is that we can learn a feature
representation where the data become linearly separable:
4
Figure 3: Left: Some training examples from the MNIST handwritten digit
dataset. Each input is a 28 × 28 grayscale image, which we treat as a 784-
dimensional vector. Right: A subset of the learned first-layer features.
Observe that many of them pick up oriented edges.
4 Expressive Power
Linear models are fundamentally limited in their expressive power: they
can’t represent functions like XOR. Are there similar limitations for MLPs?
It depends on the activation function.
5
Figure 4: Designing a binary threshold network to compute a particular
function.
4.2 Universality
As it turns out, nonlinear activation functions give us much more power:
under certain technical conditions, even a shallow MLP (i.e. one with a
single hidden layer) can represent arbitrary functions. Therefore, we say it
is universal.
Let’s demonstrate universality in the case of binary inputs. We do this This argument can easily be made
using the following game: suppose we’re given a function mapping input into a rigorous proof, but this
course won’t be concerned with
vectors to outputs; we will need to produce a neural network (i.e. specify mathematical rigor.
the weights and biases) which matches that function. The function can be
given to us as a table which lists the output corresponding to every possible
input vector. If there are D inputs, this table will have 2D rows. An example
is shown in Figure 4. For convenience, let’s suppose these inputs are ±1,
rather than 0 or 1. All of our hidden units will use a hard threshold at 0
(but we’ll see shortly that these can easily be converted to soft thresholds),
and the output unit will be linear.
Our strategy will be as follows: we will have 2D hidden units, each
of which recognizes one possible input vector. We can then specify the
function by specifying the weights connecting each of these hidden units
to the outputs. For instance, suppose we want a hidden unit to recognize
the input (−1, 1, −1). This can be done using the weights (−1, 1, −1) and
bias −2.5, and this unit will be connected to the output unit with weight 1.
(Can you come up with the general rule?) Using these weights, any input
pattern will produce a set of hidden activations where exactly one of the
units is active. The weights connecting inputs to outputs can be set based
on the input-output table. Part of the network is shown in Figure 4.
6
Universality is a neat property, but it has a major catch: the network
required to represent a given function might have to be extremely large (in
particular, exponential). In other words, not all functions can be represented
compactly. We desire compact representations for two reasons:
7
Lecture 6: Backpropagation
Roger Grosse
1 Introduction
So far, we’ve seen how to train “shallow” models, where the predictions are
computed as a linear function of the inputs. We’ve also observed that deeper
models are much more powerful than linear ones, in that they can compute a
broader set of functions. Let’s put these two together, and see how to train
a multilayer neural network. We will do this using backpropagation, the
central algorithm of this course. Backpropagation (“backprop” for short) is
a way of computing the partial derivatives of a loss function with respect to
the parameters of a network; we use these derivatives in gradient descent,
exactly the way we did with linear regression and logistic regression.
If you’ve taken a multivariate calculus class, you’ve probably encoun-
tered the Chain Rule for partial derivatives, a generalization of the Chain
Rule from univariate calculus. In a sense, backprop is “just” the Chain Rule
— but with some interesting twists and potential gotchas. This lecture and
Lecture 8 focus on backprop. (In between, we’ll see a cool example of how
to use it.) This lecture covers the mathematical justification and shows how
to implement a backprop routine by hand. Implementing backprop can get
tedious if you do it too often. In Lecture 8, we’ll see how to implement an
automatic differentiation engine, so that derivatives even of rather compli-
cated cost functions can be computed automatically. (And just as efficiently
as if you’d done it carefully by hand!)
This will be your least favorite lecture, since it requires the most tedious
derivations of the whole course.
1.2 Background
I would highly recommend reviewing and practicing the Chain Rule for
partial derivatives. I’d suggest Khan Academy1 , but you can also find lots
of resources on Metacademy2 .
1
https://www.khanacademy.org/math/multivariable-calculus/
multivariable-derivatives/multivariable-chain-rule/v/
multivariable-chain-rule
2
https://metacademy.org/graphs/concepts/chain_rule
1
2 The Chain Rule revisited
Before we get to neural networks, let’s start by looking more closely at an
example we’ve already covered: a linear classification model. For simplicity,
let’s assume we have univariate inputs and a single training example (x, t).
The predictions are a linear function followed by a sigmoidal nonlinearity.
Finally, we use the squared error loss function. The model and loss function
are as follows:
z = wx + b (1)
y = σ(z) (2)
1
L = (y − t)2 (3)
2
Now, to change things up a bit, let’s add a regularizer to the cost function.
We’ll cover regularizers properly in a later lecture, but intuitively, they try to
encourage “simpler” explanations. In this example, we’ll use the regularizer
λ 2
2 w , which encourages w to be close to zero. (λ is a hyperparameter; the
larger it is, the more strongly the weights prefer to be close to zero.) The
cost function, then, is:
1
R = w2 (4)
2
Lreg = L + λR. (5)
2
peated applications of the univariate Chain Rule.
1 λ
Lreg = (σ(wx + b) − t)2 + w2
2 2
∂Lreg ∂ 1 λ
= (σ(wx + b) − t)2 + w2
∂w ∂w 2 2
1 ∂ λ ∂ 2
= (σ(wx + b) − t)2 + w
2 ∂w 2 ∂w
∂
= (σ(wx + b) − t) (σ(wx + b) − t) + λw
∂w
∂
= (σ(wx + b) − t)σ 0 (wx + b) (wx + b) + λw
∂w
= (σ(wx + b) − t)σ 0 (wx + b)x + λw
∂Lreg ∂ 1 2 λ 2
= (σ(wx + b) − t) + w
∂b ∂b 2 2
1 ∂ λ ∂ 2
= (σ(wx + b) − t)2 + w
2 ∂b 2 ∂b
∂
= (σ(wx + b) − t) (σ(wx + b) − t) + 0
∂b
∂
= (σ(wx + b) − t)σ 0 (wx + b) (wx + b)
∂b
= (σ(wx + b) − t)σ 0 (wx + b)
This gives us the correct answer, but hopefully it’s apparent from this
example that this method has several drawbacks:
3
2.2 Multivariable chain rule: the easy case
We’ve already used the univariate Chain Rule a bunch of times, but it’s
worth remembering the formal definition:
d
f (g(t)) = f 0 (g(t))g 0 (t). (6)
dt
Roughly speaking, increasing t by some infinitesimal quantity h1 “causes” g
to change by the infinitesimal h2 = g 0 (t)h1 . This in turn causes f to change
by f 0 (g(t))h2 = f 0 (g(t))g 0 (t)h1 .
The multivariable Chain Rule is a generalization of the univariate one.
Let’s say we have a function f in two variables, and we want to compute
d
dt f (x(t), y(t)). Changing t slightly has two effects: it changes x slightly,
and it changes y slightly. Each of these effects causes a slight change to f .
For infinitesimal changes, these effects combine additively. The Chain Rule,
therefore, is given by:
d ∂f dx ∂f dy
f (x(t), y(t)) = + . (7)
dt ∂x dt ∂y dt
4
Figure 1: Computation graph for the regularized linear regression example
in Section 2.4. The magenta arrows indicate the case which requires the
multivariate chain rule because w is used to compute both z and R.
Now let’s return to our running example, written again for convenience:
z = wx + b
y = σ(z)
1
L = (y − t)2
2
1
R = w2
2
Lreg = L + λR.
Let’s introduce the computation graph. The nodes in the graph corre-
spond to all the values that are computed, with edges to indicate which
values are computed from which other values. The computation graph for
our running example is shown in Figure 1. Note that the computation graph
is not the network architecture.
The goal of backprop is to compute the derivatives w and b. We do this The nodes correspond to values
by repeatedly applying the Chain Rule (Eqn. 9). Observe that to compute that are computed, rather than to
a derivative using Eqn. 9, you first need the derivatives for its children in units in the network.
the computation graph. This means we must start from the result of the
computation (in this case, E) and work our way backwards through the
graph. It is because we work backward through the graph that backprop
and reverse mode autodiff get their names.
Let’s start with the formal definition of the algorithm. Let v1 , . . . , vN
denote all of the nodes in the computation graph, in a topological ordering.
(A topological ordering is any ordering where parents come before children.)
We wish to compute all of the derivatives vi , although we may only be
interested in a subset of these values. We first compute all of the values in
a forward pass, and then compute the derivatives in a backward pass.
As a special case, vN denotes the result of the computation (in our running
example, vN = E), and is the thing we’re trying to compute the derivatives
of. Therefore, by convention, we set vN = 1. The algorithm is as follows: E = 1 because increasing the cost
by h increases the cost by h.
For i = 1, . . . , N
vN = 1
For i = N − 1, . . . , 1
∂v
vi = j∈Ch(vi ) vj ∂vji
P
5
Here Pa(vi ) and Ch(vi ) denote the parents and children of vi .
This procedure may become clearer when we work through the example
in full:
Lreg = 1
dLreg
R = Lreg
dR
= Lreg λ
dLreg
L = Lreg
dL
= Lreg
dL
y=L
dy
= L (y − t)
dy
z=y
dz
= y σ 0 (z)
∂z dR
w=z +R
∂w dw
= zx + Rw
∂z
b=z
∂b
=z
Since we’ve derived a procedure for computing w and b, we’re done. Let’s
write out this procedure without the mess of the derivation, so that we can
compare it with the naı̈ve method of Section 2.1:
Lreg = 1
R = Lreg λ
L = Lreg
y = L (y − t)
z = y σ 0 (z)
w = zx + Rw
b=z
The derivation, and the final result, are much cleaner than with the naı̈ve
method. There are no redundant computations here. Furthermore, the Actually, there’s one redundant
procedure is modular : it is broken down into small chunks that can be computation, since σ(z) can be
reused when computing σ 0 (z). But
reused for other computations. For instance, if we want to change the we’re not going to focus on this
loss function, we’d only have to modify the formula for y. With the naı̈ve point.
method, we’d have to start over from scratch.
6
(a) (b)
Figure 2: (a) Full computation graph for the loss computation in a multi-
layer neural net. (b) Vectorized form of the computation graph.
hi = σ(zi )
X (2) (2)
yk = wki hi + bk
i
1X
L= (yk − tk )2
2
k
As before, we start by drawing out the computation graph for the network.
The case of two input dimensions and two hidden units is shown in Figure
2(a). Because the graph clearly gets pretty cluttered if we include all the
units individually, we can instead draw the computation graph for the vec-
torized form (Figure 2(b)), as long as we can mentally convert it to Figure
2(a) as needed.
Based on this computation graph, we can work through the derivations
of the backwards pass just as before. One you get used to it, feel free to
skip the step where we write down
L=1 L.
yk = L (yk − tk )
(2)
wki = yk hi
(2)
bk = yk
(2)
X
hi = yk wki
k
zi = hi σ 0 (zi )
(1)
wij = zi xj
(1)
bi = zi
Focus especially on the derivation of hi , since this is the only step which
actually uses the multivariable Chain Rule.
7
Once we’ve derived the update rules in terms of indices, we can find
the vectorized versions the same way we’ve been doing for all our other
calculations. For the forward pass:
z = W(1) x + b(1)
h = σ(z)
y = W(2) h + b(2)
1
L = kt − yk2
2
And the backward pass:
L=1
y = L (y − t)
W(2) = yh>
b(2) = y
h = W(2)> y
z = h ◦ σ 0 (z)
W(1) = zx>
b(1) = z
8
∂E/∂w means, how much does E change if we change w while holding b
fixed? By contrast, Eqn. 10 treats E as a function of L and w; in Eqn. 10,
we’re making a change to the second argument to E (which happens to be
denoted w), while holding the first argument fixed.
Unfortunately, we need to refer to both of these interpretations when
describing backprop, and the partial derivative notation just leaves this dif-
ference implicit. Doubly unfortunately, our field hasn’t consistently adopted
any notational conventions which will help us here. There are dozens of ex-
planations of backprop out there, most of which simply ignore this issue,
letting the meaning of the partial derivatives be determined from context.
This works well for experts, who have enough intuition about the problem
to resolve the ambiguities. But for someone just starting out, it might be
hard to deduce the meaning from context.
That’s why I picked the bar notation. It’s the least bad solution I’ve
been able to come up with.
9
Lecture 7: Distributed Representations
Roger Grosse
1 Introduction
We’ll take a break from derivatives and optimization, and look at a partic-
ular example of a neural net that we can train using backprop: the neural
probabilistic language model. Here, the goal is to model the distribution
of English sentences (a task known as language modeling), and we do this
by reducing it to a sequential prediction task. I.e., we learn to predict the
distribution of the next word in a sentence given the previous words. This
lecture will also serve as an example of one of the most important concepts
about neural nets, that of a distributed representation. We can understand
this in contrast with a localized representation, where a particular piece
of information is stored in only one place. In a distributed representation,
information is spread throughout the representation. This turns out to be
really useful, since it lets us share information between related entities —
in the case of language modeling, between related words.
• The observation model, represented as p(a | s), which tells us how The notation p(· | ·) denotes the
likely a sentence is to lead to a given acoustic signal. You might, for conditional distribution.
instance, build a model of the human vocal system. A lot of work has
gone into this, but we’re not going to talk about it here.
• The prior, represented as p(s), which tells us how likely a given sen-
tence is to be spoken, before we’ve seen a. This is the thing we’re
trying to estimate when we do language modeling.
Given these two distributions, we can combine them using Bayes’ Rule
to infer the posterior distribution over sentences, i.e. the probability
1
distribution over sentences taking into account the observations. Recall
that Bayes’ Rule is as follows:
p(s) p(a | s)
p(s | a) = P 0 0
. (1)
s0 p(s ) p(a | s )
Hence, Bayes’ Rule lets us combine our prior beliefs with an observation
model in a principled and elegant way.
Having a good prior distribution p(s) is very useful, since speech signals
are inherently ambiguous. E.g., “recognize speech” sounds very similar to
“wreck a nice beach”, but the former is much more likely to be spoken. This
is the sort of thing we’d like our language models to capture.
Hence, we can talk instead about modeling the distribution over sentences.
We’ll try to fit a model which represents a distribution pθ (s), parame-
terized by θ. The maximum likelihood criterion says we’d like to choose
the θ which maximizes the likelihood, or the probability of the observed
data:
N
Y
max pθ (s(i) ). (4)
θ
i=1
At this point, you might be concerned that the probability of any particular
sentence will be vanishingly small. This is true, but we can fix that prob-
lem by working with log probabilities. Then the probability of the corpus
conveniently decomposes as a sum: Since it’s easier to work with
positive numbers, and log
N N probabilities are negative, we often
rephrase maximum likelihood as
Y X
(i)
log p(s ) = log p(s(i) ). (5)
minimizing negative log
i=1 i=1 probabilities.
The log probability of monkeys typing the entire works of Shakespeare is
on a scale we can reasonably work with. And if slightly better trained What is this probability, under the
monkeys are slightly more likely to type Hamlet, it will give us a smooth assumption that they type all keys
uniformly at random?
training criterion we can optimize with gradient descent.
A sentence is a sequence of words A sentence is a sequence of words
w1 , w2 , . . . , wT . The chain rule of conditional probability implies that
2
p(s) factorizes as the products of conditional probabilities of individual
words: Note that the Chain Rule applies
to any distribution, i.e. we’re not
p(s) = p(w1 , . . . , wT ) = p(w1 )p(w2 | w1 ) · · · p(wT | w1 , . . . , wT −1 ). (6) making any assumptions here.
3
up with a variety of clever ways for dealing with data sparsity, including
adding imaginary counts of all the words, and combining the predictions of
different context lengths.
But there’s one problem fundamental to the n-gram approach: it’s hard
to share information between related words. If we see the sentence “The
cat got squashed in the garden on Friday”, we should estimate a higher
probability o seeing the sentence “The dog got flattened in the yard on
Monday”, even though these two sentences have few words in common.
Distributed representations give a great way of doing this.
4
If we write out the negative log-likelihood for a sentence, it decomposes
as the sum of cross-entropies for predicting each word:
T
Y
− log p(s) = − log p(wt | w1 , . . . , wt−1 ) (9)
t=1
T
X
=− log p(wt | w1 , . . . , wt−1 ) (10)
t=1
XT
=− log ytv (11)
t=1
T X
X V
=− ttv log ytv , (12)
t=1 v=1
The only new concept here is the table look-up in the first layer. The
network learns a representation of every word in the dictionary as a vector,
and keeps these in a lookup table. This can be seen as a matrix R, where
each column gives the vector representation of one word. The network does
one table lookup for each of the context words, and the activation vector
for the embedding layer is the concatenation of the representations of all
the context words.
There’s another way to think of the embedding layer: suppose the con-
text words are represented with one-hot encodings. Then we can think of
the embedding layer as basically a linear layer whose weights are shared
between all the context words. Recall that a linear layer just computes
a matrix-vector product. In this case, we’re multiplying the representa-
tion matrix R by the one-hot vectors, which corresponds to pulling out the
corresponding column of R. You should convince yourself that
this is the case.
5
After the embedding layer, there’s a hidden layer, followed by a softmax
output layer, which is what we’d expect if we’re using cross-entropy loss.
This architecture also includes a skip connection from the embedding layer
to the output layer; we’ll talk about skip connections later in the course,
but roughly speaking, they help information travel faster through the net-
work. This whole network can be trained using backpropagation, exactly
as we’ve discussed in the previous lecture. You’ll implement this for your
first homework assignment.
There are various synonyms for word representation:
Observe that unlike n-gram models, the neural language model is very
compact, even for long context lengths. While the size of the CPTs grows
exponentially in the context length, the size of the network (number of
weights, or number of units) grows linearly in the context length. This The number of weights is linear
means that we can efficiently account for much longer context lengths, such only assuming the number of
hidden units stays fixed. But in
as 10. practice, we might need more
If all goes well, the learned representations will reflect the semantic hidden units to represent longer
relationships between words. Here are two common ways to measure this: contexts.
• If two words are dissimilar, the Euclidean distance between their rep-
resentations, kr1 − r2 k, should be large.
These two criteria aren’t equivalent in general, but they are equivalent in
the case where r1 and r2 are both unit vectors: If the representations are unit
vectors, r>
1 r2 is also referred to as
cosine similarity, since it is the
kr1 − r2 k2 = (r1 − r2 )> (r1 − r2 ) (13)
cosine of the angle between the
= r>
1 r1− 2r> 1 r2 + r>
2 r2 (14) representations.
=2− 2r>
1 r2 (15)
6
dimensions might be put close together in 2-D. But it is still a pretty in-
structive visualization. Here2 is an example of a tSNE visualization of word
representations learned by a different model, but one based on similar prin-
ciples. Notice that semantically similar words get grouped together.
2
http://www.cs.toronto.edu/ hinton/turian.png
7
Lecture 8: Optimization
Roger Grosse
1 Introduction
Now that we’ve seen how to compute derivatives of the cost function with
respect to model parameters, what do we do with those derivatives? In this
lecture, we’re going to take a step back and look at optimization problems
more generally. We’ve briefly discussed gradient descent and used it to train
some models, but what exactly is the gradient, and why is it a good idea to
move opposite it? We also introduce stochastic gradient descent, a way of
obtaining noisy gradient estimates from a small subset of the data.
Using modern neural network libraries, it is easy to implement the back-
prop algorithm so that it correctly computes the gradient. It’s not always
so easy to get it to work well. In this lecture, we’ll make a list of things that
can go drastically wrong in neural net training, and talk about how we can
spot them. This includes: learning rates that are too large or too small,
symmetries, dead or saturated units, and badly conditioned curvature. We
discuss tricks to ameliorate all of these problems. In general, debugging a
learning algorithm is like debugging any other complex piece of software:
if something goes wrong, you need to make hypotheses about what might
have happened, and look for evidence or design experiments to test those
hypotheses. This requires a thorough understanding of the principles of
optimization. Understanding the principles of
Our style of thinking in this lecture will be very different from that neural nets and being able to
diagnose failure modes are what
in the last several lectures. When we discussed backprop, we looked at the distinguishes someone who’s
gradient computations algebraically: we derived mathematical equations for finished CSC321 from someone
computing all the derivatives. We also looked at the computations imple- who’s merely worked through the
mentationally, seeing how to implement them efficiently (e.g. by vectorizing TensorFlow tutorial.
the computations), and designing an automatic differentiation system which
separated the backprop algorithm itself from the design of a network archi-
tecture. In this lecture, we’ll look at gradient descent geometrically: we’ll
reason qualitatively about optimization problems and about the behavior
of gradient descent, without thinking about how the gradients are actually
computed. I.e., we abstract away the gradient computation. One of the
most important skills to develop as a computer scientist is the ability to
move between different levels of abstraction, and to figure out which level
is most appropriate for the problem at hand.
1
• Know why stochastic gradient descent can be faster than batch gradi-
ent descent, and understand the tradeoffs in choosing the mini-batch
size.
• Know what effect the learning rate has on the training process. Why
can it be advantageous to decay the learning rate over time?
– slow progress
– instability
– fluctuations
– dead or saturated units
– symmetries
– badly conditioned curvature
2
(a) (b)
Figure 1: (a) Cost surface for an optimization problem with two local min-
ima, one of which is the global minimum. (b) Cartoon plot of a one-
dimensional optimization problem, and the gradient descent iterates start-
ing from two different initializations, in two different basins of attraction.
(a) (b)
3
denoted ∇θ E. This is the direction which goes directly uphill, i.e. the di-
rection which increases the cost the fastest relative to the distance moved.
We can’t determine the magnitude of the gradient from the contour plot,
but it is easy to determine its direction: the gradient is always orthogonal
(perpendicular) to the level sets. This gives an easy way to draw it on a
contour plot (e.g. see Figure 2(a)). Algebraically, the gradient is simply the
vector of partial derivatives of the cost function: In this context, E is taken as a
function of the parameters, not of
the loss L. Therefore, the partial
∂E/∂θ1
∂E .. derivatives correspond to the
∇θ E = = (1)
∂θ . values wij , bi , etc., computed from
∂E/∂θM backpropagation.
The fact that the vector of partial derivatives gives the steepest ascent
direction is far from obvious; you would see the derivation in a multivariable
calculus class, but here we will take it for granted.
The gradient descent update rule (which we’ve already seen multiple
times) can be written in terms of the gradient:
θ ← θ − α∇θ E, (2)
where α is the scalar-valued learning rate. This shows directly that gra-
dient descent moves opposite the gradient, or in the direction of steepest
descent. Too large a learning rate can cause instability, whereas too small
a learning rate can cause slow progress. In general, the learning rate is one
of the most important hyperparameters of a learning algorithm, so it’s very
important to tune it, i.e. look for a good value. (Most commonly, one tries Recall that hyperparameters are
a bunch of values and picks the one which works the best.) parameters which aren’t part of
the model and which aren’t tuned
For completeness, it’s worth mentioning one more possible feature of a with gradient descent.
cost function, namely a saddle point, shown in Figure 2(b). This is a point
where the gradient is zero, but which isn’t a local minimum because the cost
increases in some directions and decreases in others. If we’re exactly on a
saddle point, gradient descent won’t go anywhere because the gradient is
zero.
4
If we use this formula directly, we must visit every training example to com-
pute the gradient. This is known as batch training, since we’re treating
the entire training set as a batch. But this can be very time-consuming, and
it’s also unnecessary: we can get a stochastic estimate of the gradient from
a single training example. In stochastic gradient descent (SGD), we
pick a training example, and update the parameters opposite the gradient
for that example: This is identical to the gradient
θ ← θ − α∇θ En . (7) descent update rule, except that E
is replaced with En .
SGD is able to make a lot of progress even before the whole training set has
been visited. A lot of datasets are so large that it can take hours or longer
to make a single pass over the training set; in such cases, batch training is
impractical, and we need to use a stochastic algorithm.
In practice, we don’t compute the gradient on a single example, but
rather average it over a batch of B training examples known as a mini-
batch. Typical mini-batch sizes are on the order of 100. Why mini-batches?
Observe that the number of operations required to compute the gradient for
a mini-batch is linear in the size of the mini-batch (since mathematically, the
gradient for each training example is a separate computation). Therefore, if
all operations were equally expensive, one would always prefer to use B = 1.
In practice, there are two important reasons to use B > 1:
• Operations on mini-batches can be vectorized by writing them in
terms of matrix operations. This reduces the interpreter overhead,
and makes use of efficient and carefully tuned linear algebra libraries.
In previous lectures, we already
derived vectorized forms of batch
• Most large neural networks are trained on GPUs or some other ar- gradient descent. The same
chitecture which enables a high degree of parallelism. There is much formulas can be applied in
mini-batch mode.
more parallelism to exploit when B is large, since the gradients can
be computed independently for each training example.
On the flip side, we don’t want to make B too large, because then it takes
too long to compute the gradients. In the extreme case where B = N , we
get batch gradient descent. (The activations for large mini-batches may also
be too large to store in memory.)
5
4.1 Incorrect gradient computations
If your computed gradients are wrong, then all bets are off. If you’re lucky,
the training will fail completely, and you’ll notice that something is wrong.
If you’re unlucky, it will sort of work, but it will also somehow be broken.
This is much more common than you might expect: it’s not unusual for
an incorrectly implemented learning algorithm to perform reasonably well.
But it will perform a bit worse than it should; furthermore, it will make it
harder to tune, since some of the diagnostics might give misleading results
if the gradients are wrong. Therefore, it’s completely useless to do anything
else until you’re sure the gradients are correct.
Fortunately, it’s possible to be confident in the correctness of the gra-
dients. We’ve already covered finite difference methods, which are pretty
reliable (see the lecture “Training a Classifier”). If you’re using one of
the major neural net frameworks, you’re pretty safe, because the gradients
are being computed automatically by a system which has been thoroughly
tested. For the rest of this discussion, we’ll assume the gradient computa-
tion is correctly implemented.
4.3 Symmetries
Suppose we initialize all the weights and biases of a neural network to zero.
All the hidden activations will be identical, and you can check by inspection
(see the lecture on backprop) that all the weights feeding into a given hid-
den unit will have identical derivatives. Therefore, these weights will have
identical values in the next step, and so on. With nothing to distinguish
different hidden units, no learning will occur. This phenomenon is perhaps
the most important example of a saddle point in neural net training.
Fortunately, the problem is easy to deal with, using any sort of sym-
metry breaking. Once two hidden units compute slightly different things,
they will probably get a gradient signal driving them even farther apart.
(Think of this in terms of the saddle point picture; if you’re exactly on
the saddle point, you get zero gradient, but if you’re slightly to one side,
6
(a) (b) (c)
Figure 3: (a) Slow progress due to a small learning rate. (b) Instability
due to a large learning rate. (c) Oscillations due to a large learning rate.
you’ll move away from it, which gives you a larger gradient, and so on.) In
practice, we typically initialize all the weights randomly.
7
the dynamics are essentially those described above. (The potential energy
is the height of the surface.)
We can simulate these dynamics with the following update rule, known
as gradient descent with momentum. (Momentum can be used with
either the batch version or with SGD.)
p ← µp − α∇θ En (8)
θ ←θ+p (9)
Just as with ordinary SGD, there is a learning rate α. There is also another
parameter µ, called the momentum parameter, satisfying 0 ≤ µ ≤ 1.
It determines the timescale on which momentum decays. In terms of the
physical analogy, it determines the amount of friction (with µ = 1 being
frictionless). As usual, it’s useful to think about the edge cases:
4.6 Fluctuations
All of the problems we’ve discussed so far occur both in batch training and
in SGD. But in SGD, we have the further problem that the gradients are
stochastic; even if they point in the right direction on average, individual
stochastic gradients are noisy and may even increase the cost function. The
effect of this noise is to push the parameters in a random direction, causing
them to fluctuate. Note the difference between oscillations and fluctua-
tions: oscillations are a systematic effect caused by the cost surface itself,
whereas fluctuations are an effect of the stochasticity in the gradients.
Fluctuations often show up as fluctuations in the cost function, and can
be seen in the training curves. One solution to fluctuations is to decrease
the learning rate; however, this can slow down the progress too much. It’s
actually fine to have fluctuations during training, since the parameters are
still moving in the right direction “on average.”
A better approach to deal with fluctuations is learning rate decay.
My favorite approach is to keep the learning rate relatively high throughout
training, but then at the very end, to decay it using an exponential schedule,
i.e.
αt = α0 e−t/τ , (10)
where α0 is the initial learning rate, t is the iteration count, τ is the decay
timescale, and t = 0 corresponds to the start of the decay.
I should emphasize that we don’t begin the decay until late in training,
when the parameters are already pretty good “on average” and we merely
have a high cost because of fluctuations. Once you start decaying α, progress
8
Figure 4: If you decay the learning rate too soon, you’ll get a sudden drop
in the loss as a result of reducing fluctuations, but the algorithm will stop
making progress towards the optimum, leading to slower convergence in the
long run. This is a big problem in practice, and we haven’t figured out any
good ways to detect if this is happening.
slows down drastically. If you decay α too early, you may get a sudden
improvement in the cost from reducing fluctuations, at the cost of failure to
converge in the long term. This phenomenon is illustrated in Figure 4.
Another neat trick for dealing with fluctuations is iterate averaging.
Separate from the training process, we keep an exponential moving av-
erage θ̃ of the iterates, as follows:
1 1
θ̃ ← 1 − θ̃ + θ. (11)
τ τ
τ is a hyperparameter called the timescale. Iterate averaging doesn’t
change the training algorithm itself at all, but when we apply or evalu-
ate the network, we use θ̃ rather than θ. In practice, iterate averaging can
give a huge performance boost by reducing the fluctuations.
9
Figure 5: The Rosenbrock function, a function which is commonly used as
an optimization benchmark and demonstrates badly conditioned curvature
(i.e. a ravine).
10
(a)
(b)
11
conditioned curvature: batch normalization and Adam. We won’t cover
them properly, but the original papers are very readable, in case you’re cu-
rious.2 Batch normalization normalizes the activations of each layer of a
network to have zero mean and unit variance. This can help significantly
for the reason outlined above. (It can also attenuate the problem of satu-
rated units.) Adam separately adapts the learning rate of each individual
parameter, in order to correct for differences in curvature along individual
coordinate directions.
4.9 Recap
Here is a table to summarize all the pitfalls, diagnostics, and workarounds
that we’ve covered:
2
D. P. Kingma and J. L. Ba, 2015. Adam: a method for stochastic optimization. ICLR
S. Ioffe and C. Szegedy, 2015. Batch normalization: accelerating deep network training
by reducing internal covariate shift.
12
Lecture 9: Generalization
Roger Grosse
1 Introduction
When we train a machine learning model, we don’t just want it to learn to
model the training data. We want it to generalize to data it hasn’t seen
before. Fortunately, there’s a very convenient way to measure an algorithm’s
generalization performance: we measure its performance on a held-out test
set, consisting of examples it hasn’t seen before. If an algorithm works well
on the training set but fails to generalize, we say it is overfitting. Improving
generalization (or preventing overfitting) in neural nets is still somewhat of
a dark art, but this lecture will cover a few simple strategies that can often
help a lot.
2 Measuring generalization
So far in this course, we’ve focused on training, or optimizing, neural net-
works. We defined a cost function, the average loss over the training set:
N
1 X
L(y(x(i) ), t(i) ). (1)
N
i=1
But we don’t just want the network to get the training examples right; we
also want it to generalize to novel instances it hasn’t seen before.
Fortunately, there’s an easy way to measure a network’s generalization
performance. We simply partition our data into three subsets:
1
• A training set, a set of training examples the network is trained on.
There are lots of variants on this
basic strategy, including something
• A validation set, which is used to tune hyperparameters such as the called cross-validation. Typically,
these alternatives are used in
number of hidden units, or the learning rate.
situations with small datasets,
i.e. less than a few thousand
• A test set, which is used to measure the generalization performance.
examples. Most applications of
neural nets involve datasets large
The losses on these subsets are called training, validation, and test
enough to split into training,
loss, respectively. Hopefully it’s clear why we need separate training and validation and test sets.
test sets: if we train on the test data, we have no idea whether the network
is correctly generalizing, or whether it’s simply memorizing the training
examples. It’s a more subtle point why we need a separate validation set.
• We also can’t tune them on the test set, because that would be “cheat-
ing.” We’re only allowed to use the test set once, to report the final
performance. If we “peek” at the test data by using it to tune hyper-
parameters, it will no longer give a realistic estimate of generalization
performance.1
2
Figure 1: (left) Qualitative relationship between the number of training
examples and training and test error. (right) Qualitative relationship be-
tween the number of parameters (or model capacity) and training and test
error.
3
of capacity on test error is non-monotonic: it decreases, and then increases.
We would like to design network architectures which have enough capacity
to learn the true regularities in the training data, but not enough capacity
to simply memorize the training set or exploit accidental regularities. This
is shown qualitatively in Figure 1.
where in the last step we introduce y? = E[t | x], which is the best possible
prediction we can make, because the first term is nonnegative and the second
term doesn’t depend on y. The second term is known as the Bayes error,
and corresponds to the best possible generalization error we can achieve
even if we model the data perfectly.
Now let’s treat y as a random variable. Assume we repeat the following
experiment: sample a training set randomly from pD , train our network,
and compute its predictions on x. If we suppress the dependence on x for
simplicity, the expected squared error decomposes as:
The first term is the bias, which tells us how far off the model’s average
prediction is. The second term is the variance, which tells us about the
variability in its predictions as a result of the choice of training set, i.e. the
4
amount to which it overfits the idiosyncrasies of the training data. The
third term is the Bayes error, which we have no control over. So this de-
composition is known as the bias-variance decomposition.
To visualize this, suppose we have two test examples, with targets
(t(1) , t(2) ). Figure 2 is a visualization in output space, where the axes
correspond to the outputs of the network on these two examples. It shows Understand why output space is
the test error as a function of the predictions on these two test examples; different from input space or
weight space.
because we’re measuring mean squared error, the test error takes the shape
of a quadratic bowl. The various quantities computed above can be seen in
the diagram:
• The generalization error is the average squared length ky − tk2 of the
line segment labeled residual.
• The bias term is the average squared length kE[y] − y∗ k2 of the line
segment labeled bias.
4 Reducing overfitting
Now that we’ve talked about generalization error and how to measure it,
let’s see how we can improve generalization by reducing overfitting. Notice
that I said reduce, rather than eliminate, overfitting. Good models will
probably still overfit at least a little bit, and if we try to eliminate overfitting,
i.e. eliminate the gap between training and test error, we’ll probably cripple
our model so that it doesn’t learn anything at all. Improving generalization
is somewhat of a dark art, and there are very few techniques which both
work well in practice and have rigorous theoretical justifications. In this
section, I’ll outline a few tricks that seem to help a lot. In practice, most
good neural networks combine several of these tricks. Unfortunately, for the
most part, these intuitive justifications are hard to translate into rigorous
guarantees.
5
Figure 2: Schematic relating bias, variance, and error. Top: If the model
is underfitting, the bias will be large, but the variance (spread of the green
x’s) will be small. Bottom: If the model is overfitting, the bias will be
small, but the variance will be large.
6
In general, linear and nonlinear layers have different uses. Recall that
adding nonlinear layers can increase the expressive power of a network archi-
tecture, i.e. broaden the set of functions it’s able to represent. By contrast,
adding linear layers can’t increase the expressivity, because the same func-
tion can be represented by a single layer. For instance, in Figure 3, the
left-hand network can represent all the same functions as the right-hand
one, since one can set W̃ = W(2) W(1) ; it can also represent some functions
that the right-hand one can’t. The main use of linear layers, therefore, is
for bottlenecks. One benefit is to reduce the number of parameters, as de-
scribed above. Bottlenecks are also useful for another reason which we’ll
talk about later on, when we discuss autoencoders.
Reducing capacity has an important drawback: it might make the net-
work too simple to learn the true regularities in the data. Therefore, it’s
often preferable to keep the capacity high, but prevent it from overfitting
in other ways. We’ll discuss some such alternatives now.
7
Figure 4: Training curves, showing the relationship between the number of
training iterations and the training and test error. (left) Idealized version.
(right) Accounting for fluctuations in the error, caused by stochasticity in
the SGD updates.
Figure 5: Two sets of weights which make the same predictions assuming
inputs x1 and x2 are identical.
For instance, suppose we are training a linear regression model with two
inputs, x1 and x2 , and these inputs are identical in the training set. The
two sets of weights shown in Figure 5 will make identical predictions on the
training set, so they are equivalent from the standpoint of minimizing the
loss. However, Hypothesis A is somehow better, because we would expect it
to be more stable if the data distribution changes. E.g., suppose we observe
the input (x1 = 1, x2 = 0) on the test set; in this case, Hypothesis A will
predict 1, while Hypothesis B will predict -8. The former is probably more
sensible. We would like a regularizer to favor Hypothesis A by assigning it
a smaller penalty.
One such regularizer which achieves this is L2 regularization; for a This is an abuse of terminology;
linear model, it is defined as follows: mathematically speaking, this
really corresponds to the squared
D L2 norm.
λX 2
RL2 (w) = wj . (3)
2
j=1
8
smaller. For instance, in the above example, with λ = 1, it assigns a penalty
of 21 (12 + 12 ) = 1 to Hypothesis A and 21 ((−8)2 + 102 ) = 82 to Hypothesis B,
so it strongly prefers Hypothesis A. Because the cost function includes both
the training loss and the regularizer, the training algorithm is encouraged
to find a compromise between the fit to the training data and the norms
of the weights. L2 regularization can be generalized to neural nets in the
obvious way: penalize the sum of squares of all the weights in all layers of
the network.
It’s pretty straightforward to incorporate regularizers into the stochastic
gradient descent computations. In particular, by linearity of derivatives,
N
∂E 1 X ∂L(i) ∂R
= + . (4)
∂θj N ∂θj ∂θj
i=1
4.4 Ensembles
Think back to Figure 2. If you average the predictions of multiple networks
trained independently on separate training sets, this reduces the variance of
the predictions, which can lead to lower loss. Of course, we can’t actually
carry out the hypothetical procedure of sampling training sets indepen-
dently (otherwise we’re probably better off combining them into one big
training set). We could try to train a bunch of networks on the same train-
ing set starting from different initializations, but their predictions might be
too similar to get much benefit from averaging. However, we can try to sim-
ulate the effect of independent training sets by somehow injecting variability
into the training procedure. Here some ways of injecting variability:
9
• Train on random subsets of the full training data. This procedure is
known as bagging.
10
The most popular form of stochastic regularization is dropout. The
algorithm itself is simple: we drop out each individual unit with some prob-
ability ρ (usually ρ = 1/2) by setting its activation to zero. We can represent
this in terms of multiplying the activations by a mask variable mi , which
randomly takes the values 0 or 1:
Why does dropout help? Think back to Figure 5, where we had two
different sets of weights which make the same predictions if inputs x1 and
x2 are always identical. We saw that L2 regularization strongly prefers A
over B. Dropout has the same preference. Suppose we drop out each of the
inputs with 1/2 probability. B’s predictions will vary wildly, causing it to
get much higher error on the training set. Thus, it can achieve some of the
same benefits that L2 regularization is intended to achieve.
One important point: while stochasticity is helpful in preventing over-
fitting, we don’t want to make predictions stochastically at test time. One
naı̈ve approach would be to simply not use dropout at test time. Unfortu-
nately, this would mean that all the units receive twice as many incoming
signals as they do during training time, so their responses will be very dif-
ferent. Therefore, at test time, we compensate for this by multiplying the
values of the weights by 1 − ρ. You’ll see an interesting interpretation of
this in Homework 4.
In a few short years, dropout has become part of the standard tool-
box for neural net training, and can give a significant performance boost,
even if one is already using the other techniques described above. Other
stochastic regularizers have also been proposed; notably batch normaliza-
tion, a method we already mentioned in the context of optimization, but
which has also been shown to have some regularization benefits. It’s also
been observed that the stochasticity in stochastic gradient descent (which
is normally considered a drawback) can itself serve as a regularizer. The
details of stochastic regularization are still poorly understood, but it seems
likely that it will continue to be a useful technique.
11
Lecture 11: Convolutional Networks
Roger Grosse
1 Introduction
So far, all the neural networks we’ve looked at consisted of layers which
computed a linear function followed by a nonlinearity:
h = φ(Wx). (1)
We never gave these layers a name, since they’re the only thing we used.
Now we will. They’re called fully connected layers, because every one of
the input units is connected to every one of the output units. While fully
connected layers are useful, they’re not always what we want. Here are
some reasons:
• They require a lot of connections: if the input layer has M units and
the output layer has N units, then we need M N connections. This
can be quite a lot; for instance, suppose the input layer is an image
consisting of M = 256 × 256 = 65563 grayscale pixels, and the output
layer consists of N = 1000 units (modest by today’s standards). A
fully connected layer would require 65 million connections. This causes
two problems:
For the next three lectures, we’ll talk about a particular kind of network ar-
chitecture which deals with all these issues: the convolutional network, or
conv net for short. Like the name suggests, the architecture is inspired by
a mathematical operator called convolution (which we’ll explain shortly).
1
Figure 1: Translate-and-scale interpretation of convolution of one-
dimensional signals.
2 Convolution
Before we talk about conv nets, let’s introduce convolution. Suppose we
have two signals x and w, which you can think of as arrays, with elements
denoted as x[t] and so on. As you can guess based on the letters, you can
think of x as an input signal (such as a waveform or an image) and w as
a set of weights, which we’ll refer to as a filter or kernel. Normally the
signals we work with are finite in extent, but it is sometimes convenient to
treat them as infinitely large by treating the values as zero everywhere else;
this is known as zero padding.
Let’s start with the one-dimensional case. The convolution of x and
w, denoted x ∗ w, is a signal with entries given by
X
(x ∗ w)[t] = x[t − τ ] w[τ ]. (2)
τ
There are two ways to think about this equation. The first is translate-
and-scale: the signal x ∗ w is composed of multiple copies of x, translated
and scaled by various amounts according to the entries of w. An example
of this is shown in Figure 1.
A second way to think about it is flip-and-filter. Here we generate
each of the entries of x ∗ w by flipping w, shifting it, and taking the dot
product with x. An example is shown in Figure 2.
The two-dimensional case is exactly analogous to the one-dimensional
case; we apply the same definition, but with more indices:
X
(x ∗ w)[s, t] = x[s − σ, t − τ ] w[σ, τ ]. (3)
σ,τ
2
Figure 2: Flip-and-filter interpretation of convolution of one-dimensional
signals.
2.1 Examples
Despite the simplicity of the operation, convolution can do some pretty
interesting things. For instance, we can blur an image:
0 1 0
1 4 1
=
0 1 0
0 -1 0
-1 8 -1
=
0 -1 0
3
Figure 3: Translate-and-scale interpretation of convolution of two-
dimensional signals.
4
0 -1 0
-1 4 -1
=
0 -1 0
We can detect edges. (That is, edges in the image itself, rather than
edges in the world. Detecting edges in the world is a very hard problem.)
This filter is known as a Sobel filter.
1 0 -1
2 0 -2
=
1 0 -1
While both properties follow easily from the definition, they’re a bit surpris-
ing and counterintuitive when you think about flip-and-filter. For instance,
let’s say you blur the image and then run a horizontal edge filter, rep-
resented as (x ∗ wblur ) ∗ whorz . By commutativity and associativity, this
is equivalent to first running the edge filter, and then blurring the result,
i.e. (x ∗ whorz ) ∗ wblur . It’s also equivalent to convolving the image with a
single kernel which is obtained by blurring the edge kernel: x∗(whorz ∗wblur ).
Another useful property of convolution is that it is linear:
This is convenient, because linear operations are often easier to deal with.
But it also shows an inherent limit to convolution: if you have a neural net
which computes lots of convolutions in sequence, it can still only compute
linear functions. In order to compute more complex operations, we’ll need
to apply some sort of nonlinear activation function in each layer. (More on
this later.)
One last property of convolution is that it’s equivariant to translation.
This means that if we shift, or translate, x by some amount, then the output
x ∗ w is shifted by the same amount. This is a useful property in the context
of neural nets, because it means the network’s computations behave in a
well-defined way as we transform the inputs.
5
convolution linear
rectification
convolution layer
3 Convolution layers
We just saw that a convolution, followed by a nonlinear activation function,
followed by another convolution, could compute something interesting. This
motivates the convolution layer, a neural net layer which computes convo-
lutions followed by a nonlinear activation function. Since convolution layers
can be thought of as doing feature detection, they’re sometimes referred to
as detection layers. First, let’s see how we can think about convolution
in terms of units and connections.
Confusingly, the way they’re standardly defined, convolution layers don’t
actually compute convolutions, but a closely related operation called filter-
ing: X
(x ? w)[t] = x[t + τ ] w[τ ]. (9)
τ
6
y0 y1 y2
w2
w0
w1
x0 x1 x2 x3 x4
Like the name suggests, filtering is essentially like flip-and-filter, but without
the flipping. (I.e., x ∗ w = x ? flip(w).) The two operations are basically
equivalent — the difference is just a matter of how the filter (or kernel) is
represented.
In the above example, we computed a single feature map, but just as we
normally use more than one hidden unit in fully connected layers, convolu-
tion layers normally compute multiple feature maps z1 , . . . , zM . The input
layers also consist of multiple feature maps x1 , . . . , xD ; these could be differ-
ent color channels of an RGB image, or feature maps computed by another
convolution layer. There is a separate filter wij associated with each pair of
an input and output feature map. The activations are computed as follows:
X
zi = xj ? wij (10)
j
hi = φ(zi ) (11)
7
The number of connections is approximately
50 × 50 × 5 × 5 × 16 × 32 = 32 million.
4 Pooling layers
In the introduction to this lecture, we observed that a neural network’s clas-
sifications ought to be invariant to small transformations of an image, such
as shifting it by a few pixels. In order to achieve invariance, we introduce
another kind of layer: the pooling layer. Pooling layers summarize (or
compress) the feature maps of the previous layer by computing a simple
function over small regions of the image. Most commonly, this function is
taken to be the maximum, so the operation is known as max-pooling.
Suppose we have input feature maps x1 , . . . , xN . Each unit of the output
map computes the maximum over some region (called a pooling group) of
the input map. (Typically, the region could be 3 × 3.) In order to shrink the
representation, we don’t consider all offsets, but instead we space them by
a stride S along each dimension. This results in the representation being
shrunk by a factor of approximately S along each dimension. (A typical
value for the stride is 2.)
Figure 7 shows an example of how pooling can provide partial invariance
to translations of the input.
Pooling also has the effect of increasing the size of units’ receptive
fields, or the regions of the input image which influence their activations.
For instance, consider the network architecture in Figure 8, which alternates
between convolution and pooling layers. Suppose all the filters are 5 ×
5 and the pooling layer uses a stride of 2. Then each unit in the first
convolution layer has a receptive field of size 5 × 5. But each unit in the
second convolution layer has a receptive field of size approximately 10 × 10,
since it does 5 × 5 filtering over a representation which was shrunken by
a factor of 2 along each dimension. A third convolution layer would have
20 × 20 receptive fields. Hence, pooling allows small filters to account for
information over large regions of an image.
8
Figure 7: An example of how pooling can provide partial invariance to
translations of the input. Observe that the first output does not change,
since the maximum value remains within its pooling group.
...
9
Lecture 12: Object Recognition with Conv Nets
Roger Grosse
1 Introduction
Vision feels so easy, since we do it all day long without thinking about it. But
think about just how hard the problem is, and how amazing it is that we can
see. A grayscale image is just a two dimensional array of intensity values, Even talking about “images”
and somehow we can recover from that a three-dimensional understanding of masks a lot of complexity; the
human retina has to deal with 11
a scene, including the types of objects and their locations, which particular orders of magnitude in intensity
people are present, what materials things are made of, and so on. In order variation and uses fancy optics
to see, we have to deal with all sorts of “nuisance” factors, such as change that let us recover detailed
in pose or lighting. It’s amazing that the human visual system does this all information in the fovea of our
visual field, for a variety of
so seamlessly that we don’t even have to think about it. wavelengths of light.
There is a large and active field of research called computer vision which
tries to get machines to see. The field has made rapid progress in the
past decade, largely because of increasing sophistication of machine learn-
ing techniques and the availability of large image collections. They’ve for-
mulated hundreds of interesting visual “tasks” which encapsulate some of
the hidden complexity we deal with on a daily basis, such as estimating the
calorie content of a plate of food or predicting whether a structure is likely
to fall down. But there’s one task which has received an especially large
amount of attention for the past 30 years and which has driven a lot of the
progress in the field: object recognition, the task of classifying an image
into a set of object categories.
Object recognition is also a useful example for looking at how conv nets
have changed over the years, since they were a state-of-the-art tool in the
early days, and in the last five years, they have re-emerged as the state-of-
the-art tool for object recognition as well as dozens of other vision tasks.
When conv nets took over the field of computer vision, object recognition
was the first domino to fall. Computers have gotten dramatically faster
during this time, and the networks have gotten correspondingly bigger and
more powerful, but they’re still based on more or less the same design
principles. This lecture will talk about some of those design principles.
1
we get them? Do we preprocess them in some way to make life easier for
the algorithm? We’ll look at just a few examples of particularly influential
datasets, but we’ll ignore dozens more, which each have their virtues and
drawbacks.
2
as 2012, Geoff Hinton and collaborators introduced dropout (a regulariza-
tion method discussed in Lecture 9) on MNIST; this turned out to work
well on a lot of other problems, and has become one of the standard tools
in the neural net toolbox.
• Where do the images come from? They used Google Image Search
to find candidate images, and then filtered by hand which images
actually represented the object category.
3
distribution, this kind of overfitting can be eliminated if one builds a large
enough training set. Dataset bias is different — it consists of systematic
biases in a dataset resulting from the way in which the data was collected.
These regularities occur in both the training and the test sets, so algorithms
which exploit them appear to generalize well on the test set. However, if
those regularities aren’t present in the situation where one actually wants
to use the classifier (e.g. a robot trying to identify objects), the system
will perform very poorly in practice. (If an image classifier only recognizes
minarets by exploiting rotation artifacts, it’s unlikely to perform very well
in the real world.)
If dataset bias is strong enough, it encourages the troubling practice of
dataset hacking, whereby researchers engineer their learning algorithms to
be able to exploit the dataset biases in order to make their results seem more
impressive. In the case of Caltech101, the dataset biases were strong enough
that dataset hacking became essentially the only way to compete. After
about 5 years, Caltech101 basically stopped being used for computer vision
research. Dozens of other object recognition datasets were created, all using
different methodology intended to attenuate dataset bias; see this paper4 for
an interesting discussion. Despite a lot of clever attempts, creating a fully An interesting tidbit: both human
realistic dataset is an elusive goal, and dataset bias will probably always researchers and learning
algorithms are able to determine
exist to some degree. with surprisingly high accuracy
which object recognition dataset a
2.3 ImageNet given image was drawn from.
In 2009, taking into account lessons learned from Caltech101 and other com-
puter vision datasets, researchers built ImageNet, a massive object recogni-
tion database consisting of millions of full-resolution images and thousands
of object categories. Based on this dataset, the ImageNet Large Scale Vi-
sual Recognition Challenge (ILSVRC) became one of the most important
computer vision benchmarks. Here’s how they approached the same ques-
tions:
• How many images? The aim was to come up with hundreds of labeled
images for each synset. The ILSVRC categories all have hundreds of
associated training examples, for a total of 1.2 million images.
4
and then humans manually labeled them. Labeling millions of im-
ages is obviously challenging, so they paid Amazon Mechanical Turk
workers to annotate images. Since some of the categories were highly
specific or unusual, they had to provide the annotators with additional
information (e.g. Wikipedia articles) to help them, and carefully vali-
dated the process by measuring inter-annotator agreement.
3 LeNet
Let’s look at a particular conv net architecture: LeNet, which was used
to classify MNIST digits in 1998. The inputs are grayscale images of size
32 × 32. One detail I’ve skipped over so far is the sizes of the outputs of
convolution layers. LeNet uses valid convolutions, where the values are
computed for only those locations whose filters lie entirely within the input.
Therefore, if the input is 32 × 32 and the filters are 5 × 5, the outputs will be
28 × 28. (The main alternative is same convolution, where the output is
the same size as the input, and the input image is padded with zeros in all
directions.) The LeNet architecture is shown in Figure 1 and summarized
in Table 1.
• Convolution layer C1. This layer has 6 feature maps and filters of size
5 × 5. It has 28 × 28 × 6 = 4704 units, 28 × 28 × 5 × 5 × 6 = 117, 600
connections, and 5 × 5 × 6 = 150 weights and 6 biases, for a total of
156 trainable parameters.
5
The!architecture!of!LeNet5!
• Fully connected layer F5. This layer has 120 units with a full set of
connections to layer S4. Since S4 has 5 × 5 × 16 = 400 units, this layer
has 400 × 120 = 48, 000 connections, and hence the same number of
weights.
• Fully connected layer F6. This layer has 84 units, fully connected to
F5. Therefore, it has 84 × 120 = 10, 080 connections and the same
number of weights.
• Output layer. The original network used something called radial basis
functions, but for simplicity we’ll pretend it’s just a linear function,
followed by a softmax over 10 categories. It has 84 × 10 = 840 con-
nections and weights.
6
Layer Type # units # connections # weights
C1 convolution 4704 117,600 150
S2 subsampling 1176 4704 0
C3 convolution 1600 240,000 2400
S4 subsampling 400 1600 0
F5 fully connected 120 48,000 48,000
F6 fully connected 84 10,080 10,080
output fully connected 10 840 840
LeNet was carefully designed to push the limits of all of these resource
constraints using the computing power of 1998. As we’ll see, conv nets have Try increasing the sizes of various
grown substantially larger in order to exploit modern computing resources. layers and checking that you’re
substantially increasing the usage
of one or more of these resources.
4 Modern conv nets
As mentioned above, AlexNet was the conv net architecture which started
a revolution in computer vision by smashing the ILSVRC benchmark. This
6
This isn’t quite true, actually. There are tricks for storing activations for only a subset
of the layers, and recomputing the rest of the activations as needed. Indeed, frameworks
like TensorFlow implement this behind the scenes. However, a larger of units generally
implies a higher memory footprint.
7
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
Figure 2: The AlexNet architecture from 2012.
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
LeNet (1989) LeNet (1998) AlexNet (2012)
classification digitsconvolutional layerdigits
taskThe second
neurons in a kernel map). objects
takes as input the (response-normalized
and pooled) output of the first convolutional
dataset USPS layer and filters it with 256 kernels of ImageNet
MNIST size 5 ⇥ 5 ⇥ 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling# categories
or normalization 10The third convolutional10layer has 384 kernels 1,000
layers. of size 3 ⇥ 3 ⇥
256 connected to the (normalized,
image size 16 × pooled)
16 outputs of the28second
× 28convolutional layer.
256 × The256
fourth
×3
convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256
training examples 7,291 60,000
kernels of size 3 ⇥ 3 ⇥ 192. The fully-connected layers have 4096 neurons each. 1.2 million
units 1,256 8,084 658,000
4 Reducing Overfitting 9,760
parameters 60,000 60 million
connections 65,000 344,000 652 million
Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC
total
make each operations
training example impose11 billion
10 bits of constraint412
on thebillion
mapping from image 200toquadrillion
label, this (est.)
turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we
describeTable
the two2:primary ways in which
Comparison ofweconv
combat overfitting.
net classification architectures.
4.1 Data Augmentation
architecture is shown in Figure 2. Like LeNet, it consists mostly of convolu-
The easiest and most common method to reduce overfitting on image data is to artificially enlarge
tion, pooling,
the dataset usingand fully connected
label-preserving layers.
transformations (e.g.,It additionally
[25, 4, 5]). We employ hastwo some “response
distinct forms
of data augmentation,
normalization” both of
layers, which Iallow
which transformed
won’t talk about imagesbecause
to be produced from not
they’re the original
believed
images with very little computation, so the transformed images do not need to be stored on disk.
toInmake a big difference,
our implementation, and have
the transformed images mostly
are generatedstopped
in Python being
code onused.
the CPU while the
GPUByis most
training measures,
on the previousAlexNet
batch of images.
is 100So these
to data
1000augmentation
times bigger schemes are,
thanin effect,
LeNet,
computationally free.
as shown in Table 2. But qualitatively, the structure is very similar to
The first form of data augmentation consists of generating image translations and horizontal reflec-
LeNet:
tions. Weitdoconsists of alternating
this by extracting random 224 ⇥ 224convolution andhorizontal
patches (and their pooling layers,from
reflections) followed
the
4
by256⇥256 images and training
fully connected our network
layers. on these extracted
Furthermore, like patches
LeNet, . This
mostincreases the size
of the of ourand
units
training set by a factor of 2048, though the resulting training examples are, of course, highly inter-
connections
dependent. Without are this
in scheme,
the convolution layers,
our network suffers and most
from substantial of thewhich
overfitting, weights
would haveare in
theforced
fully us to use much smaller
connected networks. At test time, the network makes a prediction by extracting
layers.
five 224 ⇥ 224 patches (the four corner patches and the center patch) as well as their horizontal
Computers
reflections have
(hence ten improved
patches in all), andaaveraging
lot since LeNet, but
the predictions madetheby thehardware advance
network’s softmax
layer on the ten patches.
that suddenly made it practical to train large neural nets was graphics
The second form
processing of data(GPUs).
units augmentation GPUs
consists of
are altering
a kindthe intensities
of processorof the RGB channels
geared in
towards
training images. Specifically, we perform PCA on the set of RGB pixel values throughout the
highly
ImageNet parallel processing
training set. involving
To each training image, we relatively
add multiples of simple
the foundoperations. One of
principal components,
the things
4
they especially excel at is matrix multiplication. Since most
This is the reason why the input images in Figure 2 are 224 ⇥ 224 ⇥ 3-dimensional.
of the running time for a neural net consists of matrix multiplication (even
convolutions are implemented as matrix products beneath the hood), GPUs
5
gave roughly a 30-fold speedup in practice for training neural nets.
AlexNet set the agenda for object recognition research ever since. In
2013, the ILSVRC winner was based on tweaks to AlexNet. In 2014, the
second place entry was VGGNet, another conv net based on more or less
similar principles.
The winning entry for 2014, GoogLeNet, or Inception, deserves men-
tion. As the name suggests, it was designed by researchers at Google. The
8
architecture is shown in Figure 3. Clearly things have gotten more com-
plicated since the days of LeNet. But the main point of interest is that
they went out of their way to reduce the number of trainable parameters
(weights) from AlexNet’s 60 million, to about 2 million. Why? Partly it was
to reduce overfitting — amazingly, it’s possible to overfit a million images
if you have a big enough network like AlexNet.
The other reason has to do with saving memory at “test time”, i.e. when
the network is being used. Traditionally, networks would be both trained
and run on a single PC, so there wasn’t much reason to draw a distinc-
tion between training and test time. But at Google, the training could be
distributed over lots of machines in a datacenter. (The activations and pa-
rameters could even be divided up between multiple machines, increasing
the amount of available memory at training time.) But the network was also
supposed to be runnable on an Android cell phone, so that images wouldn’t
have to be sent to Google’s servers for classification. On a cell phone, it
would have been extravagant to spend 240MB to store AlexNet’s 60 million
parameters, so it was really important to cut down on parameters to make
it fit in memory.
They achieved this in two ways. First, they eliminated the fully con-
nected layers, which we already saw contain most of the parameters in LeNet
and AlexNet. GoogLeNet is convolutions all the way. It also avoids having
large convolutions by breaking them down into a sequence of convolutions
involving smaller filters. (Two 3 × 3 filters have fewer parameters than a
5 × 5 filter, even though they cover a similar radius of the image.) They This is analogous to how linear
call this layer-within-a-layer architecture “Inception”, after the movie about bottleneck layers can reduce the
number of parameters.
dreams-within-dreams.
Performance on ImageNet improved asonishingly fast during the years
the competition was run. Here are the figures: We’ll put off the last item, deep
residual nets (ResNets), until a
Year Model Top-5 error Lecture 16 since they depend on
some ideas that we won’t cover
2010 Hand-designed descriptors + SVM 28.2% until we talk about RNNs.
2011 Compressed Fisher Vectors + SVM 25.8%
2012 AlexNet 16.4%
2013 a variant of AlexNet 11.7%
2014 GoogLeNet 6.6%
2015 deep residual nets 4.5%
It’s really unusual for error rates to drop by a factor of 6 over a period
of 5 years, especially on a task like object recognition that hundreds of
researchers had already worked hard on and where performance had seemed
to plateau.
9
Figure 3: The Inception architecture from 2014.
10
Lecture 15: Recurrent Neural Nets
Roger Grosse
1 Introduction
Most of the prediction tasks we’ve looked at have involved pretty simple
kinds of outputs, such as real values or discrete categories. But much of the
time, we’re interested in predicting more complex structures, such as images
or sequences. The next three lectures are about producing sequences; we’ll
get to producing images later in the course. If the inputs and outputs are
both sequences, we refer to this as sequence-to-sequence prediction.
Here are a few examples of sequence prediction tasks:
We’ve already seen one architecture which generate sequences: the neu-
ral language model. Recall that we used the chain rule of conditional prob-
ability to decompose the probability of a sentence:
T
Y
p(w1 , . . . , wT ) = p(wt | w1 , . . . , wt−1 ), (1)
t=1
and then made a Markov assumption so that we could focus on a short time
window:
p(wt | w1 , . . . , wt−1 ) = p(wt | wt−K , . . . , wt−1 ), (2)
where K is the context length. This means the neural language model is
memoryless: its predictions don’t depend on anything before the context
window. But sometimes long-term dependencies can be important:
1
Figure 1: Left: A neural language model with context length of 1. Right:
Turning this into a recurrent neural net by adding connections between the
hidden units. Note that information can pass through the hidden units,
allowing it to model long-distance dependencies.
The fact that the sentence is about Rob Ford gives some clues about what is
coming next. But the neural language model can’t make use of that unless
its context length is at least 13.
Figure 1 shows a neural language model with context length 1 being
used to generate a sentence. Let’s say we modify the architecture slightly For a neural language model, each
by adding connections between the hidden units. This gives it a long-term set of hidden units would usually
receive connections from the last
memory: information about the first word can flow through the hidden units K inputs, for K > 1. For RNNs,
to affect the predictions about later words in the sentence. Such an archi- usually it only has connections
tecture is called a recurrent neural network (RNN). This seems like a from the current input. Why?
simple change, but actually it makes the architecture much more powerful.
RNNs are widely used today both in academia and in the technology in-
dustry; the state-of-the-art systems for all of the sequence prediction tasks
listed above use RNNs.
• Know how to compute the loss derivatives for an RNN using backprop
through time.
2
Figure 2: An example of an RNN and its unrolled representation. Note
that each color corresponds to a weight matrix which is replicated at all
time steps.
at all time steps, as well as the connections between them. For a given se-
quence length, the unrolled network is essentially just a feed-forward neural
net, although the weights are shared between all time steps. See Figure 2
for an example.
The trainable parameters for an RNN include the weights and biases for
all of the layers; these are replicated at every time step. In addition, we
need some extra parameters to get the whole thing started, i.e. determine
the values of the hidden units at the first time step. We can do this one of
two ways:
• We can learn a separate set of biases for the hidden units in the first
time step. Really, these two approaches aren’t
very different. The signal from the
• We can start with a dummy time step which receives no inputs. We t = 0 hiddens to the t = 1 hiddens
would then learn the initial values of the hidden units, i.e. their values is always the same, so we can just
learn a set of biases which do the
during the dummy time step. same thing.
Let’s look at some simple examples of RNNs.
3
linear
output 2 1.5 2.5 3.5
unit
linear
hidden w=1 2 1.5 2.5 3.5
unit w=1 w=1 w=1
input 2 -0.5 1 1
unit
logistic
output 1.00 0.92 0.03
unit
w=5
linear
hidden
unit w=1 4 0.5 -0.7
w=1 w= -1
input input
unit
1
unit
2
2 -2 0 3.5 1 2.2
Figure 3: Top: the RNN for Example 1. Bottom: the RNN for Example 2.
Input: 0 1 0 1 1 0 1 0 1 1
Parity bits: 0 1 1 0 1 1 −→
This suggests a strategy: the output unit y (t) represents the parity
bit, and it feeds into the computation at the next time step. In
other words, we’d like to achieve the following relationship:
4
Figure 4: RNN which computes the parity function (Example 3).
5
It performs the following computations in the forward pass: All of these equations are basically
like the feed-forward case except
z (t) .
z (t) = ux(t) + wh(t−1) (3)
h(t) = φ(z (t) ) (4)
r(t) = vh(t) (5)
(t) (t)
y = φ(r ). (6)
Figure 5 shows the unrolled computation graph. Note the weight shar-
ing. Now we just need to do backprop on this graph, which is hopefully a
completely mechanical procedure by now: Pay attention to the rules for h(t) ,
u, v, and w.
L=1 (7)
∂L
y (t) = L (8)
∂y (t)
r(t) = y (t) φ0 (r(t) ) (9)
h(t) = r(t) v + z (t+1) w (10)
0 (t)
z (t) = h(t) φ (z ) (11)
T
X
u= z (t) x(t) (12)
t=1
T
X
v= r(t) h(t) (13)
t=1
T
X −1
w= z (t+1) h(t) (14)
t=1
These update rules are basically like the ones for an MLP, except that the
weight updates are summed over all time steps. Why are the bounds different in
the summations over t?
6
Figure 5: The unrolled computation graph.
The vectorized backprop rules are analogous: Remember that for all the
activation matrices, rows
correspond to training examples
L=1 (15) and columns corresopnd to units,
∂L and N is the number of data
Y(t) = L (16) points (or mini-batch size).
∂Y(t)
R(t) = Y(t) ◦ φ0 (R(t) ) (17)
H(t) = R(t) V> + Z(t+1) W> (18)
Z(t) = H(t) ◦ φ0 (Z(t) ) (19)
T
1 X (t) > (t)
U= Z X (20)
N
t=1
T
1 X >
V= R(t) H(t) (21)
N
t=1
T −1
1 X >
W= Z(t+1) H(t) (22)
N
t=1
7
4 Sequence Modeling
Now let’s look at some ways RNNs can be applied to sequence modeling.
However, unlike with the other two models, we won’t make a Markov as-
sumption. In other words, the distribution over each word depends on all
the previous words. We’ll make the predictions using an RNN; each of the
conditional distributions will be predicted using the output units at a given
time step. As usual, we’ll use a softmax activation function for the output
units, and cross-entropy loss.
At training time, the words of a training sentence are used as both the
inputs and the targets to the network, as follows:
It may seem funny to use the sentence as both input and output — isn’t it
easy to predict a sentence from itself? But each word appears as a target
before it appears as an input, so there’s no way for information to flow from
the word-as-input to the word-as-target. That means the network can’t
cheat by just copying its input.
To generate from the RNN, we sample each of the words in sequence
from its predictive distribution. This means we compute the output units
for a given time step, sample the word from the corresponding distribution,
and then feed the sampled word back in as an input in the next time step.
We can represent this as follows:
Remember that vocabularies can get very large, especially once you
include proper nouns. As we saw in Lecture 10, it’s computationally dif-
ficult to predict distributions over millions of words. In the context of a
8
neural language model, one has to deal with this by changing the scheme
for predicting the distribution (e.g. using hierarchical softmax or negative
sampling). But RNNs have memory, which gives us another option: we
can model text one character at a time! In addition to the computational
problems of large vocabularies, there are additional advantages to modling
text as sequences of characters:
• Any words that don’t appear in the vocabulary are implicitly assigned
probability 0. But with a character-based language model, there’s only
a finite set of ASCII characters to consider.
The first thing to notice is that the text isn’t globally coherent, so it’s
clearly not just memorized in its entirety from Wikipedia. But the model
produces mostly English words and some grammatical sentences, which is
a nontrivial achievement given that it works at the character level. It even
produces a plausible non-word, “ephemerable”, meaning it has picked up
some morphological structure. The text is also locally coherent, in that it
starts by talking about politics, and then transportation.
9
But this has some clear problems: the sentences might not even be the
same length, and even if they were, the words wouldn’t be perfectly aligned
because different languages have different grammar. There might also be
some ambiguous words early on in the sentence which can only be resolved
using information from later in the sentence.
Another approach, which was done successfully in 20142 , is to have the
RNN first read the English sentence, remember as much as it can in its
hidden units, and then generate the French sentence. This is represented as
follows:
The special end-of-sentence token <EOS> marks the end of the input. The
part of the network which reads the English sentence is known as the en-
coder, and the part that reads the French sentence is known as the de-
coder, and they don’t share parameters.
Interestingly, remembering the English sentence is a nontrivial subprob-
lem in itself. We can defined a simplified task called memorization, where
the network gets an English sentence as input, and has to output the same
sentence. Memorization can be a useful testbed for experimenting with
RNN algorithms, just as MNIST is a useful testbed for experimenting with
classification algorithms.
Before RNNs took over, most machine translation was done by algo-
rithms which tried to transform one sentence into another. The RNN ap-
proach described above is pretty different, in that it converts the whole
sentence into an abstract semantic representation, and then uses that to
generate the French sentence. This is a powerful approach, because the en-
coders and decoders can be shared between different languages. Inputs of
any language would be mapped to a common semantic space (which ought
to capture the “meaning”), and then any other langage could be generated
from that semantic representation. This has actually been made to work,
and RNNs are able to perform machine translation on pairs of languages
for which there were no aligned pairs in the training set!
2
I. Sutskever. Sequence to sequence learning with neural networks. 2014
10
values.append(value)
code.append(code)
end for
new value= Operation.evaluate(values)
new code = Operation.generate code(codes)
stack.push((new value, new code))
Learning to Execute
end for
ned a language model on final value, final code = stack.pop()
Under=review
4) predicted whether two datasets as a conference
training, paper testing
validation, at ICLR 2015
Input:
Both of these approaches idx = j=8584
hash(final code) modulo 3
n from a program charac- datasets[idx].add((final
for x in range(8): value, final code))
Algorithm j+=920
1: Pseudocode of the algorithm used to generate the distribution over the python pro-
b=(1500+j)
gram. Programs
res that the model deals
produced by this algorithm are guaranteed to never have dead code. The type of the
print((b+7567))
sampleTarget:
at arise from variable as- (train, 25011.
test, or validation) is determined by its hash modulo 3.
e Recurrent Neural Net- Under review as a conference paper at ICLR 2015
mory units (Hochreiter & Learning to Execute
ere are many other RNN
ks with long term depen- 11 A DDITIONAL R ESULTS ON THE M EMORIZATION P ROBLEM
Input: generate all the training samples with length = a and
et al., 2007; Koutnı́k et al., i=8827 nesting = b. This strategy is most “sound” from statis-
Input:
l., 2013). We presentc=(i-5347)
the algorithm for generating the training cases, and present
vqppkn an extensive qualitative evaluation of tical perspective, as it is generally recommended to make
the samples and the kinds of predictions made byInput:
print((c+8704) if 2641<8500 else the trained LSTMs.sqdvfljmnc the training distribution identical to test distribution.
rain LSTMs to accurately 5308) Under review as a conference paper at ICLR 2015
b=9930 y2vxdddsepnimcbvubkomhrpliibtwztbljipcc
tional nature of computer We emphasize that
Target: 1218. these predictions rely on teacher forcing. That is, even if the LSTM made an incorrect Naive curriculum strategy (naive)
Target: hkhpg
M would learn faster if we prediction in the i-th output digit, the LSTM willfor x in range(11):b-=4369
be provided as input the correct i-th output digit for predicting
ators separately and then g=b; We begin with length = 1 and nesting = 1. Once learning
the i + 1-th digit. While teacher forcing has no effect whenever the LSTM makes no errors at all, a sample that
them. This approach can print(((g-8043)+9955)). stops making progress, we increase length by 1. We repeat
makes an early error and gets the remainder of the digits correctly Figureneeds
2. Antoexample
be interpreted
program with care.
with scrambled characters. It
m learning (Bengio et al., this process until its length reaches a, in which case we
Figure 1. Example programs on which we train the LSTM. Target:
The helps illustrate the-36217.
Target: difficulty faced by47736.
our neural network.
& Grauman, 2011), which increase nesting by one and reset length to 1.
output of each program is a single number. A “dot” symbol ”Baseline”
indi- prediction:
”Baseline” prediction:-37515. -0666.
he “difficulty level” of the 12 catesQ UALITATIVE EVALUATION
the end of a number and OF
has to be predicted as THE CURRICULUM
Figure 6: Left: Example inputs for the “learning to execute” task. Right:
”Naive”
well. prediction:
”Naive” prediction:
STRATEGIES
-38609. 11262. We can also choose to first increase nesting and then length.
and is partially motivated 3.1.
”Mix” prediction: Memorization Task
-35893. However, it does not make a noticeable difference in per-
”Mix” prediction: 48666.
s learn much faster when An input with scrambled characters, to highlight the difficulty of the task.
”Combined” In prediction:
”Combined”
addition to -35055.
prediction:
program evaluation, 48766.
we also investigate the formance. We skip this option in the rest of paper, and
with hard but manageable 12.1 E XAMPLES OF PROGRAM EVALUATION PREDICTION .memorizing
L ENGTH a=random 4, N ESTING =of1numbers. Given increase length first in all our experiments. This strategy is
und the naive curriculum task of sequence
We are more restrictive with multiplication and the ranges an example input 123456789, the LSTM reads it one char- has been examined in previous work on curriculum learn-
l. (2009) to be generally of for-loop, as these are much more difficult Input: to handle. ing (Bengio et al., 2009). However, we show that often it
Input:
acter at a time, stores it in memory, and then outputs
mful. One of our key con- We constrain one of the operands of multiplication and the gives even worse performance than baseline.
new curriculum learning Input:
d=5446 j=(1*5057);
123456789 one character at a time. We present and ex-
range of for-loops to be chosen uniformly from forthe much
x in range(8):d+=(2678
plore two simple performance
print(((j+1215)+6931)). if enhancing
4803<2829 elseinput
techniques: 9848)Mixed strategy (mix)
es the speed and the qual- print(6652).
smaller range [1, 4 · length]. This choice is dictated by the ifreversing
tal setting that we consid-
print((d 5935<4845 else et
(from Sutskever 3043)).
al. (2014)) and input doubling.
limitations of our architecture. Our models are able to per- Target: 13203. To generate a random sample, we first pick a random length
Figure
Target: 8: Prediction accuracy
6652.
form linear-time computation while generic integer
on the memorization
Target:
mul- The idea task
of for
input
”Baseline” prediction:3043.the four
reversing iscurriculum
reverse thestrategies.
to13015. order of the Thefrom
input[1, a] and a random nesting from [1, b] independently
length
”Baseline” ranges
prediction: from 5 to 65
6652. digits. Every
tiplication requires superlinear time. Similar restrictions strategy
”Baseline” is
prediction:
input evaluated
(987654321)
”Naive” prediction: 3043. with
while the
keepingfollowing
the
12007. 4
desired input
output modification
un- for every sample. The Mixed strategy uses a balanced mix-
applyschemes:
”Naive” nosince
prediction:
to for-loops, modification;
6652.
nested for-loops input inversion; input doubling; and input doubling and inversion.
”Naive”
can implement prediction:
”Mix”
changed 3043.
prediction:
(123456789). It seems to be
13379.a neutral operation as tureThe
of easy and difficult examples, so at any time during
”Mix” prediction:
training
integer time is limited
multiplication. 6652.
to 20 epochs. ”Mix” prediction:
the”Combined”
average distance 3043.
prediction:
between each13205.input and its correspond- training, a sizable fraction of the training samples will have
ple programs that can be ”Combined” prediction: 6652. ”Combined” ing prediction:
target did not3043. become shorter. However, input reversing the appropriate difficulty for the LSTM.
nstant memory. This re- The nesting parameter is the number of times we are al- introduces many short term dependencies that make it eas-
utational structure of the lowed to combine the operations with each other. Higher ier for the LSTM to start making correct predictions. This Combining the mixed strategy with naive curriculum
print((5997-738)). Input:
single pass over the pro- value of nesting results in programs with a deeper parse strategy (combined)
Input: Input: strategy was first
print(((1090-3305)+9466)). introduced for LSTMs for machine trans-
ry. Our programs use the tree. Nesting makes the task much harder for our LSTMs, lation by Sutskever et al. (2014).
Target: This strategy combines the mix strategy with the naive
a small number of oper- because they do not have a natural way of5259. print((((2578
dealing with if
Target:
7750<1768 else 8639)-2590)+342)).
7251.
”Baseline” prediction: strategy. In this approach, every training case is obtained
esting). We consider the compositionality, in contrast to Tree Neural5101.
Networks.
Target: It The second performance
”Baseline” prediction:6391. enhancing technique
7111. is input dou-
either by the naive strategy or by the mix strategy. As a
ubtraction, multiplication, ”Naive”
is surprising that theyprediction: 5101.
are able to deal with nested expres-
”Baseline” bling, where
”Naive” prediction:
prediction: we present
-555. the input sequence
7099. twice (so the
10 example input becomes while the result, the combined strategy always exposes the network
, and for-loops, although ”Mix” prediction:
sions at all. 5249. ”Mix”
”Naive” prediction: prediction:
6329. 123456789; 7595.123456789),
at least to some difficult examples, which is the key way in
ogram ends with a single ”Combined” prediction: 5229. ”Mix” output is unchanged
”Combined”
prediction: (123456789).
prediction:
6461. 7699.This method is mean-
It is important to emphasize that the LSTM reads the input ingless from a probabilistic perspective as RNNs approx- which it differs from the naive curriculum strategy. We no-
number. Several example one character at a time and produces the output character ”Combined” prediction: 6105. ticed that it reliably outperformed the other strategies in our
imate the conditional distribution p(y|x), yet here we at-
by character. The characters are initially meaningless from tempt to learn p(y|x, x). Still, it gives noticeable per- experiments. We explain why our new curriculum learning
amily of distributions pa- the model’sInput:
perspective; for instance, the model does not Input: strategies outperform the naive curriculum strategy in Sec-
formance improvements. By processing the input several
g. The length parameter is know that “+” means addition or that 6 is followed by 7. Input: a=8331; tion 7.
print((16*3071)).
Figure 7: Examples of outputs of the RNNs from the “Learning to execute” times before producing an output, the LSTM is given the
hat appear in the programs Indeed, scrambling the input characters (e.g., replacing “a”
print((((841 print((a-(15
if 2076<7326 7082))).
* the else 1869) *10) if earlier
7827<317 else 7192)).
opportunity to correct mistakes it made in the We evaluate these four strategies on the program evaluation
formly from [1, 10 length with “q”, “b” with
Target: “w”, etc.,) would have no effect
49136. on the
enerated with length = 4
paper.
]).
model’s ability to solve this problem. We demonstrate Target:the
passes.
Target: 7192. -97899. task (Section 6.1) and on the memorization task (Section
”Baseline” prediction: 49336. ”Baseline” prediction: -96991. 6.2).
difficulty of the task by presenting an input-output”Baseline”
example prediction: 7192.
”Naive” prediction:
with scrambled characters in Figure 2.
48676. 4.”Naive”
”Naive” prediction:Curriculum prediction:
7192.Learning -19959.
”Mix” prediction: 57026. ”Mix” prediction:
”Mix” prediction: 7192. -95551.
5. RNN with LSTM cells
4.3 Learning
”Combined” to Execute
prediction: Programs
”Combined”
49626. ”Combined”
Our
prediction: prediction:
program generation
7192. scheme -96397.
is parametrized by length
and nesting. These two parameters allow us control the In this section we briefly describe the deep LSTM (Sec-
complexity of the program. When length and nesting are tion 5.1). All vectors are n-dimensional unless explicitly
A particularly impressive example of the capabilities of RNNs is that they
Input: large enough, the learning problem nearly intractable. This
12.4 E stated otherwise. Let hl 2 Rn be a =hidden
XAMPLES OF PROGRAM EVALUATION PREDICTION . L ENGTH = 6,t N ESTING 1 state in layer
Input: indicates that in order to learn to evaluate programs of a l in timestep t. Let Tn,m : Rn ! Rm be a biased lin-
are ablec=2060;
to learn to execute simple programs. This was demonstrated by
d=8640;
given length = a and nesting = b, it may help to first learn ear mapping (x ! W x + b for some W and b). We
print((7135 if 6710>((d+7080)*14) else 7200)).
print((c-4387)). to evaluate programs with length
3 ⌧ a and nesting ⌧ b. let be element-wise multiplication and let h0t be the in-
Wojciech Zaremba and Ilya Sutskever, then at Google. Here, the input to
Target: We compare the 7200.
Input: following curriculum learning strategies: put at timestep k. We use the activations at the top layer
the RNN Target: -2327.
was a simple Python program consisting of simple arithmetic and
”Baseline” prediction: 7200. (baseline) The baseline approach
print((71647-548966)).
No curriculum learning L (namely hL t ) to predict yt where L is the depth of our
”Baseline” prediction: -2320.
”Naive” prediction: 7200.
does not use curriculum learning. This means that we LSTM.
control flow, and
”Naive” the target was the result of executing the program. Both
prediction: Target:
”Mix” prediction:
-2201. 7200. -477319.
”Mix” prediction: -2377. ”Baseline” prediction:
”Combined” prediction: 7200. -472122.
the inputs and the targets were fed to the RNN one character at a time. ”Naive” prediction: -477591.
”Combined” prediction: -2317. ”Mix” prediction: -479705.
Examples are shown in Figure 6. ”Combined” prediction: -475009.
Input:
TheirInput:
RNN architecture was able to learn to do this fairly well. Some
b=6968
for x in range(10):b-=(299 if 3389<9977 else 203)
examplesprint((2
of outputs
* of
5172)). various versions of
print((12*b)).their system are shown in Figure 7.
Input:
16
3
W. Zaremba and I. Sutskever. Learning to Execute. ICLR, 2015.
11
Lecture 16: Learning Long-Term Dependencies
Roger Grosse
1 Introduction
Last lecture, we introduced RNNs and saw how to derive the gradients using
backprop through time. In principle, this lets us train them using gradient
descent. But in practice, gradient descent doesn’t work very well unless
we’re careful. The problem is that we need to learn dependencies over long
time windows, and the gradients can explode or vanish.
We’ll first look at the problem itself, i.e. why gradients explode or vanish.
Then we’ll look at some techniques for dealing with the problem — most
significantly, changing the architecture to one where the gradients are stable.
• Know about various methods for dealing with the problem, and why
they help:
– Gradient clipping
– Reversing the input sequence
– Identity initialization
– Reason about how the memory cell behaves for a given setting
of the input, output, and forget gates
– Understand how this architecture helps keep the gradients stable
information about the first word in the sentence doesn’t get used in the
1
Figure 1: Encoder-decoder model for machine translation (see 14.4.2 for
full description). Note that adjusting the weights based on the first in-
put requires the error signal to travel backwards through the entire path
highlighted in red.
h(t) = z (t+1) w
z (t) = h(t) φ0 (z (t) )
Hence, h(1) is a linear function of h(T ) . The coefficient is the partial deriva-
(T )
tive ∂h
h(1)
. If we make the simplifying assumption that the activation func-
1
Review Section 14.3 if you’re hazy on backprop through time.
2
tions are linear, we get
∂h(T )
= wT −1 ,
∂h(1)
which can clearly explode or vanish unless w is very close to 1. For instance,
if w − 1.1 and T = 50, we get ∂h(T ) /∂h(1) = 117.4, whereas if w = 0.9
and T = 50, we get ∂h(T ) /∂h(1) = 0.00515. In general, with nonlinear
activation functions, there’s nothing special about w = 1; the boundary
between exploding and vanishing will depend on the values h(t) .
More generally, in the multivariate case,
∂h(T )
= WT −1 .
∂h(1)
This will explode if the largest eigenvalue of W is larger than 1, and vanish
if the largest eigenvalue is smaller than 1.
Contrast this with the behavior of the forward pass. In the forward
pass, the activations at each step are put through a nonlinear activation
function, which typically squashes the values, preventing them from blowing
up. Since the backwards pass is entirely linear, there’s nothing to prevent
the derivatives from blowing up.
This looks a bit like repeatedly applying the function f . Therefore, we can
gain some intuition for how RNNs behave by studying iterated functions,
i.e. functions which we iterate many times.
Iterated functions can be complicated. Consider the innocuous-looking
quadratic function
f (x) = 3.5 x (1 − x). (2)
If we iterate this function multiple times (i.e. f (f (f (x))), etc.), we get some
complicated behavior, as shown in Figure 2. Another famous example of
the complexity of iterated functions is the Mandelbrot set:
3
Figure 2: Iterations of the function f (x) = 3.5 x (1 − x).
If you initialize at z0 = 0 and iterate this mapping, it will either stay within
some bounded region or shoot off to infinity, and the behavior depends on
the value of c. The Mandelbrot set is the set of values of c where it stays
bounded; as you can see, this is an incredibly complex fractal.
It’s a bit easier to analyze iterated functions if they’re monotonic. Con-
sider the function
f (x) = x2 + 0.15.
This is monotonic over [0, 1]. We can determine the behavior of repeated
iterations visually:
4
Here, the red line shows the trajectory of the iterates. If the initial value is
x0 = 0.6, start with your pencil at x = y = 0.6, which lies on the dashed line.
Set y = f (x) by moving your pencil vertically to the graph of the function,
and then set x = y by moving it horizontally to the dashed line. Repeat
this procedure, and you should notice a pattern. There are some regions
where the iterates move to the left, and other regions where the move to
the right. Eventually, the iterates either shoot off to infinity or wind up at
a fixed point, i.e. a point where x = f (x). Fixed points are represented
graphically as points where the graph of x intersects the dashed line. Some
fixed points (such as 0.82 in this example) repel the iterates; these are called
sources. Other fixed points (such as 0.17) attract the iterates; these are
called sinks, or attractors. The behavior of the system can be summarized
with a phase plot:
Observe that fixed points with derivatives f 0 (x) < 1 are sinks and fixed
points with f 0 (x) > 1 are sources.
Even though the computations of an RNN are discrete, we can think of
them as a sort of dynamical system, which has various attractors:
This figure is a cartoon of the space of hidden activations. If you start out
in the blue region, you wind up in one attractor, whereas if you start out
in the red region, you wind up in the other attractor. If you evaluate the
Jacobian ∂h(T ) /∂h(1) in the interior of one of these regions, it will be close
to 0, since if you change the initial conditions slightly, you still wind up
at exactly the same place. But the Jacobian right on the boundary will be
large, since shifting the initial condition slightly moves us from one attractor
to the other.
To make this story more concrete, consider the following RNN, which
uses the tanh activation function:
Figure 3 shows the function computed at each time step, as well as the
function computed by the network as a whole. From this figure. you can Think about how we can derive
see which regions have exploding or vanishing gradients. the right-hand figure from the
left-hand one using the analysis
given above.
5
Figure 3: (left) The function computed by the RNN at each time step,
(right) the function computed by the network.
Figure 4: (left) Loss function for individual training examples, (right) cost
function averaged over 1000 training examples.
6
3.1 Gradient Clipping
First, there’s a simple trick which sometimes helps a lot: gradient clip-
ping. Basically, we prevent gradients from blowing up by rescaling them so
that their norm is at most a particular value η. I.e., if kgk > η, where g is
the gradient, we set
ηg
g← . (4)
kgk
This biases the training procedure, since the resulting values won’t actually
be the gradient of the cost function. However, this bias can be worth it if
it keeps things stable. The following figure shows an example with a cliff
and a narrow valley; if you happen to land on the face of the cliff, you
take a huge step which propels you outside the good region. With gradient
clipping, you can stay within the valley.
There’s a gap of only one time step between when the first word is read
and when it’s needed. This means that the network can easily learn the
relationships between the first words; this could allow it to learn good word
representations, for instance. Once it’s learned this, it can go on to the
more difficult dependencies between words later in the sentences.
7
3.3 Identity Initialization
In general, iterated functions can have complex and chaotic behavior. But
there’s one particular function you can iterate as many times as you like: the
identity function f (x) = x. If your network computes the identity function,
the gradient computation will be perfectly stable, since the Jacobian is
simply the identity matrix. Of course, the identity function isn’t a very
interesting thing to compute, but it still suggests we can keep things stable
by encouraging the computations to stay close to the identity function.
The identity RNN architecture2 is a kind of RNN where the activation
functions are all ReLU, and the recurrent weights are initialized to the
identity matrix. The ReLU activation function clips the activations to be
nonnegative, but for nonnegative activations, it’s equivalent to the identity
function. This simple initialization trick achieved some neat results; for
instance, it was able to classify MNIST digits which were fed to the network
one pixel at a time, as a length-784 sequence.
8
• At the center is a memory cell, which is the thing that’s able to
remember information over time. It has a linear activation function,
and a self-loop which is modulated by a forget gate, which takes
values between 0 and 1; this means that the weight of the self-loop is
equal to the value of the forget gate.
• The forget gate is a unit similar to the ones we’ve covered previously;
it computes a linear function of its inputs, followed by a logistic ac-
tivation function (which means its output is between 0 and 1). The
forget gate would probably be better called a “remember gate”, since
if it is on (takes the value 1), the memory cell remembers its previous
value, whereas if the forget gate is off, the cell forgets it.
• The block also receives inputs from other blocks in the network; these
are summed together and passed through a tanh activation function
(which squashes the values to be between -1 and 1). The connection
from the input unit to the memory cell is gated by an input gate,
which has the same functional form as the forget gate (i.e., linear-
then-logistic).
• The block produces an output, which is the value of the memory cell,
passed through a tanh activation function. It may or may not pass
this on to the rest of the network; this connection is modulated by the
output gate, which has the same form as the input and forget gates.
It’s useful to summarize various behaviors the memory cell can achieve
depending on the values of the input and forget gates:
input gate forget gate behavior
0 1 remember the previous value
1 1 add to the previous value
0 0 erase the value
1 0 overwrite the value
If the forget gate is on and the input gate is off, the block simply computes
the identity function, which is a useful default behavior. But the ability to
read and write from it lets it implement more sophisticated computations.
The ability to add to the previous value means these units can simulate a
counter; this can be useful, for instance, when training a language model,
if sentences tend to be of a particular length.
When we implement an LSTM, we have a bunch of vectors at each time
step, representing the values of all the memory cells and each of the gates.
Mathematically, the computations are as follows:
(t)
i σ
f (t) σ
(t)
= W x (6)
o(t) σ h(t−1)
g(t) tanh
c(t) = f (t) ◦ c(t−1) + i(t) ◦ g(t) (7)
h(t) = o(t) ◦ tanh(c(t) ). (8)
Here, (6) uses a shorthand for applying different activation functions to
different parts of the vector. Observe that the blocks receive signals from
9
Figure 5: The LSTM unit.
10
the current inputs and the previous time step’s hidden units, just like in
standard RNNs. But the network’s input g and the three gates i, o, and f
have independent sets of incoming weights. Then (7) gives the update rule
for the memory cell (think about how this relates to the verbal description
above), and (8) defines the output of the block.
For homework, you are asked to show that if the forget gate is on and
the input and output gates are off, it just passes the memory cell gradients
through unmodified at each time step. Therefore, the LSTM architecture
is resistant to exploding and vanishing gradients, although mathematically
both phenomena are still possible.
If the LSTM architecture sounds complicated, that was the reaction of
machine learning researchers when it was first proposed. It wasn’t used
much until 2013 and 2014, when resesarchers achieved impressive results
on two challenging and important sequence prediction problems: speech-to-
text and machine translation. Since then, they’ve become one of the most
widely used RNN architectures; if someone tells you they’re using an RNN,
there’s a good chance they’re actually using an LSTM. There have been
many attempts to simplify the architecture, and one particular variant called
the gated recurrent unit (GRU) is fairly widely used, but so far nobody has
found anything that’s both simpler and at least as effective across the board.
It appears that most of the complexity is probably required. Fortunately,
you hardly ever have to think about it, since LSTMs are implemented as a
black box in all of the major neural net frameworks.
11
Lecture 17: ResNets and Attention
Roger Grosse
1 Introduction
We have two unrelated agenda items for today. First, we’ll revisit image
classification in light of what we’ve learned about RNNs. In particular, we
saw that one way to make it easier for RNNs to learn long-distance depen-
dencies is to make it easy for each layer to represent the identity function,
which lets them pass information unmodified through many layers. This is
a useful thing for the network to do, and it also helps keep the gradients
from exploding or vanishing. If we want to train image classifiers with a
ridiculously large number of layers, we need to use these sorts of tricks. The
deep residual network (ResNet) is a particularly elegant architecture which
lets information pass directly through; it can be used to train networks with
hundreds, or even thousands of layers, and is the current state-of-the-art for
a variety of computer vision tasks.
Our second agenda item is attention. The problem with the encoder-
decoder architecture for translation is that all the information about the
input sentence needs to be stored in the vector of hidden activations. This
has a fixed dimension (typically on the order of 1000), i.e. it doesn’t grow
with the length of the sentence. It’s pretty neat that summarizing the
meaning of a sentence as a vector works at all, but this strategy hits is
limits once the sentences are about 20 words or so, a fairly typical sentence
length. Attention-based architectures allow the network to refer back to the
input sentence as they produce their output, thereby reducing the pressure
on the hidden units and allowing them to easily handle very long sentences.
2 ResNets
Before 2015, the GoogLeNet (Inception) architecture set the standard for
a deep conv net. It was about 20 layers deep, not counting pooling. In
2015, the new state-of-the-art on ImageNet was the deep residual network
(ResNet), which had the distinction that that it was 150 layers deep. When
we discussed image classification, I promised we’d come back to ResNets
once we covered a key conceptual idea. That idea was exploding and van-
ishing gradients.
Recall that the Jacobian ∂h(T ) /∂h(1) for an RNN is the product of the
Jacobians of individual layers:
1
But notice that this same formula applies to the Jacobian for a feed-forward
network (e.g. MLP or conv net). How come we never talked about exploding
and vanishing gradients until we got to RNNs? The reason is that until
recently, feed-forward nets were at most tens of layers deep, whereas RNNs
would often be unrolled for hundreds of time steps. Hence, we’d be doing
lots more steps of backprop (i.e. multiplying lots of Jacobians together),
making things more likely to explode or vanish. This means if we want to
train feed-forward nets with hundreds of layers, we need to figure out how
to keep the backprop computations stable.
In Homework 3, you derived the backprop equations for the following
architecture, where the inputs get added to the outputs:
z = W(1) x + b(1)
h = φ(z) (1)
(2)
y =x+W h
This is a special case of a more general architectural primitive called the
residual block:
y = x + F(x), (2)
where F is a function called the residual function. In the above example,
F is an MLP with one hidden layer. In general, it’s typically a shallow
neural net, with 1–3 hidden layers. We can represent the residual block
graphically as follows:
2
(Each layer computes a separate residual function, with separate trainable
parameters.) Last lecture, we noted two architectures that make it easy to
represent the identity function: identity RNNs and LSTMs. The ResNet is a
third such architecture. Observe that if each F returns zero (e.g. because all
the weights are 0), then this architecture simply passes the input x through
unmodified. I.e., it computes the identity function.
We can also see this algebraically in terms of the backprop equation for
a residual block:
∂F
x(`) = x(`+1) + x(`+1)
∂x
(3)
(`+1)
∂F
=x I+
∂x
Hence, if ∂F/∂x = 0, the error signals are simply passed through unmodi-
fied. As long as ∂F/∂x is small, the Jacobian for the residual block will be
close to the identity, and the error signals won’t explode or vanish.
So that’s the one big idea behind ResNets. If people say they are using
ResNets for a vision task, they’re probably referring to particular architec-
tures based on the ones in this paper1 . This paper achieved state-of-the-art
on ImageNet in 2015, and since then, the state-of-the-art on many computer
vision tasks has consisted of variants of these ResNet architectures. There’s
one important detail that needs to be mentioned: the input and output to
a residual block clearly need to be the same size, because the output is the
sum of the input and the residual function. But for conv nets, it’s important
to shrink the images (e.g. using pooling) in order to expand the number of
feature maps. ResNets typically achieve this by having a few convolution
layers with a stride of 2, so that the dimension of the image is reduced by
a factor of 2 along each dimension.
The benefit of the ResNet architecture is that it’s possible to train ab-
surdly large numbers of layers. The state-of-the-art ImageNet classifier from
the above paper had 50 residual blocks, and the residual function for each
was a 3-layer conv net, so the network as a whole had about 150 layers.
Hardly anybody had been expecting it to be useful to train 150 layers. On
a smaller object recognition benchmark called CIFAR, they were actually
able to train a ResNet with 1000 layers, though it didn’t work any better
than their 100-layer network.
What on earth are all these layers doing? When we visualized the In-
ception activations, we found pretty good evidence that higher layers were
learning more abstract and high-level features. But the idea that there are
150 meaningfully different levels of abstraction seems pretty fishy. We ac-
tually don’t have a good explanation for why 150 layers works better than
50.
3 Attention
Our second topic for today is attention. Recall the encoder-decoder model
for machine translation from last lecture:
1
K. He, X. Zhang, S. Ren, and J. Sun, 2016. Deep residual learning for image recog-
nition
3
All the information the decoder receives about the input sentence is stored
in a single code vector, which is the final hidden state of the encoder. This
means the code vector needs to store all the relevant information about
the input sentence — and since we’re translating the whole sentence, that
effectively means it must have memorized the sentence. It’s a bit surprising
that this is possible, though not implausible: it may require about 1000
bits to store the ASCII characters in a 20-word sentence, so it should be
possible in principle to store the same information in a vector of length 5000
(a typical hidden dimension for this architecture). But still, this is putting
a lot of pressure on the RNN’s memory.
Attention-based modeling fixes this problem by allowing the decoder
to look at the input sentence as it generates text. This removes the need
for the hidden units to store the whole input sentence. Instead, they’ll
just have to remember a little bit of context about things like where it is
in the input sentence and what part of speech it’s looking for next. The
original attention-based translation paper was Bahdanau et al., “Neural
machine translation by jointly learning to align and translate”2 , and we’ll
be focusing on their architecture here.
This model has both an encoder and a decoder. Let’s consider both in
sequence. First, the encoder. The encoder’s job is to compute an anno-
tation vector for each word in the sentence; these vectors are what the
decoder will see when it focuses on a word. One seemingly obvious choice
would be to use a lookup table of word representations, just like the neural
language model from Lecture 7. But words can have multiple meanings, and
it often requires information from the rest of the sentence to disambiguate
the meaning. The relevant information could be either earlier or later in the
sentence. So instead, we use an architecture called a bidirectional RNN:
The original bidirectional RNN
uses a kind of architecture called
the gated recurrent unit (GRU),
which is similar to the LSTM. You
could use an LSTM instead if you
want.
This is really just a fancy term for two completely separate RNNs, one of
which processes the words in forward order, and the other of which processes
them in reverse order. The hidden states of the two RNNs are concatenated
at each time step to give the annotation vector.
The decoder architecture is shown in Figure 1. It is very similar to the
RNN language models we’ve looked at, in that it is an RNN which predicts a
2
D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning
to align and translate. ICLR, 2015. https://arxiv.org/abs/1409.0473
4
Figure 1: The decoder architecture from Bahdanau et al.’s attention-based
translation model.
distribution over words at each time step, and receives the words as input.
Like the RNN language models, it’s trained using teacher forcing, so it In practice, we don’t actually
receives the words of the actual target sentence as inputs. To sample from sample from the model’s
predictions, since that has too high
the model at test time, the words are sampled from the model’s predictions a chance of producing a silly
and fed back in as inputs. So far, nothing new. sentence. Instead, we search for
The difference is that the decoder uses attention to compute a context the most probable output sentence
vector c(i) for each output time step i. (We’re using i rather than t to using a technique called beam
search. But this is beyond the
keep separate the time steps of the encoder and decoder.) This is a soft scope of the class.
attention model, which means that it has the ability to spread its attention
across all the input words. More precisely, it computes a weighted average Hard attention models only look at
of the annotation vectors for all the input words: one part of the input at a time,
but we won’t consider those in this
X class.
c(i) = αij h(j) , (4)
j
where the attention weights αij are computed as a softmax over all the input
words:
exp(eij )
αij = P (5)
j 0 exp(eij 0 )
Notice that the logits eij are a function of both the decoder’s previous
hidden state s(i−1) and the annotation vector h(j) . The previous hidden
state is clearly needed, since the decoder needs to remember some context
about what it has already generated (such as the part of speech of the
previous word) in order to know where to look. Using the annotation vectors
themselves as inputs to the attention weights is an interesting approach,
as it lets the attention mechanism implement content-based addressing,
which looks up words according to their semantics rather than their position
in the sentence. For instance, if the decoder has just produced an adjective,
5
Figure 2: Visualizations of where the attention model is looking as it gener-
ates each output word. Each row corresponds to the attention vector (the
αij ’s) for one word in the output (French) sentence. Figure from Bahdanau
et al.
6
Lecture 18: Learning probabilistic models
Roger Grosse
1 Overview
In the first half of the course, we introduced backpropagation, a technique we used to
train neural nets to minimize a variety of cost functions. One of the cost functions we
discussed was cross-entropy, which encourages the network to learn to predict a probability
distribution over the targets. This was our first glimpse into probabilistic modeling. But
probabilistic modeling is so important that we’re going to spend almost the last third of
the course on it. This lecture introduces some of the key principles.
This lecture and the next one aren’t about neural nets. Instead, they’ll introduce the
principles of probabilistic modeling in as simple a setting as possible. Then, starting next
week, we’re going to apply these principles in the context of neural nets, and this will result
in some very powerful models.
2 Maximum likelihood
The first method we’ll cover for fitting probabilistic models is maximum likelihood. In
addition to being a useful method in its own right, it will also be a stepping stone towards
Bayesian modeling.
Let’s begin with a simple example: we have flipped a particular coin 100 times, and it
landed heads NH = 55 times and tails NT = 45 times. We want to know the probability
that it will come up heads if we flip it again. We formulate the probabilistic model:
1
The behavior of the coin is summarized with a parameter θ, the probability
that a flip lands heads (H). The flips D = x(1) , . . . , x(100) are independent
Bernoulli random variables with parameter θ.
(In general, we will use D as a shorthand for all the observed data.) We say that the indi-
vidual flips are independent and identically distributed (i.i.d.); they are independent
because one outcome does not influence any of the other outcomes, and they are identically
distributed because they all follow the same distribution (i.e. a Bernoulli distribution with
parameter θ).
We now define the likelihood function L(θ), which is the probability of the observed
data, as a function of θ. In the coin example, the likelihood is the probability of the
particular sequence of H’s and T’s being generated:
Note that L is a function of the model parameters (in this case, θ), not the observed data.
This likelihood function will generally take on extremely small values; for instance,
L(0.5) = 0.5100 ≈ 7.9 × 10−31 . Therefore, in practice we almost always work with the
log-likelihood function,
For our coin example, `(0.5) = log 0.5100 = 100 log 0.5 = −69.31. This is a much easier
value to work with.
In general, we would expect good choices of θ to assign high likelihood to the observed
data. This suggests the maximum likelihood criterion: choose the parameter θ which
maximizes `(θ). If we’re lucky, we can do this analytically by computing the derivative and
setting it to zero. (More precisely, we find critical points by setting the derivative to zero.
We check which of the critical points, or boundary points, has the largest value.) Let’s try
this for the coin example:
d` d
= (NH log θ + NT log(1 − θ))
dθ dθ
NH NT
= − (3)
θ 1−θ
Setting this to zero, we find the maximum likelihood estimate
NH
θ̂ML = , (4)
NH + NT
i.e. the maximum likelihood estimate is simply the fraction of flips that came up heads.
(We put a hat over the parameter to emphasize that it’s an estimate.) Hopefully this seems
like a sensible guess for θ. Now let’s look at some more examples.
2
Example 1. Suppose we are interested in modeling the distribution of temper-
atures in Toronto in March. Here are the high temperatures, in Celsius, from
the first week of March 2014:
Since µ can take any possible real value, the maximum must occur at a critical
point, so let’s look for critical points. Setting the derivative to 0,
N
d` 1 X d (i)
0= = 2 (x − µ)2
dµ 2σ dµ
i=1
N
1 X
=− x(i) − µ (6)
σ2
i=1
Therefore, N
P (i) − µ = 0, and solving for µ, we get µ = 1
PN (i)
i=1 x N i=1 x . The
maximum likelihood estimate of the mean of a normal distribution is simply
the mean of the observed values, or the empirical mean. Plugging in our
temperature data, we get µ̂ML = −5.97.
3
of two variables, we find critical points by setting the partial derivatives to 0.
In this case,
N
∂` 1 X (i)
0= =− 2 x −µ (7)
∂µ σ
i=1
"N #
∂` ∂ X 1 1 (i) 2
0= = − log 2π − log σ − 2 (x − µ)
∂σ ∂σ 2 2σ
i=1
N
X 1 ∂ ∂ ∂ 1 (i)
= − log 2π − log σ − (x − µ)2
2 ∂σ ∂σ ∂σ 2σ
i=1
N
X 1 1
= 0− + 3 (x(i) − µ)2
σ σ
i=1
N
N 1 X (i)
=− + 3 (x − µ)2 (8)
σ σ
i=1
1 PN (i)
From the first equality, we find that µ̂ML = i=1 x
is the empirical mean,
N
q P
just as before. From the second inequality, we find σ̂ML = N1 N i=1 (x
(i) − µ)2 .
In other words, σ̂ML is simply the empirical standard deviation. In the case of
the Toronto temperatures, we get µ̂ML = −5.97 (as before) and σ̂ML = 4.55.
Example 3. We’ve just seen two examples where we could obtain the exact
maximum likelihood solution analytically. Unfortunately, this situation is the
exception rather than the rule. Let’s consider how to compute the maximum
likelihood estimate of the parameters of the gamma distribution, whose PDF
is defined as
ba a−1 −bx
p(x) = x e , (9)
Γ(a)
where Γ(a) is the gamma function, which is a generalization of the factorial
function to continuous values.1 The model parameters are a and b, both of
which must take positive values. The log-likelihood, therefore, is
N
X
`(a, b) = a log b − log Γ(a) + (a − 1) log x(i) − bx(i)
i=1
N
X N
X
= N a log b − N log Γ(a) + (a − 1) log x(i) − b x(i) . (10)
i=1 i=1
1
R∞ t−1 −x
The definition is Γ(t) = 0
x e dx, but we’re never going to use the definition in this class.
4
Most scientific computing environments provide a function which computes
log Γ(a). In SciPy, for instance, it is scipy.special.gammaln.
To maximize the log-likelihood, we’re going to use gradient ascent, which is just
like gradient descent, except we move uphill rather than downhill. To derive
the update rules, we need the partial derivatives:
N
∂` d X
= N log b − N log Γ(a) + log x(i) (11)
∂a da
i=1
N
∂` a X
=N − x(i) . (12)
∂b b
i=1
• The derivatives worked out nicely because we were dealing with log-likelihoods. Try
taking derivatives of the likelihood function L(θ), and you’ll see that they’re much
messier.
5
language model is clearly bogus. If this were a statistics class, we’d talk about ways
to test your modeling assumptions. But because this is a machine learning class,
we’ll throw caution to the wind and fit models that we know are wrong. Hopefully
they’ll still be good enough that they can make sensible predictions (in the supervised
setting) or reveal interesting structure (in the unsupervised setting).
• A distribution p(θ), known as the prior distribution. It’s called the prior because
6
it’s supposed to encode your “prior beliefs,” i.e. everything you believed about the
parameters before looking at the data. In practice, we normally choose priors to be
computationally convenient, rather than based on any sort of statistical principle.
More on this later.
• The likelihood p(D | θ), the probability of the observations given the parameters,
just like in maximum likelihood.
• The posterior distribution p(θ | D). This corresponds to our beliefs about the
parameters after observing the data. In general, the posterior distribution can be
computed using Bayes’ Rule:
p(θ)p(D | θ)
p(θ | D) = R . (13)
p(θ 0 )p(D | θ 0 ) dθ 0
However, we don’t normally compute the denominator directly. Instead we work with
unnormalized distributions as long as possible, and normalize only when we need to.
Bayes’ Rule can therefore be written in a more succinct form, using the symbol ∝ to
denote “proportional to”:
p(θ | D) ∝ p(θ)p(D | θ). (14)
• The posterior predictive distribution p(D0 | D), which is the distribution over
future observables given past observations. For instance, given that we’ve observed
55 H’s and 45 T’s, what’s the probability that the next flip will land H? We can
compute the posterior predictive distribution by computing the posterior over θ and
then marginalizing out θ:
Z
p(D | D) = p(θ | D)p(D0 | θ) dθ.
0
(15)
7
Figure 1: The PDF of the beta distribution for various values of the parameters a and b.
Observe that the distribution becomes more peaked as a and b become large, and the peak
is near a/(a + b).
particularly useful one is the beta distribution, parameterized by a, b > 0, and defined
as:
Γ(a + b) a−1
p(θ; a, b) = θ (1 − θ)b−1 . (16)
Γ(a)Γ(b)
This distribution is visualized in Figure 1. Why did we choose the beta distribution, of all
things? Once we work through the example, we’ll see that it’s actually pretty convenient.
Observe that the first term (with all the Γ’s) is just a normalizing constant, so it doesn’t
depend on θ. In most of our computations, we’ll only need to work with unnormalized dis-
tributions (i.e. ones which don’t necessarily integrate to 1), so we can drop the cumbersome
normalizing constant and write
A few values are plotted in Figure 1. From these plots, we observe a few things:
• The distribution appears to be centered around a/(a + b). In fact, it’s possible to
show that if θ ∼ Beta(a, b), then E[θ] = a/(a + b).
8
Figure 2: Plots of the prior, likelihood, and posterior for the coin flip example, with the
prior Beta(2, 2). (Left) Small data setting, NH = 2, NT = 0. (Right) Large data
setting, NH = 55, NT = 45. In this case, the data overwhelm the prior, so the posterior is
determined by the likelihood. Note: for visualization purposes, the likelihood function is
normalized to integrate to 1, since otherwise it would be too small to see.
Now let’s compute the posterior and posterior predictive distributions. When we plug
in our prior and likelihood terms for the coin example, we get:
But this is just a beta distribution with parameters NH + a and NT + b. Let’s stop and
check if our answer makes sense. As we observe more flips, NH and NT both get larger,
and the distribution becomes more peaked around a particular value. Furthermore, the
peak of the distribution will be near NH /(NH + NT ), our maximum likelihood solution.
This reflects the fact that the more data we observe, the less uncertainty there is about
the parameter, and the more the likelihood comes to dominate. We say that the data
overwhelm the prior.
9
We now compute the posterior predictive distribution over the next flip x0 :
θpred = Pr(x0 = H | D)
Z
= p(θ | D)Pr(x0 = H | θ) dθ
Z
= Beta(θ; NH + a, NT + b) · θ dθ
In other words, the prior was chosen to have the same functional form as the likelihood.3
Since we multiply these expressions together to get the (unnormalized) posterior, the pos-
terior will also have this functional form. A prior chosen in this way is called a conjugate
prior. In this case, the parameters of the prior distribution simply got added to the
observed counts, so they are sometimes referred to as pseudo-counts.
Let’s look at some more examples.
Example 4. Let’s return to our problem of estimating the mean temperature
in Toronto, where our model assumes a Gaussian with unknown mean µ and
known standard deviation σ = 5. The first task is to choose a conjugate prior.
In order to do this, let’s look at the PMF of a single data point:
(x − µ)2
1
p(x | µ) = √ exp − (24)
2πσ 2σ 2
3
The ∝ notation obscures the fact that the normalizing constants in these two expressions may be
completely different, since p(θ) is a distribution over parameters, while p(D | θ) is a distribution over observed
data. In this example, the latter normalizing constant happens to be 1, but that won’t always be the case.
10
If we look at this as a function of µ (rather than x), we see that it’s still a
Gaussian! This should lead us to conjecture that the conjugate prior for a
Gaussian is a Gaussian. Let’s try it and see if it works.
Our prior distribution will be a Gaussian distribution with mean µpri and stan-
dard deviation σpri . The posterior is then given by:
p(µ | D) ∝ p(µ)p(D | µ)
" !# " N #
(µ − µpri )2
1 Y 1 1
= √ exp − 2
√ exp − 2 (x(i) − µ)2
2πσpri 2σpri 2πσ 2σ
i=1
N
!
(µ − µpri )2 1 X (i)
∝ exp − 2 − (x − µ)2
2σpri 2σ 2
i=1
N N
!
µ2 µpri µ µ2pri1 X (i) 2 1 X (i) N 2
∝ exp − 2 + 2 − − 2
2 [x ] + 2 x µ − 2µ
2σpri σpri 2σ
2σpri σ 2σ
i=1 i=1
N
" # " # !
µ2pri
PN (i)
1 X (i) 2 µpri x 1 1 N
= exp − 2 − 2 [x ] + 2 − i=12 µ− 2 + 2 µ2
2σpri 2σ σpri σ 2 σpri σ
i=1
!
(µ − µpost )2
∝ exp − 2 ,
σpost
where
1
σpost = q (25)
1 N
2
σpri
+ σ2
1 N 1 PN (i)
2 µpri
σpri
+ σ2 N i=1 x
µpost = 1 N
. (26)
2
σpri
+ σ2
The last step uses a technique called completing the square. You’ve probably
done this before in a probability theory class. So the posterior distribution is a
Gaussian with mean µpost and standard deviation σpost .
The formulas are rather complicated, so let’s break them apart. First look how
σpost changes if we vary the prior or the data.
11
• What if we increase the prior standard deviation σpri or the observation
standard deviation σ? Then the denominator gets smaller, which means
σpost gets larger. This should be intuitive, because increasing the uncer-
tainty in either the prior or the likelihood should increase the uncertainty
in the posterior.
Now let’s look at the formula for µpost . It takes the form of a weighted
average of the prior mean µpri and the maximum likelihood mean N1 N (i)
P
i=1 x .
By weighted average, I mean something of the form
ar + bs
.
a+b
where the weights a and b are both positive. This is a weighted average of r
and s; if a is larger, it is closer to r, while if b is larger, it is closer to s. For
µpost , the weights for the prior mean and maximum likelihood mean are 1/σpri 2
and N/σ 2 , respectively. Let’s see what happens if we change the problem.
Observe that N (i) is a sufficient statistic, since it is the only thing we need
P
i=1 x
to remember about the data in order to compute the posterior.
Finally, let’s take a look at the posterior predictive distribution. We compute
this as
Z
p(x0 | D) = p(µ | D)p(x0 | µ) dµ
Z
= Gaussian(µ; µpost , σpost )Gaussian(x0 ; µ, σ) dµ
q
= Gaussian(x0 ; µpost , σpost
2 + σ2) (27)
The last step uses the formula for the convolution of two Gaussian distributions.
Now let’s see how it behaves.
12
Figure 3: The prior, posterior, and posterior predictive distributions for the Toronto tem-
peratures example.
• When there are no observations (i.e. N = 0), then µpost and σpost are the
prior mean and standard deviation. The predictive distribution is centered
at µpost , but more spread out than the prior.
• When N is very large, the mean of the predictive distribution is close to
the maximum likelihood mean, and the standard deviation is very close to
σ. In other words, it makes almost the same predictions as the maximum
likelihood estimate.
The prior, posterior, and posterior predictive distributions are all shown in
Figure 3.
For both the coin and Gaussian examples, the posterior predictive distribution had
the same parametric form as the model. (I.e., it was a Bernoulli distribution for the coin
model, and a Gaussian distribution for the Gaussian model.) This does not happen in
general; often the posterior predictive distribution doesn’t have a convenient form, which
is part of what makes the full Bayesian approach difficult to apply.
13
But for the Bayesian approach, we need to compute an integral in order to marginalize
out the model parameters. If we only have a few parameters, we can do this using nu-
merical quadrature methods. Unfortunately, these methods are exponential in the number
of variables being integrated out. If we’re trying to fit a neural net with thousands (or
even millions) of parameters, this is completely impractical. There are other methods for
integration which perform well in high dimensional spaces; we’ll discuss one such set of
techniques, called Markov chain Monte Carlo, later in the course. However, integration
still tends to be a much more difficult problem than optimization, so if possible we would
like to formulate our learning algorithms in terms of optimization. Let’s now look at the
maximum a-posteriori (MAP) approximation, a way of converting the integration problem
into an optimization problem.
Example 5. Let’s return to our coin flip example. The joint probability is
given by:
14
(Here, const is a shorthand for terms which don’t depend on θ.) Let’s maximize
this by finding a critical point:
d NH + a − 1 NT + b − 1
log p(θ, D) = − (33)
dθ θ 1−θ
Setting this to zero, we get
NH + a − 1
θ̂MAP = (34)
NH + NT + a + b − 2
We can summarize the results of the three different methods in the following
table, for a = b = 2.
Formula NH = 2, NT = 0 NH = 55, NT = 45
NH 55
θ̂ML NH +NT 1 100 = 0.55
NH +a 4 57
θpred NH +NT +a+b 6 ≈ 0.67 104 ≈ 0.548
NH +a−1 3 56
θ̂MAP NH +NT +a+b−2 4 = 0.75 102 ≈ 0.549
When we have 100 observations, all three methods agree quite closely with each
other. However, with only 2 observations, they are quite different. θ̂ML = 1,
which as we noted above, is dangerous because it assigns no probability to T,
and it will have a test log-likelihood of −∞ if there is a single T in the test
set. The other methods smooth the estimates considerably. MAP behaves
somewhere in between ML and FB; this happens pretty often, as MAP is a sort
of compromise between the two methods.
Example 6. Let’s return to our Gaussian example. Let’s maximize the joint
probability:
N
1 2 1 X (i)
log p(µ, D) = const − 2 (µ − µ pri ) − (x − µ)2 (35)
2σpri 2σ 2
i=1
N
d 1 1 X (i)
log p(µ, D) = − 2 (µ − µpri ) + 2 (x − µ) (36)
dµ σpri σ
i=1
When we set this to 0, we get exactly the same formula for µ̂MAP as we derived
earlier for µpost . This doesn’t mean the two methods make the same predictions,
though. The two predictive distributions have the same mean,qbut the MAP
one has standard deviation σ̂MAP = σ, compared with σpred = σpost 2 + σ 2 for
15
Figure 4: Comparison of the predictions made by the ML, FB, and MAP methods about
future temperatures. (Left) After observing one training case. (Right) After observing
7 training cases, i.e. one week.
the full Bayesian approach. In other words, the full Bayesian approach smooths
the predictions, while MAP does not. Therefore, the full Bayesian approach
tends to make more sensible predictions in the small data setting. A comparison
of the three methods is shown in Figure 4.
16
Figure 5: The full Bayesian posterior predictive distribution given the temperatures for
the first week, and a histogram of temperatures for the remainder of the month. Observe
that the predictions are poor because of model misspecification.
However, in the presence of model misspecification, the full Bayesian approach can
still overfit. This term is unfortunate because it makes it sound like misspecification only
happens when we do something wrong. But pretty much all the models we use in machine
learning are vast oversimplifications of reality, so we can’t rely on the theoretical guarantees
of the Bayesian approach (which rely on the model being correctly specified). We can
see this in our Toronto temperatures example. Figure 5 shows the posterior predictive
distribution given the first week of March, as well as a histogram of temperature values for
the rest of the month. A lot of the temperature values are outside the range predicted by
the model! There are at least two problems here, both of which result from the erroneous
i.i.d. assumption:
• The data are not identically distributed: the observed data are for the start of the
month, and temperatures may be higher later in the month.
• The data are not independent: temperatures in subsequent days are correlated, so
treating each observation as a new independent sample results in a more confident
posterior distribution than is actually justified.
Unfortunately, the data are rarely independent in practice, and there are often systematic
differences between the datasets we train on and the settings where we’ll need to apply the
learned models in practice. Therefore, overfitting remains a real possibility even with the
full Bayesian approach.
17
4 Summary
We’ve introduced three different methods for learning probabilistic models:
• Maximum likelihood (ML), where we choose the parameters which maximize the
likelihood:
• The full Bayesian (FB) approach, where we make predictions using the posterior
predictive distribution. To do this, we condition on the data and integrate out the
parameters: Z
p(D0 | D) = p(θ | D)p(D0 | θ) dθ. (38)
θ̂ MAP = arg max log p(θ | D) = arg max log p(θ) + log p(D | θ). (39)
θ θ
This is similar to ML in that it’s an optimization problem, and the prior term log p(θ)
is analogous to a regularization term.
All three approaches behave similarly in the setting where there are many more data points
than parameters. However, in settings where there isn’t enough data to accurately fit the
parameters, the Bayesian methods have a smoothing effect, which can result in much more
sensible predictions. Later in this course, we’ll see models where each of these methods is
useful.
18
Lecture 19: Generative Adversarial Networks
Roger Grosse
1 Introduction
Generative modeling is a type of machine learning where the aim is to
model the distribution that a given set of data (e.g. images, audio) came
from. Normally this is an unsupervised problem, in the sense that the
models are trained on a large collection of data. For instance, recall that Generative models are sometimes
the MNIST dataset was obtained by scanning handwritten zip code digits used for supervised learning, but
we won’t consider that here. See
from envelopes. So consider the distribution of all the digits people ever Gaussian discriminant analysis or
write in zip codes. The MNIST training examples can be viewed as samples naive Bayes in CSC411.
from this distribution. If we fit a generative model to MNIST, we’re trying
to learn about the distribution from the training samples. Notice that this
formulation doesn’t use the labels, so it’s an unsupervised learning problem.
Figure 1(a) shows a random subset of the MNIST training examples,
and Figure 1(b) shows some samples from a generative model (called a
Deep Boltzmann Machine) trained on MNIST1 ; this was considered an im-
pressive result back in 2009. The model’s samples are visually hard to dis-
tinguish from the training examples, suggesting that the model has learned
to match the distribution fairly closely. (We’ll see later why this can be
misleading.) But generative modeling has come a long way since then, and
in fact has made astounding progress over the past 4 years. Figure 1(c)
shows some samples from a Generative Adversarial Network (GAN) trained
on the “dog” category from the CIFAR-10 object recognition dataset in
20152 ; this was considered an impressive result at the time. Fast forward
two years, and GANs are now able to produce convincing high-resolution
images3 , as exemplified in Figure 1(d).
Why train a generative model?
1
(a) (b)
(c) (d)
Figure 1: (a) Training images from the MNIST dataset. (b) Samples from a
Deep Boltzmann Machine (Salakhutdinov and Hinton, 2009). (c) Samples
from a GAN trained on the “dog” category of CIFAR-10 (Denton et al.,
2015) (d) Samples from a GAN trained on images of celebrities (Karras et
al., 2017).
2
lot of attention about 10 years ago, with the motivation that there’s
a lot more unlabeled data than labeled data, and having more data
ought to let us learn better features. This motivation has declined
in importance due to the availability of large labeled datasets such
as ImageNet, and to the surprising success of supervised features at
transferring to other tasks.
Last time, we saw very simple examples of learning distributions, i.e. fit-
ting Gaussian and Bernoulli distributions using maximum likelihood. This
lecture and the next one are about deep generative models, where we use
neural nets to learn powerful generative models of complex datasets. There
are four kinds of deep generative models in widespread use today:
• Generative adversarial networks (the topic of today’s lecture)
• Reversible architectures (Lecture 20)
• Autoregressive models (Lectures 7, 15–17, and 20)
• Variational autoencoders (beyond the scope of this class) You can learn about variational
autoencoders in csc412.
Three of these four kinds of generative models are typically trained with
maximum likelihood. But Generative Adversarial Networks (GANs)
are based on a very different idea: we’d like the model to produce samples
which are indistinguishable from the real data, as judged by a discriminator
network whose job it is to tell real from fake. GANs are probably the current
state-of-the-art generative model, as judged by the quality of the samples.
3
Figure 2: A 1-dimensional generator network.
Implicit generative models are pretty hard to think about, since the relation-
ship between the network weights and the density is complicated. Figure 2
shows an example of a generator network which encodes a univariate dis-
tribution with two different modes. Try to understand why it produces the
density shown.
When we train an implicit generative model of images, we’re aiming to
learn the following:
This probably seems preposterous at first; how can you encode something as
complex as a probability distribution over images in terms of a deterministic
mapping from a spherical Gaussian distribution? But amazingly, it works.
4
hood. Generative adversarial networks use an elegant training criterion
that doesn’t require computing the likelihood. In particular, if the genera-
tor is doing a good job of modeling the data distribution, then the generated
samples should be indistinguishable from the true data. So the idea behind
GANs is to train a discriminator network whose job it is to classify
whether an observation (e.g. an image) is from the training set or whether
it was produced by the generator. The generator is evaluated based on the
discriminator’s inability to tell its samples from data.
To rephrase this, we simultaneously train two different networks:
• The generator network G, defined as in Section 2, which tries to gen-
erate realistic samples
• The discriminator network D, which is a binary classification network
which tries to classify real vs. fake samples. It takes an input x and
computes D(x), the probability it assigns to x being real.
The two networks are trained competitively: the generator is trying to fool
the discriminator, and the discriminator is trying not to be fooled by the
generator. This is shown schematically as follows:
5
(a) (b)
Note that the generator has no control over the first term in Eqn. 1, which
is why we simply write it as constant.
Consider the cost function from the perspective of the generator. Given
a fixed generator, the discriminator will learn to minimize its cross-entropy.
The generator knows this, so it wants to maximize the minimum cross-
entropy achievable by any discriminator. Mathematically, it’s trying to
compute
arg max min JD . (3)
G D
This this cost function involves a min inside a max, it’s called the minimax
formulation. It’s an example of a perfectly competitive game, or zero-
sum game, since the generator and discriminator have exactly opposing
objectives.
The generator and discriminator are trained jointly, so they can adapt
to each other. Both networks are trained using backprop on their cost func- Unfortunately, not having a unified
tions. This is handled automatically by autodiff packages, but conceptually cost function for training both
networks makes the training
we can understand it as shown in Figure 3. In practice, we don’t actually do dynamics much more complicated
separate updates for the generator and discriminator; rather, both networks compared with the optimization
are updated in a single backprop step. Figure 4 shows a cartoon example setting, as we assumed in the rest
of a GAN being trained on a 1-dimensional toy dataset. of this course. This means GAN
training can be pretty finnicky.
6
Figure 4: Cartoon of training a GAN to model a 1-dimensional distribution.
Black: the data density. Blue: the discriminator function. Green: the
generator distribution. Arrows: the generator function. First the discrim-
inator is updated, then the generator, and so on. Figure from Goodfellow
et al., 2014, “Generative adversarial nets”.
pushing the network to assign probability 0.01 rather than 0.001, and then
0.1 rather than 0.01, and so on.
The same reasoning applies to GANs. Observe what happens if the
discriminator is doing very well, or equivalently, the generator is doing very
badly. This means D(G(z)) is very close to 0, and hence JG is close to 0
(the worst possible value). If we were to change the generator’s weights just
a little bit, then JG would still be close to 0. This means we’re in a plateau
of the minimax cost function, i.e. the generator’s gradient is close to 0, and
it will hardly get updated.
We can apply a fix that’s roughly analogous to when we switched from
logistic-least-squares to logistic-cross-entropy in Lecture 4. In particular, we
modify the generator’s cost function to magnify small differences in D(G(z))
when it is close to 0. Mathematically, we replace the generator cost from
Eqn. 2 with the modifed cost:
This cost function is really unhappy when the discriminator is able to con-
fidently recognize one of its samples as fake, so the generator gets a strong
gradient signal pushing it to make the discriminator less confident. Even-
tually, it should be able to produce samples which actually fool the dis-
criminator. The relationship between the two generator costs is shown in
Figure 5. The modified generator cost is typically much more effective than
the minimax formulation, and is what’s nearly always used in practice.
7
Figure 5: Comparison of the minimax generator cost to the modified one.
https://github.com/junyanz/CycleGAN
8
Figure 6: The CycleGAN architecture.
9
Lecture 20: Reversible and Autoregressive Models
Roger Grosse
In this lecture, we’ll cover two kinds deep generative model architec-
tures for which we can measure the likelihood, and hence can train them
using maximum likelihood. The first kind is reversible architectures, where
the network’s computations can be inverted in order to recover the input
which maps to a given output. We’ll see that this makes the likelihood
computation tractable.
The second kind of architecture is autoregressive models. This isn’t new:
we’ve already covered neural language models and RNN language models,
both of which are examples of autoregressive models. In this lecture, we’ll
introduce two tricks for making them much more scalable, so that we can
apply them to high-dimensional data modalities like high-resolution images
and audio waveforms.
1 Reversible Models
Recall the GAN generator architecture from last lecture: we would first
sample a code vector from a fixed, simple distribution such as uniform or
spherical Gaussian. The generator (which is a deterministic feed-forward
network) maps the code vector to the observation space. Hopefully, the dis-
tribution of the network’s outputs should approximate the data distribution.
We noted that this was an implicit generative model, since it’s intractable
to determine the density p(x) for any observation x. But if we modify the
generator architecture to be reversible, then it’s possible to compute the
density.
Mathematically, this is based on the change-of-variables formula for
probability density functions. Suppose we have a bijective, differentiable
mapping f : Z → X . (“Bijective” means the mapping must be 1–1 and
cover all of X .) Since f is bijective, we can think of it as representing a
1
change-of-variables transformation. For instance, x = f (z) = 12z could
represent a conversion of units from feet to inches. If we have a density
pZ (z), the change-of-variables formula gives us the density pX (x):
−1
∂x
pX (x) = pZ (z) det , (1)
∂z
y = x + F(x), (2)
2
(a) (b) (c)
y1 = x1 + F(x2 )
(3)
y2 = x2
x2 = y2
(4)
x1 = y1 − F(x2 )
Here’s what happens when we compose two residual blocks, with the
roles of x1 and x2 swapped:
y1 = x1 + F(x2 )
(5)
y2 = x2 + G(y1 )
3
This is an upper triangular matrix. Think back to linear algebra class:
the determinant of an upper triangular matrix is simply the product of
the diagonal entries. In this case, the diagonal entries are all 1’s, so the
determinant is 1. How convenient! Since the determinant is 1, the mapping
is volume preserving, i.e. it maps any given set to another set of the same
volume. In our context, this just means the determinant term disappears
from the change-of-variables formula (Eqn. 1).
All this analysis so far was for a single reversible block. What if we build
a reversible network by chaining together lots of reversible blocks?
Fortunately, inversion of the whole network is still easy, since we just invert
each block from top to bottom. Mathematically,
For the determinant, we can apply the chain rule for derivatives, followed
by the product rule for determinants:
4
reversible architectures with maximum likelihood.) The change-of-variables
formula gives us:
−1
∂x
pX (x) = pZ (z) det
∂z (9)
= pZ (z)
Hence, the maximum likelihood objective over the whole dataset is:
N
Y N
Y
pX (x(i) ) = pZ (f −1 (x(i) )) (10)
i=1 i=1
5
Figure 2: Examples of sequence modeling tasks with very long contexts.
Left: Modeling images as sequences using raster scan order. Right: Mod-
eling an audio waveform (e.g. speech signal).
that to predict the next term in the sequence, if we want to account for
even 1 second of context, this requires a context of length 16,000.
One way to account for such a long context is to use an RNN, which
(through its hidden units) accounts for the entire sequence that was gener-
ated so far. The problem is that computing the hidden units for each time
step depends on the hidden units from the previous time step, so the for-
ward pass of backprop requires a for-loop over time steps. (The backward
pass requires a for-loop as well.) With thousands of time steps, this can
get very expensive. But think about the neural language model architecture
from Lecture 7. At training time, the predictions at each time step are done
independently of each other, so all the time steps can be processed simulta-
neously with vectorized computations. This implies that training with very
long sequences could be done much more efficiently if we could somehow
get rid of the recurrent connections.
Causal convolution is an elegant solution to this problem. Observe
that, in order to apply the chain rule for conditional probability (Eqn. 11),
it’s important that information never leak backwards in time, i.e. that each
prediction be made only using observations from earlier in the sequence.
A model with this property is called causal. We can design convolutional
neural nets (CNNs) to have a causal structure by masking their connections,
i.e. constraining certain of their weights to be zero, as shown in Figure 3. At
training time, the predictions can be computed for the entire sequence with
a single forward pass through the CNN. Causal convolution is a particularly
elegant architecture in that it allows computations to be shared between the
predictions for different time steps, e.g. a given unit in the first layer will
affect the predictions at multiple different time steps.
It’s interesting to contrast a causal convolution architecture with an
RNN. We could turn the causal CNN into an RNN by adding recurrent
connections between the hidden units. This would have the advantage that,
because of its memory, the model could use information from all previous
time steps to make its predictions. But training would be very slow, since
it would require a for-loop over time steps. A very influential recent pa-
per2 showed that both strategies are actually highly effective for modeling
images. Take a moment to look at the examples in that paper.
2
van den Oord et al., 2016, “Pixel recurrent neural networks”. https://arxiv.org/
abs/1601.06759
6
Figure 3: Top: a causal CNN applied to sequential data (such as an audio
waveform). Source: van den Oord et al., 2016, “WaveNet: a generative
model for raw audio”. Bottom: applying causal convolution to model-
ing images. Source: van den Oord et al., 2016, “Pixel recurrent neural
networks”.
7
Figure 4: The dilated convolution architecture used in WaveNet. Source:
van den Oord et al., 2016, “WaveNet: a generative model for raw audio”.
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
3
van den Oord et al., 2016, “WaveNet: a generative model for raw audio”. https:
//arxiv.org/abs/1609.03499