L10 Image Classification
L10 Image Classification
Roger Grosse
1 Introduction
Vision feels so easy, since we do it all day long without thinking about it. But
think about just how hard the problem is, and how amazing it is that we can
see. A grayscale image is just a two dimensional array of intensity values, Even talking about “images”
and somehow we can recover from that a three-dimensional understanding of masks a lot of complexity; the
human retina has to deal with 11
a scene, including the types of objects and their locations, which particular orders of magnitude in intensity
people are present, what materials things are made of, and so on. In order variation and uses fancy optics
to see, we have to deal with all sorts of “nuisance” factors, such as change that let us recover detailed
in pose or lighting. It’s amazing that the human visual system does this all information in the fovea of our
visual field, for a variety of
so seamlessly that we don’t even have to think about it. wavelengths of light.
There is a large and active field of research called computer vision which
tries to get machines to see. The field has made rapid progress in the
past decade, largely because of increasing sophistication of machine learn-
ing techniques and the availability of large image collections. They’ve for-
mulated hundreds of interesting visual “tasks” which encapsulate some of
the hidden complexity we deal with on a daily basis, such as estimating the
calorie content of a plate of food or predicting whether a structure is likely
to fall down. But there’s one task which has received an especially large
amount of attention for the past 30 years and which has driven a lot of the
progress in the field: object recognition, the task of classifying an image
into a set of object categories.
Object recognition is also a useful example for looking at how conv nets
have changed over the years, since they were a state-of-the-art tool in the
early days, and in the last five years, they have re-emerged as the state-of-
the-art tool for object recognition as well as dozens of other vision tasks.
When conv nets took over the field of computer vision, object recognition
was the first domino to fall. Computers have gotten dramatically faster
during this time, and the networks have gotten correspondingly bigger and
more powerful, but they’re still based on more or less the same design
principles. This lecture will talk about some of those design principles.
1
we get them? Do we preprocess them in some way to make life easier for
the algorithm? We’ll look at just a few examples of particularly influential
datasets, but we’ll ignore dozens more, which each have their virtues and
drawbacks.
2
as 2012, Geoff Hinton and collaborators introduced dropout (a regulariza-
tion method discussed in Lecture 9) on MNIST; this turned out to work
well on a lot of other problems, and has become one of the standard tools
in the neural net toolbox.
• Where do the images come from? They used Google Image Search
to find candidate images, and then filtered by hand which images
actually represented the object category.
3
distribution, this kind of overfitting can be eliminated if one builds a large
enough training set. Dataset bias is different — it consists of systematic
biases in a dataset resulting from the way in which the data was collected.
These regularities occur in both the training and the test sets, so algorithms
which exploit them appear to generalize well on the test set. However, if
those regularities aren’t present in the situation where one actually wants
to use the classifier (e.g. a robot trying to identify objects), the system
will perform very poorly in practice. (If an image classifier only recognizes
minarets by exploiting rotation artifacts, it’s unlikely to perform very well
in the real world.)
If dataset bias is strong enough, it encourages the troubling practice of
dataset hacking, whereby researchers engineer their learning algorithms to
be able to exploit the dataset biases in order to make their results seem more
impressive. In the case of Caltech101, the dataset biases were strong enough
that dataset hacking became essentially the only way to compete. After
about 5 years, Caltech101 basically stopped being used for computer vision
research. Dozens of other object recognition datasets were created, all using
different methodology intended to attenuate dataset bias; see this paper4 for
an interesting discussion. Despite a lot of clever attempts, creating a fully An interesting tidbit: both human
realistic dataset is an elusive goal, and dataset bias will probably always researchers and learning
algorithms are able to determine
exist to some degree. with surprisingly high accuracy
which object recognition dataset a
2.3 ImageNet given image was drawn from.
In 2009, taking into account lessons learned from Caltech101 and other com-
puter vision datasets, researchers built ImageNet, a massive object recogni-
tion database consisting of millions of full-resolution images and thousands
of object categories. Based on this dataset, the ImageNet Large Scale Vi-
sual Recognition Challenge (ILSVRC) became one of the most important
computer vision benchmarks. Here’s how they approached the same ques-
tions:
• How many images? The aim was to come up with hundreds of labeled
images for each synset. The ILSVRC categories all have hundreds of
associated training examples, for a total of 1.2 million images.
4
and then humans manually labeled them. Labeling millions of im-
ages is obviously challenging, so they paid Amazon Mechanical Turk
workers to annotate images. Since some of the categories were highly
specific or unusual, they had to provide the annotators with additional
information (e.g. Wikipedia articles) to help them, and carefully vali-
dated the process by measuring inter-annotator agreement.
3 LeNet
Let’s look at a particular conv net architecture: LeNet, which was used
to classify MNIST digits in 1998. The inputs are grayscale images of size
32 × 32. One detail I’ve skipped over so far is the sizes of the outputs of
convolution layers. LeNet uses valid convolutions, where the values are
computed for only those locations whose filters lie entirely within the input.
Therefore, if the input is 32 × 32 and the filters are 5 × 5, the outputs will be
28 × 28. (The main alternative is same convolution, where the output is
the same size as the input, and the input image is padded with zeros in all
directions.) The LeNet architecture is shown in Figure 1 and summarized
in Table 1.
• Convolution layer C1. This layer has 6 feature maps and filters of size
5 × 5. It has 28 × 28 × 6 = 4704 units, 28 × 28 × 5 × 5 × 6 = 117, 600
connections, and 5 × 5 × 6 = 150 weights and 6 biases, for a total of
156 trainable parameters.
5
The!architecture!of!LeNet5!
• Fully connected layer F5. This layer has 120 units with a full set of
connections to layer S4. Since S4 has 5 × 5 × 16 = 400 units, this layer
has 400 × 120 = 48, 000 connections, and hence the same number of
weights.
• Fully connected layer F6. This layer has 84 units, fully connected to
F5. Therefore, it has 84 × 120 = 10, 080 connections and the same
number of weights.
• Output layer. The original network used something called radial basis
functions, but for simplicity we’ll pretend it’s just a linear function,
followed by a softmax over 10 categories. It has 84 × 10 = 840 con-
nections and weights.
6
Layer Type # units # connections # weights
C1 convolution 4704 117,600 150
S2 subsampling 1176 4704 0
C3 convolution 1600 240,000 2400
S4 subsampling 400 1600 0
F5 fully connected 120 48,000 48,000
F6 fully connected 84 10,080 10,080
output fully connected 10 840 840
LeNet was carefully designed to push the limits of all of these resource
constraints using the computing power of 1998. As we’ll see, conv nets have Try increasing the sizes of various
grown substantially larger in order to exploit modern computing resources. layers and checking that you’re
substantially increasing the usage
of one or more of these resources.
4 Modern conv nets
As mentioned above, AlexNet was the conv net architecture which started
a revolution in computer vision by smashing the ILSVRC benchmark. This
6
This isn’t quite true, actually. There are tricks for storing activations for only a subset
of the layers, and recomputing the rest of the activations as needed. Indeed, frameworks
like TensorFlow implement this behind the scenes. However, a larger of units generally
implies a higher memory footprint.
7
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
Figure 2: The AlexNet architecture from 2012.
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
LeNet (1989) LeNet (1998) AlexNet (2012)
classification digitsconvolutional layerdigits
taskThe second
neurons in a kernel map). objects
takes as input the (response-normalized
and pooled) output of the first convolutional
dataset USPS layer and filters it with 256 kernels of ImageNet
MNIST size 5 ⇥ 5 ⇥ 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling# categories
or normalization 10The third convolutional10layer has 384 kernels 1,000
layers. of size 3 ⇥ 3 ⇥
256 connected to the (normalized,
image size 16 × pooled)
16 outputs of the28second
× 28convolutional layer.
256 × The256
fourth
×3
convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256
training examples 7,291 60,000
kernels of size 3 ⇥ 3 ⇥ 192. The fully-connected layers have 4096 neurons each. 1.2 million
units 1,256 8,084 658,000
4 Reducing Overfitting 9,760
parameters 60,000 60 million
connections 65,000 344,000 652 million
Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC
total
make each operations
training example impose11 billion
10 bits of constraint412
on thebillion
mapping from image 200toquadrillion
label, this (est.)
turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we
describeTable
the two2:primary ways in which
Comparison ofweconv
combat overfitting.
net classification architectures.
4.1 Data Augmentation
architecture is shown in Figure 2. Like LeNet, it consists mostly of convolu-
The easiest and most common method to reduce overfitting on image data is to artificially enlarge
tion, pooling,
the dataset usingand fully connected
label-preserving layers.
transformations (e.g.,It additionally
[25, 4, 5]). We employ hastwo some “response
distinct forms
of data augmentation,
normalization” both of
layers, which Iallow
which transformed
won’t talk about imagesbecause
to be produced from not
they’re the original
believed
images with very little computation, so the transformed images do not need to be stored on disk.
toInmake a big difference,
our implementation, and have
the transformed images mostly
are generatedstopped
in Python being
code onused.
the CPU while the
GPUByis most
training measures,
on the previousAlexNet
batch of images.
is 100So these
to data
1000augmentation
times bigger schemes are,
thanin effect,
LeNet,
computationally free.
as shown in Table 2. But qualitatively, the structure is very similar to
The first form of data augmentation consists of generating image translations and horizontal reflec-
LeNet:
tions. Weitdoconsists of alternating
this by extracting random 224 ⇥ 224convolution andhorizontal
patches (and their pooling layers,from
reflections) followed
the
4
by256⇥256 images and training
fully connected our network
layers. on these extracted
Furthermore, like patches
LeNet, . This
mostincreases the size
of the of ourand
units
training set by a factor of 2048, though the resulting training examples are, of course, highly inter-
connections
dependent. Without are this
in scheme,
the convolution layers,
our network suffers and most
from substantial of thewhich
overfitting, weights
would haveare in
theforced
fully us to use much smaller
connected networks. At test time, the network makes a prediction by extracting
layers.
five 224 ⇥ 224 patches (the four corner patches and the center patch) as well as their horizontal
Computers
reflections have
(hence ten improved
patches in all), andaaveraging
lot since LeNet, but
the predictions madetheby thehardware advance
network’s softmax
layer on the ten patches.
that suddenly made it practical to train large neural nets was graphics
The second form
processing of data(GPUs).
units augmentation GPUs
consists of
are altering
a kindthe intensities
of processorof the RGB channels
geared in
towards
training images. Specifically, we perform PCA on the set of RGB pixel values throughout the
highly
ImageNet parallel processing
training set. involving
To each training image, we relatively
add multiples of simple
the foundoperations. One of
principal components,
the things
4
they especially excel at is matrix multiplication. Since most
This is the reason why the input images in Figure 2 are 224 ⇥ 224 ⇥ 3-dimensional.
of the running time for a neural net consists of matrix multiplication (even
convolutions are implemented as matrix products beneath the hood), GPUs
5
gave roughly a 30-fold speedup in practice for training neural nets.
AlexNet set the agenda for object recognition research ever since. In
2013, the ILSVRC winner was based on tweaks to AlexNet. In 2014, the
second place entry was VGGNet, another conv net based on more or less
similar principles.
The winning entry for 2014, GoogLeNet, or Inception, deserves men-
tion. As the name suggests, it was designed by researchers at Google. The
8
architecture is shown in Figure 3. Clearly things have gotten more com-
plicated since the days of LeNet. But the main point of interest is that
they went out of their way to reduce the number of trainable parameters
(weights) from AlexNet’s 60 million, to about 2 million. Why? Partly it was
to reduce overfitting — amazingly, it’s possible to overfit a million images
if you have a big enough network like AlexNet.
The other reason has to do with saving memory at “test time”, i.e. when
the network is being used. Traditionally, networks would be both trained
and run on a single PC, so there wasn’t much reason to draw a distinc-
tion between training and test time. But at Google, the training could be
distributed over lots of machines in a datacenter. (The activations and pa-
rameters could even be divided up between multiple machines, increasing
the amount of available memory at training time.) But the network was also
supposed to be runnable on an Android cell phone, so that images wouldn’t
have to be sent to Google’s servers for classification. On a cell phone, it
would have been extravagant to spend 240MB to store AlexNet’s 60 million
parameters, so it was really important to cut down on parameters to make
it fit in memory.
They achieved this in two ways. First, they eliminated the fully con-
nected layers, which we already saw contain most of the parameters in LeNet
and AlexNet. GoogLeNet is convolutions all the way. It also avoids having
large convolutions by breaking them down into a sequence of convolutions
involving smaller filters. (Two 3 × 3 filters have fewer parameters than a
5 × 5 filter, even though they cover a similar radius of the image.) They This is analogous to how linear
call this layer-within-a-layer architecture “Inception”, after the movie about bottleneck layers can reduce the
number of parameters.
dreams-within-dreams.
Performance on ImageNet improved asonishingly fast during the years
the competition was run. Here are the figures: We’ll put off the last item, deep
residual nets (ResNets), until a
Year Model Top-5 error later lecture since they depend on
some ideas that we won’t cover
2010 Hand-designed descriptors + SVM 28.2% until we talk about RNNs.
2011 Compressed Fisher Vectors + SVM 25.8%
2012 AlexNet 16.4%
2013 a variant of AlexNet 11.7%
2014 GoogLeNet 6.6%
2015 deep residual nets 4.5%
It’s really unusual for error rates to drop by a factor of 6 over a period
of 5 years, especially on a task like object recognition that hundreds of
researchers had already worked hard on and where performance had seemed
to plateau.
9
Figure 3: The Inception architecture from 2014.
10