0% found this document useful (0 votes)
30 views10 pages

L10 Image Classification

This document provides an overview of image classification and summarizes key datasets that have been used for object recognition tasks. It discusses: 1) The USPS and MNIST datasets of handwritten digits, which were early influential datasets and helped demonstrate practical applications of convolutional neural networks (CNNs) for classification. 2) The Caltech101 dataset, one of the first major object recognition datasets with 101 categories. It highlights issues with dataset design, such as imbalanced numbers of images per category and biases introduced through image normalization choices. 3) How datasets like these have driven progress in computer vision and machine learning, helping algorithms improve on benchmarks while also revealing limitations from dataset biases. They provide labeled examples needed to formulate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views10 pages

L10 Image Classification

This document provides an overview of image classification and summarizes key datasets that have been used for object recognition tasks. It discusses: 1) The USPS and MNIST datasets of handwritten digits, which were early influential datasets and helped demonstrate practical applications of convolutional neural networks (CNNs) for classification. 2) The Caltech101 dataset, one of the first major object recognition datasets with 101 categories. It highlights issues with dataset design, such as imbalanced numbers of images per category and biases introduced through image normalization choices. 3) How datasets like these have driven progress in computer vision and machine learning, helping algorithms improve on benchmarks while also revealing limitations from dataset biases. They provide labeled examples needed to formulate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture 10: Image Classification

Roger Grosse

1 Introduction
Vision feels so easy, since we do it all day long without thinking about it. But
think about just how hard the problem is, and how amazing it is that we can
see. A grayscale image is just a two dimensional array of intensity values, Even talking about “images”
and somehow we can recover from that a three-dimensional understanding of masks a lot of complexity; the
human retina has to deal with 11
a scene, including the types of objects and their locations, which particular orders of magnitude in intensity
people are present, what materials things are made of, and so on. In order variation and uses fancy optics
to see, we have to deal with all sorts of “nuisance” factors, such as change that let us recover detailed
in pose or lighting. It’s amazing that the human visual system does this all information in the fovea of our
visual field, for a variety of
so seamlessly that we don’t even have to think about it. wavelengths of light.
There is a large and active field of research called computer vision which
tries to get machines to see. The field has made rapid progress in the
past decade, largely because of increasing sophistication of machine learn-
ing techniques and the availability of large image collections. They’ve for-
mulated hundreds of interesting visual “tasks” which encapsulate some of
the hidden complexity we deal with on a daily basis, such as estimating the
calorie content of a plate of food or predicting whether a structure is likely
to fall down. But there’s one task which has received an especially large
amount of attention for the past 30 years and which has driven a lot of the
progress in the field: object recognition, the task of classifying an image
into a set of object categories.
Object recognition is also a useful example for looking at how conv nets
have changed over the years, since they were a state-of-the-art tool in the
early days, and in the last five years, they have re-emerged as the state-of-
the-art tool for object recognition as well as dozens of other vision tasks.
When conv nets took over the field of computer vision, object recognition
was the first domino to fall. Computers have gotten dramatically faster
during this time, and the networks have gotten correspondingly bigger and
more powerful, but they’re still based on more or less the same design
principles. This lecture will talk about some of those design principles.

2 Object recognition datasets


Recall that object recognition is a kind of supervised learning problem,
which means there’s a particular behavior we would like our system to
achieve (labeling an image with the correct category), and we need to pro-
vide the system with labeled examples of the correct behavior. This means
we need to come up with a dataset, a set of images with their corresponding
labels. This raises questions such as: how do we choose the set of categories?
What sorts of images do we allow, how many do we need, and where do

1
we get them? Do we preprocess them in some way to make life easier for
the algorithm? We’ll look at just a few examples of particularly influential
datasets, but we’ll ignore dozens more, which each have their virtues and
drawbacks.

2.1 USPS and MNIST


Before machine learning algorithms were good enough to recognize objects
in images, researchers’ attention focused on a simpler image classification
problem: handwritten digit recognition. In the 1980s, the US Postal
Service was interested in automatically reading zip codes on envelopes. This
task is a bit harder than handwritten digit recognition, since one also has to
identify the locations and orientations of the individual digits, but clearly
digit recognition would be a useful step towards solving the problem. They
collected a dataset of images of handwritten digits (now called the USPS
Dataset) by hand-segmenting individual digits from handwritten zip codes.
To make things easier for the algorithm, the digits were normalized to be
a consistent size and orientation. Despite this normalization, the dataset
still included a lot of sources of variability: digits were written in a variety
of writing styles and using different kinds of writing instruments. Many of
the digits are ambiguous, even to humans.
Classifying USPS digits became the first practical use of conv nets: in
1989, a group of researchers at Bell Labs introduced a conv net architecture,
which involved several convolution and subsampling layers, followed by a
fully connected layer. This network was able to classify the digits with
91.9% accuracy.
Almost a decade later, researchers created a slightly larger handwrit-
ten digit dataset. They made some modifications to a dataset produced
by the National Institute of Standards and Technology, so the dataset was
called Modified NIST, or MNIST. Similar to the USPS Dataset, MNIST
images were normalized by centering the digits within the image and nor-
malizing them to a standard size. The main difference is that the dataset
is larger: there were 70,000 examples, of which 60,000 are used for training
and 10,000 are used for testing. Yann LeCun and colleagues introduced a LeCun is now one of the leading
larger conv net architecture called LeNet which was able to classify images researchers in the field, and directs
Facebook AI Research.
with 98.9% accuracy, and used this network in the context of a larger sys-
tem for automatically reading the numbers on checks. (Because LeNet was
trained on segmented and normalized digit images, this system had to solve
the problems of automatic segmentation and normalization, among other
things.) This was the first automatic check reading system that was accu-
rate enough to be practically useful. This was one of the big success stories
of AI in the 1990s — and interestingly, it happened during the “neural net
winter”, showing that good ideas can still work even when they fall out of
fashion.1
Apart from its initial practical uses, MNIST has served as one of the
most widely used machine learning benchmarks for two decades. Even
though the test errors have long been low enough to be practically meaning-
less, MNIST has driven a lot of progress in neural net research. As recently
1
LeCun et al. Gradient-based learning applied to document recognition. Proceedings
of the IEEE, 1998.

2
as 2012, Geoff Hinton and collaborators introduced dropout (a regulariza-
tion method discussed in Lecture 9) on MNIST; this turned out to work
well on a lot of other problems, and has become one of the standard tools
in the neural net toolbox.

2.2 Caltech101 and the perils of dataset design


In 2003, researchers at Caltech released the first major object recognition
dataset which was used to train and benchmark object recognition algo-
rithms. Since their dataset included 101 object categories, they called it
Caltech101.2 Here’s how they approached some of the key questions of
dataset design.

• Which object categories to consider? They chose a set of 101 object


categories by opening a dictionary to random pages and choosing from
the nouns which were associated with images.

• Where do the images come from? They used Google Image Search
to find candidate images, and then filtered by hand which images
actually represented the object category.

• How many images? They didn’t target a particular number of objects


per category, but just collected as many as possible. The numbers of
objects per category were very unbalanced as a result, but in practice,
when the dataset is used for benchmarking, most systems are trained
with a fixed number of images per category (which is usually between
1 and 20).

• How to normalize the images? They normalized the images in a vari-


ety of ways to make things simpler for the learning algorithms. Images
were scaled to be about 300 pixels wide. In order to reduce variabil-
ity in pose, they flipped some of the images so that a given object
was always facing the same direction. More controversially, images of
certain object categories were rotated because the authors’ proposed
method had trouble dealing with vertically oriented objects.

For about 5 years, Caltech101 was widely used as a benchmark dataset


for object recognition, and academic papers showed rapid improvements in
classification accuracy. Unfortunately, the dataset had a number of idiosyn-
crasies, known as dataset biases. E.g., for some reason, objects always
appeared at a consistent location within an image, with the result that if
one averages the raw pixel values, the average image still resembles the ob-
ject category.3 Also, as mentioned above, images of certain categories were
rotated, leading to distinctive rotation artifacts.
Dataset bias results in a kind of overfitting which is different from what
we’ve talked about so far. In our lecture on generalization, we observed
that a training set might happen to have certain accidental regularities
which don’t occur in the test set; algorithms can overfit if they exploit
these regularities. If the training and test images are drawn from the same
2
https://www.vision.caltech.edu/Image_Datasets/Caltech101/
3
https://www.vision.caltech.edu/Image_Datasets/Caltech101/
averages100objects.jpg

3
distribution, this kind of overfitting can be eliminated if one builds a large
enough training set. Dataset bias is different — it consists of systematic
biases in a dataset resulting from the way in which the data was collected.
These regularities occur in both the training and the test sets, so algorithms
which exploit them appear to generalize well on the test set. However, if
those regularities aren’t present in the situation where one actually wants
to use the classifier (e.g. a robot trying to identify objects), the system
will perform very poorly in practice. (If an image classifier only recognizes
minarets by exploiting rotation artifacts, it’s unlikely to perform very well
in the real world.)
If dataset bias is strong enough, it encourages the troubling practice of
dataset hacking, whereby researchers engineer their learning algorithms to
be able to exploit the dataset biases in order to make their results seem more
impressive. In the case of Caltech101, the dataset biases were strong enough
that dataset hacking became essentially the only way to compete. After
about 5 years, Caltech101 basically stopped being used for computer vision
research. Dozens of other object recognition datasets were created, all using
different methodology intended to attenuate dataset bias; see this paper4 for
an interesting discussion. Despite a lot of clever attempts, creating a fully An interesting tidbit: both human
realistic dataset is an elusive goal, and dataset bias will probably always researchers and learning
algorithms are able to determine
exist to some degree. with surprisingly high accuracy
which object recognition dataset a
2.3 ImageNet given image was drawn from.

In 2009, taking into account lessons learned from Caltech101 and other com-
puter vision datasets, researchers built ImageNet, a massive object recogni-
tion database consisting of millions of full-resolution images and thousands
of object categories. Based on this dataset, the ImageNet Large Scale Vi-
sual Recognition Challenge (ILSVRC) became one of the most important
computer vision benchmarks. Here’s how they approached the same ques-
tions:

• Which object categories to consider? ImageNet was meant to be very


comprehensive. The categories were taken from WordNet, a lexical
database for English constructed by cognitive scientists at Princeton.
WordNet consists of a hierarchy of “synsets”, or sets of synonyms
which all denote the same concept. ImageNet was intended to include
as many synsets as possible; as of 2010, it included almost 22,000
synsets, out of 80,000 noun synsets in WordNet. The categories are
very specific, including hundreds of different types of dogs. Out of
these categories, 1000 were chosen for the ILSVRC.

• How many images? The aim was to come up with hundreds of labeled
images for each synset. The ILSVRC categories all have hundreds of
associated training examples, for a total of 1.2 million images.

• Where do the images come from? Similarly to Caltech101, candidate


images were taken from the results of various image search engines,
4
A. Torralba and A. Efros. An unbiased look at dataset bias. Computer Vision and
Pattern Recognition (CVPR), 2011.

4
and then humans manually labeled them. Labeling millions of im-
ages is obviously challenging, so they paid Amazon Mechanical Turk
workers to annotate images. Since some of the categories were highly
specific or unusual, they had to provide the annotators with additional
information (e.g. Wikipedia articles) to help them, and carefully vali-
dated the process by measuring inter-annotator agreement.

• How are the images normalized? In contrast to Caltech101, the im-


ages in the dataset itself are not normalized. (However, the object
recognition systems themselves might perform some sort of prepro-
cessing.)
Because the object categories are so diverse and fine-grained, and images
can contain multiple objects, there might not be a unique right answer for
every image. Therefore, one normally reports top-5 accuracy, whereby
the algorithm is allowed to make 5 different predictions for each image, and
it gets it right if any of the 5 predictions are the correct category.
ImageNet is an extremely challenging dataset to work with because of
its scale and the diversity of object categories. The first algorithms to be ap-
plied were not neural nets, but in 2012, researchers in Toronto entered this
competition using a neural net called AlexNet (in honor of its lead creator,
Alex Krizhevsky). It achieved top-5 error of 28.5%, which was substantially
better than the competitors. This result created a big splash, leading com-
puter vision researchers to switch to using neural nets and prompting some
of the world’s largest software companies to start up research labs focused
on deep learning. Since AlexNet, error rates on ImageNet have fallen dra-
matically, hitting 4.5% error in 2015 (the last year the competition was run),
and all of the leading approaches have been based on conv nets. This even
beat human performance, which was measured at 5.1% error (although this
can vary significantly depending how one measures).

3 LeNet
Let’s look at a particular conv net architecture: LeNet, which was used
to classify MNIST digits in 1998. The inputs are grayscale images of size
32 × 32. One detail I’ve skipped over so far is the sizes of the outputs of
convolution layers. LeNet uses valid convolutions, where the values are
computed for only those locations whose filters lie entirely within the input.
Therefore, if the input is 32 × 32 and the filters are 5 × 5, the outputs will be
28 × 28. (The main alternative is same convolution, where the output is
the same size as the input, and the input image is padded with zeros in all
directions.) The LeNet architecture is shown in Figure 1 and summarized
in Table 1.

• Convolution layer C1. This layer has 6 feature maps and filters of size
5 × 5. It has 28 × 28 × 6 = 4704 units, 28 × 28 × 5 × 5 × 6 = 117, 600
connections, and 5 × 5 × 6 = 150 weights and 6 biases, for a total of
156 trainable parameters.

• Subsampling layer S2. In LeNet, the “subsampling layers” are essen-


tially pooling layers, where the pooling function is the mean (rather

5
The!architecture!of!LeNet5!

Figure 1: The LeNet architecture from 1998.

than max). They use a stride of 2, so the image size is shrunk by a


factor of 2 along each dimension.

• Convolution layer C3. This layer has 16 feature maps of size 10 × 10


and filters of size 5 × 5. Therefore, it has 10 × 10 × 16 = 1600 units.
If all the feature maps were connected to all the feature maps, this
layer would have 10 × 10 × 5 × 5 × 6 × 16 = 240, 000 connections and
5 × 5 × 6 × 16 = 2400 weights.5

• Subsampling layer S4. This is another pooling layer with a stride of


2, so it reduces each dimension by another factor of 2, to 5 × 5.

• Fully connected layer F5. This layer has 120 units with a full set of
connections to layer S4. Since S4 has 5 × 5 × 16 = 400 units, this layer
has 400 × 120 = 48, 000 connections, and hence the same number of
weights.

• Fully connected layer F6. This layer has 84 units, fully connected to
F5. Therefore, it has 84 × 120 = 10, 080 connections and the same
number of weights.

• Output layer. The original network used something called radial basis
functions, but for simplicity we’ll pretend it’s just a linear function,
followed by a softmax over 10 categories. It has 84 × 10 = 840 con-
nections and weights.

These calculations are all summarized in Table 1. After sitting through


all this tedium, we can draw a number of useful conclusions:

• Most of the units are in the first convolution layer.

• Most of the connections are in the second convolution layer.

• Most of the weights are in the fully connected layers.

These observations correspond to various resource limitations when design-


ing a network architecture. In particular, if we want to make the network
as big as possible, here are some of the limitations we run into:
5
Since this would have been a lot of connections by the standards of 1998, they skimped
on connections by connecting only a subset of the feature maps. This brought the number
of connections down to 156,000.

6
Layer Type # units # connections # weights
C1 convolution 4704 117,600 150
S2 subsampling 1176 4704 0
C3 convolution 1600 240,000 2400
S4 subsampling 400 1600 0
F5 fully connected 120 48,000 48,000
F6 fully connected 84 10,080 10,080
output fully connected 10 840 840

Table 1: LeNet architecture, with the sizes of layers.

• Running the network to compute predictions (equivalently, the for-


ward pass of backprop) requires approximately one add-multiply oper-
ation per connection in the network. As observed in a previous lecture,
the backwards pass is about as expensive as two forward passes, so the
total computational cost of backprop is proportional to the number
of connections. This means the convolution layers are generally the
most expensive part of the network in terms of running time.

• Memory is another scarce resource. It’s worth distinguishing two sit-


uations: training time, where we train the network using backprop,
and test time, the somewhat misleading name for the setting where
we use an already-trained network.

– Backprop requires storing all of the activations in memory.6 Since


the number of activations is the number of units times the mini-
batch size, the number of units determines the memory footprint
of the activations at training time. The activations don’t need
to be stored at test time.
– The weights also need to be stored in memory, both at training
time and test time.

• The weights constitute the vast majority of trainable parameters of


the model (the number of biases generally being far smaller), so if
you’re worried about overfitting, you could consider cutting down the
number of weights.

LeNet was carefully designed to push the limits of all of these resource
constraints using the computing power of 1998. As we’ll see, conv nets have Try increasing the sizes of various
grown substantially larger in order to exploit modern computing resources. layers and checking that you’re
substantially increasing the usage
of one or more of these resources.
4 Modern conv nets
As mentioned above, AlexNet was the conv net architecture which started
a revolution in computer vision by smashing the ILSVRC benchmark. This
6
This isn’t quite true, actually. There are tricks for storing activations for only a subset
of the layers, and recomputing the rest of the activations as needed. Indeed, frameworks
like TensorFlow implement this behind the scenes. However, a larger of units generally
implies a higher memory footprint.

7
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
Figure 2: The AlexNet architecture from 2012.
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
LeNet (1989) LeNet (1998) AlexNet (2012)
classification digitsconvolutional layerdigits
taskThe second
neurons in a kernel map). objects
takes as input the (response-normalized
and pooled) output of the first convolutional
dataset USPS layer and filters it with 256 kernels of ImageNet
MNIST size 5 ⇥ 5 ⇥ 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling# categories
or normalization 10The third convolutional10layer has 384 kernels 1,000
layers. of size 3 ⇥ 3 ⇥
256 connected to the (normalized,
image size 16 × pooled)
16 outputs of the28second
× 28convolutional layer.
256 × The256
fourth
×3
convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256
training examples 7,291 60,000
kernels of size 3 ⇥ 3 ⇥ 192. The fully-connected layers have 4096 neurons each. 1.2 million
units 1,256 8,084 658,000
4 Reducing Overfitting 9,760
parameters 60,000 60 million
connections 65,000 344,000 652 million
Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC
total
make each operations
training example impose11 billion
10 bits of constraint412
on thebillion
mapping from image 200toquadrillion
label, this (est.)
turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we
describeTable
the two2:primary ways in which
Comparison ofweconv
combat overfitting.
net classification architectures.
4.1 Data Augmentation
architecture is shown in Figure 2. Like LeNet, it consists mostly of convolu-
The easiest and most common method to reduce overfitting on image data is to artificially enlarge
tion, pooling,
the dataset usingand fully connected
label-preserving layers.
transformations (e.g.,It additionally
[25, 4, 5]). We employ hastwo some “response
distinct forms
of data augmentation,
normalization” both of
layers, which Iallow
which transformed
won’t talk about imagesbecause
to be produced from not
they’re the original
believed
images with very little computation, so the transformed images do not need to be stored on disk.
toInmake a big difference,
our implementation, and have
the transformed images mostly
are generatedstopped
in Python being
code onused.
the CPU while the
GPUByis most
training measures,
on the previousAlexNet
batch of images.
is 100So these
to data
1000augmentation
times bigger schemes are,
thanin effect,
LeNet,
computationally free.
as shown in Table 2. But qualitatively, the structure is very similar to
The first form of data augmentation consists of generating image translations and horizontal reflec-
LeNet:
tions. Weitdoconsists of alternating
this by extracting random 224 ⇥ 224convolution andhorizontal
patches (and their pooling layers,from
reflections) followed
the
4
by256⇥256 images and training
fully connected our network
layers. on these extracted
Furthermore, like patches
LeNet, . This
mostincreases the size
of the of ourand
units
training set by a factor of 2048, though the resulting training examples are, of course, highly inter-
connections
dependent. Without are this
in scheme,
the convolution layers,
our network suffers and most
from substantial of thewhich
overfitting, weights
would haveare in
theforced
fully us to use much smaller
connected networks. At test time, the network makes a prediction by extracting
layers.
five 224 ⇥ 224 patches (the four corner patches and the center patch) as well as their horizontal
Computers
reflections have
(hence ten improved
patches in all), andaaveraging
lot since LeNet, but
the predictions madetheby thehardware advance
network’s softmax
layer on the ten patches.
that suddenly made it practical to train large neural nets was graphics
The second form
processing of data(GPUs).
units augmentation GPUs
consists of
are altering
a kindthe intensities
of processorof the RGB channels
geared in
towards
training images. Specifically, we perform PCA on the set of RGB pixel values throughout the
highly
ImageNet parallel processing
training set. involving
To each training image, we relatively
add multiples of simple
the foundoperations. One of
principal components,
the things
4
they especially excel at is matrix multiplication. Since most
This is the reason why the input images in Figure 2 are 224 ⇥ 224 ⇥ 3-dimensional.
of the running time for a neural net consists of matrix multiplication (even
convolutions are implemented as matrix products beneath the hood), GPUs
5
gave roughly a 30-fold speedup in practice for training neural nets.
AlexNet set the agenda for object recognition research ever since. In
2013, the ILSVRC winner was based on tweaks to AlexNet. In 2014, the
second place entry was VGGNet, another conv net based on more or less
similar principles.
The winning entry for 2014, GoogLeNet, or Inception, deserves men-
tion. As the name suggests, it was designed by researchers at Google. The

8
architecture is shown in Figure 3. Clearly things have gotten more com-
plicated since the days of LeNet. But the main point of interest is that
they went out of their way to reduce the number of trainable parameters
(weights) from AlexNet’s 60 million, to about 2 million. Why? Partly it was
to reduce overfitting — amazingly, it’s possible to overfit a million images
if you have a big enough network like AlexNet.
The other reason has to do with saving memory at “test time”, i.e. when
the network is being used. Traditionally, networks would be both trained
and run on a single PC, so there wasn’t much reason to draw a distinc-
tion between training and test time. But at Google, the training could be
distributed over lots of machines in a datacenter. (The activations and pa-
rameters could even be divided up between multiple machines, increasing
the amount of available memory at training time.) But the network was also
supposed to be runnable on an Android cell phone, so that images wouldn’t
have to be sent to Google’s servers for classification. On a cell phone, it
would have been extravagant to spend 240MB to store AlexNet’s 60 million
parameters, so it was really important to cut down on parameters to make
it fit in memory.
They achieved this in two ways. First, they eliminated the fully con-
nected layers, which we already saw contain most of the parameters in LeNet
and AlexNet. GoogLeNet is convolutions all the way. It also avoids having
large convolutions by breaking them down into a sequence of convolutions
involving smaller filters. (Two 3 × 3 filters have fewer parameters than a
5 × 5 filter, even though they cover a similar radius of the image.) They This is analogous to how linear
call this layer-within-a-layer architecture “Inception”, after the movie about bottleneck layers can reduce the
number of parameters.
dreams-within-dreams.
Performance on ImageNet improved asonishingly fast during the years
the competition was run. Here are the figures: We’ll put off the last item, deep
residual nets (ResNets), until a
Year Model Top-5 error later lecture since they depend on
some ideas that we won’t cover
2010 Hand-designed descriptors + SVM 28.2% until we talk about RNNs.
2011 Compressed Fisher Vectors + SVM 25.8%
2012 AlexNet 16.4%
2013 a variant of AlexNet 11.7%
2014 GoogLeNet 6.6%
2015 deep residual nets 4.5%
It’s really unusual for error rates to drop by a factor of 6 over a period
of 5 years, especially on a task like object recognition that hundreds of
researchers had already worked hard on and where performance had seemed
to plateau.

9
Figure 3: The Inception architecture from 2014.

10

You might also like