0% found this document useful (0 votes)
9 views

unsupervised

how to do unsupervised learning

Uploaded by

mead-quinsy.1w
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

unsupervised

how to do unsupervised learning

Uploaded by

mead-quinsy.1w
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Building High-level Features

Using Large Scale Unsupervised Learning

Quoc V. Le [email protected]
Marc’Aurelio Ranzato [email protected]
Rajat Monga [email protected]
Matthieu Devin [email protected]
Kai Chen [email protected]
Greg S. Corrado [email protected]
arXiv:1112.6209v5 [cs.LG] 12 Jul 2012

Jeff Dean [email protected]


Andrew Y. Ng [email protected]

Abstract 1. Introduction
The focus of this work is to build high-level, class-
We consider the problem of building high- specific feature detectors from unlabeled images. For
level, class-specific feature detectors from instance, we would like to understand if it is possible to
only unlabeled data. For example, is it pos- build a face detector from only unlabeled images. This
sible to learn a face detector using only unla- approach is inspired by the neuroscientific conjecture
beled images? To answer this, we train a 9- that there exist highly class-specific neurons in the hu-
layered locally connected sparse autoencoder man brain, generally and informally known as “grand-
with pooling and local contrast normalization mother neurons.” The extent of class-specificity of
on a large dataset of images (the model has neurons in the brain is an area of active investiga-
1 billion connections, the dataset has 10 mil- tion, but current experimental evidence suggests the
lion 200x200 pixel images downloaded from possibility that some neurons in the temporal cortex
the Internet). We train this network using are highly selective for object categories such as faces
model parallelism and asynchronous SGD on or hands (Desimone et al., 1984), and perhaps even
a cluster with 1,000 machines (16,000 cores) specific people (Quiroga et al., 2005).
for three days. Contrary to what appears to Contemporary computer vision methodology typically
be a widely-held intuition, our experimental emphasizes the role of labeled data to obtain these
results reveal that it is possible to train a face class-specific feature detectors. For example, to build
detector without having to label images as a face detector, one needs a large collection of images
containing a face or not. Control experiments labeled as containing faces, often with a bounding box
show that this feature detector is robust not around the face. The need for large labeled sets poses
only to translation but also to scaling and a significant challenge for problems where labeled data
out-of-plane rotation. We also find that the are rare. Although approaches that make use of inex-
same network is sensitive to other high-level pensive unlabeled data are often preferred, they have
concepts such as cat faces and human bod- not been shown to work well for building high-level
ies. Starting with these learned features, we features.
trained our network to obtain 15.8% accu-
racy in recognizing 22,000 object categories This work investigates the feasibility of building high-
from ImageNet, a leap of 70% relative im- level features from only unlabeled data. A positive
provement over the previous state-of-the-art. answer to this question will give rise to two significant
results. Practically, this provides an inexpensive way
to develop features from unlabeled data. But perhaps
more importantly, it answers an intriguing question as
to whether the specificity of the “grandmother neuron”
Appearing in Proceedings of the 29 th International Confer- could possibly be learned from unlabeled data. Infor-
ence on Machine Learning, Edinburgh, Scotland, UK, 2012. mally, this would suggest that it is at least in principle
Copyright 2012 by the author(s)/owner(s). possible that a baby learns to group faces into one class
Building high-level features using large-scale unsupervised learning

because it has seen many of them and not because it This result is also validated by visualization via nu-
is guided by supervision or rewards. merical optimization. Control experiments show that
the learned detector is not only invariant to translation
Unsupervised feature learning and deep learning have
but also to out-of-plane rotation and scaling.
emerged as methodologies in machine learning for
building features from unlabeled data. Using unlabeled Similar experiments reveal the network also learns the
data in the wild to learn features is the key idea be- concepts of cat faces and human bodies.
hind the self-taught learning framework (Raina et al.,
The learned representations are also discriminative.
2007). Successful feature learning algorithms and their
Using the learned features, we obtain significant leaps
applications can be found in recent literature using a
in object recognition with ImageNet. For instance, on
variety of approaches such as RBMs (Hinton et al.,
ImageNet with 22,000 categories, we achieved 15.8%
2006), autoencoders (Hinton & Salakhutdinov, 2006;
accuracy, a relative improvement of 70% over the state-
Bengio et al., 2007), sparse coding (Lee et al., 2007)
of-the-art. Note that, random guess achieves less than
and K-means (Coates et al., 2011). So far, most of
0.005% accuracy for this dataset.
these algorithms have only succeeded in learning low-
level features such as “edge” or “blob” detectors. Go- 2. Training set construction
ing beyond such simple features and capturing com-
plex invariances is the topic of this work. Our training dataset is constructed by sampling frames
from 10 million YouTube videos. To avoid duplicates,
Recent studies observe that it is quite time intensive each video contributes only one image to the dataset.
to train deep learning algorithms to yield state of the Each example is a color image with 200x200 pixels.
art results (Ciresan et al., 2010). We conjecture that
the long training time is partially responsible for the A subset of training images is shown in Ap-
lack of high-level features reported in the literature. pendix A. To check the proportion of faces in
For instance, researchers typically reduce the sizes of the dataset, we run an OpenCV face detector on
datasets and models in order to train networks in a 60x60 randomly-sampled patches from the dataset
practical amount of time, and these reductions under- (http://opencv.willowgarage.com/wiki/). This exper-
mine the learning of high-level features. iment shows that patches, being detected as faces by
the OpenCV face detector, account for less than 3% of
We address this problem by scaling up the core compo- the 100,000 sampled patches
nents involved in training deep networks: the dataset,
the model, and the computational resources. First, 3. Algorithm
we use a large dataset generated by sampling random
frames from random YouTube videos.1 Our input data In this section, we describe the algorithm that we use
are 200x200 images, much larger than typical 32x32 to learn features from the unlabeled training set.
images used in deep learning and unsupervised fea-
ture learning (Krizhevsky, 2009; Ciresan et al., 2010; 3.1. Previous work
Le et al., 2010; Coates et al., 2011). Our model, a Our work is inspired by recent successful algorithms in
deep autoencoder with pooling and local contrast nor- unsupervised feature learning and deep learning (Hin-
malization, is scaled to these large images by using ton et al., 2006; Bengio et al., 2007; Ranzato et al.,
a large computer cluster. To support parallelism on 2007; Lee et al., 2007). It is strongly influenced by the
this cluster, we use the idea of local receptive fields, work of (Olshausen & Field, 1996) on sparse coding.
e.g., (Raina et al., 2009; Le et al., 2010; 2011b). This According to their study, sparse coding can be trained
idea reduces communication costs between machines on unlabeled natural images to yield receptive fields
and thus allows model parallelism (parameters are dis- akin to V1 simple cells (Hubel & Wiesel, 1959).
tributed across machines). Asynchronous SGD is em-
ployed to support data parallelism. The model was One shortcoming of early approaches such as sparse
trained in a distributed fashion on a cluster with 1,000 coding (Olshausen & Field, 1996) is that their archi-
machines (16,000 cores) for three days. tectures are shallow and typically capture low-level
concepts (e.g., edge “Gabor” filters) and simple invari-
Experimental results using classification and visualiza- ances. Addressing this issue is a focus of recent work
tion confirm that it is indeed possible to build high- in deep learning (Hinton et al., 2006; Bengio et al.,
level features from unlabeled data. In particular, using 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009)
a hold-out test set consisting of faces and distractors, which build hierarchies of feature representations. In
we discover a feature that is highly selective for faces. particular, Lee et al (2008) show that stacked sparse
1
This is different from the work of (Lee et al., 2009) who RBMs can model certain simple functions of the V2
trained their model on images from one class. area of the cortex. They also demonstrate that con-
Building high-level features using large-scale unsupervised learning

volutional DBNs (Lee et al., 2009), trained on aligned Lyu & Simoncelli, 2008; Jarrett et al., 2009).2
images of faces, can learn a face detector. This result
As mentioned above, central to our approach is the use
is interesting, but unfortunately requires a certain de-
of local connectivity between neurons. In our experi-
gree of supervision during dataset construction: their
ments, the first sublayer has receptive fields of 18x18
training images (i.e., Caltech 101 images) are aligned,
pixels and the second sub-layer pools over 5x5 over-
homogeneous and belong to one selected category.
lapping neighborhoods of features (i.e., pooling size).
The neurons in the first sublayer connect to pixels in all
input channels (or maps) whereas the neurons in the
second sublayer connect to pixels of only one channel
(or map).3 While the first sublayer outputs linear filter
responses, the pooling layer outputs the square root of
the sum of the squares of its inputs, and therefore, it
is known as L2 pooling.
Our style of stacking a series of uniform modules,
switching between selectivity and tolerance layers, is
reminiscent of Neocognition and HMAX (Fukushima
& Miyake, 1982; LeCun et al., 1998; Riesenhuber &
Poggio, 1999). It has also been argued to be an archi-
tecture employed by the brain (DiCarlo et al., 2012).
Although we use local receptive fields, they are not
convolutional: the parameters are not shared across
different locations in the image. This is a stark differ-
Figure 1. The architecture and parameters in one layer of ence between our approach and previous work (LeCun
our network. The overall network replicates this structure et al., 1998; Jarrett et al., 2009; Lee et al., 2009). In
three times. For simplicity, the images are in 1D. addition to being more biologically plausible, unshared
weights allow the learning of more invariances other
3.2. Architecture than translational invariances (Le et al., 2010).
Our algorithm is built upon these ideas and can be In terms of scale, our network is perhaps one of the
viewed as a sparse deep autoencoder with three impor- largest known networks to date. It has 1 billion train-
tant ingredients: local receptive fields, pooling and lo- able parameters, which is more than an order of mag-
cal contrast normalization. First, to scale the autoen- nitude larger than other large networks reported in
coder to large images, we use a simple idea known as literature, e.g., (Ciresan et al., 2010; Sermanet & Le-
local receptive fields (LeCun et al., 1998; Raina et al., Cun, 2011) with around 10 million parameters. It is
2009; Lee et al., 2009; Le et al., 2010). This biolog- worth noting that our network is still tiny compared to
ically inspired idea proposes that each feature in the the human visual cortex, which is 106 times larger in
autoencoder can connect only to a small region of the terms of the number of neurons and synapses (Pakken-
lower layer. Next, to achieve invariance to local defor- berg et al., 2003).
mations, we employ local L2 pooling (Hyvärinen et al.,
2009; Gregor & LeCun, 2010; Le et al., 2010) and lo-
3.3. Learning and Optimization
cal contrast normalization (Jarrett et al., 2009). L2
pooling, in particular, allows the learning of invariant Learning: During learning, the parameters of the
features (Hyvärinen et al., 2009; Le et al., 2010). second sublayers (H) are fixed to uniform weights,
whereas the encoding weights W1 and decoding
Our deep autoencoder is constructed by replicating weights W2 of the first sublayers are adjusted using
three times the same stage composed of local filtering, 2
The subtractive normalization removes the
local pooling and local contrast normalization. The weighted average of neighboring neurons from the
output of one stage is the input to the next one and
P
current neuron gi,j,k = hi,j,k − iuv Guv hi,j+u,i+v
the overall model can be interpreted as a nine-layered The divisive P normalization computes yi,j,k =
2
network (see Figure 1). gi,j,k / max{c, ( iuv Guv gi,j+u,i+v )0.5 }, where c is set
to be a small number, 0.01, to prevent numerical errors.
The first and second sublayers are often known as fil- G is a Gaussian weighting window. (Jarrett et al., 2009)
3
tering (or simple) and pooling (or complex) respec- For more details regarding connectivity patterns and
tively. The third sublayer performs local subtractive parameter sensitivity, see Appendix B and E.
and divisive normalization and it is inspired by bio-
logical and computational models (Pinto et al., 2008;
Building high-level features using large-scale unsupervised learning

the following optimization problem model replica asks the centralized parameter servers
for an updated copy of its model parameters. It then
m 
X 2 processes a mini-batch to compute a parameter gra-
minimize W2 W1T x(i) − x(i) +
W1 ,W2
i=1
2 dient, and sends the parameter gradients to the ap-
k q  propriate parameter servers, which then apply each
gradient to the current value of the model parame-
X
λ  + Hj (W1T x(i) )2 . (1)
j=1 ter. We can reduce the communication overhead by
having each model replica request updated parame-
Here, λ is a tradeoff parameter between sparsity and ters every P steps and by sending updated gradient
reconstruction; m, k are the number of examples and values to the parameter servers every G steps (where
pooling units in a layer respectively; Hj is the vector of G might not be equal to P). Our DistBelief software
weights of the j-th pooling unit. In our experiments, framework automatically manages the transfer of pa-
we set λ = 0.1. rameters and gradients between the model partitions
and the parameter servers, freeing implementors of the
This optimization problem is also known as recon-
layer functions from having to deal with these issues.
struction Topographic Independent Component Anal-
ysis (Hyvärinen et al., 2009; Le et al., 2011a).4 The Asynchronous SGD is more robust to failure and slow-
first term in the objective ensures the representations ness than standard (synchronous) SGD. Specifically,
encode important information about the data, i.e., for synchronous SGD, if one of the machines is slow,
they can reconstruct input data; whereas the second the entire training process is delayed; whereas for asyn-
term encourages pooling features to group similar fea- chronous SGD, if one machine is slow, only one copy
tures together to achieve invariances. of SGD is delayed while the rest of the optimization
can still proceed.
Optimization: All parameters in our model were In our training, at every step of SGD, the gradient is
trained jointly with the objective being the sum of the computed on a minibatch of 100 examples. We trained
objectives of the three layers. the network on a cluster with 1,000 machines for three
To train the model, we implemented model parallelism days. See Appendix B, C, and D for more details re-
by distributing the local weights W1, W2 and H to garding our implementation of the optimization.
different machines. A single instance of the model
partitions the neurons and weights out across 169 ma- 4. Experiments on Faces
chines (where each machine had 16 CPU cores). A
In this section, we describe our analysis of the learned
set of machines that collectively make up a single copy
representations in recognizing faces (“the face detec-
of the model is referred to as a “model replica.” We
tor”) and present control experiments to understand
have built a software framework called DistBelief that
invariance properties of the face detector. Results for
manages all the necessary communication between the
other concepts are presented in the next section.
different machines within a model replica, so that users
of the framework merely need to write the desired up-
wards and downwards computation functions for the 4.1. Test set
neurons in the model, and don’t have to deal with the The test set consists of 37,000 images sam-
low-level communication of data across machines. pled from two datasets: Labeled Faces In the
We further scaled up the training by implementing Wild dataset (Huang et al., 2007) and ImageNet
asynchronous SGD using multiple replicas of the core dataset (Deng et al., 2009). There are 13,026 faces
model. For the experiments described here, we di- sampled from non-aligned Labeled Faces in The Wild.5
vided the training into 5 portions and ran a copy of The rest are distractor objects randomly sampled from
the model on each of these portions. The models com- ImageNet. These images are resized to fit the visible
municate updates through a set of centralized “param- areas of the top neurons. Some example images are
eter servers,” which keep the current state of all pa- shown in Appendix A.
rameters for the model in a set of partitioned servers 4.2. Experimental protocols
(we used 256 parameter server partitions for training
the model described in this paper). In the simplest After training, we used this test set to measure the
implementation, before processing each mini-batch a performance of each neuron in classifying faces against
4
distractors. For each neuron, we found its maximum
In (Bengio et al., 2007; Le et al., 2011a), the encod- and minimum activation values, then picked 20 equally
ing weights and the decoding weights are tied: W1 = W2 .
However, for better parallelism and better features, our 5
http://vis-www.cs.umass.edu/lfw/lfw.tgz
implementation does not enforce tied weights.
Building high-level features using large-scale unsupervised learning

spaced thresholds in between. The reported accuracy tested neuron, by solving:


is the best classification accuracy among 20 thresholds.
x∗ = arg min f (x; W, H), subject to ||x||2 = 1.
x
4.3. Recognition
Here, f (x; W, H) is the output of the tested neuron
Surprisingly, the best neuron in the network performs given learned parameters W, H and input x. In our
very well in recognizing faces, despite the fact that no experiments, this constraint optimization problem is
supervisory signals were given during training. The solved by projected gradient descent with line search.
best neuron in the network achieves 81.7% accuracy in
detecting faces. There are 13,026 faces in the test set, These visualization methods have complementary
so guessing all negative only achieves 64.8%. The best strengths and weaknesses. For instance, visualizing
neuron in a one-layered network only achieves 71% ac- the most responsive stimuli may suffer from fitting to
curacy while best linear filter, selected among 100,000 noise. On the other hand, the numerical optimization
filters sampled randomly from the training set, only approach can be susceptible to local minima. Results,
achieves 74%. shown in Figure 3, confirm that the tested neuron in-
deed learns the concept of faces.
To understand their contribution, we removed the lo-
cal contrast normalization sublayers and trained the
network again. Results show that the accuracy of
best neuron drops to 78.5%. This agrees with pre-
vious study showing the importance of local contrast
normalization (Jarrett et al., 2009).
We visualize histograms of activation values for face
images and random images in Figure 2. It can be seen,
even with exclusively unlabeled data, the neuron learns
to differentiate between faces and random distractors.
Specifically, when we give a face as an input image, the
neuron tends to output value larger than the threshold,
0. In contrast, if we give a random image as an input
image, the neuron tends to output value less than 0.

Figure 3. Top: Top 48 stimuli of the best neuron from the


test set. Bottom: The optimal stimulus according to nu-
Figure 2. Histograms of faces (red) vs. no faces (blue). merical constraint optimization.
The test set is subsampled such that the ratio between
faces and no faces is one. 4.5. Invariance properties
We would like to assess the robustness of the face de-
4.4. Visualization tector against common object transformations, e.g.,
translation, scaling and out-of-plane rotation. First,
In this section, we will present two visualization tech-
we chose a set of 10 face images and perform distor-
niques to verify if the optimal stimulus of the neuron
tions to them, e.g., scaling and translating. For out-
is indeed a face. The first method is visualizing the
of-plane rotation, we used 10 images of faces rotating
most responsive stimuli in the test set. Since the test
in 3D (“out-of-plane”) as the test set. To check the ro-
set is large, this method can reliably detect near opti-
bustness of the neuron, we plot its averaged response
mal stimuli of the tested neuron. The second approach
over the small test set with respect to changes in scale,
is to perform numerical optimization to find the opti-
3D rotation (Figure 4), and translation (Figure 5).6
mal stimulus (Berkes & Wiskott, 2005; Erhan et al.,
2009; Le et al., 2010). In particular, we find the norm- 6
Scaled, translated faces are generated by standard
bounded input x which maximizes the output f of the cubic interpolation. For 3D rotated faces, we used 10 se-
Building high-level features using large-scale unsupervised learning

Figure 4. Scale (left) and out-of-plane (3D) rotation (right)


Figure 6. Visualization of the cat face neuron (left) and
invariance properties of the best feature.
human body neuron (right).
scribed in (Zhang et al., 2008). In this dataset, there
are 10,000 positive images and 18,409 negative images
(so that the positive-to-negative ratio is similar to the
case of faces). The negative images are chosen ran-
domly from the ImageNet dataset.
Negative and positive examples in our human body
dataset are subsampled at random from a benchmark
Figure 5. Translational invariance properties of the best dataset (Keller et al., 2009). In the original dataset,
feature. x-axis is in pixels each example is a pair of stereo black-and-white im-
ages. But for simplicity, we keep only the left images.
The results show that the neuron is robust against In total, like in the case of human faces, we have 13,026
complex and difficult-to-hard-wire invariances such as positive and 23,974 negative examples.
out-of-plane rotation and scaling. We then followed the same experimental protocols as
Control experiments on dataset without faces: before. The results, shown in Figure 6, confirm that
As reported above, the best neuron achieves 81.7% ac- the network learns not only the concept of faces but
curacy in classifying faces against random distractors. also the concepts of cat faces and human bodies.
What if we remove all images that have faces from the Our high-level detectors also outperform standard
training set? baselines in terms of recognition rates, achieving 74.8%
We performed the control experiment by running a and 76.7% on cat and human body respectively. In
face detector in OpenCV and removing those training comparison, best linear filters (sampled from the train-
images that contain at least one face. The recognition ing set) only achieve 67.2% and 68.1% respectively.
accuracy of the best neuron dropped to 72.5% which In Table 1, we summarize all previous numerical re-
is as low as simple linear filters reported in section 4.3. sults comparing the best neurons against other base-
lines such as linear filters and random guesses. To un-
5. Cat and human body detectors derstand the effects of training, we also measure the
Having achieved a face-sensitive neuron, we would like performance of best neurons in the same network at
to understand if the network is also able to detect other random initialization.
high-level concepts. For instance, cats and body parts We also compare our method against several other
are quite common in YouTube. Did the network also algorithms such as deep autoencoders (Hinton &
learn these concepts? Salakhutdinov, 2006; Bengio et al., 2007) and K-
To answer this question and quantify selectivity prop- means (Coates et al., 2011). Results of these baselines
erties of the network with respect to these concepts, are reported in the bottom of Table 1.
we constructed two datasets, one for classifying hu- 6. Object recognition with ImageNet
man bodies against random backgrounds and one for
classifying cat faces against other random distractors. We applied the feature learning method to the task
For the ease of interpretation, these datasets have a of recognizing objects in the ImageNet dataset (Deng
positive-to-negative ratio identical to the face dataset. et al., 2009). We started from a network that already
learned features from YouTube and ImageNet images
The cat face images are collected from the dataset de-
using the techniques described in this paper. We then
quences of rotated faces from The Sheffield Face Database – added one-versus-all logistic classifiers on top of the
http://www.sheffield.ac.uk/eee/research/iel/research/face. highest layer of this network. This method of ini-
See Appendix F for a sample sequence. tializing a network by unsupervised learning is also
Building high-level features using large-scale unsupervised learning

Table 1. Summary of numerical comparisons between our algorithm against other baselines. Top: Our algorithm vs.
simple baselines. Here, the first three columns are results for methods that do not require training: random guess,
random weights (of the network at initialization, without any training) and best linear filters selected from 100,000
examples sampled from the training set. The last three columns are results for methods that have training: the best
neuron in the first layer, the best neuron in the highest layer after training, the best neuron in the network when the
contrast normalization layers are removed. Bottom: Our algorithm vs. autoencoders and K-means.
Concept Random Same architecture Best Best first Best Best neuron without
guess with random weights linear filter layer neuron neuron contrast normalization
Faces 64.8% 67.0% 74.0% 71.0% 81.7% 78.5%
Human bodies 64.8% 66.5% 68.1% 67.2% 76.8% 71.8%
Cats 64.8% 66.0% 67.8% 67.1% 74.6% 69.3%
Concept Our Deep autoencoders Deep autoencoders K-means on
network 3 layers 6 layers 40x40 images
Faces 81.7% 72.3% 70.9% 72.5%
Human bodies 76.7% 71.2% 69.8% 69.3%
Cats 74.8% 67.5% 68.3% 68.5%

Table 2. Summary of classification accuracies for our method and other state-of-the-art baselines on ImageNet.
Dataset version 2009 (∼9M images, ∼10K categories) 2011 (∼14M images, ∼22K categories)
State-of-the-art 16.7% (Sanchez & Perronnin, 2011) 9.3% (Weston et al., 2011)
Our method 16.1% (without unsupervised pretraining) 13.6% (without unsupervised pretraining)
19.2% (with unsupervised pretraining) 15.8% (with unsupervised pretraining)

known as “unsupervised pretraining.” During super- bining ideas from recently developed algorithms to
vised learning with labeled ImageNet images, the pa- learn invariances from unlabeled data. Our implemen-
rameters of lower layers and the logistic classifiers were tation scales to a cluster with thousands of machines
both adjusted. This was done by first adjusting the lo- thanks to model parallelism and asynchronous SGD.
gistic classifiers and then adjusting the entire network
Our work shows that it is possible to train neurons to
(also known as “fine-tuning”). As a control experi-
be selective for high-level concepts using entirely unla-
ment, we also train a network starting with all random
beled data. In our experiments, we obtained neurons
weights (i.e., without unsupervised pretraining: all pa-
that function as detectors for faces, human bodies, and
rameters are initialized randomly and only adjusted by
cat faces by training on random frames of YouTube
ImageNet labeled data).
videos. These neurons naturally capture complex in-
We followed the experimental protocols specified variances such as out-of-plane and scale invariances.
by (Deng et al., 2010; Sanchez & Perronnin, 2011), in
The learned representations also work well for discrim-
which, the datasets are randomly split into two halves
inative tasks. Starting from these representations, we
for training and validation. We report the performance
obtain 15.8% accuracy for object recognition on Ima-
on the validation set and compare against state-of-the-
geNet with 20,000 categories, a significant leap of 70%
art baselines in Table 2. Note that the splits are not
relative improvement over the state-of-the-art.
identical to previous work but validation set perfor-
mances vary slightly across different splits. Acknowledgements: We thank Samy Bengio,
Adam Coates, Tom Dean, Jia Deng, Mark Mao, Peter
The results show that our method, starting from
Norvig, Paul Tucker, Andrew Saxe, and Jon Shlens for
scratch (i.e., raw pixels), bests many state-of-the-art
helpful discussions and suggestions.
hand-engineered features. On ImageNet with 10K cat-
egories, our method yielded a 15% relative improve- References
ment over previous best published result. On Ima-
Bengio, Y. and LeCun, Y. Scaling learning algorithms to-
geNet with 22K categories, it achieved a 70% relative wards AI. In Large-Scale Kernel Machines, 2007.
improvement over the highest other result of which we
are aware (including unpublished results known to the Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.
authors of (Weston et al., 2011)). Note, random guess Greedy layerwise training of deep networks. In NIPS,
achieves less than 0.005% accuracy for this dataset. 2007.

7. Conclusion Berkes, P. and Wiskott, L. Slow feature analysis yields


a rich repertoire of complex cell properties. Journal of
In this work, we simulated high-level class-specific neu- Vision, 2005.
rons using unlabeled data. We achieved this by com-
Ciresan, D. C., Meier, U., Gambardella, L. M., and
Building high-level features using large-scale unsupervised learning

Schmidhuber, J. Deep big simple neural nets excel on Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. ICA
handwritten digit recognition. CoRR, 2010. with Reconstruction Cost for Efficient Overcomplete
Feature Learning. In NIPS, 2011a.
Coates, A., Lee, H., and Ng, A. Y. An analysis of single-
layer networks in unsupervised feature learning. In AIS- Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow,
TATS 14, 2011. B., and Ng, A.Y. On optimization methods for deep
learning. In ICML, 2011b.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. ImageNet: A Large-Scale Hierarchical Image LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gra-
Database. In CVPR, 2009. dient based learning applied to document recognition.
Proceeding of the IEEE, 1998.
Deng, J., Berg, A., Li, K., and Fei-Fei, L. What does
classifying more than 10,000 image categories tell us? Lee, H., Battle, A., Raina, R., and Ng, Andrew Y. Efficient
In ECCV, 2010. sparse coding algorithms. In NIPS, 2007.

Desimone, R., Albright, T., Gross, C., and Bruce, C. Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief
Stimulus-selective properties of inferior temporal neu- net model for visual area V2. In NIPS, 2008.
rons in the macaque. The Journal of Neuroscience, 1984. Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convo-
lutional deep belief networks for scalable unsupervised
DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does
learning of hierarchical representations. In ICML, 2009.
the brain solve visual object recognition? Neuron, 2012.
Lyu, S. and Simoncelli, E. P. Nonlinear image representa-
Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visu- tion using divisive normalization. In CVPR, 2008.
alizing higher-layer features of deep networks. Technical
report, University of Montreal, 2009. Olshausen, B. and Field, D. Emergence of simple-cell re-
ceptive field properties by learning a sparse code for nat-
Fukushima, K. and Miyake, S. Neocognitron: A new al- ural images. Nature, 1996.
gorithm for pattern recognition tolerant of deformations
and shifts in position. Pattern Recognition, 1982. Pakkenberg, B., P., D., Marner, L., Bundgaard, M. J.,
Gundersen, H. J. G., Nyengaard, J. R., and Regeur, L.
Gregor, K. and LeCun, Y. Emergence of complex-like cells Aging and the human neocortex. Experimental Geron-
in a temporal product network with local receptive fields. tology, 2003.
arXiv:1006.0448, 2010.
Pinto, N., Cox, D. D., and DiCarlo, J. J. Why is real-world
Hinton, G. E. and Salakhutdinov, R.R. Reducing the di- visual object recognition hard? PLoS Computational
mensionality of data with neural networks. Science, Biology, 2008.
2006.
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and
Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learn- Fried, I. Invariant visual representation by single neu-
ing algorithm for deep belief nets. Neural Computation, rons in the human brain. Nature, 2005.
2006.
Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y.
Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, Self-taught learning: Transfer learning from unlabelled
E. Labeled faces in the wild: A database for studying data. In ICML, 2007.
face recognition in unconstrained environments. Techni-
cal Report 07-49, University of Massachusetts, Amherst, Raina, R., Madhavan, A., and Ng, A. Y. Large-scale
October 2007. deep unsupervised learning using graphics processors. In
ICML, 2009.
Hubel, D. H. and Wiesel, T.N. Receptive fields of single
neurons in the the cat’s visual cortex. Journal of Phys- Ranzato, M., Huang, F. J, Boureau, Y., and LeCun, Y. Un-
iology, 1959. supervised learning of invariant feature hierarchies with
applications to object recognition. In CVPR, 2007.
Hyvärinen, A., Hurri, J., and Hoyer, P. O. Natural Image
Riesenhuber, M. and Poggio, T. Hierarchical models of
Statistics. Springer, 2009.
object recognition in cortex. Nature Neuroscience, 1999.
Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Sanchez, J. and Perronnin, F. High-dimensional signa-
Y. What is the best multi-stage architecture for object ture compression for large-scale image-classification. In
recognition? In ICCV, 2009. CVPR, 2011.
Keller, C., Enzweiler, M., and Gavrila, D. M. A new bench- Sermanet, P. and LeCun, Y. Traffic sign recognition with
mark for stereo-based pedestrian detection. In Proc. of multiscale convolutional neural networks. In IJCNN,
the IEEE Intelligent Vehicles Symposium, 2009. 2011.
Krizhevsky, A. Learning multiple layers of features from Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up
tiny images. Technical report, University of Toronto, to large vocabulary image annotation. In IJCAI, 2011.
2009.
Zhang, W., Sun, J., and Tang, X. Cat head detection -
Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and how to effectively exploit shape and texture features. In
Ng, A. Y. Tiled convolutional neural networks. In NIPS, ECCV, 2008.
2010.
Building high-level features using large-scale unsupervised learning

A. Training and test images


A subset of training images is shown in Figure 7. As
can be seen, the positions, scales, orientations of faces
in the dataset are diverse. A subset of test images for

Figure 9. Diagram of the network we used with more de-


Figure 7. Thirty randomly-selected training images (shown tailed connectivity patterns. Color arrows mean that
before the whitening step). weights only connect to only one map. Dark arrows mean
that weights connect to all maps. Pooling neurons only
connect to one map whereas simple neurons and LCN neu-
rons connect to all maps.
identifying the face neuron is shown in Figure 8.

C. Model Parallelism
We use model parallelism to distribute the storage of
parameters and gradient computations to different ma-
chines. In Figure 10, we show how the weights are
divided and stored in different “partitions,” or more
simply, machines (see also (Krizhevsky, 2009)).

D. Further multicore parallelism


Machines in our cluster have many cores which allow
further parallelism. Hence, we split these cores to per-
form different tasks. In our implementation, the cores
are divided into three groups: reading data, sending
Figure 8. Some example test set images (shown before the (or writing) data, and performing arithmetic compu-
whitening step).
tations. At every time instance, these groups work in
parallel to load data, compute numerical results and
send to network or write data to disks.

B. Models E. Parameter sensitivity


Central to our approach in this paper is the use of
The hyper-parameters of the network are chosen to
locally-connected networks. In these networks, neu-
fit computational constraints and optimize the train-
rons only connect to a local region of the layer below.
ing time of our algorithm. These parameters can be
In Figure 9, we show the connectivity patterns of the changed at the expense of longer training time or more
neural network architecture described in the paper. computational resources. For instance, one could in-
The actual images in the experiments are 2D, but for crease the size of the receptive fields at an expense of
simplicity, our images in the visualization are in 1D. using more memory, more computation, and more net-
Building high-level features using large-scale unsupervised learning

F. Example out-of-plane rotated face


sequence
In Figure 12, we show an example sequence of 3D
(out-of-plane) rotated faces. Note that the faces
are black and white but treated as a color pic-
ture in the test. More details are available at the
webpage for The Sheffield Face Database dataset –
http://www.sheffield.ac.uk/eee/research/
iel/research/face

Figure 10. Model parallelism with the network architecture


in use. Here, it can be seen that the weights are divided ac-
cording to the locality of the image and stored on different
machines. Concretely, the weights that connect to the left
side of the image are stored in machine 1 (“partition 1”).
The weights that connect to the central part of the image
are stored in machine 2 (“partition 2”). The weights that
connect to the right side of the image are stored in machine
3 (“partition 3”).
Figure 12. A sequence of 3D (out-of-plane) rotated face of
one individual. The dataset consists of 10 sequences.
work bandwidth per machine; or one could increase the
number of maps at an expense of using more machines
and memories.
G. Best linear filters
These hyper-parameters also could affect the perfor-
mance of the features. We performed control exper- In the paper, we performed control experiments to
iments to understand the effects of the two hyper- compare our features against “best linear filters.”
parameters: the size of the receptive fields and the This baseline works as follows. The first step is to sam-
number of maps. By varying each of these parame- ple 100,000 random patches (or filters) from the train-
ters and observing the test set accuracies, we can gain ing set (each patch has the size of a test set image).
an understanding of how much they affect the perfor- Then for each patch, we compute its cosine distances
mance on the face recognition task. Results, shown between itself and the test set images. The cosine dis-
in Figure 11, confirm that the results are only slightly tances are treated as the feature values. Using these
sensitive to changes in these control parameters. feature values, we then search among 20 thresholds to
find the best accuracy of a patch in classifying faces
against distractors. Each patch gives one accuracy for
our test set.
The reported accuracy is the best accuracy among
100,000 patches randomly-selected from the training
set.

H. Histograms on the entire test set


Figure 11. Left: effects of receptive field sizes on the test Here, we also show the detailed histograms for the neu-
set accuracy. Right: effects of number of maps on the test rons on the entire test sets.
set accuracy. The fact that the histograms are distinctive for pos-
itive and negative images suggests that the network
has learned the concept detectors.
Building high-level features using large-scale unsupervised learning

Figure 13. Histograms of neuron’s activation values for the


best face neuron on the test set. Red: the histogram for
face images. Blue: the histogram for random distractors.
Figure 15. Histograms for the best cat neuron on the test
set. Red: the histogram for cat images. Blue: the his-
togram for random distractors.

Figure 14. Histograms for the best human body neuron on


the test set. Red: the histogram for human body images.
Blue: the histogram for random distractors.

I. Most responsive stimuli for cats and


human bodies
In Figure 16, we show the most responsive stimuli for
cat and human body neurons on the test sets. Note
that, the top stimuli for the human body neuron are
black and white images because the test set images are
black and white (Keller et al., 2009).

J. Implementation details for


autoencoders and K-means
In our implementation, deep autoencoders are also lo-
cally connected and use sigmoidal activation function.
For K-means, we downsample images to 40x40 in or-
der to lower computational costs. We also varied the
parameters of autoencoders, K-means and chose them
to maximize performances given resource constraints. Figure 16. Top: most responsive stimuli on the test set for
In our experiments, we used 30,000 centroids for K- the cat neuron. Bottom: Most responsive human body
stimuli on the test set for the human body neuron.
means. These models also employed parallelism in a
similar fashion described in the paper. They also used
1,000 machines for three days.

You might also like