unsupervised
unsupervised
Quoc V. Le [email protected]
Marc’Aurelio Ranzato [email protected]
Rajat Monga [email protected]
Matthieu Devin [email protected]
Kai Chen [email protected]
Greg S. Corrado [email protected]
arXiv:1112.6209v5 [cs.LG] 12 Jul 2012
Abstract 1. Introduction
The focus of this work is to build high-level, class-
We consider the problem of building high- specific feature detectors from unlabeled images. For
level, class-specific feature detectors from instance, we would like to understand if it is possible to
only unlabeled data. For example, is it pos- build a face detector from only unlabeled images. This
sible to learn a face detector using only unla- approach is inspired by the neuroscientific conjecture
beled images? To answer this, we train a 9- that there exist highly class-specific neurons in the hu-
layered locally connected sparse autoencoder man brain, generally and informally known as “grand-
with pooling and local contrast normalization mother neurons.” The extent of class-specificity of
on a large dataset of images (the model has neurons in the brain is an area of active investiga-
1 billion connections, the dataset has 10 mil- tion, but current experimental evidence suggests the
lion 200x200 pixel images downloaded from possibility that some neurons in the temporal cortex
the Internet). We train this network using are highly selective for object categories such as faces
model parallelism and asynchronous SGD on or hands (Desimone et al., 1984), and perhaps even
a cluster with 1,000 machines (16,000 cores) specific people (Quiroga et al., 2005).
for three days. Contrary to what appears to Contemporary computer vision methodology typically
be a widely-held intuition, our experimental emphasizes the role of labeled data to obtain these
results reveal that it is possible to train a face class-specific feature detectors. For example, to build
detector without having to label images as a face detector, one needs a large collection of images
containing a face or not. Control experiments labeled as containing faces, often with a bounding box
show that this feature detector is robust not around the face. The need for large labeled sets poses
only to translation but also to scaling and a significant challenge for problems where labeled data
out-of-plane rotation. We also find that the are rare. Although approaches that make use of inex-
same network is sensitive to other high-level pensive unlabeled data are often preferred, they have
concepts such as cat faces and human bod- not been shown to work well for building high-level
ies. Starting with these learned features, we features.
trained our network to obtain 15.8% accu-
racy in recognizing 22,000 object categories This work investigates the feasibility of building high-
from ImageNet, a leap of 70% relative im- level features from only unlabeled data. A positive
provement over the previous state-of-the-art. answer to this question will give rise to two significant
results. Practically, this provides an inexpensive way
to develop features from unlabeled data. But perhaps
more importantly, it answers an intriguing question as
to whether the specificity of the “grandmother neuron”
Appearing in Proceedings of the 29 th International Confer- could possibly be learned from unlabeled data. Infor-
ence on Machine Learning, Edinburgh, Scotland, UK, 2012. mally, this would suggest that it is at least in principle
Copyright 2012 by the author(s)/owner(s). possible that a baby learns to group faces into one class
Building high-level features using large-scale unsupervised learning
because it has seen many of them and not because it This result is also validated by visualization via nu-
is guided by supervision or rewards. merical optimization. Control experiments show that
the learned detector is not only invariant to translation
Unsupervised feature learning and deep learning have
but also to out-of-plane rotation and scaling.
emerged as methodologies in machine learning for
building features from unlabeled data. Using unlabeled Similar experiments reveal the network also learns the
data in the wild to learn features is the key idea be- concepts of cat faces and human bodies.
hind the self-taught learning framework (Raina et al.,
The learned representations are also discriminative.
2007). Successful feature learning algorithms and their
Using the learned features, we obtain significant leaps
applications can be found in recent literature using a
in object recognition with ImageNet. For instance, on
variety of approaches such as RBMs (Hinton et al.,
ImageNet with 22,000 categories, we achieved 15.8%
2006), autoencoders (Hinton & Salakhutdinov, 2006;
accuracy, a relative improvement of 70% over the state-
Bengio et al., 2007), sparse coding (Lee et al., 2007)
of-the-art. Note that, random guess achieves less than
and K-means (Coates et al., 2011). So far, most of
0.005% accuracy for this dataset.
these algorithms have only succeeded in learning low-
level features such as “edge” or “blob” detectors. Go- 2. Training set construction
ing beyond such simple features and capturing com-
plex invariances is the topic of this work. Our training dataset is constructed by sampling frames
from 10 million YouTube videos. To avoid duplicates,
Recent studies observe that it is quite time intensive each video contributes only one image to the dataset.
to train deep learning algorithms to yield state of the Each example is a color image with 200x200 pixels.
art results (Ciresan et al., 2010). We conjecture that
the long training time is partially responsible for the A subset of training images is shown in Ap-
lack of high-level features reported in the literature. pendix A. To check the proportion of faces in
For instance, researchers typically reduce the sizes of the dataset, we run an OpenCV face detector on
datasets and models in order to train networks in a 60x60 randomly-sampled patches from the dataset
practical amount of time, and these reductions under- (http://opencv.willowgarage.com/wiki/). This exper-
mine the learning of high-level features. iment shows that patches, being detected as faces by
the OpenCV face detector, account for less than 3% of
We address this problem by scaling up the core compo- the 100,000 sampled patches
nents involved in training deep networks: the dataset,
the model, and the computational resources. First, 3. Algorithm
we use a large dataset generated by sampling random
frames from random YouTube videos.1 Our input data In this section, we describe the algorithm that we use
are 200x200 images, much larger than typical 32x32 to learn features from the unlabeled training set.
images used in deep learning and unsupervised fea-
ture learning (Krizhevsky, 2009; Ciresan et al., 2010; 3.1. Previous work
Le et al., 2010; Coates et al., 2011). Our model, a Our work is inspired by recent successful algorithms in
deep autoencoder with pooling and local contrast nor- unsupervised feature learning and deep learning (Hin-
malization, is scaled to these large images by using ton et al., 2006; Bengio et al., 2007; Ranzato et al.,
a large computer cluster. To support parallelism on 2007; Lee et al., 2007). It is strongly influenced by the
this cluster, we use the idea of local receptive fields, work of (Olshausen & Field, 1996) on sparse coding.
e.g., (Raina et al., 2009; Le et al., 2010; 2011b). This According to their study, sparse coding can be trained
idea reduces communication costs between machines on unlabeled natural images to yield receptive fields
and thus allows model parallelism (parameters are dis- akin to V1 simple cells (Hubel & Wiesel, 1959).
tributed across machines). Asynchronous SGD is em-
ployed to support data parallelism. The model was One shortcoming of early approaches such as sparse
trained in a distributed fashion on a cluster with 1,000 coding (Olshausen & Field, 1996) is that their archi-
machines (16,000 cores) for three days. tectures are shallow and typically capture low-level
concepts (e.g., edge “Gabor” filters) and simple invari-
Experimental results using classification and visualiza- ances. Addressing this issue is a focus of recent work
tion confirm that it is indeed possible to build high- in deep learning (Hinton et al., 2006; Bengio et al.,
level features from unlabeled data. In particular, using 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009)
a hold-out test set consisting of faces and distractors, which build hierarchies of feature representations. In
we discover a feature that is highly selective for faces. particular, Lee et al (2008) show that stacked sparse
1
This is different from the work of (Lee et al., 2009) who RBMs can model certain simple functions of the V2
trained their model on images from one class. area of the cortex. They also demonstrate that con-
Building high-level features using large-scale unsupervised learning
volutional DBNs (Lee et al., 2009), trained on aligned Lyu & Simoncelli, 2008; Jarrett et al., 2009).2
images of faces, can learn a face detector. This result
As mentioned above, central to our approach is the use
is interesting, but unfortunately requires a certain de-
of local connectivity between neurons. In our experi-
gree of supervision during dataset construction: their
ments, the first sublayer has receptive fields of 18x18
training images (i.e., Caltech 101 images) are aligned,
pixels and the second sub-layer pools over 5x5 over-
homogeneous and belong to one selected category.
lapping neighborhoods of features (i.e., pooling size).
The neurons in the first sublayer connect to pixels in all
input channels (or maps) whereas the neurons in the
second sublayer connect to pixels of only one channel
(or map).3 While the first sublayer outputs linear filter
responses, the pooling layer outputs the square root of
the sum of the squares of its inputs, and therefore, it
is known as L2 pooling.
Our style of stacking a series of uniform modules,
switching between selectivity and tolerance layers, is
reminiscent of Neocognition and HMAX (Fukushima
& Miyake, 1982; LeCun et al., 1998; Riesenhuber &
Poggio, 1999). It has also been argued to be an archi-
tecture employed by the brain (DiCarlo et al., 2012).
Although we use local receptive fields, they are not
convolutional: the parameters are not shared across
different locations in the image. This is a stark differ-
Figure 1. The architecture and parameters in one layer of ence between our approach and previous work (LeCun
our network. The overall network replicates this structure et al., 1998; Jarrett et al., 2009; Lee et al., 2009). In
three times. For simplicity, the images are in 1D. addition to being more biologically plausible, unshared
weights allow the learning of more invariances other
3.2. Architecture than translational invariances (Le et al., 2010).
Our algorithm is built upon these ideas and can be In terms of scale, our network is perhaps one of the
viewed as a sparse deep autoencoder with three impor- largest known networks to date. It has 1 billion train-
tant ingredients: local receptive fields, pooling and lo- able parameters, which is more than an order of mag-
cal contrast normalization. First, to scale the autoen- nitude larger than other large networks reported in
coder to large images, we use a simple idea known as literature, e.g., (Ciresan et al., 2010; Sermanet & Le-
local receptive fields (LeCun et al., 1998; Raina et al., Cun, 2011) with around 10 million parameters. It is
2009; Lee et al., 2009; Le et al., 2010). This biolog- worth noting that our network is still tiny compared to
ically inspired idea proposes that each feature in the the human visual cortex, which is 106 times larger in
autoencoder can connect only to a small region of the terms of the number of neurons and synapses (Pakken-
lower layer. Next, to achieve invariance to local defor- berg et al., 2003).
mations, we employ local L2 pooling (Hyvärinen et al.,
2009; Gregor & LeCun, 2010; Le et al., 2010) and lo-
3.3. Learning and Optimization
cal contrast normalization (Jarrett et al., 2009). L2
pooling, in particular, allows the learning of invariant Learning: During learning, the parameters of the
features (Hyvärinen et al., 2009; Le et al., 2010). second sublayers (H) are fixed to uniform weights,
whereas the encoding weights W1 and decoding
Our deep autoencoder is constructed by replicating weights W2 of the first sublayers are adjusted using
three times the same stage composed of local filtering, 2
The subtractive normalization removes the
local pooling and local contrast normalization. The weighted average of neighboring neurons from the
output of one stage is the input to the next one and
P
current neuron gi,j,k = hi,j,k − iuv Guv hi,j+u,i+v
the overall model can be interpreted as a nine-layered The divisive P normalization computes yi,j,k =
2
network (see Figure 1). gi,j,k / max{c, ( iuv Guv gi,j+u,i+v )0.5 }, where c is set
to be a small number, 0.01, to prevent numerical errors.
The first and second sublayers are often known as fil- G is a Gaussian weighting window. (Jarrett et al., 2009)
3
tering (or simple) and pooling (or complex) respec- For more details regarding connectivity patterns and
tively. The third sublayer performs local subtractive parameter sensitivity, see Appendix B and E.
and divisive normalization and it is inspired by bio-
logical and computational models (Pinto et al., 2008;
Building high-level features using large-scale unsupervised learning
the following optimization problem model replica asks the centralized parameter servers
for an updated copy of its model parameters. It then
m
X 2 processes a mini-batch to compute a parameter gra-
minimize W2 W1T x(i) − x(i) +
W1 ,W2
i=1
2 dient, and sends the parameter gradients to the ap-
k q propriate parameter servers, which then apply each
gradient to the current value of the model parame-
X
λ + Hj (W1T x(i) )2 . (1)
j=1 ter. We can reduce the communication overhead by
having each model replica request updated parame-
Here, λ is a tradeoff parameter between sparsity and ters every P steps and by sending updated gradient
reconstruction; m, k are the number of examples and values to the parameter servers every G steps (where
pooling units in a layer respectively; Hj is the vector of G might not be equal to P). Our DistBelief software
weights of the j-th pooling unit. In our experiments, framework automatically manages the transfer of pa-
we set λ = 0.1. rameters and gradients between the model partitions
and the parameter servers, freeing implementors of the
This optimization problem is also known as recon-
layer functions from having to deal with these issues.
struction Topographic Independent Component Anal-
ysis (Hyvärinen et al., 2009; Le et al., 2011a).4 The Asynchronous SGD is more robust to failure and slow-
first term in the objective ensures the representations ness than standard (synchronous) SGD. Specifically,
encode important information about the data, i.e., for synchronous SGD, if one of the machines is slow,
they can reconstruct input data; whereas the second the entire training process is delayed; whereas for asyn-
term encourages pooling features to group similar fea- chronous SGD, if one machine is slow, only one copy
tures together to achieve invariances. of SGD is delayed while the rest of the optimization
can still proceed.
Optimization: All parameters in our model were In our training, at every step of SGD, the gradient is
trained jointly with the objective being the sum of the computed on a minibatch of 100 examples. We trained
objectives of the three layers. the network on a cluster with 1,000 machines for three
To train the model, we implemented model parallelism days. See Appendix B, C, and D for more details re-
by distributing the local weights W1, W2 and H to garding our implementation of the optimization.
different machines. A single instance of the model
partitions the neurons and weights out across 169 ma- 4. Experiments on Faces
chines (where each machine had 16 CPU cores). A
In this section, we describe our analysis of the learned
set of machines that collectively make up a single copy
representations in recognizing faces (“the face detec-
of the model is referred to as a “model replica.” We
tor”) and present control experiments to understand
have built a software framework called DistBelief that
invariance properties of the face detector. Results for
manages all the necessary communication between the
other concepts are presented in the next section.
different machines within a model replica, so that users
of the framework merely need to write the desired up-
wards and downwards computation functions for the 4.1. Test set
neurons in the model, and don’t have to deal with the The test set consists of 37,000 images sam-
low-level communication of data across machines. pled from two datasets: Labeled Faces In the
We further scaled up the training by implementing Wild dataset (Huang et al., 2007) and ImageNet
asynchronous SGD using multiple replicas of the core dataset (Deng et al., 2009). There are 13,026 faces
model. For the experiments described here, we di- sampled from non-aligned Labeled Faces in The Wild.5
vided the training into 5 portions and ran a copy of The rest are distractor objects randomly sampled from
the model on each of these portions. The models com- ImageNet. These images are resized to fit the visible
municate updates through a set of centralized “param- areas of the top neurons. Some example images are
eter servers,” which keep the current state of all pa- shown in Appendix A.
rameters for the model in a set of partitioned servers 4.2. Experimental protocols
(we used 256 parameter server partitions for training
the model described in this paper). In the simplest After training, we used this test set to measure the
implementation, before processing each mini-batch a performance of each neuron in classifying faces against
4
distractors. For each neuron, we found its maximum
In (Bengio et al., 2007; Le et al., 2011a), the encod- and minimum activation values, then picked 20 equally
ing weights and the decoding weights are tied: W1 = W2 .
However, for better parallelism and better features, our 5
http://vis-www.cs.umass.edu/lfw/lfw.tgz
implementation does not enforce tied weights.
Building high-level features using large-scale unsupervised learning
Table 1. Summary of numerical comparisons between our algorithm against other baselines. Top: Our algorithm vs.
simple baselines. Here, the first three columns are results for methods that do not require training: random guess,
random weights (of the network at initialization, without any training) and best linear filters selected from 100,000
examples sampled from the training set. The last three columns are results for methods that have training: the best
neuron in the first layer, the best neuron in the highest layer after training, the best neuron in the network when the
contrast normalization layers are removed. Bottom: Our algorithm vs. autoencoders and K-means.
Concept Random Same architecture Best Best first Best Best neuron without
guess with random weights linear filter layer neuron neuron contrast normalization
Faces 64.8% 67.0% 74.0% 71.0% 81.7% 78.5%
Human bodies 64.8% 66.5% 68.1% 67.2% 76.8% 71.8%
Cats 64.8% 66.0% 67.8% 67.1% 74.6% 69.3%
Concept Our Deep autoencoders Deep autoencoders K-means on
network 3 layers 6 layers 40x40 images
Faces 81.7% 72.3% 70.9% 72.5%
Human bodies 76.7% 71.2% 69.8% 69.3%
Cats 74.8% 67.5% 68.3% 68.5%
Table 2. Summary of classification accuracies for our method and other state-of-the-art baselines on ImageNet.
Dataset version 2009 (∼9M images, ∼10K categories) 2011 (∼14M images, ∼22K categories)
State-of-the-art 16.7% (Sanchez & Perronnin, 2011) 9.3% (Weston et al., 2011)
Our method 16.1% (without unsupervised pretraining) 13.6% (without unsupervised pretraining)
19.2% (with unsupervised pretraining) 15.8% (with unsupervised pretraining)
known as “unsupervised pretraining.” During super- bining ideas from recently developed algorithms to
vised learning with labeled ImageNet images, the pa- learn invariances from unlabeled data. Our implemen-
rameters of lower layers and the logistic classifiers were tation scales to a cluster with thousands of machines
both adjusted. This was done by first adjusting the lo- thanks to model parallelism and asynchronous SGD.
gistic classifiers and then adjusting the entire network
Our work shows that it is possible to train neurons to
(also known as “fine-tuning”). As a control experi-
be selective for high-level concepts using entirely unla-
ment, we also train a network starting with all random
beled data. In our experiments, we obtained neurons
weights (i.e., without unsupervised pretraining: all pa-
that function as detectors for faces, human bodies, and
rameters are initialized randomly and only adjusted by
cat faces by training on random frames of YouTube
ImageNet labeled data).
videos. These neurons naturally capture complex in-
We followed the experimental protocols specified variances such as out-of-plane and scale invariances.
by (Deng et al., 2010; Sanchez & Perronnin, 2011), in
The learned representations also work well for discrim-
which, the datasets are randomly split into two halves
inative tasks. Starting from these representations, we
for training and validation. We report the performance
obtain 15.8% accuracy for object recognition on Ima-
on the validation set and compare against state-of-the-
geNet with 20,000 categories, a significant leap of 70%
art baselines in Table 2. Note that the splits are not
relative improvement over the state-of-the-art.
identical to previous work but validation set perfor-
mances vary slightly across different splits. Acknowledgements: We thank Samy Bengio,
Adam Coates, Tom Dean, Jia Deng, Mark Mao, Peter
The results show that our method, starting from
Norvig, Paul Tucker, Andrew Saxe, and Jon Shlens for
scratch (i.e., raw pixels), bests many state-of-the-art
helpful discussions and suggestions.
hand-engineered features. On ImageNet with 10K cat-
egories, our method yielded a 15% relative improve- References
ment over previous best published result. On Ima-
Bengio, Y. and LeCun, Y. Scaling learning algorithms to-
geNet with 22K categories, it achieved a 70% relative wards AI. In Large-Scale Kernel Machines, 2007.
improvement over the highest other result of which we
are aware (including unpublished results known to the Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.
authors of (Weston et al., 2011)). Note, random guess Greedy layerwise training of deep networks. In NIPS,
achieves less than 0.005% accuracy for this dataset. 2007.
Schmidhuber, J. Deep big simple neural nets excel on Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. ICA
handwritten digit recognition. CoRR, 2010. with Reconstruction Cost for Efficient Overcomplete
Feature Learning. In NIPS, 2011a.
Coates, A., Lee, H., and Ng, A. Y. An analysis of single-
layer networks in unsupervised feature learning. In AIS- Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow,
TATS 14, 2011. B., and Ng, A.Y. On optimization methods for deep
learning. In ICML, 2011b.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. ImageNet: A Large-Scale Hierarchical Image LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gra-
Database. In CVPR, 2009. dient based learning applied to document recognition.
Proceeding of the IEEE, 1998.
Deng, J., Berg, A., Li, K., and Fei-Fei, L. What does
classifying more than 10,000 image categories tell us? Lee, H., Battle, A., Raina, R., and Ng, Andrew Y. Efficient
In ECCV, 2010. sparse coding algorithms. In NIPS, 2007.
Desimone, R., Albright, T., Gross, C., and Bruce, C. Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief
Stimulus-selective properties of inferior temporal neu- net model for visual area V2. In NIPS, 2008.
rons in the macaque. The Journal of Neuroscience, 1984. Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convo-
lutional deep belief networks for scalable unsupervised
DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does
learning of hierarchical representations. In ICML, 2009.
the brain solve visual object recognition? Neuron, 2012.
Lyu, S. and Simoncelli, E. P. Nonlinear image representa-
Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visu- tion using divisive normalization. In CVPR, 2008.
alizing higher-layer features of deep networks. Technical
report, University of Montreal, 2009. Olshausen, B. and Field, D. Emergence of simple-cell re-
ceptive field properties by learning a sparse code for nat-
Fukushima, K. and Miyake, S. Neocognitron: A new al- ural images. Nature, 1996.
gorithm for pattern recognition tolerant of deformations
and shifts in position. Pattern Recognition, 1982. Pakkenberg, B., P., D., Marner, L., Bundgaard, M. J.,
Gundersen, H. J. G., Nyengaard, J. R., and Regeur, L.
Gregor, K. and LeCun, Y. Emergence of complex-like cells Aging and the human neocortex. Experimental Geron-
in a temporal product network with local receptive fields. tology, 2003.
arXiv:1006.0448, 2010.
Pinto, N., Cox, D. D., and DiCarlo, J. J. Why is real-world
Hinton, G. E. and Salakhutdinov, R.R. Reducing the di- visual object recognition hard? PLoS Computational
mensionality of data with neural networks. Science, Biology, 2008.
2006.
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and
Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learn- Fried, I. Invariant visual representation by single neu-
ing algorithm for deep belief nets. Neural Computation, rons in the human brain. Nature, 2005.
2006.
Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y.
Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, Self-taught learning: Transfer learning from unlabelled
E. Labeled faces in the wild: A database for studying data. In ICML, 2007.
face recognition in unconstrained environments. Techni-
cal Report 07-49, University of Massachusetts, Amherst, Raina, R., Madhavan, A., and Ng, A. Y. Large-scale
October 2007. deep unsupervised learning using graphics processors. In
ICML, 2009.
Hubel, D. H. and Wiesel, T.N. Receptive fields of single
neurons in the the cat’s visual cortex. Journal of Phys- Ranzato, M., Huang, F. J, Boureau, Y., and LeCun, Y. Un-
iology, 1959. supervised learning of invariant feature hierarchies with
applications to object recognition. In CVPR, 2007.
Hyvärinen, A., Hurri, J., and Hoyer, P. O. Natural Image
Riesenhuber, M. and Poggio, T. Hierarchical models of
Statistics. Springer, 2009.
object recognition in cortex. Nature Neuroscience, 1999.
Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Sanchez, J. and Perronnin, F. High-dimensional signa-
Y. What is the best multi-stage architecture for object ture compression for large-scale image-classification. In
recognition? In ICCV, 2009. CVPR, 2011.
Keller, C., Enzweiler, M., and Gavrila, D. M. A new bench- Sermanet, P. and LeCun, Y. Traffic sign recognition with
mark for stereo-based pedestrian detection. In Proc. of multiscale convolutional neural networks. In IJCNN,
the IEEE Intelligent Vehicles Symposium, 2009. 2011.
Krizhevsky, A. Learning multiple layers of features from Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up
tiny images. Technical report, University of Toronto, to large vocabulary image annotation. In IJCAI, 2011.
2009.
Zhang, W., Sun, J., and Tang, X. Cat head detection -
Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and how to effectively exploit shape and texture features. In
Ng, A. Y. Tiled convolutional neural networks. In NIPS, ECCV, 2008.
2010.
Building high-level features using large-scale unsupervised learning
C. Model Parallelism
We use model parallelism to distribute the storage of
parameters and gradient computations to different ma-
chines. In Figure 10, we show how the weights are
divided and stored in different “partitions,” or more
simply, machines (see also (Krizhevsky, 2009)).