ImageNet Classification With Deep
ImageNet Classification With Deep
DOI:10.1145/ 30 6 5 3 8 6
JU N E 2 0 1 7 | VO L. 6 0 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 85
research highlights
0.5
3
3
3
5 3
3 3
5
11 3
2048 2048
dense
48 192 192 128
11 27 128
55
13 13 13
3 3
5
224 3 3
5 3
13
27 3 3 13 13 dense dense
11 3
55 1000
11 192 192 128 Max
pooling 2048 2048
224
Stride Max 128 Max
of 4 pooling pooling
3 48
pooling. This is what we use throughout our network, with s = of the second convolutional layer. The fourth convolu-
2 and z = 3. This scheme reduces the top-1 and top-5 error tional layer has 384 kernels of size 3 × 3 × 192, and the fifth
rates by 0.4% and 0.3%, respectively, as compared with the convolutional layer has 256 kernels of size 3 × 3 × 192. The
non overlapping scheme s = 2, z = 2, which produces output of fully connected layers have 4096 neurons each.
equivalent dimensions. We generally observe during training
that models with overlapping pooling find it slightly more dif- 5. REDUCING OVERFITTING
ficult to overfit. Our neural network architecture has 60 million parameters.
Although the 1000 classes of ILSVRC make each training
4.5. Overall architecture example impose 10 bits of constraint on the mapping from
Now we are ready to describe the overall architecture of image to label, this turns out to be insufficient to learn so
our CNN. As depicted in Figure 2, the net contains eight many parameters without considerable overfitting. Below,
layers with weights; the first five are convolutional and the we describe the two primary ways in which we combat
remaining three are fully connected. The output of the last overfitting.
fully connected layer is fed to a 1000-way softmax which
produces a distribution over the 1000 class labels. Our net- 5.1. Data augmentation
work maximizes the multinomial logistic regression The easiest and most common method to reduce overfitting
objective, which is equivalent to maximizing the average on image data is to artificially enlarge the dataset using label-
across training cases of the log-probability of the correct preserving transformations (e.g., Refs.4, 5, 30). We employ two
label under the prediction distribution. distinct forms of data augmentation, both of which allow
The kernels of the second, fourth, and fifth convolutional transformed images to be produced from the original images
layers are connected only to those kernel maps in the previous with very little computation, so the transformed images do
layer which reside on the same GPU (see Figure 2). The ker- not need to be stored on disk. In our implementation, the
nels of the third convolutional layer are connected to all ker- transformed images are generated in Python code on the CPU
nel maps in the second layer. The neurons in the while the GPU is training on the previous batch of images. So
fully-connected layers are connected to all neurons in the these data augmentation schemes are, in effect, computa-
previous layer. Response-normalization layers follow the first tionally free.
and second convolutional layers. Max-pooling layers, of the The first form of data augmentation consists of generating
kind described in Section 4.4, follow both response-normal- image translations and horizontal reflections. We do this by
ization layers as well as the fifth convolutional layer. The extracting random 224 × 224 patches (and their horizontal
ReLU non linearity is applied to the output of every convolu- reflections) from the 256 × 256 images and training our net-
tional and fully connected layer. work on these extracted patches.d This increases the size of
The first convolutional layer filters the 224 × 224 × 3 our training set by a factor of 2048, though the resulting train-
input image with 96 kernels of size 11 × 11 × 3 with a stride ing examples are, of course, highly inter dependent. Without
of 4 pixels (this is the distance between the receptive field this scheme, our network suffers from substantial overfit-
centers of neighboring neurons in a kernel map). The sec- ting, which would have forced us to use much smaller net-
ond convolutional layer takes as input the (response-nor- works. At test time, the network makes a prediction by
malized and pooled) output of the first convolutional layer extracting five 224 × 224 patches (the four corner patches and
and filters it with 256 kernels of size 5 × 5 × 48. The third, the center patch) as well as their horizontal reflections (hence
fourth, and fifth convolutional layers are connected to one 10 patches in all), and averaging the predictions made by the
another without any intervening pooling or normalization network’s softmax layer on the ten patches.
layers. The third convolutional layer has 384 kernels of size
3 × 3 × 256 connected to the (normalized, pooled) outputs This is the reason why the input images in Figure 2 are 224 × 224 × 3 dimensional.
d
JU N E 2 0 1 7 | VO L. 6 0 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 87
research highlights
The second form of data augmentation consists of altering where i is the iteration index, u is the momentum variable, ε is
the intensities of the RGB channels in training images. the learning rate, and 〈 w 〉D is the average over the ith batch
i i
Specifically, we perform PCA on the set of RGB pixel values Di of the derivative of the objective with respect to w, evaluated
throughout the ImageNet training set. To each training at wi.
image, we add multiples of the found principal components, We initialized the weights in each layer from a zero-mean
with magnitudes proportional to the corresponding eigen Gaussian distribution with standard deviation 0.01. We ini-
values times a random variable drawn from a Gaussian with tialized the neuron biases in the second, fourth, and fifth
mean 0 and standard deviation 0.1. Therefore to each RGB convolutional layers, as well as in the fully connected hidden
image pixel Ixy = [IRxy, IGxy, IBxy]T we add the following quantity: layers, with the constant 1. This initialization accelerates the
early stages of learning by providing the ReLUs with positive
[p1, p2, p3] [α1λ1, α2λ2, α3λ3]T, inputs. We initialized the neuron biases in the remaining lay-
ers with the constant 0.
where pi and λi are ith eigenvector and eigenvalue of the 3 × 3 We used an equal learning rate for all layers, which we
covariance matrix of RGB pixel values, respectively, and αi is adjusted manually throughout training. The heuristic which
the aforementioned random variable. Each αi is drawn only we followed was to divide the learning rate by 10 when the vali-
once for all the pixels of a particular training image until that dation error rate stopped improving with the current learning
image is used for training again, at which point it is re drawn. rate. The learning rate was initialized at 0.01 and reduced
This scheme approximately captures an important property three times prior to termination. We trained the network for
of natural images, namely, that object identity is invariant to roughly 90 cycles through the training set of 1.2 million
changes in the intensity and color of the illumination. This images, which took 5–6 days on two NVIDIA GTX 580 3GB
scheme reduces the top-1 error rate by over 1%. GPUs.
6. DETAILS OF LEARNING Table 2. Comparison of error rates on ILSVRC-2012 validation and test sets.
We trained our models using stochastic gradient descent Model Top-1 (val, %) Top-5 (val, %) Top-5 (test, %)
with a batch size of 128 examples, momentum of 0.9, and
SIFT + FVs 6
– – 26.2
weight decay of 0.0005. We found that this small amount of 1 CNN 40.7 18.2 –
weight decay was important for the model to learn. In other 5 CNNs 38.1 16.4 16.4
words, weight decay here is not merely a regularizer: it 1 CNN* 39.0 16.6 –
reduces the model’s training error. The update rule for weight 7 CNNs* 36.7 15.4 15.3
w was In italics are best results achieved by others. Models with an “*” were “pre-trained” to classify
the entire ImageNet 2011 Fall release (see Section 7 for details).
Figure 4. (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model. The correct label is written
under each image, and the probability assigned to the correct label is also shown with a red bar (if it happens to be in the top 5). (Right) Five
ILSVRC-2010 test images in the first column. The remaining columns show the six training images that produce feature vectors in the last
hidden layer with the smallest Euclidean distance from the feature vector for the test image.
JU N E 2 0 1 7 | VO L. 6 0 | N O. 6 | C OM M U N IC AT ION S OF T HE ACM 89
research highlights
of poses. We present the results for many more test images in 8. Fei-Fei, L., Fergus, R., Perona, P. learning of hierarchical
Learning generative visual models representations. In Proceedings of
the supplementary material. from few training examples: the 26th Annual International
Computing similarity by using Euclidean distance An incremental Bayesian approach Conference on Machine Learning
tested on 101 object categories. (2009). ACM, 609–616.
between two 4096-dimensional, real-valued vectors is ineffi- Comput. Vision Image Understanding 22. Linnainmaa, S. Taylor expansion of the
cient, but it could be made efficient by training an auto 106, 1 (2007), 59–70. accumulated rounding error. BIT
9. Fukushima, K. Neocognitron: A Numer. Math. 16, 2 (1976), 146–160.
encoder to compress these vectors to short binary codes. This self-organizing neural network model 23. Mensink, T., Verbeek, J., Perronnin, F.,
should produce a much better image retrieval method than for a mechanism of pattern recognition Csurka, G. Metric learning for large
unaffected by shift in position. Biol. scale image classification:
applying auto encoders to the raw pixels,16 which does not Cybern. 36, 4 (1980), 193–202. Generalizing to new classes at
10. Griffin, G., Holub, A., Perona, P. near-zero cost. In ECCV – European
make use of image labels and hence has a tendency to retrieve Caltech-256 object category dataset. Conference on Computer Vision
images with similar patterns of edges, whether or not they are Technical Report 7694, California (Florence, Italy, Oct. 2012).
Institute of Technology, 2007. 24. Nair, V., Hinton, G.E. Rectified linear
semantically similar. 11. He, K., Zhang, X., Ren, S., Sun, J. Deep units improve restricted Boltzmann
residual learning for image recognition. machines. In Proceedings of the 27th
arXiv preprint arXiv:1512.03385, 2015. International Conference on Machine
8. DISCUSSION 12. Hinton, G., Srivastava, N., Learning (2010).
Our results show that a large, deep CNN is capable of achieving Krizhevsky, A., Sutskever, I., 25. Pinto, N., Cox, D., DiCarlo, J. Why is
Salakhutdinov, R. Improving neural real-world visual object recognition
record-breaking results on a highly challenging dataset using networks by preventing co-adaptation hard? PLoS Comput. Biol. 4, 1 (2008),
purely supervised learning. It is notable that our network’s per- of feature detectors. arXiv preprint e27.
arXiv:1207.0580 (2012). 26. Pinto, N., Doukhan, D., DiCarlo, J., Cox,
formance degrades if a single convolutional layer is removed. 13. Jarrett, K., Kavukcuoglu, K., D. A high-throughput screening
For example, removing any of the middle layers results in a loss Ranzato, M.A., LeCun, Y. What is the approach to discovering good forms of
best multi-stage architecture for biologically inspired visual
of about 2% for the top-1 performance of the network. So the object recognition? In International representation. PLoS Comput. Biol. 5,
depth really is important for achieving our results. Conference on Computer Vision 11 (2009), e1000579.
(2009). IEEE, 2146–2153. 27. Rumelhart, D.E., Hinton, G.E., Williams,
To simplify our experiments, we did not use any unsuper- 14. Krizhevsky, A. Learning multiple layers R.J. Learning internal representations
vised pre-training even though we expect that it will help, espe- of features from tiny images. Master’s by error propagation. Technical report,
thesis, Department of Computer DTIC Document, 1985.
cially if we obtain enough computational power to significantly Science, University of Toronto, 2009. 28. Russell, BC, Torralba, A., Murphy, K.,
15. Krizhevsky, A. Convolutional deep Freeman, W. Labelme: A database and
increase the size of the network without obtaining a corre- belief networks on cifar-10. web-based tool for image annotation.
sponding increase in the amount of labeled data. Thus far, our Unpublished manuscript, 2010. Int. J. Comput Vis. 77, 1 (2008),
16. Krizhevsky, A., Hinton, G. Using very 157–173.
results have improved as we have made our network larger and deep autoencoders for content-based 29. Sánchez, J., Perronnin, F. High-
trained it longer but we still have many orders of magnitude to image retrieval. In ESANN (2011). dimensional signature compression for
17. LeCun, Y., Boser, B., Denker, J., large-scale image classification. In
go in order to match the infero temporal pathway of the human Henderson, D., Howard, R., Hubbard, W., IEEE Conference on Computer Vision
visual system. Ultimately we would like to use very large and Jackel, L., et al. Handwritten digit and Pattern Recognition (CVPR), 2011
recognition with a back-propagation (2011). IEEE, 1665–1672.
deep convolutional nets on video sequences where the tempo- network. In Advances in Neural 30. Simard, P., Steinkraus, D., Platt, J. Best
ral structure provides very helpful information, that is, missing Information Processing Systems (1990). practices for convolutional neural
18. LeCun, Y. Une procedure networks applied to visual document
or far less obvious in static images. d’apprentissage pour reseau a seuil analysis. In Proceedings of the
asymmetrique (a learning scheme for Seventh International Conference on
asymmetric threshold networks). 1985. Document Analysis and Recognition.
9. EPILOGUE 19. LeCun, Y., Huang, F., Bottou, L. Volume 2 (2003), 958–962.
The response of the computer vision community to the suc- Learning methods for generic object 31. Szegedy, C., Liu, W., Jia, Y., Sermanet,
recognition with invariance to pose and P., Reed, S., Anguelov, D., Erhan, D.,
cess of SuperVision was impressive. Over the next year or two, lighting. In Proceedings of the 2004 Vanhoucke, V., Rabinovich, A.
IEEE Computer Society Conference on Going deeper with convolutions.
they switched to using deep neural networks and these are Computer Vision and Pattern In Proceedings of the IEEE Conference
now widely deployed by Google, Facebook, Microsoft, Baidu Recognition, 2004, CVPR 2004. on Computer Vision and Pattern
Volume 2 (2004). IEEE, II–97. Recognition (2015), 1–9.
and many other companies. By 2015, better hardware, more 20. LeCun, Y., Kavukcuoglu, K., Farabet, C. 32. Turaga, S., Murray, J., Jain, V., Roth, F.,
hidden layers, and a host of technical advances reduced the Convolutional networks and Helmstaedter, M., Briggman, K., Denk,
applications in vision. In Proceedings W., Seung, H. Convolutional networks
error rate of deep convolutional neural nets by a further factor of 2010 IEEE International can learn to generate affinity graphs
of three so that they are now quite close to human perfor- Symposium on Circuits and Systems for image segmentation. Neural
(ISCAS) (2010). IEEE, 253–256. Comput. 22, 2 (2010), 511–538.
mance for static images.11, 31 Much of the credit for this revolu- 21. Lee, H., Grosse, R., Ranganath, R., Ng, 33. Werbos, P. Beyond regression: New
tion should go to the pioneers who spent many years A. Convolutional deep belief tools for prediction and analysis in
networks for scalable unsupervised the behavioral sciences, 1974.
developing the technology of CNNs, but the essential missing
ingredient was supplied by FeiFei et al.7 who put a huge effort
into producing a labeled dataset that was finally large enough Alex Krizhevsky and Geoffrey E. Hinton Ilya Sutskever ([email protected]),
to show what neural networks could really do. ({akrizhevsky, geoffhinton}@google.com), OpenAI.
Google Inc.