LeNgiCoaLahProNg11 PDF
LeNgiCoaLahProNg11 PDF
LeNgiCoaLahProNg11 PDF
Quoc V. Le [email protected]
Jiquan Ngiam [email protected]
Adam Coates [email protected]
Abhik Lahiri [email protected]
Bobby Prochnow [email protected]
Andrew Y. Ng [email protected]
Computer Science Department, Stanford University, Stanford, CA 94305, USA
Lee et al., 2009a)), Map-Reduce style parallelism is mimic certain properties of the area V2 in the visual
still an effective mechanism for scaling up. In such cortex. The key idea in their approach is to penal-
cases, the cost of communicating the parameters across ize the deviation between the expected
value of the
the network is small relative to the cost of computing hidden representations E hj (x; W, b) and a preferred
the objective function value and gradient. target activation ρ. By setting ρ to be close to zero,
the hidden unit will be sparsely activated.
3. Deep learning algorithms Sparse representations have been employed suc-
cessfully in many applications such as object
3.1. Restricted Boltzmann Machines
recognition (Ranzato et al., 2007; Lee et al., 2009a;
In RBMs (Smolensky, 1986; Hinton et al., 2006), the Nair & Hinton, 2009; Yang et al., 2009), speech
gradient used in training is an approximation formed recognition (Lee et al., 2009b) and activity recogni-
by a taking small number of Gibbs sampling steps tion (Taylor et al., 2010; Le et al., 2011).
(Contrastive Divergence). Given the biased nature of
Training sparse RBMs is usually difficult. This is
the gradient and intractability of the objective func-
due to the stochastic nature of RBMs. Specifically,
tion, it is difficult to use any optimization methods
in stochastic mode, the estimate of the expectation
other than plain SGDs. For this reason we will not
E hj (x; W, b) is very noisy. A common practice to
consider RBMs in our experiments.
train
sparse RBMs is to use a running estimate of
E hj (x; W, b) and penalizing only the bias (Lee et al.,
3.2. Autoencoders and denoising autoencoders 2008; Larochelle & Bengio, 2008). This further com-
Given an unlabelled dataset {x(i) }m plicates the optimization procedure and makes it hard
i=1 , an autoencoder
is a two-layer network that learns nonlinear codes to debug the learning algorithm. Moreover, it is im-
to represent (or “reconstruct”) the data. Specifi- portant to tune the learning rates correctly for the
cally, we want to learn representations h(x(i) ; W, b) = different parameters W , b and c. Consequently, it can
σ(W x(i) + b) such that σ(W T h(x(i) ; W, b) + c) is ap- be difficult to train sparse RBMs.
proximately x(i) ,
In our experience, it is often faster and simpler to ob-
m ‚
X ‚2 tain sparse representations via autoencoders with the
‚ ` T (i) ´ (i) ‚
minimize ‚σ W σ(W x + b) + c − x ‚ (1) proposed sparsity penalties, especially when batch or
W,b,c 2
i=1
large minibatch optimization methods are used.
In detail, we consider sparse autoencoders with a tar-
Here, we use the L2 norm to penalize the difference get activation of ρ and penalize it using the KL diver-
between the reconstruction and the input. In other gence (Hinton, 2010):
studies, when x is binary, the cross entropy cost can n
X “ ‚1 X m ”
also be used (Bengio et al., 2007). Typically, we set DKL ρ ‚ hj (x(i) ; W, b) , (2)
‚
m i=1
the activation function σ to be the sigmoid or hyper- j=1
For L-BFGS, we vary the minibatch size in the hidden unit activations are computed by averaging
{1000, 10000}; whereas for CG, we vary the minibatch over at least about 1000 examples.
size in {100, 10000}. For SGDs, we tried 20 combi-
nations of optimization parameters, including varying
the minibatch size in {1, 10, 100, 1000} (when the mini-
batch is large, this method is also called minibatch
Gradient Descent).
We compared the reconstruction errors on the test set
of different optimization methods and summarize the
results in Figure 1. For SGDs, we only report the
results for two best parameter settings.
The results show that minibatch L-BFGS and CG with
line search converge faster than carefully tuned plain
SGDs. In particular, CG performs better compared to
L-BFGS because computing the conjugate information
can be less expensive than estimating the Hessian. CG Figure 3. Sparse autoencoder training with 10000 units,
also performs better than SGDs thanks to both the line ρ = 0.1, one machine.
search and conjugate information.
To understand how much estimating conjugacy helps We report the performance of different methods in Fig-
CG, we also performed a control experiment where we ure 3. The results show that L-BFGS/CG are much
tuned (increased) the minibatch size and added a line faster than SGDs. The difference, however, is more
search procedure to SGDs. significant than in the case of standard autoencoders.
This is because L-BFGS and CG prefer larger mini-
batch size and consequently it is easier to estimate the
expected value of the hidden activation. In contrast,
SGDs have to deal with a noisy estimate of the hid-
den activation and we have to set the learning rate
parameters to be small to make the algorithm more
stable. Interestingly, the line search does not signifi-
cantly improve SGDs, unlike the previous experiment.
A close inspection of the line search shows that ini-
tial step sizes are chosen to be slightly smaller (more
conservative) than the tuned step size.
The idea of using GPUs for training deep learning al- size, L-BFGS enjoys more significant speed improve-
gorithms was first proposed in (Raina et al., 2009). In ments.5
this section, we will consider GPUs and carry out ex-
periments with standard autoencoders to understand
how different optimization algorithms perform.
Using the same experimental protocols described
in 4.3, we compared optimization methods and their
gains in switching from CPUs to GPUs and present
the results in Figure 4. From the figure, the speed
up gains are much higher for L-BFGS and CG than
SGDs. This is because L-BFGS and CG prefer larger
minibatch sizes which can be parallelized more effi-
ciently on the GPUs.
5. Discussion
In our experiments, different optimization algorithms
appear to be superior on different problems. On con-
trary to what appears to be a widely-held belief, that
SGDs are almost always preferred, we found that L-
Figure 6. Parallel training of CNNs. BFGS and CG can be superior to SGDs in many cases.
Among the problems we considered, L-BFGS is a good
improves the performance of both L-BFGS and CG. candidate for optimization for low dimensional prob-
lems, where the number of parameters are relatively
4.9. Classification on standard MNIST small (e.g., convolutional neural networks). For high
dimensional problems, CG often does well.
Finally, we carried out experiments to determine if L-
Sparsity provides another compelling case for using L-
BFGS affects classification accuracy. We used a con-
BFGS/CG. In our experiments, L-BFGS and CG out-
volution network with a first layer having 32 maps of
perform SGDs on training sparse autoencoders.
5x5 filters and 3x3 pooling with subsampling. The sec-
ond layer had 64 maps of 5x5 filters and 2x2 pooling We note that there are cases where L-BFGS may not
with subsampling. This architecture is similar to that be expected to perform well (e.g., if the Hessian is not
described in Ranzato et al. (2007), with the following well approximated with a low-rank estimate). For in-
two differences: (i) we did not use an additional hid- stance, on local networks (Le et al., 2010) where the
den layer of 200 hidden units; (ii) the receptive field overlaps between receptive fields are small, the Hessian
of our first layer pooling units is slightly larger (for has a block-diagonal structure and L-BFGS, which
computational reasons). uses low-rank updates, may not perform well.7 In such
cases, algorithms that exploit the problem structures
may perform much better.
Table 1. Classification error on MNIST test set for some
representative methods without pretraining. SGDs with CG and L-BFGS are also methods that can take better
diagonal Levenberg-Marquardt are used in (LeCun et al., advantage of the GPUs thanks to their preference of
1998; Ranzato et al., 2007). larger minibatch sizes. Furthermore, if one uses tiled
LeNet-5, SGDs, no distortions (LeCun et al., 1998) 0.95%
LeNet-5, SGDs, huge distortions (LeCun et al., 1998) 0.85% (locally connected) networks or other networks with
LeNet-5, SGDs, distortions (LeCun et al., 1998) 0.80% a relatively small number of parameters, it is possible
ConvNet, SGDs, no distortions (Ranzato et al., 2007) 0.89%
to compute the gradients in a Map-Reduce framework
ConvNet, L-BFGS, no distortions (this paper) 0.69%
and speed up training with L-BFGS.
We trained our network using 4 machines (with
Acknowledgments: We thank Andrew Maas, An-
GPUs). For every epoch, we saved the parameters
drew Saxe, Quinn Slack, Alex Smola and Will Zou for
to disk and used a hold-out validation set of 10000
comments and discussions. This work is supported by
examples6 to select the best model. The best model
the DARPA Deep Learning program under contract
is used to make predictions on the test set. The re-
number FA8650-10-C-7020.
sults of our method (ConvNet) using minibatch L-
BFGS are reported in Table 1. The results show
that the CNN, trained with L-BFGS, achieves an en- References
couraging classification result: 0.69%. We note that Bartlett, P., Hazan, E., and Rakhlin, A. Adaptive online
this is the best result for MNIST among algorithms gradient descent. In NIPS, 2008.
that do not use unsupervised pretraining or distor-
Bengio, Y. Learning deep architectures for AI. Foundations
tions. In particular, engineering distortions, typi- and Trends in Machine Learning, 2009.
cally viewed as a way to introduce domain knowl-
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.
6
We used a reduced training set of 50000 examples
7
throughout the classification experiments. Personal communications with Will Zou.
Deep learning optimization
Greedy layerwise training of deep networks. In NIPS, Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief
2007. net model for visual area V2. In NIPS, 2008.
Bordes, A., Bottou, L., and Gallinari, P. SGD-QN: Careful Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convo-
quasi-newton stochastic gradient descent. JMLR, 2010. lutional deep belief networks for scalable unsupervised
learning of hierarchical representations. In ICML, 2009a.
Bottou, L. Stochastic gradient learning in neural networks.
In Proceedings of Neuro-Nı̂mes 91, 1991. Lee, H., Largman, Y., Pham, P., and Ng, A. Y. Unsuper-
vised feature learning for audioclassification using con-
Bottou, L. and Bousquet, O. The tradeoffs of large scale volutional deep belief networks. In NIPS, 2009b.
learning. In NIPS. 2008. Mann, G., McDonald, R., Mohri, M., Silberman, N., and
Chu, C.T., Kim, S. K., Lin, Y. A., Yu, Y. Y., Bradski, G., Walker, D. Efficient large-scale distributed training of
Ng, A. Y., and Olukotun, K. Map-Reduce for machine conditional maximum entropy models. In NIPS, 2009.
learning on multicore. In NIPS 19, 2007. Martens, J. Deep learning via hessian-free optimization.
In ICML, 2010.
Ciresan, D. C., Meier, U., Gambardella, L. M., and
Schmidhuber, J. Deep big simple neural nets excel on Nair, V. and Hinton, G. E. 3D object recognition with
handwritten digit recognition. CoRR, 2010. deep belief nets. In NIPS, 2009.
Coates, A., Lee, H., and Ng, A. Y. An analysis of single- Olshausen, B. and Field, D. Emergence of simple-cell re-
layer networks in unsupervised feature learning. In AIS- ceptive field properties by learning a sparse code for nat-
TATS 14, 2011. ural images. Nature, 1996.
Dean, J. and Ghemawat, S. Map-Reduce: simplified data Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y.
processing on large clusters. Comm. ACM, 2008. Self-taught learning: Transfer learning from unlabelled
data. In ICML, 2007.
Do, C.B., Le, Q.V., and Foo, C.S. Proximal regularization
Raina, R., Madhavan, A., and Ng, A. Y. Large-scale
for online and batch learning. In ICML, 2009.
deep unsupervised learning using graphics processors. In
Goodfellow, I., Le, Q.V., Saxe, A., Lee, H., and Ng, A.Y. ICML, 2009.
Measuring invariances in deep networks. In NIPS, 2010. Ranzato, M., Huang, F. J, Boureau, Y., and LeCun, Y. Un-
supervised learning of invariant feature hierarchies with
Hinton, G. A practical guide to training restricted boltz-
applications to object recognition. In CVPR, 2007.
mann machines. Technical report, U. of Toronto, 2010.
Shalev-Shwartz, S., Singer, Y., and Srebro, N. Pegasos:
Hinton, G. E. and Salakhutdinov, R.R. Reducing the di- Primal estimated sub-gradient solver for svm. In ICML,
mensionality of data with neural networks. Science, 2007.
2006.
Simard, P., Steinkraus, D., and Platt, J. Best practices
Hinton, G. E., Osindero, S., and Teh, Y.W. A fast learning for convolutional neural networks applied to visual doc-
algorithm for deep belief nets. Neu. Comp., 2006. ument analysis. In ICDAR, 2003.
Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Smolensky, P. Information processing in dynamical sys-
Y. What is the best multi-stage architecture for object tems: foundations of harmony theory. In Parallel dis-
recognition? In ICCV, 2009. tributed processing, 1986.
Larochelle, H. and Bengio, Y. Classification using discrim- Taylor, G.W., Fergus, R., Lecun, Y., and Bregler, C.
inative restricted boltzmann machines. In ICML, 2008. Convolutional learning of spatio-temporal features. In
ECCV, 2010.
Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and
Ng, A. Y. Tiled convolutional neural networks. In NIPS, Teo, C. H., Le, Q. V., Smola, A. J., and Vishwanathan,
2010. S. V. N. A scalable modular convex solver for regularized
risk minimization. In KDD, 2007.
Le, Q. V., Zou, W., Yeung, S. Y., and Ng, A. Y. Learning Vincent, P., Larochelle, H., Bengio, Y., and Manzagol,
hierarchical spatio-temporal features for action recog- P. A. Extracting and composing robust features with
nition with independent subspace analysis. In CVPR, denoising autoencoders. In ICML, 2008.
2011.
Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt,
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gra- M. W., and Murphy, K. P. Accelerated training of con-
dient based learning applied to document recognition. ditional random fields with stochastic gradient methods.
Proceeding of the IEEE, 1998. In ICML, 2007.
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Effi- Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial
cient backprop. In Neural Networks: Tricks of the trade. pyramid matching using sparse coding for image classi-
Springer, 1998. fication. In CVPR, 2009.
Lee, H., Battle, A., Raina, R., and Ng, Andrew Y. Efficient Zinkevich, M., Weimer, M., Smola, A., and Li, L. Paral-
sparse coding algorithms. In NIPS, 2007. lelized stochastic gradient descent. In NIPS, 2010.