LeNgiCoaLahProNg11 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

On optimization methods for deep learning

Quoc V. Le [email protected]
Jiquan Ngiam [email protected]
Adam Coates [email protected]
Abhik Lahiri [email protected]
Bobby Prochnow [email protected]
Andrew Y. Ng [email protected]
Computer Science Department, Stanford University, Stanford, CA 94305, USA

Abstract 2008; Zinkevich et al., 2010). A strength of SGDs is


The predominant methodology in training that they are simple to implement and also fast for
deep learning advocates the use of stochastic problems that have many training examples.
gradient descent methods (SGDs). Despite However, SGD methods have many disadvantages.
its ease of implementation, SGDs are diffi- One key disadvantage of SGDs is that they require
cult to tune and parallelize. These problems much manual tuning of optimization parameters such
make it challenging to develop, debug and as learning rates and convergence criteria. If one does
scale up deep learning algorithms with SGDs. not know the task at hand well, it is very difficult to
In this paper, we show that more sophisti- find a good learning rate or a good convergence cri-
cated off-the-shelf optimization methods such terion. A standard strategy in this case is to run the
as Limited memory BFGS (L-BFGS) and learning algorithm with many optimization parame-
Conjugate gradient (CG) with line search ters and pick the model that gives the best perfor-
can significantly simplify and speed up the mance on a validation set. Since one needs to search
process of pretraining deep algorithms. In over the large space of possible optimization param-
our experiments, the difference between L- eters, this makes SGDs difficult to train in settings
BFGS/CG and SGDs are more pronounced where running the optimization procedure many times
if we consider algorithmic extensions (e.g., is computationally expensive. The second weakness of
sparsity regularization) and hardware ex- SGDs is that they are inherently sequential: it is very
tensions (e.g., GPUs or computer clusters). difficult to parallelize them using GPUs or distribute
Our experiments with distributed optimiza- them using computer clusters.
tion support the use of L-BFGS with locally
connected networks and convolutional neural Batch methods, such as Limited memory BFGS (L-
networks. Using L-BFGS, our convolutional BFGS) or Conjugate Gradient (CG), with the presence
network model achieves 0.69% on the stan- of a line search procedure, are usually much more sta-
dard MNIST dataset. This is a state-of-the- ble to train and easier to check for convergence. These
art result on MNIST among algorithms that methods also enjoy parallelism by computing the gra-
do not use distortions or pretraining. dient on GPUs (Raina et al., 2009) and/or distribut-
ing that computation across machines (Chu et al.,
2007). These methods, conventionally considered to
1. Introduction be slow, can be fast thanks to the availability of large
amounts of RAMs, multicore CPUs, GPUs and com-
Stochastic Gradient Descent methods (SGDs) puter clusters with fast network hardware.
have been extensively employed in machine
On a single machine, the speed benefits of L-BFGS
learning (Bottou, 1991; LeCun et al., 1998;
come from using the approximated second-order in-
Shalev-Shwartz et al., 2007; Bottou & Bousquet,
formation (modelling the interactions between vari-
Appearing in Proceedings of the 28 th International Con- ables). On the other hand, for CG, the benefits come
ference on Machine Learning, Bellevue, WA, USA, 2011. from using conjugacy information during optimization.
Copyright 2011 by the author(s)/owner(s). Thanks to these bookkeeping steps, L-BFGS and CG
Deep learning optimization

can be faster and more stable than SGDs. 2. Related work


A weakness of batch L-BFGS and CG, which require Optimization research has a long history. Exam-
the computation of the gradient on the entire dataset ples of successful unconstrained optimization methods
to make an update, is that they do not scale grace- include Newton-Raphson’s method, BFGS methods,
fully with the number of examples. We found that Conjugate Gradient methods and Stochastic Gradient
minibatch training, which requires the computation Descent methods. These methods are usually associ-
of the gradient on a small subset of the dataset, ad- ated with a line search method to ensure that the al-
dresses this weakness well. We found that minibatch gorithms consistently improve the objective function.
LBFGS/CG are fast when the dataset is large.
When it comes to large scale machine learning, the
Our experimental results reflect the different strengths favorite optimization method is usually SGDs. Re-
and weaknesses of the different optimization meth- cent work on SGDs focuses on adaptive strategies
ods. Among the problems we considered, L-BFGS is for the learning rate (Shalev-Shwartz et al., 2007;
highly competitive or sometimes superior to SGDs/CG Bartlett et al., 2008; Do et al., 2009) or improving
for low dimensional problems, especially convolutional SGD convergence by approximating second-order in-
models. For high dimensional problems, CG is more formation (Vishwanathan et al., 2007; Bordes et al.,
competitive and usually outperforms L-BFGS and 2010). In practice, plain SGDs with constant learning
SGDs. Additionally, using a large minibatch and line rates or learning rates of the form β+t α
are still pop-
search with SGDs can improve performance. ular thanks to their ease of implementation. These
More significant speed improvements of L-BFGS and simple methods are even more common in deep learn-
CG over SGDs are observed in our experiments with ing (Hinton, 2010) because the optimization prob-
sparse autoencoders. This is because having a larger lems are nonconvex and the convergence proper-
minibatch makes the optimization problem easier for ties of complex methods (Shalev-Shwartz et al., 2007;
sparse autoencoders: in this case, the cost of esti- Bartlett et al., 2008; Do et al., 2009) no longer hold.
mating the second-order and conjugate information is Recent proposals for training deep networks argue for
small compared to the cost of computing the gradient. the use of layerwise pretraining (Hinton et al., 2006;
Furthermore, when training autoencoders, L-BFGS Bengio et al., 2007; Bengio, 2009). Optimization tech-
and CG can be both sped up significantly (2x) by niques for training these models include Contrastive
simply performing the computations on GPUs. Con- Divergence (Hinton et al., 2006), Conjugate Gradi-
versely, only small speed improvements were observed ent (Hinton & Salakhutdinov, 2006), stochastic diag-
when SGDs are used with GPUs on the same problem. onal Levenberg-Marquardt (LeCun et al., 1998) and
Hessian-free optimization (Martens, 2010). Convolu-
We also present results showing that Map-Reduce style tional neural networks (LeCun et al., 1998) have tra-
optimization works well for L-BFGS when the model ditionally employed SGDs with the stochastic diagonal
utilizes locally connected networks (Le et al., 2010) Levenberg-Marquardt, which uses a diagonal approxi-
or convolutional neural networks (LeCun et al., 1998). mation to the Hessian (LeCun et al., 1998).
Our experimental results show that the speed improve-
ments are close to linear in the number of machines In this paper, it is our goal to empirically study the
when locally connected networks and convolutional pros and cons of off-the-shelf optimization algorithms
networks are used (up to 8 machines considered in the in the context of unsupervised feature learning and
experiments). deep learning. In that direction, we focus on compar-
ing L-BFGS, CG and SGDs.
We applied our findings to train a convolutional net-
work model (similar to Ranzato et al. (2007)) with L- Parallel optimization methods have recently attracted
BFGS on a GPU cluster and obtained 0.69% test set attention as a way to scale up machine learn-
error. This is the state-of-the-art result on MNIST ing algorithms. Map-Reduce (Dean & Ghemawat,
among algorithms that do not use pretraining or dis- 2008) style optimization methods (Chu et al., 2007;
tortions. Teo et al., 2007) have been successful early ap-
proaches. We also note recent studies (Mann et al.,
Batch optimization is also behind the success of fea- 2009; Zinkevich et al., 2010) that have parallelized
ture learning algorithms that achieve state-of-the-art SGDs without using the Map-Reduce framework.
performance on a variety of object recognition prob-
lems (Le et al., 2010; Coates et al., 2011) and action In our experiments, we found that if we use tiled (lo-
recognition problems (Le et al., 2011). cally connected) networks (Le et al., 2010) (which in-
cludes convolutional architectures (LeCun et al., 1998;
Deep learning optimization

Lee et al., 2009a)), Map-Reduce style parallelism is mimic certain properties of the area V2 in the visual
still an effective mechanism for scaling up. In such cortex. The key idea in their approach is to penal-
cases, the cost of communicating the parameters across ize the deviation between the expected
 value of the
the network is small relative to the cost of computing hidden representations E hj (x; W, b) and a preferred
the objective function value and gradient. target activation ρ. By setting ρ to be close to zero,
the hidden unit will be sparsely activated.
3. Deep learning algorithms Sparse representations have been employed suc-
cessfully in many applications such as object
3.1. Restricted Boltzmann Machines
recognition (Ranzato et al., 2007; Lee et al., 2009a;
In RBMs (Smolensky, 1986; Hinton et al., 2006), the Nair & Hinton, 2009; Yang et al., 2009), speech
gradient used in training is an approximation formed recognition (Lee et al., 2009b) and activity recogni-
by a taking small number of Gibbs sampling steps tion (Taylor et al., 2010; Le et al., 2011).
(Contrastive Divergence). Given the biased nature of
Training sparse RBMs is usually difficult. This is
the gradient and intractability of the objective func-
due to the stochastic nature of RBMs. Specifically,
tion, it is difficult to use any optimization methods
in stochastic mode, the estimate of the expectation
other than plain SGDs. For this reason we will not
E hj (x; W, b) is very noisy. A common practice to
consider RBMs in our experiments.
train
 sparse RBMs is to use a running estimate of
E hj (x; W, b) and penalizing only the bias (Lee et al.,
3.2. Autoencoders and denoising autoencoders 2008; Larochelle & Bengio, 2008). This further com-
Given an unlabelled dataset {x(i) }m plicates the optimization procedure and makes it hard
i=1 , an autoencoder
is a two-layer network that learns nonlinear codes to debug the learning algorithm. Moreover, it is im-
to represent (or “reconstruct”) the data. Specifi- portant to tune the learning rates correctly for the
cally, we want to learn representations h(x(i) ; W, b) = different parameters W , b and c. Consequently, it can
σ(W x(i) + b) such that σ(W T h(x(i) ; W, b) + c) is ap- be difficult to train sparse RBMs.
proximately x(i) ,
In our experience, it is often faster and simpler to ob-
m ‚
X ‚2 tain sparse representations via autoencoders with the
‚ ` T (i) ´ (i) ‚
minimize ‚σ W σ(W x + b) + c − x ‚ (1) proposed sparsity penalties, especially when batch or
W,b,c 2
i=1
large minibatch optimization methods are used.
In detail, we consider sparse autoencoders with a tar-
Here, we use the L2 norm to penalize the difference get activation of ρ and penalize it using the KL diver-
between the reconstruction and the input. In other gence (Hinton, 2010):
studies, when x is binary, the cross entropy cost can n
X “ ‚1 X m ”
also be used (Bengio et al., 2007). Typically, we set DKL ρ ‚ hj (x(i) ; W, b) , (2)

m i=1
the activation function σ to be the sigmoid or hyper- j=1

bolic tangent function.


Unlike RBMs, the gradient of the autoencoder objec- where m is the number of examples and n is the num-
tive can be computed exactly and this gives rise to an ber of hidden units.
opportunity to use more advanced optimization meth-
To train sparse autoencoders, we need to estimate the
ods, such as L-BFGS and CG, to train the networks.
expected activation value for each hidden unit. How-
Denoising autoencoders (Vincent et al., 2008) are also ever, we will not be able to compute this statistic un-
algorithms that can be trained by L-BFGS/CG. less we run the optimization method in batch mode. In
practice, if we have a small dataset, it is better to use
3.3. Sparse RBMs and Autoencoders a batch method to train a sparse autoencoder because
we do not have to tweak optimization parameters, such
Sparsity regularization typically leads to more in- as minibatch size, λ as described below.
terpretable features that perform well for clas-
sification. Sparse coding was first proposed Using a minibatch of size m′ << m, it is typi-
cal
 to keep a running estimate τ of the expectation
by (Olshausen & Field, 1996) as a model of simple cells E h(x; W, b) . In this case, the KL penalty is
in the visual cortex. Lee et al. (2007); Raina et al.
(2007) applied sparse coding to learn features for ma- n m′
‚ 1 X
X “ ‚ ”
chine learning applications. Lee et al. (2008) com- DKL ρ ‚λ ′ hj (x(i) ; W, b) + (1 − λ)τj (3)
m i=1
bined sparsity and RBMs to learn representations that j=1
Deep learning optimization

rate schedule than a constant learning rate. We also


use momentum, and vary the number of examples used
where λ is another tunable parameter.
to compute the gradient. In summary, the optimiza-
tion parameters associated with SGDs are: α, β, mo-
3.4. Tiled and locally connected networks mentum parameters (Hinton, 2010) and the number of
RBMs and autoencoders have densely-connected net- examples in a minibatch.
work architectures which do not scale well to large We run L-BFGS and CG with a fixed minibatch for
images. For large images, the most common approach several iterations and then resample a new minibatch
is to use convolutional neural networks (LeCun et al., from the larger training set. For each new mini-
1998; Lee et al., 2009a). Convolutional neural net- batch, we discard the cached optimization history in
works have local receptive field architectures: each L-BFGS/CG.
hidden unit can only connect to a small region of
the image. Translational invariance is usually hard- In our settings, for CG and L-BFGS, there are two
wired by weight tying. Recent approaches try to optimization parameters: minibatch size and number
relax this constraint (Le et al., 2010) in their tiled of iterations per minibatch. We use the default val-
convolutional architectures to also learn other invari- ues2 for other optimization parameters, such as line
ances (Goodfellow et al., 2010). search parameters. For CG and LBFGS, we replaced
the minibatch after 3 iterations and 20 iterations re-
Our experimental results show that local architectures, spectively. We found that these parameters generally
such as tiled convolutional or convolutional architec- work very well for many problems. Therefore, the only
tures, can be efficiently trained with a computer clus- remaining tunable parameter is the minibatch size.
ter using the Map-Reduce framework. With local ar-
chitectures, the cost of communicating the gradient
4.3. Autoencoder training
over the network is often smaller than the cost of com-
puting it (e.g., cases considered in the experiments). We compare L-BFGS, CG against SGDs for training
autoencoders. Our autoencoders have 10000 hidden
4. Experiments units and the sigmoid activation function (σ). As a
result, our model has approximately 8 × 105 param-
4.1. Datasets and computers eters, which is considered challenging for high order
optimization methods like L-BFGS.3
Our experiments were carried out on the standard
MNIST dataset. We used up to 8 machines for our
experiments; each machine has 4 Intel CPU cores (at
2.67 GHz) and a GeForce GTX 285 GPU. Most ex-
periments below are done on a single machine unless
indicated with “parallel.”
We performed our experiments using Matlab and its
GPU-plugin Jacket.1 For parallel experiments, we
used our custom toolbox that makes remote procedure
calls in Matlab and Java.
In the experiments below, we report the standard met-
ric in machine learning: the objective function evalu-
ated on test data (i.e., test error) against time. We
note that the objective function evaluated on the train-
ing shows similar trends.
Figure 1. Autoencoder training with 10000 units on one
4.2. Optimization methods machine.
2
We used LBFGS in minFunc by Mark Schmidt and a
We are interested in off-the-shelf SGDs, L-BFGS and CG implementation from Carl Rasmussen. We note that
CG. For SGDs, we used a learning rate schedule of both of these methods are fairly optimized implementa-
α
β+t where t is the iteration number. In our experi- tions of these algorithms; less sophisticated implementa-
ments, we found that it is better to use this learning tions of these algorithms may perform worse.
3
For lower dimensional problems, L-BFGS works much
1
http://www.accelereyes.com/ better than other candidates (we omit the results due to
space constraints).
Deep learning optimization

For L-BFGS, we vary the minibatch size in the hidden unit activations are computed by averaging
{1000, 10000}; whereas for CG, we vary the minibatch over at least about 1000 examples.
size in {100, 10000}. For SGDs, we tried 20 combi-
nations of optimization parameters, including varying
the minibatch size in {1, 10, 100, 1000} (when the mini-
batch is large, this method is also called minibatch
Gradient Descent).
We compared the reconstruction errors on the test set
of different optimization methods and summarize the
results in Figure 1. For SGDs, we only report the
results for two best parameter settings.
The results show that minibatch L-BFGS and CG with
line search converge faster than carefully tuned plain
SGDs. In particular, CG performs better compared to
L-BFGS because computing the conjugate information
can be less expensive than estimating the Hessian. CG Figure 3. Sparse autoencoder training with 10000 units,
also performs better than SGDs thanks to both the line ρ = 0.1, one machine.
search and conjugate information.
To understand how much estimating conjugacy helps We report the performance of different methods in Fig-
CG, we also performed a control experiment where we ure 3. The results show that L-BFGS/CG are much
tuned (increased) the minibatch size and added a line faster than SGDs. The difference, however, is more
search procedure to SGDs. significant than in the case of standard autoencoders.
This is because L-BFGS and CG prefer larger mini-
batch size and consequently it is easier to estimate the
expected value of the hidden activation. In contrast,
SGDs have to deal with a noisy estimate of the hid-
den activation and we have to set the learning rate
parameters to be small to make the algorithm more
stable. Interestingly, the line search does not signifi-
cantly improve SGDs, unlike the previous experiment.
A close inspection of the line search shows that ini-
tial step sizes are chosen to be slightly smaller (more
conservative) than the tuned step size.

4.5. Training autoencoders with GPUs

Figure 2. Control experiment with line search for SGDs.

The results are shown in Figure 2 which confirm that


having a line search procedure makes SGDs simpler
to tune and faster. Using information in the previous
steps to form the Hessian approximation (L-BFGS) or
conjugate directions (CG) further improves the results.

4.4. Sparse autoencoder training


In this experiment, we trained the autoencoders with
the KL sparsity penalty. The target activation ρ is
set to be 10% (a typical value for sparse autoencoders
or RBMs). The weighting between the estimate for Figure 4. Autoencoder training with 10000 units, ρ = 0.1,
the current sample and the old estimate (λ) is set one machine with GPUs. L-BFGS and CG enjoy a speed
to the ratio
 m′ between
the minibatch size m′ and 1000 up of approximately 2x, while no significant improvement
(= min 1000 , 1 ). This means that our estimates of is observed for plain SGDs.
Deep learning optimization

The idea of using GPUs for training deep learning al- size, L-BFGS enjoys more significant speed improve-
gorithms was first proposed in (Raina et al., 2009). In ments.5
this section, we will consider GPUs and carry out ex-
periments with standard autoencoders to understand
how different optimization algorithms perform.
Using the same experimental protocols described
in 4.3, we compared optimization methods and their
gains in switching from CPUs to GPUs and present
the results in Figure 4. From the figure, the speed
up gains are much higher for L-BFGS and CG than
SGDs. This is because L-BFGS and CG prefer larger
minibatch sizes which can be parallelized more effi-
ciently on the GPUs.

4.6. Parallel training of dense networks


In this experiment, we explore optimization methods Figure 5. Parallel training of locally connected networks.
for training autoencoders in a distributed fashion using With locally connected networks, the communication cost
the Map-Reduce framework (Chu et al., 2007).4 We is reduced significantly. The inset figure shows the (L-
BFGS) speed improvement as a function of number of
also used the same settings for all algorithms as men-
slaves. The speed up factor is measured by taking the
tioned above in Section 4.3.
amount of time that requires each method to reach a test
Our results with training dense autoencoders (omitted objective equal or better than 2.
due to lack of space) show that parallelizing densely
connected networks in this manner can result in slower
convergence than running the method on a standalone Also, the figure shows that L-BFGS enjoys an almost
machine. This can be attributed to the communica- linear speed up for up to 8 slave machines considered
tion costs involve in passing the models and gradients in the experiments when locally connected networks
across the network: the parameter vectors have a size are used. On models where the number of parameters
of 64Mb, which can be a considerable amount of net- is small, L-BFGS’s bookkeeping and communication
work traffic since it is frequently communicated. cost are both small compared to gradient computa-
tions (which is distributed across the machines).
4.7. Parallel training of local networks
4.8. Parallel training of supervised CNNs
If we use tiled (locally connected) networks (Le et al.,
2010), Map-Reduce style gradient computation can be In this experiment, we compare different optimization
used as an effective way for training. In tiled networks, methods for supervised training of two-layer convolu-
the number of parameters is small and thus the cost of tional neural networks (CNNs). Specifically, our model
transferring the gradient across the network can often has has 16 maps of 5x5 filters in the first layer, followed
be smaller than the cost of computing it. Specifically, by (non-overlapping) pooling units that pool over a
in this experiment, we constrain each hidden unit to 3x3 region. The second layer has 16 maps of 4x4 fil-
connect to a small section of the image. Furthermore, ters, without any pooling units. Additionally, we have
we do not share any weights across hidden units (no a softmax classification layer which is connected to all
weight tying constraints). We learn 20 feature maps, the output units from the second layer. In this exper-
where each map consists of 441 filters, each of size 8x8. iment, we distribute the gradient computations across
many machines with GPUs.
The results presented in Figure 5 show that SGDs are
slower when a computer cluster is used. On the other The experimental results (Figure 4.8) show that L-
hand, thanks to its preference of a larger minibatch BFGS is better than CG and SGDs on this problem
because of low dimensionality (less than 10000 param-
4
In detail, for parallelized methods, one central ma- eters). Map-Reduce style parallelism also significantly
chine (“master”) runs the optimization procedure while the
5
slaves compute the objective values and gradients. At ev- In this experiment, we did not tune the minibatch size,
ery step during optimization, the master sends the param- i.e., when we have 4 slaves, the minibatch size per computer
eter across all slaves, the slaves then compute the objective is divided by 4. We expect that tuning this minibatch size
function and gradient and send back to the master. will improve the results when the number of computers
goes up.
Deep learning optimization

edge, can improve classification results for MNIST. In


fact, state-of-the-art results involve more careful dis-
tortion engineering and/or unsupervised pretraining,
e.g., 0.4% (Simard et al., 2003), 0.53% (Jarrett et al.,
2009), 0.39% (Ciresan et al., 2010).

5. Discussion
In our experiments, different optimization algorithms
appear to be superior on different problems. On con-
trary to what appears to be a widely-held belief, that
SGDs are almost always preferred, we found that L-
Figure 6. Parallel training of CNNs. BFGS and CG can be superior to SGDs in many cases.
Among the problems we considered, L-BFGS is a good
improves the performance of both L-BFGS and CG. candidate for optimization for low dimensional prob-
lems, where the number of parameters are relatively
4.9. Classification on standard MNIST small (e.g., convolutional neural networks). For high
dimensional problems, CG often does well.
Finally, we carried out experiments to determine if L-
Sparsity provides another compelling case for using L-
BFGS affects classification accuracy. We used a con-
BFGS/CG. In our experiments, L-BFGS and CG out-
volution network with a first layer having 32 maps of
perform SGDs on training sparse autoencoders.
5x5 filters and 3x3 pooling with subsampling. The sec-
ond layer had 64 maps of 5x5 filters and 2x2 pooling We note that there are cases where L-BFGS may not
with subsampling. This architecture is similar to that be expected to perform well (e.g., if the Hessian is not
described in Ranzato et al. (2007), with the following well approximated with a low-rank estimate). For in-
two differences: (i) we did not use an additional hid- stance, on local networks (Le et al., 2010) where the
den layer of 200 hidden units; (ii) the receptive field overlaps between receptive fields are small, the Hessian
of our first layer pooling units is slightly larger (for has a block-diagonal structure and L-BFGS, which
computational reasons). uses low-rank updates, may not perform well.7 In such
cases, algorithms that exploit the problem structures
may perform much better.
Table 1. Classification error on MNIST test set for some
representative methods without pretraining. SGDs with CG and L-BFGS are also methods that can take better
diagonal Levenberg-Marquardt are used in (LeCun et al., advantage of the GPUs thanks to their preference of
1998; Ranzato et al., 2007). larger minibatch sizes. Furthermore, if one uses tiled
LeNet-5, SGDs, no distortions (LeCun et al., 1998) 0.95%
LeNet-5, SGDs, huge distortions (LeCun et al., 1998) 0.85% (locally connected) networks or other networks with
LeNet-5, SGDs, distortions (LeCun et al., 1998) 0.80% a relatively small number of parameters, it is possible
ConvNet, SGDs, no distortions (Ranzato et al., 2007) 0.89%
to compute the gradients in a Map-Reduce framework
ConvNet, L-BFGS, no distortions (this paper) 0.69%
and speed up training with L-BFGS.
We trained our network using 4 machines (with
Acknowledgments: We thank Andrew Maas, An-
GPUs). For every epoch, we saved the parameters
drew Saxe, Quinn Slack, Alex Smola and Will Zou for
to disk and used a hold-out validation set of 10000
comments and discussions. This work is supported by
examples6 to select the best model. The best model
the DARPA Deep Learning program under contract
is used to make predictions on the test set. The re-
number FA8650-10-C-7020.
sults of our method (ConvNet) using minibatch L-
BFGS are reported in Table 1. The results show
that the CNN, trained with L-BFGS, achieves an en- References
couraging classification result: 0.69%. We note that Bartlett, P., Hazan, E., and Rakhlin, A. Adaptive online
this is the best result for MNIST among algorithms gradient descent. In NIPS, 2008.
that do not use unsupervised pretraining or distor-
Bengio, Y. Learning deep architectures for AI. Foundations
tions. In particular, engineering distortions, typi- and Trends in Machine Learning, 2009.
cally viewed as a way to introduce domain knowl-
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.
6
We used a reduced training set of 50000 examples
7
throughout the classification experiments. Personal communications with Will Zou.
Deep learning optimization

Greedy layerwise training of deep networks. In NIPS, Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief
2007. net model for visual area V2. In NIPS, 2008.
Bordes, A., Bottou, L., and Gallinari, P. SGD-QN: Careful Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convo-
quasi-newton stochastic gradient descent. JMLR, 2010. lutional deep belief networks for scalable unsupervised
learning of hierarchical representations. In ICML, 2009a.
Bottou, L. Stochastic gradient learning in neural networks.
In Proceedings of Neuro-Nı̂mes 91, 1991. Lee, H., Largman, Y., Pham, P., and Ng, A. Y. Unsuper-
vised feature learning for audioclassification using con-
Bottou, L. and Bousquet, O. The tradeoffs of large scale volutional deep belief networks. In NIPS, 2009b.
learning. In NIPS. 2008. Mann, G., McDonald, R., Mohri, M., Silberman, N., and
Chu, C.T., Kim, S. K., Lin, Y. A., Yu, Y. Y., Bradski, G., Walker, D. Efficient large-scale distributed training of
Ng, A. Y., and Olukotun, K. Map-Reduce for machine conditional maximum entropy models. In NIPS, 2009.
learning on multicore. In NIPS 19, 2007. Martens, J. Deep learning via hessian-free optimization.
In ICML, 2010.
Ciresan, D. C., Meier, U., Gambardella, L. M., and
Schmidhuber, J. Deep big simple neural nets excel on Nair, V. and Hinton, G. E. 3D object recognition with
handwritten digit recognition. CoRR, 2010. deep belief nets. In NIPS, 2009.
Coates, A., Lee, H., and Ng, A. Y. An analysis of single- Olshausen, B. and Field, D. Emergence of simple-cell re-
layer networks in unsupervised feature learning. In AIS- ceptive field properties by learning a sparse code for nat-
TATS 14, 2011. ural images. Nature, 1996.

Dean, J. and Ghemawat, S. Map-Reduce: simplified data Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y.
processing on large clusters. Comm. ACM, 2008. Self-taught learning: Transfer learning from unlabelled
data. In ICML, 2007.
Do, C.B., Le, Q.V., and Foo, C.S. Proximal regularization
Raina, R., Madhavan, A., and Ng, A. Y. Large-scale
for online and batch learning. In ICML, 2009.
deep unsupervised learning using graphics processors. In
Goodfellow, I., Le, Q.V., Saxe, A., Lee, H., and Ng, A.Y. ICML, 2009.
Measuring invariances in deep networks. In NIPS, 2010. Ranzato, M., Huang, F. J, Boureau, Y., and LeCun, Y. Un-
supervised learning of invariant feature hierarchies with
Hinton, G. A practical guide to training restricted boltz-
applications to object recognition. In CVPR, 2007.
mann machines. Technical report, U. of Toronto, 2010.
Shalev-Shwartz, S., Singer, Y., and Srebro, N. Pegasos:
Hinton, G. E. and Salakhutdinov, R.R. Reducing the di- Primal estimated sub-gradient solver for svm. In ICML,
mensionality of data with neural networks. Science, 2007.
2006.
Simard, P., Steinkraus, D., and Platt, J. Best practices
Hinton, G. E., Osindero, S., and Teh, Y.W. A fast learning for convolutional neural networks applied to visual doc-
algorithm for deep belief nets. Neu. Comp., 2006. ument analysis. In ICDAR, 2003.
Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Smolensky, P. Information processing in dynamical sys-
Y. What is the best multi-stage architecture for object tems: foundations of harmony theory. In Parallel dis-
recognition? In ICCV, 2009. tributed processing, 1986.

Larochelle, H. and Bengio, Y. Classification using discrim- Taylor, G.W., Fergus, R., Lecun, Y., and Bregler, C.
inative restricted boltzmann machines. In ICML, 2008. Convolutional learning of spatio-temporal features. In
ECCV, 2010.
Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and
Ng, A. Y. Tiled convolutional neural networks. In NIPS, Teo, C. H., Le, Q. V., Smola, A. J., and Vishwanathan,
2010. S. V. N. A scalable modular convex solver for regularized
risk minimization. In KDD, 2007.
Le, Q. V., Zou, W., Yeung, S. Y., and Ng, A. Y. Learning Vincent, P., Larochelle, H., Bengio, Y., and Manzagol,
hierarchical spatio-temporal features for action recog- P. A. Extracting and composing robust features with
nition with independent subspace analysis. In CVPR, denoising autoencoders. In ICML, 2008.
2011.
Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt,
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gra- M. W., and Murphy, K. P. Accelerated training of con-
dient based learning applied to document recognition. ditional random fields with stochastic gradient methods.
Proceeding of the IEEE, 1998. In ICML, 2007.
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Effi- Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial
cient backprop. In Neural Networks: Tricks of the trade. pyramid matching using sparse coding for image classi-
Springer, 1998. fication. In CVPR, 2009.
Lee, H., Battle, A., Raina, R., and Ng, Andrew Y. Efficient Zinkevich, M., Weimer, M., Smola, A., and Li, L. Paral-
sparse coding algorithms. In NIPS, 2007. lelized stochastic gradient descent. In NIPS, 2010.

You might also like