0% found this document useful (0 votes)
55 views10 pages

Training Neural Networks Without Gradients

Training Neural Networks Without Gradients

Uploaded by

sagarkurandwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views10 pages

Training Neural Networks Without Gradients

Training Neural Networks Without Gradients

Uploaded by

sagarkurandwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Training Neural Networks Without Gradients:

A Scalable ADMM Approach

Gavin Taylor1 TAYLOR @ USNA . EDU


Ryan Burmeister1
Zheng Xu2 XUZH @ CS . UMD . EDU
Bharat Singh2 BHARAT @ CS . UMD . EDU
Ankit Patel3 ABP 4@ RICE . EDU
arXiv:1605.02026v1 [cs.LG] 6 May 2016

Tom Goldstein2 TOMG @ CS . UMD . EDU


1
United States Naval Academy, Annapolis, MD USA
2
University of Maryland, College Park, MD USA
3
Rice University, Houston, TX USA

Abstract many parameters. Because big datasets provide results that


With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in
models and enormous training datasets, GPUs many machine learning tasks, researchers are willing to
have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit
ral networks. This is largely because conven- large amounts of time to training models and tuning hyper-
tional optimization algorithms rely on stochastic parameters.
gradient methods that don’t scale well to large Gradient-based training methods have several properties
numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First,
more, the convergence of all gradient methods, while large amounts of data can be shared amongst many
including batch methods, suffers from common cores, existing optimization methods suffer when paral-
problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing
ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points,
unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which
nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra-
to train networks without gradient descent steps. dient descent, conjugate gradients, and BFGS. Several mit-
The proposed method reduces the network train- igating approaches to avoiding this issue have been intro-
ing problem to a sequence of minimization sub- duced, including rectified linear units (ReLU) (Nair & Hin-
steps that can each be solved globally in closed ton, 2010), Long Short-Term Memory networks (Hochre-
form. The proposed method is advantageous be- iter & Schmidhuber, 1997), RPROP (Riedmiller & Braun,
cause it avoids many of the caveats that make 1993), and others, but the fundamental problem remains.
gradient methods slow on highly non-convex
problems. The method exhibits strong scaling in In this paper, we introduce a new method for training the
the distributed setting, yielding linear speedups parameters of neural nets using the Alternating Direction
even when split over thousands of cores. Method of Multipliers (ADMM) and Bregman iteration.
This approach addresses several problems facing classi-
cal gradient methods; the proposed method exhibits linear
1. Introduction scaling when data is parallelized across cores, and is robust
to gradient saturation and poor conditioning. The method
As hardware and algorithms advance, neural network per- decomposes network training into a sequence of sub-steps
formance is constantly improving for many machine learn- that are each solved to global optimality. The scalability
ing tasks. This is particularly true in applications where of the proposed method, combined with the ability to avoid
extremely large datasets are available to train models with local minima by globally solving each substep, can lead to
dramatic speedups.
Proceedings of the 33 rd International Conference on Machine
Learning, New York, NY, USA, 2016. JMLR: W&CP volume We begin in Section 2 by describing the mathematical no-
48. Copyright 2016 by the author(s). tation and context, and providing a discussion of several
Scaling Neural Networks with ADMM

weaknesses of gradient-based methods that we hope to ad- number of training samples, and then a descent step is
dress. Sections 3 and 4 introduce and describe the opti- taken using this approximate gradient. Stochastic gradient
mization approach, and Sections 5 and 6 describe in detail methods work extremely well in the serial setting, but lack
the distributed implementation. Section 7 provides an ex- scalability. Recent attempts to scale SGD include Down-
perimental comparison of the new approach with standard pour, which runs SGD simultaneously on multiple cores.
implementations of several gradient-based methods on two This model averages parameters across cores using multi-
problems of differing size and difficulty. Finally, Section ple communication nodes that store copies of the model. A
8 contains a closing discussion of the paper’s contributions conceptually similar approach is elastic averaging (Zhang
and the future work needed. et al., 2015), in which different processors simultaneously
run SGD using a quadratic penalty term that prevents dif-
2. Background and notation ferent processes from drifting too far from the central av-
erage. These methods have found success with modest
Though there are many variations, a typical neural network numbers of processors, but fail to maintain strong scaling
consists of L layers, each of which is defined by a linear for large numbers of cores. For example, for several ex-
operator Wl , and a non-linear neural activation function periments reported in (Dean et al., 2012), the Downpour
hl . Given a (column) vector of input activations, al−1 , a distributed SGD method runs slower with 1500 cores than
single layer computes and outputs the non-linear function with 500 cores.
al = hl (Wl al−1 ). A network is formed by layering these
The scalability of SGD is limited because it relies on a large
units together in a nested fashion to compute a composite
number of inexpensive minimization steps that each use a
function; in the 3-layer case, for example, this would be
small amount of data. Forming a noisy gradient from a
f (a0 ; W ) = W3 (h2 (W2 h1 (W1 a0 ))) (1) small mini-batch requires very little computation. The low
cost of this step is an asset in the serial setting where it en-
where W = {Wl } denotes the ensemble of weight ma- ables the algorithm to move quickly, but disadvantageous
trices, and a0 contains input activations for every training in the parallel setting where each step is too inexpensive
sample (one sample per column). The function h3 is absent to be split over multiple processors. For this reason, SGD
as it is common for the last layer to not have an activation is ideally suited for computation on GPUs, where multi-
function. ple cores can simultaneously work on a small batch of data
using a shared memory space with virtually no communi-
Training the network is the task of tuning the weight ma-
cation overhead.
trices W to match the output activations aL to the targets
y, given the inputs a0 . Using a loss function `, the training When parallelizing over CPUs, it is preferable to have
problem can be posed as methods that use a small number of expensive minimiza-
tion steps, preferably involving a large number of data. The
min `(f (a0 ; W ), y) (2) work required on each minimization step can then be split
W
across many worker nodes, and the latency of communica-
Note that in our notation, we have included all input acti- tion is amortized over a large amount of computation. This
vations for all training data into the matrix/tensor a0 . This approach has been suggested by numerous authors who
notation benefits our discussion of the proposed algorithm, propose batch computation methods (Ngiam et al., 2011),
which operates on all training data simultaneously as a which compute exact gradients on each iteration using the
batch. entire dataset, including conjugate gradients (Towsey et al.,
Also, in our formulation the tensor W contains linear op- 1995; Møller, 1993), BFGS, and Hessian-free (Martens &
erators, but not necessarily dense matrices. These linear Sutskever, 2011; Sainath et al., 2013) methods.
operators can be convolutions with an ensemble of filters, Unfortunately, all gradient-based approaches, whether
in which case (1) represents a convolutional net. batched or stochastic, also suffer from several other criti-
Finally, the formulation used here assumes a feed-forward cal drawbacks. First, gradient-based methods suffer from
architecture. However, our proposed methods can handle the vanishing gradients. During backpropagation, the
more complex network topologies (such as recurrent net- derivative of shallow layers in a network are formed us-
works) with little modification. ing products of weight matrices and derivatives of non-
linearities from downstream layers. When the eigenvalues
2.1. What’s wrong with backprop? of the weight matrices are small and the derivatives of non-
linearities are nearly zero (as they often are for sigmoid and
Most networks are trained using stochastic gradient de- ReLU non-linearities), multiplication by these terms anni-
scent (SGD, i.e. backpropagation) in which the gradient hilates information. The resulting gradients in shallow lay-
of the network loss function is approximated using a small
Scaling Neural Networks with ADMM

ers contain little information about the error (Bengio et al., employs least-squares parameter updates that are conceptu-
1994; Riedmiller & Braun, 1993; Hochreiter & Schmidhu- ally similar to (but different from) the Parallel Weight Up-
ber, 1997). date proposed here (see Section 5). However, there is cur-
rently no implementation nor any training results to com-
Second, backprop has the potential to get stuck at local
pare against.
minima and saddle points. While recent results suggest
that local minimizers of SGD are close to global minima Note that our work is the first to consider alternating least
(Choromanska et al., 2014), in practice SGD often lingers squares as a method to distribute computation across a clus-
near saddle points where gradients are small (Dauphin ter, although the authors of (Carreira-Perpinán & Wang,
et al., 2014). 2012) do consider implementations that are “distributed”
in the sense of using multiple threads on a single machine
Finally, backprop does not easily parallelize over layers, a
via the Matlab matrix toolbox.
significant bottleneck when considering deep architectures.
However, recent work on SGD has successfully used model
parallelism by using multiple replicas of the entire network 3. Alternating minimization for neural
(Dean et al., 2012). networks
We propose a solution that helps alleviate these problems The idea behind our method is to decouple the weights
by separating the objective function at each layer of a neu- from the nonlinear link functions using a splitting tech-
ral network into two terms: one term measuring the relation nique. Rather than feeding the output of the linear opera-
between the weights and the input activations, and the other tor Wl directly into the activation function hl , we store the
term containing the nonlinear activation function. We then output of layer l in a new variable zl = Wl al−1 . We also
apply an alternating direction method that addresses each represent the output of the link function as a vector of ac-
term separately. The first term allows the weights to be tivations al = hl (zl ). We then wish to solve the following
updated without the effects of vanishing gradients. In the problem
second step, we have a non-convex minimization problem
that can be solved globally in closed-form. Also, the form minimize `(zL , y)
{Wl },{al },{zl }
of the objective allows the weights of every layer to be up- (3)
subject to zl = Wl al−1 , for l = 1, 2, · · · L
dated independently, enabling parallelization over layers.
al = hl (zl ), for l = 1, 2, · · · L − 1.
This approach does not require any gradient steps at all.
Rather, the problem of training network parameters is re- Observe that solving (3) is equivalent to solving (2). Rather
duced to a series of minimization sub-problems using the than try to solve (3) directly, we relax the constraints by
alternating direction methods of multipliers. These mini- adding an `2 penalty function to the objective and attack
mization sub-problems are solved globally in closed form. the unconstrained problem
minimize `(zL , y) + βL kzL − WL aL−1 k2
2.2. Related work {Wl },{al },{zl }
L−1
Other works have applied least-squares based methods to +
X 
γl kal − hl (zl )k2 + βl kzl − Wl al−1 k2 (4)

neural networks. One notable example is the method of
l=1
auxiliary coordinates (MAC) (Carreira-Perpinán & Wang,
2012) which uses quadratic penalties to approximately where {γl } and {βl } are constants that control the weight
enforce equality constraints. Unlike our method, MAC of each constraint. The formulation (4) only approximately
requires iterative solvers for sub-problems, whereas the enforces the constraints in (3). To obtain exact enforcement
method proposed here is designed so that all sub-problems of the constraints, we add a Lagrange multiplier term to (4),
have closed form solutions. Also unlike MAC, the method which yields
proposed here uses Lagrange multipliers to exactly enforce minimize `(zL , y) (5)
equality constraints, which we have found to be necessary {Wl },{al },{zl }
for training deeper networks. + hzL , λi + βL kzL − WL aL−1 k2
Another related approach is the expectation-maximization L−1
X
γl kal − hl (zl )k2 + βl kzl − Wl al−1 k2 .
 
(EM) algorithm of (Patel et al., 2015), which is derived +
from the Deep Rendering Model (DRM), a hierarchical l=1
generative model for natural images. They show that feed- where λ is a vector of Lagrange multipliers with the same
forward propagation in a deep convolutional net corre- dimensions as zL . Note that in a classical ADMM formu-
sponds to inference on their proposed DRM. They derive lation, a Lagrange multiplier would be added for each con-
a new EM learning algorithm for their proposed DRM that straint in (3). The formulation above corresponds more
Scaling Neural Networks with ADMM

Algorithm 1 ADMM for Neural Nets Activations update Minimization for al is a simple least-
Input: training features {a0 }, and labels {y}, squares problem similar to the weight update. However, in
Initialize: allocate {al }L=1 L this case the matrix al appears in two penalty terms in (4),
l=1 , {zl }l=1 , and λ
repeat and so we must minimize βl kzl+1 − Wl+1 al k + γl kal −
for l = 1, 2, · · · , L − 1 do hl (zl )k2 for al , holding all other variables fixed. The new
Wl ← zl a†l−1 value of al is given by
T
al←(βl+1 Wl+1 Wl+1 +γl I)−1 (βl+1 Wl+1
T
zl+1 +γl hl (zl )) T
(βl+1 Wl+1 Wl+1 + γl I)−1 (βl+1 Wl+1
T
zl+1 + γl hl (zl )) (6)
zl ← arg minz γl kal − hl (z)k2 + βl kzl − Wl al−1 k2
end for T
where Wl+1 is the adjoint (transpose) of Wl+1 .
WL ← zL a†L−1
zL ← arg minz `(z, y) + hzL , λi + βL kz − WL al−1 k2 Outputs update The update for zl requires minimizing
λ ← λ + βL (zL − WL aL−1 )
until converged min γl kal − hl (z)k2 + βl kz − Wl al−1 k2 . (7)
z

This problem is non-convex and non-quadratic (because


of the non-linear term h). Fortunately, because the non-
closely to Bregman iteration, which only requires a La- linearity h works entry-wise on its argument, the entries in
grange correction to be added to the objective term (and zl are de-coupled. Solving (7) is particularly easy when h
not the constraint terms), rather than classical ADMM. We is piecewise linear, as it can be solved in closed form; com-
have found the Bregman formulation to be far more stable mon piecewise linear choices for h include rectified lin-
than a full scale ADMM formulation. This issue will be ear units (ReLUs) and non-differentiable sigmoid functions
discussed in detail in Section 4. given by
The split formulation (4) is carefully designed to be easily 
1, if x ≥ 1
( 
minimized using an alternating direction method in which x, if x > 0
each sub-step has a simple closed-form solution. The al- hrelu (x) = , hsig (x) = x, if 0 < x < 1 .
0, otherwise 
0, otherwise

ternating direction scheme proceeds by updating one set of
variables at a time – either {Wl }, {al }, or {zl } – while
For such choices of h, the minimizer of (7) is easily com-
holding the others constant. The simplicity of the proposed
puted using simple if-then logic. For more sophistical
scheme comes from the following observation: The min-
choices of h, including smooth sigmoid curves, the prob-
imization of (4) with respect to both {Wl } and {al−1 } is
lem can be solved quickly with a lookup table of pre-
a simple linear least-squares problem. Only the minimiza-
computed solutions because each 1-dimensional problem
tion of (4) with respect {zl } is nonlinear. However, there is
only depends on two inputs.
no coupling between the entries of {zl }, and so the prob-
lem of minimizing for {zl } decomposes into solving a large
Lagrange multiplier update After minimizing for
number of one-dimensional problems, one for each entry
{Wl }, {al }, and {zl }, the Lagrange multiplier update is
in {zl }. Because each sub-problem has a simple form and
given simply by
only 1 variable, these problems can be solved globally in
closed form. λ ← λ + βL (zL − WL aL−1 ). (8)
The full alternating direction method is listed in Algo- We discuss this update further in Section 4.
rithm 1. We discuss the details below.

3.1. Minimization sub-steps 4. Lagrange multiplier updates via method of


multipliers and Bregman iteration
In this section, we consider the updates for each variable in
(5). The algorithm proceeds by minimizing for Wl , al , and The proposed method can be viewed as solving the con-
zl , and then updating the Lagrange multipliers λ. strained problem (3) using Bregman iteration, which is
closely related to ADMM. The convergence of Bregman
iteration is fairly well understood in the presence of linear
Weight update We first consider the minimization of (4) constraints (Yin et al., 2008). The convergence of ADMM
with respect to {Wl }. For each layer l, the optimal solution is fairly well understood for convex problems involving
minimizes kzl − Wl al−1 k2 . This is simply a least squares only two separate variable blocks (He & Yuan, 2015). Con-
problem, and the solution is given by Wl ← zl a†l−1 where vergence results also guarantee that a local minima is ob-
a†l−1 represents the pseudoinverse of the (rectangular) acti- tained for two-block non-convex objectives under certain
vation matrix al−1 . smoothness assumptions (Nocedal & Wright, 2006).
Scaling Neural Networks with ADMM

Because the proposed scheme involves more than two cou- 4.2. Interpretation as method of multipliers
pled variable blocks and a non-smooth penalty function,
In addition to the Bregman interpretation, the proposed
it lies outside the scope of known convergence results for
method can also be viewed as an approximation to the
ADMM. In fact, when ADMM is applied to (3) in a con-
method of multipliers, which solves constrained problems
ventional way using separate Lagrange multiplier vectors
of the form
for each constraint, the method is highly unstable because
of the de-stabilizing effect of a large number of coupled, min J(u) subject to Au = b (11)
non-smooth, non-convex terms. u

Fortunately, we will see below that the Bregman Lagrange for some convex function J and (possibly non-linear) op-
update method (13) does not involve any non-smooth con- erator A. In its most general form (which does not assume
straint terms, and the resulting method seems to be ex- linear constraints) the method proceeds using the iterative
tremely stable. updates
(
4.1. Bregman interpretation uk+1 ← min J(u) + hλk , A(u) − bi + β2 kA(u) − bk2
λk+1 ← λk + ∂u { β2 kA(u) − bk2 }
Bregman iteration (also known as the method of multipli-
ers) is a general framework for solving constrained opti-
where λk is a vector of Lagrange multipliers that is gen-
mization problems. Methods of this type have been used
erally initialized to zero, and β2 kA(u) − bk2 is a quadratic
extensively in the sparse optimization literature (Yin et al.,
penalty term. After each minimization sub-problem, the
2008). Consider the general problem of minimizing
gradient of the penalty term is added to the Lagrange mul-
min J(u) subject to Au = b (9) tipliers. When the operator A is linear, this update takes
u the form λk+1 ← λk + βAT (Au − b), which is the most
for some convex function J and linear operator A. Breg- common form of the method of multipliers.
man iteration repeatedly solves Just like in the Bregman case, we now let J(u) = `(zL , y),
and let A contain the constraints in (3). After a minimiza-
1
uk+1 ← min DJ (u, uk ) + kAu − bk2 (10) tion pass, we must update the Lagrange multiplier vector.
2 Assuming a good minimizer has been achieved, the deriva-
where p ∈ ∂J(uk ) is a (sub-)gradient of J at uk , and tive of (5) should be nearly zero. All variables except zL ap-
DJ (u, uk ) = J(u) − J(uk ) − hu − uk , pi is the so-called pear only in the quadratic penalty, and so these derivatives
Bregman distance. The iterative process (10) can be viewed should be negligibly small. The only major contributor to
as minimizing the objective J subject to an inexact penalty the gradient of the penalty term is zL , which appears in both
that approximately obtains Ax ≈ b, and then adding a lin- the loss function and the quadratic penalty. The gradient of
ear term to the objective to weaken it so that the quadratic the penalty term with respect to zL , is βL (zL − WL aL−1 ),
penalty becomes more influential on the next iteration. which is exactly the proposed multiplier update.

The Lagrange update described in Section 3 can be inter- When the objective is approximately minimized by al-
preted as performing Bregman iteration to solve the prob- ternately updating separate blocks of variables (as in
lem (3), where J(u) = `(zL , y), and A contains the con- the proposed method), this becomes an instance of the
straints in (3). On each iteration, the outputs zl are updated ADMM (Boyd et al., 2011).
immediately before the Lagrange step is taken, and so zl−1
satisfies the optimality condition 5. Distributed implementation using data
parallelism
0 ∈ ∂z `(zL , y) + βL (zL − WL aL−1 ) + λ.
The main advantage of the proposed alternating minimiza-
It follows that tion method is its high degree of scalability. In this section,
we explain how the method is distributed.
λ + βL (zL − WL aL−1 ) ∈ −∂z `(zL , y).
Consider distributing the algorithm across N worker nodes.
For this reason, the Lagrange update (13) can be inter- The ADMM method is scaled using a data parallelization
preted as updating the sub-gradient in the Bregman iterative strategy, in which different nodes store activations and out-
method for solving (3). The combination of the Bregman puts corresponding to different subsets of the training data.
iterative update with an alternating minimization strategy For each layer, the activation matrix is broken into columns
makes the proposed algorithm an instance of the split Breg- subsets as ai = (a1 , a2 , · · · , aN ). The output matrix zl and
man method (Goldstein & Osher, 2009). Lagrange multipliers λ decompose similarly.
Scaling Neural Networks with ADMM

The optimization sub-steps for updating {al } and {zl } worker n computing
do not require any communication and parallelize triv-
ially. The weight matrix update requires the computation of λn ← λn + βL (zL
n
− WL anL−1 ) (13)
pseudo-inverses and products involving the matrices {al } using only local data.
and {zl }. This can be done effectively using transpose re-
duction strategies that reduce the dimensionality of matri-
ces before they are transmitted to a central node. 6. Implementation details
Like many training methods for neural networks, the
Parallel Weight update The weight update has the form ADMM approach requires several tips and tricks to get
Wl ← zl a†l , where a†l represents the pseudoinverse of maximum performance. The convergence theory for the
the activation matrix al . This pseudoinverse can be writ- method of multipliers requires a good minimizer to be com-
ten a†l = aTl (al aTl )−1 . Using this expansion, the W update puted before updating the Lagrange multipliers. When the
decomposes across nodes as method is initialized with random starting values, the ini-
! !−1 tial iterates are generally far from optimal. For this reason,
N N
X X we frequently “warm start” the ADMM method by running
Wl ← zln (anl )T anl (anl )T .
several iterations without Lagrange multiplier updates.
n=1 n=1
The method potentially requires the user to choose a large
The individual products zln (anl )T and anl (anl )T are com- number of parameters {γi } and {βi }. We choose γi = 10
puted separately on each node, and then summed across and βi = 1 for all trials runs reported here, and we have
nodes using a single reduce operation. Note that the width found that this choice works reliably for a wide range of
of anl equals the number of training vectors that are stored problems and network architectures. Note that in the clas-
on node n, which is potentially very large for big data sets. sical ADMM method, convergence is guaranteed for any
When the number of features (the number of rows in anl ) is choice of the quadratic penalty parameters.
less than the number of training data (columns of anl ), we
can exploit transpose reduction when forming these prod- We use training data with binary class labels, in which each
ucts – the product anl (anl )T is much smaller than the matrix output entry aL is either 1 or 0. We use a separable loss
anl alone. This dramatically reduces the quantity of data function with a hinge penalty of the form
transmitted during the reduce operation. (
max{1 − z, 0}, when a = 1,
Once these products have been formed and reduced onto `(z, a) =
max{a, 0}, when a = 0.
a central server, the central node computes the inverse of
al aTl , updates Wl , and then broadcasts the result to the This loss function works well in practice, and yields min-
worker nodes. imization sub-problems that are easily solved in closed
form.
Parallel Activations update The update (6) trivially de-
Finally, our implementation simply initializes the activa-
composes across workers, with each worker computing
tion matrices {al } and output matrices {zl } using i.i.d
anl ← (βl+1 Wl+1
T
Wl+1 +γI)−1 (βl+1 Wl+1
T n
zl+1 +γl hl (zln )). Gaussian random variables. Because our method updates
the weights before anything else, the weight matrices do
Each server maintains a full representation of the entire not require any initialization. The results presented here
weight matrix, and can formulate its own local copy of the are using Gaussian random variables with unit variance,
T
matrix inverse (βl+1 Wl+1 Wl+1 + γI)−1 . and the results seem to be fairly insensitive to the variance
of this distribution. This seems to be because the output
Parallel Outputs update Like the activations update, the updates are solved to global optimality on each iteration.
update for zl trivially parallelizes and each worker node
solves 7. Experiments
min γl kanl − hl (zln )k2 + βl kzln − Wl anl−1 k2 . (12) In this section, we present experimental results that com-
zln
pare the performance of the ADMM method to other
Each worker node simply computes Wl anl−1 using local approaches, including SGD, conjugate gradients, and L-
data, and then updates each of the (decoupled) entries in BFGS on benchmark classification tasks. Comparisons are
zln by solving a 1-dimensional problem in closed form. made across multiple axes. First, we illustrate the scaling
of the approach, by varying the number of cores available
Parallel Lagrange multiplier update The Lagrange and clocking the compute time necessary to meet an accu-
multiplier update also trivially splits across nodes, with racy threshold on the test set of the problem. Second, we
Scaling Neural Networks with ADMM

800 1.00
700
Time to Benchmark Accuracy (s)
0.95
600 0.90

Test Set Accuracy


500
0.85
400
0.80
300
200 0.75
100 0.70
0 0.65 5 10 15 20 25
100 101 102 103
Number of Cores Time (s)
(a) Time required for ADMM to reach 95% test accuracy (b) Test set predictive accuracy as a function of time in sec-
vs number of cores. This problem was not large enough to onds for ADMM on 2,496 cores (blue), in addition to GPU
support parallelization over many cores, yet the advantages implementations of conjugate gradients (green), SGD (red),
of scaling are still apparent (note the x-axis has log scale). In and L-BFGS (cyan).
comparison, on the GPU, L-BFGS reached this threshold in
3.2 seconds, CG in 9.3 seconds, and SGD in 8.2 seconds.

Figure 1. Street View House Numbers (subsection 7.1)

show test set classification accuracy as a function of time to 7.1. SVHN


compare the rate of convergence of the optimization meth-
First, we focus on the problem posed by the SVHN dataset.
ods. Finally, we show these comparisons on two different
For this dataset, we optimized a net with two hidden layers
data sets, one small and relatively easy, and one large and
of 100 and 50 nodes and ReLU activation functions. This
difficult.
is an easy problem (test accuracy rises quickly) that does
The new ADMM approach was implemented in Python on not require a large volume of data and is easily handled
a Cray XC30 supercomputer with Ivy Bridge processors, by gradient-based methods on a GPU. However, Figure
and communication between cores performed via MPI. 1a demonstrates that ADMM exhibits linear scaling with
SGD, conjugate gradients, and L-BFGS are run as imple- cores. Even though the implementations of the gradient-
mented in the Torch optim package on NVIDIA Tesla K40 based methods enjoy communication via shared memory
GPUs. These methods underwent a thorough hyperparam- on the GPU while ADMM required CPU-to-CPU commu-
eter grid search to identify the algorithm parameters that nication, strong scaling allows ADMM on CPU cores to
produced the best results. In all cases, timings indicate compete with the gradient-based methods on a GPU.
only the time spent optimizing, excluding time spent load-
This is illustrated clearly in Figure 1b, which shows each
ing data and setting up the network.
method’s performance on the test set as a function of time.
Experiments were run on two datasets. The first is a subset With 1,024 compute cores, on an average of 10 runs,
of the Street View House Numbers (SVHN) dataset (Net- ADMM was able to meet the 95% test set accuracy thresh-
zer et al., 2011). Neural nets were constructed to classify old in 13.3 seconds. After an extensive hyperparameter
pictures of 0s from 2s using histogram of gradient (HOG) search to find the settings which resulted in the fastest con-
features of the original dataset. Using the “extra” dataset vergence, SGD converged on average in 28.3 seconds, L-
to train, this meant 120,290 training datapoints of 648 fea- BFGS in 3.3 seconds, and conjugate gradients in 10.1 sec-
tures each. The testing set contained 5,893 data points. onds. Though the small dataset kept ADMM from taking
full advantage of its scalability, it was nonetheless sufficient
The second dataset is the far more difficult Higgs
to allow it to be competitive with GPU implementations.
dataset (Baldi et al., 2014), consisting of a training set of
10,500,000 datapoints of 28 features each, with each data-
point labelled as either a signal process producing a Higgs 7.2. Higgs
boson or a background process which does not. The testing For the much larger and more difficult Higgs dataset, we
set consists of 500,000 datapoints. optimized a simple network with ReLU activation func-
tions and a hidden layer of 300 nodes, as suggested in
(Baldi et al., 2014). The graph illustrates the amount of
time required to optimize the network to a test set pre-
Scaling Neural Networks with ADMM

Time to Benchmark Accuracy (s) 350 0.65


300
250 0.60

Test Set Accuracy


200
0.55
150
100 0.50
50
0 1000 2000 3000 4000 5000 6000 7000 0.45 0
10 101 102 103
Number of Cores Time (s)
(a) Time required for ADMM to reach 64% test accuracy (b) Test set predictive accuracy as a function of time for
when parallelized over varying levels of cores. L-BFGS ADMM on 7200 cores (blue), conjugate gradients (green),
on a GPU required 181 seconds, and conjugate gradients re- and SGD (red). Note the x-axis is scaled logarithmically.
quired 44 minutes. SGD never reached 64% accuracy.
Figure 2. Higgs (subsection 7.2)

diction accuracy of 64%; this parameter was chosen as all 8.1. Looking forward
batch methods being tested reliably hit this accuracy bench-
The experiments shown here represent a fairly narrow
mark over numerous trials. As is clear from Figure 2a, par-
range of classification problems and are not meant to
allelizing over additional cores decreases the time required
demonstrate the absolute superiority of ADMM as a train-
dramatically, and again exhibits linear scaling.
ing method. Rather, this study is meant to be a proof of
In this much larger problem, the advantageous scaling al- concept demonstrating that the caveats of gradient-based
lowed ADMM to reach the 64% benchmark much faster methods can be avoided using alternative minimization
than the other approaches. Figure 2b illustrates this clearly, schemes. Future work will explore the behavior of alter-
with ADMM running on 7200 cores reaching this bench- nating direction methods in broader contexts.
mark in 7.8 seconds. In comparison, L-BFGS required 181
We are particularly interested in focusing future work
seconds, and conjugate gradients required 44 minutes.1 In
on recurrent nets and convolutional nets. Recurrent
seven hours of training, SGD never reached 64% accuracy
nets, which complicate standard gradient methods (Jaeger,
on the test set. These results suggest that, for large and
2002; Lukoševičius, 2012), pose no difficulty for ADMM
difficult problems, the strong linear scaling of ADMM en-
schemes whatsoever because they decouple layers using
ables it to leverage large numbers of cores to (dramatically)
auxiliary variables. Convolutional networks are also of
out-perform GPU implementations.
interest because ADMM can, in principle, handle them
very efficiently. When the linear operators {Wl } represent
8. Discussion & Conclusion convolutions rather than dense weight matrices, the least
squares problems that arise in the updates for {Wl } and
We present a method for training neural networks without
{al } can be solved efficiently using fast Fourier transforms.
using gradient steps. In addition to avoiding many diffi-
culties of gradient methods (like saturation and choice of Finally, there are avenues to explore to potentially im-
learning rates), performance of the proposed method scales prove convergence speed. These include adding momen-
linearly up to thousands of cores. This strong scaling en- tum terms to the weight updates and studying different ini-
ables the proposed approach to out-perform other methods tialization schemes, both of which are known to be impor-
on problems involving extremely large datasets. tant for gradient-based schemes (Sutskever et al., 2013).
1
It is worth noting that though L-BFGS required substantially
more time to reach 64% than did ADMM, it was the only method Acknowledgements
to produce a superior classifier, doing as well as 75% accuracy on
the test set. This work was supported by the National Science Founda-
tion (#1535902), the Office of Naval Research (#N00014-
15-1-2676 and #N0001415WX01341), and the DoD High
Performance Computing Center.
Scaling Neural Networks with ADMM

References Lukoševičius, Mantas. A practical guide to applying echo


state networks. In Neural Networks: Tricks of the Trade,
Baldi, Pierre, Sadowski, Peter, and Whiteson, Daniel.
pp. 659–686. Springer, 2012.
Searching for exotic particles in high-energy physics
with deep learning. Nature communications, 5, 2014.
Martens, James and Sutskever, Ilya. Learning recurrent
Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. neural networks with hessian-free optimization. In Pro-
Learning long-term dependencies with gradient descent ceedings of the 28th International Conference on Ma-
is difficult. Neural Networks, IEEE Transactions on, 5 chine Learning (ICML-11), pp. 1033–1040, 2011.
(2):157–166, 1994.
Møller, Martin Fodslette. A scaled conjugate gradient al-
Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and gorithm for fast supervised learning. Neural networks, 6
Eckstein, Jonathan. Distributed optimization and statisti- (4):525–533, 1993.
cal learning via the alternating direction method of mul-
tipliers. Foundations and Trends R in Machine Learn- Nair, Vinod and Hinton, Geoffrey E. Rectified linear units
ing, 3(1):1–122, 2011. improve restricted boltzmann machines. In Proceedings
of the 27th International Conference on Machine Learn-
Carreira-Perpinán, Miguel A and Wang, Weiran. Dis-
ing (ICML-10), pp. 807–814, 2010.
tributed optimization of deeply nested systems. arXiv
preprint arXiv:1212.5921, 2012.
Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco,
Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Alessandro, Wu, Bo, and Ng, Andrew Y. Reading dig-
Arous, Gérard Ben, and LeCun, Yann. The loss surfaces its in natural images with unsupervised feature learning.
of multilayer networks. arXiv preprint arXiv:1412.0233, In NIPS workshop on deep learning and unsupervised
2014. feature learning, volume 2011, pp. 4. Granada, Spain,
2011.
Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar,
Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Ngiam, Jiquan, Coates, Adam, Lahiri, Ahbik, Prochnow,
Identifying and attacking the saddle point problem in Bobby, Le, Quoc V, and Ng, Andrew Y. On optimization
high-dimensional non-convex optimization. In Advances methods for deep learning. In Proceedings of the 28th
in neural information processing systems, pp. 2933– International Conference on Machine Learning (ICML-
2941, 2014. 11), pp. 265–272, 2011.
Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, Kai,
Devin, Matthieu, Mao, Mark, Ranzato, Marc’aurelio, Nocedal, Jorge and Wright, Stephen. Numerical optimiza-
Senior, Andrew, Tucker, Paul, Yang, Ke, Le, Quoc V., tion. Springer Science & Business Media, 2006.
and Ng, Andrew Y. Large scale distributed deep net-
Patel, Ankit B, Nguyen, Tan, and Baraniuk, Richard. A
works. In Pereira, F., Burges, C.J.C., Bottou, L., and
probabilistic theory of deep learning. arXiv preprint
Weinberger, K.Q. (eds.), Advances in Neural Informa-
arXiv:1504.00641, 2015.
tion Processing Systems 25, pp. 1223–1231. Curran As-
sociates, Inc., 2012.
Riedmiller, Martin and Braun, Heinrich. A direct adaptive
Goldstein, Tom and Osher, Stanley. The split bregman method for faster backpropagation learning: The rprop
method for l1-regularized problems. SIAM Journal on algorithm. In Neural Networks, 1993., IEEE Interna-
Imaging Sciences, 2(2):323–343, 2009. tional Conference on, pp. 586–591. IEEE, 1993.

He, Bingsheng and Yuan, Xiaoming. On non-ergodic con- Sainath, Tara N, Horesh, Lior, Kingsbury, Brian, Aravkin,
vergence rate of douglas–rachford alternating direction Aleksandr Y, and Ramabhadran, Bhuvana. Accelerat-
method of multipliers. Numerische Mathematik, 130(3): ing hessian-free optimization for deep neural networks
567–577, 2015. by implicit preconditioning and sampling. In Automatic
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short- Speech Recognition and Understanding (ASRU), 2013
term memory. Neural computation, 9(8):1735–1780, IEEE Workshop on, pp. 303–308. IEEE, 2013.
1997.
Sutskever, Ilya, Martens, James, Dahl, George, and Hinton,
Jaeger, Herbert. Tutorial on training recurrent neural net- Geoffrey. On the importance of initialization and mo-
works, covering BPPT, RTRL, EKF and the” echo state mentum in deep learning. In Proceedings of the 30th in-
network” approach. GMD-Forschungszentrum Infor- ternational conference on machine learning (ICML-13),
mationstechnik, 2002. pp. 1139–1147, 2013.
Scaling Neural Networks with ADMM

Towsey, Michael, Alpsan, Dogan, and Sztriha, Laszlo.


Training a neural network with conjugate gradient meth-
ods. In Neural Networks, 1995. Proceedings., IEEE
International Conference on, volume 1, pp. 373–378.
IEEE, 1995.
Yin, Wotao, Osher, Stanley, Goldfarb, Donald, and Dar-
bon, Jerome. Bregman iterative algorithms for \ell 1-
minimization with applications to compressed sensing.
SIAM Journal on Imaging Sciences, 1(1):143–168, 2008.
Zhang, Sixin, Choromanska, Anna E, and LeCun, Yann.
Deep learning with elastic averaging sgd. In Advances
in Neural Information Processing Systems, pp. 685–693,
2015.

You might also like