Neural Net Research
Neural Net Research
Ke Li 1 Jitendra Malik 1
mization algorithms using reinforcement learn- Recently, Li & Malik (2016) and Andrychowicz et al.
ing. In this paper, we explore learning an op- (2016) introduced two different frameworks for learning
timization algorithm for training shallow neu- optimization algorithms. Whereas Andrychowicz et al.
ral nets. Such high-dimensional stochastic opti- (2016) focuses on learning an optimization algorithm for
mization problems present interesting challenges training models on a particular task, Li & Malik (2016)
for existing reinforcement learning algorithms. sets a more ambitious objective of learning an optimiza-
We develop an extension that is suited to learn- tion algorithm for training models that is task-independent.
ing optimization algorithms in this setting and We study the latter paradigm in this paper and develop a
demonstrate that the learned optimization algo- method for learning an optimization algorithm for high-
rithm consistently outperforms other known op- dimensional stochastic optimization problems, like the
timization algorithms even on unseen tasks and problem of training shallow neural nets.
is robust to changes in stochasticity of gradients Under the “Learning to Optimize” framework proposed by
and the neural net architecture. More specifi- Li & Malik (2016), the problem of learning an optimization
cally, we show that an optimization algorithm algorithm is formulated as a reinforcement learning prob-
trained with the proposed method on the prob- lem. We consider the general structure of an unconstrained
lem of training a neural net on MNIST general- continuous optimization algorithm, as shown in Algorithm
izes to the problems of training neural nets on the 1. In each iteration, the algorithm takes a step x and uses
Toronto Faces Dataset, CIFAR-10 and CIFAR- it to update the current iterate x(i) . In hand-engineered op-
100. timization algorithms, x is computed using some fixed
formula that depends on the objective function, the cur-
rent iterate and past iterates. Often, it is simply a function
1. Introduction of the current and past gradients.
Machine learning is centred on the philosophy that learn-
Algorithm 1 General structure of optimization algorithms
ing patterns automatically from data is generally better than
meticulously crafting rules by hand. This data-driven ap- Require: Objective function f
proach has delivered: today, machine learning techniques x(0) random point in the domain of f
can be found in a wide range of application areas, both in for i = 1, 2, . . . do
AI and beyond. Yet, there is one domain that has conspicu- x (f, {x(0) , . . . , x(i 1) })
ously been left untouched by machine learning: the design if stopping condition is met then
of tools that power machine learning itself. return x(i 1)
end if
One of the most widely used tools in machine learning is x(i) x(i 1) + x
optimization algorithms. We have grown accustomed to end for
seeing an optimization algorithm as a black box that takes
in a model that we design and the data that we collect and
Different choices of yield different optimization algo-
outputs the optimal model parameters. The optimization al-
rithms and so each optimization algorithm is essentially
gorithm itself largely stays static: its design is reserved for
characterized by its update formula . Hence, by learn-
human experts, who must toil through many rounds of the-
ing , we can learn an optimization algorithm. Li & Ma-
oretical analysis and empirical validation to devise a better
lik (2016) observed that an optimization algorithm can be
1
University of California, Berkeley, CA 94720, United States. viewed as a Markov decision process (MDP), where the
Correspondence to: Ke Li <[email protected]>. state includes the current iterate, the action is the step vec-
Learning to Optimize Neural Nets
tor x and the policy is the update formula . Hence, the Work on learning optimization algorithms can be viewed
problem of learning simply reduces to a policy search as a different approach to learning to learn/meta-learning,
problem. where the process of training a base-level learner is repre-
sented as a sequence of updates to the learner’s parameters.
In this paper, we build on the method proposed in (Li
& Malik, 2016) and develop an extension that is suited Because the proposed method learns an algorithm, it is re-
to learning optimization algorithms for high-dimensional lated to the line of work on program induction, the goal
stochastic problems. We use it to learn an optimization of which is to learn programs from examples of input and
algorithm for training shallow neural nets and show that output. Various approaches have been explored: Cramer
it outperforms popular hand-engineered optimization algo- (1985) investigated evolving programs represented as ab-
rithms like ADAM (Kingma & Ba, 2014), AdaGrad (Duchi stract syntax trees using genetic algorithms, Liang et al.
et al., 2011) and RMSprop (Tieleman & Hinton, 2012) (2010) defines a hierarchical Bayesian prior over programs
and an optimization algorithm learned using the supervised in a formal language and infers the best program using ap-
learning method proposed in (Andrychowicz et al., 2016). proximate sampling and neural Turing machines (Graves
Furthermore, we demonstrate that our optimization algo- et al., 2014) try to predict output examples from input ex-
rithm learned from the experience of training on MNIST amples directly. Hochreiter et al. (2001) observed that a
generalizes to training on other datasets that have very dis- recurrent neural net with a particular setting of weights that
similar statistics, like the Toronto Faces Dataset, CIFAR-10 takes a training example as input at each time step can be
and CIFAR-100. viewed as an online learning algorithm and learns an online
learning algorithm by learning the weights of the recurrent
2. Related Work neural net.
Work on hyperparameter optimization tries to design au-
The line of work on learning optimization algorithms is
tomatic ways of searching for the best hyperparameters
fairly recent. Li & Malik (2016) and Andrychowicz et al.
for training a model on a particular task. Various kinds
(2016) were the first to propose learning optimization al-
of methods have been proposed, such those based on
gorithms. Li & Malik (2016) explored learning task-
Bayesian optimization (Hutter et al., 2011; Bergstra et al.,
independent optimization algorithms for training various
2011; Snoek et al., 2012; Swersky et al., 2013; Feurer
models and used reinforcement learning to learn the op-
et al., 2015), random search (Bergstra & Bengio, 2012) and
timization algorithm, while Andrychowicz et al. (2016) in-
gradient-based optimization (Bengio, 2000; Domke, 2012;
vestigated learning task-dependent optimization algorithms
Maclaurin et al., 2015). Note that the discovered hyperpa-
and used supervised learning.
rameters are generally specific to the task and the model,
More broadly related is work on “meta-learning” and and hyperparameter optimization must be rerun for a new
“learning to learn” (Baxter et al., 1995; Vilalta & Drissi, task or model.
2002; Brazdil et al., 2008; Thrun & Pratt, 2012). Histor-
More closely related are methods for online hyperparam-
ically, these terms have been used by different authors to
eter adaptation, which adjust particular types of hyper-
refer to approaches that aim to achieve different objectives,
parameters automatically while performing optimization.
which essentially differ in the type of knowledge learned at
Some rules for adjusting hyperparameters are manually de-
the meta-level. Many methods try to learn commonalities
signed (Bray et al., 2004), while others learn policies for
shared across a family of related tasks; this line of work
adjusting the step size (Hansen, 2016; Daniel et al., 2016;
has blossomed into an area that has later become known
Fu et al., 2016) or the damping factor in the Levenberg-
as transfer learning and multi-task learning. Other meth-
Marquardt algorithm (Ruvolo et al., 2009). Unlike this
ods (Brazdil et al., 2003; Schmidhuber, 2004) aim to learn
line of work, this paper explores learning a general opti-
how to choose a base-level learner that would perform the
mization algorithm, which may differ from existing opti-
best on a given task. One challenge under this setting is
mization algorithms both in direction and size of the step
to decide on a parameterization of the space of base-level
taken in each iteration. A different line of work (Gregor &
learners that is both rich enough to be capable of repre-
LeCun, 2010; Sprechmann et al., 2013) parameterizes in-
senting disparate base-level learners and compact enough
termediate operands of special-purpose solvers for a class
to permit tractable search over this space. Brazdil et al.
of optimization problems that arise in sparse coding and
(2003) proposes a nonparametric representation and stores
learns them using supervised learning.
examples of different base-level learners in a database,
whereas Schmidhuber (2004) proposes representing base-
level learners as general-purpose programs. The former has
limited representation power, while the latter makes search
and learning in the space of base-level learners intractable.
Learning to Optimize Neural Nets
3. Learning to Optimize costs that are expected to be incurred over time. The en-
vironment is formalized as a partially observable Markov
3.1. Setting decision process (POMDP)1 , which is defined by the tuple
In the “Learning to Optimize” framework, we are given a (S, O, A, pi , p, po , c, T ), where S ✓ RD is the set of states,
0
set of training objective functions f1 , . . . , fn drawn from O ✓ RD is the set of observations, A ✓ Rd is the set of
some distribution F. An optimization algorithm A takes actions, pi (s0 ) is the probability density over initial states
an objective function f and an initial iterate x(0) as in- s0 , p (st+1 |st , at ) is the probability density over the sub-
put and produces a sequence of iterates x(1) , . . . , x(T ) , sequent state st+1 given the current state st and action at ,
where x(T ) is the solution found by the optimizer. We po (ot |st ) is the probability density over the current obser-
are also given a distribution D that generates the initial vation ot given the current state st , c : S ! R is a function
iterate x(0) and a meta-loss L, which takes an objective that assigns a cost to each state and T is the time horizon.
function f and a sequence of iterates x(1) , . . . , x(T ) pro- Often, the probability densities pi , p and po are not given
duced by an optimization algorithm as input and outputs explicitly, but can be accessed indirectly via sampling.
a scalar that measures the quality of the iterates. The In our setting, the state st consists of the current it-
goal is to learn ⇥ an optimization ⇤ algorithm A such that
⇤
erate x(t) and features (·) that depend on the his-
Ef ⇠F ,x(0) ⇠D L(f, A (f, x )) is minimized. The meta-
⇤ (0)
tory of iterates x(1) , . . . , x(t) , (possibly noisy) gradi-
loss is chosen to penalize optimization algorithms that ex- ents rfˆ(x(1) ), . . . , rfˆ(x(t) ) and (possibly noisy) ob-
hibit behaviours we find undesirable, like slow convergence jective values fˆ(x(1) ), . . . , fˆ(x(t) ). The observation ot
or excessive oscillations. Assuming we would like to learn excludes✓ x(t) and consists of possibly different fea-
an algorithm that minimizes the objective function it is n ot ◆
given, tures x(i) , rfˆ(x(i) ), fˆ(x(i) ) . The action at
PT a good choice of meta-loss would then simply be
i=0
i=1 f (x (i)
), which can be interpreted as the area under is the step x that will be used to update the iter-
the curve of objective values over time. ate. The initial probability density pi is defined im-
The objective functions f1 , . . . , fn may correspond to loss plicitly in terms of D and is the density of the ran-
⇣ ⇣ ⌘⌘T
functions for training base-level learners, in which case dom variable x(0) , x(0) , rfˆ(x(0) ), fˆ(x(0) ) where
the algorithm that learns the optimization algorithm can be
viewed as a meta-learner. In this setting, each objective x(0) ⇠ D and f ⇠ F. The transition proba-
function is the loss function for training a particular base- bility density p is the density of the random variable
✓ ✓n ot+1 ◆◆T
learner on a particular task, and so the set of training ob- x(t) + x, x(i) , rfˆ(x(i) ), fˆ(x(i) ) given
jective functions can be loss functions for training a base- ✓n ◆ i=0
ot
learner or a family of base-learners on different tasks. At x(t) , x(i) , rfˆ(x(i) ), fˆ(x(i) ) and x, where
test time, the learned optimization algorithm is evaluated i=0
on unseen objective functions, which correspond to loss f ⇠ F. The observation probability density po is also
functions for training base-learners on new tasks, which defined implicitly and depends on the definition of and
may be completely unrelated to tasks used for training the . Assuming the goal is to learn an optimization algo-
optimization algorithm. Therefore, the learned optimiza- rithm that minimizes the objective function, the cost c of a
T
tion algorithm must not learn anything about the tasks used state st = x(t) , (·) is simply the true objective value
for training. Instead, the goal is to learn an optimization al- f (x(t) ).
gorithm that can exploit the geometric structure of the error
surface induced by the base-learners. For example, if the A policy ⇡ (at |ot , t ) is a conditional probability density
base-level model is a neural net with ReLU activation units, over actions at given the current observation ot and time
the optimization algorithm should hopefully learn to lever- step t. When a policy is independent of t, it is known as
age the piecewise linearity of the model. Hence, there is a a stationary policy. The goal of the reinforcement learning
clear division of responsibilities between the meta-learner algorithm is to learn a policy ⇡ ⇤ that minimizes the total
and base-learners. The knowledge learned at the meta-level expected cost over time. More precisely,
should be pertinent for all tasks, whereas the knowledge " #
T
X
learned at the base-level should be task-specific. The meta-
⇡ ⇤ = arg min Es0 ,a0 ,s1 ,...,sT c(st ) ,
learner should therefore generalize across tasks, whereas ⇡
t=0
the base-learner should generalize across instances.
where the expectation is taken with respect to the joint dis-
3.2. RL Preliminaries and Formulation tribution over the sequence of states and actions, often re-
The goal of reinforcement learning is to learn to interact 1
What is described is an undiscounted finite-horizon POMDP
with an environment in a way that minimizes cumulative with continuous state, observation and action spaces.
Learning to Optimize Neural Nets
ferred to as a trajectory, which has the density actions taken by and ⇡ at every time step3 . So, the prob-
Z lem becomes:
" #
q(s0 , a0 ,s1 , . . . , sT ) = pi (s0 ) po ( o0 | s0 ) T
X
o0 ,...,oT min E c(st ) s.t. E [at ] = E [E⇡ [ at | st ]] 8t
✓,⌘
TY1 t=0
In our setting, any particular policy ⇡ (at |ot , t ), which where Dt (✓, ⌘) := E [DKL ( ⇡ ( at | st ; ✓)k ( at | st , t; ⌘))]
generates at = x given ot = (·) at every time step and Dt (⌘, ✓) := E [DKL ( ( at | st , t; ⌘)k ⇡ ( at | st ; ✓))].
corresponds to a particular (noisy) update formula , and The algorithm assumes that ( at | st , t; ⌘) =
therefore a particular (noisy) optimization algorithm. In T
N (Kt st + kt , Gt ), where ⌘ := (Kt , kt , Gt )t=1 and
practice, stochasticity is only used for training the policy;
⇡ ( at | ot ; ✓) = N (µ⇡! (ot ), ⌃⇡ ), where ✓ := (!, ⌃⇡ )
when testing the policy, ⌃⇡ (ot ) is typically set to zero,
and µ⇡! (·) can be an arbitrary function that is typically
thereby making the optimization algorithm behave deter-
modelled using a nonlinear function approximator like a
ministically. Therefore, learning an optimization algorithm
neural net.
simply reduces to searching for the optimal policy.
At the start of each iteration, the algorithm con-
3.3. Guided Policy Search structs a model of the transition probability density
p̃ ( st+1 | st , at , t; ⇣) = N (At st +Bt at +ct , Ft ), where ⇣ :=
The reinforcement learning method we use is guided pol- T
(At , Bt , ct , Ft )t=1 is fitted to samples of st drawn from
icy search (GPS) (Levine et al., 2015), which is a policy
the trajectory induced by , which essentially amounts
search method designed for searching over large classes of
to a local linearization of the true transition probability
expressive non-linear policies in continuous state and ac-
p ( st+1 | st , at , t). We will use E ˜ [·] to denote expecta-
tion spaces. It maintains two policies, and ⇡, where the
tion taken with respect to the trajectory induced by under
former lies in a time-varying linear policy class in which
the modelled transition probability p̃. Additionally, the al-
the optimal policy can found in closed form, and the latter
gorithm fits local quadratic approximations to c(st ) around
lies in a stationary non-linear policy class in which policy
samples of st drawn from the trajectory induced by so
optimization is challenging. In each iteration, it performs
that c(st ) ⇡ c̃(st ) := 12 sTt Ct st + dTt st + ht for st ’s that
policy optimization on , and uses the resulting policy as
are near the samples.
supervision to train ⇡.
With these assumptions, the subproblem that needs to be
More precisely, GPS solves the following constrained opti- T
mization problem: solved to update ⌘ = (Kt , kt , Gt )t=1 becomes:
" #
" T
# T
X
X min E ˜ c̃(st ) aTt t + ⌫t D (⌘, ✓)
min E c(st ) s.t. ( at | st , t; ⌘) = ⇡ ( at | st ; ✓) 8at , st , t ⌘
✓,⌘ t=0
t=0 " T
#
X
where ⌘ and ✓ denote the parameters of and ⇡ respec- s.t. E ˜ DKL ( at | st , t; ⌘)k at | st , t; ⌘ 0 ✏,
tively, E⇢ [·] denotes the expectation taken with respect to t=0
the
R trajectory induced by 2a policy ⇢ and ⇡ ( at | st ; ✓) := where ⌘ 0 denotes the old ⌘ from the previous iteration. Be-
ot
⇡ ( at | ot ; ✓) po ( ot | st ) . cause p̃ and c̃ are only valid locally around the trajectory
induced by , the constraint is added to limit the amount by
Since there are an infinite number of equality constraints,
the problem is relaxed by enforcing equality on the mean which ⌘ is updated. It turns out that the unconstrained prob-
lem can be solved in closed form using a dynamic program-
2
In practice, the explicit form of the observation probability po ming algorithm known as linear-quadratic-Gaussian (LQG)
is usually not known or the integral may be intractable to compute.
3
So, a linear Gaussian model is fitted to samples of st and at and Though the Bregman divergence penalty is applied to the
used in place of the true ⇡ ( at | st ; ✓) where necessary. original probability distributions over at .
Learning to Optimize Neural Nets
regulator in time linear in the time horizon T and cubic in all coordinates that correspond to entries in the same ma-
the dimensionality of the state space D. The constrained trix. That is, if the values of two coordinates in all cur-
problem is solved using dual gradient descent, which uses rent and past gradients and iterates are identical, then the
LQG as a subroutine to solve for the primal variables in step vector produced by the algorithm should have identi-
each iteration and increments the dual variable on the con- cal values in these two coordinates. We will refer to the
straint until it is satisfied. set of coordinates on which permutation invariance is en-
forced as a coordinate group. For the purposes of learning
Updating ✓ is straightforward, since expectations taken
an optimization algorithm for neural nets, a natural choice
with respect to the trajectory induced by ⇡ are always con-
would be to make each coordinate group correspond to a
ditioned on st and all outer expectations over st are taken
weight matrix or a bias vector. Hence, the total number of
with respect to the trajectory induced by . Therefore,
coordinate groups is twice the number of layers, which is
⇡ is essentially decoupled from the transition probabil-
usually fairly small.
ity p ( st+1 | st , at , t) and so its parameters can be updated
without affecting the distribution of st ’s. The subproblem In the case of GPS, we impose this prior on both and ⇡.
that needs to be solved to update ✓ therefore amounts to a For the purposes of updating ⌘, we first impose a block-
standard supervised learning problem. diagonal structure on the parameters At , Bt and Ft of the
fitted transition probability density p̃ ( st+1 | st , at , t; ⇣) =
Since ( at | st , t; ⌘) and ⇡ ( at | st ; ✓) are Gaussian,
N (At st + Bt at + ct , Ft ), so that for each coordinate in
D (✓, ⌘) can be computed analytically. More concretely,
the optimization problem, the dimensions of st+1 that cor-
if we assume ⌃⇡ to be fixed for simplicity, the subproblem
respond to the coordinate only depend on the dimensions
that is solved for updating ✓ = (!, ⌃⇡ ) is:
of st and at that correspond to the same coordinate. As a
result, p̃ ( st+1 | st , at , t; ⇣) decomposes
⇣ into multiple⌘ inde-
" T
X ⌫t pendent probability densities p̃j sjt+1 sjt , ajt , t; ⇣ j , one
T ⇡
min E t µ! (ot ) + tr Gt 1 ⌃⇡ log |⌃⇡ | for each coordinate j. Similarly, we also impose a block-
✓
t=0
2
⌫t i diagonal structure on Ct for fitting c̃(st ) and on the pa-
+ (µ⇡! (ot ) E [ at | st , t])T Gt 1 (µ⇡! (ot ) E [ at | st , t]) rameter matrix of the fitted model for ⇡ ( at | st ; ✓). Under
2
these assumptions, Kt and Gt are guaranteed to be block-
Note that the last term is the squared Mahalanobis distance diagonal as well. Hence, the Bregman divergence penalty
between the mean actions of and ⇡ at time step t, which term, D (⌘, ✓) decomposes into a sum of Bregman diver-
is intuitive as we would like to encourage ⇡ to match . gence terms, one for each coordinate.
We then further constrain dual variables t , sub-vectors
3.4. Convolutional GPS of parameter vectors and sub-matrices of parameter matri-
ces corresponding to each coordinate group to be identical
The problem of learning high-dimensional optimization al-
across the group. Additionally, we replace the weight ⌫t
gorithms presents challenges for reinforcement learning al-
on D (⌘, ✓) with an individual weight on each Bregman
gorithms due to high dimensionality of the state and action
divergence term for each coordinate group. The problem
spaces. For example, in the case of GPS, because the run-
then decomposes into multiple independent subproblems,
ning time of LQG is cubic in dimensionality of the state
one for each coordinate group. Because the dimensionality
space, performing policy search even in the simple class
of the state subspace corresponding to each coordinate is
of linear-Gaussian policies would be prohibitively expen-
constant, LQG can be executed on each subproblem much
sive when the dimensionality of the optimization problem
more efficiently.
is high.
Similarly, for ⇡, we choose a µ⇡! (·) that shares parameters
Fortunately, many high-dimensional optimization prob-
across different coordinates in the same group. We also
lems have underlying structure that can be exploited. For
impose a block-diagonal structure on ⌃⇡ and constrain the
example, the parameters of neural nets are equivalent up to
appropriate sub-matrices to share their entries.
permutation among certain coordinates. More concretely,
for fully connected neural nets, the dimensions of a hidden
layer and the corresponding weights can be permuted ar- 3.5. Features and Policy Model
bitrarily without changing the function they compute. Be- We describe the features (·) and (·) at time step t, which
cause permuting the dimensions of two adjacent layers can define the state st and observation ot respectively.
permute the weight matrix arbitrarily, an optimization algo-
rithm should be invariant to permutations of the rows and Because of the stochasticity of gradients and objective val-
columns of a weight matrix. A reasonable prior to impose ues, the state features (·) are defined in terms of sum-
t
is that the algorithm should behave in the same manner on mary statistics of the history of iterates x(i) i=0 , gradi-
Learning to Optimize Neural Nets
Figure 1. Comparison of the various hand-engineered and learned algorithms on training neural nets with 48 input and hidden units on
(a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 64. The vertical axis is the true objective value and the horizontal
axis represents the iteration. Best viewed in colour.
n ot n ot ⇣ ⌘
ents rfˆ(x(i) ) and objective values fˆ(x(i) ) . • rfˆ(x(t) )/ rfˆ(x(max(t 1,0))
) +1
i=0 i=0
We define the following statistics, which we will refer to
|x(max(t 1,1)) x(max(t 2,0))
|
as the average recent iterate, gradient and objective value •
|x(t) x(t 1) |+0.1
respectively:
Pi We use an recurrent neural net with a single layer of 128
• x(i) := 1
x(j)
min(i+1,3) j=max(i 2,0)
LSTM (Hochreiter & Schmidhuber, 1997) cells to model
Pi the mean of the policy, µ⇡! (ot ). As mentioned above, the
• rfˆ(x(i) ) := 1
rfˆ(x(j) )
min(i+1,3) j=max(i 2,0)
policy operates on each coordinate individually and the pa-
Pi rameters of the policy on all coordinates within a coordi-
• fˆ(x(i) ) := 1
fˆ(x(j) )
min(i+1,3) j=max(i 2,0)
nate group are identical. Technically, because µ⇡! (·) should
only depend on ot , (·) also includes the previous state of
The state features (·) consist of the relative change in the the LSTM cells.
average recent objective value, the average recent gradient
normalized by the magnitude of the a previous average re-
cent gradient and a previous change in average recent iter- 4. Experiments
ate relative to the current change in average recent iterate: For clarity, we will refer to training of the optimization
n⇣ ⌘ o24 algorithm as “meta-training” to differentiate it from base-
• fˆ(x(t 5i) ) fˆ(x(t 5(i+1)) ) /fˆ(x(t 5(i+1)) ) level training, which will simply be referred to as “train-
i=0
ing”.
n ⇣ ⌘o25
• rfˆ(x(t 5i) )/ rfˆ(x(max(t 5(i+1),tmod5)) ) +1 We meta-trained an optimization algorithm on a single ob-
i=0
⇢
jective function, which corresponds to the problem of train-
24
x(max(t 5(i+1),tmod5+5)) x(max(t 5(i+2),tmod5)) ing a two-layer neural net with 48 input units, 48 hidden
•
x(t 5i) x(t 5(i+1)) +0.1
i=0
units and 10 output units on a randomly projected and nor-
malized version of the MNIST training set with dimension-
Note that all operations are applied element-wise. Also, ality 48 and unit variance in each dimension. We used
whenever a feature becomes undefined (i.e.: when the time a time horizon of 400 iterations and a mini-batch size of
step index becomes negative), it is replaced with the all- 64 for computing stochastic gradients and objective val-
zeros vector. ues. We evaluate the optimization algorithm on its ability
to generalize to unseen objective functions, which corre-
Unlike state features, which are only used when training the spond to the problems of training neural nets on different
optimization algorithm, observation features (·) are used tasks/datasets. We evaluate the learned optimization algo-
both during training and at test time. Consequently, we rithm on three datasets, the Toronto Faces Dataset (TFD),
use noisier observation features, which depend directly on CIFAR-10 and CIFAR-100. These datasets are chosen for
the noisy gradients and objective values. The observation their very different characteristics from MNIST and each
features include the following: other: TFD contains 3300 grayscale images that have rel-
⇣ ⌘ atively little variation and has seven different categories,
• fˆ(x(t) ) fˆ(x(t 1)
) /fˆ(x(t 1)
) whereas CIFAR-100 contains 50,000 colour images that
Learning to Optimize Neural Nets
Figure 2. Comparison of the various hand-engineered and learned algorithms on training neural nets with 100 input units and 200 hidden
units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 64. The vertical axis is the true objective value and the
horizontal axis represents the iteration. Best viewed in colour.
Figure 3. Comparison of the various hand-engineered and learned algorithms on training neural nets with 48 input and hidden units on
(a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 10. The vertical axis is the true objective value and the horizontal
axis represents the iteration. Best viewed in colour.
have varied appearance and has 100 different categories. mization algorithm meta-trained using our method (which
we will refer to as Predicted Step Descent) consistently de-
All algorithms are tuned on the training objective function.
scends to the optimum the fastest across all datasets. On
For hand-engineered algorithms, this entails choosing the
the other hand, other algorithms are not as consistent and
best hyperparameters; for learned algorithms, this entails
the relative ranking of other algorithms varies by dataset.
meta-training on the objective function. We compare to the
This suggests that Predicted Step Descent has learned to
seven hand-engineered algorithms: stochastic gradient de-
be robust to variations in the data distributions, despite be-
scent, momentum, conjugate gradient, L-BFGS, ADAM,
ing trained on only one objective function, which is associ-
AdaGrad and RMSprop. In addition, we compare to an
ated with a very specific data distribution that character-
optimization algorithm meta-trained using the method de-
izes MNIST. It is also interesting to note that while the
scribed in (Andrychowicz et al., 2016) on the same train-
algorithm meta-trained using (Andrychowicz et al., 2016)
ing objective function (training two-layer neural net on ran-
(which we will refer to as L2LBGDBGD) performs well on
domly projected and normalized MNIST) under the same
CIFAR, it is unable to reach the optimum on TFD.
setting (a time horizon of 400 iterations and a mini-batch
size of 64). Next, we change the architecture of the neural nets and see
if Predicted Step Descent generalizes to the new architec-
First, we examine the performance of various optimization
ture. We increase the number of input units to 100 and the
algorithms on similar objective functions. The optimiza-
number of hidden units to 200, so that the number of pa-
tion problems under consideration are those for training
rameters is roughly increased by a factor of 8. As shown in
neural nets that have the same number of input and hidden
Figure 2, Predicted Step Descent consistently outperforms
units (48 and 48) as those used during meta-training. The
other algorithms on each dataset, despite having not been
number of output units varies with the number of categories
trained to optimize neural nets of this architecture. Interest-
in each dataset. We use the same mini-batch size as that
ingly, while it exhibited a bit of oscillation initially on TFD
used during meta-training. As shown in Figure 1, the opti-
and CIFAR-10, it quickly recovered and overtook other al-
Learning to Optimize Neural Nets
Figure 4. Comparison of the various hand-engineered and learned algorithms on training neural nets with 100 input units and 200 hidden
units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 10. The vertical axis is the true objective value and the
horizontal axis represents the iteration. Best viewed in colour.
Figure 5. Comparison of the various hand-engineered and learned algorithms on training neural nets with 100 input units and 200 hidden
units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 for 800 iterations with mini-batches of size 64. The vertical axis is the true objective
value and the horizontal axis represents the iteration. Best viewed in colour.
gorithms, which is reminiscent of the phenomenon reported value than all the other algorithms. Many hand-engineered
in (Li & Malik, 2016) for low-dimensional optimization algorithms also experienced much greater oscillations than
problems. This suggests that it has learned to detect when previously, suggesting that the optimization problems are
it is performing poorly and knows how to change tack ac- inherently harder. L2LBGDBGD diverged fairly quickly
cordingly. L2LBGDBGD experienced difficulties on TFD on these two datasets.
and CIFAR-10 as well, but slowly diverged.
Finally, we try doubling the number of iterations. As shown
We now investigate how robust Predicted Step Descent is in Figure 5, despite being trained over a time horizon of
to stochasticity of the gradients. To this end, we take a 400 iterations, Predicted Step Descent behaves reasonably
look at its performance when we reduce the mini-batch size beyond the number of iterations it is trained for.
from 64 to 10 on both the original architecture with 48 in-
put and hidden units and the enlarged architecture with 100 5. Conclusion
input units and 200 hidden units. As shown in Figure 3, on
the original architecture, Predicted Step Descent still out- In this paper, we presented a new method for learning opti-
performs all other algorithms and is able to handle the in- mization algorithms for high-dimensional stochastic prob-
creased stochasticity fairly well. In contrast, conjugate gra- lems. We applied the method to learning an optimization
dient and L2LBGDBGD had some difficulty handling the algorithm for training shallow neural nets. We showed that
increased stochasticity on TFD and to a lesser extent, on the algorithm learned using our method on the problem of
CIFAR-10. In the former case, both diverged; in the latter training a neural net on MNIST generalizes to the prob-
case, both were progressing slowly towards the optimum. lems of training neural nets on unrelated tasks/datasets like
the Toronto Faces Dataset, CIFAR-10 and CIFAR-100. We
On the enlarged architecture, Predicted Step Descent expe-
also demonstrated that the learned optimization algorithm
rienced some significant oscillations on TFD and CIFAR-
is robust to changes in the stochasticity of gradients and the
10, but still managed to achieve a much better objective
neural net architecture.
Learning to Optimize Neural Nets