0% found this document useful (0 votes)

39 views10 pages

Neural Net Research

1) The document proposes a framework called "Learning to Optimize" for learning optimization algorithms using reinforcement learning. 2) It applies this framework to learn an optimization algorithm for training shallow neural networks, showing it outperforms algorithms like ADAM and RMSprop. 3) The learned optimization algorithm generalizes to training neural networks on other datasets like CIFAR-10 and Toronto Faces, despite their different statistics.

Uploaded by

Eric Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views10 pages

Neural Net Research

Uploaded by

Eric Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Learning to Optimize Neural Nets

Ke Li 1 Jitendra Malik 1

Abstract optimization algorithm. Given this state of affairs, perhaps

it is time for us to start practicing what we preach and learn
Learning to Optimize (Li & Malik, 2016) is a how to learn.
recently proposed framework for learning opti-
arXiv:1703.00441v1 [cs.LG] 1 Mar 2017

mization algorithms using reinforcement learn- Recently, Li & Malik (2016) and Andrychowicz et al.
ing. In this paper, we explore learning an op- (2016) introduced two different frameworks for learning
timization algorithm for training shallow neu- optimization algorithms. Whereas Andrychowicz et al.
ral nets. Such high-dimensional stochastic opti- (2016) focuses on learning an optimization algorithm for
mization problems present interesting challenges training models on a particular task, Li & Malik (2016)
for existing reinforcement learning algorithms. sets a more ambitious objective of learning an optimiza-
We develop an extension that is suited to learn- tion algorithm for training models that is task-independent.
ing optimization algorithms in this setting and We study the latter paradigm in this paper and develop a
demonstrate that the learned optimization algo- method for learning an optimization algorithm for high-
rithm consistently outperforms other known op- dimensional stochastic optimization problems, like the
timization algorithms even on unseen tasks and problem of training shallow neural nets.
is robust to changes in stochasticity of gradients Under the “Learning to Optimize” framework proposed by
and the neural net architecture. More specifi- Li & Malik (2016), the problem of learning an optimization
cally, we show that an optimization algorithm algorithm is formulated as a reinforcement learning prob-
trained with the proposed method on the problem. We consider the general structure of an unconstrained
lem of training a neural net on MNIST general- continuous optimization algorithm, as shown in Algorithm
izes to the problems of training neural nets on the 1. In each iteration, the algorithm takes a step x and uses
Toronto Faces Dataset, CIFAR-10 and CIFAR- it to update the current iterate x(i) . In hand-engineered op-
100. timization algorithms, x is computed using some fixed
formula that depends on the objective function, the cur-
rent iterate and past iterates. Often, it is simply a function
1. Introduction of the current and past gradients.
Machine learning is centred on the philosophy that learn-
Algorithm 1 General structure of optimization algorithms
ing patterns automatically from data is generally better than
meticulously crafting rules by hand. This data-driven ap- Require: Objective function f
proach has delivered: today, machine learning techniques x(0) random point in the domain of f
can be found in a wide range of application areas, both in for i = 1, 2, . . . do
AI and beyond. Yet, there is one domain that has conspicu- x (f, {x(0) , . . . , x(i 1) })
ously been left untouched by machine learning: the design if stopping condition is met then
of tools that power machine learning itself. return x(i 1)
end if
One of the most widely used tools in machine learning is x(i) x(i 1) + x
optimization algorithms. We have grown accustomed to end for
seeing an optimization algorithm as a black box that takes
in a model that we design and the data that we collect and
Different choices of yield different optimization algo-
outputs the optimal model parameters. The optimization al-
rithms and so each optimization algorithm is essentially
gorithm itself largely stays static: its design is reserved for
characterized by its update formula . Hence, by learn-
human experts, who must toil through many rounds of the-
ing , we can learn an optimization algorithm. Li & Ma-
oretical analysis and empirical validation to devise a better
lik (2016) observed that an optimization algorithm can be
1
University of California, Berkeley, CA 94720, United States. viewed as a Markov decision process (MDP), where the
Correspondence to: Ke Li <[email protected]>. state includes the current iterate, the action is the step vec-
Learning to Optimize Neural Nets

tor x and the policy is the update formula . Hence, the Work on learning optimization algorithms can be viewed
problem of learning simply reduces to a policy search as a different approach to learning to learn/meta-learning,
problem. where the process of training a base-level learner is repre-
sented as a sequence of updates to the learner’s parameters.
In this paper, we build on the method proposed in (Li
& Malik, 2016) and develop an extension that is suited Because the proposed method learns an algorithm, it is re-
to learning optimization algorithms for high-dimensional lated to the line of work on program induction, the goal
stochastic problems. We use it to learn an optimization of which is to learn programs from examples of input and
algorithm for training shallow neural nets and show that output. Various approaches have been explored: Cramer
it outperforms popular hand-engineered optimization algo- (1985) investigated evolving programs represented as ab-
rithms like ADAM (Kingma & Ba, 2014), AdaGrad (Duchi stract syntax trees using genetic algorithms, Liang et al.
et al., 2011) and RMSprop (Tieleman & Hinton, 2012) (2010) defines a hierarchical Bayesian prior over programs
and an optimization algorithm learned using the supervised in a formal language and infers the best program using ap-
learning method proposed in (Andrychowicz et al., 2016). proximate sampling and neural Turing machines (Graves
Furthermore, we demonstrate that our optimization algo- et al., 2014) try to predict output examples from input ex-
rithm learned from the experience of training on MNIST amples directly. Hochreiter et al. (2001) observed that a
generalizes to training on other datasets that have very dis- recurrent neural net with a particular setting of weights that
similar statistics, like the Toronto Faces Dataset, CIFAR-10 takes a training example as input at each time step can be
and CIFAR-100. viewed as an online learning algorithm and learns an online
learning algorithm by learning the weights of the recurrent
2. Related Work neural net.
Work on hyperparameter optimization tries to design au-
The line of work on learning optimization algorithms is
tomatic ways of searching for the best hyperparameters
fairly recent. Li & Malik (2016) and Andrychowicz et al.
for training a model on a particular task. Various kinds
(2016) were the first to propose learning optimization al-
of methods have been proposed, such those based on
gorithms. Li & Malik (2016) explored learning task-
Bayesian optimization (Hutter et al., 2011; Bergstra et al.,
independent optimization algorithms for training various
2011; Snoek et al., 2012; Swersky et al., 2013; Feurer
models and used reinforcement learning to learn the op-
et al., 2015), random search (Bergstra & Bengio, 2012) and
timization algorithm, while Andrychowicz et al. (2016) in-
gradient-based optimization (Bengio, 2000; Domke, 2012;
vestigated learning task-dependent optimization algorithms
Maclaurin et al., 2015). Note that the discovered hyperpa-
and used supervised learning.
rameters are generally specific to the task and the model,
More broadly related is work on “meta-learning” and and hyperparameter optimization must be rerun for a new
“learning to learn” (Baxter et al., 1995; Vilalta & Drissi, task or model.
2002; Brazdil et al., 2008; Thrun & Pratt, 2012). Histor-
More closely related are methods for online hyperparam-
ically, these terms have been used by different authors to
eter adaptation, which adjust particular types of hyper-
refer to approaches that aim to achieve different objectives,
parameters automatically while performing optimization.
which essentially differ in the type of knowledge learned at
Some rules for adjusting hyperparameters are manually de-
the meta-level. Many methods try to learn commonalities
signed (Bray et al., 2004), while others learn policies for
shared across a family of related tasks; this line of work
adjusting the step size (Hansen, 2016; Daniel et al., 2016;
has blossomed into an area that has later become known
Fu et al., 2016) or the damping factor in the Levenberg-
as transfer learning and multi-task learning. Other meth-
Marquardt algorithm (Ruvolo et al., 2009). Unlike this
ods (Brazdil et al., 2003; Schmidhuber, 2004) aim to learn
line of work, this paper explores learning a general opti-
how to choose a base-level learner that would perform the
mization algorithm, which may differ from existing opti-
best on a given task. One challenge under this setting is
mization algorithms both in direction and size of the step
to decide on a parameterization of the space of base-level
taken in each iteration. A different line of work (Gregor &
learners that is both rich enough to be capable of repre-
LeCun, 2010; Sprechmann et al., 2013) parameterizes in-
senting disparate base-level learners and compact enough
termediate operands of special-purpose solvers for a class
to permit tractable search over this space. Brazdil et al.
of optimization problems that arise in sparse coding and
(2003) proposes a nonparametric representation and stores
learns them using supervised learning.
examples of different base-level learners in a database,
whereas Schmidhuber (2004) proposes representing base-
level learners as general-purpose programs. The former has
limited representation power, while the latter makes search
and learning in the space of base-level learners intractable.
Learning to Optimize Neural Nets

3. Learning to Optimize costs that are expected to be incurred over time. The en-
vironment is formalized as a partially observable Markov
3.1. Setting decision process (POMDP)1 , which is defined by the tuple
In the “Learning to Optimize” framework, we are given a (S, O, A, pi , p, po , c, T ), where S ✓ RD is the set of states,
0
set of training objective functions f1 , . . . , fn drawn from O ✓ RD is the set of observations, A ✓ Rd is the set of
some distribution F. An optimization algorithm A takes actions, pi (s0 ) is the probability density over initial states
an objective function f and an initial iterate x(0) as in- s0 , p (st+1 |st , at ) is the probability density over the sub-
put and produces a sequence of iterates x(1) , . . . , x(T ) , sequent state st+1 given the current state st and action at ,
where x(T ) is the solution found by the optimizer. We po (ot |st ) is the probability density over the current obser-
are also given a distribution D that generates the initial vation ot given the current state st , c : S ! R is a function
iterate x(0) and a meta-loss L, which takes an objective that assigns a cost to each state and T is the time horizon.
function f and a sequence of iterates x(1) , . . . , x(T ) pro- Often, the probability densities pi , p and po are not given
duced by an optimization algorithm as input and outputs explicitly, but can be accessed indirectly via sampling.
a scalar that measures the quality of the iterates. The In our setting, the state st consists of the current it-
goal is to learn ⇥ an optimization ⇤ algorithm A such that
⇤
erate x(t) and features (·) that depend on the his-
Ef ⇠F ,x(0) ⇠D L(f, A (f, x )) is minimized. The meta-
⇤ (0)
tory of iterates x(1) , . . . , x(t) , (possibly noisy) gradi-
loss is chosen to penalize optimization algorithms that ex- ents rfˆ(x(1) ), . . . , rfˆ(x(t) ) and (possibly noisy) ob-
hibit behaviours we find undesirable, like slow convergence jective values fˆ(x(1) ), . . . , fˆ(x(t) ). The observation ot
or excessive oscillations. Assuming we would like to learn excludes✓ x(t) and consists of possibly different fea-
an algorithm that minimizes the objective function it is n ot ◆
given, tures x(i) , rfˆ(x(i) ), fˆ(x(i) ) . The action at
PT a good choice of meta-loss would then simply be
i=0
i=1 f (x (i)
), which can be interpreted as the area under is the step x that will be used to update the iter-
the curve of objective values over time. ate. The initial probability density pi is defined im-
The objective functions f1 , . . . , fn may correspond to loss plicitly in terms of D and is the density of the ran-
⇣ ⇣ ⌘⌘T
functions for training base-level learners, in which case dom variable x(0) , x(0) , rfˆ(x(0) ), fˆ(x(0) ) where
the algorithm that learns the optimization algorithm can be
viewed as a meta-learner. In this setting, each objective x(0) ⇠ D and f ⇠ F. The transition proba-
function is the loss function for training a particular base- bility density p is the density of the random variable
✓ ✓n ot+1 ◆◆T
learner on a particular task, and so the set of training ob- x(t) + x, x(i) , rfˆ(x(i) ), fˆ(x(i) ) given
jective functions can be loss functions for training a base- ✓n ◆ i=0
ot
learner or a family of base-learners on different tasks. At x(t) , x(i) , rfˆ(x(i) ), fˆ(x(i) ) and x, where
test time, the learned optimization algorithm is evaluated i=0
on unseen objective functions, which correspond to loss f ⇠ F. The observation probability density po is also
functions for training base-learners on new tasks, which defined implicitly and depends on the definition of and
may be completely unrelated to tasks used for training the . Assuming the goal is to learn an optimization algo-
optimization algorithm. Therefore, the learned optimiza- rithm that minimizes the objective function, the cost c of a
T
tion algorithm must not learn anything about the tasks used state st = x(t) , (·) is simply the true objective value
for training. Instead, the goal is to learn an optimization al- f (x(t) ).
gorithm that can exploit the geometric structure of the error
surface induced by the base-learners. For example, if the A policy ⇡ (at |ot , t ) is a conditional probability density
base-level model is a neural net with ReLU activation units, over actions at given the current observation ot and time
the optimization algorithm should hopefully learn to lever- step t. When a policy is independent of t, it is known as
age the piecewise linearity of the model. Hence, there is a a stationary policy. The goal of the reinforcement learning
clear division of responsibilities between the meta-learner algorithm is to learn a policy ⇡ ⇤ that minimizes the total
and base-learners. The knowledge learned at the meta-level expected cost over time. More precisely,
should be pertinent for all tasks, whereas the knowledge " #
T
X
learned at the base-level should be task-specific. The meta-
⇡ ⇤ = arg min Es0 ,a0 ,s1 ,...,sT c(st ) ,
learner should therefore generalize across tasks, whereas ⇡
t=0
the base-learner should generalize across instances.
where the expectation is taken with respect to the joint dis-
3.2. RL Preliminaries and Formulation tribution over the sequence of states and actions, often re-
The goal of reinforcement learning is to learn to interact 1
What is described is an undiscounted finite-horizon POMDP
with an environment in a way that minimizes cumulative with continuous state, observation and action spaces.
Learning to Optimize Neural Nets

ferred to as a trajectory, which has the density actions taken by and ⇡ at every time step3 . So, the prob-
Z lem becomes:
" #
q(s0 , a0 ,s1 , . . . , sT ) = pi (s0 ) po ( o0 | s0 ) T
X
o0 ,...,oT min E c(st ) s.t. E [at ] = E [E⇡ [ at | st ]] 8t
✓,⌘
TY1 t=0

⇡ ( at | ot , t) p ( st+1 | st , at ) po ( ot+1 | st+1 ) .

t=0
This problem is solved using Bregman ADMM (Wang &
Banerjee, 2014), which performs the following updates in
each iteration:
To make learning tractable, ⇡ is often constrained to lie T
X h i
in a parameterized family. A common assumption is that ⌘ arg min E c(st ) T
t at + ⌫t Dt (⌘, ✓)
⇡ ( at | ot , t) = N (µ⇡ (ot ), ⌃⇡ (ot )), where N (µ, ⌃) de- ⌘
t=0
notes the density of a Gaussian with mean µ and covari- T
X T
ance ⌃. The functions µ⇡ (·) and possibly ⌃⇡ (·) are mod- ✓ arg min t E [E⇡ [ at | st ]] + ⌫t Dt (✓, ⌘)
✓
elled using function approximators, whose parameters are t=0

learned. t t + ↵⌫t (E [E⇡ [ at | st ]] E [at ]) 8t,

In our setting, any particular policy ⇡ (at |ot , t ), which where Dt (✓, ⌘) := E [DKL ( ⇡ ( at | st ; ✓)k ( at | st , t; ⌘))]
generates at = x given ot = (·) at every time step and Dt (⌘, ✓) := E [DKL ( ( at | st , t; ⌘)k ⇡ ( at | st ; ✓))].
corresponds to a particular (noisy) update formula , and The algorithm assumes that ( at | st , t; ⌘) =
therefore a particular (noisy) optimization algorithm. In T
N (Kt st + kt , Gt ), where ⌘ := (Kt , kt , Gt )t=1 and
practice, stochasticity is only used for training the policy;
⇡ ( at | ot ; ✓) = N (µ⇡! (ot ), ⌃⇡ ), where ✓ := (!, ⌃⇡ )
when testing the policy, ⌃⇡ (ot ) is typically set to zero,
and µ⇡! (·) can be an arbitrary function that is typically
thereby making the optimization algorithm behave deter-
modelled using a nonlinear function approximator like a
ministically. Therefore, learning an optimization algorithm
neural net.
simply reduces to searching for the optimal policy.
At the start of each iteration, the algorithm con-
3.3. Guided Policy Search structs a model of the transition probability density
p̃ ( st+1 | st , at , t; ⇣) = N (At st +Bt at +ct , Ft ), where ⇣ :=
The reinforcement learning method we use is guided pol- T
(At , Bt , ct , Ft )t=1 is fitted to samples of st drawn from
icy search (GPS) (Levine et al., 2015), which is a policy
the trajectory induced by , which essentially amounts
search method designed for searching over large classes of
to a local linearization of the true transition probability
expressive non-linear policies in continuous state and ac-
p ( st+1 | st , at , t). We will use E ˜ [·] to denote expecta-
tion spaces. It maintains two policies, and ⇡, where the
tion taken with respect to the trajectory induced by under
former lies in a time-varying linear policy class in which
the modelled transition probability p̃. Additionally, the al-
the optimal policy can found in closed form, and the latter
gorithm fits local quadratic approximations to c(st ) around
lies in a stationary non-linear policy class in which policy
samples of st drawn from the trajectory induced by so
optimization is challenging. In each iteration, it performs
that c(st ) ⇡ c̃(st ) := 12 sTt Ct st + dTt st + ht for st ’s that
policy optimization on , and uses the resulting policy as
are near the samples.
supervision to train ⇡.
With these assumptions, the subproblem that needs to be
More precisely, GPS solves the following constrained opti- T
mization problem: solved to update ⌘ = (Kt , kt , Gt )t=1 becomes:
" #
" T
# T
X
X min E ˜ c̃(st ) aTt t + ⌫t D (⌘, ✓)
min E c(st ) s.t. ( at | st , t; ⌘) = ⇡ ( at | st ; ✓) 8at , st , t ⌘
✓,⌘ t=0
t=0 " T
#
X
where ⌘ and ✓ denote the parameters of and ⇡ respec- s.t. E ˜ DKL ( at | st , t; ⌘)k at | st , t; ⌘ 0  ✏,
tively, E⇢ [·] denotes the expectation taken with respect to t=0

the
R trajectory induced by 2a policy ⇢ and ⇡ ( at | st ; ✓) := where ⌘ 0 denotes the old ⌘ from the previous iteration. Be-
ot
⇡ ( at | ot ; ✓) po ( ot | st ) . cause p̃ and c̃ are only valid locally around the trajectory
induced by , the constraint is added to limit the amount by
Since there are an infinite number of equality constraints,
the problem is relaxed by enforcing equality on the mean which ⌘ is updated. It turns out that the unconstrained prob-
lem can be solved in closed form using a dynamic program-
2
In practice, the explicit form of the observation probability po ming algorithm known as linear-quadratic-Gaussian (LQG)
is usually not known or the integral may be intractable to compute.
3
So, a linear Gaussian model is fitted to samples of st and at and Though the Bregman divergence penalty is applied to the
used in place of the true ⇡ ( at | st ; ✓) where necessary. original probability distributions over at .
Learning to Optimize Neural Nets

regulator in time linear in the time horizon T and cubic in all coordinates that correspond to entries in the same ma-
the dimensionality of the state space D. The constrained trix. That is, if the values of two coordinates in all cur-
problem is solved using dual gradient descent, which uses rent and past gradients and iterates are identical, then the
LQG as a subroutine to solve for the primal variables in step vector produced by the algorithm should have identi-
each iteration and increments the dual variable on the con- cal values in these two coordinates. We will refer to the
straint until it is satisfied. set of coordinates on which permutation invariance is en-
forced as a coordinate group. For the purposes of learning
Updating ✓ is straightforward, since expectations taken
an optimization algorithm for neural nets, a natural choice
with respect to the trajectory induced by ⇡ are always con-
would be to make each coordinate group correspond to a
ditioned on st and all outer expectations over st are taken
weight matrix or a bias vector. Hence, the total number of
with respect to the trajectory induced by . Therefore,
coordinate groups is twice the number of layers, which is
⇡ is essentially decoupled from the transition probabil-
usually fairly small.
ity p ( st+1 | st , at , t) and so its parameters can be updated
without affecting the distribution of st ’s. The subproblem In the case of GPS, we impose this prior on both and ⇡.
that needs to be solved to update ✓ therefore amounts to a For the purposes of updating ⌘, we first impose a block-
standard supervised learning problem. diagonal structure on the parameters At , Bt and Ft of the
fitted transition probability density p̃ ( st+1 | st , at , t; ⇣) =
Since ( at | st , t; ⌘) and ⇡ ( at | st ; ✓) are Gaussian,
N (At st + Bt at + ct , Ft ), so that for each coordinate in
D (✓, ⌘) can be computed analytically. More concretely,
the optimization problem, the dimensions of st+1 that cor-
if we assume ⌃⇡ to be fixed for simplicity, the subproblem
respond to the coordinate only depend on the dimensions
that is solved for updating ✓ = (!, ⌃⇡ ) is:
of st and at that correspond to the same coordinate. As a
result, p̃ ( st+1 | st , at , t; ⇣) decomposes
⇣ into multiple⌘ inde-
" T
X ⌫t pendent probability densities p̃j sjt+1 sjt , ajt , t; ⇣ j , one
T ⇡
min E t µ! (ot ) + tr Gt 1 ⌃⇡ log |⌃⇡ | for each coordinate j. Similarly, we also impose a block-
✓
t=0
2
⌫t i diagonal structure on Ct for fitting c̃(st ) and on the pa-
+ (µ⇡! (ot ) E [ at | st , t])T Gt 1 (µ⇡! (ot ) E [ at | st , t]) rameter matrix of the fitted model for ⇡ ( at | st ; ✓). Under
2
these assumptions, Kt and Gt are guaranteed to be block-
Note that the last term is the squared Mahalanobis distance diagonal as well. Hence, the Bregman divergence penalty
between the mean actions of and ⇡ at time step t, which term, D (⌘, ✓) decomposes into a sum of Bregman diver-
is intuitive as we would like to encourage ⇡ to match . gence terms, one for each coordinate.
We then further constrain dual variables t , sub-vectors
3.4. Convolutional GPS of parameter vectors and sub-matrices of parameter matri-
ces corresponding to each coordinate group to be identical
The problem of learning high-dimensional optimization al-
across the group. Additionally, we replace the weight ⌫t
gorithms presents challenges for reinforcement learning al-
on D (⌘, ✓) with an individual weight on each Bregman
gorithms due to high dimensionality of the state and action
divergence term for each coordinate group. The problem
spaces. For example, in the case of GPS, because the run-
then decomposes into multiple independent subproblems,
ning time of LQG is cubic in dimensionality of the state
one for each coordinate group. Because the dimensionality
space, performing policy search even in the simple class
of the state subspace corresponding to each coordinate is
of linear-Gaussian policies would be prohibitively expen-
constant, LQG can be executed on each subproblem much
sive when the dimensionality of the optimization problem
more efficiently.
is high.
Similarly, for ⇡, we choose a µ⇡! (·) that shares parameters
Fortunately, many high-dimensional optimization prob-
across different coordinates in the same group. We also
lems have underlying structure that can be exploited. For
impose a block-diagonal structure on ⌃⇡ and constrain the
example, the parameters of neural nets are equivalent up to
appropriate sub-matrices to share their entries.
permutation among certain coordinates. More concretely,
for fully connected neural nets, the dimensions of a hidden
layer and the corresponding weights can be permuted ar- 3.5. Features and Policy Model
bitrarily without changing the function they compute. Be- We describe the features (·) and (·) at time step t, which
cause permuting the dimensions of two adjacent layers can define the state st and observation ot respectively.
permute the weight matrix arbitrarily, an optimization algo-
rithm should be invariant to permutations of the rows and Because of the stochasticity of gradients and objective val-
columns of a weight matrix. A reasonable prior to impose ues, the state features (·) are defined in terms of sum-
t
is that the algorithm should behave in the same manner on mary statistics of the history of iterates x(i) i=0 , gradi-
Learning to Optimize Neural Nets

(a) (b) (c)

Figure 1. Comparison of the various hand-engineered and learned algorithms on training neural nets with 48 input and hidden units on
(a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 64. The vertical axis is the true objective value and the horizontal
axis represents the iteration. Best viewed in colour.

n ot n ot ⇣ ⌘
ents rfˆ(x(i) ) and objective values fˆ(x(i) ) . • rfˆ(x(t) )/ rfˆ(x(max(t 1,0))
) +1
i=0 i=0
We define the following statistics, which we will refer to
|x(max(t 1,1)) x(max(t 2,0))
|
as the average recent iterate, gradient and objective value •
|x(t) x(t 1) |+0.1
respectively:
Pi We use an recurrent neural net with a single layer of 128
• x(i) := 1
x(j)
min(i+1,3) j=max(i 2,0)
LSTM (Hochreiter & Schmidhuber, 1997) cells to model
Pi the mean of the policy, µ⇡! (ot ). As mentioned above, the
• rfˆ(x(i) ) := 1
rfˆ(x(j) )
min(i+1,3) j=max(i 2,0)
policy operates on each coordinate individually and the pa-
Pi rameters of the policy on all coordinates within a coordi-
• fˆ(x(i) ) := 1
fˆ(x(j) )
min(i+1,3) j=max(i 2,0)
nate group are identical. Technically, because µ⇡! (·) should
only depend on ot , (·) also includes the previous state of
The state features (·) consist of the relative change in the the LSTM cells.
average recent objective value, the average recent gradient
normalized by the magnitude of the a previous average re-
cent gradient and a previous change in average recent iter- 4. Experiments
ate relative to the current change in average recent iterate: For clarity, we will refer to training of the optimization
n⇣ ⌘ o24 algorithm as “meta-training” to differentiate it from base-
• fˆ(x(t 5i) ) fˆ(x(t 5(i+1)) ) /fˆ(x(t 5(i+1)) ) level training, which will simply be referred to as “train-
i=0
ing”.
n ⇣ ⌘o25
• rfˆ(x(t 5i) )/ rfˆ(x(max(t 5(i+1),tmod5)) ) +1 We meta-trained an optimization algorithm on a single ob-
i=0
⇢
jective function, which corresponds to the problem of train-
24
x(max(t 5(i+1),tmod5+5)) x(max(t 5(i+2),tmod5)) ing a two-layer neural net with 48 input units, 48 hidden
•
x(t 5i) x(t 5(i+1)) +0.1
i=0
units and 10 output units on a randomly projected and nor-
malized version of the MNIST training set with dimension-
Note that all operations are applied element-wise. Also, ality 48 and unit variance in each dimension. We used
whenever a feature becomes undefined (i.e.: when the time a time horizon of 400 iterations and a mini-batch size of
step index becomes negative), it is replaced with the all- 64 for computing stochastic gradients and objective val-
zeros vector. ues. We evaluate the optimization algorithm on its ability
to generalize to unseen objective functions, which corre-
Unlike state features, which are only used when training the spond to the problems of training neural nets on different
optimization algorithm, observation features (·) are used tasks/datasets. We evaluate the learned optimization algo-
both during training and at test time. Consequently, we rithm on three datasets, the Toronto Faces Dataset (TFD),
use noisier observation features, which depend directly on CIFAR-10 and CIFAR-100. These datasets are chosen for
the noisy gradients and objective values. The observation their very different characteristics from MNIST and each
features include the following: other: TFD contains 3300 grayscale images that have rel-
⇣ ⌘ atively little variation and has seven different categories,
• fˆ(x(t) ) fˆ(x(t 1)
) /fˆ(x(t 1)
) whereas CIFAR-100 contains 50,000 colour images that
Learning to Optimize Neural Nets

(a) (b) (c)

Figure 2. Comparison of the various hand-engineered and learned algorithms on training neural nets with 100 input units and 200 hidden
units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 64. The vertical axis is the true objective value and the
horizontal axis represents the iteration. Best viewed in colour.

(a) (b) (c)

Figure 3. Comparison of the various hand-engineered and learned algorithms on training neural nets with 48 input and hidden units on
(a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 10. The vertical axis is the true objective value and the horizontal
axis represents the iteration. Best viewed in colour.

have varied appearance and has 100 different categories. mization algorithm meta-trained using our method (which
we will refer to as Predicted Step Descent) consistently de-
All algorithms are tuned on the training objective function.
scends to the optimum the fastest across all datasets. On
For hand-engineered algorithms, this entails choosing the
the other hand, other algorithms are not as consistent and
best hyperparameters; for learned algorithms, this entails
the relative ranking of other algorithms varies by dataset.
meta-training on the objective function. We compare to the
This suggests that Predicted Step Descent has learned to
seven hand-engineered algorithms: stochastic gradient de-
be robust to variations in the data distributions, despite be-
scent, momentum, conjugate gradient, L-BFGS, ADAM,
ing trained on only one objective function, which is associ-
AdaGrad and RMSprop. In addition, we compare to an
ated with a very specific data distribution that character-
optimization algorithm meta-trained using the method de-
izes MNIST. It is also interesting to note that while the
scribed in (Andrychowicz et al., 2016) on the same train-
algorithm meta-trained using (Andrychowicz et al., 2016)
ing objective function (training two-layer neural net on ran-
(which we will refer to as L2LBGDBGD) performs well on
domly projected and normalized MNIST) under the same
CIFAR, it is unable to reach the optimum on TFD.
setting (a time horizon of 400 iterations and a mini-batch
size of 64). Next, we change the architecture of the neural nets and see
if Predicted Step Descent generalizes to the new architec-
First, we examine the performance of various optimization
ture. We increase the number of input units to 100 and the
algorithms on similar objective functions. The optimiza-
number of hidden units to 200, so that the number of pa-
tion problems under consideration are those for training
rameters is roughly increased by a factor of 8. As shown in
neural nets that have the same number of input and hidden
Figure 2, Predicted Step Descent consistently outperforms
units (48 and 48) as those used during meta-training. The
other algorithms on each dataset, despite having not been
number of output units varies with the number of categories
trained to optimize neural nets of this architecture. Interest-
in each dataset. We use the same mini-batch size as that
ingly, while it exhibited a bit of oscillation initially on TFD
used during meta-training. As shown in Figure 1, the opti-
and CIFAR-10, it quickly recovered and overtook other al-
Learning to Optimize Neural Nets

(a) (b) (c)

Figure 4. Comparison of the various hand-engineered and learned algorithms on training neural nets with 100 input units and 200 hidden
units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 10. The vertical axis is the true objective value and the
horizontal axis represents the iteration. Best viewed in colour.

(a) (b) (c)

Figure 5. Comparison of the various hand-engineered and learned algorithms on training neural nets with 100 input units and 200 hidden
units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 for 800 iterations with mini-batches of size 64. The vertical axis is the true objective
value and the horizontal axis represents the iteration. Best viewed in colour.

gorithms, which is reminiscent of the phenomenon reported value than all the other algorithms. Many hand-engineered
in (Li & Malik, 2016) for low-dimensional optimization algorithms also experienced much greater oscillations than
problems. This suggests that it has learned to detect when previously, suggesting that the optimization problems are
it is performing poorly and knows how to change tack ac- inherently harder. L2LBGDBGD diverged fairly quickly
cordingly. L2LBGDBGD experienced difficulties on TFD on these two datasets.
and CIFAR-10 as well, but slowly diverged.
Finally, we try doubling the number of iterations. As shown
We now investigate how robust Predicted Step Descent is in Figure 5, despite being trained over a time horizon of
to stochasticity of the gradients. To this end, we take a 400 iterations, Predicted Step Descent behaves reasonably
look at its performance when we reduce the mini-batch size beyond the number of iterations it is trained for.
from 64 to 10 on both the original architecture with 48 in-
put and hidden units and the enlarged architecture with 100 5. Conclusion
input units and 200 hidden units. As shown in Figure 3, on
the original architecture, Predicted Step Descent still out- In this paper, we presented a new method for learning opti-
performs all other algorithms and is able to handle the in- mization algorithms for high-dimensional stochastic prob-
creased stochasticity fairly well. In contrast, conjugate gra- lems. We applied the method to learning an optimization
dient and L2LBGDBGD had some difficulty handling the algorithm for training shallow neural nets. We showed that
increased stochasticity on TFD and to a lesser extent, on the algorithm learned using our method on the problem of
CIFAR-10. In the former case, both diverged; in the latter training a neural net on MNIST generalizes to the prob-
case, both were progressing slowly towards the optimum. lems of training neural nets on unrelated tasks/datasets like
the Toronto Faces Dataset, CIFAR-10 and CIFAR-100. We
On the enlarged architecture, Predicted Step Descent expe-
also demonstrated that the learned optimization algorithm
rienced some significant oscillations on TFD and CIFAR-
is robust to changes in the stochasticity of gradients and the
10, but still managed to achieve a much better objective
neural net architecture.
Learning to Optimize Neural Nets

References Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive

subgradient methods for online learning and stochastic
Andrychowicz, Marcin, Denil, Misha, Gomez, Sergio,
optimization. Journal of Machine Learning Research,
Hoffman, Matthew W, Pfau, David, Schaul, Tom, and
12(Jul):2121–2159, 2011.
de Freitas, Nando. Learning to learn by gradient descent
by gradient descent. arXiv preprint arXiv:1606.04474, Feurer, Matthias, Springenberg, Jost Tobias, and Hutter,
2016. Frank. Initializing bayesian hyperparameter optimiza-
tion via meta-learning. In AAAI, pp. 1128–1135, 2015.
Baxter, Jonathan, Caruana, Rich, Mitchell, Tom, Pratt,
Lorien Y, Silver, Daniel L, and Thrun, Sebastian. NIPS Fu, Jie, Lin, Zichuan, Liu, Miao, Leonard, Nicholas, Feng,
1995 workshop on learning to learn: Knowledge con- Jiashi, and Chua, Tat-Seng. Deep q-networks for acceler-
solidation and transfer in inductive systems. https: ating the training of deep neural networks. arXiv preprint
//web.archive.org/web/20000618135816/ arXiv:1606.01467, 2016.
http://www.cs.cmu.edu/afs/cs.cmu.edu/ Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural
user/caruana/pub/transfer.html, 1995. Turing machines. arXiv preprint arXiv:1410.5401, 2014.
Accessed: 2015-12-05.
Gregor, Karol and LeCun, Yann. Learning fast approxima-
Bengio, Yoshua. Gradient-based optimization of hyperpa- tions of sparse coding. In Proceedings of the 27th Inter-
rameters. Neural computation, 12(8):1889–1900, 2000. national Conference on Machine Learning (ICML-10),
pp. 399–406, 2010.
Bergstra, James and Bengio, Yoshua. Random search for
hyper-parameter optimization. The Journal of Machine Hansen, Samantha. Using deep q-learning to con-
Learning Research, 13(1):281–305, 2012. trol optimization hyperparameters. arXiv preprint
arXiv:1602.04062, 2016.
Bergstra, James S, Bardenet, Rémi, Bengio, Yoshua, and Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
Kégl, Balázs. Algorithms for hyper-parameter optimiza- term memory. Neural computation, 9(8):1735–1780,
tion. In Advances in Neural Information Processing Sys- 1997.
tems, pp. 2546–2554, 2011.
Hochreiter, Sepp, Younger, A Steven, and Conwell, Pe-
Bray, Matthieu, Koller-meier, Esther, Müller, Pascal, ter R. Learning to learn using gradient descent. In Inter-
Van Gool, Luc, and Schraudolph, Nicol N. 3d hand national Conference on Artificial Neural Networks, pp.
tracking by rapid stochastic gradient descent using a 87–94. Springer, 2001.
skinning model. In 1st European Conference on Visual
Media Production (CVMP. Citeseer, 2004. Hutter, Frank, Hoos, Holger H, and Leyton-Brown, Kevin.
Sequential model-based optimization for general algo-
Brazdil, Pavel, Carrier, Christophe Giraud, Soares, Carlos, rithm configuration. In Learning and Intelligent Opti-
and Vilalta, Ricardo. Metalearning: applications to data mization, pp. 507–523. Springer, 2011.
mining. Springer Science & Business Media, 2008. Kingma, Diederik and Ba, Jimmy. Adam: A
method for stochastic optimization. arXiv preprint
Brazdil, Pavel B, Soares, Carlos, and Da Costa, arXiv:1412.6980, 2014.
Joaquim Pinto. Ranking learning algorithms: Using ibl
and meta-learning on accuracy and time results. Machine Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel,
Learning, 50(3):251–277, 2003. Pieter. End-to-end training of deep visuomotor policies.
arXiv preprint arXiv:1504.00702, 2015.
Cramer, Nichael Lynn. A representation for the adaptive
generation of simple sequential programs. In Proceed- Li, Ke and Malik, Jitendra. Learning to optimize. CoRR,
ings of the First International Conference on Genetic Al- abs/1606.01885, 2016.
gorithms, pp. 183–187, 1985. Liang, Percy, Jordan, Michael I, and Klein, Dan. Learning
programs: A hierarchical Bayesian approach. In Pro-
Daniel, Christian, Taylor, Jonathan, and Nowozin, Sebas- ceedings of the 27th International Conference on Ma-
tian. Learning step size controllers for robust neural net- chine Learning (ICML-10), pp. 639–646, 2010.
work training. In Thirtieth AAAI Conference on Artificial
Intelligence, 2016. Maclaurin, Dougal, Duvenaud, David, and Adams, Ryan P.
Gradient-based hyperparameter optimization through re-
Domke, Justin. Generic methods for optimization-based versible learning. arXiv preprint arXiv:1502.03492,
modeling. In AISTATS, volume 22, pp. 318–326, 2012. 2015.
Learning to Optimize Neural Nets

Ruvolo, Paul L, Fasel, Ian, and Movellan, Javier R. Op-

timization on a budget: A reinforcement learning ap-
proach. In Advances in Neural Information Processing
Systems, pp. 1385–1392, 2009.
Schmidhuber, Jürgen. Optimal ordered problem solver.
Machine Learning, 54(3):211–254, 2004.

Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.

Practical bayesian optimization of machine learning al-
gorithms. In Advances in neural information processing
systems, pp. 2951–2959, 2012.
Sprechmann, Pablo, Litman, Roee, Yakar, Tal Ben, Bron-
stein, Alexander M, and Sapiro, Guillermo. Supervised
sparse analysis and synthesis operators. In Advances in
Neural Information Processing Systems, pp. 908–916,
2013.
Swersky, Kevin, Snoek, Jasper, and Adams, Ryan P. Multi-
task bayesian optimization. In Advances in neural infor-
mation processing systems, pp. 2004–2012, 2013.
Thrun, Sebastian and Pratt, Lorien. Learning to learn.
Springer Science & Business Media, 2012.
Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-
rmsprop: Divide the gradient by a running average of
its recent magnitude. COURSERA: Neural networks for
machine learning, 4(2), 2012.
Vilalta, Ricardo and Drissi, Youssef. A perspective view
and survey of meta-learning. Artificial Intelligence Re-
view, 18(2):77–95, 2002.
Wang, Huahua and Banerjee, Arindam. Bregman al-
ternating direction method of multipliers. CoRR,
abs/1306.3203, 2014.

DL Unit 4&5
No ratings yet
DL Unit 4&5
27 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Automl: A Perspective Where Industry Meets Academy
No ratings yet
Automl: A Perspective Where Industry Meets Academy
154 pages
Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
Machine Learning: A Review of Learning Types
No ratings yet
Machine Learning: A Review of Learning Types
7 pages
Imitation Learning
No ratings yet
Imitation Learning
188 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
5th Unit DL Final Class Notes
No ratings yet
5th Unit DL Final Class Notes
77 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
DL 12
No ratings yet
DL 12
55 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Unit 3
No ratings yet
Unit 3
47 pages
Simulation-Based Optimization Parametric Optimizat
100% (1)
Simulation-Based Optimization Parametric Optimizat
11 pages
Unit-1 ML
No ratings yet
Unit-1 ML
39 pages
DL 4
No ratings yet
DL 4
15 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
4 PDF
No ratings yet
4 PDF
37 pages
Optimization For Deep Learning Theory and Algorithms
No ratings yet
Optimization For Deep Learning Theory and Algorithms
60 pages
483 Learning To Optimize
No ratings yet
483 Learning To Optimize
13 pages
Introduction To Optimization-Lec1
No ratings yet
Introduction To Optimization-Lec1
36 pages
A Survey of Optimization Methods ML
No ratings yet
A Survey of Optimization Methods ML
30 pages
ML (Unit-1)
No ratings yet
ML (Unit-1)
17 pages
Adam Optimizer
No ratings yet
Adam Optimizer
14 pages
Machine Learning Basics Dl2 RK
No ratings yet
Machine Learning Basics Dl2 RK
16 pages
Learning Task
No ratings yet
Learning Task
14 pages
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
No ratings yet
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
23 pages
10 1109@tcyb 2019 2950779
No ratings yet
10 1109@tcyb 2019 2950779
14 pages
Lit Rev 7
No ratings yet
Lit Rev 7
12 pages
Wistuba Et Al DSAA 2015
No ratings yet
Wistuba Et Al DSAA 2015
10 pages
A Survey of Optimization Methods From A Machine Learning Perspective
No ratings yet
A Survey of Optimization Methods From A Machine Learning Perspective
14 pages
23-Practical Aspects of Optimization
No ratings yet
23-Practical Aspects of Optimization
7 pages
Incremental Learning Algorithms and Applications
No ratings yet
Incremental Learning Algorithms and Applications
12 pages
Optimizers
No ratings yet
Optimizers
30 pages
A Survey On Machine Learning Algorithms Techniques and
No ratings yet
A Survey On Machine Learning Algorithms Techniques and
6 pages
Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
PMLR (2018) - Model-Based Reinforcement Learning Via Meta-Policy Optimization
No ratings yet
PMLR (2018) - Model-Based Reinforcement Learning Via Meta-Policy Optimization
13 pages
Backpropagation
No ratings yet
Backpropagation
6 pages
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation
No ratings yet
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation
10 pages
Bergstra12a PDF
No ratings yet
Bergstra12a PDF
25 pages
Dynamic Online Learning Via Frank-Wolfe Algorithm
No ratings yet
Dynamic Online Learning Via Frank-Wolfe Algorithm
16 pages
Ecs 403 ML Module I
No ratings yet
Ecs 403 ML Module I
33 pages
Optimizers
No ratings yet
Optimizers
4 pages
M L S H: ETA Earning Hared Ierarchies
No ratings yet
M L S H: ETA Earning Hared Ierarchies
11 pages
Machine Learning1
No ratings yet
Machine Learning1
8 pages
Comparison of Particle Swarm Optimization and Backpropagation As
No ratings yet
Comparison of Particle Swarm Optimization and Backpropagation As
8 pages
ML Mid-Ii Objective
No ratings yet
ML Mid-Ii Objective
6 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
ML Assignment
No ratings yet
ML Assignment
7 pages
Key Ideas in Machine Learning
No ratings yet
Key Ideas in Machine Learning
11 pages
ANDONIE, R. Hyperparameter Optimization in Learning Systems. Journal of Membrane Computing. 2019.
No ratings yet
ANDONIE, R. Hyperparameter Optimization in Learning Systems. Journal of Membrane Computing. 2019.
13 pages
Deep Learning Exp 2.3 MU
No ratings yet
Deep Learning Exp 2.3 MU
4 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
IJAIML23022024P0A3 (p.1-8)
No ratings yet
IJAIML23022024P0A3 (p.1-8)
8 pages
A Study of The Optimization Algorithms in Deep Learning
No ratings yet
A Study of The Optimization Algorithms in Deep Learning
4 pages
Intermediate AI Prompting – Reinforcement Learning
From Everand
Intermediate AI Prompting – Reinforcement Learning
Eric Centore
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet

Neural Net Research

Uploaded by

Neural Net Research

Uploaded by

Learning to Optimize Neural Nets

Abstract optimization algorithm. Given this state of affairs, perhaps

⇡ ( at | ot , t) p ( st+1 | st , at ) po ( ot+1 | st+1 ) .

learned. t t + ↵⌫t (E [E⇡ [ at | st ]] E [at ]) 8t,

(a) (b) (c)

(a) (b) (c)

(a) (b) (c)

(a) (b) (c)

(a) (b) (c)

References Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive

Ruvolo, Paul L, Fasel, Ian, and Movellan, Javier R. Op-

Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.

You might also like