0% found this document useful (0 votes)
18 views

Ne RO

The document discusses a new neural network optimizer called Nero that performs per-neuron projected gradient descent. It requires less memory than Adam or LAMB and tends to outperform tuned baseline optimizers across various tasks without hyperparameter tuning. The paper also explores the relationship between neural architecture and optimization and how constraining the optimization domain can improve stability and reduce sensitivity to hyperparameters.

Uploaded by

Bhavik Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Ne RO

The document discusses a new neural network optimizer called Nero that performs per-neuron projected gradient descent. It requires less memory than Adam or LAMB and tends to outperform tuned baseline optimizers across various tasks without hyperparameter tuning. The paper also explores the relationship between neural architecture and optimization and how constraining the optimization domain can improve stability and reduce sensitivity to hyperparameters.

Uploaded by

Bhavik Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Learning by Turning: Neural Architecture Aware Optimisation

Yang Liu∗ Jeremy Bernstein∗ Markus Meister Yisong Yue


[email protected] [email protected] [email protected] [email protected]

Abstract h?i What is the right domain of optimisation for a


neural network’s weights? Is it Rd , or something
Descent methods for deep networks are more exotic—such as a Cartesian product of hyper-
arXiv:2102.07227v1 [cs.NE] 14 Feb 2021

notoriously capricious: they require careful spheres?


tuning of step size, momentum and weight de-
cay, and which method will work best on a Typically, optimisation is conducted over Rd , while a
new benchmark is a priori unclear. To address careful weight initialisation and a tuned weight decay hy-
this problem, this paper conducts a combined perparameter impose a soft constraint on the optimisation
study of neural architecture and optimisation, domain. Since normalisation schemes such as batch norm
leading to a new optimiser called Nero: the (Ioffe & Szegedy, 2015) render the network invariant to the
neuronal rotator. Nero trains reliably without scale of the weights, weight decay also plays a somewhat
momentum or weight decay, works in situa- subtle second role in modifying the effective learning rate.
tions where Adam and SGD fail, and requires Hyperparameters with this kind of subtle coupling add to
little to no learning rate tuning. Also, Nero’s the compounding cost of hyperparameter search.
memory footprint is „ square root that of Furthermore, descent methods such as Adam (Kingma
Adam or LAMB. Nero combines two ideas: (1) & Ba, 2015) and LAMB (You et al., 2020) use either
projected gradient descent over the space of synapse-specific or layer-specific gradient normalisation.
balanced networks; (2) neuron-specific updates, This motivates a second question:
where the step size sets the angle through h?i At what level of granularity should an optimiser
which each neuron’s hyperplane turns. The work? Should normalisation occur per-synapse or
paper concludes by discussing how this geo- per-layer—or perhaps, per-neuron?
metric connection between architecture and
optimisation may impact theories of generali- This paper contends that in deep learning, hyperpa-
sation in deep learning. rameters proliferate because of hidden couplings between
optimiser and architecture. By studying the above ques-
1 Introduction tions, and distilling the simple rules that govern optimi-
sation and architecture, this paper aims to make deep
Deep learning has brought on a new paradigm in computer learning less brittle—and less sensitive to opaque hyper-
science, enabling artificial systems to interact with the parameters.
world at an unprecedented level of complexity. That said,
the core technology relies on various heuristic numerical Summary of contributions:
techniques that are sometimes brittle and often require
extensive tuning. A major goal of modern research in 1. A new optimiser—Nero: the neuronal rotator. Nero
machine learning is to uncover the principles underlying performs per-neuron projected gradient descent, and
learning in neural systems, and thus to derive more reliable uses „ square root the memory of Adam or LAMB.
learning algorithms. 2. Experiments across image classification, image gener-
Part of the challenge of this endeavour is that learn- ation, natural language processing and reinforcement
ing in deep networks is an inherently coupled problem. learning, in which Nero’s out-of-the-box configura-
Suppose that training performance is sensitive to a partic- tion tends to outperform tuned baseline optimisers.
ular detail of the neural architecture—then it is unclear
whether that detail affects the expressivity of the archi- 3. Discussion of how the connection between optimisa-
tecture, or just the ability of the descent method to train tion and architecture relates to generalisation theo-
the architecture. ries, such as PAC-Bayes and norm-based complexity.
This observation motivates the combined study of archi-
tecture and optimisation, and this paper explores several
questions at that intersection. First of all:
∗ Equal contribution.
2 Related work such as RMSprop (Tieleman & Hinton, 2012) and Adam
(Kingma & Ba, 2015) often work better. For instance,
This section reviews relevant work pertaining to both Adam often works much better than SGD for training
neural architecture design and optimisation in machine generative adversarial networks (Bernstein et al., 2020a).
learning, and concludes with a bridge to the neuroscience Yet the theory behind Adam is poorly understood (Reddi
literature. et al., 2018).
A more recent line of work has explored optimisation
2.1 Neural Architecture Design methods that make relative updates to neural network pa-
rameters. Optimisers like LARS (You et al., 2017), LAMB
The importance of wiring constraints for the stable func- (You et al., 2020) and Fromage (Bernstein et al., 2020a)
tion of engineered neural systems is not a new discovery. make per-layer relative updates, while Madam (Bernstein
One important concept is that of balanced excitation and et al., 2020b) makes per-synapse relative updates. You
inhibition. For instance, Rosenblatt (1958) found that bal- et al. (2017) found that these methods stabilise large batch
ancing the proportion of excitatory and inhibitory synaptic training, while Bernstein et al. (2020a) found that they
connections made his perceptron more robust to varying require little to no learning rate tuning across tasks.
input sizes. Another concept relates to the total magni- Though these recent methods partially account for the
tude of synapse strengths. For example, Rochester et al. neural architecture—by paying attention to its layered
(1956) constrained the sum of magnitudes of synapses operator structure—they do not rigorously address the
impinging on a neuron so as to stabilise the process of optimisation domain. As such, LARS and LAMB require
learning. Similar ideas were explored by von der Malsburg a tunable weight decay hyperparameter, while Fromage
(1973) and Miller & MacKay (1994). These works are and Madam restrict the optimisation to a bounded set of
early predecessors to this paper’s definition of balanced tunable size (i.e. weight clipping). Without this additional
networks given in Section 3.1. tuning, these methods can be unstable—see for instance
Given the resurgence of neural networks over the last (Bernstein et al., 2020a, Figure 2) and (Bernstein et al.,
decade, the machine learning community has taken up the 2020b, Figure 3).
mantle of research on neural architecture design. Special The discussion in the previous paragraph typifies the
weight scalings—such as Xavier init (Glorot & Bengio, machine learning state of the art: optimisation techniques
2010) and Kaiming init (He et al., 2015)—have been pro- that work well, albeit only after hyperparameter tun-
posed to stabilise signal transmission through deep net- ing. For instance, LAMB is arguably the state-of-the-art
works. These scalings are only imposed at initialisation relative optimiser, but it contains in total five tunable
and are free to wander during training—an issue which hyperparameters. Since—at least naïvely—the cost of
may be addressed by tuning a weight decay hyperparame- hyperparameter search is exponential in the number of
ter. More recent approaches—such as batch norm (Ioffe hyperparameters, the prospect of fully tuning LAMB is
& Szegedy, 2015)—explicitly control activition statistics computationally daunting.
throughout training by adding extra normalisation layers
to the network.
Other recent normalisation techniques lie closer to the 2.3 Homeostatic Control in Neuroscience
work of Rosenblatt (1958) and Rochester et al. (1956). Since the brain is a system that must learn stably without
Techniques that involve constraining a neuron’s weights to hyperparameter do-overs, it is worth looking to neuro-
the unit hypersphere include: weight norm (Salimans & science for inspiration on designing better learning algo-
Kingma, 2016), decoupled networks (Liu et al., 2017, 2018) rithms.
and orthogonal parameterised training (Liu et al., 2020). A major swathe of neuroscience research studies mech-
Techniques that also balance excitation and inhibition anisms by which the brain performs homeostatic control.
include centred weight norm (Huang et al., 2017) and For instance, neuroscientists report a form of homeosta-
weight standardisation (Qiao et al., 2019). sis termed synaptic scaling, where a neuron modulates
the strengths of all its synapses to stabilise its firing rate
2.2 Descent Methods in Deep Learning (Turrigiano, 2008). More generally, heterosynaptic plas-
ticity refers to homeostatic mechanisms that modulate
Much classic work in optimisation theory focuses on de- the strength of unstimulated synapses (Chistiakova et al.,
riving convergence results for descent methods under as- 2015). Shen et al. (2020) review connections to normalisa-
sumptions such as convexity (Boyd & Vandenberghe, 2004) tion methods used in machine learning.
and Lipschitz continuity of the gradient (Nesterov, 2004). These observations inspired this paper to consider
These simplifying assumptions are often used in the ma- implementing homeostatic control via projected gradient
chine learning literature. For instance, Bottou et al. (2018) descent—leading to the Nero optimiser.
provide convergence guarantees for stochastic gradient de-
scent (SGD) under each of these assumptions. However,
these assumptions do not hold in deep learning (Sun,
2019).
On a related note, SGD is not the algorithm of choice
in many deep learning applications, and heuristic methods
3 Background Theory 3.2 Stable Descent Steps
In general, an L-layer neural network f p¨q is a composition Since a network is trained via perturbations to its param-
of L simpler functions f1 p¨q, ..., fL p¨q: eters, it is important to know what size perturbations are
appropriate. Consider an L-layer network with weight
f pxq “ fL ˝ fL´1 ˝ ... ˝ f1 pxq. (forward pass) matrices W “ pW1 , W2 , ..., WL q and loss function LpW q.
For a perturbation ∆W “ p∆W1 , ∆W2 , ..., ∆WL q, the
Due to this compositionality, any slight ill-conditioning following definition establishes a notion of stable step size:
in the simple functions fi p¨q has the potential to com-
pound over layers, making the overall network f p¨q very Definition 3. Let θl denote the angle between ∆Wl and
ill-conditioned. Architecture design should aim to prevent ´∇Wl LpW q. A descent step is stable if for all l “ 1, ..., L:
this from happening, as will be covered in Section 3.1 }∇Wl LpW ` ∆W q ´ ∇Wl LpW q}F
The Jacobian Bf {Bfl , which plays a key role in evalu- ă cos θl . (1)
}∇Wl LpW q}F
ating gradients, also takes the form of a deep product:
Or in words: for each layer, the relative change in
Bf BfL BfL´1 Bfl`1 gradient induced by the perturbation should not exceed
“ ¨ ¨ ... ¨ . (backward pass)
Bfl BfL´1 BfL´2 Bfl the cosine of the angle between the perturbation and the
negative gradient.
Therefore, it is also important from the perspective of
This definition is useful because Bernstein et al. (2020a)
gradient-based optimisation that compositionality is ade-
proved that a stable descent step is guaranteed to decrease
quately addressed, as will be covered in Section 3.2.
a continuously differentiable loss function LpW q. Since
Inequality 1 is of little use without a model of its lefthand
3.1 Balanced Network Architectures side, Bernstein et al. (2020a) proposed the following model:
A common strategy to mitigate the issue of compounding Definition 4. The loss function obeys deep relative trust
ill-conditioning is to explicitly re-normalise the activations if for all perturbations ∆W “ p∆W1 , ∆W2 , ..., ∆WL q:
at every network layer. Batch norm (Ioffe & Szegedy,
2015) exemplifies this strategy, and was found to improve }∇Wl LpW ` ∆W q ´ ∇Wl LpW q}F źL ˆ
}∆Wk }F
˙
the trainability of deep residual networks. Batch norm ď 1` ´ 1.
}∇Wl LpW q}F k“1
}Wk }F
works by standardising the activations across a batch of
inputs at each network layer—that is, it shifts and scales While deep relative trust is based on a perturbation
the activations to have mean zero and variance one across analysis of L-layer perceptrons (Bernstein et al., 2020a,
a batch. Theorem 1), the key idea is that its product structure
Although batch norm works well, it adds computa- explicitly models the product structure of the network’s
tional overhead to both the forward and backward pass. To backward pass.
explore how far one can get without explicit re-normalisation, The deep relative trust model suggests that a stable
the following definitions are useful: descent step should involve small relative perturbations
per layer. This motivates the layer-wise family of de-
Definition 1. A neuron is balanced if its weight vector
scent methods (You et al., 2017, 2020). Still, it is unclear
w P Rd satisfies the following constraints:
whether layers are the right base object to consider. Per-
řd haps a more refined analysis would replace the layers
i“1 wi “ 0; (balanced excitation & inhibition)
řd appearing in Definition 4 with individual neurons or even
2
i“1 wi “ 1. (`2 constant sum rule) synapses.
Small relative perturbations per-synapse were explored
Definition 2. A network is balanced if all its constituent by Bernstein et al. (2020b) and found to slightly degrade
neurons are balanced. training performance compared to Adam and SGD. But
As noted by Huang et al. (2017), balanced neurons this paper will explore the per-neuron middle ground:
attain some of the properties of batch
ř norm for free. To see Definition 5. A step of size η ą 0 is said to be per-neuron
this, consider a linear neuron y “ i wi xi with inputs xi relative if for any neuron with weights w P Rd and bias
that are uncorrelated with mean µ and variance 1. Then b P R, the perturbations ∆w P Rd and ∆b P R satisfy:
the output y is standardised:
ř ř }∆w}2 {}w}2 Æ η and |∆b|{|b| Æ η.
Erys “ i wi Erxi s “ µ i wi “ 0;
Varrys “ i wi2 Varrxi s “ i wi2 “ 1.
ř ř
A per-neuron relative update is automatically per-layer
relative. To see this, consider a weight matrix W whose
While the assumptions on the inputs xi are unlikely to N rows correspond to N neurons wp1q , ..., wpN q . Then:
hold exactly, under more general conditions the constraints cř cř
may at least encourage the standardisation of activation }∆W }F N
}∆wpiq }22 N
η 2 }wpiq }22
}W }F “ ři“1
N piq }2 Æ i“1
ř N piq }2 “ η. (2)
statistics through the layers of the network (Brock et al., i“1 }w 2 i“1 }w 2

2021).
4 Nero: the Neuronal Rotator Algorithm 1 Nero optimiser. “Out-of-the-box” hyper-
parameter defaults are η “ 0.01 and β “ 0.999. The
Following the discussion in Section 3, this paper will constant σb P R` refers to the initialisation scale of the
consider an optimisation algorithm that makes per-neuron biases.
relative updates (Definition 5) constrained to the space of Input: step size η P p0, 1s, averaging constant β P r0, 1q
balanced networks (Definition 2). repeat
Since a balanced neuron is constrained to the unit for each neuron do
hypersphere, a per-neuron relative update with step size Ź get weight & bias gradients gw P Rn & gb P R
η corresponds to a pure rotation of the neuron’s weight Ź update running averages
vector by angle « η. To see this, take η small in the 2
ḡw Ð β ¨ ḡw2
` p1 ´ βq ¨ }gw }22
following picture: ḡb Ð β ¨ ḡb ` p1 ´ βq ¨ gb2
2 2

}w ` ∆w}2 “ 1 Ź update weights w P Rn and bias b P R


w Ð w ´ η ¨ }w}2 {ḡw ¨ gw
θ }∆w}2 “ η
b Ð b ´ η ¨ σb {ḡb ¨ gb
}w}2 “ 1 Ź project weights back to constraint set
řn
Hence, this paper proposes Nero: the neuronal rotator. w Ð w ´ n1 i“1 wi
Nero’s goal is to reduce the burden of hyperparameter w Ð w{}w}2
tuning by baking architectural information into the opti- end for
miser. More concretely, the anticipated advantages are as until converged
follows:
1. Since per-neuron relative updates are automatically Since the target of automatic differentiation is still the
per-layer relative by Equation 2, they should inherit unnormalised vector w, r overhead is incurred in both the
the properties of per-layer updates—in particular, forward and backward pass. Moreover, there is a subtle
stability across batch sizes (You et al., 2017) while coupling between the step size in additive optimisers like
needing little to no learning rate tuning (Bernstein Adam and the scale of the unnormalised weights w r (see
et al., 2020a). Section 5.3).
In contrast, Nero opts to implement balanced networks
2. Since balanced networks place hard constraints on via projected gradient descent. This is lighter-weight than
the norm of a neuron’s weights, the need for initiali- Equation 3, since duplicate copies of the weights are not
sation tuning and weight decay should be removed. needed and the network’s backward pass does not involve
extra operations. Furthermore, Nero can be used as a
3. Gradients are often normalised by running averages,
drop-in replacement for optimisers like Adam, SGD or
in order to retain relative scale information between
LAMB, without the user needing to manually modify
successive minibatch gradients (Tieleman & Hinton,
the network architecture via the reparameterisation in
2012). Along with momentum, this is the main
Equation 3.
memory overhead of Adam and LAMB compared to
Pseudocode for Nero is provided in Algorithm 1. For
vanilla SGD. Per-neuron running averages consume
brevity, the Adam-style bias correction of the running
„ square root the memory of per-synapse running
averages is omitted from the pseudocode. But in the
averages.
Pytorch implementation used in this paper’s experiments,
4. Since normalisation is local to a neuron, no com- thearunning averages ḡw and ḡb are divided by a factor
munication is needed between neurons in a layer of 1 ´ β t before the tth update. This corrects for the
(unlike for per-layer updates). This makes the op- warmup bias stemming from ḡw and ḡb being initialised
timiser more distributable—for example, a single to zero (Kingma & Ba, 2015).
layer can be split across multiple compute devices While the pseudocode in Algorithm 1 is presented for
without fuss. For the same reason, the Nero update neurons and biases, in the Pytorch implementation the
is biologically plausible. bias update is applied to any parameters lacking a notion
of fan-in—including batch norm gains and biases. Typical
There is one significant difference between Nero and initialisation scales are σb “ 1 for gains and σb “ 0.01
prior work on balanced networks. In centred weight norm for biases. The Pytorch implementation of Nero used
(Huang et al., 2017) and weight standardisation (Qiao σb “ 0.01 for any bias parameter initialised to zero.
et al., 2019), a neuron’s underlying weight representation
is an unnormalised vector wr P Rd —which is normalised by
including the following reparameterisation in the neural
architecture:
wr ´ 1T w
r ¨ 1{d
normalisepwq
r :“ T
, (3)
}w
r´1 w r ¨ 1{d}2

where 1 denotes the vector of 1s.


5 Experiments
Training Validation
100
This section begins with targeted experiments intended Constraints
to demonstrate Nero’s key properties. Then, in Section 0.4 Both
10 1 Mean

Top-1 error
5.6, Nero is benchmarked across a range of popular tasks. Norm
In all figures, the mean and range are plotted over three 0.2 None
repeats. For Nero, out-of-the-box refers to setting η “ 0.01 10 2

and β “ 0.999. More experimental details are given in


0.1
Appendix A. 10 3
0 50 100 150 200 0 50 100 150 200
Epoch Epoch
5.1 Constraints Help Nero Figure 1. Ablating the balanced network constraints. A VGG-
To verify that projecting to the space of balanced networks 11 network was trained on CIFAR-10. The legend denotes
improves the performance of Nero, an ablation experiment which of Nero’s constraints were active. Mean refers to bal-
was conducted. As can be seen in Figure 1, when training anced excitation & inhibition, while norm refers to the `2
a VGG-11 image classifier on the CIFAR-10 dataset, Nero constant sum rule.
performed best with both constraints switched on.

100 Training Validation


5.2 Per-Neuron Updates are a Good Mid-
0.8 Nero w/o constraints
dle Ground Madam
LAMB

Top-1 error
Since Bernstein et al. (2020b) found that per-synapse 10 1 0.4
relative updates led to slightly degraded performance,
while per-layer relative updates typically perform well 0.2
(You et al., 2017, 2020; Bernstein et al., 2020a), this section 10 2

compares per-synapse, per-neuron and per-layer relative 0.1


0 50 100 150 200 0 50 100 150 200
updates. In particular, Nero is compared to Madam (per- Epoch Epoch
synapse relative) and LAMB (per-layer relative). Training Validation
100
A VGG-11 model was trained on the CIFAR-10 dataset. 0.8 Nero
Without constraints, the three optimisers performed sim- Madam+constraints
10 0.4 LAMB+constraints
Top-1 error

ilarly, achieving „ 12% top-1 validation error (Figure 2, 1


top). Constraining to the space of balanced networks
(Definition 2) improved both Nero and LAMB, but did 0.2
not have a significant effect on Madam (Figure 2, bottom). 10 2

In both configurations, Nero outperformed Madam and 0.1


LAMB, demonstrating the viability of per-neuron relative 0 50 100 150 200 0 50 100 150 200
Epoch Epoch
updates.
Figure 2. Comparing per-synapse (Madam), per-neuron (Nero)
and per-layer (LAMB) relative updates. A VGG-11 network
5.3 The Pitfalls of Reparameterisation
was trained to classify CIFAR-10. Top: all optimisers without
Existing implementations of balanced networks (Definition balanced network constraints. Bottom: all optimisers with
2) work via the re-parameterisation given in Equation 3 constraints.
(Huang et al., 2017; Qiao et al., 2019). This leads to an
undesired coupling between the learning rate in optimisers
like Adam and the scale of the unnormalised w r parameters.
1.0 Reparameterisation 1.0 100 layer MLP
To verify this, a network with weights normalised by Nero
Equation 3 was trained to classify the MNIST dataset. 0.8 0.8 SGD
Training accuracy

The initial weights wr were drawn from N p0, σ 2 q, and the Adam
0.6 0.6 LAMB
experiment was repeated for σ “ 1 and σ “ 100. The
0.4 0.4
Adam optimiser was used for training with a fixed learning Initialisation scale
rate of 0.01. As can be seen in Figure 3 (left), the training 0.2 =1 0.2
performance was sensitive to the weight scale σ, despite = 100
0.0 0.0
the fact that a weight normalisation scheme was being 0 1 2 3 4 5 0 10 20 30 40 50
Epoch Epoch
used.
The unnecessary scale freedom of reparameterisation Figure 3. Left: Training a 5 layer perceptron normalised via
can lead to other undesired consequences, such as numer- reparameterisation (Equation 3) on MNIST. For a fixed Adam
ical overflow. Nero completely eliminates this issue by learning rate, training is sensitive to the scale σ of the raw
implementing balanced networks via projected gradient weights w.
r This motivates the different approach taken by Nero.
descent. Right: Using Nero to train a 100 layer perceptron—without
batch norm or skip connections—to classify MNIST.
5.4 Nero Trains Deeper Networks
0.6 Training 0.6 Validation
Very deep networks are typically difficult to train without
Nero
architectural modifications such as residual connections 0.5 SGD 0.5
(He et al., 2016) or batch norm (Ioffe & Szegedy, 2015). SGD+wd

Top-1 error
0.4 0.4
To test whether Nero enables training very deep models
without such modifications, Figure 3 (right) shows the 0.3 0.3
results of training a very deep multilayer perceptron (MLP) 0.2 0.2
on the MNIST dataset. Unlike SGD, Adam and LAMB,
0.1 0.1
Nero could reliably train a 100-layer MLP. 0 20 40 60 80 0 20 40 60 80
Epoch Epoch
5.5 Highly Tuned SGD can Outperform Figure 4. Training a ResNet-50 network to classify the Im-
ageNet dataset. Nero uses its out-of-the-box default hyper-
Nero
parameters η “ 0.01 and β “ 0.999. SGD+wd uses initial
This section compares Nero out-of-the-box to an SGD learning rate 0.1, momentum 0.9 and weight decay (wd) 0.0001
implementation with tuned learning rate, weight decay as tuned by He et al. (2016). SGD is also shown without weight
and momentum. The comparison was made for training decay.
a ResNet-50 image classifier on the ImageNet dataset.
As can be seen in Figure 4, SGD with tuned learning
rate, momentum, and weight decay outperformed Nero.
100 Training 100 Test
However, the optimal set of SGD hyperparameters was Nero
brittle, and ablating weight decay alone increased the 80 SGD 80
Adam
top-1 validation error by 5%. 60 LAMB 60

FID
40 40
5.6 Nero Works Well Out-of-the-Box
20 20
This section probes the versatility and robustness of Nero
0 0
by comparing its optimisation and generalisation per- 0 30 60 90 120 0 30 60 90 120
Epoch Epoch
formance with three popular alternatives—SGD, Adam,
and LAMB—across six tasks. The tasks span the do- Figure 5. Class-conditional GAN training on CIFAR-10. Equal
mains of computer vision, natural language processing, learning rates were used in the generator and discriminator.
and reinforcement learning. A wide spectrum of neural The Fréchet Inception Distance (Heusel et al., 2017, FID)
architectures were tested—from convolutional networks measures the distance between the sample statistics of real
to transformers. and fake data as represented at a deep layer of a pre-trained
To make a fair comparison between optimisers, a fair image classifier.
hyperparameter tuning strategy is needed. In this section:
1. Learning rates were tuned over t10´4 , 10´3 , ..., 100 u.
Training 0.4 Validation
100
2. For Adam, LAMB and SGD, the momentum hy-
perparameter was tuned to achieve good perfor-
10 1
Top-1 error

mance on the most complicated benchmark—cGAN


training—and then fixed across the rest of the bench- 0.2
10 2 Nero
marks. In each case, the best momentum value for SGD
cGAN was 0. Adam
10 3 LAMB
0.1
3. β in Nero and β2 in Adam and LAMB were fixed 0 50 100 150 200 0 50 100 150 200
to 0.999 across all experiments, as recommended by Epoch Epoch
Kingma & Ba (2015) and You et al. (2020). 100 Training 0.40 Validation
Nero
4. Weight decay was not used in any of the experiments. 10 1 SGD
Adam 0.20
Top-1 error

The results are collated in Table 1. Nero achieved LAMB


the best validation performance in every experiment— 10 2

while the runner-up varied across tasks. What’s more, 0.10


10 3
the same learning rate of η “ 0.01 was optimal for Nero
in five out of six experiments. This means that Nero 10 4 0.05
0 50 100 150 200 0 50 100 150 200
has strong out-of-the-box performance, since Nero’s only Epoch Epoch
other hyperparameter was fixed to β “ 0.999 across all
Figure 6. CIFAR-10 classification. Top: performance of a
experiments.
vanilla, convolutional VGG-11 network. Bottom: performance
The remainder of this section discusses each experi-
of a batch-normalised, residual ResNet-18 network.
ment in turn. Implementation details are given in Ap-
pendix A.
Task Dataset Model Metric pÖq Nero SGD Adam LAMB Nero η SGD η Adam η LAMB η
cGAN CIFAR-10 BigGAN-like FID (Ó) 15.43 ˘ 0.37 33.06 ˘ 0.42 23.42 ˘ 0.85 16.32 ˘ 0.23 0.01 0.01 0.0001 0.01
Classification CIFAR-10 VGG11 Top-1 Error (Ó) 11.16% ˘ 0.17 12.61% ˘ 0.21 12.86% ˘ 0.34 13.66% ˘ 0.05 0.01 0.1 0.001 0.01
Classification CIFAR-10 ResNet-18 Top-1 Error (Ó) 5.75% ˘ 0.07 7.75% ˘ 0.17 5.93% ˘ 0.19 6.46% ˘ 0.12 0.01 0.1 0.01 0.1
Language Model Wikitext-2 Transformer Perplexity (Ó) 172.99 ˘ 0.51 181.76 ˘ 0.49 178.05 ˘ 0.96 200.54 ˘ 0.53 0.01 1.0 0.0001 0.01
Translation WMT16 En–De Transformer Perplexity (Ó) 11.35 ˘ 1.20 92.40 ˘ 89.48 12.63 ˘ 0.34 16.36 ˘ 0.29 0.001 0.0001 0.0001 0.01
PPO Atari Pong vanilla CNN Reward (Ò) 20.62 ˘ 0.05 11.99 ˘ 8.65 15.92 ˘ 3.40 ´19.46 ˘ 0.10 0.01 0.1 0.0001 0.001

Table 1. Validation results for the best learning rate η. The best result is shown in bold, while the runner-up is underlined.

Image synthesis with cGAN Generative Adversarial


Network (Goodfellow et al., 2014, GAN) training is per- 600 Training 400 Validation
haps the most challenging optimisation problem tackled in Nero
500 SGD 350
this paper. Good performance has traditionally relied on Adam
400

Perplexity
extensive tuning: different learning rates are often used in LAMB 300
the generator and discriminator (Heusel et al., 2017) and 300
250
training is highly sensitive to momentum (Brock et al., 200
2019, p. 35). The class-conditional GAN model in this 100 200
paper is based on the BigGAN architecture (Brock et al., 0 150
0 5 10 15 20 0 5 10 15 20
2019). This is a heterogeneous network involving a va- Epoch Epoch
riety of building blocks: convolutions, embeddings, fully Figure 7. Training a language model on the Wikitext-2 dataset.
connected layers, attention layers, conditional batch norm A small transformer network was used, composed of 19 tensors.
and spectral norm (Miyato et al., 2018). The results are Nero achieved the best anytime performance.
presented in Figure 5.
Image classification In Section 5.5, Nero out-of-the-
box was shown to outperform SGD without weight decay Training Validation
when training ResNet-50 on ImageNet. Due to limited Nero
computational resources, the authors of this paper were SGD
103 Adam 103
Perplexity

unable to run the LAMB and Adam baselines on Im- LAMB


ageNet. Experiments were run across all baselines on
the smaller CIFAR-10 dataset instead. The networks 102 102
used were the vanilla, convolutional VGG-11 network (Si-
monyan & Zisserman, 2015) and the batch-normalised, 101 101
residual ResNet-18 network (He et al., 2015). The results 0 25 50 75 100 0 25 50 75 100
Epoch Epoch
are presented in Figure 6.
Figure 8. Training an English–German translation model on
Natural language processing Much recent progress WMT16. A larger transformer network was used, composed of
in natural language processing is based on the transformer 121 tensors. The optimisers with gradient normalisation—Nero,
architecture (Vaswani et al., 2017). Transformers process Adam, and LAMB—performed best in training this model.
information via layered, all-to-all comparisons—without Training with SGD was unstable and led to significantly worse
recourse to recurrence or convolution. This paper experi- perplexity.
mented with a smaller transformer (19 tensors) trained on
the Wikitext-2 dataset, and a larger transformer (121 ten-
sors) trained on WMT2016 English–German translation.
The results are presented in Figures 7 and 8. 20

Reinforcement learning Many reinforcement learn- 10


Reward

ing algorithms use neural networks to perform function


0 Nero
approximation. Proximal Policy Optimization (Schulman SGD
et al., 2017, PPO) is one example, and PPO has gained 10 Adam
increasing popularity for its simplicity, scalability, and LAMB
robust performance. This paper experimented with PPO 20
on the Atari Pong video game. The results are presented 0 1 2 3 4 5
Million steps
in Figure 9.
While LAMB failed to train on this task, further in- Figure 9. Training a policy network to play Pong. Proxi-
vestigation revealed that setting LAMB’s momentum hy- mal Policy Optimisation (PPO) was used. Pong’s reward is
perparameter to 0.9 enabled LAMB to learn. This demon- bounded between ˘21. While investigating LAMB’s failure
strates LAMB’s sensitivity to hyperparameters. to train the policy network, it was discovered that adjusting
the β1 momentum hyperparameter from 0 to 0.9 improved
LAMB’s performance.
6 Discussion: Rotation and Generalisa- By restricting deep relative trust (Definition 4) or its
tion antecedent (Bernstein et al., 2020a, Theorem 1) to bal-
anced networks, the following definition becomes natural:
The results in this paper may have a bearing on the gen- Definition 6. A solution that attains zero training error
eralisation theory of neural systems—an area of research is α-robust if all neurons may be simultaneously and ar-
that is still not settled. Consider the following hypothesis: bitrarily rotated by up to angle α without inducing an
Hypothesis 1. Deep learning generalises because SGD error.
is biased towards solutions with small norm. Geometrically, an α-robust solution is the product of
m hyperspherical caps. If the version space consists of K
This hypothesis is well-known, and is alluded to or
non-intersecting α-robust solutions, then its measure is:
mentioned explicitly in many papers (Wilson et al., 2017;
Zhang et al., 2017; Bansal et al., 2018; Advani et al., 2020). PrVS s “ K ¨ Prcapd´2 pαqsm ě K
2m sinmpd´2q α2 , (5)
But in light of the results in Table 1, Hypothesis 1
where capd´2 pαq denotes an α-cap of Sd´2 , and the in-
encounters some basic problems. First, for some tasks—
equality follows from (Ball, 1997, Lemma 2.3). Combining
such as the GAN and translation experiment—SGD simply
Inequality 5 with Inequality 4 yields the following gener-
performs very poorly. And second, Nero is able to find
alisation bound for neural networks:
generalising solutions even when the norm of the network
is constrained. For instance, the VGG-11 network and the m ln 2 ` mpd ´ 2q ln sin1 α ` ln 2n
δ ´ ln K
2
Wikitext-2 transformer model have no gain parameters εpVS q ď .
n´1
so, under Nero, the norm of the weights (though not Focusing on the dominant terms, the bound suggests that
the biases) is fixed and cannot be “adapting to the data the average test error εpVS q over the space of solutions
complexity”. VS is low when the number of datapoints n exceeds the
Then it seems right to consider an alternative theory: number of parameters md less the entropy ln K of the
Hypothesis 2. Deep learning generalises because the multitude of distinct solutions. The theory has two main
space of networks that fit the training data has large mea- implications:
sure. 1. In the “over-parameterised” regime md " n, general-
This hypothesis is essentially the PAC-Bayesian gen- isation can still occur if the number of distinct solu-
eralisation theory (McAllester, 1998; Langford & Seeger, tions K is exponential in the number of parameters
2001) applied to deep learning. Valle-Perez et al. (2019) md. In practice, ln K might be increased relative to
have developed this line of work, proving the following md by constraining the architecture based on the
result: symmetries of the data—e.g. using convolutions for
image data.
Theorem 1 (Realisable PAC-Bayes). First, fix a proba-
bility measure P over the weight space Ω of a classifier. Let 2. All else equal, solutions with larger α-robustness
S denote a training set of n iid datapoints and let VS Ă Ω may generalise better. In practice, α might be in-
denote the version space—that is, the subset of classifiers creased by regularising the training procedure (Foret
that fit the training data. Consider the population error et al., 2021).
rate 0 ď εpwq ď 1 of weight setting w P Ω, and its average Future work might investigate these ideas more thor-
over the version space εpVS q :“ Ew„P rεpwq|w P VS s. Then, oughly.
for a proportion 1 ´ δ of random draws of the training set
S,
7 Conclusion
1 2n
1 ln PrVS s ` ln δ
εpVS q ď ln ď . (4) This paper has proposed the Nero optimiser based on a
1 ´ εpVS q n´1 combined study of optimisation and neural architecture.
Nero pairs two ingredients: (1) projected gradient descent
The intuition is that for a larger measure of solutions
over the space of balanced networks; and (2) per-neuron
PrVS s, less information needs to be extracted from the
relative updates. Taken together, a Nero update turns
training data to find just one solution, thus memorisation
each neuron through an angle set by the learning rate.
is less likely.
Nero was found to have strong out-of-the-box perfor-
A simple formula for PrVS s is possible based on this
mance. In almost all the experiments in this paper—
paper’s connection between optimisation and architecture,
spanning GAN training, image classification, natural lan-
since the problem is reduced to hyperspherical geometry.
guage processing and reinforcement learning—Nero trained
Consider a balanced network (Definition 2) composed of m
well using its default hyperparameter settings. The two ex-
neurons each with fan-in d. Then the optimisation domain
ceptions were the 100 layer MLP and the WMT16 En–De
is isomorphic to the Cartesian product of m hyperspheres:
transformer, for which Nero required a reduced learning
Ω–Sd´2
ˆ ... ˆ Sd´2 ,
looooooooomooooooooon rate of η “ 0.001. Thus Nero has the potential to accel-
erate deep learning research and development, since the
m times
need for time and energy intensive hyperparameter search
while P can be fixed to the uniform distribution on Ω. may be reduced.
References He, K., Zhang, X., Ren, S., and Sun, J. Delving deep
into rectifiers: Surpassing human-level performance on
Advani, M. S., Saxe, A. M., and Sompolinsky, H. High- ImageNet classification. In International Conference on
dimensional dynamics of generalization error in neural Computer Vision, 2015.
networks. Neural Networks, 2020.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
Ball, K. An elementary introduction to modern convex learning for image recognition. In Computer Vision and
geometry. In MSRI Publications, 1997. Pattern Recognition, 2016.
Bansal, Y., Advani, M., Cox, D., and Saxe, A. M. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B.,
Minnorm training: an algorithm for training over- and Hochreiter, S. GANs trained by a two time-scale
parameterized deep neural networks. arXiv:1806.00730, update rule converge to a local Nash equilibrium. In
2018. Neural Information Processing Systems, 2017.
Bernstein, J., Vahdat, A., Yue, Y., and Liu, M.-Y. On the Huang, L., Liu, X., Liu, Y., Lang, B., and Tao, D. Cen-
distance between two neural networks and the stability tered weight normalization in accelerating training of
of learning. In Neural Information Processing Systems, deep neural networks. In International Conference on
2020a. Computer Vision, 2017.
Bernstein, J., Zhao, J., Meister, M., Liu, M.-Y., Anandku- Ioffe, S. and Szegedy, C. Batch normalization: Accelerat-
mar, A., and Yue, Y. Learning compositional functions ing deep network training by reducing internal covariate
via multiplicative weight updates. In Neural Informa- shift. In International Conference on Machine Learning,
tion Processing Systems, 2020b. 2015.
Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Kingma, D. P. and Ba, J. Adam: A Method for Stochastic
methods for large-scale machine learning. SIAM Review, Optimization. In International Conference on Learning
2018. Representations, 2015.
Boyd, S. and Vandenberghe, L. Convex Optimization. Kostrikov, I. Pytorch implementations of reinforce-
Cambridge University Press, 2004. ment learning algorithms. github.com/ikostrikov/
Brock, A., Donahue, J., and Simonyan, K. Large scale pytorch-a2c-ppo-acktr-gail, 2018.
GAN training for high fidelity natural image synthesis.
Langford, J. and Seeger, M. Bounds for averaging clas-
In International Conference on Learning Representa-
sifiers. Technical report, Carnegie Mellon University,
tions, 2019.
2001.
Brock, A., De, S., and Smith, S. L. Characterizing signal
Liu, W., Zhang, Y.-M., Li, X., Yu, Z., Dai, B., Zhao, T.,
propagation to close the performance gap in unnormal-
and Song, L. Deep hyperspherical learning. In Neural
ized ResNets. In International Conference on Learning
Information Processing Systems, 2017.
Representations, 2021.
Liu, W., Liu, Z., Yu, Z., Dai, B., Lin, R., Wang, Y., Rehg,
Chistiakova, M., Bannon, N., Chen, J.-Y., Bazhenov, M.,
J. M., and Song, L. Decoupled networks. In Computer
and Volgushev, M. Homeostatic role of heterosynap-
Vision and Pattern Recognition, 2018.
tic plasticity: models and experiments. Frontiers in
Computational Neuroscience, 2015. Liu, W., Lin, R., Liu, Z., Rehg, J. M., Xiong, L., Weller, A.,
and Song, L. Orthogonal over-parameterized training.
Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.
arXiv:2004.04690, 2020.
Sharpness-aware minimization for efficiently improving
generalization. In International Conference on Learning McAllester, D. A. Some PAC-Bayesian theorems. In
Representations, 2021. Conference on Computational Learning Theory, 1998.
Glorot, X. and Bengio, Y. Understanding the difficulty Miller, K. and MacKay, D. The role of constraints in
of training deep feedforward neural networks. In In- Hebbian learning. Neural Computation, 1994.
ternational Conference on Artificial Intelligence and
Statistics, 2010. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y.
Spectral normalization for generative adversarial net-
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., works. In International Conference on Learning Repre-
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, sentations, 2018.
Y. Generative adversarial nets. In Neural Information
Processing Systems, 2014. Nesterov, Y. Introductory lectures on convex optimization:
A basic course. In Applied Optimization, 2004.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, Qiao, S., Wang, H., Liu, C., Shen, W., and Yuille, A.
K. Accurate, large minibatch SGD: Training ImageNet Micro-batch training with batch-channel normalization
in 1 hour. arXiv:1706.02677, 2017. and weight standardization. arXiv:1903.10520, 2019.
Reddi, S. J., Kale, S., and Kumar, S. On the convergence You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojana-
of Adam and beyond. In International Conference on palli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh,
Learning Representations, 2018. C.-J. Large batch optimization for deep learning: Train-
ing BERT in 76 minutes. In International Conference
Rochester, N., Holland, J., Haibt, L., and Duda, W. Tests on Learning Representations, 2020.
on a cell assembly theory of the action of the brain,
using a large digital computer. Information Theory, Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,
1956. O. Understanding deep learning requires rethinking
generalization. In International Conference on Learning
Rosenblatt, F. The perceptron: A probabilistic model Representations, 2017.
for information storage and organization in the brain.
Psychological Review, 1958.
Salimans, T. and Kingma, D. P. Weight normalization:
A simple reparameterization to accelerate training of
deep neural networks. In Neural Information Processing
Systems, 2016.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and


Klimov, O. Proximal policy optimization algorithms.
arXiv:1707.06347, 2017.
Shen, Y., Wang, J., and Navlakha, S. A correspondence be-
tween normalization strategies in artificial and biological
neural networks. In From Neuroscience to Artificially
Intelligent Systems, 2020.
Simonyan, K. and Zisserman, A. Very deep convolutional
networks for large-scale image recognition. In Interna-
tional Conference on Learning Representations, 2015.

Sun, R. Optimization for deep learning: theory and


algorithms. arXiv:1912.08957, 2019.
Tieleman, T. and Hinton, G. E. Lecture 6.5—RMSprop:
Divide the gradient by a running average of its recent
magnitude. COURSERA: Neural Networks for Machine
Learning, 2012.
Turrigiano, G. The self-tuning neuron: Synaptic scaling
of excitatory synapses. Cell, 2008.
Valle-Perez, G., Camargo, C. Q., and Louis, A. A. Deep
learning generalizes because the parameter–function
map is biased towards simple functions. In International
Conference on Learning Representations, 2019.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten-
tion is all you need. In Neural Information Processing
Systems, 2017.
von der Malsburg, C. Self-organization of orientation
sensitive cells in the striate cortex. Kybernetik, 1973.
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and
Recht, B. The marginal value of adaptive gradient
methods in machine learning. In Neural Information
Processing Systems, 2017.
You, Y., Gitman, I., and Ginsburg, B. Scaling SGD
batch size to 32K for ImageNet training. Technical
Report UCB/EECS-2017-156, University of California,
Berkeley, 2017.
Appendix A Experimental Details ImageNet classification For training with SGD + mo-
mentum + weight decay, the initial learning was set to
All code is to be found at github.com/jxbz/nero. This 0.1, momentum was set to 0.9 and weight decay was set to
appendix records important details of the implementations 0.0001. These settings follow He et al. (2016). One epoch
and their hyperparameters. of linear learning rate warm-up was used, followed by 89
epochs of cosine annealing. The batch size was set to 400
MNIST classification These experiments used a mul- for ResNet-50 to fit the GPU vRAM budget, and this
tilayer perceptron (MLP) network. An L-layer architec- was in the range known to yield good performance (Goyal
ture consisted of pL ´ 1q layers of dimension 784 ˆ 784 et al., 2017). This paper’s implementation surpassed the
followed by an output layer of dimension 784 ˆ 10. A target ImageNet top-1 accuracy of 76.3% for ResNet-50
“scaled
? relu” nonlinearity was? used, defined by ϕpxq :“ (Goyal et al., 2017; You et al., 2020). The training was
2 ¨ maxp0, xq. The factor of 2 was motivated by Kaim- distributed over four NVIDIA RTX 2080Ti GPUs, tak-
ing init (He et al., 2015) and was not tuned. The repa- ing „ 45 hours per training run. The total number of
rameterisation experiment used L “ 5 layers and trained GPU hours for all ImageNet experiments in this paper
for 5 epochs without learning rate decay. The very deep was „ 1500.
MLP used L “ 100 layers and trained for 50 epochs with
the learning rate decayed by a factor of 0.9 at the end Wikitext-2 language model The small transformer
of every epoch, and with the initial learning rate tuned model was trained for 20 epochs, with the learning rate
over t0.0001, 0.001, 0.01, 0.1u. Training took place on an decayed by a factor of 0.1 at epoch 10. The initial learning
unknown Google Colab GPU. On an NVIDIA Tesla P100 rate was tuned over {0.0001, 0.001, 0.01, 0.1, 1.0}. The
GPU, the 5-layer MLP took „ 1 minute to train and the batch size was set to 20. Training on an NVIDIA RTX
100-layer MLP took „ 30 minutes. 2080Ti GPU took „ 15 minutes.

CIFAR-10 cGAN Equal learning rates were used in WMT16 En–De translation The large transformer
the generator and discriminator. The initial learning model was trained for 100 epochs, with a linear warm-
rate was tuned over {0.0001, 0.001, 0.01, 0.1, 1.0} for all up from epoch 0 to 50, and linear annealing from epoch
optimisers. The networks were trained for 120 epochs, 50 to 100. The maximum learning rate was tuned over
with the learning rate decayed by a factor of 10 at epoch {0.0001, 0.001, 0.01, 0.1, 1.0}. A batch size of 128 was
100. The momentum parameter in SGD and β1 in Adam used. Training took „ 1 hour on an NVIDIA RTX 2080Ti
and LAMB were tuned over {0.0, 0.9}. Nero’s β and β2 GPU.
in Adam and LAMB were set to 0.999 without tuning.
Training took around 3 hours on an NVIDIA RTX 2080Ti Reinforcement learning Hyperparameter settings fol-
GPU. lowed Kostrikov (2018), except for the initial learning rate
and the total number of environment steps. The number
CIFAR-10 classification All models were trained for of steps was fixed to 5 million, and the initial learning rate
200 epochs, with 5 epochs of linear learning rate warm-up was tuned over {0.0001, 0.001, 0.01, 0.1, 1.0}. The policy
and learning rate decay by a factor of 0.2 at epochs 100, network combined convolutional image feature extractors
150 and 180. The initial learning rates were tuned over with dense output layers. Training was performed on an
{0.0001, 0.001, 0.01, 0.1, 1.0}. Training was performed NVIDIA RTX 2080Ti GPU, and the training time was
on an NVIDIA RTX 2080Ti GPU. Training time for the „ 1.5 hours.
VGG-11 network was „ 1 hour, and for ResNet-18 was
„ 2 hours.
Since the experiments in Figures 1 and 2 were intended
to probe the fundamental properties of optimisers rather
than their performance under a limited tuning budget,
a more fine-grained learning rate search was conducted.
Specifically, the learning rates were tuned over {0.01, 0.02,
0.04, 0.06, 0.08, 0.1}. The best results are listed in the
following table:

Optimiser Fix Mean Fix Norm Top 1 Error Best η


Nero 12.17% ˘ 0.08 0.02
Nero X 11.99% ˘ 0.14 0.01
Nero X 10.03% ˘ 0.24 0.02
Nero X X 8.61% ˘ 0.22 0.02
Madam 12.77% ˘ 0.20 0.02
Madam X X 12.60% ˘ 0.12 0.06
LAMB 12.73% ˘ 0.10 0.02
LAMB X X 8.88% ˘ 0.08 0.06

You might also like