Ne RO
Ne RO
2021).
4 Nero: the Neuronal Rotator Algorithm 1 Nero optimiser. “Out-of-the-box” hyper-
parameter defaults are η “ 0.01 and β “ 0.999. The
Following the discussion in Section 3, this paper will constant σb P R` refers to the initialisation scale of the
consider an optimisation algorithm that makes per-neuron biases.
relative updates (Definition 5) constrained to the space of Input: step size η P p0, 1s, averaging constant β P r0, 1q
balanced networks (Definition 2). repeat
Since a balanced neuron is constrained to the unit for each neuron do
hypersphere, a per-neuron relative update with step size Ź get weight & bias gradients gw P Rn & gb P R
η corresponds to a pure rotation of the neuron’s weight Ź update running averages
vector by angle « η. To see this, take η small in the 2
ḡw Ð β ¨ ḡw2
` p1 ´ βq ¨ }gw }22
following picture: ḡb Ð β ¨ ḡb ` p1 ´ βq ¨ gb2
2 2
Top-1 error
5.6, Nero is benchmarked across a range of popular tasks. Norm
In all figures, the mean and range are plotted over three 0.2 None
repeats. For Nero, out-of-the-box refers to setting η “ 0.01 10 2
Top-1 error
Since Bernstein et al. (2020b) found that per-synapse 10 1 0.4
relative updates led to slightly degraded performance,
while per-layer relative updates typically perform well 0.2
(You et al., 2017, 2020; Bernstein et al., 2020a), this section 10 2
The initial weights wr were drawn from N p0, σ 2 q, and the Adam
0.6 0.6 LAMB
experiment was repeated for σ “ 1 and σ “ 100. The
0.4 0.4
Adam optimiser was used for training with a fixed learning Initialisation scale
rate of 0.01. As can be seen in Figure 3 (left), the training 0.2 =1 0.2
performance was sensitive to the weight scale σ, despite = 100
0.0 0.0
the fact that a weight normalisation scheme was being 0 1 2 3 4 5 0 10 20 30 40 50
Epoch Epoch
used.
The unnecessary scale freedom of reparameterisation Figure 3. Left: Training a 5 layer perceptron normalised via
can lead to other undesired consequences, such as numer- reparameterisation (Equation 3) on MNIST. For a fixed Adam
ical overflow. Nero completely eliminates this issue by learning rate, training is sensitive to the scale σ of the raw
implementing balanced networks via projected gradient weights w.
r This motivates the different approach taken by Nero.
descent. Right: Using Nero to train a 100 layer perceptron—without
batch norm or skip connections—to classify MNIST.
5.4 Nero Trains Deeper Networks
0.6 Training 0.6 Validation
Very deep networks are typically difficult to train without
Nero
architectural modifications such as residual connections 0.5 SGD 0.5
(He et al., 2016) or batch norm (Ioffe & Szegedy, 2015). SGD+wd
Top-1 error
0.4 0.4
To test whether Nero enables training very deep models
without such modifications, Figure 3 (right) shows the 0.3 0.3
results of training a very deep multilayer perceptron (MLP) 0.2 0.2
on the MNIST dataset. Unlike SGD, Adam and LAMB,
0.1 0.1
Nero could reliably train a 100-layer MLP. 0 20 40 60 80 0 20 40 60 80
Epoch Epoch
5.5 Highly Tuned SGD can Outperform Figure 4. Training a ResNet-50 network to classify the Im-
ageNet dataset. Nero uses its out-of-the-box default hyper-
Nero
parameters η “ 0.01 and β “ 0.999. SGD+wd uses initial
This section compares Nero out-of-the-box to an SGD learning rate 0.1, momentum 0.9 and weight decay (wd) 0.0001
implementation with tuned learning rate, weight decay as tuned by He et al. (2016). SGD is also shown without weight
and momentum. The comparison was made for training decay.
a ResNet-50 image classifier on the ImageNet dataset.
As can be seen in Figure 4, SGD with tuned learning
rate, momentum, and weight decay outperformed Nero.
100 Training 100 Test
However, the optimal set of SGD hyperparameters was Nero
brittle, and ablating weight decay alone increased the 80 SGD 80
Adam
top-1 validation error by 5%. 60 LAMB 60
FID
40 40
5.6 Nero Works Well Out-of-the-Box
20 20
This section probes the versatility and robustness of Nero
0 0
by comparing its optimisation and generalisation per- 0 30 60 90 120 0 30 60 90 120
Epoch Epoch
formance with three popular alternatives—SGD, Adam,
and LAMB—across six tasks. The tasks span the do- Figure 5. Class-conditional GAN training on CIFAR-10. Equal
mains of computer vision, natural language processing, learning rates were used in the generator and discriminator.
and reinforcement learning. A wide spectrum of neural The Fréchet Inception Distance (Heusel et al., 2017, FID)
architectures were tested—from convolutional networks measures the distance between the sample statistics of real
to transformers. and fake data as represented at a deep layer of a pre-trained
To make a fair comparison between optimisers, a fair image classifier.
hyperparameter tuning strategy is needed. In this section:
1. Learning rates were tuned over t10´4 , 10´3 , ..., 100 u.
Training 0.4 Validation
100
2. For Adam, LAMB and SGD, the momentum hy-
perparameter was tuned to achieve good perfor-
10 1
Top-1 error
Table 1. Validation results for the best learning rate η. The best result is shown in bold, while the runner-up is underlined.
Perplexity
extensive tuning: different learning rates are often used in LAMB 300
the generator and discriminator (Heusel et al., 2017) and 300
250
training is highly sensitive to momentum (Brock et al., 200
2019, p. 35). The class-conditional GAN model in this 100 200
paper is based on the BigGAN architecture (Brock et al., 0 150
0 5 10 15 20 0 5 10 15 20
2019). This is a heterogeneous network involving a va- Epoch Epoch
riety of building blocks: convolutions, embeddings, fully Figure 7. Training a language model on the Wikitext-2 dataset.
connected layers, attention layers, conditional batch norm A small transformer network was used, composed of 19 tensors.
and spectral norm (Miyato et al., 2018). The results are Nero achieved the best anytime performance.
presented in Figure 5.
Image classification In Section 5.5, Nero out-of-the-
box was shown to outperform SGD without weight decay Training Validation
when training ResNet-50 on ImageNet. Due to limited Nero
computational resources, the authors of this paper were SGD
103 Adam 103
Perplexity
CIFAR-10 cGAN Equal learning rates were used in WMT16 En–De translation The large transformer
the generator and discriminator. The initial learning model was trained for 100 epochs, with a linear warm-
rate was tuned over {0.0001, 0.001, 0.01, 0.1, 1.0} for all up from epoch 0 to 50, and linear annealing from epoch
optimisers. The networks were trained for 120 epochs, 50 to 100. The maximum learning rate was tuned over
with the learning rate decayed by a factor of 10 at epoch {0.0001, 0.001, 0.01, 0.1, 1.0}. A batch size of 128 was
100. The momentum parameter in SGD and β1 in Adam used. Training took „ 1 hour on an NVIDIA RTX 2080Ti
and LAMB were tuned over {0.0, 0.9}. Nero’s β and β2 GPU.
in Adam and LAMB were set to 0.999 without tuning.
Training took around 3 hours on an NVIDIA RTX 2080Ti Reinforcement learning Hyperparameter settings fol-
GPU. lowed Kostrikov (2018), except for the initial learning rate
and the total number of environment steps. The number
CIFAR-10 classification All models were trained for of steps was fixed to 5 million, and the initial learning rate
200 epochs, with 5 epochs of linear learning rate warm-up was tuned over {0.0001, 0.001, 0.01, 0.1, 1.0}. The policy
and learning rate decay by a factor of 0.2 at epochs 100, network combined convolutional image feature extractors
150 and 180. The initial learning rates were tuned over with dense output layers. Training was performed on an
{0.0001, 0.001, 0.01, 0.1, 1.0}. Training was performed NVIDIA RTX 2080Ti GPU, and the training time was
on an NVIDIA RTX 2080Ti GPU. Training time for the „ 1.5 hours.
VGG-11 network was „ 1 hour, and for ResNet-18 was
„ 2 hours.
Since the experiments in Figures 1 and 2 were intended
to probe the fundamental properties of optimisers rather
than their performance under a limited tuning budget,
a more fine-grained learning rate search was conducted.
Specifically, the learning rates were tuned over {0.01, 0.02,
0.04, 0.06, 0.08, 0.1}. The best results are listed in the
following table: