0% found this document useful (0 votes)
59 views10 pages

Fast Processing Algo

This document proposes a simplified variant of the Shuffle-Exchange network for fast processing of long sequences. The Shuffle-Exchange network allows modeling long-range dependencies in sequences in O(n log n) time, providing an efficient alternative to the quadratic attention mechanism. The proposed model is based on residual networks using GELU and layer normalization, making it simpler, faster, and lighter than the original Shuffle-Exchange network. It achieves state-of-the-art results on language modeling and music transcription tasks while using fewer parameters than previous models.

Uploaded by

Wizuri Siblah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views10 pages

Fast Processing Algo

This document proposes a simplified variant of the Shuffle-Exchange network for fast processing of long sequences. The Shuffle-Exchange network allows modeling long-range dependencies in sequences in O(n log n) time, providing an efficient alternative to the quadratic attention mechanism. The proposed model is based on residual networks using GELU and layer normalization, making it simpler, faster, and lighter than the original Shuffle-Exchange network. It achieves state-of-the-art results on language modeling and music transcription tasks while using fewer parameters than previous models.

Uploaded by

Wizuri Siblah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Residual Shuffle-Exchange Networks

for Fast Processing of Long Sequences

Andis Draguns 1 Emı̄ls Ozolin, š 1 Agris Šostaks 1 Matı̄ss Apinis 1 Kārlis Freivalds 1

Abstract In music, dependencies occur on several scales. At the


finest scale samples of the waveform correlate to form note
arXiv:2004.04662v1 [cs.LG] 6 Apr 2020

Attention is a commonly used mechanism in se- pitches, at medium scale neighbouring notes relate to each
quence processing, but it is of O(n2 ) complexity other by forming melodies and chord progressions, at coarse
which prevents its application to long sequences. scale common melodies reappear throughout the entire piece
The recently introduced neural Shuffle-Exchange creating a coherent musical form (Thickstun et al., 2016;
network offers a computation-efficient alternative, Huang et al., 2019). Dealing with such dependencies require
enabling the modelling of long-range dependen- processing very long sequences (several pages of text or the
cies in O(n log n) time. The model, however, entire musical composition) in a manner that aggregates
is quite complex, involving a sophisticated gat- information from their distant parts. Especially challenging
ing mechanism derived from the Gated Recur- are approaches that work directly on the raw waveform.
rent Unit. In this paper, we present a simple and
lightweight variant of the Shuffle-Exchange net- The ability to combine distant information is even more
work, which is based on a residual network em- important for algorithmic tasks where each output symbol
ploying GELU and Layer Normalization. The typically depends on every input symbol. The goal of algo-
proposed architecture not only scales to longer rithm synthesis is to derive an algorithm from input-output
sequences but also converges faster and provides examples which are often given as sequences. Algorithmic
better accuracy. It surpasses Shuffle-Exchange tasks are especially challenging due to the need for pro-
network on the LAMBADA language modelling cessing sequences of unlimited length. Also, generalization
task and achieves state-of-the-art performance plays an important role since training is often performed on
on the MusicNet dataset for music transcription short sequences but testing on long ones.
while using significantly fewer parameters. We The commonly used (for example in Transformers) attention
show how to combine Shuffle-Exchange network mechanism is of quadratic complexity depending on the se-
with convolutional layers establishing it as a use- quence length, therefore, is not an attractive choice for long
ful building block in long sequence processing sequences. Recently (Freivalds et al., 2019) introduced neu-
applications. ral Shuffle-Exchange networks that allow modelling of long-
range dependencies in sequences in O(n log n) time. The
idea is very promising and offers a computation-efficient
1. Introduction alternative to the attention mechanism.

More and more applications of sequence processing per- In this paper, we present a much simpler, faster and
formed by neural networks require dealing with long inputs. lightweight version of the Shuffle-Exchange network which
A key requirement is to allow modelling of dependencies is based on the residual network idea employing Gaussian
between distant parts of the sequences. Such long-range Error Linear Units (Hendrycks and Gimpel, 2016) and Layer
dependencies occur in natural language when the meaning Normalization (Ba et al., 2016) instead of gates.
of some word depends on other words in the same or pre- We empirically validate our improved model on algorithmic
vious sentence. There are important cases, e.g., to resolve tasks, LAMBADA question answering and multi-instrument
coreferences, when such distant information may not be musical note recognition (MusicNet dataset). It surpasses
disregarded. the original Shuffle-Exchange network by 2.1% on the
1 LAMBADA language modelling task and achieves state-
Institute of Mathematics and Computer Science Uni-
versity of Latvia. Correspondence to: Kārlis Freivalds of-the-art 78.02 % average precision score on MusicNet,
<[email protected]>. which is by 3.8% better than the previous best result.
We introduce a modification where we prepend the Resid-
Residual Shuffle-Exchange Networks

ual Shuffle-Exchange network with strided convolutions to (Vinyals et al., 2015). Specialized memory modules, which
increases the speed and applicability to long sequences even are controlled by a mechanism similar to attention, are used
more. This enables processing a sequence of length 2M by Neural Turing Machine (Graves et al., 2014) and Differ-
symbols in only 3.97 seconds. entiable Neural Computer (Graves et al., 2016). Neural GPU
(Kaiser and Sutskever, 2015) utilizes active memory (Kaiser
2. Related Work and Bengio, 2016) where computation is coupled with mem-
ory access. This architecture can learn fairly complicated
Recurrent networks, in particular, LSTM (Hochreiter and algorithms such as long number addition and multiplica-
Schmidhuber, 1997) and GRU (Cho et al., 2014) are im- tion. But the computation and memory coupling introduce
portant tools for sequence processing. They can efficiently a limitation: for the information to travel from one end of
process sequences of any length and have the ability to re- the sequence to the other, o(n) layers are required which
member arbitrary long dependencies. Although successful result in Ω(n2 ) total complexity. The flow of information is
at natural language processing, they process symbols one facilitated by introducing diagonal gates in (Freivalds and
by one and have limited state memory, hence can remember Liepins, 2018) that improves training and generalization,
only a limited number of such dependencies. but does not address the performance problem caused by
many layers.
Another option is using convolutional architectures (Gehring
et al., 2017). But convolutions are inherently local − the For music transcription tasks convolutional architectures are
value of a particular neuron depends on a small neighbour- common, see (Benetos et al., 2019) for a good overview.
hood of the previous layer. To bring in more distant rela- In the article introducing the MusicNet dataset (Thickstun
tionships, it is common to augment the network with the et al., 2016) convolutional network performed the best. A
attention mechanism. The attention mechanism has become notable performance on MusicNet is achieved by a con-
a standard choice in numerous neural models, including volutional network based on complex numbers (Trabelsi
Transformer (Vaswani et al., 2017) and BERT (Devlin et al., et al., 2018). Recently, a state-of-the-art performance was
2018) which achieve state-of-the-art accuracy in NLP and achieved by the Transformer network employing the Fourier
related tasks. But the complexity of the attention mecha- transform of the waveform in the complex domain (Yang
nism is quadratic depending on the input length and does et al., 2019).
not scale to long sequences.
An obvious way to overcome the complexity of attention is 3. Neural Shuffle-Exchange Networks
cutting the sequence into short segments and using attention
Neural Shuffle-Exchange networks (Freivalds et al., 2019)
only within the segment boundaries (Al-Rfou et al., 2018).
has been recently proposed as an efficient alternative to the
To, at least partially, recover the lost information, recurrent
attention mechanism that allows modelling of long-range de-
connections can be added between the segments (Dai et al.,
pendencies in sequences in O(n log n) time. Neural Shuffle-
2019). Sparse transformers (Child √ et al., 2019) reduce the Exchange network is the neural adaption of the Shuffle-
complexity of attention to O(n n) by attending only to a
Exchange and Beneš networks which are well-known from
small predetermined subset of locations. Star-Transformer
packet routing tasks in computer networks. The Shuffle-
(Guo et al., 2019) sparsifies attention even more by pushing
Exchange network has a regular layered structure and con-
all the long-range dependency information through one cen-
sists of repeated applications of two stages – shuffle and
tral node and reaches linear time performance. (Clark and
exchange. Fig. 1 shows a Shuffle-Exchange network for
Gardner, 2017) enable document level question answering
by preselecting the paragraph most likely to contain the an-
swer. Reformer (Kitaev et al., 2020) uses locality-sensitive
hashing to approximate attention in time O(n log n).
A different way to capture long-range structure is to increase
the receptive field of convolution by using dilated (atrous)
convolution, where the convolution mask is spread out at
regular spatial intervals. Dilated architectures have achieved
great success in image segmentation (Yu and Koltun, 2015)
and audio generation (van den Oord et al., 2016).
An important use of sequence processing models is in learn-
ing algorithmic tasks (see (Kant, 2018) for a good overview)
where the way how to access memory is crucial. Attention
mechanism to access memory is used in Pointer Networks Figure 1. Shuffle-Exchange network.
Residual Shuffle-Exchange Networks

1 st Beneš block 2 nd Beneš block


Shared Shared Shared Shared
Input weights weights weights weights Output

Switch Shuffle Inverse


Unit Layer Shuffle
Layer

Figure 2. Residual Shuffle-Exchange network with 2 Beneš blocks and 8 inputs.

routing 8 messages from the left side to the right. First 4. The Model
comes the exchange stage where elements are divided into
adjacent pairs, and each pair is passed through a switch. We propose the Residual Shuffle-Exchange network − a sim-
The switch contains logic to select which input is routed to pler and faster replacement for the neural Shuffle-Exchange
which output. The shuffle stage follows (depicted as arrows network. Residual Shuffle-Exchange network consists of
in the figures), where messages are permuted according alternating Switch Layers and Shuffle Layers and uses the
to the perfect-shuffle permutation. The perfect-shuffle is same architecture and weight sharing as the neural Shuffle-
a method to shuffle a deck of cards by splitting the deck Exchange network1 , for an example see Fig. 2 depicting a
into two halves and then interleaving the halves. In this Residual Shuffle-Exchange network with 2 Beneš blocks
permutation, the destination address is a cyclic bit shift (left for input sequence of length 8.
or right) of the source address. The network for routing We replace the Switch Unit with Residual Switch Unit
2k messages contains k exchange stages and k − 1 shuffle (RSU) employing GELU and Layer Normalization. Our
stages. It is proven that switches can always be programmed network has an input of 2k cells where each cell is a vector
in a way to connect any source to any destination through of size m.
the network) (Dally and Towles, 2004). But the throughput
of the Shuffle-Exchange network is limited − it may not be In the Switch Layer, we apply RSU to adjacent non-
possible to route several messages simultaneously. A better overlapping pairs of input cells. RSU has two inputs [i1 , i2 ]
design for multiple message routing is the Beneš network. and two [o1 , o2 ] outputs.

The Beneš network is formed by connecting a Shuffle- Creating the pairs technically is implemented as reshaping
Exchange network with its mirror copy. The mirror copy is the sequence i into a twice shorter sequence where each new
obtained by reversing the direction of bit shift in the desti- cell concatenates two adjacent cells [i1 , i2 ] along the feature
nation address calculation. Such a network is able to route dimension. RSU consists of two linear transformations
2k messages to any input-to-output permutation. To do that on the feature dimension. The first linear transformation
2k − 1 exchange stages and 2k − 2 shuffle stages are needed is followed by Layer Normalization (LayerNorm) without
(Dally and Towles, 2004). output bias and gain (Xu et al., 2019) and then by Gaussian
Error Linear Unit (GELU) (Hendrycks and Gimpel, 2016).
The neural Shuffle-Exchange network adapts the structure By default, we use 2x larger middle layer size than the input
of Beneš network and places a learnable 2-to-2 function in of the first layer, this is a good compromise of speed and
each switch. The original Switch Unit of (Freivalds et al., accuracy (see Section 5.4). A second linear transformation
2019) is based on the GRU, but in this work, we give a better is applied after GELU. The RSU is defined as follows:
implementation based on a residual network. 1
We do not use skip connections between Beneš blocks as in
the original model as they do not help our improved model.
Residual Shuffle-Exchange Networks

i = [i1 , i2 ] network with several strided convolutions to increase the


g = GELU(LayerNorm(Zi)) number of feature maps and reduce the sequence length.
c = Wg + B We use convolutions with stride 2 and apply LayerNorm
and GELU after each convolution like in the RSU. Before
[o1 , o2 ] = σ(S) i + h c the result is passed to the Residual Shuffle-Exchange net-
In the above equations, Z, W are weight matrices of size work, a linear transformation is applied. The structure of
2m × 4m and 4m × 2m, respectively, S is vector of size the network with two prepended convolutional layers for
2m and B is a bias vector − all of those are learnable processing inputs of length 4096 is depicted in Fig. 4.
parameters; h is scalar value, denotes element-wise vector
multiplication and σ is the sigmoid function.
4096x1

i1 o1 2048x96
(m) (m)
Conv + Conv + 1024x384 1024x192 1024x192
i Layer g c Linear Residual Shuffle
(2m)
Z Norm GELU
(4m)
W (2m)
+ () Layer Norm
+ GELU
Layer Norm
+ GELU Transform Exchange Network


i2 o2

(m) (m)

W ⊙ + ()
vector vector linear activation or scaling by pointwise split into
concatenation copy transformation normalization learnable addition halves Figure 4. The architecture with two prepended convolutions em-
parameter ployed for the MusicNet task.

Prepending convolutions shortens the input to the RSE net-


Figure 3. Residual Switch Unit. It has two inputs and two outputs. work and speeds up processing. The obtained accuracy for
The number of feature maps is given in parenthesis. the MusicNet is roughly the same, see analysis below. Note
that this approach leads to a shorter output than the input. It
The output of RSU is connected with its input through a
may be necessary to append transposed convolution layers
residual connection. This connection is scaled by a learnable
at the end of the network to upsample the signal back to its
parameter S, which is restricted to the range [0,1] by the
original length. For the MusicNet task, upsampling is not
sigmoid function. Additionally, we scale the new value c
necessary since we utilize only a few elements of the output
coming out of the last linear transformation by a constant h.
sequence.
We initialize S and h such that the signal travelling through
the network keeps its expected amplitude at 0.25 under the
assumption of the normal distribution (which is observed 5. Evaluation
in practice).
√ To have that, we initialize S as σ −1 (r) and
We have implemented the proposed architecture in Tensor-
h as 1 − r2 ∗ 0.25 where r is an experimentally chosen
Flow. The code is available at https://github.com/
constant close to 1. We use r = 0.9, which works well. The
LUMII-Syslab/RSE. All models are trained on a single
signal amplitude after LayerNorm is 1, the weight matrix W
Nvidia RTX 2080 Ti (11GB) GPU with RAdam optimizer
is initialized to keep this amplitude. If the amplitude of the
(Liu et al., 2019)
input is 0.25, then the expected amplitude at the output is
also 0.25, which is a good range for the softmax loss. During
training, the network is free to adjust these amplitudes, but 5.1. Algorithmic tasks
this initialization provides stable convergence even for deep Let us evaluate how well the Residual Shuffle-Exchange
networks. (RSE) network performs on algorithmic tasks in compari-
son with the neural Shuffle-Exchange (SE) (Freivalds et al.,
4.1. Prepending convolutions 2019). The goal is to infer O(n log n) time algorithms purely
from input-output examples. Algorithmic tasks are good
There are tasks, e.g. the MusicNet task, where there is a
benchmarks to evaluate the model’s ability to develop a rich
large mismatch of information content between input and
set of long-term dependencies.
Residual Shuffle-Exchange network – each input unit con-
tains one sample, but Residual Shuffle-Exchange network We consider long binary addition, long binary multiplication
requires a large number of feature maps to work well. En- and sorting, which are common benchmark tasks in several
coding just one number into many feature maps is wasteful. papers including (Freivalds et al., 2019; Kalchbrenner et al.,
For such tasks, we prepend the Residual Shuffle-Exchange 2015; Zaremba and Sutskever, 2015; Zaremba et al., 2016;
Residual Shuffle-Exchange Networks

1
0.6

RSE addition RSE sorting RSE multiplication


0.5 SE addition SE sorting SE multiplication 0.9

0.4 0.8

accuracy
error

0.3
0.7
RSE addition
0.2 SE addition
0.6
RSE sorting
SE sorting
0.1
0.5 RSE multiplication
SE multiplication
0
0.4
0 5k 10k 15k 20k
8 16 32 64 128 256 512 1K 2K 4K
step test length

Figure 5. Comparison of Residual Shuffle-Exchange (RSE) and Figure 6. Comparison of Residual Shuffle-Exchange (RSE) and
Shuffle-Exchange (SE) networks on algorithmic tasks length 64. Shuffle-Exchange (SE) generalization to longer sequences.

Joulin and Mikolov, 2015; Grefenstette et al., 2015; Kaiser 5.2. LAMBADA question answering
and Sutskever, 2015; Freivalds and Liepins, 2018; Dehghani
The goal of the LAMBADA task is to predict a given tar-
et al., 2018).
get word from its broad context (on average, 4.6 sentences
The model for evaluation consists of an embedding layer collected from novels). The sentences in the LAMBADA
where each symbol of the input is mapped to a vector of dataset (Paperno et al., 2016) are specially selected such
length m, one or two Beneš blocks and the output layer that giving the right answer requires examining the whole
which performs a linear transformation to the required num- passage. In 81% cases of the test set the target word can be
ber of classes with a softmax cross-entropy loss for each found in the text, and we follow a common strategy (Chu
symbol independently. We use an RSE model having one et al., 2017; Dehghani et al., 2018) to choose the target
Beneš block for addition and sorting tasks, two blocks for word as one from the text. The answer will be wrong in the
the multiplication task and m = 192 feature maps. remaining cases, so the achieved accuracy will not exceed
81%.
We use dataset generators and curriculum learning from
(Freivalds et al., 2019). For training, we instantiate several We instantiate the model for input length 256 (all test and
models for sequence lengths (powers of 2) from 8 to 64 train examples fit in this length) and pad the input sequence
sharing the same weights and train each example on the to that length by placing the sequence at a random posi-
smallest instance it fits. We pad the sequence up to the tion and adding zeros on both ends. Randomized padding
required length with zeroes. Figure 5 shows the testing improves test accuracy. We use a pretrained fastText 1M
accuracy on sequences of length 64 vs training step. We can English word embedding (Mikolov et al., 2018) for the input
see that on multiplication task the proposed model trains words. The embedding layer is followed by 2 Beneš blocks
much faster than SE, reaching near-zero error in about 20K with 384 feature maps. To perform the answer selection as
steps vs 200K steps for the SE. For addition and sorting a word from the text, each symbol of the output is linearly
tasks, both models perform similarly. mapped to a single scalar and we use softmax loss over the
obtained sequence to select the position of the answer word.
Next, let us compare the generalization performance of both
models, see Fig. 6. We train both models on length up to In Table 1, we give our results in the context of previous
64 and evaluate on length up to 4K. On addition and sorting works. The Residual Shuffle-Exchange network scores bet-
tasks, the proposed RSE model generalizes very well to ter than the Shuffle-Exchange network by 2.1% while using
length 256 but looses slightly to SE on longer sequences. For 3x less learnable parameters. Current state-of-the-art model
the multiplication task RSE model generalizes reasonably GPT-2 (Radford et al., 2019) surpass our model by 8.9%
well to twice as long sequences but not more, where the old while using 140 times more learnable parameters and pre-
model does not generalize even this much. training on a huge dataset.
In Fig 7, we compare the training and evaluation time of
Residual Shuffle-Exchange (RSE), Shuffle-Exchange (SE)
and Universal Transformer (UT) networks using config-
urations that reach their best test accuracy. We use the
Residual Shuffle-Exchange Networks

Table 1. Accuracy on LAMBADA word prediction task


Model Learnable parameters (M) Test accuracy (%)
Random word from passage (Paperno et al., 2016) - 1.6
Gated-Attention Reader (Chu et al., 2017) unknown 49.0
Neural Shuffle-Exchange network (Freivalds et al., 2019) 33 52.28
Residual Shuffle-Exchange network (this work) 11 54.34
Universal Transformer (Dehghani et al., 2018) 152 56.0
GPT-2 (Radford et al., 2019) 1542 63.24
Human performance (Chu et al., 2017) - 86.0

official Universal Transformer and Shuffle-Exchange im- 2019), the waveform is downsampled from 44.1 kHz to 11
plementations and measure the time for one training and kHz. To perform classification, regularly spaced windows
evaluation step on a single sequence. For the Universal of a given length are extracted from the waveform, and we
Transformer, we use its base configuration with 152M learn- predict all the notes that are being played at the midpoint of
able parameters. Shuffle-Exchange and Residual Shuffle- the window.
Exchange networks have 384 feature maps and 2 Beneš
We use an RSE model with two Beneš blocks with 192
blocks, with total parameter count 33M and 11M, respec-
feature maps. We experimentally found this to be the best
tively. We evaluate sequence lengths that fit in the 11GB of
configuration. To increase the training speed, we prepend
GPU memory. Residual Shuffle-Exchange network works
two strided convolutions in front of that, see analysis of
faster and can be evaluated on 4x longer sequences than
other options below. To obtain the note predictions, we use
Shuffle-Exchange network and 128x longer sequences than
the element in the middle of the sequence output by the
the Universal Transformer.
RSE model, linearly transform it to the 128 values, one for
each note pitch, and apply the sigmoid-cross-entropy loss
8 function to perform multi-label classification.
4
2 We add an additional term to the loss function which predicts
1
0.5
the notes played at regularly spaced intervals with stride
time (seconds)

0.25 128 in the input sequence. This term is used only during
0.125
0.062
SE eval training. The term is added because using only the middle
SE train
0.031 UT eval element in the loss function seems to lead to lower accuracy
0.015 UT train
RSE eval
in predicting the beginning and the end of the notes. The
0.007
0.003
RSE train
loss for these additional predictions is calculated in the same
64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K way as for the middle element. We find that adding this term
input length
to the loss function improves the training speed and the final
accuracy of the model. For example, without adding this
Figure 7. Evaluation and training time on different input lengths loss term we achieve 76.84%, which is by 1.18% lower than
(log-scale). with the added term.
We train the model for 800k iterations which corresponds to
approximately 8 epochs. We found this value by examining
5.3. MusicNet the training dynamics on the validation set.
The music transcription dataset MusicNet (Thickstun et al., For evaluating the model, we use the average precision
2016) consists of 330 classical music recordings paired score (APS) which is the area under the precision-recall
with the midi transcriptions of their notes. The total length curve. This metric is well suited to prediction tasks with im-
of the recordings is 34 hours, and it features 11 different balanced classes and is suggested for the MusicNet dataset
instruments. The task is to classify what notes are being in the original paper (Thickstun et al., 2016).
played at each time step given the waveform. As multiple
notes can be played at the same time, this is a multi-label We train the model on different window sizes ranging from
classification task. 128 to 8192, see Fig. 8. We find that larger windows invari-
ably lead to better accuracy. The best APS score of 78.02%
The dataset does not provide a separate validation set, so we is obtained on length 8192 that is by 3.8% better than the
split off 6 recordings from the training set as in (Trabelsi previous state of the art achieved by Complex Transformer
et al., 2018) and use them for validation. Like in (Yang et al.,
Residual Shuffle-Exchange Networks

80
16
8
75
4 Two conv layers eval
2 Two conv layers train
70 1 One conv layer eval

time (seconds)
One conv layer train
0.5
APS (%)

No conv layers eval


0.25
65 No conv layers train
0.125
0.062
60 0.031
0.015
55 0.007
0.003

256 512 1K 2K 4K 8K 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

input length input length

Figure 8. MusicNet accuracy depending on the window size. Figure 9. Training and evaluation speed (log-scale) depending on
the prepended convolution count on the MusicNet task.

Table 2. Performance on the MusicNet music transcription dataset


Model Learnable parameters (M) APS (%)
cgRNN (Wolter and Yao, 2018) 2.36 53.00
Deep Real Network (Trabelsi et al., 2018) 10.00 69.80
Deep Complex Network (Trabelsi et al., 2018) 17.14 72.90
Complex Transformer (Yang et al., 2019) 11.61 74.22
Residual Shuffle-Exchange network (this work) 3.06 78.02

Table 3. Accuracy (APS) depending on the number of convolu-


tional layers on length 1024
Convolutional layer count APS (%)
0 69.29
1 69.57
2 68.95
3 67.39

accuracy on window size 1024 for a different number of


Figure 10. Top picture shows the predictions of our model on the convolutions. We find that one convolution gives the best
test set. Bottom picture shows the corresponding labels. results, although the differences are small. We chose to use
two convolutions for a good balance between speed and
accuracy.
(Yang et al., 2019), see Table 2 for the comparison with
Fig. 9 shows the training and evaluation speed of the model
other works. For a visualisation of the notes predicted with
depending on the number of convolutions. We use the batch
window size 8192 and the correct labels, see Fig. 10. The
size of one example in this test to see the sequence length
notes are generally well predicted but their start and end
limit our model can be trained and tested on a single GPU.
time are smoothed out, especially for the lower pitches. The
The training lines stop at the length at which the model does
predicted note shading represents the classification confi-
not fit in the GPU memory anymore. Testing lines reach a
dence. In real applications an appropriate threshold should
2M technical limitation of our implementation. Increasing
be applied to get note duration.
the number of convolution layers improves training and
We have tested how prepending convolutions impact the testing speed, and the version with two convolutions can be
speed and accuracy of the model. Table 3 shows the obtained trained on sequences up to length 128K.
Residual Shuffle-Exchange Networks

0.6
Baseline Without LayerNorm ReLU insetad of GELU 0.5
m feature maps 2m feature maps (baseline) 4m feature maps
Without residual Without residual scale
0.5
0.4

0.4
0.3

error
0.3
error

0.2
0.2

0.1
0.1

0 0

0 5k 10k 15k 20k 25k 30k 35k 40k 0 5k 10k 15k 20k 25k 30k 35k 40k
step step

Figure 11. Ablation experiments. The plot shows test error on the Figure 12. Test error depending on the middle layer size (m is the
multiplication task length 128 vs training step. number of feature maps of the model).

5.4. Ablation study 6. Conclusions


We have chosen the multiplication task as a showcase for We have introduced a new Residual Shuffle-Exchange neu-
the ablation study. It is a hard task which challenges every ral network model that outperforms the previous one in
aspect of the model and performance differences are clearly terms of accuracy, speed and simplicity. It has O(n log
observable. We use a model with 2 Beneš blocks, 192 n) complexity and enables processing of sequences up to
feature maps and train it on length 128. We consider the length 2 million where standard methods, like attention, fail.
following simplifications of the proposed architecture: We have shown how to combine the model with strided
convolutions that increases its speed and sequence length
• removing LayerNorm (without LayerNorm) that can be processed.
• using ReLU instead of GELU The proposed model achieves state-of-the-art accuracy to
• removing the residual connection; the last equation of recognize musical notes directly from the waveform – a
RSU becomes [o1 , o2 ] = c (without residual) task where the ability to process long sequences is crucial.
Notably, our architecture uses significantly fewer parameters
• setting the residual weight parameter σ(S) to a con- than the previously best models for this task.
stant 1 instead of a learnable parameter; the equation
becomes [o1 , o2 ] = i + h c (without residual scale) By providing an alternative version of the Switch Unit, we
have established that the interconnection structure of the
We can see in Fig. 11 that the proposed baseline performs Shuffle-Exchange network is the key to the power of the
the best. Versions without residual connection or without network, not a particular implementation of the switching
normalization do not work well. Residual weight parameter mechanism. Our experiments confirm the Residual Shuffle-
and GELU non-linearity give a smaller contribution to the Exchange network as a useful building block for long se-
model’s performance. quence processing applications.
In Fig. 12, we investigate the effect of the RSU middle layer
Acknowledgements
size on the performance. Parameter count and speed of the
model is directly proportional to the middle layer size; there- We would like to thank the IMCS UL Scientific Cloud for
fore, we want to select the smallest size, which gives a good the computing power and Leo Trukšāns for the technical
performance. By default, we use 2m feature maps where m support. This research is funded by the Latvian Council of
is the number of feature maps of the model. Versions with Science, project No. lzp-2018/1-0327.
half as many or twice as many middle maps are explored.
We see that a larger middle map count gives a better perfor- References
mance. We consider the choice of 2m inner maps a good
compromise between performance and parameter count. Al-Rfou, R., D. Choe, N. Constant, M. Guo, and L. Jones
2018. Character-level language modeling with deeper
We have performed ablation experiments also for LAM- self-attention. arXiv preprint arXiv:1808.04444.
BADA and MusicNet tasks with similar conclusions, but
the differences are much less pronounced. Ba, J. L., J. R. Kiros, and G. E. Hinton
Residual Shuffle-Exchange Networks

2016. Layer normalization. arXiv preprint Gehring, J., M. Auli, D. Grangier, D. Yarats, and Y. N.
arXiv:1607.06450. Dauphin
2017. Convolutional sequence to sequence learning. In
Benetos, E., S. Dixon, Z. Duan, and S. Ewert Proceedings of the 34th International Conference on Ma-
2019. Automatic music transcription: An overview. IEEE chine Learning, D. Precup and Y. W. Teh, eds., volume 70
Signal Processing Magazine, 36:20–30. of Proceedings of Machine Learning Research, Pp. 1243–
1252. PMLR.
Child, R., S. Gray, A. Radford, and I. Sutskever
2019. Generating long sequences with sparse transform- Graves, A., G. Wayne, and I. Danihelka
ers. arXiv preprint arXiv:1904.10509. 2014. Neural Turing machines. arXiv preprint
arXiv:1410.5401.
Cho, K., B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio Graves, A., G. Wayne, M. Reynolds, T. Harley, I. Danihelka,
2014. Learning phrase representations using RNN A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefen-
encoder-decoder for statistical machine translation. arXiv stette, T. Ramalho, J. Agapiou, et al.
preprint arXiv:1406.1078. 2016. Hybrid computing using a neural network with
dynamic external memory. Nature, 538(7626):471.
Chu, Z., H. Wang, K. Gimpel, and D. McAllester
2017. Broad context language modeling as reading com- Grefenstette, E., K. M. Hermann, M. Suleyman, and P. Blun-
prehension. In Proceedings of the 15th Conference of the som
European Chapter of the Association for Computational 2015. Learning to transduce with unbounded memory.
Linguistics: Volume 2, Short Papers, Pp. 52–57. ACL In Advances in Neural Information Processing Systems
(Association for Computational Linguistics). 28, C. Cortes and Lee D.D. et al., eds., Pp. 1828–1836.
Curran Associates, Inc.
Clark, C. and M. Gardner
2017. Simple and effective multi-paragraph reading com- Guo, Q., X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang
prehension. arXiv preprint arXiv:1710.10723. 2019. Star-transformer. arXiv preprint arXiv:1902.09113.
Dai, Z., Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Hendrycks, D. and K. Gimpel
Le, and R. Salakhutdinov 2016. Gaussian error linear units (gelus). arXiv preprint
2019. Transformer-xl: Attentive language models beyond arXiv:1606.08415.
a fixed-length context. arXiv preprint arXiv:1901.02860.
Hochreiter, S. and J. Schmidhuber
Dally, W. J. and B. P. Towles 1997. Long short-term memory. Neural computation,
2004. Principles and practices of interconnection net- 9(8):1735–1780.
works. Elsevier.
Huang, C.-Z. A., A. Vaswani, J. Uszkoreit, I. Simon,
Dehghani, M., S. Gouws, O. Vinyals, J. Uszkoreit, and C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman,
Ł. Kaiser M. Dinculescu, and D. Eck
2018. Universal transformers. arXiv preprint 2019. Music transformer. In International Conference on
arXiv:1807.03819. Learning Representations.

Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova Joulin, A. and T. Mikolov
2018. Bert: Pre-training of deep bidirectional trans- 2015. Inferring algorithmic patterns with stack-
formers for language understanding. arXiv preprint augmented recurrent nets. In Advances in Neural In-
arXiv:1810.04805. formation Processing Systems 28, C. Cortes and Lee D.D.
et al., eds., Pp. 190–198. Curran Associates, Inc.
Freivalds, K. and R. Liepins
2018. Improving the neural GPU architecture for algo- Kaiser, Ł. and S. Bengio
rithm learning. The ICML workshop Neural Abstract 2016. Can active memory replace attention? In Advances
Machines & Program Induction v2 (NAMPI 2018). in Neural Information Processing Systems 29, D. Lee
and Luxburg U.V. et al., eds., Pp. 3781–3789. Curran
Freivalds, K., E. Ozolin, š, and A. Šostaks Associates, Inc.
2019. Neural shuffle-exchange networks – sequence pro-
cessing in O(n log n) time. In Advances in Neural Infor- Kaiser, Ł. and I. Sutskever
mation Processing Systems 32, Pp. 6626–6637. Curran 2015. Neural GPUs learn algorithms. arXiv preprint
Associates, Inc. arXiv:1511.08228.
Residual Shuffle-Exchange Networks

Kalchbrenner, N., I. Danihelka, and A. Graves Vinyals, O., M. Fortunato, and N. Jaitly
2015. Grid long short-term memory. arXiv preprint 2015. Pointer networks. In Advances in Neural Informa-
arXiv:1507.01526. tion Processing Systems 28, C. Cortes and Lee D.D. et
al., eds., Pp. 2692–2700. Curran Associates, Inc.
Kant, N.
2018. Recent advances in neural program synthesis. arXiv Wolter, M. and A. Yao
preprint arXiv:1802.02353. 2018. Complex gated recurrent neural networks. In
Kitaev, N., Ł. Kaiser, and A. Levskaya Advances in Neural Information Processing Systems,
2020. Reformer: The efficient transformer. arXiv preprint Pp. 10536–10546.
arXiv:2001.04451. Xu, J., X. Sun, Z. Zhang, G. Zhao, and J. Lin
Liu, L., H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han 2019. Understanding and improving layer normalization.
2019. On the variance of the adaptive learning rate and In Advances in Neural Information Processing Systems,
beyond. arXiv preprint arXiv:1908.03265. Pp. 4383–4393.

Mikolov, T., E. Grave, P. Bojanowski, C. Puhrsch, and Yang, M., M. Q. Ma, D. Li, Y.-H. H. Tsai, and R. Salakhut-
A. Joulin dinov
2018. Advances in pre-training distributed word represen- 2019. Complex transformer: A framework for
tations. In Proceedings of the Eleventh International Con- modeling complex-valued sequence. arXiv preprint
ference on Language Resources and Evaluation (LREC arXiv:1910.10202.
2018), N. Calzolari and Choukri, Khalid et al., eds. Euro-
Yu, F. and V. Koltun
pean Language Resources Association (ELRA).
2015. Multi-scale context aggregation by dilated convo-
Paperno, D., G. Kruszewski, A. Lazaridou, Q. N. Pham, lutions. arXiv preprint arXiv:1511.07122.
R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and
R. Fernández Zaremba, W., T. Mikolov, A. Joulin, and R. Fergus
2016. The lambada dataset: word prediction requiring 2016. Learning simple algorithms from examples. In
a broad discourse context. In Proceedings of the 54th International Conference on Machine Learning, Pp. 421–
Annual Meeting of the Association for Computational 429.
Linguistics (Volume 1: Long Papers), Pp. 1525–1534. Zaremba, W. and I. Sutskever
ACL (Association for Computational Linguistics). 2015. Reinforcement learning neural Turing machines-
Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, and revised. arXiv preprint arXiv:1505.00521.
I. Sutskever
2019. Language models are unsupervised multitask learn-
ers.
Thickstun, J., Z. Harchaoui, and S. Kakade
2016. Learning features of music from scratch. arXiv
preprint arXiv:1611.09827.
Trabelsi, C., O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subrama-
nian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio,
and C. Pal
2018. Deep complex networks. arxiv:1705.09792.
van den Oord, A., S. Dieleman, H. Zen, K. Simonyan,
O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior,
and K. Kavukcuoglu
2016. Wavenet: A generative model for raw audio. In
SSW - the 9th ISCA Speech Synthesis Workshop, Sunny-
vale, CA, USA, 2016, P. 125.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, Ł. Kaiser, and I. Polosukhin
2017. Attention is all you need. In Advances in Neural In-
formation Processing Systems 30, I. Guyon and Luxburg
U.V. et al., eds., Pp. 5998–6008. Curran Associates, Inc.

You might also like