GCNN
GCNN
Yann N. Dauphin
Angela Fan
Michael Auli
David Grangier
Facebook AI Research
arXiv:1612.08083v1 [cs.CL] 23 Dec 2016
2. Approach
Convolution
In this paper we introduce a new neural language model A = E∗W + b
that replaces recurrent connections typically used in recur-
rent networks with gated temporal convolutions. Neural B = E∗V + c
language models (Bengio et al., 2003) produce a repre-
sentation H = [h0 , . . . , hN ] of the context for each word
w0 , . . . , wN to predict the next word P (wi |hi ). Recurrent σ
neural networks compute H through a recurrent function Gating
hi = f (hi−1 , wi−1 ) which is an inherently sequential pro-
H0 = A⊗σ(B)
cess that cannot be parallelized over i.1
The proposed approach convolves the inputs to obtain H =
f ∗ w and therefore has no temporal dependencies which
makes it easier to parallelize over the individual words of
a sentence. This process will compute each context as a Stack L - 1 Convolution+Gating Blocks
function of a number of preceding words. Compared to
recurrent networks, the context size is finite but we will Softmax
demonstrate that we can represent large enough contexts to Y = softmax(WHL)
perform well in practice (§5).
Figure 1 illustrates the model architecture. Words are rep-
resented by a vector embedding stored in a lookup table
D|V|×m where |V| is the number of words in the vocabu-
lary and m is the embedding size. The input to our model Figure 1. Architecture of the gated convolutional network for lan-
is a sequence of words w0 , . . . , wN which are represented guage modeling.
by word embeddings E = [Dw0 , . . . , DwN ]. We compute
the hidden layers h0 , . . . , hL as
hl (X) = (X ∗ W + b) ⊗ σ(X ∗ V + c) (1) assuming the first input element is the beginning of se-
quence marker which we do not predict, where k is the
where X ∈ RN ×m is the input of layer hl , that is either
width of the kernel.
word embeddings or the outputs of previous layers, W ∈
Rk×m×n , b ∈ Rn , V ∈ Rk×m×n , c ∈ Rn are learned The output of each layer is a linear projection X ∗ W + b
parameters, σ is the sigmoid function and ⊗ is the element- modulated by the gates σ(X ∗ V + c). Similar to LSTMs,
wise product between matrices. the gates multiply each element of the matrix X ∗ W + b
and control the information passed in the hierarchy. We dub
When convolving inputs, we take care that hi does not con-
this gating mechanism Gated Linear Units (GLU). Stacking
tain information from future words. We address this by
multiple layers on top of the input E gives a representation
shifting the convolution inputs to prevent the kernels from
of the context for each word H = hL ◦. . .◦h0 (E). We wrap
seeing future context (Oord et al., 2016a). Specifically, we
the convolution and the gated linear unit in a pre-activation
zero-pad the beginning of the sequence by k/2 elements,
residual block that adds the input of the block to the output
1
Parallelization is usually done over multiple sequences in- (He et al., 2015a). The blocks have a bottleneck structure
stead. for computational efficiency and each block has up to 5
Language Modeling with Gated Convolutional Networks
spherical trust region strong LSTM and RNN models from the literature to our
∗ T gated convolutional approach on two datasets.
∆θ = argmin f (θ) + ∇f ∆θ
s. t. k∆θk≤ Table 1 shows that our model outperforms all state-of-the-
∇f art approaches that have been trained on a single GPU on
= − max(k∇f k, ) . (4)
k∇f k the Google Billion Word benchmark. Of the methods that
Our experiments run significantly faster with the use of gra- use multiple GPUs, only the very large LSTM of Joze-
dient clipping even though we do not use a recurrent archi- fowicz et al. (2016) achieves better results. However, this
tecture. model was trained on 32 GPUs for 3 weeks while as our
model trains on a single GPU in 2 weeks. The GCNN-13
We train on a single Tesla M40 GPU and implement our model has 13 layers of 1268 units each and kernel width 4.
models in Torch (Collobert et al., 2011). While better
performance could be achieved by training longer and on
Model Test PPL
multiple GPUs, we focused on better exploring the hyper-
LSTM-1024 (Grave et al., 2016b) 48.7
parameter space of small models to identify a compact
GCNN-8 44.9
model with good generalization performance. This strategy
is attractive to both understand architectual choices and to Table 2. Results on the WikiText-103 dataset.
identify models with better efficiency at test time.
On Google Billion Word, the average sentence length is
4.3. Hyper-parameters only 20 words which is relatively short. Next, we test on
We found good hyper-parameter configurations by cross- WikiText-103 to answer the question if our model can per-
validation using random search on a validation set. In terms form equally well in a setup were much larger contexts are
of the architecture of the model, we select the number of possible. On WikiText-103, an input sequence is an entire
residual blocks between {1, . . . , 10}, the size of the embed- Wikipedia article instead of an individual sentence. The re-
dings with {128, . . . , 256}, the number of units between sults (Table 2) show that the gated convolutional model out-
{128, . . . , 2048}, the kernel width between {3, . . . , 5}. In performs an LSTM on this problem as well. The GCNN-8
general, finding a good architecture is simple and the rule model has 8 layers with 800 units each and the LSTM has
of thumb is that the larger the model, the better the per- 1024 units.
formance. In terms of optimization, we initialize the lay-
ers of the model with the Kaiming initialization (He et al., 5.1. Computational Efficiency
2015b), with the learning rate sampled uniformly in the in-
terval [1., 2.], the momentum set to 0.99 and clipping set
Throughput Responsiveness
to 0.1. Good hyper-parameter for the optimizer are quite
(CPU) (GPU) (GPU)
straightforward to find and the optimal values do not seem
LSTM-2048 169 45,622 2,282
to change very much between datasets.
GCNN-22 179 45,878 45,878
80 70
75
ReLU
65 GTU
70 GLU
60
Test Perplexity
Test Perplexity
65
55
60
Tanh 50
55 ReLU
50 GTU 45
GLU
45 40
0 5 10 15 20 25 30 35 0 50 100
Epochs Hours
Figure 2. Learning curves on WikiText-103 (left) and Google Billion Word (right) for models with different activation mechanisms.
Models with gated linear units (GLU) converge faster and to a lower perplexity.
Computational cost is an important consideration for lan- batch of 750 sequences of length 20, resulting in 15, 000
guage models. Depending on the application, there are a tokens per batch. Table 3 shows that the throughput for
number of metrics to consider. We measure the throughput the LSTM and the GCNN are similar on CPU but not on
of a model as the number of tokens that can be processed GPU. The LSTM performs very well on GPU because the
per second. Throughput can be maximized by processing large batch size of 750 enables high parallelization. This
many sentences in parallel to amortize sequential opera- is because the LSTM implementation has been thoroughly
tions. In contrast, responsiveness is the speed of process- optimized and uses cuDNN while as the cuDNN imple-
ing the input sequentially, one token at a time. Throughput mentation of convolutions is not been optimized for 1-D
is important because it indicates the time required to pro- convolutions which we use in our model. We believe much
cess a corpus of text and responsiveness is an indicator of better performance can be achieved by a more efficient 1-
the time to finishing processing a sentence. A model can D cuDNN convolution. Unlike the LSTM, the GCNN can
have low responsiveness but high throughput by evaluating be parallelized both over sequences as well as across the
many sentences simultaneously through batching. In this tokens of each sequence. On the other hand, GCNN is 20
case, such a model is slow in finishing processing individ- times faster in terms of responsiveness.
ual sentences, but can process many sentences at a good
Table 4 shows that the convolutional model requires fewer
rate.
parameters and floating point operations per token than a
We evaluate the throughput and responsiveness for mod- comparable LSTM.
els that reach approximately 43.9 perplexity on the Google
Billion Word benchmark. We consider the LSTM with 5.2. Gating Mechanisms
2048 units in Table 1 and a GCNN with 22 layers with
Resnet blocks that have a bottleneck structure as described In this section we compare the gated linear unit with other
by (He et al., 2015a). The network has 3 bottleneck blocks mechanisms as well as to models without gating. We con-
of the form 128, 128, 512 followed by a 256, 256, 512 fol- sider the LSTM-style gating mechanism (GTU) tanh(X ∗
lowed by a fully connected 1024, 1024, 2048 block. Note W+b)⊗σ(X∗V+c) of (Oord et al., 2016b) and networks
that only the middle layer of these blocks is a convolution that use regular ReLU or Tanh activations. Gating units
(k = 5). We found that this architecture is quite important add parameters and in order to make a fair comparison, we
to obtain good computational efficiency. carefully cross-validate models with a comparable number
of parameters. Figure 2 (left) shows that GLU networks
Parameters FLOPs/token converge to a lower perplexity than the other approaches
LSTM-2048 289M 19M on WikiText-103. Similar to gated linear units, the ReLU
GCNN-22 185M 14M has a linear path that lets the gradients easily pass through
the active units. In our experiments we observe that this
Table 4. Number of parameters and FLOPs for the models of Fig- translates to much faster convergence for both the ReLU
ure 3. FLOPs exclude the operations required by the softmax and the GLU. On the other hand, neither Tanh nor GTU
layer which are identical. have a linear path and thus suffer from vanishing gradients.
Comparing the GTU and Tanh models allows us to measure
The throughput of the LSTM is measured by using a large the effect of gating since the Tanh model can be thought of
Language Modeling with Gated Convolutional Networks
as a GTU network with the sigmoid gating units removed. other is simply a factorization of the model which remains
The results (Figure 2, left) show that the gating units make linear up to the softmax, at which point it becomes log-
a vast difference. Both Tanh and GTU units suffer under linear. Another variation of GLUs are bilinear layers (Mnih
vanishing gradients since in the GTU both the inputs as & Hinton, 2007) which take the form hl (X) = (X ∗ W +
well as the gating units cut the gradients when the units b) ⊗ (X ∗ V + c). This is similar to GLUs but with linear
saturate. We argue that the difference between GTU and gating units instead.
Tanh indicates that gating units provide useful modeling
Figure 3 shows that GLUs perform best, followed by bilin-
capabilities. The ReLU unit is not an exact ablation of the
ear layers and then linear layers. Bilinear layers improve
gating units in the GLU, but it can be seen as a simplifica-
over linear ones by more than 40 perplexity points, and the
tion ReLU(X) = X ⊗ (X > 0) where the gates become
GLU improves another 20 perplexity points over the bilin-
active depending on the sign of the input. However, also in
ear model. The linear model performs very poorly at per-
this case, GLU units lead to lower perplexity.
plexity 115 even compared to 67.6 of a Kneser-Ney 5-gram
In Figure 2 (right) we repeat the same experiment on the model, even though the former has access to more context.
larger Google Billion Words dataset. We consider a fixed The linear gating units of the bilinear model provide a way
time budget of 100 hours because of the considerable train- for the model to modulate the flow of information in the
ing time required for this task. Similar to WikiText-103 network and the large reduction in perplexity shows that
we see that gated linear units achieve the best results on this is important. Surprisingly the introduction of the lin-
this problem. There is a gap of about 5 perplexity points ear gating units is enough to allow reaching 61 perplexity
between the GLU and ReLU which is similar to the dif- on Google 1B which surpasses Kneser-Ney 5-gram models
ference between the LSTM and RNN models measured by and the non-linear neural model of (Ji et al., 2015). How-
(Jozefowicz et al., 2016) on the same dataset. ever, the non-linear gating units of the GLU ultimately per-
form better.
5.3. Non-linear Modeling
5.4. Network Depth
140 Linear
Bilinear 41.0
120 GLU
Test Perplexity
40.5
100
Test Perplexity
40.0
80
60 39.5
40 39.0
0 50 100
Hours
38.5
7 8 9 10 11 12 13
Figure 3. Learning curves on Google Billion Word for models
Depth
with varying degrees of non-linearity. Figure 4. Impact of network depth on test perplexity for Google
Billion Word. Deeper models perform better.
The experiments so far have shown that the gated linear unit
benefits from the linear path the unit provides compared
Next we turn to the question of how network depth effects
to other non-linearities. Next, we compare networks with
the accuracy of our model. Figure 4 shows that perplexity
GLUs to purely linear networks and networks with bilinear
on Google Billion Word improves as we increase the depth
layers in order to measure the impact of the non-linear path
of the model. This also shows that good results are possible
provided by the gates of the GLU. One motivation for this
with a number of layers smaller than the average sentence
experiment is the success of linear models on many natural
length of 20 on GBW since we use 13 layers in this set-
language processing tasks (Manning & Schütze, 1999).
ting. The GCNN in Table 1 builds context representations
We consider deep linear convolutional networks where the by applying exactly 13 layers to each input while a recur-
layers lack the gating units of the GLU and take the form rent model would pass information through 20 sequential
hl (X) = X∗W+b. Stacking several layers on top of each layers on average on this corpus.
Language Modeling with Gated Convolutional Networks
42.5
42.0
41.5
Test Perplexity
41.0
40.5
40.0
39.5
39.0
38.5
14 16 18 20 22 24 26 28
Context
Figure 5. Test perplexity as a function of context for Google Billion Word (left) and Wiki-103 (right). We observe that models with
bigger context achieve better results but the results start diminishing quickly after a context of 20.
5.5. Context Size head but it is small compared to the large gains in conver-
gence speed.
Figure 5 shows the impact of context size for the gated
CNN. We tried different combinations of network depth
and kernel widths for each context size and chose the best 140
performing one for each size. Generally, larger contexts 130 Without Clipping
improve accuracy but returns drastically diminish with win- 120 Without WeightNorm
dows larger than 20 words, even for WikiText-103 where With Both
110
Test Perplexity
outperforming several strong LSTM results. Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer,
Noam, and Wu, Yonghui. Exploring the limits of language
modeling. arXiv preprint arXiv:1602.02410, 2016.
Acknowledgments
Kneser, Reinhard and Ney, Hermann. Improved backing-off for
We would like to thank Jonas Gehring, Edouard Grave, Ar- m-gram language modeling. In Acoustics, Speech, and Signal
mand Joulin and Ronan Collobert for helpful discussions Processing, 1995. ICASSP-95., 1995 International Conference
related to this work. on, volume 1, pp. 181–184. IEEE, 1995.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Salimans, Tim and Kingma, Diederik P. Weight normalization: A
Deep residual learning for image recognition. arXiv preprint simple reparameterization to accelerate training of deep neural
arXiv:1512.03385, 2015a. networks. arXiv preprint arXiv:1602.07868, 2016.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Shazeer, Noam, Pelemans, Joris, and Chelba, Ciprian. Skip-gram
Delving deep into rectifiers: Surpassing human-level perfor- language modeling using sparse non-negative matrix probabil-
mance on imagenet classification. In Proceedings of the IEEE ity estimation. arXiv preprint arXiv:1412.1454, 2014.
International Conference on Computer Vision, pp. 1026–1034,
2015b. Steedman, Mark. The syntactic process. 2002.
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term Sutskever, Ilya, Martens, James, Dahl, George E, and Hinton, Ge-
memory. Neural computation, 9(8):1735–1780, 1997. offrey E. On the importance of initialization and momentum in
deep learning. 2013.
Ji, Shihao, Vishwanathan, SVN, Satish, Nadathur, Anderson,
Michael J, and Dubey, Pradeep. Blackout: Speeding up recur- Yu, Dong and Deng, Li. Automatic Speech Recognition: A Deep
rent neural network language models with very large vocabu- Learning Approach. Springer Publishing Company, Incorpo-
laries. arXiv preprint arXiv:1511.06909, 2015. rated, 2014. ISBN 1447157788, 9781447157786.