0% found this document useful (0 votes)
13 views8 pages

GCNN

The document presents a new convolutional approach to language modeling using gated convolutional networks. It introduces a novel gating mechanism and achieves state-of-the-art results on several benchmarks, demonstrating its effectiveness. The model can parallelize computation over time more efficiently than recurrent models.

Uploaded by

octoparse8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

GCNN

The document presents a new convolutional approach to language modeling using gated convolutional networks. It introduces a novel gating mechanism and achieves state-of-the-art results on several benchmarks, demonstrating its effectiveness. The model can parallelize computation over time more efficiently than recurrent models.

Uploaded by

octoparse8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Language Modeling with Gated Convolutional Networks

Yann N. Dauphin
Angela Fan
Michael Auli
David Grangier
Facebook AI Research
arXiv:1612.08083v1 [cs.CL] 23 Dec 2016

Abstract bedding words in continuous space over which a neural net-


work is applied. The current state of the art to language
The pre-dominant approach to language model-
modeling is based on long short term memory networks
ing to date is based on recurrent neural networks.
(LSTM; Hochreiter et al., 1997) which can model poten-
In this paper we present a convolutional approach
tially arbitrarily long dependencies.
to language modeling. We introduce a novel
gating mechanism that eases gradient propaga- In this paper, we introduce gated convolutional networks
tion and which performs better than the LSTM- and apply them to language modeling. Convolutional net-
style gating of Oord et al. (2016b) despite being works can be stacked to represent large context sizes and
simpler. We achieve a new state of the art on extract hierarchical features over larger and larger contexts
WikiText-103 as well as a new best single-GPU with more abstractive features (LeCun & Bengio, 1995).
result on the Google Billion Word benchmark. In This allows to model long-term dependencies by applying
settings where latency is important, our model O( Nk ) operations over a context of size N and kernel width
achieves an order of magnitude speed-up com- k. In contrast, recurrent networks view the input as a chain
pared to a recurrent baseline since computation structure and therefore require a linear number O(N ) of
can be parallelized over time. To our knowledge, operations.
this is the first time a non-recurrent approach out-
Analyzing the input hierarchically bears resemblance to
performs strong recurrent models on these tasks.
classical grammar formalisms which build syntactic tree
structure of increasing granuality, e.g., sentences consist
of noun phrases and verb phrases each comprising further
1. Introduction internal structure (Manning & Schütze, 1999; Steedman,
Statistical language models estimate the probability distri- 2002). Hierarchical structure also eases learning since the
bution of a sequence of words. This amounts to modeling number of non-linearities for a given context size is reduced
the probability of the next word given the preceding words, compared to a chain structure, thereby mitigating the van-
i.e. ishing gradient problem (Glorot & Bengio, 2010).

N Modern hardware is well suited to models that are highly


parallelizable. In recurrent networks, the next output de-
Y
P (w0 , . . . , wN ) = P (w0 ) P (wi |w0 , . . . , wi−1 ),
i=1
pends on the previous hidden state which does not en-
able parallelization over the elements of a sequence. Con-
where wi are discrete word indices in a vocabulary. Lan- volutional networks are very amenable to this computing
guage models are a critical part of systems for speech paradigm since the computation of all input words can be
recognition (Yu & Deng, 2014) as well as machine transla- performed simultaneously (§2).
tion (Koehn, 2010).
Gating has been shown to be essential for recurrent neural
Recently, neural networks (Bengio et al., 2003; Mikolov networks to reach state-of-the-art performance (Jozefow-
et al., 2010; Jozefowicz et al., 2016) have been shown to icz et al., 2016). Our gated linear units reduce the vanish-
outperform classical n-gram language models (Kneser & ing gradient problem for deep architectures by providing a
Ney, 1995; Chen & Goodman, 1996). Classical language linear path for the gradients while retaining non-linear ca-
models suffer under data sparsity that makes it difficult to pabilities (§3)
represent large contexts and therefore long-range depen-
dencies. Neural language models tackle this issue by em- We run experiments in a single GPU setup and show that
Language Modeling with Gated Convolutional Networks

gated convolutional networks outperform other recently Input sentence


published language models such as LSTMs trained in a
similar setting on the Google Billion Word Benchmark Text The cat sat on the mat .
(Chelba et al., 2013). We also evaluate the ability of w0 w1 w2 w3 w4 w5 w6
our models to deal with long-range dependencies on the
WikiText-103 benchmark for which the model is condi-
tioned on an entire paragraph rather than a single sen- Lookup Table
tence and we achieve a new state-of-the-art on this dataset
E = Dw
(Merity et al., 2016). Finally, we show that gated linear i
units achieve higher accuracy and converge faster than the
LSTM-style gating of Oord et al. (2016; §4, §5)

2. Approach
Convolution
In this paper we introduce a new neural language model A = E∗W + b
that replaces recurrent connections typically used in recur-
rent networks with gated temporal convolutions. Neural B = E∗V + c
language models (Bengio et al., 2003) produce a repre-
sentation H = [h0 , . . . , hN ] of the context for each word
w0 , . . . , wN to predict the next word P (wi |hi ). Recurrent σ
neural networks compute H through a recurrent function Gating
hi = f (hi−1 , wi−1 ) which is an inherently sequential pro-
H0 = A⊗σ(B)
cess that cannot be parallelized over i.1
The proposed approach convolves the inputs to obtain H =
f ∗ w and therefore has no temporal dependencies which
makes it easier to parallelize over the individual words of
a sentence. This process will compute each context as a Stack L - 1 Convolution+Gating Blocks
function of a number of preceding words. Compared to
recurrent networks, the context size is finite but we will Softmax
demonstrate that we can represent large enough contexts to Y = softmax(WHL)
perform well in practice (§5).
Figure 1 illustrates the model architecture. Words are rep-
resented by a vector embedding stored in a lookup table
D|V|×m where |V| is the number of words in the vocabu-
lary and m is the embedding size. The input to our model Figure 1. Architecture of the gated convolutional network for lan-
is a sequence of words w0 , . . . , wN which are represented guage modeling.
by word embeddings E = [Dw0 , . . . , DwN ]. We compute
the hidden layers h0 , . . . , hL as
hl (X) = (X ∗ W + b) ⊗ σ(X ∗ V + c) (1) assuming the first input element is the beginning of se-
quence marker which we do not predict, where k is the
where X ∈ RN ×m is the input of layer hl , that is either
width of the kernel.
word embeddings or the outputs of previous layers, W ∈
Rk×m×n , b ∈ Rn , V ∈ Rk×m×n , c ∈ Rn are learned The output of each layer is a linear projection X ∗ W + b
parameters, σ is the sigmoid function and ⊗ is the element- modulated by the gates σ(X ∗ V + c). Similar to LSTMs,
wise product between matrices. the gates multiply each element of the matrix X ∗ W + b
and control the information passed in the hierarchy. We dub
When convolving inputs, we take care that hi does not con-
this gating mechanism Gated Linear Units (GLU). Stacking
tain information from future words. We address this by
multiple layers on top of the input E gives a representation
shifting the convolution inputs to prevent the kernels from
of the context for each word H = hL ◦. . .◦h0 (E). We wrap
seeing future context (Oord et al., 2016a). Specifically, we
the convolution and the gated linear unit in a pre-activation
zero-pad the beginning of the sequence by k/2 elements,
residual block that adds the input of the block to the output
1
Parallelization is usually done over multiple sequences in- (He et al., 2015a). The blocks have a bottleneck structure
stead. for computational efficiency and each block has up to 5
Language Modeling with Gated Convolutional Networks

layers. plicative skip connection which helps gradient flow through


the layers. We find that gated linear units perform better
The simplest choice to obtain model predictions is to use
in practice compared to LSTM-style gating which we dub
a softmax layer, however, this choice is often computa-
gated tanh units (GTU; §5).
tionally inefficient for large vocabularies and an approx-
imation such as noise contrastive estimation (Gutmann
& Hyvärinen) or hierarchical softmax (Morin & Bengio, 4. Experimental Setup
2005) is preferred. We choose an improvement of the latter
4.1. Datasets
known as adaptive softmax which assigns higher capacity
to very frequent words and lower capacity to rare words We report results on two public large-scale language mod-
(Grave et al., 2016a). This results in lower memory re- eling datasets. First, the Google Billion Word dataset
quirements as well as faster computation, both at training (GBW; Chelba et al., 2013) is considered one of the largest
and at test time. language modeling datasets with close to one billion to-
kens and a vocabulary of over 800K words. In this dataset,
3. Gating Mechanisms words appearing less than 3 times are replaced with a spe-
cial unknown symbol. The data is based on an English cor-
Gating mechanisms control the path through which infor- pus of 30, 301, 028 sentences whose order has been shuf-
mation flows in the network and have proven to be use- fled. Second, WikiText-103 is a smaller dataset of over
ful for recurrent neural networks (Hochreiter & Schmidhu- 100M tokens with a vocabulary of about 200K words (Mer-
ber, 1997). LSTMs enable long-term memory via a sep- ity et al., 2016). Different to GBW, sentences are consec-
arate cell controlled by input and forget gates. This al- utive which allows to condition the model on larger con-
lows information to flow unimpeded through potentially texts than single sentences. For both datasets, we add a be-
many timesteps. Without these gates information could ginning of sequence marker <S > at the start of each line
easily vanish through the transformations of each timestep. and an end of sequence marker </S> at the end of each
In contrast, convolutional networks do not suffer from the line. On the Google Billion Word corpus each sequence
same kind of vanishing gradient and we find experimentally is a single sentence, while on WikiText-103 a sequence
that they do not require forget gates. is an entire paragraph. The model sees <S> and </S >
as input but only predicts the end of sequence marker
Therefore, our gated linear units only possess output gates
</S>. We evaluate models by computing the perplexity
which allow the network to control which information 1
PN
should be propagated in the hierarchy of layers. We show e N i − log p(wi |...,wi−1 ) on the standard held out test por-
this mechanism to be useful for language modeling as it al- tion of each dataset.
lows the model to select which words or features are rele-
vant to predict the next word. In parallel to our work, Oord 4.2. Training
et al. (2016b) have shown the effectiveness of an LSTM- We found Nesterov’s momentum (Sutskever et al., 2013)
style mechanism of the form tanh(X ∗ W + b) ⊗ σ(X ∗ to be worth the over-head compared to standard stochas-
V + c) for the convolutional modeling of images. tic gradient descent. The cost in terms of memory is stor-
Gated linear units are a simplified gating mechanism based ing another vector of the size of the parameters but it in-
on the work of Dauphin & Grangier (2015) for non- creases the speed of convergence significantly with mini-
deterministic gates that reduce the vanishing gradient prob- mal computational over-head. The speed of convergence
lem by having linear units couple to the gates. This retains was further increased by clipping the gradients to 0.1 (Pas-
the non-linear capabilities of the layer while allowing the canu et al., 2013) and weight normalization (Salimans &
gradient to pass without scaling through the linear unit. The Kingma, 2016). The combination of these methods allowed
gradient of the LSTM-style gating of Oord et al. (2016b) is us to achieve stable and fast convergence with compara-
tively large learning rates such as 1.
∇[tanh(X) ⊗ σ(X)] = tanh0 (X)∇X ⊗ σ(X)
Pascanu et al. (2013) argue for gradient clipping because it
+σ 0 (X)∇X ⊗ tanh(X). (2)
prevents the gradient explosion problem that characterizes
Notice that it gradually vanishes as we stack layers because RNNs. We argue that gradient clipping is not tied to RNNs
of the downscaling factors tanh0 (X) and σ 0 (X). In con- since it can be derived from the more general concept of
trast, the gradient of the gated linear unit trust region methods. Gradient clipping is found using a

∇[X ⊗ σ(X)] = ∇X ⊗ σ(X) + X ⊗ σ 0 (X)∇X (3)


has a path ∇X ⊗ σ(X) without downscaling for the ac-
tivated gating units in σ(X). This can be seen as a multi-
Language Modeling with Gated Convolutional Networks

Model Test PPL Hardware


Sigmoid-RNN-2048 (Ji et al., 2015) 68.3 1 CPU
Interpolated KN 5-Gram (Chelba et al., 2013) 67.6 100 CPUs
Sparse Non-Negative Matrix LM (Shazeer et al., 2014) 52.9 -
RNN-1024 + MaxEnt 9 Gram Features (Chelba et al., 2013) 51.3 24 GPUs
LSTM-2048-512 (Jozefowicz et al., 2016) 43.7 32 GPUs
2-layer LSTM-8192-1024 (Jozefowicz et al., 2016) 30.6 32 GPUs
LSTM-2048 (Grave et al., 2016a) 43.9 1 GPU
2-layer LSTM-2048 (Grave et al., 2016a) 39.8 1 GPU
GCNN-13 38.1 1 GPU

Table 1. Results on the Google Billion Word test set.

spherical trust region strong LSTM and RNN models from the literature to our
∗ T gated convolutional approach on two datasets.
∆θ = argmin f (θ) + ∇f ∆θ
s. t. k∆θk≤ Table 1 shows that our model outperforms all state-of-the-
∇f art approaches that have been trained on a single GPU on
= − max(k∇f k, ) . (4)
k∇f k the Google Billion Word benchmark. Of the methods that
Our experiments run significantly faster with the use of gra- use multiple GPUs, only the very large LSTM of Joze-
dient clipping even though we do not use a recurrent archi- fowicz et al. (2016) achieves better results. However, this
tecture. model was trained on 32 GPUs for 3 weeks while as our
model trains on a single GPU in 2 weeks. The GCNN-13
We train on a single Tesla M40 GPU and implement our model has 13 layers of 1268 units each and kernel width 4.
models in Torch (Collobert et al., 2011). While better
performance could be achieved by training longer and on
Model Test PPL
multiple GPUs, we focused on better exploring the hyper-
LSTM-1024 (Grave et al., 2016b) 48.7
parameter space of small models to identify a compact
GCNN-8 44.9
model with good generalization performance. This strategy
is attractive to both understand architectual choices and to Table 2. Results on the WikiText-103 dataset.
identify models with better efficiency at test time.
On Google Billion Word, the average sentence length is
4.3. Hyper-parameters only 20 words which is relatively short. Next, we test on
We found good hyper-parameter configurations by cross- WikiText-103 to answer the question if our model can per-
validation using random search on a validation set. In terms form equally well in a setup were much larger contexts are
of the architecture of the model, we select the number of possible. On WikiText-103, an input sequence is an entire
residual blocks between {1, . . . , 10}, the size of the embed- Wikipedia article instead of an individual sentence. The re-
dings with {128, . . . , 256}, the number of units between sults (Table 2) show that the gated convolutional model out-
{128, . . . , 2048}, the kernel width between {3, . . . , 5}. In performs an LSTM on this problem as well. The GCNN-8
general, finding a good architecture is simple and the rule model has 8 layers with 800 units each and the LSTM has
of thumb is that the larger the model, the better the per- 1024 units.
formance. In terms of optimization, we initialize the lay-
ers of the model with the Kaiming initialization (He et al., 5.1. Computational Efficiency
2015b), with the learning rate sampled uniformly in the in-
terval [1., 2.], the momentum set to 0.99 and clipping set
Throughput Responsiveness
to 0.1. Good hyper-parameter for the optimizer are quite
(CPU) (GPU) (GPU)
straightforward to find and the optimal values do not seem
LSTM-2048 169 45,622 2,282
to change very much between datasets.
GCNN-22 179 45,878 45,878

5. Results Table 3. Processing speed in tokens/s at test time for an LSTM


with 2048 units and GCNN with 22 layers achieving 43.9 and
LSTMs and recurrent networks are able to capture long 43.8 perplexity, respectively on Google Billion Word. The GCNN
term dependencies and are fast becoming cornerstones in improves the responsiveness by 20 times while maintaining high
natural language processing. In this section, we compare throughput.
Language Modeling with Gated Convolutional Networks

80 70
75
ReLU
65 GTU
70 GLU
60
Test Perplexity

Test Perplexity
65
55
60
Tanh 50
55 ReLU
50 GTU 45
GLU
45 40
0 5 10 15 20 25 30 35 0 50 100
Epochs Hours

Figure 2. Learning curves on WikiText-103 (left) and Google Billion Word (right) for models with different activation mechanisms.
Models with gated linear units (GLU) converge faster and to a lower perplexity.

Computational cost is an important consideration for lan- batch of 750 sequences of length 20, resulting in 15, 000
guage models. Depending on the application, there are a tokens per batch. Table 3 shows that the throughput for
number of metrics to consider. We measure the throughput the LSTM and the GCNN are similar on CPU but not on
of a model as the number of tokens that can be processed GPU. The LSTM performs very well on GPU because the
per second. Throughput can be maximized by processing large batch size of 750 enables high parallelization. This
many sentences in parallel to amortize sequential opera- is because the LSTM implementation has been thoroughly
tions. In contrast, responsiveness is the speed of process- optimized and uses cuDNN while as the cuDNN imple-
ing the input sequentially, one token at a time. Throughput mentation of convolutions is not been optimized for 1-D
is important because it indicates the time required to pro- convolutions which we use in our model. We believe much
cess a corpus of text and responsiveness is an indicator of better performance can be achieved by a more efficient 1-
the time to finishing processing a sentence. A model can D cuDNN convolution. Unlike the LSTM, the GCNN can
have low responsiveness but high throughput by evaluating be parallelized both over sequences as well as across the
many sentences simultaneously through batching. In this tokens of each sequence. On the other hand, GCNN is 20
case, such a model is slow in finishing processing individ- times faster in terms of responsiveness.
ual sentences, but can process many sentences at a good
Table 4 shows that the convolutional model requires fewer
rate.
parameters and floating point operations per token than a
We evaluate the throughput and responsiveness for mod- comparable LSTM.
els that reach approximately 43.9 perplexity on the Google
Billion Word benchmark. We consider the LSTM with 5.2. Gating Mechanisms
2048 units in Table 1 and a GCNN with 22 layers with
Resnet blocks that have a bottleneck structure as described In this section we compare the gated linear unit with other
by (He et al., 2015a). The network has 3 bottleneck blocks mechanisms as well as to models without gating. We con-
of the form 128, 128, 512 followed by a 256, 256, 512 fol- sider the LSTM-style gating mechanism (GTU) tanh(X ∗
lowed by a fully connected 1024, 1024, 2048 block. Note W+b)⊗σ(X∗V+c) of (Oord et al., 2016b) and networks
that only the middle layer of these blocks is a convolution that use regular ReLU or Tanh activations. Gating units
(k = 5). We found that this architecture is quite important add parameters and in order to make a fair comparison, we
to obtain good computational efficiency. carefully cross-validate models with a comparable number
of parameters. Figure 2 (left) shows that GLU networks
Parameters FLOPs/token converge to a lower perplexity than the other approaches
LSTM-2048 289M 19M on WikiText-103. Similar to gated linear units, the ReLU
GCNN-22 185M 14M has a linear path that lets the gradients easily pass through
the active units. In our experiments we observe that this
Table 4. Number of parameters and FLOPs for the models of Fig- translates to much faster convergence for both the ReLU
ure 3. FLOPs exclude the operations required by the softmax and the GLU. On the other hand, neither Tanh nor GTU
layer which are identical. have a linear path and thus suffer from vanishing gradients.
Comparing the GTU and Tanh models allows us to measure
The throughput of the LSTM is measured by using a large the effect of gating since the Tanh model can be thought of
Language Modeling with Gated Convolutional Networks

as a GTU network with the sigmoid gating units removed. other is simply a factorization of the model which remains
The results (Figure 2, left) show that the gating units make linear up to the softmax, at which point it becomes log-
a vast difference. Both Tanh and GTU units suffer under linear. Another variation of GLUs are bilinear layers (Mnih
vanishing gradients since in the GTU both the inputs as & Hinton, 2007) which take the form hl (X) = (X ∗ W +
well as the gating units cut the gradients when the units b) ⊗ (X ∗ V + c). This is similar to GLUs but with linear
saturate. We argue that the difference between GTU and gating units instead.
Tanh indicates that gating units provide useful modeling
Figure 3 shows that GLUs perform best, followed by bilin-
capabilities. The ReLU unit is not an exact ablation of the
ear layers and then linear layers. Bilinear layers improve
gating units in the GLU, but it can be seen as a simplifica-
over linear ones by more than 40 perplexity points, and the
tion ReLU(X) = X ⊗ (X > 0) where the gates become
GLU improves another 20 perplexity points over the bilin-
active depending on the sign of the input. However, also in
ear model. The linear model performs very poorly at per-
this case, GLU units lead to lower perplexity.
plexity 115 even compared to 67.6 of a Kneser-Ney 5-gram
In Figure 2 (right) we repeat the same experiment on the model, even though the former has access to more context.
larger Google Billion Words dataset. We consider a fixed The linear gating units of the bilinear model provide a way
time budget of 100 hours because of the considerable train- for the model to modulate the flow of information in the
ing time required for this task. Similar to WikiText-103 network and the large reduction in perplexity shows that
we see that gated linear units achieve the best results on this is important. Surprisingly the introduction of the lin-
this problem. There is a gap of about 5 perplexity points ear gating units is enough to allow reaching 61 perplexity
between the GLU and ReLU which is similar to the dif- on Google 1B which surpasses Kneser-Ney 5-gram models
ference between the LSTM and RNN models measured by and the non-linear neural model of (Ji et al., 2015). How-
(Jozefowicz et al., 2016) on the same dataset. ever, the non-linear gating units of the GLU ultimately per-
form better.
5.3. Non-linear Modeling
5.4. Network Depth

140 Linear
Bilinear 41.0
120 GLU
Test Perplexity

40.5
100
Test Perplexity

40.0
80

60 39.5

40 39.0
0 50 100
Hours
38.5
7 8 9 10 11 12 13
Figure 3. Learning curves on Google Billion Word for models
Depth
with varying degrees of non-linearity. Figure 4. Impact of network depth on test perplexity for Google
Billion Word. Deeper models perform better.
The experiments so far have shown that the gated linear unit
benefits from the linear path the unit provides compared
Next we turn to the question of how network depth effects
to other non-linearities. Next, we compare networks with
the accuracy of our model. Figure 4 shows that perplexity
GLUs to purely linear networks and networks with bilinear
on Google Billion Word improves as we increase the depth
layers in order to measure the impact of the non-linear path
of the model. This also shows that good results are possible
provided by the gates of the GLU. One motivation for this
with a number of layers smaller than the average sentence
experiment is the success of linear models on many natural
length of 20 on GBW since we use 13 layers in this set-
language processing tasks (Manning & Schütze, 1999).
ting. The GCNN in Table 1 builds context representations
We consider deep linear convolutional networks where the by applying exactly 13 layers to each input while a recur-
layers lack the gating units of the GLU and take the form rent model would pass information through 20 sequential
hl (X) = X∗W+b. Stacking several layers on top of each layers on average on this corpus.
Language Modeling with Gated Convolutional Networks

42.5
42.0
41.5
Test Perplexity

41.0
40.5
40.0
39.5
39.0
38.5
14 16 18 20 22 24 26 28
Context

Figure 5. Test perplexity as a function of context for Google Billion Word (left) and Wiki-103 (right). We observe that models with
bigger context achieve better results but the results start diminishing quickly after a context of 20.

5.5. Context Size head but it is small compared to the large gains in conver-
gence speed.
Figure 5 shows the impact of context size for the gated
CNN. We tried different combinations of network depth
and kernel widths for each context size and chose the best 140
performing one for each size. Generally, larger contexts 130 Without Clipping
improve accuracy but returns drastically diminish with win- 120 Without WeightNorm
dows larger than 20 words, even for WikiText-103 where With Both
110
Test Perplexity

we may condition on an entire Wikipedia article. This


100
means that the unlimited context offered by recurrent mod-
els is not strictly necessary for language modeling. Fur- 90
thermore, this finding is also congruent with the fact that 80
good performance with recurrent networks can be obtained 70
by truncating gradients after only 20 timesteps using trun- 60
cated back propgation through time. Figure 5 also shows 50
that WikiText-103 benefits much more from larger context 40000 80000 120000 160000
size than Google Billion Word as the performance degrades Updates
more sharply with smaller contexts. WikiText-103 pro-
vides much more context than Google Billion Word where Figure 6. Effect of weight normalization and gradient clipping on
the average sentence size is 20. However while the average Google Billion Word.
size of the documents are close to 4000 tokens, we find that
strong performance can be achieved with a context size as
low as 30 tokens. 6. Conclusion
We introduce a convolutional neural network for language
5.6. Training Algorithms
modeling with a novel gating mechanism. Compared to
In this section, we perform an ablation of weight normal- recurrent neural networks, our approach builds a hierarchi-
ization and gradient clipping. We separately cross-validate cal representation of the input words that makes it easier
the hyper-parameters of each configuration to make the to capture long-range dependencies, similar in spirit to the
comparison fair. Due to the high cost of each of these ex- tree-structured analysis of linguistic grammar formalisms.
periments we only consider a single iteration over the train- The same property eases learning since features are passed
ing data. Figure 6 shows that both methods significantly through a fixed number of layers and non-linearities, un-
speed-up convergence. Weight normalization in particular like for recurrent networks where the number of processing
improves the speed by over two times. This speed-up is steps differs depending on the position of the word in the
partly due to the ability to use much larger learning rates input. The results show that our gated convolutional net-
(1 instead of 0.01) than would otherwise be possible. Both work achieves a new state of the art on WikiText-103. On
clipping and weight normalization add computational over- the larger Google Billion Word benchmark, we achieve a
new best result for models trained on a single GPU, thereby
Language Modeling with Gated Convolutional Networks

outperforming several strong LSTM results. Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer,
Noam, and Wu, Yonghui. Exploring the limits of language
modeling. arXiv preprint arXiv:1602.02410, 2016.
Acknowledgments
Kneser, Reinhard and Ney, Hermann. Improved backing-off for
We would like to thank Jonas Gehring, Edouard Grave, Ar- m-gram language modeling. In Acoustics, Speech, and Signal
mand Joulin and Ronan Collobert for helpful discussions Processing, 1995. ICASSP-95., 1995 International Conference
related to this work. on, volume 1, pp. 181–184. IEEE, 1995.

Koehn, Philipp. Statistical Machine Translation. Cambridge Uni-


versity Press, New York, NY, USA, 1st edition, 2010. ISBN
References 0521874157, 9780521874151.
Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, and Jauvin,
Christian. A neural probabilistic language model. journal of LeCun, Yann and Bengio, Yoshua. Convolutional networks for
machine learning research, 3(Feb):1137–1155, 2003. images, speech, and time series. The handbook of brain theory
and neural networks, 3361(10):1995, 1995.
Chelba, Ciprian, Mikolov, Tomas, Schuster, Mike, Ge, Qi, Brants,
Thorsten, Koehn, Phillipp, and Robinson, Tony. One billion Manning, Christopher D and Schütze, Hinrich. Foundations of
word benchmark for measuring progress in statistical language statistical natural language processing, 1999.
modeling. arXiv preprint arXiv:1312.3005, 2013.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer Sen-
Chen, Stanley F and Goodman, Joshua. An empirical study of tinel Mixture Models. ArXiv e-prints, September 2016.
smoothing techniques for language modeling. In Proceedings
of the 34th annual meeting on Association for Computational Mikolov, Tomáš, Martin, Karafiát, Burget, Lukáš, Cernocký, Jan,
Linguistics, pp. 310–318. Association for Computational Lin- and Khudanpur, Sanjeev. Recurrent Neural Network based
guistics, 1996. Language Model. In Proc. of INTERSPEECH, pp. 1045–1048,
2010.
Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, Clement.
Torch7: A Matlab-like Environment for Machine Learning. In Mnih, Andriy and Hinton, Geoffrey. Three new graphical models
BigLearn, NIPS Workshop, 2011. URL http://torch.ch. for statistical language modelling. In Proceedings of the 24th
international conference on Machine learning, pp. 641–648.
Dauphin, Yann N and Grangier, David. Predicting distri- ACM, 2007.
butions with linearizing belief networks. arXiv preprint
arXiv:1511.05622, 2015. Morin, Frederic and Bengio, Yoshua. Hierarchical probabilistic
neural network language model. In Aistats, volume 5, pp. 246–
Glorot, Xavier and Bengio, Yoshua. Understanding the difficulty 252. Citeseer, 2005.
of training deep feedforward neural networks. The handbook
of brain theory and neural networks, 2010. Oord, Aaron van den, Kalchbrenner, Nal, and Kavukcuoglu,
Koray. Pixel recurrent neural networks. arXiv preprint
Grave, E., Joulin, A., Cissé, M., Grangier, D., and Jégou, H. arXiv:1601.06759, 2016a.
Efficient softmax approximation for GPUs. ArXiv e-prints,
September 2016a. Oord, Aaron van den, Kalchbrenner, Nal, Vinyals, Oriol, Espe-
holt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Condi-
Grave, E., Joulin, A., and Usunier, N. Improving Neural Lan- tional image generation with pixelcnn decoders. arXiv preprint
guage Models with a Continuous Cache. ArXiv e-prints, De- arXiv:1606.05328, 2016b.
cember 2016b.
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the
Gutmann, Michael and Hyvärinen, Aapo. Noise-contrastive esti- difficulty of training recurrent neural networks. In Proceedings
mation: A new estimation principle for unnormalized statisti- of The 30th International Conference on Machine Learning,
cal models. pp. 1310–1318, 2013.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Salimans, Tim and Kingma, Diederik P. Weight normalization: A
Deep residual learning for image recognition. arXiv preprint simple reparameterization to accelerate training of deep neural
arXiv:1512.03385, 2015a. networks. arXiv preprint arXiv:1602.07868, 2016.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Shazeer, Noam, Pelemans, Joris, and Chelba, Ciprian. Skip-gram
Delving deep into rectifiers: Surpassing human-level perfor- language modeling using sparse non-negative matrix probabil-
mance on imagenet classification. In Proceedings of the IEEE ity estimation. arXiv preprint arXiv:1412.1454, 2014.
International Conference on Computer Vision, pp. 1026–1034,
2015b. Steedman, Mark. The syntactic process. 2002.

Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term Sutskever, Ilya, Martens, James, Dahl, George E, and Hinton, Ge-
memory. Neural computation, 9(8):1735–1780, 1997. offrey E. On the importance of initialization and momentum in
deep learning. 2013.
Ji, Shihao, Vishwanathan, SVN, Satish, Nadathur, Anderson,
Michael J, and Dubey, Pradeep. Blackout: Speeding up recur- Yu, Dong and Deng, Li. Automatic Speech Recognition: A Deep
rent neural network language models with very large vocabu- Learning Approach. Springer Publishing Company, Incorpo-
laries. arXiv preprint arXiv:1511.06909, 2015. rated, 2014. ISBN 1447157788, 9781447157786.

You might also like