0% found this document useful (0 votes)

157 views9 pages

NLP Cache Model

This document summarizes a paper that proposes a Neural Cache Model to improve neural language models. The Neural Cache Model stores past hidden states from the language model and uses them to retrieve the next word through a dot product with the current hidden state. This allows the model to adapt to recent context without complex memory mechanisms. Evaluation on several datasets shows the Neural Cache Model outperforms other memory augmented networks.

Uploaded by

Nilpa Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

157 views9 pages

NLP Cache Model

Uploaded by

Nilpa Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Published as a conference paper at ICLR 2017

I MPROVING N EURAL L ANGUAGE M ODELS WITH A

C ONTINUOUS C ACHE
Edouard Grave, Armand Joulin, Nicolas Usunier
Facebook AI Research
{egrave,ajoulin,usunier}@fb.com

A BSTRACT

We propose an extension to neural network language models to adapt their pre-

diction to the recent history. Our model is a simplified version of memory aug-
mented networks, which stores past hidden activations as memory and accesses
them through a dot product with the current hidden activation. This mechanism is
very efficient and scales to very large memory sizes. We also draw a link between
the use of external memory in neural network and cache models used with count
based language models. We demonstrate on several language model datasets that
our approach performs significantly better than recent memory augmented net-
works.

1 I NTRODUCTION

Language models, which are probability distributions over sequences of words, have many appli-
cations such as machine translation (Brown et al., 1993), speech recognition (Bahl et al., 1983) or
dialogue agents (Stolcke et al., 2000). While traditional neural networks language models have ob-
tained state-of-the-art performance in this domain (Jozefowicz et al., 2016; Mikolov et al., 2010),
they lack the capacity to adapt to their recent history, limiting their application to dynamic environ-
ments (Dodge et al., 2015). A recent approach to solve this problem is to augment these networks
with an external memory (Graves et al., 2014; Grefenstette et al., 2015; Joulin & Mikolov, 2015;
Sukhbaatar et al., 2015). These models can potentially use their external memory to store new
information and adapt to a changing environment.
While these networks have obtained promising results on language modeling datasets (Sukhbaatar
et al., 2015), they are quite computationally expensive. Typically, they have to learn a parametrizable
mechanism to read or write to memory cells (Graves et al., 2014; Joulin & Mikolov, 2015). This may
limit both the size of their usable memory as well as the quantity of data they can be trained on. In
this work, we propose a very light-weight alternative that shares some of the properties of memory
augmented networks, notably the capability to dynamically adapt over time. By minimizing the
computation burden of the memory, we are able to use larger memory and scale to bigger datasets.
We observe in practice that this allows us to surpass the perfomance of memory augmented networks
on different language modeling tasks.
Our model share some similarities with a model proposed by Kuhn (1988), called the cache model.
A cache model stores a simple representation of the recent past, often in the form of unigrams, and
uses them for prediction (Kuhn & De Mori, 1990). This contextual information is quite cheap to
store and can be accessed efficiently. It also does not need any training and can be appplied on
top of any model. This makes this model particularly interesting for domain adaptation (Kneser &
Steinbiss, 1993).
Our main contribution is to propose a continuous version of the cache model, called Neural Cache
Model, that can be adapted to any neural network language model. We store recent hidden activations
and use them as representation for the context. Using simply a dot-product with the current hidden
activations, they turn out to be extremely informative for prediction. Our model requires no training
and can be used on any pre-trained neural networks. It also scales effortlessly to thousands of
memory cells. We demonstrate the quality of the Neural Cache models on several language model
tasks and the LAMBADA dataset (Paperno et al., 2016).

1
Published as a conference paper at ICLR 2017

2 L ANGUAGE MODELING
A language model is a probability distribution over sequences of words. Let V be the size of the
vocabulary; each word is represented by a one-hot encoding vector x in RV = V, corresponding to
its index in the vocabulary. Using the chain rule, the probability assigned to a sequence of words
x1 , . . . , xT can be factorized as
T
Y
p(x1 , ..., xT ) = p(xt | xt1 , ..., x1 ).
t=1

Language modeling is often framed as learning the conditional probability over words, given the
history (Bahl et al., 1983).
This conditional probability is traditionally approximated with non-parameteric models based on
counting statistics (Goodman, 2001). In particular, smoothed N-gram models (Katz, 1987; Kneser &
Ney, 1995) achieve good performance in practice (Mikolov et al., 2011). Parametrized alternatives
are either maximum entropy language models (Rosenfeld, 1996), feedforward networks (Bengio
et al., 2003) or recurrent networks (Mikolov et al., 2010). In particular, recurrent networks are
currently the best solution to approximate this conditional probability, achieving state-of-the-arts
performance on standard language modeling benchmarks (Jozefowicz et al., 2016; Zilly et al., 2016).

Recurrent networks. Assuming that we have a vector ht Rd encoding the history xt , ..., x1 ,
the conditional probability of a word w can be parametrized as
pvocab (w | xt , ..., x1 ) exp(h>
t ow ).

The history vector ht is computed by a recurrent network by recursively applying an equation of the
form
ht = (xt , ht1 ) ,
where is a function depending on the architecture of the network. Several architecture for recur-
rent networks have been proposed, such as the Elman network (Elman, 1990), the long short-term
memory (LSTM) (Hochreiter & Schmidhuber, 1997) or the gated recurrent unit (GRU) (Chung
et al., 2014). One of the simplest recurrent networks is the Elman network (Elman, 1990), where
ht = (Lxt + Rht1 ) ,
where is a non-linearity such as the logistic or tanh functions, L RdV is a word embedding
matrix and R Rdd is the recurrent matrix. The LSTM architecture is particularly interesting in
the context of language modelling (Jozefowicz et al., 2016) and we refer the reader to Graves et al.
(2013) for details on this architecture.
The parameters of recurrent neural network language models are learned by minimizing the nega-
tive log-likelihood of the training data. This objective function is usually minimized by using the
stochastic gradient descent algorithm, or variants such as Adagrad (Duchi et al., 2011). The gradient
is computed using the truncated backpropagation through time algorithm (Werbos, 1990; Williams
& Peng, 1990).

Cache model. After a word appears once in a document, it is much more likely to appear again.
As an example, the frequency of the word tiger on the Wikipedia page of the same name is 2.8%,
compared to 0.0037% over the whole Wikipedia. Cache models exploit this simple observation
to improve n-gram language models by capturing long-range dependencies in documents. More
precisely, these models have a cache component, which contains the words that appeared in the
recent history (either the document or a fixed number of words). A simple language model, such as
a unigram or smoothed bigram model, is fitted on the words of the cache and interpolated with the
static language model (trained over a larger dataset). This technique has many advantages. First,
this is a very efficient way to adapt a language model to a new domain. Second, such models can
predict out-of-vocabulary words (OOV words), after seeing them once. Finally, this helps capture
long-range dependencies in documents, in order to generate more coherent text.

2
Published as a conference paper at ICLR 2017

Id x5
(h1 , x2 ) (h2 , x3 ) (h3 , x4 )
Figure 1: The neural cache stores
Id Id Id O the previous hidden states in memory
cells. They are then used as keys to re-
R R R trieve their corresponding word, that
h1 h2 h3 h4 is the next word. There is no transfor-
mation applied to the storage during
writing and reading.
L L L L

x1 x2 x3 x4

3 N EURAL C ACHE M ODEL

The Neural Cache Model adds a cache-like memory to neural network language models. It exploits
the hidden representations ht to define a probability distribution over the words in the cache. As
illustrated Figure 1, the cache stores pairs (hi , xi+1 ) of a hidden representation, and the word which
was generated based on this representation (we remind the reader that the vector hi encodes the
history xi , ..., x1 ). At time t, we then define a probability distribution over words stored in the cache
based on the stored hidden representations and the current one ht as
t1
1{w=xi+1 } exp(h>
X
pcache (w | h1..t , x1..t ) t hi )
i=1
where the scalar is a parameter which controls the flatness of the distribution. When is equal
to zero, the probability distribution over the history is uniform, and our model is equivalent to a
unigram cache model (Kuhn & De Mori, 1990).
From the point of view of memory-augmented neural networks, the probability
pcache (w | h1..t , x1..t ) given by the neural cache model can be interpreted as the probability
to retrieve the word w from the memory given the query ht , where the desired answer is the next
word xt+1 . Using previous hidden states as keys for the words in the memory, the memory lookup
operator can be implemented with simple dot products between the keys and the query. In contrast
to existing memory-augmented neural networks, the neural cache model avoids the need to learn the
memory lookup operator. Such a cache can thus be added to a pre-trained recurrent neural language
model without fine tuning of the parameters, and large cache size can be used with negligible impact
on the computational cost of a prediction.

Neural cache language model. Following the standard practice in n-gram cache-based language
models, the final probability of a word is given by the linear interpolation of the cache language
model with the regular language model, obtaining:
p(w | h1..t , x1..t ) = (1 )pvocab (w | ht ) + pcache (w | h1..t , x1..t ) .

Instead of taking a linear interpolation between the two distribution with a fixed , we also consider
a global normalization over the two distribution:
t1
!
1{w=xi+1 } exp(ht hi + ) .
X
> >
p(w | h1..t , x1..t ) exp(ht ow ) +
i=1
This corresponds to taking a softmax over the vocabulary and the words in the cache. The parameter
controls the weight of the cache component, and is the counterpart of the parameter for linear
interpolation.
The addition of the neural cache to a recurrent neural language model inherits the advantages of n-
gram caches in usual cache-based models: The probability distribution over words is updated online
depending on the context, and out-of-vocabulary words can be predicted as soon as they have been
seen at least once in the recent history. The neural cache also inherits the ability of the hidden states
of recurrent neural networks to model longer-term contexts than small n-grams, and thus allows for
a finer modeling of the current context than e.g., unigram caches.

3
Published as a conference paper at ICLR 2017

Linear interpolation (ptb) Global normalization (ptb)

0.05 0.0
0.1 96 0.5 96
lambda 0.15 1.0
0.2 90 1.5 90

alpha
0.25 2.0 84
0.3 84 2.5
0.35 78 3.0 78
0.4 3.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.08 0.16 0.24 0.32 0.4
theta theta
Figure 2: Perplexity on the validation set of Penn Tree Bank for linear interpolation (left) and
global normalization (right), for various values of hyperparameters , and . We use a cache
model of size 500. The base model has a validation perplexity of 86.9. The best linear interpolation
has a perplexity of 74.6, while the best global normalization has a perplexity of 74.9.

Model Test PPL

RNN+LSA+KN5+cache (Mikolov & Zweig, 2012) 90.3
LSTM (Zaremba et al., 2014) 78.4
Variational LSTM (Gal & Ghahramani, 2015) 73.4
Recurrent Highway Network (Zilly et al., 2016) 66.0
Pointer Sentinel LSTM (Merity et al., 2016) 70.9
LSTM (our implem.) 82.3
Neural cache model 72.1

Table 1: Test perplexity on the Penn Tree Bank.

Training procedure. For now, we first train the (recurrent) neural network language model, with-
out the cache component. We only apply the cache model at test time, and choose the hyperparam-
eters and (or ) on the validation set. A big advantage of our method is that it is very easy
and cheap to apply, with already trained neural models. There is no need to perform backpropaga-
tion over large contexts, and we can thus apply our method with large cache sizes (larger than one
thousand).

4 R ELATED WORK

Cache model. Adding a cache to a language model was intoducted in the context of speech recog-
nition(Kuhn, 1988; Kupiec, 1989; Kuhn & De Mori, 1990). These models were further extended by
Jelinek et al. (1991) into a smoothed trigram language model, reporting reduction in both perplexity
and word error rates. Della Pietra et al. (1992) adapt the cache to a general n-gram model such that
it satisfies marginal constraints obtained from the current document.

Adaptive language models. Other adaptive language models have been proposed in the past:
Kneser & Steinbiss (1993) and Iyer & Ostendorf (1999) dynamically adapt the parameters of their
model to the recent history using different weight interpolation schemes. Bellegarda (2000) and
Coccaro & Jurafsky (1998) use latent semantic analysis to adapt their models to the current context.
Similarly, topic features have been used with either maximum entropy models (Khudanpur & Wu,
2000) or recurrent networks (Mikolov & Zweig, 2012; Wang & Cho, 2015). Finally, Lau et al.
(1993) proposes to use pairs of distant of words to capture long-range dependencies.

Memory augmented neural networks. In the context of sequence prediction, several memory
augmented neural networks have obtained promising results (Sukhbaatar et al., 2015; Graves et al.,
2014; Grefenstette et al., 2015; Joulin & Mikolov, 2015). In particular, Sukhbaatar et al. (2015)
stores a representation of the recent past and accesses it using an attention mechanism Bahdanau
et al. (2014). Sukhbaatar et al. (2015) shows that this reduces the perplexity for language modeling.

4
Published as a conference paper at ICLR 2017

Linear interpolation (wikitext2) Global normalization (wikitext2)

0.05 104 0.0 104
0.1 96 0.5 96
lambda 0.15 1.0
0.2 1.5

alpha
0.25 88 2.0 88
0.3 80 2.5 80
0.35 3.0
0.4 72 3.5 72
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.08 0.16 0.24 0.32 0.4
theta theta
Figure 3: Perplexity on the validation set of wikitext2 for linear interpolation (left) and global
normalization (right), for various values of hyperparameters , and . We use a cache model of
size 2000. The base model has a validation perplexity of 104.2. The best linear interpolation has a
perplexity of 72.1, while the best global normalization has a perplexity of 73.5.

Model wikitext2 wikitext103

Zoneout + Variational LSTM (Merity et al., 2016) 100.9 -
Pointer Sentinel LSTM (Merity et al., 2016) 80.8 -
LSTM (our implementation) 99.3 48.7
Neural cache model (size = 100) 81.6 44.8
Neural cache model (size = 2,000) 68.9 40.8

Table 2: Test perplexity on the wikitext datasets. The two datasets share the same validation and
test sets, making all the results comparable.

This approach has been successfully applied to question answering, when the answer is contained
in a given paragraph (Chen et al., 2016; Hermann et al., 2015; Kadlec et al., 2016; Sukhbaatar et al.,
2015). Similarly, Vinyals et al. (2015) explores the use of this mechanism to reorder sequences
of tokens. Their network uses an attention (or pointer) over the input sequence to predict which
element should be selected as the next output. Gulcehre et al. (2016) have shown that a similar
mechanism called pointer softmax could be used in the context of machine translation, to decide
which word to copy from the source to target.
Independently of our work, Merity et al. (2016) apply the same mechanism to recurrent network.
Unlike our work, they uses the current hidden activation as a representation of the current input
(while we use it to represent the output). This requires additional learning of a transformation
between the current representation and those in the past. The advantage of our approach is that we
can scale to very large caches effortlessly.

5 E XPERIMENTS
In this section, we evaluate our method on various language modeling datasets, which have different
sizes and characteristics. On all datasets, we train a static recurrent neural network language model
with LSTM units. We then use the hidden representations from this model to obtain our cache, which
is interpolated with the static LSTM model. We also evaluate a unigram cache model interpolated
with the static model as another baseline.

5.1 S MALL SCALE EXPERIMENTS

Datasets. In this section, we describe experiments performed on two small datasets: the Penn
Tree Bank (Marcus et al., 1993) and the wikitext2 (Merity et al., 2016) datasets. The Penn
Tree Bank dataset is made of articles from the Wall Street Journal, contains 929k training tokens
and has a vocabulary size of 10k. The wikitext2 dataset is derived from Wikipedia articles,
contains 2M training tokens and has a vocabulary size of 33k. These datasets contain non-shuffled
documents, therefore requiring models to capture inter-sentences dependencies to perform well.

5
Published as a conference paper at ICLR 2017

text8 wikitext103
125 49
120 48
47
115 46
perplexity

perplexity
110 baseline 45 baseline
unigram 44 unigram
105 neural 43 neural
100 42
41
95 40
102 103 104 102 103 104
cache size (log scale) cache size (log scale)
Figure 4: Test perplexity as a function of the number of words in the cache, for our method and a
unigram cache baseline. We observe that our approach can uses larger caches than the baseline.

Implementation details. We train recurrent neural network language models with 1024 LSTM
units, regularized with dropout (probability of dropping out units equals to 0.65). We use the Ada-
grad algorithm, with a learning rate of 0.2, a batchsize of 20 and initial weight uniformly sampled in
the range [0.05, 0.05]. We clip the norm of the gradient to 0.1 and unroll the network for 30 steps.
We consider cache sizes on a logarithmic scale, from 50 to 10, 000, and fit the cache hyperparameters
on the validation set.

Results. We report the perplexity on the validation sets in Figures 2 and 3, for various values
of hyperparameters, for linear interpolation and global normalization. First, we observe that on
both datasets, the linear interpolation method performs slightly better than the global normalization
approach. It is also easier to apply in practice, and we thus use this method in the remainder of this
paper. In Tables 1 and 2, we report the test perplexity of our approach and state-of-the-art models.
Our approach is competitive with previous models, in particular with the pointer sentinel LSTM
model of Merity et al. (2016). On Penn Tree Bank, we note that the improvement over the base
model is similar for both methods. On the wikitext2 dataset, both methods obtain similar results
when using the same cache size (100 words). Since our method is computationally cheap, it is easy
to increase the cache to larger values (2, 000 words), leading to dramatic improvements (30% over
the baseline, 12% over a small cache of 100 words).

5.2 M EDIUM SCALE EXPERIMENTS

Datasets and implementation details. In this section, we describe experiments performed over
two medium scale datasets: text8 and wikitext103. Both datasets are derived from Wikipedia,
but different pre-processing were applied. The text8 dataset contains 17M training tokens and
has a vocabulary size of 44k words, while the wikitext103 dataset has a training set of size
103M, and a vocabulary size of 267k words. We use the same setting as in the previous section,
except for the batchsize (we use 128) and dropout parameters (we use 0.45 for text8 and 0.25 for
wikitext103). Since both datasets have large vocabularies, we use the adaptive softmax (Grave
et al., 2016) for faster training.

Results. We report the test perplexity as a function of the cache size in Figure 4, for the neural
cache model and a unigram cache baseline. We observe that our approach can exploits larger cache
sizes, compared to the baseline. In Table 2, we observe that the improvement in perplexity of
our method over the LSTM baseline on wikitext103 is smaller than for wikitext2 (approx.
16% v.s. 30%). The fact that improvements obtained with more advanced techniques decrease
when the size of training data increases has already been observed by Goodman (2001). Both
wikitext datasets sharing the same test set, we also observe that the LSTM baseline, trained
on 103M tokens (wikitext103), strongly outperforms more sophisticated methods, trained on
2M tokens (wikitext2). For these two reasons, we believe that it is important to evaluate and
compare methods on relatively large datasets.

6
Published as a conference paper at ICLR 2017

Model Test Model Dev Ctrl

LSTM-500 (Mikolov et al., 2014) 156 WB5 (Paperno et al., 2016) 3125 285
SCRNN (Mikolov et al., 2014) 161 WB5+cache (Paperno et al., 2016) 768 270
MemNN (Sukhbaatar et al., 2015) 147 LSTM-512 (Paperno et al., 2016) 5357 149
LSTM-1024 (our implem.) 121.8 LSTM-1024 (our implem.) 4088 94
Neural cache model 99.9 Neural cache model 138 129

(a) text8 (b) lambada

Table 3: Perplexity on the text8 and lambada datasets. WB5 stands for 5-gram language model
with Witten-Bell smoothing.

lambada
700
600 Control
500 Development
perplexity

400
300
200
100
0
0.0 0.2 0.4 0.6 0.8 1.0
lambda
Figure 5: Perplexity on the development and control sets of lambada, as a function of the interpo-
lation parameters .

5.3 E XPERIMENTS ON THE LAMBADA DATASET

Finally, we report experiments carried on the lambada dataset, introduced by Paperno et al. (2016).
This is a dataset of short passages extracted from novels. The goal is to predict the last word of the
excerpt. This dataset was built so that human subjects solve the task perfectly when given the full
context (approx. 4.6 sentences), but fail to do so when only given the sentence with the target word.
Thus, most state-of-the-art language models fail on this dataset. The lambada training set contains
approximately 200M tokens and has a vocabulary size of 93, 215. We report results for our method
in Table 3, as well the performance of baselines from Paperno et al. (2016). Adding a neural cache
model to the LSTM baseline strongly improves the performance on the lambada dataset. We also
observe in Figure 5 that the best interpolation parameter between the static model and the cache
is not the same for the development and control sets. This is due to the fact that more than 83%
of passages of the development set include the target word, while this is true for only 14% of the
control set. Ideally, a model should have strong results on both sets. One possible generalization of
our model would be to adapt the interpolation parameter based on the current vector representation
of the history ht .

6 C ONCLUSION

We presented the neural cache model to augment neural language models with a longer-term mem-
ory that dynamically updates the word probablilities based on the long-term context. A neural cache
can be added on top of a pre-trained language model at negligible cost. Our experiments on both lan-
guage modeling tasks and the challenging LAMBADA dataset shows that significant performance
gains can be expected by adding this external memory component.
Technically, the neural cache models is similar to some recent memory-augmented neural networks
such as pointer networks. However, its specific design makes it possible to avoid learning the mem-
ory lookup component. This makes the neural cache appealing since it can use larger cache sizes
than memory-augment networks and can be applied as easily as traditional count-based caches.

7
Published as a conference paper at ICLR 2017

R EFERENCES
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to
align and translate. arXiv preprint arXiv:1409.0473, 2014.
Lalit R Bahl, Frederick Jelinek, and Robert L Mercer. A maximum likelihood approach to continuous speech
recognition. PAMI, 1983.
Jerome R Bellegarda. Exploiting latent semantic information in statistical language modeling. Proceedings of
the IEEE, 2000.
Yoshua Bengio, Rjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language
model. JMLR, 2003.
Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. The mathematics of
statistical machine translation: Parameter estimation. Computational linguistics, 1993.
Danqi Chen, Jason Bolton, and Christopher D Manning. A thorough examination of the cnn/daily mail reading
comprehension task. arXiv preprint arXiv:1606.02858, 2016.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated re-
current neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Noah Coccaro and Daniel Jurafsky. Towards better integration of semantic predictors in statistical language
modeling. In ICSLP. Citeseer, 1998.
Stephen Della Pietra, Vincent Della Pietra, Robert L Mercer, and Salim Roukos. Adaptive language modeling
using minimum discriminant estimation. In Proceedings of the workshop on Speech and Natural Language,
1992.
Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam,
and Jason Weston. Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv preprint
arXiv:1511.06931, 2015.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic
optimization. JMLR, 2011.
Jeffrey L Elman. Finding structure in time. Cognitive science, 1990.
Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural net-
works. arXiv preprint arXiv:1512.05287, 2015.
Joshua T Goodman. A bit of progress in language modeling. Computer Speech & Language, 2001.
Edouard Grave, Armand Joulin, Moustapha Ciss, David Grangier, and Herv Jgou. Efficient softmax ap-
proximation for gpus. arXiv preprint arXiv:1609.04309, 2016.
A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP,
2013.
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with
unbounded memory. In Advances in Neural Information Processing Systems, pp. 18281836, 2015.
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. Pointing the unknown
words. arXiv preprint arXiv:1603.08148, 2016.
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. Teaching machines to read and comprehend. In NIPS, 2015.
Sepp Hochreiter and Jrgen Schmidhuber. Long short-term memory. Neural computation, 1997.
Rukmini M Iyer and Mari Ostendorf. Modeling long distance dependence in language: Topic mixtures versus
dynamic cache models. IEEE Transactions on speech and audio processing, 1999.
Frederick Jelinek, Bernard Merialdo, Salim Roukos, and Martin Strauss. A dynamic language model for speech
recognition. In HLT, 1991.
Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. In
Advances in Neural Information Processing Systems, pp. 190198, 2015.

8
Published as a conference paper at ICLR 2017

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of
language modeling. arXiv preprint arXiv:1602.02410, 2016.
Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text understanding with the attention sum
reader network. arXiv preprint arXiv:1603.01547, 2016.
Slava M Katz. Estimation of probabilities from sparse data for the language model component of a speech
recognizer. ICASSP, 1987.
Sanjeev Khudanpur and Jun Wu. Maximum entropy techniques for exploiting syntactic, semantic and colloca-
tional dependencies in language modeling. Computer Speech & Language, 2000.
Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In ICASSP, 1995.
Reinhard Kneser and Volker Steinbiss. On the dynamic adaptation of stochastic language models. In ICASSP,
1993.
Roland Kuhn. Speech recognition and the frequency of recently used words: A modified markov model for
natural language. In Proceedings of the 12th conference on Computational linguistics-Volume 1, 1988.
Roland Kuhn and Renato De Mori. A cache-based natural language model for speech recognition. PAMI, 1990.
Julien Kupiec. Probabilistic models of short and long distance word dependencies in running text. In Proceed-
ings of the workshop on Speech and Natural Language, 1989.
Raymond Lau, Ronald Rosenfeld, and Salim Roukos. Trigger-based language models: A maximum entropy
approach. In ICASSP, 1993.
Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of
english: The penn treebank. Computational linguistics, 1993.
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv
preprint arXiv:1609.07843, 2016.
Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. In SLT,
2012.
Tomas Mikolov, Martin Karafit, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. Recurrent neural
network based language model. In INTERSPEECH, 2010.
Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Cernock`y. Empirical evaluation and
combination of advanced language modeling techniques. In INTERSPEECH, 2011.
Tomas Mikolov, Armand Joulin, Sumit Chopra, Michael Mathieu, and MarcAurelio Ranzato. Learning longer
memory in recurrent neural networks. arXiv preprint arXiv:1412.7753, 2014.
Denis Paperno, Germn Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro
Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernndez. The lambada dataset: Word prediction
requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
Ronald Rosenfeld. A maximum entropy approach to adaptive statistical language modeling. Computer, Speech
and Language, 1996.
Andreas Stolcke, Noah Coccaro, Rebecca Bates, Paul Taylor, Carol Van Ess-Dykema, Klaus Ries, Elizabeth
Shriberg, Daniel Jurafsky, Rachel Martin, and Marie Meteer. Dialogue act modeling for automatic tagging
and recognition of conversational speech. Computational linguistics, 2000.
Sainbayar Sukhbaatar, Szlam Arthur, Jason Weston, and Rob Fergus. End-to-end memory networks. In NIPS,
2015.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015.
Tian Wang and Kyunghyun Cho. Larger-context language modelling. arXiv preprint arXiv:1511.03729, 2015.
Paul J Werbos. Backpropagation through time: what it does and how to do it. 1990.
Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of recurrent net-
work trajectories. Neural computation, 1990.
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint
arXiv:1409.2329, 2014.
Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnk, and Jrgen Schmidhuber. Recurrent highway
networks. arXiv preprint arXiv:1607.03474, 2016.

Ai ML Final Lab Manual
0% (1)
Ai ML Final Lab Manual
23 pages
Previewpdf
No ratings yet
Previewpdf
75 pages
Artificial Neural Network Based Power System Restoratoin
50% (2)
Artificial Neural Network Based Power System Restoratoin
22 pages
Building An Open Source Facial Recognition System For Mass Surveillance
100% (1)
Building An Open Source Facial Recognition System For Mass Surveillance
31 pages
Data Mining
100% (3)
Data Mining
18 pages
Recurrent Neural Nets
No ratings yet
Recurrent Neural Nets
144 pages
Vehicle License Plate Detection and Recognition Using Neural Network
No ratings yet
Vehicle License Plate Detection and Recognition Using Neural Network
5 pages
Jntuk Machine Learning 3-2 Unit-5
No ratings yet
Jntuk Machine Learning 3-2 Unit-5
15 pages
1 s2.0 S1568494623000844 Main
No ratings yet
1 s2.0 S1568494623000844 Main
18 pages
Neural Turing
No ratings yet
Neural Turing
194 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Recurrent Neural Networks: Amir H. Payberah
No ratings yet
Recurrent Neural Networks: Amir H. Payberah
142 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Minorprojectthesis
No ratings yet
Minorprojectthesis
43 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
58 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
Survey On Large Language Models
No ratings yet
Survey On Large Language Models
52 pages
Trend
No ratings yet
Trend
47 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
SSRN Id3607845 PDF
No ratings yet
SSRN Id3607845 PDF
39 pages
L5 Cse256 Fa24 LM
No ratings yet
L5 Cse256 Fa24 LM
65 pages
Natural Language Processing With Deep Learning 1 PDF
No ratings yet
Natural Language Processing With Deep Learning 1 PDF
37 pages
Temporal Pattern Classification Using Spiking Neural Networks
No ratings yet
Temporal Pattern Classification Using Spiking Neural Networks
67 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
A Survey Large Language Models
No ratings yet
A Survey Large Language Models
58 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Cs224n 2025 Lecture05 RNNLM
No ratings yet
Cs224n 2025 Lecture05 RNNLM
54 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
Icml2016 Memnn Tutorial
No ratings yet
Icml2016 Memnn Tutorial
89 pages
Accepted Manuscript: Applied Soft Computing
No ratings yet
Accepted Manuscript: Applied Soft Computing
39 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
NLP NN Language Modeling Week5
No ratings yet
NLP NN Language Modeling Week5
33 pages
Natual Language Processing
No ratings yet
Natual Language Processing
33 pages
PerceptiLabs-ML Handbook
No ratings yet
PerceptiLabs-ML Handbook
31 pages
1 s2.0 S0142112324004857 Main
No ratings yet
1 s2.0 S0142112324004857 Main
12 pages
BIA Master of Science in Computer Science 3
No ratings yet
BIA Master of Science in Computer Science 3
22 pages
5th Unit
No ratings yet
5th Unit
36 pages
A Self-Organizing Deep Network Architecture Designed Based On LSTM Network Via Elitism-Driven Roulette-Wheel Selection For Time-Series Forecasting
No ratings yet
A Self-Organizing Deep Network Architecture Designed Based On LSTM Network Via Elitism-Driven Roulette-Wheel Selection For Time-Series Forecasting
17 pages
Unit 2
No ratings yet
Unit 2
35 pages
L6 - UCLxDeepMind DL2020 Document of Google
No ratings yet
L6 - UCLxDeepMind DL2020 Document of Google
141 pages
Deep Learning (MODULE-4) - RNN - NLP
No ratings yet
Deep Learning (MODULE-4) - RNN - NLP
52 pages
Recurrent Neural Networks Cheatsheet
No ratings yet
Recurrent Neural Networks Cheatsheet
44 pages
Dynamic Mixtures of Contextual Experts For Language Modeling
No ratings yet
Dynamic Mixtures of Contextual Experts For Language Modeling
11 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
An Automated Online Proctoring System Using Attentive-Net To Assess Student Mischievous Behavior
No ratings yet
An Automated Online Proctoring System Using Attentive-Net To Assess Student Mischievous Behavior
30 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Character-Aware Neural Language Models
No ratings yet
Character-Aware Neural Language Models
9 pages
Language Models
No ratings yet
Language Models
11 pages
Tailoring An Interpretable Neural Language Model: Yike Zhang, Pengyuan Zhang, and Yonghong Yan
No ratings yet
Tailoring An Interpretable Neural Language Model: Yike Zhang, Pengyuan Zhang, and Yonghong Yan
15 pages
NLP
No ratings yet
NLP
12 pages
NLP Midsem Last Lesson
No ratings yet
NLP Midsem Last Lesson
53 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
ReliefNet Fast Bas Relief Generation From 3D Scen - 2021 - Computer Aided Desig
No ratings yet
ReliefNet Fast Bas Relief Generation From 3D Scen - 2021 - Computer Aided Desig
14 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
Skin Disease Detection and Remedial System
No ratings yet
Skin Disease Detection and Remedial System
7 pages
04 - RNNs
No ratings yet
04 - RNNs
37 pages
Artificial Intelligence (AI)
No ratings yet
Artificial Intelligence (AI)
11 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
MemoryPrompt A Light Wrapper To Improve Context Tracking in
No ratings yet
MemoryPrompt A Light Wrapper To Improve Context Tracking in
9 pages
Relay Protection System of Transmission Line Based On AI - PLOS ONE
No ratings yet
Relay Protection System of Transmission Line Based On AI - PLOS ONE
13 pages
Anomaly Detection in Structural Health Monitoring With Ensemble Learning and Reinforcement Learning
No ratings yet
Anomaly Detection in Structural Health Monitoring With Ensemble Learning and Reinforcement Learning
16 pages
Research Article - Format
No ratings yet
Research Article - Format
7 pages
Spa 2018 8563389
No ratings yet
Spa 2018 8563389
6 pages
10 (3S) 4112-4118
No ratings yet
10 (3S) 4112-4118
7 pages
1709.07432 DYNAMIC EVALUATION OF NEURAL SEQUENCE MODELS Improve PTB
No ratings yet
1709.07432 DYNAMIC EVALUATION OF NEURAL SEQUENCE MODELS Improve PTB
10 pages
Predicting Students Academic Performance Using Artificial Neural Network: A Case Study of An Engineering Course
No ratings yet
Predicting Students Academic Performance Using Artificial Neural Network: A Case Study of An Engineering Course
9 pages
Ieee Format
No ratings yet
Ieee Format
8 pages
Document Context Language Models
No ratings yet
Document Context Language Models
10 pages
1508.06615 - PTB Character Aware Neural Language Models Yoon Kim
No ratings yet
1508.06615 - PTB Character Aware Neural Language Models Yoon Kim
9 pages
Tsai Et Al - 2020 - Learning Molecular Dynamics With Simple Language Model Built Upon Long
No ratings yet
Tsai Et Al - 2020 - Learning Molecular Dynamics With Simple Language Model Built Upon Long
11 pages
Convolutional Neural Networks For
No ratings yet
Convolutional Neural Networks For
9 pages
Proposal AASTU
No ratings yet
Proposal AASTU
9 pages
Cyber Security Using AI
No ratings yet
Cyber Security Using AI
8 pages
Domain-Adaptative Continual Learning For Low-Resource Tasks: Evaluation On Nepali
No ratings yet
Domain-Adaptative Continual Learning For Low-Resource Tasks: Evaluation On Nepali
10 pages
Revisiting Simple Neural Probabilistic Language Models (2021)
No ratings yet
Revisiting Simple Neural Probabilistic Language Models (2021)
8 pages
AI-Based Navigator For Hassle-Free Navigation For Visually Impaired
No ratings yet
AI-Based Navigator For Hassle-Free Navigation For Visually Impaired
6 pages
A Survey On Neural Network Language Models
No ratings yet
A Survey On Neural Network Language Models
7 pages
Deep Neural Network Language Models - W12-2703
No ratings yet
Deep Neural Network Language Models - W12-2703
9 pages
Neural Language Modelling - NLP
No ratings yet
Neural Language Modelling - NLP
30 pages
A Neural Probabilistic Language Model by Yoshua Bengio Ducharme and Vincent 2001
No ratings yet
A Neural Probabilistic Language Model by Yoshua Bengio Ducharme and Vincent 2001
7 pages
Unit 3-Notes AI
No ratings yet
Unit 3-Notes AI
36 pages
09 - Extensions of Recurrent Neural Network Language Model
No ratings yet
09 - Extensions of Recurrent Neural Network Language Model
4 pages
Mikolov10 Interspeech
No ratings yet
Mikolov10 Interspeech
4 pages
Programming with X10: Definitive Reference for Developers and Engineers
From Everand
Programming with X10: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Semantic Network: Fundamentals and Applications
From Everand
Semantic Network: Fundamentals and Applications
Fouad Sabry
No ratings yet
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet