0% found this document useful (0 votes)
13 views9 pages

Imperial Dlcourse2022 RNN Notes

The document provides an overview of recurrent neural networks (RNNs) and their mathematical formulations, including the simple RNN and Long Short-Term Memory (LSTM) models. It discusses the back-propagation through time (BPTT) method for training RNNs, the challenges of gradient vanishing and explosion, and techniques to mitigate these issues, such as gradient clipping and proper weight initialization. Additionally, it details the structure and functioning of LSTMs, emphasizing their design to address the limitations of simple RNNs.

Uploaded by

selamwork17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views9 pages

Imperial Dlcourse2022 RNN Notes

The document provides an overview of recurrent neural networks (RNNs) and their mathematical formulations, including the simple RNN and Long Short-Term Memory (LSTM) models. It discusses the back-propagation through time (BPTT) method for training RNNs, the challenges of gradient vanishing and explosion, and techniques to mitigate these issues, such as gradient clipping and proper weight initialization. Additionally, it details the structure and functioning of LSTMs, emphasizing their design to address the limitations of simple RNNs.

Uploaded by

selamwork17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture notes on recurrent neural networks (RNNs)

1.1 Simple RNNs


Mathematical form
Assume we want to build a neural network to process a data sequence x1:T = (x1 , ..., xT ). Recurrent
neural networks (RNNs) are neural networks suited for processing sequential data, which, if well
trained, can model dependencies within a sequence of arbitrary length.
We also assume a supervised learning task which aims to learn the mapping
from inputs x1:T to outputs y1:T = (y1 , ..., yT ). Then a simple RNN computes
the following mapping for t = 1, ..., T :

ht = ϕh (Wh ht−1 + Wx xt + bh ), (1)


yt = ϕy (Wy ht + by ). (2)

Here the network parameters are θ = {Wh , Wx , Wy , bh , by }, ϕh and ϕy are the


non-linear activation functions for the hidden state ht and the output yt , respec-
tively. For t = 1 the convention is to set h0 = 0 so that h1 = ϕh (Wx x1 + bh );
alternatively h0 can also be added to θ as a learnable parameter.

Back-propagation through time (BPTT)


A loss function is required for training an RNN. Assuming the following loss function to minimise:
T
X
L(θ) = L(yt ). (3)
t=1

This is a common form for the loss function in many sequential modelling tasks such as video/audio
d
PT d
sequence reconstruction. The derivative of the loss function w.r.t. θ is dθ L(θ) = t=1 dθ L(yt ),
d
therefore it remains to compute dθ L(yt ) for θ = {Wh , Wx , Wy , bh , by }.

• Derivative of L(yt ) w.r.t. Wy and by :

dL(yt ) dL(yt ) dyt dL(yt ) dL(yt ) dyt


= , = .
dWy dyt dWy dby dyt dby

• Derivative of L(yt ) w.r.t. Wx and bh :

dL(yt ) dL(yt ) dyt dht dL(yt ) dL(yt ) dyt dht


= , = .
dWx dyt dht dWx dbh dyt dht dbh
Here Wx and bh contributes to ht in two ways: both direct and indirect contributions, where
the latter is through ht−1 . This means:

dht ∂ht dht dht−1 dht ∂ht dht dht−1


= + , = + .
dWx ∂Wx dht−1 dWx dbh ∂bh dht−1 dbh

Derivations of these derivatives requires the usage of chain rule which will be explained next.

• Derivative of L(yt ) w.r.t. Wh : by chain rule, we have

dL(yt ) dL(yt ) dht


=
dWh dht dWh

1
Figure 1: Visualising Back-propagation through time (BPTT) without truncation. The black arrows
show forward pass computations, while the red arrows show the gradient back-propagation in order
to compute ∇Wh L(yt ).

where dL(y t) dL(yt ) dyt dht


dht = dyt dht . Importantly, here the entries in the Jacobian dWh contains the total
dht
gradient of ht [i] w.r.t. Wh [m, n]. It remains to compute dWh and notice that ht depends on
ht−1 which also depends on Wh :
dht ∂ht dht dht−1
= + . (4)
dWh ∂Wh dht−1 dWh
∂ht
Here the entries in ∂Wh contains partial gradient only (by treating ht−1 as a constant w.r.t. Wh ,
dht−1
note that ht depends on ht−1 ). By expanding the dWh term further, we have:
dht ∂ht dht ∂ht−1 dht dht−1 dht−2
= + + = ...
dWh ∂Wh dht−1 ∂Wh dht−1 dht−2 dWh
X t t−1
!
Y dhl+1 ∂hτ (5)
= ,
τ =1
dhl ∂Wh
l=τ
Qt−1 l+1
with the convention that when τ = t, l=t dh dhl = 1. This means the chain rule of the
gradients needs to be computed in an reversed order from time t = T to time t = 1, hence the
name Back-propagation through time (BPTT). A visualisation of BPTT is provided in Figure
1. Truncation with length L might be applied to this back-propagation procedure, and with
truncated BPTT the gradient is computed as
t t−1
!
dht X Y dhl+1 ∂hτ
truncate[ ]= .
dWh dhl ∂Wh
τ =max(1,t−L) l=τ

Gradient vanishing/explosion issues


Simple RNNs are often said to suffer from gradient vanishing or gradient explosion issues. To
understand this, notice that
dhl+1 ⊤
= ϕ′h (Wh hl + Wx xl+1 + bh ) ⊙ Wh , (6)
dhl
Here ϕ′h (Wh hl + Wx xl+1 + bh ) denotes a vector containing element-wise derivatives, and we reload

the element-wise product operator for vector a ∈ Rd×1 and B ∈ Rd×d as the “broadcasting element-
Qt−1 dhl+1
wise product” a ⊙ B := [ a, ..., a ] ⊙ B. This means l=τ dhl contains products of t − τ copies
| {z }
repeat d′ times
of Wh and the derivative ϕ′h (·) at time steps l = τ, ..., t − 1.

2
Figure 2: Visualising the gradient step with/out gradient clipping. Source: Goodfellow et al. [2016].

Now consider a simple case where ϕh (·) is an identity mapping so that ϕ′h (·) = 1. We further
Qt−1 l+1
assume the hidden states have scalar values, i.e. dim(ht ) = 1. Then we have l=τ dh dhl = (Wh
t−τ ⊤
)
which can vanish or explode when t − τ is large, depending on whether Wh < 1 or not. In the
general case when Wh is a matrix, depending on whether the largest singular value (i.e. maximum
of the absolute values of the largest and smallest eigenvalues) of Wh is smaller or larger than 1, the
Qt−1 l+1
spectral norm of l=τ dh dhl = (Wh
t−τ ⊤
) will vanish or explode when t − τ increases. When ϕh (·)
is selected as the sigmoid function or the hyperbolic tangent function, gradient vanishing problem
can still happen. Take hyperbolic tangent function as an example. When entries in ht is close to
±1, then ϕ′h (·) ≈ 0, i.e. the derivative is saturated. Multiplying several of such saturated derivatives
together also leads to the gradient vanishing problem.

Some tricks to fix the gradient vanishing/explosion issues


There are a handful of empirical tricks to fix the gradient vanishing/explosion issues discussed above.
• Gradient clipping:
This trick is often used to prevent the gradient from explosion. With a fixed hyper-parameter
γ, a gradient g is clipped when ||g|| > η:
γ
g← g.
||g||
This trick ensures the gradients used in optimisation has their maximum norm bounded by
a pre-defined hyper-parameter. It introduces biases in the gradient-based optimisation proce-
dure, but in certain cases it can be beneficial. Figure 2 visualises such an example, where with
gradient clipping, the updates can stay in the valley of the loss function.
• Good initialisation of the recurrent weight matrix Wh :
The IRNN approach [Le et al., 2015] uses ReLU activation’s for ϕh and initialise Wh = I,
bh = 0. This makes ϕ′h (t) = δ(t > 0) and dh l+1
dhl = δ(Wx xl+1 > 0) at initialisation. While
there is no guarantee of eliminating the gradient vanishing/explosion problem during the whole
course of training, empirically RNNs with this trick have achieve competitive performance to
LSTMs in a variety of tasks.
• Alternatively, one can construct the recurrent weight matrix Wh to be orthogonal or unitary
matrix. See e.g. Saxe et al. [2014]; Arjovsky et al. [2016] for examples.

1.2 Long Short-Term Memory (LSTM)


Mathematical form
The Long Short-term Memory [Hochreiter and Schmidhuber, 1997] was proposed with the motivation
of addressing the gradient vanishing/explosion problem. It introduces memory cell states and gates

3
to control the error flows, in detail the computation goes as follows (with σ(·) as sigmoid function):

ft :forget gate ft = σ(Wf · [ht−1 , xt ] + bf ) (7)


it :input gate it = σ(Wi · [ht−1 , xt ] + bi ) (8)
ot :output gate ot = σ(Wo · [ht−1 , xt ] + bo ) (9)
xt :input c̃t = tanh(Wc · [ht−1 , xt ] + bc ) (10)
ct :memory cell state ct = ft ⊙ ct−1 + it ⊙ c̃t (11)
ht :hidden state ht = ot ⊙ tanh(ct ) (12)
The parameters of an LSTM are therefore θ = {Wf , Wi , Wo , Wc , bf , bi , bo , bc }. Again by con-
vention, h0 and c0 are either set to zero vectors or added to the learnable parameters. We note
that if the initial cell state c0 is initialised to zero, then we have the elements in ct and ht bounded
within (−1, 1). The output yt can be produced by a 1-layer neural network similar to the simple
RNN case: yt = ϕy (Wy ht + by ).

*Gradient computation
Readers are encouraged to derive themselves the gradient of L(yt ) with respect to θ. Specifically
for the recurrent weight matrix Wc , computing the derivative dL(yt)
dWc requires the following terms:

dL(yt ) dL(yt ) dht


= (13)
dWc dht dWc
dht dtanh(ct ) dot
= ot ⊙ + tanh(ct ) ⊙ (14)
dWc dWc dWc
dot dot dht−1
= (15)
dWc dht−1 dWc
dct dct−1 dft dc̃t dit
= ft ⊙ + ct−1 ⊙ + it ⊙ + c̃t ⊙ . (16)
dWc dWc dWc dWc dWc
As ht−1 also depends on ct−1 and ot−1 , it means that
 
dft dft dht−1 dft dtanh(ct−1 ) dot−1
= = ot−1 ⊙ + tanh(ct−1 ) ⊙ (17)
dWc dht−1 dWc dht−1 dWc dWc
 
dit dit dht−1 dit dtanh(ct−1 ) dot−1
= = ot−1 ⊙ + tanh(ct−1 ) ⊙ (18)
dWc dht−1 dWc dht−1 dWc dWc
 
dc̃t ∂ c̃t dc̃t dht−1 ∂ c̃t dc̃t dtanh(ct−1 ) dot−1
= + = + ot−1 ⊙ + tanh(ct−1 ) ⊙ . (19)
dWc ∂Wc dht−1 dWc ∂Wc dht−1 dWc dWc
dc̃t ∂ c̃t dc̃t
Note again the difference between dW c
and ∂W c
. The former Jacobian dW c
has its entries as the total
∂ c̃t
gradient of c̃t [i] w.r.t. Wc [m, n], while the latter partial gradient ∂Wc has its entries as the partial
gradient of c̃t [i] w.r.t. Wc [m, n] (by treating ht−1 as a constant w.r.t. Wc , note that c̃t depends on
ht−1 as well). Combining the derivations, we have (notice that dtanh(c dWc
t−1 )
= dtanh(c t−1 ) dct−1
dct−1 dWc ):
 
dot dot dtanh(ct−1 ) dot−1
= ot−1 ⊙ + tanh(ct−1 ) ⊙ (20)
dWc dht−1 dWc dWc
 
dct dtanh(ct−1 ) dct dct−1 dct dot−1 ∂ c̃t
= ft + ot−1 ⊙ ⊙ + tanh(ct−1 ) ⊙ + it ⊙
dWc dct−1 dht−1 dWc dht−1 dWc ∂Wc
| {z }
dct
= dc
t−1

(21)
dct dft dit dc̃t
= ct−1 ⊙ + c̃t ⊙ + it ⊙ . (22)
dht−1 dht−1 dht−1 dht−1

4
dct
This means for computing dWc it requires computing
t−1 t−1
Y dcl+1 Y dtanh(cl ) dcl+1
= [fl+1 + ol ⊙ ⊙ ]
dcl dcl dhl
l=τ l=τ

for all τ = 1, ..., t. There is no guarantee that this term will not vanish or explode, however the usage
of forget gates makes the issue less severe. To see this, notice Qi that by expanding the product term
above, we have that it contains terms proportional to fi+1 ⊙ l=τ ol ⊙ dcdhl+1 l
for i = τ +1, ..., t−1. So
if in the forward pass the network sets fi+1 → 0 (i.e. forgetting the previous cell state), then this will
Qi
also likely to bring fi+1 ⊙ l=τ ol ⊙ dcdhl+1 l
≈ 0, which is helpful to cope with the gradient explosion
Qi
problem. On the other hand, dWc also contains terms proportional to l=τ +1 fl ⊙ oτ ⊙ dcdhτ +1
dct
τ
for
i = τ + 1, ..., t − 1. This means if the network sets fl → 1 for l = τ + 1, ..., i (i.e. maintaining the cell
Qi
state until at least time t = i), then it will be likely that l=τ +1 fl ⊙ oτ ⊙ dcdhτ +1
τ
≈ oτ ⊙ dcdhτ +1
τ
, which
helps prevent the gradient info at time τ from vanishing when oτ → 1, and thus be helpful to learn
longer term dependencies. The gradients dW dct
c
dot
and dW c
also require computing products of dodhi+1 i
and
dci+1
dhi terms, and analogous analysis can be done for those product terms. It is worth emphasising
again that LSTM does NOT solve the gradient vanishing/explosion problem completely, however
empirical evidences have shown that it is easier for LSTMs to learn longer term dependencies when
compared with the simple RNN.

Gated Recurrent Unit (GRU): a simplified gated RNN


The Gated Recurrent Unit (GRU) [Cho et al., 2014] improves the simple RNN with the gating
mechanism as well. Compared with LSTM, GRU removes the input/output gates and the cell state,
but still maintains the forgetting mechanism in some form:

zt = σ(Wz · [ht−1 , xt ] + bz ) (23)


rt = σ(Wr · [ht−1 , xt ] + br ) (24)
h̃t = tanh(Wh · [rt ⊙ ht−1 , xt ] + bh ) (25)
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ h̃t (26)

The network parameters are then θ = {Wz , Wr , Wh , bz , br , bh }. We see here zt acts as the the
update gate which determines the incorporation of the current info to the hidden states, and rt is
named the reset gate which also impact on the maintenance of the historical information.

1.3 Sequence-to-sequence models


Given dataset of input-output sequence pairs (x1:T , y1:L ), the goal of sequence prediction is to build
a model pθ (y1:L |x1:T ) to fit the data. Note here that the x, y sequences might have different lengths,
and the input/output length T and L can vary across input-output pairs. So to handle sequence
outputs of arbitrary length, we define an auto-regressive model
L
Y
pθ (y1:L |x1:T ) = pθ (yl |y<l , v), v = enc(x1:T ). (27)
l=1

Here pθ (yl |y<l , v) is defined by a sequence decoder, e.g. an LSTM:

pθ (yl |y<l , v) = pθ (yl |hdl , cdl ), hdl , cdl = LST Mθdec (y<l ), (28)

and the decoder LSTM has its internal recurrent states hd0 , cd0 initialised using the input sequence
representation v = enc(x1:T ). The encoder also uses an LSTM, meaning that

v = enc(x1:T ) = N Nθ (heT , ceT ), heT , ceT = LST Mθenc (x1:T ). (29)

5
Figure 3: Visualising the forward pass of Seq2Seq model in train (left) and test (right) time.

Regarding the prediction for the first output y1 , either pθ (y1 |x1:T ) can be produced using the
last recurrent states of the encoder LSTM (i.e. pθ (y1 |x1:T ) = pθ (y1 |heT , ceT )), or we can add in a
“start of sentence” token as y0 and compute the probability vector for y1 using the LSTM decoder:
pθ (y1 |x1:T ) = pθ (y1 |y0 , v). This model is named Sequence-to-sequence model or Seq2Seq model in
short, which is proposed by Sutskever et al. [2014].
The decoder LSTM forward pass in training and test times are different, which is visualised in
Figure 3. In training, since maximum likelihood training requires evaluating pθ (yl |y<l , v) for the
output sequences y1:L from the dataset, this means the inputs to the decoder LSTM are words
in the data label sequence, and the output of the LSTM is the probability vector for the current
word, which is then used to compute the MLE objective (i.e. negative cross-entropy). In test time,
however, there is no ground truth output sequence provided, therefore the input to the decoder
LSTM at step l is the predicted word yl−1 ∼ pθ (yl−1 |y<l−1 , v). In practice prediction is done by
e.g. beam search rather than naive sequential sampling [Sutskever et al., 2014].
In NLP applications such as machine translation, both x1:T and y1:L are sequence of words
which cannot be directly processed by neural networks. Instead each xt (yl ) needs to be mapped
to a real-value vector before feeding it to the encoder (decoder) LSTM. A naive approach is to
use one-hot encoding: assume that the input sentence is in English and we have a sorted English
vocabulary of size V , then

xt → (0, 0, ..., 1, ..., 0), if xt is the kth word in the vocabulary.


| {z }
k−1

This is clearly inefficient since English vocabulary has tens of thousands of words. Instead, it is
recommended to map the words to their word embeddings using word2vec [Mikolov et al., 2013a]
or GloVe [Pennington et al., 2014], which has much smaller dimensions. But more importantly,
semantics are preserved to some extent in these word embeddings, e.g. it has been shown that
vector calculus results like “emb(king) - emb(male) + emb(female) = emb(queen)” hold for word2vec
embeddings.
In many NLP applications the decoder output is a probability vector pθ (yl |y<l , v) which specifies
the predictive probability of each of the words in the vocabulary. When the vocabulary is large (which
is often so), applying softmax to obtain the probability vector can be very challenging. Interested
readers can check e.g. hierarchical softmax [Mikolov et al., 2013a] and negative sampling [Mikolov
et al., 2013b] for solutions to mitigate this issue.

6
1.4 *Generative models for sequences
To generate sequential data such as video, text and audio, one needs to build a generative model
pθ (x1:T ) and train it with e.g. (approximate) maximum likelihood. In the following we discuss two
types of latent variable models that are often used in sequence generation tasks.

Sequence VAE with global latent variables


Similar to VAEs for image generation, one can define a latent variable model with a global latent
variable for sequence generation [Fabius and van Amersfoort, 2014; Bowman et al., 2016]:
Z
pθ (x1:T ) = pθ (x1:T |z)p(z)dz. (30)

If variational lower-bound is used for training, then this also requires an approximate posterior
qϕ (z|x1:T ) to be optimised:
ϕ∗ , θ ∗ = arg max L(ϕ, θ), L(ϕ, θ) = Epdata (x1:T ) [Eqϕ (z|x1:T ) [log pθ (x1:T |z)] − KL[qϕ (z|x1:T )||p(z)]].
| {z }
:=L(x1:T ,ϕ,θ)
(31)
Now it remains to define the encoder and decoder distributions, such that they can process sequence
of any length. This can be achieved using e.g. LSTMs to define an auto-regressive decoder:
T
Y
pθ (x1:T |z) = pθ (xt |x<t , z), pθ (x1 |x<1 , z) = pθ (x1 |z). (32)
t=1

with the distributional parameters of pθ (xt |x<t , z) defined by LST Mθ (x<t ) which has its recurrent
states h0 , c0 initialised using z. For the encoder, LSTMs can also be used to process the input:
2
qϕ (z|x1:T ) = N (z; µϕ (x1:T ), diag(σϕ (x1:T ))), µϕ (x1:T ), log σϕ (x1:T ) = LST Mϕ (x1:T ). (33)

State-space models
State-space models assume that for every observation xt at time t, there is a latent variable zt
that generates it, and the sequence dynamic model is defined in the latent space rather than in the
observation space. In detail, a prior dynamic model is assumed on the transitions of the latent states
zt , often in an auto-regressive way:
T
Y
pθ (z1:T ) = pθ (zt |z<t ), pθ (z1 |z<1 ) = pθ (z1 ). (34)
t=1

The observation xt at time t is assumed to be conditionally dependent on zt only, and this conditional
distribution is also called the emission model :
pθ (xt |z1:T ) = pθ (xt |zt ). (35)
Combining both definitions, we have the sequence generative model defined as
T
Z Y
pθ (x1:T ) = pθ (xt |zt )pθ (zt |z<t )dz1:T . (36)
t=1

A variational lower-bound objective for training this state-space model require an approximate
posterior distribution qϕ (z1:T |x1:T ):
L(ϕ, θ) = Epdata (x1:T ) [Eqϕ (z1:T |x1:T ) [log pθ (x1:T |z1:T )] − KL[qϕ (z1:T |x1:T )||pθ (z1:T )]]
" T
#
X (37)
= Epdata (x1:T ) Eqϕ (z1:T |x1:T ) [ log pθ (xt |zt )] − KL[qϕ (z1:T |x1:T )||pθ (z1:T )] .
t=1

7
Different from the image generation case, now the prior distribution pθ (z1:T ) also has learnable
parameters in θ, so in this case it is less appropriate to view this KL term as a “regulariser”.
The expanded expression for the variational lower-bound depends on the definition of the encoder
distribution qθ (z1:T |x1:T ). The simplest solution is to use a factorised approximate posterior
T
Y
qϕ (z1:T |x1:T ) = q(zt |x≤t ), (38)
t=1

and the variational lower-bound becomes


  
T
X  
L(ϕ, θ) = Epdata (x1:T ) 
 |Eqϕ (zt |x≤t ) [log pθ (xt |zt )] −{zKL[qϕ (zt |x≤t )||pθ (zt |z<t )]} .
Eq(z<t |x<t )  
t=1
:=L(xt ,ϕ,θ,z<t )
(39)
We see that the term L(xt , ϕ, θ, z<t ) resembles the VAE objective in the image generation case,
except that the prior distribution pθ (zt |z<t ) is conditioned on the previous latent states z<t rather
than a standard Gaussian, and the q distribution takes x≤t as the input rather than a single frame
xt .
Neural networks can be used to construct the conditional distributions in the following way. The
distributional parameters (e.g. mean and variance) of the emission model pθ (xt |zt ) can be defined
by a neural network transformation of zt , similar to deep generative models for images. The prior
dynamic model pθ (zt |z<t ) can be defined as (e.g. with an LSTM)

pθ (zt |z<t ) = pθ (zt |hpt , cpt ), hpt , cpt = LST Mθ (z<t ). (40)

This means the previous latent states z<t are summarised by the LSTM internal recurrent states
hpt and cpt , which are then transformed into the distributional parameters of pθ (zt |z<t ). For the
factorised encoder distribution, it can also be defined using an LSTM:

qϕ (zt |x≤t ) = qϕ (zt |hqt , cqt ), hqt , cqt = LST Mϕ (x≤t ). (41)

References
Arjovsky, M., Shah, A., and Bengio, Y. (2016). Unitary evolution recurrent neural networks. In
International Conference on Machine Learning, pages 1120–1128.
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. (2016). Generating
sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computa-
tional Natural Language Learning, pages 10–21. Association for Computational Linguistics.
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,
Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine
translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics.

Fabius, O. and van Amersfoort, J. R. (2014). Variational recurrent auto-encoders. arXiv preprint
arXiv:1412.6581.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. http://www.
deeplearningbook.org.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–
1780.

8
Le, Q. V., Jaitly, N., and Hinton, G. E. (2015). A simple way to initialize recurrent networks of
rectified linear units. arXiv preprint arXiv:1504.00941.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representa-
tions in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representa-
tions of words and phrases and their compositionality. Advances in neural information processing
systems, 26:3111–3119.
Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global vectors for word representa-
tion. In Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), pages 1532–1543.
Saxe, A. M., McClelland, J. L., and Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of
learning in deep linear neural networks. In International Conference on Learning Representations.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks.
Advances in neural information processing systems, 27:3104–3112.

You might also like