Context Based
Context Based
Context Based
Sivasurya Santhanam
Institute for Software Technology
German Aerospace Center(DLR)
[email protected]
arXiv:2005.00048v1 [cs.CL] 30 Apr 2020
May 4, 2020
A BSTRACT
Long short-term memory(LSTM) units on sequence-based models are being used in translation,
question-answering systems, classification tasks due to their capability of learning long-term depen-
dencies. In Natural language generation, LSTM networks are providing impressive results on text
generation models by learning language models with grammatically stable syntaxes. But the downside
is that the network does not learn about the context. The network only learns the input-output function
and generates text given a set of input words irrespective of pragmatics. As the model is trained
without any such context, there is no semantic consistency among the generated sentences.
The proposed model is trained to generate text for a given set of input words along with a context
vector. A context vector is similar to a paragraph vector that grasps the semantic meaning(context)
of the sentence. Several methods of extracting the context vectors are proposed in this work. While
training a language model, in addition to the input-output sequences, context vectors are also trained
along with the inputs. Due to this structure, the model learns the relation among the input words,
context vector and the target word. Given a set of context terms, a well trained model will generate
text around the provided context.
Based on the nature of computing context vectors, the model has been tried out with two variations
(word importance and word clustering). In the word clustering method, the suitable embeddings
among various domains are also explored. The results are evaluated based on the semantic closeness
of the generated text to the given context.
Keywords Natural language generation · LSTM networks · Sequence models · language models
1 Introduction
In the past few years, Natural language processing(NLP) has seen rapid developments and the reasons are mainly
attributed to the developments in learning based algorithms and advancements in computational power. As the
availability of computational resources increased, active research and development happened in the architecture of deep
learning algorithms. The use of feed-forward neural networks in lots of applications yielded impressive results, yet
it is not considered optimal for the language models. This is because, the feed-forward network tries to optimize the
input-output function without any concern to the order of inputs. Considering the type of data used in Natural language
processing (both speech and text) the sequence of encoding is important; similar to a time series data. In case of text
processing, there is a significant amount of information embedded in the word order. Thus, the usage of feed-forward
networks result in loss of information. In order to capture the time and sequence, recurrent neural networks are applied
in the field of Natural language processing.
Recurrent neural network(RNN) is a type of sequence based neural network which has memory units to remember
past information. In feed-forward networks only the output of the previous layers are fed into the next layer. In case of
RNNs, the output of each cell(neuron) is given back to the same cell as input and hence the name recurrent networks.
Due to this recurrent nature, the positional information of words in sequences are encoded as delays in the memory unit.
This helps the recurrent network to keep track of word order. The current state of the RNN cell is calculated from the
input to the current cell and the output of the previous state. Thus, in addition to the output from the previous layer, the
output of the current layer at previous time stamp is also fed as input to the current layer. To demystify the time based
delays in state representations, the RNN network is usually unfolded through time and seen as feed-forward network.
Figure 1a shows the architecture of a vanilla RNN in the state representation and Figure 1b shows the representation of
the same RNN when unfolded based on the time information. Vanilla network is a common name for representing a
generic model without any special case or variations in it. The “xt ” represents the input, “ht ” the output and the “A”
block represents the hidden layer containing a single activation function. For example the sentence “The quick brown
fox jumps” is encoded as “The(X0 ), quick(X1 ), brown(X2 ), fox(X3 ), jumps(X4 )”. Similar to a feed-forward network,
the model learns via back propagation. Since the inputs are given as sequences in RNNs, the back propagation also
occurs with respect to the time factor and it is termed as back propagation through time (BPTT). RNNs works fine for
short sequences. But in case of longer sequences, the gradients are unable to be propagated through many cells. It is
referred to as the vanishing gradient problem [1].
One way of looking at this vanishing gradient problem is that the network is trying to propagate all the information
through longer sequences. And with longer sequences, the gradients are getting diminished during BPTT. To remedy
this issue, Long short term memory network (LSTM) [2] is introduced. LSTMs manage to retain long-term information
by propagating only the relevant information back. LSTMs are a variant of the RNNs, where instead of a single
activation function in the hidden layer, there are multiple gates. The input gate, forget gate, output gate and the cell
states. At each time point, the gates will determine which of the past information is relevant to retain and which has to
be forgotten by throwing them away. The input gate controls the current cell information, the forget gate regulates how
much amount of information from previous state has to be forgotten and the output gate filters the information to be
passed on to the next layer. Due to such a nature, LSTMs could preserve history for long sentences and have been used
extensively in NLP applications from question-answering systems to machine translation.
For instance, consider there are two language models(LM) which are trained on vanilla RNN and LSTM respectively.
And both these models are tested on a sentence “I have participated in lots of Chess competitions since my childhood
and that made me really good at playing ”. Since the vanilla RNN model only retains short term memory, it only
considers the last phrase “really good at playing” and would predict words such as “piano, football, cards” depending
upon the probability of occurrence and the dataset in which it is trained on. Whereas the LSTM model will preserve the
long term information about chess competitions and would predict the word “chess”. It is also possible that the RNN
network might predict “chess”, but the confidence score will be lower compared to an LSTM model.
Based upon the length of input and output sequences used to train the RNN/LSTM model, the configurations maybe
one-to-many, many-to-many, many-to-one. Machine translation is a classical example of many-to-many configuration
where the input will be a sequence in the source language and the output will also be a sequence in the target language.
Sentiment analysis and intent classification comes under many-to-one where there will be an input sequence and an
output class to predict. many-to-many configuration is fondly called as Sequence to sequence models or Seq2Seq
2
2 Related work
Several variations of LSTM models have been proposed by making changes in the internal gates. Gated recurrent
unit(GRU) introduced by Cho et al. [3] is a generalized simpler model of LSTM with just reset gate and update gate.
Tai et al. [4] has attempted to use tree based topology on syntactic dependencies, rather than a linear model of LSTM
network.
Feeding the input sequences in the order of words lets the model learn from the past history. Schuster et al. [5] proposed
Bidirectional RNN where the input sequences are not only trained from first word to last, but also in the reversed
sequence. This worked effectively for machine translation tasks [6], as for some target languages the grammatical
structure is different from the source language and training simultaneously in both directions let the model also to
account for future input words.
RNNs used for language modeling based on character level encoding is done by Sutskever et al. [7]. Graves [8]
demonstrated how LSTMs are used in text prediction (language modeling) and applied the same in handwriting
prediction tasks. Sutskever et al. [6] proposed the general model of end-to-end approach on applying Bi-directional
multi layered LSTMs for language translation.
3 Problem statement
It is evident from the previous works that LSTMs are widely applied in NLP tasks. From machine translation to
question-answering systems, the configuration of the network is quite similar i.e, Sequence to Sequence models. These
encoder-decoder seq2seq networks are good at capturing intrinsic features by mapping input to output sequences.
Building a good language model which is able to capture semantic context is more than just input-output mapping
function. For instance in text generation model, the network only learns the function of mapping a sequence of words to
the successive word. Without necessary contextual information, the network just spits out grammatically stable but
pragmatically nonsensical sentences. Thus, as the network goes on generating sentences, the sentences will have no
coherence nor specific context.
Contexts provide additional semantic information about the sentences. In linguistic point of view, Dash [9] discussed
how the contextual information influences the meaning of words. Gale [10] used context based methods for word sense
disambiguation. Maynard et al. [11] exploited the contexts in terminological acquaintance by clustering them.
This work is based on the hypothesis that, training neural network models along with relative contextual information
helps to learn about semantic association of the sentences to the context. This also helps to avoid lexical ambiguity.
Based on this hypothesis, I propose a method of including contextual information to train a language model using
LSTM networks. In the view of such contextual information, the model not only learns about the input-output function,
but also the relation of the context words with them. Such a language model can be applied for context based text
generation task, where the network generates sentences based on a given set of context words.
For a text generation problem, the performance of the model is usually evaluated between not duplicating the same text
and generating new sentences which are syntactically strong. Perplexity is a good measure for it. The proposed work
focuses on the degree of semantic closeness between the provided contexts and the text generated, not of the quality
of generated texts. Various methods of extracting contextual information from the text is proposed. Each method is
evaluated based on the semantic closeness of generated sentences for the given context using cosine similarity measure.
The proposed method of training language model along with context is unanimous with any architecture of LM. Thus
the performance of LSTM networks will not influence the proposed model and will not be evaluated in this work.
3
training examples in the dataset and n represents the length of the input in terms of words. Depending upon the number
of network layers, dataset size, the training process may require heavy computational power. For applications like
text generation, datasets which have the form of a story is suitable, as the generated text will have specific domain
knowledge.
The proposed model of context based text generation adopts the same strategy with a variation in the configuration to
include the context along with input-output instances. As shown in the Figure 2, context extraction methods are applied
to extract context vectors, which are appended along with input words to train the language model.
I have proposed two different methods to extract the context vectors and both are discussed in Section 5. Irrespective
of the method applied to obtain the context vectors, these vectors are of the same dimension as the input and output
vectors. The proposed system is based on the hypothesis that the context vector of a sequence preserves the semantic
information of the sequence. For instance, the context vector for the sentence “The quick brown fox jumps over the lazy
dog” contains words such as “animals, forest, action”. For every input-output instances, a context vector is computed to
train the context-based text generation model. A context could be just a keyword, a set of words or a topic, anything
that represents the semantic content of the sentence. Context vectors are similar to the paragraph vectors [12], but they
do not reflect the complete semantics. The use of topic modeling techniques such as Latent dirichlet allocation(LDA) or
latent semantic analysis(LSA) is unsuitable for our application as the nature of the topic modeling techniques are based
on distribution of topics over the document. Whereas in case of text generation models, domain based dataset lacks the
diversity of topics.
Of all the various configurations, the text generation uses many-to-one structure. In many-to-one, the model is trained
against multiple input words and a single target word. It is worthy to note that we deliberately avoided tokenization of
sentences from the dataset, because the features such as sentence completion, sentence openings and the semantic flow
from the previous to the following sentences are preserved when trained as continuous words in a paragraph. Thus the
input sequences extracted from the dataset have uniform length but are not complete sentences. On the other hand,
context vectors are calculated for each valid sentences. While generating training dataset, the inputs are i,...,i+n words
with output word at i+n+1 from a paragraph. The maximum number of words in the input sequence that constituted a
sentence determines the context vector for that input instance. The context vectors are just appended along with the
inputs when training the model.
5 Methodology
A section of The Lord of the Rings dataset1 is used in this work. For the data pre-processing, punctuations are removed
and only sentences with a minimum of 7 words in length are used for the training. total number of training sentences
are also limited to 100,000, so as to keep the computational memory space in control. As explained in Section 4, the
input-output training instances are generated from the dataset. All the words are vectorized using one-hot encoding,
thus each word is represented by a vector of dimension V, where V is the size of vocabulary. The length of the input
sequences excluding context vector is set to 8. The context vector for a training sample is also a V-dimensional vector.
The encoding process of the context vector depends upon the method of extraction. As discussed, it is a many-to-one
configuration with the inputs containing the context vector appended with input words and the output contains the
succeeding word in the sequence.
A sequential model of LSTM network is applied with the configuration: Bidirectional LSTM, 256 hidden units, ‘relu’
activation for the inner layers and ‘softmax‘ activation for the outer layer. Keras [13] library is used to train the model.
Since the language model is a prediction problem, that the model has to predict the one-hot encoded target word,
‘softmax’ is suitable to determine the maximum probability of a word. Implementation is done in python with the
1
https://github.com/wess/iotr
4
help of libraries, including but not limited to scikit-learn, numpy and gensim. Once the model is trained along with
context vectors, the prediction phase requires two sets of inputs from the user. The first input is the sequence of seed
words. Seed words are the opening words given to the network to initiate the text generation process. The network
expects 8 seed words, as we have trained the network with an input sequence of length 8. The second set of input is
context words. One can provide any number of context words as needed. For C context words, the network generates C
sentences. During prediction phase, the network sets the input with the first context word along with the initial seed
words and predicts the subsequent word. Once the word is predicted, the first word from the seed word is removed and
the predicted word is appended thus maintaining the sequence length as 8. With the new input sequence and the same
context word the subsequent word is predicted. This process is continued until the sentence is completed. The sentence
completion is identified by a full-stop. Once a sentence is completed, the next context word is used for generating
the second sentence following the same procedure but instead of seed words, the words generated from the previous
sentence is considered. The two methods of extracting context vectors are described as follows:
The word importance method is based on the hypothesis that, the most important word in the sentence represents the
context of the sentence. Term frequency-inverse document frequency (tf-idf) is a classical measure of determining the
importance of a word in the document/sentences. Based on the frequency of the word in the same sentence and in the
entire document, the word importance(tf-idf value) is calculated. Depending upon the tf-idf value of every word in
a sentence, the word having the most importance is picked out and chosen as the context word. In this method, the
context is clearly a keyword picked out of the sentence. This process is executed for all the training sentences and
context words are extracted. Since, the context is simply a word from the vocabulary, the context vector is represented
as one-hot representation of the word. The language model is trained along with the context vectors and tested on some
random context words. The results of the text generation using the current method is shown below
Context provided: “Hobbits – gollum – adventure – king – ring – war – friends – war – book – home”
Generated text: (Hobbits) Hobbits lived in the woods and once an elf came all his There was Tom
to go on by this If he meant all the game but there was a bit of Moria of and Bilbo was running a black
eyebrows at the Mountain pretty of and and as you played up about and he said up about the hall in all this
They will not come down in it. (gollum) He had been going up a bit and crept in into the side of a magic
beard and into the prisoners of our If you have get him or I will get my dear sir and Bilbo felt he must have
to find this adventures but you are cook the question they all are yes. (adventure) Dont me on to Bilbo
my sudden understanding a pity mixed with various noticing down and their parlour as well as he could.
(king) No trouble He was in a boat and a breakfast and their sun had been and one of his second wealth
as Gandalf dared their ponies behind they had just had a fair breakfast as they had been gone further he
had been adventures many He had and been very He had come a ring of their wild Well all this That
From the result, we can conclude that the generated text does not have a good syntactic structure, but as previously said
the work focuses on context words and the semantic relation to the generated words and not the performance of text
prediction model. Also the generated sentences did not reflect the context from the provided keywords as expected. The
proposed hypothesis did not hold. The word with highest tf-idf score was not a good representation of the context of the
sentence. There was too much loss of information, as only one word is considered.
Despite picking up the most important word as context, this method tries to find a context vector using all the words
in a sentence. It is based on the hypothesis that a context vector in a high dimensional word vector space captures
the semantic content of the sentence. Word embeddings are vectorized representations of words encoded with certain
semantic information. Word2Vec [14], Glove [15], FastText [16] are some of the commonly used word embeddings. I
have chosen Word2Vec as the embedding model for the word clustering method. Word embeddings have the ability to
preserve two interesting properties: semantic similarity and semantic relationships. Extracting context vectors using
word clustering method works by exploiting these two features. This method has the hypothesis that, the context of a
sentence is determined by a context vector present in the word vector space, such that the semantic similarity of the
context vector and the accumulated sentence vector is highest. Given a list of sentences S and the cluster centers C of
the word vector space, the hypothesis is expressed as in equation(1)
Pwn
∀s ∈ S, ∃c ∈ C s.t, wi cos(s, c) is minimum (1)
5
Figure 3: Word-embedding clusters(n=100) visualized under t-SNE
The accumulated sentence vector is computed by the summation of all the word vectors in the sentence. Based on the
property of semantic relationships, the sum should preserve the semantic content of all the words present in it. From
the property of semantic similarity, the closest vector to the accumulated word vectors is obtained from a word vector
space. The word vector space is a high-dimensional space which contains all the vocabularies of a dataset represented
in word embeddings, where the words with the similar semantic meaning exists closer to each other. Such word vector
space is generated by training a self-supervised word2vec model over a dataset. These word embeddings(200-d vectors)
constitute the word vector space.
It is clear that the language model is trained using The Lord of the Rings(LOR) dataset, but it is possible to construct the
word vector space from any dataset. I have constructed the word vector space from two sources. The first one, “domain
vector space” is generated using the same LOR dataset thus having an enclosed domain of vocabulary pertaining to the
story. The second one, “wiki vector space” is built from various wikipedia sources2 [17] which covers a comprehensive
set of vocabulary. Experimenting with both these vector spaces helps to identify the features which best captures the
context of a sentence. As a quick comparison, the context for a sentence is extracted from both vector spaces. For the
sentence “It was often said in other families that long ago one of the Took ancestors must have taken a fairy wife”,
the Domain vector space contained context words of “took, drew, gave, taking, unimpeachable”, whereas the context
words extracted by wiki vector space contained “father, mother, son, daughter, grandmother”. It could be seen that the
domain space contained localized words of the action, whereas the wiki space captured words of subjects and nouns in
a broader context.
Once the word vector spaces are built, the word vectors in the high-dimensional space have been clustered using
k-means clustering into multiple clusters (n=100). The clustering is done to group the semantically similar words into a
cluster, thus sentences having closer meanings will obtain the same context vector. For visualizing the clusters, 200-dim
vectors from the wiki space has been resized into 60-dim vectors using Singular value decomposition(SVD), and then
t-stochastic neighbor embeddings(t-SNE) [18] has been applied to reduce it down to 2 dimensions as shown in Fig 3.
Obviously visualizing 100 different colors may look blurry, still the various clusters could be observed.
After clustering the vocabularies, the context vector for a sentence is obtained from the cluster center having the
minimum cosine distance with the accumulated sentence vectors. Thus for each and every training instance, the
corresponding cluster center is calculated. From the cluster centers, the words present in the cluster is selected as the
context words and the context vector is represented with the bag-of-words(BOW) encoding. Applying all the words
from the cluster for a context vector reduces the significance of the context. Thus top n words(n=5) closer to the cluster
center is chosen to represent the context vector.
2
http://wortschatz.uni-leipzig.de
6
When the training instances are prepared from both the domain and wiki spaces, I trained the context based text
generation model with the same configuration as described before. I tested the trained model on wiki vector space and
the results are shown below.
Context provided: “gandalf – ring – war – friends – snake – book – home – king – hobbit”
Generated text: (gandalf) characters hobbits lived in the woods and an elf said that please try down
they all that been had only been so they had only been much in the meantime in the deal of the and and
they had brought the dwarves of the dwarves but they had brought him from the road and the goblins went
out and again that they could hear him on their chief and the great Master of the town that of the great
wooden spur which they had had been as the Great Goblin and his little sword that had was in a way
but he had picked up in the dark. (ring) Some ago it was so all that so fitted off he found he more one all
his second staff on the floor and the laughter of the others of the door and the laughter of the line of the
dwarves and he had never a shocking of the below. (war) end of their way had would be and the most
of the dwarves were built with guards. (friends) It was like a pull for the jug. (snake) smash and but the
various birds that is the more and horses now he never a cave and was not to be done in the obscurest
words. (book) Bilbo I have a little time for the terrible hours and in fact and besides of the town were me
that the track had come from a good way that was too I dont know but I am your last Burglar said a share
of the rate. (home) The heart of the hobbit and made of yellow wood. (king) Bilbo was more and a long
deal of the dwarves they had been busy in the direction of the Mountain. (hobbit) The dwarves that was in
a story of the chief of the dwarves and they were all alone and that they were all alone and whether they
had brought of them the dwarves were not in the valley and they were forced to pull the frightful Eagle the
Great Mountain and coming
From the results, the semantic content of some of the generated sentences are in line with the context word provided.
For instance, the context “war” generated phrases like built with guards. “snakes” produced words around the forest
topic like birds, horses, cave. The context “home” was pretty interesting and sounded like an idiom The heart of the
hobbit and made of yellow wood. It was not successful for all the contexts. For the context “friends”, the generated text
was terrible, even for the grammaticality. The following section discusses in detail about the evaluation of these results.
6 Results
Before comparing the results of the methods, it is important to go through how the language model learnt over time.
Changes in loss and accuracy on training and validation sets.
(a) Training and Validation accuracy (b) Training and Validation loss over
over time time
Figure 4: Training & validation performance
Figure 4a shows a plot of accuracy over epochs(time) on training and validation dataset. It compares both the word
vector spaces(wiki and domain). One could observe that the validation accuracy never really took off. The training
dataset keeps on getting better and better and even reached around 97%. It is very clear that the model is overfitting. But
the problem with text generation models is that unlike other models, there is no sweet spot where the training accuracy
7
increases and validation accuracy decreases that training the model should be halted. We are looking for a generalizable
model that could generate new text, thus the whole point of validation dataset fails here.
It comes as no surprise when we also look into the loss plotted against epochs in Figure 4b. The training loss decreases
while the validation loss keeps on increasing. One interesting thing to note in both the plots is that, both the wiki and
domain spaces are learning at the same rate. From these plots we will never know which vector model representation
could generate better text. The only solution is to run evaluations on each and every epoch for both vector spaces to
find out the best model to extract context features. In addition to comparing which model performs better, from the
evaluation score at each epochs we could also ascertain the specific point of stopping for the optimum output.
There are various evaluation methods used in this area of research such as BLEU, F-measure, WER, accuracy, perplexity,
etc. To evaluate against a reference data like in machine translation or question-answering downstream tasks, metrics
such as BLEU, NIST, F-measure could be used. As for applications without any reference text, perplexity has been
applied to find out how good is the generated language model. They are used to test the syntactic integrity of the
generated text. Still, the task at hand is not to evaluate the performance of the language model whereas to evaluate if the
generated text has semantic closeness with the context provided. Due to the nature of problem at hand, I have used the
following evaluation technique using word embeddings and cosine similarity measures to compute the similarity of the
generated text.
The similarity is carried out between the context word and a list of selected words based on part-of-speech(POS) from
the generated sentence. This evaluation is based on the hypothesis that the nouns in the sentence conveys most of
the semantic context compared to the rest of the words in the sentence. Thus, only the nouns(NOUN, PROPN) are
processed and the cosine similarity of each nouns with the context word is calculated. The average of all the cosine
distances in a sentence gives the similarity of the sentence with the context word. Given a sentence S containing nouns
ni , the cosine similarity measure is expressed as in equation(2)
1 Pnn
mi . cos(ni , c) (2)
N ni
Figure 5: Cosine similarity of the generated text with context over training time
The evaluation measure equation(2) is applied on every 3 epochs to both wiki and domain space models to measure how
good is the semantic closeness between the context provided and the generated text. Figure 5 shows a plot of semantic
similarity measure of the output over time.
Apart from the context based methods, it is also essential to evaluate them against the base model. The base model
performed text generation task without any context vectors. Thus, the usage of the context vectors did not affect the
base model at all. The spikes produced by the base model are purely random. It is obvious from the plot that training
language models along with context information produce much better results than the base model.
From the plot, it is clear that the context based models’ performance increased over time and at a specific point it starts
to decrease. One important thing to note is that, cosine similarity measure could go negative, if the words are highly
8
dissimilar. When considering the cosine similarity of the whole generated text, if in case there are few negative values
they will be negligible when computing the whole average. By comparison, the model trained on context extraction from
the domain vector space performed better than the wiki vector space embeddings. Around 20 epochs of training, both
the models achieved maximum performance. Thus, on plotting the maximum peaks of the domain training model for
the given dataset, one could figure out the optimum performance of the model with relative to context based generation.
This is shown in the Figure 6.
From Figure 6, we could see that a training accuracy between 67% and 85% provides a generalizable model for the
considered use case.
7 Discussion
The paper discussed about the need for applying contextual information to train language models for text generation
tasks. LSTM networks have been chosen as the language model for the considered use-case. The main focus of the
work is on finding out the best way of extracting context from the sentences, so as to train a text generation model along
with the contexts aiming for a better language model. Multiple methods have been attempted for context extraction and
among those methods, contexts extracted from word clusters in word vector spaces worked better. Also, it has been
clear from the results that the word embeddings generated from the same dataset provided better context vectors. The
evaluation method used cosine similarity measures to calculate the semantic closeness between the generated sentences
and the provided context. The evaluation results also emphasized that context based models perform better compared to
base models. The evaluation method also provided a feedback to the training process to know when the model overfits
so as to stop training the model. The proposed system could find potential applications in the following areas:
• Question answering systems, where instead of following a template based approach the model could identify
the context from the question and respond meaningfully concerning the context
• By keeping track of topics in chat-bots, conversations could be generated based on the previously conversed
topics/contexts
• Besides context vectors defining semantic meanings of sentences, it could rather be used to specify a structure.
for instance, when text-generation is trained with context vectors of introduction-content-summary-moral
structures. Thus a generated report/story instead of having a random format, a defined form could be expected
of the result
9
References
[1] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. Learning long-term dependencies with gradient descent is
difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
[2] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[3] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.
Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.
[4] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-
structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for
Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume
1: Long Papers), pages 1556–1566, Beijing, China, July 2015. Association for Computational Linguistics.
[5] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. Signal Processing, IEEE Transac-
tions on, 45(11):2673–2681, 1997.
[6] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances
in neural information processing systems, pages 3104–3112, 2014.
[7] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In
Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011.
[8] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
[9] Niladri Sekhar Dash. Context and contextual word meaning. SKASE Journal of Theoretical Linguistics, 2008.
[10] William A Gale, Kenneth W Church, and David Yarowsky. Work on statistical methods for word sense disam-
biguation. 1992.
[11] Diana Maynard and Sophia Ananiadou. Terminological acquaintance: the importance of contextual information
in terminology. In Proc. of NLP2000 Workshop on Computational Terminology for Medical and Biological
Applications. Citeseer, 2000.
[12] Andrew M Dai, Christopher Olah, and Quoc V Le. Document embedding with paragraph vectors. arXiv preprint
arXiv:1507.07998, 2015.
[13] François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
[14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words
and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119,
2013.
[15] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation.
In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages
1532–1543, 2014.
[16] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-
training distributed word representations. In Proceedings of the International Conference on Language Resources
and Evaluation (LREC 2018), 2018.
[17] Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Building large monolingual dictionaries at the leipzig
corpora collection: From 100 to 200 languages. In LREC, volume 29, pages 31–43, 2012.
[18] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research,
9(Nov):2579–2605, 2008.
10