Extracting Word Synonyms From Text Using Neural Approaches
Extracting Word Synonyms From Text Using Neural Approaches
1, January 2020 45
Abstract: Extracting synonyms from textual corpora using computational techniques is an interesting research problem in the
Natural Language Processing (NLP) domain. Neural techniques (such as Word2Vec) have been recently utilized to produce
distributional word representations (also known as word embeddings) that capture semantic similarity/relatedness between
words based on linear context. Nevertheless, using these techniques for synonyms extraction poses many challenges due to the
fact that similarity between vector word representations does not indicate only synonymy between words, but also other sense
relations as well as word association or relatedness. In this paper, we tackle this problem using a novel 2-step approach. We
first build distributional word embeddings using Word2Vec then use the induced word embeddings as an input to train a feed-
forward neutral network using annotated dataset to distinguish between synonyms and other semantically related words.
Keywords: Neural networks, semantic similarity, word representations, natural language processing.
we have conducted, the most similar word to “large” INPUT PROJECTION OUTPUT
was “small”, even though the two are actually exact
opposites of each other (antonyms pair). These
w(t-2)
observations motivate the need for novel computational
techniques that are tailored to capture synonymy
between word pairs and not any other semantic relation. w(t-1)
The rest of this paper is organized as follows: section
w(t)
2 presents the related work in the literature regarding
computational synonyms extraction. Section 3 w(t+1)
introduces the data and approach we have used to
construct the word embeddings, or distributional word
representations. Section 4 presents the neural network w(t+2)
we have used to build a classifier for synonym
Figure 1. Architecture of the Skip-Gram model for learning word
identification.
representations.
achieve their goal, the authors conducted an intrinsic general public in written form, and the verb
and extrinsic evaluation on their approach. For intrinsic “book/VB” which is synonymous to the verb
evaluation, two human evaluators accessed whether or “reserve”. To ensure a good quality of the constructed
not each word pair is synonymous. For extrinsic word embeddings, we only considered the words with
evaluation, the authors used their system in machine frequency higher than 25. This was followed by a
translation evaluation task and observe an improvement normalization step in which words were lower-cased
on the evaluation metric of the machine translation. It (e.g., “The/DT” to “the/DT”) and digits were replaced
worth mentioning that Leeuwenberg et al. [7] favoured by a wildcard (e.g., “546/CD” to “DIGIT/CD”). The
minimally supervised system so their approach can be final vocabulary size was 197,361 word types. Our
extended to other least resourced languages which may pre-processed corpus contained 24.7 million sentences
not supported by rich lexical databases or sophisticated and around 602 million word tokens.
NLP tools. The only source of annotation that was used In order to construct vectors from the POS-tagged
in that research is part-of-speech tagged corpus. corpus, we used the neural word embedding model
In this paper, we use a similar approach to the work (word2vec). Using word2vec, vectors of similar
proposed by Leeuwenberg et al. [7], but we add another lexical items would be grouped together in the
supervised step that requires a small set of annotated semantic vector-space. The Skip-Gram model was
data. We show that synonymy identification can be used, number of dimensions was set to 300, and
tackled as a supervised machine learning task where the context window size was set to 5. Then, we used the
features are actually the word embeddings constructed cosine similarity measure to obtain a cluster of similar
using a generic technique such as Word2Vec. Contrary words for nouns, verbs, adjectives, and adverbs.
to previous works in NLP and script recognition which
relies on manually engineered features (for example, 3.2. Qualitative Analysis
the work of Bahashwan et al. [1] and Khan et al. [6]),
To perform a simple (qualitative) evaluation of the
our work relies on hidden representative features
aforementioned approach, we manually inspect the 5
discovered by a neural network.
most similar words to a given set of target words
(Table 1).
3. Constructing Word Embeddings Many interesting observations can be made from
An intuitive simple solution to this problem is to Table 1. For the word “police”, none of the top 5
convert each lexical item in the text corpus into a vector words can be recognized as a direct synonym.
representation, then computing the similarity between However, these words exhibit domain similarity to the
the vectors using a similarity measure (e.g., cosine target word. Since the corpus used in this experiment
similarity). Given vector representations for two words, is a news corpus crawled from the open web, one can
namely w1 and w2, the cosine similarity can be conclude that these words occur in crime-related
expressed as inner product space which measures the reports. In these contexts, the phrase “police
cosine of the angle between them. The higher the investigators” can be regarded as a synonym to the
similarity between two vectors, the higher the chance phrase “police detectives”. More interestingly, other
that the lexical items they represent are synonyms. This paradigmatic relations can be recognized within the
might be followed by a classification step to identify top 5 words themselves. For example, “police
whether the similarity is actually due to synonymy investigators” and “police officers” are co-hyponyms
(positive instance) or other type of of the hypernym “employees at the police
similarity/association (negative instance). In this department”.
section, the approach that was used to build the word Table 1. A few target words (with their POS tag) and their 5 most
embeddings is explained in details. similar words.
Target word (POS) Top 5 similar words
3.1. Data and Approach authorities, officers, investigators, detectives,
police (NN)
eyewitnesses
In this paper, we used NewsCrawl 2014 corpus from large (JJ) small, huge, sizable, massive, big
WMT workshop. We removed the headlines from the social-networking photo-sharing, video-sharing, micro-blogging,
corpus by removing sentences that do not end with a (JJ) on-demand, twitter-like
state-funded, government-funded, non-federal,
punctuation mark. After that, the corpus was tagged funded (JJ)
non-university, bursary
using Stanford Part-Of-Speech (POS) tagger developed scientific (JJ)
peer-reviewed, empirical, neuroscientific
by Toutanova et al. [12]. The advantage of using POS laboratory-based, scholarly
cheap, off-the-shelf, affordable, cost-effective,
tagged data is that homographs with different POS tags inexpensive (JJ)
nontoxic
can be distinguished if the POS tag is appended to the murdered (VBD) raped, abducted, hanged, stabbed, killed
word. For example, the two main senses of the word treated (VBN)
cared, discharged, drugged, hospitalized,
readmitted
“book” would be represented as two different words, quickly (RB) soon, swiftly, fast, easily, slowly
that is, the noun “book/NN” which refers the
instrument that is used to convey information to the
48 The International Arab Journal of Information Technology, Vol. 17, No. 1, January 2020
The second target word is the adjective “large”. The similar behaviour. This observation suggests that fine-
top 5 similar words are all adjectives that can be used to grained POS tags might help distinguishing different
describe size. Among these, three words are senses of homographs as well as different senses of
synonymous to the word large (huge, massive, and big). the verbs. Finally, the last entry in Table 1 shows the
However, the most similar word is “small”, which adverb “quickly” and its top 5 similar words.
actually an antonym to the target word. The top 5
words seem to exhibit functional similarity to the target 3.3. Does Distributional Similarity Indicate
word. Synonymy?
We have observed the presence of many
contemporary compound adjectives in the corpus that To further understand the nature of the distributional
are made of multi-words separated by a hyphen. An similarity between words, we conduct an investigation
example of these adjectives is social-networking as in on a set of target words that compiled by the author.
“social-networking/JJ website/NN”. The top 5 similar To narrow the scope of our investigation, we only
words to this token were also compound adjectives, and considered adjectives for this investigation. We looked
interestingly, all of them are within the web 2.0 in depth into the most similar words to a set of 100
domain. However, these adjectives exhibit functional words. We used WordNet as a reference to discover
similarity as well. For example, if we replace “social- the relation between the similar words if they are
networking” by “micro-blogging” in the sentence linked somehow in this lexical database. If the two
“Twitter is a popular social-networking platform”, the words are not linked in WordNet, we perform a
meaning of the sentence would not change. In addition, manual evaluation to categorize the relation between
“photo-sharing”, “video-sharing”, and “micro- the two words. The result of this investigation is show
blogging” are different ways of creating user-generated in Table 2. Moreover, we were able to find the
content on the web, which manifests the co-hyponymy following categories of words similarity/relatedness in
relation. This observation motivates an appealing our investigation.
research direction that aims to discover mulit-word WordNet synonyms: words which are recognized as
adjectives and their domain of usage (e.g., technology, synonyms in WordNet.
bio-sciences, politics. etc.,). A similar trend can be WordNet antonyms: words which are recognized as
observed with the adjective “funded”. The two most antonyms in WordNet.
similar lexical items are actually a specification of the WordNet similar to: words which are recognized as
adjective, namely “state-funded” and “government- similar to each other in WordNet.
funded”. WordNet see also: words which are connected to
Because the technique used in this paper is based on each other in WordNet under the see also category.
linear context (that is, context words that are present in Nearly synonyms: words which are recognized as
the neighbourhood of a target word), adjectives that synonyms.
tend to modify the same set of words would have high
Specification: A word is a specific case of another
cosine similarity. For example, the most similar word to
(more general) word.
the target word “scientific” is “peer-reviewed”, which
Domain Similarity: words that occur in the same
can be justified by the fact that these two adjectives
domain or topic (topically similar words).
modify similar word (e.g., “scientific journal” vs.
“peer-reviewed” journal or “scientific evidence” vs Contrasting: words that contrasting but do not
“empirical evidence”). Perhaps the only direct example qualify as opposites or antonyms.
of the feasibility of distributional similarity to discover Association: words that are associated in most
synonyms is present in the target word “inexpensive”, context.
where the most similar words can considered as Shortening: A word is an orthographic short form
synonyms (or nearly synonyms) by most native English of another full word.
users (i.e., cheap, off-the-shelf, affordable, cost- Similar: The words are somehow similar, but they
effective). are neither synonyms nor antonyms. They do not
For verbs, we observed a similar behaviour to those exhibit any relatedness.
already discussed. For example, the top 5 similar words WordNet contains a lot of words that are connected by
to the verb “murdered” in the past simple tense (tagged the synonymy or antonyms relation. However, some
as VBD) are all verbs that describe criminal events, adjective pairs might be considered as synonyms by a
only one of them can be considered as a synonym human evaluator even though WordNet does not
(“killed”). On other hand, the verb “treated” in the past recognize them as direct synonyms. For example, the
participial tense (tagged as VBN) was similar to verbs word “appealing” would be considered as synonym to
that describe events which usually take place at a the word “attractive” by most English speakers. But
hospital or medical institution (the medical sense of the WordNet puts “appealing” under the category similar
verb treat). Interestingly, the verb “treated” in the to the word “attractive”. The same can be said about
simple past tense (tagged as VBD) does not show a
Extracting Word Synonyms from Text using Neural Approaches 49
the word “beautiful”, which is put under the category necessarily indicate synonymy. It is quite rare that the
see also to the word “attractive”. Therefore, we added most similar word is actually a synonym to the target
these relations to Table 2. word. This raises the question: given a vector
From Table 2, one can observe that only 6 adjective representation of two words, can we build a system
pairs out of the 100 pairs are actually WordNet that classifies whether the two words are synonyms or
synonyms. Interestingly, 21 of the pairs are WordNet not? If yes, how to obtain labelled data to train the
antonyms. However, many of the pairs that are system?
identified as synonyms by manual investigation are We addressed the synonymy identification
actually within the similar to category in WordNet. We problem as a classification task. To make the problem
also identified other pairs that are not directly linked in simpler and doable within the time-frame of the
WordNet to be synonyms (we refer to them as nearly research, only adjectives were considered for the
synonyms). We found 23 pairs that can be qualified as classification. We used a feed-forward neural network
nearly synonyms. These observations could justify the with backward propagation as a learning algorithm.
low precision of the work of Leeuwenberg et al. [7] The classifier architecture is shown in Figure 2. To
when evaluation was performed against the synonyms obtain labelled training data, we extracted synonyms
in WordNet compared to the manual evaluation which pairs from SimLex-999 similarity lexicon with
gave much more optimistic results. Perhaps the most similarity score > 6:5. SimLex-999 is a gold-standard
interesting example in Table 2 is (natural-liquefied), resource for evaluating distributional semantic models
since the similarity between the two words seems to be [4]. According to authors’ own words: “Simlex-999
unknown at first. We manually checked the corpus for explicitly quantifies similarity rather than association
instances of the two adjectives, we found that the two or relatedness”. Simlex-999 was produced using
words co-occur in many contexts in the corpus due to crowd-sourcing of 500 paid native speakers who were
the wide use of the expression liquefied natural gas in asked to rate the similarity, as opposed to association,
newswire corpora such as the one we have used in these of different concepts provided a visual interface. Table
experiments. We chose to refer to this particular 3 shows some examples from the data set.
instance of similarity as association. It seems that the
Skip-Gram model of Word2Vec capture this kind of
association with high similarity score.
Table 2. Counts per category for 100 target adjective with their most
similar words.
Category Most similar (out of 100) Example
WordNet synonyms 6 Possible-potential
WordNet antonyms 21 Unsuccessful- successful
WordNet similar to 21 Responsible-accountable
WordNet see also 3 careful-cautious
Nearly-synonyms 23 limited-minimal
healthy-
Specification 5
heart-healthy Figure 2. Feed-Forward Neural Network Architecture for
scientific- synonymy classification.
Domain similarity 4
peer-reviewed
personal- Table 3. Examples from the SimLex-999 lexicon. Similarity scale
Contrasting 3
work-related ranges from 10 (most similar) to 0 (least similar).
natural-
Association 2
liquefied Word1 Word2 POS Sim-score
Shortening 1 professional-pro old new A 1.58
Similar 11 proper- adequate smart intelligent A 9.2
plane jet N 8.1
woman man N 3.33
Finally, we found 11 adjective pairs where the word dictionary N 3.68
semantic similarity somehow exists but does not create build V 8.48
get put V 1.98
qualified to be exactly synonyms and cannot be keep protect V 5.4
categorized under any of the aforementioned
categories. These pairs also are not directly linked in Considering the examples given in Table 3, one can
WordNet and we simply refer to this category as conclude that SimLex-999 was designed to reflect
similar. semantic similarity due to synonymy relation (as
shown in the similarity score of smart and intelligent)
4. Neural Network for Synonymy more than any other relation. For example, even
Identification though the words “man” and “woman” denote similar
concepts (i.e., co-hyponyms of the hypernym person
From the analysis in the previous section, it is clear that or human), they received a relatively low similarity
similarity (as measured by cosine similarity) does not score. In addition, and despite the strong functional
50 The International Arab Journal of Information Technology, Vol. 17, No. 1, January 2020
similarity between “old” and “new”, they received a antonyms, (co) hyponyms, etc.,). This task can be
low similarity scores. Therefore, the SimLex-999 can tackled as a multi-class classification in a similar
be used to perform a quantitative evaluation of the fashion to that we used to address the synonymy
effectiveness of semantic models to reflect synonymy identification task.
with high confidence.
Then, we expanded the dataset by linking the pairs 6. Conclusions
to the WordNet lexicon. For example, for a given pair
(good, great), we searched in the synsets of “good” In this paper, we addressed the problem of extracting
where “great” is one of the synonyms and included all synonyms from text corpus using word embeddings
other synonyms within the same synset. To obtain and supervised neural network. We used word2vec to
negative examples, antonyms in the same synset were construct word embeddings and performed a
extracted and antonym pairs were formed. Using this qualitative evaluation of the most similar words to a
method, we obtained 128 synonymy pairs (SYN) and few target words. Our investigation showed that
91 antonymy pairs (ANT). We added 90 pairs of words distributional similarity does not always indicate
that are neither synonyms nor antonyms (ELS) which synonymy but the similarity might be due to other
have been annotated manually. functional similarity (e.g., antonym) or domain
We performed several experiments or both 2-way similarity (e.g., association). Then, we showed that
classification (SYN/ANT) and 3-way classification embeddings constructed using word2vec can be used
(SYN/ANT/ELS). We used the vector representations as features to feed a neural network for synonymy
(the output of word2vec) for each pair as an input to the classification task. For future work, we suggested
neural network. In each of the classification runs, the extending our approach to sense relation identification
data were split into 75:25 training and testing, instead of synonymy discovery which can be tackled
respectively. The results are reported in Table 4. as multi-class classification task.