0% found this document useful (0 votes)
30 views11 pages

A Metaheuristic With A Neural Surrogate Function - 2022 - Machine Learning With

This paper proposes a new approach for word sense disambiguation that uses a neural network as a surrogate fitness function in a metaheuristic algorithm. It also introduces a new method for simultaneously training word and sense embeddings. Experimental results show the proposed method achieves accuracies between 74.8-84.6% on various evaluation datasets, which is competitive with state-of-the-art methods.

Uploaded by

a190ta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views11 pages

A Metaheuristic With A Neural Surrogate Function - 2022 - Machine Learning With

This paper proposes a new approach for word sense disambiguation that uses a neural network as a surrogate fitness function in a metaheuristic algorithm. It also introduces a new method for simultaneously training word and sense embeddings. Experimental results show the proposed method achieves accuracies between 74.8-84.6% on various evaluation datasets, which is competitive with state-of-the-art methods.

Uploaded by

a190ta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Machine Learning with Applications 9 (2022) 100369

Contents lists available at ScienceDirect

Machine Learning with Applications


journal homepage: www.elsevier.com/locate/mlwa

A metaheuristic with a neural surrogate function for Word Sense


Disambiguation
Azim Keshavarzian Nodehi ∗, Nasrollah Moghadam Charkari
Tarbiat Modares University, Tehran, Iran

ARTICLE INFO ABSTRACT


Keywords: Word Sense Disambiguation (WSD) is one of the earliest problems in natural language processing which aims
Word Sense Disambiguation to determine the correct sense of words in context. The semantic information provided by WSD systems is
Metaheuristics highly beneficial to many tasks such as machine translation, information extraction, and semantic parsing. In
Surrogate Functions
this work, a new approach for WSD is proposed which uses a neural network as a surrogate fitness function
Sense Mapping
in a metaheuristic algorithm. Also, a new method for simultaneous training of word and sense embeddings is
proposed in this work. Accordingly, the node2vec algorithm is employed on the WordNet graph to generate
sequences containing both words and senses. These sequences are then used along with paragraphs from
Wikipedia in the word2vec algorithm to generate embeddings for words and senses at the same time. In
order to address data imbalance in this task, sense probability distribution data extracted from the training
corpus is used in the search process of the proposed simulated annealing algorithm. Furthermore, we introduce
a new approach for clustering and mapping senses in the WordNet graph, which considerably improves the
accuracy of the proposed method. In this approach, nodes in the WordNet graph are clustered on the condition
that no two senses of the same word be present in one cluster. Then, repeatedly, all nodes in each cluster
are mapped to a randomly selected node from that cluster, meaning that the representative node can take
advantage of the training instances of all the other nodes in the cluster. Training the proposed method in
this work is done using the SemCor dataset and the SemEval-2015 dataset has been used as the validation
set. The final evaluation of the system is performed on SensEval-2, SensEval-3, SemEval-2007, SemEval-2013,
SemEval-2015, and the concatenation of all five mentioned datasets. The performance of the system is also
evaluated on the four content word categories, namely, nouns, verbs, adjectives, and adverbs. Experimental
results show that the proposed method achieves accuracies in the range of 74.8 to 84.6 percent in the ten
aforementioned evaluation categories which are close to and in some cases better than the state of the art in
this task.

1. Introduction of manual creation and development of such resources. As a result,


they are often incomplete and become outdated as languages evolve
Lexical ambiguity of polysemous words is considered as one of the (Navigli, 2009). And despite efforts to automate this costly process that
most challenging problems in several NLP applications such as machine has resulted in much richer knowledge bases, methods that solely rely
translation, question answering, and information retrieval. Word sense on these resources still lag behind the supervised variants (Bevilacqua
disambiguation, which attempts to select, among a finite set of senses, et al., 2021).
the most appropriate sense for a word within a given context, can
Supervised WSD, on the other hand, relies on labeled data to
be considerably beneficial to a range of tasks in NLP by explicitly
learn one classifier per word in the vocabulary. With the exception of
addressing this ambiguity (Navigli, 2018).
sequence-to-sequence models that take as input a sequence of words
Research on WSD is mainly focused on two types of systems,
and output a sequence with polysemous words replaced by their most
namely, knowledge-based and supervised WSD. Knowledge-based ap-
proaches take advantage of the structural information provided by re- probable meaning, disambiguation is done separately for each word
sources like WordNet (Miller, 1995) and BabelNet (Navigli & Ponzetto, using these classifiers; since the number of meanings for different
2012), which are structured as graphs, to identify the most probable words varies, there is not a fixed number of classes to choose from,
meaning of a word in its context using graph algorithms such as therefore necessitating the training of one classifier per word, and
PageRank. A major problem with these systems is the expensiveness consequently, the existence of a very large body of sense-labeled text.

Abbreviations: WSD, Word Sense Disambiguation


∗ Corresponding author.
E-mail addresses: [email protected] (A.K. Nodehi), [email protected] (N.M. Charkari).

https://doi.org/10.1016/j.mlwa.2022.100369
Received 2 January 2022; Received in revised form 5 June 2022; Accepted 14 June 2022
Available online 17 June 2022
2666-8270/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

Another challenge facing supervised WSD is dealing with imbalanced done in Lesk. In Basile et al. (2014), dense vector representations were
data. Word senses follow the Pareto distribution, meaning the most introduced to the extended Lesk method and the idea of definition
and least frequent senses of words are respectively over- and under- overlap was replaced with cosine similarity. The sense whose gloss
represented in the data, resulting in a tendency for the system to overfit vector representation had the highest similarity to the target word’s
to the most frequent class (Manning & Schütze, 1999). representation was considered to be the correct sense. In Agirre et al.
In this paper, we present a hybrid approach that uses a meta- (2014), for each target word to be disambiguated, all the content words
heuristic algorithm with a deep neural network classifier as a surrogate in its context were inserted into the WordNet graph as nodes, with
objective function (Brownlee et al., 2015) to perform WSD simulta- an edge from each of them to all of its possible senses. Then, using
neously on sequences of words. Doing so, we alleviate the need, to the Personalized PageRank algorithm, among the possible senses of
a certain extent, to balance naturally imbalanced sense-labeled data, the target word, the one with the highest score was chosen to be the
which is further addressed by modifying the search procedure of the most appropriate. In Moro et al. (2014), weights were introduced to
metaheuristic. The sense and word embeddings used in the training the edges in BabelNet’s knowledge graph based on structural density
of the neural network are trained simultaneously using the node2vec of the graph, and for each node, a set of related concepts was created
algorithm (Grover & Leskovec, 2016) on a modified version of the using random walks with restart. Given an input text, a subgraph was
WordNet graph. We also present a new method for clustering and created containing all possible interpretations of the text (i.e., including
mapping WordNet senses in order to address the lack of training data, all the different senses of all the ambiguous words in the text) and the
especially for low-frequency senses. Our work offers the following most coherent interpretation was chosen to be the one corresponding to
contributions: the densest subgraph; first by removing low-probability nodes and then
choosing the subgraph that had the highest average degree. In Dongsuk
• Using a deep neural network classifier as a surrogate objective
et al. (2018), semantic relational connections in a lexical knowledge
function inside a metaheuristic algorithm for WSD which allows
base were used to connect the different senses. The concatenation
parallelization and enhanced performance on GPU.
of these connecting paths was fed to the doc2vec algorithm (Le &
• Incorporating sense distribution information in the search mech-
Mikolov, 2014), which encodes entire documents into dense vector
anism of the metaheuristic algorithm that yields two benefits:
representations. These representations were taken to represent words in
addressing the imbalance in sense-annotated data and boosting
the vocabulary. To perform WSD, words that had the highest similarity
performance in terms of both speed and accuracy.
with each target word were collected using the vectors generated in the
• Simultaneous training of word and sense embeddings using two
previous step and a similarity measure, namely, cosine similarity. Then,
knowledge sources, namely, WordNet and Wikipedia.
for each sense of the target word, a subgraph containing that sense and
• Clustering and repeated random mapping of WordNet senses to all senses of its chosen context words was extracted from the lexical
overcome the lack of training data, especially for low-frequency knowledge base. Finally, the sense whose corresponding subgraph had
senses in WordNet.
the highest connectivity according to the personalized PageRank algo-
The remainder of this paper is structured as follows. Section 2 sum- rithm was selected as the most appropriate one for the target word.
marizes the related work in the WSD task. The proposed method and In Chaplot and Salakhutdinov (2018), topic modeling (Latent Dirichlet
its components are described in detail in Section 3. In Section 4, the Allocation) was used to facilitate the use of the entire document as
experimental results achieved by the proposed method are detailed and context, following the idea that documents are about a particular topic,
compared to the state-of-the-art in the task, and Section 5 contains the sentences in a document are not independent of one another, and that
conclusion of this work. words strongly tend to exhibit only one sense throughout a document
(Gale et al., 1992). In Scarlini et al. (2020a), multiple knowledge
2. Related work resources and the BERT model (Devlin et al., 2019) were used to
compute two vector representations for each sense, one by averaging
WSD methods can be divided into 3 general categories depending on contextual embeddings of related words in sentences extracted from
the knowledge source they use to perform disambiguation. Knowledge- Wikipedia and the other by exploiting the sense gloss enriched with
based systems take advantage of knowledge resources such as WordNet the synset lemmas. At test time, the context vector of each target word
which are often structured as graphs to calculate a probability dis- was generated with BERT, and using the previously calculated sense
tribution over the possible senses of a word. These methods usually embeddings and applying a 1-nearest-neighbor approach, the sense
use lexical similarity measures (Lesk, 1986) or graph-based measures closest to the context of the target word was taken to be the most
(Moro et al., 2014). Supervised methods rely on sense-annotated data appropriate.
to train classifiers that determine the most appropriate sense of a
word in context (Lee & Ng, 2002). Finally, semi-supervised methods 2.2. Supervised methods
attempt to address the knowledge acquisition bottleneck by introducing
automatically labeled data to be used along with the hand-labeled In Lee and Ng (2002), a variety of supervised learning algorithms
training data to enhance system performance (Mihalcea, 2004). including naive bayes, decision trees, and support vector machines
were trained and evaluated using manually-defined features, such as
2.1. Knowledge-based methods parts of speech of neighboring words, single words and collocations
surrounding the target word, and syntactic relations from dependency
In Lesk (1986), disambiguation was done by selecting the sense trees. In Taghipour and Ng (2015), pre-trained word embeddings as
whose gloss had the highest overlap with the glosses of the words well as embeddings fine-tuned to specific tasks were introduced in
in the context of the target word. In Cowie et al. (1992), the Lesk the feature space of support vector machines. In Kageback and Sa-
approach was used as the energy function in the simulated annealing lomonsson (2013), pre-trained word embeddings were used as input
algorithm to disambiguate all ambiguous words in a sentence simulta- for a recurrent neural network consisting of a bidirectional LSTM layer,
neously. Banerjee and Pedersen (2013) also used the Lesk approach but a fully connected layer, and a softmax layer. The first two layers
extended synset glosses with glosses of related synsets in the WordNet shared parameters, but the softmax layer learned a different set of
graph determined by a relatedness measure based on relations between parameters for each target word. In Melamud et al. (2016), generic
synsets. The overlap measure was also modified to take into account task-independent context embeddings were learned in an unsupervised
overlap length and assign more weight to longer overlaps in proportion manner using two LSTM networks and a similar approach to word2vec’s
to the length, as opposed to only operating on the word level, as negative sampling (Mikolov et al., 2013), encoding the context to the

2
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

right and left of the target word separately. The concatenation of the of supervised word sense disambiguation. Co-training and self-training
outputs of these two networks went through a fully-connected layer to are bootstrapping methods that iteratively classify unlabeled instances
produce the context embedding of the target word. To disambiguate and add the most confidently predicted ones to the training data. In
a test word, its context embedding was compared to the context em- Niu et al. (2005), unlabeled examples were annotated and added to
bedding of every occurrence of the word in the training set using the labeled training data using label propagation performed on a graph
cosine similarity, and the label was chosen from the instance that consisting of labeled and unlabeled instances as nodes, with an edge
had the highest similarity. In Yuan et al. (2016), an LSTM language between each pair of nodes if one was among the k nearest neighbors
model was trained on a large corpus of text data to predict held-out of the other, and edge weights being inversely proportional to the
words in text. Pre-trained word embeddings were fed to the LSTM distance between nodes. The idea being that the closer the examples,
layer and the output was projected to a lower dimensional context the higher the probability of their labels to be same, thus forming clus-
layer which, coupled with a softmax layer, predicted held-out words. ters, labels are iteratively propagated from labeled to nearby unlabeled
Sense embeddings were generated by averaging the context vectors of instances in the graph until convergence. In Mihalcea (2007), sense
all examples sharing the same sense label. For inference, the correct annotated corpora were generated by mapping hyperlinked occurrences
sense was the one that had the highest cosine similarity with the of ambiguous words in Wikipedia to WordNet senses. Yuan et al.
context vector of the ambiguous word. In Raganato, Bovi et al. (2017), a (2016) also used label propagation to add a large number of unlabeled
supervised system was proposed which, instead of learning a dedicated sentences from the web to the training data. Addition of a large number
classifier for every target lemma, treated WSD as a sequence labeling of unlabeled instances leads to the decision boundary between the
problem, thus learning a single all-words model from training data. different sense clusters to be better approximated, since the nearest
This transformed WSD into the problem of translating a sequence of neighbor classifier assumes spherical boundaries for clusters.
words to a sequence of sense-tagged tokens. To this end, three different
architectures were used: a bidirectional LSTM network, a bidirectional 3. Proposed method
LSTM network with an attention mechanism, and a complete encoder–
decoder architecture with LSTM and attention layers. In Vial et al. In this paper, we propose a hybrid model that combines the
(2019), semantic relations in WordNet were used to cluster senses that strengths of metaheuristic algorithms and neural networks to simul-
refer to the same concept to decrease the granularity of the senses and taneously disambiguate sequences of words. Metaheuristic algorithms
increase their coverage. Using these clusters representing all the senses are particularly interesting in the task of word sense disambiguation
inside them as the new output labels, and with BERT contextualized in that they resemble the way human beings seem to disambiguate
vectors as input, a sequence-to-sequence model featuring transformer words in context, albeit unconsciously in most cases; given a sequence
encoders was trained on a sense-annotated corpus to perform WSD. In of words, different combinations of senses are constructed and the
Kumar et al. (2019), both input and output were encoded into dense combination that is selected is the one that ‘‘makes the most sense’’.
vector representations; labels (i.e., senses) were encoded into sense The challenge for metaheuristic WSD is then to define ‘‘making sense’’
embeddings by either contextual representations of gloss definitions in mathematical terms, since this would be the function used by the
provided by BERT or knowledge graph embeddings that exploit rela- metaheuristic as the objective function to find the best combination of
tional information to produce dense representations of graph entities. senses. Several attempts have been made to make use of this notion
Input tokens were encoded using a BLSTM network with an attention in WSD using different metaheuristic algorithms, including simulated
mechanism. Finally, the dot product of each input token with all of its annealing, genetic algorithms, ant colony optimization, and particle
possible senses plus a bias term was fed to a softmax layer to predict swarm optimization (Cowie et al., 1992). In all of these attempts, a
the correct sense. In Bevilacqua and Navigli (2020), following the work very simple yet at times effective measure called the Lesk method has
of Kumar et al. (2019), structural information from the WordNet graph been used to determine the fitness of generated solutions (i.e., combi-
was injected into the pre-softmax scores that were used for prediction. nations of senses) where the combination that has the highest overlap
In Huang et al. (2019), gloss information in WordNet and labeled of sense definitions is considered to be the most appropriate. This
sentences from SemCor (Miller et al., 1993) were used to generate method, however, has not been able to perform well as an objective
context-gloss pairs of training instances with positive and negative function for metaheuristic algorithms when it comes to disambiguation
labels. These pairs were then used with a classification layer to fine- results. Therefore, in this work, we propose a model in which a neural
tune a pre-trained BERT model. In Scarlini et al. (2020b), in addition network is trained to act as a surrogate objective function (also known
to the knowledge-based system mentioned in Section 2.1, a supervised as meta-model, proxy function, and approximation function) inside
approach to calculate the sense embeddings was proposed where the a metaheuristic algorithm. We also introduce a new approach for
context vector was generated using SemCor instead of Wikipedia; for clustering and mapping WordNet senses to combat the granularity of
every sense, all the sentences in which the sense appears were fed to senses in WordNet and the lack of sufficient training examples for
BERT and the average of the output vectors was concatenated with the low-frequency senses of words. This is achieved through clustering
gloss vector to represent the sense. Prediction was then done using the WordNet senses and generating multiple mappings by granting every
same 1-nearest neighbor approach. In Conia and Navigli (2021), WSD sense an equal chance to represent a cluster. Done repeatedly, this has
was framed as a multi-label classification problem where more than one the effect of increasing training examples, disproportionately for low-
sense could be assigned to each target word, addressing the observation frequency senses and thus flattening the sense probability distribution.
that around 5% of instances in both training and evaluation data for Furthermore, we propose a method that facilitates the simultaneous
WSD are labeled with two or more senses, which suggests that there training of word and sense embeddings using multiple knowledge
is more than one possible meaning for the target words in question in sources of different natures, enabling the two sets of embeddings to be
their respective contexts. The aim was achieved by using binary cross- informed by one another during the training process. Fig. 1 provides
entropy loss in place of cross-entropy where instead of choosing the an overall view of the proposed framework, the major parts of which
best sense for each target word among its possible sense, the network are detailed in the remainder of this section.
decided for each sense whether or not it was suitable in that context.
3.1. The simulated annealing algorithm
2.3. Semi-supervised systems
The metaheuristic used in the proposed model is the simulated
Mihalcea (2004) investigated the application of co-training and self- annealing algorithm. It was selected because of its simplicity and
training, and their combination with majority voting to the problem convergence speed. Its name is an analogy to the annealing process

3
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

Fig. 1. An overall view of the proposed framework with the training pipeline on the left and the testing pipeline on the right.

in metals where slow cooling allows the metal to reach a minimum number of senses of the 𝑖th word in WordNet. A combination is gen-
energy state, meaning fewer atom dislocations in the crystal lattice, erated by selecting a sense for each word in the sequence. The initial
lower internal stress, and as a result, less brittle material. combination 𝐶 is a sequence of ones since senses in WordNet are sorted
In simulated annealing, function 𝐸 corresponds to the energy to based on frequency. This means that initially for each word in the
be optimized, and a parameter 𝑇 which corresponds to temperature sequence, its first sense is selected as it is the most commonly used. This
is initialized at a high value and decreased slowly to create a balance can be observed by the histogram analysis of the training data, namely
between exploration and exploitation; high temperatures lead to more SemCor, which reveals that word senses follow the Pareto distribution,
exploration and low temperatures to more exploitation in the search meaning the first sense accounts for the majority of occurrences. The
space. value of 𝐸 for this configuration is the probability given by the final
Initially, the algorithm starts by computing the value of 𝐸 for a layer of the neural objective function with a sigmoid activation. Next,
random configuration in the search space. Then, a random change is a new configuration is generated by choosing a different sense for a
made to this configuration and a new value of 𝐸 is computed. If the random word 𝑖 in the sequence. The new sense 𝑠𝑖𝑗 for word 𝑖 could
new value of 𝐸 is lower (in the case of minimization) than the old
be chosen uniformly randomly. However, it is possible to exploit the
value, the new configuration replaces the old one. However, if the
knowledge of the distribution of senses extracted from the training data
new value is not lower, there is still a probability for it to be chosen,
to considerably improve both accuracy and convergence speed. There-
with the probability being inversely proportional to the temperature;
fore, having chosen a random word 𝑖 in the sequence, and knowing
the higher the temperature, the higher the probability of the algorithm
the possible number of senses 𝑘𝑖 for word 𝑖, a random sense number
to go uphill. This feature is essential in that it prevents the algorithm
1 ≤ 𝑗 ≤ 𝑘𝑖 is generated with its probability based on the distribution
from being stuck in a local minimum in the early stages of the process.
of senses of all words in the training data with 𝑘𝑖 possible number of
Finally, when the algorithm satisfies a pre-defined stopping criterion,
e.g., a fixed number of iterations, it is terminated. senses. A new combination 𝐶 ′ is generated by replacing the old sense
In word sense disambiguation, configurations in the search space 𝑠𝑖1 with the newly selected sense 𝑠𝑖𝑗 for word 𝑖. Let 𝛥𝐸 = 𝐸𝐶 ′ − 𝐸𝐶
are the different combinations of senses in a sequence of words and the be the change in fitness from 𝐶 to 𝐶 ′ . If 𝛥𝐸 > 0, 𝐶 ′ replaces 𝐶 and
function 𝐸 is what determines whether or not, and to what degree, a subsequent changes are made to 𝐶 ′ . If 𝛥𝐸 ≤ 0, 𝐶 ′ replaces 𝐶 with
given combination ‘‘makes sense’’. In this work, we take the probability probability 𝑝 = exp( 𝛥𝐸 𝑇
) where 𝑇 is the temperature. The value of 𝑇
which is output by the final layer of a neural network to represent how is initialized at 1 and replaced by 0.9995𝑇 after every iteration. This
much sense a given combination makes and since we seek the com- means that the probability of ‘‘going uphill’’ decreases as the algorithm
bination that makes the most sense, the task becomes a maximization progresses.
problem. One of the benefits of using a neural network as the objective
More formally, given a sequence of words with length 𝑛, we rep- function in a metaheuristic is parallelization, meaning sequences do
resent the senses of the 𝑖th word as 𝑠𝑖1 , 𝑠𝑖2 , … , 𝑠𝑖𝑘 , with 𝑘𝑖 being the not need to be disambiguated in series; in every iteration, a change is

4
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

Training such a model requires a set of correct (i.e., positive) and


incorrect (i.e., negative) instances. Correct instances, in this context, are
defined as sequences of words and senses that have been used together
in a certain order in the dataset. Any other combination of senses that
does not fit this definition is considered incorrect. Correct sequences of
this nature are readily available through sense-tagged corpora such as
SemCor; given a body of text, sense-tagged tokens are viewed as senses
and others as words. Correct sequences are collected by the means of a
moving window of length 𝑙. Incorrect sequences, however, need to be
generated and since there can be a large number of possible incorrect
instances, the question of how and how many negative instances are
generated and used becomes important. The number of negative in-
stances is treated as a hyperparameter in this work. Negative sequences
are generated by making changes in the correct sequences harvested
from training data; given a correct sequence, a random labeled token
is selected and replaced by a different sense of the same word. If
the newly generated sequence does not exist in the pool of correct
sequences, it is kept as a negative sequence.
In the process of generating negative instances, changes could be
made either uniformly randomly or based on underlying sense distribu-
tions extracted from training data. Making changes uniformly randomly
Fig. 2. The architecture of the neural network trained to recognize correct com-
makes it easier for the network to learn the boundary between the
binations of senses and used as the objective function in the simulated annealing two classes as they would have completely different sense distributions.
algorithm. However, when deployed into the simulated annealing algorithm, the
model then struggles to find the correct sequences. This is due the
fact that many negative sequences have a similar distribution of senses
made in all sequences first, and then the new sequences are evaluated to correct sequences. On the other hand, making changes based on
in batches by the neural network, leading to substantial improvement the underlying sense distributions makes it more difficult to train the
is computation time. However, evaluating sequences in batches pre- network due the high similarity of instances, but the resulting model is
vents the use of dynamic stopping criteria in the algorithm and for found to perform better at test time. Therefore, the latter method was
this reason, we use a fixed 5000-iteration stopping condition in the employed to generate the negative class.
algorithm. Finally, since there exist multiple negative instances for each posi-
tive instance, 𝑁 being the number of negative instances added to the
dataset for a given positive sequence, that sequence is duplicated 𝑁
3.2. The neural objective function
times to achieve a balanced dataset.

The choice of objective function is crucial to the success of any


3.3. Sense compression
metaheuristic algorithm as it acts as a guide in the algorithm’s search
for the optimum solution. One of the main challenges in metaheuristic
One of the main challenges in WSD when working with the WordNet
WSD is defining the objective function as it is not clear what it should
inventory is the high granularity of senses, especially when it comes to
be. In previous attempts (Cowie et al., 1992), the Lesk method and verbs and nouns. This could make it extremely difficult for systems to
its variations have been used as the objective function, where given a distinguish between highly similar senses of the same word, particularly
combination of senses for a sequence of words, the overlap of these when dealing with a lack of sufficient sense-labeled data.
senses’ definitions is taken to be the fitness of the combination in Following the work of Vial et al. (2019), we propose a new method
question. This method performs well when there are coarse distinctions to cluster and map WordNet senses in order to reduce vocabulary size
between senses, but as the granularity of senses increases, the method while increasing the number of training instances, especially for less
begins to struggle. One reason for this may be that the model operates frequent senses in the training data.
at the sparse level of words and as senses become more and more In Vial et al. (2019), clusters are generated iteratively, where ini-
fine-grained, there is more overlap between the different senses of the tially, the set of all clusters 𝐶 includes clusters with one sense each.
same word. A logical conclusion, therefore, is that the metaheuristic Then in each iteration, clusters are sorted by size and the smallest
algorithm could benefit from a more complex objective function. cluster 𝑐𝑥 along with its smallest related cluster 𝑐𝑦 are selected and
As a result, we propose using a neural network trained to distinguish merged, with two clusters being related if there is at least one edge
between correct and incorrect combinations of senses as the objective between any of their members. The validity of the merge is determined
function in the simulated annealing algorithm described in Section 3.1. by the condition that no two senses of the same word should be in
This allows more flexibility in the choice of network architecture and one cluster, leaving intact the ability to disambiguate the word. If the
parameter space, as well as dense representations of words and senses merge is valid, it is kept and added to C while 𝑐𝑥 and 𝑐𝑦 are discarded.
to be exploited by the algorithm. Otherwise, the merge is undone and the next smallest related cluster to
The architecture of the model used is depicted in Fig. 2. Given 𝑐𝑥 is selected and the process continues until no merges are possible,
an input sequence of length 𝑙, a one-hot vector representation of the at which point 𝑐𝑥 is taken out of the cluster set as a final cluster.
sequence is used to collect the embeddings associated with its tokens. Finally, when the clusters are generated, all members in each cluster
These embeddings are fed to a bidirectional LSTM layer which outputs can be mapped to one of the members, which means the representative
a vector of size 1 × 2𝑢 where 𝑢 is the number of LSTM units. This member can take advantage of the training instances of the other
vector is then fed to a fully connected layer with ReLU units before members. However, it is not clear how the representative member in
going through the final layer of the network with the sigmoid activation each cluster is selected in Vial et al. (2019). One possible choice could
which produces a probability corresponding to the certainty by which be the most frequent sense in each cluster. Another is to group the
the input sequence can be classified as correct. senses in each cluster by part of speech and map all the members in

5
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

each group to the most frequent member in training data, meaning that for each sense in WordNet as a second category of nodes to the graph,
senses will not be mapped across parts of speech. thereby creating a graph with two categories of nodes, namely senses
However, implementing the mappings in both mentioned ways in- and words. Using this new graph, each initial cluster now contains
creases the intensity of the already steep underlying pareto distribution one sense and all its synonymous words, and the remaining clustering
of the training data, which means that the more frequent senses would steps are performed as before. It should be noted that the presence of a
receive more training instances and those belonging to less frequent word in more than one cluster is inevitable and allowed as the merging
ones would be taken away from them. This, in effect, reduces the vocab- condition only relies on senses and not words, which are removed after
ulary size by a large margin and if the mappings could be performed on clustering is finished.
test data as well, it would have a very positive effect on disambiguation The impact of this change in the graph can be observed in Fig. 4.
results. Making changes to the test data, however, is not possible since Before adding the synonyms, the initial cluster containing the sense
it is assumed that the correct senses are unknown and need to be ‘frost.v.03’ is connected to only one other cluster and during cluster-
determined. Performing mappings in this manner only on the training ing, it is not merged with any other cluster. After adding the synonyms,
data leads to sense distributions in training and test data to get further the cluster containing ‘frost.v.03’ is connected to seven other clusters
away from each other, which results in a decrease in the performance of through the word ‘frost’, resulting in a much higher probability of
our system. It is necessary to mention again that the manner in which being merged, which indeed happens in this case during clustering.
cluster representatives are chosen in Vial et al. (2019) has not been Implementing this method yields the outcomes below, all of which
detailed in the paper and our analysis is done with the assumption that result in more compression of the senses:
the most frequent sense is chosen to represent the cluster, which seems
to be the most feasible choice. • A 3.8 percent decrease in single-member clusters from 87.8 to 84
In this work, we propose a mapping method that not only does percent
not create the aforementioned problem, but acts as a mechanism for • A 45.7 percent decrease in the number of generated clusters from
generating training examples, especially for less frequent senses. In 19 732 to 10 711 clusters
this method, the selection of cluster representatives is done randomly, • Increasing the average size of the generated clusters from 5.96 to
which assigns less frequent senses an equal probability to be chosen as 10.98 senses in each cluster
cluster representatives. As a result, when all of the senses in a cluster It can be observed that the biggest impact of this method is on the
are mapped to a low-frequency sense, it can utilize all the other senses’ number and the size of generated clusters, and that the number of
training instances and possibly overcome the lack of training data. It is single-member clusters is still substantial. This is due to the fact that not
obvious that performing this random mapping only once is insufficient only do many senses in the WordNet graph have very low connectivity,
and therefore, the process of sampling, generating the mappings, and many of them have very few synonyms which themselves have very low
generating a mapped version of the training data needs to be done connectivity as well.
several times so as to allow different senses to have the chance to
represent their cluster. As a result of this process, we have multiple 3.4. Pre-training word and sense embeddings
versions of the original training data, all of which are combined and
used to train the proposed neural surrogate function. The neural network described in Section 3.2 takes as input se-
The different effects of the two mapping methods can be seen in quences that consist of both words and senses, therefore requiring
Fig. 3. Fig. 3a shows the distribution of senses among the first five two sets of embeddings. There are various methods to train word and
senses of all words in training and validation data. As mentioned sense embeddings separately, as well as to convert one to the other. In
before, mapping to the most frequent sense in each cluster leads to an this work, we propose a method to train word and sense embeddings
increase in the occurrence of the first sense and a decrease in others, simultaneously, allowing both sets to affect one another.
while mapping with our method results in a decrease in the first and The WordNet graph is composed of word senses as nodes and
second sense, and an increase in the third, fourth, and fifth. Even semantic relationships between these senses as edges. Each sense in the
though these changes are small, the advantage of the proposed method graph is associated with a group of synonymous words, called synsets,
in this level of analysis is that it does not distance the distribution of which are not directly part of the graph. In order to be able to train
training data from validation data. word and sense embeddings at the same time, we add these synonyms
The more prominent advantage of the proposed method, however, to the graph as a second category of nodes, where edges between the
is shown in Fig. 3b, which shows the distribution of senses for words two groups of nodes represent synonymy. The result of this addition on
with exactly five senses. It can be seen that the training data, its a small subset of the graph can be seen in Fig. 4.
mapped version to the most frequent sense in each cluster, and the Having constructed a graph that includes both words and senses, the
validation data all follow the Pareto distribution, which is steeper in node2vec algorithm is applied to map the vertices to dense representa-
the mapped version of the training data. However, the mapped version tions in a d-dimensional space. Node2vec is a feature representation
with our method indicates a significantly more uniform distribution, learning algorithm based on word2vec. It utilizes a flexible neigh-
which means that the less frequent senses have a comparable number of borhood sampling strategy which allows for smoothly interpolating
training instances to the more frequent ones. It is important to note that between BFS and DFS search algorithms with the aim of creating a
this does not imply a decrease in the number of training instances for proper balance between structural equivalence and homophily. This
more frequent senses, but an increase for all senses disproportionally is achieved with a flexible biased random walk procedure that can
in favor of the less frequent ones. explore neighborhoods in both a BFS and a DFS fashion to generate
Another difference between our work and that of Vial et al. (2019) is sequences of nodes that are then treated as normal sequences of text
the generation of initial clusters. In Vial et al. (2019), each initial clus- in the word2vec algorithm to learn dense representations with the
ter contains one sense from the WordNet graph and the combination of skip-gram model.
clusters is done using the semantic relationships between senses. Taking Doing so means that word and sense embeddings are affected by one
into consideration that 14.2 percent of nodes in the WordNet graph another during the training process. It also means that we are no longer
are isolated and a substantial number of nodes have very low levels bound to WordNet as the knowledge base. It allows the integration
of connectivity in the graph, it is not surprising that 87.7 percent of of knowledge bases that are different in nature, namely WordNet and
nodes remain isolated in their initial clusters at the end of the clustering Wikipedia. In order to exploit this additional wealth of knowledge,
process, meaning no mapping will be done on these nodes. Therefore, in paragraphs are sampled from Wikipedia and added to the node2vec
order to address this problem, we add the synonymous words provided sequences. With Wikipedia driving word representation learning and

6
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

Fig. 3. The distribution of senses among the first five senses of all words (3a) and the distribution of senses for words with exactly five senses (3b). The number five was selected
as an example for demonstration purposes and the observed patterns here are present in all other groups.

Fig. 4. A small subset of the WordNet graph before and after adding the synonymous words for each sense in WordNet, showing the impact on the connectivity of the initial
clusters.

WordNet driving both word and sense representation learning, we have that only a subset of content words in these datasets has been sense-
two knowledge sources being used in tandem to learn two separate sets labeled. As validation set, we use the SemEval-2015 as it is the smallest
of embeddings simultaneously. of the evaluation datasets that contains labeled instances for all four
content word categories. The performance of the system is also reported
4. Experimental results in the four content word categories, namely nouns, verbs, adjectives,
and adverbs. We report the 𝐹1 score which is the harmonic mean of
precision and recall, and as our system provides an answer for all
In this section, we detail the experimental setup for the different
instances, the three metrics are equivalent.
modules of the proposed system, analyze the performance of its dif-
ferent components, and finally compare the experimental results with 4.2. Experimental setup and hyperparameters
state-of-the-art methods in this task.
In the node2vec algorithm, parameters p and q which control
4.1. Data and evaluation metric the probability of returning to the previous step and the local/global
tendency of the random walks are both experimentally set to 1, which
The training of the neural surrogate function is done using the in effect turns the node2vec algorithm into the DeepWalk algorithm
SemCor corpus. SemCor is the most commonly used dataset for train- (Perozzi et al., 2014) which is a special case of node2vec. The length of
ing WSD systems. It consists of bodies of text whose content words the random walks is set to 80, equal to the average length of paragraphs
(i.e., nouns, verbs, adjectives, and adverbs) have been labeled with sampled from Wikipedia. This is done to ensure the balance between
their correct sense from WordNet. WordNet is a lexical knowledge words and senses in the data. Also, the number of sequences starting
base that gathers word senses in the form of synsets (synonym sets) from each node is set to the number of its neighbors so that each
where each word sense is associated with a set of synonymous words. neighbor has an equal chance of being visited on the first step, since
the first step is taken uniformly randomly.
These synsets are connected through a multitude of relations such as
The word and sense embeddings are trained by the word2vec al-
hypernym and hyponym relations. This structure makes it possible to
gorithm using WordNet and Wikipedia. The skip-gram model is used
conceptualize WordNet as a graph with synsets as nodes and the rela-
in word2vec, with a window size and embedding size of 5 and 512
tions between synsets as edges. WordNet and Wikipedia are used in this
respectively. Binary cross-entropy is used as loss function and RMSprop
work as knowledge sources to train the word and sense embeddings,
as training algorithm.
with Wikipedia providing information for words and WordNet for both To generate positive and negative instances to train the neural
words and senses. The assessment of the proposed system is done in surrogate function, the text in the training data is divided into windows
the fine-grained all-words WSD task using the evaluation framework of size 32. For each window, 10 negative instances are generated and
suggested by Raganato, Camacho-Collados et al. (2017), which includes added to the example pool along with 10 copies of the positive instance.
five datasets from the SensEval and SemEval workshops, including To simplify the search in the simulated annealing algorithm, only the
SensEval-2 (Edmonds & Cotton, 2001), SensEval-3 (Snyder & Palmer, first 12 senses of words are considered which cover over 96 percent of
2004), SemEval-2007 (Pradhan et al., 2007), SemEval-2013 (Navigli training examples.
et al., 2013) and SemEval-2015 (Moro & Navigli, 2015), as well as the In the training of the neural surrogate function, we use a learning
concatenation of all five datasets. The SensEval and SemEval datasets rate of 10−4 , a batch size of 512, binary cross-entropy as loss function,
are provided in the same form as SemCor, with the difference being and Adam as the training algorithm, with training done in one epoch.

7
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

Table 1
The effect of using the underlying distribution of senses in the search process of the simulated annealing algorithm. The first model uses a uniform distribution, while the second
model uses the underlying distribution of senses extracted from training data to guide the search. Both models use random embeddings and no mapping is performed.
SE2 SE3 SE07 SE13 SE15 Concatenation of all datasets
N V Adj Adv ALL
Uniform Sense Distro. 66.0 65.4 59.9 63.9 66.2 67.3 55.7 67.6 73.1 64.9
Underlying Sense Distro. 69.1 68.4 61.6 67.9 69.7 70.6 57.9 69.0 72.2 68.2

Table 2
The effect of using pre-trained word and sense embeddings in the training of the neural surrogate function. Both models use the underlying distribution of senses and no mapping
is performed.
SE2 SE3 SE07 SE13 SE15 Concatenation of all datasets
N V Adj Adv ALL
Random Embeddings 69.1 68.4 61.6 67.9 69.7 70.6 57.9 69.0 72.2 68.2
Pre-trained Embeddings 71.8 70.1 67.4 72.2 73.9 73.6 61.9 70.9 75.1 71.5

Table 3
The effect of clustering WordNet senses and the repeated random mapping of the training data on system performance. All three models use the underlying sense distribution and
pre-trained embeddings.
SE2 SE3 SE07 SE13 SE15 Concatenation of all datasets
N V Adj Adv ALL
No Mapping 71.8 70.1 67.4 72.2 73.9 73.6 61.9 70.9 75.1 71.5
Mapping (Senses) 76.2 77.8 77.6 82.5 76.9 84.0 72.3 68.9 72.3 78.1
Mapping (Senses & Words) 78.3 76.1 78.3 84.1 78.9 84.6 73.7 74.8 75.3 79.5

In the simulated annealing algorithm, the initial temperature is set 2.8 percent decrease in accuracy, respectively. This problem is solved
to 1 and the annealing rate to 0.9995, which is applied after each by incorporating the synonymous words in the graph as the third model
iteration. The stopping criterion for the algorithm is fixed to 5000 shows an improvement of between 5 and 11.9 percent in all categories
iterations for all sequences, as a dynamic criterion cannot be used compared to the first model. Compared to the second model, the third
due to the parallelization of the search through the neural surrogate model results in a 1.7 percent decrease on the SensEval-2 dataset, and
function. a 0.7 to 5.9 percent increase for others.
The hyperparameter values for which a reason was not specified
Moreover, it is noteworthy that the performance of the proposed
in this subsection were chosen experimentally. The hyperparameters
of the systems and their values are further detailed in Table 6 in Ap- clustering and mapping method is highly superior on nouns and verbs.
pendix A. This can be further confirmed considering the results on SemEval-
2007 which contains only nouns and verbs, and SemEval-2013 which
4.3. Results and discussion contains only nouns as labeled examples.
The inferior performance of this method on adjectives and adverbs
In this section we present, analyze, and compare the experimental can be explained to some extent considering the statistics in Table 4.
results provided by the proposed system. Table 1 shows the effect of There is a very large number of isolated adjectives and adverbs in the
using the underlying distribution of senses in the search process of WordNet graph, meaning that the probability of a merge and therefore
the simulated annealing algorithm. In the first model, changes to the a mapping for these nodes is zero. The addition of words to the graph
current solution (combination of senses) are made uniformly randomly.
has a considerable impact on these numbers, but they remain substan-
In the second model, changes are made based on the underlying distri-
tially higher than isolated nouns and verbs. Additionally, the average
bution of senses extracted from training data. In both models, random
degree of adjectives and adverbs in the WordNet graph, compared to
initial embeddings have been used and no mapping has been done on
that of nouns and verbs, is much lower even without considering the
senses. It can be seen that using this information has led to a 0.9 percent
decrease in accuracy for adverbs, but a 1.4 to 4 percent increase in the nodes with a degree of zero. This means that even though these nodes
other nine categories. are not completely isolated, they have a much lower chance to be
Table 2 shows the effect of using pre-trained word and sense em- merged with other clusters.
beddings as opposed to random embeddings. In both models, the Finally in Table 5, we compare the experimental results achieved by
underlying sense distribution has been used and no mapping has been the proposed method with the state of the art in this task. The systems
performed on senses. It can be observed that using pre-trained word and have been divided into three categories: supervised, knowledge-based,
sense embeddings results in a 1.7 to 5.8 percent increase in accuracy and semi-supervised systems. We can see that our proposed system
across the ten evaluation categories. outperformed all other systems in the categories of nouns and verbs,
The impact of sense clustering, repeated random mapping, and and the SemEval-2013 dataset by 1.5, 3.4, and 2.3 percent respectively,
training the neural surrogate function with the combination of the
but it lags behind on the other four datasets by 1.6 to 4.4 percentage
different mapped versions of the training data can be seen in Table 3.
points. However, the major downside of the system is in the two
In the first model, no mapping has been done on training data. In
categories of adjectives and adverbs, where our system lags behind the
the second model, clustering and mapping is done using only senses
in WordNet. In the third model, synonymous words are used as well best systems by 9.9 and 12 percent. This is in contrast to the common
as senses for clustering and mapping. All three models leverage the pattern in WSD systems where performance on adverbs and adjectives
underlying sense distribution and pre-trained embeddings. It is evident are normally higher than nouns and especially verbs. We believe that
from this table that applying mappings generated using only senses the reason for this is the low connectivity of adjectives and adverbs
leads to an improvement by 3 to 10.4 percent in all but two of the in the WordNet graph as discussed in this section, which hinders the
evaluation categories, namely adjectives and adverbs which see a 2 and performance of our clustering and mapping approach.

8
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

Table 4
Number of isolated nodes before and after clustering, and average degree of nodes with and without words in the WordNet graph.
Noun Verb Adjective Adverb
Isolated Nodes in WordNet 0 137 13 666 2958
Isolated Nodes after Clustering (Using Senses Only) 312 379 13 671 2959
Isolated Nodes after Clustering (Using Senses and Words) 225 66 6712 1995
Average Degree of Sense Nodes in WordNet (without Words)a 2.76 2.18 1.28 1.02
Average Degree of Dense Nodes in WordNet (with Words)a 4.54 3.98 1.97 1.73
a
Without considering zero-degree nodes.

Table 5
Experimental results achieved by our proposed system compared to the state of the art in word sense disambiguation.
Concatenation of all datasets
SE2 SE3 SE07 SE13 SE15 N V Adj Adv ALL
Most frequent sense 65.6 66.0 54.5 63.8 67.1 67.7 50.3 74.3 80.9 65.2
Taghipour and Ng (2015) – 68.2 – – – – – – – –
Semi-sup.
Yuan et al. (2016) 73.8 71.8 63.5 69.5 72.6 72.8 62.2 81.3 85.7 72.6
Agirre et al. (2014) 59.7 57.9 41.7 – – – – – – –
Basile et al. (2014) – – – 71.5 – – – – – –
Moro et al. (2014) – 68.3 62.7 69.2 – – – – – –
Knowledge-based
Dongsuk et al. (2018) – – 56.1 75.0 65.8 – – – – –
Chaplot and Salakhutdinov 69.0 66.9 55.6 65.3 69.6 69.7 51.2 76.0 80.9 66.9
(2018)
Scarlini et al. (2020a, 2020b) 80.6 70.3 73.6 74.8 80.2 – – – – 75.9
Melamud et al. (2016) 72.3 68.2 61.5 67.2 71.7 – – – – –
Yuan et al. (2016) 73.6 69.2 64.2 67.0 72.1 71.3 64.2 81.3 84.5 72.1
Raganato, Bovi et al. (2017) 72.0 70.2 64.8 66.9 72.4 71.5 57.5 75.0 83.8 69.9
and Raganato,
Supervised Camacho-Collados et al.
(2017)
Kumar et al. (2019) 73.8 71.1 67.3 69.4 74.5 74.0 60.2 78.0 82.1 71.8
Vial et al. (2019) 79.7 77.8 73.4 78.7 82.6 81.4 68.7 83.7 85.5 79.0
Huang et al. (2019) 77.7 75.2 72.5 76.1 80.4 79.3 66.9 78.2 86.4 77.0
Scarlini et al. (2020a, 2020b) 83.7 79.7 79.9 78.7 80.2 – – – – 80.4
Bevilacqua and Navigli (2020) 80.8 79.0 75.2 80.7 81.8 82.9 69.4 83.6 87.3 80.1
Conia and Navigli (2021) 80.4 77.8 76.2 81.8 83.3 82.9 70.3 83.4 85.5 80.2
This work 78.3 76.1 78.3 84.1 78.9 84.6 73.7 74.8 75.3 79.5

5. Conclusion CRediT authorship contribution statement

Azim Keshavarzian Nodehi: Conceptualization, Methodology, Soft-


In this work, we presented a metaheuristic WSD system that uses a
ware, Validation, Formal analysis, Investigation, Resources, Data cura-
neural surrogate function in the simulated annealing algorithm. To the
tion, Writing. Nasrollah Moghadam Charkari: Supervision.
best of our knowledge, this is the first time that a metaheuristic with
a surrogate function has been used in the WSD task. In the simulated
Declaration of competing interest
annealing algorithm, we used the underlying pareto sense distribution
extracted from training data to guide the search. We also presented The authors declare that they have no known competing finan-
a method that facilitates simultaneous training of word and sense cial interests or personal relationships that could have appeared to
embeddings that are then used in the training of the neural surrogate influence the work reported in this paper.
function. This allows the use of different knowledge sources to be used
together to train two different sets of embeddings at the same time that
are affected by one another during training. In addition, we presented
a new approach to cluster and map WordNet senses in order to provide Funding Sources
more training instances, especially for low-frequency senses in the data.
Through experimental results, we showed that our proposed method This research did not receive any specific grant from funding agen-
outperforms the state of the art in WSD on verbs and nouns. However, cies in the public, commercial, or not-for-profit sectors.
due to the low connectivity of adjectives and adverbs in the WordNet
Appendix A
graph, our model struggles to perform well on these two categories of
words. Despite this point of weakness, however, our model yields the
In this appendix, we present all of the hyperparameters in the
best accuracy on the SemEval-2013 dataset and achieves competitive
proposed method and their values, separated by module, in Table 6.
results on the other four evaluation datasets, with its accuracy on the
The values for parameters p and q in the node2vec algorithm are the
concatenation of all five datasets being only 0.9 percent lower than that
same as those investigated in the original paper Grover and Leskovec
of the best available system. (2016) where the algorithm was introduced. The walk length is equal
For future work, we aim to investigate the performance of our to the average length of paragraphs sampled from Wikipedia to ensure
system using different knowledge sources, most importantly BabelNet. the balance between words and senses in the data. The number of walks
We believe that higher levels of connectivity among nodes in the graph per node is equal to the number of neighbors for each node so that each
can benefit the performance of our proposed system, especially on adjacent node has an equal probability of being visited on the first step
adjectives and adverbs. as the first step of the walk is taken uniformly randomly. The set of

9
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

Table 6
The hyperparameters of the proposed method and their values separated by module.
Module Hyperparameter Values Tested Value
p 0.25, 0.5, 1, 2, 4 1
q 0.25, 0.5, 1, 2, 4 1
Node2vec
Walk length 80 80
Number of walks per node num(adj(node)) num(adj(node))
Window size 5, 7, 9 5
Word2vec
Embedding dimensions 128, 256, 512 512
Window size 8, 16, 32, 64 32
Training Sample Generation
Maximum number of senses for each word 10, 11, 12, 13, 14 12
Number of negative samples generated for each positive sequence 5, 10, 20, 30 10
Input sequence length 8, 16, 32, 64 32
Embedding dimensions 128, 256, 512 512
Structure
LSTM units 32, 64, 128, 256 128
Neural network dense_1 16, 32, 64, 128 64
dense_2 1 1
Learning rate 0.001, 0.0001, 0.00001 𝟏𝟎−𝟒
Training
Epochs 1, 2, 3 1
Batch size 128, 256, 512 512
Initial temperature 1 1

Simulated Annealing Annealing rate (applied after each iteration) [0.999–0.9999] 0.9995
Number of iterations 100, 500, 1000, 2500, 5000, 7500, 10000 5000
Maximum number of senses for each word 10, 11, 12, 13, 14 12

tested values for the maximum number of senses for each word was Edmonds, P., & Cotton, S. (2001). [dataset] SENSEVAL-2: Overview. In Proceedings of
centered around 12 as the first 12 senses in WordNet account for 96 second international workshop on evaluating word sense disambiguation systems.
Gale, W. A., Church, K., & Yarowsky, D. (1992). A method for disambiguating word
percent of instances in the training data. Also, we chose not to include
senses in a corpus. In Computers and the humanities, Vol. 26 (pp. 415–439).
all senses due to the exponential increase in the complexity of the Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks.
search. The initial temperature in the simulated annealing algorithm In Proceedings of the international conference on knowledge discovery and data mining
was set to 1 following the example of Cowie et al. (1992) where the (pp. 855–864).
Huang, L., Sun, C., Qiu, X., & Huang, X. (2019). Glossbert: Bert for word sense
algorithm was first used in the word sense disambiguation problem.
disambiguation with gloss knowledge. In Proceedings of the conference on empirical
methods in natural language processing and the 9th international joint conference on
References natural language processing (pp. 3500–3505).
Kageback, M., & Salomonsson, H. (2013). Word sense disambiguation using a
bidirectional LSTM. In The 5th workshop on cognitive aspects of the lexicon (CogALex).
Agirre, E., de Lacalle, Oier Lopez, & Soroa, A. (2014). Random walks for knowledge-
Kumar, S., Jat, S., Saxena, K., & Talukdar, P. (2019). Zero-shot word sense disambigua-
based word sense disambiguation. Association for Computational Linguistics, 40(1),
tion using sense definition embeddings. In Proceedings of the 57th annual meeting of
57–84.
the association for computational linguistics (pp. 5670–5681).
Banerjee, S., & Pedersen, T. (2013). Extended gloss overlaps as a measure of semantic
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents.
relatedness. In International joint conference on artificial intelligence (IJCAI).
In Proceedings of the 31st international conference on machine learning .
Basile, P., Caputo, A., & Semeraro, G. (2014). An enhanced lesk word sense disam-
Lee, Y. K., & Ng, H. T. (2002). An empirical evaluation of knowledge sources and
biguation algorithm through a distributional semantic model. In Proceedings of
learning algorithms for word sense disambiguation. In Proceedings of the conference
the 25th international conference on computational linguistics: Technical papers (pp.
on empirical methods in natural language processing (EMNLP) (pp. 41–48).
1591–1600).
Lesk, M. (1986). Automatic sense disambiguation using machine-readable dictionaries:
Bevilacqua, M., & Navigli, R. (2020). Breaking through the 80% glass ceiling: Raising
How to tell a pine cone from an ice cream cone. In Proceedings of the 5th special
the state of the art in word sense disambiguation by incorporating knowledge
interest group on design of communication (SIGDOC) (pp. 24–26).
graph information. In Proceedings of the 58th annual meeting of the association for
Manning, C., & Schütze, H. (1999). Foundations of statistical natural language processing.
computational linguistics (pp. 2854–2864).
Cambridge, Massachusetts, USA: MIT Press.
Bevilacqua, M., Pasini, T., Raganato, A., & Navigli, R. (2021). Recent trends in word Melamud, O., Goldberger, J., & Dagan, I. (2016). context2vec: Learning generic context
sense disambiguation: A survey. In Proceedings of the thirtieth international joint embedding with bidirectional LSTM. In Proceedings of the 20th SIGNLL conference
conference on artificial intelligence (pp. 4330–4338). on computational natural language learning (pp. 51–61).
Brownlee, A., Woodward, J., & Swan, J. (2015). Metaheuristic design pattern: Surrogate Mihalcea, R. (2004). Co-training and self-training for word sense disambiguation. In
fitness functions. In Proceedings of the companion publication of the annual conference Proceedings of the 8th conference on computational natural language learning (CoNLL)
on genetic and evolutionary computation (pp. 1261–1264). (pp. 33–40).
Chaplot, D. S., & Salakhutdinov, R. (2018). Knowledge-based word sense disambigua- Mihalcea, R. (2007). Using wikipedia for automatic word sense disambiguation. In
tion using topic models. In Proceedings of 30th innovative applications of artificial Proceedings of the North American chapter of the association for computational
intelligence conference (pp. 5062–5069). linguistics (pp. 196–203).
Conia, S., & Navigli, R. (2021). Framing word sense disambiguation as a multi- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
label problem for model-agnostic knowledge integration. In Proceedings of the 16th representations in vector space. arXiv preprint arXiv:1301.3781.
conference of the European chapter of the association for computational linguistics (pp. Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the
3269–3275). ACM, 38(11), 39–41.
Cowie, J., Guthrie, J., & Guthrie, L. (1992). Lexical disambiguation using simulated Miller, G. A., Leacock, C., Tengi, R., & Bunker, R. T. (1993). [dataset] A semantic
annealing. In Proceedings of the workshop on speech and natural language (pp. concordance. In Proceedings of the ARPA workshop on human language technology
238–242). (pp. 303–308).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: pre-training of Moro, A., & Navigli, R. (2015). [dataset] SemEval-2015 task 13: Multilingual all-words
deep bidirectional transformers for language understanding. In Proceedings of the sense disambiguation and entity linking. In Proceedings of the 9th international
conference of the North American chapter of the association for computational linguistics: workshop on semantic evaluation.
human language technologies, Volume 1 (pp. 4171–4186). Moro, A., Raganato, A., & Navigli, R. (2014). Entity linking meets word sense
Dongsuk, O., Kwon, S., Kim, K., & Ko, Y. (2018). Word sense disambiguation based on disambiguation: A unified approach. Transactions of the Association for Computational
word similarity calculation using word vector representation from a knowledge- Linguistics, 2, 231–244.
based graph. In Proceedings of the association for computational linguistics (pp. Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys,
2704–2714). 41(2), 1–69.

10
A.K. Nodehi and N.M. Charkari Machine Learning with Applications 9 (2022) 100369

Navigli, R. (2018). Natural language understanding: Instructions for (Present and Raganato, A., Camacho-Collados, J., & Navigli, R. (2017). Word sense disambiguation:
Future) use. In Proceedings of the twenty-seventh international joint conference on A unified evaluation framework and empirical comparison. In Proceedings of the
artificial intelligence (pp. 5697–5702). 15th conference of the European chapter of the association for computational linguistics,
Navigli, R., Jurgens, D., & Vannella, D. (2013). [dataset] SemEval-2013 task 12: Volume 1 (pp. 99–110).
Multilingual word sense disambiguation. In Proceedings of the 7th international Scarlini, B., Pasini, T., & Navigli, R. (2020a). SensEmBERT: Context-enhanced sense
workshop on semantic evaluation. embeddings for multilingual word sense disambiguation. In Proceedings of the 34th
Navigli, R., & Ponzetto, S. (2012). BabelNet: The automatic construction, evaluation and AAAI conference on artificial intelligence.
application of a wide-coverage multilingual semantic network. Artificial Intelligence, Scarlini, B., Pasini, T., & Navigli, R. (2020b). With more contexts comes better perfor-
193, 217–250. mance: Contextualized sense embeddings for all-round word sense disambiguation.
Niu, Z. Y., Ji, D. H., & Tan, C. L. (2005). Word sense disambiguation using label In Proceedings of the conference on empirical methods in natural language processing
propagation based semi-supervised learning. In Proceedings of the 43rd annual (EMNLP).
meeting of the association for computational linguistics (pp. 395-402). Snyder, B., & Palmer, M. (2004). [dataset] The english all-words task. In Proceedings of
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online learning of social the third international workshop on the evaluation of systems for the semantic analysis
representations. In Proceedings of the 20th ACM SIGKDD international conference on of text .
knowledge discovery and data mining (pp. 701–710). Taghipour, K., & Ng, H. T. (2015). One million sense-tagged instances for word sense
Pradhan, S., Loper, E., Dligach, D., & Palmer, M. (2007). [dataset] SemEval-2007 task- disambiguation and induction. In Proceedings of the 19th conference on computational
17: English lexical sample, SRL and all-words. In Proceedings of the 4th international natural language learning (pp. 338–344).
workshop on semantic evaluations. Vial, L., Benjamin, L., & Didier, S. (2019). Sense vocabulary compression through
Raganato, A., Bovi, C. D., & Navigli, R. (2017). Neural sequence learning models for the semantic knowledge of wordnet for neural word sense disambiguation. In
word sense disambiguation. In Proceedings of the conference on empirical methods in Proceedings of the 10th global wordnet conference (GWC).
natural language processing (pp. 1156–1167). Yuan, D., Richardson, J., Doherty, R., Evans, C., & Altendorf, E. (2016). Semi-
supervised word sense disambiguation with neural models. In Proceedings of the
26th international conference on computational linguistics (COLING) (pp. 1374–1385).

11

You might also like