0% found this document useful (0 votes)
29 views

Published Paper

Uploaded by

Getnete degemu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Published Paper

Uploaded by

Getnete degemu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Semantic textual similarity between

sentences using bilingual word semantics

Md. Shajalal & Masaki Aono

Progress in Artificial Intelligence

ISSN 2192-6352

Prog Artif Intell


DOI 10.1007/s13748-019-00180-4

1 23
Your article is protected by copyright and
all rights are held exclusively by Springer-
Verlag GmbH Germany, part of Springer
Nature. This e-offprint is for personal use only
and shall not be self-archived in electronic
repositories. If you wish to self-archive your
article, please use the accepted manuscript
version for posting on your own website. You
may further deposit the accepted manuscript
version in any repository, provided it is only
made publicly available 12 months after
official publication or later and provided
acknowledgement is given to the original
source of publication and a link is inserted
to the published article on Springer's
website. The link must be accompanied by
the following text: "The final publication is
available at link.springer.com”.

1 23
Author's personal copy
Progress in Artificial Intelligence
https://doi.org/10.1007/s13748-019-00180-4

REGULAR PAPER

Semantic textual similarity between sentences using bilingual word


semantics
Md. Shajalal1 · Masaki Aono2

Received: 4 December 2018 / Accepted: 2 March 2019


© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Abstract
Semantic textual similarity between sentences is indispensable for many information retrieval tasks. Traditional lexical
similarity measures cannot compute the similarity beyond a trivial level. Moreover, they only can capture the textual similarity,
but not semantic. In this paper, we propose a method for semantic textual similarity that leverages bilingual word-level
semantics to compute the semantic similarity between sentences. To capture word-level semantics, we employ distribute
representation of words in two different languages. The similarity function based on the concept-to-concept relationship
corresponding to the words is also utilized for the same purpose. Multiple new semantic similarity measures are introduced
based on word-embedding models trained on two different corpora in two different languages. Apart from these, another new
semantic similarity measure is also introduced using the word sense comparison. The similarity score between the sentences
is then computed by applying a linear ranking approach to all proposed measures with their importance score estimated
employing a supervised feature selection technique. We conducted experiments on the SemEval Semantic Textual Similarity
(STS-2017) test collections. The experimental results demonstrated that our method is effective for measuring semantic textual
similarity and outperforms some known related methods.

Keywords Semantic similarity · Word semantics · Word-embedding · Textual similarity · Bilingual semantics

1 Introduction the text similarity is employed widely which include text


summarization, machine translation, paraphrase detection,
Semantic textual similarity between sentences is beneficial sentiment analysis, etc [3].
and mandatory for many information retrieval (IR) tasks. The The most typical technique for computing the textual sim-
vector space model in IR is the earliest application of textual ilarity between two text segments is lexical matching that
similarity. The model retrieves the most relevant documents produces the similarity score considering the number of lex-
for a given query from a certain collection using the text simi- ical items (words/phrases) exist in both input segments [17].
larity between query and documents. The textual similarity is There are some other techniques to improve the similar-
used in some other applications such as web search, subtopic ity measure by canonicalizing the text using stemming,
mining, word sense disambiguation (WSD), relevance feed- stopword removal, part-of-speech (POS) tagging, longest
back, text classification and so on [20,23,26]. There are also common subsequence matching, etc [24]. But these types
some natural language processing (NLP) applications where of similarity measures are not able to capture the similarity
beyond a trivial level. Furthermore, these similarity mea-
sures only can capture textual similarity, but cannot always
B Md. Shajalal
identify the semantic similarity. For example, consider two
[email protected]
sentences “I own a cat” and “I have a pet.” There is an
Masaki Aono
[email protected] obvious semantic similarity between these two sentences,
but the conventional textual similarity measures are not able
1 Department of Computer Science and Mathematics, to capture any kind of semantic connection between the given
Bangladesh Agricultural University, 2202 Mymensingh, sentences. On the contrary, consider two sentences “How are
Bangladesh
you?” and “How old are you?,” though there is no semantic
2 Department of Computer Science and Engineering, Toyohashi similarity between these two sentences, the lexical match-
University of Technology, Toyohashi, Aichi, Japan

123
Author's personal copy
Progress in Artificial Intelligence

Table 1 The similarity scores for different sentence pairs ranging [0,1] [9]
Sentence1 Sentence2 Score

The two sentences are completely equivalent, as they mean the same thing
The bird is bathing in the sink Birdie is washing itself in the water basin 5.00
The two sentences are mostly equivalent, but some unimportant details differ
In May 2010, the troops attempted to invade Kabul The US army invaded Kabul on May 7, 2010 4.00
The two sentences are roughly equivalent, but some important information differs/missing
John said he is considered a witness but not a suspect “He is not a suspect anymore”. John said 3.00
The two sentences are not equivalent, but share some details
They flew out of the nest in groups They flew into the nest together 2.00
The two sentences are not equivalent, but are on the same topic
The woman is playing the violin The young lady enjoys listening to the guitar 1.00
The two sentences are completely dissimilar
John went horse back riding at dawn with a whole Sunrise at dawn is a magnificent view to take in if 0.00
group of friends you wake up early enough for it

ing will give the decision that they are 75% similar in terms ity. In Sect. 3, we present our proposed method to estimate
of words overlap. Some example sentence pairs from STS- semantic similarity by addressing the challenges. We present
2017 [9] with their corresponding semantic similarity scores the experiments and evaluation results to show the effective-
ranging [0,5] are given in Table 1. The larger the score, the ness of our proposed method in Sect. 4. Some concluded
more the similarity between the sentences is. remarks and future directions are described in Sect. 5.
Recently, semantic textual similarity (STS) has gained
much attention in the research community [1,2,9,24,25,28].
SemEval (Semantic Evaluation)1 is organizing some mul- 2 Related work
tilingual and cross-language tasks on STS in the recent
past [1,2,9]. Researchers proposed different methods to SemEval is organizing some multilingual and cross-language
compute the similarity based on different techniques and tasks on STS in the recent past [1,2,9]. The participat-
resources [1,2,9,24,25,28]. ing methods in SemEval STS tasks estimate the similar-
This paper presents an effective method for measuring the ity by introducing numerous features employing multiple
semantic textual similarity which uses bilingual word-level resources [4,25]. They identified and applied some rules to
semantics to estimate sentence-level semantic similarity. To incorporate the challenges like currency values, negation,
capture word-level semantics, we utilize vector represen- compounds, number overlap and literal matching [1,2,9]. To
tation of words (word-embedding) and WordNet. Multiple extract features, they leveraged multiple knowledge bases
new semantic similarity measures are proposed based on such as WordNet and Wikipedia. They also used some tools
the word-embedding model. We leverage multiple pretrained to amplify the performance which includes named entity
embedding models which are trained on three different cor- recognizer (NER), dependency parser, stemmer, lemmatizer,
pora in two different languages. In these similarity measures, part-of-speech (POS) tagger, stopword list, etc. Mihalcea et
we exploit the word-level similarity between words from two al. [24] suggested different types of corpus-based and knowl-
corresponding sentences only when the words belong to the edge bases similarity measures. Their method utilized the
same class (POS tag). Apart from these, we also introduced word-level similarity from the corresponding texts. Hassan-
a new similarity measure by utilizing the word sense from zadeh et al. [15] introduced multiple syntactic, semantic and
WordNet. To estimate the importance of each measure, a structural features based on the content information of the
supervised linear regression model is used. Finally, a linear text segments and external resources to capture the similarity.
weighted ranking is applied to all measures with there impor- Kozareva et al. [19] introduced an answer validation system
tance score. The experimental results on SemEval STS 2017 by adopting a machine-learning-based textual entailment or
test collections demonstrated that our proposed method is similarity system using multilingual semantics, and the lan-
effective and outperformed some known related works. guages include English, Dutch, French, German, Spanish,
The rest of the paper is structured as follows: Section 2 Italian and Portuguese.
summarizes the related work on semantic textual similar- The semantic information from some external struc-
tured knowledge bases such as Wikipedia and WordNet is
1 SemEval: https://en.wikipedia.org/wiki/SemEval. employed to estimate the similarity. In some prior works [11,

123
Author's personal copy
Progress in Artificial Intelligence

Fig. 1 Semantic textual


Sentence 1 Sentence 2
similarity estimation framework Wikipedia Google News Resources
(English & Corpus
Bengali)

Preprocessing Similarity Measures Similarity Score Estimation Score

WordNet based Similarity

Word-Embedding based Similarity


ElasticNet Regularization

Traditional Similarity
Weighted Linear Ranking

12,22], the proposed methods identified the semantic mean- that have little or no semantic meaning associated) are filtered
ing of words from WordNet and applied that seman- out from the sentence by using Indri’s2 stopwords collec-
tic information to compute the similarity between texts. tion. We also apply NLTK WordNet lemmatizer to the words
Researchers also proposed corpus-based methods combin- that convert them into their base form. For example, con-
ing with WordNet-based measures [21,24]. In [24], they sider an example sentence “A girl is on a sandy beach”.
introduced an approach where the individual weight of each The outcome of the preprocessing phase is a set of words
word estimated using a large corpus is exploited with the S = {girl, sandy, beach}.
similarity score derived from WordNet. They applied the
similarities between words in similar class in two sen- 3.2 Similarity measures
tences. The average of the maximum similarities is then
used as the final similarity score. Li et al. [21] combined Let S1 and S2 be the two set of words for two corresponding
the word order score with WordNet-based measure to cal- sentences. We propose multiple new similarity measures with
culate the sentence similarity. Recently, researchers tried the help of WordNet and word-embedding that exploit the
word-embedding-based techniques which are also used for words’ semantics to estimate the sentence-level similarity.
semantic similarity [14,18]. In this regard, we used word-level semantics from WordNet
sense. We also employ multiple pretrained word-embedding
models in two different languages to estimate the similarity of
3 Our approach sentence-pair using the vector representation of words. Here,
we try to capture word-level semantics from the distributed
This section presents our proposed method to compute the representation of words in multiple text corpora. The details
semantic textual similarity between two sentences. The high- of each similarity measures are described below.
level building blocks of our methods are depicted in Fig. 1 that
can be divided into three phases, namely (1) Preprocessing, 3.2.1 Similarity based on wordnet
(2) Similarity measures and (3) Similarity score estimation.
In preprocessing phase, each sentence is canonicalized We utilize word-level semantic information from WordNet
into a set of words after filtering the stopwords out. The to compute sentence-level semantic similarity. The similar-
lemmatization is also applied to each word to convert into ities in WordNet are defined between concept-to-concept,
their base form. Then, we introduce multiple new seman- rather than words. We leverage those concepts to estimate
tic similarity measures and discuss how they will estimate word-to-word similarity for two words corresponding to two
the similarity of sentence-pair in Similarity measure phase. sentences.
Wikipedia (both Bengali & English) and Google News Cor-
Similarity based on WordNet (WN_sim): Our proposed
pus are used as the resources in this phase. We also investigate
similarity measure based on WordNet is defined as
the performance of the classical traditional similarity mea-
follows:
sures to compare the performance with newly introduced
ones. Finally, the similarity score is computed by leveraging
all extracted features with their importance score estimated 
w∈S1 Max(simlch (w, v)) · weight(w)
using a supervised feature selection technique. v∈S2
W Nsim (S1 , S2 ) = 
w∈S1 weight(w)
3.1 Preprocessing (1)

Given a sentence-pair, at first, the punctuation marks are 2 Indri’s Stopwords: http://www.lemurproject.org/stopwords/stoplist.
removed from each sentence. Then, the stopwords (words dft.

123
Author's personal copy
Progress in Artificial Intelligence

Table 2 Basic notation used in


Symbol Description
Algorithm 1
S_ter ms[ ] List of words after processing sentence S
AVS N −dimensional average feature vector for each sentence S
tc_S Total number of words contain in vocabulary of w2v_model for S
w2v_model Trained word-embedding model
vocab(w2v_model) Vocabulary of w2v_model
add(t, AVS, w2v_model) Adding the N −dimensional vector for term t with AVS
divide(AVS, tc_S) Dividing each value of AVS with tc_S
μ Mean of elements of the average vector AVS
σ Standard deviation of the elements in vector AVS

where Max(simlch (w, v)) returns the maximum similarity same POS tag. tagw denotes the POS of w which must be
v∈S2
same to POS of v.
score of word w in sentence S1 with the any of the words
v exist in sentence S2 . The function weight(w) denotes the Average Feature vector-based Similarity (AFS_sim): For
IDF of w. The similarity function sim lch (w, v) is the Lea- a given sentience S, our next similarity measure first com-
cock & Chodorw similarity defined as follows: putes the feature vector for each word and then returns a
feature vector for sentence S by averaging the word vectors.
length(w, v) The average feature vector AVS per sentence is calculated by
simlch (w, v) = − log
2· D the following Algorithm 1 named AVSC. The pseudocode is
presented in Algorithm 1. Table 2 summarizes the basic nota-
where length(w, v) is the shortest distance from concept w tion used in the algorithm.
to concept v. The maximum depth is defined by D. For a given sentence S1, we first remove different types of
punctuation marks, digits, etc. The preprocessed sentence is
3.2.2 Similarity based on word-embedding split into the list of words. The list is denoted by S_ter ms[ ].
This list is used to compute the similarity between two cor-
Word-embedding represents each word as a vector in a high responding sentences S1 and S2, respectively.
dimension. The embedding space can be used to extract
the semantic information of words. Therefore, pretrained Algorithm 1: Algorithm fo AVS Calculation: AVSC(S,
word-embedding model is used in this research to introduce w2v_model)
two new similarity measures to capture word-level seman- Input: Sentence S and Word2Vec model, w2v_model
tics. Output: Average Feature Vector, AVS
1 S_ter ms[ ] ← Pr epr ocess(S1)
Average Pairwise Similarity(APS_sim): In our first word- 2 AVS ← [0, · · · , 0]
embedding-based similarity measure, the similarity is esti- 3 tc_S ← 0
mated between two words within same classes. The words 4 for each term, t ∈ S_ter ms do
5 if t in vocab(w2v_model) then
are classified considering their part-of-speech (POS) tags. If 6 AVS ← add(t, AVS, w2v_model)
two words have the same POS tag, they are considered in the 7 tc_S++
same class. In other words, the similarity is estimated if and 8 end
only if two words from two sentences have the same POS 9 end
tag. The average pairwise similarity A P Wsim is defined as 10 AVS ← divide(AVS, tc_S)
follows: 11 AVS ← z_normalization(AVS, μ, σ ) [ where x  = x−μ
σ ]

APSsim (S1 , S2 )
 Then, we compute the average feature vector AVS for each
w∈S1 Max(simv∈S2 (w, tagw , v, wvmodel )) sentence. Word-embedding model returns a N-dimensional
= (2)
|S| vector for each term. Therefore, we retrieve the feature vec-
tor for each term t belongs to S_ter ms[ ]. The vectors are
where Max(simv∈S2 (w, tagw , v, wvmodel )) returns the max- computed only for words those belong to the vocabulary of
imum similarity as described in Eq. 1. The function sim the Word2Vec model, vocab(w2v_model). The feature vec-
v∈S2
(w, tagw , v, wvmodel ) denotes the similarity based on word- tors for each word belongs to a particular sentence are then
embedding model wvmodel between w and v which share the added.

123
Author's personal copy
Progress in Artificial Intelligence

This addition is done in scope that starts in step 4 and ends 3.3.1 Importance estimation
in step 8. The vector after the addition is stored in AVS for the
corresponding sentence. Each values in AVS is then divided ElasticNet regularization model [29] is a supervised feature
by total number of words tc_S for the respective sentence S. selection and importance estimation technique applied to
The division is done in step 10. Averaging the vectors might compute the importance of every similarity measure. The
go toward any hidden differences; it can be important and extracted features may contain some noisy and redundant
not. Therefore, we normalize each of the elements of vector features. Those features may not contribute to the accuracy
AVS in a particular scale. The normalization is done in step of the predictive model or may decrease the accuracy of the
11. The following Z-score normalization technique is applied model. We employ a supervised feature selection technique
in this purpose. ElasticNet, to estimate the importance of each measure.
The ElasticNet is a regularized regression method that
x −μ linearly combines the l1 and l2 penalties of the lasso and
x = ridge methods. The Elastic Net [13,29] can be defined by
σ
the combination of Lasso [27] and ridge regression [16] as
where the normalized value and original value are indicated follows:
by x  and x, respectively. μ and σ denote the mean and stan-  
1 
n
dard deviation of the vector’s elements, respectively. Then,
min (yi − β0 − xβ)2 + λPα (β) (4)
we apply the cosine similarity to compute the similarity score β,β0 ∈R p+1 2n
i=1
between S1 and S2.
Let AVS1 and AVS2 be the two normalized average fea- where
ture vectors calculated by Algorithm 1 for the corresponding
two sentences S1 and S2 for measuring the similarity. The 1
average feature vector-based similarity (AFSsim ) measure is Pα (β) = (1 − α)  β l22 +α  β l1
2
defined as follows: p 
1  
= (1 − α)β 2j + α β j 
AVS1 · AVS2 2
j=1
AFSsim (S1, S2) =
||AVS1|| · ||AVS2||
N where yi denotes the response of observations i and x repre-
AV S1i · AV S2i
=  i=0  (3) sents the data. n p and β are the sample size, the dimension
N 2 N 2
i=0 AV S1i i=0 AV S2i of the features space and the parameters of the linear regres-
sion, respectively. For α = [0, 1], the elastic net reduces to
where AVS1i and AVS2i denote the i-th feature value of the ridge regression and the Lasso, respectively. Due to the
vector AVS1 and AVS2, respectively. The dimension of the smoothness of the l2 norm, ridge regression always keeps all
vector AVS is denoted by N . To investigate the performance the explanatory variables in the model. On the other hand,
of the traditional textual similarity functions, we also employ Lasso provides a compact representation of the features space
some similarities including edit-distance-based lexical simi- for the sharp edge of the l1 constraints [13].
larity, Jaccard similarity, etc. However, Lasso have some limitations compared to the
elastic net. One of them is that the number of features selected
by Lasso may not exceed the sample size when p > n.
3.3 Similarity score estimation Another limitation is that when a group of features is highly
correlated to each other, Lasso does not make a good selec-
The final similarity score is computed using our proposed tion. We, therefore, make use of the elastic net to alleviate
semantic similarity measures (described in the previous sec- these limitations.
tion) and some common traditional similarity measures. The
similarity scores from these measures are varied widely.
Alternatively, the different individual measure has the dif- 3.3.2 Linear ranking
ferent levels of contribution in computing the sentence
similarity. That is why, we estimate the measures’ importance Finally, we employed a linear ranking approach using all
using supervised feature selection technique. Here, we treat similarity measures as well as their weight to estimate the
the different similarity measures as features set. We divide final similarity score between S1 and S2 as follows
this step into two phase which includes i. Importance Esti- T
mation and ii. Linear Ranking. The remainder of this section i=1 wi
· S M i (S1, S2)
sim(S1, S2) = T (5)
i=1 wi
presents the details of these two phases.

123
Author's personal copy
Progress in Artificial Intelligence

Table 3 Overview of different


Model Dimension Vocabulary size Corpus Language
word-embedding models
W2V_GN 300 3,000,000 Google News Corpus English
W2V_Wiki_E 200 71,291 Wikipedia English
W2V_Wiki_B 200 77,076 Wikipedia Bengali

where SMi (S1, S2) denotes the i-th similarity measure and 3. Ignore the grammatical errors and awkward wording as
wi is the importance score of i-th corresponding measure long as they do not obscure what is being conveyed.
estimated by Eq. 4. 4. Avoid overlabeling pairs with middle range scores.
5. Be careful of overreliance on an extreme score like 0 or
5.
4 Experiments and evaluation
In total, we exploited three pretrained word-embedding
We carried out experiments with a wide range of experimen-
models. The summary of them is presented in Table 3. Two
tal settings on STS-2017 dataset and validate the performance
of them are trained with English and Bengali Wikipedia, and
of our method with a standard evaluation metric used in
other one is trained with Google News Corpus. We used
SemEval STS task. The remainder of this section presents the
python Google Translator package to translate the sentences
details of dataset collection, evaluation metric, experimental
to capture semantics using Bengali word-embedding model.
setup and the comparative discussion about the performance
The Bengali word-embedding model and translated sentence
of our method with some known related works.
pairs are used by Eqs. 2 and 3. NLTK POS tagger and NLTK
WordNet lemmatizer have also been used to identify the POS
4.1 Dataset collection
of the word and stem the word, respectively. The IDF score
in Eq. 1 for each words is estimated using the Clueweb09 [8]
To test the performance of our proposed method, we car-
document corpus which is comprised with 50 millions of web
ried out experiments on a benchmark dataset for semantic
documents.
textual similarity. The SemEval Semantic Textual Similarity
2017 [9] (STS20173 ) task provided a dataset of 250 pairs of
sentences. The STS2017 organizers provided the similarity 4.2 Evaluation metric
score per sentence-pair that was calculated by human asses-
sors’ judgments. We employed their provided gold-standard The performance of our method has been tested based on
judgment as a ground truth in this research. Their human Pearson Correlation Coefficient.4 This evaluation metric has
assessors have given the similarity score using the following also been used as an official metric to test the performance
similarity label ranges from [0, 5]. of a method in SemEval STS2017 [9].
Let X = {x1 , x2 , x3 . . . xn } and Y = {y1 , y2 , y3 . . . yn } be
• Label 0: On different topics the two sets of scores for n pairs of sentences generated by the
• Label 1: Not similar but share few common details system and human assessors’ judgment, respectively. Each
• Label 2: Not similar but share some common details element xi or yi in set X and Y , respectively, represents the
• Label 3: Roughly similar semantic textual similarity between i-th sentence-pair. The
• Label 4: Similar Pearson Coefficient Correlation r is defined as follows:
• Label 5: Completely similar
n
i=1 (x i
− x̄)(yi − ȳ)
The distribution of the similarity labels after the annota- r =   (6)
n n
tion is mentioned elsewhere in [9]. The human assessors are (x
i=1 i − x̄) 2
i=1 (yi − ȳ)
2

instructed to assign the labels as followings [9]:


where n is the number of sentence pairs and xi and yi are the
1. Assign labels as precisely as possible according to the similarity scores given by participant and human assessors,
underlying meaning of the two sentences rather than their respectively, indexed with i. The arithmetic mean of the ele-
n
superficial similarities or differences. ments of X is defined by x̄ = n1 i=1 xi and analogously
2. Be careful of wording differences that have an impact on for ȳ.
what is being said or described.

3 STS2017: http://alt.qcri.org/semeval2017/task1/. 4 https://en.wikipedia.org/wiki/Pearson_correlation_coefficient.

123
Author's personal copy
Progress in Artificial Intelligence

4.3 Measures’ importance similarity. Among other measures with all variants, two mea-
sures (Eqs. 2, 3) based on word-embedding trained with
We applied ElasticNet [29] regularization technique to com- Google News Corpus ranked second and third which reflect
pute the importance of each similarity measure. In this regard, the importance of our proposed measures.
an R package named glmnet has been used to compute the
importance. We applied fivefold cross-validation with the 4.4 Experimental setup
parameter α ranging from 0.1 to 0.9. Figure 2 reflects the esti-
mated importance of similarity measures. The figure reflects To validate the performance of our method, we conducted
that our introduced similarity measures based on WordNet experiments in multiple experimental settings. At first, we
(Eq. 1) ranked in the first position. Therefore, we can con- applied our proposed semantic similarity measures based on
clude that semantics from WordNet can capture the semantic WordNet and word-embedding separately to observe the per-
formance. Then, we applied the combination of all three
0.2 proposed measures in which different embedding models
0.18 (illustrated in Table 3) were used. Finally, we applied the
Importance (Coefficient)

0.16 linear ranking discussed in Sect. 3.3 (Eq. 5) on all measures


with their variants in terms of the embedding models. The
0.14
lexical matching in terms of word overlap is used as the base-
0.12 line. The summary of all experimental setup is illustrated in
0.1 Table 4.
0.08
4.5 Experimental results
0.06
0.04 The performance of our proposed method with all experi-
0.02 mental settings on SemEval STS 2017 test collection [9] in
0 terms of Pearson’s ( r × 100) is summarized in Table 5.
AFS_Wiki_B

APS_Wiki_B

The table illustrates that the method applying super-


AFS_Wiki_E

APS_Wiki_E
APS_GN
WN

AFS_GN

vised feature selection on proposed semantic similarity


measures with their variants, TF + WN+WE_AV, achieved
the best performance over all other experimental settings.
Among all introduced measures including their variants,
Similarity Measures
WN_sim performs better than the other measures based on
Fig. 2 Estimated importance of similarity measure. The importance word-embedding. The performances of the proposed mea-
score is represented by the Y-axis and the similarity measures are rep- sures using different word-embedding models (WE_GN,
resented by X-axis WE_Wiki_E, and WE_Wiki_B corresponding to the settings

Table 4 Summary of all


SN Run Description
experimental settings
1 WN_sim Similarity score computed by WNsim (S1 , S2 ) (Eq. 1)
2 WE_GN Applied word-embedding-based similarity measures APSsim (S1 , S2 ) and
AFSsim (S1 , S2 ) presented in Eq. 2 and Eq. 3 with 300-dimensional (i.e.,
N = 300) word2vec model trained with Google News Corpus
3 WE_Wiki_E Applied word-embedding-based similarity measures APSsim (S1 , S2 ) and
AFSsim (S1 , S2 ) presented in Eq. 2 and Eq. 3 with 200-dimensional
word2vec model trained with English Wikipedia
4 WE_Wiki_B Applied word-embedding-based similarity measures APSsim (S1 , S2 ) and
AFSsim (S1 , S2 ) presented in Eq. 2 and Eq. 3 with 200-dimensional
word2vec model trained with Bengali Wikipedia
5 WN+WE_GN Applied linear ranking of experiment setup 1 and setup 2
6 WN+WE_Wiki_E Applied linear ranking on experiment setup 1 and setup 3
7 WN+WE_Wiki_B Applied linear ranking on experiment setup 1 and setup 4
8 TF + WN+WE_AV Applied linear ranking with all introduce measures including their variants
and traditional measures
9 LM Applied lexical matching based on terms overlap

123
Author's personal copy
Progress in Artificial Intelligence

Table 5 Performance of our proposed method in terms of Pearson’s The sentence-pairwise performance comparison among
( r × 100) on SemEval 2017 semantic textual similarity dataset (STS experimental setting 1, 2, 3, 4 and 8 (wrt. Table 4) for 40 ran-
2017)
domly selected sentence pairs is depicted in Fig. 3. The figure
Method Run Pearson’s (r × 100) indicates that the evaluation results of individual settings are
Our method WN_sim 66.63
varied widely. But most of the case, we can see that our pro-
posed method using ElasticNet importance estimation, TF +
WE_GN 65.09
WN+WE_AV achieved better performance compared to the
WE_Wiki_E 63.15
others.
WE_Wiki_B 57.29
WN+WE_GN 68.81
WN+WE_Wiki_E 67.42 4.6 Comparison with related work
WN+WE_Wiki_B 59.12
TF + WN+WE_AV 77.13 To validate the performance of our proposed method, we
Baseline LM 31.59 compared the performance with some known related methods
The best result is in bold among all and baseline. The Table 6 reflects the comparison between
our method and some known related methods in terms of
Pearson’s correlation coefficient r × 100. The table illus-
2, 3, and 4) are also quite competitive as compared to the trates that our proposed method outperformed other methods
WN_sim. Therefore, we can conclude that our proposed as well as the baseline.
measures are able to capture better semantics to estimate
the semantic similarity between texts. Moreover, our pro-
posed measures applied to multiple word-embedding models Table 6 Performance comparison among our proposed method and
some known related methods in terms of Pearson’s ( r × 100) on
trained on different corpora as well as different languages
SemEval 2017 semantic textual similarity dataset (STS 2017)
achieved competitive and consistent performance. These
findings and results demonstrated the effectiveness of the Method Run Pearson’s (r × 100)
proposed measures in computing the semantic similarity of Our method TF + WN+WE_AV 77.13
texts. Combining the measures based on WordNet and word- Related methods Bonet & Cedeño [10] 72.69
embedding (settings WN+WE_GN, WN+WE_Wiki_E, and Bjerva and Östling [7] 69.06
WN+WE_Wiki_B) further improves the performance and MatrusriIndia [9] 65.79
also surpasses the individual contribution. Finally, the results
NLPProxem [9] 62.56
of the linear ranking applied on all measures with their vari-
Borrow and Peskov [5] 61.74
ants, TF + WN+WE_AV, proved the effectiveness of impor-
Biçici [6] 54.68
tance estimation using ElasticNet regularization technique.
Baseline LM 31.59
This setting achieved a new state-of-the-art performance in
estimating the semantic similarity between texts. The best result is in bold among all

5
4.5
4
3.5
Similarity score

3
2.5
2
1.5
1
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
WN_sim WE_GN WE_E WE_B TF+WN_WE_AV

Fig. 3 Pairwise performance of different experimental settings. The similarity score is represented by the X-axis, and the experimental settings are
represented by Y -axis

123
Author's personal copy
Progress in Artificial Intelligence

Bonet and Cedeño [10] proposed a method based on dif- Spanish and pilot on interpretability. In: Proceedings of the 9th
ferent features including lexical features, explicit semantic International Workshop on Semantic Evaluation (SemEval 2015),
pp. 252–263 (2015)
analysis, context vector-based features and embedding-based 2. Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A.,
features. The multilingual word representation is employed Mihalcea, R., Rigau, G., Wiebe, J.: Semeval-2016 task 1: Seman-
by the group of Bonet and Cedeño [10] to capture semantic tic textual similarity, monolingual and cross-lingual evaluation. In:
similarity. On the other side, Borrow and Peskov [5] applied Proceedings of the 10th International Workshop on Semantic Eval-
uation (SemEval-2016), pp. 497–511 (2016)
end-to-end shared weighted deep lstm model for semantic 3. Aliguliyev, R.M.: A new sentence similarity measure and sen-
textual similarity. Though the related methods are based on tence based extractive technique for automatic text summarization.
similarity measures using numerous technique and resources, Expert Syst. Appl. 36(4), 7764–7772 (2009)
our proposed method uses only the word-level semantics 4. Bär, D., Biemann, C., Gurevych, I., Zesch, T.: Ukp: computing
semantic textual similarity by combining multiple content simi-
to capture sentence-level similarity. The results also indi-
larity measures. In: Proceedings of the First Joint Conference on
cate that the word-level semantics is effective to capture the Lexical and Computational Semantics-Volume 1: Proceedings of
sentence-level similarity. However, the experimental results the Main Conference and the Shared Task, and Volume 2: Proceed-
demonstrated the effectiveness of our method in computing ings of the Sixth International Workshop on Semantic Evaluation,
Association for Computational Linguistics, pp. 435–440 (2012)
semantic similarity.
5. Barrow, J., Peskov, D.: UMDeep at SemEval-2017 task 1: end-to-
end shared weight LSTM model for semantic textual similarity.
In: Proceedings of the 11th International Workshop on Semantic
5 Conclusion and future directions Evaluation (SemEval-2017), pp. 180–184 (2017)
6. Biçici, E.: RTM at SemEval-2017 task 1: referential translation
machines for predicting semantic similarity. In: Proceedings of the
This paper introduced a method for measuring the semantic 11th International Workshop on Semantic Evaluation (SemEval-
textual similarity between sentences. We estimated similar- 2017), pp. 203–207 (2017)
ity between sentences using the word-level semantics. In this 7. Bjerva, J., Östling, R.: ResSim at SemEval-2017 task 1: multi-
regard, we investigate bilingual word semantics that has been lingual word representations for semantic textual similarity. In:
Proceedings of the 11th International Workshop on Semantic Eval-
utilized to capture the semantic similarity between sentences. uation (SemEval-2017), pp. 154–158 (2017)
We proposed three new semantic similarity measures exploit- 8. Callan, J., Hoy, M., Yoo, C., Zhao, l.: Clueweb09 data set (2009)
ing word-embedding and WordNet. The performance of each 9. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.:
individual measures using different resources (Wikipedia and SemEval-2017 task 1: semantic textual similarity-multilingual and
cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055
Google news corpus) on STS-2017 dataset was observed. The (2017)
observation concluded that proposed measures are effective 10. España-Bonet, C., Barrón-Cedeño, A.: Lump at SemEval-2017
for computing similarity. Moreover, the performance of mea- task 1: towards an interlingua semantic similarity. In: Proceed-
sures using Bilingual semantics (both Bengali and English) ings of the 11th International Workshop on Semantic Evaluation
(SemEval-2017), pp. 144–149 (2017)
was competitive that indicate the consistency of our proposed 11. Fernando, S., Stevenson, M.: A semantic similarity approach to
measures to capture similarity. The combination of measure paraphrase detection. In: Proceedings of the 11th Annual Research
further improved the performance. Finally, the linear ranking Colloquium of the UK Special Interest Group for Computational
applied to all measures with their importance score computed Linguistics, pp. 45–52 (2008)
12. Ferreira, R., Lins, R.D., Freitas, F., Simske, S.J., Riss, M.: A new
linear regression technique surpasses the performance of all sentence similarity assessment measure based on a three-layer sen-
and achieved a new state-of-the-art performance. The perfor- tence representation. In: Proceedings of the 2014 ACM Symposium
mance comparison with some known related methods also on Document Engineering, ACM, pp. 25–34 (2014)
demonstrated the effectiveness of our method. 13. Fewzee, P., Karray, F.: Elastic net for paralinguistic speech recog-
nition. In : Proceedings of the 14th ACM International Conference
In the near future, we would like to apply our proposed on Multimodal Interaction, ACM, pp. 509–516 (2012)
semantic measures in some other fields such as query sugges- 14. Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J.:
tion generation, web search diversification, query completion UMBC_ebiquity-core: semantic textual similarity systems. In:
and subtopic mining. It would be interesting to apply mul- Second Joint Conference on Lexical and Computational Seman-
tics (* SEM), Volume 1: Proceedings of the Main Conference and
tilingual word semantics to estimate the cross-language the Shared Task: Semantic Textual Similarity, vol. 1, pp. 44–52
sentence similarity. We also have a plan to apply long short- (2013)
term-memory (LSTM) to introduce a new similarity measure 15. Hassanzadeh, H., Groza, H., Nguyen, A., Hunter, J.: Uqeresearch:
for semantic similarity. semantic textual similarity quantification. In: Proceedings of the
9th International Workshop on Semantic Evaluation (SemEval
2015), pp. 123–127 (2015)
16. Hoerl, A., Kennard, R.: Ridge Regression, in Encyclopedia of Sta-
References tistical Sciences, vol. 8, pp. 129–136. Wiley, New York (1988)
17. Jijkoun, V., de Rijke, M.: Recognizing textual entailment using
1. Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez- lexical similarity. In: Proceedings of the PASCAL Challenges
Agirre, A., Guo, W., Lopez-Gazpio, I., Maritxalar, M., Mihalcea, Workshop on Recognising Textual Entailment, Citeseer, pp. 73–
R.: Semeval-2015 task 2: Semantic textual similarity, English, 76 (2005)

123
Author's personal copy
Progress in Artificial Intelligence

18. Kenter, T., De Rijke, M.: Short text similarity with word embed- 26. Shajalal, Md., Ullah, M.Z., Chy, A.N., Aono N.: Query subtopic
dings. In: Proceedings of the 24th ACM International on Con- diversification based on cluster ranking and semantic features.
ference on Information and Knowledge Management, ACM, pp. In: Advanced Informatics: Concepts, Theory And Application
1411–1420 (2015) (ICAICTA), 2016 International Conference On, IEEE, pp. 1–6
19. Kozareva, Z., Vazquez, S., Montoyo, A.: Adaptation of a machine- (2016)
learning textual entailment system to a multilingual answer valida- 27. Tibshirani, Robert: Regression shrinkage and selection via the
tion exercise. In: CLEF (Working Notes), 2006 lasso: a retrospective. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73(3),
20. Li, H., Xu, J.: Semantic matching in search. Found. Trends Inf. 273–282 (2011)
Retr. 7(5), 343–469 (2014) 28. Zhang, Z. , Saligrama, V.: Zero-shot learning via semantic sim-
21. Li, Y., McLean, D., Bandar, Z.A., Crockett, K.: Sentence similarity ilarity embedding. In: Proceedings of the IEEE International
based on semantic nets and corpus statistics. IEEE Trans. Knowl. Conference on Computer Vision, p. 4166–4174 (2015)
Data Eng. 8, 1138–1150 (2006) 29. Zou, H., Hastie, T.: Regularization and variable selection via the
22. Lintean, M.C., Rus, V.: Measuring semantic similarity in short texts elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320
through greedy pairing and word semantics. In: FLAIRS Confer- (2005)
ence (2012)
23. Metzler, D., Dumais, S., Meek, C.: Similarity measures for
short segments of text. In: European Conference on Information
Publisher’s Note Springer Nature remains neutral with regard to juris-
Retrieval, pp. 16–27. Springer, Berlin (2007)
dictional claims in published maps and institutional affiliations.
24. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and
knowledge-based measures of text semantic similarity. AAAI 6,
775–780 (2006)
25. Šarić, F., Glavaš, G., Karan, M., Šnajder, J., Bašić, B.D.: Takelab:
systems for measuring semantic text similarity. In: Proceedings
of the First Joint Conference on Lexical and Computational
Semantics-Volume 1: Proceedings of the Main Conference and the
Shared Task, and Volume 2: Proceedings of the Sixth International
Workshop on Semantic Evaluation, Association for Computational
Linguistics, pp. 441–448 (2012)

123

You might also like