Shimaa IsmailSemanticSimilarity
Shimaa IsmailSemanticSimilarity
net/publication/362786644
CITATION READS
1 29
3 authors:
Tarek Elshishtawy
Benha University
9 PUBLICATIONS 41 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Shimaa Ismail on 19 August 2022.
Keywords: Arabic Text Similarity, Semantic Similarity, Lexical Similarity, Word Embedding, Permutation Feature,
Negation Effect.
The rest of the paper is organized as follows. Section i.e., words used in comparable contexts are mapped to a
2 represents the related work of the semantic similarity proximal vector space. Different techniques represent the
approaches. Section 3 explains the approach words in vector spaces such as Skipgram skip-G,
methodology. In section 4, the evaluations of the Continuous Bag of Words CBOW, and the co-
experimental results are described. Finally, the occurrence frequency. The terms "flower" and "tree," for
conclusion of our approach is presented in section 5. example, are semantically related since they both refer to
plants and are used in the same context [11]. FastText is
2. Related Work one of these word embedding models that presented
Arabic snipped texts can be classified according to word representation for the Arabic language. FastText is
the adopted methodology as follows: an unsupervised learning technique that generates vector
One of the most used strategies for evaluating representations for words in 294 languages [12]. It
semantic similarity is deep learning with feature- supports both CBOW and skip-G models.
engineered models. Tian et al. [6] used characteristics Nagoudi et al. [13] used a word embedding model
such as n-gram overlap, edit distance, and longest presented by Zahran et al. [14] based on CBOW, Skip-
common prefix/ suffix/ substring to train deep learning G, and GloVe approaches to determine the semantic
algorithms. They were able to reach a PCC of 0.7440. similarity of Arabic sentences. Additionally, they
On the aspects of alignment similarity and string included two weighing functions: IDF weighting and
similarity measurements, Henderson et al. [7] employed POS tagging. They achieved a PCC of 0.7463 ranked the
the same method with various algorithms, such as first for applying an approach with the native language
Recurrent Neural Networks (RNN) and Recurrent and the second among all participants.
Convolutional Neural Networks (RCNN). For the same Alian et al., [15] proposed a method that combined
dataset, their PCC is 0.7304. lexical, semantic and syntactic-semantic features with
For the same dataset utilized, the semantic machine learning techniques like linear regression and
information space (SIS) is a technique that produced a support vector machine regression. They applied the
high Pearson correlation coefficient. The non- Levenshtein method and one of the word embedding
overlapping Information Content (IC) computation is models to represent words in a vector space. They
obtained using this method, which is based on the evaluated their approach on three different datasets. For
semantic hierarchical taxonomy in WordNet. This the STS-MSRvid of the SemEval competition dataset,
method was employed by Wu et al [8] in three studies. they achieved a PCC of 0.743.
As they used the IC which is based on the word The word-based similarity category treats the phrase
frequencies from WordNet and the British National as a collection of words; hence it is based on the
Corpus. They also collaborate the IDF weighting system similarity between terms. There are several approaches
using the IC with cosine similarity. The highest Pearson for determining phrase similarity in this area, including
correlation coefficient is presented in this competition by maximum similarity, similarity matrix, employing
0.7543. similar and dissimilar components, and word meaning
BableNet is a huge, multilingual semantic network disambiguation [4]. In addition, several of these
with broad coverage that gathered its data from Wordnet strategies are integrated. First, the maximum similarity
and Wikipedia [9]. The multilingual word-sense aligner of each word in the first sentence and each word in the
proposed by Hassan et al., [10] relies heavily on the second sentence is determined using the Max similarity
BableNet network. According to the Bable synsets, they approach. The average similarity is then determined
built an aligner that aligns each word in one phrase to [16]. The similarity matrix method generates a matrix
another word in the other. In many languages, these containing the results of calculating the similarity
synsets reflect the word's meaning, named entity, and between each word in each sentence and the words in the
synonyms. For the used Arabic dataset, PCC is 0.7158. other sentence [17]. Wang et al., [18] describe the use of
Word embedding is a common method for various similar and dissimilar parts to represent words using
text applications and NLP activities. The distributed word embedding. They then used cosine similarity to
representation of words in a vector space is referred to as create a similarity matrix. Furthermore, a semantic
word embedding. Traditional NLP techniques miss matching function was used to create a semantic
syntactic (structure) and semantic (meaning) links across matching vector for each word. They break down the
collections of words. As a result, using word vectors to resulting match vectors to discover which portions of
represent words has its advantages. The word vectors are each vector are similar and which are distinct. Finally,
multidimensional continuous floating-point values in the similarity is calculated using these vectors. In [19],
which semantically comparable words are transferred to they utilized Wordnet synonyms to extend the words of
geometrically close regions. Each point in the word the original phrases, then generated a vector
vector represents a dimension of the word's meaning, representation for these words in addition to the vectors
of the set of terms in each sentence. Finally, the cosine lemmatization in both the fastText vector model and the
similarity of the two vectors is used to calculate the input text in our technique. In the following part, we'll
similarity. go through how to do that. Second, utilizing the vector
word-space of sentences, a word-matching matrix is
3. Methodology created. Finally, the degree of similarity between
In this research, we presented two-hybrid approaches sentences is calculated.
for two different semantic techniques. Fig. (1) illustrates the suggested technique, which is
3.1 The First Proposed Approach divided into three stages:
In the first approach, we present a hybrid 1) Vector-Based Similarity,
methodology that combines two semantic similarity 2) Word-Based Similarity,
approaches: word-based and vector-based methods, to 3) Similarity Measures.
quantify the semantic similarity between two snipped 3.1.1 Vector-Based Similarity
Arabic texts. The vector space of each word is first FastText Models are available and readable for the
retrieved in lemma form using the fastText vector model. Arabic language. FastText models trained using CBOW
The StanfordNLP library was used to generate their with position-weights, in dimension 300, with character
lemma form. To analyze natural language, the n-grams of length 5, a window of size 5, and 10
StanfordNLP package is employed. It turns a string of negatives [21]. Arabic fastText corpus contains more
human language text into lists of sentences and words, than 356 thousand words. These words are totally
generates basic forms of those words, parts of speech, different represented in their surface form. A screenshot
and morphological aspects, and provides a syntactic from this dataset is shown in fig. (2).
structural dependency parse that is parallel across over
70 languages [20]. It is utilized for tokenization and
Many studies employed the vector space model's surface To extract the closest words for each word in the
form to derive the semantic similarity between words sentences, vectors are employed to extract semantic
that did not include additional semantically related similarity using a word embedding or word
terms. But in this research, a lemmatization technique is representation. For related or near-synonymous words,
used to improve the search word space for the fastText these vectors shared the same semantic properties. The
model words. The lemmatization is applied for the input suggested method extracts comparable words from the
text and the fastText vector model and stated as a mapped fastText model using the preprocessed and
mapped fastText model. Some other preprocessing tools lemma form of input words. There are numerous indices
are applied for the text such as noise removal, word in the mapped model for each word in the input text that
normalization, eliminating stopwords. contains the same word but distinct vectors. Fig. (3)
There are two techniques are used to extract the showed the process of extracting the semantic word
semantic word spaces for each word of the input words: space for a specific word.
using the closest words algorithm or a ready-made The mapped vector model is used to extract the
function built in the fastText module in python language. closest words using Algorithm 1. Where np is the
3.1.1.1 Using the closest words algorithm NumPy library stands for Numerical Python.
Fig. (3) The process of extracting the semantic word space for a word using the closest words algorithm
Fig. (4) The process of extracting the semantic word space for a word using the fastText library
Iterating over the whole vectors of the mapped sentence matched to some of the second sentence words
fastText model to obtain the index of the most with the number of common words (NCW).
comparable word can be used to find the closest word. 3.1.3 Similarity Measures
By repeating the loop N times with the same method, N For two sentences s1 and s2 with a length of n and m
numbers of related terms are discovered. The word space respectively, the similarity score is measured by
number for each word can be assigned, however, Equation 1. Where p is the length of the matching
according to the mapped fastText model, each word has matrix.
a distinct number of indexes. The main purpose is to ∑ ⁄
(1)
increase the word search space by collecting more
relevant comparable terms for a particular word. In this The result value is a ratio from 0 to 1 scale. So, we
approach, we extract 10 closest words for each index. multiply the output by 5 to make the output ratio is from
3.1.1.2 Using the fastText module 0 to 5 scale. The zero value represents that the two
The language model of the fastText organization was sentences are quite different, and the five value indicates
released as a Python module [22]. Similar words can be that the two sentences are typical.
retrieved using a built-in function named "get nearest 3.2 The Second Proposed Approach
neighbors" in this package or module. Like the word In the second approach, a modified approach is
space for each word, this function returns the 10 closest presented to measure the similarity of two snipped
neighbors of the searched word in its surface form. The Arabic texts lexically and semantically based on the edit
process of extracting the semantic word space using the distance approach. This proposed approach is hybrid in
fastText module is shown in Figure 4. the sense that both syntax and semantic features are used
3.1.2 Word-based Similarity to measure the similarity. Different knowledge resources
In this stage, we tried to find the relatedness or the are employed such the semantic word spaces. Also, the
association between words. From the semantic words approach presents a solution to miss ordering of words
spaces, a common word matrix is built. This matrix is between given two sentences. The modified edit distance
constructed by the common words of each pair of words approach is based on different weights (edit cost)
by their semantic word spaces. From this matrix, a according to the state of the two words.
matching matrix is generated by selecting the most The proposed workflow for measuring the edit cost
common words between each pair of words. The between two words is shown in Fig (5).
matching matrix consists of the words of the first
Fig. (5) The workflow diagram of finding the edit cost of two words
operations. For each candidate sequence, the edit the closest words of the semantic word space of each
distance is calculated. The candidate sequence with the word.
shortest edit distance is chosen as the one that most 4.2.1.1 Using the closest words algorithm
accurately matches the alignment of the words in the two In this experiment, two tests are accomplished. The
sentences. first test uses the mapped vector model with the input
text words in their lemma form. The second uses the
4. Experiments input text and the fastText vector model in their surface
4.1 Dataset Description form. First, the input text is preprocessed with the
We utilized the dataset from the Semantic Evaluation preprocessing tools except the stopwords are eliminated
“SemEval” yearly competition to assess the performance once and keep another. The results of the Pearson
of our approach. This event is used for a variety of correlation coefficient are shown in Table (1).
languages and track issues. The semantic similarity Results of Table (1) proved that applying the
between texts is one of these tracks (word phrases, lemmatization technique for the input text with the
sentences, paragraphs, or full documents). Furthermore, mapped vector model has a better effect than using the
the text might be in monolingual or multilingual formats. fastText model and the input text in their surface form.
2017 was the final year of the competition that used the In addition, removing the stopwords from the input text
semantic similarity track. improved the results slightly.
Development Sets and Evaluation Sets are the two
datasets included in the released dataset. One of the Table (1) Experimental Results using the closest words
Development Sets for monolingual Arabic snipped texts algorithm
is the STS-MSRvid dataset1, which contains 368 pairs of
sentences. The Evaluation Sets2 are a collection of 250 Dataset With Pre- With Surface With Lemma
sentence pairs with human judgment ratings that were Processing Form Form
published as the Evaluation Gold Standard3. In the With StopWords 0.6708 0.6886
output of these datasets, the Pearson Correlation Without StopWords 0.6887 0.7000
Coefficient (PCC) between each pair of sentences is
supplied in a one-column table. This coefficient runs Table (2) Experimental Results using the fastText
from -1 to 1, with a value of (-1) indicating that the module.
values of the two columns are completely different and a
value of (1) indicating that they are identical. This The dataset in Built-In Closest Words
coefficient is expressed in Equation 4. Surface Form Function Algorithm
∑ ̅ ̅
(4) With StopWords 0.6513 0.6708
√∑ ̅ √∑ ̅ Without StopWords 0.6679 0.6887
Where ̅ is the mean of x which is defined by
Equation 5. Table (3) Experimental Results for studying the
negation effect on the proposed Approach.
̅ ∑ ⁄ (5)
4.2 Experiments Evaluations Before After
Dataset With
The two proposed approaches are evaluated with two Studying Studying
Pre-Processing
different datasets in two tests as follows: Negation Negation
4.2.1 The First Approach (Evaluation Gold With StopWords 0.6924 0.7052
Standard) Without
0.7039 0.7212
The first approach is evaluated using the evaluation StopWords
gold standard dataset. The experimental results are
classified according to the applied algorithm of finding
http://alt.qcri.org/semeval2017/task1/data/uploads/ar_sts_data_
updated.zip
2
http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.eval
.v1.1.zip
3
http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.gs.z
ip
4.2.1.2 Using the fastText module score. The Pearson Correlation Coefficient of the entire
In this experiment, the proposed approach is applied dataset was modified by these scores, as seen in Table 3.
with input text in their surface form to be compatible In the last experiment done, the Pearson Correlation
with the results of the built-In function Coefficient becomes close to the human judgment
“get_nearest_neighbors”. The results shown in Table (2) scores. The score of 0.7212 is the highest value obtained
are obtained with stopwords and without stopwords. in applying the proposed approach.
Table (2) proved that the proposed algorithm for 4.2.1.4 Comparison with other approaches
finding the closest words achieved good results than the Table (4) compares our proposed approach to other
built-in function that does the same task. works in the AR-AR track of the SemEval competition
4.2.1.3 Studying Negation Effect 2017 that had the highest PCC for the participants.
Negation is a significant factor that influences the In Table (4), some researchers used the google machine
sentence's orientation (Ismail et al., 2016). Negation translator to increase the training dataset as a
terms in Arabic include (, لم, عدم, ال, لن, ليست,)ليس. The requirement for the deep learning approach. Therefore,
meaning of the sentence is reversed with these negative their results are much better than the results of the
words. These negation terms are scanned in each traditional approaches that use the native language. The
sentence of the two input sentences in the proposed proposed approach is considered the second after [13].
method. If a negation word is found in a sentence, the 4.2.2 The Second Approach (STS-MSRvid)
overall score will be reduced by (-1.5), as long as the The second approach is evaluated using the STS-
other sentence does not include negation terms. The MSRvid dataset. The experimental results were carried
negation presence was represented by a score of (-0.5), out for each methodology and after applying the
which substituted the (+1) score from the common word permutation feature as shown in Table (5).
Table (4) Results for the SemEval participants for Evaluation Sets
Table (5) the proposed approach correlation results for the Development Sets
Table (6) the proposed approach correlation with similar works for STS-MSRvid Dataset
The proposed approach is compared to other works that used the same dataset as shown in Table (6). It demonstrates that
our suggested approach has a higher PCC than other research, implying that it has a benefit when compared to other
methodologies.
[19] K.Abdalgader, A.Skabar, “Short-text similarity [21] E.Grave, P.Bojanowski, P.Gupta, A.Joulin, and
measurement using word sense disambiguation T.Mikolov, “Learning word vectors for 157
and synonym expansion”. In Australasian joint languages,” arXiv preprint arXiv:1802.06893,
conference on artificial intelligence pp.435-444, 2018.
Springer, Berlin, Heidelberg, 2010. [22] FastText Model, “Word vectors for 157
[20] StanfordNLP Package, “StanfordNLP 0.2.0 - languages”, “https://fasttext.cc/docs/en/crawl-
Python NLP Library for Many Human vectors.html”, [Date accessed 25/03/2022].
Languages”,
“https://stanfordnlp.github.io/stanfordnlp/index.ht
ml”, [Date accessed 25/03/2022]