0% found this document useful (0 votes)
36 views11 pages

Shimaa IsmailSemanticSimilarity

Uploaded by

Getnete degemu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views11 pages

Shimaa IsmailSemanticSimilarity

Uploaded by

Getnete degemu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/362786644

Arabic Semantic-Based Textual Similarity

Article in Benha Journal of Applied Sciences · April 2022


DOI: 10.21608/bjas.2022.254708

CITATION READS

1 29

3 authors:

Shimaa Ismail Abdelwahab el sammak


Benha University Benha University
4 PUBLICATIONS 34 CITATIONS 21 PUBLICATIONS 152 CITATIONS

SEE PROFILE SEE PROFILE

Tarek Elshishtawy
Benha University
9 PUBLICATIONS 41 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cloud Computing View project

Steganography View project

All content following this page was uploaded by Shimaa Ismail on 19 August 2022.

The user has requested enhancement of the downloaded file.


Benha Journal of Applied Sciences (BJAS) print: ISSN 2356–9751
Vol. (7) Issue (4) (2022), (133-142) online: ISSN 2356–976x
http://bjas.journals.ekb.eg

Arabic Semantic-Based Textual Similarity


Shimaa Ismail 1, AbdelWahab Alsammak2 and Tarek Elshishtawy1
1
Faculty of Computers and Artificial Intelligence, Benha Univ., Benha, Egypt
2
Faculty of Engineering Shoubra, Benha Univ., Benha, Egypt
E-mail: [email protected]
Abstract
Textual similarity is one of the most important aspects of information retrieval. This paper proposes several techniques of
semantic textual similarity as well as the factors that influence them. Two-hybrid approaches for measuring the degree of
similarity between two Arabic snipped texts are presented. The first proposed approach combined the word-based and vector-
based similarity methods to construct semantic word spaces for each word of the input text. These words are represented in
their lemma forms to capture all semantically related words. In this approach, the semantic word spaces are used to find the
best matching between the input text words, and hence, the degree of similarity between the two snipped texts is computed.
The second proposed approach combined semantic and syntactic based approaches. The basic Levenshtein concept
represents the main structure for this approach. It has been modified to measure the edit cost at the token level not at the
character level. In addition, the semantic word spaces are added to this approach to include the semantic features to the
syntactic features. Some techniques are embedded to overcome the syntactic approach problems such as the word sequence.
Pearson correlation coefficient is used to measure the degree of correctness of the two proposed approaches as compared to
two benchmark datasets. The experiments achieved 0.7212 and 0.7589 for the two proposed approaches on two different
datasets.

Keywords: Arabic Text Similarity, Semantic Similarity, Lexical Similarity, Word Embedding, Permutation Feature,
Negation Effect.

1. Introduction and determines how similar these characters are. The


STS (Semantic Textual Similarity) metrics have been semantic similarity, on the other hand, is determined by
a major topic in a variety of studies and applications. the meaning of the phrases. It is necessary to determine
Information retrieval, machine translation, text the degree of relatedness between sentences to quantify
summarization, sentiment analysis, question generating semantic similarity [4].
and answering, automatic essay scoring, automatic short The Levenshtein approach is one of the approaches
answer grading, and other activities all rely on them [1]. that measure the similarity of texts lexically. The
There are several barriers in determining the Levenshtein distance is a statistic for calculating how
semantic similarity or relationship between two snipped many edit operations are necessary to convert one string
Arabic texts, including morphological inflections and to another [5]. The distance between two words is
orthographic confusion related to optional diacritization. defined as the number of single-character changes (i.e.,
As a result, there are more homographs than in English, insertions, deletions, or replacements) necessary to
which adds to the confusion [2]. The Arabic word "‫"ذهب‬ change one word into the other in its original form. The
for example, maybe the verb "Went" or the noun "Gold" edit distance is expanded in this study to detect distances
with various diacritics. It is possible to discern between between two sentences given in words rather than
the two meanings based on the part of speech tagging characters. In addition, semantic features are embedded
and the context of the word. As a result, it is important to to change the edit cost between the input words.
do this work with a conventional morphological tool. In Semantic word spaces are generated from one of the
this research, the StanfordNLP library is used as a word embedding models that represent the words in a
morphological tool to determine the Part of Speech vector space. This vector space defined the meaning of
(POS) tagging for words. In addition, the lemmatization words and the relation between them. There are two
approach was employed in our study to avoid the algorithms used to extract these semantic word spaces in
diacritization difficulty and its impact by using the this research. In the two proposed approaches, the
lemma form of the word. The inflected word form is semantic word spaces are used in two different manners.
transformed into its dictionary lemma look-up form In this paper, two hybrid approaches are proposed to
through lemmatization [3]. This lemma form is the enhance the overall calculations of the similarities
shortest that captures all the word's semantic between two texts. An interlaced approach that combines
characteristics. word-based and vector-based approaches was proposed.
There are two different types of sentence similarity: A modified strategy that combines both syntax and
lexical similarity and semantic similarity. Lexical semantic techniques also presented.
similarity looks at a sentence as a series of characters
134 Arabic Semantic-Based Textual Similarity

The rest of the paper is organized as follows. Section i.e., words used in comparable contexts are mapped to a
2 represents the related work of the semantic similarity proximal vector space. Different techniques represent the
approaches. Section 3 explains the approach words in vector spaces such as Skipgram skip-G,
methodology. In section 4, the evaluations of the Continuous Bag of Words CBOW, and the co-
experimental results are described. Finally, the occurrence frequency. The terms "flower" and "tree," for
conclusion of our approach is presented in section 5. example, are semantically related since they both refer to
plants and are used in the same context [11]. FastText is
2. Related Work one of these word embedding models that presented
Arabic snipped texts can be classified according to word representation for the Arabic language. FastText is
the adopted methodology as follows: an unsupervised learning technique that generates vector
One of the most used strategies for evaluating representations for words in 294 languages [12]. It
semantic similarity is deep learning with feature- supports both CBOW and skip-G models.
engineered models. Tian et al. [6] used characteristics Nagoudi et al. [13] used a word embedding model
such as n-gram overlap, edit distance, and longest presented by Zahran et al. [14] based on CBOW, Skip-
common prefix/ suffix/ substring to train deep learning G, and GloVe approaches to determine the semantic
algorithms. They were able to reach a PCC of 0.7440. similarity of Arabic sentences. Additionally, they
On the aspects of alignment similarity and string included two weighing functions: IDF weighting and
similarity measurements, Henderson et al. [7] employed POS tagging. They achieved a PCC of 0.7463 ranked the
the same method with various algorithms, such as first for applying an approach with the native language
Recurrent Neural Networks (RNN) and Recurrent and the second among all participants.
Convolutional Neural Networks (RCNN). For the same Alian et al., [15] proposed a method that combined
dataset, their PCC is 0.7304. lexical, semantic and syntactic-semantic features with
For the same dataset utilized, the semantic machine learning techniques like linear regression and
information space (SIS) is a technique that produced a support vector machine regression. They applied the
high Pearson correlation coefficient. The non- Levenshtein method and one of the word embedding
overlapping Information Content (IC) computation is models to represent words in a vector space. They
obtained using this method, which is based on the evaluated their approach on three different datasets. For
semantic hierarchical taxonomy in WordNet. This the STS-MSRvid of the SemEval competition dataset,
method was employed by Wu et al [8] in three studies. they achieved a PCC of 0.743.
As they used the IC which is based on the word The word-based similarity category treats the phrase
frequencies from WordNet and the British National as a collection of words; hence it is based on the
Corpus. They also collaborate the IDF weighting system similarity between terms. There are several approaches
using the IC with cosine similarity. The highest Pearson for determining phrase similarity in this area, including
correlation coefficient is presented in this competition by maximum similarity, similarity matrix, employing
0.7543. similar and dissimilar components, and word meaning
BableNet is a huge, multilingual semantic network disambiguation [4]. In addition, several of these
with broad coverage that gathered its data from Wordnet strategies are integrated. First, the maximum similarity
and Wikipedia [9]. The multilingual word-sense aligner of each word in the first sentence and each word in the
proposed by Hassan et al., [10] relies heavily on the second sentence is determined using the Max similarity
BableNet network. According to the Bable synsets, they approach. The average similarity is then determined
built an aligner that aligns each word in one phrase to [16]. The similarity matrix method generates a matrix
another word in the other. In many languages, these containing the results of calculating the similarity
synsets reflect the word's meaning, named entity, and between each word in each sentence and the words in the
synonyms. For the used Arabic dataset, PCC is 0.7158. other sentence [17]. Wang et al., [18] describe the use of
Word embedding is a common method for various similar and dissimilar parts to represent words using
text applications and NLP activities. The distributed word embedding. They then used cosine similarity to
representation of words in a vector space is referred to as create a similarity matrix. Furthermore, a semantic
word embedding. Traditional NLP techniques miss matching function was used to create a semantic
syntactic (structure) and semantic (meaning) links across matching vector for each word. They break down the
collections of words. As a result, using word vectors to resulting match vectors to discover which portions of
represent words has its advantages. The word vectors are each vector are similar and which are distinct. Finally,
multidimensional continuous floating-point values in the similarity is calculated using these vectors. In [19],
which semantically comparable words are transferred to they utilized Wordnet synonyms to extend the words of
geometrically close regions. Each point in the word the original phrases, then generated a vector
vector represents a dimension of the word's meaning, representation for these words in addition to the vectors

Benha Journal Of Applied Sciences, Vol. (7) Issue (4) (2022(


Shimaa Ismail, AbdelWahab Alsammak and Tarek Elshishtawy 135

of the set of terms in each sentence. Finally, the cosine lemmatization in both the fastText vector model and the
similarity of the two vectors is used to calculate the input text in our technique. In the following part, we'll
similarity. go through how to do that. Second, utilizing the vector
word-space of sentences, a word-matching matrix is
3. Methodology created. Finally, the degree of similarity between
In this research, we presented two-hybrid approaches sentences is calculated.
for two different semantic techniques. Fig. (1) illustrates the suggested technique, which is
3.1 The First Proposed Approach divided into three stages:
In the first approach, we present a hybrid 1) Vector-Based Similarity,
methodology that combines two semantic similarity 2) Word-Based Similarity,
approaches: word-based and vector-based methods, to 3) Similarity Measures.
quantify the semantic similarity between two snipped 3.1.1 Vector-Based Similarity
Arabic texts. The vector space of each word is first FastText Models are available and readable for the
retrieved in lemma form using the fastText vector model. Arabic language. FastText models trained using CBOW
The StanfordNLP library was used to generate their with position-weights, in dimension 300, with character
lemma form. To analyze natural language, the n-grams of length 5, a window of size 5, and 10
StanfordNLP package is employed. It turns a string of negatives [21]. Arabic fastText corpus contains more
human language text into lists of sentences and words, than 356 thousand words. These words are totally
generates basic forms of those words, parts of speech, different represented in their surface form. A screenshot
and morphological aspects, and provides a syntactic from this dataset is shown in fig. (2).
structural dependency parse that is parallel across over
70 languages [20]. It is utilized for tokenization and

Fig. (1) The Proposed approach Overview.

Fig. (2) A screenshot from the fastText Arabic Model dataset

Benha Journal Of Applied Sciences, Vol. (7) Issue (4) (2022(


136 Arabic Semantic-Based Textual Similarity

Many studies employed the vector space model's surface To extract the closest words for each word in the
form to derive the semantic similarity between words sentences, vectors are employed to extract semantic
that did not include additional semantically related similarity using a word embedding or word
terms. But in this research, a lemmatization technique is representation. For related or near-synonymous words,
used to improve the search word space for the fastText these vectors shared the same semantic properties. The
model words. The lemmatization is applied for the input suggested method extracts comparable words from the
text and the fastText vector model and stated as a mapped fastText model using the preprocessed and
mapped fastText model. Some other preprocessing tools lemma form of input words. There are numerous indices
are applied for the text such as noise removal, word in the mapped model for each word in the input text that
normalization, eliminating stopwords. contains the same word but distinct vectors. Fig. (3)
There are two techniques are used to extract the showed the process of extracting the semantic word
semantic word spaces for each word of the input words: space for a specific word.
using the closest words algorithm or a ready-made The mapped vector model is used to extract the
function built in the fastText module in python language. closest words using Algorithm 1. Where np is the
3.1.1.1 Using the closest words algorithm NumPy library stands for Numerical Python.

Fig. (3) The process of extracting the semantic word space for a word using the closest words algorithm

Algorithm 1: Finding the closest words for a specific word


Input: The Word Vector, N
Output: Similar Words
1. difference = vector of all words – vector of a word
2. delta = np.sum(difference * difference)
3. i = np.argmin(delta)
4. similar word = word of index i
5. drop the index of this similar word
6. iterate the previous steps by N times
7. Return similar words

Fig. (4) The process of extracting the semantic word space for a word using the fastText library

Benha Journal Of Applied Sciences, Vol. (7) Issue (4) (2022(


Shimaa Ismail, AbdelWahab Alsammak and Tarek Elshishtawy 137

Iterating over the whole vectors of the mapped sentence matched to some of the second sentence words
fastText model to obtain the index of the most with the number of common words (NCW).
comparable word can be used to find the closest word. 3.1.3 Similarity Measures
By repeating the loop N times with the same method, N For two sentences s1 and s2 with a length of n and m
numbers of related terms are discovered. The word space respectively, the similarity score is measured by
number for each word can be assigned, however, Equation 1. Where p is the length of the matching
according to the mapped fastText model, each word has matrix.
a distinct number of indexes. The main purpose is to ∑ ⁄
(1)
increase the word search space by collecting more
relevant comparable terms for a particular word. In this The result value is a ratio from 0 to 1 scale. So, we
approach, we extract 10 closest words for each index. multiply the output by 5 to make the output ratio is from
3.1.1.2 Using the fastText module 0 to 5 scale. The zero value represents that the two
The language model of the fastText organization was sentences are quite different, and the five value indicates
released as a Python module [22]. Similar words can be that the two sentences are typical.
retrieved using a built-in function named "get nearest 3.2 The Second Proposed Approach
neighbors" in this package or module. Like the word In the second approach, a modified approach is
space for each word, this function returns the 10 closest presented to measure the similarity of two snipped
neighbors of the searched word in its surface form. The Arabic texts lexically and semantically based on the edit
process of extracting the semantic word space using the distance approach. This proposed approach is hybrid in
fastText module is shown in Figure 4. the sense that both syntax and semantic features are used
3.1.2 Word-based Similarity to measure the similarity. Different knowledge resources
In this stage, we tried to find the relatedness or the are employed such the semantic word spaces. Also, the
association between words. From the semantic words approach presents a solution to miss ordering of words
spaces, a common word matrix is built. This matrix is between given two sentences. The modified edit distance
constructed by the common words of each pair of words approach is based on different weights (edit cost)
by their semantic word spaces. From this matrix, a according to the state of the two words.
matching matrix is generated by selecting the most The proposed workflow for measuring the edit cost
common words between each pair of words. The between two words is shown in Fig (5).
matching matrix consists of the words of the first

Fig. (5) The workflow diagram of finding the edit cost of two words

Benha Journal Of Applied Sciences, Vol. (7) Issue (4) (2022(


138 Arabic Semantic-Based Textual Similarity

Algorithm 2: Finding the edit cost between two words


Input: Pair of words S1[i], S2[j]
Output: Edit cost
Let cost be the edit cost,
IF the word S1[i] = the word S2[j],
then, cost=0;
ELSE IF S1[i] is one of the semantic space words of S2[j] according to the fastText module,
then, cost= 1- similarity score (SC);
ELSE IF S2[j] is one of the semantic space words of S1[i] according to the fastText module,
then, cost = (1- SC);
ELSE IF there are common words bw the two semantic spaces obtained by the closest words algorithm,
then, cost = 1- similarity ratio (SR);
SR = NCW/ Max (|WS1|,|WS2|),
Where NCW is the number of common words between WS1[i] and WS2[j],
Where WS1 is the length of the word space of word i in S1 that differ according to each word,
Where WS2 is the length of the word space of word j in S2 that differ according to each word,
ELSE, cost=1;
The frame algorithm for calculating the edit cost of terms are considered as NCW. The SR is then
given two words in two sentences is presented in determined by dividing the NCW by the maximum
Algorithm 2. length of the two semantic spaces, which is differing
In this proposed approach, new features are according to each word. The cost will be (1-SR).
combined to provide accurate measures of the similarity Otherwise, the cost equals one.
between two snipped Arabic texts. The following For two sentences S1, S2 with length n, m words
summarizes the features and proposed modules of the respectively, a corresponding matrix CM is constructed
approach. depending on the edit cost between each pair of words.
3.2.1 Token Lemma Level The value of CM[i+1][j+1] is as shown in Equation 2:
The Levenshtein approach is one of the lexical
similarity strategies based on the edit distance metrics. [ ][ ]
The cost of changing one string into another is calculated { [ ][ ] (2)
using its methodology, which assigns a unity cost to all [ ][ ]
edit operations. This cost is utilized to create a character Finally, the edit distance between the two sentences
matrix between the two words, which then yields the edit is represented by the final value of CM[-1][-1]. The
distance. The suggested method extends the Levenshtein similarity between the two sentences is measured by
algorithm by computing 'edit distances' at the token level Equation 3.
rather than the character level as in the traditional
algorithm. The lemma versions of the tokens are
represented. In addition, the input text is pre-processed (3)
using the mentioned pre-processing tools in the first 3.2.3 Applying Word Permutation
proposed approach. The Levenshtein method is based on word order and
3.2.2 Embedding Semantic Knowledge word syntactic dependencies. For example, the two
The cost between each pair of words is calculated phrases “Every morning the sun shines in the sky” and
using the semantic word spaces produced from the two "The sun shines in the sky every morning” in Arabic
algorithms outlined in the first proposed approach. In "‫ " تسطع الشمس في السماء كل صباح‬and " ‫كل صباح تسطع الشمس في‬
algorithm 2, the edit cost between each pair of words has ‫ "السماء‬have the same meaning and their similarity should
a different value according to their state. First, the be 100 percent, however, when using the Levenshtein
semantic word space obtained from the fastText module technique, the similarity would be zero owing to
is used. The ten nearest neighbors’ words obtained from incorrect word order. As a result, a permutation
the fastText module have similarity scores (SC) for each approach is employed to determine the optimal word
term. This score shows how closely the retrieved alignment between the two texts. A permutation is a
comparable word is semantically connected to the mathematical approach for determining the number of
searched word. The edit cost between two words is alternative arrangements in a set when the order of the
determined by (1-SC) if the first word is one of the arrangements is needed. One of the two phrases is
semantic space words of the second word and vice versa. rearranged n! times, where n is the sentence's word
Otherwise, the semantic word spaces generated from length. In our approach, we utilized this technique for
Algorithm 1 are used. The two semantic spaces' common five words only to reduce the complexity of permutation

Benha Journal Of Applied Sciences, Vol. (7) Issue (4) (2022(


Shimaa Ismail, AbdelWahab Alsammak and Tarek Elshishtawy 139

operations. For each candidate sequence, the edit the closest words of the semantic word space of each
distance is calculated. The candidate sequence with the word.
shortest edit distance is chosen as the one that most 4.2.1.1 Using the closest words algorithm
accurately matches the alignment of the words in the two In this experiment, two tests are accomplished. The
sentences. first test uses the mapped vector model with the input
text words in their lemma form. The second uses the
4. Experiments input text and the fastText vector model in their surface
4.1 Dataset Description form. First, the input text is preprocessed with the
We utilized the dataset from the Semantic Evaluation preprocessing tools except the stopwords are eliminated
“SemEval” yearly competition to assess the performance once and keep another. The results of the Pearson
of our approach. This event is used for a variety of correlation coefficient are shown in Table (1).
languages and track issues. The semantic similarity Results of Table (1) proved that applying the
between texts is one of these tracks (word phrases, lemmatization technique for the input text with the
sentences, paragraphs, or full documents). Furthermore, mapped vector model has a better effect than using the
the text might be in monolingual or multilingual formats. fastText model and the input text in their surface form.
2017 was the final year of the competition that used the In addition, removing the stopwords from the input text
semantic similarity track. improved the results slightly.
Development Sets and Evaluation Sets are the two
datasets included in the released dataset. One of the Table (1) Experimental Results using the closest words
Development Sets for monolingual Arabic snipped texts algorithm
is the STS-MSRvid dataset1, which contains 368 pairs of
sentences. The Evaluation Sets2 are a collection of 250 Dataset With Pre- With Surface With Lemma
sentence pairs with human judgment ratings that were Processing Form Form
published as the Evaluation Gold Standard3. In the With StopWords 0.6708 0.6886
output of these datasets, the Pearson Correlation Without StopWords 0.6887 0.7000
Coefficient (PCC) between each pair of sentences is
supplied in a one-column table. This coefficient runs Table (2) Experimental Results using the fastText
from -1 to 1, with a value of (-1) indicating that the module.
values of the two columns are completely different and a
value of (1) indicating that they are identical. This The dataset in Built-In Closest Words
coefficient is expressed in Equation 4. Surface Form Function Algorithm
∑ ̅ ̅
(4) With StopWords 0.6513 0.6708
√∑ ̅ √∑ ̅ Without StopWords 0.6679 0.6887
Where ̅ is the mean of x which is defined by
Equation 5. Table (3) Experimental Results for studying the
negation effect on the proposed Approach.
̅ ∑ ⁄ (5)
4.2 Experiments Evaluations Before After
Dataset With
The two proposed approaches are evaluated with two Studying Studying
Pre-Processing
different datasets in two tests as follows: Negation Negation
4.2.1 The First Approach (Evaluation Gold With StopWords 0.6924 0.7052
Standard) Without
0.7039 0.7212
The first approach is evaluated using the evaluation StopWords
gold standard dataset. The experimental results are
classified according to the applied algorithm of finding

http://alt.qcri.org/semeval2017/task1/data/uploads/ar_sts_data_
updated.zip
2

http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.eval
.v1.1.zip
3

http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.gs.z
ip

Benha Journal Of Applied Sciences, Vol. (7) Issue (4) (2022(


140 Arabic Semantic-Based Textual Similarity

4.2.1.2 Using the fastText module score. The Pearson Correlation Coefficient of the entire
In this experiment, the proposed approach is applied dataset was modified by these scores, as seen in Table 3.
with input text in their surface form to be compatible In the last experiment done, the Pearson Correlation
with the results of the built-In function Coefficient becomes close to the human judgment
“get_nearest_neighbors”. The results shown in Table (2) scores. The score of 0.7212 is the highest value obtained
are obtained with stopwords and without stopwords. in applying the proposed approach.
Table (2) proved that the proposed algorithm for 4.2.1.4 Comparison with other approaches
finding the closest words achieved good results than the Table (4) compares our proposed approach to other
built-in function that does the same task. works in the AR-AR track of the SemEval competition
4.2.1.3 Studying Negation Effect 2017 that had the highest PCC for the participants.
Negation is a significant factor that influences the In Table (4), some researchers used the google machine
sentence's orientation (Ismail et al., 2016). Negation translator to increase the training dataset as a
terms in Arabic include (,‫ لم‬,‫ عدم‬,‫ ال‬,‫ لن‬,‫ ليست‬,‫)ليس‬. The requirement for the deep learning approach. Therefore,
meaning of the sentence is reversed with these negative their results are much better than the results of the
words. These negation terms are scanned in each traditional approaches that use the native language. The
sentence of the two input sentences in the proposed proposed approach is considered the second after [13].
method. If a negation word is found in a sentence, the 4.2.2 The Second Approach (STS-MSRvid)
overall score will be reduced by (-1.5), as long as the The second approach is evaluated using the STS-
other sentence does not include negation terms. The MSRvid dataset. The experimental results were carried
negation presence was represented by a score of (-0.5), out for each methodology and after applying the
which substituted the (+1) score from the common word permutation feature as shown in Table (5).

Table (4) Results for the SemEval participants for Evaluation Sets

Language Researchers PCC


[8] 0.7543
Google Machine Translator [6] 0.7440
[7] 0.7304
[13] 0.7463
Native Language First Proposed Approach 0.7212
[10] 0.7158

Table (5) the proposed approach correlation results for the Development Sets

Input With Lemma Form PCC PCC + Permutation


With StopWords 0.6402 0.7010
Modified Edit Distance (MED)
Without StopWords 0.6544 0.7314
With StopWords 0.6792 0.7349
MED + Semantic Word Space
Without StopWords 0.6835 0.7589

Table (6) the proposed approach correlation with similar works for STS-MSRvid Dataset

Used Technique Research PCC


Lexical & Semantic Similarity Second Proposed Approach 0.759
Similarity features + Machine Learning
[15] 0.743
Mean of IDF weighted vectors [13] 0.691

The proposed approach is compared to other works that used the same dataset as shown in Table (6). It demonstrates that
our suggested approach has a higher PCC than other research, implying that it has a benefit when compared to other
methodologies.

Benha Journal Of Applied Sciences, Vol. (7) Issue (4) (2022(


Shimaa Ismail, AbdelWahab Alsammak and Tarek Elshishtawy 141

5. Conclusion information space to evaluate semantic textual


In this paper, we proposed two-hybrid approaches similarity,” In Proceedings of the 11th
that combine different semantic similarity techniques for International Workshop on Semantic Evaluation
measuring the similarity between two snipped Arabic (SemEval-2017), August., pp.77-84, 2017.
texts. The input text is preprocessed with different [9] R.Navigli and S.P.Ponzetto, “BabelNet: Building
preprocessing tools such as normalization and removing a very large multilingual semantic network,” In
noise diacritics which improved its efficiency. Proceedings of the 48th annual meeting of the
Moreover, A lemmatization tool is applied for the input association for computational linguistics, July.,
text and the used word embedding model to enrich the pp.216-225, 2010.
search word space. The semantic word spaces extracted [10] B.Hassan, S.AbdelRahman, R.Bahgat and
by applying the closest words algorithm or the fastText I.Farag, “FCICU at SemEval-2017 Task 1: Sense-
module are also lemmatized. Applying the based language independent semantic textual
lemmatization technique proved that it has a great effect similarity approach,” In Proceedings of the 11th
on results. In addition, the permutation tool is applied to International Workshop on Semantic Evaluation
overcome the word order problem that affects the (SemEval-2017), August., pp.125-129, 2017.
similarity significantly. Finally, the experimental results [11] Dzone “https://dzone.com/articles/introduction-
for the two proposed approaches over two different to-word-vectors”, [Date accessed 25/03/2022],
datasets are satisfying and close to the most values for 2018.
the researchers who used the same datasets. [12] P.Bojanowski, E.Grave, A.Joulin and T.Mikolov,
“Enriching word vectors with subword
References information,” Transactions of the Association for
[1] W.H.Gomaa and A.A.Fahmy, “A Survey of Text Computational Linguistics, vol.5, pp.135-146,
Similarity Approaches,” International Journal of 2017.
Computer Applications, vol.68, pp.13-18, 2013. [13] E.Nagoudi, J.Ferrero and D.Schwab, “LIM-LIG
[2] N.Y.Habash, “Introduction to Arabic natural at SemEval-2017 Task1: Enhancing the semantic
language processing”. Synthesis lectures on similarity for arabic sentences with vectors
human language technologies, vol.3(1), pp.1-187, weighting,” In Proceedings of the 11th
2010. International Workshop on Semantic Evaluation
[3] S.Ismail, A.Alsammak and T.Elshishtawy, “A (SemEval-2017), pp.134-138, 2017.
generic approach for extracting aspects and [14] M.A.Zahran, A.Magooda, A.Y.Mahgoub,
opinions of Arabic reviews,” In Proceedings of H.Raafat, M.Rashwan & A.Atyia. “Word
the 10th international conference on informatics representations in vector space and their
and systems, May., pp.173-179, 2016. applications for arabic.” In International
[4] M.Farouk, “Measuring sentences similarity: a Conference on Intelligent Text Processing and
survey”. arXiv preprint arXiv:1910.03940, 2019. Computational Linguistics, pp.430-443, Springer,
[5] V.Levenshtein, “Binary codes capable of Cham, 2015.
correcting deletions, insertions and reversals,” [15] M.Alian, A.Awajan, A.Al-Hasan and R.Akuzhia,
Soviet Physics Doklady, vol.10(8), pp.707-710, “Building Arabic Paraphrasing Benchmark based
1966. on Transformation Rules,” Transactions on Asian
[6] J.Tian, Z.Zhou, M.Lan and Y.Wu, “Ecnu at and Low-Resource Language Information
semeval-2017 task 1: Leverage kernel-based Processing, vol.20(4), pp.1-17, 2021.
traditional nlp features and neural networks to [16] R.Mihalcea, C.Corley, & C.Strapparava. Corpus-
build a universal model for multilingual and based and knowledge-based measures of text
cross-lingual semantic textual similarity,” semantic similarity. In Aaai ,Vol.6(2006),
In Proceedings of the 11th international pp.775-780, 2006.
workshop on semantic evaluation (SemEval- [17] S.Fernando, M.Stevenson, “A semantic similarity
2017), August., pp.191-197, 2017. approach to paraphrase detection”.
[7] J.Henderson, E.Merkhofer, L.Strickhart and In Proceedings of the 11th annual research
G.Zarrella, “MITRE at SemEval-2017 Task 1: colloquium of the UK special interest group for
Simple semantic similarity,” In Proceedings of computational linguistics, pp.45-52, 2008.
the 11th International Workshop on Semantic [18] Z.Wang, H.Mi, A.Ittycheriah, “Sentence
Evaluation (SemEval-2017), August., pp.185- similarity learning by lexical decomposition and
190, 2017. composition”. arXiv preprint arXiv:1602.07019,
[8] H.Wu, H.Y.Huang, P.Jian, Y.Guo and C.Su, 2016.
“BIT at SemEval-2017 Task 1: Using semantic

Benha Journal Of Applied Sciences, Vol. (7) Issue (4) (2022(


142 Arabic Semantic-Based Textual Similarity

[19] K.Abdalgader, A.Skabar, “Short-text similarity [21] E.Grave, P.Bojanowski, P.Gupta, A.Joulin, and
measurement using word sense disambiguation T.Mikolov, “Learning word vectors for 157
and synonym expansion”. In Australasian joint languages,” arXiv preprint arXiv:1802.06893,
conference on artificial intelligence pp.435-444, 2018.
Springer, Berlin, Heidelberg, 2010. [22] FastText Model, “Word vectors for 157
[20] StanfordNLP Package, “StanfordNLP 0.2.0 - languages”, “https://fasttext.cc/docs/en/crawl-
Python NLP Library for Many Human vectors.html”, [Date accessed 25/03/2022].
Languages”,
“https://stanfordnlp.github.io/stanfordnlp/index.ht
ml”, [Date accessed 25/03/2022]

Benha Journal Of Applied Sciences, Vol. (7) Issue (4) (2022(

View publication stats

You might also like