Text Similarity Algorithms
Text Similarity Algorithms
ALGORITHMS
GROUP 4
Ling How Wei (S50751)
Tew Eng Yeaw (S51467)
Teoh Yi Yin (S58798)
Contents
1
2
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7 Video Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
References 26
3
1.2 History
In the 1960s, information retrieval systems were discovered in business and intelligence
applications. However, the first computer-based search system was built in the late 1940s
and was inspired by pioneering innovations in the first half of the 20th century. The
number of bits of information packed into a square inch of hard drive surface grew from
2000 bits in 1956 to 100 billion bits in 2005(Walter, 2005). Through the high-speed
network, the world quickly accesses a large amount of information. The only solution to
find related items from these large text databases is to search, so IR systems have become
ubiquitous.
Before the IR system, mechanical and electro-mechanical devices are used for search.
They are traditional methods of managing large amounts of information that derived
from the discipline of librarianship. Items were indexed using cataloging schemes. The
category-related catalog cards with holes are aligned with each other to determine whether
there is an entry with a specific category combination in the collection. A match was
found if the light could be seen through the arrangement of cards. The first person to
build such a system was Emanuel Goldberg, who solved this problem in the 1920s and
1930s.
In 1948, Holmstrom described a "machine called Univac" that could search for text
references related to subject codes through the UK’s Royal Society. The code and text
were stored on a magnetic steel tape(J.E.Holmstrom, 1948). Holmstrom stated that the
machine could process at the rate of 120 words per minute. This is the first mention of
computers used to search for content.
In the 1960s, IR systems were greatly improved. Gerard Salton and his group members
produced a large number of technical reports, establishing ideas and concepts that are still
the main field of research today. One of these fields is the formalization of algorithms to
rank documents relative to a query, which was first proposed by Switzer. It is an approach
where documents and queries were treated as vectors in an N-dimensional space. Later,
Salton suggests the measurement of similarity between a document and query vector
by using the cosine coefficient(Salton, 1968). Another innovation was the introduction
of relevant feedback. It was a process that supports iterative search, where previously
retrieved documents can be marked as relevant in the IR system(Sanderson & Croft,
2012). Besides that, the examination of the clustering of documents with similar content
is also one of the IR enhancements. The statistical association of terms with similar
semantics increases the number of documents matching the query by expanding the query
with lexical variants or semantically related words. During this period, commercial search
5
companies emerged from the development of customized systems built for large companies
or government organizations.
In the 1970s, one of the important developments was Luhn’s term frequency (tf)
weights which complemented with Spärck Jones’s work on the appearance of words in
aggregate documents. Her paper on inverse document frequency (idf) introduced the
idea that the frequency of occurrence of a word in a document collection was inversely
proportional to its significance in retrieval(Sanderson & Croft, 2012). In the study of
formalizing the retrieval process, Salton synthesized the output of his group’s work on
vectors to generate a vector space model. Robertson also defined the probability ranking
principle(Robertson & Jones, 1976), which determined how to optimally rank documents
based on probabilistic measures with respect to defined evaluation measures.
Between the 1980s and mid-1990s, variations of tf •idf weighting schemes were pro-
duced and the formal retrieval models were expanded. Advances on the basic vector
space model were developed. The most famous is Latent Semantic Indexing (LSI),
in which the dimensionality of the vector space of a document collection was reduced
through singular-value decomposition(Deerwester, Dumais, Furnas, Landauer, & Harsh-
man, 1990). Queries were mapped into the reduced space. Deerwester and his colleagues
claimed the reduction caused words with common semantic meaning to be merged result-
ing in queries matching a wider range of relevant documents(Sanderson & Croft, 2012).
Another technique discovered was stemming which is a process of matching words to their
lexical variants.
In late 1993, Web search engines began to appear in the World Wide Web created by
Berners-Lee in late 1990. From the mid-1990s to the present, link analysis and anchor
text search were two important developments. Both were related to the early work of
using citation data for bibliometric analysis and searching and using “spreading activa-
tion” search in hypertext networks. The automatic utilization of information extracted
from search engine logs was also examined. During this period, the application of search
and the field of information retrieval continued to develop with changes in the computing
environment. The development of social search deals with searches involving user com-
munities and informal information exchange. New research in many fields, such as user
tagging, filtering and recommendation, and collaborative search has begun to provide
effective new tools for managing personal and social information.
6
2.1.1 Boolean
Boolean Model is the document are represented as a set of term, and the query are
representing as Boolean query. Boolean query is the combination of logical operators
AND, OR and NOT (Pasi & Tecnologie, 1999). Boolean model is the old model that is
easy to be understands. The disadvantages of Boolean model are the result, which only
true or false is given. The lack of partial match can be a problem when user do search.
Boolean model does not rank the retrieved document, make all the results retrieved
equally important.
The concept of similarity underpins the Vector Space Model (VSM). The model implies
that a document’s query relevance is roughly equivalent to document-query similarity.
The bag-of-words concept is used to represent both documents and queries. Different
from Boolean model, VSM enable the ranking to the document retrieved and providing
feedback. In VSM, document and query were represented by vector.
When user searching for information, in most of the cases, user is uncertain about the
thing that he searched and the retrieval of the document is also uncertain. Probabilistic
model is a framework that models these uncertainties. To be more specific, the uncertain
of the IR means that the understanding of the user query and the level of satisfaction
of user toward the document that retrieved. The purpose of probabilistic model is to
compute the probability of relevant of the document retrieved. Probabilistic model also
7
provides ranking to the relevant feedback, describing how likely is the document relevant
to the query. Assume that the given query, q and the collection of documents, D. If the
document d of set D is relevant for q, then Rd,q will return 1, otherwise Rd,q will return
0. Probability rule was applied to calculate the probability. Bayes rule is one of the rules
that was applied to calculate the probability.
This study had been carried out by Puji Santoso, Pundhi Yuliawati, Ridwan Shalahuddin
and Aji Prasetya Wibawa. This research purpose was to compare the Levenshtein distance
algorithms and the damerau Levenshtein distance algorithms to identify which algorithm
was better for Indonesian spelling. There were some methods used by Santoso et al.
(2019) in this project, and the flowchart was as shown in Figure 1:
1. Two fairy tale stories dataset was collected. The dataset contains 1266 words with
100 typing errors, collected from ceritadonenegrakyat.com. After that, data pro-
cessing was performed to remove numbers and read signs in a story.
2. The distance measurement with Demarau was applied. The processes used in the
Levenshtein distance algorithm are almost the same as the damerau Levenshtein
distance algorithm, which is an insertion, deletion, and substitution as usual and
an additional step transposition.
3. The suggested wrong word will be displayed, and the accuracy of the results will
be calculated.
Figure 2 shows the wrong words suggested by the two algorithms. Based on that figure,
damerau Levenshtein has a better result compared to Levenshtein distance. The disad-
vantage of the damerau Levenshtein is this algorithm cannot correct two words that stick
together without spacing, as shown in Figure 3. Based on Figure 4, the accuracy of the
damerau Levenshtein algorithm was 75% in correcting the wrong words, higher than the
Levenshtein distance algorithm, which was about 73% (Santoso, Yuliawati, Shalahuddin,
& Wibawa, 2019).
8
This article is written by Viny Christanti, Mawardi. Fendy Augusfian, Jeanny Pragantha,
and Stephane Bressan, where the main focus of this article is to correct the spelling error
by using the Damerau-Levenshtein Distance algorithm. According to authors, when
teacher prepared the question, teacher will re-examine the question they had typed to
make sure no typographical error in the exam paper. The work becomes difficult when
it comes to the question that prepared ranging from grades 1 to 6 in elementary school.
Next, there are variety of spelling checking tool on the internet that used to check the
spelling error, but it is not common to find an Indonesian language spelling checking tool.
To overcome this problem, the authors have proposed an Indonesian language spelling
error checking application to help teacher in checking the spelling of the exam question.
The data that authors used in the research is the exam test script. According to authors,
teacher will create a bank to store all the assessment, quiz and exam question. Figure
below shows the example of exam test scripts that had been used by authors. Indonesian
dictionary was used to give suggestion of the error word.
According to authors, there is two type of test were taken to check the spelling error.
The first test is to determine 50 sentences of non-real word errors, and the second test is
the 15 question of formatted exam scripts. The data was undergoing two different test
categories, which is Manual Correction and Automatic Correction. Manual Correction
enable user to choose the desired suggested word while Automatic Correction auto cor-
recting the error with the first ranking word in the suggestion. Figure below shows the
result after processing by the algorithm. Manual Correction has the highest result of 88%
in sentence accuracy while the automatic correction gives the result of 70% in sentence
accuracy. The word accuracy for Manual Correction is higher than automatic correction
which is 84% of accuracy. The time spent for automatic correction is shorter than manual
correction (Christanti Mawardi, Viny, Augusfian, Fendy, Pragantha, Jeanny, & Bressan,
Stéphane, 2020).
10
This study had been carried out by Nur Hamidah, Novi Yusliani, and Desty Rodiah.
The purpose of this project was to create a system to detect word errors automatically.
Correcting typos manually will take a long time to ensure the writing free from typing
errors. In this study, the dictionary lookup method has been used to search for the wrong
words. Pre-processing steps such as case folding (change upper case to lower case) and
tokenizing (break a sentence into words) were carried out. N-gram had been used to cut
the words into a piece of character. Term Frequency – Inverse Document Frequency (TF-
IDF) was used to determine the importance of the words in a document. The damerau
Levenshtein distance was used to determine the space between the words to produce
word candidates, and then the cosine similarity was applied to sort it. Lastly, Mean
Reciprocal Rank (MRR) was used to evaluate the search rankings. Value 1 will return
when all targets are ranked 1 for all the displayed results of the data candidate. The
data used in this project contain four documents, which have 30 deletion types errors,
30 insertion types errors, 30 transposition errors, and 30 substitution errors. Based on
Figure 6, insertion type errors produced the highest percentage, 97.78%, because the
average word candidate was ranked 1 while the substitution gave the lowest percentage,
86%. The low percentage produce by substitution type perhaps caused by the inaccuracy
11
of the ranking, where it should be in the top 5 ranking (Hamidah, Yusliani, & Rodiah,
2020).
This paper had been written by two authors, Ahmed Adeeb Jalal and Basheer Husham
Ali. This project aimed to propose a classification approach to cluster the document
based on similar scientific fields to make other researchers easier in finding the research
papers. It is caused by the problem of finding a suitable research paper with a normal
search process will be challenging and spending a lot of time, especially need to deal with
a lot of sources. The method used in this project is shown in Figure 7. The dataset
of about 518 research papers published from 2012 to 2019 in the Bulletin of Electrical
Engineering and Informatics (BEEI) journal was collected. The dataset will be classified
based on titles, abstracts, and keywords into five clusters. The crawler algorithm will be
applied to retrieve the content needed from the paper. Then, the topics of each cluster
will undergo text pre-processing to change the sentence into words. TF-IDF was carried
out to extract features and calculate the weight of the word. The authors suggested using
cosine similarity in measuring the similarity of the content. Based on the results in Figure
8, the classification approach can classify more than 96% of the research paper based on
similarity. The result was validated by using precision and recall, as shown in Figure 9
(Jalal & Ali, 2021).
12
The keyword weight was counted before transferring to cosine similarity, below is the
14
formula for keyword weighting. Wi denotes the weight of keyword I, wmax denotes the
maximum weight and the result W is the weight of the keyword.
Wi
W=
Wm ax
For example, “Adenocarcinoma” has the Wmax of 9 and “scc” has the Wi of 8, thus the
W is 0.89. So, the weight of the keyword “scc” is 0.89. Figure below shows the weighting
of the keyword done by the authors.
In the next step, the word that contain in both reference and source document is deter-
mined in order to perform cosine similarity. The keywords that appear in both of the
documents is: cell, high, compar, cancer and study. Figure below show the frequency of
the particular keywords appear in both documents.
After getting the frequency of the same keyword, the authors substitute the value in
to the formula to calculate the weight of the reference document, weight of the second
document, weight in the reference document and weight in the second document, which
is 2.11,1.67,12.22 and 8.67 respectively. After obtaining all the result needed, the values
15
were substitute into the formula of cosine similarity as shown in the figure below.
The result of relevance between the two document is 0.023. Cosine similarity state that
the less the similarity of the documents, the closer the value to 0. Thus, it can conclude
that the second document is not similar to the reference document (Gunawan, Sembiring,
& Budiman, 2018).
3.1 Cosine
Cosine similarity is a measurement of the similarity between the documents by using the
angle between vectors (Alake, 2021). For the angle of the vector is 90 degrees, the value
will be 0, and the value will be 1 when the angle is 0 degrees. If the cosine value is closer
16
to 1, it means that the two vectors have high similarity. There is an example as shown
below:
3.2 Demerau-Levenshtein
Damerau Levenshtein distance can be obtained through some operations: insert, delete,
substitution and transposition to measure the minimum changes required from a word to
another word (Santoso et al., 2019).
The process of damerau Levenshtein algorithm as shown in Table 2 is as below:
1. Initialise the number, n for the character length of source and m for character
length of target. If n and m equals to 0, it will return the value of 0.
17
Target C A K E
Source 0 1 2 3 4
C 1 0 1 2 3
A 2 1 0 1 2
E 3 2 1 1 1
K 4 3 2 1 1
S 5 4 3 2 2
3. The row(m) will have value range from 0. . . m, and the column(n) will have value
range from 0 to n.
4. Compare each character in source and target. If s[i ] = t [i ], then the cost = 0. Else
the cost = 1.
5. Calculate the min(x,y,z) and add it into the position (row by row). X denotes the
insertion operation and can be formulated as x = d [i −1, j ]+1; y denotes the deletion
operation and can be formulated as y = d [i ][ j − 1] + 1; z denotes the substitution
operation and can be formulated as y = d [i − 1][ j − 1].
6. To transpose the characters, the two characters must be positioned side by side,
and the two characters must be compared in the previous steps. The condition of
transposition can be described as if i > 1 & j > 1 & s[i] = t[j-1] & s[i-1]= t[j].
After that, add the value min(d[i,j],d[i-2,j-2]+cost) to d[i,j].
7. Steps 4 to Steps 6 are repeated until the maximum size of matrix is reach.
8. The last value in the corner of the matrix will be the damerau-levenshtein distance
score.
4 Methodology
Information Retrieval Architecture
1. Input data
The source of the data that been used in this project is the headline written by
18
the three selected newspaper, which is NST, Astro Awani and The Sun Daily. A
headline is the text that describe the content of the article, and usually printed in
large letter and located on top of the newspaper. People who read the headline will
know roughly the content that written in the article.
2. Data pre-processing
Raw data is the data that have not been processed. The using of raw data directly
will affect the accuracy of the processed result. Thus, raw data is needed to be
pre-processed before it was used in any other algorithm. Data pre-processing is the
process that transforming the raw data into understandable form. The operation
of processing the raw data is using stop word. Stopwords will filtered out the less
meaningful word such as “is”,” the” and “he”. This could make the similarity analysis
more efficient and accurate. After that, the data was being stemmed to get the root
word.
3. Similarity Analysis
To determine the similarity of the data provided, Cosine Similarity and Damerau-
Levenshtein Distance were used in this project. Cosine Similarity finds the angle
between the vectors. The larger the angle, the less similarity between the data
inserted. Damerau-Levenshtein Distance is used to determine how much operations,
which is insertion, deletion, substitution and transposition needed to make the
source string looks exactly like the targeted string.
4. Ranking
Before comparing the result of cosine similarity and Damerau-Levenshtein Distance
(DLD), DLD has to first convert the value to similarity score. To convert, the
formula below is applied to the distance score. DamerauLevenshtein(s,t) denotes
the score of the distance and max(|s||t|) denotes the maximum length of the string.
After calculating the similarity score, the value is compared to cosine similarity.
The ranking of both algorithms will be concluded and discussed in next section.
19
The raw data in Table 3 will undergoes three processes before it was ready to use.
The raw data will firstly be tokenized, where the string will be separated into smaller
unit. Table 4 shows the result of the data undergoes tokenization.
After tokenization, the data was further processed by stopwords removal. The purpose
of performing this step is to remove the meaningless word in the text. Table 5 shows the
result of the data that had been stopwords removal.
Furthermore, the data was undergoing stemming. Stemming is the process that re-
ducing the word to its root word. Table 6 shows the result of data after stemming.
20
After data pre-processing, the data was ready to use in the next steps. The data
will then be process by two different algorithms to determine the similarity between the
source text.
After that, the result after performing TF-IDF will be input to the similarity analysis.
Figure 15 shows the result after applying cosine similarity. The information that can be
obtained from Figure 5a is the similarity of the data comparing between the three-sample
21
data. In the first row, the value 1, which means the data is completely similar to first
headline. In other word, the sub array corresponds to the newspaper chosen and the
element of the array is the result of similarity. By using calculator, press cos-1(1) and
you will get the angle of 0, which means the vector are close to each other.
Figure 17 shows the DLD of NST compare to NST. The result is 0. This means that
both of the headline is the same.
After that, the distance score is converting to the similarity score. The length of the
NST’ s headline is 55 and the length of Astro Awani’s headline is 66. Thus, the maximum
length of string is 66. By substituting the value into the formula:
22
31
Si mi l ar i t yScor e = 1 −
66
Si mi l ar i t yScor e = 0.5303
The similarity score for the DLD is 0.5303. By comparing Cosine similarity = 0.5376
and DLD = 0.5303, cosine similarity gives a highest score compared to DLD.
6 Conclusion
Information retrieval is the process of retrieving information from a collection of document
by user query. To study about the cosine similarity and Damerau-Levenshtein distance
(DLD), five articles of these algorithm was reviewed and applied in different purpose.
Most of the field that these algorithms were applied in find the text relevance such as
determining the spelling error and finding the similarity score of two documents. News
headline of NST and Astro Awani were used in this project for text similarity analysis.
The data was first being pre-processed by using tokenization, stop words removal and
stemming.Next, the data was used in Cosine similarity and DLD. In Cosine similarity,
the term frequency has to be determined before the similarity can be analyzed. TF-IDF
has been used to vectorize the data. The data after vectorized is processed by cosine
similarity, and the similarity score is 0.5376. In DLD, the pre-processed data was directly
input into the algorithms, the distance score is 31 and the similarity score is 0.5303. Most
of the similarity score is much higher when using the data after pre-process compare to
the data before pre-process. In most cases, cosine similarity has the produces the highest
similarity score compare to DLD.
7 Video Presentation
Youtube Link: https://youtu.be/vgW2yZC12lM
References
26
27