0% found this document useful (0 votes)
108 views28 pages

Text Similarity Algorithms

This document discusses two text similarity algorithms: cosine similarity and Damerau-Levenshtein distance. It provides background on information retrieval and a literature review of related works applying these algorithms. The document then describes implementing the two algorithms and analyzing their results.

Uploaded by

Yi Yin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views28 pages

Text Similarity Algorithms

This document discusses two text similarity algorithms: cosine similarity and Damerau-Levenshtein distance. It provides background on information retrieval and a literature review of related works applying these algorithms. The document then describes implementing the two algorithms and analyzing their results.

Uploaded by

Yi Yin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

INFORMATION RETRIEVAL: TEXT SIMILARITY

ALGORITHMS

July 20, 2021

GROUP 4
Ling How Wei (S50751)
Tew Eng Yeaw (S51467)
Teoh Yi Yin (S58798)
Contents

1 Introduction to Information Retrieval . . . . . . . . . . . . . . . . . . . . . . 3


1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Research Background and Literature Review . . . . . . . . . . . . . . . . . 6
2.1 Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Boolean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Vector Space Model . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Existing Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Damerau Levenshtein Distance for Indonesian Spelling
Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Spelling Correction Application with DamerauLevenshtein
Distance to Help Teachers Examine Typographical Error
in Exam Test Scripts . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Spelling Checker using Algorithm Damerau Levenshtein
Distance and Cosine Similarity . . . . . . . . . . . . . . . . 10
2.2.4 Text Documents Clustering Using Data Mining Techniques 11
2.2.5 The Implementation of Cosine Similarity to Calculate Text
Relevance Between Two Documents . . . . . . . . . . . . . 13
3 Lexical Text Similarity Model . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Cosine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Demerau-Levenshtein . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Damerau-Levenshtein Distance (DLD) . . . . . . . . . . . . . . . . . 21
5.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1
2

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7 Video Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

References 26
3

1 Introduction to Information Retrieval


1.1 Overview
Information retrieval (IR) is a field of computer science that deals with the collections
of unstructured or semi-structured data. According to the keywords specified in the
user query, the information resources related to the information needs are then obtained
from the data set. The objective of IR is to facilitate the rapid and accurate search of
text (Nadkarni, 2002) based on keywords specified in a user’s query. In order to speed
up retrieval, there are several ways to electronically preprocess documents. All of these
belong to the general term ‘indexing’. An index is a structure that helps to quickly locate
information resources of interest.
IR technology is that the foundation of Web-based search engines. For decades, IR
has been an ‘orphan’ technology researched by scientists, with few commercial products
and limited functions. Due to the spread of the World Wide Web (WWW), IR is now
important because most of the information on the Web is in textual form. A lot of users
use web search engines such as Google, Bing, and Yahoo to get information on any topic
around the world. The WWW has now replaced libraries as the reference tool of the
primary choice for everybody.
Information retrieval has different types of applications, one of the most important
applications is Blog search. Searching can be done based on the basis of similarity. The
similarity is a process by which we determine the relationship between text snippets
(Pradhan, Gyanchandani, & Wadhvani, 2015). Finding similarity between words is the
basic part of text similarity which is then used as the preliminary stage for sentence,
paragraph, and document similarities (Gomaa, Fahmy, et al., 2013). Word can be similar
lexically or semantically. If words have similar characters, they are lexically similar; if
they have similar meanings, the sentences are semantically similar. Lexical similarity
introduced through different String-Based algorithms. String similarity measures oper-
ate string sequences and character composition. Semantic similarity introduced through
Corpus-Based and Knowledge-Based algorithms. Corpus-Based similarity determines the
similarity between words according to information obtained from a large corpus while
knowledge-based similarity uses the information obtained from the semantic network to
determine the similarity between the words.
4

1.2 History
In the 1960s, information retrieval systems were discovered in business and intelligence
applications. However, the first computer-based search system was built in the late 1940s
and was inspired by pioneering innovations in the first half of the 20th century. The
number of bits of information packed into a square inch of hard drive surface grew from
2000 bits in 1956 to 100 billion bits in 2005(Walter, 2005). Through the high-speed
network, the world quickly accesses a large amount of information. The only solution to
find related items from these large text databases is to search, so IR systems have become
ubiquitous.
Before the IR system, mechanical and electro-mechanical devices are used for search.
They are traditional methods of managing large amounts of information that derived
from the discipline of librarianship. Items were indexed using cataloging schemes. The
category-related catalog cards with holes are aligned with each other to determine whether
there is an entry with a specific category combination in the collection. A match was
found if the light could be seen through the arrangement of cards. The first person to
build such a system was Emanuel Goldberg, who solved this problem in the 1920s and
1930s.
In 1948, Holmstrom described a "machine called Univac" that could search for text
references related to subject codes through the UK’s Royal Society. The code and text
were stored on a magnetic steel tape(J.E.Holmstrom, 1948). Holmstrom stated that the
machine could process at the rate of 120 words per minute. This is the first mention of
computers used to search for content.
In the 1960s, IR systems were greatly improved. Gerard Salton and his group members
produced a large number of technical reports, establishing ideas and concepts that are still
the main field of research today. One of these fields is the formalization of algorithms to
rank documents relative to a query, which was first proposed by Switzer. It is an approach
where documents and queries were treated as vectors in an N-dimensional space. Later,
Salton suggests the measurement of similarity between a document and query vector
by using the cosine coefficient(Salton, 1968). Another innovation was the introduction
of relevant feedback. It was a process that supports iterative search, where previously
retrieved documents can be marked as relevant in the IR system(Sanderson & Croft,
2012). Besides that, the examination of the clustering of documents with similar content
is also one of the IR enhancements. The statistical association of terms with similar
semantics increases the number of documents matching the query by expanding the query
with lexical variants or semantically related words. During this period, commercial search
5

companies emerged from the development of customized systems built for large companies
or government organizations.
In the 1970s, one of the important developments was Luhn’s term frequency (tf)
weights which complemented with Spärck Jones’s work on the appearance of words in
aggregate documents. Her paper on inverse document frequency (idf) introduced the
idea that the frequency of occurrence of a word in a document collection was inversely
proportional to its significance in retrieval(Sanderson & Croft, 2012). In the study of
formalizing the retrieval process, Salton synthesized the output of his group’s work on
vectors to generate a vector space model. Robertson also defined the probability ranking
principle(Robertson & Jones, 1976), which determined how to optimally rank documents
based on probabilistic measures with respect to defined evaluation measures.
Between the 1980s and mid-1990s, variations of tf •idf weighting schemes were pro-
duced and the formal retrieval models were expanded. Advances on the basic vector
space model were developed. The most famous is Latent Semantic Indexing (LSI),
in which the dimensionality of the vector space of a document collection was reduced
through singular-value decomposition(Deerwester, Dumais, Furnas, Landauer, & Harsh-
man, 1990). Queries were mapped into the reduced space. Deerwester and his colleagues
claimed the reduction caused words with common semantic meaning to be merged result-
ing in queries matching a wider range of relevant documents(Sanderson & Croft, 2012).
Another technique discovered was stemming which is a process of matching words to their
lexical variants.
In late 1993, Web search engines began to appear in the World Wide Web created by
Berners-Lee in late 1990. From the mid-1990s to the present, link analysis and anchor
text search were two important developments. Both were related to the early work of
using citation data for bibliometric analysis and searching and using “spreading activa-
tion” search in hypertext networks. The automatic utilization of information extracted
from search engine logs was also examined. During this period, the application of search
and the field of information retrieval continued to develop with changes in the computing
environment. The development of social search deals with searches involving user com-
munities and informal information exchange. New research in many fields, such as user
tagging, filtering and recommendation, and collaborative search has begun to provide
effective new tools for managing personal and social information.
6

2 Research Background and Literature Review


2.1 Retrieval Models
Information retrieval (IR) is the part of computer science where the goal of IR is to enable
users get the relevant document from the searching. The differences between IR and Data
Retrieval is IR retrieve the document while Data Retrieval retrieves data by catching the
keywords of the user query. IR model is the model that made up of algorithms where
it explains the information relevant to user’s finding by the given query. IR model can
be classified to three type: Boolean model, Vector Space Model and Probabilistic Model
(Saini, Singh, & Kumar, 2014)

2.1.1 Boolean

Boolean Model is the document are represented as a set of term, and the query are
representing as Boolean query. Boolean query is the combination of logical operators
AND, OR and NOT (Pasi & Tecnologie, 1999). Boolean model is the old model that is
easy to be understands. The disadvantages of Boolean model are the result, which only
true or false is given. The lack of partial match can be a problem when user do search.
Boolean model does not rank the retrieved document, make all the results retrieved
equally important.

2.1.2 Vector Space Model

The concept of similarity underpins the Vector Space Model (VSM). The model implies
that a document’s query relevance is roughly equivalent to document-query similarity.
The bag-of-words concept is used to represent both documents and queries. Different
from Boolean model, VSM enable the ranking to the document retrieved and providing
feedback. In VSM, document and query were represented by vector.

2.1.3 Probabilistic Model

When user searching for information, in most of the cases, user is uncertain about the
thing that he searched and the retrieval of the document is also uncertain. Probabilistic
model is a framework that models these uncertainties. To be more specific, the uncertain
of the IR means that the understanding of the user query and the level of satisfaction
of user toward the document that retrieved. The purpose of probabilistic model is to
compute the probability of relevant of the document retrieved. Probabilistic model also
7

provides ranking to the relevant feedback, describing how likely is the document relevant
to the query. Assume that the given query, q and the collection of documents, D. If the
document d of set D is relevant for q, then Rd,q will return 1, otherwise Rd,q will return
0. Probability rule was applied to calculate the probability. Bayes rule is one of the rules
that was applied to calculate the probability.

2.2 Existing Works


2.2.1 Damerau Levenshtein Distance for Indonesian Spelling Correction

This study had been carried out by Puji Santoso, Pundhi Yuliawati, Ridwan Shalahuddin
and Aji Prasetya Wibawa. This research purpose was to compare the Levenshtein distance
algorithms and the damerau Levenshtein distance algorithms to identify which algorithm
was better for Indonesian spelling. There were some methods used by Santoso et al.
(2019) in this project, and the flowchart was as shown in Figure 1:

1. Two fairy tale stories dataset was collected. The dataset contains 1266 words with
100 typing errors, collected from ceritadonenegrakyat.com. After that, data pro-
cessing was performed to remove numbers and read signs in a story.

2. The distance measurement with Demarau was applied. The processes used in the
Levenshtein distance algorithm are almost the same as the damerau Levenshtein
distance algorithm, which is an insertion, deletion, and substitution as usual and
an additional step transposition.

3. The suggested wrong word will be displayed, and the accuracy of the results will
be calculated.

Figure 2 shows the wrong words suggested by the two algorithms. Based on that figure,
damerau Levenshtein has a better result compared to Levenshtein distance. The disad-
vantage of the damerau Levenshtein is this algorithm cannot correct two words that stick
together without spacing, as shown in Figure 3. Based on Figure 4, the accuracy of the
damerau Levenshtein algorithm was 75% in correcting the wrong words, higher than the
Levenshtein distance algorithm, which was about 73% (Santoso, Yuliawati, Shalahuddin,
& Wibawa, 2019).
8

Figure 1: Flowchart program

Figure 2: Example of incorrect words

Figure 3: Word that cannot be suggested


9

Figure 4: Testing result

2.2.2 Spelling Correction Application with DamerauLevenshtein Distance


to Help Teachers Examine Typographical Error in Exam Test Scripts

This article is written by Viny Christanti, Mawardi. Fendy Augusfian, Jeanny Pragantha,
and Stephane Bressan, where the main focus of this article is to correct the spelling error
by using the Damerau-Levenshtein Distance algorithm. According to authors, when
teacher prepared the question, teacher will re-examine the question they had typed to
make sure no typographical error in the exam paper. The work becomes difficult when
it comes to the question that prepared ranging from grades 1 to 6 in elementary school.
Next, there are variety of spelling checking tool on the internet that used to check the
spelling error, but it is not common to find an Indonesian language spelling checking tool.
To overcome this problem, the authors have proposed an Indonesian language spelling
error checking application to help teacher in checking the spelling of the exam question.
The data that authors used in the research is the exam test script. According to authors,
teacher will create a bank to store all the assessment, quiz and exam question. Figure
below shows the example of exam test scripts that had been used by authors. Indonesian
dictionary was used to give suggestion of the error word.
According to authors, there is two type of test were taken to check the spelling error.
The first test is to determine 50 sentences of non-real word errors, and the second test is
the 15 question of formatted exam scripts. The data was undergoing two different test
categories, which is Manual Correction and Automatic Correction. Manual Correction
enable user to choose the desired suggested word while Automatic Correction auto cor-
recting the error with the first ranking word in the suggestion. Figure below shows the
result after processing by the algorithm. Manual Correction has the highest result of 88%
in sentence accuracy while the automatic correction gives the result of 70% in sentence
accuracy. The word accuracy for Manual Correction is higher than automatic correction
which is 84% of accuracy. The time spent for automatic correction is shorter than manual
correction (Christanti Mawardi, Viny, Augusfian, Fendy, Pragantha, Jeanny, & Bressan,
Stéphane, 2020).
10

Figure 5: Sample Exam Test Scripts

2.2.3 Spelling Checker using Algorithm Damerau Levenshtein Distance


and Cosine Similarity

This study had been carried out by Nur Hamidah, Novi Yusliani, and Desty Rodiah.
The purpose of this project was to create a system to detect word errors automatically.
Correcting typos manually will take a long time to ensure the writing free from typing
errors. In this study, the dictionary lookup method has been used to search for the wrong
words. Pre-processing steps such as case folding (change upper case to lower case) and
tokenizing (break a sentence into words) were carried out. N-gram had been used to cut
the words into a piece of character. Term Frequency – Inverse Document Frequency (TF-
IDF) was used to determine the importance of the words in a document. The damerau
Levenshtein distance was used to determine the space between the words to produce
word candidates, and then the cosine similarity was applied to sort it. Lastly, Mean
Reciprocal Rank (MRR) was used to evaluate the search rankings. Value 1 will return
when all targets are ranked 1 for all the displayed results of the data candidate. The
data used in this project contain four documents, which have 30 deletion types errors,
30 insertion types errors, 30 transposition errors, and 30 substitution errors. Based on
Figure 6, insertion type errors produced the highest percentage, 97.78%, because the
average word candidate was ranked 1 while the substitution gave the lowest percentage,
86%. The low percentage produce by substitution type perhaps caused by the inaccuracy
11

of the ranking, where it should be in the top 5 ranking (Hamidah, Yusliani, & Rodiah,
2020).

Figure 6: MMR value chart

2.2.4 Text Documents Clustering Using Data Mining Techniques

This paper had been written by two authors, Ahmed Adeeb Jalal and Basheer Husham
Ali. This project aimed to propose a classification approach to cluster the document
based on similar scientific fields to make other researchers easier in finding the research
papers. It is caused by the problem of finding a suitable research paper with a normal
search process will be challenging and spending a lot of time, especially need to deal with
a lot of sources. The method used in this project is shown in Figure 7. The dataset
of about 518 research papers published from 2012 to 2019 in the Bulletin of Electrical
Engineering and Informatics (BEEI) journal was collected. The dataset will be classified
based on titles, abstracts, and keywords into five clusters. The crawler algorithm will be
applied to retrieve the content needed from the paper. Then, the topics of each cluster
will undergo text pre-processing to change the sentence into words. TF-IDF was carried
out to extract features and calculate the weight of the word. The authors suggested using
cosine similarity in measuring the similarity of the content. Based on the results in Figure
8, the classification approach can classify more than 96% of the research paper based on
similarity. The result was validated by using precision and recall, as shown in Figure 9
(Jalal & Ali, 2021).
12

Figure 7: Flow diagram of classification approach

Figure 8: Paper classification and distribution

Figure 9: Validation results


13

2.2.5 The Implementation of Cosine Similarity to Calculate Text Rele-


vance Between Two Documents

This article is written by D Gunawan, C A Sembiring and M A Budiman. The main


idea of this paper is to implement the cosine similarity algorithm to determine the text
relevance between two documents. According to the authors’ research, they found that
the existing work which the finding of similarity is implemented in web crawlers. This
means that similarity is only compare by two web pages, but not the text itself. Another
research had done by the authors shows that the researcher used clustering algorithm to
collect the similar web pages, where it is used to identify the character. The drawback
of that particular research is the keyword is not determined before passing the data into
the algorithm.
According to the authors, the document used in this article can be divided into two
categories, which is reference document and source document. Reference document is the
document that will be refer by the source document to find the similarity score. The data
was first pre-processed before proceed to similarity analysis. The first step is to remove
the punctuation. After removing the punctuation, the data is transforming to lowercase
by case folding. Next, the data was undergoing tokenization, where all the string is cut
into a smaller piece, and then stop word is remove from the data. The last step for data
pre-process is stemming, where all the word converts to the stem word. After stemming,
the frequency of the keyword appear in the document is extracted. Figure below shows
the keyword that extracted from the reference and source document. The keywords are
assigned with the “Biology and medical” category.

Figure 10: Keyword Extraction

The keyword weight was counted before transferring to cosine similarity, below is the
14

formula for keyword weighting. Wi denotes the weight of keyword I, wmax denotes the
maximum weight and the result W is the weight of the keyword.

Wi
W=
Wm ax

For example, “Adenocarcinoma” has the Wmax of 9 and “scc” has the Wi of 8, thus the
W is 0.89. So, the weight of the keyword “scc” is 0.89. Figure below shows the weighting
of the keyword done by the authors.

Figure 11: Keyword Weighting

In the next step, the word that contain in both reference and source document is deter-
mined in order to perform cosine similarity. The keywords that appear in both of the
documents is: cell, high, compar, cancer and study. Figure below show the frequency of
the particular keywords appear in both documents.

After getting the frequency of the same keyword, the authors substitute the value in
to the formula to calculate the weight of the reference document, weight of the second
document, weight in the reference document and weight in the second document, which
is 2.11,1.67,12.22 and 8.67 respectively. After obtaining all the result needed, the values
15

Figure 12: Same Keywords between reference and second document

were substitute into the formula of cosine similarity as shown in the figure below.

Figure 13: Cosine similarity calculation

The result of relevance between the two document is 0.023. Cosine similarity state that
the less the similarity of the documents, the closer the value to 0. Thus, it can conclude
that the second document is not similar to the reference document (Gunawan, Sembiring,
& Budiman, 2018).

3 Lexical Text Similarity Model


The lexical text similarity model used in this project are cosine similarity and Damerau
Levenshtein distance.

3.1 Cosine
Cosine similarity is a measurement of the similarity between the documents by using the
angle between vectors (Alake, 2021). For the angle of the vector is 90 degrees, the value
will be 0, and the value will be 1 when the angle is 0 degrees. If the cosine value is closer
16

to 1, it means that the two vectors have high similarity. There is an example as shown
below:

Document 1: The burger is tasty.


Document 2: The burger is tasteless.

Step 1: Vectoring the text

Table 1: Text vectoring

Words Document 1 Document 2


The 1 1
Burger 1 1
Is 1 1
Tasty 1 0
Tasteless 0 1

Based on Table 1, define Document 1: [1, 1, 1, 1, 0] as vector A, Document 2: [1, 1,


1, 0, 1] as vector B.
Step 2: Find the cosine similarity

• For cosine similarity, C S = (A.B )/(||A||.||B ||)

• Dot product of A and B:1.1 + 1.1 + 1.1 + 1.0 + 0.1 = 3


p
• Magnitude of the vector A: 12 + 12 + 12 + 12 + 02 = 2
p
• Magnitude of the vector A: 12 + 12 + 12 + 02 + 12 = 2

• Cosine Similarity: (3)/(2 ∗ 2) = 0.75 (75% similarity between 2 sentences in both


document)

3.2 Demerau-Levenshtein
Damerau Levenshtein distance can be obtained through some operations: insert, delete,
substitution and transposition to measure the minimum changes required from a word to
another word (Santoso et al., 2019).
The process of damerau Levenshtein algorithm as shown in Table 2 is as below:

1. Initialise the number, n for the character length of source and m for character
length of target. If n and m equals to 0, it will return the value of 0.
17

Table 2: Example of calculation in damerau Levenshtein

Target C A K E
Source 0 1 2 3 4
C 1 0 1 2 3
A 2 1 0 1 2
E 3 2 1 1 1
K 4 3 2 1 1
S 5 4 3 2 2

2. Create a two-dimensional array d, row = n+1 column = m+1.

3. The row(m) will have value range from 0. . . m, and the column(n) will have value
range from 0 to n.

4. Compare each character in source and target. If s[i ] = t [i ], then the cost = 0. Else
the cost = 1.

5. Calculate the min(x,y,z) and add it into the position (row by row). X denotes the
insertion operation and can be formulated as x = d [i −1, j ]+1; y denotes the deletion
operation and can be formulated as y = d [i ][ j − 1] + 1; z denotes the substitution
operation and can be formulated as y = d [i − 1][ j − 1].

6. To transpose the characters, the two characters must be positioned side by side,
and the two characters must be compared in the previous steps. The condition of
transposition can be described as if i > 1 & j > 1 & s[i] = t[j-1] & s[i-1]= t[j].
After that, add the value min(d[i,j],d[i-2,j-2]+cost) to d[i,j].

7. Steps 4 to Steps 6 are repeated until the maximum size of matrix is reach.

8. The last value in the corner of the matrix will be the damerau-levenshtein distance
score.

4 Methodology
Information Retrieval Architecture

1. Input data
The source of the data that been used in this project is the headline written by
18

the three selected newspaper, which is NST, Astro Awani and The Sun Daily. A
headline is the text that describe the content of the article, and usually printed in
large letter and located on top of the newspaper. People who read the headline will
know roughly the content that written in the article.

2. Data pre-processing
Raw data is the data that have not been processed. The using of raw data directly
will affect the accuracy of the processed result. Thus, raw data is needed to be
pre-processed before it was used in any other algorithm. Data pre-processing is the
process that transforming the raw data into understandable form. The operation
of processing the raw data is using stop word. Stopwords will filtered out the less
meaningful word such as “is”,” the” and “he”. This could make the similarity analysis
more efficient and accurate. After that, the data was being stemmed to get the root
word.

3. Similarity Analysis
To determine the similarity of the data provided, Cosine Similarity and Damerau-
Levenshtein Distance were used in this project. Cosine Similarity finds the angle
between the vectors. The larger the angle, the less similarity between the data
inserted. Damerau-Levenshtein Distance is used to determine how much operations,
which is insertion, deletion, substitution and transposition needed to make the
source string looks exactly like the targeted string.

4. Ranking
Before comparing the result of cosine similarity and Damerau-Levenshtein Distance
(DLD), DLD has to first convert the value to similarity score. To convert, the
formula below is applied to the distance score. DamerauLevenshtein(s,t) denotes
the score of the distance and max(|s||t|) denotes the maximum length of the string.

D amer auLevensht ei n(s, t )


Si mi l ar i t yScor e = 1 −
max(|s||t |)

After calculating the similarity score, the value is compared to cosine similarity.
The ranking of both algorithms will be concluded and discussed in next section.
19

5 Results and Discussion


Table 3 shows the sample news headline that was used in this project. The data was
copied directly from the article on the same day, which is on 11 April 2021. The data in
Table 3 is not ready to use because it is still raw.

Table 3: News Headline as sample input data

NST Astro Awani


Malaysian visiting family COVID: M’sian visiting
among 32 new Covid-19 family member among
cases in Singapore 32 new cases in S’pore
Saturday

The raw data in Table 3 will undergoes three processes before it was ready to use.
The raw data will firstly be tokenized, where the string will be separated into smaller
unit. Table 4 shows the result of the data undergoes tokenization.

Table 4: Tokenization of news headline

NST Astro Awani


[’Malaysian’, ’visiting’, [’COVID’, ’M’, ’sian’,
’family’, ’among’, ’32’, ’visiting’, ’family’, ’mem-
’new’, ’Covid’, ’19’, ’cases’, ber’, ’among’, ’32’, ’new’,
’in’, ’Singapore’] ’cases’, ’in’, ’S’, ’pore’,
’Saturday’]

After tokenization, the data was further processed by stopwords removal. The purpose
of performing this step is to remove the meaningless word in the text. Table 5 shows the
result of the data that had been stopwords removal.

Table 5: Stopwords of news headline

NST Astro Awani


[’Malaysian’, ’visiting’, [’COVID’, ’M’, ’sian’,
’family’, ’among’, ’32’, ’visiting’, ’family’, ’mem-
’new’, ’Covid’, ’19’, ’cases’, ber’, ’among’, ’32’, ’new’,
’Singapore’] ’cases’, ’S’, ’pore’, ’Satur-
day’]

Furthermore, the data was undergoing stemming. Stemming is the process that re-
ducing the word to its root word. Table 6 shows the result of data after stemming.
20

Table 6: Stemming of news headline

NST Astro Awani


[’Malaysian’, ’visiting’, [’COVID’, ’M’, ’sian’,
’family’, ’among’, ’32’, ’visiting’, ’family’, ’mem-
’new’, ’Covid’, ’19’, ’cases’, ber’, ’among’, ’32’, ’new’,
’Singapore’] ’cases’, ’S’, ’pore’, ’Satur-
day’]

After data pre-processing, the data was ready to use in the next steps. The data
will then be process by two different algorithms to determine the similarity between the
source text.

5.1 Cosine Similarity


Cosine similarity is used to measure how similar the strings are without considering
the size of the text. Before input the data into the algorithm, the frequency of each
word appear in the documents are needed to count. TF-IDF was apply to perform the
frequency counting. In this case, python library has provided us the TfidfVectorizer from
scikit learn. The output is in sparse matrix. Figure 14 below shows the result after being
process by TF-IDF. The row is the term and the column are the name of documents.
The value of 0.000000 means the word does not appear in that particular document.

Figure 14: Result of TF-IDF

After that, the result after performing TF-IDF will be input to the similarity analysis.
Figure 15 shows the result after applying cosine similarity. The information that can be
obtained from Figure 5a is the similarity of the data comparing between the three-sample
21

data. In the first row, the value 1, which means the data is completely similar to first
headline. In other word, the sub array corresponds to the newspaper chosen and the
element of the array is the result of similarity. By using calculator, press cos-1(1) and
you will get the angle of 0, which means the vector are close to each other.

Figure 15: Cosine Similarity of the inputted data

5.2 Damerau-Levenshtein Distance (DLD)


To proceed with DLD, the pre-processed data was used in this section. DLD determines
the distance between two headlines, where the minimum operations such as insertion,
deletion, substitution and transposition were made to make the source text similar to
targeted text. In this case, NST was taken as source and Astro Awani is taken as target.
Figure 16 shows the total number of steps taken to transform NST’s headline to Astro
Awani’ headline. The DLD distance of NST and Astro Awani is 31.

Figure 16: DLD of NST and Astro Awani

Figure 17 shows the DLD of NST compare to NST. The result is 0. This means that
both of the headline is the same.

Figure 17: DLD of NST and NST

After that, the distance score is converting to the similarity score. The length of the
NST’ s headline is 55 and the length of Astro Awani’s headline is 66. Thus, the maximum
length of string is 66. By substituting the value into the formula:
22

D amer auLevensht ei n(s, t )


Si mi l ar i t yScor e = 1 −
max(|s||t |)

31
Si mi l ar i t yScor e = 1 −
66
Si mi l ar i t yScor e = 0.5303

The similarity score for the DLD is 0.5303. By comparing Cosine similarity = 0.5376
and DLD = 0.5303, cosine similarity gives a highest score compared to DLD.

5.3 Result Analysis


16 News title from NST and Astro Awani is collected. Table 7 shows the sample data
from NST and Astro Awani.

Table 7: NST and Astro Awani title

No. NST Astro Awani


1 Malaysian visiting family among COVID: M’sian visiting family
32 new Covid-19 cases in Singa- member among 32 new cases in
pore S’pore Saturday
2 4 Ah Long syndicates busted; 29 Police bust four Ah Long syndi-
nabbed in Penang and Perak cates with arrest of 29 in Penang,
Perak - Bukit Aman
3 Kelantan cancels Ramadan No Ramadan bazaars in Kelan-
bazaars, terawih prayers tan: Dr Izani
4 Guns and silence to mark Prince Britain’s Prince Philip, husband
Philip’s death of Queen Elizabeth, dies aged 99
5 Discuss ways to address loud ex- Govt agencies should discuss
haust issue - Wee ways to address loud exhaust is-
sue - Wee
6 Muslims to begin fasting tomor- Malaysian Muslims to begin fast-
row ing tomorrow
7 SUKE tragedy: Use full force of SUKE tragedy: Roads along Per-
the law, says Lam Thye siaran Alam Damai to be closed
for clean-up work
23

No. NST Astro Awani


8 Japan to release treated Japan says to release contami-
Fukushima water into the nated Fukushima water into sea
sea: PM
9 Eddin Syazlee apologises for Eddin Syazlee apologises for
falling asleep at event falling asleep at huffaz gradua-
tion ceremony
10 SRC was a sham set up as Najib’s SRC International: Pasukan pem-
personal ATM belaan Najib selesai hujah enam
hari
11 Seven districts in Kelantan, five MCO in seven districts in Kelan-
in Sarawak under MCO after tan from Friday
spikes in infections
12 IPOH: A 31-year-old woman died "IPOH: A single mother is be-
after she was stabbed by a 23- lieved to have been stabbed to
year-old man believed to be her death in front of her sons in her
boyfriend at her home in Taman own house at Taman Perpaduan
Perpaduan, Tambun here last here last night. Perak Criminal
night. Perak Criminal Investiga- Investigations Department chief
tions Department chief Senior As- SAC Anuar Othman said the po-
sistant Commissioner Anuar Oth- lice received a call about a fight
man said K. Krishna was stabbed between the 31-year-old victim
in the chest several times with a and a 23-year-old male suspect at
knife. Anuar said police received about 8.45 pm last night. "
a phone call from the woman’s
neighbour at 8.45pm reporting a
commotion at her house.
13 Strict compliance with SOP at Malaysians reminded to follow
Ramadan buffets SOP at Ramadan bazaars
14 438,220 individuals complete 438,220 individuals complete
both doses of Covid-19 vaccine both doses of COVID-19 vaccine
15 KUALA LUMPUR:Police have Fahmi Reza arrested for allegedly
arrested graphic artist and ac- insulting Queen
tivist Fahmi Reza.
24

No. NST Astro Awani


16 Businessman remanded for as- Two ’bodyguards’ were beaten,
saulting bodyguards over fasting pointed a gun, threatened with
death for fasting

Table below shows the comparison of the similarity value.

Table 8: Similarity Value (Cosine Similarity and DLD)

No. Cosine Similarity(Before) DLD(Before) Cosine Similarity(After) DLD(After)


1 0.2967 0.4583 0.5044 0.5161
2 0.2693 0.3976 0.5688 0.4688
3 0.0935 0.2692 0.3809 0.3556
4 0.00 0.2462 0.1598 0.2364
5 0.3609 0.6471 0.7995 0.7885
6 0.8467 0.7674 0.8182 0.7222
7 0.0935 0.3247 0.1191 0.4098
8 0.5884 0.6610 0.6328 0.7556
9 0.6499 0.6849 0.5727 0.6491
10 0.1191 0.3433 0.1598 0.3448
11 0.5803 0.3049 0.4743 0.4340
12 0.2854 0.4038 0.4917 0.4662
13 0.2532 0.4231 0.2258 0.4048
14 1.00 0.9322 1.00 1.00
15 0.2095 0.2667 0.6060 0.8333
16 0.1928 0.2857 0.1909 0.2400
25

6 Conclusion
Information retrieval is the process of retrieving information from a collection of document
by user query. To study about the cosine similarity and Damerau-Levenshtein distance
(DLD), five articles of these algorithm was reviewed and applied in different purpose.
Most of the field that these algorithms were applied in find the text relevance such as
determining the spelling error and finding the similarity score of two documents. News
headline of NST and Astro Awani were used in this project for text similarity analysis.
The data was first being pre-processed by using tokenization, stop words removal and
stemming.Next, the data was used in Cosine similarity and DLD. In Cosine similarity,
the term frequency has to be determined before the similarity can be analyzed. TF-IDF
has been used to vectorize the data. The data after vectorized is processed by cosine
similarity, and the similarity score is 0.5376. In DLD, the pre-processed data was directly
input into the algorithms, the distance score is 31 and the similarity score is 0.5303. Most
of the similarity score is much higher when using the data after pre-process compare to
the data before pre-process. In most cases, cosine similarity has the produces the highest
similarity score compare to DLD.

7 Video Presentation
Youtube Link: https://youtu.be/vgW2yZC12lM
References

Alake, R. (2021, 04). Understanding Cosine Similarity And Its Appli-


cation. Retrieved from https://towardsdatascience.com/understanding
-cosine-similarity-and-its-application-fd42f585296a
Christanti Mawardi, Viny, Augusfian, Fendy, Pragantha, Jeanny, & Bres-
san, Stéphane. (2020). Spelling correction application with damerau-
levenshtein distance to help teachers examine typographical error in exam
test scripts. E3S Web Conf., 188, 00027. Retrieved from https://doi.org/
10.1051/e3sconf/202018800027 doi: 10.1051/e3sconf/202018800027
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman,
R. (1990). Indexing by latent semantic analysis. Journal of the American
society for information science, 41(6), 391–407.
Gomaa, W. H., Fahmy, A. A., et al. (2013). A survey of text similarity approaches.
International Journal of Computer Applications, 68(13), 13–18.
Gunawan, D., Sembiring, C., & Budiman, M. (2018, 03). The implementation
of cosine similarity to calculate text relevance between two documents.
Journal of Physics: Conference Series, 978, 012120. doi: 10.1088/1742
-6596/978/1/012120
Hamidah, N., Yusliani, N., & Rodiah, D. (2020). Spelling checker using algorithm
damerau levenshtein distance and cosine similarity. Sriwijaya Journal of
Informatics and Applications, 1(1).
Jalal, A. A., & Ali, B. H. (2021). Text documents clustering using data mining
techniques. International Journal of Electrical & Computer Engineering
(2088-8708), 11(1).
J.E.Holmstrom. (1948). Section iii. opening plenary session. in The Royal
Society Scientific Information Conference, 21 June-2 July 1948 : report
and papers submitted: London: Royal Society.
Nadkarni, P. (2002). An introduction to information retrieval: applications in

26
27

genomics. The pharmacogenomics journal, 2(2), 96–102.


Pasi, G., & Tecnologie, I. (1999, 08). A logical formulation of the boolean model
and of weighted boolean models.
Pradhan, N., Gyanchandani, M., & Wadhvani, R. (2015). A review on text
similarity technique used in ir and its application. International Journal of
Computer Applications, 120(9).
Robertson, S. E., & Jones, K. S. (1976). Relevance weighting of search terms.
Journal of the American Society for Information science, 27(3), 129–146.
Saini, B., Singh, V., & Kumar, S. (2014, 07). Information retrieval models
and searching methodologies: Survey. International Journal of Advance
Foundation and Research in Science Engineering (IJAFRSE), 1.
Salton, G. (1968). Automatic information organization and retrieval.
Sanderson, M., & Croft, W. B. (2012). The history of information retrieval
research. Proceedings of the IEEE, 100(Special Centennial Issue), 1444–
1451.
Santoso, P., Yuliawati, P., Shalahuddin, R., & Wibawa, A. P. (2019). Damerau
levenshtein distance for indonesian spelling correction..
Walter, C. (2005). Insights: Kryder’s law’. Scientific American, 01–2005.

You might also like