0% found this document useful (0 votes)

89 views

Assignment No. 2: Similarity and Dissimilarity Measures

The document discusses various similarity and dissimilarity measures that can be used for text mining tasks. It begins with an introduction on the importance of similarity measures for problems like classification and clustering. It then provides a literature review on surveys of similarity measures. The main body defines and provides examples of 11 different similarity/distance measures - Levenshtein distance, Jaro distance, Q-grams, Euclidean distance, Manhattan distance, Chebyshev distance, Hamming distance, cosine similarity, latent semantic analysis, Mahalanobis distance, and Lesk measure. It concludes with a table comparing the measures based on their equations, time complexity, advantages, disadvantages and suitable application areas.

Uploaded by

Ahmed Qurada

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views

Assignment No. 2: Similarity and Dissimilarity Measures

Uploaded by

Ahmed Qurada

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Assignment No.

Text Mining

Similarity and dissimilarity measures

Submitted To: - Dr. Ibrar

Submitted By:- Fahmi Hassan Ali Quradaa

Ph.D Scholar

University of Peshawar

Department of Computer Science

Similarity and Dissimilarity Measurements

1. INTRODUCTION

Similarity or distance measures are vital components used for solving many pattern
recognition problems such as classification and clustering. These measures play an
increasingly core role in text related research and application in several tasks such as
text classification, topic tracking, question answering, short answer scoring, machine
translation, essay scoring, topic detection and others. Different clustering algorithms need
a measure for determining how dissimilar two given documents are. This difference is
often measured by similarity measures such as Euclidean distance, Cosine similarity, etc.
These measurements can be used to identify the suitable clustering algorithm for a
specific problem.

Informally, the similarity between two objects is a numerical measure of the degree to
which the two objects are similar. Consequently, similarity is higher for pairs of objects
that are more the same. Similarities are usually no-negative and are often between Zero
(dissimilar) and one (identical practically) .

On the other hand, the dissimilarity between two objects is the degree to which the two
objects are unlike. For more similar pairs of objects dissimilarities are lower. commonly,
the term distance is used as synonym for dissimilarity. Dissimilarities sometimes fall in
the interval Zero and One, but it is also common form them to range from 0 to infinity.

A variety of similarity/dissimilarity measures have been formulated throughout the years,

each with its own strengths and weaknesses. In this assignment I will discuss some of the
literature studies that have been done in the similarity and distance measurements. Then a
list of similarity and dissimilarity measures are presented. Finally, a comparison between
those measurements.

2. Literature review
There have been many surveys of the similarity and distance measures proposed in
different discipline. Applying suitable measures reflects in more accurate data analysis.
Vijaymeena, M. K., and Kavitha in [1] have discussed the similarity measures that are
applied on text similarity. They classified similarity measuers into three significant
categories; Corpus-based similarities, string-base and Knowledge based. Choi, S. et. al.
[2] carried out a comprehensive survey on binary measures. They collected 76 binary
similarity and dissimilarity (distance) measures that are over the last century and disclose
their correlations by means of the hierarchical clustering technique. However , David J.
Weller-Fahy et. al. [3] conducted an overview of the use of similarity and distance
(dissimilarity) measures within Network Intrusion Anomaly Detection (NIAD) research.
Nitin P., Madhura P et. al. [4] discussed various clustering techniques and the current
similarity measures based on distance based clustering. A. S. Shirkhorshidi et. al. [4] has
been proposed a technical framework to analyze, compare and benchmark the effect of
various similarity measures on the results of distance-based clustering algorithms.

3. Similarity and distance measures

3.1 Levenshtein distance

The Levenshtein measure among two strings is measured as the minimum number of
edits required to convert one string into the other, with the allowable edit operations
being deletion, insertion, or replacements of a single character [11]. Levenshtein
algorithm computes the distance between two string is as follow:
1- Initialize matri M of size (|s1|+1) x (|s2|+1)
2- Fill matrix : Mi,0 = I and M0,j=j
𝑀𝑖 − 1, 𝑗 − 1 𝑖𝑓 𝑥[𝑖] = 𝑦[𝑗]
3- Recursion : Mi,j= {
1 + min(𝑀𝑖 − 1, 𝑗, 𝑀𝑖, 𝐽 − 1, 𝑀𝑖 − 1, 𝑗 − 1) 𝑒𝑙𝑠𝑒
4- Distance : Levenshteindist(x,y)=M|x|,|y|
𝑙𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑑𝑖𝑠𝑡(𝑥,𝑦)
Levenshtein Similarity: simlevenshtein(x,y)=1- max(|𝑥|,|𝑦|)

For example :
3.2 Jaro distance

To measure the similarity between two strings Jaro distance is used. When the distance
value is higher, the more similar the strings are [5]. It computes the distance between
strings as follows:
1- Search for common characters between strings
2- m: number of matching characters
max(|𝑥|,|𝑦|)
3- search range matching characters : −1
2

4- t :number of transpositions
1 𝑚 𝑚 𝑚−𝑡
5- simjaro= 3 (|𝑥| + |𝑦| + )
𝑚

For example :
3.3 Q-grams

Q-grams are typically used in approximate string matching by “sliding” a window of

length q over a contiguous sequence of n items from a given string to create a number of
'q' length grams for matching a match is then rated as number of q-gram matches within
the second string over possible q-grams [5]. It works as follows:
 Split string into short substrings of length n.
o Sliding window over string
o N=2 :bigrams
o Variation : Pad with n-1 special characters
 Emphasizes beginning and end of string
o Variation : include positional information to weight similarities
 Number of n-grams= |x|-n+1
 Count how many n-grams are common in both strings.

3.4 Euclidean distance

The Euclidean distance between two points is measured by the numerical difference
of their coordinates. It is general to recognize the name of a point with its Cartesian
coordinate [11]. So if we have two point’s p1 and p2 on the real line, then the distance
( 𝑝𝑞
̅̅̅) between them is known by:
3.5 Manhattan distance

The Manhattan distance measure the distance that would be traveled to get from one
point (x) to the other (y) if a grid-like path is followed. The distance between two points
is the sum of the differences of their corresponding components [5].
The distance between a point X=(X1, X2, etc.) and a point Y=(Y1, Y2, etc.) is known by:
𝑑 = ∑𝑛𝑖=1 |𝑥𝑖 − 𝑦𝑖|
Where n is the number of variables, and Xi and Yi are the values of the ith variable, at
points X and Y in that order.
The difference between Euclidean distance and Manhattan distance shown in the figure:

Figure (1) the difference between Euclidean distance and Manhattan distance

3.6 Chebyshev distance

Chebyshev similarity measurement is a distance measurement that is the maximum

absolute distance in one dimension of two M dimensional points. It examines the absolute
magnitude of the difference between coordinates of a pair of points [11].
Its formula is : dij=max(|X1-X2|,|Y1-Y2|).
Applications of Chebyshev distance measurement are: Chess and Warehouse Logistics.

3.7 Hamming Distance

The Hamming distance is a measure for comparing two equal length binary strings. It
computes the distance based on the number of positions at which the two bits are not
similar [11]. For Example

 "karolin" and "kathrin" is 3. "karolin" and "kerstin" is 3.

 1011101and1001001is 2. 2173896 and 2233796is 3.
It computes as :

This measure is used in telecommunication to estimate error when data is transmitted

between computer network.

3.8 Cosine similarity

Cosine similarity is a metric of similarity used to compare how similar documents are
without consideration of their size. The angle between two vectors is measured by cosine
to decide whether documents’ vectors are pointing in the same direction [11].
The cosine similarity computed as follow:
𝐴.𝐵 ∑𝑛
𝑖=1 𝐴𝑖 𝑥 𝐵𝑖
Similarity (A, B) = ||𝐴||𝑥||𝑏|| =
√∑𝑛 2 𝑛 2
𝑖=1 𝐴𝑖 𝑥√∑𝑖=1 𝐵𝑖

3.9 Latent Semantic Analysis (LSA)

Latent Semantic Indexing (LSI) is a semantic similarity measure for discovering

similarity between words based on information gain from big corpus. It assumes that
terms (words) which are close in meaning will arise in similar pieces of document or
paragraph. The first step in LSA is to create a matrix, each row represents an index word
and each column represents document or paragraph. Each cell in the matrix contains the
number of times that word (term) occurs in that document or paragraph. Then, numbers
of columns are reduced by using mathematical technique called SVD (singular value
decomposition). Finally, it compares between remaining words (rows) by using the
cosine similarity [11].

3.10 Mahalanobis distance

The Mahalanobis distance is a measure of distance between two vectors and a set of
data, or a variation that measures the separation of two vectors from the same dataset
[5]. It is calculated as:
3.11 Lesk

Lesk measure is another type of knowledge-based semantic relatedness. Which cover a

more relationship between words or concepts for example is a part of, is a kind of , is a
specific example of , is the opposite of and is a part of. It return back from Machine
Readable Dictionaries (MRD) all sense definitions of the concepts (words) to be
disambiguated. Then it finds out overlaps in the definitions. The score of relatedness
measures by summation of squares of the overlap length [11].

4. Comparison between similarity measures

In the following table different similarity and distance measure are compared based
on the following criteria equation, time complexity, advantages , disadvantage and the
area that metric is suitable. As we can see each measure has strengths and weakness such
as Euclidean distance that is very common and easy to compute and works well with
datasets. But it is sensitive to outliers. Based on that, the selection of the similarity
measure identifies the suitable clustering or pattern recognition algorithm for a specific
problem.
Time
Distance disadva
Equation complexit advantage applications
measure ntages
y
Levenshtein O(n*m) spelling
distance correction, all
applications that
benefit from soft
matching of
words, e.g.
information
retrieval, machine
translation etc.
Euclidean O(n) Very Sensitiv K-means
distance common, e to algorithm,
easy to outliers Fuzzy c-means
compute [7,10]. algorithm [8].
and works
well with
datasets
with
compact or
isolatedclus
ters [7,10].
𝑛
Manhattan O(n) Is common Sensitiv K-means
distance 𝑑 = ∑ |𝑥𝑖 − 𝑦𝑖| and like e to the algorithm
𝑖=1 other outliers.
Minkowski [7,10]
-driven
distances it
works well
with
datasets
with
compact or
isolated
clusters [7].
Chebyshev O(n) it requires It Chess and
distance less time to require Warehouse
decide the more Logistics
distances space electronic CAM
between applications
the datasets
[6]
Hamming O(n) detection of
Distance errors in
information
transmission and
telecommunicatio
n
Cosine ∑𝑛𝑖=1 𝐴𝑖 𝑥 𝐵𝑖 O(3n) Independen It is not Mostly used in
measure t of vector invarian document
√∑𝑛𝑖=1 𝐴2𝑖 𝑥√∑𝑛𝑖=1 𝐵𝑖2 length and t to similarity
invariant tolinear applications
rotation transfor [8,10].
[10]. mation
[10].
Mahalanobis O(3n) Mahalanobi It can Hyperellipsoidal
distance s is a be clustering
datadriven expensi algorithm
measure ve in [9].
that can terms
ease the of
distance comput
distortion ation
caused by a [10]
linear
combinatio
n of
attributes
[5].

5. Conclusion

In this work we discuses several similarity and distance metrics and compare some of the
similarity measures in terms of equation, time complicity and other criteria. Each
measurement has a strength and weakness. So when we are using any clustring or data
mining algorithm we should decide which measure we will use because the call will
effect on the results that we will get.

References

[1] Vijaymeena, M. K., and Kavitha, K. (2016). A survey on similarity measures in text
mining. Mach. Learn. Appl. Int. J. 3, 19–28. doi: 10.5121/mlaij.2016.3103

[2] Choi, S. Cha, and C. C. Tappert, “A Survey of Binary Similarity and Distance
Measures, ”J. Systemics, Cybernetics and Informatics,vol.8,no. 1, pp. 43–48, 2010.

[3] D. J. Weller-Fahy, B. J. Borghetti, and A. A. Sodemann, “A Survey of Distance and

Similarity Measures Used Within Network Intrusion Anomaly Detection,” IEEE
Commun. Surv. Tutor., vol. 17, no. 1, pp. 70–91, 2015.

[4] Jasmine Irani, Nitin Pise and Madhurs Phatak” Clustering techniques and similarity
measure used in clustering: A survey, Published in International Journal of Computer
Applications 134(7):9-14, 2016.doi: 10.5120/ijca2016907841

[5] A. S. Shirkhorshidi, S. Aghabozorgi, and T. Y. Wah, “A comparison study on

similarity and dissimilarity measures in clustering continuousdata,”PLOS ONE, vol. 10,
no. 12, pp. 1–20, 12 2015. [Online].Available:
https://doi.org/10.1371/journal.pone.0144059

[6] Sujan Dahal , Effect of Different Distance Measures in Result of Cluster Analysis,

Master’s Thesis, 2015

[7] Gan G, Ma C, Wu J. Data Clustering theory, Algorithms, and Applications.
ASASIAM Series on Statis-tics and Applied. Society for Industrial and Applied
Mathematics; 2007
[8] Han J, Kamber M, Pei J. Data mining: concepts and techniques. Morgan Kaufmann;
2006
[9] Mao J, Jain AK. A self-organizing network for hyperellipsoidal clustering (HEC).
IEEE Trans Neural Networks.
1996; 7: 16–29. doi: 10.1109/72.478389 PMID: 18255555
[10] Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Computing Surveys.
ACM; 1999. pp.264–323. doi:10.1145/331499.331504
[11] W. H. Gomaa and A. A. Fahmy, “A survey of text similarityapproaches,”Int. J.
Comput. Appl., vol. 68, no. 13, pp. 13–18, 2013.

PDF of Customer Services Excellent Teori Dan Praktik DR Kasmir Full Chapter Ebook
100% (10)
PDF of Customer Services Excellent Teori Dan Praktik DR Kasmir Full Chapter Ebook
69 pages
Verb Worksheet For Class 3
33% (3)
Verb Worksheet For Class 3
8 pages
2 English Plus Photocopiable Resources 1 Answer Key
61% (18)
2 English Plus Photocopiable Resources 1 Answer Key
14 pages
Lingo Dingo and The DANISH Astronaut Mid Res
No ratings yet
Lingo Dingo and The DANISH Astronaut Mid Res
18 pages
Transcript of Sleight of Mouth Demo
No ratings yet
Transcript of Sleight of Mouth Demo
14 pages
Lista de Verbos Regulares e Irregulares (2020)
No ratings yet
Lista de Verbos Regulares e Irregulares (2020)
4 pages
Information 11 00421 v2
No ratings yet
Information 11 00421 v2
17 pages
Cosine Similarity
No ratings yet
Cosine Similarity
4 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
A_Comparative_Study_on_Distance_Measuring_Approach
No ratings yet
A_Comparative_Study_on_Distance_Measuring_Approach
3 pages
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
No ratings yet
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
4 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Manhattan & Euclidean Distance
No ratings yet
Manhattan & Euclidean Distance
16 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Format Synopsis DP
No ratings yet
Format Synopsis DP
12 pages
Exposure of Document
No ratings yet
Exposure of Document
5 pages
Comparison Jaccard Similarity Cosine Similarity and Combined
No ratings yet
Comparison Jaccard Similarity Cosine Similarity and Combined
8 pages
tkde-2014-26-7
No ratings yet
tkde-2014-26-7
17 pages
Clustering Part4
No ratings yet
Clustering Part4
79 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
Distance and Similarity: Andre Salvaro Furtado
No ratings yet
Distance and Similarity: Andre Salvaro Furtado
56 pages
Module 5 Document Clustering
No ratings yet
Module 5 Document Clustering
33 pages
Cluster
No ratings yet
Cluster
13 pages
Lecture -7 MSDS
No ratings yet
Lecture -7 MSDS
32 pages
Similarity Measure For Sequences of Categorical Data
No ratings yet
Similarity Measure For Sequences of Categorical Data
12 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
20 pages
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
No ratings yet
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
50 pages
A Survey of Numerous Text Similarity Approach
No ratings yet
A Survey of Numerous Text Similarity Approach
10 pages
III-clustering
No ratings yet
III-clustering
87 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
Unit III
No ratings yet
Unit III
85 pages
Distance Measure
No ratings yet
Distance Measure
11 pages
Mini-Project #2: Instructions
No ratings yet
Mini-Project #2: Instructions
5 pages
DSB- Unit3
No ratings yet
DSB- Unit3
87 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
CS-DM MODULE- 3
No ratings yet
CS-DM MODULE- 3
27 pages
The Stringdist Package For Approximate String Matching
No ratings yet
The Stringdist Package For Approximate String Matching
13 pages
Development and Application of A Metric On Semantic Nets: S1S1tM3, 17, I, 1707 Li
No ratings yet
Development and Application of A Metric On Semantic Nets: S1S1tM3, 17, I, 1707 Li
14 pages
Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
Clustering
0% (1)
Clustering
127 pages
Mahyuddin Databia
No ratings yet
Mahyuddin Databia
8 pages
Similarity Matching in CEP Systems
100% (1)
Similarity Matching in CEP Systems
15 pages
CS 3308 Learning Journal Unit 4
No ratings yet
CS 3308 Learning Journal Unit 4
5 pages
Chandola Kumar
No ratings yet
Chandola Kumar
13 pages
A Survey of Binary Similarity and Distance Measures
No ratings yet
A Survey of Binary Similarity and Distance Measures
6 pages
Week 5
No ratings yet
Week 5
64 pages
book2
No ratings yet
book2
3 pages
Dist
No ratings yet
Dist
14 pages
Lec 5
No ratings yet
Lec 5
22 pages
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
No ratings yet
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
5 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Finding Similar Items: Aadil Ahmad, Pawan Kumar, Himanshu Kamboj, and Sunil Kumar
No ratings yet
Finding Similar Items: Aadil Ahmad, Pawan Kumar, Himanshu Kamboj, and Sunil Kumar
3 pages
Evaluation of Similarity Measurement For Image Retrieval
No ratings yet
Evaluation of Similarity Measurement For Image Retrieval
4 pages
distance-and-similarity
No ratings yet
distance-and-similarity
33 pages
Lecture 7 - Distance Measures
No ratings yet
Lecture 7 - Distance Measures
38 pages
Lec-3. datamining-similarity-distance-ext
No ratings yet
Lec-3. datamining-similarity-distance-ext
104 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
El Present Simple en Ingles - The Simple Present Tense: Negative Form
No ratings yet
El Present Simple en Ingles - The Simple Present Tense: Negative Form
5 pages
ABAP - Fundamental Tasks and Tools
100% (1)
ABAP - Fundamental Tasks and Tools
138 pages
Derrida - On Abraham
No ratings yet
Derrida - On Abraham
11 pages
Sample MSC in Project Management
No ratings yet
Sample MSC in Project Management
2 pages
COM ANSWER KEY - Ordinal Numbers - AC em Casa - 5ano
No ratings yet
COM ANSWER KEY - Ordinal Numbers - AC em Casa - 5ano
19 pages
30 day plan to improve your English
No ratings yet
30 day plan to improve your English
6 pages
The Great Maths Mystry
No ratings yet
The Great Maths Mystry
5 pages
Felicity Condition
No ratings yet
Felicity Condition
10 pages
The Simple Present and The Present Continuous Tenses
No ratings yet
The Simple Present and The Present Continuous Tenses
3 pages
Neonclass
67% (3)
Neonclass
64 pages
Ideas For Present Tenses Test
No ratings yet
Ideas For Present Tenses Test
7 pages
Essay Writing - Different Stages of Essay Writing
No ratings yet
Essay Writing - Different Stages of Essay Writing
2 pages
Przykladowe Zadania 2022
No ratings yet
Przykladowe Zadania 2022
4 pages
Past Simple Tense 12594
No ratings yet
Past Simple Tense 12594
1 page
27 - Ion, Criveanu-1 PDF
No ratings yet
27 - Ion, Criveanu-1 PDF
5 pages
The Last Leaf
No ratings yet
The Last Leaf
17 pages
01 Business Etiquette
No ratings yet
01 Business Etiquette
4 pages
Speech Style
No ratings yet
Speech Style
20 pages
02 Changes Ielts Words 1
No ratings yet
02 Changes Ielts Words 1
6 pages
PowerPoint Bahasa Inggris
No ratings yet
PowerPoint Bahasa Inggris
8 pages
Data Structures
100% (2)
Data Structures
75 pages
Fahim Resume
No ratings yet
Fahim Resume
2 pages
Activity Sheets Week 3-4-5 Quarter-4-Mtb-Fil-Math-And-Paralel Assessment
No ratings yet
Activity Sheets Week 3-4-5 Quarter-4-Mtb-Fil-Math-And-Paralel Assessment
3 pages
BKB Module 4 Version 7 - Assignment, Referencing and Plagiarism
No ratings yet
BKB Module 4 Version 7 - Assignment, Referencing and Plagiarism
82 pages