Assignment No. 2: Similarity and Dissimilarity Measures
Assignment No. 2: Similarity and Dissimilarity Measures
Text Mining
Ph.D Scholar
University of Peshawar
1. INTRODUCTION
Similarity or distance measures are vital components used for solving many pattern
recognition problems such as classification and clustering. These measures play an
increasingly core role in text related research and application in several tasks such as
text classification, topic tracking, question answering, short answer scoring, machine
translation, essay scoring, topic detection and others. Different clustering algorithms need
a measure for determining how dissimilar two given documents are. This difference is
often measured by similarity measures such as Euclidean distance, Cosine similarity, etc.
These measurements can be used to identify the suitable clustering algorithm for a
specific problem.
Informally, the similarity between two objects is a numerical measure of the degree to
which the two objects are similar. Consequently, similarity is higher for pairs of objects
that are more the same. Similarities are usually no-negative and are often between Zero
(dissimilar) and one (identical practically) .
On the other hand, the dissimilarity between two objects is the degree to which the two
objects are unlike. For more similar pairs of objects dissimilarities are lower. commonly,
the term distance is used as synonym for dissimilarity. Dissimilarities sometimes fall in
the interval Zero and One, but it is also common form them to range from 0 to infinity.
2. Literature review
There have been many surveys of the similarity and distance measures proposed in
different discipline. Applying suitable measures reflects in more accurate data analysis.
Vijaymeena, M. K., and Kavitha in [1] have discussed the similarity measures that are
applied on text similarity. They classified similarity measuers into three significant
categories; Corpus-based similarities, string-base and Knowledge based. Choi, S. et. al.
[2] carried out a comprehensive survey on binary measures. They collected 76 binary
similarity and dissimilarity (distance) measures that are over the last century and disclose
their correlations by means of the hierarchical clustering technique. However , David J.
Weller-Fahy et. al. [3] conducted an overview of the use of similarity and distance
(dissimilarity) measures within Network Intrusion Anomaly Detection (NIAD) research.
Nitin P., Madhura P et. al. [4] discussed various clustering techniques and the current
similarity measures based on distance based clustering. A. S. Shirkhorshidi et. al. [4] has
been proposed a technical framework to analyze, compare and benchmark the effect of
various similarity measures on the results of distance-based clustering algorithms.
The Levenshtein measure among two strings is measured as the minimum number of
edits required to convert one string into the other, with the allowable edit operations
being deletion, insertion, or replacements of a single character [11]. Levenshtein
algorithm computes the distance between two string is as follow:
1- Initialize matri M of size (|s1|+1) x (|s2|+1)
2- Fill matrix : Mi,0 = I and M0,j=j
𝑀𝑖 − 1, 𝑗 − 1 𝑖𝑓 𝑥[𝑖] = 𝑦[𝑗]
3- Recursion : Mi,j= {
1 + min(𝑀𝑖 − 1, 𝑗, 𝑀𝑖, 𝐽 − 1, 𝑀𝑖 − 1, 𝑗 − 1) 𝑒𝑙𝑠𝑒
4- Distance : Levenshteindist(x,y)=M|x|,|y|
𝑙𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑑𝑖𝑠𝑡(𝑥,𝑦)
Levenshtein Similarity: simlevenshtein(x,y)=1- max(|𝑥|,|𝑦|)
For example :
3.2 Jaro distance
To measure the similarity between two strings Jaro distance is used. When the distance
value is higher, the more similar the strings are [5]. It computes the distance between
strings as follows:
1- Search for common characters between strings
2- m: number of matching characters
max(|𝑥|,|𝑦|)
3- search range matching characters : −1
2
4- t :number of transpositions
1 𝑚 𝑚 𝑚−𝑡
5- simjaro= 3 (|𝑥| + |𝑦| + )
𝑚
For example :
3.3 Q-grams
The Euclidean distance between two points is measured by the numerical difference
of their coordinates. It is general to recognize the name of a point with its Cartesian
coordinate [11]. So if we have two point’s p1 and p2 on the real line, then the distance
( 𝑝𝑞
̅̅̅) between them is known by:
3.5 Manhattan distance
The Manhattan distance measure the distance that would be traveled to get from one
point (x) to the other (y) if a grid-like path is followed. The distance between two points
is the sum of the differences of their corresponding components [5].
The distance between a point X=(X1, X2, etc.) and a point Y=(Y1, Y2, etc.) is known by:
𝑑 = ∑𝑛𝑖=1 |𝑥𝑖 − 𝑦𝑖|
Where n is the number of variables, and Xi and Yi are the values of the ith variable, at
points X and Y in that order.
The difference between Euclidean distance and Manhattan distance shown in the figure:
Figure (1) the difference between Euclidean distance and Manhattan distance
The Hamming distance is a measure for comparing two equal length binary strings. It
computes the distance based on the number of positions at which the two bits are not
similar [11]. For Example
Cosine similarity is a metric of similarity used to compare how similar documents are
without consideration of their size. The angle between two vectors is measured by cosine
to decide whether documents’ vectors are pointing in the same direction [11].
The cosine similarity computed as follow:
𝐴.𝐵 ∑𝑛
𝑖=1 𝐴𝑖 𝑥 𝐵𝑖
Similarity (A, B) = ||𝐴||𝑥||𝑏|| =
√∑𝑛 2 𝑛 2
𝑖=1 𝐴𝑖 𝑥√∑𝑖=1 𝐵𝑖
The Mahalanobis distance is a measure of distance between two vectors and a set of
data, or a variation that measures the separation of two vectors from the same dataset
[5]. It is calculated as:
3.11 Lesk
In the following table different similarity and distance measure are compared based
on the following criteria equation, time complexity, advantages , disadvantage and the
area that metric is suitable. As we can see each measure has strengths and weakness such
as Euclidean distance that is very common and easy to compute and works well with
datasets. But it is sensitive to outliers. Based on that, the selection of the similarity
measure identifies the suitable clustering or pattern recognition algorithm for a specific
problem.
Time
Distance disadva
Equation complexit advantage applications
measure ntages
y
Levenshtein O(n*m) spelling
distance correction, all
applications that
benefit from soft
matching of
words, e.g.
information
retrieval, machine
translation etc.
Euclidean O(n) Very Sensitiv K-means
distance common, e to algorithm,
easy to outliers Fuzzy c-means
compute [7,10]. algorithm [8].
and works
well with
datasets
with
compact or
isolatedclus
ters [7,10].
𝑛
Manhattan O(n) Is common Sensitiv K-means
distance 𝑑 = ∑ |𝑥𝑖 − 𝑦𝑖| and like e to the algorithm
𝑖=1 other outliers.
Minkowski [7,10]
-driven
distances it
works well
with
datasets
with
compact or
isolated
clusters [7].
Chebyshev O(n) it requires It Chess and
distance less time to require Warehouse
decide the more Logistics
distances space electronic CAM
between applications
the datasets
[6]
Hamming O(n) detection of
Distance errors in
information
transmission and
telecommunicatio
n
Cosine ∑𝑛𝑖=1 𝐴𝑖 𝑥 𝐵𝑖 O(3n) Independen It is not Mostly used in
measure t of vector invarian document
√∑𝑛𝑖=1 𝐴2𝑖 𝑥√∑𝑛𝑖=1 𝐵𝑖2 length and t to similarity
invariant tolinear applications
rotation transfor [8,10].
[10]. mation
[10].
Mahalanobis O(3n) Mahalanobi It can Hyperellipsoidal
distance s is a be clustering
datadriven expensi algorithm
measure ve in [9].
that can terms
ease the of
distance comput
distortion ation
caused by a [10]
linear
combinatio
n of
attributes
[5].
5. Conclusion
In this work we discuses several similarity and distance metrics and compare some of the
similarity measures in terms of equation, time complicity and other criteria. Each
measurement has a strength and weakness. So when we are using any clustring or data
mining algorithm we should decide which measure we will use because the call will
effect on the results that we will get.
References
[1] Vijaymeena, M. K., and Kavitha, K. (2016). A survey on similarity measures in text
mining. Mach. Learn. Appl. Int. J. 3, 19–28. doi: 10.5121/mlaij.2016.3103
[2] Choi, S. Cha, and C. C. Tappert, “A Survey of Binary Similarity and Distance
Measures, ”J. Systemics, Cybernetics and Informatics,vol.8,no. 1, pp. 43–48, 2010.
[4] Jasmine Irani, Nitin Pise and Madhurs Phatak” Clustering techniques and similarity
measure used in clustering: A survey, Published in International Journal of Computer
Applications 134(7):9-14, 2016.doi: 10.5120/ijca2016907841
[6] Sujan Dahal , Effect of Different Distance Measures in Result of Cluster Analysis,