100% found this document useful (2 votes)
7K views4 pages

2.8 DataMining

There are different measures that can be used to calculate similarity between data points, but there is no universally accepted measure. The document discusses calculating similarity between a query data point and database points using Euclidean distance, Manhattan distance, supremum distance, and cosine similarity. It also discusses normalizing the data and calculating Euclidean distance on the normalized data. The results show different rankings depending on the similarity measure used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
7K views4 pages

2.8 DataMining

There are different measures that can be used to calculate similarity between data points, but there is no universally accepted measure. The document discusses calculating similarity between a query data point and database points using Euclidean distance, Manhattan distance, supremum distance, and cosine similarity. It also discusses normalizing the data and calculating Euclidean distance on the normalized data. The results show different rankings depending on the similarity measure used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

2.8 It is important to dene or select similarity measures in data analysis.

However, there is no commonly


accepted subjective similarity measure. Results can vary depending on the similarity measures used.
Nonetheless, seemingly different similarity measures may be equivalent after some transformation.
Suppose we have the following 2-D data set Formula for Eucledian distance,

(a) Consider the data as 2-D data points. Given a new data point, x = (1.4,1.6) as a query, rank the
database points based on similarity with the query using Euclidean distance, Manhattan distance,
supremum distance, and cosine similarity.
(b) Normalizethedatasettomakethenormofeachdatapointequalto1.UseEuclidean distance on the
transformed data to rank the data points.
Ans a) Formula for Euclidean distance,

Therefore, d(x,x1)=0.141
d(x,x2)=0.67
d(x,x3)=0.28
d(x,x4)=0.223
d(x,x5)=0.60
Thus, rank of the data points based on similarity with x using Eucledian distance is
x2,x5,x3,x4,x1

Formula for Manhattan distance,

Therefore, d(x,x1)=0.2
d(x,x2)=0.9
d(x,x3)=0.4
d(x,x4)=0.3
d(x,x5)=0.7
Thus, rank of the data points based on similarity with x using Manhattan distance is
X2, x5, x3, x4, and x1

Formula for Supremum distance,

Therefore, d(x,x1)=0.1
d(x,x2)=0.6
d(x,x3)=0.2
d(x,x4)=0.2
d(x,x5)=0.6
Thus, rank of the data points based on similarity with x using Supremum distance is
X2, x5, x3, x4, and x1
Cosine similarity:

x. x1
x.x 1

( x , x 1) =

where

( x , x 1) =

( x , x 2) =

( x , x 3 )=

is the Euclidean norm of vector x defined as

( 1.4 )( 1.5 )+(1.6)(1.7)


( 1.4 +1.6 )( 1.5 +1.7 )
2

(1.4 ) (2 )+(1.6)(1.9)
( 1.4 2 +1.62 )( 22 +1.92 )

2.1+ 2.72 4.82


=
=0.9999
4.86
4.86

=0.9957

( 1.4 ) ( 1.6 ) +(1.6)(1.8)


( 1.42 +1.62 )( 1.62+ 1.82)

x 12+ x 22 ++ xn 2

=0.9999

( x , x 4) =

( x , x 5 )=

(1.4 ) (1.2 ) +(1.6)(1.5)


( 1.42 +1.62 )( 1.22 +1.52 )

( 1.4 ) ( 1.5 )+(1.6)(1.0)


( 1.42 +1.62 )( 1.52+1.0 2)

=0.9990

=0.9653

Thus, rank of the data points based on similarity with x using Supremum distance is x1, x3, x4, x2, x5.

b) norm( x)=sqrt{(1.4)^2 +(1.6)^2} ~ 2.13


Normalized x is (1.4/2.13,1.6/2.13) =(0.65,0.75)

norm( x1)=sqrt{(1.5)^2 +(1.7)^2} ~ 2.26


Normalized x1 is (1.5/2.26,1.7/2.26) =(0.57,0.75)
norm( x2)=sqrt{(2)^2 +(1.7)^2} ~ 2.76
Normalized x2 is (2/2.76,1.9/2.76) =(0.26,0.69)
norm( x3)=sqrt{(1.6)^2 +(1.8)^2} ~ 2.40
Normalized x3 is (1.6/2.40,1.8/2.40) =(0.67,0.75)
norm( x4)=sqrt{(1.2)^2 +(1.5)^2} ~ 1.92
Normalized x4 is (1.2/1.92,1.5/1.92) =(0.62,0.78)
norm( x5)=sqrt{(1.5)^2 +(1.0)^2} ~ 1.80
Normalized x5 is (1.5/1.80,1.0/1.80) =(0.83,0.55)
Formula for Eucledian distance is,

D(x,x1)=0.8
D(x,x2)=0.71
D(x,x3)=0.02

D(x,x4)=0.04
D(x,x5)=0.27
Thus, rank of the data points based on similarity with x using Euclidean distance in normalized form is
x1, x2, x5, x4, x3.

You might also like