0% found this document useful (0 votes)

7 views51 pages

DMi 03-Proximity

The document discusses similarity and dissimilarity measures in data mining, highlighting their importance in techniques such as clustering and anomaly detection. It explains various distance measures, including Euclidean, Minkowski, and Mahalanobis distances, as well as similarity measures for binary data like the Simple Matching Coefficient and Jaccard Similarity. Additionally, it covers transformations of proximity measures and the properties of distances and similarities.

Uploaded by

srivatsav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views51 pages

DMi 03-Proximity

Uploaded by

srivatsav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

Data Mining

Prof. Dr. Nizamettin AYDIN

[email protected]

http://www3.yildiz.edu.tr/~naydin

1
Data Mining

Similarity and Dissimilarity

Measures
• Outline
– Similarity and Dissimilarity between Simple Attributes
– Dissimilarities between Data Objects
– Similarities between Data Objects
– Examples of Proximity
– Mutual Information
– Issues in Proximity
– Selecting the Right Proximity Measure
2
Similarity and Dissimilarity Measures
• Similarity and dissimilarity are important
because they are used by a number of data
mining techniques, such as clustering, nearest
neighbor classification, and anomaly detection.
• In many cases, the initial data set is not needed
once these similarities or dissimilarities have
been computed.
• Such approaches can be viewed as transforming
the data to a similarity (dissimilarity) space and
then performing the analysis.
3
Similarity and Dissimilarity Measures
• Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0, upper limit varies
– The term distance is used as a synonym for dissimilarity
• Proximity refers to a similarity or dissimilarity

4
Transformations
• often applied to convert a similarity to a
dissimilarity, or vice versa, or to transform a
proximity measure to fall within a particular range,
such as [0,1].
– For instance, we may have similarities that range from
1 to 10, but the particular algorithm or software
package that we want to use may be designed to work
only with dissimilarities, or it may work only with
similarities in the interval [0,1]
• Frequently, proximity measures, especially
similarities, are defined or transformed to have
values in the interval [0,1]. 5
Transformations
• often applied to convert a similarity to a
dissimilarity, or vice versa, or to transform a
proximity measure to fall within a particular range,
such as [0,1].
– For instance, we may have similarities that range from
1 to 10, but the particular algorithm or software
package that we want to use may be designed to work
only with dissimilarities, or it may work only with
similarities in the interval [0,1]
• Frequently, proximity measures, especially
similarities, are defined or transformed to have
values in the interval [0,1]. 6
Transformations
• Example:
– If the similarities between objects range from 1 (not at
all similar) to 10 (completely similar), we can make
them fall within the range [0, 1] by using the
transformation s′=(s-1)/9,where s and s′ are the original
and new similarity values, respectively.
• The transformation of similarities and
dissimilarities to the interval [0, 1]
– s′=(s-smin)/(smax- smin),where smax and smin are the
maximum and minimum similarity values.
– d′=(d-dmin)/(dmax- dmin),where dmax and dmin are the
maximum and minimum dissimilarity values. 7
Transformations
• However, there can be complications in mapping proximity
measures to the interval [0, 1] using a linear transformation.
– If, for example, the proximity measure originally takes values in the
interval [0,∞], then dmax is not defined and a nonlinear
transformation is needed.
– Values will not have the same relationship to one another on the
new scale.
• Consider the transformation d=d/(1+d) for a dissimilarity
measure that ranges from 0 to ∞.
– Given dissimilarities 0, 0.5, 2, 10, 100, 1000
– Transformed dissimilarities 0, 0.33, 0.67, 0.90, 0.99, 0.999.
• Larger values on the original dissimilarity scale are
compressed into the range of values near 1, but whether this is
desirable depends on the application. 8
Similarity/Dissimilarity for Simple Attributes
• The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single,
simple attribute.

• Next, we consider more complicated measures of proximity

between objects that involve multiple attributes:
– dissimilarities between data objects
– similarities between data objects.
9
Distances - Euclidean Distance
• The Euclidean distance, d , between two points, x
and y , in one-, two-, three-, or higher-
dimensional space, is given by

xk and yk are, respectively, the kth attributes

– where n is the number of dimensions (attributes) and

(components) of data objects x and y.

• Standardization is necessary, if scales differ.
10
Distances - Euclidean Distance
3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2
0
p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
11
Distances - Minkowski Distance
• Minkowski Distance is a generalization of
Euclidean Distance, and is given by

(attributes) and xk and yk are are, respectively, the kth

– where r is a parameter, n is the number of dimensions

attributes (components) of data objects x and y.

12
Distances - Minkowski Distance
• The following are the three most common examples
of Minkowski distances.
– r = 1 , City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this for binary vectors is the Hamming distance,
which is just the number of bits that are different between two binary
vectors
– r = 2 , Euclidean distance (L2 norm)
– r = ∞ , Supremum (Lmax norm, L∞ norm) distance.
• This is the maximum difference between any component of the
vectors
• Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
13
Distances - Minkowski Distance
3
L1 p1 p2 p3 p4
2 p1 p1 0 4 4 6
p3 p4 p2 4 0 2 4
1 p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
point x y p2 2.828 0 1.414 3.162
p1 0 2 p3 3.162 1.414 0 2
p2 2 0 p4 5.099 3.162 2 0
p3 3 1
L p1 p2 p3 p4
p4 5 1
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix

14
Distances - Mahalanobis Distance
• Mahalonobis distance is the distance between a
point and a distribution (not between two distinct
points).
– It is effectively a multivariate equivalent of the
Euclidean distance.
• It transforms the columns into uncorrelated variables
• Scale the columns to make their variance equal to 1
• Finally, it calculates the Euclidean distance.
• It is defined as

– where Σ−1 is the inverse of the covariance matrix of the

data. 15
Distances - Mahalanobis Distance
• In the Figure, there are 1000 points, whose x and y
attributes have a correlation of 0.6.
– The Euclidean distance
between the two large
points at the opposite
ends of the long axis of
the ellipse is 14.7, but
Mahalanobis distance is
only 6.
• This is because the
Mahalanobis distance
gives less emphasis to
the direction of largest
variance.
16
Distances - Mahalanobis Distance
• Covariance Matrix:
 0.3 0.2
  
 0 . 2 0 . 3
A: (0.5, 0.5) C

B: (0, 1)
B

A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
17
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some
well-known properties.
• If d(x, y) is the distance between two points, x and y,
then the following properties hold.
– Positivity
• d(x, y) ≥ 0 for all x and y
• d(x, y) = 0 only if x = y
– Symmetry
• d(x, y) = d(y, x) for all x and y
– Triangle Inequality
• d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z
• Measures that satisfy all three properties are known as
metrics
18
Common Properties of a Similarity
• If s(x, y) is the similarity between points x and y,
then the typical properties of similarities are the
following:
– Positivity
• s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
– Symmetry
• s(x, y) = s(y, x) for all x and y
• For similarities, the triangle inequality typically
does not hold
– However, a similarity measure can be converted to a
metric distance
19
A Non-symmetric Similarity Measure Example

• Consider an experiment in which people are

asked to classify a small set of characters as they
flash on a screen.
– The confusion matrix for this experiment records
how often each character is classified as itself, and
how often each is classified as another character.
– Using the confusion matrix, we can define a
similarity measure between a character x and a
character y as the number of times that x is
misclassified as y,
• but note that this measure is not symmetric.

20
A Non-symmetric Similarity Measure Example

• For example, suppose that “0” appeared 200

times and was classified as a “0” 160 times, but
as an “o” 40 times.
• Likewise, suppose that “o” appeared 200 times
and was classified as an “o” 170 times, but as “0”
only 30 times.
– Then, s(0,o) = 40, but s(o, 0) = 30.
• In such situations, the similarity measure can be
made symmetric by setting
– s′(x, y) = s′(y, x) = (s(x, y)+s(y, x))/2,
• where s indicates the new similarity measure.
21
Similarity Measures for Binary Data
• Similarity measures between objects that contain
only binary attributes are called similarity
coefficients, and typically have values between 0 and
1.
• Let x and y be two objects that consist of n binary
attributes.
– The comparison of two binary vectors, leads to the
following quantities (frequencies):
• f00 = the number of attributes where x is 0 and y is 0
• f01 = the number of attributes where x is 0 and y is 1
• f10 = the number of attributes where x is 1 and y is 0
• f11 = the number of attributes where x is 1 and y is 1
22
Similarity Measures for Binary Data
• Simple Matching Coefficient (SMC)
– One commonly used similarity coefficient

– This measure counts both presences and absences

equally.
• Consequently, the SMC could be used to find students who
had answered questions similarly on a test that consisted
only of true/false questions.

23
Similarity Measures for Binary Data
• Jaccard Similarity Coefficient
– frequently used to handle objects consisting of
asymmetric binary attributes

– This measure counts both presences and absences

equally.
• Consequently, the SMC could be used to find students who
had answered questions similarly on a test that consisted
only of true/false questions.

24
SMC versus Jaccard: Example
• Calculate SMC and J for the binary vectors,
x = (1 0 0 0 0 0 0 0 0 0)
y = (0 0 0 0 0 0 1 0 0 1)

f01 = 2 (the number of attributes where x was 0 and y was 1)

f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)
•
SMC = (f11 + f00) / (f01 + f10 + f11 + f00)
= (0 + 7) / (2 + 1 + 0 + 7) = 0.7
J = (f11) / (f01 + f10 + f11)
= 0 / (2 + 1 + 0) =0
25
Cosine Similarity
• Cosine Similarity is one of the most common
measures of document similarity
• If x and y are two document vectors, then

– where ′ indicates vector or matrix transpose and

indicates the inner product of the two vectors,
and is the length of vector x,

26
Cosine Similarity
• Cosine similarity really is a measure of the
(cosine of the) angle between x and y.
– Thus, if the cosine similarity is 1,
the angle between x and y is 0◦, and
x and y are the same except for
length.
–
– If the cosine similarity is 0, then the angle between x
and y is 90◦, and they do not share any terms (words).
• It can also be written as

27
Cosine Similarity - Example
• Cosine Similarity between two document vectors
• This example calculates the cosine similarity for the
following two data objects, which might represent
document vectors:
x = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)

28
Extended Jaccard Coefficient
• Also known as Tanimoto Coefficient
• The extended Jaccard coefficient can be used for
document data and that reduces to the Jaccard
coefficient in the case of binary attributes.
• This coefficient, which we shall represent as EJ,
is defined by the following equation:

29
Correlation
• used to measure the linear relationship between
two sets of values that are observed together.
– Thus, correlation can measure the relationship
between two variables (height and weight) or between
two objects (a pair of temperature time series).
• Correlation is used much more frequently to
measure the similarity between attributes
– since the values in two data objects come from
different attributes, which can have very different
attribute types and scales.
• There are many types of correlation
30
Correlation - Pearson’s correlation
• between two sets of numerical values, i.e., two vectors, x
and y, is defined by:

– where the following standard statistical notation and

definitions are used:

31
Correlation – Example (Perfect Correlation)

• Correlation is always in the range −1 to 1.

– A correlation of 1 (−1) means that x and y have a
perfect positive (negative) linear relationship;
• that is, xk = ayk + b, where a and b are constants.
• The following two vectors x and y illustrate cases
where the correlation is −1 and +1, respectively.
x = (−3, 6, 0, 3,−6) x = (3, 6, 0, 3, 6)
y = ( 1,−2, 0,−1, 2) y = (1, 2, 0, 1, 2)

corr(x, y) = −1xk = −3yk corr(x, y) = 1 xk = 3yk

32
Correlation – Example (Nonlinear Relationships)

• If the correlation is 0, then there is no linear

relationship between the two sets of values.
– However, nonlinear relationships can still exist.
• In the following example,, but their correlation is 0.
9
8
x = (-3, -2, -1, 0, 1, 2, 3) 7
6
y = (9, 4, 1, 0, 1, 4, 9) 5

Y
4
3
2
mean(x) = 0, mean(y) = 4 1
0
std(x) = 2.16, std(y) = 3.74 -4 -2 0
X
2 4

33
Visually Evaluating Correlation
• Scatter plots
showing the
similarity
from –1 to 1.

34
Correlation vs Cosine vs Euclidean Distance
• Compare the three proximity measures according to their
behavior under variable transformation
– scaling: multiplication by a value
– translation: adding a constant
Property Cosine Correlation Euclidean Distance
Invariant to scaling (multiplication) Yes Yes No
Invariant to translation (addition) No Yes No
• Consider the example
– x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0)
– ys = y × 2 = (2, 4, 6, 8, 0, 0, 0) yt = y + 5 = (6, 7, 8, 9, 5, 5, 5)

Measure (x , y) (x , ys) (x , yt)

Cosine 0.9667 0.9667 0.7940
Correlation 0.9429 0.9429 0.9429
Euclidean Distance 1.4142 5.8310 14.2127
35
Correlation vs cosine vs Euclidean distance
• Choice of the right proximity measure depends on the
domain
• What is the correct choice of proximity measure for the
following situations?
– Comparing documents using the frequencies of words
• Documents are considered similar if the word frequencies are similar
– Comparing the temperature in Celsius of two locations
• Two locations are considered similar if the temperatures are similar
in magnitude
– Comparing two time series of temperature measured in
Celsius
• Two time series are considered similar if their shape is similar,
– i.e., they vary in the same way over time, achieving minimums and
maximums at similar times, etc.
36
Comparison of Proximity Measures
• Domain of application
– Similarity measures tend to be specific to the type of attribute
and data
– Record data, images, graphs, sequences, 3D-protein structure,
etc. tend to have different measures
• However, one can talk about various properties that you
would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
• The measure must be applicable to the data and produce
results that agree with domain knowledge 37
Information Based Measures
• Information theory is a well-developed and
fundamental disciple with broad applications

• Some similarity measures are based on

information theory
– Mutual information in various versions
– Maximal Information Coefficient (MIC) and related
measures
– General and can handle non-linear relationships
– Can be complicated and time intensive to compute
38
• Information relates to possible outcomes of an event
– transmission of a message, flip of a coin, or measurement of
a piece of data

• The more certain an outcome, the less information that

it contains and vice-versa
– For example, if a coin has two heads, then an outcome of
heads provides no information
– More quantitatively, the information is related the
probability of an outcome
• The smaller the probability of an outcome, the more information it
provides and vice-versa
– Entropy is the commonly used measure
39
Entropy
• For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by

• Entropy is between 0 and log2n and is measured

in bits
– Thus, entropy is a measure of how many bits it takes
to represent an observation of X on average
40
Entropy Examples
• For a coin with probability p of heads and
probability q = 1 – p of tails

– For p= 0.5, q = 0.5 (fair coin) H = 1

– For p = 1 or q = 1, H = 0

• What is the entropy of a fair four-sided die?

41
Entropy for Sample Data: Example

Hair Color Count p -plog2p

Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
Total 100 1.0 1.1540

• Maximum entropy is log25 = 2.3219

42
Entropy for Sample Data
• Suppose we have
– a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is
mi
– Then, for this sample

• For continuous data, the calculation is harder

43
Mutual Information
• used as a measure of similarity between two sets of
paired values that is sometimes used as an alternative
to correlation, particularly when a nonlinear
relationship is suspected between the pairs of values.
– This measure comes from information theory, which is the
study of how to formally define and quantify information.
– It is a measure of how much information one set of values
provides about another, given that the values come in
pairs, e.g., height and weight.
• If the two sets of values are independent, i.e., the value of one
tells us nothing about the other, then their mutual information is
0.

44
Mutual Information
• Information one variable provides about another
Formally, , where H(X,Y) is the joint entropy of
X and Y,

where pij is the probability that the ith value of X

and the jth value of Y occur together
• For discrete variables, this is easy to compute
• Maximum mutual information for discrete
variables is log2(min( nX, nY ), where nX (nY) is the
number of values of X (Y)
45
Mutual Information Example
• Evaluating Nonlinear Relationships with Mutual Information
– Recall Example where , but their correlation was 0.

x = (−3,−2,−1, 0, 1, 2, 3) y = ( 9, 4, 1, 0, 1, 4, 9)
I(x, y) = H(x) + H(y) − H(x, y) = 1.9502 Entropy for y

Joint entropy for x and y

Entropy for x

46
Mutual Information Example
Student Count p -plog2p Student Grade Count p -plog2p
Status Status
Undergrad 45 0.45 0.5184
Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744
Undergrad B 30 0.30 0.5211
Total 100 1.00 0.9928
Undergrad C 10 0.10 0.3322
Grade Count p -plog2p
Grad A 30 0.30 0.5211
A 35 0.35 0.5301
Grad B 20 0.20 0.4644
B 50 0.50 0.5000
Grad C 5 0.05 0.2161
C 15 0.15 0.4105
Total 100 1.00 2.2710
Total 100 1.00 1.4406

• Mutual information of Student Status and Grade

= 0.9928 + 1.4406 - 2.2710 = 0.1624
47
Maximal Information Coefficient
• Applies mutual information to two continuous variables
• Consider the possible binnings of the variables into
discrete categories
– nX × nY ≤ N0.6 where
•nX is the number of values of X
•nY is the number of values of Y
•N is the number of samples (observations, data objects)
• Compute the mutual information
– Normalized by log2(min( nX, nY )
• Take the highest value
• Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S.
Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel associations in large data sets." science 334, no. 6062
(2011): 1518-1524.

48
General Approach for Combining Similarities

• Sometimes attributes are of many different types,

but an overall similarity is needed.
– For the kth attribute, compute a similarity, sk(x, y), in
the range [0, 1].
– Define an indicator variable, k, for the kth attribute as
follows:
• k = 0 if the kth attribute is an asymmetric attribute and both objects
have a value of 0, or if one of the objects has a missing value for the
kth attribute
• k = 1 otherwise
– Compute

49
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
– Use non-negative weights 

• Can also define a weighted form of distance

50
51

Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Lecture 7 - Distance Measures
No ratings yet
Lecture 7 - Distance Measures
38 pages
Data Similarity
0% (1)
Data Similarity
18 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Similarity
No ratings yet
Similarity
20 pages
02data Part4
No ratings yet
02data Part4
28 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Lec 5
No ratings yet
Lec 5
22 pages
Clustering
No ratings yet
Clustering
15 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Unit 3
No ratings yet
Unit 3
13 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
Similarity
No ratings yet
Similarity
19 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Lab 2
No ratings yet
Lab 2
21 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Clustering
0% (1)
Clustering
127 pages
Similarity
No ratings yet
Similarity
20 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Dist
No ratings yet
Dist
14 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Physics LAB MANUAL WITH READING CLASS X TERM-I - 2021
No ratings yet
Physics LAB MANUAL WITH READING CLASS X TERM-I - 2021
9 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Astm A234 PDF
100% (1)
Astm A234 PDF
9 pages
Mock PRMO Test Maths
No ratings yet
Mock PRMO Test Maths
9 pages
Oxyprobe PDF
No ratings yet
Oxyprobe PDF
16 pages
Module 4 - FRICTIONAL FLOW IN CONDUITS-printable
No ratings yet
Module 4 - FRICTIONAL FLOW IN CONDUITS-printable
39 pages
Narrative Report: Basic Micros
0% (1)
Narrative Report: Basic Micros
3 pages
Engineering Maths
No ratings yet
Engineering Maths
5 pages
Sheil PHD Thesis
No ratings yet
Sheil PHD Thesis
280 pages
Scheme - G Third Semester (Ej, Et, En, Ev, Ex, Is, Ic, Iu, De, Mu, Ie, Ed, Ei)
No ratings yet
Scheme - G Third Semester (Ej, Et, En, Ev, Ex, Is, Ic, Iu, De, Mu, Ie, Ed, Ei)
39 pages
Reactions in Aqueous Solution
No ratings yet
Reactions in Aqueous Solution
43 pages
Pigeonhole Principle
No ratings yet
Pigeonhole Principle
2 pages
CSEC Maths 2020 January Past Paper Solutions
No ratings yet
CSEC Maths 2020 January Past Paper Solutions
34 pages
Chapter 2 Flownet
No ratings yet
Chapter 2 Flownet
10 pages
Past Paper
No ratings yet
Past Paper
40 pages
Appendix-55
No ratings yet
Appendix-55
89 pages
SURVEYING ASSIGNMENT Ankita Gupta PDF
0% (1)
SURVEYING ASSIGNMENT Ankita Gupta PDF
4 pages
Synge 1926 Geometry of Dynamics
No ratings yet
Synge 1926 Geometry of Dynamics
77 pages
Behavior Study of Web Post Buckling and Viereendeel Mechanism in Castellated Beams
No ratings yet
Behavior Study of Web Post Buckling and Viereendeel Mechanism in Castellated Beams
43 pages
Magnetic Chuck All
No ratings yet
Magnetic Chuck All
32 pages
Subjective Test Planner (Infinity Pro) - PDF Only - Neev 2026
No ratings yet
Subjective Test Planner (Infinity Pro) - PDF Only - Neev 2026
3 pages
MR - Nobody Screenplay
No ratings yet
MR - Nobody Screenplay
31 pages
Fallsem2019-20 Phy6012 TH Vl2019201007003 Reference Material Phy6012 Solid State Magnetism
100% (1)
Fallsem2019-20 Phy6012 TH Vl2019201007003 Reference Material Phy6012 Solid State Magnetism
3 pages
Rheofibre - Basf
No ratings yet
Rheofibre - Basf
2 pages
3900 Series Digital Radio Test Set: P25 Option Manual
No ratings yet
3900 Series Digital Radio Test Set: P25 Option Manual
137 pages
Practical Values of Friction Factors: Brian S. Prosser Keith G. Wallace
No ratings yet
Practical Values of Friction Factors: Brian S. Prosser Keith G. Wallace
6 pages
DIMcomfort 4.0 - Theorymanual
No ratings yet
DIMcomfort 4.0 - Theorymanual
11 pages
Discrete Math PMI
No ratings yet
Discrete Math PMI
2 pages
CNC and Robotics (Me1968) : Rakesh V Adakane
No ratings yet
CNC and Robotics (Me1968) : Rakesh V Adakane
43 pages
How To Make Conductive Graphite Paint: Overview: Essential Question: Background
No ratings yet
How To Make Conductive Graphite Paint: Overview: Essential Question: Background
2 pages
Calorimetry Practice Problems WS2
No ratings yet
Calorimetry Practice Problems WS2
2 pages
Foundations of Elementary Analysis
From Everand
Foundations of Elementary Analysis
Roshan Trivedi
No ratings yet
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
No ratings yet

DMi 03-Proximity

Uploaded by

DMi 03-Proximity

Uploaded by

Data Mining

Prof. Dr. Nizamettin AYDIN

Similarity and Dissimilarity

• Next, we consider more complicated measures of proximity

xk and yk are, respectively, the kth attributes

(components) of data objects x and y.

(attributes) and xk and yk are are, respectively, the kth

attributes (components) of data objects x and y.

– where Σ−1 is the inverse of the covariance matrix of the

• Consider an experiment in which people are

• For example, suppose that “0” appeared 200

– This measure counts both presences and absences

– This measure counts both presences and absences

f01 = 2 (the number of attributes where x was 0 and y was 1)

– where ′ indicates vector or matrix transpose and

– where the following standard statistical notation and

• Correlation is always in the range −1 to 1.

corr(x, y) = −1xk = −3yk corr(x, y) = 1 xk = 3yk

• If the correlation is 0, then there is no linear

Measure (x , y) (x , ys) (x , yt)

• Some similarity measures are based on

• The more certain an outcome, the less information that

• Entropy is between 0 and log2n and is measured

– For p= 0.5, q = 0.5 (fair coin) H = 1

• What is the entropy of a fair four-sided die?

Hair Color Count p -plog2p

• Maximum entropy is log25 = 2.3219

• For continuous data, the calculation is harder

where pij is the probability that the ith value of X

Joint entropy for x and y

• Mutual information of Student Status and Grade

• Sometimes attributes are of many different types,

• Can also define a weighted form of distance

You might also like