DMi 03-Proximity
DMi 03-Proximity
http://www3.yildiz.edu.tr/~naydin
1
Data Mining
4
Transformations
• often applied to convert a similarity to a
dissimilarity, or vice versa, or to transform a
proximity measure to fall within a particular range,
such as [0,1].
– For instance, we may have similarities that range from
1 to 10, but the particular algorithm or software
package that we want to use may be designed to work
only with dissimilarities, or it may work only with
similarities in the interval [0,1]
• Frequently, proximity measures, especially
similarities, are defined or transformed to have
values in the interval [0,1]. 5
Transformations
• often applied to convert a similarity to a
dissimilarity, or vice versa, or to transform a
proximity measure to fall within a particular range,
such as [0,1].
– For instance, we may have similarities that range from
1 to 10, but the particular algorithm or software
package that we want to use may be designed to work
only with dissimilarities, or it may work only with
similarities in the interval [0,1]
• Frequently, proximity measures, especially
similarities, are defined or transformed to have
values in the interval [0,1]. 6
Transformations
• Example:
– If the similarities between objects range from 1 (not at
all similar) to 10 (completely similar), we can make
them fall within the range [0, 1] by using the
transformation s′=(s-1)/9,where s and s′ are the original
and new similarity values, respectively.
• The transformation of similarities and
dissimilarities to the interval [0, 1]
– s′=(s-smin)/(smax- smin),where smax and smin are the
maximum and minimum similarity values.
– d′=(d-dmin)/(dmax- dmin),where dmax and dmin are the
maximum and minimum dissimilarity values. 7
Transformations
• However, there can be complications in mapping proximity
measures to the interval [0, 1] using a linear transformation.
– If, for example, the proximity measure originally takes values in the
interval [0,∞], then dmax is not defined and a nonlinear
transformation is needed.
– Values will not have the same relationship to one another on the
new scale.
• Consider the transformation d=d/(1+d) for a dissimilarity
measure that ranges from 0 to ∞.
– Given dissimilarities 0, 0.5, 2, 10, 100, 1000
– Transformed dissimilarities 0, 0.33, 0.67, 0.90, 0.99, 0.999.
• Larger values on the original dissimilarity scale are
compressed into the range of values near 1, but whether this is
desirable depends on the application. 8
Similarity/Dissimilarity for Simple Attributes
• The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single,
simple attribute.
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
11
Distances - Minkowski Distance
• Minkowski Distance is a generalization of
Euclidean Distance, and is given by
12
Distances - Minkowski Distance
• The following are the three most common examples
of Minkowski distances.
– r = 1 , City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this for binary vectors is the Hamming distance,
which is just the number of bits that are different between two binary
vectors
– r = 2 , Euclidean distance (L2 norm)
– r = ∞ , Supremum (Lmax norm, L∞ norm) distance.
• This is the maximum difference between any component of the
vectors
• Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
13
Distances - Minkowski Distance
3
L1 p1 p2 p3 p4
2 p1 p1 0 4 4 6
p3 p4 p2 4 0 2 4
1 p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
point x y p2 2.828 0 1.414 3.162
p1 0 2 p3 3.162 1.414 0 2
p2 2 0 p4 5.099 3.162 2 0
p3 3 1
L p1 p2 p3 p4
p4 5 1
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
14
Distances - Mahalanobis Distance
• Mahalonobis distance is the distance between a
point and a distribution (not between two distinct
points).
– It is effectively a multivariate equivalent of the
Euclidean distance.
• It transforms the columns into uncorrelated variables
• Scale the columns to make their variance equal to 1
• Finally, it calculates the Euclidean distance.
• It is defined as
B: (0, 1)
B
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
17
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some
well-known properties.
• If d(x, y) is the distance between two points, x and y,
then the following properties hold.
– Positivity
• d(x, y) ≥ 0 for all x and y
• d(x, y) = 0 only if x = y
– Symmetry
• d(x, y) = d(y, x) for all x and y
– Triangle Inequality
• d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z
• Measures that satisfy all three properties are known as
metrics
18
Common Properties of a Similarity
• If s(x, y) is the similarity between points x and y,
then the typical properties of similarities are the
following:
– Positivity
• s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
– Symmetry
• s(x, y) = s(y, x) for all x and y
• For similarities, the triangle inequality typically
does not hold
– However, a similarity measure can be converted to a
metric distance
19
A Non-symmetric Similarity Measure Example
20
A Non-symmetric Similarity Measure Example
23
Similarity Measures for Binary Data
• Jaccard Similarity Coefficient
– frequently used to handle objects consisting of
asymmetric binary attributes
24
SMC versus Jaccard: Example
• Calculate SMC and J for the binary vectors,
x = (1 0 0 0 0 0 0 0 0 0)
y = (0 0 0 0 0 0 1 0 0 1)
26
Cosine Similarity
• Cosine similarity really is a measure of the
(cosine of the) angle between x and y.
– Thus, if the cosine similarity is 1,
the angle between x and y is 0◦, and
x and y are the same except for
length.
–
– If the cosine similarity is 0, then the angle between x
and y is 90◦, and they do not share any terms (words).
• It can also be written as
27
Cosine Similarity - Example
• Cosine Similarity between two document vectors
• This example calculates the cosine similarity for the
following two data objects, which might represent
document vectors:
x = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
28
Extended Jaccard Coefficient
• Also known as Tanimoto Coefficient
• The extended Jaccard coefficient can be used for
document data and that reduces to the Jaccard
coefficient in the case of binary attributes.
• This coefficient, which we shall represent as EJ,
is defined by the following equation:
29
Correlation
• used to measure the linear relationship between
two sets of values that are observed together.
– Thus, correlation can measure the relationship
between two variables (height and weight) or between
two objects (a pair of temperature time series).
• Correlation is used much more frequently to
measure the similarity between attributes
– since the values in two data objects come from
different attributes, which can have very different
attribute types and scales.
• There are many types of correlation
30
Correlation - Pearson’s correlation
• between two sets of numerical values, i.e., two vectors, x
and y, is defined by:
31
Correlation – Example (Perfect Correlation)
32
Correlation – Example (Nonlinear Relationships)
Y
4
3
2
mean(x) = 0, mean(y) = 4 1
0
std(x) = 2.16, std(y) = 3.74 -4 -2 0
X
2 4
33
Visually Evaluating Correlation
• Scatter plots
showing the
similarity
from –1 to 1.
34
Correlation vs Cosine vs Euclidean Distance
• Compare the three proximity measures according to their
behavior under variable transformation
– scaling: multiplication by a value
– translation: adding a constant
Property Cosine Correlation Euclidean Distance
Invariant to scaling (multiplication) Yes Yes No
Invariant to translation (addition) No Yes No
• Consider the example
– x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0)
– ys = y × 2 = (2, 4, 6, 8, 0, 0, 0) yt = y + 5 = (6, 7, 8, 9, 5, 5, 5)
41
Entropy for Sample Data: Example
44
Mutual Information
• Information one variable provides about another
Formally, , where H(X,Y) is the joint entropy of
X and Y,
x = (−3,−2,−1, 0, 1, 2, 3) y = ( 9, 4, 1, 0, 1, 4, 9)
I(x, y) = H(x) + H(y) − H(x, y) = 1.9502 Entropy for y
46
Mutual Information Example
Student Count p -plog2p Student Grade Count p -plog2p
Status Status
Undergrad 45 0.45 0.5184
Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744
Undergrad B 30 0.30 0.5211
Total 100 1.00 0.9928
Undergrad C 10 0.10 0.3322
Grade Count p -plog2p
Grad A 30 0.30 0.5211
A 35 0.35 0.5301
Grad B 20 0.20 0.4644
B 50 0.50 0.5000
Grad C 5 0.05 0.2161
C 15 0.15 0.4105
Total 100 1.00 2.2710
Total 100 1.00 1.4406
48
General Approach for Combining Similarities
49
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
– Use non-negative weights
50
51