0% found this document useful (0 votes)

26 views15 pages

On The Surprising Behavior of Distance Metrics

The document discusses how distance metrics behave unexpectedly in high dimensional spaces, finding that metrics like Manhattan distance (L1 norm) are more meaningful than Euclidean distance (L2 norm) for high dimensional data mining applications. It analyzes the Lk norm theoretically and empirically, introducing fractional distance metrics which better preserve proximity and improve clustering algorithms like k-means. The results provide guidance on choosing effective distance metrics for high dimensional data mining problems.

Uploaded by

Yacoov Tovim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views15 pages

On The Surprising Behavior of Distance Metrics

Uploaded by

Yacoov Tovim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

On the Surprising Behavior of Distance Metrics

in High Dimensional Space

Charu C. Aggarwal1 , Alexander Hinneburg2 , and Daniel A. Keim2

1
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598, USA.
[email protected]
2
Institute of Computer Science, University of Halle
Kurt-Mothes-Str.1, 06120 Halle (Saale), Germany
{ hinneburg, keim }@informatik.uni-halle.de

Abstract. In recent years, the effect of the curse of high dimensionality

has been studied in great detail on several problems such as clustering,
nearest neighbor search, and indexing. In high dimensional space the data
becomes sparse, and traditional indexing and algorithmic techniques fail
from a efficiency and/or effectiveness perspective. Recent research results
show that in high dimensional space, the concept of proximity, distance
or nearest neighbor may not even be qualitatively meaningful. In this
paper, we view the dimensionality curse from the point of view of the di-
stance metrics which are used to measure the similarity between objects.
We specifically examine the behavior of the commonly used Lk norm
and show that the problem of meaningfulness in high dimensionality is
sensitive to the value of k. For example, this means that the Manhat-
tan distance metric (L1 norm) is consistently more preferable than the
Euclidean distance metric (L2 norm) for high dimensional data mining
applications. Using the intuition derived from our analysis, we introduce
and examine a natural extension of the Lk norm to fractional distance
metrics. We show that the fractional distance metric provides more mea-
ningful results both from the theoretical and empirical perspective. The
results show that fractional distance metrics can significantly improve
the effectiveness of standard clustering algorithms such as the k-means
algorithm.

1 Introduction

In recent years, high dimensional search and retrieval have become very well
studied problems because of the increased importance of data mining applica-
tions [1], [2], [3], [4], [5], [8], [10], [11]. Typically, most real applications which
require the use of such techniques comprise very high dimensional data. For such
applications, the curse of high dimensionality tends to be a major obstacle in the
development of data mining techniques in several ways. For example, the per-
formance of similarity indexing structures in high dimensions degrades rapidly,
so that each query requires the access of almost all the data [1].

J. Van den Bussche and V. Vianu (Eds.): ICDT 2001, LNCS 1973, pp. 420–434, 2001.
c Springer-Verlag Berlin Heidelberg 2001
On the Surprising Behavior of Distance Metrics 421

It has been argued in [6], that under certain reasonable assumptions on the
data distribution, the ratio of the distances of the nearest and farthest neighbors
to a given target in high dimensional space is almost 1 for a wide variety of data
distributions and distance functions. In such a case, the nearest neighbor problem
becomes ill defined, since the contrast between the distances to different data
points does not exist. In such cases, even the concept of proximity may not
be meaningful from a qualitative perspective: a problem which is even more
fundamental than the performance degradation of high dimensional algorithms.
In most high dimensional applications the choice of the distance metric is
not obvious; and the notion for the calculation of similarity is very heuristical.
Given the non-contrasting nature of the distribution of distances to a given
query point, different measures may provide very different orders of proximity
of points to a given query point. There is very little literature on providing
guidance for choosing the correct distance measure which results in the most
meaningful notion of proximity between two records. Many high dimensional
indexing structures and algorithms use the euclidean distance metric as a natural
extension of its traditional use in two- or three-dimensional spatial applications.
In this paper, we discuss the general
Pd behavior of the commonly used Lk norm
(x, y ∈ Rd , k ∈ Z, Lk (x, y) = i=1 (kxi − y i kk )1/k ) in high dimensional space.
The Lk norm distance function is also susceptible to the dimensionality curse
for many classes of data distributions [6]. Our recent results [9] seem to suggest
that the Lk -norm may be more relevant for k = 1 or 2 than values of k ≥ 3. In
this paper, we provide some surprising theoretical and experimental results in
analyzing the dependency of the Lk norm on the value of k. More specifically,
we show that the relative contrasts of the distances to a query point depend
heavily on the Lk metric used. This provides considerable evidence that the
meaningfulness of the Lk norm worsens faster with increasing dimensionality for
higher values of k. Thus, for a given problem with a fixed (high) value of the
dimensionality d, it may be preferable to use lower values of k. This means that
the L1 distance metric (Manhattan Distance metric) is the most preferable for
high dimensional applications, followed by the Euclidean Metric (L2 ), then the
L3 metric, and so on. Encouraged by this trend, we examine the behavior of
fractional distance metrics, in which k is allowed to be a fraction smaller than 1.
We show that this metric is even more effective at preserving the meaningfulness
of proximity measures. We back up our theoretical results with empirical tests on
real and synthetic data showing that the results provided by fractional distance
metrics are indeed practically useful. Thus, the results of this paper have strong
implications for the choice of distance metrics for high dimensional data mining
problems. We specifically show the improvements which can be obtained by
applying fractional distance metrics to the standard k-means algorithm.
This paper is organized as follows. In the next section, we provide a theo-
retical analysis of the behavior of the Lk norm in very high dimensionality. In
section 3, we discuss fractional distance metrics and provide a theoretical analy-
sis of their behavior. In section 4, we provide the empirical results, and section
5 provides summary and conclusions.
422 C.C. Aggarwal, A. Hinneburg, and D.A. Keim

2 Behavior of the Lk -Norm in High Dimensionality

In order to present our convergence results, we first establish some notations and
definitions in Table 1.
Table 1. Notations and Basic Definitions

Notation Definition
d Dimensionality of the data space
N Number of data points
F 1-dimensional data distribution in (0, 1)
Xd Data point from F d with each coordinate drawn from F
distkd (x, y) Distance between (x1 , . . . xd ) and (y 1 , . . . y d )
Pd
using Lk metric = i=1 [(xi1 − xi2 )k ]1/k
k · kk Distance of a vector to the origin (0, . . . , 0)
using the function distkd (·, ·)
k
Dmaxd = max{kXd kk } Farthest distance of the N points
to the origin using the distance metric Lk
Dminkd = min{kXd kk } Nearest distance of the N points
to the origin using the distance metric Lk
E[X], var[X] Expected value and variance of a random variable X
Yd →p c A vector sequence Y1 , . . . , Yd converges in probability to a
constant vector c if: ∀ > 0 limd→∞ P [distd (Yd , c) ≤ ] = 1

Theorem 1. Beyer
et. al. (Adapted for Lk metric)
kXd kk Dmaxk k
d −Dmind
If limd→∞ var E[kXd kk ] = 0 , then Dmink
→p 0.
d

Proof. See [6] for proof of a more general version of this result.

The result of the theorem [6] shows that the difference between the maxi-
mum and minimum distances to a given query point 1 does not increase as fast
as the nearest distance to any point in high dimensional space. This makes a
proximity query meaningless and unstable because there is poor discrimination
between the nearest and furthest neighbor. Henceforth, we will refer to the ratio
Dmaxk k
d −Dmind
Dmink
as the relative contrast.
d
Dmaxk −Dmink
The results in [6] use the value of d
Dmink
d
as an interesting criterion
d
for meaningfulness. In order to provide more insight, in the following we analyze
the behavior for different distance metrics in high-dimensional space. We first
assume a uniform distribution of data points and show our results for N = 2
points. Then, we generalize the results to an arbitrary number of points and
arbitrary distributions.
1
In this paper, we consistently use the origin as the query point. This choice does not
affect the generality of our results, though it simplifies our algebra considerably.
On the Surprising Behavior of Distance Metrics 423

Lemma 1. Let F be uniform distribution of r N = 2 points. For an Lk metric,

h i
Dmaxk −Dmin k
1 1
limd→∞ E d
d
1/k−1/2
d
= C · (k+1) 1/k 2·k+1 , where C is some con-
stant.

Proof. Let Ad and Bd be the two points in a d dimensional data distribu-

tion such that each coordinate is independently drawn from a 1-dimensional
data distribution F with finite mean and standard deviation. Specifically Ad =
(P1 . . . Pd ) and Bd = (Q1 . . . Qd ) with Pi and Qi being drawn from F. Let
Pd
P Ad = { i=1 (Pi )k }1/k be the distance of Ad to the origin using the Lk metric
Pd
and P Bd = { i=1 (Qi )k }1/k the distance of Bd . The difference of distances is
Pd Pd
P Ad − P Bd = { i=1 (Pi )k }1/k − { i=1 (Qi )k }1/k .
1
It can be shown 2 that the random variable (Pi )k has mean k+1 and standard
r k
k 1 (P A ) 1 (P Bd )k
deviation k+1 2·k+1 . This means that d
d
→p (k+1) , d →p
1
(k+1) and therefore
1/k 1/k
P Ad 1 P Bd 1
→p , →p (1)
d1/k k+1 d1/k k+1
r
|P Ad −P Bd | 1 2
We intend to show that d1/k−1/2
→p (k+1)1/k 2·k+1 . We can express
|P Ad − P Bd | in the following numerator/denominator form which we will use in
order to examine the convergence behavior of the numerator and denominator
individually.
|(P Ad )k − (P Bd )k |
|P Ad − P Bd | = Pk−1 (2)
k−r−1 (P B )r
r=0 (P Ad ) d

Dividing both sides by d1/k−1/2 and regrouping the right-hand-side we get:

√
|P Ad − P Bd | |((P Ad )k − (P Bd )k )|/ d
=P (3)
d1/k−1/2 k−1 P Ad k−r−1 P Bd r
r=0 d1/k d1/k

Consequently, using Slutsky’s theorem 3 and the results of Equation 1 we obtain

X P Ad k−r−1 P Bd r
k−1
1
(k−1)/k
· →p k · (4)
r=0
d1/k d1/k k+1

Having characterized the convergence behavior of the denominator of the right

hand side of Equation √ 3, letPusd now examine the √ behaviorPof the numerator:
√
d
|(P Ad )k − (P Bd )k |/ d = | i=1 ((Pi )k − (Qi )k )|/ d = | i=1 Ri |/ d. Here
Ri is the new random variable defined by ((Pi )k − (Qi )k ) ∀i ∈ {1,√. . . d}. This
random variable has zero mean and standard deviation which is 2 · σ where
2
This is because E[Pik ] = 1/(k + 1) and E[Pi2k ] = 1/(2 · k + 1).
3
Slutsky’s Theorem: Let Y1 . . . Yd . . . be a sequence of random vectors and h(·) be
a continuous function. If Yd →p c then h(Yd ) →p h(c).
424 C.C. Aggarwal, A. Hinneburg, and D.A. Keim

σ is the standard deviation of (Pi )k . The sum of different values of Ri over d

dimensions√will converge
√ to a normal distribution with mean 0 and standard
deviation 2 · σ · d because of the central limit theorem. Consequently, the
mean average deviation of this distribution will be C · σ for some constant C.
Therefore, we have: r
|(P Ad )k − (P Bd )k | k 1
limd→∞ E √ =C· (5)
d k + 1 2 · k +1
Since the denominator of Equation 3 shows probabilistic convergence, we can
combine the results of Equations 4 and 5 to obtain
r
|P Ad − P Bd | 1 1
limd→∞ E =C· (6)
d 1/k−1/2 (k + 1)1/k 2·k+1
We can easily generalize the result for a database of N uniformly distributed
points. The following Corollary provides the result.
Corollary r 1. Let F be the uniform distribution of N = n points.rThen,
h i
C 1 Dmaxk k
d −Dmind C·(n−1) 1
(k+1) 1/k 2·k+1 ≤ limd→∞ E d1/k−1/2 ≤ (k+1) 1/k 2·k+1 .

Proof. This is because if L is the expected difference between the maximum and
minimum of two randomly drawn points, then the same value for n points drawn
from the same distribution must be in the range (L, (n − 1) · L).
The results can be modified for arbitrary distributions of N points in a data-
base by introducing the constant factor Ck . In that case, the general dependency
1 1
of Dmax − Dmin on d k − 2 remains unchanged. A detailed proof is provided in
the Appendix; a short outline of the reasoning behind the result is available in
[9].
Lemma 2.
h [9] Let F be ian arbitrary distribution of N = 2 points. Then,
Dmaxk k
d −Dmind
limd→∞ E d1/k−1/2
= Ck , where Ck is some constant dependent on k.

Corollary 2. Let F be the arbitrary distribution of N = n points. Then,

Dmaxkd − Dminkd
Ck ≤ limd→∞ E ≤ (n − 1) · Ck .
d1/k−1/2

Thus, this result shows that in high dimensional space Dmaxkd − Dminkd in-
creases at the rate of d1/k−1/2 , independent of the data distribution. This means
that for the manhattan distance metric, the value of this expression diverges to
∞; for the Euclidean distance metric, the expression is bounded by constants
whereas for all other distance metrics, it converges to 0 (see Figure 1). Further-
more, the convergence is faster when the value of k of the Lk metric increases.
This provides the insight that higher norm parameters provide poorer contrast
between the furthest and nearest neighbor. Even more insight may be obtai-
ned by examining the exact behavior of the relative contrast as opposed to the
absolute distance between the furthest and nearest point.
On the Surprising Behavior of Distance Metrics 425

1.15 1.9 25
p=2 p=2 p=1
1.1 1.8
1.05 1.7 20

1 1.6
15
0.95 1.5
0.9 1.4
10
0.85 1.3
0.8 1.2 5
0.75 1.1
0.7 1 0
20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

(a) k = 3 (b) k = 2 (c) k = 1

400 1.6e+07
p=2/3 p=2/5
350 1.4e+07
300 1.2e+07
250 1e+07

200 8e+06
150 6e+06
100 4e+06
50 2e+06

0 0
20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

(d) k = 2/3 (e) k = 2/5

Fig. 1. |Dmax − Dmin| depending on d for different metrics (uniform data)

Table 2. Effect of dimensionality on relative (L1 and L2 ) behavior of relative contrast

Dimensionality P [Ud < Td ] Dimensionality P [Ud < Td ]

1 Both metrics are the same 10 95.6%
2 85.0% 15 96.1%
3 88.7% 20 97.1%
4 91.3% 100 98.2%

Theoremh2. Let F be theuniform distribution

q of N = 2 points. Then,
Dmaxk k √ i
d −Dmind 1
limd→∞ E Dmink
· d = C · 2·k+1 .
d

Proof. Let Ad , Bd , P1 . . . Pd , Q1 . . . Qd , P Ad , P Bd be defined as in the proof

of Lemma 1. We have shown in the proof of the previous result that Pd1/k Ad
→
1/k
1
k+1 . Using Slutsky’s theorem we can derive that:

1/k
P Ad P Bd 1
min{ 1/k , 1/k } → (7)
d d k+1

We have also shown in the previous result that:

s
|P Ad − P Bd | 1 1
limd→∞ E = C · (8)
d1/k−1/2 (k + 1)1/k 2·k+1

We can combine the results in Equation 7 and 8 to obtain:

√ |P Ad − P Bd | p
limd→∞ E d· = C · 1/(2 · k + 1) (9)
min{P Ad , P Bd }

Note that the above results confirm

√ of the results in [6] because it shows that
the relative contrast degrades as 1/ d for the different distance norms. Note
426 C.C. Aggarwal, A. Hinneburg, and D.A. Keim

RELATIVE CONTRAST FOR UNIFORM DISTRIBUTION

4.5

N=10,000
N=1,000
4
N=100
1
0.8 f=1
3.5 f=0.75
0.6 f=0.5
3 f=0.25
0.4
RELATIVE CONTRAST

2.5 0.2

2
0
-0.2
1.5
-0.4
1 -0.6
-0.8
0.5
-1
0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
0 1 2 3 4 5 6 7 8 9 10
PARAMETER OF DISTANCE NORM

Fig. 3. Unit spheres for different fractio-

Fig. 2. Relative contrast variation with
nal metrics (2D)
norm parameter for the uniform distribu-
tion

that for values of d in thepreasonable range of data mining applications, the

norm dependent factor of 1/(2 · k + 1) may play a valuable role in affecting
the relative contrast. For such cases, even the relative rate of degradation of
the different distance metrics for a given data set in the same value of the
dimensionality may be important. In the Figure 2 we have illustrated the relative
contrast created by an artificially generated data set drawn from a uniform
distribution in d = 20 dimensions. Clearly, the relative contrast
p decreases with
increasing value of k and also follows the same trend as 1/(2 · k + 1).
Another interesting aspect which can be explored to improve nearest neigh-
bor and clustering algorithms in high-dimensional space is the effect of k on the
relative contrast. Even though the expected relative contrast always decreases
with increasing dimensionality, this may not necessarily be true for a given data
set and different k. To show this, we performed the following experiment
on the

Dmax2d −Dmin2d
Manhattan (L1 ) and Euclidean (L2 ) distance metric: Let Ud = Dmind2

Dmax1d −Dmin1d
and Td = Dmin1d
. We performed some empirical tests to calculate
the value of P [Ud < Td ] for the case of the Manhattan (L1 ) and Euclidean
(L2 ) distance metrics for N = 10 points drawn from a uniform distribution. In
each trial, Ud and Td were calculated from the same set of N = 10 points, and
P [Ud < Td ] was calculated by finding the fraction of times Ud was less than Td
in 1000 trials. The results of the experiment are given in Table 2. It is clear that
with increasing dimensionality d, the value of P [Ud < Td ] continues to increase.
Thus, for higher dimensionality, the relative contrast provided by a norm with
smaller parameter k is more likely to dominate another with a larger parameter.
For dimensionalities of 20 or higher it is clear that the manhattan distance me-
tric provides a significantly higher relative contrast than the Euclidean distance
metric with very high probability. Thus, among the distance metrics with inte-
gral norms, the manhattan distance metric is the method of choice for providing
the best contrast between the different points. This result of our analysis can be
directly used in a number of different applications.
On the Surprising Behavior of Distance Metrics 427

3 Fractional Distance Metrics

The result of the previous section that the Manhattan metric (k = 1) provides
the best discrimination in high-dimensional data spaces is the motivation for
looking into distance metrics with k < 1. We call these metrics fractional distance
metrics. A fractional distance metric distfd (Lf norm) for f ∈ (0, 1) is defined
as:
d
X i 1/f
distfd (x, y) = (x − y i )f .
i=1

To give a intuition of the behavior of the fractional distance metric we plotted

in Figure 3 the unit spheres for different fractional metrics in R2 .
We will prove most of our results in this section assuming that f is of the form
1/l, where l is some integer. The reason that we show the results for this special
case is that we are able to use nice algebraic tricks for the proofs. The natural
conjecture from the smooth continuous variation of distfd with f is that the
results are also true for arbitrary values of f . 4 . Our results provide considerable
insights into the behavior of the fractional distance metric and its relationship
with the Lk -norm for integral values of k.

Lemma 3. Let F be the uniform distribution of N = 2 points and f = 1/l for

some integer
l. Then, r
Dmaxfd −Dminfd 1 1
limd→∞ E d1/f −1/2
= C · (f +1)1/f 2·f +1 .

Proof. Let Ad , Bd , P1 . . . Pd , Q1 . . . Qd , P Ad , P Bd be defined using the Lf metric

as they were defined in Lemma 1 for the Lk metric. Let further QAd = (P Ad )f =
Pd Pd
(P Ad )1/l = i=1 (Pi )f and QBd = (P Bd )f = (P Bd )1/l = i=1 (Qi )f . Analo-
gous to Lemma 1, QA 1
d →p f +1 ,
d QBd
d →p f +1 .
1

h i r
|P Ad −P Bd | 1 1
We intend to show that E dl−1/2 = C · (f +1) 1/f 2·f +1 . The
Pd P d
difference of distances is |P Ad − P Bd | = { i=1 (Pi )f }1/f − { i=1 (Qi )f }1/f
Pd P d
= { i=1 (Pi )f }l − { i=1 (Qi )f }l . Note that the above expression is of the form
Pl−1
|al − bl | = |a − b| · ( r=0 ar · bl−r−1 ). Therefore, |P Ad − P Bd | can be written as
Pd Pl−1
{ i=1 |(Pi )f − (Qi )f |} · { r=0 (QAd )r · (QBd )l−r−1 }. By dividing both sides by
d1/f −1/2 and regrouping the right hand side we get:
Pd Xl−1 r l−r−1
|P Ad − P Bd | i=1 |(Pi )f − (Qi )f | QAd QBd
→p { √ }·{ · } (10)
d1/f −1/2 d r=0
d d

By using the results in Equation 10, we can derive that:

Pd
|P Ad − P Bd | i=1 |(Pi )f − (Qi )f | 1
→p { √ } · {l · } (11)
d1/f −1/2 d (1 + f )l−1
4
Empirical simulations of the relative contrast show this is indeed the case.
428 C.C. Aggarwal, A. Hinneburg, and D.A. Keim

f f
This
√ random variable (Pi ) − (Qi ) has zero mean fand standard deviation which
is 2 · σ where σ is the standard deviation of (Pi ) . The sum of different values
of (Pi )f − (Qi )f over d dimensions will √ converge to normal distribution with
mean 0 and standard deviation 2 · σ · d because of the central limit theorem.
Consequently,
√ the expected mean average deviation of this normal distribution
is C · σ · d for some constant C. Therefore, we have:
s
|(P Ad )f − (P Bd )f | f 1
limd→∞ E √ =C ·σ =C · . (12)
d f +1 2·f +1

Combining the results of Equations 12 and 11, we get:

s
|P Ad − P Bd | C 1
limd→∞ E = (13)
d1/f −1/2 (f + 1)1/f 2·f +1

An direct consequence of the above result is the following generalization to

N = n points.

Corollary 3. When F is the uniform distribution of N = n points and f = 1/l

for some integer l. Then, for someconstant C we have:
r r
C 1 Dmaxfd −Dminfd C·(n−1) 1
(f +1)1/f 2·f +1 ≤ limd→∞ E d1/f −1/2
≤ (f +1)1/f 2·f +1 .

Proof. Similar to corollary 1.

The above result shows that the absolute difference between the maximum
and minimum for the fractional distance metric increases at the rate of d1/f −1/2 .
Thus, the smaller the fraction, the greater the rate of absolute divergence bet-
ween the maximum and minimum value. Now, we will examine the relative
contrast of the fractional distance metric.

Theorem 3. Let F be the uniform distribution of N = 2 points and f = 1/l

for someinteger l. Then,
√ q
Dmaxfd −Dminfd 1
limd→∞ Dmin f d = C · 2·f +1 for some constant C.
d

Proof. Analogous to the proof of Theorem 2.

The following is the direct generalization to N = n points.

Corollary 4. Let F be the uniform distribution of N = n points, and f = 1/l

for some integer l. Then, for some constant
C
q q
1 Dmaxfd −Dminfd 1
C · 2·f +1 ≤ limd→∞ E Dminf
≤ C · (n − 1) · 2·f +1 .
d

Proof. Analogous to the proof of Corollary 1.

On the Surprising Behavior of Distance Metrics 429

This result is true for the case of arbitrary values f (not just f = 1/l) and
N , but the use of these specific values of f helps considerably in simplification of
the proof of the result. The empirical simulation in Figure 2, shows the behavior
for arbitrary values of f and N . The curve for each value of N is different but all
curves fit the general trend of reduced contrast with increased value of f . Note
that the value of the relative contrast for both, the case of integral distance
metric Lk and fractional distance metric Lf is the same in the boundary case
when f = k = 1.
The above results show that fractional distance metrics provide better con-
trast than integral distance metrics both in terms of the absolute distributions
of points to a given query point and relative distances. This is a surprising result
in light of the fact that the Euclidean distance metric is traditionally used in
a large variety of indexing structures and data mining applications. The wide-
spread use of the Euclidean distance metric stems from the natural extension
of applicability to spatial database systems (many multidimensional indexing
structures were initially proposed in the context of spatial systems). However,
from the perspective of high dimensional data mining applications, this natural
interpretability in 2 or 3-dimensional spatial systems is completely irrelevant.
Whether the theoretical behavior of the relative contrast also translates into
practically useful implications for high dimensional data mining applications is
an issue which we will examine in greater detail in the next section.

4 Empirical Results

In this section, we show that our surprising findings can be directly applied to
improve existing mining techniques for high-dimensional data. For the experi-
ments, we use synthetic and real data. The synthetic data consists of a number
of clusters (data inside the clusters follow a normal distribution and the cluster
centers are uniformly distributed). The advantage of the synthetic data sets is
that the clusters are clearly separated and any clustering algorithm should be
able to identify them correctly. For our experiments we used one of the most wi-
dely used standard clustering algorithms - the k-means algorithm. The data set
used in the experiments consists of 6 clusters with 10000 data points each and
no noise. The dimensionality was chosen to be 20. The results of our experiments
show that the fractional distance metrics provides a much higher classification
rate which is about 99% for the fractional distance metric with f = 0.3 versus
89% for the Euclidean metric (see figure 4). The detailed results including the
confusion matrices obtained are provided in the appendix.
For the experiments with real data sets, we use some of the classification
problems from the UCI machine learning repository 5 . All of these problems
are classification problems which have a large number of feature variables, and
a special variable which is designated as the class label. We used the following
simple experiment: For each of the cases that we tested on, we stripped off the
5
http : //www.cs.uci.edu/˜mlearn
430 C.C. Aggarwal, A. Hinneburg, and D.A. Keim

100
95
90

Classification Rate
85
80
75
70
65
60
55
50
0 0.5 1 1.5 2 2.5 3
Distance Parameter

Fig. 4. Effectiveness of k-Means

class variable from the data set and considered the feature variables only. The
query points were picked from the original database, and the closest l neighbors
were found to each target point using different distance metrics. The technique
was tested using the following two measures:
1. Class Variable Accuracy: This was the primary measure that we used
in order to test the quality of the different distance metrics. Since the class va-
riable is known to depend in some way on the feature variables, the proximity
of objects belonging to the same class in feature space is evidence of the mea-
ningfulness of a given distance metric. The specific measure that we used was
the total number of the l nearest neighbors that belonged to the same class as
the target object over all the different target objects. Needless to say, we do not
intend to propose this rudimentary unsupervised technique as an alternative to
classification models, but use the classification performance only as an evidence
of the meaningfulness (or lack of meaningfulness) of a given distance metric. The
class labels may not necessarily always correspond to locality in feature space;
therefore the meaningfulness results presented are evidential in nature. However,
a consistent effect on the class variable accuracy with increasing norm parameter
does tend to be a powerful way of demonstrating qualitative trends.
2. Noise Stability: How does the quality of the distance metric vary with
more or less noisy data? We used noise masking in order to evaluate this aspect.
In noise masking, each entry in the database was replaced by a random entry
with masking probability pc . The random entry was chosen from a uniform
distribution centered at the mean of that attribute. Thus, when pc is 1, the data
is completely noisy. We studied how each of the two problems were affected by
noise masking.
In Table 3, we have illustrated some examples of the variation in performance
for different distance metrics. Except for a few exceptions, the major trend in
this table is that the accuracy performance decreases with increasing value of the
norm parameter. We have show the table in the range L0.1 to L10 because it was
easiest to calculate the distance values without exceeding the numerical ranges in
the computer representation. We have also illustrated the accuracy performance
when the L∞ metric is used. One interesting observation is that the accuracy
with the L∞ distance metric is often worse than the accuracy value by picking
a record from the database at random and reporting the corresponding target
On the Surprising Behavior of Distance Metrics 431

Table 3. Number of correct class label matches between nearest neighbor and target

Data Set L0.1 L0.5 L1 L2 L4 L10 L∞ Random

Machine 522 474 449 402 364 353 341 153
Musk 998 893 683 405 301 272 163 140
Breast Cancer (wdbc) 5299 5268 5196 5052 4661 4172 4032 3021
Segmentation 1423 1471 1377 1210 1103 1031 300 323
Ionosphere 2954 3002 2839 2430 2062 1836 1769 1884

4 3.5

L(0.1)
3.5 L(1)
3
L(10)
ACCURACY RATIO TO RANDOM MATCHING

3
2.5

2.5
ACCURACY RATIO 2

1.5
1.5

1
1 ACCURACY OF RANDOM MATCHING
ACCURACY OF RANDOM MATCHING

0.5
0.5

0 0
0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PARAMETER OF DISTANCE NORM USED NOISE MASKING PROBABILITY

Fig. 5. Accuracy depending on the norm Fig. 6. Accuracy depending on noise mas-
parameter king

value. This trend is observed because of the fact that the L∞ metric only looks
at the dimension at which the target and neighbor are furthest apart. In high
dimensional space, this is likely to be a very poor representation of the nearest
neighbor. A similar argument is true for Lk distance metrics (for high values of
k) which provide undue importance to the distant (sparse/noisy) dimensions.
It is precisely this aspect which is reflected in our theoretical analysis of the
relative contrast, which results in distance metrics with high norm parameters
to be poorly discriminating between the furthest and nearest neighbor.
In Figure 5, we have shown the variation in the accuracy of the class variable
matching with k, when the Lk norm is used. The accuracy on the Y -axis is
reported as the ratio of the accuracy to that of a completely random matching
scheme. The graph is averaged over all the data sets of Table 3. It is easy to see
that there is a clear trend of the accuracy worsening with increasing values of
the parameter k.
We also studied the robustness of the scheme to the use of noise masking.
For this purpose, we have illustrated the performance of three distance metrics
in Figure 6: L0.1 , L1 , and L10 for various values of the masking probability on
the machine data set. On the X-axis, we have denoted the value of the masking
probability, whereas on the Y -axis we have the accuracy ratio to that of a com-
pletely random matching scheme. Note that when the masking probability is 1,
then any scheme would degrade to a random method. However, it is interesting
to see from Figure 6 that the L10 distance metric degrades much faster to the
432 C.C. Aggarwal, A. Hinneburg, and D.A. Keim

random performance (at a masking probability of 0.4), whereas the L1 degrades

to random at 0.6. The L0.1 distance metric is most robust to the presence of
noise in the data set and degrades to random performance at the slowest rate.
These results are closely connected to our theoretical analysis which shows the
rapid lack of discrimination between the nearest and furthest distances for high
values of the norm-parameter because of undue weighting being given to the
noisy dimensions which contribute the most to the distance.

5 Conclusions and Summary

In this paper, we showed some surprising results of the qualitative behavior of

the different distance metrics for measuring proximity in high dimensionality.
We demonstrated our results in both a theoretical and empirical setting. In the
past, not much attention has been paid to the choice of distance metrics used
in high dimensional applications. The results of this paper are likely to have a
powerful impact on the particular choice of distance metric which is used from
problems such as clustering, categorization, and similarity search; all of which
depend upon some notion of proximity.

References

1. Weber R., Schek H.-J., Blott S.: A Quantitative Analysis and Performance Study
for Similarity-Search Methods in High-Dimensional Spaces. VLDB Conference Pro-
ceedings, 1998.
2. Bennett K. P., Fayyad U., Geiger D.: Density-Based Indexing for Approximate
Nearest Neighbor Queries. ACM SIGKDD Conference Proceedings, 1999.
3. Berchtold S., Böhm C., Kriegel H.-P.: The Pyramid Technique: Towards Breaking
the Curse of Dimensionality. ACM SIGMOD Conference Proceedings, June 1998.
4. Berchtold S., Böhm C., Keim D., Kriegel H.-P.: A Cost Model for Nearest Neighbor
Search in High Dimensional Space. ACM PODS Conference Proceedings, 1997.
5. Berchtold S., Ertl B., Keim D., Kriegel H.-P. Seidl T.: Fast Nearest Neighbor
Search in High Dimensional Spaces. ICDE Conference Proceedings, 1998.
6. Beyer K., Goldstein J., Ramakrishnan R., Shaft U.: When is Nearest Neighbors
Meaningful? ICDT Conference Proceedings, 1999.
7. Shaft U., Goldstein J., Beyer K.: Nearest Neighbor Query Performance for Unsta-
ble Distributions. Technical Report TR 1388, Department of Computer Science,
University of Wisconsin at Madison.
8. Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. ACM
SIGMOD Conference Proceedings, 1984.
9. Hinneburg A., Aggarwal C., Keim D.: What is the nearest neighbor in high dimen-
sional spaces? VLDB Conference Proceedings, 2000.
10. Katayama N., Satoh S.: The SR-Tree: An Index Structure for High Dimensional
Nearest Neighbor Queries. ACM SIGMOD Conference Proceedings, 1997.
11. Lin K.-I., Jagadish H. V., Faloutsos C.: The TV-tree: An Index Structure for High
Dimensional Data. VLDB Journal, Volume 3, Number 4, pages 517–542, 1992.
On the Surprising Behavior of Distance Metrics 433

Appendix
Here we provide a detailed proof of Lemma 2, which proves our modified conver-
gence results for arbitrary distributions of points. This Lemma shows that the
asymptotical rate of convergence of the absolute difference of distances between
the nearest and furthest points is dependent on the distance norm used. To re-
cap, we restate Lemma 2.
Lemma 2: h Let Fk be ankarbitrary
i distribution of N = 2 points. Then,
Dmaxd −Dmind
limd→∞ E d1/k−1/2
= Ck , where Ck is some constant dependent on k.

Proof. Let Ad and Bd be the two points in a d dimensional data distribution

such that each coordinate is independently drawn from the data distribution F.
Specifically Ad = (P1 . . . Pd ) and Bd = (Q1 . . . Qd ) with Pi and Qi being drawn
Pd
from F. Let P Ad = { i=1 (Pi )k }1/k be the distance of Ad to the origin using
Pd
the Lk metric and P Bd = { i=1 (Qi )k }1/k the distance of Bd .
We assume that the kth power of a random variable drawn from the dis-
tribution F has mean µF ,k and standard deviation σF ,k . This means that:
P Ak P Bdk
d
d
→p µF ,k , d →p µF ,k and therefore:
1/k 1/k
P Ad /d1/k →p (µF ,k ) , P Bd /d1/k →p (µF ,k ) . (14)

We intend to show that |PdA1/k−1/2

d −P Bd |
→p Ck for some constant Ck depending
on k. We express |P Ad − P Bd | in the following numerator/denominator form
which we will use in order to examine the convergence behavior of the numerator
and denominator individually.

|(P Ad )k − (P Bd )k |
|P Ad − P Bd | = Pk−1 (15)
k−r−1 (P B )r
r=0 (P Ad ) d

Dividing both sides by d1/k−1/2 and regrouping on right-hand-side we get

√
|P Ad − P Bd | |(P Ad )k − (P Bd )k |/ d
=P (16)
d1/k−1/2 k−1 P Ad k−r−1 P Bd r
r=0 d1/k d1/k

Consequently, using Slutsky’s theorem and the results of Equation 14 we have:

X
k−1 k−r−1 r
(k−1)/k
P Ad /d1/k · P Bd /d1/k →p k · (µF ,k ) (17)
r=0

Having characterized the convergence behavior of the denominator of the right-

hand-side of Equation √ 16, let
Pdus now examine the √ behavior
Pdof the numerator:
√
|(P Ad )k − (P Bd )k |/ d = | i=1 ((Pi )k − (Qi )k )|/ d = | i=1 Ri |/ d.
Here Ri is the new random variable defined by ((Pi )k − (Qi )k ) ∀i ∈ {1, √ . . . d}.
This random variable has zero mean and standard deviation which is 2 · σF ,k
where σF ,k is the standard deviation of (Pi )k . Then, the sum of different values
434 C.C. Aggarwal, A. Hinneburg, and D.A. Keim

of Ri over d dimensions √will converge

√ to a normal distribution with mean 0
and standard deviation 2 · σF ,k · d because of the central limit theorem.
Consequently, the mean average deviation of this distribution will be C · σF ,k
for some constant C. Therefore, we have:

|(P Ad )k − (P Bd )k |
limd→∞ E √ = C · σF ,k (18)
d
Since the denominator of Equation 16 shows probabilistic convergence, we can
combine the results of Equations 17 and 18 to obtain:

|P Ad − P Bd | σF ,k
limd→∞ E 1/k−1/2
=C· (k−1)/k
(19)
d k·µ F ,k

The result follows.

Confusion Matrices. We have illustrated the confusion matrices for two dif-
ferent values of p below. As illustrated, the confusion matrix for using the value
p = 0.3 is significantly better than the one obtained using p = 2.

Table 4. Confusion Matrix- p=2, (rows for prototype, colums for cluster)

1208 82 9711 4 10 14
0 2 0 0 6328 4
1 9872 104 32 11 0
8750 8 74 9954 1 18
39 0 10 8 8 9948
2 36 101 2 3642 16

Table 5. Confusion Matrix- p=0.3, (rows for prototype, colums for cluster)

51 115 9773 10 37 15
0 17 24 0 9935 14
15 10 9 9962 0 4
1 9858 66 5 19 1
8 0 9 3 9 9956
9925 0 119 20 0 10

Ap Human Geography Vocab
50% (2)
Ap Human Geography Vocab
20 pages
Chapter 2 - BCE - Surveying
No ratings yet
Chapter 2 - BCE - Surveying
103 pages
Summative Test in Basic Calculus
100% (4)
Summative Test in Basic Calculus
3 pages
Technical Drawing Unit Plan
No ratings yet
Technical Drawing Unit Plan
46 pages
HCO Oriented Core Procedure Updated
No ratings yet
HCO Oriented Core Procedure Updated
38 pages
FE Surveying
100% (1)
FE Surveying
69 pages
The Dirac Equation in Rindler Space - A Pedagogical Introduction
No ratings yet
The Dirac Equation in Rindler Space - A Pedagogical Introduction
21 pages
[Lecture Notes in Computer Science 2341] Cui Yu (eds.) - High-Dimensional Indexing_ Transformational Approaches to High-Dimensional Range and Similarity Searches (2003, Springer-Verlag Berlin Heidelberg) - li.pdf
No ratings yet
[Lecture Notes in Computer Science 2341] Cui Yu (eds.) - High-Dimensional Indexing_ Transformational Approaches to High-Dimensional Range and Similarity Searches (2003, Springer-Verlag Berlin Heidelberg) - li.pdf
159 pages
5 3 3 2 + + - Where K Is An Integer To Be Found. (3) : 1 Without Using A Calculator, Show That
No ratings yet
5 3 3 2 + + - Where K Is An Integer To Be Found. (3) : 1 Without Using A Calculator, Show That
13 pages
Geometric Algebras For Euclidean Geometry: Charles Gunn
No ratings yet
Geometric Algebras For Euclidean Geometry: Charles Gunn
25 pages
Example Calculation of P-Y Curve in Sand For Laterally Loaded Pile PDF
100% (2)
Example Calculation of P-Y Curve in Sand For Laterally Loaded Pile PDF
5 pages
OCR Maths M2 Topic Questions From Papers Circular Motion
No ratings yet
OCR Maths M2 Topic Questions From Papers Circular Motion
12 pages
Shape
No ratings yet
Shape
22 pages
Solutions 3 - 640
No ratings yet
Solutions 3 - 640
3 pages
Clustering
No ratings yet
Clustering
104 pages
Vectors and Matrices
No ratings yet
Vectors and Matrices
22 pages
Problem Solving Using Geogebra - The Case of Geometry: Kaja Maricic
No ratings yet
Problem Solving Using Geogebra - The Case of Geometry: Kaja Maricic
10 pages
Minkowski's Theorem
No ratings yet
Minkowski's Theorem
14 pages
Distance Metric Learning: A Comprehensive Survey: Liu Yang Advisor: Rong Jin May 8th, 2006
No ratings yet
Distance Metric Learning: A Comprehensive Survey: Liu Yang Advisor: Rong Jin May 8th, 2006
51 pages
Paper - 29 - Presentation
No ratings yet
Paper - 29 - Presentation
40 pages
Advanced Calculus (MATH 104) - Polar Coordinates: Harish Bhandari
No ratings yet
Advanced Calculus (MATH 104) - Polar Coordinates: Harish Bhandari
14 pages
Delaunay Triangulation
No ratings yet
Delaunay Triangulation
10 pages
Cómo Explicar Física Cuántica Con Un Gato Zombi
No ratings yet
Cómo Explicar Física Cuántica Con Un Gato Zombi
23 pages
Trigonometry at A Glance
No ratings yet
Trigonometry at A Glance
5 pages
Reachable Distance Function For KNN Classification
No ratings yet
Reachable Distance Function For KNN Classification
152 pages
14 Partial Derivatives 2
No ratings yet
14 Partial Derivatives 2
23 pages
Efficient Nearest Neighbor Search in High Dimensional Hamming Space
No ratings yet
Efficient Nearest Neighbor Search in High Dimensional Hamming Space
11 pages
0606 s14 QP 12
No ratings yet
0606 s14 QP 12
16 pages
PhySci - Free Fall Module
No ratings yet
PhySci - Free Fall Module
5 pages
Similarity Measures For Categorical Data
No ratings yet
Similarity Measures For Categorical Data
12 pages
MIT8 01SC Problems25 Soln
100% (1)
MIT8 01SC Problems25 Soln
17 pages
Questions Make It Right Paper 2 Sm015
No ratings yet
Questions Make It Right Paper 2 Sm015
2 pages
L3 KNN
No ratings yet
L3 KNN
17 pages
Reachable Distance Function For KNN Classification
No ratings yet
Reachable Distance Function For KNN Classification
15 pages
B.Sc. I Semester Physics:: Paper I Mechanics and Oscillations:: Imp Questions
100% (4)
B.Sc. I Semester Physics:: Paper I Mechanics and Oscillations:: Imp Questions
2 pages
Standardization and Its Effects On K-Means Clustering Algorithm
No ratings yet
Standardization and Its Effects On K-Means Clustering Algorithm
6 pages
When Is "Nearest Neighbor" Meaningful?: Abstract. We Explore The Effect of Dimensionality On The "Nearest Neigh
No ratings yet
When Is "Nearest Neighbor" Meaningful?: Abstract. We Explore The Effect of Dimensionality On The "Nearest Neigh
19 pages
K-Nearest Neighbors: Nipun Batra July 5, 2020
No ratings yet
K-Nearest Neighbors: Nipun Batra July 5, 2020
66 pages
A Gentle Introduction To The Kernel Distance: 1 Definitions
No ratings yet
A Gentle Introduction To The Kernel Distance: 1 Definitions
9 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Alam Uri 2014
No ratings yet
Alam Uri 2014
8 pages
Differential Equation of The Elastic Curve
No ratings yet
Differential Equation of The Elastic Curve
6 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
IJRET - Scalable and Efficient Cluster-Based Framework For Multidimensional Indexing
No ratings yet
IJRET - Scalable and Efficient Cluster-Based Framework For Multidimensional Indexing
5 pages
An MDL Framework For Data Clustering: Petri Kontkanen, Petri Myllym Aki, Wray Buntine, Jorma Rissanen and Henry Tirri
No ratings yet
An MDL Framework For Data Clustering: Petri Kontkanen, Petri Myllym Aki, Wray Buntine, Jorma Rissanen and Henry Tirri
35 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Am 101 A PDF
No ratings yet
Am 101 A PDF
3 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
Clustering Hierarchical Algorithms
100% (1)
Clustering Hierarchical Algorithms
21 pages
p117 Andoni
No ratings yet
p117 Andoni
6 pages
TM3 ch07 Clustering
No ratings yet
TM3 ch07 Clustering
47 pages
Clustering
No ratings yet
Clustering
104 pages
Geometry-Grade 7 Reviewer
100% (1)
Geometry-Grade 7 Reviewer
4 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection
No ratings yet
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection
4 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
K-Nearest Neighbour Classifiers
No ratings yet
K-Nearest Neighbour Classifiers
18 pages
III Clustering
No ratings yet
III Clustering
87 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Mini Project
No ratings yet
Mini Project
62 pages
01 Basics 02knn 02
No ratings yet
01 Basics 02knn 02
7 pages
Introduction To Classification - KNN
No ratings yet
Introduction To Classification - KNN
29 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Clustering Algorithm With Learnable Distance For Categorical Data With Nominal and Ordinal Attributes
No ratings yet
Clustering Algorithm With Learnable Distance For Categorical Data With Nominal and Ordinal Attributes
5 pages
A Comparative Study On Distance Measuring Approach
No ratings yet
A Comparative Study On Distance Measuring Approach
3 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Chem320 Conformational Analysis Lecture Notes 1
No ratings yet
Chem320 Conformational Analysis Lecture Notes 1
23 pages
01 Basics 02knn 03
No ratings yet
01 Basics 02knn 03
9 pages
A Comparison Study On Similarity and Dissimilarity Measures in Clustering Continuous Data
No ratings yet
A Comparison Study On Similarity and Dissimilarity Measures in Clustering Continuous Data
20 pages
Similarity Based Learning (Part 2)
No ratings yet
Similarity Based Learning (Part 2)
15 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
K Nearest Neighbour - Algorithm
No ratings yet
K Nearest Neighbour - Algorithm
29 pages
Evaluation of K Nearest Neighbour Classifier Performance For Heterogeneous Data Sets
No ratings yet
Evaluation of K Nearest Neighbour Classifier Performance For Heterogeneous Data Sets
15 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Wang 2019
No ratings yet
Wang 2019
11 pages
Formulas at A Glance - IDS
No ratings yet
Formulas at A Glance - IDS
5 pages
05 KNN
No ratings yet
05 KNN
49 pages
Elpis
No ratings yet
Elpis
12 pages

On The Surprising Behavior of Distance Metrics

Uploaded by

On The Surprising Behavior of Distance Metrics

Uploaded by

On the Surprising Behavior of Distance Metrics

in High Dimensional Space

Charu C. Aggarwal1 , Alexander Hinneburg2 , and Daniel A. Keim2

Abstract. In recent years, the effect of the curse of high dimensionality

2 Behavior of the Lk -Norm in High Dimensionality

Lemma 1. Let F be uniform distribution of r N = 2 points. For an Lk metric,

Proof. Let Ad and Bd be the two points in a d dimensional data distribu-

Dividing both sides by d1/k−1/2 and regrouping the right-hand-side we get:

Consequently, using Slutsky’s theorem 3 and the results of Equation 1 we obtain

Having characterized the convergence behavior of the denominator of the right

σ is the standard deviation of (Pi )k . The sum of different values of Ri over d

Corollary 2. Let F be the arbitrary distribution of N = n points. Then,

(a) k = 3 (b) k = 2 (c) k = 1

(d) k = 2/3 (e) k = 2/5

Fig. 1. |Dmax − Dmin| depending on d for different metrics (uniform data)

Dimensionality P [Ud < Td ] Dimensionality P [Ud < Td ]

Theoremh2. Let F be theuniform distribution

Proof. Let Ad , Bd , P1 . . . Pd , Q1 . . . Qd , P Ad , P Bd be defined as in the proof

We have also shown in the previous result that:

We can combine the results in Equation 7 and 8 to obtain:

Note that the above results confirm

RELATIVE CONTRAST FOR UNIFORM DISTRIBUTION

Fig. 3. Unit spheres for different fractio-

that for values of d in thepreasonable range of data mining applications, the

3 Fractional Distance Metrics

To give a intuition of the behavior of the fractional distance metric we plotted

Lemma 3. Let F be the uniform distribution of N = 2 points and f = 1/l for

Proof. Let Ad , Bd , P1 . . . Pd , Q1 . . . Qd , P Ad , P Bd be defined using the Lf metric

By using the results in Equation 10, we can derive that:

Combining the results of Equations 12 and 11, we get:

An direct consequence of the above result is the following generalization to

Corollary 3. When F is the uniform distribution of N = n points and f = 1/l

Proof. Similar to corollary 1.

Theorem 3. Let F be the uniform distribution of N = 2 points and f = 1/l

Proof. Analogous to the proof of Theorem 2.

The following is the direct generalization to N = n points.

Corollary 4. Let F be the uniform distribution of N = n points, and f = 1/l

Proof. Analogous to the proof of Corollary 1.

Fig. 4. Effectiveness of k-Means

Data Set L0.1 L0.5 L1 L2 L4 L10 L∞ Random

random performance (at a masking probability of 0.4), whereas the L1 degrades

5 Conclusions and Summary

In this paper, we showed some surprising results of the qualitative behavior of

Proof. Let Ad and Bd be the two points in a d dimensional data distribution

We intend to show that |PdA1/k−1/2

Dividing both sides by d1/k−1/2 and regrouping on right-hand-side we get

Consequently, using Slutsky’s theorem and the results of Equation 14 we have:

Having characterized the convergence behavior of the denominator of the right-

of Ri over d dimensions √will converge

The result follows.

You might also like

Theoremh2. Let F be theuniform distribution