0% found this document useful (0 votes)
38 views16 pages

Antipole Tree Indexing

The document summarizes the Antipole Tree indexing structure for supporting range search and k-nearest neighbor search in metric spaces. The Antipole Tree combines ideas from existing structures like the M-Tree, MVP-Tree, and FQ-Tree. It is a binary tree where nodes represent clusters of data points. To split clusters, it uses an "Antipole pair" of points selected through a randomized tournament that are far apart to approximate the cluster diameter. The Antipole Tree outperforms other structures like List of Clusters and M-Tree by a factor of two on average. It also achieves better clustering properties in many cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views16 pages

Antipole Tree Indexing

The document summarizes the Antipole Tree indexing structure for supporting range search and k-nearest neighbor search in metric spaces. The Antipole Tree combines ideas from existing structures like the M-Tree, MVP-Tree, and FQ-Tree. It is a binary tree where nodes represent clusters of data points. To split clusters, it uses an "Antipole pair" of points selected through a randomized tournament that are far apart to approximate the cluster diameter. The Antipole Tree outperforms other structures like List of Clusters and M-Tree by a factor of two on average. It also achieves better clustering properties in many cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO.

4, APRIL 2005 535

Antipole Tree Indexing to Support Range


Search and K-Nearest Neighbor Search
in Metric Spaces
Domenico Cantone, Alfredo Ferro, Alfredo Pulvirenti, Diego Reforgiato Recupero, and Dennis Shasha

Abstract—Range and k-nearest neighbor searching are core problems in pattern recognition. Given a database S of objects in a
metric space M and a query object q in M, in a range searching problem the goal is to find the objects of S within some threshold
distance to q, whereas in a k-nearest neighbor searching problem, the k elements of S closest to q must be produced. These problems
can obviously be solved with a linear number of distance calculations, by comparing the query object against every object in the
database. However, the goal is to solve such problems much faster. We combine and extend ideas from the M-Tree, the Multivantage
Point structure, and the FQ-Tree to create a new structure in the “bisector tree” class, called the Antipole Tree. Bisection is based on
the proximity to an “Antipole” pair of elements generated by a suitable linear randomized tournament. The final winners a; b of such a
tournament are far enough apart to approximate the diameter of the splitting set. If distða; bÞ is larger than the chosen cluster diameter
threshold, then the cluster is split. The proposed data structure is an indexing scheme suitable for (exact and approximate) best match
searching on generic metric spaces. The Antipole Tree outperforms by a factor of approximately two existing structures such as List of
Clusters, M-Trees, and others and, in many cases, it achieves better clustering properties.

Index Terms—Indexing methods, similarity measures, information search and retrieval.

1 INTRODUCTION
Then, one recursively constructs the tree rooted in ci
S EARCHING is a basic problem in metric spaces. Hence, much
efforts have been spent both in clustering algorithms,
which are often included in the searching process as a
associated with the partition set of the elements closer to
ci , for i ¼ 1; 2.
preliminary step (see BIRCH [53], DBSCAN [24], CLIQUE [3], A good choice for the pair ðc1 ; c2 Þ of splitting points
BIRCH* [27], WaveClusters [46], CURE [32], and CLARANS consists of maximizing their distance. For this purpose,
[41]), and in the development of new indexing techniques we propose a simple approximate algorithm based on
(see, for instance, MVP-Tree [9], M-Tree [22], SLIM-Tree [48], tournaments of the type described in [6]. Our tournament
FQ-Tree [4], List of Clusters [16], and SAT [40]; the reader is is played as follows: At each round, the winners of the
also referred to [18] for a survey on this subject). For the previous round are randomly partitioned into subsets of a
special case of Euclidean spaces, one can see [2], [29], [8], fixed size  and their 1-medians1 are discarded. Rounds
X-Tree [7], and CHILMA [47]. are played until one is left with less than 2 elements.
We combine and extend ideas from the M-Tree, MVP-Tree, The farthest pair of points in the final set is our Antipole
and FQ-Tree structures together with randomized techniques pair of elements.
coming from the approximate algorithms community [6], to The paper is organized as follows: In the next section, we
design a simple and efficient indexing scheme called Antipole give the basic definitions of range search and k-nearest
Tree. This data structure is able to support range queries and neighbor queries in general metric spaces and we briefly
k-nearest neighbor queries in generic metric spaces. review relevant previous work, with special emphasis on
The Antipole Tree belongs to the class of “bisector trees” those structures which have been shown to be the most
[18], [13], [42], which are binary trees whose nodes effective, such as List of Clusters [16], M-Trees [22], and
represent sets of elements to be clustered. Its construction MVP-Trees [9]. The Antipole Tree is described in Section 3.
begins by first allocating a root r and then selecting two Techniques to compute the approximate 1-Median and the
splitting points c1 , c2 in the input set, which become the diameter of a subset of a generic metric space are
children of r. Subsequently, the points in the input set are illustrated, respectively, in Sections 3.1 and 3.2. In
partitioned according to their proximity to the points c1 , c2 . Section 4, we present a procedure for range searching on
the Antipole Tree. Section 5 presents an algorithm for the
exact k-nearest neighbor problem. The Antipole Tree is
. D. Cantone, A. Ferro, A. Pulvirenti, and D.R. Reforgiato are with the experimentally compared with List of Clusters, M-Tree, and
Dipartimento di Matematica e Informatica, Università degli Studi di MVP-Tree in Section 6. In particular, cluster diameter
Catania, Italy, Viale Andrea Doria n. 6 95125 Cantania. threshold tuning is discussed. An approximate k-nearest
E-mail: {cantone, ferro, apulvirenti, diegoref}@dmi.unict.it.
. D. Shasha is with the Computer Science Department, New York neighbor algorithm is also introduced in Section 7 and a
University, 251 Mercer Street, New York, NY 10012. comparison with the version for approximate search of List
E-mail: [email protected]. of Clusters [12] is given with a precision-recall analysis. In
Manuscript received 27 Aug. 2003; revised 9 Apr. 2004; accepted 14 Sept. Section 8, we deal with the problem of the curse of
2004; published online 17 Feb. 2005.
For information on obtaining reprints of this article, please send e-mail to: 1. We recall that the 1-median of a set of points S in a metric space is an
[email protected], and reference IEEECS Log Number TKDE-0160-0803. element of S whose average distance from all points of S is minimal.
1041-4347/05/$20.00 ß 2005 IEEE Published by the IEEE Computer Society
536 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005

dimensionality. Indeed, in high dimension, linear scan for We take the idea that a parent node corresponds to a
uniform data sets may become competitive with the best cluster and its children nodes are subclusters of that parent
searching algorithms. However, most of the real-world data cluster from the M-Tree. The main differences between our
sets are nonuniform. We successfully compare our algo- algorithm and the M-Tree are the construction method, that
rithm with linear scan in nonuniform data sets of very high- clusters in the M-Tree must have a limited number of
dimensional Euclidean spaces. We draw our conclusions in elements, and the search strategy as our algorithm produces
Section 9. Finally, the Appendix which can be found on the a binary tree data structure.
Computer Society Digital Library at http://computer.org/ VP-Trees ([49], [52]) organize items coming from a metric
tkde/archives.htm proposes an efficient approximation space into a binary tree. The items are stored both in the
scheme for the diameter computation in the Euclidean case. leaves and in the internal nodes of the tree. The items stored
in the internal nodes are the “vantage points.” To process a
query requires the computation of the distance between the
2 BASIC DEFINITIONS AND RELATED WORKS query point and some of the vantage points. The construc-
Let M be a nonempty set of objects and let dist : tion of a VP-Tree partitions a data set according to the
ðM  MÞ ! IR be a function such that the following distances that the objects have with respect to a reference
properties hold: point. The median value of these distances is used as a
separator to partition objects into two balanced subsets
1.ð8 x; y 2 MÞ distðx; yÞ  0 (positiveness); (those as close or closer than the median and those farther
2.ð8 x; y 2 MÞ distðx; yÞ ¼ distðy; xÞ (symmetry); than the median). The same procedure can recursively be
3.ð8 x 2 MÞ distðx; xÞ ¼ 0 (reflexivity) and ð8x; y 2 applied to each of the two subsets.
The Multi-Vantage-Point tree [9] is an intellectual
MÞ ðx 6¼ y ! distðx; yÞ > 0Þ (strict positiveness);
descendant of the vantage point tree and the GNAT [10]
4. ð8 x; y; z 2 MÞ distðx; yÞ  distðx; zÞ þ distðz; yÞ structure. The MVP-Tree appears to be superior to the
(triangle inequality); previous methods. The fundamental idea is that, given a
then, the pair ðM; distÞ is called a metric space and dist is point p, one can partition all objects into m partitions based
called its metric function. Well-known metric functions on their distances from p, where the first partition consists
include Manhattan distance, Euclidean distance, string edit of those points within distance d1 from p, the second
distance, or the shortest path distance through a graph. Our consists of those points whose distance is greater than d1
goal is to build a low cost data structure for the range search and less than or equal to d2 , etc. Given two points, pa and pb ,
problem and k-nearest neighbor searching in metric spaces. the partitions a1 ;    ; am based on pa and the partitions
Definition 2.1 (Range query). Given a query object q, a b1 ;    ; bm based on pb can be created. One can then intersect
database S, and a threshold t, the Range Search Problem is to all possible a and b-partitions (i.e., ai intersect bj for 1  i 
m and 1  j  m) to get m2 partitions. In an MVP-Tree, each
find all objects fo 2 Sjdistðo; qÞ  tg.
node in the tree corresponds to two objects (vantage points)
Definition 2.2 (k-Nearest Neighbor query). Given a query and m2 children, where m is a parameter of the construction
object q and an integer k > 0, the k-Nearest Neighbor Problem algorithm and each child corresponds to a partition. When
is to retrieve the k closest elements to q in S. searching for objects within distance t of query point q, the
algorithm does the following: Given a parent node having
Our basic cost measure is the number of distance vantage points pa and pb , if some partition Z has the
calculations, since these are often expensive in metric property that for every object z 2 Z, distðz; pa Þ < dz , and
spaces, e.g., when computing the editing distance among distðq; pa Þ > dz þ t, then Z can be discarded. There are other
strings. reasons for discarding clusters, also based on the triangle
Three main sources of ideas have contributed to our inequality. Using multiple vantage points together with
work. The FQ-Tree [4], an example of a structure using precomputed distances reduces the number of distance
pivots (see [18] for an extended survey), organizes the items computations at query time. Like the MVP-Tree, our
of a collection ranging over a metric space into the leaves of a structure makes aggressive use of the triangle inequality.
tree data structure. Viewed abstractly, FQ-Trees consist of a Another relevant recent work, due to Cháavez and
vector of reference objects r1 ;    ; rk and a distance vector vo Navarro [16], proposes a structure called List of Clusters.
associated with each object o such that vo ½i ¼ distðo; ri Þ. A Such a list is constructed in the following way: Starting
query object q computes a distance to each reference object, from a random point, a cluster with bounded diameter (or
thus obtaining a vq . Object o cannot be within a threshold limited number of objects) centered in that random point
distance t from q if for any i, vq ½i > vo ½i þ t. That is, even if o is constructed. Then, such a process is iterated by selecting
is closer to q than ri , q cannot be closer to o than t. a new point, for example, the farthest from the previous
We use a similar idea except that our reference objects one, and constructing another cluster around it. The
are the centroids of clusters. process terminates when no more points are left. Authors
M-Trees [22], [20] are dynamically balanced trees. Nodes experimentally show that their structure outperforms
of an M-Tree store several items of the collection provided other existing methods when parameters are chosen in a
that they are “close” and “not too numerous.” If one of suitable way.
these conditions is violated, the node is split and a suitable Other sources of inspiration include [11], [23], [26], [30],
subtree originating in the node is recursively constructed. In [45], [44], [48], [40].
the M-Tree, each parent node corresponds to a cluster with
a radius and every child of that node corresponds to a
subcluster with a smaller radius. If a centroid x has a 3 THE ANTIPOLE TREE
distance distðx; qÞ from the query object and the radius of Let (M, dist) be a finite metric space, let S be a subset of M,
the cluster is r, then the entire cluster corresponding to x and suppose that we aim to split it into the minimum
can be discarded if distðx; qÞ > t þ r. possible number of clusters whose radii should not exceed a
CANTONE ET AL.: ANTIPOLE TREE INDEXING TO SUPPORT RANGE SEARCH AND K-NEAREST NEIGHBOR SEARCH IN METRIC SPACES 537

Fig. 1. The 1-Median algorithm.

given threshold . This problem has been studied by 3.2 The Diameter (Antipole) Computation
Hochbaum and Maass [35] for Euclidean spaces. Their Let ðM; dÞ be a metric space with distance function dist :
approximation algorithm has been improved by Gonzalez
ðM  MÞ7!IR and let S be a finite subset of M. The diameter
in [31]. Similar ideas are used by Feder and Greene [25] (see
[43] for an extended survey on clustering methods in computation problem or furthest pair problem is to find the pair of
Euclidean spaces). points A; B in S such that distðA; BÞ  distðx; yÞ; 8x; y 2 S.
The Antipole clustering of bounded radius  is per- As observed in [36], we can construct a metric space
formed by a recursive top-down procedure starting from where all distances among objects are set to 1 except for one
the given finite set of points S and checking at each step if a (randomly chosen) which is set to 2. In this case, any
given splitting condition  is satisfied. If this is not the case, algorithm that tries to give an approximation factor greater
then splitting is not performed, the given subset is a cluster, than 1=2 must examine all pairs, so a randomized algorithm
and a centroid having distance approximatively less than  will not necessarily find that pair.
from every other node in the cluster is computed by the Nevertheless, we expect a good outcome in nearly all
procedure described in Section 3.1. cases. Here, we introduce a randomized algorithm inspired
Otherwise, if  is satisfied then a pair of points fA; Bg of by the one proposed for the 1-median computation [14] and
S, called the Antipole pair, is generated by the algorithm reviewed in the preceding section. In this case, each subset
described in Section 3.2 and is used to split S into two Xi is locally processed by a procedure LOCAL WINNER
subsets SA and SB obtained by assigning each point p of S to
which computes its exact 1-median xi and then returns the
the subset containing the endpoint closest to p of the
set Xi , obtained by removing the element xi from Xi . The
Antipole fA; Bg. The splitting condition  states that
distðA; BÞ is greater than the cluster diameter threshold elements in X1 [ X2 . . . [ Xk are used in the subsequent
corrected by the error coming from the Euclidean case step. The tournament terminates when we are left with a
analysis described in the Appendix which can be found on single set, X, from which we extract the final winners A; B,
the Computer Society Digital Library at http://computer. as the furthest points in X. The pair A; B is called the
org/tkde/archives.htm. Indeed, the diameter threshold is Antipole pair and their distance represents the approximate
based on a statistical analysis of the pairwise distances of diameter of the set S.
the input set (see Section 6.2) which can be used to evaluate The pseudocode of the Antipole algorithm
the intrinsic dimension [18] of the metric space. The tree
obtained by the above procedure is called an Antipole Tree. APPROX ANTIPOLE;
All nodes are annotated with the Antipole endpoints and similar to that of the 1-Median algorithm, is given in Fig. 1.
the corresponding cluster radius; each leaf contains also the
A faster (but less accurate) variant of
1-median of the corresponding final cluster. Its implemen-
tation is described in Section 3.3. APPROX ANTIPOLE
3.1 1-Median can be used. Such variant, called
In this section, we review a randomized algorithm for the
approximate 1-median selection [14], an important sub- FAST APPROX ANTIPOLE;
routine in our Antipole Tree construction. It is based on a consists of taking Xi as the farthest pair of Xi . Its
tournament played among the elements of the input set S. pseudocode can therefore be obtained simply by replacing
At each round, the elements which passed the preceding in APPROX ANTIPOLE each call to LOCAL WINNER by
turn are randomly partitioned into subsets, say X1 ; . . . ; Xk .
a call to FIND ANTIPOLE. In the next section, we will
Then, each subset Xi is locally processed by a procedure
prove that both variants have a linear running time in the
which computes its exact 1-median xi . The elements
x1 ; . . . ; xk move to the next round. The tournament number of elements. We will also show that
terminates when we are left with a single element x, the FAST APPROX ANTIPOLE is also linear in the tourna-
final winner. The winner approximates the exact 1-median ment size , whereas APPROX ANTIPOLE is quadratic
in S. Fig. 1 contains the pseudocode of this algorithm. The with respect to .
local optimization procedure 1-MEDIAN ðXÞ returns the For tournaments of size 3, both variants plainly coincide.
exact 1-median in X. A running time analysis (see [14] for Thus, since in the rest of the paper only tournaments of
details) shows that the above procedure takes time 2t n þ size 3 will be considered, by referring to the faster variant
oðnÞ in the worst-case. we will not loose any accuracy.
538 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005

Fig. 2. The Antipole Algorithm.

3.2.1 Running Time Analysis of Antipole Computation been fixed. It can easily be seen that W1 ðnÞ satisfies the
following recurrence relation:
Two fundamental parameters present in the algorithm
8
reported in Fig. 2 (also reported in Fig. 1), namely, the <0 if 0  n  2;
splitting factor  (also referred to as the tournament size) and W1 ðnÞ ¼ 1 if 3  n < 2;
: n  
the parameter threshold, need to be tuned. b  c þ W1 ð  1Þ  bnc if n  2:
The splitting factor  is used to set the size of each subset
X processed by procedure LOCAL WINNER, with the only By induction on n, we can show that W1 ðnÞ  n. For n < 2,
exception of one subset for each round of the tournament our estimate is trivially true. Thus, let n  2. Then, by
(whose size is at most ð2  1Þ), and the argument of the last inductive hypothesis, we have
call to FIND ANTIPOLE (whose size is at most equal to         
threshold). It is clear that the larger values of  are, the n n n n
better the output quality is and the higher the computa- W1 ðnÞ ¼ þ W1 ð  1Þ   þ ð  1Þ 
   
tional costs are. In many cases, a satisfying output quality  
n
can be obtained even with small values for . ¼  ð1 þ ð  1ÞÞ ¼ n:

A good trade off between output quality and computa-
tional cost is obtained by choosing as value for  one unit more The number of distance computations made by a call
PjXj
than the dimension that characterizes the investigated metric LOCAL WINNERðXÞ is equal to i¼1 ði  1Þ ¼ jXjðjXj1Þ
2 . At
space [18]. This suggestion lies on intuitive grounds devel- each round of the tournament, all the calls to procedure
oped in the case of a Euclidean metric space IRm and is largely LOCAL WINNER have an argument of size , with the
confirmed by the experiments reported in [14]. The parameter possible exception of the last call, which can have an
threshold controls the termination of the tournament. Again, argument of size between ð þ 1Þ and ð2  1Þ. We notice
larger values for threshold ensure better output quality, that the last call to procedure FIND ANTIPOLE made
though at increasing cost. Observe that the value ð 2  1Þ for within the return instruction of APPROX ANTIPOLE has
threshold forces the property that the last set of elements, argument of size at most #. Since there are dlog=ð1Þ ne
where the final winner is selected, must contain at least  rounds, it follows that the total number of distances
elements, provided that jSj  . Moreover, in order to ensure computed by a call of APPROX ANTIPOLEðSÞ, with
a linear computational complexity jSj ¼ n, tournament size , and threshold #, is majorized
pffiffiffiffiffiffi of the algorithm, the
Oð jSj by the expression
threshold value need to be n
2
pÞ.ffiffiffiffiffiffiConsequently,
o a good
choice is threshold ¼ min   1; jSj . ð  1Þ
The algorithm APPROX ANTIPOLE given in Fig. 2 is W ðn; ; #Þ  þ dlog=ð1Þ ne
 2
characterized by its simplicity and, hence, it is expected to ð2  1Þð2  2Þ ð  1Þ #ð#  1Þ
be very efficient from the computational point of view, at   þ
2 2 2
least in the case in which the parameters  and threshold are ð  1Þ
taken small enough. In fact, we will show below that our ¼ n þ Oðlog n þ #2 Þ:
2
algorithm has a worst-case complexity of ð1Þ n þ oðnÞ in
2pffiffiffi pffiffiffi
the input size n, provided that threshold is oð nÞ. By taking # ¼ oð nÞ, the above expression is easily seen to
Plainly, the complexity of the algorithm be ð1Þ
2 n þ oðnÞ.
APPROX ANTIPOLE is dominated by the number of Summing up, we have:
distances computed by it within calls to procedure Theorem 3.1. Given an input set of size n 2 IN, a constant
LOCAL WINNER. We shall estimate below such a number. pffiffiffi
tournament size   3, and a threshold # ¼ oð nÞ, the
Let W ðn; ; #Þ be the number of calls to procedure
LOCAL WINNER made within the while-loops by algorithm APPROX ANTIPOLE performs ð1Þ 2 n þ oðnÞ
APPROX ANTIPOLE, with an input of size n and using distance computations.
p a r a m e t e r s   3 a n d t h r e s h o l d #  1. P l a i n l y ,
W ðn; ; #Þ  W ðn; ; 1Þ, for any #  1, thus it will suffice Concerning the complexity of the faster variant
to find an upper bound for W ðn; ; 1Þ. For notational FAST APPROX ANTIPOLE, we have the  following recur-
convenience, let us put W1 ðnÞ ¼ W ðn; ; 1Þ, where  has rence relation W1 ðnÞ ¼ bnc þ W1 2  bnc , for n  2. By
CANTONE ET AL.: ANTIPOLE TREE INDEXING TO SUPPORT RANGE SEARCH AND K-NEAREST NEIGHBOR SEARCH IN METRIC SPACES 539

To build such a data structure, the procedure BUILD (see


Fig. 4) takes as input the data set S, a target cluster radius ,
and a set Q (empty at the beginning). The algorithm starts
by checking if Q is empty and, if so, it calls the subroutine
ADAPTED APPROX ANTIPOLE,2 which returns an Anti-
pole pair. Then, the Antipole pair is inserted into Q.
Next, the algorithm checks if the splitting condition is
true. If this is the case, the set S is divided into SA and SB ,
where the objects closer to A are put in SA and symme-
trically for B. Otherwise, a cluster is generated. The other
subroutine used in BUILD is CHECK which checks
whether there is an object O in SA (or SB ) that may become
the Antipole of A (or B), by using the distances already
Fig. 3. (a) A generic object in the Antipole data structure. (b) A generic computed and cached. If an Antipole is found, it is inserted
cluster in the Antipole data structure. into Q and then the recursive call in BUILD skips the
computation of another Antipole pair.
induction on n, we can show that the number of calls to the The routine MAKE CLUSTER (Fig. 4) creates a cluster of
n
subroutine FIND ANTIPOLE is W1 ðnÞ  d2 e. For n < 2, objects with bounded radius. This procedure computes the
our estimate is trivially true. Thus, let n  2. Then, by cluster centroid C with the randomized algorithm
inductive hypothesis, we have APPROX 1 MEDIAN and then computes the distance
      
between each O in the cluster and C.
n n n 2  bnc The data structure resulting from BUILD is a binary
W1 ðnÞ ¼ þ W1 2   þ
    2 tree whose leaves contain a set of clusters, each of which
 

has an approximate centroid and the radius, based on
n 2 n
  1þ  : that centroid, is less than . Fig. 5a shows the evolution
  2  2 of the data set during the construction of the tree. At the
Finally, much by the same arguments as those preceding first step, the pair A, B is found by the algorithm
Theorem 3.1, we can show that the following holds: ADAPTED APPROX ANTIPOLE, then the input data set
is split into the subsets SA and SB . The second step
Theorem 3.2. Given an input set of size n 2 IN, a constant proceeds as the first for the subset containing A while, for
pffiffiffi
tournament size   3, and a threshold # ¼ oð nÞ, the the subset containing B, it produces a cluster since its
algorithm FAST APPROX ANTIPOLE performs ð1Þ 2ð2Þ n þ
diameter is less than 2. The third and final step produce
oðnÞ distance computations. the final clusters for the subsets containing A1 and B1 .
Fig. 5b shows the corresponding Antipole data structure.
3.3 The Antipole Tree Data Structure in General
Metric Spaces 3.3.1 Construction Time Analysis
The Antipole Tree data structure can be used in a generic Let us compute the running time of each routine.
metric space ðM; distÞ, where dist is the distance metric. Building the Antipole Tree takes quadratic time in the
Each element of the metric space along with its related data worst case. For example, let us consider a metric space in
constitutes a type called object. An object O (Fig. 3a) in the which the distance between any pair of distinct objects is
Antipole data structure contains the following information: 2 þ 1. In this case, if the subsets SA and SB have size 1
an element x, an array DV storing the distances between x and jSj  i respectively, where i is the ith recursive call,
and all its ancestors (the Antipole pairs) in the tree, and a then the complexity becomes Oðn2 Þ. Notice that
variable DC containing the distance from the centroid C of ADAPTED APPROX ANTIPOLE will take constant com-
x’s cluster. A data set S is a collection of objects drawn from putational time in this case because all the pairwise
M. Each cluster (Fig. 3b) stores the following information: distances are supposed to be strictly greater than 2.

. centroid, C, the element that minimizes the sum of


the distances from the other cluster members; 4 RANGE SEARCH ALGORITHM
. radius, Radius, containing the distance from C to the The range search algorithm takes as input the Antipole Tree
farthest object; T , the query object q, the threshold t, and returns the result
. member list, CList , storing the catalog of the objects of the range search of the database with threshold t. The
contained in the cluster; search algorithm recursively descends all branches of the
. size of CList , Size, stored in the cluster. tree until either it reaches a leaf representing a cluster to be
The Antipole data structure has internal nodes and leaf visited or it detects a subtree that is certainly out of range
and, therefore, may be pruned out. Such branches are
nodes:
filtered by applying the triangle inequality. Notice that the
. An internal node stores 1) the identities of two triangle inequality is used both for exclusion and inclusion.
Antipole objects A and B, called the Antipole pair of The use for exclusion establishes that an object can be
distance at least 2 apart, 2) the radii RadA and RadB pruned, thus avoiding the computation of the distance
of the two subsets (SA , SB obtained by splitting S between such an object and the query. The other usage
based on their proximity to A and B, respectively), establishes that an object must be inserted because the
and 3) pointers to the left and right subtrees left and
right. 2. Notice that this algorithm is a variation of FIND ANTIPOLE that
stops when a pair of objects with distance greater than 2 is found,
. A leaf node stores a cluster. otherwise, it returns an empty set.
540 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005

Fig. 4. The algorithm Build Antipole Tree and the routine MakeCluster.

Fig. 5. A clustering example (a) in a generic metric space and (b) the corresponding Antipole data structure.

Fig. 6. The Range Search algorithm.

object is close to its cluster’s centroid and the centroid is indicating the number of objects requested. It returns the set
very close to the query object (see Figs. 6 and 7 for the of objects in S which are the k-nearest neighbors of q.
pseudocode). Hjaltason and Samet in [34] propose a method called
Incremental Nearest Neighbor to perform k-nearest neigh-
bor search in spatial databases. Their approach uses a
5 K-NEAREST NEIGHBOR ALGORITHM priority queue storing the subtrees that should be visited,
The k-nearest neighbor search algorithm takes as input the ordered by their distance from the query object. The authors
Antipole Tree T , the query object q, and the k parameter claim that their approach can be applied to all hierarchical
CANTONE ET AL.: ANTIPOLE TREE INDEXING TO SUPPORT RANGE SEARCH AND K-NEAREST NEIGHBOR SEARCH IN METRIC SPACES 541

Fig. 7. The Visit Cluster algorithm.

data structures. Here, we propose an application of such a . clustered 20-dimensional Euclidean space. More
method to Antipole Tree. precisely, a set of 100,000 objects obtained in the
The algorithm described below uses two different following way: By using uniform distributions, take
priority queues. The first one stores the subtrees of the 100 random spheres and select 1,000 random points
Antipole data structure which may be visited during the in each of them.
search (left subtree, right subtree, or leaf); the second one The real data sets are, respectively:
keeps track of the objects that will be returned as output.
The incremental nearest-neighbor algorithm starts by . a set of 45,000 strings chosen from the Linux
putting the root of the Antipole Tree in the priority queue dictionary with the editing distance;
pQueue. Then, it proceeds by extracting the minimum from . a set of 42,000 images chosen from the Corel image
the priority queue. If the extracted node is a leaf (cluster), it database with the metric L2 ;
visits it. Otherwise, it decides to visit each of its subtrees on . high-dimensional Euclidean space sets of points
the basis of the subtree’s radius, the distance of the Antipole corresponding to textures of VISTEX database [50]
endpoint from the query, and a threshold t by applying the with the metric L2 .
triangle inequality. The threshold t, which is initialized to
1, stores the largest distance from the query q to any of the For each experiment, we ran 100 random queries: half of
current k-nearest neighbors. Subtrees which need to be them were chosen in the input set, the remaining ones in the
visited will be put in the priority queue. All current complement.
k-nearest neighbors found are stored in another heap
6.1 Construction Time
outQueue in order to optimize the dynamic operations
(such as insertions, deletions, and updates). Figs. 8 and 9 We measure construction time in terms of the number of
summarize the pseudocode. distance computations and CPU time on uniformly dis-
tributed objects in ½0; 110 , as described above. Fig. 10a
illustrates a comparison between the Antipole Tree, the
6 EXPERIMENTAL ANALYSIS MVP-Tree, and the M-Tree, showing the distances needed
In this section, we evaluate the efficiency of constructing during the construction. Data were taken again in ½0; 110
and searching through an Antipole Tree. We have im- with size from 100,000 to 500,000 elements. The cluster
plemented the structure using the C programming lan- radius  used was  ¼ 0:625, as found by our estimation
guage under Linux operating system. The experiments use algorithm described below. We used the parameter settings
synthetic and real data sets. The synthetic data sets are for MVP-Trees and M-Trees suggested by the authors [9],
based on those ones used by [9]: [20]. Fig. 10a shows also that building the Antipole Tree
requires fewer distance computations than the M-Tree but
. uniform 10-dimensional Euclidean space (sets of more than the MVP-Tree. The difference is roughly a factor
100; 000; 200; 000; . . . ; 500; 000 objects uniformly of 1.5. Fig. 11 shows that the difference in construction costs
distributed in ½0; 110 ); can be compensated by faster range queries on less than

Fig. 9. A procedure for checking whether the object O should be added


Fig. 8. The incremental k-nearest neighbor search algorithm. to the OUT set.
542 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005

Fig. 10. (a) Construction complexity using uniformly generated data measured by the number of distance computations needed by the Antipole Tree
with cluster diameter 1.25 versus M-Tree and MVP-Tree. (b) CPU time in seconds needed to build the Antipole Tree.

0.2 percent of the entire input database. Thus, unless Figs. 12 (Uniform) and (Clustered) show that across
queries are very rare, the Antipole Tree recovers in terms of different values of the threshold t of the range search, the
queries cost what it loses in construction. Experiments best choice of the cluster diameter is 0.625 for the uniform
proving this fact are reported in Section 6.3. data set and 2.5 for the clustered one.
Fig. 10b shows the CPU time needed to bulk load the Experiments with real and synthetic data showed that
proposed data structure; it also shows that the CPU time choosing the cluster diameter 10 percent less than the
needed to construct the Antipole Tree grows linearly in median pairwise distance value gives, regardless of the
many cases. Because the MVP-Tree entails sorting, it range search threshold, a quite surprising result.
requires at least O(n log n) operations (though not distance 6.3 Range Search Analysis and Comparisons
calculations) to build the data structure.
In this section, we present an extensive comparison among
6.2 Choosing the Best Cluster Diameter the Antipole Tree, the MVP-Tree, the M-Tree, and List of
Clusters in terms of the number of distance computations
In this section, we discuss how to tune the Antipole Tree for
for range queries. The number of distance computations
range queries. We measure the cost by the number of required by each query has been estimated as the average
distance calculations among objects of the underlying value in a set of 100 queries. In order to perform a fair
metric space. comparison with the three competing data structures,
Before the Antipole data structure can be used, it needs MVP-Tree, M-Tree, and List of Cluster, we have set their
to be tuned. To tune the Antipole Tree, we must choose the implementation parameters to the best values according to
radius  of the clusters very carefully by analyzing the data the ones suggested by the authors. For the MVP-Tree, in [9]
set properties. In what follows, we will show that optimal it is shown that its best performance is achieved by setting
cluster radius depends on the intrinsic dimensionality of the the parameters in the following way:
underlying metric space.
We performed, as described before, our experiments in 1. Two vantage points in every internal node v1 and v2 .
10 and 20-dimensional spaces with uniform and clustered 2. m2 ¼ 4 partition classes. Four children for each pair
distributions having size 100,000. However, the methodol- of vantage points.
ogy of finding the optimal diameter can be applied to other 3. k ¼ 13 maximum number of objects in a leaf node.
dimensions and arbitrary data sizes. 4. p unbounded, the size of the vector storing the
distances between the objects in a leaf and their
ancestors in the tree (the vantage points). Such a
vector is used during the range search to discard
objects without having to compute their distance
from the query object. Notice that the higher the
dimension is of such a vector the more distances
from vantage points can be used to prune candidates
and this improves the performance of the MVP-Tree
in terms of distance computations. For this reason,
we have set this parameter to its maximum value:
the height of the MVP-Tree.3
For the M-Tree implementation, we made use of the
BulkLoading4 algorithm [20]. The two parameters needed to
tune the data structure in order to obtain better perfor-
mance are the minimum node utilization and the secondary
memory page size. The best performance observed during

3. The authors are grateful to T. Bozkaya and M. Ozsoyoglu for


Fig. 11. Number of range queries, as a fraction of the data set size, providing them the program to generate the input for the clustered data set.
which are sufficient to recover the higher cost of Antipole Tree 4. The authors are grateful to P. Ciaccia, M. Patella, and P. Zezula for
construction with respect to MVP-Tree construction. providing them the source code of the M-Tree.
CANTONE ET AL.: ANTIPOLE TREE INDEXING TO SUPPORT RANGE SEARCH AND K-NEAREST NEIGHBOR SEARCH IN METRIC SPACES 543

Fig. 12. Diameter tuning using uniformly and clustered generated points in dimensions 10 and 20, respectively.

the search was obtained with minimum node utilization 0.2 with a database size 3,000 obtained from the VISTEX [50]
and page size 8K. texture database. Notice that by using the query thresholds
Concerning List of Clusters, we used fixed bucket size depicted in Fig. 16, the output set captures from 0 percent to
according with heuristics p3 and p5 suggested by the 5 percent of the elements of the entire data set in IR147 and
authors in [16]. p3 consists in choosing the center of the from 0 percent to 10 percent of the elements of the entire
ith cluster as the furthest element from the ði  1Þth center, data set in IR267 . Antipole Tree shows a better behavior with
whereas p5 picks the element which maximizes the sum of regard to List of Clusters tuned with the best fixed bucket
distances from previous centers. size we noticed.
In the first experiment (Fig. 13), we compare the four
data structures in a uniform data set taken from ½0; 1n with 6.4 K-Nearest Neighbor Comparisons
n ¼ 10, varying the query threshold from 0.1 to 0.8, and In Fig. 17, we present a set of experiments in which the
using a data set of size 300,000. For the Antipole, we used K NEAREST NEIGHBOR algorithm is compared with the
two different cluster radii : 0.5 and 0.625, respectively. M-Tree and the List of Clusters. Notice that we compared the
Antipole Tree performs better than the other three data Antipole Tree with just the M-Tree and List of Clusters
structures computing less distances during the search. because the k-nearest neighbor search is not discussed for the
Notice that using a query threshold from 0.1 to 0.7, we MVP-Tree (see [9]). As described in Section 6.3, we choose
capture in the output data set from 0% to 1% of the elements of
uniform and clustered data in IR10 and IR20 . Each data set has
the entire data set (0.8 captures the 3% of the entire set). Fig. 14
shows that with query thresholds from 0.4 to 0.6, we save size 100,000. We run the K NEAREST NEIGHBOR algo-
between 10 percent and 70 percent of the distance computa- rithm with k ¼ 1; 2; 4; 6; 8; 10; 15; 20 using 100 queries for
tions, which, in the figure, is indicated as the gain percentage.
The next set of experiments (see Fig. 15) was designed to
compare the four data structures in different metric spaces:
the clustered Euclidean space IR20 , a string space under an
editing distance metric, and an image histogram space with
an L2 distance metric. The corresponding data sets are:
100,000 clustered points, 45,000 strings from the Linux
dictionary, and 42,000 image histograms from the Corel
image database,5 respectively. Results show a 30 percent of
savings in distance computations.
Since List of Clusters reportedly works well in high
dimension in Fig. 16, we show a comparison in range search
in very high dimension Euclidean spaces IR147 and IR267 ,

5. Obtained from the UCI Knowledge Discovery in Databases Archive, Fig. 13. Comparisons in IR10 using 300,000 randomly generated vectors.
http://kdd.ics.uci.edu. The query threshold goes from 0.1 to 0.8.
544 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005

Fig. 14. Each picture shows the number of distances computed by the compared data structures using threshold from 0.4 to 0.6. The respective gain
percentage (percentage of distances saved) of the Antipole Tree with regard to the MVP-Tree, the M-Tree, and the List of Clusters is also plotted.

each experiment (half belonging to the data structure and half the nearest neighbor search. A first simple algorithm, called
not). Using the Antipole Tree, we save up to 85 percent of BEST PATH SEARCH, follows the best path in the tree
from the root to the leaf, and returns the centroid stored in
distance computations.
the leaf node. This algorithm uses the same strategy as the
Concerning experiments in very high dimension, in TSVQ to find quickly an approximate nearest neighbor of a
Fig. 18 we show a comparison with List of Clusters using a query object.
data set of 3,000 elements in Euclidean IR147 and IR267 from In what follows, we present a set of experiments where
VISTEX [50]. Antipole Tree clearly outperforms List of TSVQ and Antipole Tree are compared. The experiments
Clusters. refer to uniformly generated objects in spaces whose
dimension ranges from 10 to 50. For each input data set,
100 queries were executed. In order to evaluate the quality
7 APPROXIMATE K-NEAREST NEIGHBOR SEARCH of the results, we run the exact search first. Then, the error 
VIA ANTIPOLE TREE is computed in the following way:
When the dimension of the space becomes very high (say
 50), all existing data structures perform poorly on range jdistðOopt ; qÞ  distðOT SV Q=Antipole ; qÞj
 ¼ :
and k-nearest neighbor searches. This is due to the well- distðOopt ; qÞ
known problem of the curse of dimensionality [37]. Lower In Fig. 19a, the errors introduced by the two approximate
bounds [19] show that the search complexity exponentially algorithms in uniformly generated set of points (upper
grows with the space dimension. For generic metric spaces, figures) and clustered set of points (lower figures) are
following [17] and [18], we introduce the concept of intrinsic depicted. On the other hand, Figs. 19b and 19d show the
dimensionality: number of distances computed by the two algorithms.
The experiments clearly show that the Antipole Tree
Definition 7.1. Let ðM; distÞ be a metric 2space, and let S  M.
 improves on TSVQ. We think that this is due to the better
The intrinsic dimension of S is  ¼ 2S2 , where S and 2S are
S position of the Antipole pairs.
the mean and the variance of its histogram distances.
A more sophisticated approximation algorithm to solve
A promising approach to alleviate at least the curse of the k-nearest neighbor problem can be obtained by using the
dimensionality is to consider approximate and probabilistic K NEAREST NEIGHBOR algorithm. The idea is the follow-
algorithms for k-nearest neighbor search. In some applica- ing: For each cluster reached during the search, the algorithm
tions, such algorithms give acceptable results. Several compares the query object with the cluster centroid without
interesting algorithms have been proposed in the literature taking into consideration the objects inside it.
[17], [21], [39], [28]. One of the most successful data This search is slower than the BEST PATH SEARCH, but
structure seems to be the Tree Structure Vector Quantiza- is more precise and can be used to perform k-nearest neighbor
tion (TSVQ). Here, we will show how to use the Antipole search. Fig. 20a shows a set of experiments done in uniform
Tree to design a suitable approximate search algorithm for spaces in dimension 30 with radius  set to 1 and 1.5.
CANTONE ET AL.: ANTIPOLE TREE INDEXING TO SUPPORT RANGE SEARCH AND K-NEAREST NEIGHBOR SEARCH IN METRIC SPACES 545

Fig. 15. (top) Comparisons of Antipole Tree versus MVP-Tree, M-Tree, and List of Clusters in a clustered space from IR20 varying the query
threshold from 0.1 to 1, with cluster radius 2. (middle) Antipole Tree versus MVP-Tree, M-Tree, and List of Clusters using an editing distance metric
with cluster radius 5. (bottom) Antipole Tree versus MVP-Tree, M-Tree, and List of Clusters using a set of image histograms with cluster radius 0.4.

Fig. 16. A comparison between Antipole Tree and List of Clusters using real database in IR147 (left) and IR267 (right).

In approximate matching, precision and recall [38] are the R-precision (precision after R recalls) gives the number
important metrics. Following [38], we call the k-nearest of distances which must be computed to obtain such recall.
neighbor elements of a query q: the k golden results. Then, We performed precision-recall analysis between Antipole
the recall after quota distances can be defined as the fraction Tree and the approximate version of List of Clusters [12].
of the k top golden elements retrieved fixing a bound, called Experiments in Fig. 22 made use of 100,000 elements of
quota, in the number of distances that can be computed dimension 30. We fixed several quotas and recalls ranging
during the search. The precision is the number of golden from 7,000 to 42,000 and from 0.5 to 0.9, respectively.
elements retrieved over the number of distances computed. Results clearly show that Antipole Tree gives precision-
On the other hand, if the recall R is fixed (i.e. 50 percent), recall factors better than List of Clusters (with fixed bucket
546 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005

Fig. 17. k-nearest neighbor comparisons. (a) 100,000 uniformly generated points in ½0; 110 . (b) 100,000 points from IR20 generated in clusters.
(c) Comparisons using the image histogram database.

Fig. 18. k-nearest neighbor search using real data from the VISTEX database in dimension IR147 and IR267 .

size). Fig. 21a makes the same comparison but using Image 8 A COMPARISON WITH LINEAR SCAN
histogram database, also Fig. 21b illustrates the effect of In this section, we present a set of experiments in which we
curse of dimensionality in precision-recall factor analysis compare the proposed data structure with a naive linear
scan. We used a set of very high-dimensional Euclidean
for the Antipole Tree using uniformly distributed objects in
data sets. Such data sets were obtained from a set of
Euclidean spaces of dimension ranging from 30 to 50. textures taken from the VISTEX database [50]. Starting from
CANTONE ET AL.: ANTIPOLE TREE INDEXING TO SUPPORT RANGE SEARCH AND K-NEAREST NEIGHBOR SEARCH IN METRIC SPACES 547

Fig. 19. A comparison between the approximate Antipole search and TSVQ search. (a) Shows the average error introduced by the two algorithms in
uniformly generated points with  ¼ 0:5 varying the space dimension from 10 to 50. (b) Shows the number of distances computed. (c) Shows the
average error introduced using points generated in clusters of space dimension 20 varying the cluster radius . (d) Shows the corresponding number
of distances needed.

Fig. 20. An experiment with the approximate k-nearest neighbor algorithm in dimension 30. In (a), the average error is showed. (b) Depicts the gain
percentage in the number of distance computations.

Fig. 21. (a) Analysis of curse of dimensionality using Antipole Tree from dimension 30 to 50. Number of distances needed fixing the recall.
(b) Comparisons using the image histogram database between the Antipole Tree and List of Clusters with regard to approximated k-nearest
neighbor. The recall varying the quota is depicted.
548 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005

Fig. 22. Comparing Antipole Tree and List of Clusters with regard to approximated k-nearest neighbor. In (a), the Recall varying the quota is
depicted. In (b), the number of distance computations with fixed recall are shown.

Fig. 23. Comparing Antipole Tree and linear scan with regard to k-nearest neighbor (left side) and range search (right side) in IR267 top, IR147 middle,
and IR63 bottom.
CANTONE ET AL.: ANTIPOLE TREE INDEXING TO SUPPORT RANGE SEARCH AND K-NEAREST NEIGHBOR SEARCH IN METRIC SPACES 549

a given texture, the data sets of tuples were built in the [5] G. Barequet and S. Har-Peled, “Efficiently Approximating the
following way: For each pixel p in the texture, we Minimum-Volume Bounding Box of a Point Set in Three
considered, per color channel, half of its h  h neighbor- Dimensions,” Proc. 10th Ann. ACM-SIAM Symp. Discrete Algo-
rithms, pp. 82-91, 1999.
hood (see [51] for more details). We obtained data sets of [6] S. Battiato, D. Cantone, D. Catalano, G. Cincotti, and M. Hofri,
dimension ranging from 63 to 267. Results, which are “An Efficient Algorithm for the Approximate Median Selection
plotted in Fig. 23, show that the proposed data structure Problem,” Proc. Fourth Italian Conf. Algorithms and Complexity,
outperforms the linear scan in such high-dimensional data pp. 226-238, 2000.
sets. We have also noticed that the intrinsic dimension of [7] S. Berchtold, D.A. Keim, and H.-P. Kriegel, “The X-Tree: An Index
these spaces goes from 5 to 10. Structure for High-Dimensional Data,” Proc. 22nd Int’l Conf. Very
Large Databases, pp. 28-39, 1996.
[8] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is
9 CONCLUSIONS Nearest Neighbor Meaningful?” Proc. Seventh Int’l Conf. Database
Theory, vol. 1540, pp. 217-235, 1999.
We extended the ideas of the most successful best match [9] T. Bozkaya and M. Ozsoyoglu, “Indexing Large Metric Spaces for
retrieval data structures, such as M-Tree, MPV-Tree, Similarity Search Queries,” ACM Trans. Database Systems, vol. 24,
no. 3, pp. 361-404, 1999.
FQ-Tree, and List of Clusters, by using pivots based on [10] S. Brin, “Near Neighbor Search in Large Metric Spaces,” Proc. 21st
the farthest pairs (Antipoles) in data sets. The resulting Int’l Conf. Very Large Data Bases, pp. 574-584, 1995.
Antipole Tree is a bisector tree using pivot-based clustering [11] W.A. Burkhard and R.M. Keller, “Some Approaches to Best-Match
with bounded diameter. Both range and k-nearest neighbor File Searching,” Comm. ACM, vol. 16, no. 4, pp. 230-236, 1973.
[12] B. Bustos and G. Navarro, “Probabilistic Proximity Searching
searches are performed by eliminating those clusters which Algorithms Based on Compact Partitions,” Proc. Symp. String
cannot contain the result of the query. Antipoles are found Processing and Information Retrieval, pp. 284-297, 2002.
by playing a linear time randomized tournament among the [13] I. Calantari and G. McDonald, “A Data Structure and an
Algorithm for the Nearest Point Problem,” IEEE Trans. Software
elements of the input set. Eng., vol. 9, no. 5, pp. 631-634, 1983.
Proliferation of clusters is limited by using a suitable [14] D. Cantone, G. Cincotti, A. Ferro, and A. Pulvirenti, “An Efficient
diameter threshold, which is determined through a statistical Algorithm for the 1-Median Problem,” SIAM J. Optimization, to
analysis on the set of distances. Moreover, an estimate of the appear.
[15] T.M. Chan, “Approximating the Diameter, Width, Smallest
ratio between pseudodiameter (Antipole length) and the real Enclosing Cylinder, and Minimum-Width Annulus,” Int’l J.
diameter is used to determine when a splitting is needed. Computational Geometry and Applications, vol. 12, nos. 1-2, pp. 67-
Since no guaranteed approximation algorithm for diameter 85, 2002.
computation in general metric spaces can exist, we used the [16] E. Chávez and G. Navarro, “An Effective Clustering Algorithm to
Index High Dimensional Metric Spaces,” Proc. 11th Ann. ACM-
approximation ratio given by a very efficient algorithm for SIAM Symp. Discrete Algorithms, pp. 75-86, 2000.
diameter computation in Euclidean spaces together with the [17] E. Chávez and G. Navarro, “A Probabilistic Spell for the Curse of
intrinsic dimension of the given metric space (Appendix, Dimensionality,” Proc. Third Workshop Algorithm Eng. and Experi-
mentation (ALENEX ’01), pp. 147-160, 2001.
which can be found on the Computer Society Digital Library
[18] E. Chávez, G. Navarro, R. Baeza-Yates, and J. Marroquin,
at http://computer.org/tkde/archives.htm). “Searching in Metric Spaces,” ACM Computing Surveys, vol. 33,
By using the tournament size equal to 3 or d  1, where d no. 3, pp. 273-321, 2001.
is the intrinsic dimension of the metric space, we obtained [19] B. Chazelle, “Computational Geometry: A Retrospective,” Proc.
good experimental results. However, we are currently 26th Ann. ACM Symp. Theory of Computing, pp. 75-94, May 1994.
investigating from a theoretical point of view how to [20] P. Ciaccia and M. Patella, “Bulk Loading the M-Tree,” Proc. Ninth
Australasian Database Conf. (ADC), pp. 15-26, 1998.
determine an optimal value for the tournament size
[21] P. Ciaccia and M. Patella, “PAC Nearest Neighbor Queries:
parameter. Extensive experimentations have been per- Approximate and Controlled Search in High-Dimensional and
formed on both synthetic and real data sets, with normal Metric Spaces,” Proc. 16th Int’l Conf. Data Eng.,, pp. 244-255, 2000.
and clustered distributions. All the experiments have [22] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access
shown that our proposed structure outperforms the most Method for Similarity Search in Metric Spaces,” Proc. 23rd Int’l
successful data structures for best match search by a factor Conf. Very Large Data Bases, pp. 426-435, 1997.
ranging between 1.5 and 2.5. [23] K. Clarkson, “Nearest Neighbor Queries in Metric Spaces,” Proc.
29th Ann. ACM Symp. Theory of Computing, pp. 609-617, May 1997.
[24] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases
ACKNOWLEDGMENTS with Noise,” Proc. Second Int’l Conf. Knowledge Discovery in
The authors are grateful to the anonymous reviewers for Databases and Data Mining, pp. 226-231, 1996.
[25] T. Feder and D. Greene, “Optimal Algorithms for Approximate
useful suggestions and comments. Clustering,” Proc. 20th Ann. ACM Symp. Theory of Computing,
pp. 434-444, 1988.
[26] A.W.-C. Fu, P.M. Chan, Y.-L. Cheung, and Y. Moon, “Dynamic
REFERENCES VP-Tree Indexing for n-Nearest Neighbor Search Given Pair-Wise
[1] P. Agarwal, J. Matousek, and S. Suri, “Farthest Neighbors, Distances,” The VLDB J., vol. 9, no. 2, pp. 154-173, 2000.
Maximum Spanning Trees, and Related Problems in Higher [27] V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French,
Dimensions,” Computational Geometry: Theory and Applications, “Clustering Large Datasets in Arbitrary Metric Spaces,” Proc. IEEE
vol. 1, pp. 189-201, 1991. 15th Int’l Conf. Data Eng., pp. 502-511, 1999.
[2] C. Aggarwal, J.L. Wolf, P.S. Yu, and M. Epelman, “Using [28] A. Gersho and R. Gray, Vector Quantization and Signal Compression.
Unbalanced Trees for Indexing Multidimensional Objects,” Knowl- Kluwer Academic, 1992.
edge and Information Systems, vol. 1, no. 3, pp. 157-192, 1999. [29] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High
[3] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Dimensions via Hashing,” Proc. 25th Int’l Conf. Very Large Data
“Automatic Subspace Clustering of High Dimensional Data for Bases, pp. 518-529, 1999.
Data Mining Applications,” Proc. ACM SIGMOD, pp. 94-105, 1998. [30] T. Gonzalez, “Clustering to Minimize the Maximum Intercluster
[4] R. Baeza-Yates, W. Cunto, U. Manber, and S. Wu, “Proximity Distance,” Theoretical Computer Science, vol. 38, pp. 293-306, 1985.
Matching Using Fixed-Queries Trees,” Proc. Combinatorial Pattern [31] T. Gonzalez, “Covering a Set of Points in Multidimensional
Matching, Fifth Ann. Symp., pp. 198-212, 1994. Space,” Information Processing Letters, vol. 40, pp. 181-188, 1991.
550 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 4, APRIL 2005

[32] S. Guha, R. Rastogi, and K. Shim, “Cure: An Efficient Clustering Domenico Cantone received the BS degree in
Algorithm for Large Databases,” Proc. ACM SIGMOD, pp. 73-84, mathematics from the University of Catania,
1998. Italy, in 1982. In 1985 and 1987, he received the
[33] S. Har-Peled, “A Practical Approach for Computing the Diameter MS and PhD degrees in computer science from
of a Point Set,” Proc. 17th Symp. Computational Geometry, pp. 177- New York University. In 1990, he became a
186, 2001. professor of computer science at the University
[34] G.R. Hjaltason and H. Samet, “Distance Browsing in Spatial of L’Aquila, Italy. In 1991, he moved to the
Database,” ACM Trans. Information Systems, vol. 24, no. 2, pp. 265- University of Catania, where since 2002 he has
318, 1999. been the director of the graduate studies in
[35] D.S. Hochbaum and W. Maass, “Approximation Schemes for computer science. Since 1995, he has also been
Covering and Packing Problems in Image Processing and VLSI,” a member of the Board of Directors of the journal Le Matematiche. His
J. ACM, vol. 32, no. 1, pp. 130-136, 1985. main scientific interests include: computable set theory, automated
[36] P. Indyk, “Sublinear Time Algorithms for Metric Space Problems,” deduction in various mathematical theories, and, more recently, string
Proc. 31st Ann. ACM Symp. Theory of Computing, pp. 428-434, 1999. matching and algorithmic engineering. In the field of computable set
[37] P. Indyk and R. Motwani, “Approximate Nearest Neighbors: theory, he has coauthored two monographs published in 1989 and 2001,
Towards Removing the Curse of Dimensionality,” Proc. 30th Ann. and he is currently working on a third monograph which will be published
ACM Symp. Theory of Computing, pp. 604-613, 1998. in 2005.
[38] C. Li, E. Chang, and H.G.-M.G. Wiederhold, “Clustering for
Approximate Similarity Search in High-Dimensional Spaces,” Alfredo Ferro received the BS degree in
IEEE Trans. Knowledge and Data Eng., vol. 14, no. 4, pp. 792-808, mathematics from Catania University, Italy, in
July-Aug. 2002. 1973 and the PhD degree in computer science
[39] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997. from New York University (NYU) in 1981 (Jay
[40] G. Navarro, “Searching in Metric Spaces by Spatial Approxima- Krakauer Award for the best dissertation in the
tion,” The VLDB J., vol. 11, pp. 28-46, 2002. field of sciences at NYU). He is currently
[41] R. Ng and J. Han, “Clarans: A Method for Clustering Objects for professor of computer science at Catania Uni-
Spatial Data Mining,” IEEE Trans. Knowledge and Data Eng., vol. 14, versity and has been director of graduate studies
no. 5, pp. 1003-1016, Sept./Oct. 2002. in computer science for several years. Since
[42] H. Noltemeier, K. Verbarg, and C. Zirkelbach, “Monotonous 1989, he has been the director of the Interna-
Bisector* Trees—A Tool for Efficient Partitioning of Complex tional School for Computer Science Researchers (Lipari School, http://
Scenes of Geometric Objects,” Data Structure and Efficient lipari.cs.unict.it). Together with Michele Purrello, he is the director of the
Algorithms, Lecture Notes in Computer Sciences, vol. 594, International School in BioMedicine and BioInformatics (http://lipari.c-
pp. 186-203, Springer-Verlag, 1992. s.unict.it/bio-info/). His research interests include bioinformatics, algo-
[43] C. Procopiuc, “Geometric Techniques for Clustering Theory and rithms for large data sets management, data mining, computational
Practice,” PhD dissertation, Duke Univ., 2001. logic, and networking.
[44] M. Shapiro, “The Choice of Reference Points in Best-Match File
Searching,” Comm. ACM, vol. 20, no. 5, pp. 339-343, 1997. Alfredo Pulvirenti received the BS degree in
[45] D. Shasha and T.-L. Wang, “New Techniques for Best-Match computer science from Catania University, Italy,
Retrieval,” ACM Trans. Information Systems, vol. 8, no. 2, pp. 140- in 1999 and the PhD degree in computer science
158, 1990. from Catania University in 2003. He has cur-
[46] G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: A rently a postdoctoral position in the Department
Wavelet Based Clustering Approach for Spatial Data in Very of Computer Science at Catania University. His
Large Databases,” The VLDB J., vol. 8, nos. 3-4, pp. 289-304, 2000. research interests include bioinformatics, data
[47] S. Sumanasekara and M. Ramakrishna, “CHILMA: An Efficient structure, approximate algorithms, structured
High Dimensional Indexing Structure for Image Databases,” Proc. databases, information retrieval, graph theory,
First IEEE Pacific-Rim Conf. Multimedia, pp. 76-79, 2000. and networking.
[48] C. Traina Jr., A. Traina, B. Seeger, and C. Faloutsos, “Slim-Trees:
High Performance Metric Trees Minimizing Overlap between
Nodes,” Proc. Seventh Int’l Conf. Extending Database Technology,
vol. 1777, pp. 51-65, 2000. Diego Reforgiato Recupero received the BS
[49] J. Uhlmann, “Satisfying General Proximity/Similarity Queries degree in computer science in 2001 from the
with Metric Trees,” Information Processing Letters, vol. 40, pp. 175- University of Catania, Italy. He is currently a PhD
179, 1991. candidate in computer science at the same
[50] VisTex, http://graphics.stanford. edu/projects/texture/demo/ university. His research interests include data-
synthesis_VisTex_192.html, Texture Synthesis: VisTex Texture, base systems and information technologies. In
2004. particular, he has been working on clustering
[51] L. Wei and M. Levoy, “Texture Synthesis over Arbitrary Manifold data in high-dimensional spaces, graphs cluster-
Surfaces,” Proc. ACM-SIGGRAPH ’01, pp. 355-360, 2001. ing, and graph matching problem.
[52] P. Yianilos, “Data Structures and Algorithms for Nearest
Neighbor Search in General Metric Saces,” Proc. Third Ann.
ACM-SIAM Symp. Discrete Algorithms, pp. 311-321, Jan. 1993.
[53] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Dennis Shasha is a professor of computer
Data Clustering Method for Very Large Databases,” Proc. ACM science in the Courant Institute at New York
SIGMOD Conf. Management of Data, pp. 103-114, 1996. University, where he works with biologists on
pattern discovery for microarrays, combinatorial
design, and network inference; and with physi-
cists and financial people on algorithms for time
series. Other areas of interest include database
tuning, tree and graph matching, and crypto-
graphic file systems. In his spare time, he has
written three books of puzzles, a biography of
great computer scientists, and technical books about database tuning,
biological pattern recognition, and a book on time series. He also writes
the puzzle column for Scientific American and Dr. Dobb’s Journal.

You might also like