0% found this document useful (0 votes)
18 views11 pages

Near Neighbor Search in Large Metric Spaces

Near Neighbor Search in Large Metric Spaces

Uploaded by

Daniel Rebelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Near Neighbor Search in Large Metric Spaces

Near Neighbor Search in Large Metric Spaces

Uploaded by

Daniel Rebelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Near Neighbor Search in Large Metric Spaces

Sergey Brin
Department of Computer Science
Stanford University
February 27, 1995

Abstract Information Retrieval { Finding sentences similar


to a user's query from a given database of sen-
Given user data, one often wants to nd approximate tences.
matches in a large database. A good example of such
a task is nding images similar to a given image in a Genetics { Finding similar DNA or protein se-
large collection of images. We focus on the important quences in one of a number of large genetics
and technically dicult case where each data element databases.
is high dimensional, or more generally, is represented
by a point in a large metric space1 and distance cal- Speaker Recognition { Finding similar vocal pat-
culations are computationally expensive. terns (e.g., under Fourier transforms) from a
In this paper we introduce a data structure to database of vocal patterns.
solve this problem called a GNAT { Geometric Near- Image Recognition { Finding images similar (us-
neighbor Access Tree. It is based on the philosophy ing the Hausdor metric HKR93]) to a given one
that the data structure should act as a hierarchical ge- from a large image library.
ometrical model of the data as opposed to a simple de-
composition of the data which doesn't use its intrin- Video Compression { Finding the image blocks of
sic geometry. In experiments, we nd that GNAT's a previous frame that are similar to blocks in a
outperform previous data structures in a number of new frame (using a simple L1 or L2 metric, pos-
applications. sibly after a DCT transform) to generate motion
Keywords { near neighbor, metric space, ap- vectors in MPEG video compression.
proximate queries, data mining, Dirichlet domains,
Voronoi regions. Data Mining { Finding approximate time series
matches (e.g., stock histories or year long tem-
perature).
1 Introduction All of the examples above t into the model of nd-
The problem of nding the near neighbors of a given ing near neighbors in a large metric space. In partic-
point in a large data set has been studied well and has ular, the rst two examples above nd near neigh-
a number of good solutions, if the data is in a simple bors in the metric space of strings under some edit
(e.g. Euclidean), low-dimensional space. However, distance function. The speaker, video, and possibly
if the data lies in a large metric space1 the problem data mining examples nd near-neighbors in a high
becomes much more dicult. Consider the following dimensional vector space under the L1 or L2 met-
examples as a small sample of where this problem ric (more sophisticated metrics could be imagined).
occurs: The image recognition example does not t as nicely
into either of these classes but still qualies as near-
 Supported by a Fellowship from the NSF. neighbor search in a large metric space.
1 By a large metric space we mean a space such that the Every data type above has some degree of correla-
volume of a ball grows very rapidly as its radius increases. High tion in its distribution. While it may be small (i.e.,
dimensional vector spaces are an example (a ball of radius 2 in
a 20 dimensional Euclidean space is over a million times large the data resembles random vectors), it must be ex-
than a ball of radius 1). See Section 3. ploited to get good performance in a near neighbor

Page 1
search. To do this in an application independent man- branch if it is outside the sphere. Now, recursively
ner requires that the data structure capture the in- construct the lower level branches.
trinsic geometry of the data. As we will see (Sec- This approach has the benets of requiring only
tion 4), our data structure, the GNAT, captures the one distance calculation per node and automatically
geometry of data collections such as the ones men- creating balanced trees. However, it su ers from re-
tioned above by hierarchically breaking it down into gions inside the median sphere and outside the me-
regions which try to preserve fundamental geometric dian sphere being very asymmetric, especially in a
structure. high-dimensional space. Since volume grows rapidly
as the radius of a sphere increases, the outside of the
sphere will tend to be very thin, given that there are
2 Related Work as many points on the inside as on the outside, thus
A very large amount of work has been done to solve worsening search performance. In our work, we try to
specic instances of near-neighbor nding problems. avoid such asymmetries. While the limited branch-
Numerous articles have been written regarding nd- ing factor of 2 can also be viewed as a weakness, we
ing similar vectors (e.g., time-series and geographic have conducted experiments with higher degree vari-
data), text (les and documents), images, sounds ations of vp-trees and nd little improvement in per-
(word recognition), etc. A more limited but still sub- formance.
stantial amount of work has addressed the general The other method, a generalized hyperplane trees
problem2. This work has mostly fallen into two cat- (gh-tree), is constructed as follows. At the top node,
egories. In one category, we assume that distance pick two points. Then, divide the remaining points
calculations are so expensive that even an O(n) or based on which of these two they are closer to. Now,
O(n log n) search algorithm is acceptable as long as recursively build both branches. This method is an
it reduces the number of distance calculations. This improvement in that it is symmetric and the tree
is the case as long as the database size is fairly small structure still tends to be well balanced (assuming
compared to the range of the search FS82] or if pre- suciently random selection of the two points). How-
processing is not allowed and only arbitrary precom- ever, it has a weakness in that it requires two com-
puted distances are given SW90]. putations at every node and is limited to a branching
The other category of solutions are hierarchical and factor of two.
typically have an O(log n) query time given a su- A variation of gh-trees was implemented at
ciently small range (typically too small to be prac- ETH Zurich BFR+ 93] as monotonous bisector trees
tical). They are of the following form: The space (MBT's) to deal specically with text. However,
is broken up hierarchically. At the top node, one or nothing in the method would have prevented them
several data points are chosen. Then the distance be- from dealing with arbitrary metric spaces. The key
tween each of these to each of the remaining points is di erence between MBT's and gh-trees is that MBT's
computed. Based on these distances, the points are only select one new point at each new node. They do
separated into two or several di erent branches. For this by reusing the point they are associated with
each branch, the structure is constructed recursively. in the parent node. As a result, MBT's overcome
J. K. Uhlmann outlined the foundation for two the rst weakness but the branching factor remains a
di erent methods, generally described as metric problem.
trees Uhl91]. One of these methods, subsequently The most relevant works, however, are also the
called vp-trees3 , was implemented by P. N. Yiani- oldest. Burkhard and Keller suggested several data
los Yia93]. The basic construction of a vp-tree is to structures (and algorithms) BK73] for approximate
break the space up using spherical cuts. To build it, search. The rst is very similar to vp-trees except
pick a point in the data set (this is called the vantage that it requires a nite number of discrete distance
point, hence the name vp-tree). Now, consider the values. Essentially, for every vantage point, a sep-
median sphere centered at the vantage point with a arate branch is allocated for every possible distance
radius such that half the remaining points fall inside value. This method, however, su ers from the same
it and half fall outside. For every other point, put it asymmetry problem as the vp-trees. The other two
in one branch if it is inside the sphere and in another data structures, which are the closest to the GNAT,
2 Some of the papers we mention below address the problem
break up the space into a number of balls, storing
the radii and centers. More specically, divide the
of nding nearest neighbors. However, their methods can be data points into groups using some method (this was
applied to nding all near neighbors with minimal change.
3 We do not look at the enhancement of vp-trees called vpsb - left as a parameter). Pick a representative of each
trees. group and call it the center of the group. Then, cal-

Page 2
culate the radius (the maximal distance to another becomes negligible. In tests, we nd that GNAT's
point) from the center for each group and pruning is almost always perform better than both vp-trees and
performed based on these radii. Recursion is briey gh-trees, and scale better.
mentioned but not analysed. The third method, an
enhancement of the second, additionally requires that
the diameter (the maximal distance between any two 3 Large Metric Spaces
points) of the points in any group be less than a con- A metric space is a set X with a distance function d:
stant, k, and the group is then called a clique. In this X 2 ! R such that: 8x y z 2 X,
case a minimal subset of the set of all maximal cliques
is used as the set of all groups. These two schemes 1. d(x y)  0 and d(x y) = 0 i x = y. (Positivity)
act as reasonably good models of the data space they 2. d(x y) = d(y x). (Symmetry)
store and if extended to a hierarchical structure, they
have an arbitrary branching factor. However, they 3. d(x y) + d(y z)  d(x z). (Triangle Inequality)
have several weaknesses. First, they do not work well Since we are dealing with arbitrary metric spaces,
with nonhomogeneous data, since we could easily end we assume the following model of computation: there
up with a lot of cliques containing only one point and is a large number of data points and a \black box" to
several cliques containing very many points. Addi- compute distance between them.
tionally, distance computations are not fully exploited The rst important observation is that it is impos-
in that distance to the center of one clique is not used sible to deal eciently with all metric spaces. In par-
to prune other cliques. Finally, while we do not fo- ticular, consider the metric space where the distance
cus on the cost of preprocessing in this paper, this between two points is 0 if they are the same and 1
cost was reported to be extremely high in the third if they are di erent. Then our only option in nding
method. a query point is a linear search and no fancy data
K. Fukunaga and P. Narendra worked out a very structure will save us. In fact, the more any space
similar scheme, which requires more than just a met- resembles such a metric space, the more dicult it
ric space, to create a tree structure with an arbitrary will be to search.
branching factor in 1975 FN75] as follows. Divide the Furthermore, the distribution of data in the met-
data points into k groups. (How this is done is left ric space is more important than the metric space
as a parameter of the structure but in tests they used itself. If the data lies on a two-dimensional surface
a clustering algorithm which requires more than just that is embedded in a 50-dimensional space, query
a metric space.) Then compute the mean4 of each times will behave more like that of a two-dimensional
group (once again a departure from a metric space) space than that of a 50-dimensional space given an
and the farthest distance from that mean to a point intelligent data structure. In a sense, for a high-
in the group. Then recursively create the structure dimensional space, the data determines the \geom-
for each group. While this method tends to have nice etry" of the space more than the constraints of the
symmetric properties (given a reasonable clustering space itself.
algorithm) that reect the space and it has an ar- Since visualizing high-dimensional data is dicult
bitrary branching factor, it has several weaknesses. we look at some simple measures to help us under-
First, it relies on more than just a metric space sec- stand the geometry of a given data space. A partic-
ond, it requires many distance computations at each ularly useful measurement is the distribution of dis-
node and does not use them fully and third, it does tances between points in the space. While the scales
not deal e ectively with balancing. of these distributions vary greatly, we can compare
In this paper we present GNAT's which can be them by considering at what range we would be in-
viewed as both a generalization of Fukunaga's method terested in nding near neighbors. In each of the
and a generalization of gh-trees. GNAT's provide graphs of the distributions that follow, we have made
good query performance by exploiting the geometry 5000 random distance calculations in the data space
of the data. Unfortunately, while query time is re- and distributed them into a number of buckets. The
duced, the build time of the data structure is sharply y axis represents the number of distances which fell
increased. However, if the application is query dom- into that bucket divided by the size of the bucket.
inant (or even if there are roughly as many queries The distributions of distances between random,
as data points) the relative cost of building a GNAT uniformly chosen vectors in 20 and 50 dimensional hy-
4 The mean of a set of points (vectors) in a vector space is
percubes of side 1 under the L1 5 distributions because
simply their sum divided by their number. The concept of a 5 Recall that the L metric is the sum of the absolute values
1
mean is not meaningful for arbitrary metric spaces. of the di erences of corresponding vector dimensions and the

Page 3
4.5
4 16 by 16
3.5 50 by 50
3
2.5
2000 2
1800 Dim 20 1.5
1600 Dim 50
1400 1
1200 0.5
1000 0
800 2000 4000 6000 8000
600 Distance
400
200 Figure 3: Distribution of distances under L2 metric
0 in 16 by 16 and 50 by 50 images.
0 5 10 15 20 25
Distance
of the Central Limit Theorem (Figure 1). For the L2
Figure 1: Distribution of distances under L1 metric metric, we obtain a Gaussian-like (though not exactly
in 20 and 50 dimensions. Gaussian) distribution limit distribution (Figure 2).
Note that the distributions for 50 dimensions should
be viewed in relation to their larger ranges and hence
are really quite narrow. The fact that the peaks are
narrow indicate that the distance function has low
entropy and that it may be dicult to index the data
since arbitrary distance measurements will provide us
with little information. However, by wisely choosing
the distance computations, we can greatly improve
the eciency of the system.
Correlated data has somewhat di erent properties
and tends to have a much atter distance distribu-
9000 tion. For example, taking random 16 by 16 or 50 by
8000 Dim 20 50 blocks from an image, then treating them as 256
7000 Dim 50 or 2500 dimensional vectors respectively and taking
6000 the L2 distances between them creates a distribution
5000 with two major maxima (Figure 3). The rst max-
4000 imum, near 0, indicates a great deal of clustering in
3000 the data since small distances are so probable the
second major maximum is one that is common to all
2000 large metric spaces we will investigate, which indi-
1000 cates that average distances are fairly likely.
0 As another example, consider taking lines of text
0 0.5 1 1.5 2 2.5 3 3.5 4 from a large text document (\A Tale of Two Cities"
Distance in this example) and using a simple edit distance
function. We considered two di erent such func-
Figure 2: Distribution of distances under L2 metric tions. Both counted the minimum number of op-
in 20 and 50 dimensions. erations needed to get from one line to the other.
The rst distance function, InsDel allowed only in-
serts and deletes of single characters as operations.
The second, Edit, added the operation of replacing
L2 , or Euclidean, metric is the square root of the sum of the
squares.

Page 4
450
400 InsDel
350 Edit
300
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
Distance

Figure 4: Distribution of edit distances in lines of


text.

one character and is consistent with what is usually


used to compute distance between strings. Both dis- Figure 5: A Simple GNAT
tance functions have the unfortunate property of re-
quiring O(n2) steps to compute where n is the num- space that belong to the top level split points and the
ber of characters, hence minimizing such calculations thin lines are those of the low level split points.
is very important. When we map out their distribu- Another goal of GNAT's is to make full use of
tions (Figure 4), both turn out to be considerably less the distances we calculate (at least within one node).
correlated than the images and look roughly like the Therefore, instead of relying on the Dirichlet struc-
uniformly chosen vectors (Figures 1 and 2). ture to perform our search, we also prune branches by
storing the ranges of distance values from split points
4 GNAT's to the data points associated with other split points
during the build. Then, if query points fall outside
Our goal when designing GNAT's was to have a data these ranges by more than the range of the search,
structure that reected the intrinsic geometry of the we can prune the split point and everything under it
underlying data. More specically, the top node of from the search. For example, suppose we know that
our hierarchical structure should give us a very brief all points in the region associated with split point q
summary of the data as a metric space, and as we have distance between 1 and 2 from split point p and
progress down the hierarchy we get a more and more we want to search for points within a radius of 0.5
accurate sense of the geometry of the data. This is from a query point x. Then if x is more than 2.5
achieved as a hierarchical, Dirichlet-domain6 based, from p, we can apply the triangle inequality (to the
structure. Given a number of points, the Dirichlet points x,y, and p where y is any point in the region
domain of one of those points is all possible points in of q) and safely prune the node associated with q.
the space which are closest to that point. At the top
node, several distinguished split points are chosen and 4.1 A Simplied Algorithm
the space is broken up into Dirichlet domains. The
remaining points are classied into groups depending The basic GNAT construction is as follows:
on what Dirichlet domain they fall in. Each group 1. Choose k split points, p1  ::: p from the dataset
is then structured recursively. A simple example of k
we wish to index (this number, known as the
such a structure is illustrated in Figure 5. The heavier degree can vary throughout the tree, see Sec-
points represent the split points of the top node and tion 4.3). They are chosen randomly but we
ner points are split points of the subnodes. The make sure they are fairly far apart (see Sec-
thick lines represent the boundaries of the regions of tion 4.2).
6 Computer scientists may know these better as the cells
of a Voronoi diagram but these are more generally known as 2. Associate each of the remaining points in the
Dirichlet domains. dataset with the closest split point. Let the set of

Page 5
points associated with a split point p be denoted
i The strategy we settled on is to sample about 3
D.i times the number of split points we wanted, and then
pick those that were the farthest apart (according to
3. For each pair of split points, (p  p ), calculate
i j a greedy algorithm). The number 3 was arrived at
the range(p  p ) = min  max ], a minimum empirically. More specically this is done as follows:
and a maximum of Dist(p  x) where x 2 D or
i j ij ij

i j Pick one of the sample points at random. Then pick


x=p . j the sample point which is farthest away from this one.
4. Recursively build the tree for each of D 's, pos- Then pick the sample point which is the farthest from
i
sibly using a di erent degree (see Section 4.2). these two (by farthest we mean that the minimum
distance from the two is the greatest). Then pick the
A search in a GNAT is performed recursively as one farthest from these three. And so on, until there
follows: as many split points as desired. A simple dynamic
1. Assume we want to nd all points with distance programming type of algorithm can do this in O(nm)
 r to a given point x. Let P represent the set time where n is the number of sample points and m
of split points of the current node (initially the is the eventual number of split points.
top node of the GNAT) which possibly contain
a near neighbor of x. Initially P contains all the 4.3 Choosing the Degree of a Node
split points of the current node. and Balancing the Tree
2. Pick a point p in P (never the same one twice). For out initial experiments we chose a degree and
Compute the distance Dist(x p). If it is less than kept it constant over all the nodes in a tree. This
or equal to r add p to the result. worked reasonably well with uncorrelated data. How-
ever, for correlated data, it sometimes unbalanced
3. For all points q 2 P , if Dist(x p);r Dist(x p)+ the tree substantially. Attempts to balance the size
r] \ range(p q) is empty, then remove q from P. of branches (by weighting distances to split points)
We can do this for the following reason. Let y tended to harm performance more than they helped.
be any point in the region associated with q. If So we considered \balancing" the tree by assigning
Dist(y p) < Dist(x p) ; r, then, by the trian- higher degrees to those nodes that contained too
gle inequality, we have Dist(x y) + Dist(y p)  many data points. This is done as follows:
Dist(x p) and hence Dist(x y) > r. Alterna- The top node is allocated a degree k. Then each
tively, if Dist(y p) > Dist(x p) + r, we can use of its children is allocated a degree proportional to
the triangle inequality, Dist(y x) + Dist(x p)  the number of data points it contains (with a certain
Dist(y p) to deduce Dist(x y) > r. y cannot maximum and minimum) so that the average is the
fall into the range Dist(x p) ; r Dist(x p) + r] global degree of k. This process works recursively so
because then range(p q) would intersect it. that the children of each node have average degree k
(ignoring the maximum and minimum degrees). The
4. Repeat steps 2 and 3 until all remaining points minimum used in tests was 2 (or the number of points
in P have been tried. contained, if that was smaller) and the maximum was
5. For all remaining p 2 P, recursively search D . 5k or 200, whichever was smaller.
i i

4.2 Selecting Split Points 4.4 Space and Time Complexity


One of the issues we had to deal with was the se- Unfortunately, GNAT's are suciently complex that
lection of split points (step 1 in the construction). they are unwieldy to formal analysis. Therefore the
We wanted split points to be more or less random, results we present here are weak and limited. We use
since this way they are likely to be near the centers the following notation:
of clusters. However, we did not want the split points
to bunch up, since if they did, distance calculations
 N { the number of data points.
from them would not have as much information con-  n { the number of nodes. Empty nodes are not
tent (distances measured from one member of a bunch counted.
would be similar to those of another) and they would
not model the geometry of the data well. Further-  s { the maximum size (space requirement) of a
data point.
more, if several split points were in the same cluster,
they would likely divide the cluster at too high a level.  d { the maximum degree of a node.

Page 6
 k { the average degree (equal to N=n). 5.1 Data Types Supported
 k2 { the second moment (the average of the The data types (metric spaces) with which the system
works are as follows:
squares) of the degree.
 l { the average depth of a point. Vectors { The simplest of the data types, these N-
dimensional vectors from a hypercube of side 1
 s { the amount of memory needed to store a data and can be chosen in two di erent ways { cho-
sen uniformly from R , and chosen uniformly
n
point. from R2 and then mapped into R using a sim-
n

ple continuous function. Both L1 and L2 metrics


We produce the following simple results: are supported.
Space (memory) An GNAT takes O(Nk2 + Ns) Image Blocks { These are 16 by 16 blocks chosen
space. In practice, k2 does not turn out to be randomly from a grayscale image. Two images
much higher than k2 . were used in tests and they produced very sim-
ilar results despite being very di erent - a digi-
Preprocessing Time This is the main disadvan- tized version of \A Day in the Park" by Seurat
tage of GNAT's as compared to other data struc- and a picture of an SR71 Blackbird. Distance
tures. In a perfect scenario, when the tree is between image blocks is computed very simply
completely balanced, we get a preprocessing time (since this is how current MPEG motion vector
of O(Nk log N) distance calculations which is a estimation schemes do this) by considering the
k
factor of k= log k more than binary trees requir- blocks as 256 dimensional vectors and using L1
ing only one distance calculation per node. In the or L2 distance. Given that the code can handle
real world, this can be substantially more due to arbitrary sized blocks a few experiments were run
the uctuating degrees and can be bound only by on 50 by 50 pixel blocks, since they have some
O(Nld) distance calculations. In tests, l tended interesting clustering properties.
to be very close to, though slightly more than Lines of Text { While the vectors and images both
log N and certainly not all nodes were of degree
k
t into vector spaces, lines of text do not. These
d so the real preprocessing time lies somewhere were lines taken from a text document. Two dif-
between those two extremes. ferent documents were attempted - all ve acts
Query Time This is the most important and of \Hamlet", which unfortunately totaled only
most dicult performance attribute to evaluate. about 4000 lines, and Dickens' \A Tale of Two
While there have been upper bounds in previous Cities". For Dickens, this was a less meaningful
works, they have either been restricted to par- test since taking lines out of a novel is not par-
ticular domains or have made assumptions which ticularly reasonable (sentences would have been
make them of no practical use. As a result of this better but they were too long to deal with). Both
and the added complexity of GNAT's we rely on texts were processed before being read by nor-
the experimental results (see Section 6). malizing whitespace and getting rid of very short
lines. Additionally, in Hamlet, speaker names
were stripped. Results for these two were not as
5 Implementation similar as one might expect (see Section 6). Both
the InsDel and the Edit distance functions (see
Section 3) were implemented and tested.
The system used for testing GNAT's and other data
structures underwent several major revisions, being
implemented in Mathematica, then C, and nally 5.2 Data Structures Supported
C++. In each of these versions, the benet of having The nal implementation supports a number of data
the data structure rely only on the distance function structures including GNAT's7.
was a tremendous advantage.
In the nal version, the code to handle vectors and VP-Trees { See Section 2 for a brief description.
text (including generation and/or loading and dis- In the tests presented here, we did not use any
tance functions) is under a hundred lines each. The 7 Both vp -trees and opt-trees were developed during this
code for images is just over a hundred lines, mainly k
project (though there has been previous work similar to opt-
to deal with the le format. trees).

Page 7
sampling technique to chose vantage points since Distance Calculations
we could not be sure that we would do it iden- 3000
tically to Yia93]. However, some limited tests vp -tree
2500 vp24-tree
with sampling indicated that savings were in the gh-tree
10% range for images and were negligible for text 2000 mytree 2
and random vectors. 1500 mytree
mytree20
10

VP -Trees A generalization of VP-Trees which dif-


k mytree50
1000 mytree
fers in that at each node, instead of the remain- 100
ing data points being split into two halves based 500 opt-tree
on their distance from the vantage point, they 0
are split into k sections of equal size (also based 0 0.1 0.2 0.3 0.4 0.5
on distance from the vantage point). These Query Range
were found to perform very similarly to vp-trees
(sometimes a little better even) but there was not
a suciently large di erence to warrant further Figure 6: Varying Query Range for 3000 Vectors
investigation.
GH-Trees { See Section 2 for a brief description. Distance Calculations
20000
They are essentially GNAT's of constant degree 2 18000 vp2 -tree
without the sampling for split points and degree 16000 vp4 -tree
variation throughout the tree. Since they per- gh-tree
14000 mytree
form worse than GNAT's of degree 2, not many 12000 mytree102
experiments were performed. 10000 mytree20
8000 mytree50
OPT-Trees { These use a much smaller number of 6000 mytree100
distance computations for queries than any other 4000 opt-tree
structure but they lose out by having far more 2000
costly other computations (even superlinear in 0
the number of data points). The idea here is 0 0.1 0.2 0.3 0.4 0.5
to pick a number of vantage points. Measure Query Range
the distances from each to all the other points
and store these in a table. When a query comes
along, measure its distance to the rst vantage Figure 7: Varying Query Range for 20000 Vectors
point and based on that weed out all of the im-
possible data points. Then take the next vantage far the number which could be run (in a reasonable
point and do the same. Continue until no more amount of time) which in turn exceeded by far the
data points are pruned. Then, check each of the number of test results presented in this section. Note
remaining ones individually. Up to the choice of that the benets of GNAT's varied and in this sec-
vantage points this gives more or less the optimal tion we try to present the range of results that were
performance, in terms of distance calculations. obtained.
This structure can serve as a lower bound for Also, while opt-tree plots are in some of the graphs,
distance calculations but is not a realistic goal it is important to keep in mind that these structures
to shoot for if one wants a scalable structure. are only ecient in terms of distance calculations but
GNAT's { This is the main data structure of this are very inecient otherwise. They are provided just
paper, described in Section 4. to serve as a lower bound.
For each of these, we check how many distance cal-
culations are used to both build the structure and 6.1 Random Vectors
perform queries. A number of tests were performed on random vectors.
The dimension was set at 50 and uniformly chosen
6 Tests and Results vectors were produced and the L2 metric was used..
The range was varied and a number of di erent data
Given the exibility of the testbed system, the num- structures were tested. The number of data points
ber of possible interesting tests to run exceeded by was 3000 (Figure 6) in one test and 20000 (Figure 7)

Page 8
Distance Calculations
2000 Distance Calculations
1800 vp2 -tree 9000
1600 mytree10 8000 vp2 -tree
1400 mytree20 7000 mytree 10
1200 mytree50 6000 mytree 20
mytree50
1000 mytree100 5000 mytree100
800 4000
600 3000
400
200 2000
0 1000
2 3 4 5 6 7 8 9 10 0
Query Range 0 2 4 6 8 10
Query Range
Figure 8: Varying Query Range for 3000 Lines of
Hamlet Using the InsDel Distance Figure 9: Varying Query Range for 10000 Lines of
Dickens Using the Edit Distance
in another. The number of test queries used in every
case was 100. Distance Calculations
The rst thing to note is how dicult it actually is 1800
to perform these searches. All of the data structures 1600 vp2 -tree
seemed to struggle with ranges above 0.3 (reasonable 1400 mytree10
queries could easily have ranges considerably above mytree20
0.5), looking at more than 50% of the data points in 1200 mytree50
many cases. This is caused by the low information 1000 mytree100
content of the distance calculations since they tend 800
to return very similar numbers (Figure 2). 600
Despite the diculty that all these methods have, 400
high degree GNAT's come out far ahead. In partic- 200
ular, the GNAT's of degrees 50 and 100 had more 0 50 100 150 200 250 300 350 400 450 500
than a factor of 3 improvement over vp-trees in many Query Range
cases.

6.2 Text Figure 10: Performance for 3000 16 by 16 images


We ran a number of experiments with text with varied
success. When testing 3000 lines of Hamlet using the Distance Calculations
InsDel distance, speedups above vp-trees were in the 700
factor of 2 range (Figure 8). Testing on 10000 lines of 600 vp2 -tree
\A Tale of Two Cities" with the Edit distance yielded mytree10
much more dramatic yet varied speedups, ranging 500 mytree20
from around 50% with a query range of 10 to more 400 mytree50
than a factor of 6 with a range of 2 (Figure 9). mytree100
300
200
6.3 Images 100
The performance of GNAT's versus vp-trees on im- 0
ages varied greatly but unfortunately they were not 0 50 100 150 200 250 300 350 400 450 500
nearly as dramatic as the results for random vectors Query Range
and text. For example, for 16 by 16 block images and
the L2 metric, the higher degree GNAT's gave only
about 15% to 25% improvement over vp-trees (Fig- Figure 11: Performance for 3000 50 by 50 images
ure 10). When using 50 by 50 blocks, results were a

Page 9
little better, with improvements in the 15% to 35% struction is used but then is iteratively improved until
range for ranges above 200. This is a clear indica- it converges to a bottom-up type construction.
tion that more work needs to be done to deal with Another important research direction is to begin
clustered data. (Figure 11). to use approximate distance metrics. For example, in
order to compute near neighbors in text using the edit
distance (an expensive computation), we can rst use
7 Conclusion and Future Work the q-gram distance Ukk92] (a relatively fast compu-
tation) to narrow the search quickly and then apply
In working with large metric spaces, we have proved proper edit distance to complete the search. The key
our intuition from low-dimensional spaces wrong in is that q-gram distance is a lower bound for edit dis-
many ways. In many cases explored in this paper tance. Similarly, we could linearly project down a
the data lies in a very large metric space whose only very high-dimensional space (such as 50 by 50 pixel
easily recognizable and readily usable structure is the images) to a somewhat lower dimensional space (e.g.
distance between its points. In other words, the space by averaging together 2 by 2 pixel blocks) and use
is so large that it is meaningless to consider and use its that as an approximation (L2 distance in the projec-
geometry and one should concentrate on the intrinsic tion is a lower bound for L2 distance in the original
geometry of the actual set of data points. space). All of these techniques, of course, rely on spe-
Consequently, it is important to exploit the con- cial knowledge of the metric space to construct the
straints of the distribution of data rather than rely approximations. However, given the approximations,
on those of the whole space. Therefore, GNAT's try a general method could be applied.
to model the data they are indexing. There are sev-
eral important issues involved in doing this.
First, does one break up the space by occupation 8 Acknowledgments
or by population? In other words, if the data is com- I thank Prof. Michael Brin (my father), Prof. Hector
posed of a large cluster and a few outliers, should Garcia-Molina, Luis Gravano, and Edouard Bugnion
the top node assign one branch for the cluster and for helpful discussions and for listening to my endless
a few branches for the outliers or should it split the ramblings.
cluster immediately and not worry about the outliers
until later? In GNAT'swe decided to compromise by
rst sampling (the by-population approach) and then
by picking out the points that were far apart (the
References
by-occupation approach) when choosing split points. BFR+ 93] E. Bugnion, S. Fei, T. Roos, P. Widmayer,
The current method of selecting points that are far and F. Widmer. A spatial index for ap-
apart can become asymmetric and some pathologi- proximate multiple string matching. In
cal behavior was observed (though it didn't impact Proc. First South American Workshop on
query performance much). This remains a problem String Processing, Belo Horizonte, Brazil,
for future work. September 1993.
A second issue is how to handle balancing. In
our experiments, we found that good balance was BK73] W. A. Burkhard and R. M. Keller. Some
not crucial to the performance of the structure. approaches to best-match le searching.
We attempted to improve the structure by using Communications of the ACM, 16(4), April
\weighted" Dirichlet domains but these tended to de- 1973.
crease performance rather than improving it. (They FN75] K. Fukunaga and P. M. Narendra. A
did reduce build time though.) Intuitively, when the
tree structure is altered so that it is balanced rather branch and bound algorithm for comput-
than so it reects the geometry of the space, searches ing k-nearest neighbors. IEEE Trans.
tend to descend down all branches. As a result we de- Comput., C-24:750{753, 1975.
cided to keep the tree depth from varying too much FS82] C.D. Feustel and L. G. Shapiro. The near-
by adjusting the degrees of the nodes. est neighbor problem in an abstract met-
For future work, we are considering new methods ric space. Pattern Recognition Letters, De-
of building the tree. Bottom-up constructions could cember 1982.
lead to very good query performance but their O(n2 )
construction cost will not scale well. Consequently, HKR93] Huttenlocher, Klanderman, and Ruck-
we are considering schemes where a top-down con- lidge. Comparing images using the haus-

Page 10
dor distance. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence,
15, 1993.
SW90] Dennis Shasha and Tsong-Li Wang. New
techniques for best-match retrieval. ACM
Transactions on Information Systems.,
8(2):140, 1990.
Uhl91] Uhlmann. Satisfying general proximity /
similarity queries with metric trees. Infor-
mation Processing Letters, 40, 1991.
Ukk92] Ukkonen. Approximate string matching
with q-grams and maximal matches. The-
oretical Computer Science, 92, 1992.
Yia93] Yianilos. Data structures and algorithms
for nearest neighbor search in general met-
ric spaces. In ACM-SIAM Symposium
on Discrete Algorithms (A Conference on
Theoretical and Experimental Analysis of
Discrete Algorithms), 1993.

Page 11

You might also like