Near Neighbor Search in Large Metric Spaces
Near Neighbor Search in Large Metric Spaces
Sergey Brin
Department of Computer Science
Stanford University
February 27, 1995
Page 1
search. To do this in an application independent man- branch if it is outside the sphere. Now, recursively
ner requires that the data structure capture the in- construct the lower level branches.
trinsic geometry of the data. As we will see (Sec- This approach has the benets of requiring only
tion 4), our data structure, the GNAT, captures the one distance calculation per node and automatically
geometry of data collections such as the ones men- creating balanced trees. However, it su ers from re-
tioned above by hierarchically breaking it down into gions inside the median sphere and outside the me-
regions which try to preserve fundamental geometric dian sphere being very asymmetric, especially in a
structure. high-dimensional space. Since volume grows rapidly
as the radius of a sphere increases, the outside of the
sphere will tend to be very thin, given that there are
2 Related Work as many points on the inside as on the outside, thus
A very large amount of work has been done to solve worsening search performance. In our work, we try to
specic instances of near-neighbor nding problems. avoid such asymmetries. While the limited branch-
Numerous articles have been written regarding nd- ing factor of 2 can also be viewed as a weakness, we
ing similar vectors (e.g., time-series and geographic have conducted experiments with higher degree vari-
data), text (les and documents), images, sounds ations of vp-trees and nd little improvement in per-
(word recognition), etc. A more limited but still sub- formance.
stantial amount of work has addressed the general The other method, a generalized hyperplane trees
problem2. This work has mostly fallen into two cat- (gh-tree), is constructed as follows. At the top node,
egories. In one category, we assume that distance pick two points. Then, divide the remaining points
calculations are so expensive that even an O(n) or based on which of these two they are closer to. Now,
O(n log n) search algorithm is acceptable as long as recursively build both branches. This method is an
it reduces the number of distance calculations. This improvement in that it is symmetric and the tree
is the case as long as the database size is fairly small structure still tends to be well balanced (assuming
compared to the range of the search FS82] or if pre- suciently random selection of the two points). How-
processing is not allowed and only arbitrary precom- ever, it has a weakness in that it requires two com-
puted distances are given SW90]. putations at every node and is limited to a branching
The other category of solutions are hierarchical and factor of two.
typically have an O(log n) query time given a su- A variation of gh-trees was implemented at
ciently small range (typically too small to be prac- ETH Zurich BFR+ 93] as monotonous bisector trees
tical). They are of the following form: The space (MBT's) to deal specically with text. However,
is broken up hierarchically. At the top node, one or nothing in the method would have prevented them
several data points are chosen. Then the distance be- from dealing with arbitrary metric spaces. The key
tween each of these to each of the remaining points is di erence between MBT's and gh-trees is that MBT's
computed. Based on these distances, the points are only select one new point at each new node. They do
separated into two or several di erent branches. For this by reusing the point they are associated with
each branch, the structure is constructed recursively. in the parent node. As a result, MBT's overcome
J. K. Uhlmann outlined the foundation for two the rst weakness but the branching factor remains a
di erent methods, generally described as metric problem.
trees Uhl91]. One of these methods, subsequently The most relevant works, however, are also the
called vp-trees3 , was implemented by P. N. Yiani- oldest. Burkhard and Keller suggested several data
los Yia93]. The basic construction of a vp-tree is to structures (and algorithms) BK73] for approximate
break the space up using spherical cuts. To build it, search. The rst is very similar to vp-trees except
pick a point in the data set (this is called the vantage that it requires a nite number of discrete distance
point, hence the name vp-tree). Now, consider the values. Essentially, for every vantage point, a sep-
median sphere centered at the vantage point with a arate branch is allocated for every possible distance
radius such that half the remaining points fall inside value. This method, however, su ers from the same
it and half fall outside. For every other point, put it asymmetry problem as the vp-trees. The other two
in one branch if it is inside the sphere and in another data structures, which are the closest to the GNAT,
2 Some of the papers we mention below address the problem
break up the space into a number of balls, storing
the radii and centers. More specically, divide the
of nding nearest neighbors. However, their methods can be data points into groups using some method (this was
applied to nding all near neighbors with minimal change.
3 We do not look at the enhancement of vp-trees called vpsb - left as a parameter). Pick a representative of each
trees. group and call it the center of the group. Then, cal-
Page 2
culate the radius (the maximal distance to another becomes negligible. In tests, we nd that GNAT's
point) from the center for each group and pruning is almost always perform better than both vp-trees and
performed based on these radii. Recursion is briey gh-trees, and scale better.
mentioned but not analysed. The third method, an
enhancement of the second, additionally requires that
the diameter (the maximal distance between any two 3 Large Metric Spaces
points) of the points in any group be less than a con- A metric space is a set X with a distance function d:
stant, k, and the group is then called a clique. In this X 2 ! R such that: 8x y z 2 X,
case a minimal subset of the set of all maximal cliques
is used as the set of all groups. These two schemes 1. d(x y) 0 and d(x y) = 0 i x = y. (Positivity)
act as reasonably good models of the data space they 2. d(x y) = d(y x). (Symmetry)
store and if extended to a hierarchical structure, they
have an arbitrary branching factor. However, they 3. d(x y) + d(y z) d(x z). (Triangle Inequality)
have several weaknesses. First, they do not work well Since we are dealing with arbitrary metric spaces,
with nonhomogeneous data, since we could easily end we assume the following model of computation: there
up with a lot of cliques containing only one point and is a large number of data points and a \black box" to
several cliques containing very many points. Addi- compute distance between them.
tionally, distance computations are not fully exploited The rst important observation is that it is impos-
in that distance to the center of one clique is not used sible to deal eciently with all metric spaces. In par-
to prune other cliques. Finally, while we do not fo- ticular, consider the metric space where the distance
cus on the cost of preprocessing in this paper, this between two points is 0 if they are the same and 1
cost was reported to be extremely high in the third if they are di erent. Then our only option in nding
method. a query point is a linear search and no fancy data
K. Fukunaga and P. Narendra worked out a very structure will save us. In fact, the more any space
similar scheme, which requires more than just a met- resembles such a metric space, the more dicult it
ric space, to create a tree structure with an arbitrary will be to search.
branching factor in 1975 FN75] as follows. Divide the Furthermore, the distribution of data in the met-
data points into k groups. (How this is done is left ric space is more important than the metric space
as a parameter of the structure but in tests they used itself. If the data lies on a two-dimensional surface
a clustering algorithm which requires more than just that is embedded in a 50-dimensional space, query
a metric space.) Then compute the mean4 of each times will behave more like that of a two-dimensional
group (once again a departure from a metric space) space than that of a 50-dimensional space given an
and the farthest distance from that mean to a point intelligent data structure. In a sense, for a high-
in the group. Then recursively create the structure dimensional space, the data determines the \geom-
for each group. While this method tends to have nice etry" of the space more than the constraints of the
symmetric properties (given a reasonable clustering space itself.
algorithm) that reect the space and it has an ar- Since visualizing high-dimensional data is dicult
bitrary branching factor, it has several weaknesses. we look at some simple measures to help us under-
First, it relies on more than just a metric space sec- stand the geometry of a given data space. A partic-
ond, it requires many distance computations at each ularly useful measurement is the distribution of dis-
node and does not use them fully and third, it does tances between points in the space. While the scales
not deal e ectively with balancing. of these distributions vary greatly, we can compare
In this paper we present GNAT's which can be them by considering at what range we would be in-
viewed as both a generalization of Fukunaga's method terested in nding near neighbors. In each of the
and a generalization of gh-trees. GNAT's provide graphs of the distributions that follow, we have made
good query performance by exploiting the geometry 5000 random distance calculations in the data space
of the data. Unfortunately, while query time is re- and distributed them into a number of buckets. The
duced, the build time of the data structure is sharply y axis represents the number of distances which fell
increased. However, if the application is query dom- into that bucket divided by the size of the bucket.
inant (or even if there are roughly as many queries The distributions of distances between random,
as data points) the relative cost of building a GNAT uniformly chosen vectors in 20 and 50 dimensional hy-
4 The mean of a set of points (vectors) in a vector space is
percubes of side 1 under the L1 5 distributions because
simply their sum divided by their number. The concept of a 5 Recall that the L metric is the sum of the absolute values
1
mean is not meaningful for arbitrary metric spaces. of the di erences of corresponding vector dimensions and the
Page 3
4.5
4 16 by 16
3.5 50 by 50
3
2.5
2000 2
1800 Dim 20 1.5
1600 Dim 50
1400 1
1200 0.5
1000 0
800 2000 4000 6000 8000
600 Distance
400
200 Figure 3: Distribution of distances under L2 metric
0 in 16 by 16 and 50 by 50 images.
0 5 10 15 20 25
Distance
of the Central Limit Theorem (Figure 1). For the L2
Figure 1: Distribution of distances under L1 metric metric, we obtain a Gaussian-like (though not exactly
in 20 and 50 dimensions. Gaussian) distribution limit distribution (Figure 2).
Note that the distributions for 50 dimensions should
be viewed in relation to their larger ranges and hence
are really quite narrow. The fact that the peaks are
narrow indicate that the distance function has low
entropy and that it may be dicult to index the data
since arbitrary distance measurements will provide us
with little information. However, by wisely choosing
the distance computations, we can greatly improve
the eciency of the system.
Correlated data has somewhat di erent properties
and tends to have a much atter distance distribu-
9000 tion. For example, taking random 16 by 16 or 50 by
8000 Dim 20 50 blocks from an image, then treating them as 256
7000 Dim 50 or 2500 dimensional vectors respectively and taking
6000 the L2 distances between them creates a distribution
5000 with two major maxima (Figure 3). The rst max-
4000 imum, near 0, indicates a great deal of clustering in
3000 the data since small distances are so probable the
second major maximum is one that is common to all
2000 large metric spaces we will investigate, which indi-
1000 cates that average distances are fairly likely.
0 As another example, consider taking lines of text
0 0.5 1 1.5 2 2.5 3 3.5 4 from a large text document (\A Tale of Two Cities"
Distance in this example) and using a simple edit distance
function. We considered two di erent such func-
Figure 2: Distribution of distances under L2 metric tions. Both counted the minimum number of op-
in 20 and 50 dimensions. erations needed to get from one line to the other.
The rst distance function, InsDel allowed only in-
serts and deletes of single characters as operations.
The second, Edit, added the operation of replacing
L2 , or Euclidean, metric is the square root of the sum of the
squares.
Page 4
450
400 InsDel
350 Edit
300
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
Distance
Page 5
points associated with a split point p be denoted
i The strategy we settled on is to sample about 3
D.i times the number of split points we wanted, and then
pick those that were the farthest apart (according to
3. For each pair of split points, (p p ), calculate
i j a greedy algorithm). The number 3 was arrived at
the range(p p ) = min max ], a minimum empirically. More specically this is done as follows:
and a maximum of Dist(p x) where x 2 D or
i j ij ij
Page 6
k { the average degree (equal to N=n). 5.1 Data Types Supported
k2 { the second moment (the average of the The data types (metric spaces) with which the system
works are as follows:
squares) of the degree.
l { the average depth of a point. Vectors { The simplest of the data types, these N-
dimensional vectors from a hypercube of side 1
s { the amount of memory needed to store a data and can be chosen in two di erent ways { cho-
sen uniformly from R , and chosen uniformly
n
point. from R2 and then mapped into R using a sim-
n
Page 7
sampling technique to chose vantage points since Distance Calculations
we could not be sure that we would do it iden- 3000
tically to Yia93]. However, some limited tests vp -tree
2500 vp24-tree
with sampling indicated that savings were in the gh-tree
10% range for images and were negligible for text 2000 mytree 2
and random vectors. 1500 mytree
mytree20
10
Page 8
Distance Calculations
2000 Distance Calculations
1800 vp2 -tree 9000
1600 mytree10 8000 vp2 -tree
1400 mytree20 7000 mytree 10
1200 mytree50 6000 mytree 20
mytree50
1000 mytree100 5000 mytree100
800 4000
600 3000
400
200 2000
0 1000
2 3 4 5 6 7 8 9 10 0
Query Range 0 2 4 6 8 10
Query Range
Figure 8: Varying Query Range for 3000 Lines of
Hamlet Using the InsDel Distance Figure 9: Varying Query Range for 10000 Lines of
Dickens Using the Edit Distance
in another. The number of test queries used in every
case was 100. Distance Calculations
The rst thing to note is how dicult it actually is 1800
to perform these searches. All of the data structures 1600 vp2 -tree
seemed to struggle with ranges above 0.3 (reasonable 1400 mytree10
queries could easily have ranges considerably above mytree20
0.5), looking at more than 50% of the data points in 1200 mytree50
many cases. This is caused by the low information 1000 mytree100
content of the distance calculations since they tend 800
to return very similar numbers (Figure 2). 600
Despite the diculty that all these methods have, 400
high degree GNAT's come out far ahead. In partic- 200
ular, the GNAT's of degrees 50 and 100 had more 0 50 100 150 200 250 300 350 400 450 500
than a factor of 3 improvement over vp-trees in many Query Range
cases.
Page 9
little better, with improvements in the 15% to 35% struction is used but then is iteratively improved until
range for ranges above 200. This is a clear indica- it converges to a bottom-up type construction.
tion that more work needs to be done to deal with Another important research direction is to begin
clustered data. (Figure 11). to use approximate distance metrics. For example, in
order to compute near neighbors in text using the edit
distance (an expensive computation), we can rst use
7 Conclusion and Future Work the q-gram distance Ukk92] (a relatively fast compu-
tation) to narrow the search quickly and then apply
In working with large metric spaces, we have proved proper edit distance to complete the search. The key
our intuition from low-dimensional spaces wrong in is that q-gram distance is a lower bound for edit dis-
many ways. In many cases explored in this paper tance. Similarly, we could linearly project down a
the data lies in a very large metric space whose only very high-dimensional space (such as 50 by 50 pixel
easily recognizable and readily usable structure is the images) to a somewhat lower dimensional space (e.g.
distance between its points. In other words, the space by averaging together 2 by 2 pixel blocks) and use
is so large that it is meaningless to consider and use its that as an approximation (L2 distance in the projec-
geometry and one should concentrate on the intrinsic tion is a lower bound for L2 distance in the original
geometry of the actual set of data points. space). All of these techniques, of course, rely on spe-
Consequently, it is important to exploit the con- cial knowledge of the metric space to construct the
straints of the distribution of data rather than rely approximations. However, given the approximations,
on those of the whole space. Therefore, GNAT's try a general method could be applied.
to model the data they are indexing. There are sev-
eral important issues involved in doing this.
First, does one break up the space by occupation 8 Acknowledgments
or by population? In other words, if the data is com- I thank Prof. Michael Brin (my father), Prof. Hector
posed of a large cluster and a few outliers, should Garcia-Molina, Luis Gravano, and Edouard Bugnion
the top node assign one branch for the cluster and for helpful discussions and for listening to my endless
a few branches for the outliers or should it split the ramblings.
cluster immediately and not worry about the outliers
until later? In GNAT'swe decided to compromise by
rst sampling (the by-population approach) and then
by picking out the points that were far apart (the
References
by-occupation approach) when choosing split points. BFR+ 93] E. Bugnion, S. Fei, T. Roos, P. Widmayer,
The current method of selecting points that are far and F. Widmer. A spatial index for ap-
apart can become asymmetric and some pathologi- proximate multiple string matching. In
cal behavior was observed (though it didn't impact Proc. First South American Workshop on
query performance much). This remains a problem String Processing, Belo Horizonte, Brazil,
for future work. September 1993.
A second issue is how to handle balancing. In
our experiments, we found that good balance was BK73] W. A. Burkhard and R. M. Keller. Some
not crucial to the performance of the structure. approaches to best-match le searching.
We attempted to improve the structure by using Communications of the ACM, 16(4), April
\weighted" Dirichlet domains but these tended to de- 1973.
crease performance rather than improving it. (They FN75] K. Fukunaga and P. M. Narendra. A
did reduce build time though.) Intuitively, when the
tree structure is altered so that it is balanced rather branch and bound algorithm for comput-
than so it reects the geometry of the space, searches ing k-nearest neighbors. IEEE Trans.
tend to descend down all branches. As a result we de- Comput., C-24:750{753, 1975.
cided to keep the tree depth from varying too much FS82] C.D. Feustel and L. G. Shapiro. The near-
by adjusting the degrees of the nodes. est neighbor problem in an abstract met-
For future work, we are considering new methods ric space. Pattern Recognition Letters, De-
of building the tree. Bottom-up constructions could cember 1982.
lead to very good query performance but their O(n2 )
construction cost will not scale well. Consequently, HKR93] Huttenlocher, Klanderman, and Ruck-
we are considering schemes where a top-down con- lidge. Comparing images using the haus-
Page 10
dor distance. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence,
15, 1993.
SW90] Dennis Shasha and Tsong-Li Wang. New
techniques for best-match retrieval. ACM
Transactions on Information Systems.,
8(2):140, 1990.
Uhl91] Uhlmann. Satisfying general proximity /
similarity queries with metric trees. Infor-
mation Processing Letters, 40, 1991.
Ukk92] Ukkonen. Approximate string matching
with q-grams and maximal matches. The-
oretical Computer Science, 92, 1992.
Yia93] Yianilos. Data structures and algorithms
for nearest neighbor search in general met-
ric spaces. In ACM-SIAM Symposium
on Discrete Algorithms (A Conference on
Theoretical and Experimental Analysis of
Discrete Algorithms), 1993.
Page 11