p117 Andoni

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Near-Optimal Hashing Algorithms for

Approximate Nearest Neighbor in High Dimensions


by Alexandr Andoni and Piotr Indyk

Abstract

I
n this article, we give an overview of efficient algorithms for the approximate and exact nearest
neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given
a new query object, one can quickly return the dataset object that is most similar to the query. The
problem is of significant interest in a wide variety of areas.
The goal of this article is twofold. In the first part, we survey a family tice, they often provide little improvement over a linear time algorithm
of nearest neighbor algorithms that are based on the concept of locality- that compares a query to each point from the database. This phenom-
sensitive hashing. Many of these algorithm have already been successfully enon is often called “the curse of dimensionality.”
applied in a variety of practical scenarios. In the second part of this arti- In recent years, several researchers have proposed methods for over-
cle, we describe a recently discovered hashing-based algorithm, for the coming the running time bottleneck by using approximation (e.g., [5,
case where the objects are points in the d-dimensional Euclidean space. 27, 25, 29, 22, 28, 17, 13, 32, 1], see also [36, 24]). In this formulation,
As it turns out, the performance of this algorithm is provably near-opti- the algorithm is allowed to return a point whose distance from the query
mal in the class of the locality-sensitive hashing algorithms. is at most c times the distance from the query to its nearest points; c >
1 is called the approximation factor. The appeal of this approach is that,
1 Introduction in many cases, an approximate nearest neighbor is almost as good as the
The nearest neighbor problem is defined as follows: given a collection exact one. In particular, if the distance measure accurately captures the
of n points, build a data structure which, given any query point, reports notion of user quality, then small differences in the distance should not
the data point that is closest to the query. A particularly interesting and matter. Moreover, an efficient approximation algorithm can be used to
well-studied instance is where the data points live in a d-dimensional solve the exact nearest neighbor problem by enumerating all approxi-
space under some (e.g., Euclidean) distance function. This problem is mate nearest neighbors and choosing the closest point1.
of major importance in several areas; some examples are data com- In this article, we focus on one of the most popular algorithms for
pression, databases and data mining, information retrieval, image and performing approximate search in high dimensions based on the con-
video databases, machine learning, pattern recognition, statistics and cept of locality-sensitive hashing (LSH) [25]. The key idea is to hash
data analysis. Typically, the features of each object of interest (docu- the points using several hash functions to ensure that for each func-
ment, image, etc.) are represented as a point in ⺢d and the distance tion the probability of collision is much higher for objects that are
metric is used to measure the similarity of objects. The basic problem close to each other than for those that are far apart. Then, one can
then is to perform indexing or similarity searching for query objects. determine near neighbors by hashing the query point and retrieving
The number of features (i.e., the dimensionality) ranges anywhere from elements stored in buckets containing that point.
tens to millions. For example, one can represent a 1000 × 1000 image The LSH algorithm and its variants has been successfully applied
as a vector in a 1,000,000-dimensional space, one dimension per pixel. to computational problems in a variety of areas, including web clus-
There are several efficient algorithms known for the case when the tering [23], computational biology [10.11], computer vision (see
dimension d is low (e.g., up to 10 or 20). The first such data structure, selected articles in [23]), computational drug design [18] and compu-
called kd-trees was introduced in 1975 by Jon Bentley [6], and remains tational linguistics [34]. A code implementing a variant of this method
one of the most popular data structures used for searching in multidi- is available from the authors [2]. For a more theoretically-oriented
mensional spaces. Many other multidimensional data structures are overview of this and related algorithms, see [24].
known, see [35] for an overview. However, despite decades of inten- The purpose of this article is twofold. In Section 2, we describe the
sive effort, the current solutions suffer from either space or query time basic ideas behind the LSH algorithm and its analysis; we also give an
that is exponential in d. In fact, for large enough d, in theory or in prac- overview of the current library of LSH functions for various distance
measures in Section 3. Then, in Section 4, we describe a recently
developed LSH family for the Euclidean distance, which achievies a
Biographies near-optimal separation between the collision probabilities of close
Alexandr Andoni ([email protected]) is a Ph.D. Candidate in computer and far points. An interesting feature of this family is that it effectively
science at Massachusetts Institute of Technology, Cambridge, MA. enables the reduction of the approximate nearest neighbor problem for
Piotr Indyk ([email protected]) is an associate professor in the worst-case data to the exact nearest neighbor problem over random (or
Theory of Computation Group, Computer Science and Artificial Intel- pseudorandom) point configuration in low-dimensional spaces.
ligence Lab, at Massachusetts Institute of Technology, Cambridge, MA. 1
See section 2.4 for more information about exact algorithms.

COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 117


Currently, the new family is mostly of theoretical interest. This is neighbor problem–one can simply check if the returned point is an
because the asymptotic improvement in the running time achieved via R-near neighbor of the query point. The reduction in the other direc-
a better separation of collision probabilities makes a difference only for tion is somewhat more complicated and involves creating several
a relatively large number of input points. Nevertheless, it is quite likely instances of the near neighbor problem for different values of R. During
that one can design better pseudorandom point configurations which do the query time, the data structures are queried in the increasing order
not suffer from this problem. Some evidence for this conjecture is pre- of R. The process is stopped when a data structure reports an answer.
sented in [3], where it is shown that point configurations induced by so- See [22] for a reduction of this type with theoretical guarantees.
called Leech lattice compare favorably with truly random configurations. In the rest of this article, we focus on the approximate near neigh-
bor problem. The formal definition of the approximate version of the
Preliminaries near neighbor problem is as follows.
2.1 Geometric Normed Spaces Definition 2.1 (Randomized c-approximate R-near neighbor, or
We start by introducing the basic notation used in this article. First, (c, R) – NN). Given a set P of points in a d-dimensional space ⺢d , and
we use P to denote the set of data points and assume that P has car- parameters R > 0, ␦ > 0, construct a data structure such that, given
dinality n. The points p from P belong to a d-dimensional space ⺢d. We any query point q, if there exists an R-near neighbor of q in P, it reports
use pi to the denote the ith coordinate of p, for i = 1…d. some cR-near neighbor of q in P with probability 1 – ␦.
For any two points p and q, the distance between them is defined as For simplicity, we often skip the word randomized in the discus-
sion. In these situations, we will assume that ␦ is an absolute constant
bounded away from 1 (e.g., 1/2). Note that the probability of success
can be amplified by building and querying several instances of the data
for a parameter s > 0; this distance function is often called the ls norm. structure. For example, constructing two independent data structures,
The typical cases include s = 2 (the Euclidean distance) or s = 1 (the each with ␦ = 1/2, yields a data structure with a probability of failure
Manhattan distance)2. To simplify notation, we often skip the subscript ␦ = 1/2·1/2 = 1/4.
2 when we refer to the Euclidean norm, that is, 储p – q储 = 储p – q储 2. In addition, observe that we can typically assume that R = 1.
Occasionally, we also use the Hamming distance, which is defined Otherwise we can simply divide all coordinates by R. Therefore, we
as the number of positions on which the points p and q differ. will often skip the parameter R as well and refer to the c-approximate
near neighbor problem or c-NN.
2.2 Problem Definition We also define a related reporting problem.
The nearest neighbor problem is an example of an optimization problem: Definition 2.2 (Randomized R-near neighbor reporting). Given a set
the goal is to find a point which minimizes a certain objective function P of points in a d-dimensional space ⺢d , and parameters R > 0, ␦ > 0,
(in this case, the distance to the query point). In contrast, the algorithms construct a data structure that, given any query point q, reports each
that are presented in this article solve the decision version of the prob- R-near neighbor of q in P with probability 1 – ␦.
lem. To simplify the notation, we say that a point p is an R-near neighbor Note that the latter definition does not involve an approximation
of a point q if the distance between p and q is at most R (see Figure 1). factor. Also, unlike the case of the approximate near neighbor, here the
In this language, our algorithm either returns one of the R-near neigh- data structure can return many (or even all) points if a large fraction of
bors or concludes that no such point exists for some parameter R. the data points are located close to the query point. As a result, one
cannot give an a priori bound on the running time of the algorithm.
However, as we point out later, the two problems are intimately
related. In particular, the algorithms in this article can be easily modi-
fied to solve both c-NN and the reporting problems.

2.3 Locality-Sensitive Hashing


The LSH algorithm relies on the existence of locality-sensitive hash
functions. Let H be a family of hash functions mapping ⺢d to some
universe U. For any two points p and q, consider a process in which
we choose a function h from H uniformly at random, and analyze the
probability that h(p) = h(q). The family H is called locality sensitive
Fig. 1. An illustration of an R-near neighbor query. The nearest (with proper parameters) if it satisfies the following condition.
neighbor of the query point q is the point p1. However, both p1 Definition 2.3 (Locality-sensitive hashing). A family H is called (R,
and p2 are R-near neighbors of q. cR, P1, P2)-sensitive if for any two points p, q 僆 ⺢d .

• if 储p – q储 ≤ R then PrH [h(q) = h(p)] ≥ P1,


Naturally, the nearest and near neighbor problems are related. It is
easy to see that the nearest neighbor problem also solves the R-near • if 储p – q储 ≥ cR then PrH [h(q) = h(p)] ≤ P2.
2 d
The name is motivated by the fact that 储p – q储1 = 冱 i = 1 冟pi – qi 冟 is the length of the In order for a locality-sensitive hash (LSH) family to be useful, it has
shortest path between p and q if one is allowed to move along only one coordinate to satisfy P1 > P2.
at a time.

118 January 2008/Vol. 51, No. 1 COMMUNICATIONS OF THE ACM


Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

To illustrate the concept, consider the following example. Assume pute their distances to the query point, and report any point that is a valid
that the data points are binary, that is, each coordinate is either 0 or 1. answer to the query. Two concrete scanning strategies are possible.
In addition, assume that the distance between points p and q is com-
1. Interrupt the search after finding the first L⬘ points (including
puted according to the Hamming distance. In this case, we can use a
duplicates) for some parameter L⬘.
particularly simple family of functions H which contains all projec-
tions of the input point on one of the coordinates, that is, H contains 2. Continue the search until all points from all buckets are
all functions hi from {0, 1} d to {0, 1} such that hi(p) = pi. Choosing retrieved; no additional parameter is required.
one hash function h uniformly at random from H means that h(p)
returns a random coordinate of p (note, however, that different appli- The two strategies lead to different behaviors of the algorithms. In
cations of h return the same coordinate of the argument). particular, Strategy 1 solves the (c, R)-near neighbor problem, while
To see that the family H is locality-sensitive with nontrivial param- Strategy 2 solves the R-near neighbor reporting problem.
eters, observe that the probability PrH [h(q) = h(p)] is equal to the frac- Strategy 1. It is shown in [25, 19] that the first strategy, with
tion of coordinates on which p and q agree. Therefore, P1 = 1 – R/d, L⬘ = 3L, yields a solution to the randomized c-approximate R-near
while P2 = 1 – cR/d. As long as the approximation factor c is greater neighbor problem, with parameters R and ␦ for some constant failure
than 1, we have P1 > P2. probability ␦ < 1. To obtain this guarantee, it suffices to set L to ⍜ (n␳),
ln 1/P
where ␳ = ln 1/P12 [19]. Note that this implies that the algorithm runs in
2.4 The Algorithm time proportional to n␳ which is sublinear in n if P1 > P2. For example,
An LSH family H can be used to design an efficient algorithm for if we use the hash functions for the binary vectors mentioned earlier,
approximate near neighbor search. However, one typically cannot use we obtain ␳ = 1/c [25, 19]. The exponents for other LSH families are
H as is since the gap between the probabilities P1 and P2 could be given in Section 3.
quite small. Instead, an amplification process is needed in order to Strategy 2. The second strategy enables us to solve the randomized
achieve the desired probabilities of collision. We describe this step R-near neighbor reporting problem. The value of the failure probability
next, and present the complete algorithm in the Figure 2. ␦ depends on the choice of the parameters k and L. Conversely, for
Given a family H of hash functions with parameters (R, cR, P1, P2) each ␦, one can provide parameters k and L so that the error probabil-
as in Definition 2.3, we amplify the gap between the high probability ity is smaller than ␦. The query time is also dependent on k and L. It
P1 and low probability P2 by concatenating several functions. In par- could be as high as ⍜(n) in the worst case, but, for many natural data-
ticular, for parameters k and L (specified later), we choose L functions sets, a proper choice of parameters results in a sublinear query time.
gj(q) = (h1, j(q),…,hk, j(q)), where ht, j (1 ≤ t ≤ k, 1 ≤ j ≤ L) are chosen The details of the analysis are as follows. Let p be any R-neighbor
independently and uniformly at random from H. These are the actual of q, and consider any parameter k. For any function g i, the probabil-
functions that we use to hash the data points. ity that g i(p) = g i(q) is at least P1k. Therefore, the probability that
The data structure is constructed by placing each point p from the g i(p) = g i(q) for some i = 1…L is at least 1 – (1 – P1k) L. If we set L =
input set into a bucket gj(p), for j = 1,…,L. Since the total number of log1 – P1k ␦ so that (1 – P1k) L ≤ ␦, then any R-neighbor of q is returned by
buckets may be large, we retain only the nonempty buckets by resort- the algorithm with probability at least 1 – ␦.
ing to (standard) hashing3 of the values gj(p). In this way, the data How should the parameter k be chosen? Intuitively, larger values of
structure uses only O(nL) memory cells; note that it suffices that the k lead to a larger gap between the probabilities of collision for close
buckets store the pointers to data points, not the points themselves. points and far points; the probabilities are P1k and P2k, respectively (see
To process a query q, we scan through the buckets g1(q),…, gL(q), and Figure 3 for an illustration). The benefit of this amplification is that the
retrieve the points stored in them. After retrieving the points, we com- hash functions are more selective. At the same time, if k is large then
P1k is small, which means that L must be sufficiently large to ensure
3
See [16] for more details on hashing. that an R-near neighbor collides with the query point at least once.

Preprocessing:
1. Choose L functions gj , j = 1,…L, by setting gj = (h1, j , h2, j ,…hk, j ), where h1, j ,…hk, j are chosen at random from the LSH family H.
2. Construct L hash tables, where, for each j = 1,…L, the j th hash table contains the dataset points hashed using the function gj.

Query algorithm for a query point q:


1. For each j = 1, 2,…L
i) Retrieve the points from the bucket gj (q) in the j th hash table.
ii) For each of the retrieved point, compute the distance from q to it, and report the point if it is a correct answer (cR-near
neighbor for Strategy 1, and R-near neighbor for Strategy 2).
iii) (optional) Stop as soon as the number of reported points is more than L⬘.

Fig. 2. Preprocessing and query algorithms of the basic LSH algorithm.

COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 119


(a) The probability that gj ( p) = gj ( q) for a fixed j. Graphs are (b) The probability that gj ( p) = gj ( q) for some j = 1…L. The prob-
shown for several values of k. In particular, the blue function abilities are shown for two values of k and several values of L.
(k = 1) is the probability of collision of points p and q under a sin- Note that the slopes are sharper when k is higher.
gle random hash function h from the LSH family.

Fig. 3. The graphs of the probability of collision of points p and q as a function of the distance between p and q for different values
of k and L. The points p and q are d = 100 dimensional binary vectors under the Hamming distance. The LSH family H is the one
described in Section 2.3.

A practical approach to choosing k was introduced in the E2 LSH ically, pick random reals s1, s2,…sd 僆 [0, w) and define hs1,…sd =
   
package [2]. There the data structure optimized the parameter k as a ( (x1 – s1)/w ,…, (xd – sd)/w ). The resulting exponent is equal to
function of the dataset and a set of sample queries. Specifically, given ␳ = 1/c + O(R/w).
the dataset, a query point, and a fixed k, one can estimate precisely the ls distance. For the Euclidean space, [17] propose the following LSH
expected number of collisions and thus the time for distance compu- family. Pick a random projection of ⺢d onto a 1-dimensional line and
tations as well as the time to hash the query into all L hash tables. The chop the line into segments of length w, shifted by a random value
 
sum of the estimates of these two terms is the estimate of the total b 僆 [0, w). Formally, hr, b = ( (r·x + b)/w , where the projection vector
query time for this particular query. E2 LSH chooses k that minimizes r 僆 ⺢d is constructed by picking each coordinate of r from the Gaussian
this sum over a small set of sample queries. distribution. The exponent ␳ drops strictly below 1/c for some (carefully
chosen) finite value of w. This is the family used in the [2] package.
3 LSH Library A generalization of this approach to ls norms for any s 僆 [0, 2) is
To date, several LSH families have been discovered. We briefly survey possible as well; this is done by picking the vector r from so-called
them in this section. For each family, we present the procedure of s-stable distribution. Details can be found in [17].
chosing a random function from the respective LSH family as well as Jaccard. To measure the similarity between two sets A, B 傺 U (con-
its locality-sensitive properties. taining, e.g., words from two documents), the authors of [9, 8] utilize the
ⱍA 艚 Bⱍ
Hamming distance. For binary vectors from {0, 1} d , Indyk and Jaccard coefficient. The Jaccard coefficient is defined as s(A, B) = ⱍA 傼 Bⱍ .
Motwani [25] propose LSH function hi(p) = pi, where i 僆 {1,…d} is a Unlike the Hamming distance, Jaccard coefficient is a similarity meas-
randomly chosen index (the sample LSH family from Section 2.3). ure: higher values of Jaccard coefficient indicate higher similarity of the
They prove that the exponent ␳ is 1/c in this case. sets. One can obtain the corresponding distance measure by taking
It can be seen that this family applies directly to M-ary vectors (i.e., d(A, B) = 1 – s(A, B). For this measure, [9, 8] propose the following
with coordinates in {1…M}) under the Hamming distance. Moreover, LSH family, called min-hash. Pick a random permutation on the ground
a simple reduction enables the extension of this family of functions to universe U. Then, define hπ(A) = min{π(a) 冷 a 僆 A}. It is not hard to
M-ary vectors under the l1 distance [30]. Consider any point p from prove that the probability of collision Prπ[hπ(A) = hπ(B)] = s(A, B). See
{1…M} d . The reduction proceeds by computing a binary string [7] for further theoretical developments related to such hash functions.
Unary(p) obtained by replacing each coordinate pi by a sequence of pi Arccos. For vectors p, q 僆 ⺢d, consider the distance measure that is
p·q
ones followed by M – pi zeros. It is easy to see that for any two M-ary the angle between the two vectors, ⍜ (p, q) = arccos 冢储p 储·储q 储冣 . For this
vectors p and q, the Hamming distance between Unary(p) and distance measure, Charikar et al. (inspired by [20]) defines the fol-
Unary(p) equals the ll1 distance between p and q. Unfortunately, this lowing LSH family [14]. Pick a random unit-length vector u 僆 ⺢d and
reduction is efficient only if M is relatively small. define hu(p) = sign(u·p). The hash function can also be viewed as par-
l1 distance. A more direct LSH family for ⺢d under the l1 distance titioning the space into two half-spaces by a randomly chosen hyperplane.
is described in [4]. Fix a real w  R, and impose a randomly shifted Here, the probability of collision is Pru[hu(p) = hu(q)] = 1 – ⍜ (p, q)/π.
grid with cells of width w; each cell defines a bucket. More specif-

120 January 2008/Vol. 51, No. 1 COMMUNICATIONS OF THE ACM


Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

l2 distance on a sphere. Terasawa and Tanaka [37] propose an LSH


algorithm specifically designed for points that are on a unit hyper-
sphere in the Euclidean space. The idea is to consider a regular poly-
tope, orthoplex for example, inscribed into the hypersphere and
rotated at random. The hash function then maps a point on the hyper-
sphere into the closest polytope vertex lying on the hypersphere. Thus,
the buckets of the hash function are the Voronoi cells of the polytope
vertices lying on the hypersphere. [37] obtain exponent ␳ that is an
improvement over [17] and the Leech lattice approach of [3].

4 Near-Optimal LSH Functions for


Euclidean Distance
In this section we present a new LSH family, yielding an algorithm
with query time exponent ␳ (c) = 1/c2 + O(log log n / log1/3 n). For
large enough n, the value of ␳ (c) tends to 1/c2 . This significantly Fig. 4. An illustration of the the ball partitioning of
improves upon the earlier running time of [17]. In particular, for c = 2, the 2-dimensional space.
our exponent tends to 0.25, while the exponent in [17] was around
0.45. Moreover, a recent paper [31] shows that hashing-based algo- The second and the main issue is the choice of w. Again, it turns
rithms (as described in Section 2.3) cannot achieve ␳ < 0.462/c2 . out that for large w, the method yields only the exponent of 1/c.
Thus, the running time exponent of our algorithm is essentially opti- Specifically, it was shown in [15] that for any two points p, q 僆 ⺢ t , the
mal, up to a constant factor. probability that the partitioning separates p and q is at most
We obtain our result by carefully designing a family of locality-sen- O 冸冪莥t ·储p – q储/w冹. This formula can be showed to be tight for the range
sitive hash functions in l2. The starting point of our construction is the of w where it makes sense as a lower bound, that is, for w =
line partitioning method of [17]. There, a point p was mapped into ⺢1 ⍀冸冪莥t ·储p – q储冹. However, as long as the separation probability depends
using a random projection. Then, the line ⺢ 1 was partitioned into linearly on the distance between p and q, the exponent ␳ is still equal
intervals of length w, where w is a parameter. The hash function for p to 1/c. Fortunately, a more careful analysis4 shows that, as in the one-
returned the index of the interval containing the projection of p. dimensional case, the minimum is achieved for finite w. For that value
An analysis in [17] showed that the query time exponent has an of w, the exponent tends to 1/c2 as t tends to infinity.
interesting dependence on the parameter w. If w tends to infinity, the
exponent tends to 1/c, which yields no improvement over [25, 19]. 5 Related Work
However, for small values of w, the exponent lies slightly below 1/c. In In this section, we give a brief overview of prior work in the spirit of
fact, the unique minimum exists for each c. the algorithms considered in this article. We give only high-level sim-
In this article, we utilize a “multi-dimensional version” of the afore- plified descriptions of the algorithms to avoid area-specific terminol-
mentioned approach. Specifically, we first perform random projection ogy. Some of the papers considered a closely related problem of finding
into ⺢ t, where t is super-constant, but relatively small (i.e., t = o(log n)). all close pairs of points in a dataset. For simplicity, we translate them
Then we partition the space ⺢ t into cells. The hash function function into the near neighbor framework since they can be solved by per-
returns the index of the cell which contains projected point p. forming essentialy n separate near neighbor queries.
The partitioning of the space ⺢ t is somewhat more involved than Hamming distance. Several papers investigated multi-index hashing-
its one-dimensional counterpart. First, observe that the natural idea of based algorithms for retrieving similar pairs of vectors with respect to
partitioning using a grid does not work. This is because this process the Hamming distance. Typically, the hash functions were projecting
roughly corresponds to hashing using concatenation of several one- the vectors on some subset of the coordinates {1…d} as in the exam-
dimensional functions (as in [17]). Since the LSH algorithms perform ple from an earlier section. In some papers [33, 21], the authors con-
such concatenation anyway, grid partitioning does not result in any sidered the probabilistic model where the data points are chosen
improvement. Instead, we use the method of “ball partitioning”, intro- uniformly at random, and the query point is a random point close to one
duced in [15], in the context of embeddings into tree metrics. The par- of the points in the dataset. A different approach [26] is to assume that
titioning is obtained as follows. We create a sequence of balls B1, B2…, the dataset is arbitrary, but almost all points are far from the query
each of radius w, with centers chosen independently at random. Each point. Finally, the paper [12] proposed an algorithm which did not make
ball Bi then defines a cell, containing points Bi\傼j< i Bj. any assumption on the input. The analysis of the algorithm was akin to
In order to apply this method in our context, we need to take care the analysis sketched at the end of section 2.4: the parameters k and L
of a few issues. First, locating a cell containing a given point could were chosen to achieve desired level of sensitivity and accuracy.
require enumeration of all balls, which would take an unbounded Set intersection measure. To measure the similarity between two sets
amount of time. Instead, we show that one can simulate this proce- A and B, the authors of [9, 8] considered the Jaccard coefficient s(A, B),
dure by replacing each ball by a grid of balls. It is not difficult then to proposing a family of hash functions h(A) such that Pr[h(A) = h(B)] =
observe that a finite (albeit exponential in t) number U of such grids s(A, B) (presented in detail in Section 3). Their main motivation was to
suffices to cover all points in ⺢ t . An example of such partitioning (for
4
t = 2 and U = 5) is given in Figure 4. Refer to [3] for more details.

COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 121


construct short similarity-preserving “sketches” of sets, obtained by 18. Dutta, D., Guha, R., Jurs, C., and Chen, T. 2006. Scalable parti-
mapping each set A to a sequence 具h1(A), ..., hk(A)典. In section 5.3 of tioning and exploration of chemical spaces using geometric hashing.
their paper, they briefly mention an algorithm similar to Strategy 2 J. Chem. Inf. Model. 46.
described at the end of the Section 2.4. One of the differences is that, 19. Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in
in their approach, the functions hi are sampled without replacement, high dimensions via hashing. In Proceedings of the International
which made it more difficult to handle small sets. Conference on Very Large Databases.
Acknowledgement 20. Goemans, M. and Williamson, D. 1995. Improved approximation
This work was supported in part by NSF CAREER grant CCR-0133849 algorithms for maximum cut and satisfiability problems using
and David and Lucille Packard Fellowship. semidefinite programming. J. ACM 42. 1115–1145.
21. Greene, D., Parnas, M., and Yao, F. 1994. Multi-index hashing for in-
References formation retrieval. In Proceedings of the Symposium on Founda-
1. Ailon, N. and Chazelle, B. 2006. Approximate nearest neighbors tions of Computer Science. 722–731.
and the Fast Johnson-Lindenstrauss Transform. In Proceedings of 22. Har-Peled, S. 2001. A replacement for voronoi diagrams of near
the Symposium on Theory of Computing. linear size. In Proceedings of the Symposium on Foundations of
2. Andoni, A. and Indyk, P. 2004. E2lsh: Exact Euclidean locality- Computer Science.
sensitive hashing. http://web.mit.edu/andoni/www/LSH/. 23. Haveliwala, T., Gionis, A., and Indyk, P. 2000. Scalable techniques
3. Andoni, A. and Indyk, P. 2006. Near-optimal hashing algorithms for clustering the web. WebDB Workshop.
for approximate nearest neighbor in high dimensions. In Proceed- 24. Indyk, P. 2003. Nearest neighbors in high-dimensional spaces. In
ings of the Symposium on Foundations of Computer Science. Handbook of Discrete and Computational Geometry. CRC Press.
4. Andoni, A. and Indyk, P. 2006. Efficient algorithms for substring 25. Indyk, P. and Motwani, R. 1998. Approximate nearest neighbor:
near neighbor problem. In Proceedings of the ACM-SIAM Sympo- Towards removing the curse of dimensionality. In Proceedings of
sium on Discrete Algorithms. 1203–1212. the Symposium on Theory of Computing.
5. Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, 26. Karp, R. M., Waarts, O., and Zweig, G. 1995. The bit vector inter-
A. 1994. An optimal algorithm for approximate nearest neighbor section problem. In Proceedings of the Symposium on Foundations
searching. In Proceedings of the ACM-SIAM Symposium on Dis- of Computer Science. pages 621–630.
crete Algorithms. 573–582.
27. Kleinberg, J. 1997. Two algorithms for nearest-neighbor search in
6. Bentley, J. L. 1975. Multidimensional binary search trees used for high dimensions. In Proceedings of the Symposium on Theory of
associative searching. Comm. ACM 18, 509–517. Computing.
7. Broder, A., Charikar, M., Frieze, A., and Mitzenmacher, M. 1998. 28. Krauthgamer, R. and Lee, J. R. 2004. Navigating nets: Simple
Min-wise independent permutations. J. Comput. Sys. Sci. algorithms for proximity search. In Proceedings of the ACM-SIAM
8. Broder, A., Glassman, S., Manasse, M., and Zweig, G. 1997. Syntac- Symposium on Discrete Algorithms.
tic clustering of the web. In Proceedings of the 6th International 29. Kushilevitz, E., Ostrovsky, R., and Rabani, Y. 1998. Efficient search
World Wide Web Conference. 391–404. for approximate nearest neighbor in high dimensional spaces. In
9. Broder, A. 1997. On the resemblance and containment of docu- Proceedings of the Symposium on Theory of Computing. 614–623.
ments. In Proceedings of Compression and Complexity of Se- 30. Linial, N., London, E., and Rabinovich, Y. 1994. The geometry of
quences. 21–29. graphs and some of its algorithmic applications. In Proceedings of
10. Buhler, J. 2001. Efficient large-scale sequence comparison by lo- the Symposium on Foundations of Computer Science. 577–591.
cality-sensitive hashing. Bioinform. 17, 419–428. 31. Motwani, R., Naor, A., and Panigrahy, R. 2006. Lower bounds on
11. Buhler, J. and Tompa, M. 2001. Finding motifs using random locality sensitive hashing. In Proceedings of the ACM Symposium
projections. In Proceedings of the Annual International Conference on Computational Geometry.
on Computational Molecular Biology (RECOMB1). 32. Panigrahy, R. 2006. Entropy-based nearest neighbor algorithm in
12. Califano, A. and Rigoutsos, I. 1993. Flash: A fast look-up algo- high dimensions. In Proceedings of the ACM-SIAM Symposium on
rithm for string homology. In Proceedings of the IEE Conference Discrete Algorithms.
on Computer Vision and Pattern Recognition (CVPR). 33. Paturi, R., Rajasekaran, S., and Reif, J.The light bulb problem.
13. Chakrabarti, A. and Regev, O. 2004. An optimal randomised cell Inform. Comput. 117, 2, 187–192.
probe lower bounds for approximate nearest neighbor searching. In 34. Ravichandran, D., Pantel, P., and Hovy, E. 2005. Randomized al-
Proceedings of the Symposium on Foundations of Computer Science. gorithms and nlp: Using locality sensitive hash functions for high
14. Charikar, M. 2002. Similarity estimation techniques from round- speed noun clustering. In Proceedings of the Annual Meeting of
ing. In Proceedings of the Symposium on Theory of Computing. the Association of Computational Linguistics.
15. Charikar, M., Chekuri, C., Goel, A., Guha, S., and Plotkin, S. 1998. 35. Samet, H. 2006. Foundations of Multidimensional and Metric
Approximating a finite metric by a small number of tree metrics. Data Structures. Elsevier, 2006.
In Proceedings of the Symposium on Foundations of Computer Science. 36. Shakhnarovich, G., Darrell, T., and Indyk, P. Eds. Nearest Neigh-
16. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. bor Methods in Learning and Vision. Neural Processing Informa-
Introduct. Algorithms. 2nd Ed. MIT Press. tion Series, MIT Press.
17. Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. 2004. Locality- 37. Terasawa, T. and Tanaka, Y. 2007. Spherical lsh for approximate
sensitive hashing scheme based on p-stable distributions.In Proceed- nearest neighbor search on unit hypersphere. In Proceedings of
ings of the ACM Symposium on Computational Geometry. the Workshop on Algorithms and Data Structures.

122 January 2008/Vol. 51, No. 1 COMMUNICATIONS OF THE ACM

You might also like