p117 Andoni
p117 Andoni
p117 Andoni
Abstract
I
n this article, we give an overview of efficient algorithms for the approximate and exact nearest
neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given
a new query object, one can quickly return the dataset object that is most similar to the query. The
problem is of significant interest in a wide variety of areas.
The goal of this article is twofold. In the first part, we survey a family tice, they often provide little improvement over a linear time algorithm
of nearest neighbor algorithms that are based on the concept of locality- that compares a query to each point from the database. This phenom-
sensitive hashing. Many of these algorithm have already been successfully enon is often called “the curse of dimensionality.”
applied in a variety of practical scenarios. In the second part of this arti- In recent years, several researchers have proposed methods for over-
cle, we describe a recently discovered hashing-based algorithm, for the coming the running time bottleneck by using approximation (e.g., [5,
case where the objects are points in the d-dimensional Euclidean space. 27, 25, 29, 22, 28, 17, 13, 32, 1], see also [36, 24]). In this formulation,
As it turns out, the performance of this algorithm is provably near-opti- the algorithm is allowed to return a point whose distance from the query
mal in the class of the locality-sensitive hashing algorithms. is at most c times the distance from the query to its nearest points; c >
1 is called the approximation factor. The appeal of this approach is that,
1 Introduction in many cases, an approximate nearest neighbor is almost as good as the
The nearest neighbor problem is defined as follows: given a collection exact one. In particular, if the distance measure accurately captures the
of n points, build a data structure which, given any query point, reports notion of user quality, then small differences in the distance should not
the data point that is closest to the query. A particularly interesting and matter. Moreover, an efficient approximation algorithm can be used to
well-studied instance is where the data points live in a d-dimensional solve the exact nearest neighbor problem by enumerating all approxi-
space under some (e.g., Euclidean) distance function. This problem is mate nearest neighbors and choosing the closest point1.
of major importance in several areas; some examples are data com- In this article, we focus on one of the most popular algorithms for
pression, databases and data mining, information retrieval, image and performing approximate search in high dimensions based on the con-
video databases, machine learning, pattern recognition, statistics and cept of locality-sensitive hashing (LSH) [25]. The key idea is to hash
data analysis. Typically, the features of each object of interest (docu- the points using several hash functions to ensure that for each func-
ment, image, etc.) are represented as a point in ⺢d and the distance tion the probability of collision is much higher for objects that are
metric is used to measure the similarity of objects. The basic problem close to each other than for those that are far apart. Then, one can
then is to perform indexing or similarity searching for query objects. determine near neighbors by hashing the query point and retrieving
The number of features (i.e., the dimensionality) ranges anywhere from elements stored in buckets containing that point.
tens to millions. For example, one can represent a 1000 × 1000 image The LSH algorithm and its variants has been successfully applied
as a vector in a 1,000,000-dimensional space, one dimension per pixel. to computational problems in a variety of areas, including web clus-
There are several efficient algorithms known for the case when the tering [23], computational biology [10.11], computer vision (see
dimension d is low (e.g., up to 10 or 20). The first such data structure, selected articles in [23]), computational drug design [18] and compu-
called kd-trees was introduced in 1975 by Jon Bentley [6], and remains tational linguistics [34]. A code implementing a variant of this method
one of the most popular data structures used for searching in multidi- is available from the authors [2]. For a more theoretically-oriented
mensional spaces. Many other multidimensional data structures are overview of this and related algorithms, see [24].
known, see [35] for an overview. However, despite decades of inten- The purpose of this article is twofold. In Section 2, we describe the
sive effort, the current solutions suffer from either space or query time basic ideas behind the LSH algorithm and its analysis; we also give an
that is exponential in d. In fact, for large enough d, in theory or in prac- overview of the current library of LSH functions for various distance
measures in Section 3. Then, in Section 4, we describe a recently
developed LSH family for the Euclidean distance, which achievies a
Biographies near-optimal separation between the collision probabilities of close
Alexandr Andoni ([email protected]) is a Ph.D. Candidate in computer and far points. An interesting feature of this family is that it effectively
science at Massachusetts Institute of Technology, Cambridge, MA. enables the reduction of the approximate nearest neighbor problem for
Piotr Indyk ([email protected]) is an associate professor in the worst-case data to the exact nearest neighbor problem over random (or
Theory of Computation Group, Computer Science and Artificial Intel- pseudorandom) point configuration in low-dimensional spaces.
ligence Lab, at Massachusetts Institute of Technology, Cambridge, MA. 1
See section 2.4 for more information about exact algorithms.
To illustrate the concept, consider the following example. Assume pute their distances to the query point, and report any point that is a valid
that the data points are binary, that is, each coordinate is either 0 or 1. answer to the query. Two concrete scanning strategies are possible.
In addition, assume that the distance between points p and q is com-
1. Interrupt the search after finding the first L⬘ points (including
puted according to the Hamming distance. In this case, we can use a
duplicates) for some parameter L⬘.
particularly simple family of functions H which contains all projec-
tions of the input point on one of the coordinates, that is, H contains 2. Continue the search until all points from all buckets are
all functions hi from {0, 1} d to {0, 1} such that hi(p) = pi. Choosing retrieved; no additional parameter is required.
one hash function h uniformly at random from H means that h(p)
returns a random coordinate of p (note, however, that different appli- The two strategies lead to different behaviors of the algorithms. In
cations of h return the same coordinate of the argument). particular, Strategy 1 solves the (c, R)-near neighbor problem, while
To see that the family H is locality-sensitive with nontrivial param- Strategy 2 solves the R-near neighbor reporting problem.
eters, observe that the probability PrH [h(q) = h(p)] is equal to the frac- Strategy 1. It is shown in [25, 19] that the first strategy, with
tion of coordinates on which p and q agree. Therefore, P1 = 1 – R/d, L⬘ = 3L, yields a solution to the randomized c-approximate R-near
while P2 = 1 – cR/d. As long as the approximation factor c is greater neighbor problem, with parameters R and ␦ for some constant failure
than 1, we have P1 > P2. probability ␦ < 1. To obtain this guarantee, it suffices to set L to ⍜ (n),
ln 1/P
where = ln 1/P12 [19]. Note that this implies that the algorithm runs in
2.4 The Algorithm time proportional to n which is sublinear in n if P1 > P2. For example,
An LSH family H can be used to design an efficient algorithm for if we use the hash functions for the binary vectors mentioned earlier,
approximate near neighbor search. However, one typically cannot use we obtain = 1/c [25, 19]. The exponents for other LSH families are
H as is since the gap between the probabilities P1 and P2 could be given in Section 3.
quite small. Instead, an amplification process is needed in order to Strategy 2. The second strategy enables us to solve the randomized
achieve the desired probabilities of collision. We describe this step R-near neighbor reporting problem. The value of the failure probability
next, and present the complete algorithm in the Figure 2. ␦ depends on the choice of the parameters k and L. Conversely, for
Given a family H of hash functions with parameters (R, cR, P1, P2) each ␦, one can provide parameters k and L so that the error probabil-
as in Definition 2.3, we amplify the gap between the high probability ity is smaller than ␦. The query time is also dependent on k and L. It
P1 and low probability P2 by concatenating several functions. In par- could be as high as ⍜(n) in the worst case, but, for many natural data-
ticular, for parameters k and L (specified later), we choose L functions sets, a proper choice of parameters results in a sublinear query time.
gj(q) = (h1, j(q),…,hk, j(q)), where ht, j (1 ≤ t ≤ k, 1 ≤ j ≤ L) are chosen The details of the analysis are as follows. Let p be any R-neighbor
independently and uniformly at random from H. These are the actual of q, and consider any parameter k. For any function g i, the probabil-
functions that we use to hash the data points. ity that g i(p) = g i(q) is at least P1k. Therefore, the probability that
The data structure is constructed by placing each point p from the g i(p) = g i(q) for some i = 1…L is at least 1 – (1 – P1k) L. If we set L =
input set into a bucket gj(p), for j = 1,…,L. Since the total number of log1 – P1k ␦ so that (1 – P1k) L ≤ ␦, then any R-neighbor of q is returned by
buckets may be large, we retain only the nonempty buckets by resort- the algorithm with probability at least 1 – ␦.
ing to (standard) hashing3 of the values gj(p). In this way, the data How should the parameter k be chosen? Intuitively, larger values of
structure uses only O(nL) memory cells; note that it suffices that the k lead to a larger gap between the probabilities of collision for close
buckets store the pointers to data points, not the points themselves. points and far points; the probabilities are P1k and P2k, respectively (see
To process a query q, we scan through the buckets g1(q),…, gL(q), and Figure 3 for an illustration). The benefit of this amplification is that the
retrieve the points stored in them. After retrieving the points, we com- hash functions are more selective. At the same time, if k is large then
P1k is small, which means that L must be sufficiently large to ensure
3
See [16] for more details on hashing. that an R-near neighbor collides with the query point at least once.
Preprocessing:
1. Choose L functions gj , j = 1,…L, by setting gj = (h1, j , h2, j ,…hk, j ), where h1, j ,…hk, j are chosen at random from the LSH family H.
2. Construct L hash tables, where, for each j = 1,…L, the j th hash table contains the dataset points hashed using the function gj.
Fig. 3. The graphs of the probability of collision of points p and q as a function of the distance between p and q for different values
of k and L. The points p and q are d = 100 dimensional binary vectors under the Hamming distance. The LSH family H is the one
described in Section 2.3.
A practical approach to choosing k was introduced in the E2 LSH ically, pick random reals s1, s2,…sd 僆 [0, w) and define hs1,…sd =
package [2]. There the data structure optimized the parameter k as a ( (x1 – s1)/w ,…, (xd – sd)/w ). The resulting exponent is equal to
function of the dataset and a set of sample queries. Specifically, given = 1/c + O(R/w).
the dataset, a query point, and a fixed k, one can estimate precisely the ls distance. For the Euclidean space, [17] propose the following LSH
expected number of collisions and thus the time for distance compu- family. Pick a random projection of ⺢d onto a 1-dimensional line and
tations as well as the time to hash the query into all L hash tables. The chop the line into segments of length w, shifted by a random value
sum of the estimates of these two terms is the estimate of the total b 僆 [0, w). Formally, hr, b = ( (r·x + b)/w , where the projection vector
query time for this particular query. E2 LSH chooses k that minimizes r 僆 ⺢d is constructed by picking each coordinate of r from the Gaussian
this sum over a small set of sample queries. distribution. The exponent drops strictly below 1/c for some (carefully
chosen) finite value of w. This is the family used in the [2] package.
3 LSH Library A generalization of this approach to ls norms for any s 僆 [0, 2) is
To date, several LSH families have been discovered. We briefly survey possible as well; this is done by picking the vector r from so-called
them in this section. For each family, we present the procedure of s-stable distribution. Details can be found in [17].
chosing a random function from the respective LSH family as well as Jaccard. To measure the similarity between two sets A, B 傺 U (con-
its locality-sensitive properties. taining, e.g., words from two documents), the authors of [9, 8] utilize the
ⱍA 艚 Bⱍ
Hamming distance. For binary vectors from {0, 1} d , Indyk and Jaccard coefficient. The Jaccard coefficient is defined as s(A, B) = ⱍA 傼 Bⱍ .
Motwani [25] propose LSH function hi(p) = pi, where i 僆 {1,…d} is a Unlike the Hamming distance, Jaccard coefficient is a similarity meas-
randomly chosen index (the sample LSH family from Section 2.3). ure: higher values of Jaccard coefficient indicate higher similarity of the
They prove that the exponent is 1/c in this case. sets. One can obtain the corresponding distance measure by taking
It can be seen that this family applies directly to M-ary vectors (i.e., d(A, B) = 1 – s(A, B). For this measure, [9, 8] propose the following
with coordinates in {1…M}) under the Hamming distance. Moreover, LSH family, called min-hash. Pick a random permutation on the ground
a simple reduction enables the extension of this family of functions to universe U. Then, define hπ(A) = min{π(a) 冷 a 僆 A}. It is not hard to
M-ary vectors under the l1 distance [30]. Consider any point p from prove that the probability of collision Prπ[hπ(A) = hπ(B)] = s(A, B). See
{1…M} d . The reduction proceeds by computing a binary string [7] for further theoretical developments related to such hash functions.
Unary(p) obtained by replacing each coordinate pi by a sequence of pi Arccos. For vectors p, q 僆 ⺢d, consider the distance measure that is
p·q
ones followed by M – pi zeros. It is easy to see that for any two M-ary the angle between the two vectors, ⍜ (p, q) = arccos 冢储p 储·储q 储冣 . For this
vectors p and q, the Hamming distance between Unary(p) and distance measure, Charikar et al. (inspired by [20]) defines the fol-
Unary(p) equals the ll1 distance between p and q. Unfortunately, this lowing LSH family [14]. Pick a random unit-length vector u 僆 ⺢d and
reduction is efficient only if M is relatively small. define hu(p) = sign(u·p). The hash function can also be viewed as par-
l1 distance. A more direct LSH family for ⺢d under the l1 distance titioning the space into two half-spaces by a randomly chosen hyperplane.
is described in [4]. Fix a real w R, and impose a randomly shifted Here, the probability of collision is Pru[hu(p) = hu(q)] = 1 – ⍜ (p, q)/π.
grid with cells of width w; each cell defines a bucket. More specif-