0% found this document useful (0 votes)

112 views10 pages

Locality-Sensitive Hashing Scheme Based On P-Stable Distributions

Locality-Sensitive Hashing Scheme Based on p-Stable Distributions

Uploaded by

Pista975

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views10 pages

Locality-Sensitive Hashing Scheme Based On P-Stable Distributions

Locality-Sensitive Hashing Scheme Based on p-Stable Distributions

Uploaded by

Pista975

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Locality-Sensitive Hashing Scheme Based on p-Stable Distributions

Mayur Datar Department of Computer Science, Stanford University

[email protected]

Nicole Immorlica Laboratory for Computer Science, MIT

[email protected]

Piotr Indyk Laboratory for Computer Science, MIT

[email protected]

Vahab S. Mirrokni Laboratory for Computer Science, MIT

[email protected]

ABSTRACT
We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under norm, based on stable distributions. Our scheme improves the running time of the earlier algorithm for the case of the norm. It also yields the rst known provably efcient approximate NN algorithm for the case . We also show that the algorithm nds the exact near neigbhor in time for data satisfying certain bounded growth condition. Unlike earlier schemes, our LSH scheme works directly on points in the Euclidean space without embeddings. Consequently, the resulting query time bound is free of large factors and is simple and easy to implement. Our experiments (on synthetic data sets) show that the our data structure is up to 40 times faster than -tree.

Categories and Subject Descriptors

E.1 [Data]: Data Structures; F.0 [Theory of Computation]: General

General Terms
Algorithms, Experimentation, Design, Performance, Theory

Keywords
Sublinear Algorithm, Approximate Nearest Neighbor, Locally Sensitive Hashing, -Stable Distributions

INTRODUCTION

This material is based upon work supported by the NSF CAREER grant CCR-0133849.

A similarity search problem involves a collection of objects (documents, images, etc.) that are characterized by a collection of relevant features and represented as points in a high-dimensional attribute space; given queries in the form of points in this space, we

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. SoCG04, June 911, 2004, NewYork, USA. Copyright 2004 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

are required to nd the nearest (most similar) object to the query. A particularly interesting and well-studied instance is -dimensional Euclidean space. This problem is of major importance to a variety of applications; some examples are: data compression, databases and data mining, information retrieval, image and video databases, machine learning, pattern recognition, statistics and data analysis. Typically, the features of the objects of interest (documents, images, etc) are represented as points in and a distance metric is used to measure similarity of objects. The basic problem then is to perform indexing or similarity searching for query objects. The number of features (i.e., the dimensionality) ranges anywhere from tens to thousands. The low-dimensional case (say, for the dimensionality equal to 2 or 3) is well-solved, so the main issue is that of dealing with a large number of dimensions, the so-called curse of dimensionality. Despite decades of intensive effort, the current solutions are not entirely satisfactory; in fact, for large enough , in theory or in practice, they often provide little improvement over a linear algorithm which compares a query to each point from the database. In particular, it was shown in [28] (both empirically and theoretically) that all current indexing techniques (based on space partitioning) degrade to linear search for sufciently high dimensions. In recent years, several researchers proposed to avoid the running time bottleneck by using approximation (e.g., [3, 22, 19, 24, 15], see also [12]). This is due to the fact that, in many cases, approximate nearest neighbor is almost as good as the exact one; in particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. In fact, in situations when the quality of the approximate nearest neighbor is much worse than the quality of the actual nearest neighbor, then the nearest neighbor problem is unstable, and it is not clear if solving it is at all meaningful [4, 17]. In [19, 14], the authors introduced an approximate high-dimensional similarity search scheme with provably sublinear dependence on the data size. Instead of using tree-like space partitioning, it relied on a new method called locality-sensitive hashing (LSH). The key idea is to hash the points using several hash functions so as to ensure that, for each function, the probability of collision is much higher for objects which are close to each other than for those which are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point. In [19, 14] the authors provided such locality-sensitive hash functions for the case when the points live in binary Hamming space . They showed experimentally that the data structure achieves large speedup over several tree-based data structures when

the data is stored on disk. In addition, since the LSH is a hashingbased scheme, it can be naturally extended to the dynamic setting, i.e., when insertion and deletion operations also need to be supported. This avoids the complexity of dealing with tree structures when the data is dynamic. The LSH algorithm has been since used in numerous applied settings, e.g., see [14, 10, 16, 27, 5, 7, 29, 6, 26, 13]. However, it suffers from a fundamental drawback: it is fast and simple only when the input points live in the Hamming space (indeed, almost all of the above applications involved binary data). As mentioned in [19, 14], it is possible to extend the algorithm to the norm, by embedding space into space, and then space into Hamming space. However, it increases the query time and/or error by a large factor and complicates the algorithm. In this paper we present a novel version of the LSH algorithm. As with the previous schemes, it works for the -Near Neighbor (NN) problem, where the goal is to report a point within distance from a query , if there is a point in the data set within distance from . Unlike the earlier algorithm, our algorithm works directly on points in Euclidean space without embeddings. As a consequence, it has the following advantages over the previous algorithm:

weaker assumptions about the growth function. It is also somewhat faster, due to the fact that the factor in the query time of the earlier schemes is multiplied by a function of , while in our case this factor is additive. We complement our theoretical analysis with experimental evaluation of the algorithm on data with wide range of parameters. In particular, we compare our algorithm to an approximate version of the -tree algorithm [2]. We performed the experiments on synthetic data sets containing planted near neighbor (see section 5 for more details); similar model was earlier used in [30]. Our experiments indicate that the new LSH scheme achieves query time of up to 40 times better than the query time of the -tree algorithm.

1.1 Notations and problem denitions

under the norm. For any We use to denote the space point , we denote by the norm of the vector . Let be any metric space, and . The ball of radius centered at is dened as . . In this paper we focus on the -NN probLet lem. Observe that -NN is simply a decision version of the Approximate Nearest Neighbor problem. Although in many applications solving the decision version is good enough, one can also reduce the approximate NN problem to approximate NN via binary-search-like approach. In particular, it is known [19, 15] that the -approximate NN problem reduces to instances of -NN. Then, the complexity of -approximate NN is the same (within log factor) as that of the -NN problem.

For the norm, its query time is , where for (the inequality is strict, see Figure 1(b)). Thus, for large range of values of , the query time exponent is better than the one in [19, 14].

It is simple and quite easy to implement. It works for any norm, as long as . Specically, we show that for any and there exists an algorithm for -NN under which uses space, with query time , where where . To our knowledge, this is the only known provable algorithm for the high-dimensional nearest neighbor problem for the case . Similarity search under such fractional norms have recently attracted interest [1, 11].

lem is locality sensitive hashing or LSH. For a domain of the points set with distance measure , an LSH family is dened as: D EFINITION 1. A family is called sensitive for if for any if then ,

2. LOCALITY-SENSITIVE HASHING An important technique from [19], to solve the -NN prob-

if then

Our algorithm also inherits two very convenient properties of LSH schemes. The rst one is that it works well on data that is extremely high-dimensional but sparse. Specically, the running time bound remains unchanged if denotes the maximum number of non-zero elements in vectors. To our knowledge, this property is not shared by other known spatial data structures. Thanks to this property, we were able to use our new LSH scheme (specifically, the norm version) for fast color-based image similarity search [20]. In that context, each image was represented by a point in roughly -dimensional space, but only about 100 dimensions were non-zero per point. The use of our LSH scheme enabled achieving order(s) of magnitude speed-up over the linear scan. The second property is that our algorithm provably reports the exact near neighbor very quickly, if the data satises certain bounded growth property. Specically, for a query point , and , let be the number of -approximate nearest neighbors of in . If grows sub-exponentially as a function of , then the LSH algorithm reports , the nearest neighbor, with constant probability within time , assuming it is given a constant factor approximation to the distance from to its nearest neighbor. In particular, we show that if , then the running time is . Efcient nearest neighbor algorithms for data sets with polynomial growth properties in general metrics have been recently a focus of several papers [9, 21, 23]. LSH solves an easier problem (near neighbor under norm), while working under

In order for a locality-sensitive hash (LSH) family to be useful, it has to satisfy inequalities and . We will briey describe, from [19], how a LSH family can be used to solve the -NN problem: We choose and . Given a family of hash functions with parameters as in Denition 1, we amplify the gap between the high probability and low probability by concatenating several functions. In particular, for specied later, dene a function family such that , where . For an integer we choose functions from , independently and uniformly at random. During preprocessing, we store each (input point set) in the bucket , for . Since the total number of buckets may be large, we retain only the non-empty buckets by resorting to hashing. To process a query , we search all buckets ; as it is possible (though unlikely) that the total number of points stored in those buckets is large, we interrupt search after nding rst points (including duplicates). be the points encountered therein. For each , if Let then we return YES and , else we return NO. The parameters and are chosen so as to ensure that with a constant probability the following two properties hold:

1. If there exists , and

then

for some

2. The total number of collisions of with points from is less than , i.e.

( ), corresponding to different s, is termed as the sketch of the vector and can be used to estimate (see [18] for details). It is easy to see that such a sketch is linearly composable, i.e. .

3.2 Hash family

In this paper we use -stable distributions in a slightly different manner. Instead of using the dot products ( ) to estimate the norm we use them to assign a hash value to each vector . Intuitively, the hash function family should be locality sensitive, i.e. if two vectors ( ) are close (small ) then they should collide (hash to the same value) with high probability and if they are far they should collide with small probability. The dot product projects each vector to the real line; It follows from -stability that for two vectors ( ) the distance between their projections ) is distributed as where is a -stable ( distribution. If we chop the real line into equi-width segments of appropriate size and assign hash values to vectors based on which segment they project onto, then it is intuitively clear that this hash function will be locality preserving in the sense described above. maps a Formally, each hash function dimensional vector onto the set of integers. Each hash function in the family is indexed by a choice of random and where is, as before, a dimensional vector with entries chosen independently from a -stable distribution and is a real number chosen uniformly the hash function is from the range . For a xed given by Next, we compute the probability that two vectors collide under a hash function drawn uniformly at random from this family. Let denote the probability density function of the absolute value of the -stable distribution. We may drop the subscript whenever it is clear from the context. For the two vectors , let . For a random vector whose entries are drawn from a -stable distribution, is distributed as where is a random variable drawn from a -stable distribution. Since is drawn uniformly from it is easy to see that

Observe that if (1) and (2) hold, then the algorithm is correct. It follows (see [19] Theorem 5 for details) that if we set , and where then (1) and (2) hold with a constant probability. Thus, we get following theorem (slightly different version of Theorem 5 in [19]), which relates the efciency of solving -NN problem to the sensitivity parameters of the LSH. T HEOREM 1. Suppose there is a -sensitive family for a distance measure . Then there exists an algorithm for -NN under measure which uses space, with query time dominated by distance computations, and evaluations of hash functions from , where

OUR LSH SCHEME

In this section, we present a LSH family based on -stable distributions, that works for all . Since we consider points in , without loss of generality we can consider , which we assume from now on.

3.1 -stable distributions

Stable distributions [31] are dened as limits of normalized sums of independent identically distributed variables (an alternate denition follows). The most well-known example of a stable distribution is Gaussian (or normal) distribution. However, the class is much wider; for example, it includes heavy-tailed distributions. Stable Distribution: A distribution over is called -stable, if there exists such that for any real numbers and variable i.i.d. variables with distribution , the random has the same distribution as the variable , where is a random variable with distribution . It is known [31] that stable distributions exist for any . In particular:

a Cauchy distribution , dened by the density function , is -stable a Gaussian (normal) distribution , dened by the density , is -stable function

We note from a practical point of view, despite the lack of closed form density and distribution functions, it is known [8] that one can generate -stable random variables essentially from two independent variables distributed uniformly over . Stable distribution have found numerous applications in various elds (see the survey [25] for more details). In computer science, stable distributions were used for sketching of high dimensional vectors by Indyk ([18]) and since have found use in various applications. The main property of -stable distributions mentioned in the denition above directly translates into a sketching technique for high dimensional vectors. The idea is to generate a random vector of dimension whose each entry is chosen independently from a -stable distribution. Given a vector of dimension , the is a random variable which is distributed dot product as (i.e., ), where is a random variable with -stable distribution. A small collection of such dot products

For a xed parameter the probability of collision decreases . Thus, as per Denition 1 monotonically with the family of hash functions above is -sensitive for and for . In what follows we will bound the ratio , which as discussed earlier is critical to the performance when this hash family is used to solve the -NN problem. Note that we have not specied the parameter , for it depends on the value of and . For every we would like to choose a nite that makes as small as possible.

4. COMPUTATIONAL ANALYSIS OF THE RATIO In this section we focus on the cases of . In these cases
the ratio can be explicitly evaluated. We compute and plot this ratio and compare it with . Note, is the best (smallest) known exponent for in the space requirement and query time that is achieved in [19] for these cases.

ties , using the density functions mentioned before. A simple

4.1 Computing the ratio for special cases For the special cases we can compute the probabili-

calculation shows that for (Cauchy) and

for lative distribution function (cdf) for a random variable that is distributed as . The value of can be obtained by substituting in the formulas above. For values in the range (in increments of ) we compute the minimum value of , , using Matlab. The plot of versus is shown in Figure 1. The crucial observation for the case is that the curve corresponding to optimal ratio ( ) lies strictly below the curve . As mentioned earlier, this is a strict improvement over the previous best known exponent from [19]. While we have computed here for in the range , we believe that is strictly less than for all values of . , we observe that curve is very close For the case to , although it lies above it. The optimal was computed using Matlab as mentioned before. The Matlab program has a limit on the number of iterations it performs to compute the minimum of a function. We reached this limit during the computations. If we compute the true minimum, then we suspect that it will be very close to , possibly equal to , and that this minimum might . be reached at If one were to implement our LSH scheme, ideally they would want to know the optimal value of for every . For , for a given value of , we can compute the value of that gives the optimal value of . This can be done using programs like Matlab. However, we observe that for a xed the value of as a function of is more or less stable after a certain point (see Figure 2). Thus, we observe that is not very sensitive to beyond a certain point and as long we choose sufciently away from , the value will be close to optimal. Note, however that we should not choose an value that is too large. As increases, both and get closer to . This increases the query time, since , which is the width of each hash function (refer to Subsection 2), increases as . We mention that for the norm, the optimal value of appears to be a (nite) function of . We also plot as a function of for a few xed values(See , we observe that for moderate values the Figure 3). For curve beats the curve over a large range of that is of , we observe that as increases the practical interest. For curve drops lower and gets closer and closer to the curve.

(Gaussian), where is the cumu-

EMPIRICAL EVALUATION OF OUR TECHNIQUE

In this section we present an experimental evaluation of our novel LSH scheme. We focus on the Euclidean norm case, since this occurs most frequently in practice. Our data structure is implemented for main memory. In what follows, we briey discuss some of the issues pertaining to the implementation of our technique. We then report some preliminary performance results based on an empirical comparison of our technique to the -tree data structure. Parameters and Performance Tradeoffs: The three main parameters that affect the performance of our algorithm are: number of projections per hash value ( ), number of hash tables () and the width of the projection (). In general, one could also introduce another parameter (say ), such that the query procedure stops after retrieving points. In our analysis, was set to . In our experiments, however, the query procedure retrieved all points colliding with the query (i.e., we used ). This reduces the number of parameters and simplies the choice of the optimal.

For a given value of , it is easy to nd the optimal value of which will guarantee that the fraction of false negatives are no more than a user specied threshold. This process is exactly the same as in an earlier paper by Cohen et al. ([10]) that uses locality sensitive hashing to nd similar column pairs in market-basket data, with the similarity exceeding a certain user specied threshold. In our experiments we tried a few values of (between and ) and below we report the that gives the best tradeoff for our scenario. The parameter represents a tradeoff between the time spent in computing hash values and time spent in pruning false positives, i.e. computing distances between the query and candidates; a bigger value increases the number of hash computations. In general we could do a binary search over a large range to nd the optimal value. This binary search can be avoided if we have a good model of the relative times of hash computations to distance computations for the application at hand. Decreasing the width of the projection () decreases the probability of collision for any two points. Thus, it has the same effect as increasing . As a result, we would like to set as small as possible and in this way decrease the number of projections we need to make. However, decreasing below a certain threshold increases the quantity , thereby requiring us to increase . Thus we cannot decrease by too much. For the norm we found the optimal value of using Matlab which we used in our experiments. Before we report our performance numbers we will next describe the data set and query set that we used for testing. Data Set: We used synthetically generated data sets and query points to test our algorithm. The dimensionality of the underlying space was varied between and . We considered generating all the data and query points independently at random. Thus, for a data point (or query point) its coordinate along every dimension would be chosen independently and uniformly at random from a certain range . However, if we did that, given a query point all the data points would be sharply concentrated at the same distance from the query point as we are operating in high dimensions. Therefore, approximate nearest neighbor search would not make sense on such a data set. Testing approximate nearest neighbor requires that for every query point , there are few data points within distance from and most of the points are at a distance no less than . We call this a planted nearest neighbor model. In order to ensure this property we generate our points as follows (a similar approach was used in [30]). We rst generate the query points at random, as above. We then generate the data points in such a way that for every query point, we guarantee at least a single point within distance and all other points are distance no less than . This novel way of generating data sets ensures every query point has a few (in our case, just one) approximate nearest neighbors, while most points are far from the query. The resulting data set has several interesting properties. Firstly, it constitutes the worst-case input to LSH (since there is only one correct nearest neighbor, and all other points are almost correct nearest neighbors). Moreover, it captures the typical situation occurring in real life similarity search applications, in which there are few points that are relatively close to the query point, and most of the database points lie quite far from the query point. For our experiments the range was set to . The total number of data points was varied between and . Both our algorithm and the -tree take as input the approximation factor . However, in addition to our algorithm also requires as input the value of the distance (upper bound) to the nearest neighbor. This can be avoided by guessing the value of and doing a binary search. We feel that for most real life applications it is easy to guess a range for that is not too large. As a result the

1 rho 1/c 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

5 6 Approximation factor c

(a) Optimal

for Figure 1: Optimal vs

(b) Optimal

for

additional multiplicative overhead of doing a binary search should not be much and will not cancel the gains that we report. Experimental Results: We did three sets of experiments to evaluate the performance of our algorithm versus that of -tree: we increased the number of data points, the dimensionality of the data set, and the approximation factor . In each set of experiments we report the average query processing times for our algorithm and the -tree algorithm, and also the ratio of the two ((average query time for -tree)/( average query time for our algorithm)), i.e. the speedup achieved by our algorithm. We ran our experiments on a Sun workstation with 650 MHz UltraSPARC-IIi, 512KB L2 cache processor, having no special support for vector computations, with 512 MB of main memory. For all our experiments we set the parameters and . Moreover, we set the percentage of false negatives that we can tolerate up to and indeed for all the experiments that we report below we did not get the more than false negatives, in fact less in most cases. For all the query time graphs that we present, the curve that lies above is that of -tree and the one below is for our algorithm. For the rst experiment we xed , and (the width of projection). We varied the number of data points from to . Figures 4(a) and 4(b) show the processing times and speedup respectively as is varied. As we see from the Figures, the speedup seems to increase linearly with . For the second experiment we xed , and . We varied the dimensionality of the data set from to . Figures 5(a) and 5(b) show the processing times and speedup respectively as is varied. As we see from the Figures, the speedup seems to increase with the dimension. For the third experiment we xed and . The approximation factor was varied from to . The width was set appropriately as a function of . Figures 6(a) and 6(b) show the processing times and speedup respectively as is varied. Memory Requirement: The memory requirement for our algorithm equals the memory to store the data points themselves and the memory required to store the hash tables. From our experiments, typical values of and are and respectively. If we insert each point in the hash tables along with their hash values and

a pointer to the data point itself, it will require words (int) of memory, which for our typical values evaluates to words. We can reduce the memory requirement by not storing the hash value explicitly as concatenation of projections, but instead hash these values in turn to get a single word for the hash. This would reduce the memory requirement to , i.e. words per data point. If the data points belong to a high dimensional space (e.g., with dimension or more), then the overhead of maintaining the hash table is not much (around with the optimization above) as compared to storing the points themselves. Thus, the memory overhead of our algorithm is small.

6. CONCLUSIONS
In this paper we present a new LSH scheme for the similarity search in high-dimensional spaces. The algorithm is easy to implement, and generalizes to arbitrary norm, for . We provide theoretical, computational and experimental evaluations of the algorithm. Although the experimental comparison of LSH and kd-tree-based algorithm suggests that the former outperforms the latter, there are several caveats that one needs to keep in mind:

We used the kd-tree structure as is. Tweaking its parameters would likely improve its performance. LSH solves the decision version of the nearest neighbor problem, while kd-tree solves the optimization version. Although the latter reduces to the former, the reduction overhead increases the running time. One could run the approximate kd-tree algorithm with approximation parameter that is much larger than the intended approximation. Although the resulting algorithm would provide very weak guarantee on the quality of the returned neighbor, typically the actual error is much smaller than the guarantee.

7. REFERENCES

[1] C. Aggarwal and D. Keim A. Hinneburg. On the surprising behavior of distance metrics in high dimensional spaces.

1 c=1.1 c=1.5 c=2.5 c=5 c=10

1 0.9 0.8 0.7 c=1.1 c=1.5 c=2.5 c=5 c=10

0.9

0.8

0.7 0.6 0.6

pxe pxe

0.5 0.4

0.5

0.4 0.3 0.3 0.2 0.1 0 0 5 10 r 15 20 0 5 10 r 15 20

0.2

0.1

(a)

vs for Figure 2: vs

(b)

vs for

[2] [3]

[4]

[5] [6]

[7]

[8] [9] [10]

[11]

[12]

Proceedings of the International Conference on Database Theory, pages 420434, 2001. S. Arya and D. Mount. Ann: Library for approximate nearest neighbor searching. available at http://www.cs.umd.edu/mount/ANN/. S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 573582, 1994. K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? Proceedings of the International Conference on Database Theory, pages 217235, 1999. J. Buhler. Efcient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17:419428, 2001. J. Buhler. Provably sensitive indexing strategies for biosequence similarity search. Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB02), 2002. J. Buhler and M. Tompa. Finding motifs using random projections. Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB01), 2001. J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. J. Amer. Statist. Assoc., 71:340344, 1976. K. Clarkson. Nearest neighbor queries in metric spaces. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pages 609617, 1997. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support prunning. Proceedings of the 16th International Conference on Data Engineering (ICDE), 2000. G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishnan. Fast mining of massive tabular data via approximate distance computations. Proc. 18th International Conference on Data Engineering (ICDE), 2002. T. Darrell, P. Indyk, G. Shakhnarovich, and P. Viola. Approximate nearest neighbors methods for learning and vision. NIPS Workshop at http://www.ai.mit.edu/projects/vip/nips03ann, 2003.

[13] B. Georgescu, I. Shimshoni, and P. Meer. Mean shift based clustering in high dimensions: A texture classicatio n example. Proceedings of the 9th International Conference on Computer Vision, 2003. [14] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), 1999. [15] S. Har-Peled. A replacement for voronoi diagrams of near linear size. Proceedings of the Symposium on Foundations of Computer Science, 2001. [16] T. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. WebDB Workshop, 2000. [17] A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? Proceedings of the International Conference on Very Large Databases (VLDB), pages 506515, 2000. [18] P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. Proceedings of the Symposium on Foundations of Computer Science, 2000. [19] P. Indyk and R. Motwani. Approximate nearest neighbor: towards removing the curse of dimensionality. Proceedings of the Symposium on Theory of Computing, 1998. [20] P. Indyk and N. Thaper. Fast color image retrieval via embeddings. Workshop on Statistical and Computational Theories of Vision (at ICCV), 2003. [21] D. Karger and M Ruhl. Finding nearest neighbors in growth-restricted metrics. Proceedings of the Symposium on Theory of Computing, 2002. [22] J. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, 1997. [23] R. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2004. [24] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efcient search for approximate nearest neighbor in high dimensional spaces. Proceedings of the Thirtieth ACM Symposium on Theory of Computing, pages 614623, 1998. [25] J. P. Nolan. An introduction to stable distributions. available at http://www.cas.american.edu/jpnolan/chap1.ps. [26] Z. Ouyang, N. Memon, T. Suel, and D. Trendalov. Cluster-based delta compression of collections of les. Proceedings of the International Conference on Web

1 0.9 0.8 0.7 0.6

pxe pxe

1 r=1.5 r=3.5 r=10 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 c 12 14 16 18 20 2 4 6 8 10 c 12 14 16 18 20 r=1.5 r=3.5 r=10 1/c

0.5 0.4 0.3 0.2 0.1 0

(a)

vs for Figure 3: vs

(b)

vs for

Information Systems Engineering (WISE), 2002. [27] N. Shivakumar. Detecting digital copyright violations on the Internet (Ph.D. thesis). Department of Computer Science, Stanford University, 2000. [28] Roger Weber, Hans J. Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proceedings of the 24th Int. Conf. Very Large Data Bases (VLDB), 1998. [29] C. Yang. Macs: Music audio characteristic sequence indexing for similarity retrieval. Proceedings of the Workshop on Applications of Signal Processing to Audio and Acoustics, 2001. [30] P.N. Yiannilos. Locally lifting the curse of dimensionality for nearest neighbor search. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2000. [31] V.M. Zolotarev. One-Dimensional Stable Distributions. Vol. 65 of Translations of Mathematical Monographs, American Mathematical Society, 1986.

APPENDIX A. GROWTH-RESTRICTED DATA SETS

In this section we focus exclusively on data sets living in the norm. Consider a data set , query , and let be the closest point in to . Assume we know the distance , in which case we can assume that it is equal to , by scaling1 . For , let and let . We consider a single shot LSH algorithm, i.e., one that uses only indices, but examines all points in the bucket contain , for some constant ing . We use the parameters . This implies that the hash function can be evaluated in time .

Proof: For any point such that , the probability is equal to that , where . Therefore

T HEOREM 2. If for some , then the single shot LSH algorithm nds with constant probability in expected time .

Note that . Moreover We have , . This implies that the probability that for some constant

1 Similar guarantees can be proved when we only know a constant approximation to the distance.

0.14

0.12

0.1 20 0.08 15 0.06 10 0.04

0.02

10 x 10
4

(a) query time vs Figure 4: Gain as data size varies

(b) speedup vs

collides with is at least is correct with constant probability. If , then we have

. Thus the algorithm

or equivalently , for proper constants . Now consider the expected number of points colliding with . Let be a multiset containing all values of over . We have

is the rst algorithm to solve this problem, and so there is no existing ratio against which we can compare our result. However, we show that for this case is arbitrarily close to . The proof follows from the following two Lemmas, which together imply Theorem 3. Let , . Then by the following lemma.

and such that , Proof: Noting , the claim is equivalent to . This in turn is equivalent to
L EMMA 1. For This is trivially true for . Furthermore, taking the derivative, we see , which is non-positive for and . Therefore, is non-increasing in the region in which we are interested, and so for all values in this region. . Now our goal is to upper bound

, then we have

L EMMA 2. For any

, there is

such that

ASYMPTOTIC ANALYSIS FOR THE GENERAL CASE

Proof: Using the values of calculated in Sub-section 3.2, followed by a change of variables, we get

there is a T HEOREM 3. For any sensitive family for such that for any ,

We prove that for the general case ( ) the ratio gets arbitrarily close to , our algorithm . For the case

0.7

0.6

0.5 25 0.4 20 0.3 15 0.2

0.1

100

150

200

250

300

350

400

450

500

100

150

200

250

300

350

400

450

500

(a) query time vs dimension Figure 5: Gain as dimension varies

(b) speedup vs dimension

Setting

and

we see

Case 1: . For these -stable distributions, converges to, say, (since the random variables drawn from those distributions have nite expectations). As is non-negative on , is a monotonically increasing function of which converges to . Thus, for every there is some such that

First, we consider and discuss the special cases and towards the end. We bound . Notice for drawn according to the absolute value of a -stable distribution with density function . To estimate , we can use the Pareto estimation ([25]) for the cumulative distribution function, which holds for ,

Set

. Then and choose

(1)

where . Note that the extra factor 2 is due to the fact that the distribution function is for the absolute value of . For this value of the -stable distribution. Fix let be the in the equation above. If we set we get

Case 2: . For this case we will choose our parameters so that we can use the Pareto estimation for the density function. Choose large enough so that the Pareto estimation is accurate to within a factor of for . Then for ,

Now we bound . We break the proof down into two cases based on the value of .

0.25

26 0.2

22 0.15 20

18 0.1 16

0.05

0 0.5

1.5

2.5

10 0.5

1.5

2.5

(a) query time vs Figure 6: Gain as varies

(b) speedup vs

Since is a constant that depends on , the rst term decreases as while the second term decreases as where . Thus for every there is some such that for all , the rst term . Then for is at most times the second term. We choose ,

Also for the case , i.e. the normal distribution, the computation is straightforward. We use the fact that for this case

In the same way we obtain

Using these two bounds, we see for ,

, where is the normal density function. For large values of , clearly dominates , because decreases exponentially ( ) while de creases as . Thus, we need to approximate as tends to innity, which is clearly .

and

Notice that similar to the previous parts, we can nd the appro is at most most . priate such that

As for the ratio , we can prove the upper bound of using LHopital rule, as follows:

for . We now consider the special cases of . For the case , we have the Cauchy distribution and we can compute of directly and . In fact for the ratio , the previous analysis for general works here.

Nguyen Princeton 0181D 11063
No ratings yet
Nguyen Princeton 0181D 11063
168 pages
Tutti Gli Articoli
No ratings yet
Tutti Gli Articoli
140 pages
Bit Reduction For Locality-Sensitive Hashing
No ratings yet
Bit Reduction For Locality-Sensitive Hashing
12 pages
24 SimilaritySearch
No ratings yet
24 SimilaritySearch
52 pages
Hashing For Similarity Search: A Survey: Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji
No ratings yet
Hashing For Similarity Search: A Survey: Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji
29 pages
Graph-Based Nearest Neighbor Search: From Practice To Theory
No ratings yet
Graph-Based Nearest Neighbor Search: From Practice To Theory
31 pages
Compact Structure Hashing Via Sparse and Similarity Preserving Embedding
No ratings yet
Compact Structure Hashing Via Sparse and Similarity Preserving Embedding
12 pages
Fast and Exact Fixed-Radius Neighbor Search Based On Sorting
No ratings yet
Fast and Exact Fixed-Radius Neighbor Search Based On Sorting
17 pages
Spfresh: Incremental In-Place Update For Billion-Scale Vector Search
No ratings yet
Spfresh: Incremental In-Place Update For Billion-Scale Vector Search
17 pages
Finding Similar Items
No ratings yet
Finding Similar Items
85 pages
PSLSH
No ratings yet
PSLSH
10 pages
SSDH
No ratings yet
SSDH
7 pages
GGNN (Ieee)
No ratings yet
GGNN (Ieee)
16 pages
1.1 About Spatial Mining
No ratings yet
1.1 About Spatial Mining
53 pages
1 s2.0 S0167865510001169 Main
No ratings yet
1 s2.0 S0167865510001169 Main
11 pages
Fast Subspace Search Via Grassmannian Based Hashing
No ratings yet
Fast Subspace Search Via Grassmannian Based Hashing
8 pages
Double-Bit Quantization For Hashing: Weihao Kong Wu-Jun Li
No ratings yet
Double-Bit Quantization For Hashing: Weihao Kong Wu-Jun Li
7 pages
An Optimal Algorithm For Approximate Nearest
No ratings yet
An Optimal Algorithm For Approximate Nearest
33 pages
Asymmetric Distances For Binary Embeddings
No ratings yet
Asymmetric Distances For Binary Embeddings
8 pages
(IJCST-V12I2P10) :CH. Nikitha Reddy, P.V.Shilohini Angel, P. Hrithika Malkan, V. Nikitha, Mr.K. Anil Kumar
No ratings yet
(IJCST-V12I2P10) :CH. Nikitha Reddy, P.V.Shilohini Angel, P. Hrithika Malkan, V. Nikitha, Mr.K. Anil Kumar
4 pages
Adl - 00 (2021 - 07 - 30 08 - 37 - 35 Utc)
No ratings yet
Adl - 00 (2021 - 07 - 30 08 - 37 - 35 Utc)
7 pages
003 05 KNN - Enhancements W3L2
No ratings yet
003 05 KNN - Enhancements W3L2
10 pages
Binary Hashing For Approximate Nearest Neighbor Search On Big Data A Survey
No ratings yet
Binary Hashing For Approximate Nearest Neighbor Search On Big Data A Survey
16 pages
Big Data Unit II
No ratings yet
Big Data Unit II
23 pages
Nearest Neighbor Retrieval Using Distance-Based Hashing
No ratings yet
Nearest Neighbor Retrieval Using Distance-Based Hashing
10 pages
Efficient Filtering With Sketches in The Ferret Toolkit: Qin LV, William Josephson, Zhe Wang, Moses Charikar and Kai Li
No ratings yet
Efficient Filtering With Sketches in The Ferret Toolkit: Qin LV, William Josephson, Zhe Wang, Moses Charikar and Kai Li
10 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
19 pages
JL Transformation - Minlash
No ratings yet
JL Transformation - Minlash
11 pages
Fixed-Radius Near Neighbors
No ratings yet
Fixed-Radius Near Neighbors
2 pages
Lshanalysis Preprint
No ratings yet
Lshanalysis Preprint
12 pages
Fast Searching of Nearest Neighbor Using Key Values in Data Mining
No ratings yet
Fast Searching of Nearest Neighbor Using Key Values in Data Mining
5 pages
Efficient Nearest Neighbor Search in High Dimensional Hamming Space
No ratings yet
Efficient Nearest Neighbor Search in High Dimensional Hamming Space
11 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
58 pages
NeurIPS 2019 Diskann Fast Accurate Billion Point Nearest Neighbor Search On A Single Node Paper
No ratings yet
NeurIPS 2019 Diskann Fast Accurate Billion Point Nearest Neighbor Search On A Single Node Paper
11 pages
Learning Hash Functions Using Column Generation: Xi Li Guosheng Lin Chunhua Shen Anton Van Den Hengel Anthony Dick
No ratings yet
Learning Hash Functions Using Column Generation: Xi Li Guosheng Lin Chunhua Shen Anton Van Den Hengel Anthony Dick
9 pages
Elpis
No ratings yet
Elpis
12 pages
E Cient Histogram-Based Similarity Search in Ultra-High Dimensional Space
No ratings yet
E Cient Histogram-Based Similarity Search in Ultra-High Dimensional Space
15 pages
Nearest Neighbor Search
No ratings yet
Nearest Neighbor Search
9 pages
An Approximate Proximity Graph Incremental Construction For Large Image Collections Indexing
No ratings yet
An Approximate Proximity Graph Incremental Construction For Large Image Collections Indexing
10 pages
78221000
No ratings yet
78221000
7 pages
Principles of Hash-Based Text Retrieval.
100% (1)
Principles of Hash-Based Text Retrieval.
8 pages
Efficient Distributed Locality Sensitive Hashing: Bahman Bahmani Ashish Goel Rajendra Shinde
No ratings yet
Efficient Distributed Locality Sensitive Hashing: Bahman Bahmani Ashish Goel Rajendra Shinde
5 pages
Scalable Nearest Neighbor Algorithms For High Dimensional Data
No ratings yet
Scalable Nearest Neighbor Algorithms For High Dimensional Data
14 pages
A Survey On Learning To Hash
No ratings yet
A Survey On Learning To Hash
22 pages
CS246 Hw1
No ratings yet
CS246 Hw1
5 pages
1 Applications of Nearest Neighbor
No ratings yet
1 Applications of Nearest Neighbor
5 pages
Near-Optimal Hashing Algorithms For Approximate Near (Est) Neighbor Problem
No ratings yet
Near-Optimal Hashing Algorithms For Approximate Near (Est) Neighbor Problem
31 pages
Product Quantization For Nearest Neighbor Search
No ratings yet
Product Quantization For Nearest Neighbor Search
13 pages
Scalable Nearest Neighbor Algorithms For High Dimensional Data
No ratings yet
Scalable Nearest Neighbor Algorithms For High Dimensional Data
16 pages
Learning To Hash For Indexing Big Data - A Survey
No ratings yet
Learning To Hash For Indexing Big Data - A Survey
22 pages
Fast Nearest Neighbor Search With Keywords: Yufei Tao Cheng Sheng
No ratings yet
Fast Nearest Neighbor Search With Keywords: Yufei Tao Cheng Sheng
13 pages
When Is "Nearest Neighbor" Meaningful?: Abstract. We Explore The Effect of Dimensionality On The "Nearest Neigh
No ratings yet
When Is "Nearest Neighbor" Meaningful?: Abstract. We Explore The Effect of Dimensionality On The "Nearest Neigh
19 pages
Fast Exact Search in Hamming Space With Multi-Index Hashing: Mohammad Norouzi, Ali Punjani, David J. Fleet
No ratings yet
Fast Exact Search in Hamming Space With Multi-Index Hashing: Mohammad Norouzi, Ali Punjani, David J. Fleet
14 pages
07.01.approximate Nearest Neighbor Queries in Fixed Dimensions
No ratings yet
07.01.approximate Nearest Neighbor Queries in Fixed Dimensions
11 pages
UNIT 2 Bigdata Mining and Analytics
No ratings yet
UNIT 2 Bigdata Mining and Analytics
18 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
No ratings yet
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
10 pages
p117 Andoni
No ratings yet
p117 Andoni
6 pages
CSC Form 48 Daily Time Record (DTR)
No ratings yet
CSC Form 48 Daily Time Record (DTR)
6 pages
Definition of Terms Geotech
100% (2)
Definition of Terms Geotech
3 pages
Hyd
No ratings yet
Hyd
502 pages
SAP Quality Management
No ratings yet
SAP Quality Management
5 pages
Mechanics - Dynamics
100% (1)
Mechanics - Dynamics
108 pages
Staff Nurse Notification 2025 RJY
No ratings yet
Staff Nurse Notification 2025 RJY
12 pages
Me 455
No ratings yet
Me 455
4 pages
Linn, Wen Teck
No ratings yet
Linn, Wen Teck
36 pages
IT SBA CXC Question 2 Markscheme Mona High School 2011
No ratings yet
IT SBA CXC Question 2 Markscheme Mona High School 2011
16 pages
Science 7 Structure and Forces - For Merge
No ratings yet
Science 7 Structure and Forces - For Merge
15 pages
Cell Theory Timeline
No ratings yet
Cell Theory Timeline
14 pages
Diversity in Underutilized Plant Species An Asia-Pacific Prespective 1938
No ratings yet
Diversity in Underutilized Plant Species An Asia-Pacific Prespective 1938
234 pages
Rfid - Car Start System
No ratings yet
Rfid - Car Start System
11 pages
40-Book Challenge Requirements and Handouts
No ratings yet
40-Book Challenge Requirements and Handouts
5 pages
Preparing A Debate Arguments and Fallacies
No ratings yet
Preparing A Debate Arguments and Fallacies
37 pages
Manjra Charitable Trust's: Degree Board/ University Year of Passing Percentage
No ratings yet
Manjra Charitable Trust's: Degree Board/ University Year of Passing Percentage
3 pages
Lorenz Datalogger Software Installation Maual
0% (1)
Lorenz Datalogger Software Installation Maual
19 pages
Lesson Plan R Task 2
No ratings yet
Lesson Plan R Task 2
5 pages
Seat 220524 e
No ratings yet
Seat 220524 e
36 pages
AmitaJoshi SamualDrugsLimited
No ratings yet
AmitaJoshi SamualDrugsLimited
13 pages
Budget of Minority
No ratings yet
Budget of Minority
18 pages
Raj Kishore Resume
No ratings yet
Raj Kishore Resume
4 pages
G9 Science Activity Sheet
No ratings yet
G9 Science Activity Sheet
1 page
Sade Assignment Full
No ratings yet
Sade Assignment Full
12 pages
Pengaruh Berbagai Komposisi Media Tanam 986d8217
No ratings yet
Pengaruh Berbagai Komposisi Media Tanam 986d8217
11 pages
Omputer Eekly Buyer S Guide To Threat Management: Making Unified Threat Management A Key Security Tool
No ratings yet
Omputer Eekly Buyer S Guide To Threat Management: Making Unified Threat Management A Key Security Tool
19 pages
Math
No ratings yet
Math
6 pages
How To Write A Good Essay in English
No ratings yet
How To Write A Good Essay in English
2 pages
Department of Applied Mechanics and Hydraulics, National Institute of Technology Karnataka, Surathkal, Mangalore-575025
No ratings yet
Department of Applied Mechanics and Hydraulics, National Institute of Technology Karnataka, Surathkal, Mangalore-575025
1 page
Resume
No ratings yet
Resume
2 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet