Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
ABSTRACT
We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under norm, based on stable distributions. Our scheme improves the running time of the earlier algorithm for the case of the norm. It also yields the rst known provably efcient approximate NN algorithm for the case . We also show that the algorithm nds the exact near neigbhor in time for data satisfying certain bounded growth condition. Unlike earlier schemes, our LSH scheme works directly on points in the Euclidean space without embeddings. Consequently, the resulting query time bound is free of large factors and is simple and easy to implement. Our experiments (on synthetic data sets) show that the our data structure is up to 40 times faster than -tree.
General Terms
Algorithms, Experimentation, Design, Performance, Theory
Keywords
Sublinear Algorithm, Approximate Nearest Neighbor, Locally Sensitive Hashing, -Stable Distributions
1.
INTRODUCTION
This material is based upon work supported by the NSF CAREER grant CCR-0133849.
A similarity search problem involves a collection of objects (documents, images, etc.) that are characterized by a collection of relevant features and represented as points in a high-dimensional attribute space; given queries in the form of points in this space, we
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. SoCG04, June 911, 2004, NewYork, USA. Copyright 2004 ACM X-XXXXX-XX-X/XX/XX ...$5.00.
are required to nd the nearest (most similar) object to the query. A particularly interesting and well-studied instance is -dimensional Euclidean space. This problem is of major importance to a variety of applications; some examples are: data compression, databases and data mining, information retrieval, image and video databases, machine learning, pattern recognition, statistics and data analysis. Typically, the features of the objects of interest (documents, images, etc) are represented as points in and a distance metric is used to measure similarity of objects. The basic problem then is to perform indexing or similarity searching for query objects. The number of features (i.e., the dimensionality) ranges anywhere from tens to thousands. The low-dimensional case (say, for the dimensionality equal to 2 or 3) is well-solved, so the main issue is that of dealing with a large number of dimensions, the so-called curse of dimensionality. Despite decades of intensive effort, the current solutions are not entirely satisfactory; in fact, for large enough , in theory or in practice, they often provide little improvement over a linear algorithm which compares a query to each point from the database. In particular, it was shown in [28] (both empirically and theoretically) that all current indexing techniques (based on space partitioning) degrade to linear search for sufciently high dimensions. In recent years, several researchers proposed to avoid the running time bottleneck by using approximation (e.g., [3, 22, 19, 24, 15], see also [12]). This is due to the fact that, in many cases, approximate nearest neighbor is almost as good as the exact one; in particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. In fact, in situations when the quality of the approximate nearest neighbor is much worse than the quality of the actual nearest neighbor, then the nearest neighbor problem is unstable, and it is not clear if solving it is at all meaningful [4, 17]. In [19, 14], the authors introduced an approximate high-dimensional similarity search scheme with provably sublinear dependence on the data size. Instead of using tree-like space partitioning, it relied on a new method called locality-sensitive hashing (LSH). The key idea is to hash the points using several hash functions so as to ensure that, for each function, the probability of collision is much higher for objects which are close to each other than for those which are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point. In [19, 14] the authors provided such locality-sensitive hash functions for the case when the points live in binary Hamming space . They showed experimentally that the data structure achieves large speedup over several tree-based data structures when
the data is stored on disk. In addition, since the LSH is a hashingbased scheme, it can be naturally extended to the dynamic setting, i.e., when insertion and deletion operations also need to be supported. This avoids the complexity of dealing with tree structures when the data is dynamic. The LSH algorithm has been since used in numerous applied settings, e.g., see [14, 10, 16, 27, 5, 7, 29, 6, 26, 13]. However, it suffers from a fundamental drawback: it is fast and simple only when the input points live in the Hamming space (indeed, almost all of the above applications involved binary data). As mentioned in [19, 14], it is possible to extend the algorithm to the norm, by embedding space into space, and then space into Hamming space. However, it increases the query time and/or error by a large factor and complicates the algorithm. In this paper we present a novel version of the LSH algorithm. As with the previous schemes, it works for the -Near Neighbor (NN) problem, where the goal is to report a point within distance from a query , if there is a point in the data set within distance from . Unlike the earlier algorithm, our algorithm works directly on points in Euclidean space without embeddings. As a consequence, it has the following advantages over the previous algorithm:
weaker assumptions about the growth function. It is also somewhat faster, due to the fact that the factor in the query time of the earlier schemes is multiplied by a function of , while in our case this factor is additive. We complement our theoretical analysis with experimental evaluation of the algorithm on data with wide range of parameters. In particular, we compare our algorithm to an approximate version of the -tree algorithm [2]. We performed the experiments on synthetic data sets containing planted near neighbor (see section 5 for more details); similar model was earlier used in [30]. Our experiments indicate that the new LSH scheme achieves query time of up to 40 times better than the query time of the -tree algorithm.
For the norm, its query time is , where for (the inequality is strict, see Figure 1(b)). Thus, for large range of values of , the query time exponent is better than the one in [19, 14].
It is simple and quite easy to implement. It works for any norm, as long as . Specically, we show that for any and there exists an algorithm for -NN under which uses space, with query time , where where . To our knowledge, this is the only known provable algorithm for the high-dimensional nearest neighbor problem for the case . Similarity search under such fractional norms have recently attracted interest [1, 11].
lem is locality sensitive hashing or LSH. For a domain of the points set with distance measure , an LSH family is dened as: D EFINITION 1. A family is called sensitive for if for any if then ,
2. LOCALITY-SENSITIVE HASHING An important technique from [19], to solve the -NN prob-
if then
Our algorithm also inherits two very convenient properties of LSH schemes. The rst one is that it works well on data that is extremely high-dimensional but sparse. Specically, the running time bound remains unchanged if denotes the maximum number of non-zero elements in vectors. To our knowledge, this property is not shared by other known spatial data structures. Thanks to this property, we were able to use our new LSH scheme (specifically, the norm version) for fast color-based image similarity search [20]. In that context, each image was represented by a point in roughly -dimensional space, but only about 100 dimensions were non-zero per point. The use of our LSH scheme enabled achieving order(s) of magnitude speed-up over the linear scan. The second property is that our algorithm provably reports the exact near neighbor very quickly, if the data satises certain bounded growth property. Specically, for a query point , and , let be the number of -approximate nearest neighbors of in . If grows sub-exponentially as a function of , then the LSH algorithm reports , the nearest neighbor, with constant probability within time , assuming it is given a constant factor approximation to the distance from to its nearest neighbor. In particular, we show that if , then the running time is . Efcient nearest neighbor algorithms for data sets with polynomial growth properties in general metrics have been recently a focus of several papers [9, 21, 23]. LSH solves an easier problem (near neighbor under norm), while working under
In order for a locality-sensitive hash (LSH) family to be useful, it has to satisfy inequalities and . We will briey describe, from [19], how a LSH family can be used to solve the -NN problem: We choose and . Given a family of hash functions with parameters as in Denition 1, we amplify the gap between the high probability and low probability by concatenating several functions. In particular, for specied later, dene a function family such that , where . For an integer we choose functions from , independently and uniformly at random. During preprocessing, we store each (input point set) in the bucket , for . Since the total number of buckets may be large, we retain only the non-empty buckets by resorting to hashing. To process a query , we search all buckets ; as it is possible (though unlikely) that the total number of points stored in those buckets is large, we interrupt search after nding rst points (including duplicates). be the points encountered therein. For each , if Let then we return YES and , else we return NO. The parameters and are chosen so as to ensure that with a constant probability the following two properties hold:
then
for some
2. The total number of collisions of with points from is less than , i.e.
( ), corresponding to different s, is termed as the sketch of the vector and can be used to estimate (see [18] for details). It is easy to see that such a sketch is linearly composable, i.e. .
Observe that if (1) and (2) hold, then the algorithm is correct. It follows (see [19] Theorem 5 for details) that if we set , and where then (1) and (2) hold with a constant probability. Thus, we get following theorem (slightly different version of Theorem 5 in [19]), which relates the efciency of solving -NN problem to the sensitivity parameters of the LSH. T HEOREM 1. Suppose there is a -sensitive family for a distance measure . Then there exists an algorithm for -NN under measure which uses space, with query time dominated by distance computations, and evaluations of hash functions from , where
3.
In this section, we present a LSH family based on -stable distributions, that works for all . Since we consider points in , without loss of generality we can consider , which we assume from now on.
a Cauchy distribution , dened by the density function , is -stable a Gaussian (normal) distribution , dened by the density , is -stable function
We note from a practical point of view, despite the lack of closed form density and distribution functions, it is known [8] that one can generate -stable random variables essentially from two independent variables distributed uniformly over . Stable distribution have found numerous applications in various elds (see the survey [25] for more details). In computer science, stable distributions were used for sketching of high dimensional vectors by Indyk ([18]) and since have found use in various applications. The main property of -stable distributions mentioned in the denition above directly translates into a sketching technique for high dimensional vectors. The idea is to generate a random vector of dimension whose each entry is chosen independently from a -stable distribution. Given a vector of dimension , the is a random variable which is distributed dot product as (i.e., ), where is a random variable with -stable distribution. A small collection of such dot products
For a xed parameter the probability of collision decreases . Thus, as per Denition 1 monotonically with the family of hash functions above is -sensitive for and for . In what follows we will bound the ratio , which as discussed earlier is critical to the performance when this hash family is used to solve the -NN problem. Note that we have not specied the parameter , for it depends on the value of and . For every we would like to choose a nite that makes as small as possible.
4. COMPUTATIONAL ANALYSIS OF THE RATIO In this section we focus on the cases of . In these cases
the ratio can be explicitly evaluated. We compute and plot this ratio and compare it with
. Note,
is the best (smallest) known exponent for in the space requirement and query time that is achieved in [19] for these cases.
4.1 Computing the ratio for special cases For the special cases we can compute the probabili-
for lative distribution function (cdf) for a random variable that is distributed as . The value of can be obtained by substituting in the formulas above. For values in the range (in increments of ) we compute the minimum value of , , using Matlab. The plot of versus is shown in Figure 1. The crucial observation for the case is that the curve corresponding to optimal ratio ( ) lies strictly below the curve . As mentioned earlier, this is a strict improvement over the previous best known exponent from [19]. While we have computed here for in the range , we believe that is strictly less than for all values of . , we observe that curve is very close For the case to , although it lies above it. The optimal was computed using Matlab as mentioned before. The Matlab program has a limit on the number of iterations it performs to compute the minimum of a function. We reached this limit during the computations. If we compute the true minimum, then we suspect that it will be very close to , possibly equal to , and that this minimum might . be reached at If one were to implement our LSH scheme, ideally they would want to know the optimal value of for every . For , for a given value of , we can compute the value of that gives the optimal value of . This can be done using programs like Matlab. However, we observe that for a xed the value of as a function of is more or less stable after a certain point (see Figure 2). Thus, we observe that is not very sensitive to beyond a certain point and as long we choose sufciently away from , the value will be close to optimal. Note, however that we should not choose an value that is too large. As increases, both and get closer to . This increases the query time, since , which is the width of each hash function (refer to Subsection 2), increases as . We mention that for the norm, the optimal value of appears to be a (nite) function of . We also plot as a function of for a few xed values(See , we observe that for moderate values the Figure 3). For curve beats the curve over a large range of that is of , we observe that as increases the practical interest. For curve drops lower and gets closer and closer to the curve.
5.
In this section we present an experimental evaluation of our novel LSH scheme. We focus on the Euclidean norm case, since this occurs most frequently in practice. Our data structure is implemented for main memory. In what follows, we briey discuss some of the issues pertaining to the implementation of our technique. We then report some preliminary performance results based on an empirical comparison of our technique to the -tree data structure. Parameters and Performance Tradeoffs: The three main parameters that affect the performance of our algorithm are: number of projections per hash value ( ), number of hash tables () and the width of the projection (). In general, one could also introduce another parameter (say ), such that the query procedure stops after retrieving points. In our analysis, was set to . In our experiments, however, the query procedure retrieved all points colliding with the query (i.e., we used ). This reduces the number of parameters and simplies the choice of the optimal.
For a given value of , it is easy to nd the optimal value of which will guarantee that the fraction of false negatives are no more than a user specied threshold. This process is exactly the same as in an earlier paper by Cohen et al. ([10]) that uses locality sensitive hashing to nd similar column pairs in market-basket data, with the similarity exceeding a certain user specied threshold. In our experiments we tried a few values of (between and ) and below we report the that gives the best tradeoff for our scenario. The parameter represents a tradeoff between the time spent in computing hash values and time spent in pruning false positives, i.e. computing distances between the query and candidates; a bigger value increases the number of hash computations. In general we could do a binary search over a large range to nd the optimal value. This binary search can be avoided if we have a good model of the relative times of hash computations to distance computations for the application at hand. Decreasing the width of the projection () decreases the probability of collision for any two points. Thus, it has the same effect as increasing . As a result, we would like to set as small as possible and in this way decrease the number of projections we need to make. However, decreasing below a certain threshold increases the quantity , thereby requiring us to increase . Thus we cannot decrease by too much. For the norm we found the optimal value of using Matlab which we used in our experiments. Before we report our performance numbers we will next describe the data set and query set that we used for testing. Data Set: We used synthetically generated data sets and query points to test our algorithm. The dimensionality of the underlying space was varied between and . We considered generating all the data and query points independently at random. Thus, for a data point (or query point) its coordinate along every dimension would be chosen independently and uniformly at random from a certain range . However, if we did that, given a query point all the data points would be sharply concentrated at the same distance from the query point as we are operating in high dimensions. Therefore, approximate nearest neighbor search would not make sense on such a data set. Testing approximate nearest neighbor requires that for every query point , there are few data points within distance from and most of the points are at a distance no less than . We call this a planted nearest neighbor model. In order to ensure this property we generate our points as follows (a similar approach was used in [30]). We rst generate the query points at random, as above. We then generate the data points in such a way that for every query point, we guarantee at least a single point within distance and all other points are distance no less than . This novel way of generating data sets ensures every query point has a few (in our case, just one) approximate nearest neighbors, while most points are far from the query. The resulting data set has several interesting properties. Firstly, it constitutes the worst-case input to LSH (since there is only one correct nearest neighbor, and all other points are almost correct nearest neighbors). Moreover, it captures the typical situation occurring in real life similarity search applications, in which there are few points that are relatively close to the query point, and most of the database points lie quite far from the query point. For our experiments the range was set to . The total number of data points was varied between and . Both our algorithm and the -tree take as input the approximation factor . However, in addition to our algorithm also requires as input the value of the distance (upper bound) to the nearest neighbor. This can be avoided by guessing the value of and doing a binary search. We feel that for most real life applications it is easy to guess a range for that is not too large. As a result the
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
5 6 Approximation factor c
10
5 6 Approximation factor c
10
(a) Optimal
(b) Optimal
for
additional multiplicative overhead of doing a binary search should not be much and will not cancel the gains that we report. Experimental Results: We did three sets of experiments to evaluate the performance of our algorithm versus that of -tree: we increased the number of data points, the dimensionality of the data set, and the approximation factor . In each set of experiments we report the average query processing times for our algorithm and the -tree algorithm, and also the ratio of the two ((average query time for -tree)/( average query time for our algorithm)), i.e. the speedup achieved by our algorithm. We ran our experiments on a Sun workstation with 650 MHz UltraSPARC-IIi, 512KB L2 cache processor, having no special support for vector computations, with 512 MB of main memory. For all our experiments we set the parameters and . Moreover, we set the percentage of false negatives that we can tolerate up to and indeed for all the experiments that we report below we did not get the more than false negatives, in fact less in most cases. For all the query time graphs that we present, the curve that lies above is that of -tree and the one below is for our algorithm. For the rst experiment we xed , and (the width of projection). We varied the number of data points from to . Figures 4(a) and 4(b) show the processing times and speedup respectively as is varied. As we see from the Figures, the speedup seems to increase linearly with . For the second experiment we xed , and . We varied the dimensionality of the data set from to . Figures 5(a) and 5(b) show the processing times and speedup respectively as is varied. As we see from the Figures, the speedup seems to increase with the dimension. For the third experiment we xed and . The approximation factor was varied from to . The width was set appropriately as a function of . Figures 6(a) and 6(b) show the processing times and speedup respectively as is varied. Memory Requirement: The memory requirement for our algorithm equals the memory to store the data points themselves and the memory required to store the hash tables. From our experiments, typical values of and are and respectively. If we insert each point in the hash tables along with their hash values and
a pointer to the data point itself, it will require words (int) of memory, which for our typical values evaluates to words. We can reduce the memory requirement by not storing the hash value explicitly as concatenation of projections, but instead hash these values in turn to get a single word for the hash. This would reduce the memory requirement to , i.e. words per data point. If the data points belong to a high dimensional space (e.g., with dimension or more), then the overhead of maintaining the hash table is not much (around with the optimization above) as compared to storing the points themselves. Thus, the memory overhead of our algorithm is small.
6. CONCLUSIONS
In this paper we present a new LSH scheme for the similarity search in high-dimensional spaces. The algorithm is easy to implement, and generalizes to arbitrary norm, for . We provide theoretical, computational and experimental evaluations of the algorithm. Although the experimental comparison of LSH and kd-tree-based algorithm suggests that the former outperforms the latter, there are several caveats that one needs to keep in mind:
We used the kd-tree structure as is. Tweaking its parameters would likely improve its performance. LSH solves the decision version of the nearest neighbor problem, while kd-tree solves the optimization version. Although the latter reduces to the former, the reduction overhead increases the running time. One could run the approximate kd-tree algorithm with approximation parameter that is much larger than the intended approximation. Although the resulting algorithm would provide very weak guarantee on the quality of the returned neighbor, typically the actual error is much smaller than the guarantee.
7. REFERENCES
[1] C. Aggarwal and D. Keim A. Hinneburg. On the surprising behavior of distance metrics in high dimensional spaces.
0.9
0.8
0.5 0.4
0.5
0.2
0.1
(a)
vs for Figure 2: vs
(b)
vs for
[2] [3]
[4]
[5] [6]
[7]
[11]
[12]
Proceedings of the International Conference on Database Theory, pages 420434, 2001. S. Arya and D. Mount. Ann: Library for approximate nearest neighbor searching. available at http://www.cs.umd.edu/mount/ANN/. S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 573582, 1994. K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? Proceedings of the International Conference on Database Theory, pages 217235, 1999. J. Buhler. Efcient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17:419428, 2001. J. Buhler. Provably sensitive indexing strategies for biosequence similarity search. Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB02), 2002. J. Buhler and M. Tompa. Finding motifs using random projections. Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB01), 2001. J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. J. Amer. Statist. Assoc., 71:340344, 1976. K. Clarkson. Nearest neighbor queries in metric spaces. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pages 609617, 1997. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support prunning. Proceedings of the 16th International Conference on Data Engineering (ICDE), 2000. G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishnan. Fast mining of massive tabular data via approximate distance computations. Proc. 18th International Conference on Data Engineering (ICDE), 2002. T. Darrell, P. Indyk, G. Shakhnarovich, and P. Viola. Approximate nearest neighbors methods for learning and vision. NIPS Workshop at http://www.ai.mit.edu/projects/vip/nips03ann, 2003.
[13] B. Georgescu, I. Shimshoni, and P. Meer. Mean shift based clustering in high dimensions: A texture classicatio n example. Proceedings of the 9th International Conference on Computer Vision, 2003. [14] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), 1999. [15] S. Har-Peled. A replacement for voronoi diagrams of near linear size. Proceedings of the Symposium on Foundations of Computer Science, 2001. [16] T. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. WebDB Workshop, 2000. [17] A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? Proceedings of the International Conference on Very Large Databases (VLDB), pages 506515, 2000. [18] P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. Proceedings of the Symposium on Foundations of Computer Science, 2000. [19] P. Indyk and R. Motwani. Approximate nearest neighbor: towards removing the curse of dimensionality. Proceedings of the Symposium on Theory of Computing, 1998. [20] P. Indyk and N. Thaper. Fast color image retrieval via embeddings. Workshop on Statistical and Computational Theories of Vision (at ICCV), 2003. [21] D. Karger and M Ruhl. Finding nearest neighbors in growth-restricted metrics. Proceedings of the Symposium on Theory of Computing, 2002. [22] J. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, 1997. [23] R. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2004. [24] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efcient search for approximate nearest neighbor in high dimensional spaces. Proceedings of the Thirtieth ACM Symposium on Theory of Computing, pages 614623, 1998. [25] J. P. Nolan. An introduction to stable distributions. available at http://www.cas.american.edu/jpnolan/chap1.ps. [26] Z. Ouyang, N. Memon, T. Suel, and D. Trendalov. Cluster-based delta compression of collections of les. Proceedings of the International Conference on Web
1 r=1.5 r=3.5 r=10 1/c 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 c 12 14 16 18 20 2 4 6 8 10 c 12 14 16 18 20 r=1.5 r=3.5 r=10 1/c
(a)
vs for Figure 3: vs
(b)
vs for
Information Systems Engineering (WISE), 2002. [27] N. Shivakumar. Detecting digital copyright violations on the Internet (Ph.D. thesis). Department of Computer Science, Stanford University, 2000. [28] Roger Weber, Hans J. Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proceedings of the 24th Int. Conf. Very Large Data Bases (VLDB), 1998. [29] C. Yang. Macs: Music audio characteristic sequence indexing for similarity retrieval. Proceedings of the Workshop on Applications of Signal Processing to Audio and Acoustics, 2001. [30] P.N. Yiannilos. Locally lifting the curse of dimensionality for nearest neighbor search. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2000. [31] V.M. Zolotarev. One-Dimensional Stable Distributions. Vol. 65 of Translations of Mathematical Monographs, American Mathematical Society, 1986.
Proof: For any point such that , the probability is equal to that , where . Therefore
T HEOREM 2. If for some , then the single shot LSH algorithm nds with constant probability in expected time .
Note that . Moreover We have , . This implies that the probability that for some constant
1 Similar guarantees can be proved when we only know a constant approximation to the distance.
0.14
30
0.12
25
0.02
10 x 10
4
10 x 10
4
(b) speedup vs
or equivalently , for proper constants . Now consider the expected number of points colliding with . Let be a multiset containing all values of over . We have
is the rst algorithm to solve this problem, and so there is no existing ratio against which we can compare our result. However, we show that for this case is arbitrarily close to . The proof follows from the following two Lemmas, which together imply Theorem 3. Let , . Then by the following lemma.
and such that , Proof: Noting , the claim is equivalent to . This in turn is equivalent to
L EMMA 1. For This is trivially true for . Furthermore, taking the derivative, we see , which is non-positive for and . Therefore, is non-increasing in the region in which we are interested, and so for all values in this region. . Now our goal is to upper bound
If
, then we have
, there is
such that
B.
Proof: Using the values of calculated in Sub-section 3.2, followed by a change of variables, we get
there is a T HEOREM 3. For any sensitive family for such that for any ,
We prove that for the general case ( ) the ratio gets arbitrarily close to , our algorithm . For the case
0.7
35
0.6
30
0.1
10
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
Setting
and
we see
Case 1: . For these -stable distributions, converges to, say, (since the random variables drawn from those distributions have nite expectations). As is non-negative on , is a monotonically increasing function of which converges to . Thus, for every there is some such that
First, we consider and discuss the special cases and towards the end. We bound . Notice for drawn according to the absolute value of a -stable distribution with density function . To estimate , we can use the Pareto estimation ([25]) for the cumulative distribution function, which holds for ,
Set
(1)
where . Note that the extra factor 2 is due to the fact that the distribution function is for the absolute value of . For this value of the -stable distribution. Fix let be the in the equation above. If we set we get
Case 2: . For this case we will choose our parameters so that we can use the Pareto estimation for the density function. Choose large enough so that the Pareto estimation is accurate to within a factor of for . Then for ,
Now we bound . We break the proof down into two cases based on the value of .
0.25
28
26 0.2
24
22 0.15 20
18 0.1 16
0.05
14
12
0 0.5
1.5
2.5
10 0.5
1.5
2.5
(b) speedup vs
Since is a constant that depends on , the rst term decreases as while the second term decreases as where . Thus for every there is some such that for all , the rst term . Then for is at most times the second term. We choose ,
Also for the case , i.e. the normal distribution, the computation is straightforward. We use the fact that for this case
, where is the normal density function. For large values of , clearly dominates , because decreases exponentially ( ) while de creases as . Thus, we need to approximate as tends to innity, which is clearly .
and
Notice that similar to the previous parts, we can nd the appro is at most most . priate such that
As for the ratio , we can prove the upper bound of using LHopital rule, as follows:
for . We now consider the special cases of . For the case , we have the Cauchy distribution and we can compute of directly and . In fact for the ratio , the previous analysis for general works here.