From 2d1f542f05b83d10ed6c159853b082589e1a2c6e Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 26 Aug 2015 10:23:56 +0200 Subject: [PATCH 01/46] Added .travis.yml --- .travis.yml | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 .travis.yml diff --git a/.travis.yml b/.travis.yml new file mode 100644 index 0000000..7545c7c --- /dev/null +++ b/.travis.yml @@ -0,0 +1,2 @@ +language: java +install: mvn install -DskipTests=true -Dmaven.javadoc.skip=true -Dgpg.skip=true From bd61879dc8c9cc2b912f4cca20be746fa65f464e Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 26 Aug 2015 10:27:13 +0200 Subject: [PATCH 02/46] Added travis-ci build badge --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 86ce027..5ee257f 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ # java-LSH -[![Maven Central](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh/badge.svg)](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh) +[![Maven Central](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh/badge.svg)](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh) [![Build Status](https://travis-ci.org/tdebatty/java-LSH.svg?branch=master)](https://travis-ci.org/tdebatty/java-LSH) A Java implementation of Locality Sensitive Hashing (LSH). From b07f01ae2180c868555f5f8506bc11e76c73f91b Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 26 Aug 2015 11:31:15 +0200 Subject: [PATCH 03/46] Added explanation on S curve for MinHas --- README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/README.md b/README.md index 5ee257f..9871bfd 100644 --- a/README.md +++ b/README.md @@ -31,6 +31,8 @@ MinHash is a hashing scheme that tents to produce similar signatures for sets th The Jaccard similarity between two sets is the relative number of elements these sets have in common: J(A, B) = |A ∩ B| / |A ∪ B| A MinHash signature is a sequence of numbers produced by multiple hash functions hi. It can be shown that the Jaccard similarity between two sets is also the probability that this hash result is the same for the two sets: J(A, B) = Pr[hi(A) = hi(B)]. Therefore, MinHash signatures can be used to estimate Jaccard similarity between two sets. Moreover, it can be shown that the expected estimation error is O(1 / sqrt(n)), where n is the size of the signature (the number of hash functions that are used to produce the signature). +### Binning + ```java import info.debatty.java.lsh.LSHMinHash; import java.util.Random; @@ -224,6 +226,17 @@ On this figure, the x-axis is the Jaccard similarity between sets, the y-axis is We can clearly recognize the typical S curve of MinHash, with the threshold (the point where the curve is the steepest) located around x = 0.5. +This curve is very important! It shows that if all your sets are similar (similarity above 0.6), all sets will most probably fall in a single bucket. And all other buckets will thus most probably be empty. This can happen for example if your dataset is skewed and presents some sort of principal direction. + +At the opposite, if your sets are all different from each other (similarity below 0.2), the curve is nearly flat. This means that pairs of sets have the same probability of falling in the same bucket, independantly of their similarity. The items are then randomly binned into the buckets. If using B buckets and S stages, computing the probability that two items are binned in the same bucket is similar to the problem of rolling S times a dice with B values. The resuling probability is 1 - [(B-1) / B]^S. The computed probability for 10 buckets is presented in table below, and roughly correspond to the above graph. + + +| Stages | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | +|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|----| +| Pr | 0.1 | 0.19 | 0.27 | 0.34 | 0.41 | 0.47 | 0.52 | 0.57 | 0.61 | 0.65 | + +### Signatures + If you simply wish to compute MinHash signatures (witout performing LSH binning), you can directly use the MinHash class: ```java From 6954f4cdf2b51ee066770c1550c2eab7220b426b Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Tue, 29 Sep 2015 10:01:33 +0200 Subject: [PATCH 04/46] Added links to Javadoc --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 9871bfd..806bc8d 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ # java-LSH -[![Maven Central](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh/badge.svg)](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh) [![Build Status](https://travis-ci.org/tdebatty/java-LSH.svg?branch=master)](https://travis-ci.org/tdebatty/java-LSH) +[![Maven Central](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh/badge.svg)](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh) [![Build Status](https://travis-ci.org/tdebatty/java-LSH.svg?branch=master)](https://travis-ci.org/tdebatty/java-LSH) [![API](http://api123.web-d.be/api123-head.svg)](http://api123.web-d.be/api/java-LSH/head/index.html) A Java implementation of Locality Sensitive Hashing (LSH). @@ -277,6 +277,8 @@ Signature similarity: 0.6767676767676768 Real similarity (Jaccard index)0.6666666666666666 ``` +[Read Javadoc...](http://api123.web-d.be/api/java-LSH/head/index.html) + ##Super-Bit Super-Bit is an improvement of Random Projection LSH. It computes an estimation of cosine similarity. In Super-Bit, the K random vectors are orthogonalized in L batches of N vectors, where @@ -381,3 +383,5 @@ public class MyApp { System.out.println("Real (cosine) similarity: " + cosineSimilarity(v1, v2)); } ``` + +[Read Javadoc...](http://api123.web-d.be/api/java-LSH/head/index.html) From e687abd66464f8f304f5bdd6b891feae5f825686 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 14 Oct 2015 11:02:07 +0200 Subject: [PATCH 05/46] Added MIT license file --- LICENSE.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 LICENSE.md diff --git a/LICENSE.md b/LICENSE.md new file mode 100644 index 0000000..41f935a --- /dev/null +++ b/LICENSE.md @@ -0,0 +1,26 @@ +# License + +This project is licensed under the terms of the **MIT license**. + +https://opensource.org/licenses/MIT + +> Copyright 2015 Thibault Debatty. +> +> Permission is hereby granted, free of charge, to any person obtaining +> a copy of this software and associated documentation files (the +> "Software"), to deal in the Software without restriction, including +> without limitation the rights to use, copy, modify, merge, publish, +> distribute, sublicense, and/or sell copies of the Software, and to +> permit persons to whom the Software is furnished to do so, subject to +> the following conditions: +> +> The above copyright notice and this permission notice shall be +> included in all copies or substantial portions of the Software. +> +> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +> EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +> MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +> NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE +> LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION +> OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION +> WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. \ No newline at end of file From e2a82ea111ff65d873d96619e003c98c9322eaf6 Mon Sep 17 00:00:00 2001 From: tibo Date: Thu, 19 Nov 2015 12:56:24 +0100 Subject: [PATCH 06/46] =?UTF-8?q?When=20computing=20MinHash=20signature,?= =?UTF-8?q?=20loop=20over=20'true'=20values,=20instead=20of=20loop=20over?= =?UTF-8?q?=20all=20possible=20values=20of=20dictionary.=20Improvement=20p?= =?UTF-8?q?roposed=20by=20Sven=20Schr=C3=B6der.?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../java/info/debatty/java/lsh/MinHash.java | 142 +++++++++--------- 1 file changed, 75 insertions(+), 67 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index d19dede..9395d48 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -1,52 +1,53 @@ package info.debatty.java.lsh; import java.security.InvalidParameterException; +import java.util.ArrayList; +import java.util.Collections; import java.util.HashSet; +import java.util.List; import java.util.Random; import java.util.Set; import java.util.TreeSet; - /** - * MinHash is a hashing scheme that tents to produce similar signatures for sets + * MinHash is a hashing scheme that tents to produce similar signatures for sets * that have a high Jaccard similarity. - * - * The Jaccard similarity between two sets is the relative number of elements - * these sets have in common: J(A, B) = |A ∩ B| / |A ∪ B| - * A MinHash signature is a sequence of numbers produced by multiple hash - * functions hi. It can be shown that the Jaccard similarity between two sets is - * also the probability that this hash result is the same for the two sets: - * J(A, B) = Pr[hi(A) = hi(B)]. Therefore, MinHash signatures can be used to - * estimate Jaccard similarity between two sets. Moreover, it can be shown that - * the expected estimation error is O(1 / sqrt(n)), where n is the size of the - * signature (the number of hash functions that are used to produce the - * signature). - * + * + * The Jaccard similarity between two sets is the relative number of elements + * these sets have in common: J(A, B) = |A ∩ B| / |A ∪ B| A MinHash signature is + * a sequence of numbers produced by multiple hash functions hi. It can be shown + * that the Jaccard similarity between two sets is also the probability that + * this hash result is the same for the two sets: J(A, B) = Pr[hi(A) = hi(B)]. + * Therefore, MinHash signatures can be used to estimate Jaccard similarity + * between two sets. Moreover, it can be shown that the expected estimation + * error is O(1 / sqrt(n)), where n is the size of the signature (the number of + * hash functions that are used to produce the signature). + * * @author Thibault Debatty http://www.debatty.info */ public class MinHash { - + public static double JaccardIndex(Set s1, Set s2) { - Set intersection = new HashSet(s1); + Set intersection = new HashSet(s1); intersection.retainAll(s2); - - Set union = new HashSet(s1); + + Set union = new HashSet(s1); union.addAll(s2); - + if (union.isEmpty()) { return 0; } - + return (double) intersection.size() / union.size(); } - + public static double JaccardIndex(boolean[] s1, boolean[] s2) { if (s1.length != s2.length) { throw new InvalidParameterException("sets must be same size!"); } return JaccardIndex(Convert2Set(s1), Convert2Set(s2)); } - + public static Set Convert2Set(boolean[] array) { Set set = new TreeSet(); for (int i = 0; i < array.length; i++) { @@ -56,64 +57,65 @@ public static Set Convert2Set(boolean[] array) { } return set; } - + /** - * Computes the size of the signature required to achieve a given error - * in similarity estimation (1 / error^2) + * Computes the size of the signature required to achieve a given error in + * similarity estimation (1 / error^2) + * * @param error * @return size of the signature */ public static int size(double error) { if (error < 0 && error > 1) { - throw new IllegalArgumentException("error should be in [0 .. 1]"); + throw new IllegalArgumentException("error should be in [0 .. 1]"); } return (int) (1 / (error * error)); } - /** * Signature size */ private int n; - + /** * Random a and b coefficients for the random hash functions */ private int[][] hash_coefs; - + /** * Dictionary size */ private int dict_size; - + /** * Initializes hash functions to compute MinHash signatures for sets built * from a dictionary of dict_size elements - * - * @param size the number of hash functions (and the size of resulting signatures) - * @param dict_size + * + * @param size the number of hash functions (and the size of resulting + * signatures) + * @param dict_size */ - public MinHash (int size, int dict_size) { + public MinHash(int size, int dict_size) { init(size, dict_size); } - + /** * Initializes hash function to compute MinHash signatures for sets built * from a dictionary of dict_size elements, with a given similarity * estimation error + * * @param error * @param dict_size */ - public MinHash (double error, int dict_size) { + public MinHash(double error, int dict_size) { init(size(error), dict_size); } - + /** - * Computes the signature for this set - * The input set is represented as an vector of booleans - * For example the array [true, false, true, true, false] + * Computes the signature for this set The input set is represented as an + * vector of booleans For example the array [true, false, true, true, false] * corresponds to the set {0, 2, 3} - * + * * @param vector * @return the signature */ @@ -121,31 +123,36 @@ public int[] signature(boolean[] vector) { if (vector.length != dict_size) { throw new IllegalArgumentException("Size of array should be dict_size"); } - + return signature(Convert2Set(vector)); } - + /** - * Computes the signature for this set - * For example set = {0, 2, 3} + * Computes the signature for this set. For example set = {0, 2, 3} + * * @param set * @return the signature */ public int[] signature(Set set) { int[] sig = new int[n]; - + for (int i = 0; i < n; i++) { sig[i] = Integer.MAX_VALUE; } - + // For each row r: - for (int r = 0; r < dict_size; r++) { - - // if set has 0 in row r, do nothgin - if (!set.contains(r)) { - continue; - } - + //for (int r = 0; r < dict_size; r++) { + // if set has 0 in row r, do nothing + // if (!set.contains(r)) { + // continue; + // } + // Loop over 'true' values, instead of loop over all values of dictionary + // to speedup computation + final List list = new ArrayList(set); + Collections.sort(list); + + for (final int r : list) { + // However, if c has 1 in row r, then for each i = 1, 2, . . . ,n // set SIG(i, c) to the smaller of the current value of // SIG(i, c) and hi(r) @@ -155,45 +162,47 @@ public int[] signature(Set set) { h(i, r)); } } - + return sig; } - - + /** * Computes an estimation of Jaccard similarity (the number of elements in * common) between two sets, using the MinHash signatures of these two sets + * * @param sig1 MinHash signature of set1 - * @param sig2 MinHash signature of set2 (produced using the same coefficients) + * @param sig2 MinHash signature of set2 (produced using the same + * coefficients) * @return the estimated similarity */ public double similarity(int[] sig1, int[] sig2) { if (sig1.length != sig2.length) { throw new IllegalArgumentException("Size of signatures should be the same"); } - + double sim = 0; for (int i = 0; i < sig1.length; i++) { if (sig1[i] == sig2[i]) { sim += 1; } } - + return sim / sig1.length; } - + /** * Computes the expected error of similarity computed using signatures + * * @return the expected error */ public double error() { return 1.0 / Math.sqrt(n); } - + private void init(int size, int dict_size) { this.dict_size = dict_size; this.n = size; - + // h = (a * x) + b // a and b should be randomly generated Random r = new Random(); @@ -203,11 +212,10 @@ private void init(int size, int dict_size) { hash_coefs[i][1] = r.nextInt(dict_size); // b } } - - + /** * Computes hi(x) as (a_i * x + b_i) % dict_size. - * + * * @param i * @param x * @return the hashed value of x, using ith hash function @@ -215,7 +223,7 @@ private void init(int size, int dict_size) { private int h(int i, int x) { return (hash_coefs[i][0] * x + hash_coefs[i][1]) % dict_size; } - + public int[][] getCoefficients() { return hash_coefs; } From 4e72977d88ed5ae60056b64eb9960e20af5009da Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Thu, 19 Nov 2015 12:59:00 +0100 Subject: [PATCH 07/46] [maven-release-plugin] prepare release v0.7 --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index e6757a9..7f02569 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.7-SNAPSHOT + 0.7 jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - HEAD + v0.7 From f5f576e69a98ab34a692508c354fb7674cfacfc3 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Thu, 19 Nov 2015 12:59:04 +0100 Subject: [PATCH 08/46] [maven-release-plugin] prepare for next development iteration --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index 7f02569..c79dd13 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.7 + 0.8-SNAPSHOT jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - v0.7 + HEAD From 7c56ae4aa591d394dafdc76b1f05e75441a399b2 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 4 Dec 2015 09:31:15 +0100 Subject: [PATCH 09/46] All LSH classes implement Serializable + added an example of serialization... --- README.md | 89 ++++++++++++++++- src/main/java/info/debatty/java/lsh/LSH.java | 4 +- .../java/info/debatty/java/lsh/MinHash.java | 3 +- .../java/lsh/examples/SerializeExample.java | 96 +++++++++++++++++++ 4 files changed, 189 insertions(+), 3 deletions(-) create mode 100644 src/main/java/info/debatty/java/lsh/examples/SerializeExample.java diff --git a/README.md b/README.md index 806bc8d..d7d0696 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,10 @@ A Java implementation of Locality Sensitive Hashing (LSH). Locality Sensitive Hashing (LSH) is a family of hashing methods that tent to produce the same hash (or signature) for similar items. There exist different LSH functions, that each correspond to a similarity metric. For example, the MinHash algorithm is designed for Jaccard similarity (the relative number of elements that two sets have in common). For cosine similarity, the traditional LSH algorithm used is Random Projection, but others exist, like Super-Bit, that deliver better resutls. +LSH functions have two main use cases: +* Compute the signature of large input vectors. These signatures can be used to quickly estimate the similarity between vectors. +* With a given number of buekcts, bin similar vectors together. + This library implements Locality Sensitive Hashing (LSH), as described in Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", Cambridge University Press. Are currently implemented: @@ -24,7 +28,6 @@ Using maven: Or see the [releases](https://github.com/tdebatty/java-LSH/releases) page. - ##MinHash MinHash is a hashing scheme that tents to produce similar signatures for sets that have a high Jaccard similarity. @@ -385,3 +388,87 @@ public class MyApp { ``` [Read Javadoc...](http://api123.web-d.be/api/java-LSH/head/index.html) + +## Serialization + +As the parameters of the hashing function are randomly initialized when the LSH object is instantiated: +* two LSH objects will produce different hashes and signatures for the same input vector; +* two executions of your program will produce different hashes and signatures for the same input vector; +* the signatures produced by two different LSH objects can not be used to estimate the similarity between vectors. + +The solution is to serialize you LSH object so you an reuse it: + +```java +import info.debatty.java.lsh.LSHMinHash; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.ObjectInputStream; +import java.io.ObjectOutputStream; +import java.util.Random; + +public class SerializeExample { + + public static void main(String[] args) + throws IOException, ClassNotFoundException { + + // Create a single random boolean vector + int n = 100; + double sparsity = 0.75; + boolean[] vector = new boolean[n]; + Random rand = new Random(); + for (int j = 0; j < n; j++) { + vector[j] = rand.nextDouble() > sparsity; + } + + // Create and configure LSH + int stages = 2; + int buckets = 10; + LSHMinHash lsh = new LSHMinHash(stages, buckets, n); + println(lsh.hash(vector)); + + // Create another LSH object + // as the parameters of the hashing function are randomly initialized + // these two LSH objects will produce different hashes for the same + // input vector! + LSHMinHash other_lsh = new LSHMinHash(stages, buckets, n); + println(other_lsh.hash(vector)); + + // Moreover, signatures produced by different LSH objects cannot + // be used to compute estimated similarity! + // The solution is to serialize and save the object, so it can be + // reused later... + File tempfile = File.createTempFile("lshobject", ".ser"); + FileOutputStream fout = new FileOutputStream(tempfile); + ObjectOutputStream oos = new ObjectOutputStream(fout); + oos.writeObject(lsh); + oos.close(); + System.out.println( + "LSH object serialized to " + tempfile.getAbsolutePath()); + + FileInputStream fin = new FileInputStream(tempfile); + ObjectInputStream ois = new ObjectInputStream(fin); + LSHMinHash saved_lsh = (LSHMinHash) ois.readObject(); + println(saved_lsh.hash(vector)); + } + + static void println(int[] array) { + System.out.print("["); + for (int v : array) { + System.out.print("" + v + " "); + } + System.out.println("]"); + } +} +``` + +Will produce something like: +``` +[5 5 ] +[3 1 ] +LSH object serialized to /tmp/lshobject5903174677942358274.ser +[5 5 ] +``` + +[Check the examples](https://github.com/tdebatty/java-LSH/tree/master/src/main/java/info/debatty/java/lsh/examples) or [read Javadoc](http://api123.io/api/java-LSH/head/index.html) diff --git a/src/main/java/info/debatty/java/lsh/LSH.java b/src/main/java/info/debatty/java/lsh/LSH.java index 65bdcb9..746c0c6 100644 --- a/src/main/java/info/debatty/java/lsh/LSH.java +++ b/src/main/java/info/debatty/java/lsh/LSH.java @@ -1,5 +1,7 @@ package info.debatty.java.lsh; +import java.io.Serializable; + /** * Implementation of Locality Sensitive Hashing (LSH) principle, as described in * Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", @@ -7,7 +9,7 @@ * * @author Thibault Debatty http://www.debatty.info */ -public abstract class LSH { +public abstract class LSH implements Serializable { protected static final long LARGE_PRIME = 433494437; diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index 9395d48..cba021f 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -1,5 +1,6 @@ package info.debatty.java.lsh; +import java.io.Serializable; import java.security.InvalidParameterException; import java.util.ArrayList; import java.util.Collections; @@ -25,7 +26,7 @@ * * @author Thibault Debatty http://www.debatty.info */ -public class MinHash { +public class MinHash implements Serializable { public static double JaccardIndex(Set s1, Set s2) { Set intersection = new HashSet(s1); diff --git a/src/main/java/info/debatty/java/lsh/examples/SerializeExample.java b/src/main/java/info/debatty/java/lsh/examples/SerializeExample.java new file mode 100644 index 0000000..d0ff329 --- /dev/null +++ b/src/main/java/info/debatty/java/lsh/examples/SerializeExample.java @@ -0,0 +1,96 @@ +/* + * The MIT License + * + * Copyright 2015 Thibault Debatty. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ +package info.debatty.java.lsh.examples; + +import info.debatty.java.lsh.LSHMinHash; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.ObjectInputStream; +import java.io.ObjectOutputStream; +import java.util.Random; + +/** + * + * @author Thibault Debatty + */ +public class SerializeExample { + + /** + * @param args the command line arguments + * @throws java.io.IOException + * @throws java.lang.ClassNotFoundException + */ + public static void main(String[] args) + throws IOException, ClassNotFoundException { + + // Create a single random boolean vector + int n = 100; + double sparsity = 0.75; + boolean[] vector = new boolean[n]; + Random rand = new Random(); + for (int j = 0; j < n; j++) { + vector[j] = rand.nextDouble() > sparsity; + } + + // Create and configure LSH + int stages = 2; + int buckets = 10; + LSHMinHash lsh = new LSHMinHash(stages, buckets, n); + println(lsh.hash(vector)); + + // Create another LSH object + // as the parameters of the hashing function are randomly initialized + // these two LSH objects will produce different hashes for the same + // input vector! + LSHMinHash other_lsh = new LSHMinHash(stages, buckets, n); + println(other_lsh.hash(vector)); + + // Moreover, signatures produced by different LSH objects cannot + // be used to compute estimated similarity! + // The solution is to serialize and save the object, so it can be + // reused later... + File tempfile = File.createTempFile("lshobject", ".ser"); + FileOutputStream fout = new FileOutputStream(tempfile); + ObjectOutputStream oos = new ObjectOutputStream(fout); + oos.writeObject(lsh); + oos.close(); + System.out.println( + "LSH object serialized to " + tempfile.getAbsolutePath()); + + FileInputStream fin = new FileInputStream(tempfile); + ObjectInputStream ois = new ObjectInputStream(fin); + LSHMinHash saved_lsh = (LSHMinHash) ois.readObject(); + println(saved_lsh.hash(vector)); + } + + static void println(int[] array) { + System.out.print("["); + for (int v : array) { + System.out.print("" + v + " "); + } + System.out.println("]"); + } +} \ No newline at end of file From 64517a0da8537298aa5a69ddcad01d5d54084a61 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 4 Dec 2015 09:32:44 +0100 Subject: [PATCH 10/46] [maven-release-plugin] prepare release v0.8 --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index c79dd13..744c981 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.8-SNAPSHOT + 0.8 jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - HEAD + v0.8 From 2fea8b4ee37ae816605af7b4b8989277fc7254db Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 4 Dec 2015 09:32:49 +0100 Subject: [PATCH 11/46] [maven-release-plugin] prepare for next development iteration --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index 744c981..d2dc42b 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.8 + 0.9-SNAPSHOT jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - v0.8 + HEAD From 253cb8990829b87922a3c3f13b13e96a8ddc722a Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Sat, 26 Dec 2015 12:15:29 +0100 Subject: [PATCH 12/46] Update README.md --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index d7d0696..f6f09fb 100644 --- a/README.md +++ b/README.md @@ -7,13 +7,16 @@ Locality Sensitive Hashing (LSH) is a family of hashing methods that tent to pro LSH functions have two main use cases: * Compute the signature of large input vectors. These signatures can be used to quickly estimate the similarity between vectors. -* With a given number of buekcts, bin similar vectors together. +* With a given number of buckets, bin similar vectors together. This library implements Locality Sensitive Hashing (LSH), as described in Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", Cambridge University Press. Are currently implemented: * MinHash algorithm for Jaccard index; * Super-Bit algorithm for cosine similarity. +* + +The coeficients of hashing functions are randomly choosen when the LSH object is instantiated. You can thus only compare signatures or bucket binning generated by the same LSH object. To reuse your LSH object between executions, you have to serialize it and save it to a file (see below the [example of LSH object serialization](https://github.com/tdebatty/java-LSH#serialization)). ##Download From 782d02b2b6deed630fe45b3029236de92f2ef974 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Sat, 26 Dec 2015 12:16:20 +0100 Subject: [PATCH 13/46] Update README.md --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index f6f09fb..aca5bf2 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,6 @@ This library implements Locality Sensitive Hashing (LSH), as described in Leskov Are currently implemented: * MinHash algorithm for Jaccard index; * Super-Bit algorithm for cosine similarity. -* The coeficients of hashing functions are randomly choosen when the LSH object is instantiated. You can thus only compare signatures or bucket binning generated by the same LSH object. To reuse your LSH object between executions, you have to serialize it and save it to a file (see below the [example of LSH object serialization](https://github.com/tdebatty/java-LSH#serialization)). From f6b6543597f2154c1a8df2fb43683ed13d9d751d Mon Sep 17 00:00:00 2001 From: Nirav Desai Date: Mon, 8 Feb 2016 22:34:27 -0600 Subject: [PATCH 14/46] Expose Set signature in LSHMinHash --- src/main/java/info/debatty/java/lsh/LSHMinHash.java | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/src/main/java/info/debatty/java/lsh/LSHMinHash.java b/src/main/java/info/debatty/java/lsh/LSHMinHash.java index bbf3881..571776b 100644 --- a/src/main/java/info/debatty/java/lsh/LSHMinHash.java +++ b/src/main/java/info/debatty/java/lsh/LSHMinHash.java @@ -24,6 +24,8 @@ package info.debatty.java.lsh; +import java.util.Set; + /** * * @author Thibault Debatty @@ -73,6 +75,10 @@ public LSHMinHash(int s, int b, int n) { public int[] hash(boolean[] vector) { return hashSignature(this.mh.signature(vector)); } + + public int[] hash(Set set) { + return hashSignature(this.mh.signature(set)); + } public int[][] getCoefficients() { return mh.getCoefficients(); From 134577d48893cd82f201f568bf81854e1c026104 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Thu, 19 May 2016 17:50:37 +0200 Subject: [PATCH 15/46] Added support for large vectors. Fixes issue #9 --- .../info/debatty/java/lsh/LSHMinHash.java | 28 +++---- .../java/info/debatty/java/lsh/MinHash.java | 13 +-- .../info/debatty/java/lsh/LSHMinHashTest.java | 80 +++++++++++++++++++ 3 files changed, 101 insertions(+), 20 deletions(-) create mode 100644 src/test/java/info/debatty/java/lsh/LSHMinHashTest.java diff --git a/src/main/java/info/debatty/java/lsh/LSHMinHash.java b/src/main/java/info/debatty/java/lsh/LSHMinHash.java index bbf3881..a240f92 100644 --- a/src/main/java/info/debatty/java/lsh/LSHMinHash.java +++ b/src/main/java/info/debatty/java/lsh/LSHMinHash.java @@ -33,31 +33,31 @@ public class LSHMinHash extends LSH { /** * Instantiates a LSH instance that internally uses MinHash, - * with s stages (or bands) and b buckets (per stage), for sets out of a + * with s stages (or bands) and b buckets (per stage), for sets out of a * dictionary of n elements. - * + * * Attention: the number of buckets should be chosen such that we have at * least 100 items per bucket. - * + * * @param s stages * @param b buckets (per stage) * @param n dictionary size */ public LSHMinHash(int s, int b, int n) { super(s, b, n); - + /** * "Mining of Massive Datasets", p.88. - * It can be shown that, using MinHash, the probability that the - * signatures of 2 sets with Jaccard similarity s agree in all the - * rows of at least one stage (band), and therefore become a candidate + * It can be shown that, using MinHash, the probability that the + * signatures of 2 sets with Jaccard similarity s agree in all the + * rows of at least one stage (band), and therefore become a candidate * pair, is 1−(1−s^R)^b * where R = signature_size / b (number of rows in a stage/band) - * Thus, the curve that shows the probability that 2 items fall in the - * same bucket for at least one of the stages, as a function of their + * Thus, the curve that shows the probability that 2 items fall in the + * same bucket for at least one of the stages, as a function of their * Jaccard index similarity, has a S shape. - * The threshold (the value of similarity at which the probability of - * becoming a candidate is 1/2) is a function of the number of stages + * The threshold (the value of similarity at which the probability of + * becoming a candidate is 1/2) is a function of the number of stages * (s, or bands b in the book) and the signature size: * threshold ≃ (1/s)^(1/R) * Hence the signature size can be computed as: @@ -69,12 +69,12 @@ public LSHMinHash(int s, int b, int n) { int signature_size = R * s; this.mh = new MinHash(signature_size, n); } - + public int[] hash(boolean[] vector) { return hashSignature(this.mh.signature(vector)); } - - public int[][] getCoefficients() { + + public long[][] getCoefficients() { return mh.getCoefficients(); } } diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index cba021f..6fb2438 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -81,7 +81,7 @@ public static int size(double error) { /** * Random a and b coefficients for the random hash functions */ - private int[][] hash_coefs; + private long[][] hash_coefs; /** * Dictionary size @@ -154,8 +154,8 @@ public int[] signature(Set set) { for (final int r : list) { - // However, if c has 1 in row r, then for each i = 1, 2, . . . ,n - // set SIG(i, c) to the smaller of the current value of + // However, if c has 1 in row r, then for each i = 1, 2, . . . ,n + // set SIG(i, c) to the smaller of the current value of // SIG(i, c) and hi(r) for (int i = 0; i < n; i++) { sig[i] = Math.min( @@ -207,7 +207,7 @@ private void init(int size, int dict_size) { // h = (a * x) + b // a and b should be randomly generated Random r = new Random(); - hash_coefs = new int[n][2]; + hash_coefs = new long[n][2]; for (int i = 0; i < n; i++) { hash_coefs[i][0] = r.nextInt(dict_size); // a hash_coefs[i][1] = r.nextInt(dict_size); // b @@ -222,10 +222,11 @@ private void init(int size, int dict_size) { * @return the hashed value of x, using ith hash function */ private int h(int i, int x) { - return (hash_coefs[i][0] * x + hash_coefs[i][1]) % dict_size; + return (int) + ((hash_coefs[i][0] * (long) x + hash_coefs[i][1]) % dict_size); } - public int[][] getCoefficients() { + public long[][] getCoefficients() { return hash_coefs; } } diff --git a/src/test/java/info/debatty/java/lsh/LSHMinHashTest.java b/src/test/java/info/debatty/java/lsh/LSHMinHashTest.java new file mode 100644 index 0000000..3e89770 --- /dev/null +++ b/src/test/java/info/debatty/java/lsh/LSHMinHashTest.java @@ -0,0 +1,80 @@ +/* + * The MIT License + * + * Copyright 2016 Thibault Debatty. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ +package info.debatty.java.lsh; + +import java.util.Random; +import org.junit.Test; + +/** + * + * @author Thibault Debatty + */ +public class LSHMinHashTest { + + /** + * Test of hash method, of class LSHMinHash. + */ + @Test + public void testHash() { + System.out.println("hash"); + + // proportion of 0's in the vectors + // if the vectors are dense (lots of 1's), the average jaccard similarity + // will be very high (especially for large vectors), and LSH + // won't be able to distinguish them + // as a result, all vectors will be binned in the same bucket... + double sparsity = 0.75; + + // Number and size of vectors + int count = 10000; + int n = 100000; + + int stages = 2; + int buckets = 10; + + // Let's generate some random sets + boolean[][] vectors = new boolean[count][n]; + Random rand = new Random(); + + for (int i = 0; i < count; i++) { + for (int j = 0; j < n; j++) { + vectors[i][j] = rand.nextDouble() > sparsity; + } + } + + LSHMinHash lsh = new LSHMinHash(stages, buckets, n); + int[][] counts = new int[stages][buckets]; + + // Perform hashing + for (boolean[] vector : vectors) { + int[] hash = lsh.hash(vector); + + for (int i = 0; i < hash.length; i++) { + // this will raise an ArrayIndexOutOfBoundsException + // if the bin values are negatives or too large + counts[i][hash[i]]++; + } + } + } +} From 752112bfcf2c50fa7c9869d3020d171c7178fd86 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 08:02:08 +0200 Subject: [PATCH 16/46] [maven-release-plugin] prepare release v0.9 --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index d2dc42b..03d3467 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.9-SNAPSHOT + 0.9 jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - HEAD + v0.9 From 7519ebcdd6f54b1a2e203d2c899da90264d26dec Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 08:02:15 +0200 Subject: [PATCH 17/46] [maven-release-plugin] prepare for next development iteration --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index 03d3467..303cb3d 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.9 + 0.10-SNAPSHOT jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - v0.9 + HEAD From 9c991f231445a64970d23974165942aad84ab1e0 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 11:01:59 +0200 Subject: [PATCH 18/46] Added test for parameter values --- .../java/info/debatty/java/lsh/MinHash.java | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index 6fb2438..ad34081 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -201,6 +201,25 @@ public double error() { } private void init(int size, int dict_size) { + if (size <= 0) { + throw new InvalidParameterException( + "Signature size should be positive"); + } + + if (dict_size <= 0) { + throw new InvalidParameterException( + "Dictionary size (or vector size) should be positive"); + } + + // In function h(i, x) the largest value could be + // dict_size * dict_size + dict_size + // throw an error if dict_size * dict_size + dict_size > Long.MAX_VALUE + if (dict_size > (Long.MAX_VALUE - dict_size) / dict_size) { + throw new InvalidParameterException( + "Dictionary size (or vector size) is too big and will " + + "cause a multiplication overflow"); + } + this.dict_size = dict_size; this.n = size; From 56328ffe11e5f091f4ca4f9a66658c7117680432 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 11:12:36 +0200 Subject: [PATCH 19/46] Cleaned MinHash code --- README.md | 2 +- .../java/info/debatty/java/lsh/MinHash.java | 88 +++++++++++++------ .../java/lsh/examples/LSHMinHashExample.java | 2 +- .../java/lsh/examples/MinHashExample.java | 2 +- 4 files changed, 64 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index aca5bf2..b089b5e 100644 --- a/README.md +++ b/README.md @@ -187,7 +187,7 @@ public class LSHMinHashExample { int[] hash2 = hashes[j]; // We compute the similarity between each pair of sets - double similarity = MinHash.JaccardIndex(vector1, vector2); + double similarity = MinHash.jaccardIndex(vector1, vector2); // We count the number of pairs with similarity 0.1, 0.2, // 0.3, etc. diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index ad34081..937f2f2 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -28,7 +28,15 @@ */ public class MinHash implements Serializable { - public static double JaccardIndex(Set s1, Set s2) { + /** + * Compute the jaccard index between two sets. + * @param s1 + * @param s2 + * @return + */ + public static double jaccardIndex( + final Set s1, final Set s2) { + Set intersection = new HashSet(s1); intersection.retainAll(s2); @@ -42,14 +50,27 @@ public static double JaccardIndex(Set s1, Set s2) { return (double) intersection.size() / union.size(); } - public static double JaccardIndex(boolean[] s1, boolean[] s2) { + /** + * Compute the exact jaccard index between two sets, represented as + * arrays of booleans. + * @param s1 + * @param s2 + * @return + */ + public static double jaccardIndex(final boolean[] s1, final boolean[] s2) { if (s1.length != s2.length) { throw new InvalidParameterException("sets must be same size!"); } - return JaccardIndex(Convert2Set(s1), Convert2Set(s2)); + return jaccardIndex(convert2Set(s1), convert2Set(s2)); } - public static Set Convert2Set(boolean[] array) { + /** + * Convert a set represented as an array of booleans to a set of integer. + * + * @param array + * @return + */ + public static Set convert2Set(final boolean[] array) { Set set = new TreeSet(); for (int i = 0; i < array.length; i++) { if (array[i]) { @@ -61,12 +82,12 @@ public static Set Convert2Set(boolean[] array) { /** * Computes the size of the signature required to achieve a given error in - * similarity estimation (1 / error^2) + * similarity estimation. (1 / error^2) * * @param error * @return size of the signature */ - public static int size(double error) { + public static int size(final double error) { if (error < 0 && error > 1) { throw new IllegalArgumentException("error should be in [0 .. 1]"); } @@ -74,58 +95,61 @@ public static int size(double error) { } /** - * Signature size + * Signature size. */ private int n; /** - * Random a and b coefficients for the random hash functions + * Random a and b coefficients for the random hash functions. */ private long[][] hash_coefs; /** - * Dictionary size + * Dictionary size (is also the size of vectors if the sets are provided + * as vectors). */ private int dict_size; /** * Initializes hash functions to compute MinHash signatures for sets built - * from a dictionary of dict_size elements + * from a dictionary of dict_size elements. * * @param size the number of hash functions (and the size of resulting * signatures) * @param dict_size */ - public MinHash(int size, int dict_size) { + public MinHash(final int size, final int dict_size) { init(size, dict_size); } /** * Initializes hash function to compute MinHash signatures for sets built * from a dictionary of dict_size elements, with a given similarity - * estimation error + * estimation error. * * @param error * @param dict_size */ - public MinHash(double error, int dict_size) { + public MinHash(final double error, final int dict_size) { init(size(error), dict_size); } /** * Computes the signature for this set The input set is represented as an - * vector of booleans For example the array [true, false, true, true, false] + * vector of booleans. + * For example the array [true, false, true, true, false] * corresponds to the set {0, 2, 3} * * @param vector * @return the signature */ - public int[] signature(boolean[] vector) { + public final int[] signature(final boolean[] vector) { if (vector.length != dict_size) { - throw new IllegalArgumentException("Size of array should be dict_size"); + throw new IllegalArgumentException( + "Size of array should be dict_size"); } - return signature(Convert2Set(vector)); + return signature(convert2Set(vector)); } /** @@ -134,7 +158,7 @@ public int[] signature(boolean[] vector) { * @param set * @return the signature */ - public int[] signature(Set set) { + public final int[] signature(final Set set) { int[] sig = new int[n]; for (int i = 0; i < n; i++) { @@ -147,7 +171,7 @@ public int[] signature(Set set) { // if (!set.contains(r)) { // continue; // } - // Loop over 'true' values, instead of loop over all values of dictionary + // Loop over true values, instead of loop over all values of dictionary // to speedup computation final List list = new ArrayList(set); Collections.sort(list); @@ -169,16 +193,17 @@ public int[] signature(Set set) { /** * Computes an estimation of Jaccard similarity (the number of elements in - * common) between two sets, using the MinHash signatures of these two sets + * common) between two sets, using the MinHash signatures of these two sets. * * @param sig1 MinHash signature of set1 * @param sig2 MinHash signature of set2 (produced using the same * coefficients) * @return the estimated similarity */ - public double similarity(int[] sig1, int[] sig2) { + public final double similarity(final int[] sig1, final int[] sig2) { if (sig1.length != sig2.length) { - throw new IllegalArgumentException("Size of signatures should be the same"); + throw new IllegalArgumentException( + "Size of signatures should be the same"); } double sim = 0; @@ -192,15 +217,20 @@ public double similarity(int[] sig1, int[] sig2) { } /** - * Computes the expected error of similarity computed using signatures + * Computes the expected error of similarity computed using signatures. * * @return the expected error */ - public double error() { + public final double error() { return 1.0 / Math.sqrt(n); } - private void init(int size, int dict_size) { + /** + * Compute has function coefficients. + * @param size + * @param dict_size + */ + private void init(final int size, final int dict_size) { if (size <= 0) { throw new InvalidParameterException( "Signature size should be positive"); @@ -240,12 +270,16 @@ private void init(int size, int dict_size) { * @param x * @return the hashed value of x, using ith hash function */ - private int h(int i, int x) { + private int h(final int i, final int x) { return (int) ((hash_coefs[i][0] * (long) x + hash_coefs[i][1]) % dict_size); } - public long[][] getCoefficients() { + /** + * Get the coefficients used by hash function hi. + * @return + */ + public final long[][] getCoefficients() { return hash_coefs; } } diff --git a/src/main/java/info/debatty/java/lsh/examples/LSHMinHashExample.java b/src/main/java/info/debatty/java/lsh/examples/LSHMinHashExample.java index 1d6d42f..3fe635d 100644 --- a/src/main/java/info/debatty/java/lsh/examples/LSHMinHashExample.java +++ b/src/main/java/info/debatty/java/lsh/examples/LSHMinHashExample.java @@ -95,7 +95,7 @@ public static void main(String[] args) { int[] hash2 = hashes[j]; // We compute the similarity between each pair of sets - double similarity = MinHash.JaccardIndex(vector1, vector2); + double similarity = MinHash.jaccardIndex(vector1, vector2); // We count the number of pairs with similarity 0.1, 0.2, // 0.3, etc. diff --git a/src/main/java/info/debatty/java/lsh/examples/MinHashExample.java b/src/main/java/info/debatty/java/lsh/examples/MinHashExample.java index adb1957..776b55a 100644 --- a/src/main/java/info/debatty/java/lsh/examples/MinHashExample.java +++ b/src/main/java/info/debatty/java/lsh/examples/MinHashExample.java @@ -53,6 +53,6 @@ public static void main(String[] args) { System.out.println("Signature similarity: " + minhash.similarity(sig1, sig2)); System.out.println("Real similarity (Jaccard index)" + - MinHash.JaccardIndex(MinHash.Convert2Set(vector1), set2)); + MinHash.jaccardIndex(MinHash.convert2Set(vector1), set2)); } } From e154465d7a87313ca73d2aaa3c7cf34bfb00d451 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 11:16:03 +0200 Subject: [PATCH 20/46] Cleaned LSHMinHash code --- .../info/debatty/java/lsh/LSHMinHash.java | 23 ++++++++++++------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/LSHMinHash.java b/src/main/java/info/debatty/java/lsh/LSHMinHash.java index 88e5572..8857223 100644 --- a/src/main/java/info/debatty/java/lsh/LSHMinHash.java +++ b/src/main/java/info/debatty/java/lsh/LSHMinHash.java @@ -24,14 +24,13 @@ package info.debatty.java.lsh; -import java.util.Set; - /** * * @author Thibault Debatty */ public class LSHMinHash extends LSH { private final MinHash mh; + private static final double THRESHOLD = 0.5; /** * Instantiates a LSH instance that internally uses MinHash, @@ -45,7 +44,7 @@ public class LSHMinHash extends LSH { * @param b buckets (per stage) * @param n dictionary size */ - public LSHMinHash(int s, int b, int n) { + public LSHMinHash(final int s, final int b, final int n) { super(s, b, n); /** @@ -66,17 +65,25 @@ public LSHMinHash(int s, int b, int n) { * R = ln(1/s) / ln(threshold) * signature_size = R * b */ - double threshold = 0.5; - int R = (int) Math.ceil(Math.log(1.0/s) / Math.log(threshold)) + 1; - int signature_size = R * s; + int r = (int) Math.ceil(Math.log(1.0 / s) / Math.log(THRESHOLD)) + 1; + int signature_size = r * s; this.mh = new MinHash(signature_size, n); } - public int[] hash(boolean[] vector) { + /** + * Bin this vector to corresponding buckets. + * @param vector + * @return + */ + public final int[] hash(final boolean[] vector) { return hashSignature(this.mh.signature(vector)); } - public long[][] getCoefficients() { + /** + * Get the coefficients used by internal hashing functions. + * @return + */ + public final long[][] getCoefficients() { return mh.getCoefficients(); } } From e84b96041439216d9412f456b9aef7f2d71871c1 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 11:27:56 +0200 Subject: [PATCH 21/46] Cleaned SuperBit code --- .../java/info/debatty/java/lsh/SuperBit.java | 208 ++++++++++-------- 1 file changed, 115 insertions(+), 93 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/SuperBit.java b/src/main/java/info/debatty/java/lsh/SuperBit.java index 21e5985..a874223 100644 --- a/src/main/java/info/debatty/java/lsh/SuperBit.java +++ b/src/main/java/info/debatty/java/lsh/SuperBit.java @@ -33,67 +33,73 @@ * Implementation of Super-Bit Locality-Sensitive Hashing. * Super-Bit is an improvement of Random Projection LSH. * It computes an estimation of cosine similarity. - * + * * Super-Bit Locality-Sensitive Hashing * Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian * http://papers.nips.cc/paper/4847-super-bit-locality-sensitive-hashing.pdf * Advances in Neural Information Processing Systems 25, 2012 - * + * * Supported input types: * - SparseIntegerVector * - double[] * - others to come... - * + * * @author Thibault Debatty */ public class SuperBit implements Serializable { - + private double[][] hyperplanes; - + private static final int DEFAULT_CODE_LENGTH = 10000; + /** * Initialize SuperBit algorithm. - * Super-Bit depth N must be [1 .. d] and number of Super-Bit L in [1 .. - * The resulting code length K = N * L + * Super-Bit depth n must be [1 .. d] and number of Super-Bit l in [1 .. + * The resulting code length k = n * l * The K vectors are orthogonalized in L batches of N vectors - * + * * @param d data space dimension - * @param N Super-Bit depth [1 .. d] - * @param L number of Super-Bit [1 .. + * @param n Super-Bit depth [1 .. d] + * @param l number of Super-Bit [1 .. */ - public SuperBit(int d, int N, int L) { + public SuperBit(final int d, final int n, final int l) { if (d <= 0) { throw new IllegalArgumentException("Dimension d must be >= 1"); } - - if (N < 1 || N > d) { - throw new IllegalArgumentException("Super-Bit depth N must be 1 <= N <= d"); + + if (n < 1 || n > d) { + throw new IllegalArgumentException( + "Super-Bit depth N must be 1 <= N <= d"); } - - if (L < 1) { - throw new IllegalArgumentException("Number of Super-Bit L must be >= 1"); + + if (l < 1) { + throw new IllegalArgumentException( + "Number of Super-Bit L must be >= 1"); } - - // Input: Data space dimension d, Super-Bit depth 1 <= N <= d, number of Super-Bit L >= 1, + + // Input: Data space dimension d, Super-Bit depth 1 <= N <= d, + // number of Super-Bit L >= 1, // resulting code length K = N * L - - // Generate a random matrix H with each element sampled independently from the normal distribution - // N (0, 1), with each column normalized to unit length. Denote H = [v1, v2, ..., vK]. - int K = N * L; - - double[][] v = new double[K][d]; + + // Generate a random matrix H with each element sampled independently + // from the normal distribution + // N (0, 1), with each column normalized to unit length. + // Denote H = [v1, v2, ..., vK]. + int code_length = n * l; + + double[][] v = new double[code_length][d]; Random rand = new Random(); - - for (int i = 0; i < K; i++) { + + for (int i = 0; i < code_length; i++) { double[] vector = new double[d]; for (int j = 0; j < d; j++) { vector[j] = rand.nextGaussian(); } - + normalize(vector); v[i] = vector; } - - + + // for i = 0 to L - 1 do // for j = 1 to N do // w_{iN+j} = v_{iN+j} @@ -104,130 +110,146 @@ public SuperBit(int d, int N, int L) { // end for // end for // Output: H˜ = [w1, w2, ..., wK] - - double[][] w = new double[K][d]; - for (int i = 0; i <= L-1; i++) { - for (int j = 1; j <= N; j++) { + + double[][] w = new double[code_length][d]; + for (int i = 0; i <= l - 1; i++) { + for (int j = 1; j <= n; j++) { java.lang.System.arraycopy( - v[i*N+j-1], + v[i * n + j - 1], 0, - w[i*N+j-1], + w[i * n + j - 1], 0, d); - - for (int k = 1; k <= (j-1); k++) { - w[i*N+j-1] = sub( - w[i*N+j-1], - product(dotProduct(w[i*N+k-1], v[i*N+j-1]), w[i*N+k-1])); + + for (int k = 1; k <= (j - 1); k++) { + w[i * n + j - 1] = sub( + w[i * n + j - 1], + product( + dotProduct( + w[i * n + k - 1], + v[ i * n + j - 1]), + w[i * n + k - 1])); } - - normalize(w[i*N+j-1]); - + + normalize(w[i * n + j - 1]); + } } - + this.hyperplanes = w; } - + /** * Initialize SuperBit algorithm. * With code length K = 10000 * The K vectors are orthogonalized in d batches of 10000/d vectors * The resulting mean error is 0.01 - * @param d + * @param d */ - public SuperBit(int d) { - this(d, d, 10000/d); + public SuperBit(final int d) { + this(d, d, DEFAULT_CODE_LENGTH / d); } - + + /** + * Initialize SuperBit algorithm without parameters + * (used only for serialization). + */ public SuperBit() { - + } - + /** - * Compute the signature of this vector + * Compute the signature of this vector. * @param vector - * @return + * @return */ - - public boolean[] signature(SparseIntegerVector vector) { + public final boolean[] signature(final SparseIntegerVector vector) { boolean[] sig = new boolean[this.hyperplanes.length]; for (int i = 0; i < this.hyperplanes.length; i++) { sig[i] = (vector.dotProduct(this.hyperplanes[i]) >= 0); } return sig; } - - public boolean[] signature(SparseDoubleVector vector) { + + /** + * Compute the signature of this vector. + * @param vector + * @return + */ + public final boolean[] signature(final SparseDoubleVector vector) { boolean[] sig = new boolean[this.hyperplanes.length]; for (int i = 0; i < this.hyperplanes.length; i++) { sig[i] = (vector.dotProduct(this.hyperplanes[i]) >= 0); } return sig; } - + /** - * Compute the signature of this vector + * Compute the signature of this vector. * @param vector - * @return + * @return */ - public boolean[] signature(double[] vector) { + public final boolean[] signature(final double[] vector) { boolean[] sig = new boolean[this.hyperplanes.length]; for (int i = 0; i < this.hyperplanes.length; i++) { sig[i] = (dotProduct(this.hyperplanes[i], vector) >= 0); } return sig; } - + /** * Compute the similarity between two signature, which is also an * estimation of the cosine similarity between the two vectors. - * + * * @param sig1 * @param sig2 * @return estimated cosine similarity */ - public double similarity(boolean[] sig1, boolean[] sig2) { - + public final double similarity(final boolean[] sig1, final boolean[] sig2) { + double E = 0; for (int i = 0; i < sig1.length; i++) { E += (sig1[i] == sig2[i] ? 1 : 0); } - + E = E / sig1.length; - + return Math.cos((1 - E) * Math.PI); } - - public double[][] getHyperplanes() { + + /** + * Get the hyperplanes coefficients used to compute signatures. + * @return + */ + public final double[][] getHyperplanes() { return this.hyperplanes; } - + /* ---------------------- STATIC ---------------------- */ - + /** * Computes the cosine similarity, computed as v1 dot v2 / (|v1| * |v2|). * Cosine similarity of two vectors is the cosine of the angle between them. * It ranges between -1 and +1 - * + * * @param v1 * @param v2 - * @return + * @return */ - public static double cosineSimilarity(double[]v1, double[] v2) { - + public static double cosineSimilarity(final double[]v1, final double[] v2) { + return dotProduct(v1, v2) / (norm(v1) * norm(v2)); } - - private static double[] product(double x, double[] v) { + + private static double[] product(final double x, final double[] v) { double[] r = new double[v.length]; for (int i = 0; i < v.length; i++) { r[i] = x * v[i]; } return r; } - - private static double[] sub(double[] a, double[] b) { + + private static double[] sub(final double[] a, final double[] b) { double[] r = new double[a.length]; for (int i = 0; i < a.length; i++) { r[i] = a[i] - b[i]; @@ -235,36 +257,36 @@ private static double[] sub(double[] a, double[] b) { return r; } - private static void normalize(double[] vector) { + private static void normalize(final double[] vector) { double norm = norm(vector); for (int i = 0; i < vector.length; i++) { - vector[i] = vector[i]/ norm; + vector[i] = vector[i] / norm; } - + } - + /** - * Returns the norm L2 : sqrt(Sum_i( v_i^2)) + * Returns the norm L2. sqrt(sum_i(v_i^2)) * @param v - * @return + * @return */ - private static double norm(double[] v) { + private static double norm(final double[] v) { double agg = 0; - + for (int i = 0; i < v.length; i++) { agg += (v[i] * v[i]); } - + return Math.sqrt(agg); } - - private static double dotProduct(double[] v1, double[] v2) { + + private static double dotProduct(final double[] v1, final double[] v2) { double agg = 0; - + for (int i = 0; i < v1.length; i++) { agg += (v1[i] * v2[i]); } - + return agg; } } From 47eb99ee87cc8dd80a94132fbf54d62b2bf40db3 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 11:31:14 +0200 Subject: [PATCH 22/46] Cleaned SuperBit code --- src/main/java/info/debatty/java/lsh/SuperBit.java | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/SuperBit.java b/src/main/java/info/debatty/java/lsh/SuperBit.java index a874223..a1dd527 100644 --- a/src/main/java/info/debatty/java/lsh/SuperBit.java +++ b/src/main/java/info/debatty/java/lsh/SuperBit.java @@ -207,14 +207,16 @@ public final boolean[] signature(final double[] vector) { */ public final double similarity(final boolean[] sig1, final boolean[] sig2) { - double E = 0; + double agg = 0; for (int i = 0; i < sig1.length; i++) { - E += (sig1[i] == sig2[i] ? 1 : 0); + if (sig1[i] == sig2[i]) { + agg++; + } } - E = E / sig1.length; + agg = agg / sig1.length; - return Math.cos((1 - E) * Math.PI); + return Math.cos((1 - agg) * Math.PI); } /** From bd3bc45e3faa8ce21171a0b8095251467f551441 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 13:41:31 +0200 Subject: [PATCH 23/46] Code clean in abstract LSH --- src/main/java/info/debatty/java/lsh/LSH.java | 128 +++++++----------- .../info/debatty/java/lsh/LSHMinHash.java | 2 +- .../info/debatty/java/lsh/LSHSuperBit.java | 28 ++-- 3 files changed, 65 insertions(+), 93 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/LSH.java b/src/main/java/info/debatty/java/lsh/LSH.java index 746c0c6..09ed39d 100644 --- a/src/main/java/info/debatty/java/lsh/LSH.java +++ b/src/main/java/info/debatty/java/lsh/LSH.java @@ -4,72 +4,39 @@ /** * Implementation of Locality Sensitive Hashing (LSH) principle, as described in - * Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", + * Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", * Cambridge University Press. - * + * * @author Thibault Debatty http://www.debatty.info */ public abstract class LSH implements Serializable { - + protected static final long LARGE_PRIME = 433494437; - - protected int s = 3; - protected int b = 10; - protected int n; - + private static final int DEFAULT_STAGES = 3; + private static final int DEFAULT_BUCKETS = 10; + + private int stages = DEFAULT_STAGES; + private int buckets = DEFAULT_BUCKETS; + /** - * Instantiates a LSH instance with s stages (or bands) and b buckets (per + * Instantiates a LSH instance with s stages (or bands) and b buckets (per * stage), in a space with n dimensions. - * - * @param s stages - * @param b buckets (per stage) - * @param n dimensionality - */ - public LSH(int s, int b, int n) { - this.s = s; - this.b = b; - this.n = n; - - } - - public LSH() { - - } - - /** - * - * @return the number of stages (bands) + * + * @param stages stages + * @param buckets buckets (per stage) */ - public int getS() { - return s; + public LSH(final int stages, final int buckets) { + this.stages = stages; + this.buckets = buckets; } /** - * Set the number of stages (also sometimes called bands). - * Default value is 3 - * @param s + * Instantiate an empty LSH instance (useful only for serialization). */ - public void setS(int s) { - this.s = s; - } + public LSH() { - /** - * - * @return the number of buckets (per stage) - */ - public int getB() { - return b; } - /** - * Set the number of buckets per stage. - * Default value is 10. - * @param b - */ - public void setB(int b) { - this.b = b; - } - /** * Hash a signature. * The signature is divided in s stages (or bands). Each stage is hashed to @@ -77,23 +44,25 @@ public void setB(int b) { * @param signature * @return An vector of s integers (between 0 and b-1) */ - public int[] hashSignature(int[] signature) { - + public final int[] hashSignature(final int[] signature) { + // Create an accumulator for each stage - int[] r = new int[s]; - + int[] hash = new int[stages]; + // Number of rows per stage - int rows = signature.length / s; - + int rows = signature.length / stages; + for (int i = 0; i < signature.length; i++) { - int stage = Math.min(i / rows, s-1); - r[stage] = (int) ((r[stage] + (long) signature[i] * LARGE_PRIME) % b); - + int stage = Math.min(i / rows, stages - 1); + hash[stage] = (int) + ((hash[stage] + (long) signature[i] * LARGE_PRIME) + % buckets); + } - - return r; + + return hash; } - + /** * Hash a signature. * The signature is divided in s stages (or bands). Each stage is hashed to @@ -101,35 +70,38 @@ public int[] hashSignature(int[] signature) { * @param signature * @return An vector of s integers (between 0 and b-1) */ - public int[] hashSignature(boolean[] signature) { + public final int[] hashSignature(final boolean[] signature) { /*int hashCode = Arrays.hashCode(signature); if (hashCode < 0) { hashCode += Integer.MAX_VALUE; } return new int[] { hashCode % b};*/ - + // Create an accumulator for each stage - long[] acc = new long[s]; - for (int i = 0; i < s; i++) { + long[] acc = new long[stages]; + for (int i = 0; i < stages; i++) { acc[i] = 0; } - + // Number of rows per stage - int rows = signature.length / s; - + int rows = signature.length / stages; + for (int i = 0; i < signature.length; i++) { - long v = (signature[i] ? (i+1) * LARGE_PRIME : 0); - + long v = 0; + if (signature[i]) { + v = (i + 1) * LARGE_PRIME; + } + // current stage - int j = Math.min(i / rows, s-1); + int j = Math.min(i / rows, stages - 1); acc[j] = (acc[j] + v) % Integer.MAX_VALUE; } - - int[] r = new int[s]; - for (int i = 0; i < s; i++) { - r[i] = (int) (acc[i] % b); + + int[] r = new int[stages]; + for (int i = 0; i < stages; i++) { + r[i] = (int) (acc[i] % buckets); } - + return r; } } diff --git a/src/main/java/info/debatty/java/lsh/LSHMinHash.java b/src/main/java/info/debatty/java/lsh/LSHMinHash.java index 8857223..152fee4 100644 --- a/src/main/java/info/debatty/java/lsh/LSHMinHash.java +++ b/src/main/java/info/debatty/java/lsh/LSHMinHash.java @@ -45,7 +45,7 @@ public class LSHMinHash extends LSH { * @param n dictionary size */ public LSHMinHash(final int s, final int b, final int n) { - super(s, b, n); + super(s, b); /** * "Mining of Massive Datasets", p.88. diff --git a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java index d07da88..5816b8c 100644 --- a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java +++ b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java @@ -40,20 +40,20 @@ public class LSHSuperBit extends LSH implements Serializable { * b buckets (per stage), in a space with n dimensions. Input vectors with * a high cosine similarity have a high probability of falling in the same * bucket... - * + * * Supported input types: * - double[] * - sparseIntegerVector * - int[] * - others to come... - * + * * @param s stages * @param b buckets (per stage) * @param n dimensionality */ public LSHSuperBit(int s, int b, int n) throws Exception { - super(s, b, n); - + super(s, b); + // SuberBit code length int K = s * b / 2; int superbit; // superbit value @@ -62,42 +62,42 @@ public LSHSuperBit(int s, int b, int n) throws Exception { break; } } - + if (superbit == 0) { throw new Exception("Superbit is 0 with parameters: s=" + s + " b=" + b + " n=" + n); } - + this.sb = new SuperBit(n, superbit, K/superbit); } - + public LSHSuperBit() { - + } /** * Hash (bin) a vector in s stages into b buckets * @param vector - * @return + * @return */ public int[] hash(double[] vector) { return hashSignature(sb.signature(vector)); } - + public int[] hash(SparseIntegerVector vector){ return hashSignature(sb.signature(vector)); } - + public int[] hash(SparseDoubleVector vector) { return hashSignature(sb.signature(vector)); } - + /** * Hash (bin) a vector in s stages into b buckets * @param vector - * @return + * @return */ public int[] hash(int[] vector) { - + double[] d = new double[vector.length]; for (int i = 0; i < vector.length; i++) { d[i] = (double) vector[i]; From 676961655de9bb613553fa714700b5af576c8c5f Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 13:43:16 +0200 Subject: [PATCH 24/46] Code clean in abstract LSH --- src/main/java/info/debatty/java/lsh/LSH.java | 5 ----- 1 file changed, 5 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/LSH.java b/src/main/java/info/debatty/java/lsh/LSH.java index 09ed39d..aac78ae 100644 --- a/src/main/java/info/debatty/java/lsh/LSH.java +++ b/src/main/java/info/debatty/java/lsh/LSH.java @@ -71,11 +71,6 @@ public final int[] hashSignature(final int[] signature) { * @return An vector of s integers (between 0 and b-1) */ public final int[] hashSignature(final boolean[] signature) { - /*int hashCode = Arrays.hashCode(signature); - if (hashCode < 0) { - hashCode += Integer.MAX_VALUE; - } - return new int[] { hashCode % b};*/ // Create an accumulator for each stage long[] acc = new long[stages]; From 2151e5acc444d5d4fa2ef92fef5939eb76643244 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 13:49:32 +0200 Subject: [PATCH 25/46] Code clean in LSHSuperBit --- .../info/debatty/java/lsh/LSHSuperBit.java | 60 ++++++++++++------- 1 file changed, 39 insertions(+), 21 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java index 5816b8c..63ac41f 100644 --- a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java +++ b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java @@ -36,10 +36,10 @@ public class LSHSuperBit extends LSH implements Serializable { private SuperBit sb; /** - * LSH implementation relying on SuperBit, to bin vectors s times (stages) in - * b buckets (per stage), in a space with n dimensions. Input vectors with - * a high cosine similarity have a high probability of falling in the same - * bucket... + * LSH implementation relying on SuperBit, to bin vectors s times (stages) + * in b buckets (per stage), in a space with n dimensions. Input vectors + * with a high cosine similarity have a high probability of falling in the + * same bucket... * * Supported input types: * - double[] @@ -47,56 +47,74 @@ public class LSHSuperBit extends LSH implements Serializable { * - int[] * - others to come... * - * @param s stages - * @param b buckets (per stage) - * @param n dimensionality + * @param stages stages + * @param buckets buckets (per stage) + * @param dimensions dimensionality + * @throws java.lang.Exception if parameters produce a superbit value 0 */ - public LSHSuperBit(int s, int b, int n) throws Exception { - super(s, b); + public LSHSuperBit( + final int stages, final int buckets, final int dimensions) + throws Exception { + + super(stages, buckets); // SuberBit code length - int K = s * b / 2; + int code_length = stages * buckets / 2; int superbit; // superbit value - for (superbit = n; superbit >= 1; superbit--) { - if (K % superbit == 0) { + for (superbit = dimensions; superbit >= 1; superbit--) { + if (code_length % superbit == 0) { break; } } if (superbit == 0) { - throw new Exception("Superbit is 0 with parameters: s=" + s + " b=" + b + " n=" + n); + throw new Exception( + "Superbit is 0 with parameters: s=" + stages + + " b=" + buckets + " n=" + dimensions); } - this.sb = new SuperBit(n, superbit, K/superbit); + this.sb = new SuperBit(dimensions, superbit, code_length / superbit); } + /** + * Empty constructor, used only for serialization. + */ public LSHSuperBit() { - } /** - * Hash (bin) a vector in s stages into b buckets + * Hash (bin) a vector in s stages into b buckets. * @param vector * @return */ - public int[] hash(double[] vector) { + public final int[] hash(final double[] vector) { return hashSignature(sb.signature(vector)); } - public int[] hash(SparseIntegerVector vector){ + /** + * Hash (bin) a vector in s stages into b buckets. + * @param vector + * @return + */ + public final int[] hash(final SparseIntegerVector vector){ return hashSignature(sb.signature(vector)); } - public int[] hash(SparseDoubleVector vector) { + /** + * Hash (bin) a vector in s stages into b buckets. + * @param vector + * @return + */ + public final int[] hash(final SparseDoubleVector vector) { return hashSignature(sb.signature(vector)); } /** - * Hash (bin) a vector in s stages into b buckets + * Hash (bin) a vector in s stages into b buckets. * @param vector * @return */ - public int[] hash(int[] vector) { + public final int[] hash(final int[] vector) { double[] d = new double[vector.length]; for (int i = 0; i < vector.length; i++) { From 9b286537231d4627b2db80ce5addf2dfe8ce99db Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Fri, 20 May 2016 14:06:26 +0200 Subject: [PATCH 26/46] Added maven checkstyle --- checkstyle.xml | 228 ++++++++++++++++++ pom.xml | 50 +++- .../info/debatty/java/lsh/LSHSuperBit.java | 2 +- 3 files changed, 267 insertions(+), 13 deletions(-) create mode 100644 checkstyle.xml diff --git a/checkstyle.xml b/checkstyle.xml new file mode 100644 index 0000000..4953b6d --- /dev/null +++ b/checkstyle.xml @@ -0,0 +1,228 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/pom.xml b/pom.xml index 303cb3d..587d6b6 100644 --- a/pom.xml +++ b/pom.xml @@ -1,28 +1,28 @@ - + 4.0.0 info.debatty java-lsh - 0.10-SNAPSHOT + 0.10-SNAPSHOT jar - + ${project.artifactId} https://github.com/tdebatty/java-LSH A Java implementation of Locality Sensitive Hashing (LSH) - + UTF-8 - - + + MIT License http://www.opensource.org/licenses/mit-license.php - + Thibault Debatty @@ -31,14 +31,14 @@ http://debatty.info - + scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git HEAD - + ossrh @@ -63,7 +63,7 @@ true - + org.apache.maven.plugins maven-source-plugin @@ -77,7 +77,7 @@ - + org.apache.maven.plugins maven-javadoc-plugin @@ -91,7 +91,7 @@ - + org.apache.maven.plugins maven-gpg-plugin @@ -126,6 +126,32 @@ + + + org.apache.maven.plugins + maven-checkstyle-plugin + 2.17 + + checkstyle.xml + **\/examples\/*.java + false + + + + + test + test + + true + true + + + check + + + + + diff --git a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java index 63ac41f..e967c5c 100644 --- a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java +++ b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java @@ -96,7 +96,7 @@ public final int[] hash(final double[] vector) { * @param vector * @return */ - public final int[] hash(final SparseIntegerVector vector){ + public final int[] hash(final SparseIntegerVector vector) { return hashSignature(sb.signature(vector)); } From f8b8e870f2a141b51ebe0d3bcef4a72f0b05e02c Mon Sep 17 00:00:00 2001 From: kireet Date: Tue, 9 Aug 2016 08:55:24 -0700 Subject: [PATCH 27/46] add alternate constructors that take in random number generator seeds --- .../info/debatty/java/lsh/LSHMinHash.java | 26 ++++++++++- .../info/debatty/java/lsh/LSHSuperBit.java | 45 ++++++++++++++++--- .../java/info/debatty/java/lsh/MinHash.java | 33 ++++++++++++-- .../java/info/debatty/java/lsh/SuperBit.java | 19 +++++++- .../info/debatty/java/lsh/MinHashTest.java | 26 +++++++++++ .../info/debatty/java/lsh/SuperBitTest.java | 29 ++++++++++++ 6 files changed, 166 insertions(+), 12 deletions(-) create mode 100644 src/test/java/info/debatty/java/lsh/MinHashTest.java create mode 100644 src/test/java/info/debatty/java/lsh/SuperBitTest.java diff --git a/src/main/java/info/debatty/java/lsh/LSHMinHash.java b/src/main/java/info/debatty/java/lsh/LSHMinHash.java index 152fee4..ac97319 100644 --- a/src/main/java/info/debatty/java/lsh/LSHMinHash.java +++ b/src/main/java/info/debatty/java/lsh/LSHMinHash.java @@ -46,7 +46,29 @@ public class LSHMinHash extends LSH { */ public LSHMinHash(final int s, final int b, final int n) { super(s, b); - + this.mh = buildMinHash(s, n, null); + } + + /** + * Instantiates a LSH instance that internally uses MinHash, + * with s stages (or bands) and b buckets (per stage), for sets out of a + * dictionary of n elements. + * + * Attention: the number of buckets should be chosen such that we have at + * least 100 items per bucket. + * + * @param s stages + * @param b buckets (per stage) + * @param n dictionary size + * @param seed random number generator seed. using the same value will + * guarantee identical hashes across object instantiations + */ + public LSHMinHash(final int s, final int b, final int n, final long seed) { + super(s, b); + this.mh = buildMinHash(s, n, seed); + } + + private MinHash buildMinHash(final int s, final int n, Long seed) { /** * "Mining of Massive Datasets", p.88. * It can be shown that, using MinHash, the probability that the @@ -67,7 +89,7 @@ public LSHMinHash(final int s, final int b, final int n) { */ int r = (int) Math.ceil(Math.log(1.0 / s) / Math.log(THRESHOLD)) + 1; int signature_size = r * s; - this.mh = new MinHash(signature_size, n); + return seed != null ? new MinHash(signature_size, n, seed) : new MinHash(signature_size, n); } /** diff --git a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java index e967c5c..ac4aa6d 100644 --- a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java +++ b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java @@ -53,12 +53,43 @@ public class LSHSuperBit extends LSH implements Serializable { * @throws java.lang.Exception if parameters produce a superbit value 0 */ public LSHSuperBit( - final int stages, final int buckets, final int dimensions) - throws Exception { + final int stages, final int buckets, final int dimensions) { super(stages, buckets); - // SuberBit code length + this.sb = buildSuperBit(stages, buckets, dimensions, null); + } + + /** + * LSH implementation relying on SuperBit, to bin vectors s times (stages) + * in b buckets (per stage), in a space with n dimensions. Input vectors + * with a high cosine similarity have a high probability of falling in the + * same bucket... + * + * Supported input types: + * - double[] + * - sparseIntegerVector + * - int[] + * - others to come... + * + * @param stages stages + * @param buckets buckets (per stage) + * @param dimensions dimensionality + * @param seed random number generator seed. using the same value will + * guarantee identical hashes across object instantiations + * + * @throws java.lang.Exception if parameters produce a superbit value 0 + */ + public LSHSuperBit( + final int stages, final int buckets, final int dimensions, long seed) { + + super(stages, buckets); + + this.sb = buildSuperBit(stages, buckets, dimensions, seed); + } + + private SuperBit buildSuperBit(final int stages, final int buckets, final int dimensions, Long seed) { + // SuperBit code length int code_length = stages * buckets / 2; int superbit; // superbit value for (superbit = dimensions; superbit >= 1; superbit--) { @@ -68,12 +99,16 @@ public LSHSuperBit( } if (superbit == 0) { - throw new Exception( + throw new IllegalArgumentException( "Superbit is 0 with parameters: s=" + stages + " b=" + buckets + " n=" + dimensions); } - this.sb = new SuperBit(dimensions, superbit, code_length / superbit); + if(seed != null) { + return new SuperBit(dimensions, superbit, code_length / superbit, seed); + } else { + return new SuperBit(dimensions, superbit, code_length / superbit); + } } /** diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index 937f2f2..b671dd8 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -119,7 +119,7 @@ public static int size(final double error) { * @param dict_size */ public MinHash(final int size, final int dict_size) { - init(size, dict_size); + init(size, dict_size, new Random()); } /** @@ -131,7 +131,33 @@ public MinHash(final int size, final int dict_size) { * @param dict_size */ public MinHash(final double error, final int dict_size) { - init(size(error), dict_size); + init(size(error), dict_size, new Random()); + } + + /** + * Initializes hash functions to compute MinHash signatures for sets built + * from a dictionary of dict_size elements. + * + * @param size the number of hash functions (and the size of resulting + * signatures) + * @param dict_size + * @param seed random number generator seed. using the same value will + * guarantee identical hashes across object instantiations + */ + public MinHash(final int size, final int dict_size, final long seed) { + init(size, dict_size, new Random(seed)); + } + + /** + * Initializes hash function to compute MinHash signatures for sets built + * from a dictionary of dict_size elements, with a given similarity + * estimation error. + * + * @param error + * @param dict_size + */ + public MinHash(final double error, final int dict_size, final long seed) { + init(size(error), dict_size, new Random(seed)); } /** @@ -230,7 +256,7 @@ public final double error() { * @param size * @param dict_size */ - private void init(final int size, final int dict_size) { + private void init(final int size, final int dict_size, Random r) { if (size <= 0) { throw new InvalidParameterException( "Signature size should be positive"); @@ -255,7 +281,6 @@ private void init(final int size, final int dict_size) { // h = (a * x) + b // a and b should be randomly generated - Random r = new Random(); hash_coefs = new long[n][2]; for (int i = 0; i < n; i++) { hash_coefs[i][0] = r.nextInt(dict_size); // a diff --git a/src/main/java/info/debatty/java/lsh/SuperBit.java b/src/main/java/info/debatty/java/lsh/SuperBit.java index a1dd527..8d89bc9 100644 --- a/src/main/java/info/debatty/java/lsh/SuperBit.java +++ b/src/main/java/info/debatty/java/lsh/SuperBit.java @@ -62,6 +62,24 @@ public class SuperBit implements Serializable { * @param l number of Super-Bit [1 .. */ public SuperBit(final int d, final int n, final int l) { + this(d, n, l, new Random()); + } + + /** + * Initialize SuperBit algorithm. + * Super-Bit depth n must be [1 .. d] and number of Super-Bit l in [1 .. + * The resulting code length k = n * l + * The K vectors are orthogonalized in L batches of N vectors + * + * @param d data space dimension + * @param n Super-Bit depth [1 .. d] + * @param l number of Super-Bit [1 .. + */ + public SuperBit(final int d, final int n, final int l, long seed) { + this(d, n, l, new Random(seed)); + } + + private SuperBit(final int d, final int n, final int l, Random rand) { if (d <= 0) { throw new IllegalArgumentException("Dimension d must be >= 1"); } @@ -87,7 +105,6 @@ public SuperBit(final int d, final int n, final int l) { int code_length = n * l; double[][] v = new double[code_length][d]; - Random rand = new Random(); for (int i = 0; i < code_length; i++) { double[] vector = new double[d]; diff --git a/src/test/java/info/debatty/java/lsh/MinHashTest.java b/src/test/java/info/debatty/java/lsh/MinHashTest.java new file mode 100644 index 0000000..99d845f --- /dev/null +++ b/src/test/java/info/debatty/java/lsh/MinHashTest.java @@ -0,0 +1,26 @@ +package info.debatty.java.lsh; + +import static org.junit.Assert.assertArrayEquals; + +import java.util.HashSet; +import java.util.Random; +import java.util.Set; + +import org.junit.Test; + +public class MinHashTest { + + @Test + public void testSeed() { + MinHash mh = new MinHash(100, 100, 123456); + MinHash mh2 = new MinHash(100, 100, 123456); + + Random r = new Random(); + + Set ints = new HashSet(); + for(int i = 0; i < 50; i++) + ints.add(r.nextInt()); + + assertArrayEquals(mh.signature(ints), mh2.signature(ints)); + } +} diff --git a/src/test/java/info/debatty/java/lsh/SuperBitTest.java b/src/test/java/info/debatty/java/lsh/SuperBitTest.java new file mode 100644 index 0000000..fa12a1c --- /dev/null +++ b/src/test/java/info/debatty/java/lsh/SuperBitTest.java @@ -0,0 +1,29 @@ +package info.debatty.java.lsh; + +import static org.junit.Assert.assertEquals; + +import java.util.Random; + +import org.junit.Test; + +public class SuperBitTest { + + @Test + public void testSeed() { + int d = 50; + SuperBit sb = new SuperBit(d, 25, 100, 123456); + SuperBit sb2 = new SuperBit(d, 25, 100, 123456); + + Random r = new Random(); + + double[] vector = new double[d]; + for(int i = 0; i < d; i++) + vector[i] = r.nextDouble(); + + boolean[] sig1 = sb.signature(vector); + boolean[] sig2 = sb2.signature(vector); + + for(int i = 0; i < sig1.length; i++) + assertEquals("pos " + i, sig1[i], sig2[i]); + } +} From 05166d7e01e09b104be47546ead82c8c1656c015 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 10 Aug 2016 11:58:19 +0200 Subject: [PATCH 28/46] Style correction --- .../info/debatty/java/lsh/LSHMinHash.java | 55 ++++++++++--------- .../info/debatty/java/lsh/LSHSuperBit.java | 47 ++++++++++------ .../java/info/debatty/java/lsh/MinHash.java | 13 +++-- .../java/info/debatty/java/lsh/SuperBit.java | 7 ++- .../info/debatty/java/lsh/LSHMinHashTest.java | 4 +- .../info/debatty/java/lsh/MinHashTest.java | 16 ++++-- .../info/debatty/java/lsh/SuperBitTest.java | 24 +++++--- 7 files changed, 102 insertions(+), 64 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/LSHMinHash.java b/src/main/java/info/debatty/java/lsh/LSHMinHash.java index ac97319..beb5812 100644 --- a/src/main/java/info/debatty/java/lsh/LSHMinHash.java +++ b/src/main/java/info/debatty/java/lsh/LSHMinHash.java @@ -46,9 +46,10 @@ public class LSHMinHash extends LSH { */ public LSHMinHash(final int s, final int b, final int n) { super(s, b); - this.mh = buildMinHash(s, n, null); + int signature_size = computeSignatureSize(s, n); + this.mh = new MinHash(signature_size, n); } - + /** * Instantiates a LSH instance that internally uses MinHash, * with s stages (or bands) and b buckets (per stage), for sets out of a @@ -60,36 +61,38 @@ public LSHMinHash(final int s, final int b, final int n) { * @param s stages * @param b buckets (per stage) * @param n dictionary size - * @param seed random number generator seed. using the same value will + * @param seed random number generator seed. using the same value will * guarantee identical hashes across object instantiations */ public LSHMinHash(final int s, final int b, final int n, final long seed) { super(s, b); - this.mh = buildMinHash(s, n, seed); + int signature_size = computeSignatureSize(s, n); + this.mh = new MinHash(signature_size, n, seed); } - - private MinHash buildMinHash(final int s, final int n, Long seed) { - /** - * "Mining of Massive Datasets", p.88. - * It can be shown that, using MinHash, the probability that the - * signatures of 2 sets with Jaccard similarity s agree in all the - * rows of at least one stage (band), and therefore become a candidate - * pair, is 1−(1−s^R)^b - * where R = signature_size / b (number of rows in a stage/band) - * Thus, the curve that shows the probability that 2 items fall in the - * same bucket for at least one of the stages, as a function of their - * Jaccard index similarity, has a S shape. - * The threshold (the value of similarity at which the probability of - * becoming a candidate is 1/2) is a function of the number of stages - * (s, or bands b in the book) and the signature size: - * threshold ≃ (1/s)^(1/R) - * Hence the signature size can be computed as: - * R = ln(1/s) / ln(threshold) - * signature_size = R * b - */ + + /** + * Compute the size of the signature according to "Mining of Massive + * Datasets" p88. + * It can be shown that, using MinHash, the probability that the + * signatures of 2 sets with Jaccard similarity s agree in all the + * rows of at least one stage (band), and therefore become a candidate + * pair, is 1−(1−s^R)^b + * where R = signature_size / b (number of rows in a stage/band) + * Thus, the curve that shows the probability that 2 items fall in the + * same bucket for at least one of the stages, as a function of their + * Jaccard index similarity, has a S shape. + * The threshold (the value of similarity at which the probability of + * becoming a candidate is 1/2) is a function of the number of stages + * (s, or bands b in the book) and the signature size: + * threshold ≃ (1/s)^(1/R) + * Hence the signature size can be computed as: + * R = ln(1/s) / ln(threshold) + * signature_size = R * b + */ + private int computeSignatureSize(final int s, final int n) { + int r = (int) Math.ceil(Math.log(1.0 / s) / Math.log(THRESHOLD)) + 1; - int signature_size = r * s; - return seed != null ? new MinHash(signature_size, n, seed) : new MinHash(signature_size, n); + return r * s; } /** diff --git a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java index ac4aa6d..572bf1b 100644 --- a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java +++ b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java @@ -50,16 +50,18 @@ public class LSHSuperBit extends LSH implements Serializable { * @param stages stages * @param buckets buckets (per stage) * @param dimensions dimensionality - * @throws java.lang.Exception if parameters produce a superbit value 0 */ public LSHSuperBit( final int stages, final int buckets, final int dimensions) { super(stages, buckets); - this.sb = buildSuperBit(stages, buckets, dimensions, null); + int code_length = stages * buckets / 2; + int superbit = computeSuperBit(stages, buckets, dimensions); + + this.sb = new SuperBit(dimensions, superbit, code_length / superbit); } - + /** * LSH implementation relying on SuperBit, to bin vectors s times (stages) * in b buckets (per stage), in a space with n dimensions. Input vectors @@ -75,20 +77,35 @@ public LSHSuperBit( * @param stages stages * @param buckets buckets (per stage) * @param dimensions dimensionality - * @param seed random number generator seed. using the same value will + * @param seed random number generator seed. using the same value will * guarantee identical hashes across object instantiations - * - * @throws java.lang.Exception if parameters produce a superbit value 0 + * */ public LSHSuperBit( - final int stages, final int buckets, final int dimensions, long seed) { - + final int stages, + final int buckets, + final int dimensions, + final long seed) { + super(stages, buckets); - - this.sb = buildSuperBit(stages, buckets, dimensions, seed); + + int code_length = stages * buckets / 2; + int superbit = computeSuperBit(stages, buckets, dimensions); + + this.sb = new SuperBit( + dimensions, superbit, code_length / superbit, seed); } - - private SuperBit buildSuperBit(final int stages, final int buckets, final int dimensions, Long seed) { + + /** + * Compute the superbit value. + * @param stages + * @param buckets + * @param dimensions + * @return + */ + private int computeSuperBit( + final int stages, final int buckets, final int dimensions) { + // SuperBit code length int code_length = stages * buckets / 2; int superbit; // superbit value @@ -104,11 +121,7 @@ private SuperBit buildSuperBit(final int stages, final int buckets, final int di + " b=" + buckets + " n=" + dimensions); } - if(seed != null) { - return new SuperBit(dimensions, superbit, code_length / superbit, seed); - } else { - return new SuperBit(dimensions, superbit, code_length / superbit); - } + return superbit; } /** diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index b671dd8..cbe1c1a 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -133,7 +133,7 @@ public MinHash(final int size, final int dict_size) { public MinHash(final double error, final int dict_size) { init(size(error), dict_size, new Random()); } - + /** * Initializes hash functions to compute MinHash signatures for sets built * from a dictionary of dict_size elements. @@ -141,13 +141,13 @@ public MinHash(final double error, final int dict_size) { * @param size the number of hash functions (and the size of resulting * signatures) * @param dict_size - * @param seed random number generator seed. using the same value will + * @param seed random number generator seed. using the same value will * guarantee identical hashes across object instantiations */ public MinHash(final int size, final int dict_size, final long seed) { init(size, dict_size, new Random(seed)); } - + /** * Initializes hash function to compute MinHash signatures for sets built * from a dictionary of dict_size elements, with a given similarity @@ -155,6 +155,8 @@ public MinHash(final int size, final int dict_size, final long seed) { * * @param error * @param dict_size + * @param seed random number generator seed. using the same value will + * guarantee identical hashes across object instantiations */ public MinHash(final double error, final int dict_size, final long seed) { init(size(error), dict_size, new Random(seed)); @@ -252,11 +254,12 @@ public final double error() { } /** - * Compute has function coefficients. + * Compute hash function coefficients using provided Random. * @param size * @param dict_size + * @param r */ - private void init(final int size, final int dict_size, Random r) { + private void init(final int size, final int dict_size, final Random r) { if (size <= 0) { throw new InvalidParameterException( "Signature size should be positive"); diff --git a/src/main/java/info/debatty/java/lsh/SuperBit.java b/src/main/java/info/debatty/java/lsh/SuperBit.java index 8d89bc9..bb827c3 100644 --- a/src/main/java/info/debatty/java/lsh/SuperBit.java +++ b/src/main/java/info/debatty/java/lsh/SuperBit.java @@ -74,12 +74,13 @@ public SuperBit(final int d, final int n, final int l) { * @param d data space dimension * @param n Super-Bit depth [1 .. d] * @param l number of Super-Bit [1 .. + * @param seed to use for the random number generator */ - public SuperBit(final int d, final int n, final int l, long seed) { + public SuperBit(final int d, final int n, final int l, final long seed) { this(d, n, l, new Random(seed)); } - - private SuperBit(final int d, final int n, final int l, Random rand) { + + private SuperBit(final int d, final int n, final int l, final Random rand) { if (d <= 0) { throw new IllegalArgumentException("Dimension d must be >= 1"); } diff --git a/src/test/java/info/debatty/java/lsh/LSHMinHashTest.java b/src/test/java/info/debatty/java/lsh/LSHMinHashTest.java index 3e89770..3e16660 100644 --- a/src/test/java/info/debatty/java/lsh/LSHMinHashTest.java +++ b/src/test/java/info/debatty/java/lsh/LSHMinHashTest.java @@ -47,8 +47,8 @@ public void testHash() { double sparsity = 0.75; // Number and size of vectors - int count = 10000; - int n = 100000; + int count = 1000; + int n = 10000; int stages = 2; int buckets = 10; diff --git a/src/test/java/info/debatty/java/lsh/MinHashTest.java b/src/test/java/info/debatty/java/lsh/MinHashTest.java index 99d845f..c01d354 100644 --- a/src/test/java/info/debatty/java/lsh/MinHashTest.java +++ b/src/test/java/info/debatty/java/lsh/MinHashTest.java @@ -8,19 +8,27 @@ import org.junit.Test; +/** + * + * @author Thibault Debatty + */ public class MinHashTest { + /** + * Test with initial seed. + */ @Test public void testSeed() { MinHash mh = new MinHash(100, 100, 123456); MinHash mh2 = new MinHash(100, 100, 123456); - + Random r = new Random(); - + Set ints = new HashSet(); - for(int i = 0; i < 50; i++) + for (int i = 0; i < 50; i++) { ints.add(r.nextInt()); - + } + assertArrayEquals(mh.signature(ints), mh2.signature(ints)); } } diff --git a/src/test/java/info/debatty/java/lsh/SuperBitTest.java b/src/test/java/info/debatty/java/lsh/SuperBitTest.java index fa12a1c..f7e9c26 100644 --- a/src/test/java/info/debatty/java/lsh/SuperBitTest.java +++ b/src/test/java/info/debatty/java/lsh/SuperBitTest.java @@ -6,24 +6,34 @@ import org.junit.Test; +/** + * + * @author Thibault Debatty + */ public class SuperBitTest { + /** + * Test with initial seed. + */ @Test - public void testSeed() { + public final void testSeed() { int d = 50; SuperBit sb = new SuperBit(d, 25, 100, 123456); SuperBit sb2 = new SuperBit(d, 25, 100, 123456); - + Random r = new Random(); double[] vector = new double[d]; - for(int i = 0; i < d; i++) + for (int i = 0; i < d; i++) { vector[i] = r.nextDouble(); - + } + boolean[] sig1 = sb.signature(vector); boolean[] sig2 = sb2.signature(vector); - - for(int i = 0; i < sig1.length; i++) - assertEquals("pos " + i, sig1[i], sig2[i]); + + for (int i = 0; i < sig1.length; i++) { + assertEquals( + "Signatures are different at index " + i, sig1[i], sig2[i]); + } } } From e404dd8ff8bc64dc971132a0b00abb8dfedb0734 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 10 Aug 2016 11:59:06 +0200 Subject: [PATCH 29/46] [maven-release-plugin] prepare release v0.10 --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index 587d6b6..4b63e3c 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.10-SNAPSHOT + 0.10 jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - HEAD + v0.10 From cb8652111bfe887be92ac21f953c6e504d9dced4 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 10 Aug 2016 11:59:12 +0200 Subject: [PATCH 30/46] [maven-release-plugin] prepare for next development iteration --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index 4b63e3c..c8e9026 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.10 + 0.11-SNAPSHOT jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - v0.10 + HEAD From 967dc63b0aaca0fdb5e9bdabeef476b039188e33 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 10 Aug 2016 12:24:05 +0200 Subject: [PATCH 31/46] Added initial seed in README and example --- README.md | 60 +++++++++++++++- .../java/lsh/examples/InitialSeed.java | 68 +++++++++++++++++++ 2 files changed, 126 insertions(+), 2 deletions(-) create mode 100644 src/main/java/info/debatty/java/lsh/examples/InitialSeed.java diff --git a/README.md b/README.md index b089b5e..c83990a 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,12 @@ A Java implementation of Locality Sensitive Hashing (LSH). +* [Download](#Download) +* [MinHash](#MinHash) +* [SuperBit](#SuperBit) +* [Comparable signatures](#Comparable-signatures) + + Locality Sensitive Hashing (LSH) is a family of hashing methods that tent to produce the same hash (or signature) for similar items. There exist different LSH functions, that each correspond to a similarity metric. For example, the MinHash algorithm is designed for Jaccard similarity (the relative number of elements that two sets have in common). For cosine similarity, the traditional LSH algorithm used is Random Projection, but others exist, like Super-Bit, that deliver better resutls. LSH functions have two main use cases: @@ -391,14 +397,64 @@ public class MyApp { [Read Javadoc...](http://api123.web-d.be/api/java-LSH/head/index.html) -## Serialization +## Comparable signatures + As the parameters of the hashing function are randomly initialized when the LSH object is instantiated: * two LSH objects will produce different hashes and signatures for the same input vector; * two executions of your program will produce different hashes and signatures for the same input vector; * the signatures produced by two different LSH objects can not be used to estimate the similarity between vectors. -The solution is to serialize you LSH object so you an reuse it: +There are two possibilities to produce comparable signatures: provide an initial seed or serialize your hash object. + +### Initial seed + +```java +import info.debatty.java.lsh.MinHash; +import java.util.Random; + +public class InitialSeed { + + public static void main(String[] args) { + + // Initialize two minhash objects, with the same seed + int signature_size = 20; + int dictionary_size = 100; + long initial_seed = 123456; + + MinHash mh = new MinHash(signature_size, dictionary_size, initial_seed); + MinHash mh2 = new MinHash(signature_size, dictionary_size, initial_seed); + + // Create a single vector of size dictionary_size + Random r = new Random(); + boolean[] vector = new boolean[dictionary_size]; + for (int i = 0; i < dictionary_size; i++) { + vector[i] = r.nextBoolean(); + } + + // The two minhash objects will produce the same signature + println(mh.signature(vector)); + println(mh2.signature(vector)); + } + + static void println(final int[] array) { + System.out.print("["); + for (int v : array) { + System.out.print("" + v + " "); + } + System.out.println("]"); + } +} +``` + +Will output: + +``` +[0 0 1 1 3 3 0 1 0 2 0 0 9 1 0 0 0 1 7 0 ] +[0 0 1 1 3 3 0 1 0 2 0 0 9 1 0 0 0 1 7 0 ] +``` + +### Serialization ```java import info.debatty.java.lsh.LSHMinHash; diff --git a/src/main/java/info/debatty/java/lsh/examples/InitialSeed.java b/src/main/java/info/debatty/java/lsh/examples/InitialSeed.java new file mode 100644 index 0000000..ca2e09e --- /dev/null +++ b/src/main/java/info/debatty/java/lsh/examples/InitialSeed.java @@ -0,0 +1,68 @@ +/* + * The MIT License + * + * Copyright 2016 Thibault Debatty. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +package info.debatty.java.lsh.examples; + +import info.debatty.java.lsh.MinHash; +import java.util.Random; + +/** + * + * @author Thibault Debatty + */ +public class InitialSeed { + + /** + * @param args the command line arguments + */ + public static void main(String[] args) { + + // Initialize two minhash objects, with the same seed + int signature_size = 20; + int dictionary_size = 100; + long initial_seed = 123456; + + MinHash mh = new MinHash(signature_size, dictionary_size, initial_seed); + MinHash mh2 = new MinHash(signature_size, dictionary_size, initial_seed); + + // Create a single vector of size dictionary_size + Random r = new Random(); + boolean[] vector = new boolean[dictionary_size]; + for (int i = 0; i < dictionary_size; i++) { + vector[i] = r.nextBoolean(); + } + + // The two minhash objects will produce the same signature + println(mh.signature(vector)); + println(mh2.signature(vector)); + } + + static void println(final int[] array) { + System.out.print("["); + for (int v : array) { + System.out.print("" + v + " "); + } + System.out.println("]"); + } +} From fb50b82befaac341e3311174104afa9621fd2ec7 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 10 Aug 2016 12:26:08 +0200 Subject: [PATCH 32/46] Update README.md --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index c83990a..b2705b8 100644 --- a/README.md +++ b/README.md @@ -3,10 +3,10 @@ A Java implementation of Locality Sensitive Hashing (LSH). -* [Download](#Download) -* [MinHash](#MinHash) -* [SuperBit](#SuperBit) -* [Comparable signatures](#Comparable-signatures) +* [Download](#download) +* [MinHash](#minhash) +* [SuperBit](#superbit) +* [Comparable signatures](#comparable-signatures) Locality Sensitive Hashing (LSH) is a family of hashing methods that tent to produce the same hash (or signature) for similar items. There exist different LSH functions, that each correspond to a similarity metric. For example, the MinHash algorithm is designed for Jaccard similarity (the relative number of elements that two sets have in common). For cosine similarity, the traditional LSH algorithm used is Random Projection, but others exist, like Super-Bit, that deliver better resutls. From 850eb57ddfce8a4cc0c315149b821b4449418aff Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 10 Aug 2016 12:26:54 +0200 Subject: [PATCH 33/46] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b2705b8..88e5eed 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ A Java implementation of Locality Sensitive Hashing (LSH). * [Download](#download) * [MinHash](#minhash) -* [SuperBit](#superbit) +* [Super-Bit](#super-bit) * [Comparable signatures](#comparable-signatures) From b3c2951e27a83888c022ae2039f603112829bd91 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 6 Sep 2017 07:49:53 +0200 Subject: [PATCH 34/46] javadoc.io --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 88e5eed..c4aa42a 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,6 @@ # java-LSH -[![Maven Central](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh/badge.svg)](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh) [![Build Status](https://travis-ci.org/tdebatty/java-LSH.svg?branch=master)](https://travis-ci.org/tdebatty/java-LSH) [![API](http://api123.web-d.be/api123-head.svg)](http://api123.web-d.be/api/java-LSH/head/index.html) +[![Maven Central](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh/badge.svg)](https://maven-badges.herokuapp.com/maven-central/info.debatty/java-lsh) [![Build Status](https://travis-ci.org/tdebatty/java-LSH.svg?branch=master)](https://travis-ci.org/tdebatty/java-LSH) [![Javadocs](http://www.javadoc.io/badge/info.debatty/java-lsh.svg)](http://www.javadoc.io/doc/info.debatty/java-lsh) + A Java implementation of Locality Sensitive Hashing (LSH). @@ -7,6 +8,8 @@ A Java implementation of Locality Sensitive Hashing (LSH). * [MinHash](#minhash) * [Super-Bit](#super-bit) * [Comparable signatures](#comparable-signatures) +* [Initial seed](#initial-seed) +* [Serialization](#serialization) Locality Sensitive Hashing (LSH) is a family of hashing methods that tent to produce the same hash (or signature) for similar items. There exist different LSH functions, that each correspond to a similarity metric. For example, the MinHash algorithm is designed for Jaccard similarity (the relative number of elements that two sets have in common). For cosine similarity, the traditional LSH algorithm used is Random Projection, but others exist, like Super-Bit, that deliver better resutls. From ead693aa86ecf85412525baa6e9dfcefa9283fe9 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 6 Sep 2017 07:51:06 +0200 Subject: [PATCH 35/46] Typos --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index c4aa42a..5989dba 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ A Java implementation of Locality Sensitive Hashing (LSH). * [Serialization](#serialization) -Locality Sensitive Hashing (LSH) is a family of hashing methods that tent to produce the same hash (or signature) for similar items. There exist different LSH functions, that each correspond to a similarity metric. For example, the MinHash algorithm is designed for Jaccard similarity (the relative number of elements that two sets have in common). For cosine similarity, the traditional LSH algorithm used is Random Projection, but others exist, like Super-Bit, that deliver better resutls. +Locality Sensitive Hashing (LSH) is a family of hashing methods that tent to produce the same hash (or signature) for similar items. There exist different LSH functions, that each correspond to a similarity metric. For example, the MinHash algorithm is designed for Jaccard similarity (the relative number of elements that two sets have in common). For cosine similarity, the traditional LSH algorithm used is Random Projection, but others exist, like Super-Bit, that deliver better results. LSH functions have two main use cases: * Compute the signature of large input vectors. These signatures can be used to quickly estimate the similarity between vectors. @@ -26,7 +26,7 @@ Are currently implemented: The coeficients of hashing functions are randomly choosen when the LSH object is instantiated. You can thus only compare signatures or bucket binning generated by the same LSH object. To reuse your LSH object between executions, you have to serialize it and save it to a file (see below the [example of LSH object serialization](https://github.com/tdebatty/java-LSH#serialization)). -##Download +## Download Using maven: ``` @@ -39,7 +39,7 @@ Using maven: Or see the [releases](https://github.com/tdebatty/java-LSH/releases) page. -##MinHash +## MinHash MinHash is a hashing scheme that tents to produce similar signatures for sets that have a high Jaccard similarity. From a4194adf6600787dca6f07b94f33f6c4787ffda0 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Wed, 6 Sep 2017 07:52:39 +0200 Subject: [PATCH 36/46] Typos + javadoc --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 5989dba..b85003d 100644 --- a/README.md +++ b/README.md @@ -291,9 +291,9 @@ Signature similarity: 0.6767676767676768 Real similarity (Jaccard index)0.6666666666666666 ``` -[Read Javadoc...](http://api123.web-d.be/api/java-LSH/head/index.html) +[Read Javadoc...](http://www.javadoc.io/doc/info.debatty/java-lsh) -##Super-Bit +## Super-Bit Super-Bit is an improvement of Random Projection LSH. It computes an estimation of cosine similarity. In Super-Bit, the K random vectors are orthogonalized in L batches of N vectors, where * N is called the Super-Bit depth @@ -398,7 +398,7 @@ public class MyApp { } ``` -[Read Javadoc...](http://api123.web-d.be/api/java-LSH/head/index.html) +[Read Javadoc...](http://www.javadoc.io/doc/info.debatty/java-lsh) ## Comparable signatures @@ -532,4 +532,4 @@ LSH object serialized to /tmp/lshobject5903174677942358274.ser [5 5 ] ``` -[Check the examples](https://github.com/tdebatty/java-LSH/tree/master/src/main/java/info/debatty/java/lsh/examples) or [read Javadoc](http://api123.io/api/java-LSH/head/index.html) +[Check the examples](https://github.com/tdebatty/java-LSH/tree/master/src/main/java/info/debatty/java/lsh/examples) or [read Javadoc](http://www.javadoc.io/doc/info.debatty/java-lsh) From 132925ef7dd8e9bdd6020ee749649f0c46ae5a31 Mon Sep 17 00:00:00 2001 From: Joyous Koala Date: Mon, 9 Apr 2018 11:00:04 -0500 Subject: [PATCH 37/46] Replaced bad hash function in MinHash --- src/main/java/info/debatty/java/lsh/MinHash.java | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index cbe1c1a..63ae77e 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -28,6 +28,8 @@ */ public class MinHash implements Serializable { + private static final long LARGE_PRIME = 2147483659L; + /** * Compute the jaccard index between two sets. * @param s1 @@ -286,13 +288,13 @@ private void init(final int size, final int dict_size, final Random r) { // a and b should be randomly generated hash_coefs = new long[n][2]; for (int i = 0; i < n; i++) { - hash_coefs[i][0] = r.nextInt(dict_size); // a - hash_coefs[i][1] = r.nextInt(dict_size); // b + hash_coefs[i][0] = r.nextInt(Integer.MAX_VALUE) + 1; // a + hash_coefs[i][1] = r.nextInt(Integer.MAX_VALUE) + 1; // b } } /** - * Computes hi(x) as (a_i * x + b_i) % dict_size. + * Computes hi(x) as (a_i * x + b_i) % LARGE_PRIME % (Integer.MAX_VALUE+1). * * @param i * @param x @@ -300,7 +302,8 @@ private void init(final int size, final int dict_size, final Random r) { */ private int h(final int i, final int x) { return (int) - ((hash_coefs[i][0] * (long) x + hash_coefs[i][1]) % dict_size); + ((hash_coefs[i][0] * (long) x + hash_coefs[i][1]) + % LARGE_PRIME % ((long) Integer.MAX_VALUE + 1)); } /** From aa4730828eb52ccef61823a4fad67ded6bbfcb37 Mon Sep 17 00:00:00 2001 From: Joyous Koala Date: Mon, 9 Apr 2018 11:56:28 -0500 Subject: [PATCH 38/46] Use smaller prime that fits in int to simplify code --- src/main/java/info/debatty/java/lsh/MinHash.java | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index 63ae77e..b5b8af6 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -28,7 +28,7 @@ */ public class MinHash implements Serializable { - private static final long LARGE_PRIME = 2147483659L; + private static final int LARGE_PRIME = 2147483647; // = 2^31 - 1 ! /** * Compute the jaccard index between two sets. @@ -285,11 +285,11 @@ private void init(final int size, final int dict_size, final Random r) { this.n = size; // h = (a * x) + b - // a and b should be randomly generated + // a and b should be randomly generated in [1,PRIME-1] hash_coefs = new long[n][2]; for (int i = 0; i < n; i++) { - hash_coefs[i][0] = r.nextInt(Integer.MAX_VALUE) + 1; // a - hash_coefs[i][1] = r.nextInt(Integer.MAX_VALUE) + 1; // b + hash_coefs[i][0] = r.nextInt(LARGE_PRIME - 1) + 1; // a + hash_coefs[i][1] = r.nextInt(LARGE_PRIME - 1) + 1; // b } } @@ -303,7 +303,7 @@ private void init(final int size, final int dict_size, final Random r) { private int h(final int i, final int x) { return (int) ((hash_coefs[i][0] * (long) x + hash_coefs[i][1]) - % LARGE_PRIME % ((long) Integer.MAX_VALUE + 1)); + % LARGE_PRIME); } /** From 34e8859de4d9e7d0c022bb03e56919882275e427 Mon Sep 17 00:00:00 2001 From: Joyous Koala Date: Mon, 9 Apr 2018 12:00:49 -0500 Subject: [PATCH 39/46] Fixed comment --- src/main/java/info/debatty/java/lsh/MinHash.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index b5b8af6..bc8e5fc 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -294,7 +294,7 @@ private void init(final int size, final int dict_size, final Random r) { } /** - * Computes hi(x) as (a_i * x + b_i) % LARGE_PRIME % (Integer.MAX_VALUE+1). + * Computes hi(x) as (a_i * x + b_i) % LARGE_PRIME * * @param i * @param x From f31147b06d723e8f0f779738a2b59678a0c4e10b Mon Sep 17 00:00:00 2001 From: Joyous Koala Date: Mon, 9 Apr 2018 12:06:01 -0500 Subject: [PATCH 40/46] 'First sentence should end with a period.' --- src/main/java/info/debatty/java/lsh/MinHash.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/main/java/info/debatty/java/lsh/MinHash.java b/src/main/java/info/debatty/java/lsh/MinHash.java index bc8e5fc..ea98ea0 100644 --- a/src/main/java/info/debatty/java/lsh/MinHash.java +++ b/src/main/java/info/debatty/java/lsh/MinHash.java @@ -294,7 +294,7 @@ private void init(final int size, final int dict_size, final Random r) { } /** - * Computes hi(x) as (a_i * x + b_i) % LARGE_PRIME + * Computes hi(x) as (a_i * x + b_i) % LARGE_PRIME . * * @param i * @param x From 2c9a4ede6a0ed7a4422c62b0334bf559cdbf73a3 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Tue, 10 Apr 2018 20:59:21 +0200 Subject: [PATCH 41/46] [maven-release-plugin] prepare release v0.11 --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index c8e9026..9e223e4 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.11-SNAPSHOT + 0.11 jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - HEAD + v0.11 From 42a520c98f25cd9d1ba2c28e2caa69a64d43f04c Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Tue, 10 Apr 2018 20:59:40 +0200 Subject: [PATCH 42/46] [maven-release-plugin] prepare for next development iteration --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index 9e223e4..d804a76 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.11 + 0.12-SNAPSHOT jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - v0.11 + HEAD From 9f4a7a25a124f6cc9598602d0bc0f74a41665933 Mon Sep 17 00:00:00 2001 From: Steve Gutz Date: Tue, 18 Jun 2019 15:34:10 -0400 Subject: [PATCH 43/46] Remove support for sparse vectors --- pom.xml | 5 -- src/main/java/info/debatty/java/lsh/LSH.java | 2 +- .../info/debatty/java/lsh/LSHSuperBit.java | 22 ------ .../java/info/debatty/java/lsh/SuperBit.java | 29 -------- .../java/lsh/examples/LSHMinHashExample.java | 32 ++++---- .../lsh/examples/SuperBitSparseExample.java | 73 ------------------- 6 files changed, 17 insertions(+), 146 deletions(-) delete mode 100644 src/main/java/info/debatty/java/lsh/examples/SuperBitSparseExample.java diff --git a/pom.xml b/pom.xml index d804a76..f919c11 100644 --- a/pom.xml +++ b/pom.xml @@ -161,11 +161,6 @@ 4.10 test - - ${project.groupId} - java-string-similarity - 0.12 - diff --git a/src/main/java/info/debatty/java/lsh/LSH.java b/src/main/java/info/debatty/java/lsh/LSH.java index aac78ae..31c51cc 100644 --- a/src/main/java/info/debatty/java/lsh/LSH.java +++ b/src/main/java/info/debatty/java/lsh/LSH.java @@ -4,7 +4,7 @@ /** * Implementation of Locality Sensitive Hashing (LSH) principle, as described in - * Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", + * Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", * Cambridge University Press. * * @author Thibault Debatty http://www.debatty.info diff --git a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java index 572bf1b..c5dbb2d 100644 --- a/src/main/java/info/debatty/java/lsh/LSHSuperBit.java +++ b/src/main/java/info/debatty/java/lsh/LSHSuperBit.java @@ -24,8 +24,6 @@ package info.debatty.java.lsh; -import info.debatty.java.utils.SparseDoubleVector; -import info.debatty.java.utils.SparseIntegerVector; import java.io.Serializable; /** @@ -43,7 +41,6 @@ public class LSHSuperBit extends LSH implements Serializable { * * Supported input types: * - double[] - * - sparseIntegerVector * - int[] * - others to come... * @@ -70,7 +67,6 @@ public LSHSuperBit( * * Supported input types: * - double[] - * - sparseIntegerVector * - int[] * - others to come... * @@ -139,24 +135,6 @@ public final int[] hash(final double[] vector) { return hashSignature(sb.signature(vector)); } - /** - * Hash (bin) a vector in s stages into b buckets. - * @param vector - * @return - */ - public final int[] hash(final SparseIntegerVector vector) { - return hashSignature(sb.signature(vector)); - } - - /** - * Hash (bin) a vector in s stages into b buckets. - * @param vector - * @return - */ - public final int[] hash(final SparseDoubleVector vector) { - return hashSignature(sb.signature(vector)); - } - /** * Hash (bin) a vector in s stages into b buckets. * @param vector diff --git a/src/main/java/info/debatty/java/lsh/SuperBit.java b/src/main/java/info/debatty/java/lsh/SuperBit.java index bb827c3..7d03ebe 100644 --- a/src/main/java/info/debatty/java/lsh/SuperBit.java +++ b/src/main/java/info/debatty/java/lsh/SuperBit.java @@ -24,8 +24,6 @@ package info.debatty.java.lsh; -import info.debatty.java.utils.SparseDoubleVector; -import info.debatty.java.utils.SparseIntegerVector; import java.io.Serializable; import java.util.Random; @@ -40,7 +38,6 @@ * Advances in Neural Information Processing Systems 25, 2012 * * Supported input types: - * - SparseIntegerVector * - double[] * - others to come... * @@ -176,32 +173,6 @@ public SuperBit() { } - /** - * Compute the signature of this vector. - * @param vector - * @return - */ - public final boolean[] signature(final SparseIntegerVector vector) { - boolean[] sig = new boolean[this.hyperplanes.length]; - for (int i = 0; i < this.hyperplanes.length; i++) { - sig[i] = (vector.dotProduct(this.hyperplanes[i]) >= 0); - } - return sig; - } - - /** - * Compute the signature of this vector. - * @param vector - * @return - */ - public final boolean[] signature(final SparseDoubleVector vector) { - boolean[] sig = new boolean[this.hyperplanes.length]; - for (int i = 0; i < this.hyperplanes.length; i++) { - sig[i] = (vector.dotProduct(this.hyperplanes[i]) >= 0); - } - return sig; - } - /** * Compute the signature of this vector. * @param vector diff --git a/src/main/java/info/debatty/java/lsh/examples/LSHMinHashExample.java b/src/main/java/info/debatty/java/lsh/examples/LSHMinHashExample.java index 3fe635d..4d1e7a1 100644 --- a/src/main/java/info/debatty/java/lsh/examples/LSHMinHashExample.java +++ b/src/main/java/info/debatty/java/lsh/examples/LSHMinHashExample.java @@ -40,40 +40,40 @@ public class LSHMinHashExample { public static void main(String[] args) { // Number of sets int count = 2000; - + // Size of dictionary int n = 100; - + // Number of buckets // Attention: to get relevant results, the number of elements per bucket // should be at least 100 int buckets = 10; - + // Let's generate some random sets boolean[][] vectors = new boolean[count][]; Random r = new Random(); - + // To get some interesting measures, we first generate a single // sparse random vector - vectors[0] = new boolean[n]; + vectors[0] = new boolean[n]; for (int j = 0; j < n; j++) { vectors[0][j] = (r.nextInt(10) == 0); } - - // Then we generate the other vectors, which have a reasonable chance + + // Then we generate the other vectors, which have a reasonable chance // to look like the first one... for (int i = 1; i < count; i++) { vectors[i] = new boolean[n]; - + for (int j = 0; j < n; j++) { vectors[i][j] = (r.nextDouble() <= 0.7 ? vectors[0][j] : (r.nextInt(10) == 0)); } } - + // Now we can proceed to LSH binning // We will test multiple stages for (int stages = 1; stages <= 10; stages++) { - + // Compute the LSH hash of each vector LSHMinHash lsh = new LSHMinHash(stages, buckets, n); int[][] hashes = new int[count][]; @@ -83,7 +83,7 @@ public static void main(String[] args) { } // We now have the LSH hash for each input set - // Let's have a look at how similar sets (according to Jaccard + // Let's have a look at how similar sets (according to Jaccard // index) were binned... int[][] results = new int[11][2]; for (int i = 0; i < vectors.length; i++) { @@ -93,15 +93,15 @@ public static void main(String[] args) { for (int j = 0; j < i; j++) { boolean[] vector2 = vectors[j]; int[] hash2 = hashes[j]; - + // We compute the similarity between each pair of sets double similarity = MinHash.jaccardIndex(vector1, vector2); - // We count the number of pairs with similarity 0.1, 0.2, + // We count the number of pairs with similarity 0.1, 0.2, // 0.3, etc. results[(int) (10 * similarity)][0]++; - // Do they fall in the same bucket for one of the stages? + // Do they fall in the same bucket for one of the stages? for (int stage = 0; stage < stages; stage++) { if (hash1[stage] == hash2[stage]) { results[(int) (10 * similarity)][1]++; @@ -116,14 +116,14 @@ public static void main(String[] args) { // in the same bucket for at least one of the stages is y for (int i = 0; i < results.length; i++) { double similarity = (double) i / 10; - + double probability = 0; if (results[i][0] != 0) { probability = (double) results[i][1] / results[i][0]; } System.out.println("" + similarity + "\t" + probability + "\t" + stages); } - + // Separate the series for Gnuplot... System.out.print("\n"); } diff --git a/src/main/java/info/debatty/java/lsh/examples/SuperBitSparseExample.java b/src/main/java/info/debatty/java/lsh/examples/SuperBitSparseExample.java deleted file mode 100644 index a6a3784..0000000 --- a/src/main/java/info/debatty/java/lsh/examples/SuperBitSparseExample.java +++ /dev/null @@ -1,73 +0,0 @@ -package info.debatty.java.lsh.examples; - -/* - * The MIT License - * - * Copyright 2015 tibo. - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to deal - * in the Software without restriction, including without limitation the rights - * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell - * copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in - * all copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, - * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN - * THE SOFTWARE. - */ - - -import info.debatty.java.lsh.SuperBit; -import info.debatty.java.utils.SparseIntegerVector; -import java.util.Random; - -/** - * - * @author Thibault Debatty - */ -public class SuperBitSparseExample { - - /** - * @param args the command line arguments - */ - public static void main(String[] args) { - - int n = 1000; - - // Initialize SuperBit algorithm for n dimensions - SuperBit sb = new SuperBit(n); - - - // Create some sparse vectors - Random rand = new Random(); - - int[] v = new int[n]; - for (int i = 0; i < n/10; i++) { - v[rand.nextInt(n)] = rand.nextInt(100); - } - SparseIntegerVector v1 = new SparseIntegerVector(v); - - v = new int[n]; - for (int i = 0; i < n/10; i++) { - v[rand.nextInt(n)] = rand.nextInt(100); - } - SparseIntegerVector v2 = new SparseIntegerVector(v); - - boolean[] sig1 = sb.signature(v1); - boolean[] sig2 = sb.signature(v2); - - System.out.println("Signature (estimated) similarity: " + - sb.similarity(sig1, sig2)); - System.out.println("Real cosine similarity: " + v1.dotProduct(v2) / (v1.norm() * v2.norm())); - - } - -} From 353e792f2da6ea62e1e07df784df53772f9c3958 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Mon, 24 Jun 2019 21:18:36 +0200 Subject: [PATCH 44/46] [maven-release-plugin] prepare release v0.12 --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index f919c11..e080aec 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.12-SNAPSHOT + 0.12 jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - HEAD + v0.12 From 7308f87f27a3a9adbe37a337f8aec9eacd7154a7 Mon Sep 17 00:00:00 2001 From: Thibault Debatty Date: Mon, 24 Jun 2019 21:18:43 +0200 Subject: [PATCH 45/46] [maven-release-plugin] prepare for next development iteration --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index e080aec..d37504e 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 info.debatty java-lsh - 0.12 + 0.13-SNAPSHOT jar ${project.artifactId} @@ -36,7 +36,7 @@ scm:git:git@github.com:tdebatty/java-LSH.git scm:git:git@github.com:tdebatty/java-LSH.git git@github.com:tdebatty/java-LSH.git - v0.12 + HEAD From 2fad13824344018c535e72db9a059f5dee76ef80 Mon Sep 17 00:00:00 2001 From: Jonathan Leitschuh Date: Sat, 19 Nov 2022 03:01:08 +0000 Subject: [PATCH 46/46] vuln-fix: Temporary File Information Disclosure This fixes temporary file information disclosure vulnerability due to the use of the vulnerable `File.createTempFile()` method. The vulnerability is fixed by using the `Files.createTempFile()` method which sets the correct posix permissions. Weakness: CWE-377: Insecure Temporary File Severity: Medium CVSSS: 5.5 Detection: CodeQL & OpenRewrite (https://public.moderne.io/recipes/org.openrewrite.java.security.SecureTempFileCreation) Reported-by: Jonathan Leitschuh Signed-off-by: Jonathan Leitschuh Bug-tracker: https://github.com/JLLeitschuh/security-research/issues/18 Co-authored-by: Moderne --- .../info/debatty/java/lsh/examples/SerializeExample.java | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/main/java/info/debatty/java/lsh/examples/SerializeExample.java b/src/main/java/info/debatty/java/lsh/examples/SerializeExample.java index d0ff329..b9c01d4 100644 --- a/src/main/java/info/debatty/java/lsh/examples/SerializeExample.java +++ b/src/main/java/info/debatty/java/lsh/examples/SerializeExample.java @@ -30,6 +30,7 @@ import java.io.IOException; import java.io.ObjectInputStream; import java.io.ObjectOutputStream; +import java.nio.file.Files; import java.util.Random; /** @@ -72,7 +73,7 @@ public static void main(String[] args) // be used to compute estimated similarity! // The solution is to serialize and save the object, so it can be // reused later... - File tempfile = File.createTempFile("lshobject", ".ser"); + File tempfile = Files.createTempFile("lshobject", ".ser").toFile(); FileOutputStream fout = new FileOutputStream(tempfile); ObjectOutputStream oos = new ObjectOutputStream(fout); oos.writeObject(lsh); @@ -93,4 +94,4 @@ static void println(int[] array) { } System.out.println("]"); } -} \ No newline at end of file +}