Principles of Hash-Based Text Retrieval.
Principles of Hash-Based Text Retrieval.
Benno Stein
Faculty of Media, Media Systems
Bauhaus University Weimar
99421 Weimar, Germany
[email protected]
(2)
hϕ
13 23 17 27
12 22 16 26 18 ...
Figure 1: A space partitioned into overlapping regions, hinted as two grids of shaded and outlined hexagons. Each region is charac-
terized by a unique key; points in the same region have a similarity of at least θ. A similarity hash function h ϕ at level θ assigns a set
of region keys to a feature vector xd , implying the following semantics: If and only if two feature vectors share a region key they are
(1) (2)
considered having a similarity of at least θ. In the example hϕ (x) = {hϕ (x), hϕ (x)} operationalizes both a precision and a recall
of 1. For readability purposes the keys of the shaded regions are shown underlined.
2.4 Controlling Retrieval Properties and 3 (cf. Subsection 2.2), which maps an embedded vector onto a
The most salient property of hash-based search is the simplifi- hash code, is distance-preserving.
cation of a continuous similarity function ϕ to the binary concept The section starts with a derivation of the globally optimum em-
“similar or not similar”: two feature vectors are considered as simi- bedding under the cosine similarity measure, and then uncovers the
lar if their hash keys are equal; otherwise they are considered as not inferiority of this embedding compared to the prefix class embed-
similar. This implication is generalized in Equation (1) at the out- ding of fuzzy-fingerprinting (Subsection 3.2). This observation is
set; the generalization pertains to two aspects: (i) the equivalence explained by the idea of threshold-centered embeddings, for which
relation refers to a similarity threshold θ, and (ii) the hash function we introduce the formal underpinning in the form of new error
hϕ is multivalued. statistics, called precision stress and recall stress at a given similar-
With the background of the presented hash-based search meth- ity threshold θ. By extending the idea toward thresholded similarity
ods we now continue the discussion of precision and recall from matrices we show how optimum embeddings for similarity hashing
Subsection 1.1. Observe that the probability of a hash collision in closed retrieval situations can be developed (Subsection 3.3).
for two vectors xd1 , xd2 decreases if the number k of simple hash
functions (LSH) or prefix classes (FF) is increased. Each hash func-
3.1 Globally Optimum Embeddings
tion or each prefix class captures additional knowledge of x and Multidimensional scaling (MDS) designates a class of tech-
hence raises the similarity threshold θ. This can be broken down to niques for embedding a set of objects into a low-dimensional real-
the following formula, termed Property 1: valued space, called embedding space here. The embedding error,
“Code length controls precision.” also called “stress”, is computed from the deviations between the
original inter-object similarities and the new inter-object similari-
Being multivalued is a necessary condition for hϕ to achieve a
ties in the embedding space.
recall of 1. A scalar-valued hash function computes one key for
Given n objects, the related similarity matrix, S, is a symmetric
one feature vector x at a time, and hence it defines a rigorous par-
n × n matrix of positive real numbers, whose (i, j)-th entry quan-
titioning of the feature vector space. Figure 1 illustrates this con-
(1) tifies the similarity between object i and object j. Let each object
nection: The scalar-valued hash function hϕ responsible for the be described by an m-dimensional feature vector x ∈ Rm , and let
shaded partitioning assigns different keys to the vectors xd1 and X be the m × n matrix comprised of these vectors. 6
xd2 , despite their high similarity (low distance). With the multi- Without loss of generality we assume each feature vector x being
(1) (2)
valued hash function, hϕ = {hϕ , hϕ }, which also considers the normalized according to the l2 -norm, i. e., ||x||2 = 1. Then, under
outlined partitioning, the intersection hϕ (xd1 ) ∩ hϕ (xd2 ) is not the cosine similarity measure, S is defined by the identity S =
empty, giving raise to infer that ϕ(xd1 , xd2 ) > θ. In fact, there XT X, where XT designates the matrix transpose of X.
is a monotonic relationship between the number of hash codes and An important property of the cosine similarity measure is that
the achieved recall, which can be broken down to the following under the Frobenius norm an optimum embedding of X can be
formula, termed Property 2: directly constructed from its singular value decomposition (SVD).
“Code multiplicity controls recall”. With SVD an arbitrary matrix X can be uniquely represented as the
However, there is no free lunch, the improved recall is bought product of three matrices:7
with a decrease in precision.
X = UΣVT
3. OPTIMALITY AND EMBEDDING U is a column orthonormal m × r matrix, Σ is an r × r diagonal
The embedding of the vector space model into a low-dimensional matrix with the singular values of X, and V is an n×r matrix. I. e.,
space is inevitably bound up with information loss. The smaller the 6
In IR applications X is the term-document-matrix. For applying
embedding error is, the better are precision and recall of the con- an MDS only S must be given.
7
structed hash function, because the affine transformation in Step 2 Unique up to rearrangement of columns and subspace rotations.
Similarities in high-dimensional original space Similarities in low-dimensional embedding space
Embedding
Rθ ≈ ϕ(xi, xj) > θ
^
Rθ ≈ ^
ϕ(yi, yj) > θ
Similarities primarily responsible for recall stress Similarities primarily responsible for precision stress
^ ^
Rθ ∩ Rθ Rθ ∩ Rθ
Figure 2: If the original document representations, X, are embedded into a low-dimensional space, the resulting document represen-
tations Y resemble the original similarities only imperfectly. Given a particular threshold θ, similarities of the original space may be
shifted from above θ to below θ (hatched area left), from below θ to above θ (hatched area right), or still remain in the interval [θ; 1]
(green area). The similarities in the hatched areas are responsible for the major part of the embedding stress.
UT U = I and VVT = I where I designates the identity matrix. global stress minimization, while hash-based search methods con-
Using these properties the matrix S can be rewritten under both the centrate on the high similarities in S in first place.8 The nature of
viewpoint of its singular value decomposition and the viewpoint of this property is captured by the following definition, which relates
similarity computation: the threshold-specific stress of an embedding to the statistical con-
cepts of precision and recall. Figure 2 illustrates the definition.
S = XT X = (UΣVT )T UΣVT
= VΣ2 VT = (ΣVT )T (ΣVT )
| {z } | {z } Definition 1 (precision stress, recall stress) Let D be a set of ob-
SVD Similarity computation jects and let X and Y be their representations in the n-dimensional
ΣVT represents a set of points with the same inter-object simi- and the k-dimensional space respectively, k < n. Moreover, let
larities as the original vectors X. The nature of the cosine similar- ϕ : X × X → [0; 1] and ϕ̂ : Y × Y → [0; 1] be two similarity
ity measure implies the direct construction of S and, in particular, measures, and let θ ∈ [0; 1] be a similarity threshold.
the identities rank (S) = rank (X) = rank (ΣVT ). Conversely, θ defines two result sets, Rθ and R̂θ , which are comprised of
if we restrict the dimensionality of the embedding space to k, the those pairs {xi , xj }, xi , xj ∈ D, whose respective representations
resulting similarity matrix Ŝ is also of rank k. According to the in X and Y are above the similarity threshold θ:
Eckart-Young Theorem the optimum rank-k approximation Ŝ∗ of {xi , xj } ∈ Rθ ⇔ ϕ(xi , xj ) > θ,
S under the Frobenius norm can be obtained from the SVD of S, by
restricting the matrix product to the k largest singular values [10]: and likewise: {xi , xj } ∈ R̂θ ⇔ ϕ̂(yi , yj ) > θ
∗
Ŝ = Vk Σ2k VkT = (Σk VkT )T (Σk VkT ) Then the set of returned pairs from the embedding space, R̂θ ,
⇒ Σk VkT = argmin ||S − Y T Y||F defines the precision stress at similarity threshold θ, epθ :
columns(Y)=n,
{Y| rank (Y)=k } X ` ´2
ϕ(xi , xj ) − ϕ̂(yi , yj )
In the information retrieval community the embedding YSVD := {xi ,xj }∈R̂θ
Σk VkT of document vectors X is known as representation in the e pθ = X ` ´2
so-called latent semantic space, spanned by k concepts. The em- ϕ(xi , xj )
bedding process became popular under the name of latent semantic {xi ,xj }∈R̂θ
indexing (LSI) [9].
Likewise, the set of similar pairs in the original space, Rθ , de-
Remark 1. A common misconception is that LSI projects the docu- fines the recall stress at similarity threshold θ, erθ :
ment vectors into a subspace in order to represent semantic similar- X ` ´2
ity. Rather, LSI constructs new features to approximate the original ϕ(xi , xj ) − ϕ̂(yi , yj )
document representations. And, if the dimension of the embed- {xi ,xj }∈Rθ
ding space is properly chosen then, due to the reduction of noise e rθ = X ` ´2
and the elimination of weak dependencies, this embedding is able ϕ(xi , xj )
to address retrieval problems deriving from the use of synonymous {xi ,xj }∈Rθ
Precision stress
Recall stress
0.6 0.6
0.4 0.4
Figure 3: Evolution of the embedding stress against the similarity threshold θ (lower stress is better). The left plot takes the embedded
vectors as basis, the right plot the original vectors, corresponding to the precision stress, ep , and the recall stress, er , respectively. At
some threshold the embedding of fuzzy-fingerprinting, Y FF , outperforms the optimum MDS embedding, Y SVD .
sij = ϕ(xi , xj ), is low. Likewise, the larger the recall stress is 3.3 Threshold-Optimum Embeddings
the higher is the probability that two vectors in the original space, in Closed Retrieval Situations
xi , xj , are mapped onto different codes though their similarity, sij ,
is high. Threshold-centered embeddings are tailored document models
for special retrieval tasks such as near duplicate detection or high
For the three embeddings, YSVD , YFF , and YLSH , obtained similarity search. They tolerate a large embedding error in the low
from optimum MDS, fuzzy-fingerprinting, and LSH respectively, similarity interval [0, θ] and strive for a high fidelity of similarities
we have analyzed the precision stress and the recall stress at various from the interval [θ, 1]. This principle forms the rationale of hash-
similarity thresholds and with different corpora. The results reflect based search.
the predicted behavior: With YSVD , obtained by optimally solving an MDS, an embed-
1. Because of its generality (domain independence) the LSH ding that minimizes the accumulated error over all similarities is
embedding is consistently worse than the prefix class em- at hand. We now introduce a threshold-optimum embedding, Y∗ ,
bedding of fuzzy-fingerprinting. which minimizes the accumulated error with respect to the inter-
2. At some break-even point the retrieval performance of prefix val [θ, 1]. The presented ideas address the closed retrieval situ-
class embedding outperforms the optimum MDS embedding. ation in first place—for open retrieval situations the construction
of an optimum embedding requires a-priori knowledge about the
Figure 3 illustrates this behavior for a sample of 2000 docu- term distribution in the collection D.9 Though the typical use case
ments drawn from the Reuters Corpus Volume 1 (RCV1) [24]. for hash-based search are open retrieval situations, the derivation
With other corpora and other parameter settings for the hash-based is useful because (i) it provides additional theoretical insights and
search methods this characteristic is observed as well. We analyzed (ii) it forms a basis to reason about performance bounds.
in this connection also specifically compiled corpora whose simi-
larity distribution is significantly skewed towards high similarities: The θ-specific retrieval analysis of the preceding subsection sug-
Figure 4 contrasts the similarity distribution in the original Reuters gests the construction principle of Y∗ . Instead of approximating
Corpus (hatched light) and in the special corpora (solid dark). the original similarity matrix S a “thresholded” similarity matrix
Sθ is taken as basis, introducing this way the binary nature of simi-
Remark 3. For most retrieval tasks an—even high—precision stress larity hashing into the approximation process. For a given threshold
can be accepted, since the necessary subsequent exact similar- θ the matrix Sθ is defined as follows:
ity analysis needs to be performed only for a very small fraction 0 1
|Dq |/|D| of all documents. Remember that the construction meth- fθ (s11 ) fθ (s12 ) . . . fθ (s1n )
ods for the hash-based search methods provide sufficient means to B .. .. .. .. C
Sθ := @ . . . . A,
fine-tune the trade-off between the precision stress, ep , and the re-
call stress, er . fθ (sn1 ) fθ (sn2 ) . . . fθ (snn )
Reuters (special)
fine an upper threshold θ and a lower threshold ϑ respectively. Sim-
Reuters (original)
0.1 ilarity values from [θ; 1] are amplified toward 1, similarity values
from [0; θ) are moved toward ϑ. The following rationale reveals
0.01 the underlying trade-off: with increasing difference θ − ϑ the am-
plification above θ improves the robustness in the encoding step
0.001 (cf. Subsection 2.2), with increasing ϑ the contraction toward ϑ re-
duces the error in the embedding step and hence allows for shorter
0.0001
0 0.2 0.4 0.6 0.8 1 codes. fθ can be realized in different ways; within our analyses
Similarity Intervals two consecutive tanh-approximations with the thresholds ϑ = 0.1
Runtime [s]
Precision
Recall
0.4
0.4
0.2
Fuzzy-fingerpr. YFF
Locality-sens. h. YLSH
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 104 5*104 105
Similarity Similarity Sample size
Figure 5: Near-duplicate detection and plagiarism analysis with hash-based search technology. The plots shows recall-at-similarity,
precision-at-similarity, and runtime-at-sample-sizes, using fuzzy-fingerprinting (FF) and locality-sensitive hashing (LSH).
Whether they can be used to develop better search methods is sub- [14] N. Higham. Computing a Nearest Symmetric Positive
ject of our research: by construction, Y∗ outperforms other em- Semidefinite Matrix. Linear Algebra and its App., 1988.
beddings. It is unclear to which extent this property can be utilized [15] G. Hinton and R. Salakhutdinov. Reducing the
in similarity search methods designed for open retrieval situations. Dimensionality of Data with Neural Networks. Science,
The theoretical analysis of the trade-off between θ and ϑ as well as 313:504–507, 2006.
the Remarks 5 and 6 provide interesting links to follow. [16] T. Hofmann. Unsupervised Learning by Probabilistic Latent
Semantic Analysis. Machine Learning, 42:177–196, 2001.
6. REFERENCES [17] P. Indyk. Stable Distributions, Pseudorandom Generators,
[1] R. Ando and L. Lee. Iterative Residual Rescaling: An Embeddings and Data Stream Computation. In FOCS’00:
Analysis and Generalization of LSI. In Proc. 24th conference Proc. of the 41st symposium on foundations of computer
on research and development in IR, 2001. science, 2000. IEEE Computer Society.
[2] G. Aston and L. Burnard. The BNC Handbook. [18] P. Indyk and R. Motwani. Approximate Nearest Neighbor –
http://www.natcorp.ox.ac.uk/what/, 1998. Towards Removing the Curse of Dimensionality. In Proc. of
[3] M. Bawa, T. Condie, and P. Ganesan. LSH Forest: the 30th symposium on theory of computing, 1998.
Self-Tuning Indexes for Similarity Search. In WWW’05: [19] I. Jolliffe. Principal Component Analysis. Springer, 1996.
Proc. of the 14th int. conference on World Wide Web, 2005. [20] J. Kleinberg. Two Algorithms for Nearest-Neighbor Search
[4] A. Broder, S. Glassman, M. Manasse, and G. Zweig. in High Dimensions. In STOC’97: Proc. of the twenty-ninth
Syntactic Clustering of the Web. In Selected papers from the ACM symposium on theory of computing, 1997.
sixth int. conference on World Wide Web, 1997. [21] J. Kruskal. Multidimensional Scaling by Optimizing
[5] D. Cai and X. Hee. Orthogonal Locality Preserving Goodness of Fit to a Nonmetric Hypothesis. Psychometrika,
Indexing. In Proc. of the 28th conference on Research and 29(1), 1964.
development in IR, 2005. [22] Y. Matsuo and M. Ishizuka. Keyword Extraction from a
[6] M. S. Charikar. Similarity Estimation Techniques from Single Document using Word Co-ocurrence Statistical
Rounding Algorithms. In STOC’02: Proc. of the Information. Int. Journal on Artificial Intelligence Tools,
thirty-fourth ACM symposium on theory of computing, 2002. 13(1):157–169, 2004.
[7] T. Cormen, C. Leiserson, and R. Rivest. Introduction to [23] J. Nolan. Stable Distributions—Models for Heavy Tailed
Algorithms. MIT Press, Cambridge. 1990. Data. http://academic2.american.edu/~jpnolan/stable/, 2005.
[8] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. [24] T. Rose, M. Stevenson, and M. Whitehead. The Reuters
Locality-Sensitive Hashing Scheme Based on p-Stable Corpus Volume 1. From Yesterday’s News to Tomorrow’s
Distributions. In SCG’04: Proc. of the twentieth symposium Language Resources. In Proc. of the third int. conference on
on computational geometry, 2004. language resources and evaluation, 2002.
[9] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and [25] S. Rump. Verification of Positive Definiteness. BIT
R. Harshman. Indexing by Latent Semantic Analysis. Numerical Mathematics, 46:433–452, 2006.
Journal of the American Society of Information Science, [26] B. Stein. Fuzzy-Fingerprints for Text-Based IR. In Proc. of
41(6):391–407, 1990. the 5th Int. Conference on Knowledge Management, Graz,
[10] C. Eckart and G. Young. The Approximation of one Matrix Journal of Universal Computer Science, 2005.
by Another of Lower Rank. Psychometrika, 1:211–218, [27] B. Stein and S. Meyer zu Eißen. Near Similarity Search and
1936. Plagiarism Analysis. In From Data and Information Analysis
[11] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in to Knowledge Engineering. Springer, 2006.
High Dimensions via Hashing. In The VLDB Journal, 1999. [28] R. Weber, H. Schek, and S. Blott. A Quantitative Analysis
[12] X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality Preserving and Performance Study for Similarity-Search Methods in
Indexing for Document Representation. In Proc. of the 27th High-dimensional Spaces. In Proc. of the 24th VLDB
conference on research and development in IR, 2001. conference, 1998.
[13] M. Henzinger. Finding Near-Duplicate Web Pages: a [29] H. Yang and J. Callan. Near-Duplicate Detection by
Large-Scale Evaluation of Algorithms. In Proc. of the 29th Instance-level Constrained Clustering. In Proc. of the 29th
conference on research and development in IR, 2006. conference on research and development in IR, 2006.