100% found this document useful (1 vote)
386 views

Principles of Hash-Based Text Retrieval.

Hash-based similarity search reduces a continuous similarity relation to the binary concept "similar or not similar" hashing is applied with great success for near similarity search in large document collections. This paper contributes to an aspect of similarity search that receives increasing attention.

Uploaded by

matthewriley123
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
386 views

Principles of Hash-Based Text Retrieval.

Hash-based similarity search reduces a continuous similarity relation to the binary concept "similar or not similar" hashing is applied with great success for near similarity search in large document collections. This paper contributes to an aspect of similarity search that receives increasing attention.

Uploaded by

matthewriley123
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Principles of Hash-based Text Retrieval

Benno Stein
Faculty of Media, Media Systems
Bauhaus University Weimar
99421 Weimar, Germany
[email protected]

ABSTRACT • Near-Duplicate Detection. Given a (very large) corpus D,


Hash-based similarity search reduces a continuous similarity rela- find all documents whose pairwise similarity is close to 1
tion to the binary concept “similar or not similar”: two feature vec- [29, 13].
tors are considered as similar if they are mapped on the same hash • Plagiarism Analysis. Given a candidate document d and a
key. From its runtime performance this principle is unequaled— (very large) corpus D, find all documents in D that contain
while being unaffected by dimensionality concerns at the same nearly identical passages from d [27].
time. Similarity hashing is applied with great success for near sim-
ilarity search in large document collections, and it is considered as From the retrieval viewpoint hash-based text retrieval is an in-
a key technology for near-duplicate detection and plagiarism anal- complete technology. Identical hash keys do not imply high sim-
ysis. ilarity but indicate a high probability of high similarity. This fact
This papers reveals the design principles behind hash-based suggests the solution strategy for the aforementioned tasks: In a
search methods and presents them in a unified way. We introduce first step a candidate set Dq ⊂ D, |Dq |  |D|, is constructed
new stress statistics that are suited to analyze the performance of by a hash-based retrieval method; in a second step Dq is further
hash-based search methods, and we explain the rationale of their investigated by a complete method.
effectiveness. Based on these insights, we show how optimum hash The entire retrieval setting can be formalized as follows. Given
functions for similarity search can be derived. We also present new are (i) a set D = {d1 , . . . , dn } of documents each of which is de-
results of a comparative study between different hash-based search scribed by an m-dimensional feature vector, x ∈ Rm , and (ii) a
methods. similarity measure, ϕ : Rm × Rm → [0; 1], with 0 and 1 indi-
cating no and maximum similarity respectively. ϕ may rely on the
Categories and Subject Descriptors l2 norm or on the angle between two feature vectors. For a query
document dq , represented by feature vector xdq , and a similarity
H.3.1 [INFORMATION STORAGE AND RETRIEVAL]: Con-
tent Analysis and Indexing; H.3.3 [INFORMATION STOR- threshold θ ∈ [0; 1], we are interested in the documents of the θ-
neighborhood Dq ⊆ D of dq , which is defined by the following
AGE AND RETRIEVAL]: Information Search and Retrieval; F
condition:
[Theory of Computation]: MISCELLANEOUS
d ∈ Dq ⇔ ϕ(xdq , xd ) > θ,
General Terms where xd denotes the feature vector of d. Within informa-
Theory, Performance tion retrieval applications the documents are represented as high-
dimensional term vectors with m > 104 , typically under the vector
Keywords space model. We distinguish between the real documents, d ∈ D,
and their representations as feature vectors, since one and the same
hash-based similarity search, locality-sensitive hashing, dimension
document may be analyzed under different models, different repre-
reduction
sentations, and different similarity measures, as will be the case in
this paper.
1. INTRODUCTION AND BACKGROUND
In low-dimensional applications, say, m < 10, the retrieval
This paper contributes to an aspect of similarity search that re-
problem can be efficiently solved with space-partitioning meth-
ceives increasing attention in information retrieval: The use of ods like grid-files, KD-trees, or quad-trees, as well as with data-
hashing to significantly speed up similarity search. The hash-based
partitioning index trees such as R-trees, Rf-trees, or X-trees. For
search paradigm has been applied with great success for the follow-
significantly larger m the construction of Dq cannot be done better
ing tasks:
than by a linear scan in O(|D|) [28]. However, if one accepts a
decrease in recall, the search can be dramatically accelerated with
similarity hashing. As will be discussed later on, the effectiveness
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are of similarity hashing results from the fact that the recall is con-
not made or distributed for profit or commercial advantage and that copies trolled in terms of the similarity threshold θ for a given similarity
bear this notice and the full citation on the first page. To copy otherwise, to measure ϕ.
republish, to post on servers or to redistribute to lists, requires prior specific To motivate the underlying ideas consider an m-dimensional
permission and/or a fee.
SIGIR’07, July 23–27, 2007, Amsterdam, The Netherlands. document representation under the vector space model with a tf -
Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. weighting scheme. An—admittedly—very simple similarity hash
function hϕ , hϕ : {x1 , . . . , xn } → N, could map each term vec- Current research is concerned with the development of similarity
tor x on a single number hϕ (x) that totals the number of those hash functions that are robust in their behavior, efficient to be com-
terms in x starting with the letter “a”. If hϕ (xd1 ) = hϕ (xd2 ) it is puted, and, most importantly, that provide an adjustable trade-off
assumed that d1 and d2 are similar. between precision and recall.
Though this example is simple, it illustrates the principle and the
problems of hash-based similarity search:
1.2 Contributions of the Paper
Our contributions relate to retrieval technology in general; they
• If hϕ is too generic it will allegedly claim very dissimilar have been developed and analyzed with focus on text retrieval tasks
documents as similar, say, it will return a large number of under arbitrary classes of vector space models. In detail:
false positives.
• The construction principles that form the basis of most hash-
• If hϕ is too specific the understanding of similarity will be- based search methods are revealed, exemplified, and related
come too narrow. Take MD5-hashing as an example, which to the statistical concepts of precision and recall (Section 2).
can only be used to model the similarity threshold θ = 1.
• The relation between hash-based search methods and op-
If hϕ is purposefully designed and captures the gist of the feature timum embeddings is analyzed. New stress statistics are
vectors, search queries can be answered in virtually constant time, presented that give both qualitative and quantitative insights
independent of the dimension of x. into the effectiveness of similarity hashing (Subsection 3.1
1.1 Perfect Similarity Sensitive Hashing and 3.2).
First, we want to point out that hash-based similarity search is a • Based on a manipulation of the original similarity matrix it
space partitioning method. Second, it is interesting to note that, at is shown how optimum methods for hash-based similarity
least in theory, for a document set D and a similarity threshold θ search can be derived in closed retrieval situations (Subsec-
a perfect space partitioning for hash-based search can be stated. tion 3.3).
To make this plausible we have formulated hash-based similarity
search as a set covering problem. This generic view differs from the • New results of a comparative study between different hash-
computation-centric descriptions found in the relevant literature. based search methods are presented (Section 4). This analy-
Consider for this purpose the Rm being partitioned into over- sis supports the theoretical considerations and the usefulness
lapping regions such that the similarity of any two points of the of the new stress statistics developed in Section 3.
same region is above θ, where each region is characterized by a
unique key κ ∈ N. Moreover, consider a multivalued hash func- 2. HASH-BASED SEARCH METHODS
tion, h∗ϕ : Rm → P(N), which is “perfectly similarity sensitive” Despite the use of sophisticated data structures nearest neigh-
with regard to threshold θ.1 ∀d1 , d2 ∈ D : bor search in D degrades to a linear search if the dimension of the
` ∗ ´
hϕ (xd1 ) ∩ h∗ϕ (xd2 ) = ∅ ⇔ ϕ(xd1 , xd2 ) > θ (1) feature vectors is around 10 or higher. If one sacrifices exactness,
| {z } | {z } that is to say, if one accepts values below 1 for precision and re-
α β call, the runtime bottleneck can be avoided by using hash-based
search methods. These are specifically designed techniques to ap-
Rationale and Utilization. h∗ϕ assigns each feature vector xd a proximate near(est) neighbor search within sublinear runtime in the
membership set Nd ∈ P(N) of region keys, whereas two sets, collection size |D|.
Nd1 , Nd2 , share a key iff xd1 and xd2 have a region in common.
Figure 1, which is used later on in a different connection, serves as 2.1 Related Work
an illustration.
Only few hash-based search methods have been developed so
Based on h∗ϕSwe can organize the mapping between all region
far, in particular random projection, locality-sensitive hashing, and
keys K, K := d∈D Nd , and documents with the same region key fuzzy-fingerprinting [20, 18, 11, 26]; they are discussed in greater
as a hash table h, h : K → P(D). Based on h the θ-neighborhood detail in Subsection 2.3 and 2.4.
Dq of dq can be constructed in O(|Dq |) runtime: 2
As will be argued in Subsection 2.2, hash-based search meth-
[
Dq = h(κ) (2) ods operationalize—apparently or hidden—a means for embedding
κ∈h∗
high-dimensional vectors into a low-dimensional space. The in-
ϕ (xdq )
tended purpose is dimension reduction while retaining as much as
Observe that h∗ϕ operationalizes both perfect precision and per- possible of the similarity information. In information retrieval em-
fect recall. For a set D that is completely known and time-invariant bedding technology has been developed for the discovery of hid-
such a function may be found. However, in most cases the equiva- den semantic structures: a high-dimensional term representation
lence relation of Equation (1), α ⇔ β, cannot be guaranteed: of a document is embedded into a low-dimensional concept space.
Known transformation techniques include latent semantic index-
⇒ If β is not a conclusion of α, Dq contains documents that do ing and its variants, probabilistic latent semantic analysis, iterative
not belong to the θ-neighborhood of dq : the precision is < 1. residual rescaling, or principal component analysis. The concept
⇐ If α is not a conclusion of β, Dq does not contain all docu- representation shall provide a better recall in terms of the semantic
ments from the θ-neighborhood of dq : the recall is < 1. expansion of queries projected into the concept space.
1
Embedding is a vital step within the construction of a similar-
For the time being only the existence of such a partitioning along ity hash function hϕ . Unfortunately, the mentioned semantic em-
with a hash function is assumed, not its construction.
bedding technology cannot be applied in this connection, which is
2
In most practical applications O(|Dq |) is bound by a small con-
stant since |Dq |  |D|. The cost of a hash table access h(κ) is rooted in the nature of the use case. Hash-based search focuses
assessed with O(1); experience shows that for a given application on what is called here ”open” retrieval situations, while semantic
hash functions with this property can always be stated [7]. embedding implies a “closed” or “semi-closed” retrieval situation.
Open retrieval situation (Semi-)closed retrieval Some or all of these steps may be repeated for one and the same
Locality sensitive hashing [11] MDS with cosine similarity [9] original feature vector x in order to obtain a set of hash codes for x.
Un-
(latent semantic indexing) The next subsection exemplifies this construction principle for two
biased
p-stable LSH [8] Non-metric MDS [21] hash-based search methods: locality-sensitive hashing and fuzzy-
LSH forest [3] PCA [19] fingerprinting. Subsection 2.4 explains the properties of hash-based
search methods in terms of the precision and recall semantics.
Fuzzy-fingerprinting [26] Probabilistic LSI [16]
Biased
Vector approximation [28] Autoencoder NN [15] 2.3 A Unified View to Locality-Sensitive
Shingling [4] Iterative residual rescaling [1] Hashing and Fuzzy-Fingerprinting
Locality preserv. ind., LPI [12] Locality-sensitive hashing (LSH) is a generic framework for the
Orthogonal LPI [5] construction of hash-based search methods. To realize the embed-
ding, a locality-sensitive hash function hϕ employs a family Hϕ
Table 1: Classification of embedding paradigms used for index- of simple hash functions h, h : Rm → N. From Hϕ a set of
ing in information retrieval. k functions is chosen by an independent and uniformly distributed
random choice, where each function is used to compute one com-
This distinction pertains to the knowledge that is compiled into the ponent of the embedding y of an original vector x. Several hash
retrieval function ρ : Q × D → R, where Q and D designate families Hϕ that are applicable for text-based information retrieval
the computer representations of the sets Q and D of queries and have been proposed [6, 8, 3]. Our focus is on the approach of Datar
documents respectively. et. al. [8], which maps a feature vector x to a real number by com-
• Open Retrieval Situation. Q and D are unknown in advance. puting the dot product aT · x. a is a random vector whose compo-
ρ relies on generic language concepts such as term distribu- nents are chosen from an α-stable probability distribution.4
tion, term frequency, or sentence length. An example is the Quantization is achieved by dividing the real number line into
vector space model along with the cosine similarity measure. equidistant intervals of width r each of which having assigned a
unique natural number. The result of the dot product is identified
• Closed Retrieval Situation. Q and D, and hence Q and D are with the number of its enclosing interval.
known in advance. ρ models semantic dependencies found in Encoding can happen in different ways and is typically done by
D with respect to Q. An example is an autoencoder neural (ρ)
summation; the computation of hϕ for a set ρ of random vectors
network applied for category identification in D [15]. a1 , . . . , ak reads as follows:
• Semi-Closed Retrieval Situation. Q is unknown and D is Xk — T 
(ρ) ai · x + c
known in advance. ρ models semantic dependencies of D hϕ (x) = ,
r
and expands a query q ∈ Q with respect to the found struc- i=1
ture. An example is PLSI. where c ∈ [0, r] is a randomly chosen offset of the real number line.
We propose the scheme in Table 1 to classify embedding meth- A multivalued hash function repeats the outlined steps for different
ods used in information retrieval. The scheme distinguishes also sets ρ1 , . . . , ρl of random vectors.
whether or not domain knowledge is exploited within the embed- Fuzzy-fingerprinting (FF) is a hash-based search method specif-
ding procedure (unbiased versus biased). E. g., locality sensitive ically designed for text-based information retrieval. Its under-
hashing works on arbitrary data, while fuzzy-fingerprinting as well lying embedding procedure can be understood as an abstraction
as shingling exploit the fact that the embedded data is text. A simi- of the vector space model and happens by “condensing” an m-
lar argumentation applies to MDS and probabilistic LSI. dimensional term vector x toward k prefix classes. A prefix class
Aside from their restriction to (semi-)closed retrieval most of the comprises all terms with the same prefix; the components of the
embedding methods in the right column of Table 1 cannot be scaled embedded feature vector y quantify the normalized expected devi-
up for large collections: they employ some form of spectral decom- ations of the k chosen prefix classes.5
position, which is computationally expensive. Quantization is achieved by applying a fuzzification scheme, ρ,
which projects the exact deviations y1 , . . . , yk on r deviation inter-
2.2 Generic Construction Principle of hϕ vals: ρ : R → {0, . . . , r − 1}
We developed a unified view on hash-based search methods by Encoding is done by computing the smallest number in radix r
interpreting them as instances of a generic construction principle, (ρ)
notation from the fuzzified deviations; the computation of hϕ for
which comprises following steps:
a particular fuzzification scheme ρ reads as follows:
1. Embedding. The m-dimensional feature vectors of the docu-
X
k
ments in D are embedded in a low-dimensional space, striv- h(ρ)
ϕ (x) = ρ(yi ) · r i−1 ,
ing for minimum distortion. The resulting k-dimensional i=1
feature vectors shall resemble the distance ratios, at least the
order of the pairwise inter-document distances, as close as where yi is the normalized expected deviation of the i-th prefix
possible.3 class in the original term vector x. Similar to LSH, a multivalued
hash function repeats the quantization and encoding steps for dif-
2. Quantization. The real-valued components of the embedded ferent fuzzification schemes, ρ1 , . . . , ρl .
feature vectors are mapped onto a small number of values. 4
α-stability guarantees locality sensitivity [17, 23]. An example
3. Encoding. From the k quantized components a single num- for an α-stable distribution is the Gaussian distribution.
5
ber is computed, which serves as hash code. For the normalization the British National Corpus is used as ref-
erence. The BNC is a 100 million word collection of written and
3
Against the analysis presented in Section 3, the concept of opti- spoken language from a wide range of sources, designed to repre-
mality implied here must be seen more differentiated. sent a wide cross-section of current British English [2].
(1)
15 ... hϕ

(2)

13 23 17 27

12 22 16 26 18 ...

hϕ(xd1) = {13, 24}


14 24 ...
hϕ(xd2) = {14, 24}
hϕ(xd3) = {16, 24}
xd1 xd2 xd3 xd4 hϕ(xd4) = {16, 26}

Figure 1: A space partitioned into overlapping regions, hinted as two grids of shaded and outlined hexagons. Each region is charac-
terized by a unique key; points in the same region have a similarity of at least θ. A similarity hash function h ϕ at level θ assigns a set
of region keys to a feature vector xd , implying the following semantics: If and only if two feature vectors share a region key they are
(1) (2)
considered having a similarity of at least θ. In the example hϕ (x) = {hϕ (x), hϕ (x)} operationalizes both a precision and a recall
of 1. For readability purposes the keys of the shaded regions are shown underlined.

2.4 Controlling Retrieval Properties and 3 (cf. Subsection 2.2), which maps an embedded vector onto a
The most salient property of hash-based search is the simplifi- hash code, is distance-preserving.
cation of a continuous similarity function ϕ to the binary concept The section starts with a derivation of the globally optimum em-
“similar or not similar”: two feature vectors are considered as simi- bedding under the cosine similarity measure, and then uncovers the
lar if their hash keys are equal; otherwise they are considered as not inferiority of this embedding compared to the prefix class embed-
similar. This implication is generalized in Equation (1) at the out- ding of fuzzy-fingerprinting (Subsection 3.2). This observation is
set; the generalization pertains to two aspects: (i) the equivalence explained by the idea of threshold-centered embeddings, for which
relation refers to a similarity threshold θ, and (ii) the hash function we introduce the formal underpinning in the form of new error
hϕ is multivalued. statistics, called precision stress and recall stress at a given similar-
With the background of the presented hash-based search meth- ity threshold θ. By extending the idea toward thresholded similarity
ods we now continue the discussion of precision and recall from matrices we show how optimum embeddings for similarity hashing
Subsection 1.1. Observe that the probability of a hash collision in closed retrieval situations can be developed (Subsection 3.3).
for two vectors xd1 , xd2 decreases if the number k of simple hash
functions (LSH) or prefix classes (FF) is increased. Each hash func-
3.1 Globally Optimum Embeddings
tion or each prefix class captures additional knowledge of x and Multidimensional scaling (MDS) designates a class of tech-
hence raises the similarity threshold θ. This can be broken down to niques for embedding a set of objects into a low-dimensional real-
the following formula, termed Property 1: valued space, called embedding space here. The embedding error,
“Code length controls precision.” also called “stress”, is computed from the deviations between the
original inter-object similarities and the new inter-object similari-
Being multivalued is a necessary condition for hϕ to achieve a
ties in the embedding space.
recall of 1. A scalar-valued hash function computes one key for
Given n objects, the related similarity matrix, S, is a symmetric
one feature vector x at a time, and hence it defines a rigorous par-
n × n matrix of positive real numbers, whose (i, j)-th entry quan-
titioning of the feature vector space. Figure 1 illustrates this con-
(1) tifies the similarity between object i and object j. Let each object
nection: The scalar-valued hash function hϕ responsible for the be described by an m-dimensional feature vector x ∈ Rm , and let
shaded partitioning assigns different keys to the vectors xd1 and X be the m × n matrix comprised of these vectors. 6
xd2 , despite their high similarity (low distance). With the multi- Without loss of generality we assume each feature vector x being
(1) (2)
valued hash function, hϕ = {hϕ , hϕ }, which also considers the normalized according to the l2 -norm, i. e., ||x||2 = 1. Then, under
outlined partitioning, the intersection hϕ (xd1 ) ∩ hϕ (xd2 ) is not the cosine similarity measure, S is defined by the identity S =
empty, giving raise to infer that ϕ(xd1 , xd2 ) > θ. In fact, there XT X, where XT designates the matrix transpose of X.
is a monotonic relationship between the number of hash codes and An important property of the cosine similarity measure is that
the achieved recall, which can be broken down to the following under the Frobenius norm an optimum embedding of X can be
formula, termed Property 2: directly constructed from its singular value decomposition (SVD).
“Code multiplicity controls recall”. With SVD an arbitrary matrix X can be uniquely represented as the
However, there is no free lunch, the improved recall is bought product of three matrices:7
with a decrease in precision.
X = UΣVT
3. OPTIMALITY AND EMBEDDING U is a column orthonormal m × r matrix, Σ is an r × r diagonal
The embedding of the vector space model into a low-dimensional matrix with the singular values of X, and V is an n×r matrix. I. e.,
space is inevitably bound up with information loss. The smaller the 6
In IR applications X is the term-document-matrix. For applying
embedding error is, the better are precision and recall of the con- an MDS only S must be given.
7
structed hash function, because the affine transformation in Step 2 Unique up to rearrangement of columns and subspace rotations.
Similarities in high-dimensional original space Similarities in low-dimensional embedding space

Embedding
Rθ ≈ ϕ(xi, xj) > θ

^
Rθ ≈ ^
ϕ(yi, yj) > θ

Similarities primarily responsible for recall stress Similarities primarily responsible for precision stress
^ ^
Rθ ∩ Rθ Rθ ∩ Rθ

Figure 2: If the original document representations, X, are embedded into a low-dimensional space, the resulting document represen-
tations Y resemble the original similarities only imperfectly. Given a particular threshold θ, similarities of the original space may be
shifted from above θ to below θ (hatched area left), from below θ to above θ (hatched area right), or still remain in the interval [θ; 1]
(green area). The similarities in the hatched areas are responsible for the major part of the embedding stress.

UT U = I and VVT = I where I designates the identity matrix. global stress minimization, while hash-based search methods con-
Using these properties the matrix S can be rewritten under both the centrate on the high similarities in S in first place.8 The nature of
viewpoint of its singular value decomposition and the viewpoint of this property is captured by the following definition, which relates
similarity computation: the threshold-specific stress of an embedding to the statistical con-
cepts of precision and recall. Figure 2 illustrates the definition.
S = XT X = (UΣVT )T UΣVT
= VΣ2 VT = (ΣVT )T (ΣVT )
| {z } | {z } Definition 1 (precision stress, recall stress) Let D be a set of ob-
SVD Similarity computation jects and let X and Y be their representations in the n-dimensional
ΣVT represents a set of points with the same inter-object simi- and the k-dimensional space respectively, k < n. Moreover, let
larities as the original vectors X. The nature of the cosine similar- ϕ : X × X → [0; 1] and ϕ̂ : Y × Y → [0; 1] be two similarity
ity measure implies the direct construction of S and, in particular, measures, and let θ ∈ [0; 1] be a similarity threshold.
the identities rank (S) = rank (X) = rank (ΣVT ). Conversely, θ defines two result sets, Rθ and R̂θ , which are comprised of
if we restrict the dimensionality of the embedding space to k, the those pairs {xi , xj }, xi , xj ∈ D, whose respective representations
resulting similarity matrix Ŝ is also of rank k. According to the in X and Y are above the similarity threshold θ:
Eckart-Young Theorem the optimum rank-k approximation Ŝ∗ of {xi , xj } ∈ Rθ ⇔ ϕ(xi , xj ) > θ,
S under the Frobenius norm can be obtained from the SVD of S, by
restricting the matrix product to the k largest singular values [10]: and likewise: {xi , xj } ∈ R̂θ ⇔ ϕ̂(yi , yj ) > θ

Ŝ = Vk Σ2k VkT = (Σk VkT )T (Σk VkT ) Then the set of returned pairs from the embedding space, R̂θ ,
⇒ Σk VkT = argmin ||S − Y T Y||F defines the precision stress at similarity threshold θ, epθ :
columns(Y)=n,
{Y| rank (Y)=k } X ` ´2
ϕ(xi , xj ) − ϕ̂(yi , yj )
In the information retrieval community the embedding YSVD := {xi ,xj }∈R̂θ
Σk VkT of document vectors X is known as representation in the e pθ = X ` ´2
so-called latent semantic space, spanned by k concepts. The em- ϕ(xi , xj )
bedding process became popular under the name of latent semantic {xi ,xj }∈R̂θ
indexing (LSI) [9].
Likewise, the set of similar pairs in the original space, Rθ , de-
Remark 1. A common misconception is that LSI projects the docu- fines the recall stress at similarity threshold θ, erθ :
ment vectors into a subspace in order to represent semantic similar- X ` ´2
ity. Rather, LSI constructs new features to approximate the original ϕ(xi , xj ) − ϕ̂(yi , yj )
document representations. And, if the dimension of the embed- {xi ,xj }∈Rθ
ding space is properly chosen then, due to the reduction of noise e rθ = X ` ´2
and the elimination of weak dependencies, this embedding is able ϕ(xi , xj )
to address retrieval problems deriving from the use of synonymous {xi ,xj }∈Rθ

words. As a consequence the retrieval performance may be im-


proved in semi-closed retrieval applications. Hofmann argues sim- Remark 2. The precision stress and the recall stress of an em-
ilarly [16]: the superposition principle underlying LSI is unable to bedding Y are statistics that tell us something about the maxi-
handle polysemy. mum precision and recall that can be achieved with similarity hash
codes constructed from Y. The larger the precision stress is the
3.2 The Rationale of Hash-Based Search: higher is the probability that two embedded vectors, yi , yj , are
Threshold-Centered Embeddings claimed being similar though their similarity in the original space,
Though the embedding YSVD minimizes the embedding error 8
“The similarity threshold controls the effective embedding error.”
of X, it is not the best starting point for constructing similarity- This property complements the two properties of hash-based search
sensitive hash codes. The main reason is that an MDS strives for a methods stated in Subsection 2.4.
1 1
Optimum MDS YSVD
Fuzzy-fingerprinting YFF
0.8 0.8 Locality-sensitive hashing YLSH

Precision stress

Recall stress
0.6 0.6

0.4 0.4

0.2 Optimum MDS YSVD 0.2


Fuzzy-fingerprinting YFF
Locality-sensitive hashing YLSH
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Similarity threshold θ Similarity threshold θ

Figure 3: Evolution of the embedding stress against the similarity threshold θ (lower stress is better). The left plot takes the embedded
vectors as basis, the right plot the original vectors, corresponding to the precision stress, ep , and the recall stress, er , respectively. At
some threshold the embedding of fuzzy-fingerprinting, Y FF , outperforms the optimum MDS embedding, Y SVD .

sij = ϕ(xi , xj ), is low. Likewise, the larger the recall stress is 3.3 Threshold-Optimum Embeddings
the higher is the probability that two vectors in the original space, in Closed Retrieval Situations
xi , xj , are mapped onto different codes though their similarity, sij ,
is high. Threshold-centered embeddings are tailored document models
for special retrieval tasks such as near duplicate detection or high
For the three embeddings, YSVD , YFF , and YLSH , obtained similarity search. They tolerate a large embedding error in the low
from optimum MDS, fuzzy-fingerprinting, and LSH respectively, similarity interval [0, θ] and strive for a high fidelity of similarities
we have analyzed the precision stress and the recall stress at various from the interval [θ, 1]. This principle forms the rationale of hash-
similarity thresholds and with different corpora. The results reflect based search.
the predicted behavior: With YSVD , obtained by optimally solving an MDS, an embed-
1. Because of its generality (domain independence) the LSH ding that minimizes the accumulated error over all similarities is
embedding is consistently worse than the prefix class em- at hand. We now introduce a threshold-optimum embedding, Y∗ ,
bedding of fuzzy-fingerprinting. which minimizes the accumulated error with respect to the inter-
2. At some break-even point the retrieval performance of prefix val [θ, 1]. The presented ideas address the closed retrieval situ-
class embedding outperforms the optimum MDS embedding. ation in first place—for open retrieval situations the construction
of an optimum embedding requires a-priori knowledge about the
Figure 3 illustrates this behavior for a sample of 2000 docu- term distribution in the collection D.9 Though the typical use case
ments drawn from the Reuters Corpus Volume 1 (RCV1) [24]. for hash-based search are open retrieval situations, the derivation
With other corpora and other parameter settings for the hash-based is useful because (i) it provides additional theoretical insights and
search methods this characteristic is observed as well. We analyzed (ii) it forms a basis to reason about performance bounds.
in this connection also specifically compiled corpora whose simi-
larity distribution is significantly skewed towards high similarities: The θ-specific retrieval analysis of the preceding subsection sug-
Figure 4 contrasts the similarity distribution in the original Reuters gests the construction principle of Y∗ . Instead of approximating
Corpus (hatched light) and in the special corpora (solid dark). the original similarity matrix S a “thresholded” similarity matrix
Sθ is taken as basis, introducing this way the binary nature of simi-
Remark 3. For most retrieval tasks an—even high—precision stress larity hashing into the approximation process. For a given threshold
can be accepted, since the necessary subsequent exact similar- θ the matrix Sθ is defined as follows:
ity analysis needs to be performed only for a very small fraction 0 1
|Dq |/|D| of all documents. Remember that the construction meth- fθ (s11 ) fθ (s12 ) . . . fθ (s1n )
ods for the hash-based search methods provide sufficient means to B .. .. .. .. C
Sθ := @ . . . . A,
fine-tune the trade-off between the precision stress, ep , and the re-
call stress, er . fθ (sn1 ) fθ (sn2 ) . . . fθ (snn )

1 where fθ (s) is a combination of two sigmoidal functions that de-


Percentage of Similarities

Reuters (special)
fine an upper threshold θ and a lower threshold ϑ respectively. Sim-
Reuters (original)
0.1 ilarity values from [θ; 1] are amplified toward 1, similarity values
from [0; θ) are moved toward ϑ. The following rationale reveals
0.01 the underlying trade-off: with increasing difference θ − ϑ the am-
plification above θ improves the robustness in the encoding step
0.001 (cf. Subsection 2.2), with increasing ϑ the contraction toward ϑ re-
duces the error in the embedding step and hence allows for shorter
0.0001
0 0.2 0.4 0.6 0.8 1 codes. fθ can be realized in different ways; within our analyses
Similarity Intervals two consecutive tanh-approximations with the thresholds ϑ = 0.1

Figure 4: Similarity distribution in the original Reuters Corpus


9
Remember that YFF is a domain-specific embedding which ex-
and in the special compilations with increased high similarities. ploits knowledge about document models and term distributions.
Precision 4. HASH-BASED RETRIEVAL AT WORK
Embedding Dim. (0.8; 0.9] (0.85; 1.0] (0.9; 1.0] (0.95; 1.0]
Finally, this section demonstrates the efficiency of locality-
Y∗ 50 0.58 0.71 0.84 0.95 sensitive hashing, fuzzy-fingerprinting, and hash-based search in
YFF 50 0.17 0.45 0.69 0.85
YSVD 50 0.35 0.45 0.57 0.73 general. We report results from a large-scale experiment on near-
duplicate detection and plagiarism analysis, using a collection of
Y∗ 25 0.29 0.38 0.51 0.74
YFF 25 0.01 0.02 0.09 0.59 100, 000 documents compiled with Yahoo, Google, and AltaVista
YSVD 25 0.16 0.22 0.34 0.56 by performing a focused search on specific topics. To compile the
collection a small number of seed documents about a topic was cho-
Table 2: Results of a near-duplicate retrieval analysis, based on sen from which 100 keywords were extracted with a co-occurrence
RCV1 and the experimental setup like before. The precision analysis [22]. Afterward, search engine queries were generated by
achieved with Y ∗ outperforms even the YFF embedding. choosing up to five keywords, and the highest ranked search results
were downloaded and their text content extracted.
and θ = 0.8 were employed. To render retrieval results comparable the two hash functions
Since Sθ is a symmetric matrix it is normal, and hence its Schur were parameterized in such a way that, on average, small and
decomposition yields a spectral decomposition: equally-sized document sets were returned for a query. As de-
scribed in Section 2.4, this relates to adjusting the recall of the
Sθ = ZΛZT hash functions, which is done with the number of fuzzification
Z is an orthogonal matrix comprising the eigenvectors of Sθ , and schemes and random vector sets respectively: two or three differ-
Λ is a diagonal matrix with the eigenvalues of Sθ . If Sθ is positive ent fuzzification schemes were employed for fuzzy-fingerprinting;
definite its unique Cholesky decomposition exists: between 10 and 20 different random vector sets were employed for
locality-sensitive hashing. The precision of fuzzy-fingerprinting is
Sθ = ZZT controlled by the number k of prefix classes and the number r of
X := ZT can directly be interpreted as matrix of thresholded deviation intervals per fuzzification scheme. To improve the preci-
document representations. As was shown in Subsection 3.1, the sion performance either of them or both can be raised. Note that
dimension of the embedding space, k, prescribes the rank of the k is application-dependent; typical values for r range from 2 to
approximation Ŝθ of Sθ . Its optimum rank-k-approximation, Ŝ∗θ , 4. The precision of locality-sensitive hashing is controlled by the
is obtained by an SVD of Sθ , which can be expressed in the factors number k of combined hash functions. For instance, when using
of the rank-k-approximated SVD of X. Let O∆QT be the SVD the hash family proposed by Datar et al., k corresponds to the num-
of X and hence OTk Ok = I. Then holds: ber of random vectors per hash function [8]; typical values for k
range from 20 to 100.
Ŝ∗θ = Xk Xk = (Ok ∆k QTk )T Ok ∆k QTk
T
The plots in Figure 5 contrasts performance results. With respect
= (∆k QTk )T ∆k QTk = Y ∗T Y ∗ to recall either approach is excellent at high similarity thresholds
(> 0.8) compared to a linear search using a cosine measure. How-
Remark 4. Y∗ := ∆k QTk is an embedding of X optimized for ever, high recall values at low similarity thresholds are achieved by
similarity hashing. Due to construction, Y∗ is primarily suited to chance only. With respect to precision fuzzy-fingerprinting is sig-
answer binary similarity questions at the a-priori chosen thresh- nificantly better than locality-sensitive hashing—a fact which di-
old θ. Since Sθ is derived from S by sigmoidal thresholding, the rectly affects the runtime performance. With respect to runtime
document representations in Y are insusceptible with respect to a performance both hashing approaches perform orders of magni-
rank-k-approximation. This renders Y∗ robust for similarity com- tude faster than a linear search. For reasonably high thresholds θ
parisons under the following interpretation of similarity: the similarity distribution (Figure 4) along with the precision stress
If yi∗ , yj∗  > 0.5 assume xi , xj  > θ (Figure 3, left) determine a sublinear increase of the result set size
|Dq | for a document query dq (Equation 2).
If yi∗ , yj∗  ≤ 0.5 assume xi , xj  ≤ θ
Remark 7. The computation of the baseline relies on a non-reduced
where ,  denotes the scalar product. Table 2 illustrates the supe- vector space, defined by the dictionary underlying D. Note that a
riority of Y∗ : For the interesting similarity interval [θ, 1] it outper- pruned document representation or a cluster-based preprocessing
forms the classical embedding as well as the embedding strategies of D, for example, may have exhibited a slower—but yet linear
of sophisticated hash-based search methods. growth. Moreover, the use of such specialized retrieval models
Remark 5. To obtain for a new n-dimensional vector x its optimum makes the analysis results difficult to interpreted.
k-dimensional representation y∗ at similarity threshold θ, a k × n
projection matrix P can be stated: 5. CONCLUSION AND CURRENT WORK
y∗ = Px, where P is computed from XT PT = Y ∗T The paper analyzed the retrieval performance and explained the
retrieval rationale of hash-based search methods. The starting point
Remark 6. The transformations imply the thresholded similarity was the development of a unified view on these methods, along
matrix Sθ being positive definite. An efficient and robust test for with the formulation of three properties that capture their design
positive definiteness was recently proposed by Rumb [25]. If Sθ is principles. We pointed out the selective nature of hash-based search
not positive definite it can be approximated by a positive definite and introduced new stress statistics to quantify this characteristic.
matrix Sθ+ , which should be the nearest symmetric positive defi- The concept of tolerating a large embedding error for small sim-
nite matrix under the Frobenius norm. As shown by Higham, Sθ+ ilarities while striving for a high fidelity at high similarities can be
is given by following identity [14]: used to reformulate the original similarity matrix and thus to derive
G+H Sθ + STθ tailored embeddings in closed retrieval situations.
Sθ+ = with G = , The presented ideas open new possibilities to derive theoret-
2 2
ical bounds for the performance of hash-based search methods.
where H is the symmetric polar factor of G.
1 1.2
Fuzzy-fingerpr. YFF Linear search
Locality-sens. h. YLSH Fuzzy-fingerpr. YFF
0.8 Locality-sens. h. YLSH
0.8
0.6

Runtime [s]
Precision

Recall
0.4
0.4
0.2
Fuzzy-fingerpr. YFF
Locality-sens. h. YLSH
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 104 5*104 105
Similarity Similarity Sample size

Figure 5: Near-duplicate detection and plagiarism analysis with hash-based search technology. The plots shows recall-at-similarity,
precision-at-similarity, and runtime-at-sample-sizes, using fuzzy-fingerprinting (FF) and locality-sensitive hashing (LSH).

Whether they can be used to develop better search methods is sub- [14] N. Higham. Computing a Nearest Symmetric Positive
ject of our research: by construction, Y∗ outperforms other em- Semidefinite Matrix. Linear Algebra and its App., 1988.
beddings. It is unclear to which extent this property can be utilized [15] G. Hinton and R. Salakhutdinov. Reducing the
in similarity search methods designed for open retrieval situations. Dimensionality of Data with Neural Networks. Science,
The theoretical analysis of the trade-off between θ and ϑ as well as 313:504–507, 2006.
the Remarks 5 and 6 provide interesting links to follow. [16] T. Hofmann. Unsupervised Learning by Probabilistic Latent
Semantic Analysis. Machine Learning, 42:177–196, 2001.
6. REFERENCES [17] P. Indyk. Stable Distributions, Pseudorandom Generators,
[1] R. Ando and L. Lee. Iterative Residual Rescaling: An Embeddings and Data Stream Computation. In FOCS’00:
Analysis and Generalization of LSI. In Proc. 24th conference Proc. of the 41st symposium on foundations of computer
on research and development in IR, 2001. science, 2000. IEEE Computer Society.
[2] G. Aston and L. Burnard. The BNC Handbook. [18] P. Indyk and R. Motwani. Approximate Nearest Neighbor –
http://www.natcorp.ox.ac.uk/what/, 1998. Towards Removing the Curse of Dimensionality. In Proc. of
[3] M. Bawa, T. Condie, and P. Ganesan. LSH Forest: the 30th symposium on theory of computing, 1998.
Self-Tuning Indexes for Similarity Search. In WWW’05: [19] I. Jolliffe. Principal Component Analysis. Springer, 1996.
Proc. of the 14th int. conference on World Wide Web, 2005. [20] J. Kleinberg. Two Algorithms for Nearest-Neighbor Search
[4] A. Broder, S. Glassman, M. Manasse, and G. Zweig. in High Dimensions. In STOC’97: Proc. of the twenty-ninth
Syntactic Clustering of the Web. In Selected papers from the ACM symposium on theory of computing, 1997.
sixth int. conference on World Wide Web, 1997. [21] J. Kruskal. Multidimensional Scaling by Optimizing
[5] D. Cai and X. Hee. Orthogonal Locality Preserving Goodness of Fit to a Nonmetric Hypothesis. Psychometrika,
Indexing. In Proc. of the 28th conference on Research and 29(1), 1964.
development in IR, 2005. [22] Y. Matsuo and M. Ishizuka. Keyword Extraction from a
[6] M. S. Charikar. Similarity Estimation Techniques from Single Document using Word Co-ocurrence Statistical
Rounding Algorithms. In STOC’02: Proc. of the Information. Int. Journal on Artificial Intelligence Tools,
thirty-fourth ACM symposium on theory of computing, 2002. 13(1):157–169, 2004.
[7] T. Cormen, C. Leiserson, and R. Rivest. Introduction to [23] J. Nolan. Stable Distributions—Models for Heavy Tailed
Algorithms. MIT Press, Cambridge. 1990. Data. http://academic2.american.edu/~jpnolan/stable/, 2005.
[8] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. [24] T. Rose, M. Stevenson, and M. Whitehead. The Reuters
Locality-Sensitive Hashing Scheme Based on p-Stable Corpus Volume 1. From Yesterday’s News to Tomorrow’s
Distributions. In SCG’04: Proc. of the twentieth symposium Language Resources. In Proc. of the third int. conference on
on computational geometry, 2004. language resources and evaluation, 2002.
[9] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and [25] S. Rump. Verification of Positive Definiteness. BIT
R. Harshman. Indexing by Latent Semantic Analysis. Numerical Mathematics, 46:433–452, 2006.
Journal of the American Society of Information Science, [26] B. Stein. Fuzzy-Fingerprints for Text-Based IR. In Proc. of
41(6):391–407, 1990. the 5th Int. Conference on Knowledge Management, Graz,
[10] C. Eckart and G. Young. The Approximation of one Matrix Journal of Universal Computer Science, 2005.
by Another of Lower Rank. Psychometrika, 1:211–218, [27] B. Stein and S. Meyer zu Eißen. Near Similarity Search and
1936. Plagiarism Analysis. In From Data and Information Analysis
[11] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in to Knowledge Engineering. Springer, 2006.
High Dimensions via Hashing. In The VLDB Journal, 1999. [28] R. Weber, H. Schek, and S. Blott. A Quantitative Analysis
[12] X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality Preserving and Performance Study for Similarity-Search Methods in
Indexing for Document Representation. In Proc. of the 27th High-dimensional Spaces. In Proc. of the 24th VLDB
conference on research and development in IR, 2001. conference, 1998.
[13] M. Henzinger. Finding Near-Duplicate Web Pages: a [29] H. Yang and J. Callan. Near-Duplicate Detection by
Large-Scale Evaluation of Algorithms. In Proc. of the 29th Instance-level Constrained Clustering. In Proc. of the 29th
conference on research and development in IR, 2006. conference on research and development in IR, 2006.

You might also like