Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation
Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation
and
arXiv:0705.4676v1 [cs.DB] 31 May 2007
Owen Kaser
University of New Brunswick
Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that
recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is
pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise
independent by discarding n − 1 bits. One application of hashing is to estimate the number of distinct n-grams,
a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions,
we desire a statistically unassuming algorithm with universally valid accuracy bounds. Most related work has
focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one-
hash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. For example, we can
improve by a factor of 2 the theoretical bounds on estimation accuracy by replacing pairwise independent hashing
by 4-wise independent hashing. We show that recursive random hashing is sufficiently independent in practice.
Maybe surprisingly, our experiments showed that hashing by cyclic polynomials, which is only quasi-pairwise
independent, sometimes outperformed 10-wise independent hashing while being twice as fast. For comparison,
we measured the time to obtain exact n-gram counts using suffix arrays and show that, while we used hardly
any storage, we were an order of magnitude faster. The experiments used a large collection of English text from
Project Gutenberg as well as synthetic data.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Content Analysis and Index-
ing; H.2.7 [Database Administration]: Data warehouse and repository
General Terms: Algorithms, Theory, Experimentation
Additional Key Words and Phrases: recursive hashing, view-size estimation, n-grams
1. INTRODUCTION
Consider a sequence of symbols ai ∈ Σ of length N. The data source has high latency: for
example, it is not in a flat binary format or a DBMS, making random access and skipping
impractical. The symbols need not be characters from a natural language: they can be par-
ticular “events” inferred from a sensor or a news feed, they can be financial or biomedical
patterns found in time series, they can be words in a natural language, and so on. The
The first author was supported by NSERC grant 261437 and the second author was supported by NSERC
grant 155967.
Authors’ addresses: Daniel Lemire, Université du Québec à Montréal, 100 Sherbrooke West, Montréal, QC
H2X 3P2 Canada and Owen Kaser, University of New Brunswick, 100 Tucker Park Road, Saint John, NB
E2L 4L1 Canada
ACM Transactions on Computational Logic, Vol. V, No. N, February 2008, Pages 1–35.
number of distinct symbols (|Σ|) could be large (on the order of 105 in the case of words
in a typical English dictionary), but it is assumed to be small compared to the amount of
memory available. We make no other assumption about the distribution of these distinct
symbols and n-grams.
An n-gram is a consecutive sequence of n symbols. We use n-grams in language mod-
eling [Gao and Zhang 2001], pattern recognition [Yannakoudakis et al. 1990], predicting
web page accesses [Deshpande and Karypis 2004], information retrieval [Nie et al. 2000],
text categorization and author attribution [Caropreso et al. 2001; Joula et al. 2006; Keselj
and Cercone 2004], speech recognition [Jelinek 1998], multimedia [Paulus and Klapuri
2003], music retrieval [Doraisamy and Rüger 2003], text mining [Losiewicz et al. 2000],
information theory [Shannon 1948], software fault diagnosis [Bose and Srinivasan 2005],
data compression [Bacon and Houde 1984], data mining [Su et al. 2000], indexing [Kim
et al. 2005], On-line Analytical Processing (OLAP) [Keith et al. 2005a], optimal charac-
ter recognition (OCR) [Droettboom 2003], automated translation [Lin and Hovy 2003],
time series segmentation [Cohen et al. 2002], and so on.This paper concerns the use of
previously published hash functions for n-grams, together with recent randomized algo-
rithms for estimating the number of distinct items in a stream of data. Together, they
permit memory-efficient estimation of the number of distinct n-grams and motivate finer
theoretical investigations into efficient n-gram hashing.
The number of distinct n-grams grows large with n: storing Shakespeare’s First Fo-
lio [Project Gutenberg Literary Archive Foundation 2007] takes up about 4.6 MiB but we
can verify that it has over 3 million distinct 15-grams of characters. If each distinct n-gram
can be stored using log(4.6 × 106 ) ≈ 22 bits, then we need over 8 MiB just to store the n-
grams (3 × 106 × 22/8 ≈ 8.3 × 106 ) without counting the indexing overhead. Indexing data
structures such as suffix arrays require even more storage. Thus, storing and indexing n-
grams can use up more storage than the original data source. Extrapolating this to the large
corpora used in computational linguistic studies, we see the futility of using brute-force
approaches that store the n-grams in main memory, when n is large. For smaller values of
n, n-grams of English words are also likely to defeat brute-force approaches.
There are two strategies to estimate statistics of a sequence in a single pass [Kearns
et al. 1994; Batu et al. 2002; Guha et al. 2006]. The generative (or black-box) strategy
samples values at random. From the samples, the probability of each value is estimated
by maximum likelihood or other statistical techniques. The evaluative strategy, on the
other hand, probes the exact probabilities or, equivalently, the number of occurrences of
(possibly randomly) chosen values. In one pass, we can randomly probe several n-grams
so we know their exact frequency. For example, given the sequence “aadababbabaacc”, a
random sampling might be “aabc”; from this P(a), P(b), and P(c) can be approximated.
However, one could instead compute the exact frequency of value “b.”
On the one hand, it is difficult to estimate the number of distinct elements from a sam-
pling, without making further assumptions. For example, suppose there is only one distinct
n-gram in 100 samples out of 100,000 n-grams. Should we conclude that there is only one
distinct n-gram overall? Perhaps there are 100 distinct n-grams, but 99 of them only oc-
cur once — thus there is a ≈ 91% probability that we observe only the common one.
While this example is a bit extreme, skewed distributions are quite common, as Zipf’s law
shows. Choosing, a priori, the number of samples we require is a major difficulty. Esti-
mating the probabilities from sampling is a problem that still interests researchers to this
day [McAllester and Schapire 2000; Orlitsky et al. 2003].
On the other hand, distinct count estimates from a probing are statistically easier [Gib-
bons and Tirthapura 2001]. With the example above, with just enough storage budget
to store 100 distinct n-grams, we would get an exact count estimate! On the downside,
probing requires properly randomized hashing.
In the spirit of probing, Gibbons-Tirthapura (GT) [Gibbons and Tirthapura 2001] count
estimation goes as follows. We have m distinct items in a stream containing the dis-
tinct items x1 , . . . , xm with possible repetitions. Let h(xi ) be pairwise-independent hash
values over [0, 2L ) and let ht (xi ) be the first t bits of the hash value. We have that
E(card({ht−1 (0)})) = m/2t . Given a fixed memory budget M, and setting t = 0, we scan
the data. While scanning, we store all distinct items xi such that ht (xi ) = 0 in a look-up
table H. As soon as size(H) = M + 1, we increment t by 1 and remove all xi in H such
that ht (xi ) 6= 0. Typically, at least one element in H is removed, but if not, the process
of incrementing t and removing items is repeated until size(H) ≤ M. Then we continue
scanning. After the run is completed, we return size(H) × 2t as the estimate. By choos-
ing M = 576/ε2 [Bar-Yossef et al. 2002], we achieve an accuracy of ε, 5 times out of 6
(i.e., P(|size(H) × 2t − m| > εm) < 1/6), by an application of Chebyshev’s inequality. By
a Chernoff bound, running the algorithm O(log 1/δ) times and taking the median of the
results gives a reliability of δ instead of 5/6. Bar-Yossef et al. suggest to improve the
algorithm by storing hash values of the xi ’s instead of the xi ’s themselves, reducing the
reliability but lowering the memory usage. Notice that our Corollary 2 shows that the es-
timate of a 5/6 reliability for M = 576/ε2 is pessimistic: M = 576/ε2 implies a reliability
of over 99%. We also prove that replacing pairwise independent by 4-wise independent
hashing substantially improves the existing theoretical performance bounds.
Random hashing can be the real bottleneck in probing, but to alleviate this problem
for n-gram hashing, we use recursive hashing [Cohen 1997; Karp and Rabin 1987]: we
leverage the fact that successive n-grams have n − 1 characters in common. We study em-
pirically online n-gram count estimation that uses one pass and hashes each n-gram only
once. We compare several different recursive n-gram hashing algorithms including hashing
by cyclic and irreducible polynomials in the binary Galois Field (GF(2)[x]). We character-
ize the independence of recursive hashing by polynomials whereas only empirical studies
had existed. We prove that hashing by cyclic polynomials is not even uniform, whereas
hashing by irreducible polynomials is pairwise independent. It is interesting to contrast
these new theoretical results with the fact that Cohen reported [Cohen 1997] that hashing
by cyclic polynomials provided a more uniform hashing, experimentally. Fortunately, we
were also able to prove that hashing by cyclic polynomials is quasi-pairwise independent.
The main contributions of this paper are an analytical survey of recursive hashing, a tighter
theoretical bound in count estimation when hashing functions are more than pairwise inde-
pendent, and an experimental validation to demonstrate practical usefulness and to suggest
further theoretical investigation.
3. RELATED WORK
Related work includes reservoir sampling, suffix arrays, and view-size estimation in OLAP.
We can choose randomly, without replacement, k samples in a sequence of unknown
length using a single pass through the data by reservoir sampling. Reservoir sampling [Vit-
ter 1985; Kolonko and Wasch 2006; Li 1994] was first introduced by Knuth [Knuth
1969]. All reservoir-sampling algorithms begin by appending the first k samples to an
array. In their linear-time (O(N)) form, reservoir-sampling algorithms sequentially visit
every symbol, choosing it as a possible sample with probability k/t where t is the num-
ber of symbols read so far. The chosen sample is simply appended at the end of the
array while an existing sample is flagged as having been removed. The array has an
average size of k (1 + log N/k) samples at the end of the run. In their sublinear form
(O(k(1 + log(N/k)) expected time), the algorithms skip a random number of data points
each time. While these algorithms use a single pass, they assume that the number of re-
quired samples k is known a priori, but this is difficult without any knowledge of the data
distribution.
Using suffix arrays [Manber and Myers 1990; 1993] and the length of the maximal com-
mon prefix between successive prefixes, Nagao and Mori [Nagao and Mori 1994] proposed
a fast algorithm to compute n-gram statistics exactly. However, it cannot be considered an
online algorithm even if we compute the suffix array in one pass: after constructing the
suffix array, one must go through all suffixes at least once more. Their implementation was
later improved by Kit and Wilks [Kit and Wilks 1998]. Compared to uncompressed suf-
fix trees [Giegerich et al. 1999], uncompressed suffix arrays use several times less storage
and their performance does not depend on the size of the alphabet. Suffix arrays can be
constructed in O(N) time using O(N) working space [Hon et al. 2003]. Querying a suffix
array for a given n-gram takes O(log N) time.
By definition, each n-gram is a tuple of length n and can be viewed as a relation to be ag-
gregated. OLAP (On-Line Analytical Processing) [Codd 1993] is a database acceleration
technique used for deductive analysis, typically involving aggregation. To achieve acceler-
ation, one frequently builds data cubes [Gray et al. 1996] where multidimensional relations
are preaggregated in multidimensional arrays. OLAP is commonly used for business pur-
poses with dimensions such as time, location, sales, expenses, and so on. Concerning text,
most work has focused on informetrics/bibliomining, document management and informa-
tion retrieval [McCabe et al. 2000; Mothe et al. 2003; Niemi et al. 2003; Bernard 1995;
Sullivan 2001]. The idea of using OLAP for exploring the text content itself (including
phrases and n-grams) was proposed for the first time by Keith, Kaser and Lemire [Keith
et al. 2005b; 2005a; Kaser et al. 2006]. The estimation of n-gram counts can be viewed
as an OLAP view-size estimation problem which itself “remains an important area of open
research” [Dehne et al. 2006].
A data-agnostic approach to view-size estimation [Shukla et al. 1996], which is likely to
be used by database vendors, can be computed almost instantly. For n-gram estimation,
the number of attributes is the size of the alphabet |Σ| and η is the number of n-grams
with possible repetitions (η = N − n + 1). Given η cells picked uniformly at random, with
replacement, in a V = K1 × K2 × · · · Kn space, the probability that any given cell (think “n-
gram”) is omitted is (1 − V1 )η . For n-grams, V = |Σ|n . Therefore, the expected number of
unoccupied cells is (1 − V1 )η × η. Similarly, assuming the number of n-grams is known to
be m, the same model permits us to estimate the number of n−1-grams by m×(1−( Vm )|Σ| ).
In practice, this approach systematically overestimates because relations are not uniformly
distributed.
A more sophisticated view-size estimation algorithm used in the context of data ware-
housing and OLAP [Shukla et al. 1996; Kotidis 2002] is logarithmic probabilistic count-
ing [Flajolet and Martin 1985]. This approach requires a single pass and almost no mem-
ory, but it assumes independent hashing for which no algorithm using limited storage is
known [Bar-Yossef et al. 2002]. Alon et al. [Alon et al. 1996, Proposition 2.3] showed
probabilistic counting could work with only pairwise independence, but the error bounds
are large (at least 100%): for any c > 2, the relative error is bounded by c − 1 with reli-
ability 1 − 2/c. Practical results are sometimes disappointing [Dehne et al. 2006]. Other
variants of this approach include linear probabilistic counting [Whang et al. 1990; Shah
et al. 2004] and Loglog counting [Durand and Flajolet 2003], but their theoretical analysis
also assumes independent hashing.
View-size estimation through sampling has been made adaptive by Haas et al. [Haas
et al. 1995]: their strategy is to first attempt to determine whether the distribution is skewed
and then use an appropriate statistical estimator. We can also count the marginal frequen-
cies of each attribute value (or symbol in an n-gram setting) and use them to give estimates
as well as (exact) lower and upper bound on the view size [Yu et al. 2005]. Other re-
searchers make particular assumptions on the distribution of relations [Nadeau and Teorey
2003; Ciaccia et al. 2001; 2003; Faloutsos et al. 1996].
5. HASHING n-GRAMS
In this section, we begin by showing how we can construct a moderately efficient (non-
recursive) n-wise independent hash function. We then define three recursive hashing fam-
ilies C YCLIC, G ENERAL, and ID37, and study their properties, being especially attentive
to C YCLIC. All four hashing families will later be benchmarked in the context of count
estimation using the GT algorithm.
A trivial way to generate an independent hash is to assign a random integer in [0, 2L )
to each new value x. Unfortunately, this requires as much processing and storage as a
complete indexing of all values. However, in a multidimensional setting this approach can
be put to good use. Suppose that we have tuples in K1 × K2 × · · · × Kn such that |Ki | is small
for all i. We can construct independent hash functions hi : Ki → [0, 2L ) for all i and combine
them. The hash function h(x1 , x2 , . . . , xn ) = h1 (x1 ) ⊕ h2 (x2 ) ⊕ · · · ⊕ hn (xn ) is then n-wise
independent (⊕ is the exclusive or). As long as the sets Ki , are small, in time O(∑i |Ki |)
we can construct the hash function by generating ∑i |Ki | random numbers and storing them
in a look-up table (see Algorithm 1). With constant-time look-up, hashing an n-gram thus
takes O(Ln) time, or O(n) if L is considered a constant.
Unfortunately, this hash function is not recursive. In the n-gram context, we can choose
h1 = h2 = . . . since Σ = K1 = K2 = . . .. While the resulting hash function is recursive over
hashed values since
h(x2 , . . . , xn+1 ) = h1 (x2 ) ⊕ · · · ⊕ h1 (xn+1 )
= h1 (x1 ) ⊕ h1 (xn+1 ) ⊕ h(x1 , . . . , xn ),
Algorithm 1 The (non-recursive) n-wise independent family.
Require: n L-bit hash functions h1 , h1 , . . . , hn over Σ from an independent hash family
1: s ← empty FIFO structure
2: for each character c do
3: append c to s
4: if length(s)= n then
5: yield h1 (s1 ) ⊕ h2 (s2 ) ⊕ . . . ⊕ hn (sn )
{The yield statement returns the value, without terminating the algorithm.}
6: remove oldest character from s
7: end if
8: end for
and by the earlier uniformity result, this last probability is equal to 1/2L . This concludes
the proof.
Of the four recursive hashing functions investigated by Cohen [Cohen 1997], G ENERAL
and C YCLIC were superior both in terms of speed and uniformity, though C YCLIC had
a small edge over G ENERAL. For n large, the benefits of these recursive hash functions
compared to the n-wise independent hash function presented earlier can be substantial:
n table look-ups2 is much more expensive than a single look-up followed by binary shifts.
5.2 Is C YCLIC almost pairwise independent?
While C YCLIC is not uniform, it was shown empirically to have good uniformity [Cohen
1997]. Hence it is reasonable to expect C YCLIC to be almost uniform and maybe even
almost pairwise independent. To illustrate this intuition, consider Table I which shows that
while h(a, a) is not uniform (h(a, a) = 001 is impossible), h(a, a) minus any bit is indeed
uniformly distributed. We will prove that this result holds in general.
The next lemma and the next theorem show that C YCLIC is quasi-pairwise independent
in the sense that L − n + 1 consecutive bits (e.g., the first or last L − n + 1 bits) are pairwise
independent. In other words, C YCLIC is pairwise independent if we are willing to sacrifice
n − 1 bits. (We say that n bits are “consecutive modulo L” if the bits are located at indexes
i mod L for n consecutive values of i such as i = k, k + 1, . . . , k + n − 1.)
L EMMA 3. If q(x) ∈ GF(2)[x]/(xL + 1) (with q(x) 6= 0) has degree n < L, then
—the equation q(x)w = y mod xL + 1 modulo the first n bits3 has exactly 2n solutions for
all y;
2 Recall we assume that Σ is not known in advance. Otherwise for many applications, each table lookup could be
—more generally, the equation q(x)w = y mod xL + 1 modulo any consecutive n bits (mod-
ulo L) has exactly 2n solutions for all y.
P ROOF. Let P be the set of polynomials of degree at most L − n − 1. Take any p(x) ∈ P,
then q(x)p(x) has degree at most L − n − 1 + n = L − 1 and thus if q(x) 6= 0 and p(x) 6=
0, then q(x)p(x) 6= 0 mod xL + 1. Hence, for any distinct p1 , p2 ∈ P we have q(x)p1 6=
q(x)p2 mod xL + 1.
To prove the first item, we begin by showing that there is always exactly one solution
in P. Consider that there are 2L−n polynomials p(x) in P, and that all values q(x)p(x)
are distinct. Suppose there are p1 , p2 ∈ P such that q(x)p1 = q(x)p2 mod xL + 1 modulo
the first n bits, then q(x)(p1 − p2 ) is a polynomial of degree at most n − 1 while p1 − p2
is a polynomial of degree at most L − n − 1 and q(x) is a polynomial of degree n, thus
p1 − p2 = 0. (If p1 − p2 6= 0 then degree(q(x)(p1 − p2) mod xL + 1) ≥ degree(q(x)) = n,
a contradiction.) Hence, all p(x) in P are mapped to distinct values modulo the first n bits,
and since there are 2L−n such distinct values, the result is shown.
Any polynomial of degree L −1 can be decomposed into the form p(x)+xL−n z(x) where
z(x) is a polynomial of degree at most n − 1 and p(x) ∈ P. By the preceding result, for
distinct p1 , p2 ∈ P, q(x)(xL−n z(x) + p1 ) and q(x)(xL−n z(x) + p2 ) must be distinct modulo
the first n bits. In other words, the equation q(x)(xL−n z(x) + p) = y modulo the first n bits
has exactly one solution p ∈ P for any z(x) and since there are 2n polynomials z(x) of
degree at most n − 1, then q(x)w = y (modulo the first n bits) must have 2n solutions.
To prove the second item, choose j and use the first item to find any w solving q(x)w =
yx j mod xL + 1 modulo the first n bits. j. Then wxL− j is a solution to q(x)w = y mod xL + 1
modulo the bits in positions j, j + 1, . . . , j + n − 1 mod L.
We have the following corollary to Lemma 3.
C OROLLARY 1. If w is chosen uniformly at random in GF(2)[x]/(xL + 1), then
P(q(x)w = y mod n − 1 bits) = 1/2L−n+1 where the n − 1 bits are consecutive (modulo
L).
T HEOREM 5.1. All consecutive L − n + 1 bits (modulo L) of the L-bit C YCLIC n-gram
hash family are pairwise independent.
P ROOF. We show P(q1 (x)h1 (a1 ) + q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) =
y mod n − 1 bits) = 1/2L−n+1 for any polynomials qi where at least one is different
from zero. It is true when there is a single non-zero polynomial qi (x) by Corollary 1.
with exceptions permitted only at the specified positions. For our polynomials, “equality modulo the first n bit
positions” implies the difference of the two polynomials has degree at most n − 1.
Suppose it is true up to k − 1 non-zero polynomials and consider a case where we have
k non-zero polynomials. Assume without loss of generality that q1 (x) 6= 0, we have
P(q1 (x)h1 (a1 ) + q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) = y mod n − 1 bits) = P(q1 (x)h1 (a1 ) =
y − q2 (x)h1 (a2 ) − . . . − qn (x)h1 (an ) mod n − 1 bits) = ∑y0 P(q1 (x)h1 (a1 ) =
y − y0 mod n − 1 bits)P(q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) = y0 mod n − 1 bits) =
1 1
∑y0 2L−n+1 2L−n+1
= 1/2L−n+1 by the induction argument, where the sum is over 2L−n+1
0
values of y . Hence the uniformity result is shown.
Consider two distinct sequences a1 , a2 , . . . , an and a01 , a02 , . . . , a0n . Write Ha =
h(a1 , a2 , . . . , an ) and Ha0 = h(a01 , a02 , . . . , a0n ). To prove pairwise independence, it suf-
fices to show that P(Ha = y mod n − 1 bits|Ha0 = y0 mod n − 1 bits) = 1/2L−n+1 . Sup-
pose that ai = a0j for some i, j; if not, the result follows by the (full) independence
of the hashing function h1 . Using Lemma 3, find q(x) such that q(x) ∑k|a0 =a0j xn−k =
k
− ∑k|ak =ai xn−k mod n − 1 bits, then Ha + q(x)Ha0 mod n − 1 bits is independent from ai =
a0j (and h1 (ai ) = h1 (a0j )).
The hashed values h1 (ak ) for ak 6= ai and h1 (a0k ) for a0k 6= a0j are now relabelled as
h1 (b1 ), . . . , h1 (bm ). Write Ha + q(x)Ha0 = ∑k qk (x)h1 (bk ) mod n − 1 bits where qk (x) are
polynomials in GF(2)[x]/(xL + 1) (not all qk (x) are zero). As in the proof of Lemma 2,
we have that Ha0 = y0 mod n − 1 bits and Ha + q(x)Ha0 = y + q(x)y0 mod n − 1 bits are
independent4 : P(Ha0 = y0 mod n − 1 bits|y0 , b1 , b2 , . . . , bm ) = 1/2L−n+1 by Corollary 1
since Ha0 = y can be written as r(x)h1 (a0j ) = y − ∑k rk (x)h1 (bk ) for some polynomials
r(x), r1 (x), . . . , rm (x). Hence, we have
P(Ha = y mod n − 1 bits|Ha0 = y0 mod n − 1 bits)
= P(Ha + q(x)Ha0 = y + q(x)y0 mod n − 1 bits|Ha0 = y0 mod n − 1 bits)
= P(Ha + q(x)Ha0 = y + q(x)y0 mod n − 1 bits)
= P(∑ qk (x)h1 (bk ) = y + q(x)y0 mod n − 1 bits)
k
and by the earlier uniformity result, this last probability is equal to 1/2L−n+1 .
However, Theorem 5.1 may be pessimistic: the hash values of some n-grams may be
more uniform and independent. The next lemma and proposition show that as long as L
and n are coprime, it suffices to drop or ignore one bit of the C YCLIC hash so that at least
one n-gram (an ) is hashed uniformly. Recall that L and n are coprime if gcd(L, n) = 1; i.e.,
their greatest common divisor is 1.
L EMMA 4. If n and L are coprime, then (xn + 1)w = 0 has only the following solutions
in GF(2)[x]/(xL + 1): w = 0 and w = xL−1 + . . . + x2 + x + 1.
P ROOF. Clearly, w = 0 and w = xL−1 + . . . + x2 + x + 1 are always solutions. The proof
that these are the only solutions proceeds in three stages: first, we show that if w solves
(xn + 1)w = 0 mod xL + 1, then w also solves (xkn + 1)w = 0 mod xL + 1, for all k. Second,
we show that this implies (xi + 1)w = 0 mod xL + 1, for all i. Finally, we show that this
implies all coefficients in w are equal, a property of only w = 0 and w = xL−1 + . . . + x2 +
x + 1.
4 We use the shorthand notation P( f (x, y) = c|x, y) = b to mean P( f (x, y) = c|x = z1 , y = z2 ) = b for all values of
z1 , z2 .
For the first stage, we want to prove that (xn + 1)w = 0 mod xL + 1 implies (xkn + 1)w =
0 mod xL + 1 for any k = 1, 2, . . . by induction. To begin, we have that (xn + 1)2 = 1 + x2n .
Suppose we have (xkn + 1)w = 0 mod xL + 1 for k = 1, 2, . . . , K − 1, we need that (xKn +
1)w = 0 mod xL + 1. We have (x(K−1)n + 1)w = 0 mod xL + 1 ⇒ (x + 1)(x(K−1)n + 1)w =
0 mod xL + 1 ⇒ (xKn + 1)w + (x(K−1)n + x)w = 0 mod xL + 1 ⇒ (xKn + 1)w + x(x(K−2)n +
1)w = 0 mod xL + 1 ⇒ (xKn + 1)w = 0 mod xL + 1 by the induction argument, showing
the result.
For the next stage, note that (xkn + 1)w mod xL + 1 = (xkn mod L + 1)w mod xL + 1. Sup-
pose k1 n = k2 n mod L. Then (k1 −k2 )n = 0 mod L and so L divides k1 −k2 because n and L
are coprime. Hence, the sequence of L elements 0, n, 2n, 3n, . . . , (L − 1)n visits all values in
{1, 2, 3, . . . , L−1}. Hence, (xi +1)w = 0 mod xL + 1 for all values of i ∈ {1, 2, 3, . . . , L−1}.
Finally, inspecting (xi + 1)w = 0 mod xL + 1 reveals that any two bits of w (the coeffi-
cients of any two of the L terms) must sum to zero in GF(2). That is, they must be equal.
This proves that 0 and xL−1 + . . . + x2 + x + 1 are the only possible solutions.
P ROPOSITION 3. Suppose a randomly chosen member of L-bit C YCLIC hash family is
applied to the n-gram an . Then
—If L and n are coprime and n is odd, then all L bits are uniformly distributed.
—If L and n are coprime, any L − 1 bits of the L-bit hash are uniformly distributed.
—If gcd(L, n) > 2, we do not have a uniform distribution of any L − 1 bits.
—When gcd(L, n) = 2, then we have a uniform distribution of any L − 1 bits only when n/2
is odd.
P ROOF. We have that h(a1 a2 . . . an ) = xn−1 h1 (a1 ) + . . . + xh1 (an−1 ) + h(an ) and so
h(a, a, . . . , a) = (xn−1 + . . . + x + 1)h1 (a). Let q(x) = xn−1 + . . . + x + 1.
We begin by showing that uniformity would follow, if q(x)w = 0 mod xL + 1 had
a unique solution. There are 2L possible values of h1 (a) and 2L possible values of
h(a, a, . . . , a) = q(x)h1 (a). If q(x)w = z mod xL + 1 has no solution w for some z, then
there must be y such that q(x)w = y mod xL + 1 has k > 1 solutions w1 , w2 , . . . , wk . Then
q(x)w = 0 mod xL + 1 has at least k solutions: q(x)0 = 0, q(x)(w1 − w2 ) = 0 mod xL + 1,
q(x)(w1 − w3 ) = 0 mod xL + 1, . . . , q(x)(w1 − wk ) = 0 mod xL + 1. Therefore, if q(x)w =
0 mod xL + 1 has a unique solution (w = 0) then q(x)w = y mod xL + 1 has a unique solu-
tion for any y and we have uniformity.
Our initial cases have n and L being coprime. Since q(x) = xn−1 + . . . + x + 1, then
(x + 1)q(x) = xn + 1 mod xL + 1 and so q(x)w = 0 mod xL + 1 ⇒ (x + 1)q(x)w = 0 mod
xL + 1 ⇒ (xn + 1)w = 0 mod xL + 1. Therefore, by Lemma 4:
—when n is odd, then w = 0 is the only solution to q(x)w = 0 mod xL + 1, and thus, we
have uniformity of all L bits, including any L − 1 bits;
—when n is even, then w = 0 and w = xL−1 + . . . + x + 1 are the only solutions to q(x)w =
0 mod xL + 1. We wish to show that the equation q(x)w = y mod xL + 1 modulo the
rth bit has exactly two solutions in GF(2)[x]/(xL + 1) for any y. This would then imply
the probability that any L − 1 bits of h(a, a, . . . , a) be some value is 2/2L = 1/2L−1 , thus
proving uniformity of any L − 1 bits.
Consider q(x)w = 0 mod xL + 1 modulo the rth bit: if q(x)w 6= 0 mod xL + 1, then
q(x)w = xr mod xL + 1. In such a case, q(x) has an inverse (wx−r ), and q(x)w =
0 mod xL + 1 has only one solution, a contradiction. Therefore, q(x)w = 0 mod xL + 1
modulo the rth bit implies q(x)w = 0 mod xL + 1 which implies that w = 0 or w =
xL−1 + . . . + x + 1. Therefore, if q(x)w1 = q(x)w2 mod xL + 1 modulo the rth bit, then
w1 − w2 differ by either 0 or xL−1 + . . . + x + 1. Thus, for any y there are at most two
solutions for q(x)w = y mod xL + 1 modulo the rth th bit. Further, if w is a solution to
q(x)w = y mod xL + 1 modulo the rth bit, then so is w + (xL−1 + . . . + x + 1): solutions
come in pairs. There are 2L−1 solutions pairs and at most one solution pair for each of
the 2L−1 values y ∈ GF(2)[x]/(xL + 1) modulo the rth bit. A pigeonhole argument shows
exactly two solutions per y.
Next, our second cases have L and n not being coprime, and we let k = gcd(L, n). Since k
divides n, we have xn−1 + . . . + x + 1 = (xk−1 + . . . + x + 1)(xn−k + . . . + x2k + xk + 1), which
can be verified by inspection. Since k divides L, we similarly have xL−1 + . . . + x + 1 =
(xk−1 + . . . + x + 1)(xL−k + . . . + x2k + xk + 1). Since (xk−1 + . . . + x + 1)(x + 1)(xL−k +
. . . + x2k + xk + 1) = (x + 1)(xL−1 + . . . + x + 1) = 0 mod xL + 1, w = (x + 1)(xL−k + . . . +
x2k + xk + 1) is a solution of q(x)w = 0 mod xL + 1. For k > 2, w = (x + 1)(xL−k + . . . +
x2k + xk + 1) is distinct from 0 and xL−1 + . . . + x + 1. Thus, as long as k is larger than two,
we have a solution w distinct from 0 and xL−1 + . . . + x + 1, preventing any L − 1 bits from
being uniform. Indeed, the probability that any L − 1 bits of (h(a, a, . . . , a) will be zero is
at least 3/2L > 2L−1 .
For k = 2, let w = ∑i=0,...,L−1 bi xi be a solution of q(x)w = 0 mod xL + 1. We have
xn−1 + . . . + x + 1 = (x + 1)(xn−2 + . . . + x4 + x2 + 1). By Lemma 4, we have (xn−1 + . . . +
x + 1)w = 0 mod xL + 1 if and only if (xn−2 + . . . + x4 + x2 + 1)w = 0 or (xn−2 + . . . + x4 +
x2 + 1)w = xL−1 + . . . + x2 + x + 1. Using these two solutions, we will show that there are
exactly two solutions when n/2 is odd and more when n/2 is even.
We are going to split the problem into odd and even terms. Define wo =
b x (i−1)/2 and w = b x i/2 . Then (xn−2 + . . . + x4 + x2 + 1)w =
∑i=1,3,...,L−1 i e ∑i=0,2,...,L−2 i
0 mod xL + 1 implies (xn/2−1 + . . . + x2 + x + 1)wo = 0 mod xL/2 + 1 and (xn/2−1 + . . . +
x2 + x + 1)we = 0 mod xL/2 + 1 whereas (xn−2 + . . . + x4 + x2 + 1)w = xL−1 + . . . + x +
1 mod xL + 1 implies (xn/2−1 + . . . + x2 + x + 1)wo = xL/2−1 + . . . + x + 1 mod xL/2 + 1 and
(xn/2−1 + . . . + x2 + x + 1)we = xL/2−1 + . . . + x + 1 mod xL/2 + 1. Notice that n/2 and L/2
are necessarily coprime so we can use the results we have derived above.
—If n/2 is odd, then (xn−2 + . . . + x4 + x2 + 1)w = 0 mod xL + 1 implies wo = 0 and we =
0, and hence w = 0. Using a symmetric argument, (xn/2−1 + . . . + x2 + x + 1)wo =
xL/2−1 + . . . + x + 1 mod xL/2 + 1 and (xn/2−1 + . . . + x2 + x + 1)we = xL/2−1 + . . . + x +
1 mod xL/2 + 1, imply wo = we = xL/2−1 + . . . + x2 + x + 1. Therefore, when n/2 is odd,
only w = 0 and w = xL−1 + . . . + x + 1 are possible solutions.
—When n/2 is even, then (xn−2 + . . . + x4 + x2 + 1)w = 0 mod xL + 1 has two solutions
(wo = 0 and wo = xL/2−1 + . . . + x2 + x + 1) and so does (xn/2−1 + . . . + x2 + x + 1)we =
0 mod xL/2 + 1 (we = 0 and we = xL/2−1 + . . . + x2 + x + 1). Hence, q(x)w = 0 mod
xL + 1 has at least 4 solutions w = 0, w = xL−2 +. . .+x4 +x2 +1, w = xL−1 +. . .+x3 +x,
w = xL−1 + . . . + x + 1.
This concludes the proof.
5.3 Integer-Division hashing (ID37) is not uniform
A variation of the Karp-Rabin hash method is “Hashing by Power-of-2 Integer Divi-
sion” [Cohen 1997], where h(x1 , . . . , xn ) = ∑i xi Bi−1 mod 2L . Parameter B needs to be
chosen carefully, so that the sequence Bk mod 2L for k = 1, 2, . . . does not repeat quickly.
In particular, the hashcode method of the Java String class uses this approach, with L = 32
and B = 31 [Sun Microsystems 2004]. Note that B is much smaller than the number
of Unicode characters (about 99,000> 216 ) [The Unicode Consortium 2006]. A widely
used textbook [Weiss 1999, p. 157] recommends a similar Integer-Division hash func-
tion for strings with B = 37. Since such Integer-Division hash functions are recursive,
quickly computed, and widely used, it is interesting to seek a randomized version of
them. Assume that h1 is random hash function over symbols uniform in [0, 2L ), then de-
fine h(x1 , . . . , xn ) = h1 (x1 ) + Bh1 (x2 ) + B2 h1 (x3 ) + . . . + Bn−1 h1 (xn ) mod 2L for some fixed
integer B. We choose B = 37 (calling the resulting randomized hash “ID37;” see Algo-
rithm 4).
Observe that ID37 is recursive over h1 . Moreover, by letting h1 map symbols over a wide
range, we intuitively can reduce the undesirable dependence between n-grams sharing a
common suffix. However, we fail to achieve uniformity.
The independence problem with ID37 is shared by all such randomized Integer-Division
hash functions that map n-grams to [0, 2L ). However, they are more severe for certain
combinations of B and n.
P ROPOSITION 4. Randomized Integer-Division (2L ) hashing with B odd is not uniform
for n-grams, if n is even. Otherwise, it is uniform, but not pairwise independent.
P ROOF. We see that P(h(a2k ) = 0) > 2−L since h(a2k ) = h1 (a)(B0 (1+B)+B2 (1+B)+
. . . + B2k−2 (1 + B)) mod 2L and since (1 + B) is even, we have P(h(a2k ) = 0) ≥ P(h1 (x1 ) =
2L−1 ∨ h1 (x1 ) = 0) = 1/2L−1 .
For the rest of the result, we begin with n = 2 and B even. If x1 6= x2 , then P(h(x1 , x2 ) =
y) = P(Bh1 (x1 ) + h1 (x2 ) = y mod 2L ) = ∑z P(h1 (x2 ) = y − Bz mod 2L )P(h1 (x1 ) = z) =
∑z P(h1 (x2 ) = y − Bz mod 2L )/2L = 1/2L , whereas P(h(x1 , x1 ) = y) = P((B + 1)h1 (x1 ) =
y mod 2L ) = 1/2L since (B + 1)x = y mod 2L has a unique solution x when B is even.
Therefore h is uniform. This argument can be extended for any value of n and for n odd, B
even.
To show it is not pairwise independent, first suppose that B is odd. For any string β
of length n − 2, consider n-grams w1 = βaa and w2 = βbb for distinct a, b ∈ Σ. Then
P(h(w1 ) = h(w2 )) = P(B2 h(β) + Bh1 (a) + h1 (a) = B2 h(β) + Bh1 (a) + h1 (a) mod 2L ) =
P((1 + B)(h1 (a) − h1 (b)) mod 2L = 0) ≥ P(h1 (a) − h1 (b) = 0) + P(h1 (a) − h1 (b) =
2L−1 ) = 2/4L . Second, if B is even, a similar argument shows P(h(w3 ) = h(w4 )) ≥ 2/4L ,
where w3 = βaa and w4 = βba. P(h(a, a) = h(b, a)) = P(Bh1 (a) + h1 (a) = Bh1 (b) +
h1 (a) mod 2L ) = P(B(h1 (a) − h1 (b)) mod 2L = 0) ≥ P(h1 (a) − h1 (b) = 0) + P(h1 (a) −
h1 (b) = 2L−1 ) = 2/4L . This argument can be extended for any value of B and n.
These results also hold for any Integer-Division hash where the modulo is by an even
number, not necessarily a power of 2. Frequently, such hashes compute their result modulo
a prime. However, even if this gave uniformity, the GT algorithm implicitly applies a
modulo 2L because it ignores higher-order bits. It is easy to observe that if h(x) is uniform
over [0, p), with p prime, then h0 (x) = h(x) mod 2L cannot be uniform.
Whether the lack of uniformity and pairwise independence is just a theoretical defect
can be addressed experimentally.
2e+06
M
1.5e+06
1e+06
500000
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11
ε 19/20
Fig. 1. For reliability 1 − δ = 0.95, we plot the memory usage M versus the accuracy ε for pairwise (p = 2) and
4-wise (p = 4) independent hash functions as per the bound of Proposition 5. Added independence substantially
improves memory requirements for a fixed estimation accuracy according to our theoretical bounds.
q
≤ e− 10×3 .
Choosing q = 30 ln 1/δ, we have P(Y > q/2) ≤ δ proving that we can make the me-
dian of the Xi ’s within εµ of µ, 1 − δ of the time for δ arbitrarily small. On this basis,
Bar-Yossef et al. [Bar-Yossef et al. 2002] report that they can estimate a count with rel-
ative precision ε and reliability 1 − δ, using Õ( ε2 + log m log 1δ ) bits of memory and
5 1
Õ( log m + 1ε log 1δ ) amortized time. Unfortunately, in practice, repeated probing im-
plies rehashing all n-grams 30 ln 1/δ times, a critical bottleneck. Moreover, in a streaming
context, the various runs are made in parallel and therefore 30 ln 1/δ different buffers are
needed. Whether this is problematic depends on the application and the size of each buffer.
We are able to generalize previous results [Gibbons and Tirthapura 2001; Bar-Yossef
et al. 2002] from pairwise independence to p-wise independence by using the following
generalization of Chebyshev’s inequality due to Schmidt et al. [Schmidt et al. 1993, Theo-
rem 2.4, Equation III].
T HEOREM 6.1. Let X1 , . . . , Xm be a sequence of p-wise independent random variables
that satisfy |Xi − E(Xi )| ≤ 1. Let X = ∑i Xi , then for C = max(p, σ2 (X)), we have
p/2
pC
P(|X − X̄| ≥ T ) ≤ .
e2/3 T 2
In particular, when p = 2, we have
2C
P(|X − X̄| ≥ T ) ≤
.
e2/3 T 2
The next proposition shows that in order to reduce the memory usage drastically while
preserving the same theoretical bounds on the accuracy and reliability, we can increase the
5 i.e., the computed count is within εµ of the true answer µ, 1 − δ of the time
independence of the hash functions. In particular, our theory says that we can estimate the
count within 10%, 19 times out of 20 by storing respectively 10.5, 2.5, and 2 thousand
n-grams6 depending on whether we have pairwise-independent, 4-wise independent or 8-
wise independent hash values. Hence, there is no need to hash the n-grams more than
once if we can assume that hash values are ≈ 4-wise independent in practice (see Fig. 1).
Naturally, recursive hashing cannot be more than pairwise independent, but we always
have the option of using slower non-recursive hashing. The following proposition is stated
for n-grams, but applies to arbitrary items.
P ROPOSITION 5. Hashing each n-gram only once, we can estimate the number of dis-
tinct n-grams within relative precision ε, with a p-wise independent hash for p ≥ 2 by
storing M distinct n-grams (M ≥ 8p) and with reliability 1 − δ where δ is given by
!
p p/2 p/2 8 p/2
δ = p/3 p/2 2 + p p/2 .
e M ε (2 − 1)
P ROOF. We generalize a proof by Bar-Yossef et al. [Bar-Yossef et al. 2002] Let Xt be the
number of distinct elements having only zeros in the first t bits of their hash value. Then X0
is the number of distinct elements (X0 = m). For j = 1, . . . , m, let Xt, j be the binary random
variable with value 1 if the jth distinct element has only zeros in the first t bits of its hash
value, and zero otherwise. We have that Xt = ∑mj=1 Xt, j and so E(Xt ) = ∑mj=1 E(Xt, j ). Since
the hash function is uniform, then P(Xt, j = 1) = 1/2t and so E(Xt, j ) = 1/2t , and hence,
E(Xt ) = m/2t ⇒ 2t E(Xt ) = m. Therefore Xt can be used to determine m.
Using pairwise independence, we can can show that σ2 (Xt ) ≤ 2mt . We have that σ2 (Xt ) ≤
m
2t because
m 2 m
E(|Xt − | ) = E((Xt − t )2 )
2t 2
m m
1 2 2
= E(( ∑ (Xt, j − )) ) = σ ( ∑ Xt, j )
j=1 2t j=1
m
= ∑ σ2 (Xt, j ) (by 2-wise independence)
j=1
m
1 1 1 1
= ∑ 2t (1 − 2t )2 + (1 − 2t )(0 − 2t )2
j=1
m
1 m
≤ ∑ 2t = .
j=1 2t
0
P(|2t Xt 0 − m| ≥ εm)
m m
= ∑ P(|Xt − t | ≥ ε t )P(t 0 = t)
t=0,...,L 2 2
m m
= ∑ P(|Xt − t | ≥ ε t )P(Xt−1 > M, Xt ≤ M).
t=0,...,L 2 2
0
P(|2t Xt 0 − m| ≥ εm)
t¯−1
m m
≤ ∑ P(|Xt − 2t | ≥ ε 2t ) + ∑ P(Xt−1 > M, Xt ≤ M)
t=0 t=t¯,...,L
t¯−1
m m
= P(Xt¯−1 > M) + ∑ P(|Xt − |≥ε t)
t=0 2t 2
t¯−1
p p/2 2t p/2
≤ P(Xt¯−1 − m/2t¯−1 > M − m/2t¯−1 ) + ∑ p/2 ε p e p/3
t=0 m
t¯−1
p p/2 2t p/2
≤ P(|Xt¯−1 − m/2t¯−1 | > M − m/2t¯−1 ) + ∑ p/2 ε p e p/3
t=0 m
p/2 t¯−1
p p/2 2t p/2
pm
≤ +∑
2t¯−1 e2/3 (M − m/2t¯−1 )2 t=0 m
p/2 ε p e p/3
7. EXPERIMENTAL RESULTS
Experiments are used to assess the accuracy of estimates obtained by several hash functions
(n-wise independent, C YCLIC, G ENERAL, and ID37, see Table II) on some input streams
using the GT count-estimation algorithm. All hash values are in [0, 219 ).
Since we are mostly interested in comparing n-gram hash families, we chose not to com-
pare variants of GT [Bar-Yossef et al. 2002] or other possible count-estimation algorithms
such as probabilistic counting [Flajolet and Martin 1985]. For an experimental comparison
of several such count-estimation algorithms in the context of multidimensional hashing,
see [Aouiche and Lemire 2007].
The experiments demonstrate that the techniques can be efficiently implemented. Details
are available in a technical report [Lemire and Kaser 2006] together with experimental
results on entropy and frequent-item-count estimations. Our code is written in C++ and is
available upon request.
7 The 11 texts are eduha10 (The Education of Henry Adams), utrkj10 (Unbeaten Tracks in Japan), utopi10
(Utopia), remus10 (Uncle Remus His Songs and His Sayings), btwoe10 (Barchester Towers), 00ws110 (Shake-
speare’s First Folio), hcath10 (History of the Catholic Church), rlchn10 (Religions of Ancient China), esymn10
(Essay on Man), hioaj10 (Impeachment of Andrew Johnson), and wflsh10 (The Way of All Flesh).
Table III. Maximum error rates ε 19 times out of 20 for various amounts of
memory (M) and for p-wise independent hash values according to Proposi-
tion 5 .
256 1,024 2,048 65,536 262,144 1,048,576
p = 2 86.4% 36.8% 24.7% 3.8% 1.8% 0.9%
p = 4 34.9% 16.1% 11.1% 1.8% 0.9% 0.5%
p = 8 30.0% 14.1% 9.7% 1.6% 0.8% 0.4%
0.09
ID37
0.08 general polynomial
cyclic polynomial
0.07 n-wise independent
0.06
0.05
ε
0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60 70 80 90 100
trial rank
Fig. 2. Count estimate errors over Shakespeare’s First Folio (00ws110.txt), 100 runs estimating 10-grams with
M = 2048.
8 The“data-agnostic” estimate from Sect. 3 is hopelessly inaccurate: it predicts 4.4 million 5-grams for Shake-
speare’s First Folio, but the actual number is 13 times smaller.
0.3
ID37
general polynomial
0.25 cyclic polynomial
n-wise independent
0.2
0.15
ε
0.1
0.05
0
0 10 20 30 40 50 60 70 80 90 100
trial rank
10
ID37
general polynomial
cyclic polynomial
n-wise independent
pairwise bound
1 10-wise bound
ε 19/20
0.1
0.01
1000
M
Fig. 4. For each M and hash function, worst-case 95th -percentile error observed on the 11 test inputs.
observed notable differences between the different hash functions. Therefore, although we
expect typical values of M to be a few hundred to a few thousand, we can broaden the range
of M examined. Although the theoretical guarantees for tiny M are poor, perhaps typical
results will be usable. And even a single buffer with M = 220 is inconsequential when a
desktop computer has several gibibytes of RAM, and the construction of a hash table or
B-tree with such a value of M is still quite affordable. Moreover, with a wider range of M,
we start to see differences between some hash functions.
We choose M = 16, 162 , 163 and 164 and analyze the 5-grams in the text 00ws1 (Shake-
speare’s First Folio). There are approximately 300,000 5-grams, and we selected a larger
file because when M = 164 , it seems unhelpful to estimate the number of 5-grams unless
100
ID37
general polynomial
cyclic polynomial
10 n-wise independent
pairwise bound
5-wise bound
ε 19/20 1
0.1
0.01
0.001
10 100 1000 10000 100000
M
Fig. 5. 95th -percentile error values (for 5-gram estimates) on 00ws1 for various hash families, over a wide range
of M. Our analysis does not permit prediction of error bounds when M = 16.
5-wise bound
0.1
0.01
10 100 1000 10000 100000
M
Fig. 6. 95th -percentile error for 5-gram estimates on Zipfian data (s=1).
James (MZJ) generator [Marsaglia and Zaman 1987; James 1990; Bourke 1998]. We
also tried using a collection of bytes generated from a random physical process (radio
static) [Haahr 1998].
For M = 4096, the 95th -percentile error for text 00ws1 was 4.7% for Linux rand(),
4.3% for MT and 4.1% for MZJ. These three pseudorandom number generators were no
match for truly random numbers, where the 95% percentile error was only 2.9%. Com-
paring this final number to Fig. 5, we see fully independent hashing is only a modest
improvement on Cohen’s hash functions (which fare better than 5%) despite its stronger
theoretical guarantee.
The other hash functions also rely on random-number generation (for h1 in Cohen’s
hashes and ID37; for h1 . . . hn in the n-wise independent hash). It would be problematic if
their performance were heavily affected by the precise random-number generation process.
However, when we examined the 95th -percentile errors we did not observe any appreciable
differences from varying the the pseudorandom-number generation process or using truly
random numbers.
5-wise bound
0.1
0.01
10 100 1000 10000 100000
M
Fig. 7. 95th -percentile error for 5-gram estimates on Zipfian data (s=1.6).
10
ID37
general polynomial
cyclic polynomial
n-wise independent
1 pairwise bound
ε 19/20
5-wise bound
0.1
0.01
10 100 1000 10000 100000
M
Fig. 8. 95th -percentile error for 5-gram estimates on Zipfian data (s=2).
9 Notethat Table III tells us only that we can expect an error of at most 86%, 95% of the time. This is not a very
strong guarantee! Here, we observe 95th -percentile errors of about 25% for C YCLIC. The point is not that we
have observed worse-than-pairwise-independent behavior here (we have not); it is that when forced to use too
many bits, C YCLIC can become clearly worse than G ENERAL.
0.3
cyclic, L=19
cyclic, L=7
0.25
0.2
error (ε )
0.15
0.1
0.05
0
0 10 20 30 40 50 60 70 80 90 100
error rank
Fig. 9. Count estimation errors using GT with C YCLIC hashing on 5 random binary sequences, each containing
m = 2000 distinct 13-grams; buffer size (M) was 256. We generated 100 estimates for each binary sequence, and
the relative errors of these estimates are displayed against their rank. Results are plotted for L = 7 and L = 19.
Note that there are 12 points (all for L = 7) off the scale, one at ε = 0.32, six at ε = 1.0, and five at ε = 15.5.
L = 7 and L = 19. They differed little, both being nearly the same as obtained with C YCLIC
with L = 19. (Actually, C YCLIC with L = 19 was marginally better than G ENERAL!).
Finally, we discuss results on realistic non-binary Zipfian data. For each run, we made a
single random choice of h1 . Results, for more than 2000 runs, are shown in Table IV. The
table shows ε values (percents), with boldfacing indicating a case when one technique had
a lower error than the other. It shows a slightly better performance from C YCLIC. Means
also slightly favour C YCLIC. This is consistent with the experimental results reported by
Cohen [Cohen 1997]. We also ran more extensive tests using an English text (00ws1.txt),
where there seems to be no notable distinction. 10,000 test runs were made, and results are
shown in Table V.
Overall, we concede there might be a small accuracy improvement from using C YCLIC,
providing that L − n + 1 is somewhat larger than log2 (m/M) + 1, the number of bits that we
anticipate GT using. However, this effect is never large — and the accuracy penalty from
using too many bits can be large. Whether C YCLIC is viable, if used carefully, depends on
whether its relative speed advantage is lost after n − 1 bits are discarded.
Table V. Count estimation error (%) using GT with poly-
nomial hashes C YCLIC and G ENERAL, data set 00ws1,
n=5
percentile C YCLIC G ENERAL
M=64 M=1024 M=64 M=1024
25 5.79 1.30 5.79 1.30
50 10.7 2.58 10.7 2.69
75 18.2 4.55 19.0 4.55
95 30.6 7.69 30.6 7.69
mean 12.5 3.15 12.6 3.14
7.8 Speed
Speeds were measured on a Dell Power Edge 6400 multiprocessor server (with four Pen-
tium III Xeon 700 MHz processors having 2 MiB cache each, sharing 2 GiB of 133 MHz
RAM). The OS kernel was Linux 2.4.20 and the GNU C++ compiler version 3.2.2 was
used with relevant compiler flags -O2 -march=i686 -fexceptions. The STL map class
was used to construct look-up tables.
Only one processor was used, and the data set consisted of all the plain text files on
the Project Gutenberg CD, concatenated into a single disk file containing over 400 MiB
and approximately 116 million 10-grams. For comparison, this file was too large to pro-
cess with the Sary suffix array [Takabayashi 2005] package (version 1.2.0), since the array
would have exceeded 2 GiB. However, the first 200 MB was successfully processed by
Sary, which took 1886 s to build the suffix10 array. The SUFARY [Yamashita 2005] (ver-
sion 2.3.8) package is said to be faster than Sary [Takabayashi 2005]. It processed the
200 MB file in 2640 s and then required 95 s to (exactly) compute the number of 5-grams
with more than 100,000 occurrences. Pipelined suffix-array implementations reportedly
can process inputs as large as 4 GB in hours [Dementiev et al. 2005].
From Table VI we see that n-gram estimation can be efficiently implemented. First,
comparing results for M = 220 to those for M = 210 , we see using a larger table costs
roughly 140 s in every case. This increase is small when considering that M was multiplied
by 210 and is consistent with the fact that the computational cost is dominated by the
hashing. Comparing different hashes, using a 10-wise independent hash was about twice as
slow as using a recursive hash. Hashing with ID37 was 15–25% faster than using Cohen’s
approaches.
Assuming that we are willing to allocate very large files to create suffix arrays and use
much internal memory, an exact count is still at least 10 times more expensive than an
approximation. Where the suffix-array approach would take about an hour to compute n-
gram counts over the entire Gutenberg CD, an estimate can be available in about 6 minutes
10 Various command-line options were attempted and the reported time is the fastest achieved.
while using very little memory and no permanent storage.
8. CONCLUSION
Considering speed, theoretical guarantees, and actual results, we recommend Cohen’s
G ENERAL. It is fast, has a theoretical performance guarantee, and behaves at least as well
as either ID37 or the n-wise independent approach. G ENERAL is pairwise independent
so that there are minimal theoretical bounds to its performance. Unlike Cohen’s C YCLIC,
one can safely use all of its bits and the minor speed advantage of C YCLIC does not seem
worthwhile. The n-wise independent hashing comes with a stronger theoretical guaran-
tee than either of Cohen’s hashes, and thus there can be no unpleasant surprises with its
accuracy on any data set. Yet there is a significant speed penalty for its use in our imple-
mentation. The speed gain of ID37 is worthwhile only for very small values of M. Not
only does it lack a theoretical accuracy guarantee, but for larger M it is observed to fall
far behind the other hashing approaches in practice. Except where accuracy is far less
important than speed, we cannot recommend ID37.
There are various avenues for follow-up work that we are pursuing. First, further im-
provements to the theoretical bound seem possible, especially for more than 4-wise inde-
pendent hashes. Generally, the current theory fails to completely explain our experimental
results: why do C YCLIC and G ENERAL sometimes fare better than n-wise independent
hashing at count estimation? Does increased independence improve the accuracy of the
GT count-estimation algorithm? Second, each item being counted can have an occurrence
count kept for it. This may enable entropy estimation as well as estimates of the number
of distinct frequent n-grams [Lemire and Kaser 2006].
REFERENCES
A LON , N., M ATIAS , Y., AND S ZEGEDY, M. 1996. The space complexity of approximating the frequency
moments. In STOC ’96. 20–29.
AOUICHE , K. AND L EMIRE , D. 2007. Unassuming view-size estimation techniques in olap. In ICEIS’07.
BACON , F. L. AND H OUDE , D. J. 1984. Data compression apparatus and method. US Patent 4612532. filed
1984; granted 1986. Assignee Telebyte (later Telcor Systems).
BAR -YOSSEF, Z., JAYRAM , T. S., K UMAR , R., S IVAKUMAR , D., AND T REVISAN , L. 2002. Counting distinct
elements in a data stream. In RANDOM’02. 1–10.
BATU , T., DASGUPTA , S., K UMAR , R., AND RUBINFELD , R. 2002. The complexity of approximating entropy.
In STOC’02. ACM Press, New York, NY, USA, 678–687.
B ERNARD , M. 1995. À juste titre: A lexicometric approach to the study of titles. Literary and Linguistic
Computing 10, 2, 135–141.
B OSE , R. P. J. C. AND S RINIVASAN , S. H. 2005. Data mining approaches to software fault diagnosis. In RIDE
’05. IEEE Computer Society, Washington, DC, USA, 45–52.
B OURKE , P. 1998. Uniform random number generator. online: http://astronomy.swin.edu.au/∼pbourke/
other/random/index.html. checked 2007-05-30.
C ANNY, J. 2002. CS174 lecture notes. http://www.cs.berkeley.edu/∼jfc/cs174/lecs/lec10/lec10.
pdf. checked 2007-05-30.
C AROPRESO , M. F., M ATWIN , S., AND S EBASTIANI , F. 2001. Text Databases & Document Management:
Theory & Practice. Idea Group Publishing, Hershey, PA, USA, Chapter A Learner-Independent Evaluation of
the Usefulness of Statistical Phrases for Automated Text Categorization, 78–102.
C ARTER , L. AND W EGMAN , M. 1979. Universal classes of hash functions. Journal of Computer and System
Sciences 18, 2, 143–154.
C IACCIA , P., G OLFARELLI , M., AND R IZZI , S. 2001. On estimating the cardinality of aggregate views. In
DMDW. 12.1–12.10.
C IACCIA , P., G OLFARELLI , M., AND R IZZI , S. 2003. Bounding the cardinality of aggregate views through
domain-derived constraints. Data & Knowledge Engineering 45, 2, 131–153.
C ODD , E. 1993. Providing OLAP (on-line analytical processing) to user-analysis: an IT mandate. Tech. rep.,
E.F. Codd and Associates.
C OHEN , J. D. 1997. Recursive hashing functions for n-grams. ACM Trans. Inf. Syst. 15, 3, 291–320.
C OHEN , P., H EERINGA , B., AND A DAMS , N. 2002. Unsupervised segmentation of categorical time series into
episodes. In ICDM’02. 99–106.
D EHNE , F., E AVIS , T., AND R AU -C HAPLIN , A. 2006. The cgmCUBE project: Optimizing parallel data cube
generation for ROLAP. Distributed and Parallel Databases 19, 1, 29–62.
D EMENTIEV, R., M EHNERT, J., K ARKKAINEN , J., AND S ANDERS , P. 2005. Better external memory suffix
array construction. In ALENEX05: Workshop on Algorithm Engineering & Experiments.
D ESHPANDE , M. AND K ARYPIS , G. 2004. Selective Markov models for predicting web page accesses. ACM
Transactions on Internet Technology (TOIT) 4, 2, 163–184.
D ORAISAMY, S. AND R ÜGER , S. 2003. Position indexing of adjacent and concurrent n-grams for polyphonic
music retrieval. In ISMIR 2003. 227–228.
D ROETTBOOM , M. 2003. Correcting broken characters in the recognition of historical printed documents. In
Digital Libraries 2003. 364–366.
D URAND , M. AND F LAJOLET, P. 2003. Loglog counting of large cardinalities. In ESA’03. Springer.
FALOUTSOS , C., M ATIAS , Y., AND S ILBERSCHATZ , A. 1996. Modeling skewed distribution using multifractals
and the 80-20 law. In VLDB’96. 307–317.
F LAJOLET, P. AND M ARTIN , G. 1985. Probabilistic counting algorithms for data base applications. Journal of
Computer and System Sciences 31, 2, 182–209.
G AO , J. AND Z HANG , M. 2001. Improving language model size reduction using better pruning criteria. In
ACL’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association
for Computational Linguistics, Morristown, NJ, USA, 176–182.
G IBBONS , P. B. AND T IRTHAPURA , S. 2001. Estimating simple functions on the union of data streams. In
SPAA’01. 281–291.
G IEGERICH , R., K URTZ , S., AND S TOYE , J. 1999. Efficient implementation of lazy suffix trees. In WAE’99.
30–42.
G ONNET, G. H. AND BAEZA -YATES , R. A. 1990. An analysis of the Karp-Rabin string matching algorithm.
Information Processing Letters 34, 5, 271–274.
G RAY, J., B OSWORTH , A., L AYMAN , A., AND P IRAHESH , H. 1996. Data cube: A relational aggregation
operator generalizing group-by, cross-tab, and sub-total. In ICDE ’96. 152–159.
G UHA , S., M C G REGOR , A., AND V ENKATASUBRAMANIAN , S. 2006. Streaming and sublinear approximation
of entropy and information distances. In SODA’06. ACM Press, New York, NY, USA, 733–742.
H AAHR , M. 1998. random.org — true random number service. online, http://www.random.org. checked
2007-05-30.
H AAS , P., NAUGHTON , J., S ESHADRI , S., AND S TOKES , L. 1995. Sampling-based estimation of the number
of distinct values of an attribute. In VLDB’95. 311–322.
H ON , W., S ADAKANE , K., AND S UNG , W. 2003. Breaking a time-and-space barrier in constructing full-text
indices. In FOCS’03. 251–260.
JAMES , F. 1990. A review of pseudorandom number generators. Computer Physics Communications 60, 329–
344.
J ELINEK , F. 1998. Statistical methods for speech recognition. MIT Press Cambridge, MA, USA.
J OULA , P., S OFKO , J., AND B RENNAN , P. 2006. A prototype for authorship attribution studies. Literary and
Linguistic Computing 21, 2, 169–178.
K ARP, R. AND R ABIN , M. 1987. Efficient randomized pattern-matching algorithms. IBM Journal of Research
and Development 31, 2, 249–260.
K ASER , O., K EITH , S., AND L EMIRE , D. 2006. The LitOLAP project: Data warehousing with literature. In
CaSTA’06.
K EARNS , M., M ANSOUR , Y., RON , D., RUBINFELD , R., S CHAPIRE , R. E., AND S ELLIE , L. 1994. On the
learnability of discrete distributions. In STOC’94. 273–282.
K EITH , S., K ASER , O., AND L EMIRE , D. 2005a. Analyzing large collections of electronic text using OLAP. In
APICS 2005.
K EITH , S., K ASER , O., AND L EMIRE , D. 2005b. Analyzing large collections of electronic text using OLAP.
Tech. Rep. TR-05-001, UNBSJ CSAS. June.
K ESELJ , V. AND C ERCONE , N. 2004. CNG method with weighted voting. In ad-hoc Authorship Attribution
Contest, P. Joula, Ed. AHC/ALLC.
K IM , M.-S., W HANG , K.-Y., L EE , J.-G., AND L EE , M.-J. 2005. n-gram/2l: a space and time efficient two-level
n-gram inverted index structure. In VLDB’05. VLDB Endowment, 325–336.
K IT, C. AND W ILKS , Y. 1998. The Virtual Corpus approach to deriving n-gram statistics from large scale
corpora. In Proceedings of 1998 International Conference on Chinese Information Processing. 223–229.
K NUTH , D. E. 1969. Seminumerical Algorithms. The Art of Computer Programming, vol. 2. Addison-Wesley.
KOLONKO , M. AND WASCH , D. 2006. Sequential reservoir sampling with a nonuniform distribution. ACM
Trans. Math. Softw. 32, 2, 257–273.
KOTIDIS , Y. 2002. Handbook of Massive Data Sets. Kluwer Academic Publishers, Norwell, MA, USA, Chapter
Aggregate View Management in Data Warehouses, 711–741.
L EMIRE , D. AND K ASER , O. 2006. One-pass, one-hash n-gram count estimation. Tech. Rep. TR-06-001, Dept.
of CSAS, UNBSJ. available from http://arxiv.org/abs/cs.DB/0610010.
L I , K.-H. 1994. Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))). ACM Trans. Math.
Softw. 20, 4, 481–493.
L IN , C.-Y. AND H OVY, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In
NAACL’03. Association for Computational Linguistics, Morristown, NJ, USA, 71–78.
L OSIEWICZ , P., OARD , D. W., AND KOSTOFF , R. N. 2000. Textual data mining to support science and tech-
nology management. J. Intell. Inf. Syst. 15, 2, 99–119.
M ANBER , U. AND M YERS , G. 1990. Suffix arrays: a new method for on-line string searches. In SODA ’90.
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 319–327.
M ANBER , U. AND M YERS , G. 1993. Suffix arrays: a new method for on-line string searches. SIAM Journal on
Computing 22, 5, 935–948.
M ARSAGLIA , G. AND Z AMAN , A. 1987. Toward a universal random number generator. Tech. Rep. FSU-SCRI-
87-50, Florida State University.
M ATSUMOTO , M. AND N ISHIMURA , T. 1998. Mersenne Twister: A 623-dimensionally equidistributed uniform
pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation 8, 1, 3–30.
M C A LLESTER , D. A. AND S CHAPIRE , R. E. 2000. On the convergence rate of Good-Turing estimators. In
COLT’00: Proceedings of the Thirteenth Annual Conference on Computational Learning Theory. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 1–6.
M C C ABE , M. C., L EE , J., C HOWDHURY, A., G ROSSMAN , D., AND F RIEDER , O. 2000. On the design and
evaluation of a multi-dimensional approach to information retrieval. In SIGIR ’00. ACM Press, New York,
NY, USA, 363–365.
M OTHE , J., C HRISMENT, C., D OUSSET, B., AND A LAUX , J. 2003. DocCube: Multi-dimensional visualiza-
tion and exploration of large document sets. Journal of the American Society for Information Science and
Technology 54, 7, 650–659.
NADEAU , T. AND T EOREY, T. 2003. A Pareto model for OLAP view size estimation. Information Systems
Frontiers 5, 2, 137–147.
NAGAO , M. AND M ORI , S. 1994. A new method of n-gram statistics for large number of n and automatic
extraction of words and phrases from large text data of Japanese. In COLING’94. 611–615.
N IE , J.-Y., G AO , J., Z HANG , J., AND Z HOU , M. 2000. On the use of words and n-grams for Chinese information
retrieval. In IRAL’00: Proceedings of the Fifth International Workshop on on Information Retrieval with Asian
Languages. ACM Press, New York, NY, USA, 141–148.
N IEMI , T., H IRVONEN , L., AND J ÄRVELIN , K. 2003. Multidimensional data model and query language for
informetrics. Journal of the American Society for Information Science and Technology 54, 10, 939–951.
O RLITSKY, A., S ANTHANAM , N., AND Z HANG , J. 2003. Always Good Turing: asymptotically optimal proba-
bility estimation. In FOCS’03. 179–188.
PAULUS , J. AND K LAPURI , A. 2003. Conventional and periodic n-grams in the transcription of drum sequences.
In ICME’03. 737–740.
P ROJECT G UTENBERG L ITERARY A RCHIVE F OUNDATION. 2007. Project Gutenberg. http://www.
gutenberg.org/. checked 2007-05-30.
RUSKEY, F. 2006. The (combinatorial) object server. http://www.theory.cs.uvic.ca/∼cos/cos.html.
checked 2007-05-30.
S CHMIDT, J., S IEGEL , A., AND S RINIVASAN , A. 1993. Chernoff-Hoeffding bounds for applications with
limited independence. In SODA’93. Society for Industrial and Applied Mathematics Philadelphia, PA, USA,
331–340.
S HAH , B., R AMACHANDRAN , K., AND R AGHAVAN , V. 2004. Storage estimation of multidimensional ag-
gregates in a data warehouse environment. In Proceedings of the World Multi-Conference on Systemics,
Cybernetics and Informatics.
S HANNON , C. E. 1948. A mathematical theory of communications. Bell Syst. Tech. J.
S HUKLA , A., D ESHPANDE , P., NAUGHTON , J., AND R AMASAMY, K. 1996. Storage estimation for multidi-
mensional aggregates in the presence of hierarchies. In VLDB’96. 522–531.
S U , Z., YANG , Q., L U , Y., AND Z HANG , H. 2000. WhatNext: a prediction system for web requests using
n-gram sequence models. In Web Information Systems Engineering 2000. 214–221.
S ULLIVAN , D. 2001. Document Warehousing and Text Mining: Techniques for Improving Business Operations,
Marketing, and Sales. John Wiley & Sons.
S UN M ICROSYSTEMS. 2004. String (Java 2 Platform SE 5.0). online documentation: http://java.sun.com/
j2se/1.5.0/docs/api/index.html.
TAKABAYASHI , S. 2005. Sary: A suffix array library and tools. online: http://sary.sourceforge.net/.
checked 2007-05-30.
T HE U NICODE C ONSORTIUM. 2006. Unicode home page. http://unicode.org/. checked 2007-05-30.
V ITTER , J. S. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1, 37–57.
W EISS , M. 1999. Data Structures and Algorithm Analysis in Java. Addison Wesley.
W HANG , K., VANDER -Z ANDEN , B., AND TAYLOR , H. 1990. A linear-time probabilistic counting algorithm
for database applications. ACM Transactions on Database Systems 15, 2, 208–229.
YAMASHITA , T. 2005. SUFARY. online: http://nais.to/∼yto/tools/sufary. checked 2007-05-30.
YANNAKOUDAKIS , E. J., T SOMOKOS , I., AND H UTTON , P. J. 1990. n-Grams and their implication to natural
language understanding. Pattern Recogn. 23, 5, 509–528.
Y U , X., Z UZARTE , C., AND S EVCIK , K. C. 2005. Towards estimating the number of distinct value combinations
for a set of attributes. In CIKM’05. ACM Press, New York, NY, USA, 656–663.