0% found this document useful (0 votes)

130 views35 pages

Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation

Abstract: Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise independent by discarding n-1 bits. One application of hashing is to estimate the number of distinct n-grams, a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire a statistically unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one-hash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. For example, we can improve by a factor of 2 the theoretical bounds on estimation accuracy by replacing pairwise independent hashing by 4-wise independent hashing. We show that recursive random hashing is sufficiently independent in practice. Maybe surprisingly, our experiments showed that hashing by cyclic polynomials, which is only quasi-pairwise independent, sometimes outperformed 10-wise independent hashing while being twice as fast. For comparison, we measured the time to obtain exact n-gram counts using suffix arrays and show that, while we used hardly any storage, we were an order of magnitude faster. The experiments used a large collection of English text from Project Gutenberg as well as synthetic data.

Uploaded by

Daniel Lemire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views35 pages

Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation

Uploaded by

Daniel Lemire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Recursive Hashing and One-Pass, One-Hash

n-Gram Count Estimation

Daniel Lemire
Université du Québec à Montréal

and
arXiv:0705.4676v1 [cs.DB] 31 May 2007

Owen Kaser
University of New Brunswick

Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that
recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is
pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise
independent by discarding n − 1 bits. One application of hashing is to estimate the number of distinct n-grams,
a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions,
we desire a statistically unassuming algorithm with universally valid accuracy bounds. Most related work has
focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one-
hash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. For example, we can
improve by a factor of 2 the theoretical bounds on estimation accuracy by replacing pairwise independent hashing
by 4-wise independent hashing. We show that recursive random hashing is sufficiently independent in practice.
Maybe surprisingly, our experiments showed that hashing by cyclic polynomials, which is only quasi-pairwise
independent, sometimes outperformed 10-wise independent hashing while being twice as fast. For comparison,
we measured the time to obtain exact n-gram counts using suffix arrays and show that, while we used hardly
any storage, we were an order of magnitude faster. The experiments used a large collection of English text from
Project Gutenberg as well as synthetic data.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Content Analysis and Index-
ing; H.2.7 [Database Administration]: Data warehouse and repository
General Terms: Algorithms, Theory, Experimentation
Additional Key Words and Phrases: recursive hashing, view-size estimation, n-grams

1. INTRODUCTION
Consider a sequence of symbols ai ∈ Σ of length N. The data source has high latency: for
example, it is not in a flat binary format or a DBMS, making random access and skipping
impractical. The symbols need not be characters from a natural language: they can be par-
ticular “events” inferred from a sensor or a news feed, they can be financial or biomedical
patterns found in time series, they can be words in a natural language, and so on. The

The first author was supported by NSERC grant 261437 and the second author was supported by NSERC
grant 155967.
Authors’ addresses: Daniel Lemire, Université du Québec à Montréal, 100 Sherbrooke West, Montréal, QC
H2X 3P2 Canada and Owen Kaser, University of New Brunswick, 100 Tucker Park Road, Saint John, NB
E2L 4L1 Canada

ACM Transactions on Computational Logic, Vol. V, No. N, February 2008, Pages 1–35.
number of distinct symbols (|Σ|) could be large (on the order of 105 in the case of words
in a typical English dictionary), but it is assumed to be small compared to the amount of
memory available. We make no other assumption about the distribution of these distinct
symbols and n-grams.
An n-gram is a consecutive sequence of n symbols. We use n-grams in language mod-
eling [Gao and Zhang 2001], pattern recognition [Yannakoudakis et al. 1990], predicting
web page accesses [Deshpande and Karypis 2004], information retrieval [Nie et al. 2000],
text categorization and author attribution [Caropreso et al. 2001; Joula et al. 2006; Keselj
and Cercone 2004], speech recognition [Jelinek 1998], multimedia [Paulus and Klapuri
2003], music retrieval [Doraisamy and Rüger 2003], text mining [Losiewicz et al. 2000],
information theory [Shannon 1948], software fault diagnosis [Bose and Srinivasan 2005],
data compression [Bacon and Houde 1984], data mining [Su et al. 2000], indexing [Kim
et al. 2005], On-line Analytical Processing (OLAP) [Keith et al. 2005a], optimal charac-
ter recognition (OCR) [Droettboom 2003], automated translation [Lin and Hovy 2003],
time series segmentation [Cohen et al. 2002], and so on.This paper concerns the use of
previously published hash functions for n-grams, together with recent randomized algo-
rithms for estimating the number of distinct items in a stream of data. Together, they
permit memory-efficient estimation of the number of distinct n-grams and motivate finer
theoretical investigations into efficient n-gram hashing.
The number of distinct n-grams grows large with n: storing Shakespeare’s First Fo-
lio [Project Gutenberg Literary Archive Foundation 2007] takes up about 4.6 MiB but we
can verify that it has over 3 million distinct 15-grams of characters. If each distinct n-gram
can be stored using log(4.6 × 106 ) ≈ 22 bits, then we need over 8 MiB just to store the n-
grams (3 × 106 × 22/8 ≈ 8.3 × 106 ) without counting the indexing overhead. Indexing data
structures such as suffix arrays require even more storage. Thus, storing and indexing n-
grams can use up more storage than the original data source. Extrapolating this to the large
corpora used in computational linguistic studies, we see the futility of using brute-force
approaches that store the n-grams in main memory, when n is large. For smaller values of
n, n-grams of English words are also likely to defeat brute-force approaches.
There are two strategies to estimate statistics of a sequence in a single pass [Kearns
et al. 1994; Batu et al. 2002; Guha et al. 2006]. The generative (or black-box) strategy
samples values at random. From the samples, the probability of each value is estimated
by maximum likelihood or other statistical techniques. The evaluative strategy, on the
other hand, probes the exact probabilities or, equivalently, the number of occurrences of
(possibly randomly) chosen values. In one pass, we can randomly probe several n-grams
so we know their exact frequency. For example, given the sequence “aadababbabaacc”, a
random sampling might be “aabc”; from this P(a), P(b), and P(c) can be approximated.
However, one could instead compute the exact frequency of value “b.”
On the one hand, it is difficult to estimate the number of distinct elements from a sam-
pling, without making further assumptions. For example, suppose there is only one distinct
n-gram in 100 samples out of 100,000 n-grams. Should we conclude that there is only one
distinct n-gram overall? Perhaps there are 100 distinct n-grams, but 99 of them only oc-
cur once — thus there is a ≈ 91% probability that we observe only the common one.
While this example is a bit extreme, skewed distributions are quite common, as Zipf’s law
shows. Choosing, a priori, the number of samples we require is a major difficulty. Esti-
mating the probabilities from sampling is a problem that still interests researchers to this
day [McAllester and Schapire 2000; Orlitsky et al. 2003].
On the other hand, distinct count estimates from a probing are statistically easier [Gib-
bons and Tirthapura 2001]. With the example above, with just enough storage budget
to store 100 distinct n-grams, we would get an exact count estimate! On the downside,
probing requires properly randomized hashing.
In the spirit of probing, Gibbons-Tirthapura (GT) [Gibbons and Tirthapura 2001] count
estimation goes as follows. We have m distinct items in a stream containing the dis-
tinct items x1 , . . . , xm with possible repetitions. Let h(xi ) be pairwise-independent hash
values over [0, 2L ) and let ht (xi ) be the first t bits of the hash value. We have that
E(card({ht−1 (0)})) = m/2t . Given a fixed memory budget M, and setting t = 0, we scan
the data. While scanning, we store all distinct items xi such that ht (xi ) = 0 in a look-up
table H. As soon as size(H) = M + 1, we increment t by 1 and remove all xi in H such
that ht (xi ) 6= 0. Typically, at least one element in H is removed, but if not, the process
of incrementing t and removing items is repeated until size(H) ≤ M. Then we continue
scanning. After the run is completed, we return size(H) × 2t as the estimate. By choos-
ing M = 576/ε2 [Bar-Yossef et al. 2002], we achieve an accuracy of ε, 5 times out of 6
(i.e., P(|size(H) × 2t − m| > εm) < 1/6), by an application of Chebyshev’s inequality. By
a Chernoff bound, running the algorithm O(log 1/δ) times and taking the median of the
results gives a reliability of δ instead of 5/6. Bar-Yossef et al. suggest to improve the
algorithm by storing hash values of the xi ’s instead of the xi ’s themselves, reducing the
reliability but lowering the memory usage. Notice that our Corollary 2 shows that the es-
timate of a 5/6 reliability for M = 576/ε2 is pessimistic: M = 576/ε2 implies a reliability
of over 99%. We also prove that replacing pairwise independent by 4-wise independent
hashing substantially improves the existing theoretical performance bounds.
Random hashing can be the real bottleneck in probing, but to alleviate this problem
for n-gram hashing, we use recursive hashing [Cohen 1997; Karp and Rabin 1987]: we
leverage the fact that successive n-grams have n − 1 characters in common. We study em-
pirically online n-gram count estimation that uses one pass and hashes each n-gram only
once. We compare several different recursive n-gram hashing algorithms including hashing
by cyclic and irreducible polynomials in the binary Galois Field (GF(2)[x]). We character-
ize the independence of recursive hashing by polynomials whereas only empirical studies
had existed. We prove that hashing by cyclic polynomials is not even uniform, whereas
hashing by irreducible polynomials is pairwise independent. It is interesting to contrast
these new theoretical results with the fact that Cohen reported [Cohen 1997] that hashing
by cyclic polynomials provided a more uniform hashing, experimentally. Fortunately, we
were also able to prove that hashing by cyclic polynomials is quasi-pairwise independent.
The main contributions of this paper are an analytical survey of recursive hashing, a tighter
theoretical bound in count estimation when hashing functions are more than pairwise inde-
pendent, and an experimental validation to demonstrate practical usefulness and to suggest
further theoretical investigation.

2. ORGANIZATION AND CONTRIBUTION

In Section 4, we present multidimensional hashing and specifically, the problem of hashing
n-grams. We review the concepts of uniform hashing as well as k-wise independence. We
present recursive hashing as a fast way to hash all n-grams in a string by repeatedly updat-
ing the previous n-gram. We then prove that recursive hashing, according to our definition,
cannot be pairwise independent (see Proposition 1). Because we wish for pairwise inde-
pendence, we introduce recursive hashing over hash values as weaker form of recursivity.
We then prove that pairwise independence is possible, but no more (see Proposition 2).
In Section 5, we present four hashing techniques for n-grams. We begin by showing how
we can compute non-recursive n-wise independent hash values, as long as the number of
distinct characters or words fits in memory. In Subsection 5.1, we present two hash families
introduced by Cohen [Cohen 1997], C YCLIC and G ENERAL. Both involve very similar
operations in Galois Fields (GF(2)[x]). We prove that G ENERAL is pairwise independent.
Regarding C YCLIC, we show that it is not even uniform. In Subsection 5.2, we take a closer
look at C YCLIC and show that almost all of its bits are pairwise independent. Finally, in
Subsection 5.3, we present a simple recursive, but non uniform, hash technique called
ID37, based on the classical Karp-Rabin hash family.
In Section 6, we review the GT count-estimation algorithm. Since most of the compu-
tational cost of the GT algorithm is in the hashing, we argue that it might be inefficient in
practice to repeatedly hash n-grams to apply a Chernoff bound. In Proposition 5, we offer a
novel result, showing that better bounds on the GT count estimation algorithm are possible
if we use k-wise independent hash values (k > 2) instead of merely pairwise ones.
Finally, in Section 7, we present our experimental results of the GT count-estimation al-
gorithm over the four hash families (n-wise independent, C YCLIC, G ENERAL, and ID37).
Using both synthetic Zipfian data and text from the Gutenberg Project, we show that hash-
ing each n-gram only once is sufficient to achieve good accuracy. The experimental re-
sults deliver some surprises. For example, C YCLIC and G ENERAL have a very similar
performance despite the apparent theoretical edge of G ENERAL which is truly pairwise
independent. We can partly explain this result since the GT count-estimation algorithm
tends to mostly use the least significant bits of each hash value. A more surprising result is
that C YCLIC and G ENERAL sometimes outperform the slower n-wise independent hashing
method we used.

3. RELATED WORK
Related work includes reservoir sampling, suffix arrays, and view-size estimation in OLAP.
We can choose randomly, without replacement, k samples in a sequence of unknown
length using a single pass through the data by reservoir sampling. Reservoir sampling [Vit-
ter 1985; Kolonko and Wasch 2006; Li 1994] was first introduced by Knuth [Knuth
1969]. All reservoir-sampling algorithms begin by appending the first k samples to an
array. In their linear-time (O(N)) form, reservoir-sampling algorithms sequentially visit
every symbol, choosing it as a possible sample with probability k/t where t is the num-
ber of symbols read so far. The chosen sample is simply appended at the end of the
array while an existing sample is flagged as having been removed. The array has an
average size of k (1 + log N/k) samples at the end of the run. In their sublinear form
(O(k(1 + log(N/k)) expected time), the algorithms skip a random number of data points
each time. While these algorithms use a single pass, they assume that the number of re-
quired samples k is known a priori, but this is difficult without any knowledge of the data
distribution.
Using suffix arrays [Manber and Myers 1990; 1993] and the length of the maximal com-
mon prefix between successive prefixes, Nagao and Mori [Nagao and Mori 1994] proposed
a fast algorithm to compute n-gram statistics exactly. However, it cannot be considered an
online algorithm even if we compute the suffix array in one pass: after constructing the
suffix array, one must go through all suffixes at least once more. Their implementation was
later improved by Kit and Wilks [Kit and Wilks 1998]. Compared to uncompressed suf-
fix trees [Giegerich et al. 1999], uncompressed suffix arrays use several times less storage
and their performance does not depend on the size of the alphabet. Suffix arrays can be
constructed in O(N) time using O(N) working space [Hon et al. 2003]. Querying a suffix
array for a given n-gram takes O(log N) time.
By definition, each n-gram is a tuple of length n and can be viewed as a relation to be ag-
gregated. OLAP (On-Line Analytical Processing) [Codd 1993] is a database acceleration
technique used for deductive analysis, typically involving aggregation. To achieve acceler-
ation, one frequently builds data cubes [Gray et al. 1996] where multidimensional relations
are preaggregated in multidimensional arrays. OLAP is commonly used for business pur-
poses with dimensions such as time, location, sales, expenses, and so on. Concerning text,
most work has focused on informetrics/bibliomining, document management and informa-
tion retrieval [McCabe et al. 2000; Mothe et al. 2003; Niemi et al. 2003; Bernard 1995;
Sullivan 2001]. The idea of using OLAP for exploring the text content itself (including
phrases and n-grams) was proposed for the first time by Keith, Kaser and Lemire [Keith
et al. 2005b; 2005a; Kaser et al. 2006]. The estimation of n-gram counts can be viewed
as an OLAP view-size estimation problem which itself “remains an important area of open
research” [Dehne et al. 2006].
A data-agnostic approach to view-size estimation [Shukla et al. 1996], which is likely to
be used by database vendors, can be computed almost instantly. For n-gram estimation,
the number of attributes is the size of the alphabet |Σ| and η is the number of n-grams
with possible repetitions (η = N − n + 1). Given η cells picked uniformly at random, with
replacement, in a V = K1 × K2 × · · · Kn space, the probability that any given cell (think “n-
gram”) is omitted is (1 − V1 )η . For n-grams, V = |Σ|n . Therefore, the expected number of
unoccupied cells is (1 − V1 )η × η. Similarly, assuming the number of n-grams is known to
be m, the same model permits us to estimate the number of n−1-grams by m×(1−( Vm )|Σ| ).
In practice, this approach systematically overestimates because relations are not uniformly
distributed.
A more sophisticated view-size estimation algorithm used in the context of data ware-
housing and OLAP [Shukla et al. 1996; Kotidis 2002] is logarithmic probabilistic count-
ing [Flajolet and Martin 1985]. This approach requires a single pass and almost no mem-
ory, but it assumes independent hashing for which no algorithm using limited storage is
known [Bar-Yossef et al. 2002]. Alon et al. [Alon et al. 1996, Proposition 2.3] showed
probabilistic counting could work with only pairwise independence, but the error bounds
are large (at least 100%): for any c > 2, the relative error is bounded by c − 1 with reli-
ability 1 − 2/c. Practical results are sometimes disappointing [Dehne et al. 2006]. Other
variants of this approach include linear probabilistic counting [Whang et al. 1990; Shah
et al. 2004] and Loglog counting [Durand and Flajolet 2003], but their theoretical analysis
also assumes independent hashing.
View-size estimation through sampling has been made adaptive by Haas et al. [Haas
et al. 1995]: their strategy is to first attempt to determine whether the distribution is skewed
and then use an appropriate statistical estimator. We can also count the marginal frequen-
cies of each attribute value (or symbol in an n-gram setting) and use them to give estimates
as well as (exact) lower and upper bound on the view size [Yu et al. 2005]. Other re-
searchers make particular assumptions on the distribution of relations [Nadeau and Teorey
2003; Ciaccia et al. 2001; 2003; Faloutsos et al. 1996].

4. MULTIDIMENSIONAL RANDOM HASHING

Hashing encodes an object as a fixed-length bit string for comparison. Multidimensional
hashing is a particular form of hashing where the objects can be represented as tuples.
Multidimensional hashing is of general interest since several commonly occurring objects
can be thought of as tuples: 32-bit values can be seen as 8-tuples containing 4-bit values.
For convenience, we consider hash functions mapping keys to [0, 2L ), where the set U of
possible keys is much larger than 2L . A difficulty with hashing is that any particular hash
function h has some “bad inputs” S ⊂ U over which some hash value (such as 0) is either
too frequent or not frequent enough (card(h−1 (0)) 6≈ card(S)/2L ) making count estimates
from hash values difficult. Rather than make assumptions about the probabilities of bad
inputs for a particular fixed hash function, an alternative approach [Carter and Wegman
1979] selects the hash function randomly from some family H of functions, all of which
map U to [0, 2L ).
Clearly, some families H have desirable properties that other families do not have. For
instance, consider a family whose members always map to even numbers — then picking
h randomly in H , we have P(h(x) = i) = 0 for any odd i and any x ∈ U. This would be
an undesirable property for many applications. We now mention some desirable properties
of families. H is uniform if, considering h selected uniformly at random from H and for
all x and y, we have P(h(x) = y) = 1/2L . This condition is weak; the family of constant
functions (h(x) = c) is uniform but would be disastrous when used with the GT algorithm.
We need stronger conditions implying that any particular member h of the family must hash
objects evenly over [0, 2L ). H is pairwise independent or universal [Carter and Wegman
1979] if we have that P(h(x1 ) = y ∧ h(x2 ) = z) = P(h(x1 ) = y)P(h(x2 ) = z) = 1/4L for all
x1 , x2 , y, z with x1 6= x2 . We will refer to such an h ∈ H as a “pairwise independent hash
function” when the family in question can be inferred from the context or is not important.
Pairwise independence implies uniformity.
As previously stated, we are only interested in hash values in [0, 2L ), but let us consider
a more general case as an example. A well-known example of a pairwise-independent
hash function for keys in the range [0, Br+1 ), where B is prime, is computed as follows.
Express key x as xr xr−1 . . . x0 in base B. Randomly choose a number a ∈ [0, Br+1 ) and
express it as ar ar−1 . . . a0 in base B. Then, set h(x) = ∑ri=0 ai xi mod B. The proof that it
is pairwise independent follows from the fact that integers modulo a prime number form a
field (GF(B)).
Gibbons and Tirthapura showed that pairwise independence was sufficient to approx-
imate count statistics accurately [Gibbons and Tirthapura 2001] essentially because the
variance of the sum of pairwise independent variables is just the sum the variances
(Var(X1 + . . . + X j ) = Var(X1 ) + . . . + Var(X j )).
Moreover, the idea of pairwise independence can be generalized: a family of hash func-
tions H is k-wise independent if given distinct x1 , . . . , xk and given h selected uniformly at
random from H , then P(h(x1 ) = y1 ∧ · · · ∧ h(xk ) = yk ) = 1/2kL . Note that k-wise indepen-
dence implies k − 1-wise independence and uniformity. (Fully) independent hash functions
are k-wise independent for arbitrarily large k. This paper contributes better bounds for ap-
proximate count statistics, providing that more fully independent hash functions are used
(4-wise instead of pairwise, for instance).
In the context of n-gram hashing, we seek recursive families of hash functions so that
we can compute new hash values quickly by reusing previous hash values. A hash function
h is recursive if there exists a (fixed) function F over triples such that
h(x2 , x3 , . . . , xn+1 ) = F(h(x1 , x2 , . . . , xn ), x1 , xn+1 ) ∀x1 , x2 , . . . , xn+1 .
The function F must be independent of h. (F is common to all members of the family).
For example, if the values xi are integers, h(x1 , x2 , . . . , xn ) = ∑ni=1 xi mod 2L is a recursive
hash function since we can choose F(a, b, c) = a − b + c mod 2L .
As an example of a common recursive hash function, given tuples (x1 , x2 , . . . , xn ) whose
components are integers taken from [0, B), we can hash by the Karp-Rabin formula
∑i xi Bi−1 mod R, where R is some prime defining the range of the hash function [Karp
and Rabin 1987; Gonnet and Baeza-Yates 1990]. It is a poor hash function for us,
since n-grams with common suffixes all get very similar hash values. For probabilistic-
counting approaches based on the number of trailing zeros of the hashed value, if H =
h(x1 , x2 , . . . , xn−1 , xn ) has many trailing zeros, then we know that h(x10 , x2 , . . . , xn−1 , xn ) will
hash to a value close to H, which therefore cannot also have many trailing zeros (assuming
x1 6= x10 ).
In fact, no recursive hash function can be pairwise independent, even if we focus
only on trailing zeros. Let zeros(x) return the number of trailing zeros (0,1,. . . ,L) of x,
where zeros(0) = L. We say h is trailing-zero uniform if P(zeros(h(x)) ≥ j) = 2− j , for
j = 0, 1, . . . , L and all n-grams x. Clearly, uniformity implies trailing-zero uniformity. Sim-
ilarly, we say h is k-wise trailing-zero independent if P(zeros(h(x1 )) ≥ j1 ∧ zeros(h(x2 )) ≥
j2 ∧ . . . ∧ zeros(h(xk )) ≥ jk ) = 2− j1 − j2 −...− jk , for ji = 0, 1, . . . , L. Again, if h is k-wise in-
dependent, it is k-wise trailing-zero independent. The converse is not true. If h is a k-wise
independent function, consider g ◦ h where g makes zero all bits before the rightmost 1
(e.g., g(0101100) = 0000100). Hash g ◦ h is k-wise trailing-zero independent but not even
uniform (consider that P(g = 0001) = 8P(g = 1000)). The GT algorithm will behave iden-
tically whether using g ◦ h or h, showing that uniformity is not a necessary condition for
good performance.
P ROPOSITION 1. Recursive hash functions are not pairwise trailing-zero independent.
P ROOF. Suppose h is a recursive pairwise independent hash function. Let hi, j =
h(xi , . . . , x j ). Then F(h1,n , x1 , xn+1 ) = h2,n+1 where the function F must be independent
of h.
Fix the values x1 , . . . , xn+1 , then
P(zeros(h1,n ) ≥ L ∧ zeros(h2,n+1 ) ≥ L)
= P(h1,n = 0 ∧ F(0, x1 , xn+1 ) = 0)

P(h1,n = 0) if F(0, x1 , xn+1 ) = 0
= ,
0 otherwise
a contradiction.
Immediately, we have that recursive hash function are not pairwise independent. Hence,
we are motivated to seek a weaker form of recursivity. By extension1 , a hash function h

1 This extended sense resembles Cohen’s use of “recursive” [Cohen 1997].

is recursive over hash values τ(xi ), where τ is a randomized hash function for symbols, if
there is a function F, independent of τ and h, such that
h(x2 , . . . , xn+1 ) = F(h(x1 , . . . , xn ), τ(x1 ), τ(xn+1 )).
A trivial example of a hash function recursive over hash values is h(x1 , x2 , . . . , xn ) =
∑ni=1 τ(xi ) mod 2L . (The multidimensional hashing function h depends implicitly on the
hash function τ.)
As the next proposition shows, being recursive over hashed values, while a weaker re-
quirement, does not allow more than pairwise independence.
P ROPOSITION 2. There is no 3-wise trailing-zero independent hashing function that is
recursive over hashed values.
P ROOF. Consider the string of symbols an bb. Suppose h is 3-wise trailing-zero inde-
pendent, then
P(zeros(h(a, . . . , a)) ≥ L ∧ zeros(h(a, . . . , a, b)) ≥ L ∧ zeros(h(a, . . . , a, b, b)) ≥ L)
= P(h(a, . . . , a) = 0 ∧ F(0, τ(a), τ(b)) = 0 ∧ F(0, τ(a), τ(b)) = 0)
= P(h(a, . . . , a) = 0 ∧ F(0, τ(a), τ(b)) = 0)
= P(zeros(h(a, . . . , a)) ≥ L ∧ zeros(h(a, . . . , a, b)) ≥ L)
= 2−2L by trailing-zero pairwise independence
6= 2−3L as required by trailing-zero 3-wise independence.
Hence, we have a contradiction and no such h can exist.
This also implies that no 3-wise independent hash can be recursive over hashed values.

5. HASHING n-GRAMS
In this section, we begin by showing how we can construct a moderately efficient (non-
recursive) n-wise independent hash function. We then define three recursive hashing fam-
ilies C YCLIC, G ENERAL, and ID37, and study their properties, being especially attentive
to C YCLIC. All four hashing families will later be benchmarked in the context of count
estimation using the GT algorithm.
A trivial way to generate an independent hash is to assign a random integer in [0, 2L )
to each new value x. Unfortunately, this requires as much processing and storage as a
complete indexing of all values. However, in a multidimensional setting this approach can
be put to good use. Suppose that we have tuples in K1 × K2 × · · · × Kn such that |Ki | is small
for all i. We can construct independent hash functions hi : Ki → [0, 2L ) for all i and combine
them. The hash function h(x1 , x2 , . . . , xn ) = h1 (x1 ) ⊕ h2 (x2 ) ⊕ · · · ⊕ hn (xn ) is then n-wise
independent (⊕ is the exclusive or). As long as the sets Ki , are small, in time O(∑i |Ki |)
we can construct the hash function by generating ∑i |Ki | random numbers and storing them
in a look-up table (see Algorithm 1). With constant-time look-up, hashing an n-gram thus
takes O(Ln) time, or O(n) if L is considered a constant.
Unfortunately, this hash function is not recursive. In the n-gram context, we can choose
h1 = h2 = . . . since Σ = K1 = K2 = . . .. While the resulting hash function is recursive over
hashed values since
h(x2 , . . . , xn+1 ) = h1 (x2 ) ⊕ · · · ⊕ h1 (xn+1 )
= h1 (x1 ) ⊕ h1 (xn+1 ) ⊕ h(x1 , . . . , xn ),
Algorithm 1 The (non-recursive) n-wise independent family.
Require: n L-bit hash functions h1 , h1 , . . . , hn over Σ from an independent hash family
1: s ← empty FIFO structure
2: for each character c do
3: append c to s
4: if length(s)= n then
5: yield h1 (s1 ) ⊕ h2 (s2 ) ⊕ . . . ⊕ hn (sn )
{The yield statement returns the value, without terminating the algorithm.}
6: remove oldest character from s
7: end if
8: end for

it is no longer even pairwise independent, since P(h(a, b) = w, h(b, a) = v) = 0 if w 6= v.

5.1 Recursive Hashing by Polynomials

We seek families of hash functions that behave, for practical purposes, as if n-wise in-
dependent while being recursive over hash values. A particularly interesting form of
hashing using the binary Galois field GF(2) is called “Recursive Hashing by Polynomi-
als” and has been attributed to Kubina by Cohen [Cohen 1997]. GF(2) contains only
two values (1 and 0) with the addition (and hence subtraction) defined by “exclusive
or”, a + b = a ⊕ b and the multiplication by “and”, a × b = a ∧ b. GF(2)[x] is the vec-
tor space of all polynomials with coefficients from GF(2). Any integer in binary form
(e.g., c = 1101) can thus be interpreted as an element of GF(2)[x] (e.g., c = x3 + x2 + 1).
If p(x) ∈ GF(2)[x], then GF(2)[x]/p(x) can be thought of as GF(2)[x] modulo p(x). As an
example, if p(x) = x2 , then GF(2)[x]/p(x) is the set of all linear polynomials. For instance,
x3 + x2 + x + 1 = x + 1 mod x2 since, in GF(2)[x], (x + 1) + x2 (x + 1) = x3 + x2 + x + 1.
Interpreting h1 hash values as polynomials in GF(2)[x]/p(x), and with the condition that
degree(p(x)) ≥ n, we define h(a1 , a2 , . . . , an ) = h1 (a1 )xn−1 + h1 (a2 )xn−2 + . . . + h1 (an ). It
is recursive over the sequence h1 (ai ). The combined hash can be computed in constant
time with respect to n by reusing previous hash values:
h(a2 , a3 , . . . , an+1 ) = xh(a1 , a2 , . . . , an ) − h1 (a1 )xn + h1 (an+1 ).
Choosing p(x) = xL + 1 for L ≥ n, for any polynomial q(x) = ∑L−1 i
i=0 qi x , we have

xq(x) = x(qL−1 xL−1 + . . . + q1 x + q0 ) = qL−2 xL−1 + . . . + q0 x + qL−1 .

Thus, we have that multiplication by x is a bitwise rotation, a cyclic left shift. The resulting
hash (see Algorithm 2) is called R ECURSIVE H ASHING BY C YCLIC P OLYNOMIALS [Co-
hen 1997], or (for short) C YCLIC. It was shown empirically to be uniform [Cohen 1997],
but it is not formally so:
L EMMA 1. C YCLIC is not uniform for n even and never pairwise independent.
P ROOF. If n is even, use the fact that xn−1 + . . . + x + 1 is divisible by x + 1 to write
xn−1 + . . . + x + 1 = (x + 1)r(x) for some polynomial r(x). Clearly, r(x)(x + 1)(xL−1 +
x L−2 +. . .+x +1) = 0 mod xL + 1 for any r(x) and so P(h(a1 , a1 , . . . , a1 ) = 0) = P((xn−1 +
. . . + x + 1)h1 (a1 ) = 0) = P((x + 1)r(x)h1 (a1 ) = 0) ≥ P(h1 (a1 ) = 0 ∨ h1 (a1 ) = xL−1 +
xL−2 + . . . + x + 1) = 1/2L−1 . Therefore, C YCLIC is not uniform for n even.
Algorithm 2 The recursive C YCLIC family.
Require: an L-bit hash function h1 over Σ from an independent hash family
1: c ← empty FIFO structure
2: x ← 0 (L-bit integer)
3: z ← 0 (L-bit integer)
4: for each character c do
5: append c to s
6: rotate x left by 1 bit
7: rotate z left by n bits
8: x ← x ⊕ z ⊕ h1 (c)
9: if length(s)= n then
10: yield x
11: remove oldest character y from s
12: z ← h1 (y)
13: end if
14: end for

To show C YCLIC is never pairwise independent, consider n = 3 (for simplicity), then

P(h(a1 , a1 , a2 ) = h(a1 , a2 , a1 )) = P((x + 1)(h1 (a1 ) + h1 (a2 )) = 0) ≥ P(h1 (a1 ) + h1 (a2 ) =
0 ∨ h1 (a1 ) + h1 (a2 ) = xL−1 + xL−2 + . . . + x + 1) = 1/2L−1 , but pairwise independent hash
values are equal with probability 1/2L . The result is shown.
Similarly, to generate hash functions over [0, 2L ), we can choose p(x) to be an irreducible
polynomial of degree L in GF(2)[x]: an irreducible polynomial cannot be factored into
nontrivial polynomials. For L = 19, an example is p(x) = x19 + x18 + x17 + x16 + x12 + x7 +
x6 + x5 + x3 + x2 + 1 [Ruskey 2006]. (With this particular irreducible polynomial, L = 19
and so we require n ≤ 19. Irreducible polynomials of larger degree can be found [Ruskey
2006] if desired.) Computing (a18 x18 + . . . + a0 )x mod p(x) as a polynomial of degree 18
or less, for representation in 19 bits, can be done efficiently. We have (a18 x18 + . . .+a0 )x =
a18 (p(x) − x19 ) + a17 x18 . . . + a0 x mod p(x) and the polynomial on the right-hand-side is
of degree at most 18. In practice, we do a left shift of the hash value and if the value of
the 20th bit is 1, then we apply an exclusive or with the integer 1 + 22 + 23 + 25 + 26 +
27 + 212 + 216 + 217 + 218 + 219 . The resulting hash is called R ECURSIVE H ASHING BY
G ENERAL P OLYNOMIALS [Cohen 1997], or (for short) G ENERAL (see Algorithm 3). To
summarize, both C YCLIC and G ENERAL hash n-grams with the formula h(a1 , a2 , . . . , an ) =
xn−1 h1 (a1 ) + . . . + xh1 (an−1 ) + h1 (an ) but C YCLIC operates in GF(2)[x]/(1 + xL ) whereas
C YCLIC does the computation in GF(2)[x]/p(x) for some irreducible polynomial. The
main benefit of setting p(x) to be an irreducible polynomial is that GF(2)[x]/p(x) is a field;
in particular, it is no longer possible that p1 (x)p2 (x) = 0 mod p(x) unless either p1 (x) = 0
or p2 (x) = 0. The field property allows us to prove that the hash function is pairwise
independent (see Lemma 2), but it is not 3-wise independent because of Proposition 2. In
the sense of Proposition 2, it is an optimal recursive hashing function.
L EMMA 2. G ENERAL is pairwise independent.
P ROOF. If p(x) is irreducible, then any non-zero q(x) ∈ GF(2)[x]/p(x) has an inverse,
noted q−1 (x) since GF(2)[x]/p(x) is a field. Interpret hash values as polynomials in
GF(2)[x]/p(x).
Algorithm 3 The recursive G ENERAL family.
Require: an L-bit hash function h1 over Σ from an independent hash family; an irreducible
polynomial p of degree L in GF(2)[x]
1: c ← empty FIFO structure
2: x ← 0 (L-bit integer)
3: z ← 0 (L-bit integer)
4: for each character c do
5: append c to s
6: x ← shift(x)
7: z ← shiftn (z)
8: x ← x ⊕ z ⊕ h1 (c)
9: if length(s)= n then
10: yield x
11: remove oldest character y from s
12: z ← h1 (y)
13: end if
14: end for
1: function shift
2: input L-bit integer x
3: shift x left by 1 bit, storing result in an L + 1-bit integer x0
4: if leftmost bit of x0 is 1 then
5: x0 ← x0 ⊕ p
6: end if
7: {leftmost bit of x0 is thus always 0}
8: return rightmost L bits of x0

Firstly, we prove that G ENERAL is uniform. In fact, we show a stronger result:

P(q1 (x)h1 (a1 )+q2 (x)h1 (a2 )+. . .+qn (x)h1 (an ) = y) = 1/2L for any polynomials qi where
at least one is different from zero. The result is shown by induction on the number
of non-zero polynomials: it is clearly true where there is a single non-zero polynomial
qi (x), since qi (x)h1 (ai ) = y ⇐⇒ q−1 −1
i (x)qi (x)h1 (ai ) = qi (x)y. Suppose it is true up to
k − 1 non-zero polynomials and consider a case where we have k non-zero polynomials.
Assume without loss of generality that q1 (x) 6= 0, we have P(q1 (x)h1 (a1 ) + q2 (x)h1 (a2 ) +
. . . + qn (x)h1 (an ) = y) = P(h1 (a1 ) = q−1 1 (x)(y − q2 (x)h1 (a2 ) − . . . − qn (x)h1 (an ))) =
∑y0 P(h1 (a1 ) = q−1 1 (x)(y − y0 ))P(q (x)h (a ) + . . . + q (x)h (a ) = y0 ) =
2 1 2 n 1 n ∑y0 21L 21L = 21L
by the induction argument. Hence the uniformity result is shown.
Consider two distinct sequences a1 , a2 , . . . , an and a01 , a02 , . . . , a0n . Write Ha =
h(a1 , a2 , . . . , an ) and Ha0 = h(a01 , a02 , . . . , a0n ). We have that P(Ha = y ∧ Ha0 = y0 ) = P(Ha =
y|Ha0 = y0 )P(Ha0 = y0 ). Hence, to prove pairwise independence, it suffices to show that
P(Ha = y|Ha0 = y0 ) = 1/2L .
Suppose that ai = a0j for some i, j; if not, the result follows since by the (full) in-
dependence of the hashing function h1 , the values Ha and Ha0 are independent. Write
q(x) = −(∑k|ak =ai xn−k )(∑k|a0 =a0j xn−k )−1 , then Ha + q(x)Ha0 is independent from ai = a0j
k
(and h1 (ai ) = h1 (a0j )).
In Ha + q(x)Ha0 , only hashed values h1 (ak ) for ak 6= ai and h1 (a0k ) for a0k 6= a0j remain:
label them h1 (b1 ), . . . , h1 (bm ). The result of the substitution can be written Ha + q(x)Ha0 =
∑k qk (x)h1 (bk ) where qk (x) are polynomials in GF(2)[x]/p(x). All qk (x) are zero if and
only if Ha + q(x)Ha0 = 0 for all values of h1 (a1 ), . . . , h1 (an ) and h1 (a01 ), . . . , h1 (a0n ) (but no-
tice that the value h1 (ai ) = h1 (a0j ) is irrelevant); in particular, it must be true when h1 (ak ) =
1 and h1 (a0k ) = 1 for all k, hence (xn + . . . + x + 1) + q(x)(xn . . . + x + 1) = 0 ⇒ q(x) = −1.
Thus, all qk (x) are zero if and only if Ha = Ha0 for all values of h1 (a1 ), . . . , h1 (an ) and
h1 (a01 ), . . . , h1 (a0n ) which only happens if the sequences a and a0 are identical. Hence, not
all qk (x) are zero.
Write Hy0 ,a0 = (∑k|a0 =a0j xn−k )−1 (y0 − ∑k|a0 6=a0j xn−k h1 (a0k )). On the one hand, the condi-
k k
tion Ha0 = y0 can be rewritten as h1 (a0j ) = Hy0 ,a0 . On the other hand, Ha + q(x)Ha0 = y +
q(x)y0 is independent from h1 (a0j ) = h1 (ai ). Because P(h1 (a0j ) = Hy0 ,a0 ) = 1/2L irrespective
of y0 and h1 (a0k ) for k ∈ {k|a0k 6= a0j }, then P(h1 (a0j ) = Hy0 ,a0 |Ha + q(x)Ha0 = y + q(x)y0 ) =
P(h1 (a0j ) = Hy0 ,a0 ) which implies that h1 (a0j ) = Hy0 ,a0 and Ha + q(x)Ha0 = y + q(x)y0 are
independent. Hence, we have
P(Ha = y|Ha0 = y0 )
= P(Ha + q(x)Ha0 = y + q(x)y0 |h1 (a0j ) = Hy0 ,a0 )
= P(Ha + q(x)Ha0 = y + q(x)y0 )
= P(∑ qk (x)h1 (bk ) = y + q(x)y0 )
k

and by the earlier uniformity result, this last probability is equal to 1/2L . This concludes
the proof.
Of the four recursive hashing functions investigated by Cohen [Cohen 1997], G ENERAL
and C YCLIC were superior both in terms of speed and uniformity, though C YCLIC had
a small edge over G ENERAL. For n large, the benefits of these recursive hash functions
compared to the n-wise independent hash function presented earlier can be substantial:
n table look-ups2 is much more expensive than a single look-up followed by binary shifts.
5.2 Is C YCLIC almost pairwise independent?
While C YCLIC is not uniform, it was shown empirically to have good uniformity [Cohen
1997]. Hence it is reasonable to expect C YCLIC to be almost uniform and maybe even
almost pairwise independent. To illustrate this intuition, consider Table I which shows that
while h(a, a) is not uniform (h(a, a) = 001 is impossible), h(a, a) minus any bit is indeed
uniformly distributed. We will prove that this result holds in general.
The next lemma and the next theorem show that C YCLIC is quasi-pairwise independent
in the sense that L − n + 1 consecutive bits (e.g., the first or last L − n + 1 bits) are pairwise
independent. In other words, C YCLIC is pairwise independent if we are willing to sacrifice
n − 1 bits. (We say that n bits are “consecutive modulo L” if the bits are located at indexes
i mod L for n consecutive values of i such as i = k, k + 1, . . . , k + n − 1.)
L EMMA 3. If q(x) ∈ GF(2)[x]/(xL + 1) (with q(x) 6= 0) has degree n < L, then
—the equation q(x)w = y mod xL + 1 modulo the first n bits3 has exactly 2n solutions for
all y;

2 Recall we assume that Σ is not known in advance. Otherwise for many applications, each table lookup could be

merely an array access.

3 By “equality modulo hsome specified set of bit positionsi”, we mean that the two quantities are bitwise identical,
Table I. C YCLIC hash for various values of h1 (a) (h(a, a) = xh1 (a) + h1 (a) mod 2L + 1)
h1 (a) h(a, a) h(a, a) (first two bits) h(a, a) (last two bits) h(a, a) (first and last bit)
000 000 00 00 00
100 110 11 10 10
010 011 01 11 01
110 101 10 01 11
001 101 10 01 11
101 011 01 11 01
011 110 11 10 10
111 000 00 00 00

—more generally, the equation q(x)w = y mod xL + 1 modulo any consecutive n bits (mod-
ulo L) has exactly 2n solutions for all y.
P ROOF. Let P be the set of polynomials of degree at most L − n − 1. Take any p(x) ∈ P,
then q(x)p(x) has degree at most L − n − 1 + n = L − 1 and thus if q(x) 6= 0 and p(x) 6=
0, then q(x)p(x) 6= 0 mod xL + 1. Hence, for any distinct p1 , p2 ∈ P we have q(x)p1 6=
q(x)p2 mod xL + 1.
To prove the first item, we begin by showing that there is always exactly one solution
in P. Consider that there are 2L−n polynomials p(x) in P, and that all values q(x)p(x)
are distinct. Suppose there are p1 , p2 ∈ P such that q(x)p1 = q(x)p2 mod xL + 1 modulo
the first n bits, then q(x)(p1 − p2 ) is a polynomial of degree at most n − 1 while p1 − p2
is a polynomial of degree at most L − n − 1 and q(x) is a polynomial of degree n, thus
p1 − p2 = 0. (If p1 − p2 6= 0 then degree(q(x)(p1 − p2) mod xL + 1) ≥ degree(q(x)) = n,
a contradiction.) Hence, all p(x) in P are mapped to distinct values modulo the first n bits,
and since there are 2L−n such distinct values, the result is shown.
Any polynomial of degree L −1 can be decomposed into the form p(x)+xL−n z(x) where
z(x) is a polynomial of degree at most n − 1 and p(x) ∈ P. By the preceding result, for
distinct p1 , p2 ∈ P, q(x)(xL−n z(x) + p1 ) and q(x)(xL−n z(x) + p2 ) must be distinct modulo
the first n bits. In other words, the equation q(x)(xL−n z(x) + p) = y modulo the first n bits
has exactly one solution p ∈ P for any z(x) and since there are 2n polynomials z(x) of
degree at most n − 1, then q(x)w = y (modulo the first n bits) must have 2n solutions.
To prove the second item, choose j and use the first item to find any w solving q(x)w =
yx j mod xL + 1 modulo the first n bits. j. Then wxL− j is a solution to q(x)w = y mod xL + 1
modulo the bits in positions j, j + 1, . . . , j + n − 1 mod L.
We have the following corollary to Lemma 3.
C OROLLARY 1. If w is chosen uniformly at random in GF(2)[x]/(xL + 1), then
P(q(x)w = y mod n − 1 bits) = 1/2L−n+1 where the n − 1 bits are consecutive (modulo
L).
T HEOREM 5.1. All consecutive L − n + 1 bits (modulo L) of the L-bit C YCLIC n-gram
hash family are pairwise independent.
P ROOF. We show P(q1 (x)h1 (a1 ) + q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) =
y mod n − 1 bits) = 1/2L−n+1 for any polynomials qi where at least one is different
from zero. It is true when there is a single non-zero polynomial qi (x) by Corollary 1.

with exceptions permitted only at the specified positions. For our polynomials, “equality modulo the first n bit
positions” implies the difference of the two polynomials has degree at most n − 1.
Suppose it is true up to k − 1 non-zero polynomials and consider a case where we have
k non-zero polynomials. Assume without loss of generality that q1 (x) 6= 0, we have
P(q1 (x)h1 (a1 ) + q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) = y mod n − 1 bits) = P(q1 (x)h1 (a1 ) =
y − q2 (x)h1 (a2 ) − . . . − qn (x)h1 (an ) mod n − 1 bits) = ∑y0 P(q1 (x)h1 (a1 ) =
y − y0 mod n − 1 bits)P(q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) = y0 mod n − 1 bits) =
1 1
∑y0 2L−n+1 2L−n+1
= 1/2L−n+1 by the induction argument, where the sum is over 2L−n+1
0
values of y . Hence the uniformity result is shown.
Consider two distinct sequences a1 , a2 , . . . , an and a01 , a02 , . . . , a0n . Write Ha =
h(a1 , a2 , . . . , an ) and Ha0 = h(a01 , a02 , . . . , a0n ). To prove pairwise independence, it suf-
fices to show that P(Ha = y mod n − 1 bits|Ha0 = y0 mod n − 1 bits) = 1/2L−n+1 . Sup-
pose that ai = a0j for some i, j; if not, the result follows by the (full) independence
of the hashing function h1 . Using Lemma 3, find q(x) such that q(x) ∑k|a0 =a0j xn−k =
k
− ∑k|ak =ai xn−k mod n − 1 bits, then Ha + q(x)Ha0 mod n − 1 bits is independent from ai =
a0j (and h1 (ai ) = h1 (a0j )).
The hashed values h1 (ak ) for ak 6= ai and h1 (a0k ) for a0k 6= a0j are now relabelled as
h1 (b1 ), . . . , h1 (bm ). Write Ha + q(x)Ha0 = ∑k qk (x)h1 (bk ) mod n − 1 bits where qk (x) are
polynomials in GF(2)[x]/(xL + 1) (not all qk (x) are zero). As in the proof of Lemma 2,
we have that Ha0 = y0 mod n − 1 bits and Ha + q(x)Ha0 = y + q(x)y0 mod n − 1 bits are
independent4 : P(Ha0 = y0 mod n − 1 bits|y0 , b1 , b2 , . . . , bm ) = 1/2L−n+1 by Corollary 1
since Ha0 = y can be written as r(x)h1 (a0j ) = y − ∑k rk (x)h1 (bk ) for some polynomials
r(x), r1 (x), . . . , rm (x). Hence, we have
P(Ha = y mod n − 1 bits|Ha0 = y0 mod n − 1 bits)
= P(Ha + q(x)Ha0 = y + q(x)y0 mod n − 1 bits|Ha0 = y0 mod n − 1 bits)
= P(Ha + q(x)Ha0 = y + q(x)y0 mod n − 1 bits)
= P(∑ qk (x)h1 (bk ) = y + q(x)y0 mod n − 1 bits)
k

and by the earlier uniformity result, this last probability is equal to 1/2L−n+1 .
However, Theorem 5.1 may be pessimistic: the hash values of some n-grams may be
more uniform and independent. The next lemma and proposition show that as long as L
and n are coprime, it suffices to drop or ignore one bit of the C YCLIC hash so that at least
one n-gram (an ) is hashed uniformly. Recall that L and n are coprime if gcd(L, n) = 1; i.e.,
their greatest common divisor is 1.
L EMMA 4. If n and L are coprime, then (xn + 1)w = 0 has only the following solutions
in GF(2)[x]/(xL + 1): w = 0 and w = xL−1 + . . . + x2 + x + 1.
P ROOF. Clearly, w = 0 and w = xL−1 + . . . + x2 + x + 1 are always solutions. The proof
that these are the only solutions proceeds in three stages: first, we show that if w solves
(xn + 1)w = 0 mod xL + 1, then w also solves (xkn + 1)w = 0 mod xL + 1, for all k. Second,
we show that this implies (xi + 1)w = 0 mod xL + 1, for all i. Finally, we show that this
implies all coefficients in w are equal, a property of only w = 0 and w = xL−1 + . . . + x2 +
x + 1.

4 We use the shorthand notation P( f (x, y) = c|x, y) = b to mean P( f (x, y) = c|x = z1 , y = z2 ) = b for all values of
z1 , z2 .
For the first stage, we want to prove that (xn + 1)w = 0 mod xL + 1 implies (xkn + 1)w =
0 mod xL + 1 for any k = 1, 2, . . . by induction. To begin, we have that (xn + 1)2 = 1 + x2n .
Suppose we have (xkn + 1)w = 0 mod xL + 1 for k = 1, 2, . . . , K − 1, we need that (xKn +
1)w = 0 mod xL + 1. We have (x(K−1)n + 1)w = 0 mod xL + 1 ⇒ (x + 1)(x(K−1)n + 1)w =
0 mod xL + 1 ⇒ (xKn + 1)w + (x(K−1)n + x)w = 0 mod xL + 1 ⇒ (xKn + 1)w + x(x(K−2)n +
1)w = 0 mod xL + 1 ⇒ (xKn + 1)w = 0 mod xL + 1 by the induction argument, showing
the result.
For the next stage, note that (xkn + 1)w mod xL + 1 = (xkn mod L + 1)w mod xL + 1. Sup-
pose k1 n = k2 n mod L. Then (k1 −k2 )n = 0 mod L and so L divides k1 −k2 because n and L
are coprime. Hence, the sequence of L elements 0, n, 2n, 3n, . . . , (L − 1)n visits all values in
{1, 2, 3, . . . , L−1}. Hence, (xi +1)w = 0 mod xL + 1 for all values of i ∈ {1, 2, 3, . . . , L−1}.
Finally, inspecting (xi + 1)w = 0 mod xL + 1 reveals that any two bits of w (the coeffi-
cients of any two of the L terms) must sum to zero in GF(2). That is, they must be equal.
This proves that 0 and xL−1 + . . . + x2 + x + 1 are the only possible solutions.
P ROPOSITION 3. Suppose a randomly chosen member of L-bit C YCLIC hash family is
applied to the n-gram an . Then
—If L and n are coprime and n is odd, then all L bits are uniformly distributed.
—If L and n are coprime, any L − 1 bits of the L-bit hash are uniformly distributed.
—If gcd(L, n) > 2, we do not have a uniform distribution of any L − 1 bits.
—When gcd(L, n) = 2, then we have a uniform distribution of any L − 1 bits only when n/2
is odd.
P ROOF. We have that h(a1 a2 . . . an ) = xn−1 h1 (a1 ) + . . . + xh1 (an−1 ) + h(an ) and so
h(a, a, . . . , a) = (xn−1 + . . . + x + 1)h1 (a). Let q(x) = xn−1 + . . . + x + 1.
We begin by showing that uniformity would follow, if q(x)w = 0 mod xL + 1 had
a unique solution. There are 2L possible values of h1 (a) and 2L possible values of
h(a, a, . . . , a) = q(x)h1 (a). If q(x)w = z mod xL + 1 has no solution w for some z, then
there must be y such that q(x)w = y mod xL + 1 has k > 1 solutions w1 , w2 , . . . , wk . Then
q(x)w = 0 mod xL + 1 has at least k solutions: q(x)0 = 0, q(x)(w1 − w2 ) = 0 mod xL + 1,
q(x)(w1 − w3 ) = 0 mod xL + 1, . . . , q(x)(w1 − wk ) = 0 mod xL + 1. Therefore, if q(x)w =
0 mod xL + 1 has a unique solution (w = 0) then q(x)w = y mod xL + 1 has a unique solu-
tion for any y and we have uniformity.
Our initial cases have n and L being coprime. Since q(x) = xn−1 + . . . + x + 1, then
(x + 1)q(x) = xn + 1 mod xL + 1 and so q(x)w = 0 mod xL + 1 ⇒ (x + 1)q(x)w = 0 mod
xL + 1 ⇒ (xn + 1)w = 0 mod xL + 1. Therefore, by Lemma 4:
—when n is odd, then w = 0 is the only solution to q(x)w = 0 mod xL + 1, and thus, we
have uniformity of all L bits, including any L − 1 bits;
—when n is even, then w = 0 and w = xL−1 + . . . + x + 1 are the only solutions to q(x)w =
0 mod xL + 1. We wish to show that the equation q(x)w = y mod xL + 1 modulo the
rth bit has exactly two solutions in GF(2)[x]/(xL + 1) for any y. This would then imply
the probability that any L − 1 bits of h(a, a, . . . , a) be some value is 2/2L = 1/2L−1 , thus
proving uniformity of any L − 1 bits.
Consider q(x)w = 0 mod xL + 1 modulo the rth bit: if q(x)w 6= 0 mod xL + 1, then
q(x)w = xr mod xL + 1. In such a case, q(x) has an inverse (wx−r ), and q(x)w =
0 mod xL + 1 has only one solution, a contradiction. Therefore, q(x)w = 0 mod xL + 1
modulo the rth bit implies q(x)w = 0 mod xL + 1 which implies that w = 0 or w =
xL−1 + . . . + x + 1. Therefore, if q(x)w1 = q(x)w2 mod xL + 1 modulo the rth bit, then
w1 − w2 differ by either 0 or xL−1 + . . . + x + 1. Thus, for any y there are at most two
solutions for q(x)w = y mod xL + 1 modulo the rth th bit. Further, if w is a solution to
q(x)w = y mod xL + 1 modulo the rth bit, then so is w + (xL−1 + . . . + x + 1): solutions
come in pairs. There are 2L−1 solutions pairs and at most one solution pair for each of
the 2L−1 values y ∈ GF(2)[x]/(xL + 1) modulo the rth bit. A pigeonhole argument shows
exactly two solutions per y.
Next, our second cases have L and n not being coprime, and we let k = gcd(L, n). Since k
divides n, we have xn−1 + . . . + x + 1 = (xk−1 + . . . + x + 1)(xn−k + . . . + x2k + xk + 1), which
can be verified by inspection. Since k divides L, we similarly have xL−1 + . . . + x + 1 =
(xk−1 + . . . + x + 1)(xL−k + . . . + x2k + xk + 1). Since (xk−1 + . . . + x + 1)(x + 1)(xL−k +
. . . + x2k + xk + 1) = (x + 1)(xL−1 + . . . + x + 1) = 0 mod xL + 1, w = (x + 1)(xL−k + . . . +
x2k + xk + 1) is a solution of q(x)w = 0 mod xL + 1. For k > 2, w = (x + 1)(xL−k + . . . +
x2k + xk + 1) is distinct from 0 and xL−1 + . . . + x + 1. Thus, as long as k is larger than two,
we have a solution w distinct from 0 and xL−1 + . . . + x + 1, preventing any L − 1 bits from
being uniform. Indeed, the probability that any L − 1 bits of (h(a, a, . . . , a) will be zero is
at least 3/2L > 2L−1 .
For k = 2, let w = ∑i=0,...,L−1 bi xi be a solution of q(x)w = 0 mod xL + 1. We have
xn−1 + . . . + x + 1 = (x + 1)(xn−2 + . . . + x4 + x2 + 1). By Lemma 4, we have (xn−1 + . . . +
x + 1)w = 0 mod xL + 1 if and only if (xn−2 + . . . + x4 + x2 + 1)w = 0 or (xn−2 + . . . + x4 +
x2 + 1)w = xL−1 + . . . + x2 + x + 1. Using these two solutions, we will show that there are
exactly two solutions when n/2 is odd and more when n/2 is even.
We are going to split the problem into odd and even terms. Define wo =
b x (i−1)/2 and w = b x i/2 . Then (xn−2 + . . . + x4 + x2 + 1)w =
∑i=1,3,...,L−1 i e ∑i=0,2,...,L−2 i
0 mod xL + 1 implies (xn/2−1 + . . . + x2 + x + 1)wo = 0 mod xL/2 + 1 and (xn/2−1 + . . . +
x2 + x + 1)we = 0 mod xL/2 + 1 whereas (xn−2 + . . . + x4 + x2 + 1)w = xL−1 + . . . + x +
1 mod xL + 1 implies (xn/2−1 + . . . + x2 + x + 1)wo = xL/2−1 + . . . + x + 1 mod xL/2 + 1 and
(xn/2−1 + . . . + x2 + x + 1)we = xL/2−1 + . . . + x + 1 mod xL/2 + 1. Notice that n/2 and L/2
are necessarily coprime so we can use the results we have derived above.
—If n/2 is odd, then (xn−2 + . . . + x4 + x2 + 1)w = 0 mod xL + 1 implies wo = 0 and we =
0, and hence w = 0. Using a symmetric argument, (xn/2−1 + . . . + x2 + x + 1)wo =
xL/2−1 + . . . + x + 1 mod xL/2 + 1 and (xn/2−1 + . . . + x2 + x + 1)we = xL/2−1 + . . . + x +
1 mod xL/2 + 1, imply wo = we = xL/2−1 + . . . + x2 + x + 1. Therefore, when n/2 is odd,
only w = 0 and w = xL−1 + . . . + x + 1 are possible solutions.
—When n/2 is even, then (xn−2 + . . . + x4 + x2 + 1)w = 0 mod xL + 1 has two solutions
(wo = 0 and wo = xL/2−1 + . . . + x2 + x + 1) and so does (xn/2−1 + . . . + x2 + x + 1)we =
0 mod xL/2 + 1 (we = 0 and we = xL/2−1 + . . . + x2 + x + 1). Hence, q(x)w = 0 mod
xL + 1 has at least 4 solutions w = 0, w = xL−2 +. . .+x4 +x2 +1, w = xL−1 +. . .+x3 +x,
w = xL−1 + . . . + x + 1.
This concludes the proof.
5.3 Integer-Division hashing (ID37) is not uniform
A variation of the Karp-Rabin hash method is “Hashing by Power-of-2 Integer Divi-
sion” [Cohen 1997], where h(x1 , . . . , xn ) = ∑i xi Bi−1 mod 2L . Parameter B needs to be
chosen carefully, so that the sequence Bk mod 2L for k = 1, 2, . . . does not repeat quickly.
In particular, the hashcode method of the Java String class uses this approach, with L = 32
and B = 31 [Sun Microsystems 2004]. Note that B is much smaller than the number
of Unicode characters (about 99,000> 216 ) [The Unicode Consortium 2006]. A widely
used textbook [Weiss 1999, p. 157] recommends a similar Integer-Division hash func-
tion for strings with B = 37. Since such Integer-Division hash functions are recursive,
quickly computed, and widely used, it is interesting to seek a randomized version of
them. Assume that h1 is random hash function over symbols uniform in [0, 2L ), then de-
fine h(x1 , . . . , xn ) = h1 (x1 ) + Bh1 (x2 ) + B2 h1 (x3 ) + . . . + Bn−1 h1 (xn ) mod 2L for some fixed
integer B. We choose B = 37 (calling the resulting randomized hash “ID37;” see Algo-
rithm 4).

Algorithm 4 The recursive ID37 family.

Require: an L-bit hash function h1 over Σ from an independent hash family
1: B ← 37
2: c ← empty FIFO structure
3: x ← 0 (L-bit integer)
4: z ← 0 (L-bit integer)
5: for each character c do
6: append c to s
7: x ← x − z mod 2L
8: x ← x/B + Bn−1 h1 (c) mod 2L
9: if length(s)= n then
10: yield x
11: remove oldest character y from s
12: z ← h1 (y)
13: end if
14: end for

Observe that ID37 is recursive over h1 . Moreover, by letting h1 map symbols over a wide
range, we intuitively can reduce the undesirable dependence between n-grams sharing a
common suffix. However, we fail to achieve uniformity.
The independence problem with ID37 is shared by all such randomized Integer-Division
hash functions that map n-grams to [0, 2L ). However, they are more severe for certain
combinations of B and n.
P ROPOSITION 4. Randomized Integer-Division (2L ) hashing with B odd is not uniform
for n-grams, if n is even. Otherwise, it is uniform, but not pairwise independent.
P ROOF. We see that P(h(a2k ) = 0) > 2−L since h(a2k ) = h1 (a)(B0 (1+B)+B2 (1+B)+
. . . + B2k−2 (1 + B)) mod 2L and since (1 + B) is even, we have P(h(a2k ) = 0) ≥ P(h1 (x1 ) =
2L−1 ∨ h1 (x1 ) = 0) = 1/2L−1 .
For the rest of the result, we begin with n = 2 and B even. If x1 6= x2 , then P(h(x1 , x2 ) =
y) = P(Bh1 (x1 ) + h1 (x2 ) = y mod 2L ) = ∑z P(h1 (x2 ) = y − Bz mod 2L )P(h1 (x1 ) = z) =
∑z P(h1 (x2 ) = y − Bz mod 2L )/2L = 1/2L , whereas P(h(x1 , x1 ) = y) = P((B + 1)h1 (x1 ) =
y mod 2L ) = 1/2L since (B + 1)x = y mod 2L has a unique solution x when B is even.
Therefore h is uniform. This argument can be extended for any value of n and for n odd, B
even.
To show it is not pairwise independent, first suppose that B is odd. For any string β
of length n − 2, consider n-grams w1 = βaa and w2 = βbb for distinct a, b ∈ Σ. Then
P(h(w1 ) = h(w2 )) = P(B2 h(β) + Bh1 (a) + h1 (a) = B2 h(β) + Bh1 (a) + h1 (a) mod 2L ) =
P((1 + B)(h1 (a) − h1 (b)) mod 2L = 0) ≥ P(h1 (a) − h1 (b) = 0) + P(h1 (a) − h1 (b) =
2L−1 ) = 2/4L . Second, if B is even, a similar argument shows P(h(w3 ) = h(w4 )) ≥ 2/4L ,
where w3 = βaa and w4 = βba. P(h(a, a) = h(b, a)) = P(Bh1 (a) + h1 (a) = Bh1 (b) +
h1 (a) mod 2L ) = P(B(h1 (a) − h1 (b)) mod 2L = 0) ≥ P(h1 (a) − h1 (b) = 0) + P(h1 (a) −
h1 (b) = 2L−1 ) = 2/4L . This argument can be extended for any value of B and n.

These results also hold for any Integer-Division hash where the modulo is by an even
number, not necessarily a power of 2. Frequently, such hashes compute their result modulo
a prime. However, even if this gave uniformity, the GT algorithm implicitly applies a
modulo 2L because it ignores higher-order bits. It is easy to observe that if h(x) is uniform
over [0, p), with p prime, then h0 (x) = h(x) mod 2L cannot be uniform.
Whether the lack of uniformity and pairwise independence is just a theoretical defect
can be addressed experimentally.

6. COUNT ESTIMATION BY PROBING

Count estimates accuracy, using the algorithms of Gibbons and Tirthapura or of Bar-Yossef
et al., depend heavily on the hash function used and the buffer memory allocated to the
algorithms [Gibbons and Tirthapura 2001; Bar-Yossef et al. 2002]. This section shows
that better accuracy bounds from a single run of the algorithm follow if the hash function
is drawn from a family of p-wise independent hash functions (p > 2), than if it is drawn
from a family of merely pairwise independent functions. Moreover, even for pairwise
independent hashing, we improve significantly the previously reported theoretical bounds.
For instance, in Corollary 2 the bound on the reliability is improved from 83% to more than
99%. In turn, this implies that less buffer space can achieve a desired quality of estimates.
Another method for improving the estimates of these techniques is to run them multiple
times (with a different hash function chosen from its family each time). Then, take the
median of the various estimates. For estimating some quantity µ within a relative error
(“precision”) of ε, it is enough to have a sequence of random variables Xi for i = 1, . . . , q
such that P(|Xi − µ| > εµ) < 1/3 where µ = X̄i . The median of all Xi will lie outside
(µ − εµ, µ + εµ) only if more than half the Xi do. This, in turn, can be made very unlikely
simply by considering many different random variables (q large). Let Yi be the random
variable taking the value 1 when |Xi − µ| > εµ and zero otherwise, and furthermore let
q
Y = ∑i=1 Yi . We have that E(Y ) ≤ q/3 and so, 3E(Y )/2 ≤ q/2. Then a Chernoff bound
says that [Canny 2002]

P(Y > q/2) ≤ P(Y > Ȳ (1 + 1/2))

!Ȳ
e1/2
≤
(1 + 1/2)1+1/2
!q/3
e1/2
≤
(1 + 1/2)1+1/2
3e+06
p=2
p=4
2.5e+06

2e+06
M

1.5e+06

1e+06

500000

0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11
ε 19/20

Fig. 1. For reliability 1 − δ = 0.95, we plot the memory usage M versus the accuracy ε for pairwise (p = 2) and
4-wise (p = 4) independent hash functions as per the bound of Proposition 5. Added independence substantially
improves memory requirements for a fixed estimation accuracy according to our theoretical bounds.

q
≤ e− 10×3 .
Choosing q = 30 ln 1/δ, we have P(Y > q/2) ≤ δ proving that we can make the me-
dian of the Xi ’s within εµ of µ, 1 − δ of the time for δ arbitrarily small. On this basis,
Bar-Yossef et al. [Bar-Yossef et al. 2002] report that they can estimate a count with rel-
ative precision ε and reliability 1 − δ, using Õ( ε2 + log m log 1δ ) bits of memory and
5 1

Õ( log m + 1ε log 1δ ) amortized time. Unfortunately, in practice, repeated probing im-

plies rehashing all n-grams 30 ln 1/δ times, a critical bottleneck. Moreover, in a streaming
context, the various runs are made in parallel and therefore 30 ln 1/δ different buffers are
needed. Whether this is problematic depends on the application and the size of each buffer.
We are able to generalize previous results [Gibbons and Tirthapura 2001; Bar-Yossef
et al. 2002] from pairwise independence to p-wise independence by using the following
generalization of Chebyshev’s inequality due to Schmidt et al. [Schmidt et al. 1993, Theo-
rem 2.4, Equation III].
T HEOREM 6.1. Let X1 , . . . , Xm be a sequence of p-wise independent random variables
that satisfy |Xi − E(Xi )| ≤ 1. Let X = ∑i Xi , then for C = max(p, σ2 (X)), we have
p/2
pC
P(|X − X̄| ≥ T ) ≤ .
e2/3 T 2
In particular, when p = 2, we have
2C
P(|X − X̄| ≥ T ) ≤
.
e2/3 T 2
The next proposition shows that in order to reduce the memory usage drastically while
preserving the same theoretical bounds on the accuracy and reliability, we can increase the

5 i.e., the computed count is within εµ of the true answer µ, 1 − δ of the time
independence of the hash functions. In particular, our theory says that we can estimate the
count within 10%, 19 times out of 20 by storing respectively 10.5, 2.5, and 2 thousand
n-grams6 depending on whether we have pairwise-independent, 4-wise independent or 8-
wise independent hash values. Hence, there is no need to hash the n-grams more than
once if we can assume that hash values are ≈ 4-wise independent in practice (see Fig. 1).
Naturally, recursive hashing cannot be more than pairwise independent, but we always
have the option of using slower non-recursive hashing. The following proposition is stated
for n-grams, but applies to arbitrary items.

P ROPOSITION 5. Hashing each n-gram only once, we can estimate the number of dis-
tinct n-grams within relative precision ε, with a p-wise independent hash for p ≥ 2 by
storing M distinct n-grams (M ≥ 8p) and with reliability 1 − δ where δ is given by
!
p p/2 p/2 8 p/2
δ = p/3 p/2 2 + p p/2 .
e M ε (2 − 1)

More generally, we have

!
p p/2 α p/2 4 p/2
δ ≤ p/3 p/2 p
+ p/2 p p/2 .
e M (1 − α) α ε (2 − 1)

for 4p/M ≤ α < 1 and any p, M.

P ROOF. We generalize a proof by Bar-Yossef et al. [Bar-Yossef et al. 2002] Let Xt be the
number of distinct elements having only zeros in the first t bits of their hash value. Then X0
is the number of distinct elements (X0 = m). For j = 1, . . . , m, let Xt, j be the binary random
variable with value 1 if the jth distinct element has only zeros in the first t bits of its hash
value, and zero otherwise. We have that Xt = ∑mj=1 Xt, j and so E(Xt ) = ∑mj=1 E(Xt, j ). Since
the hash function is uniform, then P(Xt, j = 1) = 1/2t and so E(Xt, j ) = 1/2t , and hence,
E(Xt ) = m/2t ⇒ 2t E(Xt ) = m. Therefore Xt can be used to determine m.
Using pairwise independence, we can can show that σ2 (Xt ) ≤ 2mt . We have that σ2 (Xt ) ≤
m
2t because
m 2 m
E(|Xt − | ) = E((Xt − t )2 )
2t 2
m m
1 2 2
= E(( ∑ (Xt, j − )) ) = σ ( ∑ Xt, j )
j=1 2t j=1
m
= ∑ σ2 (Xt, j ) (by 2-wise independence)
j=1
m
1 1 1 1
= ∑ 2t (1 − 2t )2 + (1 − 2t )(0 − 2t )2
j=1
m
1 m
≤ ∑ 2t = .
j=1 2t

6 in some dictionary, for instance in a hash table

By Theorem 6.1, we have
p p/2 2t p/2
P(|Xt − m/2t | ≥ εm/2t ) ≤ , as long as σ2 (Xt ) ≥ p.
m p/2 ε p e p/3
Let M be the available memory and suppose that the hash function has L bits such that
we know, a priori, that P(XL > M) ≈ 0. This is necessary since we do not wish to increase
L dynamically. It is reasonable since for L and m fixed, P(XL > M) goes down as 1/M p :
P(XL > M) = P(XL − m/2L > M − m/2L )
≤ P(|XL − m/2L | > M − m/2L )
p/2
p max(m/2L , p)

≤
e2/3 (M − m/2L )2
for M − m/2L > 0 where we used Theorem 6.1. For example, if p = 2, M = 256, L = 19,
P(XL > M) ≤ 4.4 × 10−8 for m = 100, 000, 000.
0
The algorithm returns the value 2t Xt 0 where t 0 is such that Xt 0 ≤ M and Xt 0 −1 > M. Note
that t 0 is itself a random quantity that depends deterministically on the hash function and
the input (the same factors that determine Xt .)
We can bound the error of this estimate as follows:

0
P(|2t Xt 0 − m| ≥ εm)
m m
= ∑ P(|Xt − t | ≥ ε t )P(t 0 = t)
t=0,...,L 2 2
m m
= ∑ P(|Xt − t | ≥ ε t )P(Xt−1 > M, Xt ≤ M).
t=0,...,L 2 2

Splitting the summation into two parts, we get

0
P(|2t Xt 0 − m| ≥ εm)
t¯−1
m m
≤ ∑ P(|Xt − 2t | ≥ ε 2t ) + ∑ P(Xt−1 > M, Xt ≤ M)
t=0 t=t¯,...,L
t¯−1
m m
= P(Xt¯−1 > M) + ∑ P(|Xt − |≥ε t)
t=0 2t 2
t¯−1
p p/2 2t p/2
≤ P(Xt¯−1 − m/2t¯−1 > M − m/2t¯−1 ) + ∑ p/2 ε p e p/3
t=0 m
t¯−1
p p/2 2t p/2
≤ P(|Xt¯−1 − m/2t¯−1 | > M − m/2t¯−1 ) + ∑ p/2 ε p e p/3
t=0 m
p/2 t¯−1
p p/2 2t p/2

pm
≤ +∑
2t¯−1 e2/3 (M − m/2t¯−1 )2 t=0 m
p/2 ε p e p/3

p p/2 m p/2 p p/2 (2t¯p/2 − 1)

≤ +
2(t¯−1)p/2 e p/3 (M − m/2t¯−1 ) p m p/2 ε p e p/3 (2 p/2 − 1)
where we assumed that p ≤ m/2t¯−1 .
m
Choose t¯ ∈ {0, . . . , L} such that αM/4 < 2t¯
≤ αM/2 for some α < 1 satisfying αM/4 ≥
p then
p p/2 m p/2 p p/2 M p/2 α p/2
≤
2(t¯−1)p/2 e p/3 (M − m/2t¯−1 ) p e p/3 (1 − α) p M p
whereas
p p/2 (2t¯p/2 − 1) p p/2 4 p/2
≤ .
m p/2 ε p e p/3 (2 p/2 − 1) ε p α p/2 M p/2 e p/3 (2 p/2 − 1)
Hence, we have
p p p
t0 α2 p2 42
P(|2 Xt 0 − m| ≥ εm) ≤ p p ( p
+ p p ).
e3 M2 (1 − α) p
α 2 ε (2 2 − 1)
Setting α = 1/2, we have
0 p p/2 2 p/2 p p/2 8 p/2
P(|2t Xt 0 − m| ≥ εm) ≤ p/3 p/2
+ p p/2 p/3 p/2 .
e M ε M e (2 − 1)
For different values of p and ε, other values of α can give tighter bounds: the best value of
α can be estimated numerically. The proof is completed.
Again the following corollary applies not only to n-grams but also to arbitrary items. It
follows from setting M = 576/ε2 , p = 2, and α = 1 − ε.
C OROLLARY 2. With M = 576/ε2 and p = 2, we can estimate the count of distinct
n-grams with reliability (1 − δ) of 99% for any ε > 0.
P ROOF. From Proposition 5, consider
p p p
!
p2 α2 42
δ ≤ p p p
+ p p
e3 M2 (1 − α) p
α 2 ε (2 2 − 1)
for α = 1 − ε, M = 576/ε2 , and p = 2, then
2ε2

1−ε 4
δ ≤ 2 2
+
e 3 576 ε (1 − ε)ε2

2 4
≤ 2 1−ε+ .
e 3 576 (1 − ε)
Taking the limit as ε tends to 0, on the basis that the probability of a miss (P(|Xt − m/2t | ≥
2
εm/2t )) can only grow as ε diminishes, we have that δ is bounded by 10/(e 3 576) ≈
0.008.
This significantly improves the value 5/6 for 1 − δ given in Bar-Yossef et al. [Bar-Yossef
et al. 2002], for the same value of M: the bound on the rate of failure goes from 1/6 to less
than 1%.
Because p-wise independence implies p − 1-wise independence, memory usage, accu-
racy and reliability can only improve as p increases. For p large (p 4), the result of
Proposition 5 is no longer useful because of the p p factor. Other inequalities on P(|X − µ|)
for p-wise independent X’s should be used.
Table II. Summary of the 4 n-gram hash families implemented. Recall that we chose p(x) = x19 + x18 + x17 +
x16 + x12 + x7 + x6 + x5 + x3 + x2 + 1 (G ENERAL).
hash family definition (h(a1 , a2 , . . . , an )) recursive independence
n-wise independent h1 (a1 ) ⊕ h1 (a2 ) ⊕ · · · ⊕ h1 (an ) no n-wise ind.
C YCLIC xn−1 h1 (a1 ) + . . . + h1 (an ) in GF(2)[x]/(x19 + 1) yes quasi-pairwise ind.
G ENERAL xn−1 h1 (a1 ) + . . . + +h1 (an ) in GF(2)[x]/p(x) yes pairwise ind.
ID37 h1 (a1 ) + . . . + Bn−1 h1 (an ) mod 219 for B = 37 yes not even uniform

7. EXPERIMENTAL RESULTS
Experiments are used to assess the accuracy of estimates obtained by several hash functions
(n-wise independent, C YCLIC, G ENERAL, and ID37, see Table II) on some input streams
using the GT count-estimation algorithm. All hash values are in [0, 219 ).
Since we are mostly interested in comparing n-gram hash families, we chose not to com-
pare variants of GT [Bar-Yossef et al. 2002] or other possible count-estimation algorithms
such as probabilistic counting [Flajolet and Martin 1985]. For an experimental comparison
of several such count-estimation algorithms in the context of multidimensional hashing,
see [Aouiche and Lemire 2007].
The experiments demonstrate that the techniques can be efficiently implemented. Details
are available in a technical report [Lemire and Kaser 2006] together with experimental
results on entropy and frequent-item-count estimations. Our code is written in C++ and is
available upon request.

7.1 Test Inputs

One complication is that the theoretical predictions are based on worst-case inputs, which
are unknown to us. In fact, the worst-case input may not even correspond to a stream of
n-grams (where n − 1 characters from one n-gram reappear in the following n-gram). As
well, we want to know how these techniques perform on “typical” inputs. As a result,
our experiments used the n-grams from a collection of 11 texts7 from Project Gutenberg.
We also used synthetic data sets generated according to various generalized Zipfian dis-
tributions. Since we are analyzing the performance of several randomized algorithms, we
ran each algorithm 100+ times on each text. As a result, we cannot run tests on inputs as
large as would be appropriate for corpus linguistics studies: to complete the entire suite
of experiments in reasonable time, we must limit ourselves to texts (for instance, Shake-
speare’s First Folio) where one run takes at most a few minutes. Even so, the entire suite
of experiments takes several days to run using current equipment.

7.2 Accuracy of Estimates

We have theoretical bounds relating to the error ε observed with a given reliability (typ-
ically 19/20), when the hash function is taken from a p-wise independent family. (See
Table III.) But how close to this bound do we come when n-grams are drawn from a
“typical” input for a computational-linguistics study? And do hash functions from highly

7 The 11 texts are eduha10 (The Education of Henry Adams), utrkj10 (Unbeaten Tracks in Japan), utopi10
(Utopia), remus10 (Uncle Remus His Songs and His Sayings), btwoe10 (Barchester Towers), 00ws110 (Shake-
speare’s First Folio), hcath10 (History of the Catholic Church), rlchn10 (Religions of Ancient China), esymn10
(Essay on Man), hioaj10 (Impeachment of Andrew Johnson), and wflsh10 (The Way of All Flesh).
Table III. Maximum error rates ε 19 times out of 20 for various amounts of
memory (M) and for p-wise independent hash values according to Proposi-
tion 5 .
256 1,024 2,048 65,536 262,144 1,048,576
p = 2 86.4% 36.8% 24.7% 3.8% 1.8% 0.9%
p = 4 34.9% 16.1% 11.1% 1.8% 0.9% 0.5%
p = 8 30.0% 14.1% 9.7% 1.6% 0.8% 0.4%

0.09
ID37
0.08 general polynomial
cyclic polynomial
0.07 n-wise independent
0.06
0.05
ε

0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60 70 80 90 100
trial rank

Fig. 2. Count estimate errors over Shakespeare’s First Folio (00ws110.txt), 100 runs estimating 10-grams with
M = 2048.

independent families actually enable more accurate8 estimates?

Figure 2 shows the relative error ε observed from four hash functions (100 estimations
with each). Estimates have been ranked by decreasing ε, and we see ID37 had more of the
poorer runs than the others.
Figure 3 shows a test input (remus) that was the worst of the 11 for several hash func-
tions, when M = 256. ID37 seems to be doing reasonably well, but we see 10-wise inde-
pendent hashing lagging.
To study the effect of varying M, we use the 5th -largest error of 100 runs. This 95th -
percentile error can be related to the theoretical bound for ε with 19/20 reliability. Figure 4
plots the largest 95th -percentile error observed over 11 test inputs. It is apparent that there
is no significant accuracy difference between the hash functions. The n-wise independent
hash alone has a guarantee to be beneath the theoretical bound for the 10-wise independent
hash according to Proposition 5: however, over the eleven Gutenberg texts, the others are
just as accurate, according to our experiments.

7.3 Using a Wider Range of Values for M

An important motivation for using p-wise independent hashing is to obtain a reliable es-
timate while only hashing once, using a small M. Nevertheless, we have thus far not

8 The“data-agnostic” estimate from Sect. 3 is hopelessly inaccurate: it predicts 4.4 million 5-grams for Shake-
speare’s First Folio, but the actual number is 13 times smaller.
0.3
ID37
general polynomial
0.25 cyclic polynomial
n-wise independent
0.2

0.15
ε

0.1

0.05

0
0 10 20 30 40 50 60 70 80 90 100
trial rank

Fig. 3. remus errors, 100 runs estimating 10-grams with M = 256.

10
ID37
general polynomial
cyclic polynomial
n-wise independent
pairwise bound
1 10-wise bound
ε 19/20

0.1

0.01
1000
M

Fig. 4. For each M and hash function, worst-case 95th -percentile error observed on the 11 test inputs.

observed notable differences between the different hash functions. Therefore, although we
expect typical values of M to be a few hundred to a few thousand, we can broaden the range
of M examined. Although the theoretical guarantees for tiny M are poor, perhaps typical
results will be usable. And even a single buffer with M = 220 is inconsequential when a
desktop computer has several gibibytes of RAM, and the construction of a hash table or
B-tree with such a value of M is still quite affordable. Moreover, with a wider range of M,
we start to see differences between some hash functions.
We choose M = 16, 162 , 163 and 164 and analyze the 5-grams in the text 00ws1 (Shake-
speare’s First Folio). There are approximately 300,000 5-grams, and we selected a larger
file because when M = 164 , it seems unhelpful to estimate the number of 5-grams unless
100
ID37
general polynomial
cyclic polynomial
10 n-wise independent
pairwise bound
5-wise bound
ε 19/20 1

0.1

0.01

0.001
10 100 1000 10000 100000
M

Fig. 5. 95th -percentile error values (for 5-gram estimates) on 00ws1 for various hash families, over a wide range
of M. Our analysis does not permit prediction of error bounds when M = 16.

the file contains substantially more 5-grams than M.

Figure 5 shows the 95th -percentile errors for Shakespeare’s First Folio, when 5-grams
are estimated. There are some smaller differences for M = 65536 (surprisingly, the 5-wise
hash function, with a better theoretical guarantee, seems to be slightly worse than C YCLIC
and G ENERAL). However, it is clear that the theoretical deficiencies in ID37 finally have
an effect: it is small when M = 4096 but large at M = 65536. (We observed similar
problems on Zipfian data.) To be fair, this non-uniform hash is still performing better than
the pairwise bound, but the trend appears clear. Does it, however, continue for very large
M?

7.4 Very Large M

Clearly, it is only sensible to measure performance when M m. Therefore, we estimate
the number of 10-grams obtained when all plain-text files in the Gutenberg CD are concate-
nated. When M = 220 , 10-wise independent hashing had an observed 95th -percentile error
of 0.182% and G ENERAL had 0.218%. The ID37 error was somewhat worse, at 0.286%.
(The theoretical pairwise error bound is 0.908% and the 10-wise bound is 0.425%.) Con-
sidering the M = 65536 case from Fig. 5, we see no experimental reason to prefer n-wise
hashing to G ENERAL, but ID37 looks less promising. However, n = 10, B = 37 is a non-
uniform combination for Integer Division.

7.5 Caveats with Random-Number Generators

To observe the effect of fully independent hashing, we implemented the usual (slow and
memory-intensive) scheme where a random value is assigned and stored whenever a key
is first seen. Probabilistic estimation of the number of n-grams is likely to expose de-
ficiencies in the random-number generator and therefore different techniques were tried.
The pseudorandom-number generator in the GNU/Linux C library was tried, as were the
Mersenne Twister (MT) [Matsumoto and Nishimura 1998] and also the Marsaglia-Zaman-
10
ID37
general polynomial
cyclic polynomial
n-wise independent
1 pairwise bound
ε 19/20

5-wise bound

0.1

0.01
10 100 1000 10000 100000
M

Fig. 6. 95th -percentile error for 5-gram estimates on Zipfian data (s=1).

James (MZJ) generator [Marsaglia and Zaman 1987; James 1990; Bourke 1998]. We
also tried using a collection of bytes generated from a random physical process (radio
static) [Haahr 1998].
For M = 4096, the 95th -percentile error for text 00ws1 was 4.7% for Linux rand(),
4.3% for MT and 4.1% for MZJ. These three pseudorandom number generators were no
match for truly random numbers, where the 95% percentile error was only 2.9%. Com-
paring this final number to Fig. 5, we see fully independent hashing is only a modest
improvement on Cohen’s hash functions (which fare better than 5%) despite its stronger
theoretical guarantee.
The other hash functions also rely on random-number generation (for h1 in Cohen’s
hashes and ID37; for h1 . . . hn in the n-wise independent hash). It would be problematic if
their performance were heavily affected by the precise random-number generation process.
However, when we examined the 95th -percentile errors we did not observe any appreciable
differences from varying the the pseudorandom-number generation process or using truly
random numbers.

7.6 95th -Percentile Errors using Zipfian Data

The various hash functions were also tested on synthetic Zipfian data, where the probability
of the kth symbol is proportional to k−s . (We chose s ∈ [0.8, 2.0].) Each data set had
N ≈ 105 , but for larger values of s there were significantly fewer n-grams. Therefore,
measurements for M=65536 would not be meaningful and are omitted.
Some results, for n = 5, are shown in Figs. 6–8. The ID37 method is noteworthy. Its
performance for larger M is badly affected by increasing s. The other hash functions are
nearly indistinguishable in almost all other cases. Results are similar for 5-grams and 10-
grams (which are not shown), except that s can grow slightly bigger for 10-grams before
ID37 fails badly.
10
ID37
general polynomial
cyclic polynomial
n-wise independent
1 pairwise bound
ε 19/20

5-wise bound

0.1

0.01
10 100 1000 10000 100000
M

Fig. 7. 95th -percentile error for 5-gram estimates on Zipfian data (s=1.6).

10
ID37
general polynomial
cyclic polynomial
n-wise independent
1 pairwise bound
ε 19/20

5-wise bound

0.1

0.01
10 100 1000 10000 100000
M

Fig. 8. 95th -percentile error for 5-gram estimates on Zipfian data (s=2).

7.7 Choosing between Cohen’s Hash Functions

Recall that we proved that Cohen’s general polynomial hash function was pairwise uni-
form and no hash function that is recursive over hashed values can be 3-wise independent,
or even 3-wise trailing-zero independent. In a sense, G ENERAL is optimal. Yet Cohen’s
cyclic polynomial hash function can be slightly more efficiently implemented, and under
some conditions (see Section 5.2), it can also be pairwise independent. The practical im-
plication of these results is not clear; indeed the experiments reported so far do not reveal a
significant distinction between C YCLIC and G ENERAL, at least at the 95th percentile error.
There are several possible explanations for these experimental results. First, we know
that the GT algorithm relies on the trailing-zeros properties of the hashes. Perhaps C YCLIC
is pairwise trailing-zeros independent in some cases when it is not pairwise independent.
Alternatively, though we know that neither hash is 3-wise independent, possibly either
of them might be “nearly 3-wise independent” in some sense (e.g., except when three n-
grams had some unusual properties). Finally, our theory considers a worst-case sequence
of n-grams whereas experiments are more easily able to assess average case inputs: the
theoretical bad cases may be extremely rare.
Recall that theory only guarantees pairwise independence for C YCLIC after we discard
n − 1 bits out of the L bits. While we can simply generate more bits than we need, that may
be slow. Perhaps, though, the theoretical penalties from using all bits are not observed in
practice.
Computationally, we can discover the worst-case input of length N and the exact distri-
bution of errors. We enumerated all possible hash functions h1 and all possible inputs of
length N. There are L|Σ| values for h1 and |Σ|N inputs, where |Σ| is the size of the alpha-
bet. Clearly, these exhaustive tests are only feasible for small values of |Σ|, L and N. We
tried to count trigrams over a binary input of length 10 using 7 bits for the hashed values
(|Σ| = 2, L = 7, M = 2, N = 10, n = 3). Each of the 214 choices for h1 leads to an error,
and as usual we are interested in the error at the 95th percentile. Note that there cannot
be more than 8 distinct trigrams (m ≤ N − n + 1) and thus the ratio of the distinct count
over the memory budget (m/M) is small enough that we should rarely use more than the
7 − 2 + 1 = 6 bits guaranteed pairwise independent by Theorem 5.1. Based on the com-
putation, in this environment, on every input, C YCLIC and G ENERAL had exactly the
same 95th -percentile error. This lead to the conjecture there was no practical difference
between Cohen’s two approaches, even when we used more than L − n + 1 bits, and some
initial experiments seemed to back this up. However, the conjecture ultimately proved
false.
Although experiments revealed many cases when C YCLIC behaved well even when
more than L − n + 1 bits were used, we also found cases when violating this limit made
C YCLIC behave poorly relative to G ENERAL9 . Figure 9 shows the errors when 13-grams
were estimated using C YCLIC and either L = 7 or L = 19. The data sets were five binary
sequences, each with an (independent) 70% chance of generating a 0 as the next symbol.
These resemble binary Zipfian sequences, and each sequence’s length was chosen to obtain
2000 distinct 13-grams. The similar behaviors observed on all five sequences suggests we
are probably observing an average-case behavior.
On these sequences, we observed that GT never used more than 4 bits from the G EN -
ERAL hash. Yet when L = 7 we observed five instances when C YCLIC ran out of bits,
indicating that its estimate was already more than 15 times too big. When L = 19, we
have more than enough pairwise-independent bits to estimate the number of distinct 13-
grams. (Recall Theorem 5.1.) Because this bad result is observed for L = 7, it rules out the
possibility that bad behavior can be prevented if L and n are coprime.
The same process was also used to compare the performance of G ENERAL between

9 Notethat Table III tells us only that we can expect an error of at most 86%, 95% of the time. This is not a very
strong guarantee! Here, we observe 95th -percentile errors of about 25% for C YCLIC. The point is not that we
have observed worse-than-pairwise-independent behavior here (we have not); it is that when forced to use too
many bits, C YCLIC can become clearly worse than G ENERAL.
0.3
cyclic, L=19
cyclic, L=7
0.25

0.2
error (ε )
0.15

0.1

0.05

0
0 10 20 30 40 50 60 70 80 90 100
error rank

Fig. 9. Count estimation errors using GT with C YCLIC hashing on 5 random binary sequences, each containing
m = 2000 distinct 13-grams; buffer size (M) was 256. We generated 100 estimates for each binary sequence, and
the relative errors of these estimates are displayed against their rank. Results are plotted for L = 7 and L = 19.
Note that there are 12 points (all for L = 7) off the scale, one at ε = 0.32, six at ε = 1.0, and five at ε = 15.5.

Table IV. Count estimation error (%) using GT with poly-

nomial hashes C YCLIC and G ENERAL, Zipfian data set.
L=19, n=5.
percentile Cyclic Polynomial General Polynomial
M=64 M=1024 M=64 M=1024
25 3.60 0.931 3.60 0.931
50 7.06 1.98 8.49 2.01
75 14.0 3.41 13.7 3.60
95 30.9 6.11 31.2 6.22
mean 10.6 2.45 10.7 2.51

L = 7 and L = 19. They differed little, both being nearly the same as obtained with C YCLIC
with L = 19. (Actually, C YCLIC with L = 19 was marginally better than G ENERAL!).
Finally, we discuss results on realistic non-binary Zipfian data. For each run, we made a
single random choice of h1 . Results, for more than 2000 runs, are shown in Table IV. The
table shows ε values (percents), with boldfacing indicating a case when one technique had
a lower error than the other. It shows a slightly better performance from C YCLIC. Means
also slightly favour C YCLIC. This is consistent with the experimental results reported by
Cohen [Cohen 1997]. We also ran more extensive tests using an English text (00ws1.txt),
where there seems to be no notable distinction. 10,000 test runs were made, and results are
shown in Table V.
Overall, we concede there might be a small accuracy improvement from using C YCLIC,
providing that L − n + 1 is somewhat larger than log2 (m/M) + 1, the number of bits that we
anticipate GT using. However, this effect is never large — and the accuracy penalty from
using too many bits can be large. Whether C YCLIC is viable, if used carefully, depends on
whether its relative speed advantage is lost after n − 1 bits are discarded.
Table V. Count estimation error (%) using GT with poly-
nomial hashes C YCLIC and G ENERAL, data set 00ws1,
n=5
percentile C YCLIC G ENERAL
M=64 M=1024 M=64 M=1024
25 5.79 1.30 5.79 1.30
50 10.7 2.58 10.7 2.69
75 18.2 4.55 19.0 4.55
95 30.6 7.69 30.6 7.69
mean 12.5 3.15 12.6 3.14

Table VI. Time (seconds) to pro-

cess all Gutenberg CD files, 10-
grams.
Hashing M = 210 M = 220
10-wise 794 938
ID37 268 407
Cyclic 345 486
General 352 489

7.8 Speed
Speeds were measured on a Dell Power Edge 6400 multiprocessor server (with four Pen-
tium III Xeon 700 MHz processors having 2 MiB cache each, sharing 2 GiB of 133 MHz
RAM). The OS kernel was Linux 2.4.20 and the GNU C++ compiler version 3.2.2 was
used with relevant compiler flags -O2 -march=i686 -fexceptions. The STL map class
was used to construct look-up tables.
Only one processor was used, and the data set consisted of all the plain text files on
the Project Gutenberg CD, concatenated into a single disk file containing over 400 MiB
and approximately 116 million 10-grams. For comparison, this file was too large to pro-
cess with the Sary suffix array [Takabayashi 2005] package (version 1.2.0), since the array
would have exceeded 2 GiB. However, the first 200 MB was successfully processed by
Sary, which took 1886 s to build the suffix10 array. The SUFARY [Yamashita 2005] (ver-
sion 2.3.8) package is said to be faster than Sary [Takabayashi 2005]. It processed the
200 MB file in 2640 s and then required 95 s to (exactly) compute the number of 5-grams
with more than 100,000 occurrences. Pipelined suffix-array implementations reportedly
can process inputs as large as 4 GB in hours [Dementiev et al. 2005].
From Table VI we see that n-gram estimation can be efficiently implemented. First,
comparing results for M = 220 to those for M = 210 , we see using a larger table costs
roughly 140 s in every case. This increase is small when considering that M was multiplied
by 210 and is consistent with the fact that the computational cost is dominated by the
hashing. Comparing different hashes, using a 10-wise independent hash was about twice as
slow as using a recursive hash. Hashing with ID37 was 15–25% faster than using Cohen’s
approaches.
Assuming that we are willing to allocate very large files to create suffix arrays and use
much internal memory, an exact count is still at least 10 times more expensive than an
approximation. Where the suffix-array approach would take about an hour to compute n-
gram counts over the entire Gutenberg CD, an estimate can be available in about 6 minutes

10 Various command-line options were attempted and the reported time is the fastest achieved.
while using very little memory and no permanent storage.

8. CONCLUSION
Considering speed, theoretical guarantees, and actual results, we recommend Cohen’s
G ENERAL. It is fast, has a theoretical performance guarantee, and behaves at least as well
as either ID37 or the n-wise independent approach. G ENERAL is pairwise independent
so that there are minimal theoretical bounds to its performance. Unlike Cohen’s C YCLIC,
one can safely use all of its bits and the minor speed advantage of C YCLIC does not seem
worthwhile. The n-wise independent hashing comes with a stronger theoretical guaran-
tee than either of Cohen’s hashes, and thus there can be no unpleasant surprises with its
accuracy on any data set. Yet there is a significant speed penalty for its use in our imple-
mentation. The speed gain of ID37 is worthwhile only for very small values of M. Not
only does it lack a theoretical accuracy guarantee, but for larger M it is observed to fall
far behind the other hashing approaches in practice. Except where accuracy is far less
important than speed, we cannot recommend ID37.
There are various avenues for follow-up work that we are pursuing. First, further im-
provements to the theoretical bound seem possible, especially for more than 4-wise inde-
pendent hashes. Generally, the current theory fails to completely explain our experimental
results: why do C YCLIC and G ENERAL sometimes fare better than n-wise independent
hashing at count estimation? Does increased independence improve the accuracy of the
GT count-estimation algorithm? Second, each item being counted can have an occurrence
count kept for it. This may enable entropy estimation as well as estimates of the number
of distinct frequent n-grams [Lemire and Kaser 2006].

REFERENCES
A LON , N., M ATIAS , Y., AND S ZEGEDY, M. 1996. The space complexity of approximating the frequency
moments. In STOC ’96. 20–29.
AOUICHE , K. AND L EMIRE , D. 2007. Unassuming view-size estimation techniques in olap. In ICEIS’07.
BACON , F. L. AND H OUDE , D. J. 1984. Data compression apparatus and method. US Patent 4612532. filed
1984; granted 1986. Assignee Telebyte (later Telcor Systems).
BAR -YOSSEF, Z., JAYRAM , T. S., K UMAR , R., S IVAKUMAR , D., AND T REVISAN , L. 2002. Counting distinct
elements in a data stream. In RANDOM’02. 1–10.
BATU , T., DASGUPTA , S., K UMAR , R., AND RUBINFELD , R. 2002. The complexity of approximating entropy.
In STOC’02. ACM Press, New York, NY, USA, 678–687.
B ERNARD , M. 1995. À juste titre: A lexicometric approach to the study of titles. Literary and Linguistic
Computing 10, 2, 135–141.
B OSE , R. P. J. C. AND S RINIVASAN , S. H. 2005. Data mining approaches to software fault diagnosis. In RIDE
’05. IEEE Computer Society, Washington, DC, USA, 45–52.
B OURKE , P. 1998. Uniform random number generator. online: http://astronomy.swin.edu.au/∼pbourke/
other/random/index.html. checked 2007-05-30.
C ANNY, J. 2002. CS174 lecture notes. http://www.cs.berkeley.edu/∼jfc/cs174/lecs/lec10/lec10.
pdf. checked 2007-05-30.
C AROPRESO , M. F., M ATWIN , S., AND S EBASTIANI , F. 2001. Text Databases & Document Management:
Theory & Practice. Idea Group Publishing, Hershey, PA, USA, Chapter A Learner-Independent Evaluation of
the Usefulness of Statistical Phrases for Automated Text Categorization, 78–102.
C ARTER , L. AND W EGMAN , M. 1979. Universal classes of hash functions. Journal of Computer and System
Sciences 18, 2, 143–154.
C IACCIA , P., G OLFARELLI , M., AND R IZZI , S. 2001. On estimating the cardinality of aggregate views. In
DMDW. 12.1–12.10.
C IACCIA , P., G OLFARELLI , M., AND R IZZI , S. 2003. Bounding the cardinality of aggregate views through
domain-derived constraints. Data & Knowledge Engineering 45, 2, 131–153.
C ODD , E. 1993. Providing OLAP (on-line analytical processing) to user-analysis: an IT mandate. Tech. rep.,
E.F. Codd and Associates.
C OHEN , J. D. 1997. Recursive hashing functions for n-grams. ACM Trans. Inf. Syst. 15, 3, 291–320.
C OHEN , P., H EERINGA , B., AND A DAMS , N. 2002. Unsupervised segmentation of categorical time series into
episodes. In ICDM’02. 99–106.
D EHNE , F., E AVIS , T., AND R AU -C HAPLIN , A. 2006. The cgmCUBE project: Optimizing parallel data cube
generation for ROLAP. Distributed and Parallel Databases 19, 1, 29–62.
D EMENTIEV, R., M EHNERT, J., K ARKKAINEN , J., AND S ANDERS , P. 2005. Better external memory suffix
array construction. In ALENEX05: Workshop on Algorithm Engineering & Experiments.
D ESHPANDE , M. AND K ARYPIS , G. 2004. Selective Markov models for predicting web page accesses. ACM
Transactions on Internet Technology (TOIT) 4, 2, 163–184.
D ORAISAMY, S. AND R ÜGER , S. 2003. Position indexing of adjacent and concurrent n-grams for polyphonic
music retrieval. In ISMIR 2003. 227–228.
D ROETTBOOM , M. 2003. Correcting broken characters in the recognition of historical printed documents. In
Digital Libraries 2003. 364–366.
D URAND , M. AND F LAJOLET, P. 2003. Loglog counting of large cardinalities. In ESA’03. Springer.
FALOUTSOS , C., M ATIAS , Y., AND S ILBERSCHATZ , A. 1996. Modeling skewed distribution using multifractals
and the 80-20 law. In VLDB’96. 307–317.
F LAJOLET, P. AND M ARTIN , G. 1985. Probabilistic counting algorithms for data base applications. Journal of
Computer and System Sciences 31, 2, 182–209.
G AO , J. AND Z HANG , M. 2001. Improving language model size reduction using better pruning criteria. In
ACL’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association
for Computational Linguistics, Morristown, NJ, USA, 176–182.
G IBBONS , P. B. AND T IRTHAPURA , S. 2001. Estimating simple functions on the union of data streams. In
SPAA’01. 281–291.
G IEGERICH , R., K URTZ , S., AND S TOYE , J. 1999. Efficient implementation of lazy suffix trees. In WAE’99.
30–42.
G ONNET, G. H. AND BAEZA -YATES , R. A. 1990. An analysis of the Karp-Rabin string matching algorithm.
Information Processing Letters 34, 5, 271–274.
G RAY, J., B OSWORTH , A., L AYMAN , A., AND P IRAHESH , H. 1996. Data cube: A relational aggregation
operator generalizing group-by, cross-tab, and sub-total. In ICDE ’96. 152–159.
G UHA , S., M C G REGOR , A., AND V ENKATASUBRAMANIAN , S. 2006. Streaming and sublinear approximation
of entropy and information distances. In SODA’06. ACM Press, New York, NY, USA, 733–742.
H AAHR , M. 1998. random.org — true random number service. online, http://www.random.org. checked
2007-05-30.
H AAS , P., NAUGHTON , J., S ESHADRI , S., AND S TOKES , L. 1995. Sampling-based estimation of the number
of distinct values of an attribute. In VLDB’95. 311–322.
H ON , W., S ADAKANE , K., AND S UNG , W. 2003. Breaking a time-and-space barrier in constructing full-text
indices. In FOCS’03. 251–260.
JAMES , F. 1990. A review of pseudorandom number generators. Computer Physics Communications 60, 329–
344.
J ELINEK , F. 1998. Statistical methods for speech recognition. MIT Press Cambridge, MA, USA.
J OULA , P., S OFKO , J., AND B RENNAN , P. 2006. A prototype for authorship attribution studies. Literary and
Linguistic Computing 21, 2, 169–178.
K ARP, R. AND R ABIN , M. 1987. Efficient randomized pattern-matching algorithms. IBM Journal of Research
and Development 31, 2, 249–260.
K ASER , O., K EITH , S., AND L EMIRE , D. 2006. The LitOLAP project: Data warehousing with literature. In
CaSTA’06.
K EARNS , M., M ANSOUR , Y., RON , D., RUBINFELD , R., S CHAPIRE , R. E., AND S ELLIE , L. 1994. On the
learnability of discrete distributions. In STOC’94. 273–282.
K EITH , S., K ASER , O., AND L EMIRE , D. 2005a. Analyzing large collections of electronic text using OLAP. In
APICS 2005.
K EITH , S., K ASER , O., AND L EMIRE , D. 2005b. Analyzing large collections of electronic text using OLAP.
Tech. Rep. TR-05-001, UNBSJ CSAS. June.
K ESELJ , V. AND C ERCONE , N. 2004. CNG method with weighted voting. In ad-hoc Authorship Attribution
Contest, P. Joula, Ed. AHC/ALLC.
K IM , M.-S., W HANG , K.-Y., L EE , J.-G., AND L EE , M.-J. 2005. n-gram/2l: a space and time efficient two-level
n-gram inverted index structure. In VLDB’05. VLDB Endowment, 325–336.
K IT, C. AND W ILKS , Y. 1998. The Virtual Corpus approach to deriving n-gram statistics from large scale
corpora. In Proceedings of 1998 International Conference on Chinese Information Processing. 223–229.
K NUTH , D. E. 1969. Seminumerical Algorithms. The Art of Computer Programming, vol. 2. Addison-Wesley.
KOLONKO , M. AND WASCH , D. 2006. Sequential reservoir sampling with a nonuniform distribution. ACM
Trans. Math. Softw. 32, 2, 257–273.
KOTIDIS , Y. 2002. Handbook of Massive Data Sets. Kluwer Academic Publishers, Norwell, MA, USA, Chapter
Aggregate View Management in Data Warehouses, 711–741.
L EMIRE , D. AND K ASER , O. 2006. One-pass, one-hash n-gram count estimation. Tech. Rep. TR-06-001, Dept.
of CSAS, UNBSJ. available from http://arxiv.org/abs/cs.DB/0610010.
L I , K.-H. 1994. Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))). ACM Trans. Math.
Softw. 20, 4, 481–493.
L IN , C.-Y. AND H OVY, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In
NAACL’03. Association for Computational Linguistics, Morristown, NJ, USA, 71–78.
L OSIEWICZ , P., OARD , D. W., AND KOSTOFF , R. N. 2000. Textual data mining to support science and tech-
nology management. J. Intell. Inf. Syst. 15, 2, 99–119.
M ANBER , U. AND M YERS , G. 1990. Suffix arrays: a new method for on-line string searches. In SODA ’90.
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 319–327.
M ANBER , U. AND M YERS , G. 1993. Suffix arrays: a new method for on-line string searches. SIAM Journal on
Computing 22, 5, 935–948.
M ARSAGLIA , G. AND Z AMAN , A. 1987. Toward a universal random number generator. Tech. Rep. FSU-SCRI-
87-50, Florida State University.
M ATSUMOTO , M. AND N ISHIMURA , T. 1998. Mersenne Twister: A 623-dimensionally equidistributed uniform
pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation 8, 1, 3–30.
M C A LLESTER , D. A. AND S CHAPIRE , R. E. 2000. On the convergence rate of Good-Turing estimators. In
COLT’00: Proceedings of the Thirteenth Annual Conference on Computational Learning Theory. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 1–6.
M C C ABE , M. C., L EE , J., C HOWDHURY, A., G ROSSMAN , D., AND F RIEDER , O. 2000. On the design and
evaluation of a multi-dimensional approach to information retrieval. In SIGIR ’00. ACM Press, New York,
NY, USA, 363–365.
M OTHE , J., C HRISMENT, C., D OUSSET, B., AND A LAUX , J. 2003. DocCube: Multi-dimensional visualiza-
tion and exploration of large document sets. Journal of the American Society for Information Science and
Technology 54, 7, 650–659.
NADEAU , T. AND T EOREY, T. 2003. A Pareto model for OLAP view size estimation. Information Systems
Frontiers 5, 2, 137–147.
NAGAO , M. AND M ORI , S. 1994. A new method of n-gram statistics for large number of n and automatic
extraction of words and phrases from large text data of Japanese. In COLING’94. 611–615.
N IE , J.-Y., G AO , J., Z HANG , J., AND Z HOU , M. 2000. On the use of words and n-grams for Chinese information
retrieval. In IRAL’00: Proceedings of the Fifth International Workshop on on Information Retrieval with Asian
Languages. ACM Press, New York, NY, USA, 141–148.
N IEMI , T., H IRVONEN , L., AND J ÄRVELIN , K. 2003. Multidimensional data model and query language for
informetrics. Journal of the American Society for Information Science and Technology 54, 10, 939–951.
O RLITSKY, A., S ANTHANAM , N., AND Z HANG , J. 2003. Always Good Turing: asymptotically optimal proba-
bility estimation. In FOCS’03. 179–188.
PAULUS , J. AND K LAPURI , A. 2003. Conventional and periodic n-grams in the transcription of drum sequences.
In ICME’03. 737–740.
P ROJECT G UTENBERG L ITERARY A RCHIVE F OUNDATION. 2007. Project Gutenberg. http://www.
gutenberg.org/. checked 2007-05-30.
RUSKEY, F. 2006. The (combinatorial) object server. http://www.theory.cs.uvic.ca/∼cos/cos.html.
checked 2007-05-30.
S CHMIDT, J., S IEGEL , A., AND S RINIVASAN , A. 1993. Chernoff-Hoeffding bounds for applications with
limited independence. In SODA’93. Society for Industrial and Applied Mathematics Philadelphia, PA, USA,
331–340.
S HAH , B., R AMACHANDRAN , K., AND R AGHAVAN , V. 2004. Storage estimation of multidimensional ag-
gregates in a data warehouse environment. In Proceedings of the World Multi-Conference on Systemics,
Cybernetics and Informatics.
S HANNON , C. E. 1948. A mathematical theory of communications. Bell Syst. Tech. J.
S HUKLA , A., D ESHPANDE , P., NAUGHTON , J., AND R AMASAMY, K. 1996. Storage estimation for multidi-
mensional aggregates in the presence of hierarchies. In VLDB’96. 522–531.
S U , Z., YANG , Q., L U , Y., AND Z HANG , H. 2000. WhatNext: a prediction system for web requests using
n-gram sequence models. In Web Information Systems Engineering 2000. 214–221.
S ULLIVAN , D. 2001. Document Warehousing and Text Mining: Techniques for Improving Business Operations,
Marketing, and Sales. John Wiley & Sons.
S UN M ICROSYSTEMS. 2004. String (Java 2 Platform SE 5.0). online documentation: http://java.sun.com/
j2se/1.5.0/docs/api/index.html.
TAKABAYASHI , S. 2005. Sary: A suffix array library and tools. online: http://sary.sourceforge.net/.
checked 2007-05-30.
T HE U NICODE C ONSORTIUM. 2006. Unicode home page. http://unicode.org/. checked 2007-05-30.
V ITTER , J. S. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1, 37–57.
W EISS , M. 1999. Data Structures and Algorithm Analysis in Java. Addison Wesley.
W HANG , K., VANDER -Z ANDEN , B., AND TAYLOR , H. 1990. A linear-time probabilistic counting algorithm
for database applications. ACM Transactions on Database Systems 15, 2, 208–229.
YAMASHITA , T. 2005. SUFARY. online: http://nais.to/∼yto/tools/sufary. checked 2007-05-30.
YANNAKOUDAKIS , E. J., T SOMOKOS , I., AND H UTTON , P. J. 1990. n-Grams and their implication to natural
language understanding. Pattern Recogn. 23, 5, 509–528.
Y U , X., Z UZARTE , C., AND S EVCIK , K. C. 2005. Towards estimating the number of distinct value combinations
for a set of attributes. In CIKM’05. ACM Press, New York, NY, USA, 656–663.

Daa Unit 3
No ratings yet
Daa Unit 3
91 pages
GP Config
50% (4)
GP Config
56 pages
10 Dictionaries
No ratings yet
10 Dictionaries
11 pages
Lecture03 Hashing
No ratings yet
Lecture03 Hashing
12 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
Theory and Practice of Monotone Minimal Perfect Hashing
No ratings yet
Theory and Practice of Monotone Minimal Perfect Hashing
27 pages
Probabilistic Counting Algorithms For Database Applications - Flajolet
No ratings yet
Probabilistic Counting Algorithms For Database Applications - Flajolet
28 pages
03 Hashing
No ratings yet
03 Hashing
21 pages
07 Hashing
No ratings yet
07 Hashing
73 pages
Brics: Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions
No ratings yet
Brics: Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions
14 pages
11 042) Guided By: Dr. D. C. Jinwala: Prepared By: Parth Patel (P CO
No ratings yet
11 042) Guided By: Dr. D. C. Jinwala: Prepared By: Parth Patel (P CO
29 pages
DS Lecture07
No ratings yet
DS Lecture07
5 pages
History-Independent Cuckoo Hashing: Udi Wieder Moni Naor Gil Segev
No ratings yet
History-Independent Cuckoo Hashing: Udi Wieder Moni Naor Gil Segev
20 pages
23 Hashing
No ratings yet
23 Hashing
14 pages
CSE143-Computer Programming II Programming Assignment #6 Due: Thursday, 11/16/17 9 PM
No ratings yet
CSE143-Computer Programming II Programming Assignment #6 Due: Thursday, 11/16/17 9 PM
3 pages
05 - Dictionaries and Tuples
No ratings yet
05 - Dictionaries and Tuples
61 pages
Unit 3 Hashing
No ratings yet
Unit 3 Hashing
23 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Loglog Counting of Large Cardinalities
No ratings yet
Loglog Counting of Large Cardinalities
13 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
String Matching and Hashing
No ratings yet
String Matching and Hashing
10 pages
1 Overview: Lecture 2 - February 3, 2005
No ratings yet
1 Overview: Lecture 2 - February 3, 2005
6 pages
Hashing and Indexing
No ratings yet
Hashing and Indexing
28 pages
L11 PDF
No ratings yet
L11 PDF
5 pages
AA Exam 2022 Answers
No ratings yet
AA Exam 2022 Answers
5 pages
Rainbow 17
No ratings yet
Rainbow 17
19 pages
Rabin Karp Alorithm For String Search
No ratings yet
Rabin Karp Alorithm For String Search
3 pages
Mit6 857S14 2.1.2
No ratings yet
Mit6 857S14 2.1.2
3 pages
CS 03
No ratings yet
CS 03
22 pages
CS2040 Tutorial4 Ans
No ratings yet
CS2040 Tutorial4 Ans
5 pages
Cuckoo Hashing For Undergraduates: Rasmus Pagh IT University of Copenhagen March 27, 2006
No ratings yet
Cuckoo Hashing For Undergraduates: Rasmus Pagh IT University of Copenhagen March 27, 2006
6 pages
1994 - Graphs, Hypergraphs and Hashing
No ratings yet
1994 - Graphs, Hypergraphs and Hashing
13 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
CH 4
No ratings yet
CH 4
58 pages
Rolling Hash (Rabin-Karp Algorithm) : Objective
No ratings yet
Rolling Hash (Rabin-Karp Algorithm) : Objective
4 pages
4101 Assignment 9
No ratings yet
4101 Assignment 9
5 pages
Language Trees and Zipping: Olume Umber
No ratings yet
Language Trees and Zipping: Olume Umber
4 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Hashing
No ratings yet
Hashing
13 pages
Language Modelling
No ratings yet
Language Modelling
8 pages
Applied Combinatorics On Words
No ratings yet
Applied Combinatorics On Words
575 pages
Classical Cryptanalysis: 3 Class - 20 3 2018 Dr. Sattar B. Sadkhan
No ratings yet
Classical Cryptanalysis: 3 Class - 20 3 2018 Dr. Sattar B. Sadkhan
24 pages
Module 2-Data Structures and Algorithms For Retrieval-Cat1
No ratings yet
Module 2-Data Structures and Algorithms For Retrieval-Cat1
133 pages
S S S ... A Z A Z S I I R R HV Sha HV, ,: e Ciently Enumerable
No ratings yet
S S S ... A Z A Z S I I R R HV Sha HV, ,: e Ciently Enumerable
3 pages
Recitation 2
No ratings yet
Recitation 2
4 pages
MIT6 006S20 Ps3-Questions
No ratings yet
MIT6 006S20 Ps3-Questions
4 pages
Rabin-Karp Algorithm
No ratings yet
Rabin-Karp Algorithm
2 pages
Lecture05 Hash Table
No ratings yet
Lecture05 Hash Table
65 pages
Digital Assignment - Design and Analysis of Algorithms: Joe Shannel Aruja 17MES0003
No ratings yet
Digital Assignment - Design and Analysis of Algorithms: Joe Shannel Aruja 17MES0003
23 pages
Objective: 1. File Input
No ratings yet
Objective: 1. File Input
2 pages
Dsa 4
No ratings yet
Dsa 4
55 pages
A Study Using N-Gram Features For Text Categorization
No ratings yet
A Study Using N-Gram Features For Text Categorization
10 pages
hw05 Solution PDF
No ratings yet
hw05 Solution PDF
8 pages
String Matching
No ratings yet
String Matching
16 pages
Hashing
No ratings yet
Hashing
5 pages
Ap 5
No ratings yet
Ap 5
5 pages
1 Hashing: 1.1 Desired Properties
No ratings yet
1 Hashing: 1.1 Desired Properties
8 pages
Scalable Modified Kneser-Ney Language Model Estimation
No ratings yet
Scalable Modified Kneser-Ney Language Model Estimation
7 pages
Haskell Assignment 2024-25
No ratings yet
Haskell Assignment 2024-25
6 pages
6.006 Introduction To Algorithms: Mit Opencourseware
No ratings yet
6.006 Introduction To Algorithms: Mit Opencourseware
5 pages
Collaborative Filtering and Inference Rules For Context-Aware Learning Object Recommendation
No ratings yet
Collaborative Filtering and Inference Rules For Context-Aware Learning Object Recommendation
11 pages
Owen Kaser, Daniel Lemire, and Steven Keith, The LitOLAP Project: Data Warehousing With Literature, CaSTA 2006, Fredericton, 2006.
No ratings yet
Owen Kaser, Daniel Lemire, and Steven Keith, The LitOLAP Project: Data Warehousing With Literature, CaSTA 2006, Fredericton, 2006.
4 pages
Unasssuming View-Size Estimation Techniques in Olap: An Experimental Comparison
No ratings yet
Unasssuming View-Size Estimation Techniques in Olap: An Experimental Comparison
6 pages
Faster Retrieval With A Two-Pass Dynamic-Time-Warping Lower Bound
100% (1)
Faster Retrieval With A Two-Pass Dynamic-Time-Warping Lower Bound
26 pages
Nosqldbs
No ratings yet
Nosqldbs
149 pages
Attribute Value Reordering For Efficient Hybrid Olap: Owen Kaser
No ratings yet
Attribute Value Reordering For Efficient Hybrid Olap: Owen Kaser
32 pages
Kamel Aouiche and Daniel Lemire, A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP, DOLAP 2007, Pp. 17-24, 2007.
No ratings yet
Kamel Aouiche and Daniel Lemire, A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP, DOLAP 2007, Pp. 17-24, 2007.
8 pages
An Optimal Linear Time Algorithm For Quasi-Monotonic Segmentation
No ratings yet
An Optimal Linear Time Algorithm For Quasi-Monotonic Segmentation
4 pages
Steven Keith, Owen Kaser, Daniel Lemire, Analyzing Large Collections of Electronic Text Using OLAP, UNBSJ CSAS Technical Report TR-05-001, June 2005.
No ratings yet
Steven Keith, Owen Kaser, Daniel Lemire, Analyzing Large Collections of Electronic Text Using OLAP, UNBSJ CSAS Technical Report TR-05-001, June 2005.
13 pages
Daniel Lemire, A Better Alternative To Piecewise Linear Time Series Segmentation, SIAM Data Mining 2007, 2007.
No ratings yet
Daniel Lemire, A Better Alternative To Piecewise Linear Time Series Segmentation, SIAM Data Mining 2007, 2007.
12 pages
Hierarchical Bin Buffering: Online Local Moments For Dynamic External Memory Arrays
No ratings yet
Hierarchical Bin Buffering: Online Local Moments For Dynamic External Memory Arrays
32 pages
Streaming Maximum-Minimum Filter Using No More Than Three Comparisons Per Element
No ratings yet
Streaming Maximum-Minimum Filter Using No More Than Three Comparisons Per Element
12 pages
Hazel Webb, Owen Kaser, Daniel Lemire, Pruning Attributes From Data Cubes With Diamond Dicing, IDEAS'08, 2008.
No ratings yet
Hazel Webb, Owen Kaser, Daniel Lemire, Pruning Attributes From Data Cubes With Diamond Dicing, IDEAS'08, 2008.
24 pages
Faster Retrieval With A Two-Pass Dynamic-Time-Warping Lower Bound
100% (1)
Faster Retrieval With A Two-Pass Dynamic-Time-Warping Lower Bound
26 pages
Tag-Cloud Drawing: Algorithms For Cloud Visualization: Owen Kaser Daniel Lemire
No ratings yet
Tag-Cloud Drawing: Algorithms For Cloud Visualization: Owen Kaser Daniel Lemire
10 pages
Collaborative Olap With Tag Clouds: Web 2.0 OLAP Formalism and Experimental Evaluation
No ratings yet
Collaborative Olap With Tag Clouds: Web 2.0 OLAP Formalism and Experimental Evaluation
8 pages
An Optimal Linear Time Algorithm For Quasi-Monotonic Segmentation
No ratings yet
An Optimal Linear Time Algorithm For Quasi-Monotonic Segmentation
16 pages
Removing Manually Generated Boilerplate From Electronic Texts: Experiments With Project Gutenberg E-Books
No ratings yet
Removing Manually Generated Boilerplate From Electronic Texts: Experiments With Project Gutenberg E-Books
24 pages
Ford Navigator 2007-5.4l-Engine
No ratings yet
Ford Navigator 2007-5.4l-Engine
58 pages
Msds Asam Tanin
No ratings yet
Msds Asam Tanin
6 pages
CIM (Computer Integrated Manufacturing)
No ratings yet
CIM (Computer Integrated Manufacturing)
23 pages
8 Batches Field Training
No ratings yet
8 Batches Field Training
77 pages
DC PPT 7 QPSK
No ratings yet
DC PPT 7 QPSK
17 pages
5-Dr Samir Ayad, Egyptian Solar Energy Systems Co
No ratings yet
5-Dr Samir Ayad, Egyptian Solar Energy Systems Co
24 pages
Phase Diagrams in Materials Science-Massalski-MetTrans
No ratings yet
Phase Diagrams in Materials Science-Massalski-MetTrans
30 pages
OpenSceneGraph Cookbook, Chapter 10
No ratings yet
OpenSceneGraph Cookbook, Chapter 10
54 pages
BFP
No ratings yet
BFP
19 pages
Hormann N80 Installation Instructions PDF - Garage Doors
No ratings yet
Hormann N80 Installation Instructions PDF - Garage Doors
38 pages
Project Code Form Surat Masuk
No ratings yet
Project Code Form Surat Masuk
5 pages
Northwestern Polytechnical University Graduate Degree Dissertation Title Selection Report Form
100% (1)
Northwestern Polytechnical University Graduate Degree Dissertation Title Selection Report Form
14 pages
Automation and Robotics AKTU Question Bank
No ratings yet
Automation and Robotics AKTU Question Bank
3 pages
Rigless, A Misnomer?: Applications To Late Well Lifecycles
100% (1)
Rigless, A Misnomer?: Applications To Late Well Lifecycles
32 pages
Tech Quiz #92: by Jim White, Shermco Industries
No ratings yet
Tech Quiz #92: by Jim White, Shermco Industries
1 page
Systemaire Brochure - Manufacturing Facilities Around The World
No ratings yet
Systemaire Brochure - Manufacturing Facilities Around The World
32 pages
Repuestos Grua PDF
No ratings yet
Repuestos Grua PDF
150 pages
WI/FPI/WW Wheel-Nose
No ratings yet
WI/FPI/WW Wheel-Nose
6 pages
Design and Analysis of Reinforced Rubber Membranes For Inflatable Dams
No ratings yet
Design and Analysis of Reinforced Rubber Membranes For Inflatable Dams
12 pages
Distinguished Author Series: Core Analysis For Aid in Reservoir Description
No ratings yet
Distinguished Author Series: Core Analysis For Aid in Reservoir Description
9 pages
9210 Brochure v1
No ratings yet
9210 Brochure v1
24 pages
Flemish Bond Megat Iman
No ratings yet
Flemish Bond Megat Iman
5 pages
Module 3
No ratings yet
Module 3
8 pages
Hunziker Temple
No ratings yet
Hunziker Temple
2 pages
SB 639 Rocker Arm Inspection
No ratings yet
SB 639 Rocker Arm Inspection
3 pages
SCS Final S20.20 2015
No ratings yet
SCS Final S20.20 2015
37 pages
Ugly Construction
No ratings yet
Ugly Construction
14 pages
Digital Code Lock Synopsis
No ratings yet
Digital Code Lock Synopsis
9 pages
Northbound SH 6 Navasota Bridge Work
No ratings yet
Northbound SH 6 Navasota Bridge Work
1 page