0% found this document useful (0 votes)

13 views13 pages

Loglog Counting of Large Cardinalities

Uploaded by

jopsrodrigues

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views13 pages

Loglog Counting of Large Cardinalities

Uploaded by

jopsrodrigues

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Loglog Counting of Large Cardinalities

(Extended Abstract)

Marianne Durand and Philippe Flajolet

Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France)

Abstract. Using an auxiliary memory smaller than the size of this ab-
stract, the LogLog algorithm makes it possible to estimate in a single
pass and within a few percents the number of diﬀerent words in the
whole of Shakespeare’s works. In general the LogLog algorithm makes
use of m “small bytes” of auxiliary memory in order to estimate in a
single pass the number of distinct elements (the “cardinality”) in a ﬁle,
√
and it does so with an accuracy that is of the order of 1/ m. The “small
bytes” to be used in order to count cardinalities till Nmax comprise about
log log Nmax bits, so that cardinalities well in the range of billions can be
determined using one or two kilobytes of memory only. The basic version
of the LogLog algorithm is validated by a complete analysis. An opti-
mized version, super–LogLog, is also engineered and tested on real-life
data. The algorithm parallelizes optimally.

1 Introduction
The problem addressed in this note is that of determining the number of distinct
elements, also called the cardinality, of a large file. This problem arises in several
areas of data-mining, database query optimization, and the analysis of traffic in
routers. In such contexts, the data may be either too large to fit at once in core
memory or even too massive to be stored, being a huge continuous flow of data
packets. For instance, Estan et al. [3] report traces of packet headers, produced
at a rate of 0.5GB per hour of compressed data (!), which were collected while
trying to trace a “worm” (Code Red, August 1 to 12, 2001), and on which it
was necessary to count the number of distinct sources passing through the link.
We propose here the LogLog algorithm that estimates cardinalities using only
a very small amount of auxiliary memory, namely m memory units, where a
memory unit, a “small byte”, comprises close to log log Nmax bits, with Nmax
an a priori upperbound on cardinalities. The estimate is (in the sense of mean
values) asymptotically unbiased ; the relative√ accuracy of the estimate (measured
by a standard deviation) is close to 1.05/ m for our best version of the algo-
rithm, Super–LogLog. For instance, estimating cardinalities till Nmax = 227 (a
hundred million different records) can be achieved with m = 2048 memory units
of 5 bits each, which corresponds to 1.28 kilobytes of auxiliary storage in total,
the error observed being typically less than 2.5%. Since the algorithm operates
incrementally and in a single pass it can be applied to data flows for which it
provides on-line estimates available at any given time. Advantage can be taken

G. Di Battista and U. Zwick (Eds.): ESA 2003, LNCS 2832, pp. 605–617, 2003.

c Springer-Verlag Berlin Heidelberg 2003
606 M. Durand and P. Flajolet

of the low memory consumption in order to gather simultaneously a very large

number of statistics on huge heterogeneous data sets. The LogLog algorithm
can also be fully distributed or parallelized, with optimum speed-up and mini-
mal interprocess communication. Finally, an embedded hardware design would
involve strictly minimal resources.
Motivations. A traditional application of cardinality estimates is database
query optimization. There, a complex query typically involves a variety of set-
theoretic operations as well as projections, joints, and so on. In this context,
knowing “for free” cardinalities of associated sets provides a valuable guide for
selecting an efficient processing strategy best suited to the data at hand. Even a
problem as simple as merging two large files with duplicates can be treated by
various combinations of sorting, straight merging, and filtering out duplicates
(in one or both of the files); the cost function of each possible strategy is then
determined by the number of records as well as by the cardinality of each file.
Probabilistic estimation algorithms also find a use in large data recording and
warehousing environments. There, the goal is to provide an approximate response
in time that is orders-of-magnitude less than what computing an exact answer
would require: see the description of the Aqua Project by Gibbons et al. in [8].
The analysis of traffic in routers, as already mentioned, benefits greatly of
cardinality estimators—this is lucidly exposed by Estan et al. in [2,3]. Certain
types of attacks (“denial of service” and “port scans”) are betrayed by alarmingly
high counts of certain characteristic events in routers. In such situations, there is
usually not enough resource available to store and search on-line the very large
number of events that take place even in a relatively small time window.
Probabilistic counting algorithms can also be used within other algorithms
whenever the final answer is the cardinality of a large set and a small tolerance
on the quality of the answer is acceptable. Palmer et al. [13] describe the use of
such algorithms in an extensive connectivity analysis of the internet topology.
For instance, one of the tasks needed there is to determine, for each distance h,
the number of pairs of nodes that are at distance at most h in the internet graph.
Since the graph studied by [13] has close to 300,000 nodes, the number of pairs
to be considered is well over 1010 , upon which costly list operations must be
performed by exact algorithms. In contrast an algorithm that would be, in the
abstract, suboptimal can be coupled with adapted probabilistic counting tech-
niques and still provide reliable estimates. In this way, the authors of [13] were
able to extract extensive metric information on the internet graph by keeping a
reduced collection of data that reside in core memory. They report a reduction
in run-time by a factor of more than 400.
Algorithms. The LogLog algorithm is probabilistic. Like in many similar
algorithms, the first idea is to appeal to a hashing function in order to randomize
data and bring them to a form that resembles random (uniform, independent)
binary data. It is this hashed data set that is distilled into cardinality estimates
by the algorithm. Various algorithms perform various tests on the hashed data
set, then compare “observables” to what probabilistic analysis predicts, and
finally “deduce” a plausible value of the parameter of interest. In the case of
Loglog Counting of Large Cardinalities 607

ghfffghfghgghggggghghheehfhfhhgghghghhfgffffhhhiigfhhffgfiihfhhh
igigighfgihfffghigihghigfhhgeegeghgghhhgghhfhidiigihighihehhhfgg
hfgighigffghdieghhhggghhfghhfiiheffghghihifgggffihgihfggighgiiif
fjgfgjhhjiifhjgehgghfhhfhjhiggghghihigghhihihgiighgfhlgjfgjjjmfl

The LogLog Algorithm with m = 256 condenses the whole of Shakespeare’s works
to a table of 256 “small bytes” of 4 bits each. The estimate of the number of distinct
words is here n◦ = 30897 (true answer: n = 28239), i.e., a relative error of +9.4%.

LogLog counting, the observable should only be linked to cardinality, and hence
be totally independent of the nature of replications and the ordering of data
present in the file, on which no information at all is available. (Depending on
context, collisions due to hashing can either be neglected or their effect can be
estimated and corrected.)
Whang, Zanden, and Taylor [16] have developed Linear Counting, which dis-
tributes (hashed) values into buckets and only keeps a bitmap indicating which
buckets are hit. Then observing the number of hits in the table leads to an es-
timate of cardinality. Since the number of buckets should not be much smaller
than the cardinalities to be estimated (say, ≥ Nmax /10), the algorithm has space
complexity that is O(Nmax ) (typically, Nmax /10 bits of storage). The linear space
is a drawback whenever large cardinalities, multiple counts, or limited hardware
are the rule. Estan, Varghese, and Fisk [3] have devised a multiscale version of
this principle, where a hierarchical collection of small windows on the bitmap
is kept. From simulation data, their Multiresolution Bitmap algorithm appears
to be about 20% more accurate than Probabilistic Counting (discussed below)
when the same amount of memory is used. The best algorithm of [3] for flows
in routers, Adaptive Bitmap, is reported to be about 3 times more efficient than
either Probabilistic Counting or Multiresolution Bitmap, but it has the dis-
advantage of not being universal, as it makes definite statistical assumptions
(“stationarity”) regarding the data input to the algorithm. (We recommend the
thorough engineering discussion of [3].)
Closer to us is the Probabilistic Counting algorithm of Flajolet and Mar-
tin [7]. This uses a certain observable that has excellent statistical properties
but is relatively costly to maintain in terms of storage. Indeed,
√ Probabilistic
Counting estimates cardinalities with an error close to 0.78/ m given a table
of m “words”, each of size about log2 Nmax .
Yet another possible idea is sampling. One may use any filter on hashed
values with selectivity p 1, store exactly and without duplicates the data
items filtered and return as estimate 1/p times the corresponding cardinality.
Wegner’s Adaptive Sampling (described and analyzed in [5]) is an elegant way
to maintain dynamically varying values of p. For m “words” of memory (where
here “word”
√ refers to the space needed by a data item), the accuracy is about
1.20/ m, which is about 50% less efficient than Probabilistic Counting.
An insightful complexity-theoretic discussion of approximate counting is pro-
vided by Alon, Matias, and Szegedy in [1]. The authors discuss a class of
“frequency–moments” statistics which includes ours (as their F0 statistics). Our
608 M. Durand and P. Flajolet

LogLog Algorithm has principles that evoke some of those found in the inter-
section of [1] and the earlier [7], but contrary to [1], we develop here a complete
eminently practical algorithmic solution and provide a very precise analysis,
including bias correction, error and risk evaluation, as well as complete dimen-
sioning rules.
We estimate that our LogLog algorithm outperforms the earlier Probabilis-
tic Counting algorithm and the similarly performing Multiresolution Bitmap
of [3] by a factor of 3 at least as it replaces “words” (of 16 to 32 bits) by “small
bytes” of typically 5 bits each, while being based on an observable that has
only slightly higher dispersion
√ than the other two algorithms—this
√ is expressed
by our two formulæ 1.30/ m (LogLog) and 1.05/ m (super–LogLog). This
places our algorithm in the same category as Adaptive Bitmap of [3]. However,
compared to Adaptive Bitmap, the LogLog algorithm has the great advantage
of being universal as it makes no assumptions on the statistical regularity of
data. We thus believe LogLog and its improved version Super–LogLog to be
the best general-purpose algorithmic solution currently known to the problem
of estimating large cardinalities.
Note. The following related references were kindly suggested by a referee: Cormode et
al., in VLDB –2002 (a new counting method based on stable laws) and Bar-Yossef et
al., SODA–2002 (a new application to counting triangles in graphs).

2 The Basic LogLog Algorithm

In computing practice, one deals with a multiset of data items, each belonging to
a discrete universe U. For instance, in the case of natural text, U may be the set
of all alphabetic strings of length ≤ 28 (‘antidisestablishmentarianism’), double
ﬂoats represented on 64 bits, and so on. A multiset M of elements of U is given
and the problem is to estimate its cardinality, that is, the number of distinct
elements it comprises. Here is the principle of the basic LogLog algorithm.

Algorithm LogLog(M: Multiset of hashed values; m ≡ 2k )

Initialize M (1) , . . . , M (m) to 0;
let ρ(y) be the rank of ﬁrst 1-bit from the left in y;
for x = b1 b2 · · · ∈ M do
set j := b1 · · · bk 2 (value of ﬁrst k bits in base 2)
set M (j)
:= max(M (j) , ρ(bk+1 bk+2 · · · );
1
M (j)
return E := αm m2 m j as cardinality estimate.

We assume throughout that a hash function, h, is available that transforms

elements of U into sufficiently long binary strings, in such a way that bits com-
posing the hashed value closely resemble random uniform independent bits. This
pragmatic attitude1 is justified by Knuth who writes in [10]: “It is theoretically
1
The more theoretically inclined reader may prefer to draw h at random from a family
of universal hash functions; see, e.g., the general discussion in [12] and the specific [1].
Loglog Counting of Large Cardinalities 609

impossible to define a hash function that creates random data from non-random
data in actual files. But in practice it is not difficult to produce a pretty good im-
itation of random data.” Given this, we formalize our basic problem as follows.
Take U = {0, 1}∞ as the universe of data endowed with the uniform (prod-
uct) probability distribution. An ideal multiset M of cardinality n is a ran-
dom object that is produced by first drawing an n-sequence independently
at random from U, then replicating elements in an arbitrary way, and finally,
applying an arbitrary permutation.
The user is provided with the (extremely large) ideal multiset M and its goal
is to estimate the (unknown to him) value of n at a small computational cost.
No information is available, hence no statistical assumption can be made,
regarding the behaviour of the replicator-shuffler daemon.
(The fact that we consider infinite data is a convenient abstraction at this stage;
we discuss its effect, together with needed adjustments, in Section 5 below.)
The basic idea consists in scanning M and observing the patterns of the form
0 1 that occur at the beginning of (hashed) records. For a string x ∈ {0, 1}∞ ,
let ρ(x) denote the position of its first 1-bit. Thus ρ(1 · · · ) = 1, ρ(001 · · · ) = 3,
etc. Clearly, we expect about n/2k amongst the distinct elements of M to have
a ρ-value equal to k. In other words, the quantity,
R(M) := max ρ(x),
x∈M
can reasonably be hoped to provide a rough indication on the value of log2 n. It
is an “observable” in the sense above since it is totally independent of the order
and the replication structure of the multiset M. In fact, in probabilistic terms,
the quantity R is precisely distributed in the same way as 1 plus the maximum
of n independent geometric variables of parameter 12 . This is an extensively
researched subject; see, e.g., [14]. It turns out that R estimates log2 n with an
additive bias of 1.33 and a standard deviation of 1.87. Thus, in a sense, the
observed value of R estimates “logarithmically” n within ±1.87 binary orders
of magnitude. Notice however that the expectation of 2R is infinite so that 2R
cannot in fact be used to estimate n.
The next idea consists in separating elements into m groups also called “buck-
ets”, where m is a design parameter. With m = 2k , this is easily done by using
the first k bits of x as representing in binary the index of a bucket. One can
then compute the parameter R on each bucket, after discarding the first k bits.
If M (j) is the (random)
m value of parameter R on bucket number j, then the
1 (j)
arithmetic mean m j=1 M , can legitimately be expected to approximate
log2 (n/m) plus an additive bias. The estimate of n returned by the LogLog
algorithm is accordingly (j)
1
E := αm m2 m M . (1)
The constant αm comes out of our later analysis as αm :=
−m
1−21/m 1 ∞ −t s
Γ (−1/m) log 2 , where Γ (s) := s 0 e t dt. It precisely corrects
the systematic bias of the raw arithmetic mean in the asymptotic limit. One
may also hope for a greater concentration of the estimates, hence better
accuracy, to result from averaging over m 1 values. The main characteristics
610 M. Durand and P. Flajolet

of the algorithm are summarized below in Theorem 1. The letters E, V denote

expectation and variance, and the subscript n indicates the cardinality of the
underlying random multiset.

Theorem 1. Consider the basic LogLog algorithm applied to an ideal multi-

set of (unknown) cardinality n and let E be the estimated value of cardinality
returned by the algorithm.
(i) The estimate E is asymptotically unbiased in the sense that, as n → ∞,
1
En (E) = 1 + θ1,n + o(1), where |θ1,n | < 10−6 .
n
(ii) The standard error deﬁned as n1 Vn (E) satisﬁes as n → ∞,
1 βm
Vn (E) = √ + θ2,n + o(1), where |θ2,n | < 10−6 .
n m
. .
One has: β128 = 1.30540, β∞ = 12 1
log2 2 + 61 π 2 = 1.29806.

In summary, apart from completely negligible ﬂuctuations whose amplitude is

less than 10−6 , the algorithm provides asymptotically a valid estimator of n. The
standard error, which measures in a mean-quadratic sense and in proportion to n
the deviations to be expected, is closely approximated by the formula2
1.30
Standard error ≈ √ .
m
For instance, m = 256 and m = 1024 give a standard error of 8% and 4%
respectively. (These ﬁgures are compatible with what was obtained on the
Shakespeare √ data.) Observe also that αm ∼ α∞ − (2π 2 + log2 2)/(48m), where
−γ .
α∞ = e 2/2 = 0.39701 (γ is Euler’s constant), so that, in practical imple-
mentations, αm can be replaced by α∞ without much detectable bias as soon as
m ≥ 64.
The proof of Theorem 1 will occupy the whole of the next section.

3 The Basic Analysis

Throughout this note, the unknown number of distinct values in the data set is
denoted by n. The LogLog algorithm provides an estimator, E, of n. We first
provide formulæ for the expectation and variance of E. Asymptotic analysis
is performed next: The Poissonization paragraph introduces the Poisson model
where n is allowed to vary according to a Poisson law, while the Depoissoniza-
tion paragraph shows the Poisson model to be asymptotically equivalent to the
“fixed–n” model that we need. The expected value of the estimator is found to
be asymptotically n, up to minute fluctuations. This establishes the asymptot-
ically unbiased character of the algorithm as asserted in (i) of Theorem 1. The
standard deviation of the estimator is also proved to be of the order of n with
the proportionality coefficient providing the value of the standard error, hence
the accuracy of the algorithm, as asserted in (ii) of Theorem 1.
2
We use ‘∼’ to denote asymptotic expansions in the usual mathematical sense and
reserve the informal ‘≈’ for “approximately equal”.
Loglog Counting of Large Cardinalities 611

350 0.25

300
0.2

250

0.15
200

150
0.1

100

0.05

0 12 14 16 18 20 22 0 12 14 16 18 20 22

Fig. 1. The distribution of observed register values for the Pi ﬁle, n ≈ 2 · 107 with
m = 1024 [left]; the distribution Pν (M = k) of a register M , for ν = 2 · 104 [right].

We start by examining what happens in a bucket that receives ν elements

(Figure 1). The random variable M is, we recall, the maximum of ν random
variables that are independent and geometrically distributed according to P(Y ≥
1
k) = 2k−1 . Consequently, the probability distribution of M is characterized
ν ν ν
by Pν (M ≤ k) = 1 − 21k , so that Pν (M = k) = 1 − 21k − 1 − 2k−1 1
.
The bivariate (exponential) generating function of this family of probability
distributions as ν varies is then ν
z k k−1
G(z, u) := Pν (M = k)uk . = uk ez(1−1/2 ) − ez(1−1/2 ) , (2)
ν!
ν,k k
as shown by a simple calculation. The starting point of the analysis
is an expres-
1 (j)
sion in terms of G of the mean and variance of Z := E/αm ≡ m2 m j M , which
n
is the unnormalized version of the estimator E. With the expression [z ]f (z) rep-
resenting the coeﬃcient of z n in the power series f (z), we state:

Lemma 1. The expected z value and variance of the unnormalized estimator Z

m
are En (Z) = mn![z n ]G m , 21/m , and
z 2/m m z 1/m m 2
Vn (Z) = m2 n![z n ] G m ,2 − mn![z n ]G m ,2

Proof. The multinomial convolution relations corresponding to mth powers of

generating functions
imply that n![z n ]G(z/m, u)m is the probability generating
(j)
function of j M . (The multinomials enumerate all ways of distributing ele-
ments amongst buckets.) The expressions for the ﬁrst and second moment of Z
are obtained from there by substituting u → 21/m and u → 22/m .
Proving Theorem 1 is reduced to estimating asymptotically these quantities.

Poissonization. We “poissonize” the problem of computing the expected value

and the variance. In this way, calculations take advantage of powerful properties
of the Mellin transform. The Poisson law of rate λ is the law of a random

variable X such that P(X = ) = e−λ λ! . Given a class Ms of probabilistic
models indexed by integers s, poissonizing means considering the “supermodel”
where model Ms is chosen according to a Poisson law of rate λ. Since the
poisson model of a large parameter λ is predominantly a mixture of models Ms
with s near λ (the Poisson law is “concentrated” near its mean), one can expect
612 M. Durand and P. Flajolet

properties of the ﬁxed-n model Mn to be reﬂected by corresponding properties

of the Poisson model taken with rate λ = n.
A useful feature is that expressions of moments and probabilities under the
Poisson model are closely related to exponential generating
functions of the
ﬁxed-n models. This owes to the fact that if f (z) = n fn z n /n! is the expo-
nential generating
function of expectations of a parameter, then the quantity
−λ λn
e−λ f (λ) = f
n n e n! gives the corresponding expectation under the Pois-
n
m
son model. In this way, one sees that the quantities En = mG m , 21/m e−n
n 2/m m −n n 1/m m −n 2
and Vn = m2 G m ,2 e − mG m ,2 e are respectively the
mean and variance of Z when the cardinality of the underlying multiset obeys
a Poisson law of rate λ = n.
Lemma 2. The Poisson mean and variance En and Vn satisfy as n → ∞:
m
1 − 21/m
En ∼ Γ (−1/m) + n ·n
log 2
m 2m
1 − 22/m 1 − 2−1/m
Vn ∼ Γ (−2/m) − Γ (−1/m) + ηn · n2 .
log 2 log 2
where | n | and |ηn | are bounded by 10−6 .
The proof crucially relies on the Mellin transform [6].

Depoissonization. Finally, the asymptotic forms of the ﬁrst two moments

of the LogLog estimator can be transferred back from the Poisson model to
the fixed-n model that underlies Theorem 1. The process involved is known as
“depoissonization”. Various options are discussed in Chapter 10 of Szpankowski’s
book [15]. We choose the method called “analytic depoissonization” by Jacquet
and Szpankowski, whose underlying engine is the saddle point method applied to
Cauchy integrals; see [9,15]. In essence, the values of an exponential generating
function at large arguments are closely related to the asymptotic form of its
coefficients provided the generating function decays fast enough away from the
positive real axis in the complex plane. The complete proof is omitted.
Lemma 3. The first two moments of the LogLog estimator are asymptotically
equivalent under the Poisson and fixed–n model: En (Z) ∼ En , and Vn (Z) ∼ Vn .
Lemmas 2 and 3 together prove Theorem 1. Easy numerical calculations and
straight asymptotic analysis of βm conclude the evaluations stated there.

4 Space Requirements
Now that the correctness—the absence of bias as well as accuracy—of the basic
LogLog algorithm has been established, there remains to see that it performs
as promised and only consumes O(log log n) bits of storage if counts till n are
needed3 .
3
A counting algorithm exhibiting a log-log feature in a diﬀerent context is Morris’s
Approximate Counting [11] analyzed in [4].
Loglog Counting of Large Cardinalities 613

In its abstract form of Section 1, the LogLog algorithm operates with po-
tentially unbounded integer registers and it consumes m of these. What we call
an –restricted algorithm is one in which each of the M (j) registers is made of
bits, that is, it can store any integer between 0 and 2 − 1. We state a shallow
result only meant to phrase mathematically the log-log property of the basic
space complexity:

Theorem 2. Let ω(n) be a function that

ntends
to inﬁnity arbitrarily slowly and
consider the function (n) = log2 log2 m + ω(n). Then, the (n)–restricted
algorithm and the LogLog algorithm provide the same output with probability
tending to 1 as n tends to inﬁnity.

The auxiliary tables maintained by the algorithm then comprise m “small bytes”,
each of size (n). In other words, the
n total space required by the algorithm in
order to count till n is m log2 log2 m (1 + o(1)) . The hashing function needs
to hash values from the original data universe onto exactly 2(n) + log2 m bits.
Observe also that, whenever no discrepancy is present at the value n itself, the
restricted algorithm automatically provides the right answer for all values n ≤ n.
The proof of this theorem results from tail properties of the multinomial
distributions and of maxima of geometric random variables.
Assume for instance that we wish to count cardinalities till 227 , that is, over
a hundred million, with an accuracy of about 4%. By Theorem 1, one should
adopt m = 1024 = 210 . Then, each bucket is visited roughly n/m = 217 times.
.
One has log2 log2 217 = 4.09. Adopt ω = 0.91, so that each register has a size
of = 5 bits, i.e., a value less than 32. Applying the upperbound of the overall
probability failure shows that an –restriction will have little incidence on the
result: the probability of a discrepancy4 is lower than 12%. In summary: The
basic LogLog counting algorithm makes it possible to estimate cardinalities
till 108 with a standard error of 4% using 1024 registers of 5 bits each, that is,
a table of 640 bytes in total.

5 Algorithmic Engineering

In this section, we describe a concrete implementation of the LogLog algorithm

that incorporates the probabilistic principles seen in previous sections. At the
same time, we propose an optimization that has several beneficial effects: (i) it
increases at no extra cost the accuracy of the results, i.e., it decreases the disper-
sion of the estimates around the mean value; (ii) it allows for the use of smaller
register values, thereby improving the storage utilization of the algorithm and
nullifying the effect of length restriction discussed in Section 4.
The fundamental probability distribution is that of the value of the M –
register in a bucket that receives ν elements (where ν ≈ n/m). This is the
4
In addition, a correction factor, calculated according to the principles of Section 3,
could easily be built into the algorithm, in order to compensate the small bias induced
by restriction
614 M. Durand and P. Flajolet

1.15
1.05
1.1

1
1.05

0.95
1

0.95 0.9

0.9
0.85
0 10000 20000 0 200000 400000 600000

Fig. 2. The evolution of the estimate (divided by the current value of n) provided by
super–LogLog on all of Shakespeare’s works: (left) words; (right) pairs of consecutive
words. Here m = 256 (standard error=6.5%).

maximum of ν geometric random variables with mean close to log2 n. The

tails of this distribution, though exponential, are still relatively “soft”, as there
holds Pν (M > log2 ν + k) ≈ 2−k . Since the estimate returned involves an expo-
nential of the arithmetic mean of bucket registers, a few exceptional values may
still distort the estimate produced by the algorithm, while more tame data will
not induce this effect. Altogether, this phenomenon lies at the origin of a natural
dispersion of estimates produced by the algorithm, hence it places a limit on the
accuracy of cardinality estimates. A simple remedy to the situation consists in
using truncation:
Truncation Rule. When collecting register values in order to produce the
final estimate, retain only the m0 := θ0 m smallest values and discard the
rest. There θ0 is a real number between 0 and 1, with θ0 = 0.7 producing
near-optimal results. The mean of these registers is computed and the esti-
(j)
1
M
mate returned is m0 αm 2 m0 , where Σ indicates the truncated sum.
The modified constant α m ensures that the algorithm remains unbiased.
When the truncation rule is applied, accuracy does increase. An empirically
determined formula for the standard error is 1.05 √ , when the Truncation Rule
m
with θ0 = 0.7 is employed.
Empirical
nstudies
justify the fact that register values may be ceiled at the
value log2 m + δ, without detectable effect for δ = 3. In other words, one
may freely combine the algorithm with restriction as follows:
Restriction
Rule.
Use register values that are in the interval [0 . .B], where
log2 Nm
max
+ 3 ≤ B.
For instance for the data at the end of Section 4, with n = 227 , m = 1024,
the value B = 20 (encoded on 5 bits) is sufficient. But now, the probability that
length-restriction affects the estimate of the algorithm drops tremendously.
Fact 1. Combining the basic LogLog counting algorithm, the Trun-
cation Rule and the Restriction Rule yields the super-LogLog al-
gorithm that estimates cardinalities with a standard error of ≈
1.05
√ when m “small bytes” are used. Here a small byte has size
m
log2 log2 Nm max
+ 3 , that is, 5 bits for maximum cardinalities Nmax
well over 108 .
Loglog Counting of Large Cardinalities 615

Length of the hash function and collisions. The length H of the hash
function—how many bits should it produce?— is guided by previous consid-
erations. There must be log2 m bits reserved for bucketing and the bound on
register values should be at least as large as the quantity B above.
Accordingly

this value H must satisfy: H ≥ H0 , where H0 := log2 m + log2 Nm max
+ 3 . In
case a value too close to H0 is adopted (say 0 ≤ H − H0 ≤ 3), then the effect
of hashing collisions must be compensated for. This is achieved by inverting the
function that gives the expected value of the number of collisions in a hash table
(see [3,16] for an analogous discussion). The estimator is then to be changed
1
(j)
H α
m m m M
into −2 log 1 − 2H 2 . (No detectable degradation of performance
results from the last modification of the estimator function, and it can safely be
used in all cases.)
Risk analysis. For the pure LogLog algorithm, the estimate is an empirical
mean of random variables that are approximately identically distributed (up
to statistical fluctuations
in bucket sizes). From there, it can be proved that
1 (j)
the quantity m j M is numerically closely approximated by a Gaussian.
Consequently, the estimate returned is very roughly Gaussian: at any rate, it
has exponentially decaying tails. (In principle, a full analysis would be feasible.)
A similar property is expected for the super-LogLog algorithm since it is based
on the same principles. As a consequence, we obtain the following pragmatic
conclusion:

Fact 2. Let σ := 1.05

√ . The estimate is within σ, 2σ, and 3σ of the exact
m
value of the cardinality n in respectively 65%, 95%, and 99% of the cases.

6 Conclusions

That super–LogLog performs quite well in practice is conﬁrmed by the following

data from simulations:
k = log2 m 4 5 6 7 8 9 10 11 12

σ 29.5 19.8 13.8 9.4 6.5 4.5 3.1 2.2 1.5
√
1.05/ m 26.3 18.6 13.1 9.3 6.5 4.6 3.3 2.3 1.6
Random 22 16 11 8 6 4 3 2.3 2
KingLear 8.2 1.6 2.1 3.9 2.9 1.2 0.3 1.7 —
ShAll 2.9 13.9 4.4 0.9 9.4 4.1 3.0 0.8 0.6
Pi 67 28 9.7 8.6 2.8 5.1 1.9 1.2 0.7

Note. σ refers to standard error as estimated

√ from extensive simulations, to be
compared to the empirical formula 1.05/ m. The next lines display the absolute
value of the relative error measured. Random refers to averages over 10,000 runs
with n = 20, 000; the other data are single runs: Pi is formed of 2 · 107 records
that are consecutive 10–digit slices of the ﬁrst 200 million decimals of π; ShAll
is the whole of Shakespeare’s works. KingLear is what its name says. (Naturally,
inherent stochastic ﬂuctuations prevent the estimates from always depending
616 M. Durand and P. Flajolet

monotonically on memory size (m) in the case of single runs on a given piece of
data.)
As we have strived to demonstrate, the LogLog algorithm in its optimized
version performs quite well. The following table (grossly) summarizes the accu-
racy (measured by standard error σ) in relation to the storage used for the major
methods known. Note that diﬀerent algorithms operate with diﬀerent memory
units.
Algorithm Std. Err. (σ) Memory units n = 108 , σ = 0.02
√
Adaptive Sampling 1.20/ m Records (≥24–bit words) 10.8 kbytes
√
Prob. Counting 0.78/ m Words (24–32 bits) 6.0 kbytes
√
Multires. Bitmap ≈ 4.4/ m Bits 4.8 kbytes
√
LogLog 1.30/ m “Small bytes” (5 bits) 2.1 kbytes
√
Super-LogLog 1.05/ m “Small bytes” (5 bits) 1.7 kbytes

The last column is a rough indication of the storage requirement for an accuracy
of 2% and a file of cardinality 108 . (The formula for Multiresolution Bitmap is
a crude extrapolation based on data of [3].)
Distributing or parallelizing the algorithm is trivial: it suffices to have dif-
ferent processors (sharing the same hash function) operate on different slices of
the data and then “max–merge” their tables of registers. Optimal speed-up is
clearly attained and interprocess communication is limited to just a few kilo-
bytes. Requirements for an embedded hardware design are absolutely minimal
as only addressing, register comparisons, and integer addition are needed.

Acknowledgements. This work has been partly supported by the European

Union under the Future and Emerging Technologies programme of the Fifth
Framework, Alcom-ft Project IST-1999-14186. The authors are grateful to
Cristian Estan and George Varghese for very liberally sharing ideas and prelim-
inary versions of their works, and to Keith Briggs for his suggestions regarding
implementation.

References
1. Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating
the frequency moments. Journal of Computer and System Sciences 58 (1999), 137–
147.
2. Estan, C., and Varghese, G. New directions in traﬃc measurement and ac-
counting. In Proceedings of SIGCOMM 2002 (2002), ACM Press. (Also: UCSD
technical report CS2002-0699, February, 2002; available electronically.).
3. Estan, C., Varghese, G., and Fisk, M. Bitmap algorithms for counting active
ﬂows on high speed links. Technical Report CS2003-0738, UCSD, Mar. 2003.
4. Flajolet, P. Approximate counting: A detailed analysis. BIT 25 (1985), 113–134.
5. Flajolet, P. On adaptive sampling. Computing 34 (1990), 391–400.
6. Flajolet, P., Gourdon, X., and Dumas, P. Mellin transforms and asymptotics:
Harmonic sums. Theoretical Computer Science 144, 1-2 (1995), 3–58.
Loglog Counting of Large Cardinalities 617

7. Flajolet, P., and Martin, G. N. Probabilistic counting algorithms for data base
applications. Journal of Computer and System Sciences 31, 2 (1985), 182–209.
8. Gibbons, P. B., Poosala, V., Acharya, S., Bartal, Y., Matias, Y.,
Muthukrishnan, S., Ramaswamy, S., and Suel, T. AQUA: System and tech-
niques for approximate query answering. Tech. report, Bell Laboratories, Murray
Hill, New Jersey, Feb. 1998.
9. Jacquet, P., and Szpankowski, W. Analytical depoissonization and its appli-
cations. Theoretical Computer Science 201, 1–2 (1998).
10. Knuth, D. E. The Art of Computer Programming, 2nd ed., vol. 3: Sorting and
Searching. Addison-Wesley, 1998.
11. Morris, R. Counting large numbers of events in small registers. Communications
of the ACM 21 (1978), 840–842.
12. Motwani, R., and Raghavan, P. Randomized Algorithms. Cambridge University
Press, 1995.
13. C. R. Palmer, G. Siganos, M. Faloutsos, C. Faloutsos, and P. Gibbons. The connec-
tivity and fault-tolerance of the Internet topology. In Workshop on Network-Related
Data Management (NRDM-2001).
14. Prodinger, H. Combinatorics of geometrically distributed random variables: Left-
to-right maxima. Discrete Mathematics 153 (1996), 253–270.
15. Szpankowski, W. Average-Case Analysis of Algorithms on Sequences. John Wiley,
New York, 2001.
16. Whang, K.-Y., Zanden, B. T. V., and Taylor, H. M. A linear-time proba-
bilistic counting algorithm for database applications. TODS 15, 2 (1990), 208–229.

Data Structures & Algorithms Cheatsheet
No ratings yet
Data Structures & Algorithms Cheatsheet
5 pages
Practical Scientific Computing in Python A Workbook
No ratings yet
Practical Scientific Computing in Python A Workbook
43 pages
Probabilistic Counting Algorithms For Database Applications - Flajolet
No ratings yet
Probabilistic Counting Algorithms For Database Applications - Flajolet
28 pages
Hyperloglog: The Analysis of A Near-Optimal Cardinality Estimation Algorithm
No ratings yet
Hyperloglog: The Analysis of A Near-Optimal Cardinality Estimation Algorithm
20 pages
Big Data Counting - How To Count A Billion Distinct Objects Using Only 1.5KB of Memory
No ratings yet
Big Data Counting - How To Count A Billion Distinct Objects Using Only 1.5KB of Memory
6 pages
Comparison of Count-Min Sketch and Hyperloglog For Distinct and Frequent Element Estimation
No ratings yet
Comparison of Count-Min Sketch and Hyperloglog For Distinct and Frequent Element Estimation
3 pages
A Simple Algorithm For Finding Frequent Elements in Streams and Bags
No ratings yet
A Simple Algorithm For Finding Frequent Elements in Streams and Bags
5 pages
Algorithms For Massive Data Problems
No ratings yet
Algorithms For Massive Data Problems
28 pages
Lossy Counting
No ratings yet
Lossy Counting
39 pages
A Linear-Time Probabilistic Counting Algorithm For Database Applications
No ratings yet
A Linear-Time Probabilistic Counting Algorithm For Database Applications
22 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
Probability
No ratings yet
Probability
40 pages
Introduction To Algorithms, Recitation 1
No ratings yet
Introduction To Algorithms, Recitation 1
7 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
Ulllted States Patent (10) Patent N0.: US 8,549,004 B2
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,549,004 B2
12 pages
Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation
No ratings yet
Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation
35 pages
Approximate Counting of Permutation Patterns: Omri Ben-Eliezer Slobodan Mitrovi C Pranjal Srivastava
No ratings yet
Approximate Counting of Permutation Patterns: Omri Ben-Eliezer Slobodan Mitrovi C Pranjal Srivastava
25 pages
1.2 Five Representative Problems
No ratings yet
1.2 Five Representative Problems
30 pages
Designe and Analysis of Algoritham Mid-Term Equivalent Assignment
No ratings yet
Designe and Analysis of Algoritham Mid-Term Equivalent Assignment
9 pages
Assocrules 2
No ratings yet
Assocrules 2
49 pages
Succinct Posets PDF
No ratings yet
Succinct Posets PDF
15 pages
Lecture 1: Problems, Models, and Resources
No ratings yet
Lecture 1: Problems, Models, and Resources
9 pages
02 Algorithm Analysis
No ratings yet
02 Algorithm Analysis
43 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
Search Algorithms
No ratings yet
Search Algorithms
14 pages
Estimating Frequency Moments of Data Streams Using Random Linear Combinations
No ratings yet
Estimating Frequency Moments of Data Streams Using Random Linear Combinations
12 pages
Chapter 3 Filled
No ratings yet
Chapter 3 Filled
34 pages
L - B F E A: Earning Ased Requency Stimation Lgorithms
No ratings yet
L - B F E A: Earning Ased Requency Stimation Lgorithms
20 pages
Awesome Big Data Algorithms
No ratings yet
Awesome Big Data Algorithms
37 pages
TM3 ch06 Frequent Itemsets
No ratings yet
TM3 ch06 Frequent Itemsets
54 pages
02 Algorithm Analysis
No ratings yet
02 Algorithm Analysis
49 pages
Introduction To Algorithms Lecture Notes (MIT 6 - 006) - It-eBooks - It-eBooks-2017, 2017 - IBooker It-eBooks - Anna's Archive
No ratings yet
Introduction To Algorithms Lecture Notes (MIT 6 - 006) - It-eBooks - It-eBooks-2017, 2017 - IBooker It-eBooks - Anna's Archive
150 pages
02 Algorithm Analysis
No ratings yet
02 Algorithm Analysis
43 pages
E-Maxx Algo - Ru.en
No ratings yet
E-Maxx Algo - Ru.en
614 pages
Combinatorial Algorithms - Edward M Reingold
0% (2)
Combinatorial Algorithms - Edward M Reingold
12 pages
Counting Part 3 (KM Koh, BP Tan)
No ratings yet
Counting Part 3 (KM Koh, BP Tan)
6 pages
A Brief Introduction On Shannon's Information Theory
No ratings yet
A Brief Introduction On Shannon's Information Theory
12 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
DGIM
No ratings yet
DGIM
90 pages
Estimating Distinct Elements Using Flajolet-Martin Algorithm On A Data Stream
No ratings yet
Estimating Distinct Elements Using Flajolet-Martin Algorithm On A Data Stream
3 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
Mca Paper XX
No ratings yet
Mca Paper XX
9 pages
Streaming Algorithm
No ratings yet
Streaming Algorithm
16 pages
Lab 1 Random Number Generator: 2 Cryptool
No ratings yet
Lab 1 Random Number Generator: 2 Cryptool
6 pages
Bda Exp8
No ratings yet
Bda Exp8
4 pages
Handling Large Datasets
No ratings yet
Handling Large Datasets
26 pages
ADDA
No ratings yet
ADDA
50 pages
MIT6 006S20 Ps3-Questions
No ratings yet
MIT6 006S20 Ps3-Questions
4 pages
Discreate Structure
No ratings yet
Discreate Structure
21 pages
Group 5 Frequency Distrubtion and Graphs
No ratings yet
Group 5 Frequency Distrubtion and Graphs
35 pages
H170227e Vimbainashe Chigumbu Daa-3
No ratings yet
H170227e Vimbainashe Chigumbu Daa-3
8 pages
Heavy Hitters
No ratings yet
Heavy Hitters
15 pages
COS10003 - Computer & Logic Essentials Assignment 3: Algorithms, Graphs and Counting
No ratings yet
COS10003 - Computer & Logic Essentials Assignment 3: Algorithms, Graphs and Counting
9 pages
Algorithm Analysis Big Oh: Data Structures and Design With Java and Junit
No ratings yet
Algorithm Analysis Big Oh: Data Structures and Design With Java and Junit
45 pages
Lecture Notes On Bucket Algorithms - Luc Devroye
No ratings yet
Lecture Notes On Bucket Algorithms - Luc Devroye
154 pages
01 Slides - en
No ratings yet
01 Slides - en
246 pages
RIND 1 JUL 2021 VerB Resol Tips
No ratings yet
RIND 1 JUL 2021 VerB Resol Tips
5 pages
03 Hardware
No ratings yet
03 Hardware
6 pages
01 Visual Perception
No ratings yet
01 Visual Perception
7 pages
02 Colour
No ratings yet
02 Colour
8 pages
04 2dtransforms
No ratings yet
04 2dtransforms
3 pages
(66309) Extended Abstract
No ratings yet
(66309) Extended Abstract
5 pages
KRR-EpistemcLogic
No ratings yet
KRR-EpistemcLogic
46 pages
M 758 LMR
No ratings yet
M 758 LMR
43 pages
Cloudera Administrator Training For Apache Hadoop PDF
50% (2)
Cloudera Administrator Training For Apache Hadoop PDF
2 pages
Oracle Apps Technical................ : Monday, 25 March 2013
No ratings yet
Oracle Apps Technical................ : Monday, 25 March 2013
51 pages
Single-Chip Microcontrollers (AMCU) : in Brief - .
No ratings yet
Single-Chip Microcontrollers (AMCU) : in Brief - .
31 pages
Atpg Scripts
100% (1)
Atpg Scripts
3 pages
ICT 7 2nd PT Wanswer
No ratings yet
ICT 7 2nd PT Wanswer
2 pages
Bacnet MS TP
100% (1)
Bacnet MS TP
44 pages
CS23511 - Computer Networks
No ratings yet
CS23511 - Computer Networks
233 pages
Medal Log 20241015
No ratings yet
Medal Log 20241015
27 pages
RHEL 9.3 - Configuring A Redhat High Availability Cluster On Redhat Openstack Platform
No ratings yet
RHEL 9.3 - Configuring A Redhat High Availability Cluster On Redhat Openstack Platform
25 pages
Nitsuko 124i - 384i Software Hardware Program Features
No ratings yet
Nitsuko 124i - 384i Software Hardware Program Features
1,195 pages
Lecture 09 - Sequential Quadratic Programming
No ratings yet
Lecture 09 - Sequential Quadratic Programming
4 pages
What Are The Parts of The Research Paper
No ratings yet
What Are The Parts of The Research Paper
6 pages
Field / Campo Profibus Pa Junction Box / Caja de Conexiones Profibus Pa
No ratings yet
Field / Campo Profibus Pa Junction Box / Caja de Conexiones Profibus Pa
2 pages
Utrasonic Cleaning Generator - 2
No ratings yet
Utrasonic Cleaning Generator - 2
17 pages
Nse4 7.2 April 2023 0021655255099
No ratings yet
Nse4 7.2 April 2023 0021655255099
193 pages
Installation Procedure
No ratings yet
Installation Procedure
9 pages
Networkjourney CCNP Enterprise 2021 Lab Workbook 1631597584
No ratings yet
Networkjourney CCNP Enterprise 2021 Lab Workbook 1631597584
77 pages
Jeffrey Roy Auth. From Machinery To Mobility Government and Democracy in A Participative Age
No ratings yet
Jeffrey Roy Auth. From Machinery To Mobility Government and Democracy in A Participative Age
137 pages
Testbank For Holes Human Anatomy and Physiology 16th Edition
No ratings yet
Testbank For Holes Human Anatomy and Physiology 16th Edition
17 pages
System Design Handbook
No ratings yet
System Design Handbook
21 pages
WORK CYCLE 7.2 Web Application Configuration Guide
No ratings yet
WORK CYCLE 7.2 Web Application Configuration Guide
132 pages
What Are The Differences Between SIMATIC Modbus - TCP Redundant V1 and V2 - ID 63201104
No ratings yet
What Are The Differences Between SIMATIC Modbus - TCP Redundant V1 and V2 - ID 63201104
1 page
CARVITE Pre Approval Form
No ratings yet
CARVITE Pre Approval Form
3 pages
Counters: "Registers" Section
No ratings yet
Counters: "Registers" Section
10 pages
Programs, Randomization & Constraints
No ratings yet
Programs, Randomization & Constraints
30 pages
Flyer Ki M en
No ratings yet
Flyer Ki M en
2 pages
Law of Exponents
No ratings yet
Law of Exponents
6 pages
Scratch Programming Playground Learn To Program by Making Cool Games 1st Edition Sweigart Download
No ratings yet
Scratch Programming Playground Learn To Program by Making Cool Games 1st Edition Sweigart Download
91 pages

Loglog Counting of Large Cardinalities

Uploaded by

Loglog Counting of Large Cardinalities

Uploaded by

Loglog Counting of Large Cardinalities

Marianne Durand and Philippe Flajolet

Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France)

of the low memory consumption in order to gather simultaneously a very large

2 The Basic LogLog Algorithm

Algorithm LogLog(M: Multiset of hashed values; m ≡ 2k )

We assume throughout that a hash function, h, is available that transforms

of the algorithm are summarized below in Theorem 1. The letters E, V denote

Theorem 1. Consider the basic LogLog algorithm applied to an ideal multi-

In summary, apart from completely negligible ﬂuctuations whose amplitude is

3 The Basic Analysis

We start by examining what happens in a bucket that receives ν elements

Lemma 1. The expected  z value and variance of the unnormalized estimator Z

Proof. The multinomial convolution relations corresponding to mth powers of

Poissonization. We “poissonize” the problem of computing the expected value

properties of the ﬁxed-n model Mn to be reﬂected by corresponding properties

Depoissonization. Finally, the asymptotic forms of the ﬁrst two moments

Theorem 2. Let ω(n) be a function that

In this section, we describe a concrete implementation of the LogLog algorithm

maximum of ν geometric random variables with mean close to log2 n. The

Fact 2. Let σ := 1.05

That super–LogLog performs quite well in practice is conﬁrmed by the following

Note. σ refers to standard error as estimated

Acknowledgements. This work has been partly supported by the European

You might also like

Lemma 1. The expected z value and variance of the unnormalized estimator Z