Loglog Counting of Large Cardinalities
Loglog Counting of Large Cardinalities
(Extended Abstract)
Abstract. Using an auxiliary memory smaller than the size of this ab-
stract, the LogLog algorithm makes it possible to estimate in a single
pass and within a few percents the number of different words in the
whole of Shakespeare’s works. In general the LogLog algorithm makes
use of m “small bytes” of auxiliary memory in order to estimate in a
single pass the number of distinct elements (the “cardinality”) in a file,
√
and it does so with an accuracy that is of the order of 1/ m. The “small
bytes” to be used in order to count cardinalities till Nmax comprise about
log log Nmax bits, so that cardinalities well in the range of billions can be
determined using one or two kilobytes of memory only. The basic version
of the LogLog algorithm is validated by a complete analysis. An opti-
mized version, super–LogLog, is also engineered and tested on real-life
data. The algorithm parallelizes optimally.
1 Introduction
The problem addressed in this note is that of determining the number of distinct
elements, also called the cardinality, of a large file. This problem arises in several
areas of data-mining, database query optimization, and the analysis of traffic in
routers. In such contexts, the data may be either too large to fit at once in core
memory or even too massive to be stored, being a huge continuous flow of data
packets. For instance, Estan et al. [3] report traces of packet headers, produced
at a rate of 0.5GB per hour of compressed data (!), which were collected while
trying to trace a “worm” (Code Red, August 1 to 12, 2001), and on which it
was necessary to count the number of distinct sources passing through the link.
We propose here the LogLog algorithm that estimates cardinalities using only
a very small amount of auxiliary memory, namely m memory units, where a
memory unit, a “small byte”, comprises close to log log Nmax bits, with Nmax
an a priori upperbound on cardinalities. The estimate is (in the sense of mean
values) asymptotically unbiased ; the relative√ accuracy of the estimate (measured
by a standard deviation) is close to 1.05/ m for our best version of the algo-
rithm, Super–LogLog. For instance, estimating cardinalities till Nmax = 227 (a
hundred million different records) can be achieved with m = 2048 memory units
of 5 bits each, which corresponds to 1.28 kilobytes of auxiliary storage in total,
the error observed being typically less than 2.5%. Since the algorithm operates
incrementally and in a single pass it can be applied to data flows for which it
provides on-line estimates available at any given time. Advantage can be taken
G. Di Battista and U. Zwick (Eds.): ESA 2003, LNCS 2832, pp. 605–617, 2003.
c Springer-Verlag Berlin Heidelberg 2003
606 M. Durand and P. Flajolet
ghfffghfghgghggggghghheehfhfhhgghghghhfgffffhhhiigfhhffgfiihfhhh
igigighfgihfffghigihghigfhhgeegeghgghhhgghhfhidiigihighihehhhfgg
hfgighigffghdieghhhggghhfghhfiiheffghghihifgggffihgihfggighgiiif
fjgfgjhhjiifhjgehgghfhhfhjhiggghghihigghhihihgiighgfhlgjfgjjjmfl
The LogLog Algorithm with m = 256 condenses the whole of Shakespeare’s works
to a table of 256 “small bytes” of 4 bits each. The estimate of the number of distinct
words is here n◦ = 30897 (true answer: n = 28239), i.e., a relative error of +9.4%.
LogLog counting, the observable should only be linked to cardinality, and hence
be totally independent of the nature of replications and the ordering of data
present in the file, on which no information at all is available. (Depending on
context, collisions due to hashing can either be neglected or their effect can be
estimated and corrected.)
Whang, Zanden, and Taylor [16] have developed Linear Counting, which dis-
tributes (hashed) values into buckets and only keeps a bitmap indicating which
buckets are hit. Then observing the number of hits in the table leads to an es-
timate of cardinality. Since the number of buckets should not be much smaller
than the cardinalities to be estimated (say, ≥ Nmax /10), the algorithm has space
complexity that is O(Nmax ) (typically, Nmax /10 bits of storage). The linear space
is a drawback whenever large cardinalities, multiple counts, or limited hardware
are the rule. Estan, Varghese, and Fisk [3] have devised a multiscale version of
this principle, where a hierarchical collection of small windows on the bitmap
is kept. From simulation data, their Multiresolution Bitmap algorithm appears
to be about 20% more accurate than Probabilistic Counting (discussed below)
when the same amount of memory is used. The best algorithm of [3] for flows
in routers, Adaptive Bitmap, is reported to be about 3 times more efficient than
either Probabilistic Counting or Multiresolution Bitmap, but it has the dis-
advantage of not being universal, as it makes definite statistical assumptions
(“stationarity”) regarding the data input to the algorithm. (We recommend the
thorough engineering discussion of [3].)
Closer to us is the Probabilistic Counting algorithm of Flajolet and Mar-
tin [7]. This uses a certain observable that has excellent statistical properties
but is relatively costly to maintain in terms of storage. Indeed,
√ Probabilistic
Counting estimates cardinalities with an error close to 0.78/ m given a table
of m “words”, each of size about log2 Nmax .
Yet another possible idea is sampling. One may use any filter on hashed
values with selectivity p 1, store exactly and without duplicates the data
items filtered and return as estimate 1/p times the corresponding cardinality.
Wegner’s Adaptive Sampling (described and analyzed in [5]) is an elegant way
to maintain dynamically varying values of p. For m “words” of memory (where
here “word”
√ refers to the space needed by a data item), the accuracy is about
1.20/ m, which is about 50% less efficient than Probabilistic Counting.
An insightful complexity-theoretic discussion of approximate counting is pro-
vided by Alon, Matias, and Szegedy in [1]. The authors discuss a class of
“frequency–moments” statistics which includes ours (as their F0 statistics). Our
608 M. Durand and P. Flajolet
LogLog Algorithm has principles that evoke some of those found in the inter-
section of [1] and the earlier [7], but contrary to [1], we develop here a complete
eminently practical algorithmic solution and provide a very precise analysis,
including bias correction, error and risk evaluation, as well as complete dimen-
sioning rules.
We estimate that our LogLog algorithm outperforms the earlier Probabilis-
tic Counting algorithm and the similarly performing Multiresolution Bitmap
of [3] by a factor of 3 at least as it replaces “words” (of 16 to 32 bits) by “small
bytes” of typically 5 bits each, while being based on an observable that has
only slightly higher dispersion
√ than the other two algorithms—this
√ is expressed
by our two formulæ 1.30/ m (LogLog) and 1.05/ m (super–LogLog). This
places our algorithm in the same category as Adaptive Bitmap of [3]. However,
compared to Adaptive Bitmap, the LogLog algorithm has the great advantage
of being universal as it makes no assumptions on the statistical regularity of
data. We thus believe LogLog and its improved version Super–LogLog to be
the best general-purpose algorithmic solution currently known to the problem
of estimating large cardinalities.
Note. The following related references were kindly suggested by a referee: Cormode et
al., in VLDB –2002 (a new counting method based on stable laws) and Bar-Yossef et
al., SODA–2002 (a new application to counting triangles in graphs).
In computing practice, one deals with a multiset of data items, each belonging to
a discrete universe U. For instance, in the case of natural text, U may be the set
of all alphabetic strings of length ≤ 28 (‘antidisestablishmentarianism’), double
floats represented on 64 bits, and so on. A multiset M of elements of U is given
and the problem is to estimate its cardinality, that is, the number of distinct
elements it comprises. Here is the principle of the basic LogLog algorithm.
impossible to define a hash function that creates random data from non-random
data in actual files. But in practice it is not difficult to produce a pretty good im-
itation of random data.” Given this, we formalize our basic problem as follows.
Take U = {0, 1}∞ as the universe of data endowed with the uniform (prod-
uct) probability distribution. An ideal multiset M of cardinality n is a ran-
dom object that is produced by first drawing an n-sequence independently
at random from U, then replicating elements in an arbitrary way, and finally,
applying an arbitrary permutation.
The user is provided with the (extremely large) ideal multiset M and its goal
is to estimate the (unknown to him) value of n at a small computational cost.
No information is available, hence no statistical assumption can be made,
regarding the behaviour of the replicator-shuffler daemon.
(The fact that we consider infinite data is a convenient abstraction at this stage;
we discuss its effect, together with needed adjustments, in Section 5 below.)
The basic idea consists in scanning M and observing the patterns of the form
0 1 that occur at the beginning of (hashed) records. For a string x ∈ {0, 1}∞ ,
let ρ(x) denote the position of its first 1-bit. Thus ρ(1 · · · ) = 1, ρ(001 · · · ) = 3,
etc. Clearly, we expect about n/2k amongst the distinct elements of M to have
a ρ-value equal to k. In other words, the quantity,
R(M) := max ρ(x),
x∈M
can reasonably be hoped to provide a rough indication on the value of log2 n. It
is an “observable” in the sense above since it is totally independent of the order
and the replication structure of the multiset M. In fact, in probabilistic terms,
the quantity R is precisely distributed in the same way as 1 plus the maximum
of n independent geometric variables of parameter 12 . This is an extensively
researched subject; see, e.g., [14]. It turns out that R estimates log2 n with an
additive bias of 1.33 and a standard deviation of 1.87. Thus, in a sense, the
observed value of R estimates “logarithmically” n within ±1.87 binary orders
of magnitude. Notice however that the expectation of 2R is infinite so that 2R
cannot in fact be used to estimate n.
The next idea consists in separating elements into m groups also called “buck-
ets”, where m is a design parameter. With m = 2k , this is easily done by using
the first k bits of x as representing in binary the index of a bucket. One can
then compute the parameter R on each bucket, after discarding the first k bits.
If M (j) is the (random)
m value of parameter R on bucket number j, then the
1 (j)
arithmetic mean m j=1 M , can legitimately be expected to approximate
log2 (n/m) plus an additive bias. The estimate of n returned by the LogLog
algorithm is accordingly (j)
1
E := αm m2 m M . (1)
The constant αm comes out of our later analysis as αm :=
−m
1−21/m 1 ∞ −t s
Γ (−1/m) log 2 , where Γ (s) := s 0 e t dt. It precisely corrects
the systematic bias of the raw arithmetic mean in the asymptotic limit. One
may also hope for a greater concentration of the estimates, hence better
accuracy, to result from averaging over m 1 values. The main characteristics
610 M. Durand and P. Flajolet
Throughout this note, the unknown number of distinct values in the data set is
denoted by n. The LogLog algorithm provides an estimator, E, of n. We first
provide formulæ for the expectation and variance of E. Asymptotic analysis
is performed next: The Poissonization paragraph introduces the Poisson model
where n is allowed to vary according to a Poisson law, while the Depoissoniza-
tion paragraph shows the Poisson model to be asymptotically equivalent to the
“fixed–n” model that we need. The expected value of the estimator is found to
be asymptotically n, up to minute fluctuations. This establishes the asymptot-
ically unbiased character of the algorithm as asserted in (i) of Theorem 1. The
standard deviation of the estimator is also proved to be of the order of n with
the proportionality coefficient providing the value of the standard error, hence
the accuracy of the algorithm, as asserted in (ii) of Theorem 1.
2
We use ‘∼’ to denote asymptotic expansions in the usual mathematical sense and
reserve the informal ‘≈’ for “approximately equal”.
Loglog Counting of Large Cardinalities 611
350 0.25
300
0.2
250
0.15
200
150
0.1
100
0.05
50
0 12 14 16 18 20 22 0 12 14 16 18 20 22
Fig. 1. The distribution of observed register values for the Pi file, n ≈ 2 · 107 with
m = 1024 [left]; the distribution Pν (M = k) of a register M , for ν = 2 · 104 [right].
4 Space Requirements
Now that the correctness—the absence of bias as well as accuracy—of the basic
LogLog algorithm has been established, there remains to see that it performs
as promised and only consumes O(log log n) bits of storage if counts till n are
needed3 .
3
A counting algorithm exhibiting a log-log feature in a different context is Morris’s
Approximate Counting [11] analyzed in [4].
Loglog Counting of Large Cardinalities 613
In its abstract form of Section 1, the LogLog algorithm operates with po-
tentially unbounded integer registers and it consumes m of these. What we call
an –restricted algorithm is one in which each of the M (j) registers is made of
bits, that is, it can store any integer between 0 and 2 − 1. We state a shallow
result only meant to phrase mathematically the log-log property of the basic
space complexity:
The auxiliary tables maintained by the algorithm then comprise m “small bytes”,
each of size (n). In other words, the
n total space required by the algorithm in
order to count till n is m log2 log2 m (1 + o(1)) . The hashing function needs
to hash values from the original data universe onto exactly 2(n) + log2 m bits.
Observe also that, whenever no discrepancy is present at the value n itself, the
restricted algorithm automatically provides the right answer for all values n ≤ n.
The proof of this theorem results from tail properties of the multinomial
distributions and of maxima of geometric random variables.
Assume for instance that we wish to count cardinalities till 227 , that is, over
a hundred million, with an accuracy of about 4%. By Theorem 1, one should
adopt m = 1024 = 210 . Then, each bucket is visited roughly n/m = 217 times.
.
One has log2 log2 217 = 4.09. Adopt ω = 0.91, so that each register has a size
of = 5 bits, i.e., a value less than 32. Applying the upperbound of the overall
probability failure shows that an –restriction will have little incidence on the
result: the probability of a discrepancy4 is lower than 12%. In summary: The
basic LogLog counting algorithm makes it possible to estimate cardinalities
till 108 with a standard error of 4% using 1024 registers of 5 bits each, that is,
a table of 640 bytes in total.
5 Algorithmic Engineering
1.15
1.05
1.1
1
1.05
0.95
1
0.95 0.9
0.9
0.85
0 10000 20000 0 200000 400000 600000
Fig. 2. The evolution of the estimate (divided by the current value of n) provided by
super–LogLog on all of Shakespeare’s works: (left) words; (right) pairs of consecutive
words. Here m = 256 (standard error=6.5%).
Length of the hash function and collisions. The length H of the hash
function—how many bits should it produce?— is guided by previous consid-
erations. There must be log2 m bits reserved for bucketing and the bound on
register values should be at least as large as the quantity B above.
Accordingly
this value H must satisfy: H ≥ H0 , where H0 := log2 m + log2 Nm max
+ 3 . In
case a value too close to H0 is adopted (say 0 ≤ H − H0 ≤ 3), then the effect
of hashing collisions must be compensated for. This is achieved by inverting the
function that gives the expected value of the number of collisions in a hash table
(see [3,16] for an analogous discussion). The estimator is then to be changed
1
(j)
H α
m m m M
into −2 log 1 − 2H 2 . (No detectable degradation of performance
results from the last modification of the estimator function, and it can safely be
used in all cases.)
Risk analysis. For the pure LogLog algorithm, the estimate is an empirical
mean of random variables that are approximately identically distributed (up
to statistical fluctuations
in bucket sizes). From there, it can be proved that
1 (j)
the quantity m j M is numerically closely approximated by a Gaussian.
Consequently, the estimate returned is very roughly Gaussian: at any rate, it
has exponentially decaying tails. (In principle, a full analysis would be feasible.)
A similar property is expected for the super-LogLog algorithm since it is based
on the same principles. As a consequence, we obtain the following pragmatic
conclusion:
6 Conclusions
monotonically on memory size (m) in the case of single runs on a given piece of
data.)
As we have strived to demonstrate, the LogLog algorithm in its optimized
version performs quite well. The following table (grossly) summarizes the accu-
racy (measured by standard error σ) in relation to the storage used for the major
methods known. Note that different algorithms operate with different memory
units.
Algorithm Std. Err. (σ) Memory units n = 108 , σ = 0.02
√
Adaptive Sampling 1.20/ m Records (≥24–bit words) 10.8 kbytes
√
Prob. Counting 0.78/ m Words (24–32 bits) 6.0 kbytes
√
Multires. Bitmap ≈ 4.4/ m Bits 4.8 kbytes
√
LogLog 1.30/ m “Small bytes” (5 bits) 2.1 kbytes
√
Super-LogLog 1.05/ m “Small bytes” (5 bits) 1.7 kbytes
The last column is a rough indication of the storage requirement for an accuracy
of 2% and a file of cardinality 108 . (The formula for Multiresolution Bitmap is
a crude extrapolation based on data of [3].)
Distributing or parallelizing the algorithm is trivial: it suffices to have dif-
ferent processors (sharing the same hash function) operate on different slices of
the data and then “max–merge” their tables of registers. Optimal speed-up is
clearly attained and interprocess communication is limited to just a few kilo-
bytes. Requirements for an embedded hardware design are absolutely minimal
as only addressing, register comparisons, and integer addition are needed.
References
1. Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating
the frequency moments. Journal of Computer and System Sciences 58 (1999), 137–
147.
2. Estan, C., and Varghese, G. New directions in traffic measurement and ac-
counting. In Proceedings of SIGCOMM 2002 (2002), ACM Press. (Also: UCSD
technical report CS2002-0699, February, 2002; available electronically.).
3. Estan, C., Varghese, G., and Fisk, M. Bitmap algorithms for counting active
flows on high speed links. Technical Report CS2003-0738, UCSD, Mar. 2003.
4. Flajolet, P. Approximate counting: A detailed analysis. BIT 25 (1985), 113–134.
5. Flajolet, P. On adaptive sampling. Computing 34 (1990), 391–400.
6. Flajolet, P., Gourdon, X., and Dumas, P. Mellin transforms and asymptotics:
Harmonic sums. Theoretical Computer Science 144, 1-2 (1995), 3–58.
Loglog Counting of Large Cardinalities 617
7. Flajolet, P., and Martin, G. N. Probabilistic counting algorithms for data base
applications. Journal of Computer and System Sciences 31, 2 (1985), 182–209.
8. Gibbons, P. B., Poosala, V., Acharya, S., Bartal, Y., Matias, Y.,
Muthukrishnan, S., Ramaswamy, S., and Suel, T. AQUA: System and tech-
niques for approximate query answering. Tech. report, Bell Laboratories, Murray
Hill, New Jersey, Feb. 1998.
9. Jacquet, P., and Szpankowski, W. Analytical depoissonization and its appli-
cations. Theoretical Computer Science 201, 1–2 (1998).
10. Knuth, D. E. The Art of Computer Programming, 2nd ed., vol. 3: Sorting and
Searching. Addison-Wesley, 1998.
11. Morris, R. Counting large numbers of events in small registers. Communications
of the ACM 21 (1978), 840–842.
12. Motwani, R., and Raghavan, P. Randomized Algorithms. Cambridge University
Press, 1995.
13. C. R. Palmer, G. Siganos, M. Faloutsos, C. Faloutsos, and P. Gibbons. The connec-
tivity and fault-tolerance of the Internet topology. In Workshop on Network-Related
Data Management (NRDM-2001).
14. Prodinger, H. Combinatorics of geometrically distributed random variables: Left-
to-right maxima. Discrete Mathematics 153 (1996), 253–270.
15. Szpankowski, W. Average-Case Analysis of Algorithms on Sequences. John Wiley,
New York, 2001.
16. Whang, K.-Y., Zanden, B. T. V., and Taylor, H. M. A linear-time proba-
bilistic counting algorithm for database applications. TODS 15, 2 (1990), 208–229.