Notes Streaming
Notes Streaming
Streaming Algorithms
Recall that a data structure is simply a method for laying out data in memory to support efficiently
answering (certain types of) queries, as well as possibly updates. For example, for the union-find
problem the disjoint forest data structure laid out the “data”, i.e. the set of unions that had been
done, in such a way as to allow for efficient answering of Find queries and Union updates. A data
structure is called dynamic if it supports both updates and queries, and static if it only supports
queries.
The term streaming algorithm simply refers to a dynamic data structure (data is updated and
queried) with the connotation that the data structure should only use sublinear memory. That is,
if it takes some number b bits of memory to represent the data, we would like our data structure
to still allow answering the desired queries while only using o(b) bits.
As a simple example, consider the case that some online vendor performs a sequence of sales and
wants to estimate gross revenue. That is, we would like to allow for inserting new sale transactions
into a database while supporting a single query: return the total gross sales so far. Even if there
are n transactions, there is a clear solution using o(n) memory: initialize a counter C to zero, and
every time there is a sale for some price p, perform the update C ← C + p. Thus our total memory
usage is only a single counter, which does not depend on n! This is a very simple example, but as
we shall soon see, there are other examples for which non-trivial solutions exist. In this lecture, we
focus on the problems of approximate counting and estimating the number of distinct elements1 .
1 Approximate Counting
Problem. Our algorithm must monitor a sequence of events, then at any given time output (an
estimate of) the number of events thus far. More formally, this is a data structure maintaining a
single integer n (initially set to zero) and supporting the following two operations:
• update(): increments n by 1
Thus n is just the number updates so far. Before any operations are performed, it is assumed that
n starts at 0. Of course a trivial algorithm maintains n using dlog ne bits of memory (a counter).
Our goal is to use much less space than this. It is not too hard to prove that it is impossible to
solve this problem exactly using o(log n) bits of space. More specifically, if the query algorithm just
1
The exposition below makes heavy usage of previous notes on the matter written by Prof. Nelson at http:
//people.eecs.berkeley.edu/~minilek/tum2016/index.html
1
// increment counter only every other update
1. init(): X ← 1, parity← 0
2. update():
(a) if parity= 1: X ← X + 1
(b) parity ← 1 − parity
3. query(): return 2X
knows that the answer is some integer between 0 and N , then since each memory configuration
corresponds to a different answer and as there are only 2S distinct memory configurations with S
bits, we must have 2S ≥ N , i.e. S ≥ dlog N e. Thus we would like to answer query() with some
estimate ñ of n satisfying
P(|ñ − n| > εn) < δ, (1)
for some 0 < ε, δ < 1 that are given to the algorithm up front. That is, we relax the requirement for
exact maintenance of n to only approximate maintenance. Here the probability is taken over the
randomness of the algorithm, so that we are trying to develop a randomized Monte Carlo solution
that has some probability of failure.
How might we do better though than using dlog ne bits to store n in memory? Imagine the following
idea: what if we did not store n, but rather stored a counter X = bn/2c. Then to answer queries,
we simply return ñ = 2X. Since n/2 takes one less bit to store than n, we have saved a bit! Plus,
our answer is only ever off by at most an additive 1 (when n is odd). This idea, of course, has a
major problem: to implement this, we need to increment X every other update; say, we do so on
the even updates but not the odd ones. But how can we know the parity of the number of updates
so far? Well, we could store that in an extra bit (see Figure 1! But wait, of course this is quite
silly since know we’re back to using dlog ne bits again. An idea around this is to use randomness:
during each update we simply flip a coin. If heads, we increment X; else if tails, we do nothing.
Then X is a random variable with E X = n/2, so ñ = 2X is equal to n in expectation.
The above idea is a good starting point, but has two main issues which we address shortly. The
first issue is that we only were able to save one bit of memory! However we want to use truly
sublinear memory, i.e. o(log n) bits. The second issue is that recall we want |n − ñ| ≤ εn. In other
words, we want ñ to be in the range [(1 − ε)n, (1 + ε)n]. Think of ε being say 0.01, so that we
want 1% relative error. The described method cannot possibly achieve this when n is small. For
example, consider what happens after a single update, so that n = 1. Either the first coin flip will
be heads, in which case we have X = 1, or it will be tails, in which case X = 0. Thus we will either
have ñ = 2 or ñ = 0, neither of which is within 1% of the true value n = 1.
Both these issues were remedied by an algorithm of Robert Morris [2], provides an estimate ñ
satisfying (1) for some ε, δ that we will analyze shortly. The algorithm works as follows:
1. Initialize X ← 0.
1
2. For each update, increment X with probability 2X
.
2
3. For a query, output ñ = 2X − 1.
Proof. We prove by induction on n. The base case is clear, so we now show the inductive step.
We have
∞
X
E 2Xn+1 = P(Xn = j) · E(2Xn+1 |Xn = j)
j=0
don’t increment X increment X
∞ z }| { z }| {
X j 1 1 j+1
= P(Xn = j) · 2 (1 − ) + · 2
2j 2j
j=0 (2)
∞
X ∞
X
= P(Xn = j)2j + P(Xn = j)
j=0 j=0
= E 2Xn + 1
= (n + 1) + 1
Proof. P(|X − E X| > λ) = P((X − E X)2 > λ2 ), and thus the claim follows by Markov’s inequality.
Recall that the value E(X −E X)2 is called the variance of X, Var[X]. Now in order to show (1), we
will also control on the variance of Morris’ estimator. This is because, by Chebyshev’s inequality,
1 1
P(|ñ − n| > εn) < · E(ñ − n)2 = E(2X − 1 − n)2 . (4)
ε2 n 2 ε2 n 2
3
When we expand the above square, we find that we need to control E 22Xn . The proof of the
following claim is by induction, similar to that of Claim 1.
Claim 3.
3 3
E(22Xn ) = n2 + n + 1. (5)
2 2
= E22Xn + 3 · E2Xn
3 2 3
= n + n + 1 + (3n + 3)
2 2
3 3
= (n + 1)2 + (n + 1) + 1
2 2
This implies E(ñ − n)2 = (E ñ2 ) − (E ñ)2 = (1/2)n2 − (1/2)n − 1 < (1/2)n2 , and thus
1 n2 1
P(|ñ − n| > εn) < 2 2
· = 2, (6)
ε n 2 2ε
which is not particularly meaningful since the right hand side is only better than 1/2 failure
probability when ε ≥ 1 (which means the estimator may very well always be 0!).
1.2 Morris+
To decrease the failure probability of Morris’ basic algorithm, we instantiate s independent copies of
Morris’ algorithm and average their outputs. That is, we obtain independent estimators ñ1 , . . . , ñs
from independent instantiations of Morris’ algorithm, and our response to a query is
s
1X
ñ = ñi
s
i=1
4
By linearity of expectation,
s
" #
1X
E ñ = E ñi
s
i=1
s
1 X
= E ñi
s
i=1
1
= ·s·n
s
=n (7)
Also for independent random variables X, Y and scalars a, b it holds that Var[aX + bY ] = a2 ·
Var[X] + b2 · Var[Y ]. Thus
" s #
1X
Var[ñ] = Var ñi
s
i=1
s
1 X
= 2 Var[ñi ]
s
i=1
1 1 2 1
= 2 ·s· n − n−1
s 2 2
1 1 2
< · n . (8)
s 2
Thus the right hand side of (6) becomes
1
P(|ñ − n| > εn) < <δ
2sε2
for s > 1/(2ε2 δ) = Θ(1/(ε2 δ)).
Overall space complexity. When we unravel Morris+, it is running a total of s = Θ(1/(ε2 δ))
instantiations of the basic Morris algorithm. Now note that once any given Morris counter X
reaches the value lg(sn/δ 2 ), the probability that it is incremented at any given moment is at
most δ 2 /(ns). Thus the probability that it is incremented at all in the next n increments is
at most δ 2 /s by a union bound P (recall that the union bound states that for any collection of
events E1 , . . . , Er , P(∪i Ei ) ≤ i P(Ei )). Thus by another union bound, with probability 1 − δ
none of the s basic Morris instantiations ever stores a value larger than lg(sn/δ 2 ), which takes
O(log log(sn/δ 2 )) = O(log log(sn/δ)) bits. Thus the total space complexity is, with probability
1 − δ, at most
O(ε−2 δ −1 (log log(n/(εδ))))
bits. In particular, for constant ε, δ (say each 1/100), the total space complexity is O(log log n)
with constant probability. This is exponentially better than the log n space achieved by storing a
counter!
2 Distinct elements.
In this section we will consider the distinct elements problem, also known as the F0 problem, defined
as follows.
5
Problem. Our sequence of updates is a stream of integers i1 , . . . , im ∈ [n] where [n] denotes the
set {1, 2, . . . , n}. We would like to output the number of distinct elements seen in the stream.
As with Morris’ approximate counting algorithm, our goal will be to minimize our space consump-
tion. There are two straightforward solutions as follows:
1. Solution 1: keep a bit array of length n, initialized to all zeroes. Set the ith bit to 1 whenever
i is seen in the stream (n bits of memory).
2. Solution 2: Store the whole stream in memory explicitly (dm log2 ne bits of memory).
We can thus solve the problem exactly using min{n, dm log2 ne} bits of memory.
Like with Morris’ algorithm, we will instead settle for computing some value t̃ s.t. P(|t− t̃| > εt) < δ,
where t denotes the number of distinct elements in the stream. The first work to show that this is
possible using small memory, i.e. o(n) bits (assuming access to a random function that is free to
store), is due to Flajolet and Martin (FM) [1].
Note this algorithm really is idealized, since we cannot afford to store a truly random such function
h in o(n) bits (first, because there are n independent random variables (h(i))ni=1 , and second because
its outputs are real numbers).
Intuition. Note that the value X stored by the algorithm is a random variable that is the min-
imum of t i.i.d. U nif (0, 1) random variables. The key claim is then the following, in which we
suppose that the unique integers in the stream are i1 , . . . , it . If X behaves as we expect, then since
X ≈ 1/(t + 1), we naturally obtain the estimator above as t̃ = 1/X − 1 ≈ 1/(1/(t + 1)) − 1 = t.
1
Claim 4. E X = t+1 .
Proof.
Z ∞
EX = P(X > λ)dλ
0
Z ∞
= P(∀i ∈ stream, h(i) > λ)dλ
0
Z ∞ t
Y
= P(h(ir ) > λ)dλ
0 r=1
Z 1
= (1 − λ)t dλ
0
1
=
t+1
6
We will also need the following claim in order to execute Chebyshev’s inequality to bound the
failure probability in our final algorithm.
2
Claim 5. E X 2 = (t+1)(t+2)
Proof.
Z 1
2
EX = P(X 2 > λ)dλ
0
Z 1 √ √
= (P(X > λ) + P(X < − λ))dλ (since X is nonnegative)
0 | {z }
0
Z 1 √
= (1 − λ)t dλ
0
Z 1 √
=2 ut (1 − u)du (substitution u = 1 − λ)
0
2
=
(t + 1)(t + 2)
t 1
This gives Var[X] = E X 2 − (E X)2 = (t+1)2 (t+2)
, or the simpler Var[X] < (E X)2 = (t+1)2
.
2.2 FM+
1 1 t 1
We have that E Z = t+1 , and Var[Z] = s (t+1)2 (t+2) < s(t+1)2
via similar computations as (7) and
(8).
1 ε
Claim 6. P(|Z − t+1 | > t+1 ) <δ
1 ε (t + 1)2 1
P(|Z − |> )< ≤δ
t+1 t+1 ε2 s(t + 1)2
7
Claim 7. P(|( Z1 − 1) − t| > O(ε)t) ≤ δ
Note that ignoring the space to store h, and with the (unrealistic) assumption that real numbers
fit into a single machine word, the total space is O(1/(ε2 δ)).
We will not show proofs here, but it is possible to show that both idealized assumptions can
be removed. First, one can naturally not use continuous random variables but rather discretize
[0, 1] into integer multiples of some small value 1/B. That is, if we have a random function
g : [n] → {0, 1, . . . , B}, we can approximate h(i) as g(i)/B. Next, let’s think about what it means
that g is a random function with the specified domain and range. Let H be the set of all (B + 1)n
functions mapping [n] to {0, 1, . . . , B}. Then we are picking a uniformly random element of H,
which requires log |H| = n log(B + 1) bits to specify. We will instead not use a uniformly random
function chosen from the set of all functions, but rather from a smaller set of functions H0 such
that |H0 | |H| (in fact we will have log |H0 | = O(log(nB)), so that h can be stored using only
O(log(nB)) bits). Thus h is not truly uniformly random but rather pseudorandom: just random
enough to hopefully still be able to use to prove correctness of our algorithm. In fact one notion of
pseudorandomness that suffices for this application is for H0 to be what is called a universal hash
family; we will not discuss this further in CS 170, but you will see more on universal hashing in
just a couple more lectures!
References
[1] Philippe Flajolet, G. Nigel Martin. Probabilistic counting algorithms for data base applica-
tions. J. Comput. Syst. Sci., volume 31, number 2, pages 182–209, 1985.
[2] Robert Morris. Counting Large Numbers of Events in Small Registers. Commun. ACM, volume
21, number 10, pages 840–842, 1978.