Heavy Hitters
Heavy Hitters
• For i = 1 to n:
– If counter == 0:
[In this case, there is no frontrunner.]
∗ current := A[i]
∗ counter++
∗c 2016–2017, Tim Roughgarden and Gregory Valiant. Not to be sold, published, or distributed without
the authors’ consent.
1
– else if A[i]==current:
[In this case, our confidence in the current frontrunner goes up.]
∗ counter++
– else
[In this case, our confidence in the current frontrunner goes down.]
∗ counter- -
• Return current
For example, suppose the input is the array {2, 1, 1}. The first iteration of the algorithm
makes “2” the current guess of the majority element, and sets the counter to 1. The next
element decreases the counter back to 0 (since 1 6= 2). The final iteration resets the current
guess to “1” (with counter value 1), which is indeed the majority element.
More generally, the algorithm correctly computes the majority element of any array that
possesses one. We encourage you to formalize a proof of this statement (e.g., by induction
on n). The intuition is that each entry of A that contains a non-majority-value can only
“cancel out” one copy of the majority value. Since more than n/2 of the entries of A contain
the majority value, there is guaranteed to be a copy of it left standing at the end of the
algorithm.
But so what? It’s a cute algorithm, but isn’t this just a toy problem? It is, but modest
generalizations of the problem are quite close to problems that people really want to solve
in practice.
1. Computing popular products. For example, A could be all of the page views of products
on amazon.com yesterday. The heavy hitters are then the most frequently viewed
products.
1
A similar problem is the “top-k problem,” where the goal is to output the k values that occur with the
highest frequencies. The algorithmic ideas introduced in this lecture are also relevant for the top-k problem.
2
You wouldn’t expect there to be a majority element in any of these applications, but you might expect
a non-empty set of heavy hitters when k is 100, 1000, or 10000.
2
2. Computing frequent search queries. For example, A could be all of the searches on
Google yesterday. The heavy hitters are then searches made most often.
3. Identifying heavy TCP flows. Here, A is a list of data packets passing through a
network switch, each annotated with a source-destination pair of IP addresses. The
heavy hitters are then the flows that are sending the most traffic. This is useful for,
among other things, identifying denial-of-service attacks.
Fact 1.1 There is no algorithm that solves the Heavy Hitters problems in one pass while
using a sublinear amount of auxiliary space.
We next explain the intuition behind Fact 1.1. We encourage you to devise a formal
proof, which follows the same lines as the intuition.
Set k = n/2, so that our responsibility is to output any values that occur at least twice
in the input array A.4 Suppose A has the form
3
Rather than thinking of the array A as an input fully specified in advance, we can alternatively think
of the elements of A as a “data stream,” which are fed to a “streaming algorithm” one element at a time.
One-pass algorithms that use small auxiliary space translate to streaming algorithms that need only small
working memory. One use case for streaming algorithms is when data arrives at such a fast rate that explicitly
storing it is absurd. For example, this is often the reality in the motivating example of data packets traveling
through a network switch. A second use case is when, even though data can be stored in its entirety and
fully analyzed (perhaps as an overnight job), it’s still useful to perform lightweight analysis on the arriving
data in real time. The first two applications (popular transactions or search queries) are examples of this.
4
A simple modification of this argument extends the impossibility result to all interesting values of k —
can you figure it out?
3
where x1 , . . . , xn−1 are an arbitrary set S of distinct elements (in {1, 2, . . . , n2 }, say) and the
final entry y may or may not be in S. By definition, we need to output y if and only if
y ∈ S. That is, answering membership queries reduces to solving the Heavy Hitters problem.
By the “membership problem,” we mean the task of preprocessing a set S to answer queries
of the form “is y ∈ S”? (A hash table is the most common solution to this problem.) It is
intuitive that you cannot correctly answer all membership queries for a set S without storing
S (thereby using linear, rather than constant, space) — if you throw some of S out, you
might get a query asking about the part you threw out, and you won’t know the answer.
It’s not too hard to make this idea precise using the Pigeonhole Principle.5
What prevents us from taking = 0 and solving the exact version of the problem? We allow
the space used by a solution to grow as 1 , so as ↓ 0 the space blows up (as is necessary, by
Fact 1.1).
1
For example, suppose we take = 2k . Then, the algorithm outputs every value with
n n
frequency count at least k , and only values with frequency count at least 2k . Thinking back
to the motivating examples in Section 1.2, such an approximate solution is essentially as
useful as an exact solution. Space usage O( 1 ) = O(k) is also totally palatable; after all, the
output of the heavy hitters or -HH problem already might be as large as k elements.
5
Somewhat more detail: if you always use sublinear space to store the set S, then you need to reuse
exactly the same memory contents for two different sets S1 and S2 . Your membership query answers will be
the same in both cases, and in one of these cases some of your answers will be wrong.
6
This impossibility result (Fact 1.1) and our response to it (the -HH problem) serve as reminders that the
skilled algorithm designer is respectful of but undaunted by impossibility results that limit what algorithms
can do. For another example, recall your study in CS161 of methods for coping with N P -complete problems.
4
2 The Count-Min Sketch
2.1 Discussion
This section presents an elegant small-space data structure, the count-min sketch [5], that
can be used to solve the -HH problem. There are also several other good solutions to the
problem, including some natural “counter-based” algorithms that extend the algorithm in
Section 1.1 for computing a majority element [7, 6]. We focus on the count-min sketch for a
number of reasons.
1. It has been implemented in real systems. For example, AT&T has used it in network
switches to perform analyses on network traffic using limited memory [4].7 At Google, a
precursor of the count-min sketch (called the “count sketch” [3]) has been implemented
on top of their MapReduce parallel processing infrastructure [8]. One of the original
motivations for this primitive was log analysis (e.g., of source code check-ins), but
presumably it is now used for lots of different analyses.
2. The data structure is based on hashing, and as such fits in well with the current course
theme.
3. The data structure introduces a new theme, present in many of the next few lectures,
of “lossy compression.” The goal here is to throw out as much of your data as possible
while still being able to make accurate inferences about it. What you want to keep
depends on the type of inference you want to support. For approximately preserving
frequency counts, the count-min sketch shows that you can throw out almost all of
your data!
We’ll only discuss how to use the count-min sketch to solve the approximate heavy hitters
problem, but it is also useful for other related tasks (see [5] for a start). Another reason
for its current popularity is that its computations parallelize easily — as we discuss its
implementation, you might want to think about this point.
5
Hash tables also offer a good solution to the membership problem, so why bother with
a bloom filter? The primary motivation is to save space — a bloom filter compresses the
stored set more than a hash table. In fact, the compression is so extreme that a bloom filter
cannot possibly answer all membership queries correctly. That’s right, it’s a data structure
that makes errors. Its errors are “one-sided,” with no false negatives (so if you inserted
an element, the bloom filter will always confirm it) but with some false positives (so there
are “phantom elements” that the data structure claims are present, even though they were
never inserted). For instance, using 8 bits per stored element — well less than the space
required for a pointer, for example — bloom filters can achieve a false positive probability
less than 2%. More generally, bloom filters offer a smooth trade-off between the space used
and the false positive probability. Both the insertion and lookup operations are super-fast
(O(1) time) in a bloom filter, and what little work there is can also be parallelized easily.
Bloom filters were invented in 1970 [1], back when space was at a premium for everything,
even spellcheckers.8 This century, bloom filters have gone viral in the computer networking
community [2]. Saving space is still a big win in many networking applications, for example
by making better use of the scarce main memory at a router or by reducing the amount of
communication required to implement a network protocol.
Bloom filters serve as a role model for the count-min sketch in two senses. First, bloom
filters offer a proof of concept that sacrificing a little correctness can yield significant space
savings. Note this is exactly the trade-off we’re after: Fact 1.1 states that exactly solving the
heavy hitters problem requires linear space, and we’re hoping that by relaxing correctness
— i.e., solving the -HH problem instead — we can use far less space. Second, at a technical
level, if you remember how bloom filters are implemented, you’ll recognize the count-min
sketch implementation as a bird of the same feather.
6
Figure 1: Running Inc(x) on the CMS data structure. Each row corresponds to a hash
function hi .
to compress the array A (since b n). This compression leads to errors. The point of `
is to implement a few “independent trials,” which allows us to reduce the error. What’s
important, and kind of amazing, is that these parameters are independent of the length n of
the array that we are processing (recall n might be in the billions, or even larger).
The data structure is just a ` × b 2-D array CMS of counters (initially all 0). See
Figure 1. After choosing ` hash functions h1 , . . . , h` , each mapping the universe of objects
to {1, 2, . . . , b}, the code for Inc(x) is simply:
• for i = 1, 2, . . . , `:
Assuming that every hash function can be evaluated in constant time, the running time of
the operation is clearly O(`).
To motivate the implementation of Count(x), fix a row i ∈ {1, 2, . . . , `}. Every time
Inc(x) is called, the same counter CMS[i][hi (x)] in this row gets incremented. Since counters
are never decremented, we certainly have
where fx denotes the frequency count of object x. If we’re lucky, then equality holds in (1).
In general, however, there will be collisions: objects y 6= x with hi (y) = hi (x). (Note with
b n, there will be lots of collisions.) Whenever Inc(y) is called for an object y that
collides with x in row i, this will also increment the same counter CMS[i][hi (x)]. So while
CMS[i][hi (x)] cannot underestimate fx , it generally overestimates fx .
The ` rows of the count-min sketch give ` different estimates of fx . How should we
aggregate these estimates? Later in the course, we’ll see scenarios where using the mean or
h) of the string formed by x with “i” appended to it.
7
the median is a good way to aggregate. Here, our estimates suffer only one-sided error — all
of them can only be bigger than the number fx we want to estimate, and so it’s a no-brainer
which estimate we should pay attention to. The smallest of the estimates is clearly the best
estimate. Thus, the code for Count(x) is simply:
The running time is again O(`). By (1), the data structure has one-sided error — it only
returns overestimates of true frequency counts, never underestimates. The key question is
obviously: how large are typical overestimates? The answer depends on how we set the
parameters b and `. As b gets bigger, we’ll have fewer collisions and hence less error. As `
gets bigger, we’ll take the minimum over more independent estimates, resulting in tighter
estimates. Thus the question is whether or not modest values of b and ` are sufficient to
guarantee that the overestimates are small. This is a quantitative question that can only
be answered with mathematical analysis; we do this in the next section (and the answer is
yes!).
Remark 2.1 (Comparison to Bloom Filters) The implementation details of the count-
min sketch are very similar to those of a bloom filter. The latter structure only uses bits,
rather than integer-valued counters. When an object is inserted into a bloom filter, ` hash
functions indicate ` bits that should be set to 1 (whether or not they were previously 0 or
1). The count-min sketch, which is responsible for keeping counts rather than just tracking
membership, instead increments ` counters. Looking up an object in a bloom filter just
involves checking the ` bits corresponding to that object — if any of them are still 0, then
the object has not been previously inserted. Thus Lookup in a bloom filter can be thought
of as taking the minimum of ` bits, which exactly parallels the Count operation of a count-
min-sketch. That the count-min sketch only overestimates frequency counts corresponds to
the bloom filter’s property that it only suffers from false positives.
8
collide with it: X
CMS[i][hi (x)] = fx + fy , (2)
y∈S
where S = {y 6= x : hi (y) = hi (x)} denotes the objects that collide with x in the ith row.
In (2), fx and the fy ’s are fixed constants (independent of the choice of hi ), while the set S
will be different for different choices of the hash function hi .
Recall that a good hash function spreads out a data set as well as if it were a random
function. With b buckets and a good hash function hi , we expect x to collide with a roughly
1/b fraction of the other elements y 6= x under hi . Thus we expect
1X n
CMS[i][hi (x)] = fx + fy ≤ fx + , (3)
b y6=x b
where in the inequality we use that the sum of the frequency counts is exactly the total
number n of increments (each increment adds 1 to exactly one frequency count). See also
Section 2.5 for a formal (non-heuristic) derivation of (3).
We should be pleased with (3). Recall the definition of the -approximate heavy hitters
problem (Section 1.4): the goal is to identify objects with frequency count at least nk , without
being fooled by any objects with frequency count less than nk − n. This means we just need
to estimate the frequency count of an object up to additive one-sided error n. If we take
the number of buckets b in the count-min sketch to be equal to 1 , then (3) says the expected
overestimate of a given object is at most n. Note that the value of b, and hence the number
of counters used by the data structure, is completely independent of n! If you think of
= .001 and n as in the billions, then this is pretty great.
So why aren’t we done? We’d like to say that, in addition to the expected overestimate
of a frequency count being small, with very large probability the overestimate of a frequency
count is small. (For a role model, recall that typical bloom filters guarantee a false positive
probability of 1-2%.) This requires translating our bound on an expectation to a bound on
a probability.
Next, we observe that (3) implies that the probability that a row’s overestimate of x is
more than 2n b
is less than 50%. (If not, the expected overestimate would be greater than
1 2n n
2
· b = b , contradicting (3).) This argument is a special case of “Markov’s inequality;” see
Section 2.5 for details.
A possibly confusing point in this heuristic analysis is: in the observation above, what
is the probability over, exactly? I.e., where is the randomness? There are two morally
equivalent interpretations of the analysis in this section. The first, which is carried out
formally and in detail in Section 2.5, is to assume that the hash function hi is chosen
uniformly at random from a universal family of hash functions (see CS161 for the definition).
The second is to assume that the hash function hi is fixed and that the data is random. If
hi is a well-crafted hash function, then your particular data set will almost always behave
like random data.11
11
In an implementation that chooses hi deterministically as a well-crafted hash function, the error analysis
9
Remember that everything we’ve done so far is just for a single row i of the hash table.
The output of Count(x) exceeds fx by more than n only if every row’s estimate is too big.
Assuming that the hash functions hi are independent,12 we have
Y ` `
` 2n 2n 1
Pr min CMS[i][hi (x)] > fx + = Pr CMS[i][hi (x)] > fx + ≤ .
i=1 b i=1
b 2
To get an overestimate threshold of n, we can set b = 2 (so e.g., 200 when = .01).
To drive the error probability — that is, the probability of an overestimate larger than this
threshold — down to the user-specified value δ, we set
`
1
=δ
2
and solve to obtain ` = log2 1δ . (This is between 6 and 7 when δ = .01.) Thus the total
number of counters required when δ = = .01 is barely over a thousand (no matter how long
the array is!). See Section 2.6 for a detailed recap of all of the count-min sketch’s properties,
and Section 2.5 for a rigorous and optimized version of the heuristic analysis in this section.
where S = {y 6= x : hi (y) = hi (x)} denotes the objects that collide with x in the ith row.
In (4), fx and the fy ’s are fixed constants (independent of the choice of hi ), while the set S
is random (i.e., different for different choices of hi ).
below does not actually hold for an arbitrary data set. (Recall that for every fixed hash function there
is a pathological data set where everything collides.) So instead we say that the analysis is “heuristic” in
this case, meaning that while not literally true, we nevertheless expect reality to conform to its predictions
(because we expect the data to be non-pathological). Whenever you do a heuristic analysis to predict the
performance of an implementation, you should always measure the implementation’s performance to double-
check that it’s working as expected. (Of course, you should do this even when you’ve proved performance
bounds rigorously — there can always be unmodeled effects (cache performance, etc.) that cause reality to
diverge from your theoretical predictions for it.)
12
Don’t forget that probabilities factor only for independent events. There are again two interpretations
of this step: in the first, we assume that each hi is chosen independently and randomly from a universal
family of hash functions; in the second, we assume that the hi ’s are sufficiently well crafted that they almost
always behave as if they were independent on real data.
10
To continue the error analysis, we make the following assumption:
(*) For every pair x, y of distinct objects, Pr[hi (y) = hi (x)] ≤ 1b .
Assumption (*) basically says that, after conditioning on the bucket to which hi assigns an
object x, the bucket hi assigns to some other object y is uniformly random. For example,
the assumption would certainly be satisfied if hi is a completely random function. It is also
satisfied if hi is chosen uniformly at random from a universal family — it is precisely the
definition of such a family (see your CS161 notes).
Before using assumption (*) to analyze (4), we recall linearity of expectation: for any
real-valued random variables X1 , . . . , Xm defined on the same probability space,
" m # m
X X
E Xj = E[Xj ] . (5)
j=1 j=1
That is, the expectation of a sum is just the sum of the expectations, even if the random
variables are not independent.13 The statement is trivial to prove — just expand the expec-
tations and reverse the order of summation — and insanely useful.
To put the pieces together, we first rewrite (4) as
X
Zi = fx + f y 1y , (6)
y6=x
where 1y is the indicator random variable that indicates whether or not y collides with x
under hi :
1 if hi (y) = hi (x)
1y =
0 otherwise.
Recalling that fx and the fy ’s are constants, we can apply linearity of expectation to (6) to
obtain X
E[Zi ] = fx + fy · E[1y ] . (7)
y6=x
11
Proposition 2.2 (Markov’s Inequality) If X is a nonnegative random variable and c >
1 is a constant, then
1
Pr[X > c · E[X]] ≤ .
c
The proof of Markov’s inequality is simple. For example, suppose you have a nonnegative
random variable X with expected value 10. How frequently could it take on a value greater
than 100? (So c = 10.) In principle, it is possible that X has value exactly 100 10% of the
time (if it has value 0 the rest of the time). But it can’t have value strictly greater than 100
10% or more of the time — if it did, its expectation would be strictly greater than 10. An
analogous argument applies to nonnegative random variables with any expectation and for
any value of c.
Let’s return to our error analysis, for a fixed object x and row i. Define
X = Zi − fx ≥ 0
as the amount by which the ith row of the count-min sketch overestimates x’s frequency
count fx . By (9), with b = e , the expected value of X is at most n
e
.14 Since this overestimate
is always nonnegative, we can apply Markov’s inequality (Proposition 2.2) with E[X] = n e
and c = e to obtain
1
Pr X > e · n
e
≤
e
and hence
1
Pr[Zi > fx + n] ≤ .
e
Assuming that the hash functions are chosen independently, we have
Y `
` 1
Pr min Zi > fx + n = Pr[Zi > fx + n] ≤ ` . (10)
i=1
i=1
e
To achieve a target error probability of δ, we just solve for ` in (10) and find that ` ≥ ln 1δ
rows are sufficient. For δ around 1%, ` = 5 is good enough.
12
of your data set and still maintain approximate frequency counts. Contrast this with
bloom filters, and pretty much every other data structure that you’ve seen, where the
space grows linearly with the number of processed elements.16
• Assuming the hash functions take constant time to evaluate, the Inc and Count opera-
tions run in O(ln 1δ ) time.
• The count-min sketch guarantees 1-sided error: no matter how the hash functions
h1 , . . . , h` are chosen, for every object x with frequency count fx , the count-min sketch
returns an estimate Count(x) that is at least fx .
13
the final output as well. Ignoring the objects with large errors, the heap contains at most 2k
objects at all times (why?), so maintaining the heap requires an extra O(log k) = O(log 1 )
amount of work per array entry.
3 Lecture Take-Aways
1. Hashing is even cooler and more useful than you had realized.
2. The key ideas behind Bloom Filters extend to other lightweight data structures that
are useful for solving other problems (not just the membership problem).
3. The idea of lossy compression. If you only want to approximately preserve certain
properties, like frequency counts, then sometimes you can get away with throwing away
almost all of your data. We’ll see another example next week: dimension reduction,
where the goal is to approximately preserve pairwise measures (like similarity) between
objects.
4. (Approximate) frequency counts/heavy hitters are exactly what you want in many
applications — traffic at a network switch, click data at a major Web site, streaming
data from a telescope or satellite, etc. It’s worth knowing that this useful primitive
can be solved efficiently at a massive scale, with even less computation and space than
most of the linear-time algorithms that you studied in CS161.
References
[1] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communica-
tions of the ACM, 13(7):422–426, 1970.
[2] A. Broder and M. Mitzenmacher. Network applications of. bloom filters: A survey.
Internet Mathematics, 1(4):485–509, 2005.
[3] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams.
Theoretical Computer Science, 312(1):3–15, 2004.
[5] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min
sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.
14
[7] J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming,
2:143–152, 1982.
[8] R. Pike, S. Dorward an R. Griesemer, and S. Quinlan. Interpreting the data: Parallel
analysis with Sawzall. Dynamic Grids and Worldwide Computing, 13(4):277–298, 2005.
15