Algorithms For Big Data
Algorithms For Big Data
BIG DATA
World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.
Printed in Singapore
Preface
The emergence of the Internet has allowed people, for the first time, to
access huge amounts of data. Think, for example, of the graph of friendships
in the social network Facebook and the graph of links between Internet
websites. Both these graphs contain more than one billion nodes, and thus,
represent huge datasets. To use these datasets, they must be processed and
analyzed. However, their mere size makes such processing very challenging.
In particular, classical algorithms and techniques, that were developed
to handle datasets of a more moderate size, often require unreasonable
amounts of time and space when faced with such large datasets. Moreover,
in some cases it is not even feasible to store the entire dataset, and thus,
one has to process the parts of the dataset as they arrive and discard each
part shortly afterwards.
The above challenges have motivated the development of new tools and
techniques adequate for handling and processing “big data” (very large
amounts of data). In this book, we take a theoretical computer science
view on this work. In particular, we will study computational models that
aim to capture the challenges raised by computing over “big data” and the
properties of practical solutions developed to answer these challenges. We
will get to know each one of these computational models by surveying a
few classic algorithmic results, including many state-of-the-art results.
This book was designed with two contradicting objectives in mind,
which are as follows:
(i) on the one hand, we try to give a wide overview of the work done in
theoretical computer science in the context of “big data” and
v
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-fm page vi
(ii) on the other hand, we strive to do so with sufficient detail to allow the
reader to participate in research work on the topics covered.
While we did our best to meet both goals, we had to compromise in
some aspects. In particular, we had to omit some important “big data”
subjects such as dimension reduction and compressed sensing. To make the
book accessible to a broader population, we also omitted some classical
algorithmic results that involve tedious calculations or very advanced
mathematics. In most cases, the important aspects of these results can
be demonstrated by other, more accessible, results.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-fm page vii
vii
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
Contents
Preface v
ix
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-fm page x
Index 443
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch01 page 1
1
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
Chapter 1
3
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch01 page 4
for such uses since they are very cost effective, and allow the storage of huge
amounts of data for a relatively cheap price. Unfortunately, reading of data
from a magnetic tape is only efficient when the data is read sequentially,
i.e., the data is read in the order in which it is written on the tape. If
a user requests a piece of information from a magnetic tape which is not
physically close to the piece of information read before, then the tape must
be rewound to the location of the requested information, which is a very
slow operation (on the order of multiple seconds). In view of this issue,
algorithms processing the data of a magnetic tape often read it sequentially.
Just like in the network monitoring scenario, the algorithm can store some
of the information it gets from the tape in the computer’s main memory.
However, typically this memory is large enough to contain only a small
fraction of the tape’s data, and thus, the algorithm has to “guess” the
important parts of the data that should be kept in memory.
algorithms and the properties that we expect from a good data stream
algorithm.
Exercise 1. Write a data stream algorithm that checks whether all the tokens
in its input stream are identical. The algorithm should output TRUE if this
is the case, and FALSE if the input stream contains at least two different
tokens.
This algorithm maintains a counter for every token it views, and these
counters are implicitly assumed to be initially zero. The counter of every
token counts the number of appearances of this token, and the algorithm
can then check which tokens have more than n/k appearance by simply
checking the values of all the counters.
While this algorithm is very simple, it is not very useful in practice
because it might need to keep very many counters if there are many distinct
tokens in the stream. Our 2-passes algorithm (Algorithm 3) avoids this issue
by using one pass to filter out most of the tokens that do not appear many
times in the stream. More specifically, in its first pass, Algorithm 3 produces
a small set F of tokens that includes all the tokens that might appear
many times in the stream. Then, in its second pass, Algorithm 3 mimics
Algorithm 2, but only for the tokens of F . Thus, as long as F is small, the
number of counters maintained by Algorithm 3 is kept small, even when
there are many unique tokens in the stream (technically, Algorithm 3 still
maintains a counter for every token in the stream. However, most of these
counters are zeros at every given time point, so a good implementation does
not require much space for storing all of them).
Let us now explain the pseudocode of Algorithm 3 in more detail. Like
Algorithm 2, Algorithm 3 also implicitly assumes that the counter of each
token is initially zero. During the first pass of this algorithm over the input
stream, it processes each token t by doing two things as follows:
• It increases the counter of t by 1.
• If after this increase there are at least k counters with non-zero values,
then every such counter is decreased by 1.
To make this more concrete, let us consider the behavior of Algorithm 3
during its first pass when given the above input example (i.e., the input
stream “ababcbca” and k = 3). While reading the first 4 tokens of the input
stream, the algorithm increases both the counters of “a” and “b” to 2. Then,
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch01 page 7
the algorithm gets to the first occurrence of the token “c”, which makes it
increase the counter of “c” to 1 momentarily. However, as the number of
non-zero counters is now equal to k, all these counters are decreased by 1,
which sets the counter of “c” back to zero and also decreases the counters
of “a” and “b” to 1.
After Algorithm 3 completes its first pass, it stores in F the set of tokens
whose counters end up with a non-zero value at the end of the first pass.
Recall that in the intuitive description of Algorithm 3, we claimed that F
is a small set that includes every token that might have many appearances
in the stream. Lemma 1 shows that this is indeed the case. We will prove
this lemma soon.
During its second pass, Algorithm 3 determines the total length of the
input stream and counts the number of times each token of F appears.
Then, it outputs the list of tokens in F that appear more than n/k times
in the input stream. Given Lemma 1, it is easy to see from this description
that Algorithm 3 fulfills its objective, i.e., it outputs exactly the tokens that
appear more than n/k times in the input stream. Thus, Lemma 1 remains
to be proved.
Proof of Lemma 1. During the first pass of Algorithm 3, the processing
of each arriving token consists of two steps: increasing the counter of the
token, and then decreasing all the counters if the number of non-zero
counters reached k. The first step might increase the number of non-zero
counters when the counter of the arriving token increases from zero to
one. However, if the number of counters reaches k following this increase,
then the second step will decrease the counter of the arriving token back
to zero; which will make the number of non-zero counters smaller than k
again. Hence, the number of counters remains smaller than k following the
processing of each input token, and is not larger than k at any given time.1
To prove the second part of the lemma, we observe that during the first
pass of Algorithm 3 the total number of counter increases is equal to the
number of tokens in the input stream, i.e., n. On the other hand, each time
that the algorithm decreases counters (i.e., executes the line 5) it decreases
k different counters. As no counter ever becomes negative, this means that
the algorithm decreases counters at most n/k times. In particular, every
given counter is decreased by at most n/k throughout the entire first pass.
Hence, the counter of a token with more than n/k appearances in the input
stream will remain positive at the end of the first pass, and thus, will end
up in F .
1 This can be proved more formally by induction on the number of tokens already
processed.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch01 page 9
2 Note that we assume here that standard operations on numbers such as addition and
multiplication can be done using a single basic operation. A more accurate analysis
of the token processing time should take into account the time required for these
operations (which is often logarithmic in the number of bits necessary to represent the
numbers). However, in the interest of simplicity we ignore, throughout this book, this
extra complication and assume standard operations on numbers require only a single
basic operation.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch01 page 11
Exercise Solutions
Solution 1
One possible solution is given as Algorithm 4.
The first line of the algorithm handles the case in which the input
stream is empty (and outputs TRUE in this case). The rest of the algorithm
compares all the tokens of the stream to the first token. If the algorithm
finds any token in the stream which is different from the first token, then it
outputs FALSE. Otherwise, all the tokens of the stream must be identical,
and thus, the algorithm outputs TRUE.
Solution 2
Recall that, after reading the first 5 tokens of the input stream, the counters
of the tokens “a”, “b” and “c” in Algorithm 3 have the values 1, 1 and 0,
respectively. The next token that the algorithm reads is “b”, which causes
the algorithm to increase the counter of “b” to 2. Then, the algorithm reads
the second occurrence of the token “c”, which momentarily increases the
counter of “c” to 1. However, the number of non-zero counters is now equal
to k again, which makes the algorithm decrease all the non-zero counters
by 1. This sets the value of the counter of “c” back to zero, and decreases
the counters of “a” and “b” to 0 and 1, respectively. The final token that
the algorithm reads is “a”, which increases the counter of “a” back to 1.
The following table summarizes the values of the counters after the
algorithm processes each one of the input stream tokens. Each column of
the table corresponds to one token of the input stream, and gives the values
of the counters after this token is processed by the algorithm. Note that the
rightmost column of the table corresponds to the last token of the input
stream, and thus, gives the values of the counters at the end of the first
pass of the algorithm.
Token Processed: “a” “b” “a” “b” “c” “b” “c” “a”
Counter of “a”: 1 1 2 2 1 1 0 1
Counter of “b”: 0 1 1 2 1 2 1 1
Counter of “c”: 0 0 0 0 0 0 0 0
Solution 3
Consider the input stream “abcabcad ” and k = 3. One can verify, by
simulating the algorithm, that when Algorithm 3 gets this input stream
and value of k, it produces the set F = {a, d}. Moreover, the values of
the counters of “a” and “d” are both 1 at the end of the first pass of the
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch01 page 13
algorithm. It can also be observed that “a” appears 3 > n/k times in the
above input stream, while “d” appears only once in this input stream.
Solution 4
Algorithm 3 maintains a counter for every token. However, by Lemma 1
only k of these counters can be non-zero at every given time. Hence, it is
enough to keep in memory only these (up to) k non-zero counters. For each
such counter, we need to maintain the token it is associated with and its
value. Since there are m possible tokens, each token can be specified using
O(log m) bits. Similarly, since the value of each counter is upper bounded
by the length n of the stream, it requires only O(log n) bits. Combining
these bounds we get that maintaining the counters of Algorithm 3 requires
only O(k(log n + log m)) bits.
In addition to its counters, Algorithm 3 uses two additional variables:
n — which counts the number of input tokens, and thus, can be represented
using O(log n) bits, and F — which is a set of up to k tokens, and thus,
can be represented using O(k log m) bits. One can observe, however, that
the space requirements of these two variables are dominated by the space
requirement O(k(log n + log m)) of the counters, and thus, can be ignored.
Solution 5
We assume throughout this solution that Algorithm 3 maintains explicitly
only the values of its non-zero counters. Note that Lemma 1 guarantees that
there are at most k non-zero counters during the first pass of Algorithm 3.
Additionally, Lemma 1 also guarantees |F | ≤ k, which implies that
Algorithm 3 has at most k non-zero counters during its second pass as well.
Since only the non-zero counters are explicitly maintained by Algorithm 3,
these observations imply that it is possible for Algorithm 3 to find the
counter of every given token in O(k) time.
During its first pass, Algorithm 3 performs three steps after reading
each token. First, it finds the counter of the read token, then it increases
this counter, and finally, if there are at least k non-zero counters, it decreases
all the non-zero counters. By the above discussion, steps one and two can
be done in O(k) time. Similarly, the third step can be implemented by
scanning the list of non-zero counters and decreasing each one of them
(when necessary), and thus, also requires only O(k) time.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch01 page 14
During its second pass, Algorithm 3 performs a few steps after reading
each token. First, it increases the variable n used to determine the length of
the input stream. Then, if the read token belongs to F , it finds its counter
and increases it. Again, the above discussion implies that these steps can
be done in O(k) time.
In conclusion, after reading each token Algorithm 3 performs O(k)
operations in both passes, and thus, this is its token processing time.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 15
Chapter 2
In this chapter, we take a short break from the data stream model and study
some tools from probability theory that we will use throughout the rest
of the book. The chapter will begin with a review of the basic probability
theory. This review will be quite quick as we assume that the reader already
took a basic probability course at some point and, thus, is familiar with the
reviewed material. After the review, we will study the topic of tail bounds,
which is an essential tool for many of the results we will see in this book,
but is not covered very well by most basic probability courses.
To exemplify the above definition, consider the toss of a fair coin. This
random process has two possible outcomes: “head” and “tail”. Since these
15
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 16
two outcomes have equal probabilities of being realized (for a fair coin),
each one of them is realized with a probability of 1/2. Thus, the discrete
probability space corresponding to this random process is (Ω, P ) for
Ω = {head, tail} and P (x) = 1/2 ∀x ∈ Ω.
In most of the cases we will study in this book, the set Ω is not only
countable but also finite. Accordingly, while all the results we mention in
this chapter apply to general countable Ω, the proofs we give for some of
them work only for the case of a finite Ω.
Consider now the roll of a fair dice. Such a roll has six possible
outcomes: {1, 2, 3, 4, 5, 6}, and the discrete probability space corresponding
to the roll assigns a probability to each one of these outcomes. However,
often we are interested in probabilities of things that are not outcomes. For
example, we might be interested in the probability that the outcome is even
or that it is larger than 4. To study such probabilities formally, we need to
define the notion of events. Every subset of E ⊆ Ω is called an event, and
the probability of an event is the sum of the probabilities of the outcomes
in it.
To exemplify the notion of events, assume that we would like to
determine the probability that the dice shows an even number. To calculate
this probability, we first note that the event that the dice shows an even
number is the set of even possible outcomes, i.e., {2, 4, 6}. The probability
of this event is
dice shows
Pr = Pr[{2, 4, 6}] = P (2) + P (4) + P (6)
an even number
1 1
= 3· = ,
6 2
where the penultimate equality holds since the outcomes 2, 4 and 6 are
realized with probability 1/6 each.1
1 The function P was defined as function over outcomes, and thus, we use it only when
Exercise 2. Consider a biased dice whose probability to show each one of its
sides is given by the following table. What is the probability of the event that
this dice shows a number larger than or equal to 4?
Shown Number 1 2 3 4 5 6
Probability 0.05 0.2 0.15 0.25 0.25 0.1
Pr[E1 ∩ E2 ]
Pr[E1 |E2 ] = .
Pr[E2 ]
Exercise 5. Consider two rolls of a fair dice. Calculate the probability that
the dice shows either five or six in at least one of the rolls. How does your
answer change if the dice is rolled k times (for some positive integer k) rather
than two times?
The tools described above can also be used to prove the following very
useful lemma, which is known as the law of total probability.
Lemma 1 (law of total probability). Let A1 , A2 , . . . , Ak be a set of
k disjoint events such that their union includes all the possible outcomes
and the probability of each individual event Ai is non-zero. Then, for every
event E, it holds that
k
Pr[E] = Pr[Ai ] · Pr[E|Ai ].
i=1
The law of total probability is very useful when one has to calculate the
probability of an event E through a case analysis that distinguishes between
multiple disjoint cases. The reader will be asked to apply this technique in
Exercise 7.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 20
Exercise 7. Consider a roll of two fair dices. Calculate the probability that the
sum of the numbers shown by the two dices is even. Hint: For every integer i
between 1 and 6, let Ai be the event that the first dice shows the value of i.
Use the law of total probability with the disjoint events A1 , A2 , . . . , A6 .
Exercise 8. Consider three tosses of a fair coin. For every two distinct values
i, j ∈ {1, 2, 3}, let Eij be the event that tosses i and j have the same outcome.
(a) Prove that the events E12 , E23 and E13 are pairwise independent.
(b) Prove that the above three events are not independent.
word “mutual” is often omitted, and in this book we follow this practice of omission.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 21
is independent, i.e.,
Pr Ei = Pr[Ei ] ∀I ⊆ {1, 2, . . . , h}, |I| k.
i∈I i∈I
Exercise 9. Consider a roll of two fair dices, and let X be the sum of the
values they show. Calculate X.
3 When Ω is infinite, this might be the sum of an infinite series, in which case the
Proof.
(1) Let R be the set of values that either X, Y or X + Y can take. For
every r ∈ R, we have
Pr[X + Y = r] = Pr[X = rX ∧ Y = rY ]
rX ,rY ∈R
rX +rY =r
Exercise 10. Use the linearity of expectation to get a simpler solution for
Exercise 9.
Proof. We prove the lemma only for the case in which f is a convex
within C. The proof for the other case is analogous. We begin by proving
by induction on k that for every set of non-negative numbers λ1 , λ2 , . . . , λk
whose sum is 1 and arbitrary numbers y1 , y2 , . . . , yk ∈ C, we have that
k
k
f λi yi λi · f (yi ). (2.1)
i=1 i=1
k
λi k
λ1 · f (yi ) + (1 − λ1 ) · · f (yi ) = λi · f (yi ),
i=2
1 − λ1 i=1
Exercise 11. Find an example of a random variable for which Inequality (2.2)
holds as equality and an example for which this inequality does not hold as
equality.
Exercise 12. Let X be the value obtained by rolling a fair dice, let O be the
event that this number is odd and let E be the event that it is even.
(a) Calculate E[X|O] and E[X|E],
(b) intuitively explain why E[X|O] < E[X|E].
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 26
Using the notation of conditional expectation, we can now give the law
of total expectation, which is an analogue of the law of total probability
(Lemma 1) that we have seen before.
k
E[X] = Pr[Ai ] · E[X|Ai ].
i=1
Proof. Let us denote by R the set of values that X can take, then
k
k
Pr[Ai ] · E[X|Ai ] = Pr[Ai ] · (r · Pr[X = r|Ai ])
i=1 i=1 r∈R
k
= r· Pr[Ai ] · Pr[X = r|Ai ]
r∈R i=1
= r · Pr[X = r] = E[X],
r∈R
where the second equality follows by changing the order of summation, and
the penultimate equality holds by the law of total probability.
Exercise 13. Consider the following game. A player rolls a dice and gets a
number d. She then rolls the dice d more times. The sum of the values shown
by the dice in these d rolls is the number of points that the player gets in the
game. Calculate the expectation of this sum.
= (rX rY ) · Pr[X = rX ] · Pr[Y = rY ]
rX ∈R
rY ∈R
= rX · Pr[X = rX ] · rY · Pr[X = rY ]
rX ∈R rY ∈R
= E[X] · E[Y ],
Exercise 14. Show that Lemma 6 does not necessarily hold when X and Y
are not independent.
In general, the variance does not have nice linear properties like the
expectation. Nevertheless, it does have some properties that resemble
linearity. Two such properties are listed in Lemma 8.
4 We do not encounter in this book any examples of random variables whose variances
do not exist (such variables can be defined only in the context of infinite probability
spaces, which we rarely consider). However, for completeness, we include the technical
requirement that the variance exists in the necessary places.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 29
h
h
h
= E[Xi2 ] + 2 E[Xi ] · E[Xj ]
i=1 i=1 j=i+1
h
h
h
− (E[Xi ])2 −2 E[Xi ] · E[Xj ]
i=1 i=1 j=i+1
h
h
= {E[Xi2] 2
− (E[Xi ]) } = Var[Xi ],
i=1 i=1
n
n
Var[X] = Var[Xi ] = pq = npq.
i=1 i=1
Probability
Value
Figure 2.1. A schematic picture of the distribution of a characteristic random variable.
The x-axis corresponds to the values that the random variable can take, and the height of
the curve above each such value represents the probability that the random variable takes
this value. For many useful random variables, the curve obtained this way has a “bell”
shape having a bulge near the expected value of the random variable. On both sides of
the bulge, the curve has “tails” in which the probability gradually approaches 0 as the
values go further and further from the expected value. The tails are shaped gray in the
above figure. To show that the random variable is concentrated around its expectation,
one should bound the size of the tails and show that they are small compared to the
bulge. This is the reason that inequalities showing concentration are called tail bounds.
only non-negative values. This assumption leads to a very basic tail bound
known as Markov’s inequality.
= t · Pr[X t].
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 33
E[(X − E[X])2 ]
Pr[(X − E[X])2 t2 ] .
t2
Recalling now that Var[X] was defined as E[(X − E[X])2 ], we get
Var[X]
Pr[(X − E[X])2 t2 ] ,
t2
which is equivalent to the inequality that we want to prove.
and
E[X]
e−δ
Pr[X (1 − δ) · E[X]] for every δ ∈ (0, 1).
(1 − δ)1−δ
Proof. Observe that the lemma is trivial when E[X] = 0. Thus, we may
assume in the rest of the proof that this is not the case. Let us now denote
by pi the expectation of Xi for every 1 i n and by Ri the set of values
that Xi can take. Fix an arbitrary value t = 0 and consider the random
variable etX . Since the variables X1 , X2 , . . . , Xn are independent, we have
n
Pn
tX t· i=1 Xi tXi
E[e ] = E e =E e
i=1
n
n
= E[etXi ] = etr · Pr[Xi = r]
i=1 i=1 r∈Ri
n
[r · et + (1 − r) · e0 ] · Pr[Xi = r]
i=1 r∈Ri
n
= [E[Xi ] · et + (1 − E[Xi ]) · e0 ]
i=1
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 36
n
n
= [pi et + (1 − pi )e0 ] = [pi (et − 1) + 1]
i=1 i=1
n
t t
epi (e −1)
= e(e −1)·E[X]
,
i=1
where the first inequality follows from the convexity of the function ex and
the last inequality holds since ex x + 1 for every real value x.
We now observe that etX is a non-negative function, and thus, we can
use Markov’s inequality to obtain for every t > 0
E[etx ]
Pr[X (1 + δ) · E[X]] = Pr[etX e(1+δ)t·E[X] ]
e(1+δ)t·E[X]
t
t
E[X]
e(e −1)·E[X] ee −1
= .
e(1+δ)t·E[X] e(1+δ)t
We now need to find the right value of t to plug into the above
inequality. One can observe that for δ > 0 we can choose t = ln(1 + δ)
because this value is positive, and moreover, plugging this value into the
above inequality yields the first part of the lemma. To prove the other
part of the lemma, we use Markov’s inequality again to get the following
equation for every t < 0:
For δ ∈ (0, 1), we can plug t = ln(1 − δ) < 0 into the above inequality,
which completes the proof of the lemma.
n
from the range [0, 1]. Then, the random variable X = i=1 Xi obeys
δ2 ·E[X] min{δ,δ2 }·E[X]
Pr[X (1 + δ) · E[X]] e− 2+δ e− 3 for every δ > 0,
and
δ2 ·E[X]
Pr[X (1 − δ) · E[X]] e− 2 for every δ ∈ (0, 1).
Proof. By Lemma 13, to prove the first part of the corollary, it suffices
to show that for every δ > 0, we have
E[X]
eδ δ2 ·E[X]
e− 2+δ .
(1 + δ)1+δ
eδ δ2
− 2+δ δ2
e ⇔ δ − (1 + δ) ln(1 + δ) −
(1 + δ)1+δ 2+δ
2δ + 2δ 2 2δ
⇔ (1 + δ) ln(1 + δ) ⇔ ln(1 + δ),
2+δ 2+δ
where the first equivalence can be shown by taking the ln of both sides of
the first inequality. It is not difficult to verify using basic calculus that the
last of the above inequalities holds for every δ > 0, which completes the
proof of the first part of the corollary. Similarly, to prove the second part
of the corollary, we need to show that
E[X]
e−δ δ2 ·E[X]
e− 2 ,
(1 − δ)1−δ
e−δ δ2 δ2
1−δ
e− 2 ⇔ −δ − (1 − δ) ln(1 − δ) −
(1 − δ) 2
δ 2 − 2δ δ 2 − 2δ
⇔ (1 − δ) ln(1 − δ) ⇔ ln(1 − δ).
2 2 − 2δ
The rightmost of these inequalities can again be proved using basic
calculus to hold for every δ ∈ (0, 1).
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 38
Exercise Solutions
Solution 1
(a) The roll of a dice has 6 possible outcomes: 1, 2, 3, 4, 5 and 6. Each
one of these outcomes has an equal probability to be realized in a fair
dice and, is thus, realized with a probability of 1/6. Hence, the discrete
probability space corresponding to the roll of a fair dice is
(b) Like in the case of a toss of a fair coin, the toss of a biased coin (like the
one we consider here) has two possible outcomes: “head” and “tail”.
In the exercise, it is specified that P (head) = 2/3. Since the sum of
P (head) and P (tail) should be 1, this implies that P (tail) = 1/3. Thus,
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 39
(c) Every outcome of the random process specified in the exercise consists
of an outcome for the dice and an outcome for the coin. Hence, the
set Ω3 of the discrete probability space corresponding to this random
process is
Solution 2
We need to determine the probability of the event that the biased dice falls
on a number which is at least 4. This event is formally denoted by the set
{4, 5, 6} because these are the three possible outcomes that are at least 4.
The probability of this event is
Solution 3
By the inclusion–exclusion principle,
Pr[E1 ∪ E2 ] + Pr[E1 ∩ E2 ] = P (o) + P (o)
o∈E1 ∪E2 o∈E1 ∩E2
= P (o) + P (o) = Pr[E1 ] + Pr[E2 ].
o∈E1 o∈E2
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 40
and this can happen only when at least one of the probabilities Pr[E1 ] or
Pr[E2 ] is zero.
Solution 4
Let Hi be the event that the coin falls on “head” in toss number i, and
let Ti be the event that the coin falls on “tail” in toss number i. Using
this notation, we can write down the probability that we are required to
calculate in the exercise as
where the equality holds since the event H1 ∩ H2 (the event that the coin
falls twice on “head”) is disjoint from the event T1 ∩ T2 (the event that the
coin falls twice on “tail”). We now observe that the event H1 contains only
information about the first toss of the coin, and the event H2 contains
only information about the second toss of the coin. Since the two tosses are
independent, so are the events H1 and H2 , which implies
2
2 4
Pr[H1 ∩ H2 ] = Pr[H1 ] · Pr[H2 ] = = .
3 9
Combining all the above equalities, we get that the probability that we
need to calculate is
4 1 5
Pr[(H1 ∩ H2 ) ∪ (T1 ∩ T2 )] = Pr[H1 ∩ H2 ] + Pr[T1 ∩ T2 ] = + = .
9 9 9
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 41
Solution 5
Let Ei be the event that the dice shows a number other than 5 or 6 in roll
number i. Since the dice has an equal probability to show every number,
Pr[Ei ] = |{1, 2, 3, 4}|/6 = 2/3. Additionally, since the two rolls of the dice
are independent, Pr[E1 ∩ E2 ] = Pr[E1 ] · Pr[E2 ] = (2/3)2 = 4/9.
At this point it is important to observe that E1 ∩ E2 is the complement
of the event whose probability we would like to calculate. In other words,
we want to calculate the probability that in at least one of the two rolls
the dice shows 5 or 6, and E1 ∩ E2 is the event that neither roll of the dice
shows 5 or 6. This observation implies that the probability that we want to
calculate is
4
Pr[Ω\(E1 ∩ E2 )] = Pr[Ω] − Pr[E1 ∩ E2 ] = 1 − ,
9
where the first equality holds since E1 ∩ E2 and Ω\(E1 ∩ E2 ) are disjoint.
Let us consider now the change if the number of rolls is k rather than 2.
In this case, we need to calculate the probability of the event complementing
E1 ∩ E2 ∩ · · · ∩ Ek , which is
k
k
k
k
2
Pr Ω\ Ei = 1 − Pr Ei = 1 − Pr[Ei ] = 1 − ,
i=1 i=1 i=1
3
where the second equality holds since every roll of the dice is independent
of all the other rolls.
Solution 6
By plugging in the definition of conditional probability, we get
k
k
Pr[E ∩ Ai ]
k
Pr[Ai ] · Pr[E|Ai ] = Pr[Ai ] · = Pr[E ∩ Ai ]
i=1 i=1
Pr[Ai ] i=1
k
= Pr E ∩ Ai = Pr[E],
i=1
Solution 7
As suggested by the hint, let Ai be the event that the first dice shows the
number i. Additionally, let E be the event that the sum of the two numbers
shown by the dices is even. We observe that if the first dice shows an even
number i, then the sum of the two numbers is even if and only if the second
dice shows one of the numbers 2, 4 or 6. Thus, for any even i we have
|{2, 4, 6}| 1
Pr[E|Ai ] = = .
6 2
Similarly, if the first dice shows an odd number i, then the sum of the
two numbers is even if and only if the second dice shows one of the numbers
1, 3 or 5. Thus, for any odd i, we have
|{1, 3, 5}| 1
Pr[E|Ai ] = = .
6 2
Since the events A1 , A2 , . . . , A6 are disjoint, and their union includes
all the possible outcomes, we get by the law of total probability
6
6
6
Pr[Ai ] 1 1
Pr[E] = Pr[Ai ] · Pr[E|Ai ] = = · Pr[Ai ] = .
i=1 i=1
2 2 i=1 2
Solution 8
(a) Due to symmetry, to show that E12 , E23 and E13 are pairwise
independent, it suffices to prove that E12 and E23 are independent.
The event E12 is the event that coin tosses 1 and 2 either both produce
heads or both produce tails. Since these are two of the four possible
outcomes for a pair of tosses, we get Pr[E12 ] = 2/4 = 1/2. Similarly,
we also get Pr[E23 ] = 1/2 and Pr[E13 ] = 1/2.
Let us now consider the event E12 ∩ E23 . One can observe that
this is the event that all three coin tosses result in the same outcome,
i.e., either they all produce heads or they all produce tails. Since these
are two out of the eight possible outcomes for three coin tosses, we get
Pr[E12 ∩E23 ] = 2/8 = 1/4. To verify that E12 and E23 are independent,
it remains to be observed that
1 1 1
Pr[E12 ] · Pr[E23 ] = · = = Pr[E12 ∩ E23 ].
2 2 4
(b) Consider the event E12 ∩ E23 ∩ E13 . One can verify that this event is
again the event that all three coin tosses resulted in the same outcomes,
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 43
1 1 1 1 1
Pr[E12 ] · Pr[E23 ] · Pr[E13 ] = · · = = = Pr[E12 ∩ E23 ∩ E13 ],
2 2 2 8 4
which implies that the three events E12 , E23 and E13 are not
independent. To understand on a more intuitive level why that is
the case, note that knowing that any two of these events happened
implies that all three coin tosses produced the same outcome and, thus,
guarantees that the third event happened as well.
Solution 9
Let us denote by (i, j) the outcome in which the first dice shows the number
i and the second dice shows the number j. There are 36 such possible
outcomes, each having a probability of 1/36. We note that there is only a
single outcome (namely (1, 1)) for which X = 2, and thus, Pr[X = 2] =
1/36. Similarly, there are two outcomes ((2, 1) and (1, 2)) for which X = 3,
and thus, Pr[X = 3] = 2/36 = 1/18. Continuing in the same way, we get
the following probabilities for all the possible values of X.
X 2 3 4 5 6 7 8 9 10 11 12
Prob. 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36
12
E[X] = i · Pr[X = i]
i=2
1 1 1 1 5 1 5 1
= 2· +3· +4· +5· +6· +7· +8· +9·
36 18 12 9 36 6 36 9
1 1 1
+ 10 · + 11 · + 12 ·
12 18 36
1 1 1 5 5 7 10 5 11 1
= + + + + + + +1+ + + = 7.
18 6 3 9 6 6 9 6 18 3
Solution 10
Let X and Y be the values shown by the first and second dice, respectively.
Since a dice shows every value between 1 and 6 with equal probability, the
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 44
1+2+3+4+5+6
= 3.5.
6
Solution 11
If X takes only a single value c with a positive probability, then
1 1 1
E = = .
X c E[X]
It can be shown that this is the only case in which Inequality (2) holds as
equality. Thus, to give an example for which it does not hold as an equality,
we can take an arbitrary non-constant random variable. In particular, if we
choose as X a random variable that takes the value 2 with probability 1/2
and the value 4 otherwise, then
1 1
E[X] = ·2+ ·4 =1+2 =3
2 2
and
1 1 1 1 1 1 1 3
E = · + · = + = ,
X 2 2 2 4 4 8 8
Solution 12
(a) Consider a number r ∈ {1, 2, 3, 4, 5, 6}. If r is even, then Pr[X =
r|O] = 0 because the event O implies that X takes an odd value,
and Pr[X = r|E] = 1/3 because under the event E the variable X has
equal probability to take each one of the three even numbers 2, 4 or 6.
Similarly, if r is odd, then Pr[X = r|O] = 1/3 and Pr[X = r|E] = 0.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 45
and
6
2+4+6
E[X|E] = r · Pr[X = r|E] = = 4.
r=1
3
(b) Since X is a uniformly random number from the set {1, 2, 3, 4, 5, 6},
E[X] is the average of the numbers in this set. Similarly, E[X|O]
and E[X|E] are the averages of the even and odd numbers in this
set, respectively. Thus, the inequality E[X|O] < E[X|E] holds simply
because the average of the even numbers in {1, 2, 3, 4, 5, 6} happens to
be larger than the average of the odd numbers in this set.
Solution 13
Let X be a random variable of the number of points earned by the player,
and let D be a random variable representing the result in the first roll of
the dice (i.e., the value of d). Since the expected value of the number shown
on the dice in a single roll is 3.5 (see the solution for Exercise 10), by the
linearity of expectation we get that the expected sum of the values shown
by the dice in d rolls is 3.5d. Thus, E[X|D = d] = 3.5d. Using the law of
total expectation, we now get
6
6
3.5d 3.5 · 21
E[X] = Pr[D = d] · E[X|D = d] = d=1
= = 12.25.
6 6
d=1
Solution 14
Let X be a random variable taking the values 1 and 3 with probability
1/2 each, and let Y = X + 1. Clearly, X and Y are not independent since
knowing the value of one of them suffices for calculating the value of the
other. Now, note that
1+3
E[X] = 1 · Pr[X = 1] + 3 · Pr[X = 3] = = 2,
2
2+4
E[Y ] = 2 · Pr[Y = 2] + 4 · Pr[Y = 4] = = 3,
2
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 46
and
2 + 12
E[X · Y ] = 2 · Pr[X = 1, Y = 2] + 12 · Pr[X = 3, Y = 4] = = 7.
2
Thus, Lemma 6 does not apply to these X and Y because
Solution 15
Using the linearity of expectation, we get
If X and Y are independent (and thus, E[XY ] = E[X] · E[Y ]), then we
also get
Solution 16
As suggested by the hint, let us define for every pair of distinct numbers
i, j ∈ {1, 2, . . . , n} an indicator Xij for the event that this pair is reversed
in π. Since X is the number of reversed pairs in π, we get
n
n
X= Xij .
i=1 j=i+1
to appear, and thus, both probabilities are equal to half. Plugging this into
the previous equality yields
n
n
E[X] = Pr[the pair of i and j is reversed]
i=1 j=i+1
n n
1 n(n − 1)/2 n(n − 1)
= = = .
i=1 j=i+1
2 2 4
Solution 17
(a) Consider the random variable Y = X − a. Since Y is always non-
negative, using Markov’s inequality, we get
E[Y ] E[X] − a
Pr[X t] = Pr[Y t − a] = .
t−a t−a
E[Z] b − E[X]
Pr[X t] = Pr[Z b − t] = .
b−t b−t
(c) To prove (a) without using Markov’s inequality, we will lower bound
the expectation of X. Like in the proof of Markov’s inequality, we need
to distinguish between high and low values of X. The high values of
X are the values larger than t. We will lower bound these values by t.
Values of X smaller than t are considered low values, and we will lower
bound them by a. Let R be the set of values that X can take, i.e.,
E[X] = r · Pr[X = r] = r · Pr[X = r] + r · Pr[X = r]
r∈R r∈R r∈R
rt r<t
t · Pr[X = r] + a · Pr[X = r]
r∈R r∈R
rt r<t
We now note that the two probabilities on the rightmost side of the
last inequality add up to 1. Using this observation, we get from the last
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 48
inequality
Solution 18
(a) Since X counts the number of successes in n independent Bernoulli
trials, it takes only non-negative values, thus, we can apply Markov’s
inequality to it. Choosing t = 2 in the alternative form of Markov’s
inequality given by Corollary 1, we get
1 1
Pr[X 2np] = Pr[X t · E[X]] = .
t 2
Var[X] q 1
Pr[X 2np] Pr[|X − np| np] 2
= .
(np) np np
Solution 19
(a) For δ = 0, the two sides of the inequality
2δ
ln(1 + δ)
2+δ
2(2 + δ) − 2δ · 1 4
= ,
(2 + δ)2 (2 + δ)2
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 49
δ 2 − 2δ
ln(1 − δ)
2 − 2δ
are again equal to 0. Thus, to prove that the inequality holds for δ ∈
[0, 1), it suffices to show that for every δ in this range the derivative of
the left-hand side of the inequality with respect to δ is upper bounded
by the derivative with respect to δ of the right-hand side. The derivative
of the left-hand side is
(2δ − 2)(2 − 2δ) + 2(δ 2 − 2δ) (8δ − 4δ 2 − 4) + (2δ 2 − 4δ)
=
(2 − 2δ)2 (2 − 2δ)2
4δ − 2δ 2 − 4
= ,
(2 − 2δ)2
4δ − 2δ 2 − 4 1
− ⇔ (4δ − 2δ 2 − 4)(1 − δ) −(2 − 2δ)2
(2 − 2δ)2 1−δ
⇔ 8δ − 2δ 3 − 6δ 2 − 4 −4 + 8δ − 4δ 2 ⇔ 0 2δ 2 + 2δ 3 .
Solution 20
√
Observe that the assumption in the exercise about n implies 10/ n ∈ (0, 1).
Using this observation and the fact that E[X] = n/2, we get using the
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch02 page 50
Chapter 3
Estimation Algorithms
In Chapter 1, we saw the data stream model, and studied a few algorithms
for this model; all of which produced exact solutions. Unfortunately,
producing exact solutions often requires a large space complexity, and
thus, most of the algorithms we will see in this book will be estimation
and approximation algorithms. An estimation algorithm is an algorithm
which produces an estimate for some value (such as the length of the
stream). Similarly, an approximation algorithm is an algorithm which given
an optimization problem finds a solution which approximately optimizes the
objective function of the problem. For example, an approximation algorithm
for the minimum weight spanning tree problem produces a spanning tree
whose weight is approximately minimal.
In this chapter, we will see our first examples of estimation algorithms
for the data stream model, and will learn how to quantify the quality of the
estimate produced by these algorithms. Further discussion and examples of
approximation algorithms can be found in Chapter 8.
51
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 52
state of the algorithm, and argue that it must encode the number of tokens
read so far by the algorithm.
1 Infact, the expected estimation error of the algorithm described increases with n since
the deviation of X from its expectation increases as n becomes larger. However, this
increase is very slow compared to the increase in n itself.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 53
Estimation Algorithms 53
(a) (b)
Figure 3.1. A graphical representation of the ladder on which the worker climbs. The
space above each bar represents the number of stream tokens “represented” by this bar,
which is inversely proportional to the probability to climb from this bar to the next
(the larger the space, the lower this probability). Ladder (a) represents the basic setting
in which the probability to move from each bar to the next one is identical. Ladder (b)
represents the more advanced setting in which this probability decreases as we get higher
on the ladder. One can observe that in ladder (b) the space above each bar of the ladder
is proportional to the height of the bar in the ladder, which intuitively means that the
estimation error increases roughly in proportion to n.
One solution for the above drawbacks is to make the probability that
the worker moves from one bar to the next different for different bars.
Specifically, we want this probability to be large for the first bar, and then
decrease gradually as we get to higher and higher bars. Since the low bars
are associated with relatively high probabilities, each such bar “represents”
only a few stream tokens, and thus, we expect the estimation error to be
small when n is small; which solves the second drawback presented above.
Moreover, as the worker gets higher, the probability of him climbing to
the next bar with each token decreases, which also decreases his expected
climbing speed. Hence, the worker is not expected to climb very high even
when n is large, and this should allow us to keep track of the bar where
the worker is currently present using a relatively small space complexity
(solving the first drawback). A graphical explanation of this approach is
given in Figure 3.1.
Morris’s algorithm, the algorithm we promised above, is based on the
last idea, and is given as Algorithm 1. One can think of the variable X used
by this algorithm as representing the current bar of the worker.
of a random variable which takes the value 1 with probability 2−X , and the
value 0 otherwise. The solutions section describes one way in which this can
be done using a fair coin.
E[Y0 ] = E[2X0 ] = 1 = 0 + 1.
Next, assume that the lemma holds for some i 0, and let us prove it for
i + 1. By the law of total probability, for every 2 j i, we get
Estimation Algorithms 55
i+1
+ (2 Pr[Xi = j − 1] − Pr[Xi = j]).
j=0
Consider the two sums on the rightmost side of the last equality. By
observing that Xi takes with a positive probability only values within the
range 0 to i, the first sum becomes
i+1
i
2j · Pr[Xi = j] = 2j · Pr[Xi = j] = E[2Xi ] = E[Yi ]
j=0 j=0
i
i
− Pr[Xi = j] = Pr[Xi = j] = 1.
j=0 j=0
= E[Yi ] + 1 = i + 2,
2 These cases require a slightly different proof. For the cases of j = 0 and j = 1, such a
proof is required since the conditional probability Pr[Xi+1 = j|Xi = j − 1] might not be
defined because the event it conditions on may have zero probability. Similarly, in the
case of j = i + 1, the conditional probability Pr[Xi+1 = j|Xi = j] is not defined for the
same reason.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 56
1
Corollary 1. For every ε 1, Pr[|(Yn − 1) − n| > εn] 1+ε .
In Section 3.2, we will see techniques that can be used to bypass the
above issue and get an estimation guarantee also for ε < 1. However, before
getting there, we prove Lemma 2 which analyzes the space complexity of
Algorithm 1.
Estimation Algorithms 57
3 Inside the big O notation, the base of a log function is not stated (as long as it is
constant) because changing the base from one constant to another is equivalent to
multiplying the log by a constant. This is the reason we can assume, for simplicity,
in this calculation that the base of the log is 2.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 58
2
Using the last equality, we can now calculate the expectation of Yi+1 .
i+1
2 2Xi+1
E[Yi+1 ] = E[2 ]= 22j · Pr[Xi+1 = j]
j=0
i+1
i+1
= 22j · Pr[Xi = j] + (2j · 2 Pr[Xi = j − 1] − 2j · Pr[Xi = j]).
j=0 j=0
Consider the two sums on the rightmost side of the last equation. Since Xi
takes with a positive probability only values within the range 0 to i, the
first sum is equal to
i+1
i
22j · Pr[Xi = j] = 22j · Pr[Xi = j] = E[22Xi ] = E[Yi2 ]
j=0 j=0
i+1
(2j · 2 Pr[Xi = j − 1] − 2j · Pr[Xi = j])
j=0
i+1
i
=4· 2j−1 · Pr[Xi = j − 1] − 2j · Pr[Xi = j]
j=1 j=0
i
=3· 2j · Pr[Xi = j] = 3 · E[Yi ].
j=0
i+1
i+1
2
E[Yi+1 ]= 22j · Pr[Xi = j] + (2j · 2 Pr[Xi = j − 1]
j=0 j=0
The expression for the variance given by Lemma 3 can be used to derive
an estimation guarantee for Algorithm 1 which is slightly better than the
guarantee given by Corollary 1.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 59
Estimation Algorithms 59
Exercise 4. Use Chebyshev’s inequality and Lemma 3 to show that, for every
ε > 0, Algorithm 1 estimates the length of the stream up to a relative error
of ε with probability at least 1 − ε−2 /2.
where the second equality holds since 1/h is a constant. As the addition of
a constant to a random variable does not change its variance, we also have
1 1 1
h h h
Var[Z i ] = Var[Yn − 1] = Var[Yn ]
h2 i=1 h2 i=1 h2 i=1
Var[Yn ] n(n − 1)
= = ,
h 2h
where the last equality holds by Lemma 3. The lemma now follows by
combining the above equalities.
Using Chebyshev’s inequality, we can now get an estimation guarantee
for Algorithm 2 which is much stronger than what we have for Algorithm 1.
Corollary 2. For every ε > 0, Algorithm 2 estimates the length of the
stream up to a relative error of ε with probability at least 3/4.
Proof. By Chebyshev’s inequality,
Var[Z̄] n(n − 1)/(2h) 1 1
Pr[|Z̄ − n| εn] 2
= 2
< 2
.
(εn) (εn) 2hε 4
Exercise 5. Show that, for every ε, δ > 0, changing the number h of copies
of Algorithm 1 used by Algorithm 2 to h = ε−2 δ −1 /2 makes Algorithm 2
estimate the length of the stream up to a relative error ε with probability at
least 1 − δ.
Estimation Algorithms 61
n – εn n n + εn
4 Recall that the median of a list of numbers is any number that is smaller than or equal
to 50% of the numbers in the list and larger than or equal to the other 50%.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 62
Pr[Q k/2] δ.
Estimation Algorithms 63
variable with a binomial distribution B(k, p) for some value p ∈ [0, 1/4]. By
the Chernoff bound, we now get
1 − 2p
Pr[Q k/2] = Pr Q 1 + · pk
2p
1 − 2p
= Pr Q 1 + · E[Q]
2p
1−2p 2p−1 −1
e− 2p ·E[Q]/3 =e 6 ·k
e−k/12 e− ln δ = δ.
Exercise 7. Recall that the token processing time of a data stream algorithm
is the maximum number of basic operations that the algorithm might perform
from the moment it reads a token of its input stream (which is not the last
token in the stream) until the moment it reads the next token. Determine
the token processing time of Algorithm 3. Is there a tradeoff between the
token processing time and the quality of the estimation guaranteed by the
algorithm?
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 64
6 Note that these techniques can never be used to improve deterministic algorithms
because running multiple copies of a deterministic algorithm in parallel is useless —
the outputs of all the copies will always be identical.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 65
Estimation Algorithms 65
Exercise Solutions
Solution 1
Consider an algorithm outputting the exact length of the input stream.
The algorithm reads the tokens of the stream sequentially, and updates its
internal state after every read. Note that, from the point of view of the
algorithm, every token might be the last token of the stream, and thus,
the algorithm must be ready at every given point to output the number of
tokens read so far based on its internal state. In other words, the internal
state of the algorithm must encode in some way the number of tokens read
so far by the algorithm. This means that the internal state must be able to
encode at least n different values, and therefore, it cannot be represented
using o(log n) bits.
Solution 2
Here, we explain one way to implement, using a fair coin, a random variable
which takes the value 1 with probability 2−X , and the value 0 otherwise.
Let us flip a fair coin X times, and consider a random variable Z which
takes the value 1 if and only if all the coin flips resulted in “heads” (and
the value 0 otherwise).
Let us prove that Z is distributed like the random variable we want to
implement. The probability that a given coin flip results in a “head” is 1/2.
Since the coin flips are independent, the probability that they all result in
“heads” is simply the product of the probabilities that the individual coin
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 66
X
1
= Pr[Coin flip i resulted in a “head”] = = 2−X .
i=1 i=1
2
Solution 3
Consider a random variable X which takes the value 0 with probability 1/2,
and the value 2 otherwise. The expected value of X is
1 1
· 0 + · 2 = 1.
2 2
Thus, X, which only takes the values 0 and 2, never falls within the range
(0, 2E[X]) = (0, 2).
Solution 4
Recall that adding a constant to a random variable does not change its
variance. Hence, by Chebyshev’s inequality,
Solution 5
Observe that Lemma 4 holds for any value of h. Thus, by choosing h =
ε−2 δ −1 /2, we get, by Chebyshev’s inequality,
Solution 6
The only way to guarantee that the median of k independent copies of X
never falls within the range [n/2, 3n/2] is to construct a variable X which
never takes values within this range. Thus, our objective is to construct a
random variable X with expectation n and variance n(n − 1)/2 which never
takes values within the above range.
To make the construction of X simple, we allow it to take only two
values a and b with a non-zero probability. Moreover, we set the probability
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 67
Estimation Algorithms 67
of both these values to 1/2. Given these choices, the expectation and
variance of X are given by
That a and b are outside the range [n/2, 3n/2] remains to be proven, which
is equivalent to proving
n(n − 1) n
> .
2 2
The last inequality is true for n 3 because
n(n − 1) n n(n − 1) n2
> ⇔ > ⇔ n(n − 2) > 0.
2 2 2 4
Solution 7
The only thing Algorithm 3 does from the moment it reads a token from
its input stream which is not the last token until it reads the next token
is passing this token to k copies of Algorithm 2. Similarly, the only thing
Algorithm 2 does from the moment it reads a token from its input stream
which is not the last token until it reads the next token is passing this token
to h copies of Algorithm 1. Thus, the token processing time of Algorithm 3
is simply k · h times the token processing time of Algorithm 1.
Consider now Algorithm 1. After reading a non-last token from its input
stream, Algorithm 1 does only two things until it reads the next token from
the input stream. First, it makes a random decision. Second, based on the
result of the random decision, it might increase X by 1. Hence, the token
processing time of Algorithm 1 is O(1).
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch03 page 68
The above discussion implies that the token processing time of Algo-
rithm 3 is
Observe that there is indeed a tradeoff between the token processing time
of Algorithm 3 and its estimation guarantee. More specifically, the token
processing time increases as the estimation guarantee improves (i.e., ε and
δ become smaller).
Solution 8
The analysis of Algorithm 4 we present here is very similar to the analysis
of Algorithm 1. In particular, we define for every 0 i n the random
variable Xi in the same way it is defined in the analysis of Algorithm 1.
Recall that this means that Xi is the value of X after Algorithm 4 reads
i tokens from its input stream. Additionally, let Yi = aXi . Observe that
the output of Algorithm 4 can be written as (Yn − 1)/(a − 1). Lemma 5 is
analogous to Lemma 1. In particular, it shows that the expected output of
Algorithm 4 is equal to the length n of the stream.
E[Y0 ] = E[aX0 ] = 1 = (a − 1) · 0 + 1.
Next, assume that the lemma holds for some i 0, and let us prove it for
i + 1. By the law of total probability, for every 2 j i, we get
Estimation Algorithms 69
i+1
i+1
E[Yi+1 ] = E[aXi+1 ] = aj · Pr[Xi+1 = j] = aj · Pr[Xi = j]
j=0 j=0
i+1
+ (a · Pr[Xi = j − 1] − Pr[Xi = j]).
j=0
Consider the two sums on the rightmost side of the last equality. By
observing that Xi takes with a positive probability only values within the
range 0 to i, the first sum becomes
i+1
i
aj · Pr[Xi = j] = aj · Pr[Xi = j] = E[aXi ] = E[Yi ]
j=0 j=0
i+1
i+1
i
(a Pr[Xi = j − 1] − Pr[Xi = j]) = a· Pr[Xi = j − 1]− Pr[Xi = j]
j=0 j=1 j=0
i
= (a − 1) · Pr[Xi = j] = a − 1.
j=0
i+1
i+1
E[Yi+1 ] = aj · Pr[Xi = j] + (a · Pr[Xi = j − 1] − Pr[Xi = j])
j=0 j=0
= E[Yi ] + (a − 1)
= [(a − 1)i + 1] + (a − 1) = (a − 1)(i + 1) + 1,
Using the last lemma, we can now analyze the expected space complex-
ity of Algorithm 4.
4ε2 δ
− log2 log2 a = − log2 log2 (1 + 2ε2 δ) − log2 − log2 (ε2 δ)
2 + 2ε2 δ
= 2 log2 ε−1 + log2 δ −1 .
Estimation Algorithms 71
Next, assume that the lemma holds for some i 0, and let us prove it
for i + 1. We already know, from the proof of Lemma 5, that for every
0 j i + 1,
2
Using the last equality, we can now calculate the expectation of Yi+1 .
i+1
2
E[Yi+1 ] = E[22Xi+1 ] = a2j · Pr[Xi+1 = j]
j=0
i+1
i+1
= a2j ·Pr[Xi = j]+ (aj · a · Pr[Xi = j − 1] − aj · Pr[Xi = j]).
j=0 j=0
Consider the two sums on the rightmost side of the last equation. Since Xi
takes with a positive probability only values within the range 0 to i, the
first sum is equal to
i+1
i
a2j · Pr[Xi = j] = a2j · Pr[Xi = j] = E[a2Xi ] = E[Yi2 ]
j=0 j=0
i+1
(aj · a · Pr[Xi = j − 1] − aj · Pr[Xi = j])
j=0
i+1
i
= a2 · aj−1 · Pr[Xi = j − 1] − aj · Pr[Xi = j]
j=1 j=0
i
= (a2 − 1) · aj · Pr[Xi = j] = (a2 − 1) · E[Yi ].
j=0
i+1
i+1
2
E[Yi+1 ]= a2j · Pr[Xi = j] + (aj · a · Pr[Xi = j − 1]
j=0 j=0
(a2 − 1)(a − 1) 2 (a2 − 1)(3 − a)
= i + i+1
2 2
+ (a2 − 1) · [(a − 1)i + 1]
(a2 − 1)(a − 1) (a2 − 1)(3 − a)
= (i + 1)2 + (i + 1) + 1,
2 2
where the penultimate equality holds by the induction hypothesis.
We are now ready to prove the estimation guarantee of Algorithm 4.
Corollary 4. For every ε, δ > 0, Algorithm 4 estimates the length of the
stream up to a relative error of ε with probability at least 1 − δ.
Proof. By Chebyshev’s inequality, we get
Yn − 1
Pr − n εn = Pr[|Yn − 1 − (a − 1)n| (a − 1)εn]
a−1
= Pr[|Yn − E[Yn ] | (a − 1)εn]
Var[Yn ] (a − 1)3 n(n − 1)/2 a−1
2
= 2 2 2
<
[(a − 1)εn] (a − 1) ε n 2ε2
2ε2 δ
= = δ.
2ε2
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch04 page 73
Chapter 4
Reservoir Sampling
Many data stream algorithms work in two steps. In the first step, the
algorithm reads its input stream and produces a summary of this stream,
while in the second step the algorithm processes the summary and produces
an output. The summary used by such an algorithm should satisfy two
seemingly contradicting requirements. On the one hand, it should be short
so as to keep the space complexity of the algorithm small. On the other
hand, the summary should capture enough information from the original
stream to allow the algorithm to produce an (approximately) correct answer
based on the summary alone. In many cases, it turns out that a summary
achieving a good balance between these requirements can be produced by
simply taking a random sample of tokens out of the algorithm’s input
stream. In this chapter, we study data stream algorithms for extracting
random samples of tokens from the stream, and present one application of
these algorithms.
73
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch04 page 74
1 Let us explain the intuition behind the name. Assume that we have a hat with n balls,
and we conduct the following procedure: draw a random ball from the hat, note which
ball it is, and then replace the ball in the hat. One can observe that by repeating this
procedure k times, we get a uniform sample of k balls with replacement.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch04 page 75
Reservoir Sampling 75
sample of k tokens from the stream with replacement means that one can
get such a sample by simply running k copies of Algorithm 1 and combining
their outputs.
The other natural way to understand uniform sampling of k tokens
from the stream is uniform sampling without replacement.2 Here, we want
to pick a set of k distinct tokens from the stream, and we want each such
set to be picked with equal probability. It is important to note that in
the context of uniform sampling without replacement, we treat each one of
the n tokens of the stream as unique. This means that if a token appears
in multiple positions in the stream, then these appearances are considered
different from each other, and thus, can appear together in a sample without
replacement. To make it easier to handle this technical point, whenever we
discuss sampling from this point on we implicitly assume that the tokens
of the stream are all distinct.
Exercise 2 shows that in some cases sampling without replacement can
be reduced to sampling with replacement (which we already know how
to do).
2 Again,the intuition for the name comes from a hat with n balls. Given such a hat, one
can get a uniform sample of k balls without replacement by repeatedly drawing a ball
from the hat, noting which ball it is and then throwing the ball away (so that it cannot
be drawn again later).
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch04 page 76
Reservoir Sampling 77
The token 5 has a relative rank which is ε-close to 1/2 when ε = 0.1,
however, its value is very different from the value of the true median (which
is 1000).
We now begin the analysis of Algorithm 3 by upper bounding the
probability that the token selected by Algorithm 3 has a relative rank
smaller than r − ε. Note that, since Algorithm 3 picks token number rk
in the sorted version S of S, this is equivalent to upper bounding the
probability that S contains at least rk tokens of relative ranks smaller
than r − ε (see Figure 4.1 for a graphical explanation).
Proof. If r < ε, then the lemma is trivial since the relative rank of every
token is non-negative. Thus, we may assume r ε in the rest of the proof.
For every 1 i k, let Xi be an indicator for the event that the ith
token in S is of a relative rank smaller than r − ε. Recall that S is a uniform
sample with replacement, and thus, every token of S is an independent
uniform token from the stream. Hence, the indicators X1 , X2 , . . . , Xk are
independent. Moreover, since the number of tokens in the stream having
a relative rank smaller than r − ε is (r − ε) · (n − 1), each indicator Xi
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch04 page 79
Reservoir Sampling 79
r–ε r r+ε
Figure 4.1. A graphical explanation of the analysis of Algorithm 3. The dots represent
the k tokens of S, and each dot appears at a position representing its relative rank. The
objective of the algorithm is to select a dot between the lines labeled r − ε and r + ε.
The algorithm tries to do this by selecting dot number rk when counting from the left.
Note that the selected dot has a relative rank at least r − ε if and only if the number of
dots to the left of the line marked by r − ε is less than rk. Similarly, the selected dot
has a relative rank at most r + ε if and only if the number of dots to the right of the line
marked by r + ε is at most k − rk.
Reservoir Sampling 81
the same time with probability at least 1 – δ. Exercise 6 shows that this
property can be made true by making the size of S dependent on h.
Exercise 6. Algorithm 4 is a variant of Algorithm 3 which uses a larger sample
S, gets h numbers r1 , r2 , . . . , rh ∈ [0, 1] rather than only one such number as
in Algorithm 3 and outputs h tokens t1 , t2 , . . . , th . Prove that with probability
at least 1−δ, the relative rank of the token ti found by this algorithm is ε-close
to ri for all 1 i h at the same time.
Exercise 7. If the weights of the tokens are all integral, then it is possible
to reduce weighted sampling of a single token from the stream to uniform
sampling of a single token. Describe a way to construct this reduction, i.e.,
explain how an algorithm for uniform sampling of a single token, such as
Algorithm 1, can be used for weighted sampling of a single token when all
weights are integral.
Reservoir Sampling 83
every token t that it reads is the candidate c with probability w(t)/W (n ).
Note that this implies that after the algorithm reads all n tokens of the
stream, each token t of the stream is the candidate c with probability
w(t)/W , which is what we want to prove.
The base case of the induction is the case n = 1. To see why the
claim we want to prove holds in this case, observe that when Algorithm 5
reads the first token t1 , it updates the candidate c to be equal to t1 with
probability w(t1 )/W (1) = 1. In other words, after Algorithm 1 reads a
single token, its candidate c is always equal to this token. Assume now that
the claim we want to prove holds after Algorithm 5 has read n − 1 tokens
(for some n > 1), and let us prove it holds also after the algorithm reads
the next token. Note that when Algorithm 5 reads the token tn , it updates
the candidate c to be this token with probability w(tn )/W (n ). Thus, it
only remains to prove that the probability of every previous token t is the
candidate c at this point is w(t)/W (n ).
Fix some token t which is one of the first n − 1 tokens of the stream. In
order for t to be the candidate c after Algorithm 5 has read n tokens, two
events must happen. First, t has to be the candidate c after the algorithm
has read n − 1 tokens, and second, Algorithm 5 must not change its
candidate after viewing tn . Since these events are independent,
t is the candidate after t is the candidate after
Pr = Pr
n tokens are read n − 1 tokens are read
Reading token tn does not make
· Pr
the algorithm change its candidate
w(t) w(tn ) w(t) w(tn )
= · 1− = · 1−
W (n − 1) W (n ) W (n ) − w(tn ) W (n )
w(t)
= ,
W (n )
where the second equality holds by the induction hypothesis and the above
observation that Algorithm 5 changes its candidate after reading tn with
probability w(tn )/W (n ).
Reservoir Sampling 85
Exercise Solutions
Solution 1
Following the hint, we prove by induction that after Algorithm 1 has read n
tokens, every one of the tokens it has read is the candidate c with probability
1/n . Note that this implies that after the algorithm reads all n tokens of
the stream, the candidate c is a uniformly random token out of the n tokens
of the stream, which is what we want to prove.
The base case of the induction is the case n = 1. To see why the
claim we want to prove holds in this case, observe that when Algorithm 1
reads the first token t1 , it updates the candidate c to be equal to t1 with
probability 1/n = 1. In other words, after Algorithm 1 reads a single token,
its candidate c is always equal to this token. Assume now that the claim
we want to prove holds after Algorithm 1 has read n − 1 tokens (for some
n > 1), and let us prove it holds also after the algorithm reads the next
token. Note that when Algorithm 1 reads the token tn , it updates the
candidate c to be this token with probability 1/n . Thus, it only remains to
be proved that the probability of every previous token to be the candidate
c at this point is also 1/n .
Fix some 1 i < n . In order for ti to be the candidate c after
Algorithm 1 has read n tokens, two events must happen. First, ti has to
be the candidate c after the algorithm has read n − 1 tokens, and second,
Algorithm 1 must not change its candidate after viewing tn . Since these
events are independent,
ti is the candidate after ti is the candidate after
Pr = Pr
n tokens are read n − 1 tokens are read
Reading token tn does not make
· Pr
the algorithm change its candidate
1 1 1
= · 1 − = ,
n −1 n n
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch04 page 86
where the second equality holds by the induction hypothesis and the above
observation that Algorithm 1 changes its candidate after reading tn with
probability 1/n .
Solution 2
(a) Let us name the k tokens of the sample S by s1 , s2 , . . . , sk , and for every
1 i k, let us denote by Ei the event that the tokens s1 , s2 , . . . , si are
all distinct. Clearly, Pr[E1 ] = 1. Moreover, since each one of the tokens
s1 , s2 , . . . , sk is a uniformly random token from the stream which is
independent of the other tokens of the sample S, we get, for every
1 < i k,
i−1
Pr[Ei |Ei−1 ] = 1 − .
n
Note that the event Ei implies the event Ei−1 for every 1 < i k.
Thus, the conditional probabilities calculated above can be combined
to give
k
Pr[Ei ] k
Pr[Ei−1 ∩ Ei ]
Pr[Ek ] = Pr[E1 ] · = Pr[E1 ] ·
i=2
Pr[Ei−1 ] i=2
Pr[Ei−1 ]
k
= Pr[E1 ] · Pr[Ei |Ei−1 ]
i=2
k
k
i−1 k k2
= 1− 1− 1− .
i=2
n i=2
n n
Reservoir Sampling 87
with respect to the different tokens of the stream, and thus, Ŝ must
have an equal probability to be each one of the sets of K.
The other way to prove that Ŝ is a uniformly random set from K is
more explicit. Fix some set S ∈ K. Since S is a uniform sample of k
tokens from the stream with replacement, we get
k!
Pr[Ŝ = S ] = .
nk
Pr[S = S ∧ E] k!/nk
Pr[S = S |E] = =
k
i=2 1 − n
Pr[E] i−1
−1
k! k! · (n − k)! n
=
k = = .
n · i=2 (n − i + 1) n! k
Solution 3
Following the hint, we prove by induction that after Algorithm 2 has read
n k tokens, the reservoir R is a uniform sample of k tokens without
replacement from the set of the first n tokens of the input stream. Note
that this implies that after the algorithm reads all n tokens of the stream,
the reservoir is a uniform sample of k tokens without replacement from the
algorithm’s input stream, which is what we want to prove.
The base case of the induction is the case n = k. In this case, after
Algorithm 2 reads n tokens, its reservoir contains the first k tokens of the
input stream. This is consistent with the claim we want to prove since a
uniform sample of k tokens without replacement from a set of size k always
contains all the tokens of the set. Assume now that the claim we want to
prove holds after Algorithm 2 has read n − 1 tokens (for some n > k), and
let us prove it holds also after the algorithm reads the next token.
Let N be the set of the first n tokens of the stream, (i.e., N =
{t1 , t2 , . . . , tn }), and let R(i) denote the reservoir R after Algorithm 2 has
read i tokens. We need to prove that for every given set S ⊆ N of k tokens,
we obtain
−1
n
Pr[S = R(n )] = .
k
There are two cases to consider. The first case is when tn ∈/ S. In this case,
we can have R(n ) = S only if Algorithm 2 did not change its reservoir
after reading tn . Since the decision to make a change in the reservoir after
reading tn is independent of the content of the reservoir, this implies
Reservoir Sampling 89
k 1 1
· = .
n k n
It now remains to be observed that the probability that both conditions
hold at the same time is
⎡ −1 ⎤ −1
n − 1 1 n
− k k! · (n − k − 1) ! n
⎣(n − k) ·
⎦· = · = .
k n n (n − 1)! k
Solution 4
If r + ε > 1, then Lemma 2 is trivial since the relative rank of every token
is at most 1. Thus, we may assume r + ε 1 in the rest of the proof.
For every 1 i k, let Xi be an indicator for the event that the ith
token in S is of relative rank larger than r + ε. Recall that S is a uniform
sample with replacement, and thus, every token of S is an independent
uniform token from the stream. Hence, the indicators X1 , X2 , . . . , Xk are
independent. Moreover, since the number of tokens in the stream having a
relative rank larger than r + ε is (1 − r − ε) · (n − 1), each indicator Xi
takes the value 1 with probability
(1 − r − ε) · (n − 1) (1 − r − ε) · (n − 1) + 1
n n
1 − (1 − r − ε) ε
= 1−r−ε+ 1−r− .
n 2
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch04 page 90
2 2 −2
ε
− 12(1−r) ε
2 ·k·(1−r− 2 )
ε
− 12(1−r)2 ·(24ε ·ln(2/δ))· 1−r
=e e 2
δ
e− ln(2/δ) = ,
2
where the first inequality holds since X takes only integer values.
Solution 5
Intuitively, a small value for ε reduces the number of tokens whose relative
rank is ε-close to r, and thus, makes life more difficult for an algorithm that
needs to find such a token. We will show that when ε becomes too small
(i.e., ε 1/(2n)), there might be no tokens at all whose relative rank is
ε-close to r, which means that the algorithm cannot find such a token.
The relative rank of every token is of the form i/(n−1) for some integer
0 i < n. This means that for every 1 i < n, the range
i−1 i
,
n−1 n−1
does not contain the relative rank of any token. Since the size of this range
is (n − 1)−1 > 1/n, this means that the range
i − 1/2 1 i − 1/2 1
− , +
n−1 2n n − 1 2n
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch04 page 91
Reservoir Sampling 91
also does not contain the relative rank of any token. Thus, for ε 1/(2n),
there is no token whose relative rank is ε-close to (i − 1/2)/(n − 1).
Solution 6
Assume that Algorithm 4 is executed with some values ε and δ for the
parameters ε and δ, respectively. Then, the size of the sample S it uses
is 24ε2 · ln(2h/δ ). Note that Algorithm 3 uses a sample S of that size
when its parameters ε and δ are set to ε and δ /h, respectively. Thus, one
can repeat the proof of Theorem 1 and get that, for every 1 i h, ti ’s
relative rank is ε -close to ri with probability at least 1 − δ /h. In other
words, the probability that the relative rank of ti is not ε -close to ri is at
most δ /h. By the union bound, we now get that the probability that for
some 1 i h the relative rank of the token ti is not ε -close to ri is at
most δ , which is what we want to prove.
Solution 7
Let ALG be an arbitrary algorithm for uniform sampling of a single token
from the stream, such as Algorithm 1, and consider Algorithm 7. This
algorithm applies ALG to a stream which contains w(t) copies of every
token t in the stream of Algorithm 7, and then returns the output of ALG.
Note that this is possible since w(t) is integral by the assumption of the
exercise.
disjoint event, we get that the probability that any copy of t is selected by
ALG is w(t)/W .
Recall that t was chosen as an arbitrary token from the stream of
Algorithm 7. Thus, we have proven that Algorithm 7 performs weighted
sampling of a single token from its stream using an algorithm ALG for
uniform sampling of a single token.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch05 page 93
Chapter 5
93
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch05 page 94
Exercise 1. Consider the four functions from the domain {a, b, c, d} to the
range {0, 1} that are specified by the following tables, and let F be the set
of the four functions represented by these tables. Verify that F is a universal
hash functions family.
where the operator ⊕ represents here the XOR operation. In other words,
the value of fb,S (x) is the XOR of b with all the bits of x at the positions
specified by S.
Lemma 1. For every two distinct strings x, y ∈ {0, 1}m, two bits bx ,
by ∈ {0, 1} and a hash function f drawn uniformly at random from FX , it
holds that Pr[f (x) = bx ∧ f (y) = by ] = 1/4.
Looking at these equations, one can observe that regardless of the values
that ⊕j∈S x(j) and ⊕j∈S y (j) happen to have, exactly one of the four
pairs (f0,S (x), f0,S (y)), (f0,S∪{i} (x), f0,S∪{i} (y)), (f1,S (x), f1,S (y)) or
(f1,S∪{i} (x), f1,S∪{i} (y)) is equal to (bx , by ). Thus, out of the four functions
f0,S , f0,S∪{i} , f1,S and f1,S∪{i} , exactly one is an option for f which makes
the event {f (x) = bx ∧ f (y) = by } hold, which implies
1
Pr f (x) = bx ∧ f (y) = by |f ∈ f0,S , f0,S∪{i} , f1,S , f1,S∪{i} = .
4
To complete the proof of the lemma, it remains to be observed that the
sets {f0,S , f0,S∪{i} , f1,S ,f1,S∪{i} } obtained for the various choices of S ⊆
{1, 2, . . . , r}\{i} are disjoint and that their union is the entire family FX .
Thus, by the law of total probability,
FP = {fa,b |a, b ∈ {0, 1}r }. In the rest of the proof, we show that this
family obeys all properties guaranteed for it by Lemma 2.
Every function fa,b in FP is defined by the strings a and b. Since these
two strings each consist of r bits, the representation of the function fa,b
using these strings requires only 2r bits, as guaranteed. Let us now prove
that FP is pairwise independent. Consider two distinct strings d1 , d2 ∈
{0, 1}r and two arbitrary strings r1 , r2 ∈ {0, 1}r . We are interested in
determining the number of (a, b) pairs for which it holds that fa,b (d1 ) = r1
and fa,b (d2 ) = r2 . To do this, we observe that the last two equalities are
equivalent to the following two equations:
Since we fixed a value for r1 , r2 , d1 and d2 , these two equations are a pair of
linear equations in two variables a and b. Moreover, these linear equations
must have exactly one solution since the fact d1 = d2 implies that the
coefficients matrix corresponding to these equations is non-singular. Thus,
there is a single pair (a, b) for which fa,b (d1 ) = r1 and fa,b (d2 ) = r2 , which
implies that for a uniformly random function fa,b ∈ FP , we have
Note that the definition of P (x) does not involve a division by zero because
the strings d1 , d2 , . . . , dk are all distinct by definition. Additionally, P (x)
is a polynomial of degree at most k – 1 over the field F and thus, it is
equal to fa0 ,a1 ,...,ak−1 for some k-tuple (a0 , a1 , . . . , ak−1 ). One can observe
that P (di ) = ri for every integer 1 i k, and thus, the function
fa0 ,a1 ,...,ak−1 equal to P (x) must also obey fa0 ,a1 ,...,ak−1 (di ) = ri for every
integer 1 i k. Hence, there exists at least one k-tuple (a0 , a1 , . . . , ak−1 )
for which these equalities hold.
Assume now toward a contradiction that there exist distinct k-tuples
(a0 , a1 , . . . , ak−1 ) and (b0 , b1 , . . . , bk−1 ) such that fa0 ,a1 ,...,ak−1 (di ) =
fb0 ,b1 ,...,bk−1 (di ) = ri for every integer 1 i k. This assumption implies
that the polynomial
k−1
Q(x) = fa0 ,a1 ,...,ak−1 (x) − fb0 ,b1 ,...,bk−1 (x) = (ai − bi )xi
i=0
Exercise Solutions
Solution 1
To solve this exercise, one should verify that for every pair of distinct
elements x, y ∈ {a, b, c, d}, exactly two out of the four tables map x and y
to the same number. We do this here only for two pairs as a demonstration,
but it is not difficult to verify that this is the case for all the pairs.
• The elements b and c are mapped to the same image by the first and
third tables, and to different images by the two other tables.
• The elements b and d are mapped to the same image by the first two
tables, and to different images by the two other tables.
Solution 2
We need to verify that for every pair of distinct elements x, y ∈ {a, b, c},
a function f chosen uniformly at random from F has an equal probability
(of 1/|R|2 = 1/4) to map x and y to every possible pair of range items r1 ,
r2 ∈ R. There are six possible assignments to x and y, however, it turns
out that in order to verify this claim it suffices to consider only three out
of these assignments because the roles of x and y are symmetric. In other
words, we need to verify the claim only for the assignments (x, y) = (a, b),
(x, y) = (a, c) and (x, y) = (b, c), and this will imply that the claim also
holds for the three other possible assignments. Furthermore, in this solution,
we only verify the claim for the assignment (x, y) = (a, b) because the
verification for the two other assignments that we need to consider is very
similar, and we leave it as an exercise for the reader.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch05 page 103
Consider now the four tables from Exercise 1 restricted to the lines of
x = a and y = b. These restricted tables are as follows:
Solution 3
We begin by proving that if F is a k-wise independent family, then the
properties stated in the exercise hold. The first property holds because for
every k distinct d, d1 , d2 , . . . , dk−1 ∈ D and an element r ∈ R, we have
Pr[f (d) = r]
= ··· Pr[f (d) = r and ∀1ik−1 f (di ) = ri ]
r1 ∈R r2 ∈R rk−1 ∈R
1 1
= ··· k
= ,
|R| |R|
r1 ∈R r2 ∈R rk−1 ∈R
where the first equality holds since the events {∀1ik−1 f (di ) = ri } are
disjoint for every fixed choice of r1 , r2 , . . . , rk−1 . One consequence of the
property that we have proved is that for every k distinct d1 , d2 , . . . , dk ∈ D
and k element r1 , r2 , . . . , rk ∈ R, it holds that
k
1
Pr [∀1ik f (di ) = ri ] = k
= Pr [f (di ) = ri ],
|R| i=1
which is the second property stated in the exercise (the first equality holds
since F is k-wise independent).
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch05 page 104
k
1
Pr [∀1ik f (di ) = ri ] = Pr [f (di ) = ri ] = k
,
i=1 |R|
where the first equality holds since the first property in the exercise asserts
that the variables in {f (di ) | 1 i k} are k-wise independent, and the
second equality holds since the second property in the exercise asserts that
f (di ) has equal probability to be any element of R (including ri ).
Solution 4
For m = 2, there are 8 functions in FX since there are two options for b
and 2m = 4 options for S. The tables corresponding to these functions are
as follows:
f 1,{2} f 1,{1,2}
Input Image Input Image
00 1 00 1
01 0 01 0
10 1 10 0
11 0 11 1
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch05 page 105
Solution 5
(a) Since G contains a function gf1 ,f2 ,...,fn for every possible choice of n
(not necessarily distinct) functions f1 , f2 , . . . , fn from F , choosing a
function g from G uniformly at random is equivalent to choosing n
functions f1 , f2 , . . . , fn from F uniformly and independently at random.
Thus, assuming g, f1 , f2 , . . . , fn are distributed this way, for every two
distinct elements d1 , d2 ∈ D and two tuples (r1,1 , r1,2 , . . . , r1,n ), (r2,1 ,
r2,2 , . . . , r2,n ) ∈ Dn it holds that
Solution 6
As suggested by the hint, let FP be the hash functions family whose
existence is guaranteed by Lemma 2 when setting r = max{m, n}. Note
that every function in FP is a function from the domain {0, 1}max{n,m} to
the range {0, 1}max{n,m} , and to solve the exercise, we need to convert it into
a function from the domain {0, 1}m to the range {0, 1}n. More formally, we
need to construct using FP a k-wise independent hash functions family F
from the domain {0, 1}m to the range {0, 1}n. In the following paragraphs,
we do this by considering two cases distinguished by the relationship they
assume between m and n.
• The first case we consider is the case, of m n. In this case, the domain
of the functions of FP is already {0, 1}m because max{m, n} = m.
However, to make the range of these functions {0, 1}n, we need to
shorten the images they produce by m − n bits, which we can do (for
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch05 page 106
where the third equality holds since the events {∀1ik fP (di ) = ri xi }
obtained for different choices of x1 , x2 , . . . , xk ∈ {0, 1}m−n are disjoint,
and the penultimate equality holds because FP is k-wise independent.
• The second case we consider is the case of m n. In this case, the
range of the functions of FP is already {0, 1}n because max {m, n} = n.
However, to make the domain of these functions {0, 1}m, we need to
extend input strings from {0, 1}m to strings of {0, 1}n, which we can
do (for example) by adding n − m trailing zeros to the end of the input
string. More formally, we define a function t from {0, 1}m to {0, 1}n as
follows:
Chapter 6
109
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 110
1 The algorithm got this name since AMS are the initials of the authors of the paper
in which this algorithm was first published. This paper presented also other algorithms,
and unfortunately, these algorithms are also often called the “AMS Algorithm”. Thus,
it is more accurate to refer to the algorithm we study here as the AMS algorithm for
counting distinct tokens.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 111
Let us now begin the analysis of the quality of the estimation obtained
by Algorithm 1. This analysis requires some additional notation. Let D
be the set of distinct tokens in the stream, and let d be their number.
Additionally, for every value 0 ≤ i ≤ log2 m, let Zi be the number of
tokens in D whose hash images have at least i trailing zeros. More formally,
Lemma 1. For every 0 ≤ i ≤ log2 m, E[Zi ] = d/2i and Var[Zi ] < d/2i .
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 112
At this point we need to use the pairwise independence of the hash functions
family H. Note that this pairwise independence means that the random
variables in the sum (6.1) are also pairwise independent as each one of
them depends on a different single value of h. Thus,
−i |D| d
Var [Zi ] = Var [Wi,t ] = 2−i 1 − 2−i < 2 = i = i.
2 2
t∈D t∈D t∈D
Corollary 1. For every constant c ≥ 1 and integer 0 ≤ i ≤ log2 m,
Consider now the case 2i ≤ d/c. In this case, Chebyshev’s inequality implies
d d
Pr [Zi = 0] = Pr Zi ≤ E [Zi ] − i ≤ Pr |Zi − E [Zi ]| ≥ i
2 2
Var [Zi ] 2i 1
≤ 2 < ≤ .
(d/2i ) d c
We are now ready to prove a guarantee for the quality of the answer
produced by Algorithm 1.
Theorem 1. For every c ≥ 2, with probability at least 1−2/c, Algorithm 1
estimates the number d of distinct elements in its data stream up to
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 113
√
a multiplicative
√ √ factor of 2 · c (i.e., its output is within the range
[d/( 2 · c), 2 · cd]).
Proof. Recall that the output of Algorithm 1 is 2z+1/2 , where z is the
largest value for which Zi is non-zero. Let be the smallest integer such
that 2 > c · d. If ≤ log2 m, then Corollary 1 states that with probability
at least 1 − 1/c, the value of Z is 0, which implies that Zi = 0 for every
i ≥ , and thus, also
√ √
2z+1/2 ≤ 2 · 2−1 ≤ 2 · cd.
√
Additionally, note that the inequality 2z+1/2 ≤ 2 · cd always holds when
√> log2 m because in this case cd ≥ m, while 2z+1/2 is always at most
2 · m since z cannot take a value larger than log2 m.
Similarly, let us now define to be the largest integer such that 2 <
d/c. If ≥ 0, then Corollary 1 states that with probability at least 1 − 1/c,
the value of Z is non-zero, which implies
√ √
2z+1/2 ≥ 2 +1/2 = 2 +1 / 2 ≥ d/( 2 · c).
√
Again, the inequality 2z+1/2 ≥ d/( 2 · c) always holds √ when < 0 because
in this case d/c ≤ 1, while 2z+1/2 is always at least 2 since z cannot take
a value smaller than 0.
By the union bound we now get that with probability at least 1 − 2/c,
the two above inequalities hold at the same time.
Theorem 1 shows that with a constant probability, the estimate
produced by Algorithm 1 is correct up to a constant multiplicative factor.
Unfortunately, this guarantee is quite weak, in the sense that if one wants
the probability that the algorithm succeeds to estimate d up to this
multiplicative factor to be close to 1, then the multiplicative factor itself
must be very large. This weakness can be somewhat alleviated using the
median technique we have seen in Chapter 3.
we would like to end this section with two exercises which further study
Algorithm 1.
the space complexity a bit) is to store more than one token, i.e., store the set
of tokens whose hash images have the largest numbers of trailing zeros. This
intuitive idea is implemented by Algorithm 3. We note that Algorithm 3
gets a parameter ε > 0 controlling the accuracy of its output.
The set of tokens that Algorithm 3 stores is denoted by B. The
algorithm also has a threshold z such that a token is stored in B if and
only if the binary representation of its hash image has at least z trailing
zeros. To keep its space complexity reasonable, the algorithm cannot let the
set B become too large. Thus, whenever the size of B exceeds a threshold
of c/ε2 (c is a constant whose value we later set to 576), the algorithm
increases z and accordingly removes tokens from B.
We begin the study of Algorithm 3 by analyzing its space complexity.
Exercise 4. Prove that the space complexity of Algorithm 3 is O(ε−2 · log m),
and thus, it is a streaming algorithm when ε is considered to be a constant.
s−1
−1
−1
2i 2s −2 d −2 12 1
2d
< 2 =ε · s
≤ε · 2
= ,
i=0
ε ε d 2 ε 12
24 s c
d< ·2 ≤ 2.
ε2 ε
This means that during the execution of Algorithm 3, the size of B can never
become larger than c/ε2 , and thus, the variable z is never increased from its
initial value of 0. Hence, the final value of z, which is denoted by r, is also 0.
Moreover, observe that Z0 counts the number of tokens from D whose hash
images have at least 0 trailing zeros. Clearly, all the tokens of D obey this
criterion, and thus, Z0 = |D| = d. Combining the above observations, we
get that when s ≤ 0, the expression 2r · Zr is deterministically equal to d.
The case of s ≥ 1 remains to be considered. Note that in this case, by
the law of total probability, we have
√
Exercise 6. Prove that, for every ε ∈ (0, 2 − 1), every deterministic data
stream algorithm for
√ estimating the number d of distinct tokens up to a multi-
plicative factor of 2−ε must have a space complexity of at least Ω(m). Hint:
The proof is very similar to the proof of Theorem 3, but uses the collection
whose existence is promised by Lemma 3 instead of the collection 2[m] .
It remains to prove Lemma 3.
Proof. Let us partition the tokens of [m] in an arbitrary way into m/c
buckets of size c each. We can now define a random set R which contains
a uniformly random single token from each bucket. One can observe that
the size of R is always m/c.
Consider now two independent random sets R1 and R2 having the same
distribution as R, and let Xi be an indicator for the event that R1 and R2
contain the same token of bucket i for 1 ≤ i ≤ m/c. Note that
m/c
|R1 ∩ R2 | = Xi .
i=1
Moreover, the variables X1 , X2 , . . . , Xm/c all the take the value 1 with
probability 1/c, independently. Thus, by the Chernoff bound,
⎡ ⎤
m/c
2m 2m
Pr |R1 ∩ R2 | ≥ 2 = Pr ⎣ Xi ≥ 2 ⎦
c i=1
c
⎡ ⎡ ⎤⎤
hP i
m/c
m/c
−E
m/c 2
=⎣ Xi ≥ 2 · E ⎣ Xi ⎦⎦ ≤ e = e−m/(3c ) .
i=1 Xi /3
i=1 i=1
2
If we now take a collection C of 2m/(7c ) independent random sets
such that each one of them is distributed like R, then by the last inequality
and the union bound we get
2m |C| 2
Pr ∃R1 ,R2 ∈C R1 ∩ R2 ≥ 2 ≤ · e−m/(3c )
c 2
2
2 2
m/(6c2 )
2m/(7c ) 2
2 2
≤ · e−m/(3c ) ≤ · e−m/(3c )
2 2
2 2
≤ 2m/(3c ) · e−m/(3c ) < 1,
where the third and last inequalities hold for a large enough m.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 122
2 This is an example of a more general method, known as the probabilistic method, for
proving the existence of various objects. In this method, the prover gives a distribution
and shows that an object drawn from this distribution has some properties with a non-
zero probability, and this implies to the existence of an object having these properties.
Note that in the proof given here the random object in fact has the required properties
2 2
with high probability (because 2m/(3c ) · e−m/(3c ) approaches 0 as m increases).
However, in general, the probabilistic method works even when the random object has
the required properties only with a very low probability.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 123
Exercise Solutions
Solution 1
To apply the median technique we run multiple parallel copies of Algo-
rithm 1, and output the median of the results produced by these copies.
Formally, the algorithm we get by applying the median technique to
Algorithm 1 is Algorithm 5. We will later choose the value of the parameter
C based on the values of ε and δ.
Recall that the median technique works when each copy has a
probability of more than 1/2 to output an estimate with the required
accuracy. In our case, we want √ the estimation to be correct up to a
multiplicative factor of (4 + ε) 2, and by Theorem 1, the probability that
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 124
2
In other words, we see that with probability at least 1 − e−2ε C/375 ,
Algorithm√5 outputs a correct estimate for d up to a multiplicative factor
of (4 + ε) 2 (i.e., a value within the range (6.2)). Recall that we want
this probability to be at least 1 − δ, which can be guaranteed by choosing
C = 188 · ε−2 · ln δ −1 .
The space complexity of Algorithm 5 remains to be analyzed. Algo-
rithm 5 uses O(ε−2 · log δ −1 ) copies of Algorithm 1, and each one of
these copies uses O(log m) space by Observation 1. Thus, the total space
complexity of Algorithm 5 is O(ε−2 · log δ −1 · log m). Note that for any
fixed value for ε this space complexity is larger than the space complexity
of Algorithm 1 by a factor of O(log δ −1 ).
Solution 2
Algorithm 1 requires the integer m to have two properties. First, it needs
to be a power of 2, and second, every token of the stream should belong
to the range 1 to m. Note that m has the second property by definition.
However, it might violate the first property. To solve this issue, let us define
m as the smallest power of 2 larger than m. Clearly, m has both the above
properties, and thus, the guarantee of Theorem 1 for Algorithm 1 will still
hold if we replace every reference to m in Algorithm 1 with a reference
to m .
In conclusion, by making the above replacement we get a modified
version of Algorithm 1 which has the same guarantee as the original version,
but does not need to assume that m is a power of 2. The space complexity
of this modified version is
Solution 3
For every value x ∈ [0, m], let Ux be the number of tokens in D whose hash
images are smaller than or equal to x. Additionally, for every token t ∈ D,
let Wx,t be an indicator for the event that h(t) ≤ x. One can observe that
Ux = Wx,t .
t∈D
Each one of the indicators in the last sum takes the value 1 with probability
x/m. Thus, by the linearity of expectation, we get E[Ux ] = dx/m.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 126
We are now ready to upper bound the probability that Algorithm 2 makes
a large estimation error. First, let us upper bound the probability that the
algorithm outputs a number larger than cd. If cd > m, then this cannot
happen since the variable u can never take a value smaller than 1. In
contrast, if cd ≤ m, then Algorithm 2 can output a number larger than
cd only when Um/(cd) is non-zero, and thus, it is enough to upper bound
the probability of this event. By Markov’s inequality, we get
Pr Um/(cd) > 0 = Pr Um/(cd) ≥ 1
m
= Pr Um/(cd) ≥ · E Um/(cd)
dm/(cd)
dm/(cd) dm/(cd) 1
≤ ≤ = .
m m c
Next, let us upper bound the probability that the algorithm outputs
a number smaller than d/c. If d/c < 1, then this cannot happen since the
variable u can never take a value larger than m. In contrast, if d/c ≥ 1, then
Algorithm 2 can output a number larger than d/c only when Umc/d = 0.
Hence, it is again enough to upper bound the probability of this event. By
Chebyshev’s inequality, we obtain
d mc/d
Pr Umc/d = 0 = Pr Umc/d ≤ E Smc/d −
m
d mc/d
≤ Pr Umc/d − E Umc/d ≥
m
Var Umc/d d mc/d /m
≤ 2 ≤ 2
(d mc/d /m) (d mc/d /m)
m m 1 1
= ≤ = ≤ .
d mc/d d (mc/d − 1) c − d/m c−1
Using the union bound we now get that the probability that the output
of Algorithm 2 is outside the range [d/c, cd] is at most 2/(c − 1).
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 127
Solution 4
The space complexity of Algorithm 3 is the space necessary for storing the
variables t and z plus the space required for representing the random hash
function h and the set B. Recall that we assume that m is a power of 2.
Hence, as we saw in Chapter 5, the hash function h can be represented using
only O(log m) bits for an appropriate choice of a pairwise independent hash
functions family H. Moreover, since the largest values that can appear in
the variables t and z are m and log2 m, respectively, both these variables
can also be represented using O(log m) bits. It remains to bound the space
required for representing the set B. The size of this set is never larger than
c/ε2 +1, and each element in it is a number between 1 and m (and thus, can
be represented using O(log m) bits). Hence, representing the set B requires
O(ε−2 · log m) bits, and this term dominates the space complexity of the
entire algorithm.
Solution 5
If d<c/ε2 , then Algorithm 4 stores all the tokens of D in the set B (since
this set never reaches its maximum size c/ε2 ), and then returns the size of
this set as its output. Thus, Algorithm 4 always outputs the correct value
for d when d < c/ε2 . In the rest of the proof we consider the case d ≥ c/ε2 .
Note that in this case the output of Algorithm 4 is mc/(ε2 · h(u)), where
h(u) is the maximum hash image of any token appearing in B when the
algorithm terminates.
We now need two lemmata.
Using the lower bound on p given by (6.5), we can lower bound the
denominator of the rightmost side of the last inequality by
2
2
c c(1 + ε) 1
dp 1 − 2
≥ · 1−
ε dp ε2 1+ε
2
c(1 + ε) ε c
= 2
· = .
ε 1+ε 1+ε
where the last inequality follows again from the value of c and the fact that
ε < 1.
Solution 6
Let c be a constant depending on ε to be determined later. Additionally, let
c be the positive constant whose existence is guaranteed by Lemma 3 for
that value of c, and let m0 be a large enough constant so that Lemma 3 holds
for every m ≥ m0 which is dividable by c. Assume by way of contradiction
that the claim we need to prove is incorrect. In other words, we assume that
there exists a deterministic data stream algorithm ALG which √ estimates
the number d of distinct tokens up to a multiplicative factor of 2−ε whose
space complexity is not Ω(m). In particular, this means that there exists
a value m1 ≥ m0 + 2c such that the space complexity of ALG is less than
m1 /2c ) bits when m = m1 . In the rest of the proof we assume m = m1 .
Let m be the largest integer which is not larger than m1 and is divisible
by c. Clearly, m ≥ m0 , and thus, by Lemma 3, there exists a collection
Cm of size at least 2m /c such that every set in Cm is of size m /c and the
intersection of every two sets in this collection is of size at most 2m /c2 .
For every set S ∈ Cm we denote by σS a data stream consisting of the
elements of S in an arbitrary (but fixed) order and by MS the state of the
memory of ALG after it has received σS .
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch06 page 131
Since the memory used by ALG is less than m1 /(2c ) bits, we get that
the number of different values this memory can take is less than 2m1 /(2c ) ≤
2m /c , where the inequality follows since m > m1 − c and m1 ≥ 2c. Thus,
by the pigeon hole principle, there must exist at least two sets S, T ∈
Cm such that S = T but MS = MT . Consider now two data streams.
Data stream σ1 is obtained by concatenating two copies of σS , while data
stream σ2 is obtained by concatenating σT and σS (in that order). Since
the memory content of ALG is the same after it reads either σS or σT , it
cannot distinguish between the data streams σ1 and σ2 , and must output
the same output for both streams.
We now observe that σ1 contains only tokens of S, and thus, the number
of distinct tokens in it is m /c. On the other hand, since the definition of
Cm implies that the intersection of S and T is of size at most 2m /c2 , the
number of distinct tokens in σ2 is at least
m 2m 2m 1
2· − 2 = 1− .
c c c c
Hence, the ratio between the number of distinct tokens in σ1 and σ2 is
at least 2(1 − 1/c), which means that the single value that ALG outputs
given either σ1 or σ2 cannot estimate √
the number of distinct tokens in both
streams up a multiplicative factor of 2 − ε, unless
√ 2
2
ε
2 − ε ≥ 2 (1 − 1/c) ⇔ 1 − √ ≥ 1 − 1/c
2
1 1
⇔c≤ √ 2 = √ .
1 − 1 − ε/ 2 2 · ε − ε2 /2
Chapter 7
Sketches
133
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 134
• In the cash register (data stream) model, the frequency change in each
update event must be positive. In other words, one can think of each
update event (t, c) as an arrival of c copies of token t. This model is
called the cash register model because update events in it can increase
the frequencies of tokens, but these frequencies can never decrease;
which is similar to the behavior of a cash register in a shop, which
usually registers only incoming payments.
• In the turnstile (data stream) model, the frequency change in each
update event can be an arbitrary (possibly negative) integer. One can
think of each update event (t, c) with a positive c value as the arrival
of c copies of token t, and of each such event with a negative c value as
the departure of |c| copies of token t. Thus, the frequency of a token t
is equal to the number of its copies that have arrived minus the number
of its copies that have departed.
This point of view gave the turnstile model its name as turnstiles
are often used to count the number of people who are currently inside
a given area. This is done as follows: Every time that a person enters
Sketches 135
the area he must rotate the turnstile in one direction, which causes the
turnstile to increase its count by one. In contrast, when a person leaves
the area he rotates the turnstile in the other direction, which makes
the turnstile decrease its count accordingly.
• In many settings it does not make sense to have a negative frequency
for tokens. In other words, at every given point the number of copies
of a token that departed can be at most equal to the number of copies
of that token that have arrived so far. The strict turnstile model is a
special case of the turnstile model in which we are guaranteed that
the updates obey this requirement (i.e., that at every given point the
frequencies of the tokens are all non-negative).
Proof. Let us upper bound the space complexity necessary for the three
variables used by Algorithm 1. The variable t contains a single token, and
thus, requires only O(log m) bits.2 Next, we observe that the variable 1 is
equal at every given point to the sum of the frequencies of all the tokens at
this point. More formally, if M is the set of all possible tokens, then
|1 | = ft |ft | = f 1 .
t∈M t∈M
2 In fact, the variable t can be dropped altogether from the description of Algorithm 1
Sketches 137
appear more than n/k times in the data stream (where k is a parameter
of the algorithm). For ease of reading, we repeat this algorithm here as
Algorithm 2. We remind the reader that the pseudocode of this algorithm
assumes that for every possible token there exists a counter whose initial
value is zero.
Using the notation defined in this chapter, we can restate the objective
of Algorithm 2 as finding the tokens whose frequency is larger than k −1 ·
f 1 . Restated this way, this objective makes sense also in the context of
the more general models presented in this chapter.
(“a”, 1), (“b”, 2), (“a”, −1), (“b”, 2), (“c”, −1), (“a”, 1).
Ideally, both the data stream algorithm ALG and the combination algo-
rithm COMB should be space efficient, but this is not required by the
formal definition. Exercise 3 gives a mock example of a sketch.
Exercise 3. Explain why the 1 norm is a sketch in the cash register and
strict turnstile models.
Sketches 139
ft f˜t ft + ε · f 1 .
Since we are in the strict turnstile model, all the frequencies are non-
negative, and thus, the last equality implies f˜t ft . It remains to be shown
that with probability at least 1/2, we also have
ft ε · f 1. (7.1)
t ∈[m]\{t}
h(t)=h(t )
For every token t ∈ [m]\{t}, let us define an indicator Xt for the event that
h(t) = h(t ). Since h is selected from a pairwise independent hash functions
family, the probability that Xt takes the value 1 must be k −1 . Using this
notation, we can now rewrite the left-hand side of Inequality (7.1) as
Xt · ft ,
t ∈[m]\{t}
and its expectation can be upper bounded, using the linearity of expecta-
tion, by
ε · f 1
E[Xt ] · ft = k −1 · ft k −1 · f 1 .
2
t ∈[m]\{t} t ∈[m]\{t}
The lemma now follows since the last inequality and Markov’s inequality
together imply that Inequality (7.1) holds with probability at least 1/2.
According to Lemma 1, Algorithm 3 produces a good estimation
for ft with probability at least 1/2. Naturally, we want to increase this
probability, and we will later present algorithms that do that by employing
multiple independent copies of Algorithm 3. However, before getting to
these algorithms, we would like to end our study of Algorithm 3 with
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 141
Sketches 141
Observation 3, which upper bounds the size of the sketch created by this
algorithm. We note that this observation works only when ε is not too small
(namely, it must be at least (mn)−1 ). For smaller values of ε, the size of the
sketch is larger than the size required for storing the entire data stream,
and thus, using such small values for ε does not make sense.
Observation 3. The size of the sketch created by Algorithm 3 is O(ε−1 ·
log n + log m) whenever ε (mn)−1 .
Proof. Each cell of C contains the total frequency of the tokens mapped
by h to this cell. As this total frequency must be upper bounded by n, we
get that each cell of C requires only O(log n) bits. Hence, the entire array
C requires a space complexity of
It remains to upper bound the space complexity required for the hash
function h. Let m be the smallest power of two which is at least m. As
we saw in Chapter 5, there exists a pairwise independent hash functions
family H of hash functions from [m ] to [k] whose individual functions
can be represented using O(log m + log k) = O(log m + log n) space
(where the equality holds since k < 4/ε 4mn and m < 2m). We
now note that by restricting the domain of the functions of H to [m],
we get a pairwise independent hash functions family H of hash functions
from [m] to [k]. Moreover, the space complexity required for representing
individual functions of H is not larger than the space complexity required
for representing individual functions of H . Hence, for an appropriate choice
of H, it is possible to represent h using O(log m + log n) bits.
We summarize the properties of Algorithm 3 that we have proved in
Theorem 1.
Theorem 1. In the strict turnstile model, Algorithm 3 produces a sketch
such that:
• for a fixed choice of the hash function h, sketches produced for different
streams can be combined,
• the size of the sketch is O(ε−1 · log n + log m) whenever ε (mn)−1 ,
• given the sketch and an arbitrary token t, generating an estimate f˜t for
the frequency ft of t in the way given by Algorithm 3 guarantees that,
with probability at least 1/2, ft f˜t f + ε · f 1 . Moreover, the first
inequality always holds.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 142
Theorem 2. In the strict turnstile model, the Count-Min sketch has the
following properties:
Sketches 143
Figure 7.1. A graphical illustration of the Count-Min sketch. The sketch includes
an array C which contains r rows, each of size k. Each row i is associated with an
independently chosen hash function hi .
• given the sketch and an arbitrary token t, generating an estimate f˜t for
the frequency ft of t in the way given by Algorithm 4 guarantees that,
with probability at least 1 − δ, ft f˜t f + ε · f 1.
Proof. The Count-Min sketch consists of two parts. The first part is an
array C which has r rows, each of size k. The second part is a list of r hash
functions, each of which is associated with one row of C (see Figure 7.1).
The important observation to make is that if we restrict our attention to a
single row of C and to its corresponding hash function, then Algorithm 4
updates them in the same way that Algorithm 3 updates its sketch. Thus,
one can view each line of the array C, together with its corresponding hash
function, as an independent sketch of the kind produced by Algorithm 3.
This already proves the first part of the theorem since we know, by
Observation 2, that sketches of the kind produced by Algorithm 3 can be
combined when they are based on the same hash function. Moreover, the
second part of the theorem follows since each sketch of the kind produced
by Algorithm 3 takes O(ε−1 · log n + log m) space by Observation 3 (when
ε (mn)−1 ), and thus, the Count-Min sketch (which consists of r such
sketches) has a space complexity of
The last part of the theorem remains to be proved. Since we treat each row
of C as an independent copy of the sketch produced by Algorithm 3, we
get by Lemma 1 that, for every 1 i r, it holds that ft C[i, hi (t)]
ft + ε · f 1 with probability at least 1/2. Moreover, the first inequality
always holds. Recall now that the estimate f˜t produced by Algorithm 4
for the frequency of t is the minimum min C[i, hi (t)], and thus, it must
1ir
be at least ft because it is the minimum over values that are all at least
ft . Additionally, with probability at least 1 − 2−r , we have C[i, hi (t)]
ft + ε · f 1 for at least some 1 i r, which implies
−1
The theorem now follows since 1 − 2−r 1 − 2− log2 δ = 1 − δ.
Sketches 145
Prove that, in the turnstile model, the Count-Median sketch has the following
properties:
• for a fixed choice of the hash functions h1 , h2 , . . . , hr , sketches produced
for different streams can be combined,
• the size of the sketch is O(log δ −1 · (ε−1 · log n + log m)) whenever ε
(mn)−1 ,
• given the sketch and an arbitrary token t, generating an estimate f˜t for
the frequency ft of t in the way specified by Algorithm 5 guarantees that
Pr[|ft − f˜t | > ε · f 1 ] 1 − δ.
the frequencies of the tokens mapped to a single cell of the sketch tend to
cancel each other rather than build up, yielding a different error term which
is often better.
The Count sketch is constructed by Algorithm 6. One can observe
that its general structure is very similar to the structure of the Count-
Min and Count-Median sketches. Specifically, the Count sketch contains a
two-dimensional array C and associates two independent hash functions hi
and gi with every row i of C.3 The function hi specifies which tokens are
mapped to each cell of row i. In contrast, the function gi (which does not
appear in the previous sketches) specifies whether the frequency of the token
is added or removed from the cell. Note that the randomness of gi means
that some frequencies are added, while others are removed, which gives
the frequencies a fair chance to cancel each other out instead of building
up. Finally, we would like to note that the Count sketch uses the same two
accuracy parameters ε, δ ∈ (0, 1] used by the Count-Min and Count-Median
sketches.
3 Likein the cases of Count-Min and Count-Median, one can treat each row of C and its
two associated hash functions as an independent sketch, and then view the Count sketch
as a collection of such sketches.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 147
Sketches 147
Exercise 6. Prove that, in the turnstile model, the Count sketch has the
following properties:
• for a fixed choice of the hash functions h1 , h2 , . . . , hr and g1 , g2 , . . . , gr ,
sketches produced for different streams can be combined,
• the size of the sketch is O(log δ −1 · (ε−2 · log n + log m)) whenever ε
(mn)−1 .
event that t and t are mapped to the same cell of row i, i.e., hi (t) = hi (t ).
t ∈[m]\{t}
gi (t) · C[i, hi (t)] = ft + gi (t) · gi (t ) · Xt,t
i
· ft .
t ∈[m]\{t}
Lemmata 2 and 3 use the last equality to prove properties of gi (t)·C[i, hi (t)].
Proof. Intuitively, the lemma holds since each token t = t has an equal
probability to contribute either ft or –ft to gi (t) · C[i, hi (t)], and thus,
does not affect the expectation of this expression. A more formal argument
is based on the linearity of the expectation. The pairwise independence
of gi implies that gi (t) and gi (t ) are independent whenever t = t .
i
Moreover, both these values are independent from the variable Xt,t since
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 148
the last variable depends only on the function hi . Thus, by the linearity of
expectation,
⎡ ⎤
E[gi (t) · C[i, hi (t)]] = E ⎣ft + gi (t) · gi (t ) · Xt,t
i
· ft
⎦
t ∈[m]\{t}
= ft + E[gi (t)] · E[gi (t )] · E[Xt,t
i
] · ft .
t ∈[m]\{t}
We now observe that E[gi (t )] = 0 for every token t . Plugging this
observation into the previous equality proves the lemma.
Lemma 3. For every 1 i r, the variance of gi (t) · C[i, hi (t)] is at
most k −1 · t ∈[m] ft2 .
Proof. Since adding a constant does not change the variance, we get
⎡ ⎤
Var[gi · C[i, hi (t)]] = Var ⎣ft + gi (t) · gi (t ) · Xt,t
i
· ft
⎦
t ∈[m]\{t}
⎡ ⎤
= Var ⎣ gi (t) · gi (t ) · Xt,t
i
· ft
⎦.
t ∈[m]\{t}
Thus, we only need to upper bound the variance on the rightmost side of the
last equality. We do that by calculating the expectation of the expression
inside this variance and the expectation of the square of this expression. As
a first step, we note that, by Lemma 2,
⎡ ⎤
E⎣ gi (t) · gi (t ) · Xt,t
i
· ft
⎦ = E[gi · C[i, hi (t)]] − ft = 0.
t ∈[m]\{t}
⎡⎛ ⎞2 ⎤
⎢⎝ ⎠ ⎥
E⎣ gi (t) · gi (t ) · Xt,t
i
· ft ⎦
t ∈[m]\{t}
⎡ ⎤
⎢ ⎥
⎢ 2 ⎥
=E⎢ i
Xt,t · ft + gi (t ) · gi (t ) · Xt,t
i
· Xt,t · ft · ft⎥
i
⎣ ⎦
t ∈[m]\{t} t ,t ∈[m]\{t}
t =t
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 149
Sketches 149
2
] · ft
i
= E[Xt,t
t ∈[m]\{t}
+ E[gi (t )] · E[gi (t )] · E[Xt,t
i
· Xt,t ] · ft · ft ,
i
t ,t ∈[m]\{t}
t =t
where the second equality holds for two reasons. First, like in the proof
of Lemma 2, the pairwise independence of the hash functions family from
which the function gi is drawn implies that gi (t ) and gi (t ) are independent
whenever t = t , and second, both these values are independent from the
i i
variables Xt,t and Xt,t whose values are determined by the function hi
alone. We now recall that E[gi (t )] = 0 for every token t and observe
i −1
that E[Xt,t ] = k due to the pairwise independence of hi . Plugging these
observations into the last equality gives
⎡⎛ ⎞2 ⎤
⎢ ⎥
E ⎣⎝ gi (t) · gi (t ) · Xt,t
i
· ft ⎠ ⎦ = k −1 · ·ft2 .
t ∈[m]\{t} t ∈[m]\{t}
⎛ ⎡ ⎤⎞2
− ⎝E ⎣ gi (t) · gi (t ) · Xt,t
i
· ft
⎦⎠
t ∈[m]\{t}
= k −1 · ·ft2 k −1 · ·ft2 .
t ∈[m]\{t} t ∈[m]
Corollary 1. For every 1 i r, Pr |gi (t) · C[i, hi (t)] − ft |
2 1
ε· t ∈[m] ft 3 .
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 150
Var[gi (t) · C[i, hi (t)]] k −1 · t ∈[m] ·ft2 1
2 2 2 ≤ .
2 ε · t ∈[m] ·ft 3
ε· t ∈[m] ft
The expression 2
t ∈[m] ft which appears in the last corollary is known
2
as the norm of the frequency vector f , and is usually denoted by f 2 .
We will discuss the 2 norm in more detail later in this section, but for now
we just note that Corollary 1 can be rephrased using this norm as follows:
Sketches 151
Thus, in the rest of the proof we concentrate on proving this inequality. Note
r
that the probability of i=1 Xi r/2 increases when the probabilities of
the individual indicators X1 , X2 , . . . , Xr taking the value 1 decrease. Thus,
for the purpose of proving (7.2), we can assume that each one of these
indicators takes the value 1 with probability exactly 2/3 (rather than at
least 2/3). By the Chernoff bound, we now get
r
r
Pr Xi
i=1
2
r r
3 2 Pr
= Pr Xi E Xi e−(1/4) ·E[ i=1 Xi ]/2
i=1
4 i=1
−1 −1
= e(2r/3)/32 = e−r/48 = e−48 log2 δ /48
e− log2 δ = δ.
Table 7.1. A comparison of the main sketches presented in this chapter for estimating
token frequencies. For every sketch, the table summarizes the model in which it works,
its space complexity and its guarantee for the difference between the estimated frequency
f˜t and the true frequency ft .
Sketch
Name Model Space Complexity With Probability 1 − δ
Count- Strict O(log δ−1 · (ε−1 · log n + log m)) f˜t − ft ∈ [0, ε · f 1 ]
Min Turnstile
Count- Turnstile O(log δ−1 · (ε−1 · log n + log m)) f˜t − ft ∈ [−ε · f 1 , ε · f 1 ]
Median
Count Turnstile O(log δ−1 · (ε−2 · log n + log m)) f˜t − ft ∈ [−ε · f 2 , ε · f 2 ]
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 152
Count sketch has a guarantee depending on the 2 norm. Hence, the use of
the Count sketch is preferred, despite its larger space complexity, when the
2 norm is much smaller than the 1 norm.
It is known that for every vector v ∈ Rm the following inequalities hold
(and moreover, there are vectors which make each one of these inequalities
tight)
v1
√ v2 v1 .
m
Hence, the 2 norm is never larger than the 1 norm, and it can sometimes be
much smaller than it. Exercise 7 gives some intuitive idea about the factors
affecting the ratio between the two norms. Specifically, it shows that for
“spread out” vectors the 2 norm is much smaller than the 1 norm, while
for “concentrated” vectors the two norms take similar values.
Exercise 7. In this exercise, we consider two kinds of vectors and study the
ratio between the 1 and 2 norms of both kinds of vectors.
(a) In vectors of the first kind we consider all the coordinates are equal. In
other words, in a vector va ∈ Rm of this kind, all m coordinates are equal
√
to some real value a. Prove that for such a vector va 1 = m · va 2 .
(b) In vectors of the second kind we consider only one coordinate has a
non-zero value. Specifically, in a vector ua ∈ Rm of this kind, all the
coordinates are zero except for one coordinate which takes the value a.
Prove that for such a vector ua 1 = ua 2 .
Sketches 153
Exercise 8. Verify that you understand why that is the case for the sketches
presented in this chapter.
4 Recallthat COMB is the algorithm that given sketches of two streams σ and σ produces
the sketch for the concatenated stream σ · σ . Such an algorithm must exist for every
type of sketch by definition.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 154
Exercise Solutions
Solution 1
One can observe that Algorithm 1 calculates the sum
ft ,
t∈M
where M is the set of possible tokens. This sum is equal to f 1 when the
frequencies of the elements are all non-negative. Since this is the case in both
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 155
Sketches 155
the cash register and the strict turnstile models, we get that Algorithm 1
calculates f 1 in these models.5
It remains to be shown that Algorithm 1 might fail to calculate f 1
in the general turnstile model. According to the above observation, such
a failure requires an element with a negative frequency. Consider a data
stream consisting of two update events corresponding to the ordered pairs
(“a”, 1) and (“b”, –1) (i.e., in the first update event, one copy of the token
“a” arrives, and in the second update event, one copy of the token “b”
leaves). The output of Algorithm 1 given this data stream is
ft = f a + f b = 1 + (−1) = 0.
t∈M
Solution 2
The variant of Algorithm 2 we suggest for the cash register model is given
as Algorithm 7. One can observe that Algorithm 7 updates its internal
state following the arrival of an update event (t, c) in the same way that
Algorithm 2 updates its internal state following the arrival of c copies
of token t in a row. Hence, the analysis of Algorithm 2 from Chapter 1
carries over to Algorithm 7, and shows that the set produced by this
algorithm contains exactly the tokens whose frequency is larger than n/k.
Furthermore, in the cash register model, n is equal to the final value of
f 1 because f 1 can only increase over time in this model. Thus, we get
that the set produced by Algorithm 7 contains exactly the tokens whose
frequency is larger than k −1 · f 1 , as required.
The space complexity of Algorithm 7 remains to be analyzed. Like in
the case of Algorithm 2, we assume that the implementation of Algorithm 7
stores explicitly only counters with a non-zero value. Moreover, one can
recall from the analysis of Algorithm 2 in Chapter 1 that the size of the set
F maintained by Algorithms 2 and 7 never exceeds k. Hence, Algorithm 7
5 We note that the cash register is in fact a special case of the strict turnstile model, and
thus, it would have been enough to just prove that Algorithm 1 calculates f 1 in the
strict turnstile model.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 156
maintains at most k non-zero counters at every given time. The value of each
such counter is upper bounded by the frequency of the token corresponding
to this counter, and thus, also by n. Therefore, we get that all the tokens
used by Algorithm 7 can be represented together using O(k log n) bits.
Moreover, the variable d, which takes the value of one of these counters,
can also be represented using this amount of bits. Next, let us consider
the variable t and the set F . The variable t represents one token, and the
set F contains up to k tokens. As there are at most m possible tokens,
each token can be represented using O(log m) bits, and thus, t and F can
both be represented using at most O(k log m) bits. Finally, we consider the
variables c and n. The variable n is increasing over time, and its final value
is the value of the parameter n. Furthermore, the variable c contains the
increase in the frequency of one token in one update event, and thus, its
value cannot exceed n since n is the sum of the frequencies of all the tokens.
Hence, we get that both variables c and n are always upper bounded by
the parameter n, and thus, can be represented using O(log n) bits.
Combining all the above observations, we get that the space complexity
of Algorithm 7 is O(k(log m + log n)), and thus, Algorithm 7 is a streaming
algorithm when k is considered to be a constant.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 157
Sketches 157
Solution 3
We first observe that the 1 norm can be calculated using the data stream
algorithm given by Algorithm 1 in the cash register and strict turnstile
models. Thus, it remains to be seen that, in these models, given only the
two 1 norms n1 and n2 of two data streams σ1 and σ2 , it is possible to
calculate the 1 norm of σ1 · σ2 . For that purpose, the frequency vectors
corresponding to these two data streams are denoted by f 1 and f 2 . Then,
assuming that M is the set of all possible tokens, we get that the 1 norm
of σ1 · σ2 is given by
|ft1 + ft2 | = |ft1 | + |ft2 | = n1 + n2 ,
t∈M t∈M t∈M
where the first equality holds since frequencies are always non-negative in
the cash register and strict turnstile models.
Solution 4
Recall that in the sketch generated by Algorithm 3 the cell C[h(t)] contains
the total frequency of all the tokens mapped to this cell by the hash function
h, and thus,
f˜t = C[h(t)] = ft = ft + ft . (7.3)
t ∈[m] t ∈[m]\{t}
h(t)=h(t ) h(t)=h(t )
Like in the proof of Lemma 1, we now define for every token t ∈ [m]\{t}
an indicator Xt for the event that h(t) = h(t ). Since h is selected from a
pairwise independent hash functions family, the probability that Xt takes
the value 1 must be k −1 . Using this notation, we can now rewrite Equality
(7.3) as
f˜t − ft = Xt · ft .
t ∈[m]\{t}
The sum on the right-hand side of this equality contains both tokens with
positive frequencies and tokens with non-positive frequencies. It is useful to
break it into two sums, one for each of these kinds of tokens. Hence, let us
define M + as the set of tokens from [m]\{t} having a positive frequency,
and let M − be the set of the other tokens from [m]\{t}. Using these sets,
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 158
we get
|f˜t − ft | = Xt · ft + Xt · ft
+ −
t ∈M t ∈M
max Xt · ft , − Xt · ft ,
t ∈M + t ∈M −
where the inequality holds since the sum
t ∈M + Xt · ft is always
non-negative, and the sum t ∈M − Xt · ft is always non-positive. Our
objective is to bound both sums using Markov’s inequality, and toward this
goal we need to bound their expectations as follows:
E Xt · ft = E[Xt ] · ft
t ∈M + t ∈M +
ε · f 1
= k −1 · |ft | k −1 · f 1
6
t ∈M +
and
E − Xt · ft = − E[Xt ] · ft
t ∈M − t ∈M −
ε · f 1
= k −1 · |ft | k −1 · f 1 .
6
t ∈M −
Thus, by the union bound, the probability that the value of either
t ∈M + Xt · ft or − t ∈M − Xt · ft is at least ε · f 1 is upper bounded
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 159
Sketches 159
Pr[|f˜t − ft | ε · f 1 ]
1
Pr max Xt · ft , − Xt · ft ε · f 1 .
+ −
3
t ∈M t ∈M
Solution 5
Our first objective is to prove that Count-Median sketches produced for
different streams can be combined when the hash functions h1 , h2 , . . . , hr
are fixed. Consider two streams σ1 and σ2 , and, given some fixed choice
for the above hash functions, let C 1 , C 2 and C 12 denote the content of the
array C in the Count-Median sketches corresponding to the streams σ1 ,
σ2 and σ1 · σ2 , respectively. We need to show that C 12 can be calculated
based on C 1 and C 2 alone. For that purpose, we observe that each cell of
C contains the total frequency of all the tokens mapped to this cell by the
hash function hi , where i is the row of the cell. Hence, each cell of C 12 is
equal to the sum of the corresponding cells of C 1 and C 2 , and thus, can be
calculated using the values of these cells alone.
Our next objective is to bound the size of the Count-Median sketch,
assuming ε (mn)−1 . The Count-Median sketch consists of the array
C and r hash functions. By the same argument used in the proof of
Observation 3, we get that each one of the hash functions from the sketch
can be represented using O(log m + log k) bits. Hence, the total space
complexity required for all the hash functions of the sketch is
12
r · O(log m + log k) = O log δ −1 · log m + log
ε
Let us now consider the space complexity required for the array C, which
contains rk cells. Each cell of C contains the sum of the frequencies of
the tokens mapped to this cell by the hash function corresponding to its
row, and thus, the absolute value of the cell’s value is upper bounded
by n. Consequently, the entire array C can be represented using a space
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 160
complexity of
Combining the two bounds we proved above on the space required for the
hash functions and the array C, we get that the total space complexity
necessary for the Count-Median sketch is
1
Pr[|ft − C[i, hi (t)]| > ε · f 1 ] . (7.4)
3
Let us denote now by Xi an indicator for the event that |ft − C[i, hi (t)] |
r
ε · f 1 , and let X = i=1 Xi . Intuitively, each Xi is an indicator
for the event that the cell C[i, hi (t)] contains a good estimation for
ft , and X is the number of such cells containing a good estimation
for ft . Recall now that Algorithm 5 outputs the median of the cells
C[1, h1 (t)], C[2, h2 (t)], . . . , C[r, hr (t)], and thus, it produces a good estima-
tion for ft whenever more than a half of these cells contain a good estimation
for ft . Thus, to prove the inequality that we want to prove, it is enough to
prove
r
Pr X > 1 − δ,
2
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 161
Sketches 161
which is equivalent to
r
Pr X δ. (7.5)
2
Solution 6
Our first objective is to prove that sketches produced for different streams
can be combined when the hash functions h1 , h2 , . . . , hr and g1 , g2 , . . . , gr
are all fixed. Consider two streams σ1 and σ2 , and, given some fixed choice
for the above hash functions, let C 1 , C 2 and C 12 denote the content of
the array C in the Count sketches corresponding to the streams σ1 , σ2 and
σ1 · σ2 , respectively. We need to show that C 12 can be calculated based
on C 1 and C 2 alone. For that purpose, we observe that the value of cell
number j in row i of C is given by the expression
gi (t) · ft . (7.6)
t∈[m]
hi (t)=j
Note that this is a linear expression of the frequencies, which implies that
each cell of C 12 is equal to the sum of the corresponding cells of C 1 and
C 2 , and thus, can be calculated using the values of these cells alone.
Our next objective is to bound the size of the Count sketch, assuming
ε (mn)−1 . The Count sketch consists of the array C, r hash functions
from [m] to [k] and r hash functions from [m] to {−1, 1}. By the same
argument used in the proof of Observation 3, we get that each one of the
hash functions from [m] to [k] can be represented using O(log m + log k)
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 162
bits. Hence, the space complexity required for all these hash functions is
6
r · O(log m + log k) = O log δ −1 · log m + log 2
ε
Consider next the hash functions from [m] to {−1, 1}. One can observe that
any hash function from [m] to [2] can be converted into a hash function
from [m] to {−1, 1} by simply renaming the elements of the function’s
image. Thus, each hash function from [m] to {−1, 1} requires the same space
complexity as a hash function from [m] to [2], and by the argument from
the proof of Observation 3, such a hash function requires O(log m+ log 2) =
O(log m) bits. Since there are r hash functions from [m] to {−1, 1} in the
Count sketch, their total space requirement is
Let us now consider the space complexity required for the array C, which
contains rk cells. We observe that the absolute value of each cell of C is
upper bounded by n. The reason for that is that the content of cell j in row
i of the array C is given by the sum (7.6), whose absolute value is upper
bounded by n since the frequency of each token appears in it only once
(either as a positive or as a negative term). Consequently, the entire array
C can be represented using a space complexity of
Combining the above bounds we got on the space required for the hash
functions and the array C, we get that the total space complexity necessary
for the Count-Median sketch is
Solution 7
(a) Since all m coordinates of va are equal to a, we get
√ √ √ √
va 1 = m · |a| = m · m · |a|2 = m · m · a2 = m · va 2 .
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 163
Sketches 163
(b) Since ua has m − 1 coordinates which take the value 0 and one
coordinate taking the value a, we get
Solution 8
We explain why the sketch produced by Algorithm 3 has the property that
DS (comp(σ)) can be obtained from DS (σ) by simply complementing all
the numbers in the sketch. The explanation for the other sketches given in
this chapter is similar. Recall that each cell of the array C in Algorithm 3
contains the sum of the frequencies of the tokens mapped to this cell by
the hash function h. Hence, in the sketch DS (σ) the value of the ith cell of
C is given by
C[i] = (fσ )t .
t∈[m]
h(t)=i
where the second equality holds since fσ + fcomp(σ) is the vector of all zeros.
Solution 9
The discussion before the exercise shows that given Count-Median sketches
for two streams σ1 and σ2 , one can get information about the difference
between the two streams using a two-step process. In the first step, a
Count-Median sketch for σ1 · comp(σ2 ) is produced from the sketches for
σ1 and σ2 , and in the second step, the sketch for σ1 · comp(σ2 ) is used to
derive an estimate for the frequency of a token t in σ1 · comp(σ2 ).
In this exercise, we are asked whether the same process can also be
applied with the Count-Min and Count sketches. To answer this question,
we first need to determine the properties of the Count-Median sketch that
allow the above two steps to work. The first step relies on the fact that the
Count-Median sketch is a linear sketch because this linearity is the thing
that enables us to derive a sketch for the stream comp(σ2 ) from the sketch
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch07 page 164
of the stream σ2 . Fortunately, both the Count-Min and Count sketches are
linear, so this step can also be performed with these two sketches.
The second step relies on the fact that we can get an estimate for
the frequency of tokens in the stream σ1 · comp(σ2 ) based on the Count-
Median sketch of this stream. This is true for the Count-Median sketch
because it applies to the turnstile model, and thus, can handle any stream.
However, this is not generally true for a sketch that applies only to the strict
turnstile model because σ1 · comp(σ2 ) may contain tokens with a negative
frequency even when the original streams σ1 and σ2 do not contain such a
token. Hence, the second step can be performed with the Count sketch that
applies to the turnstile model, but cannot be performed with the Count-Min
sketch, which applies only to the strict turnstile model.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 165
Chapter 8
165
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 166
u v
s w
graph data stream algorithms corresponds to the plain vanilla data stream
model because we assume a single edge arrives at every time point, and
additionally, edges are never removed. More general kinds of graph data
stream algorithms corresponding to the other data stream models presented
in Chapter 7 have also been studied, however, they are beyond the scope of
this book.
As is customary with graphs, we use G, V and E to denote the graph
corresponding to the input stream, its sets of vertices and its sets of edges,
respectively. Additionally, we use n and m to denote the number of vertices
and edges in the graph, respectively. Note that this notation is standard
for graphs, but it is not consistent with the notation we have used so far in
this book. For example, the length of the stream is equal to the number of
edges, and is thus, given now by m; unlike in the previous chapters where
we used n to denote this length (at least in the plain vanilla model). We
adopt the convention that the meaning of n and m should be understood
from the context. Specifically, in the context of data stream algorithms
for graph problems, we assume that n and m represent the sizes of V
and E, respectively, and in the context of non-graph-related data stream
algorithms, we use n and m in the same way we did in the previous chapters.
We are now ready to present our first example of a graph data stream
algorithm, which is an algorithm for determining whether a given graph is
bipartite. This algorithm is given as Algorithm 1.
Algorithm 1 grows a forest F using the following process. Originally, F
is empty. Then, every time that the algorithm gets an edge of the original
graph, it adds this edge to F unless this would create a cycle in F . Observe
that this construction process guarantees that F contains only edges of the
original graph. While growing the forest F , the algorithm tests the original
graph for bipartiteness by considering all the cycles that can be created
by adding a single edge of the graph to F . If all these cycles are of even
length, then the algorithm declares the graph to be bipartite. Otherwise,
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 167
Figure 8.2. An illustration of the analysis of Algorithm 1. The black edges are the
edges of F . The gray edges represent edges of G that have already arrived, but were not
added to F . Finally, the dashed edge (u, v) is a new edge that has arrived now. Note
that (u, v) closes an even cycle with F , and that in any presentation of F as a bipartite
graph (i.e., a partitioning of its nodes into two sides with no edges within a side) u and
v must appear in different sides of the graph since there is an odd length path between
them.
whose weight (i.e., the total weight of the edges in it) is as large as
possible.
One kind of data stream algorithms that have been developed for
the maximum weight-matching problem estimates the size of the maxi-
mum weight matching, without explicitly producing any matching. Such
algorithms are estimation algorithms because, like most of the algorithms
we have seen so far in this book, they simply estimate some numerical
value related to the stream; in this case, the weight of its maximum
weight matching. We note that many estimation algorithms exist for the
maximum weight matching problem, and in some cases they manage to
achieve non-trivial results even using a space complexity which is low
enough to make them streaming algorithms. Nevertheless, in this chapter
we are interested in algorithms that do more than estimating the weight
of the maximum weight matching. Specifically, we want algorithms that
output a matching whose weight is close to the weight of the maximum
weight matching. Such algorithms are called approximation algorithms.
More generally, an approximation algorithm is an algorithm which given
an optimization problem (such as maximum weight matching) finds a
solution which approximately optimizes the objective function of the
problem.
As a simple first example for an approximation algorithm, let us
consider Algorithm 2. Algorithm 2 is an algorithm for the special case of
maximum weight matching in which the weight of all the edges is equal
to 1. In other words, in this special case the objective is to find the largest
matching. We note that this special case of maximum weight matching is
often called the maximum cardinality matching problem.
Algorithm 2 grows its solution matching M using the following process.
Originally, M is empty. Then, whenever the algorithm gets a new edge, it
adds the edge to M unless this will make M an illegal matching. Clearly, this
Figure 8.3. An illustration of the analysis of Algorithm 2. The middle edge is an edge
of M which is blamed for the exclusion of two edges of OPT (presented as dashed edges).
Note that the middle edge cannot be blamed for the exclusion of any additional edges
of OPT because this will require two edges of OPT to have a common vertex.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 172
• The definition given above for the approximation ratio works only for
maximization problems and deterministic algorithms. Variants of this
definition for minimization problems and randomized algorithms exist
as well, and we will present some of them later in this book when
needed.
• Even for maximization problems and deterministic algorithms, there is
an alternative (slightly different) definition of the approximation ratio
which is occasionally used. Specifically, according to this alternative
definition, the approximation ratio is defined as
OP T (I)
supI∈J .
ALG(I)
Note that this definition yields approximation ratios in the range [1, ∞),
which is more convenient in some cases. The two definitions, however,
are very close and one can observe that the approximation ratio
according to one definition is exactly the inverse of the approximation
ratio according to the other definition. For example, Algorithm 2 is a
2-approximation algorithm according to the last definition. To avoid
confusion, in this book we stick to the definition of the approximation
ratio given by Definition 1.
in the stack. Let us now explain each one of these parts in more detail. In
the first part, the algorithm maintains for every vertex u ∈ V a potential pu .
Originally, this potential is set to 0, and it grows as more and more edges
hitting the vertex are added to the stack.2 The algorithm then allows an
edge to be added to the stack only if its weight is significantly larger than
the sum of the potentials of its two endpoints. Intuitively, this makes sense
because it prevents a new edge (u, v) from being added to the stack unless
it is significantly heavier than edges that are already in the stack and hit
either u or v. More formally, whenever Algorithm 3 gets an edge (u, v),
it calculates for this edge a residual weight w (u, v) by subtracting the
potentials of pu and pv from the weight of (u, v). If the residual weight
is negligible compared to the original weight, then the algorithm simply
discards the edge. Otherwise, the algorithm pushes the edge into the stack
and increases the potentials of u and v by the edge’s residual weight.
We now get to the second part of Algorithm 3, in which the algorithm
creates a matching from the edges of the stack. This is done by simply
popping the edges from the stack one by one, and adding any popped edge
to the matching, unless this addition makes the matching illegal.
Exercise 5. Use the above assumption to prove that the stack S maintained
by Algorithm 3 contains at no time more than O(ε−1 · n log n) edges. Hint:
First, show that the potential pu of a vertex u grows exponentially with the
number of edges hitting u in the stack S.
Using the last exercise, we can easily upper bound the space complexity
of Algorithm 3.
and since each edge can be stored using O(log n) space, all these edges can
be stored using a space complexity of only O(ε−1 · n log2 n).
In addition to the edges, Algorithm 3 also stores one more thing, which
is the potentials of the vertices. One can observe that the potential of
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 176
Lemma 3.
1−ε
w (u, v) · w(u, v).
(u,v)∈Sm 2 (u,v)∈OP T
(8.1)
Before we get to the proof of this claim, let us explain why it proves the
lemma. Since the potentials all start as 0 and S0 = E0 = Ø, the value of
(8.1) for m = 0 is
(1 − ε) · w (u, v) = (1 − ε) · w (u, v).
(u,v)∈OP T (u,v)∈OP T
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 177
which (in its turn) implies the lemma. Thus, it remains to be proven that
(8.1) is indeed a non-decreasing function of m . Specifically, fix an arbitrary
1 r m. Our objective is to prove that the value of (8.1) for m = r is at
least as large as the value of (8.1) for m = r − 1. Let (ur , vr ) be the edge
at place r in the stream, i.e., (ur , vr ) is the single edge of Er\Er−1 . We now
need to consider a few cases, as follows:
• Case 1 — The first case we consider is that (ur , vr ) does not belong
to OPT, and additionally, it is not added to the stack S. In this case,
Sr = Sr−1 and OP T \Er = OP T \Er−1 , which together imply that the
value of (8.1) is identical for both m = r and m = r − 1.
• Case 2 — The second case is that (ur , vr ) belongs to OPT, but is
not added to the stack S. In this case Sr = Sr−1 and OP T \Er =
(OP T \Er−1 )\{(ur , vr )}, which together imply that the change in the
value of (8.1) when m is increased from r − 1 to r is
r is at most 2 · w (ur , vr ). Adding all the above, we get that the total
change in the value of (8.1) when m is increased from r − 1 to r is
non-negative in this case, too.
• Case 4 — The final case is that (ur , vr ) belongs to OPT, and
additionally, it is added to the stack S. Like in the previous case, we
again get that the change in the first term of (1) when m is increased
from r − 1 to r is 2 · w (ur , vr ). Let us now analyze the change in the
second term of (8.1). Since OP T \Er = (OP T \Er−1 )\{(ur , vr )}, the
decrease in this term is given by (1 − ε)·w(ur , vr )− (pur ,r−1 + pvr ,r−1 ).
Note that this time there is no need to take into account the change
in potentials following the addition of (ur , vr ) to the stack because the
value of (8.1) for m r does not depend on the potentials of ur and
vr (no other edge of OPT can involve these vertices). It now remains
to be observed that, by definition,
which implies that, in this case, the total change in the value of (8.1)
when m is increased from r − 1 to r is
2 · w (ur , vr )
−[w (ur , vr ) − ε · w(ur , vr )] = w (ur , vr ) + ε · w(ur , vr ) 0,
where the last inequality holds since the fact that (ur , vr ) is added to
the stack implies that w (ur , vr ) is positive.
where the second line follows from the first line since w (u, v) = w(u, v) −
pu − pv . Summing up the last inequality over all the edges of M , we get
w(u, v)
(u,v)∈M
⎡ ⎤
⎣w (u, v) + w (u , v ) + w (u , v )⎦
(u,v)∈M (u ,v )∈Bu (u ,v )∈Bv
w (u, v).
(u,v)∈Sm
where the last inequality holds due to two observations: first, that every
edge of Sm \M must be blamed on some endpoint of an edge of M , and
second, that the residual weights of edges in the stack are always positive
(and so are the residual weights of the edges of M because M ⊆ Sm ).
Lemma 5. E[X] = TG .
Proof. Let T(u,v) be the number of triangles in G whose first edge in the
stream is (u, v). Since each triangle has a unique first edge in the stream,
TG = T(u,v) . (8.2)
(u,v)∈E
Let us explain now how the value of w affects the output of Algorithm 6. If
there is a triangle consisting of the vertices u, v and w which is counted by
T(u,v) , then Algorithm 6 will find the edges (u, w) and (v, w) after (u, v) in
the stream, and will output m(n − 2). Otherwise, Algorithm 6 will not find
both these edges after (u, v) in the stream, and thus, will output 0. Hence,
for a fixed choice of the edge (u, v), Algorithm 6 will output m(n − 2) with
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 183
probability T(u,v) /|V \{u, v}| = T(u,v) /(n−2). More formally, for every edge
(u , v ) ∈ E,
T(u ,v ) T(u ,v )
E [X| (u, v) = (u , v )] = ·[m (n − 2)]+ 1 − ·0 = mT(u ,v ) .
n−2 n−2
It remains to be observed that, by the law of total expectation,
E [X] = Pr [(u, v) = (u , v )] · E [X| (u, v) = (u , v )]
(u ,v )∈E
1
= · mT(u ,v ) = T(u ,v ) = TG ,
m
(u ,v )∈E (u ,v )∈E
mn · TG mn · TG ε2 · mn · TG
= = .
h B/ε2
B
Let us give the last corollary an intuitive meaning. This corollary states
that when we are given a bound B such that we are guaranteed that the
number of triangles in G is large enough to make mn/TG at most B/3,
then Algorithm 7 is guaranteed to estimate the number of triangles up
to a relative error ε with probability at least 2/3. In other words, if we
think of all the graphs that have enough triangles to guarantee mn/TG
B/3 as a class of graphs parametrized by B, then Corollary 2 states that
Algorithm 7 produces a good estimate for the number of triangles with a
constant probability when the graph belongs to this class of graphs. Note
that this corresponds well with the beginning of this section, where we
learned that obtaining an estimate for the number of triangles in a graph
using a non-trivial space requires us to assume something on the graph (for
example, that it belongs to some class of graphs).
Observe that increasing the value of the parameter B extends the class
of graphs for which Corollary 2 guarantees a good estimate (with constant
probability). However, there is a cost for this increase in the form of a larger
space complexity.
Exercise 9. Prove that the space complexity of Algorithm 7 is O((n + ε−2 B)·
log n).
• The trivial algorithm for triangle counting stores the entire graph,
which requires a space complexity of Θ(m log n). Hence, Algorithm 7
is only interesting when B is much smaller than m; which is equivalent
to saying that Algorithm 7 is only interesting when we are guaranteed
that the number of triangles in G is much larger than n.
Exercise Solutions
Solution 1
The algorithm we suggest for determining whether a graph is connected is
given as Algorithm 8. Note that this algorithm indeed assumes knowledge
of the set V of vertices as suggested by the exercise.
The main data structure maintained by Algorithm 8 is the collection C.
Observation 2 gives an important property of C. Formally, this observation
can be proved using induction on m .
The space complexity necessary for the two pointers Cu and Cv remains
to be analyzed. By Observation 2, the collection C contains at most n sets
at every given time. Thus, each pointer to one of these sets requires only
O(log n) bits. Combining all these space bounds, we get that the space
complexity required for the entire Algorithm 8 is
Solution 2
Algorithm 2 has to store only two things: the current edge and the
matching M . Since M is a legal matching, it contains at most n/2 edges.
Moreover, since any edge is represented using two vertices, we get that the
space complexity of Algorithm 2 is upper bounded by the space necessary
for storing 2(n/2 + 1) = n + 2 = O(n) vertices. Recall that each vertex can
be represented using a space of O(log n) because there are only n vertices,
which implies that the space complexity of Algorithm 2 is
e1 e2 e3
Figure 8.4. A graph showing that the approximation ratio of Algorithm 2 is not better
than 1/2.
Solution 3
Consider the graph given in Figure 8.4. Clearly, the maximum size matching
in this graph consists of the edges e1 and e3 . However, if the edge e2 appears
first in the input stream, then Algorithm 2 will add it to its output matching
M , which will prevent any other edge from being added to M later. Hence,
we have found an input for which the ratio between the size of the matching
produced by Algorithm 2 (which is 1) and the size of the largest possible
matching (which is 2) is 1/2, as required.
Solution 4
The pseudocode of Algorithm 3 assumes prior knowledge of the set V of
vertices only for the purpose of initializing the potentials of all the vertices
to 0. However, in an implementation this can be done implicitly. In other
words, an implementation of the algorithm can check every time before
accessing the potential of a vertex if this potential was set before, and treat
the potential as 0 if the answer is negative (i.e., the potential was never
set before). Thus, Algorithm 3 can be implemented without assuming prior
knowledge of V . Note that the algorithm learns, however, the non-isolated
vertices of V since these vertices appear inside the edges that the algorithm
gets during its execution.
Solution 5
Following the hint, we first prove Lemma 8.
pu w (u, v) ε · w (u, v)
ε · 1
= 1 = (1 + ε)Su −1 .
Assume now that Su > 1 and that the lemma holds for Su − 1, and let us
prove it for Su . Let (u, v) be the last edge hitting u that was added to the
stack S. Before (u, v) was added, the stack contained Su − 1 edges hitting
u, and thus, by the induction hypothesis, the potential of u at this point
(which we denote by pu ) was at least (1 + ε)Su −2 . Additionally, since (u, v)
was added to the stack, we get
Recalling that the addition of (u, v) to the stack increases the potential of
u by w (u, v), we conclude that the new potential of u after the addition of
(u, v) to the stack is at least
As the weight of all edges, including (u, v), is at most nc , this implies that
the residual weight of (u, v) is non-positive, which contradicts the fact that
(u, v) was add to the stack. As a consequence of this contradiction, we get
that the number of edges in the stack hitting every given vertex is at most
log1+ε nc + 2. Moreover, since there are n vertices, and each edge must hit
two of them, we also get the following upper bound on the number of edges
in the stack.
n · log1+ε nc + 2 n · ln nc n · ln nc
= +n + n = O ε−1 · n log n .
2 2 ln (1 + ε) ε
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 192
Solution 6
Let us begin the solution for this exercise by proving that Algorithm 4 is
a semi-streaming algorithm. Algorithm 4 has to store only two things: the
current edge and the matching M . Since M is a legal matching, it contains
at most n/k edges (each edge of M includes k vertices, and no vertex
can belong to more than one edge in a legal matching). Moreover, since
any edge is represented using k vertices, we get that the space complexity
of Algorithm 2 is upper bounded by the space necessary for storing
k(n/k + 1) = n + k = O(n) vertices. Recall that each vertex can be
represented using a space of O(log n) because there are only n vertices,
which implies that the space complexity of Algorithm 2 is
Solution 7
The analysis of Algorithm 5 is very similar to the analysis of Algorithm 3,
and therefore, we concentrate here only on the places in the analyses where
they differ.
We begin with the analysis of the space complexity of Algorithm 5.
One can verify that the result given by Exercise 5 — i.e., that the number
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 193
of edges in the stack S is never larger than O(ε−1 · n log n) — holds for
Algorithm 5 as well. Moreover, the same applies also to the part of the
proof of Corollary 1 using this result to show that Algorithm 5 has to
maintain at every given time at most O(ε−1 · n log n) edges. We now note
that every edge consists of k vertices and a weight. Moreover, each vertex
can be represented using O(log n) space since there are only n vertices, and
each weight can also be represented using O(log nc ) = O(log n) space since
we assume that the weights are integers between 1 and nc (for a constant c).
Thus, each edge can be represented using a space of (k + 1) · O(log n) =
O(log n), which implies that the total space required by Algorithm 5 for
storing edges is only O(ε−1 · n log2 n).
In addition to the edges, Algorithm 5 also stores one more thing, which
is the potentials of the vertices. One can observe that the potential of
each vertex is an integer, and furthermore, it is upper bounded by the
total weight of the edges hitting this vertex. Since there are at most nk−1
such edges, we get that the potential of every vertex is upper bounded by
nk−1 · nc = nc+k−1 , and thus, can be represented using O(log nc+k−1 ) =
O(log n) space. Hence, the potentials of all the vertices together require no
more than O(n log n) space. This completes the proof that Algorithm 5 is
a semi-streaming algorithm (for constant ε and k).
The next step in the analysis of Algorithm 5 is to prove an analog for
Lemma 3. Specifically, we prove Lemma 9.
1−ε
Lemma 9. e∈Sm w (e) k · e∈OP T w(e).
Proof. Recall that the proof of Lemma 3 is done by arguing that (1) is a
non-decreasing function of m . The proof of this lemma follows in the same
way from the claim that the next expression is non-decreasing (recall that
pu,m is the potential of vertex u after the algorithm has processed the first
m edges and Em is the set of these first m edges).
k· w (u, v) + (1 − ε) · w (u, v) − pu,m .
e∈Sm e∈OP T \Em u∈e
To prove this claim, we need to use the same partition into four cases used
in the proof of Lemma 3. As the proofs of the individual cases here are
identical (up to some technical changes) to the proofs of the corresponding
cases in the proof of Lemma 3, we omit them.
We also need the following analog for Lemma 4. The proof of this analog
differs from the proof of Lemma 4 only in some technical details.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 194
Combining Lemmata 9 and 10, we get that the weight of the matching
produced by Algorithm 5 is at least a fraction of (1 − ε)/k out of the weight
of a maximum weight matching. Thus, Algorithm 5 is a 1−ε k -approximation
algorithm, which completes the solution for the exercise.
Solution 8
Assume first that we are given a magical helper that selects a uniformly
random edge from the stream and signals when this edge arrives. It is not
difficult to see that given such a helper Algorithm 6 can be implemented
as a graph data stream algorithm. Specifically, we use the edge selected by
the helper to implement Line 1 of the algorithm, i.e., to sample a uniformly
random edge (u, v). Since the helper signals its choice as soon as (u, v)
arrives, we can also select at this point a vertex w ∈ V \{u, v}, and then
scan the rest of the stream for the edges (u, w) and (v, w) whose presence
or absence from the part of the stream appearing after (u, v) determines
Algorithm 6’s output.
Unfortunately, we do not have such a magical helper. Instead, we can
use a reservoir sampling algorithm to implement Line 1 of Algorithm 6.
The difference between a reservoir sampling algorithm and the above
magical helper is that the reservoir sampling algorithm maintains a sample
that changes occasionally over time. More specifically, every time that
the reservoir sampling algorithm receives an edge, it decides with some
probability to make this edge the new sample, and the final output of the
algorithm is the last edge that was made into a sample this way. This
means that we can never be sure (before the stream terminates) whether
the current sample of the reservoir sampling algorithm will be replaced with
a new edge later on or will end up as the output sample.
To solve this issue, every time after the reservoir sampling algorithm
selects a new edge as its sample, Algorithm 6 should assume that this edge
is the final sample and continue executing based on this assumption. If this
assumption turns out to be false (i.e., the reservoir sampling algorithm picks
a new edge as its sample at a later stage), then Algorithm 6 can simply
discard everything that it did based on the previous sample, and again start
assuming that the new sample is the final edge (u, v).
Recall that the reservoir sampling algorithm stores only the length of
the stream and a constant number of tokens (i.e., edges) at every given time.
July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch08 page 195
The length of the stream is m, and thus, can be stored using O(log m) =
O(log n2 ) = O(log n) space. Moreover, storing each edge can be done by
storing its two endpoints, which again requires O(log n) space. Hence, the
space complexity necessary for the reservoir sampling algorithm is only
O(log n). In addition to the space required for this algorithm, Algorithm 6
requires only O(log n) additional space for storing the three vertices u, v
and w and the values of the parameters m and n. Thus, we can conclude
that Algorithm 6 can be implemented as a graph data stream algorithm
using O(log n) space.
Solution 9
Algorithm 7 uses h copies of Algorithm 6. Each one of these copies can be
implemented using a space complexity of O(log n) by Exercise 8 assuming
prior knowledge of V , but without taking into account the space necessary
for representing V . The total space complexity required for all these copies
is
h · O(log n) = O(B/ε2 ) · O(log n) = O(ε−2 B · log n).
Adding the space necessary for representing V , which is O(n log n),
we get that the total space complexity necessary for Algorithm 7 is
O((n + ε−2 B) · log n).
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
Chapter 9
197
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 198
tokens (or events) that appeared in the stream. In Section 9.1, we describe
the sliding window model, which is a formal model capturing the above
point of view.
• If the new token is “b” and the last token was “a”, then the algorithm
has detected a new appearance of “ab” in the active window, which
implies that zero tokens have arrived since the last appearance of this
sequence.
• Otherwise, the algorithm either increases the LastAppearance by 1 to
indicate that one more token has arrived after the last appearance of
the sequence “ab”, or set it to “Never ” if so many tokens have arrived
since this last appearance that the sequence “ab” is not part of the
active window anymore.
5 6
7
10 1 2
Figure 9.1. An update performed by Algorithm 2 following the arrival of a single edge.
On the left side we see the forest F before the processing of the new edge (the new edge
appears in it as a dashed line). Beside each edge of F we have number indicating its age
(the last edge that arrived has an age of 1, the edge that arrived before it has an age of
2, and so on). We assume in this example that W = 10, hence, the edge marked with
the age 10 is about to leave the active window. On the right side we see the forest F
after the processing of the new edge. The addition of the new edge to F created a cycle,
which forced Algorithm 2 to remove the oldest edge of this cycle. Additionally, the age
of all the edges increased by 1, which means that the edge whose age used to be 10 is no
longer part of the active window, and has been, thus, removed.
create a cycle in the forest F , which is fixed by the removal of a single edge
from this cycle. The removed edge is chosen as the oldest edge of the cycle.
Additionally, edges of F that are no longer part of the active window (i.e.,
are too old) also get removed from the forest. See Figure 9.1 for a graphical
illustration of the changes made in F during the processing of a new edge.
Proof. Let Gm be the graph induced by the edges of the active window
after Algorithm 2 has processed m edges (for some positive integer m), and
let Fm be the forest F at this point. Observe that the forest Fm solely
consists of edges of Gm . Thus, when Fm is connected, so must be Gm . To
complete the proof of the lemma, we have to also prove the other direction,
i.e., that Fm is connected whenever Gm is connected. To show that, we
prove by induction on m the stronger claim that for every edge (u, v) in
Gm there must be a path P(u,v) in Fm between u and v such that none of
the edges of P(u,v) is older than (i.e., arrived before) the edge (u, v).
For m = 0 this claim is trivial since Gm contains no edges. Assume
now that the claim holds for Gm−1 , and let us prove it for Gm . Consider an
arbitrary edge (u, v) of Gm . If (u, v) is a new edge that did not appear in
Gm−1 , then we can choose the path P(u,v) as the edge (u, v) itself because
a new edge is always added to F . Otherwise, by the induction hypothesis,
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 202
there must have been a path P(u,v) in Fm−1 between u and v such that
none of the edges of P(u,v) is older than (u, v). If P(u,v) appears in Fm as
well, then we are done. Thus, let us assume from now on that this is not
the case. The fact that (u, v) appears in Gm guarantees that all the edges
of P(u,v) , which are not older than (u, v) by definition, are still part of the
active window. Thus, the only explanation for the fact that P(u,v) does not
appear in Fm is that one of its edges was part of the cycle that was created
when edge number m of the stream arrived.
Let us denote this edge (i.e., edge number m of the stream) by (u , v ).
Additionally, let C(u ,v ) denote the cycle that was created when (u , v ) was
added to F , and let (uR , vR ) denote the edge of P(u,v) that belongs also to
C(u ,v ) and was removed following the arrival of (u , v ). Since Algorithm 2
removes the oldest edge of the cycle, (uR , vR ) must be the oldest edge of
C(u ,v ) . Thus, C(u ,v ) \{(uR , vR )} is a path between uR and vR in Gm such
that all the edges in this path are younger than (uR , vR ), and thus, also
younger than (u, v). Adding this path to P(u,v) instead of the edge (uR , vR )
creates a new path between u and v which appears in Fm and contains
only edges that are not older than (u, v), which completes the proof of the
lemma (see Figure 9.2 for a graphical illustration).
After proving that Algorithm 2 works correctly, let us now discuss its
time and space efficiency. We begin with the study of the token processing
time of Algorithm 2 (the time used by Algorithm 2 to process a single edge).
This time strongly depends on the data structures we use to handle two
main tasks. The first task is updating the forest F , which includes: adding
the new edge to the forest, detecting the cycle that is created (if it exists),
finding the oldest edge of the cycle, removing the oldest edge of the cycle
P′(u, v)
u uR vR v
C(u′, v′)
u′ v′
Figure 9.2. Illustration of the construction of the path P(u,v) based on the path P(u,v)
and the cycle C(u ,v ) . The path P(u,v) is identical to the path P(u,v) , except that the
edge (uR , vR ) of P(u,v) is replaced with the other edges of C(u ,v ) .
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 203
and removing edges that are no longer in the active window. The second
task is updating the ages of the edges and detecting edges that are too old to
be in the active window. Using a naı̈ve representation of F , updating F can
take as long as Θ(n) time.1 However, using more involved data structures,
it is possible to significantly decrease the time required for updating F .
As the data structures required for that are quite complex and have little
to do with sliding window algorithms in general, we do not present them
in this book. Instead, we concentrate on the second task mentioned above
(i.e., updating edge ages and detecting edges that are no longer within the
active window).
A naı̈ve way to handle this task is to associate an age counter with every
edge, and let this counter count the number of edges that have arrived after
the edge in question. Unfortunately, having such age counters means that
the algorithm will have to spend Θ(n) time following the arrival of every
new edge on updating these counters because every one of the counters has
to be increased by 1 following the arrival. A more efficient way to maintain
the ages of the edges is to have a single counter C tracking time. This
counter starts with the value 0, and is increased by 1 following the arrival
of every edge. The existence of C allows us to track the ages of edges in
a much easier way. Every time that an edge arrives, we store its arrival
time (i.e., the value of C at that time) in the edge. Then, when we want
to determine the age of a given edge, it is enough to consider the difference
between the current time (value of C) and the time in which the edge
arrived. Moreover, as Algorithm 2 discards edges whose age is more than
W , we can safely make C a counter modulo with some value p = Θ(W ), i.e.,
it does not need to be an unbounded counter. This observation is important
because it means that the space complexity of C does not need to grow with
the length of the stream (we discuss this point in more detail later).
Exercise 2. The second task presented above consists of two parts: tracking
the ages of the edges and detecting edges whose age is large enough to place
them outside the active window. The last paragraph explains how to efficiently
track the ages of the edges, which is the first part of the task. Explain how
to efficiently implement the second part of the task, i.e., describe an efficient
way to detect edges of F whose age puts them outside the active window.
Input stream: 1 2 3 4 5 6 7 8 9 10 11
2 ALG instance:
nd
2 3 4
4 ALG instance:
th
4 5 6
Figure 9.3. An illustration of the trick used to convert a plain vanilla data stream
algorithm ALG into a sliding window algorithm. A new instance of ALG is created every
time that a token of the stream arrives. The first instance gets the W first tokens of the
stream (here we assume that W = 3). The second instance gets the W tokens starting
at place 2 of the stream, and so on. There are two important observations to make
here. The first observation is that at every given point, one instance of ALG received
exactly the tokens of the active window, and thus, the sliding window algorithm can use
the output of this instance of ALG to answer queries. The second observation is that
we never need to keep more than W instances of ALG. For example, by the time we
receive token number 4 and create the fourth instance of ALG, the first instance can be
discarded because it received in the past a token (token number 1) that has already left
the active window.
which have already left the active window (see Figure 9.3 for a graphical
illustration). However, even after this improvement, the space complexity of
the sliding window algorithm obtained using this trick is larger by a factor
of W compared to the space complexity of ALG because the sliding window
algorithm has to keep W instances of ALG, which is usually unacceptable.
Let us denote by J the set of instances of ALG maintained by the
sliding window algorithm produced by the above trick. In many cases, it
is possible to save space by keeping only a subset J of the instances of
J, and still produce roughly the same answers when queried. Intuitively,
this is possible because two instances from J that were created at close
time points receive similar streams, and thus, often produce very similar
answers. To use this intuition, we should strive to choose the subset J in a
way which guarantees that every instance of J\J started at a close enough
time to one of the instances of J to guarantee that both instances share
a similar answer. Clearly, if we succeed in doing that, then we can use the
instances of J to produce a query answer which is similar to the answer
that would have been produced by keeping all the instances of J.
Many existing sliding window algorithms are based on the idea
described by the last paragraph. Often these algorithms use additional
insights that are necessary for the implementation of this idea in the
context of the specific problems solved by these algorithms. Nevertheless,
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 208
and thus, analyze the space complexity of Algorithm 4. The second part
of Lemma 2 shows that advancing a single place in the list decreases the
value by no more than a factor of 1 – α (with some exception). This allows
us to estimate f (σ ) for every suffix σ of σ1 using the following trick.
If σ = σi for some i, then we already have f (σ ). Otherwise, we locate
the i for which σ is a suffix of σi−1 , but not of σi . Clearly, f (σi−1 )
f (σ ) f (σi ), which means that f (σi ) can be used as an estimate for f (σ )
because we know that the ratio between f (σi−1 ) and f (σi ) is relatively
small. In particular, we will show that this observation allows Algorithm 4
to estimate the value of the suffix of σ1 corresponding to the currently active
window.
Proof of Lemma 2. The first part of the lemma follows immediately from
the fact that Algorithm 4 explicitly looks for instances ai , ai+2 violating the
inequality (1 − β) · f (σi ) f (σi+2 ), and continues to delete instances from
the list A as long as such a violating pair of instances exists. Proving the
second part of the lemma is a bit more involved. Consider the iteration in
which the instance ai+1 of ALG was created, and let ai+1 be the instance of
ALG created by the iteration before that. One can observe that Algorithm
4 never removes from A the last instance of ALG. Thus, ai+1 was still
part of A when the instance ai+1 was created. There are now two cases to
consider. The first case is that ai = ai+1 . This case implies that ai and ai+1
were created in consecutive iterations of Algorithm 4, and thus, σi contains
exactly one more token compared to σi+1 , which completes the proof of the
lemma for this case.
The second case that we need to consider is the case ai = ai+1 . This
case implies that ai and ai+1 were not adjacent in A when ai+1 was first
added to A. Thus, they must have become adjacent when some instance a
that appeared between them was removed. The removal of a could not have
been due to inactivity of a because that would have caused the removal
of ai as well. Thus, the removal of a must have been done because at the
time of the removal it was true that (1 − β) · f (ai ) < f (ai+1 ). At this point,
we need to invoke Property 3 of Definition 1. This property guarantees
that if f (σi+1 ) was smaller than f (σi ) by at most a factor of 1 − β at
some point, then adding the new tokens that arrived after this point to
both σi and σi+1 cannot make the ratio f (σi+1 )/f (σi ) = f (ai+1 )/f (ai )
smaller than 1 − α. In other words, f (ai ) and f (ai+1 ) must obey now the
inequality (1 − α) · f (ai ) f (ai+1 ) because they obeyed the inequality
(1 − β) · f (ai ) < f (ai+1 ) at some point in the past.
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 211
Recall now that, by Lemma 2, every value in the list f (a1 ), f (a2 ), . . . ,
f (a|A| ) must be smaller by at least a factor of 1 − β compared to the
value that appears in the list two places earlier. As the first value in the list
corresponding to an active instance is f (ai ), and the last value in the list
corresponding to an active instance is f (aj ), this implies that the number
of values in the list corresponding to active instances is upper bounded by
f (aj ) ln (f (aj ) /f (ai ))
2 + 2 · log1−β = 2+2·
f (ai ) ln (1 − β)
ln (f (ai ) /f (aj ))
2+2·
β
= O β −1 · (log W + log m) ,
where the inequality follows from the inequality ln(1 − β) −β, which
holds for every real value β.
Exercise Solutions
Solution 1
The following table summarizes the changes in the state of Algorithm 1
during its execution. The first row of the table corresponds to the state
of the algorithm immediately after the execution of its initialization part,
while the rows after that correspond to the states of the algorithm after
the processing of each token of the stream (in order). Each row contains
a few pieces of information, including the value of the two variables of the
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 217
algorithm at the time corresponding to the row, the content of the active
window at that time (i.e., the last 3 tokens that have arrived) and the
answer of the query part of the algorithm if executed at this time. Note
that the query part indicates correctly whether the sequence “ab” appears
in the active window.
Solution 2
We need to explain how to detect edges of F whose age is large enough
to put them outside the active window. A naı̈ve solution is to maintain a
queue which contains all the edges of the active window. Every time that a
new edge arrives, it should be added to the queue. If the size of the queue
exceeds W following this addition, then we know that the first edge of the
queue is now outside the active window, and we can remove it from both the
queue and F . This naı̈ve solution is very time-efficient, but unfortunately,
it uses a lot of space because it requires the algorithm to store W edges.
To improve the space efficiency of the solution, we need to remove from
the queue edges that are no longer in F . Note that this reduces the size of
the queue to O(n) because a forest over n nodes contains at most n−1 edges.
Moreover, such a removal can be done efficiently if the queue is represented
as a doubly linked list. One should note, however, that, as a consequence
of the removal of these edges, we can no longer determine whether the first
edge of the queue belongs to the active window by simply checking the size
of the queue. However, as explained in the paragraph before this exercise,
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 218
we can calculate the age of the first edge of the queue by comparing its
arrival time with the current time as captured by the counter C. Hence,
we can still determine in O(1) time whether the first edge of the queue is
outside the active window (and thus, needs to be removed from the queue
and from F ).
Solution 3
We begin this solution by analyzing the space complexity of Algorithm 3.
Note that the main difference between the things that Algorithms 2 and 3
have to store is that Algorithm 3 stores two forests F1 and F2 , as opposed
to Algorithm 2 which stores only one forest F . The analysis of the space
complexity of Algorithm 2, as given by the proof of Observation 1, bounds
the space complexity used for storing the forest F by showing two things.
First, that each edge requires O(log n + log W ) space, and second, that F
contains O(n) edges because it is a forest with n vertices. These two things
together imply that F can be represented using O(n(log n + log W )) space.
Clearly, this analysis of the space complexity of F is based on no properties
of F , other than the mere fact that it is a forest. Thus, there is no reason
preventing the same analysis from being applied to F1 and F2 as well. In
other words, the two forests F1 and F2 require O(n(log n+ log W )) space —
just like F . Recalling that the replacement of F with F1 and F2 is the main
difference between the data structures used by Algorithms 2 and 3, we get
that the space complexity bound proved by Observation 1 for Algorithm 2
applies also to Algorithm 3. Hence, we have proved that Algorithm 3 is a
semi-streaming algorithm.
Let us now shift our attention to proving that Algorithm 3 answers
correctly (when queried) whether the graph induced by the edges in its
active window is a 2-edge-connected graph. In other words, we need to
show that the graph induced by the union of the edges of F1 and F2 (which
we denote by H from now on) is 2-edge-connected if and only if the graph
induced by all the edges in the active window (which we denote by G from
now on) is 2-edge-connected. Observe that both F1 and F2 include only
edges from the active window, which implies that H is a subgraph of G.
Thus, the number of edges of G crossing every given cut is at least as large
as the number of such edges of H. In particular, a cut that is crossed by
at least two edges in H is also crossed by at least two edges in G, which
guarantees that G is 2-edge-connected whenever H is 2-edge-connected.
The other direction, namely, that H is 2-edge-connected whenever G is
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 219
remains to be proved. For that purpose, let us assume from now on that G
is 2-edge-connected. We will show that the 2-edge-connectivity of H follows
from this assumption.
Consider an arbitrary non-trivial cut between the vertices of a set Ø =
S ⊂ V and the vertices of V \S. Since G is 2-edge-connected, there must
be at least two edges in it crossing this cut. Let us denote the youngest
edge among these by e1 and the second youngest among them by e2 . To
prove that H is 2-edge-connected, it is enough to show that it contains both
edges e1 and e2 because (S, V \S) was chosen as an arbitrary non-trivial cut.
When e1 arrived, Algorithm 3 added it to F1 because it adds every new
edge to F1 . Assume, by way of contradiction, that e1 was removed from
F1 at a later point. Since e1 belongs to G, it is part of the active window,
which implies that its removal from F1 was due to its membership in a
cycle C1 which contained only younger edges than e1 itself. However, such
a cycle C1 must include a second edge e1 (other than e1 ) that crosses the
cut (S, V \S), which contradicts the definition of e1 because we got that e1
crosses the cut (S, V \S) and is younger than e1 due to its membership in
C1 (see Figure 9.4 for a graphical illustration). This contradiction implies
that e1 must be part of F1 , and thus, also part of H.
Consider now the edge e2 . When e2 arrived, it was added to F1 by
Algorithm 3. If e2 was never removed from F1 , then it is also part of H, and
we are done. Thus, we can safely assume in the rest of the proof that e2 was
removed from F1 . When that happened, it was added to F2 by Algorithm 3.
Using an analogous argument to the one used in the last paragraph, we can
show that if e2 was removed from F2 later, then there must exist an edge
e2 which was at F2 at some point, crosses the cut (S, V \S) and is younger
than e2 . However, the existence of such an edge e2 leads to a contradiction
The cycle C1
e1
Side S of Side V \ S
the cut of the cut
e′1
Figure 9.4. A graphical illustration of the proof that e1 has not been removed from F1 .
The cycle C1 is a hypothetical cycle which caused e1 ’s removal. As such, this cycle must
contain beside e1 only edges younger than it. Additionally, this cycle must cross at least
twice the cut (S, V \S) because e1 crosses this cut and each cycle crosses each cut an
even number of times. Thus, C1 must contain an edge e1 which is both younger than e1
and crosses the cut (S, V \S), which contradicts e1 ’s definition.
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 220
since the fact that e1 remains in F1 implies that e2 is the youngest edge
crossing the cut (S, V \S) that was ever added to F2 . Thus, we have proved
that e2 remains in F2 , and hence, it belongs to H, which completes the
proof that H is 2-edge-connected.
Solution 4
(a) Fix a value ε ∈ (0, 1/2), and let f be the function that given a stream
σ of integers between 1 and c returns the sum of the integers in the
stream. We need to show that f is (ε, ε)-smooth. To do that, we need
to prove that it obeys the four properties given in Definition 1.
f (σ) = f (σ · σ ) = f (σ ) + f (σ ) f (σ ),
f (σ · σ ) = f (σ ) + f (σ ) (1 − ε) · f (σ) + f (σ )
(1 − ε) · [f (σ) + f (σ )] = (1 − ε) · f (σ · σ ),
Solution 5
Consider an arbitrary instance a of Algorithm 5 used by SlidingSum. If at
some point a is an active instance, then (by definition) a received at most
W tokens so far. Hence, from the point of view of a, the length of its input
stream is at most W . Combining that with Exercise 4, we get that the space
complexity used by a at this point is O(log W ). Thus, in the rest of this
solution we may assume that a is an inactive instance of Algorithm 5.
According to Observation 2, the inactive instance a is the only inactive
instance in the list A, and it appears as the first instance in this list. Let us
denote the second instance in this list by a2 . Since a2 is an active instance,
it received at most W tokens so far, and thus, its variable s has a value of
at most c · W . Observe now that when an instance of Algorithm 5 reaches
the end of its stream, it returns the value of its variable s. Thus, Lemma 2
implies in this context that one of two things must happen:
• The first case is that the value of the variable s of the instance a is
larger than the value of the variable s of the instance a2 by at most a
factor of (1 − ε)−1 2. Combining this with the upper bound we have
on the value of the variable s of a2 , we get that in this case the value
of the variable s of the instance a is at most 2c · W .
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 222
• The other case is that a has received exactly one token more than a2 ,
which means that it has received at most W + 1 tokens in total. Thus,
in this case we also get that the value of the variable s of a is upper
bounded by c · (W + 1) 2c · W .
Solution 6
(a) It is enough to prove that Corollary 2 applies to Algorithm 6 as well.
This corollary is based on Observation 2 and on the second part of
Lemma 2, and it can be observed that the proofs of both of these work
for Algorithm 6 without any modification. Thus, the same is also true
for Corollary 2.
(b) Let us begin this solution with a few observations, as follows:
• The first part of Lemma 2 does not apply to Algorithm 6. However,
it is still true that, for every 1 i |A| − 2, either (1 − β) · f (σi )
f (σi+2 ) or σi+1 is marked. The reason for that is that Algorithm 6
explicitly looks for triplets ai , ai+1 , ai+2 which do not have either of
the above properties, and deletes the instance ai+1 whenever it finds
such a triplet.
• A can contain at most one inactive instance because Algorithm 6
deletes all the inactive instances except for the last one.
• At most one of the active instances in A can be marked. The reason
for that is that the active instances are all instances that were created
by Algorithm 6 following the arrival of a token that currently belongs
to the active window, and at most one token out of an active window
of length W can have a location in the stream of the form k · W for
some integer k.
Let us explain now how these observations can be used to bound the
total number of instances in A. Let ai and aj be the first and last
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 223
where the inequality follows from the inequality ln(1 − β) −β, which
holds for every real value β. Recalling that ai and aj are the first and
last active instances in A, respectively, the last bound implies that there
are at most O(β −1 · (log W + log m)) active instances in A, and thus, at
most 1 + O(β −1 · (log W + log m)) = O(β −1 · (log W + log m)) instances
in A in total.
(c) Assume, by way of a contradiction, that there exists an instance a of
ALG in the list A which has already received more than 2W +1 tokens,
and let am be the first marked instance that was created after a. Let
us denote by t and tm the tokens whose arrival caused Algorithm 6 to
create the instances a and am , respectively. Observe that the location of
tm in the stream can be at most W places after the location of t because
one out of every W instances of ALG created by Algorithm 6 is marked,
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch09 page 224
a becomes inacve: t tm
am becomes inacve: t tm
W W
Figure 9.5. A graphical illustration of some properties of the instances of ALG used
in part (c) of Solution 6. The upper part of the figure represents the situation at the time
point in which the instance a became inactive. Since t is the token whose arrival triggered
the creation of a, this is also the time point in which t dropped out of the active window.
As the active window is of size W , at this time point there must exist a token tm in
the active window whose arrival triggered the creation of the marked instance am . The
lower part of the figure represents the situation that exists at the moment am becomes
inactive. One can observe that a receives at most 2W + 1 tokens up to this moment.
and the first marked instance created after the arrival of t becomes am .
Additionally, one can recall that am , as a marked instance, cannot
be deleted before it becomes inactive. Thus, it is guaranteed that am
becomes inactive, and it does so at the latest when a receives its 2W +1
token (see Figure 9.5 for a graphical explanation). Immediately when
this happens, a is removed from A because a and am are both inactive
instances and am appears later in A than a. However, this contradicts
our assumption that a has received more than 2W + 1 tokens.
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch10 page 225
225
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
Chapter 10
Introduction to Sublinear
Time Algorithms
227
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch10 page 228
Exercise 1 shows that getting a sublinear time algorithm for the above
problem will require us to give up hope of getting an exact answer.
Fortunately, as it turns out, once we give up this hope, there are very fast
algorithms for the problem. One such algorithm is given as Algorithm 1.
This algorithm gets as input a string w of length n and quality control
parameters ε, δ ∈ (0, 1).
Recall that we assume in this book that standard operations on
numbers use constant time. In particular, we will assume that sampling a
number from a range can be done in constant time. Given this assumption,
we get the following observation.
{
min hε,hε2 ·n/#a } 2 δ
= e− 3 e−hε /3
elog(δ/2) = ,
2
2 2 δ
= e−(hε /2)·(n/#a )
e−hε /2
elog(δ/2) = .
2
Combining all the above, and using the union bound, we get that with
h
probability at least 1 − δ, the sum i=1 Xi is equal to its expectation
h#a /n up to an error of hε. Observe now that the output of Algorithm 1
is equal to this sum times n/h. Thus, the expectation of this output is #a ,
and with probability at least 1 − δ, it is equal to this expectation up to an
error of ε · n.
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch10 page 230
Given the above discussion, it is clear that we must state explicitly the
assumptions that we make. Thus, we will assume from now on that the
distances in M have the following four properties. In the statement of these
properties, we denote by M (u, v) the distance from point u to point v.
One can observe that standard Euclidean distances always obey these
properties. However, many other distances also obey them. For example, if
the points of P are the vertices of a connected graph, then the distances
between these vertices according to the graph obey all the above properties
(recall that the distance between vertices of a graph is the length of the
shortest path between them). In general, distances that obey the four above
properties are often called pseudometric.
Before presenting our algorithm for estimating the diameter, let us state
once more the problem that we want to solve. Given a set P of points and
a matrix M of distances between them which form a pseudometric, we are
interested in estimating the diameter of P . The algorithm we suggest for
this problem is given as Algorithm 2.
1 Readers who are familiar with the definition of metrics might observe that this
Exercise 4. Are there algorithms whose query complexity is larger than their
time complexity? Give an example of such an algorithm, or explain why such
an algorithm cannot exist.
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch10 page 234
Exercise Solutions
Solution 1
Assume by way of contradiction that there exists a deterministic Algorithm
A that given a string w can exactly determine the number of “a” characters
in w using o(|w|) time. In particular, this means that there exists a large
enough n such that algorithm A determines the number of “a” characters
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch10 page 235
in strings of size n using less than n time. Note that this implies that, given
such an input string, Algorithm A does not read all of it.
Consider now an arbitrary string w of length n. By the above discussion,
there exists a character in w which is not read by A when w is its input
string. Let us denote the location of this character by i, and let w be a
string of length n which is equal to w in every location other than i and
contains the character “a” in location i if and only if w does not contain it in
this location. Clearly, w and w have a different number of “a” characters.
However, as A does not read location i of w when given w as input, it
will execute in exactly the same manner given either w or w . Thus, A will
produce the same output given both inputs; which leads to a contradiction.
The result of this exercise can be extended to randomized algorithms
by fixing the random choices of the algorithm. In other words, consider
a randomized algorithm A which uses o(|w|) time and always determines
correctly the number of “a” characters in its input string. One can fix the
random choices of this algorithm as follows. Whenever A asks for a random
bit, we forward it a bit from a predetermined sequence of bits (which could
even be all zeros for the sake of this proof). This fixing of the random choices
converts A into a deterministic algorithm. The important observation now
is that the resulting deterministic version of A represents one possible
execution path of the original algorithm A, and thus, is also guaranteed to
use o(|w|) time and determine correctly the number of “a” characters in its
input string; which contradicts our previous result regarding deterministic
algorithms.
Solution 2
Consider an arbitrary deterministic algorithm A, and assume that it does
not read all of the matrix M when M contains only zeros. Let C be one of
the cells that A does not read when M contains only zeros, and consider
the execution of A on a matrix M which contains zeros in all cells other
than C and 1 in the cell C. We now need to make two observations, as
follows.
First, since A does not read the cell C given the matrix M , it will follow
the same execution path given M and M , and thus, will produce the same
answer given both of them. The second observation is that M corresponds
to a diameter of 0, while M corresponds to a diameter of 1 (note that the
diameter is simply the maximum of the matrix entries), and thus, A fails
to distinguish between a diameter of 0 and a diameter of 1.
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch10 page 236
Solution 3
Figure 10.2 depicts a star over six vertices. We encourage you to use it as
a visual aid while reading this solution.
We begin the solution by observing that the diameter of a star is 2
since this is the distance between any two non-center vertices of the star.
Consider now the output of Algorithm 2 given that it selects the center
vertex of the star as u. Since the distance of the center vertex from any
other vertex of the star is 1, Algorithm 2 outputs in this case the value 1,
which underestimates the true diameter by a factor of 2.
Consider now an execution of Algorithm 2 in which it selects a non-
center vertex as u. Since the distance between any pair of non-center vertices
is 2, the algorithm outputs 2 in this case — i.e., its estimate for the distance
is completely accurate.
Solution 4
Reading a unit of information from the input requires a unit of time. Thus,
the query complexity of an algorithm is always a lower bound on its time
complexity, which implies that the query complexity can never be larger
than the time complexity.
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 237
Chapter 11
Property Testing
237
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 238
of the objects, and thus, reduces the number of objects on which we need
to execute the slower exact algorithm. In other cases, the answer of the
property testing algorithm might be useful enough to make the application
of an exact algorithm redundant altogether. Following are two such
cases:
• Some inputs, such as the graph representing the World Wide Web, change
constantly. For such inputs, it does not make much sense to distinguish
between objects that have the property and objects that roughly have the
property, since the input can easily alternate between these two states.
However, the distinction between having the property and being far from
having it is often still meaningful, as it is less likely for the input to switch
between these very distinct states.
• In real-world scenarios’ inputs often either have the property or are far
from having it. When this happens, the output of a property testing
algorithm is as good as the output of an exact algorithm.
Later in this chapter we will see a few examples of property testing
algorithms. However, before we can do that, we need to have a more formal
definition of a property testing algorithm.
list of n numbers contains duplicates, we get that the distance between any
two lists of n numbers is the fraction of the positions in which the two
lists differ. For example, given the lists “1, 5, 7, 8” and “6, 5, 1, 8”, one
can observe that these lists differ in positions 1 and 3, which are half of
the positions, and thus, the distance between them is 1/2. Note that this
distance function always outputs a distance between 0 and 1 because the
fraction of the positions in which two lists differ is always a number between
0 and 1. Moreover, since a list does not differ from itself in any position,
the distance between a list and itself is always 0, as required.
Exercise 1. Describe one natural way in which the problem of testing whether
a graph over n vertices is connected can be formalized.
Intuitively, this distance measures how far o is from having the property P ,
and it becomes 0 when o belongs to P . If the distance d(o, P ) of an element
o from P is at least ε, then we say that o is ε-far from having the property
and the algorithm for the property testing problem should output “No”. In
contrast, if the distance of o from P is smaller than ε, but o does not belong
to P (i.e., the object o is close to having the property, but does not really
have it), then the algorithm is allowed to output either “Yes” or “No”.
A randomized algorithm for a property testing problem is very similar
to a deterministic one, except that it is only required to output the
appropriate answer with probability at least 2/3. In other words, given
an object o ∈ P , the algorithm must output “Yes” with probability at least
2/3, and given an object o which is ε-far from having the property, the
algorithm must output “No” with probability at least 2/3.
One should note that the number 2/3 in the above definition
of randomized algorithms for property testing problems is arbitrary.
Exercise 2 shows that increasing it to any number smaller than 1 will not
make much of a difference.
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 240
Exercise 2. Show that given a property testing algorithm ALG and any value
δ ∈ (0, 1/3], one can create a new property testing algorithm ALG δ for the
same problem such that
(a) ALG δ outputs the appropriate answer with probability at least 1 − δ.
(b) The query complexity of ALG δ is larger than the query complexity of
ALG by only an O(log δ −1 ) factor.
Hint: ALG δ should make use of Θ (log δ −1 ) independent executions of ALG.
Proof. Since the list is free of duplicates, any sub-list of it must also be
free of duplicates. In particular, the sub-list S produced by Algorithm 1 is
free of duplicates, which makes the algorithm output “Yes”.
The other (and more complicated) case that we need to consider is the
case in which Algorithm 1 gets a list L which is ε-far from being free of
duplicates. We need to show that in this case Algorithm 1 outputs “No”
with probability at least 2/3. The first step toward proving this is to show
that the fact that L is ε-far from being free of duplicates implies that it
must have some structural property.
Proof. Since L is ε-far from being free of duplicates, one must change at
least εn positions in L to get a duplicates free list. Thus, there must be at
least εn positions in L which contain numbers that also appear in another
position in L. Let us denote the set of these positions by R. Consider now
an arbitrary number m which appears multiple times in L, and let Rm be
the subset of R containing the positions in which m appears in L. Since
|Rm | 2 by the definition of m, we can create at least |Rm |/2 |Rm |/3
disjoint pairs from the positions in Rm . Repeating this for every number
m which appears multiple times in L, we are guaranteed to get at least
|R|/3 εn/3 disjoint pairs such that the two positions corresponding to
every pair contain the same number in L.
We would like to show that with a probability of at least 2/3, the two
positions of some pair of Q both appear in the list D. Let D1 be the prefix
√
of D of length n, and let D2 be the rest of the list D. Additionally,
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 242
let Q1 be the subset of Q containing only pairs whose first positions appear
in D1 .
√
Lemma 2. With probability at least 5/6, |Q1 | ε · n/12.
Moreover, since the Yi variables are independent, using the Chernoff bound,
we get
⎡ ⎤ ⎡ ⎤
|D1 |
√ |D1 |
E |D1 |
Yi √
ε· n⎦ i=1 (1/2)2 ·(ε· n/3)
Pr ⎣ Yi < ≤ Pr ⎣ Yi < ⎦ e− 2
i=1
6 i=1
2
√
ε· n 1
≤ e− 24 ,
12
√
where the last inequality holds when ε n is large enough (recall that we
√
only need to consider the case that ε n is larger than some large enough
constant c).
|D1 |
We have proved that with probability at least 11/12, the sum i=1 Yi
√
is at least ε · n/6. Unfortunately, that is not good enough to prove the
lemma because D1 might contain repetitions, which implies that the first
position of some pairs in Q might appear multiple times in D1 . To solve
this, we need to show that D1 is not likely to contain many repetitions. Let
Zij be an indicator for the event that locations i and j in D1 contain the
same value. One can observe that the size of |Q1 | is lower bounded by the
expression
We have already seen that the first term is usually large. Let us now show
that the second term is usually small. Clearly, Zij takes the value 1 with
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 243
1
≤ .
12
By the union bound, we get that with probability at least 5/6 the sum
|D1 | √ |D | |D1 |
i=1 Yi is at least ε · n/6 and the sum i=11 j=i+1 Zij is at most 12.
When that happens, the size of |Q1 | can be lower bounded by
|D1 |
|D1 | |D1 |
√ √
ε· n ε· n
Yi − Zij − 12 ,
i=1 i=1 j=i+1
6 12
√
where the last inequality holds when ε · n is large enough.
Next, let us show that when |Q1 | is large, Algorithm 1 is likely to detect
a duplicate. Note that Algorithm 1 returns “No” whenever D2 contains the
second position of some pair in Q1 because this means that both positions
of this pair appear in D.
√
Lemma 3. When |Q1 | ε · n/12, D2 contains the second position of
some pair in Q1 with probability at least 5/6.
|D2 |
22√ n
√
|Q1 | ε ε
− √ε
· 22 n 22 1
1− 1− √ e 12 n ε = e− 12 .
n 12 n 6
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 244
Combining Lemmata 2 and 3, we get that when L is ε-far from being free
of duplicates, Algorithm 1 outputs “No” with probability at least (5/6)2
2/3. We summarize this in Theorem 1.
Theorem 1. Algorithm 1 is a property testing algorithm for testing
whether a list of length n is free of duplicates whose query complexity is
√
O( n/ε).
By definition, a property testing algorithm is guaranteed to output
with probability at least 2/3 “Yes” given an object obeying the property
and “No” given an object which is ε-far from having the property. However,
this definition still allows the algorithm to make a mistake (i.e., produce
the wrong output) for every such input with probability at most 1/3.
We consider the mistake of outputting “No” given an object obeying the
property as one kind of mistake, and the mistake of outputting “Yes” given
an object which is ε-far from having the property as a second kind of
mistake. If the algorithm has a non-zero probability to make both kinds of
mistakes, then we say that it is a two-sided error algorithm. However, some
algorithms can only make one of the above two kinds of mistakes. Such
algorithms are called one-sided error algorithms.
11.3 The List Model and Testing Lists for Being Sorted
Recall that a property testing problem consists of three components: a
set N of possible input objects, a subset P of the objects that obey the
property and a distance function d. Intuitively, the set N and the distance
function d describe the world in which the problem resides, i.e., what kinds
of objects exist, and how is the distance between them defined. In contrast,
the subset P describes the property that we are looking for in objects of
this world (by including exactly the objects having this property). Due to
this intuitive point of view, the pair of N and d is often called the model
to which the problem belongs, and the subset P is often simply referred to
as the property of the problem.
Often there are many interesting properties that one might want to test
within the context of a single model (or world). For example, the problem
of testing a list of n numbers for being free of duplicates lives in a model in
which the set of objects is the set of all lists of n numbers and the distance
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 245
between two lists is the fraction of the n positions in which the two lists
differ. This model is often called the list model. Beside the property of being
free of duplicates, there are other properties that one might want to test in
this model. For example, one might be interested in testing in this model
the property of being a sorted list.
Exercise 4. Give an additional example of a natural property one might want
to test in the list model.
In the rest of this section, we study algorithms for testing the property
of being a sorted list in the list model. In other words, we are interested in a
property testing algorithm for testing whether a list of n numbers is sorted
(in a non-decreasing order). Exercise 5 demonstrates that this problem is
not trivial by showing that two very natural algorithms for the problem can
fail with a high probability to detect that a list is not sorted even when it
is very far from being sorted.
Exercise 5. For every one of the following two algorithms, describe a list
which is 1/2-far from being sorted, but the algorithm declares it to be sorted
with a probability of at least 1 − O(1/n).
(a) The algorithm that picks a uniformly random location i between 1 and
n − 1, and declares the list to be sorted if and only if the number at
location i is not larger than the number at location i + 1.
(b) The algorithm that picks a uniformly random pair of locations 1 i <
j n, and declares the list to be sorted if and only if the number at
location i is not larger than the number at location j.
given either vi or vj as its input. In both cases, the algorithm selects pivot
values, compares them to its input and acts according to the results of the
comparisons. As long as the comparison of the pivot to vi and vj gives the
same result, the algorithm continues with the same execution path given
either vi or vj . However, this cannot go on forever. To see why that is the
case, note that the fact that vi and vj are good values implies that binary
search manages to find them when it looks for them. Thus, vi is the sole
value remaining in the potential range of binary search until the search
terminates when vi is given as the input, and similarly vj is the sole value
that remains in the potential range until the search terminates when vj is
the input. Thus, at some point binary search must get a different answer
when comparing the pivot to vi or vj .
Let us denote the pivot for which the comparison resulted in a different
answer by p. Thus, either vi < p vj or vj < p vi , and let us assume
without loss of generality that the first option is true. When comparing vi
with p, binary search learns that vi is smaller than p, and thus, it removes
the second half of the potential range. However, we already know that vi
remains in the potential range until the end when vi is the input for the
binary search. Thus, vi must appear before p in the list. A similar argument
shows that either p = vj or vj appears after p in the list. Hence, we have
proved that the relative order in the list of vi and vj is consistent with their
order as numbers. Since this is true for every two good values, it proves the
lemma.
Corollary 1. Given a list which is ε-far from being sorted, Algorithm 2
returns “No” with probability at least ε.
Proof. Observe that a list of length n in which there is a subset of m
values that appear in the right order can be sorted by changing only the
other n − m values. Thus, a list which is ε-far from being sorted cannot
contain a subset of more than (1 − ε)n values that appear in the right
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 248
order. In particular, since Lemma 4 shows that the good values appear in
the right order, for such a list we must have |VG | (1−ε)n. Recall now that
Algorithm 2 outputs “No” whenever the random position it picks contains
a value that does not belong to VG . Thus, the probability of this algorithm
to output “No” given a list which is ε-far from being sorted is at least
n − |VG | n − (1 − ε)n
= ε.
n n
Observation 3 and Corollary 1 together show that Algorithm 2 correctly
answers when its input list is sorted, and has a significant probability to
correctly answer when its input list is ε-far from being sorted. However,
to get a property testing algorithm, we need the last probability to be at
least 2/3. This can be achieved by executing Algorithm 2 multiple times,
and outputting that the list is sorted if and only if all the executions of
Algorithm 2 report this answer. A formal presentation of the algorithm we
get in this way is given as Algorithm 3.
independent, the probability that at least one of the copies returns “No” is
at least
2
1 − (1 − ε)h 1 − e−εh = 1 − e−ε2/ε 1 − e−2 ,
3
where the first inequality follows from the inequality 1 − x e−x , which
holds for every real value x. Hence, given a list which is ε-far from being
sorted, Algorithm 3 detects that it is not sorted with probability at least
2/3; which completes the proof that it is a one-sided error property testing
algorithm.
It remains to bound the query complexity of Algorithm 3. Consider
first the query complexity of Algorithm 2. This algorithm uses one query to
read vi , and then makes some additional queries through the use of binary
search. However, binary search has a time complexity of O(log n), and
thus, its query complexity cannot be larger than that. Hence, we got that
the query complexity of Algorithm 2 is upper bounded by 1 + O(log n) =
O(log n). Let us consider now Algorithm 3. One can observe that it accesses
its input list only through the h copies of Algorithm 2 that it uses. Thus,
the query complexity of Algorithm 3 is upper bounded by h times the query
complexity of Algorithm 2, i.e., by
(a) (b)
Figure 11.3. A few additional examples of images in the pixel model. The three images
on the left are half-planes, while the rightmost image is not a half-plane since no straight
line can separate the white and black pixels in it.
Figure 11.4. A graphical illustration of the terms “edge” and “end points”. In the
leftmost image, the top and bottom edges of the image are marked in black. Note that
these are simply the top and bottom rows of the image, respectively. In the second image
from the left, the left and right edges of the image are marked in black. Note that these
are the leftmost and rightmost columns of the image, respectively. Finally, in the third
and fourth images we mark in black the end points of the top and left edges, respectively
(the edges themselves are marked in gray). Note that the end points of an edge are its
two pixels that fall on the corners of the image.
Exercise 7. Show that an image having four edges with non-identical end
point colors is never a half-plane.
w m b w m b
(a) (b)
Figure 11.5. Consider three pixels w, b and m on an edge that have the following
properties: w is white, b is black and m is in the middle between w and b. There are two
possibilities for the color of m, however, this figure demonstrates that regardless of the
color of m, there exists a pair of pixels on the edge that have non-identical colors and
the distance between them is half the distance between w and b. (a) shows that the pair
of m and b has these properties when m is white, and (b) shows that the pair of m and
w has these properties when m is black.
The case that the image has exactly two edges with non-identical
end point colors remains to be considered. Consider one of these edges.
We already know that one of its end points is black, while the other one is
white. In other words, we have two pixels of the edge which have different
colors and the distance between them is n. Let us denote the black pixel
by b and the white pixel by w. If we now pick the middle pixel m between
b and w (or roughly the middle pixel if there is no pixel exactly in the
middle), then it must be either black or white. If it is white, then b and
m are two pixels on the edge which have different colors and the distance
between them is roughly half the distance between b and w. Similarly, if
m is black, then w and m are two pixels on the edge having the above
properties. Thus, regardless of the color of m, given two pixels b and w on
the edge which have different colors and a distance of n, we managed to
get a pair of pixels on the edge which have different colors and a distance
of about half of n (see Figure 11.5 for a graphical illustration). Repeating
this process O(log ε−1 ) times, we end up with a pair of pixels on the edge
which have different colors and a distance of at most max{εn/2, 1}.
Recall now that we assume that the image has two edges whose end
points have non-identical colors. Applying the above process to both these
edges, we get a pair (b1 , w1 ) of pixels on one edge and a pair (b2 , w2 ) of
pixels on the other edge, such that: the distance between the pixels in each
pair is at most max{εn/2, 1}, the pixels b1 and b2 are both black and the
pixels w1 and w2 are both white. We now connect b1 and b2 by a section
Sb , and also w1 and w2 by a section Sw . If the two sections intersect, then
clearly the image is not a half-plane. Thus, we may assume that the two
sections partition the image into three areas as follows:
w1 b1
W R B
w2 b2
Figure 11.6. The partitioning of an image into the three areas W , R and B based on
the locations of the pixels w1 , w2 , b1 and b2 .
See Figure 11.6 for a graphical illustration of the above three areas.
One can observe that since B is separated from the white pixels w1 and
w2 by a section ending with two black pixels, it cannot include any white
pixels unless the image is not a half-plane. Similarly, the image is not a
half-plane if W contains any black pixels. Thus, one can try to determine
whether the image is a half-plane by looking for pixels in B that are white
and pixels in W that are black. If such pixels can be found, then the image
is certainly not a half-plane. This idea is the basis of the algorithm we
suggest, which is given as Algorithm 4, for determining whether an image
having two edges with non-identical end point colors is a half-plane.
Lemma 6. Assuming ε 2/n, given an image having two edges with non-
identical end point colors which is ε-far from being a half-plane, Algorithm 4
returns “No” with probability at least 2/3.
Exercise 9. Theorem 3 only works when ε is not too small. Explain why the
above analysis of Algorithm 4 fails when ε < 2/n, and suggest a modification
for the algorithm that will make it work also in this case.
Exercise Solutions
Solution 1
In the problem of testing whether a graph over n vertices is connected, the
set N of possible inputs is the set of all graphs over n vertices, and the set
P is the subset of the connected graphs in N . If we think of the graph as
represented by its adjacency matrix, then it is natural to define the distance
between two graphs as the fraction of the entries in this matrix in which
the two graphs differ (see Figure 11.7 for an example).
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 257
1 2 1 2
Graphs:
3 4 3 4
Adjacency 0 1 0 1 0 1 1 0
1 0 1 0 1 0 0 1
matrices: 0 1 0 1 1 0 0 1
1 0 1 0 0 1 1 0
Figure 11.7. Two graphs on 4 vertices and their corresponding adjacency matrices.
The two matrices differ in 8 cells, which are half of the cells of each matrix. Thus, the
distance between the two graphs is 1/2.
Solution 2
Consider the algorithm ALG δ given as Algorithm 5. As this algorithm
accesses the input object only by executing ALG on it h times, it is clear
that the query complexity of ALG δ is larger than the query complexity of
ALG by a factor of h = Θ(log δ −1 ). Thus, it remains to show that ALG δ
outputs the appropriate answer with probability at least 1 − δ.
Consider now the case that o is ε-far from having the property. In this case,
we need to show that ALG δ outputs “No” with probability at least 1 − δ,
which is equivalent to showing that the sum hi=1 Xi is at most h/2 with
that probability. Since ALG is a property testing algorithm, we get that in
this case each indicator Xi takes the value 1 with some probability p 1/3.
Using the Chernoff bound again, we get
h h
h 1/2
Pr Xi > = Pr Xi > · ph
i=1
2 i=1
p
h h
1/2
= Pr Xi > ·E Xi
i=1
p i=1
j ff
1/2 2
min
p
−1, ( 1/2
p
−1) ·ph
−
≤e 3
{
min 1,
1/2
p
−1 }·(1/2−p)h {
min 1,
1/2
1/3 }
−1 ·(1/2−1/3)h
− −
=e 3 e 3
h
= e− 36 elog δ = δ.
Solution 3
As proved by Observation 2, given a list L which is free of duplicates,
Algorithm 1 always outputs “Yes”. In other words, it never makes the
mistake of outputting “No” given a list obeying the property, and thus, it
is a one-sided error algorithm.
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 259
Solution 4
Recall that the objects in the list model consist of all the lists of n numbers.
There are many natural properties of such lists that one might be interested
in testing. A very partial list of them includes the following:
Solution 5
(a) Assume for simplicity that n is even, and consider the list whose first
half consists of the integer numbers n/2 + 1 to n and its second half
consists of the integer numbers 1 to n/2 (for example, if n = 8, then we
get the list “5, 6, 7, 8, 1, 2, 3, 4”). To make this list sorted, one must
either change all the numbers in its first half, or all the numbers in its
second half, thus, this list is 12 -far from being sorted.
Consider now the algorithm described by part (a) of the exercise.
This algorithm picks a value i between 1 and n − 1, and then compares
the values of the list in locations i and i + 1. Given the above list, the
numbers in these locations will always be compatible with a sorted list,
unless i = n/2. Thus, the algorithm will declare the list to be sorted
with probability
sorted requires changing at least n/2 values in it. Thus, the list is 12 -far
from being sorted.
Consider now the algorithm described by part (b) of the exercise. This
algorithm picks two random locations i < j, and then compares the values
of the list in these locations. Given the above list, the numbers in locations
i and j will be consistent with a sorted list, unless i = 2k − 1 and j = 2k
for some integer 1 k n/2. Thus, the algorithm will declare the list to
be sorted with probability
There is no integer k such
Pr
that i = 2k − 1 and j = 2k
n/2
=1−
#possible i, j values such that 1 i < j n
n/2
=1− = 1 − O(1/n).
n(n − 1)/2
Solution 6
Consider a pre-processing algorithm that takes the input list for Algorithm
3 and replaces every number v in it with an ordered pair (v, i), where i is
the location of v in the original list. Let us define for such ordered pairs a
comparison according to the following lexicographic rule. A pair (v1 , i1 ) is
larger than a pair (v2 , i2 ) if v1 > v2 or v1 = v2 and i1 > i2 . Let us now
make a few observations as follows:
Solution 7
Consider an image whose four edges all have non-identical end-point colors.
In such an image, no two adjacent corners can have the same color, and
thus, it must have two black corners on one diagonal, and two white corners
on the other diagonal. Assume now, by way of contradiction, that this image
is a half-plane. Then, there must be a straight line separating the white and
black pixels in it. Let us study the relationship of this separating line with
the diagonals of the image. There are a few cases to consider as follows:
• The first case is that the separating line does not intersect the diagonals.
This case implies that all the corners of the image appear on one side
of the image, and thus, should have the same color; which leads to a
contradiction.
• The second case is that the separating line intersects one of the diagonals
at a point which is not a corner. In this case, the two corners of this
diagonal appear on different sides of the separating lines, and should
have different colors; which again leads to a contradiction.
• The third case is that the separating line intersects the diagonals (and
therefore, also the image) only in the corners. In this case, the two or
three corners of the images that are not intersected by the separating
line appear on the same side of the separating line, and thus, should
have identical colors. This leads to a contradiction once more, because
each diagonal has at least one corner which is not intersected by the
separating line.
Solution 8
Consider a half-plane image having zero edges with non-identical end point
colors. All the corners of this image must have the same colors, which means
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 262
that they are all on the same side of the straight line separating the white
and black pixels of the image. Note that this implies that the entire image is
on one side of the separating line, and thus, the entire image must have one
color. In other words, as claimed by the hint, there are only two half-plane
images having zero edges with non-identical end point colors: the image
which is all white, and the image which is all black.
This last observation suggests the following simple algorithm for
determining whether an image having zero edges with non-identical end
point colors is a half-plane. The algorithm first selects a random subset
of pixels, and compares their colors with the color of the corners. If the
algorithm detects any pixel whose color differs from the color of the corners,
then it declares that the image is not a half-plane. However, if all the pixels
selected by the algorithm have the same color as the corners, then the
algorithm declares the image to be a half-plane. A more formal description
of this algorithm is given as Algorithm 6.
Let us prove that Algorithm 6 is a property testing algorithm with a
query complexity of O(ε−1 ). This follows from the next three claims.
Proof. Algorithm 6 queries exactly h pixels from the image, and thus,
its query complexity is O(h) = O(ε−1 ).
from the image, the color of this pixel will match the colors of the corners
of the image (and the color of every other pixel in the image). Hence, given
an image of the above type, the condition of the “if” statement on Line 4
of Algorithm 6 will never evaluate to “true”, which will make the algorithm
return “Yes”.
Lemma 7. Algorithm 6 returns “No” with probability at least 2/3 given
an image having zero edges with non-identical end point colors which is
ε-far from being a half-plane.
Proof. Consider an image having zero edges with non-identical end point
colors which is ε-far from being a half-plane and assume without loss of
generality that the corners of the image are white. We need to show that
Algorithm 6 returns “No” given this image with probability at least 2/3.
Toward this goal, we claim that there must be at least ε · n2 black pixels in
the image. To see why that must be the case, note that if there were less
than ε · n2 black pixels in the image, then it could not be ε-far from being a
half-plane, since it could be transformed into the all-white half-plane image
by changing the color of only these less than ε · n2 black pixels.
Consider now a single random pixel picked by Algorithm 6. The
probability that this pixel is black is at least ε by the above discussion.
Since the random pixels picked by Algorithm 6 are independent, this means
that the probability that at least one of them is black is at least
2
1 − (1 − ε)h 1 − e−εh = 1 − e−ε2/ε 1 − e−2 .
3
The proof now completes by observing that Algorithm 6 outputs “No”
whenever one of the random pixels it picks is black since we assumed the
corners of the image are white.
Solution 9
One of the first steps of Algorithm 4 is finding two pairs of pixels (one pair
consists of b1 and w1 , and the other of b2 and w2 ) such that the pixels
in each pair have non-identical colors and the distance between them is
upper bounded by nε/2. Naturally, this cannot be done when ε < 2/n,
since this will require finding two disjoint pixels whose distance from each
other is less than 1. The best alternative is to find pairs of the above
kind such that the distance between the pixels in each pair is exactly 1.
Unfortunately, this might result in an area R containing Θ(n) pixel. To
see why that is problematic, one can note that, when ε < 2/n, a property
July 1, 2020 17:14 Algorithms for Big Data - 9in x 6in b3853-ch11 page 264
testing algorithm must detect that an image is not a half-plane even when
it can be made a half-plane by changing as few as εn2 2n pixels. Thus, in
the regime of ε < 2/n, one cannot ignore a part of the image of size Θ(n).
In particular, to make Algorithm 4 apply to this regime as well, we must
take into consideration the pixels of R in some way.
Assume from now on that we are in the regime ε < 2/n, and consider
the modified version of Algorithm 4 given as Algorithm 7. Algorithm 7
differs from Algorithm 4 in two things. First, it uses pairs (b1 , w1 ) and
(b2 , w2 ) whose distance is 1 rather than εn/2. As discussed above, this
leads to an area R containing O(n) pixels. The second modification is that
the algorithm constructs an image I which is black in the area B (of the
original image), white in the area W and agrees with the original image on
the area R. Constructing I requires the algorithm to read all the pixels of
R from the original image, however, in the regime ε < 2/n this can be done
without making the query complexity of the algorithm exceed O(ε−1 ).
Chapter 12
There are multiple natural representations for graphs. For example, a graph
can be represented by its adjacency matrix, or by storing the adjacency list
of each vertex. Traditionally, the designer of a graph algorithm is free to
assume that the algorithm’s input graph is given in any one of the standard
representations. This assumption often makes sense because the graph
can be easily converted from one representation to another in polynomial
(or often even linear) time. However, in the context of sublinear time
algorithms, the algorithm does not have enough time to convert the input
graph from one representation to another, and thus, the representation in
which the input graph is given can have a major impact on the things that
can be performed in sublinear time. For example, it might be possible to
determine in sublinear time whether a graph is connected when it is given
in one representation, but not when it is given in another representation.
The above issue implies that we should study sublinear time algorithms
for graph problems separately for every one of the standard graph
representations. In this chapter and in Chapter 13, we do this for two such
representations. In particular, in this chapter we study sublinear algorithms
that assume a representation of graphs which is appropriate for graphs with
a bounded degree, and in Chapter 13 we study a sublinear time algorithm
that assumes the adjacency matrix representation.
267
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 268
Figure 12.1. A graph, and its bounded degree representation for d = 3. Cells in the
arrays of neighbors which are empty are designated with the character “-”.
If we now replace nu(i) with n̂u(i) in Algorithm 1, then we get the algorithm
given as Algorithm 2.
The big advantage of Algorithm 2 over Algorithm 1 is that we can
implement it in sublinear time (which is even independent of the number
of nodes in G — see Observation 1). However, unlike Algorithm 1 whose
(the proof for this fact is identical to the proof of Exercise 2). Hence, the
expected output of Algorithm 2 is not necessarily equal to the number of
connected components in G. Exercise 3 solves this issue by proving that the
values of (12.1) and (12.2) do not differ by much, which implies that the
expected output of Algorithm 2 is always close to the number of connected
components in G (which we proved to be equal to (12.1)).
Exercise 3. Prove that 0 u∈V 1/n̂u − u∈V 1/nu εn/2.
Proof. Excluding the calculation of the terms n̂u(i) , one can observe
that the time complexity of Algorithm 2 is O(h) given our standard
assumption that arithmetic operations take Θ(1) time. According to the
above discussion, evaluating each one of the terms n̂u(i) requires O(d/ε)
time. Thus, the total time complexity of Algorithm 2 is given by
d hd d · log δ −1
O(h) + h · O =O =O .
ε ε ε3
h2 ε2
− P 2 −2
·log(2/δ)·ε2 /12
=e 12E[ h Y ]
i=1 i e−hε /12
= e−12ε
δ
elog(δ/2) = .
2
Similarly,
h h
hε
Pr Yi E Yi −
i=1 i=1
2
h h
hε
= Pr Yi 1− h ·E Yi
i=1 2E[ i=1 Yi ] i=1
!2 " #
hε
P
h
− Ph E Yi /2
e 2E [
Y
i=1 i ] i=1
h2 ε2
− Ph 2 −2
·log(2/δ)·ε2 /8
=e 8E[
i=1 i
Y ]
e−hε /8
= e−12ε
δ
elog(δ/2) = .
2
Combining the last two inequalities using the union bound, we see that
with probability at least 1 − δ the sum hi=1 Yi does not deviate from its
expectation by more than hε/2, which implies
n h
n
h
nε
Pr
· Yi − E · Yi
1 − δ.
h h i=1
2
i=1
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 272
h
It now remains to recall that the expression nh · i=1 Yi is equal to
the output of Algorithm 2, and thus, its expectation is equal to (12.2).
The lemma now follows by combining this observation with the last
inequality.
Corollary 1. Algorithm 2 estimates the value of (12.2) up to an error of
εn with probability at least 1 − δ.
Proof. Observe that Exercise 3 implies that (12.2) always estimates
(12.1) up to an error of εn/2. The corollary now follows by combining this
observation with Lemma 1, i.e., with the fact that Algorithm 2 estimates
the value of (12.2) up to an error of εn/2 with probability at least 1 − δ.
Theorem 1 summarizes the properties that we have proved for
Algorithm 2.
Theorem 1. Algorithm 2 has a time complexity of O(dε−3 · log δ −1 ),
and with probability at least 1 − δ, it estimates the number of connected
components in G up to an error of εn.
Theorem 1 shows that with a large probability, the error in the estima-
tion produced by Algorithm 2 is bounded by εn. Note that this error can be
very large compared to the estimated value when G contains few connected
components. One might hope to improve that by finding an algorithm that
estimates the number of connected components up to a relative error of ε,
but unfortunately, Exercise 4 shows that this cannot be done.
Exercise 4. Prove that no sublinear time algorithm can estimate the number
of connected components in a bounded degree graph up to a relative error
of 1/4 with a probability 1/2 + ε for any constant ε >0. Hint: Consider the
graph G1 which is a path of length n and the random graph G2 which is
obtained from G1 by removing a uniformly random edge from it. Show that
with a high probability, every sublinear time algorithm must output the same
value given either G1 or G2 .
throughout this section that the input graph is connected and that its edges
have integer weights between 1 and some positive integer w.
One of the standard algorithms for finding a minimum weight spanning
tree is the well-known algorithm of Kruskal. Pseudocode of this algorithm
is given as Algorithm 3.
Exercise 5 proves an important property of the tree constructed by
Kruskal’s algorithm. Let T be the tree produced by Kruskal’s algorithm
for the graph G, and let us denote by Ti the set of the edges of T whose
weight is at most i. Similarly, let us denote by Gi the subgraph obtained
from G by removing all edges whose weight exceed i. Finally, given a graph
G, we denote by CC (G) the set of connected components of G.
Remark: It can be shown that the claim made by Exercise 5 is true for
every minimum weight spanning tree of G. However, this is not necessary
for our purposes.
Using Exercise 5, it is now possible for us to assign a formula for the
weight of T in terms of the number of connected components in various
subgraphs of G.
w
n − |CC(G1 )| + i · (|CC(Gi−1 )| − |CC(Gi )|)
i=2
w−1
= n − w · |CC(Gw )| + |CC(Gi )|
i=1
w−1
=n−w+ |CC(Gi )|,
i=1
where the last equality holds since Gw is equivalent to the original
graph G, and we assume that G is connected (i.e., has a single connected
component).
One can now observe that whenever these inequalities hold, we also have
w−1
w−1
n−w+ Ci − n − w + |CC(Gi )|
i=1 i=1
w−1
w−1
w−1
=
Ci − |CC(Gi )|
|Ci − |CC(Gi )||
(w − 1) · εn = ε n,
Figure 12.2. An example of a graph and a vertex cover of it. The black vertices form
a vertex cover of this graph because each edge hits at least one black vertex.
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 277
d 2i
· = 2.
2i−1 d
where ALG(I) is the value of the solution produced by the algorithm ALG
given the instance I, and OPT (I) is the value of the optimal solution for
the instance I — i.e., the solution with the minimal value.
underlying the social network, and then looking up the community in which
the user we are interested in happens to be. While this approach is natural,
it suffers from a very significant disadvantage, namely, that getting the
social community of just a single user requires us to construct a partition
for all the users of the social network. If the social network is small, this
disadvantage is not too problematic, but if the social network is large, the
overhead of partitioning the entire network into social communities can
easily make this approach impractical. An alternative way to describe this
situation in somewhat more abstract terms is as follows. Our algorithm
creates a “global solution” by partitioning the entire social network into
social communities. Then, we drop most of this global solution and keep
only the part of it which we are interested in, which is the social community
which includes the user we are interested in. A much more efficient approach
would be to somehow generate only the part of the global solution in
which we are interested, without going through the time-consuming task
of generating the entire global solution. This is exactly the objective that
local computation algorithms try to achieve.
Let us now consider a more specific example. Consider the problem of
finding a minimum weight spanning tree. A local computation algorithm
for this problem gets a graph G and a single edge e in it, and should output
whether e belongs to the minimum weight spanning tree of G (hopefully in
less time than what is necessary for calculating the entire minimum weight
spanning tree). One can observe that as long as G has a single minimum
weight spanning tree, the task faced by the local computation algorithm
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 283
1 This definition works for deterministic local computation algorithms, but a slightly more
involved definition can be used to also define randomized local computation algorithms.
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 284
already know that CA (G) is a vertex cover which is not much larger than
the minimum vertex cover, an estimate of its size is also an estimate for the
size of the minimum vertex cover. Lemma 4 quantifies the quality of the
estimate that Algorithm 9 gets for the size of CA (G). In this lemma, we
denote by |CA (G)| the number of nodes in the vertex cover CA (G).
Lemma 4. For every ε > 0, with probability at least 2/3, the output of
Algorithm 9 belongs to the range [0.5|CA (G)| − εn, 2|CA (G)| + εn].
n
s
· Xi , (12.3)
s i=1
Thus, to prove the lemma we need to show that (12.3) does not deviate from
its expectation by too much. Let us first prove that (12.3) is smaller than
0.5|CA | − εn with probability at most 1/6. If |CA (G)| 2εn, then this is
trivial since (12.3) is always non-negative. Hence, it remains to consider the
case that |CA (G)| > 2εn. Since the value of each indicator Xi is determined
by the sample ui , and the samples u1 , u2 , . . . , us are independent, we can
use the Chernoff bound in this case to get
n
s
Pr · Xi 0.5CA (G) − εn
s i=1
s s s
s 1
Pr Xi · CA (G) = Pr Xi · E Xi
i=1
2n i=1
2 i=1
Ps 1
e−E[ i=1 Xi ]/8
= e−s·|CA (G)|/(8n) e−s·(2εn)/(8n) = e−sε/4 e−3 < .
6
Next, we would like to prove that (12.3) is larger than 2|CA | + εn with
probability at most 1/6. Let Yi be an indicator that takes the value 1 with
probability max{ε/2, |CA(G)|/n}. Since this probability is an upper bound
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 287
The lemma now follows by applying the union bound to the two above
results.
Corollary 3. With probability at least 2/3, the output of Algorithm 9 is
at least OP T /2 − εn, where OPT is the size of the minimum vertex cover,
and at most (2 + 4 · log2 d) · OP T + εn.
Proof. The corollary follows immediately by combining Lemma 4 with
the fact that Algorithm 8 is a (1 + 2 · log2 d)-approximation algorithm by
Theorem 4, and thus, its global solution CA (G) obeys
Figure 12.3. Two bounded degree representations of the single graph (for the same
d = 3). Cells in the representations which are empty are designated with the character
“-”. Note that, despite the fact that both representations correspond to the same graph,
the distance between them is 7/12 because the two representations are different in 7 out
of the 12 cells each of them has.
Proof. We will prove the lemma by showing that a graph G with less than
εdn/2 connected components cannot be ε-far from being connected. Recall
that a graph containing h connected components can be made connected
by adding to it h − 1 edges. Thus, one might assume that any such graph
can be made connected by modifying at most 2(h−1) entries in its bounded
degree representation. As we will see later in this proof, this intuition works
when there are at least two empty entries in the lists of neighbors of the
vertices of every single connected component of the graph (because the
above-mentioned h − 1 edges can be added by modifying these empty
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 290
entries). However, things are more complicated when there are connected
components of the graph with less than 2 empty entries in their lists of
neighbors, and our proof will begin by showing how to avoid this case.
Assume that G is a graph with h connected components, and for every
connected component of G, let us say that it has the property Q if there
are at least two empty entries in the lists of neighbors of the vertices of
this connected component. Consider now a connected component C of G
that does not have the property Q, and let T be a spanning tree of C. The
fact that C does not have the property Q implies that it contains at least
two vertices, and thus, T must have at least two leaves: t1 and t2 . Let us
denote now by e1 an edge that hit t1 and does not belong to T (if such an
edge exists), and similarly, let us denote by e2 an edge that hits t2 and does
not belong to T (again, if such an edge exists) — see Figure 12.4(a) for a
graphical illustration of these edges. If neither e1 nor e2 exists, then each
of t1 and t2 has an empty entry in its list of neighbors, which contradicts
our choice of C as a connected component of G which does not have the
property Q. Thus, either e1 or e2 must exist, and let us assume without loss
of generality that e1 exists. We now observe that by removing e1 we can
make C have the property Q, and moreover, this removal will not affect the
connectivity of C because T is connected. Thus, we have found a way to
make C have the property Q by removing a single edge, which requires us to
change at most 2 entries. Repeating this process now for all the connected
components of G which does not have the property Q will result in a graph
G in which all the connected components C have the property Q, and will
require at most 2h entry modifications.
Our next objective is to explain how G can be made connected by
changing at most 2h entries in it. Let us denote the connected components
t1
e1 e2 ui-1 ui ui+1
vi-1 vi vi+1
t2
(a) (b)
Figure 12.4. Illustrations for the proof of Lemma 5. (a) describes a single connected
component of G, the tree T within it, the leaves t1 and t2 of this tree and the edges
e1 and e2 that hit these leaves and are not part of T . (b) describes three consecutive
connected components, Ci−1 , Ci and Ci+1 of G and the edges that are added to connect
them in G .
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 291
Exercise 11. Lemma 5 proves that a bounded degree graph which is far from
being connected has many more connected components than a connected
graph (which has a single connected component). Thus, an algorithm
for estimating the number of connected components in a graph, such as
Algorithm 2, can be used to distinguish between such graphs. Show that this
idea leads to a property testing algorithm for determining whether a bounded
degree graph is connected whose time complexity is O(d−2 ε−3 ).
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 292
The case that n > 4/(εd) remains to be considered. In this case, the time
complexity of Algorithm 10 is dominated by the time required for the s
executions of BFS that it uses. As seen above, in general, a single BFS
execution can take time that depends on n, but fortunately, Algorithm 10
stops the BFS as soon as it finds 4/(εd) vertices. Since BFS scans only the
edges of vertices that it has already found, stopping the BFS in this fashion
guarantees that its time complexity is only O(4/(εd) · d) = O(ε−1 ). Thus,
the total time complexity of Algorithm 10 is
Exercise Solutions
Solution 1
Consider an arbitrarily connected component C of G. We will show that
the nodes of C contribute exactly 1 to the expression (12.1), which proves
that (12.1) is equal to the number of connected components in G since C
is a generic connected component of G. The contribution of the nodes of C
to (12.1) is
1 1 1
= = |C| · = 1.
nu |C| |C|
u∈C u∈C
Solution 2
For every 1 i h, let Yi be a random variable equal to 1/nu(i) . Since
u(i) is a uniformly random vertex of G, we get
1/nu
E[Yi ] = u∈V .
n
Observe now that the output of Algorithm 1 can be written as (n/h) ·
h
i=1 Yi . Thus, by the linearity of expectation, the expected output of the
algorithm is equal to
n n 1
h h
1/nu
· E[Yi ] = · u∈V
= .
h i=1 h i=1 n nu
u∈V
This completes the proof since the rightmost side of the last equality is
equal to (12.1), which was already proved to be equal to the number of
connected components in G in Exercise 1.
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 297
Solution 3
Recall that n̂u = min{nu , 2/ε}. This equality implies that n̂u is never larger
than nu . Thus,
1 1 1 1
⇒ − 0.
n̂u nu n̂u nu
u∈V u∈V u∈V u∈V
To prove the other inequality of the exercise, we note that whenever n̂u and
nu are not identical, n̂u is equal to 2/ε. Thus, 1/n̂u can be upper bounded
by 1/nu + ε/2. Summing up this observation over all the nodes of G,
we get
1 1 1 nε 1 nε
− + − = .
n̂u nu nu 2 nu 2
u∈V u∈V u∈V u∈V
Solution 4
Following the hint, let us denote by G1 the path over n vertices, and let
G2 be a random graph obtained from G1 by removing a uniformly random
edge from it. Consider now an arbitrary sublinear time algorithm ALG for
estimating the number of connected components in a graph. We will prove
that with probability 1 − o(1), ALG produces the same output given either
G1 or G2 .
Assume first that ALG is a deterministic algorithm, and let EALG be
the edges of G1 that ALG reads when it gets G1 as its input. Consider the
execution path that ALG follows when it gets G1 as its input, and observe
that removing any edge of G1 outside of EALG cannot change this execution
path because such an edge is never read by ALG. Thus, ALG will follow the
same execution path given G1 or G2 (and will produce the same output)
whenever EALG does not include the single edge that is removed from G1
to get G2 . Since this edge is selected as a uniformly random edge of G1 , we
get that ALG produces the same output given G1 or G2 with probability
at least |EALG |/(n − 1). More formally, if we denote by ALG(G) the output
of ALG given a graph G, then
|EALG |
Pr[ALG(G1 ) = ALG(G2 )] 1 − .
n−1
Recall that the above holds when ALG is deterministic, and let us now
consider the case where ALG is randomized. An equivalent way to view
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 298
ALG is to assume that it gets a list of random bits, and that whenever
it needs to make a random decision, it reads the next bit from this list.
Given this point of view, executing ALG consists of two stages. In the first
stage, we create the list of random bits, and in the second stage, we run the
deterministic algorithm that ALG becomes given this list. In other words,
executing ALG (like any other randomized algorithm) is equivalent to
choosing a deterministic algorithm according to an appropriate distribution,
and then executing it. To make things more formal, let us denote by
A a random deterministic algorithm chosen according to the distribution
defining ALG, and observe that ALG(G) has the same distribution as A(G).
Let B be the set of algorithms that A has a positive probability to
be. Since every algorithm A ∈ B is deterministic, our above result for
deterministic algorithms applies to it. Thus, by the law of total probability,
we obtain
Pr[A(G1 ) = A(G2 )] = Pr[A (G1 ) = A (G2 )|A = A ] · Pr[A = A ]
A ∈B
|EA |
E 1− |A = A · Pr[A = A ]
n−1
A ∈B
E[|EALG|]
=1− = 1 − o(1),
n−1
where EA is the set of edges read by algorithm A when it gets G1 . Note
that the last equality holds since ALG reads all the edges of EALG , and
thus, its sublinear time complexity upper bounds the size of EALG . Let us
now denote by H2 the set of graphs that G2 has a positive probability to
be. Then, by using again the law of total probability, we get
Pr[G = G2 ] · Pr[A(G1 ) = A(G)] = Pr[A(G1 ) = A(G2 )] 1 − o(1).
G∈H2
Since the sum G∈H2 Pr[G = G2 ] is equal to 1, the last inequality implies
that there must be a graph G ∈ H2 such that Pr[A(G1 ) = A(G ))]
1 − o(1).
Assume now, by way of contradiction, that ALG estimates the number
of connected components in its input graph up to a relative error of 1/4
with probability at least 1/2 + ε for some constant ε > 0. This means that
given G1 , ALG must output a value of no more than 5/4 with probability
at least 1/2 + ε because G1 has a single connected component. Similarly,
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 299
given G , ALG must output a value of no less than 3/2 with probability at
least 1/2 + ε. Since 3/2 > 5/4, the two claims can be combined as follows:
Pr[A(G1 ) = A(G )]
Pr[A(G1 ) = A(G ) 3/2] + Pr[A(G1 ) = A(G ) 5/4]
Pr[A(G1 ) 3/2] + Pr[A(G ) 5/4]
Pr[A(G1 ) > 5/4] + Pr[A(G ) < 3/2]
= Pr[ALG(G1 ) > 5/4] + Pr[ALG(G ) < 3/2] 2 · (1/2 − ε) = 1 − 2ε,
Solution 5
By definition, Ti only contains edges of G whose weight is at most i, and
thus, all its edges appear in Gi . This implies that all the edges of Ti
appear within connected components of Gi . Thus, it is possible to count
the number of edges in Ti by simply counting the number of edges from
this set within every given connected component of Gi . Thus, let us fix
such a connected component C of Gi .
Since all the edges of Ti belong to the tree T , they cannot contain a
cycle. Thus, the number of such edges within C can be upper bounded by
|C| − 1, where |C| is the number of nodes in the connected component C.
Let us now assume by way of contradiction that this upper bound is not
tight, i.e., that Ti contains less than |C| − 1 edges inside the connected
component C. This implies that C can be partitioned into two subsets C1
and C2 of nodes such that there is no edge of Ti between these subsets.
However, since C is a connected component of Gi , there must be an edge
of weight at most i between C1 and C2 . Let us denote this edge by e (see
Figure 12.5 for a graphical illustration of the notation we use).
Let Te be the set T maintained by Kruskal’s algorithm immediately
before it considers the edge e. Since e has a weight of at most i, but does not
belong to Ti , we get that it does not belong to T either. By definition, the
fact that Kruskal’s algorithm did not add e to T implies that the addition
of e to Te creates a cycle. In other words, Te must already include a path
between the two endpoints of e. Recalling that e connects nodes of C1 and
C2 , this implies the existence of a path in Te between C1 and C2 . Observe
now that the edges of Te must all have a weight of at most i because e
has a weight of at most i (and Kruskal’s algorithm considers the edges in a
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 300
C C1
C2 e
(a) (b)
Figure 12.5. The connected component C of Gi . Figure (a) presents the case that
this connected component contains |C| − 1 edges of Ti , which implies that these edges
form a spanning tree of the component. Figure (b) presents the case that the component
contains less edges of Ti , and thus, it can be partitioned into two parts C1 and C2 such
that no edge of Ti connects them. However, since C is connected in Gi , there must
be an edge e of this graph between the two parts.
Solution 6
Algorithm 4 uses w − 1 executions of Algorithm 2. By Theorem 1, each
one of these executions uses O(d/ε3 · log δ −1 ) time when its input graph
Gi is pre-calculated. Unfortunately, pre-calculating Gi before executing
Algorithm 2 on it requires too much time. However, one can observe that
every time Algorithm 2 attempts to read some cell from the neighbors list of
a vertex u of Gi , we can calculate the content of this cell in constant time
as follows. If the corresponding cell in the representation of G is empty,
then the original cell of Gi should be empty as well. Otherwise, let us
assume that the corresponding cell in the representation of G contains the
neighbor v, and that the edge (u, v) has a weight of j in G. If j > i, then
the edge (u, v) is not a part of Gi , and thus, the original cell of Gi should
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 301
be empty. Otherwise, the edge (u, v) is part of Gi , and the original cell of
Gi should contain the same vertex v as the cell of G.
Summing up the above, we get that since every cell in the representation
of Gi can be calculated in constant time, executing Algorithm 2 should
take the same amount of time (up to constants) as executing this algorithm
on a pre-calculated input graph. Thus, each one of the executions of
Algorithm 2 used by Algorithm 4 take O(d/ε3 · log δ −1 ) time. Since there
are w − 1 such executions, and Algorithm 4 uses only O(w) time for the
calculations it makes outside of these executions, we get that the total time
complexity of Algorithm 4 is
d dw dw4 w
O(w) + (w − 1) · O · log δ −1 =O · log δ −1 =O · log .
ε3 ε3 ε δ
Solution 7
Let ε and δ be two values from (0, 1], and let us assume without loss of
generality that ε−1 is an integer (otherwise, we can replace ε with a value
that obeys this property and belongs to the range [ε/2, ε]). Following the
hint, we would like to round down the weight of every edge to an integer
multiple of ε/2. Then, we can make all the weights integral by multiplying
them with 2/ε. Combining the two steps, we transform a weight i of an
edge into a weight
2i/ε. Let us denote by G the weighted graph obtained
from G by this weights transformation.
We now execute Algorithm 4 on G with ε = 1 and δ = δ. Since the
weights of the edges in G are integers between 2/ε 1 and 2w/ε, Theorem
2 guarantees that Algorithm 4 executes in time O(dw4 /ε4 · log(w/(εδ)))
when G is pre-calculated, and moreover, with probability at least 1 − δ it
estimates the weight of the minimum weight spanning tree of G up to an
error of ε n = n. Lemma 7 allows us to translate estimates on the weight
of the minimum weight spanning tree of G to estimates on the weight of
the minimum weight spanning tree of G.
which implies
εW εW ε ε ε εn
− v(T ) − · v(T ) = · [W − v(T )] · n = εn.
2 2 2 2 2 2
Similarly, T is a spanning tree of G, and its weight there is at most
εv(T )/2+εn/2 (where the term εn/2 is due to the rounding in the definition
of the weights of G and the observation that T has at most n edges). Hence,
which implies
εW εW ε εn
− v(T ) − · v(T ) −
2 2 2 2
ε εn ε εn
= · [W − v(T )] − · (−n) − = −εn.
2 2 2 2
The above discussion and the last lemma show that the algorithm given
as Algorithm 11 estimates the weight of the minimum weight spanning tree
of G up to an error of εn with probability at most 1 − δ. To determine
the time complexity of Algorithm 11, we observe that it requires only O(1)
time in addition to the time required for executing Algorithm 4 on G .
Thus, it is enough to bound the last time complexity. When G is pre-
calculated, we saw above that the execution of Algorithm 4 takes only
O(dw4 /ε4 · log(w/(εδ))) time. Unfortunately, calculating G takes too much
time, but we can generate in constant time every weight or neighbors list
entry of G that Algorithm 4 tries to read, which is as good as having a
pre-calculated copy of G . Hence, it is possible to implement Algorithm 11
using a time complexity of O(dw4 /ε4 · log(w/(εδ))).
Solution 8
(a) Assume by way of contradiction that Algorithm 7 produced an illegal
coloring, namely, there is an edge e whose end points u and v have the
same final color c. Then, one of two things must have happened. Either
u and v got the same final color c at the same time, or the end point
u obtained its final color c before v (the reverse can also happen, but
we can ignore it due to symmetry). Let us show that neither of these
cases can happen.
Consider first the case that u and v got the same final color c at the
same time. In the iteration of Algorithm 7 in which this has happened,
both u and v must have picked this color c and then each one of them
has reported to the other one about choosing c. Thus, both of them
were informed that one of their neighbors has also chosen the color c,
and this should have prevented them from finalizing c. Thus, we get a
contradiction.
Consider now the second case in which the end point u of e has chosen
a final color c, and then the other end point v decided to choose the
same final color c at a later time. Consider the iteration of Algorithm 7
in which u has finalized its color. At the end of this iteration, u informs
all its neighbors (including v) that it has finalized its color c, and
consequently, these neighbors deleted c from their corresponding color
lists C. Since a vertex picks colors only from its list C of colors, this
means that v could not pick the color c at any time after u has finalized
its choice of color, which leads us to a contradiction again.
(b) Every iteration of Algorithm 7 involves two rounds of communication.
In the first round, every vertex informs its neighbors about the color it
has picked from its list of colors C, and in the second round every vertex
informs its neighbors whether it has finalized the color it has picked.
Thus, we need to show that with probability at least 1−1/n Algorithm 7
terminates after 2 · log2 n iterations. Fix an arbitrary vertex u of
the input graph, and consider an arbitrary iteration of Algorithm 7
which starts before u has finalized its color. In this iteration, u picks
an arbitrary color cu , and this color is finalized if it is different from
the colors picked by the neighbors of u that did not have their colors
finalized as yet. If we denote by Nu the set of these neighbors of
u (i.e., the neighbors of u with a final color), by cv the color of a
neighbor v ∈ Nu and by Cu the list of colors that u has at the current
iteration, then the probability that u finalizes its color at the end of this
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 304
iteration is
|Nu |
Pr[cu = cv ∀v ∈ Nu ] = Pr[cu ∈
/ {cv |v ∈ Nu }] = 1 − .
|Cu |
The important observation now is that every time that a neighbor of u
finalizes its color, Nu loses a single vertex and Cu loses at most a single
color. Since the original size of Nu is the degree of u, which is at most
d, and the original size of Cu is 2d, this implies that
number of neighbors of u
d−
|Nu | which have finalized their colors d 1
1− 1− 1− = .
|Cu | number of neighbors of u 2d 2
2d −
which have finalized their colors
Solution 9
The subgraph G (u) constructed by Algorithm 8 contains all the nodes of
G whose distance from u is at most log2 d + 1. Thus, recalling that the
degree of every node in G is upper bounded by d, we can upper bound the
number of nodes in G (u) by
log2 d+1
di (log2 d + 2) · dlog2 d+1 = O(d)log2 d+2 .
i=0
Since both the construction of G (u) and the execution of Algorithm 6 for
every one of its nodes in parallel require O(d)O(log d) time, this is also the
time complexity of Algorithm 8.
Solution 10
We need to show that, for every ε > 0 and c ∈ (1, 2], when s is set to
6c/[(c−1)2 ε], then with probability at least 2/3 the output of Algorithm 9
belongs to the range [|CA (G)|/c − εn, c · |CA (G)| + εn]. The proof of this
fact is very similar to the proof of Lemma 4, but we write it fully here for
completeness.
Let Xi be an indicator for the event that ui belongs to CA (G). Recall
that the output of Algorithm 9 is equal to
n
s
· Xi , (12.4)
s i=1
We now need to show that (12.4) does not deviate from its expectation by
too much. Let us first prove that (12.4) is smaller than |CA |/c − εn with
probability at most 1/6. If |CA (G)| c · εn, then this is trivial since (12.4)
is always non-negative. Hence, the case that |CA (G)| > c · εn remains to
be considered. Since the value of each indicator Xi is determined by the
sample ui , and the samples u1 , u2 , . . . , us are independent, we can use the
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 306
2 P
·E[ si=1 Xi ]/2 2
e−(1−1/c) = e−(1−1/c) s·|CA (G)|/(2n)
2 2 1
e−s(1−1/c) ·(cεn)/(2n)
= e−s(1−1/c) ·(cε)/2
e−3 < .
6
Next, we would like to prove that (12.4) is larger than c|CA | + εn with
probability at most 1/6. Let Yi be an indicator that takes the value 1 with
probability max{ε/c, |CA (G)|/n}. Since this probability is an upper bound
on the probability that Xi takes the value 1, we get
n
s
Pr · Xi c · CA (G) + εn
s i=1
s
s
cs cs
= Pr Xi · CA (G) + sε Pr Yi · CA (G) + sε
i=1
n i=1
n
s
s 2 Ps
Pr Yi cE Yi e−(c−1) E[ i=1 Yi ]/3
i=1 i=1
2 1
e−(c−1) sε/(3c)
e−2 < .
6
The claim that we need to prove now follows by applying the union bound
to the two above results.
Solution 11
First, we note that when ε 8/(dn), the exercise asks for an algorithm that
can determine whether a bounded degree graph is connected whose time
complexity is at most O(dn3 ). It is easy to see that this time complexity is
larger than the time complexity of BFS, and thus, the problem is trivial.
Hence, in the rest of the solution we assume ε > 8/(dn).
July 1, 2020 17:15 Algorithms for Big Data - 9in x 6in b3853-ch12 page 307
Proof. Since G is ε-far from being connected, it must have at least εdn/2
connected components by Lemma 5. Thus, with probability at least 2/3
the estimate c computed by Algorithm 12 for the number of connected
components of G will be at least εdn/2 − εdn/8 = 3εdn/8 > εdn/4.
Thus, Algorithm 12 will declare G to be unconnected with probability
at least 2/3.
Solution 12
The objective of Algorithm 2 is to estimate the number of connected
components in the graph. To achieve this goal, the algorithm executes
BFS from random vertices of the graph and then averages (in some
sophisticated way) the number of vertices found by these BFS executions.
To make the result of the algorithm trustworthy, the algorithm must use
enough executions to guarantee that the average it calculates is close to its
expectation, which requires many executions.
In contrast, for Algorithm 10 to detect that a graph is not connected,
it is enough for it to have one BFS execution which finds a small number of
vertices. In particular, it is not necessary for the algorithm to have enough
BFS executions so that the fraction of the BFS executions which find a
small number of vertices is close to its expectation. Thus, the algorithm
can work with a significantly smaller number of BFS executions.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch13 page 309
Chapter 13
13.1 Model
We begin this chapter by defining the adjacency matrix and the property
testing model we assume. Recall that the adjacency matrix of an (undi-
rected) graph is a matrix which has a single column and a single row for
every vertex of the graph. Each cell of this matrix is an indicator for the
existence of an edge between the vertices corresponding to its column and
row. In other words, the cell in the row corresponding to vertex u and the
column corresponding to vertex v contains the value 1 if and only if there is
an edge between u and v (see Figure 13.1 for an example). We would like to
draw your attention to two properties of the adjacency matrix, as follows:
• The main diagonal of the matrix corresponds to edges between a vertex
and itself, i.e., self-loops. When the graph is simple, which we assume
throughout this chapter, then this diagonal contains only zeros.
• For every pair of vertices u and v, there are two cells of the matrix
which indicate the existence of an edge between u and v: the cell at the
row corresponding to u and the column corresponding to v, and the
309
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch13 page 310
u v u v s w
u 0 1 1 0
v 1 0 1 1
s 1 1 0 0
w 0 1 0 0
s w
Figure 13.1. A graph and its adjacency matrix representation. Note that the matrix
is symmetric (because the graph is not directed) and its main diagonal contains only
zero entries (because there are no self-loops in the graph).
Exercise 1. Show that no graph is ε-far from being connected in the dense
graph model unless ε < 2n−1 .
εn2
2 · |E | < 2 · = εn2 .
2
Intuitively, the last observation shows that when G is ε-far from being
bipartite, every partition of it is violated by many edges. This means that for
every given partition there should be a high probability that the subgraph
G[V ] selected by Algorithm 1 contains at least a single edge violating this
partition. More formally, given a partition of V into two disjoint sets V1
and V2 , let us denote by A(V1 , V2 ) the event that one of the edges violating
this partition appears in G[V ]. Then, the above intuitive argument claims
that A(V1 , V2 ) should be a high probability event for every given partition
of V .
The last exercise implies that our objective to prove that G[V ] is
unlikely to be bipartite when G is far from being bipartite is equivalent to
proving that, with a significant probability, the event A(V1 , V2 ) occurs for
all possible partitions of V for such graphs G. Since we already argued that
intuitively A(V1 , V2 ) is a high probability event, it is natural to lower bound
the probability of the event that A(V1 , V2 ) occurs for all possible partitions
of V using the union bound. As a first step, let us find a lower bound on
the probability that the event A(V1 , V2 ) occurs for a given partition.
s/2 s/2
2|E |
= / E] =
Pr[u2i−1 u2i ∈ 1−
i=1 i=1
n2
s/2
(1 − ε) e−εs/2 e−εs/3 ,
i=1
where the first equality holds since the vertices u1 , u2 , . . . , us are indepen-
dent of each other and the last inequality is true for any s 2.
As planned, we can now use the union bound to lower bound the
probability that the event A(V1 , V2 ) happens for all possible partitions.
Since there are 2n −2 2n possible partitions, and for every given partition
the event A(V1 , V2 ) happens with probability at least 1−e−εs/3 by Lemma 1,
we get that the probability that the event A(V1 , V2 ) happens for all possible
partitions is at least
to all possible partitions of V fails because there are just too many such
partitions.
Lemma 2. Given our assumption that there exists an edge from every
vertex of V \U to a vertex of U, there exists a partition (W1 , W2 ) of W such
that no edge of G[V ] violates (U1 ∪ W1 , U2 ∪ W2 ) if and only if the event
A(U1 ∪ N (U2 ), V \(U1 ∪ N (U2 ))) does not occur.
Proof. We begin by proving the simpler direction of the lemma, i.e., we
assume that (U1 ∪ W1 , U2 ∪ W2 ) is violated by an edge of G[V ] regardless
of the partition (W1 , W2 ) of W chosen, and we show that this implies that
event A(U1 ∪ N (U2 ), V \(U1 ∪ N (U2 ))) occurred. Let us choose W1 = W ∩
N (U2 ), then our assumption that (U1 ∪ W1 , U2 ∪ W2 ) is always violated
by an edge of G[V ] implies that there exists an edge e of G[V ] violating
the partition (U1 ∪ (W ∩ N (U2 )), U2 ∪ (W \N (U2 ))) of V . Recall that this
means that the edge e goes between two vertices belonging to the same
side of this partition, and thus, it must also violate the partition (U1 ∪
N (U2 ), V \(U1 ∪ N (U2 ))) of V because every side of this partition includes
one side of the previous partition. However, the last observation implies
that the event A(U1 ∪ N (U2 ), V \(U1 ∪ N (U2 ))) occurred by definition.
We now need to prove the other direction of the lemma, i.e., we assume
that the event A(U1 ∪N (U2 ), V \(U1 ∪N (U2 ))) occurred and prove that this
implies that (U1 ∪ W1 , U2 ∪ W2 ) is violated by an edge of G[V ] regardless
of the partition (W1 , W2 ) of W chosen. The event A(U1 ∪ N (U2 ), V \(U1 ∪
N (U2 ))) implies that there is an edge e in G[V ] violating its partition. The
edge e must be within one of the two sides of this partition, which gives us
a few cases to consider.
• The first case is that e connects two vertices of U1 . In this case, e clearly
violates (U1 ∪ W1 , U2 ∪ W2 ).
• The second case we consider is that e connects a vertex u1 ∈ U1 with
a vertex v ∈ N (U2 ). By the definition of N (U2 ) there must be an
edge between v and some vertex u2 ∈ U2 . Thus, G[V ] contains a path
of length two between a vertex u1 on one side of the partition (U1 ∪
W1 , U2 ∪ W2 ) and a vertex u2 on the other side of this partition (see
Figure 13.2(a)). Regardless of the side of the partition in which we
place the middle vertex v of this path, we are guaranteed that one of
the edges of the path will be within a single side of the partition, i.e.,
will violate the partition.
• The third case we need to consider is that e connects two vertices v1
and v2 of N (U2 ). By the definition of N (U2 ), there must be an edge
connecting v1 to a vertex u1 ∈ U2 and an edge connecting v2 to a vertex
u2 ∈ U2 (possibly u1 = u2 ). Thus, G[V ] contains a path of length 3 that
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch13 page 316
U1 W U2 W U2
v1 u1
u1 v u2
v2 u2
(a) (b)
Corollary 1. Given our assumption that there exists an edge from every
vertex of V \U to a vertex of U, G[V ] is bipartite if and only if there is a
partition (U1 , U2 ) of U such that the event A(U1 ∪ N (U2 ), V \(U1 ∪ N (U2 )))
does not occur.
partition (U1 , U2 ) of U such that the event A(U1 ∪ N (U2 ), V \(U1 ∪ N (U2 )))
does not occur, which completes the proof of the corollary.
Recall that in Section 13.2 we tried to prove that G[V ] is unlikely to
be bipartite when G is ε-far from being bipartite by showing that for such
a graph G there is a significant probability that the event A(V1 , V2 ) holds
for every partition of V . Despite the fact that the event A(V1 , V2 ) holds for
every given partition of V with high probability, we failed to prove that it
holds for all the partitions at the same time with a significant probability
because we had to consider all the possible partitions of V , and there were
too many of these. Corollary 1 shows that (under our assumption) it is in
fact sufficient to consider only a single partition of V for every partition of
U . Since the number of such partitions is much smaller than the number
of all possible partitions of V , we can use Corollary 1 to prove that G[V ]
is unlikely to be bipartite when G is ε-far from being bipartite. However,
before we can do that, we first need to deal with the technical issue that
the partitions we now consider are defined by U , and thus, we need to show
that the events corresponding to them happen with high probability even
given a fixed choice of U . This is done by Exercise 4.
the probability that the above event happens for all the possible partitions
of U at the same time is at least
Proof. In this proof, we pretend that the algorithm selects the set U
by repeatedly picking a uniformly random vertex of G and adding it to
U , until the set U contains sU vertices (which can require more than sU
iterations if the same vertex is added to U multiple times). We denote by
u1 , u2 , . . . , usU the first sU vertices selected by this process, and observe
that they are independent random vertices of G.
Consider now an arbitrary vertex v of G of degree at least εn/8. The
probability that ui is a neighbor of v, for any given 1 i sU , is at least
(εn/8)/n = ε/8, and thus, the probability that v belongs to R(U ) can be
upper bounded by
sU
ε ε
1− e−εsU /8 e− ln(48/ε) = .
i=1
8 48
εn εn2
X · n + (n − X) · X ·n+ .
8 8
Recall that with probability at least 5/6 we have X < εn/8, which means
that the last expression evaluates to less than εn2 /4.
Recall that the original assumption we used (i.e., that every vertex
outside U is connected by an edge to a vertex of U ) was useful because it
allowed us to extend every partition of U into a partition of V in a natural
way. Without this assumption, it is not obvious how to assign the vertices of
V \U that are not connected by an edge to U (note that these are exactly the
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch13 page 320
vertices of R(U )). We circumvent this issue by not assigning these vertices
to any side of the partition, which should not make too much of a difference
when only few edges hit these vertices — i.e., when U has a good coverage.
More formally, a partial partition of V is a pair of disjoint subsets V1 and
V2 of V whose union might not include the entire set V . Given a partition
(U1 , U2 ) of U , we construct a partial partition of V consisting of two sets:
U1 ∪ N (U2 ) and V /(U1 ∪ N (U2 ) ∪ R(U )).
One can naturally extend the notation of violating edges to partial
partitions as follows. Given a partial partition (V1 , V2 ) of V , an edge e
violates this partial partition if it connects two vertices of V1 or two vertices
of V2 . As usual, we define the event A(V1 , V2 ) as the event that some edge
of G[V ] violates the partial partition (V1 , V2 ). Using these definitions, we
can state a weaker version of Lemma 2 that Exercise 5 asks you to prove.
Corollary 3. If the event A(U1 ∪ N (U2 ), V \(U1 ∪ N (U2 ) ∪ R(U ))) occurs
for every partition (U1 , U2 ) of U, then G[V ] is not bipartite.
Proof. Assume by way of contradiction that the event A(U1 ∪ N (U2 ),
V \(U1 ∪ N (U2 ) ∪ R(U ))) occurs for every partition (U1 , U2 ) of U , but
G[V ] is bipartite. The fact that G[V ] is bipartite implies that there exists
a partition (U1 , U2 ) of U and a partition (W1 , W2 ) of W such that no
edge of G[V ] violates the partition (U1 ∪ W1 , U2 ∪ W2 ) of V . However, this
leads to a contradiction because Exercise 5 shows that the existence of such
a partition is excluded by our assumption that the event A(U1 ∪ N (U2 ),
V \(U1 ∪ N (U2 ) ∪ R(U ))) has happened.
We now need to prove that the event stated by the Exercise 5 is a
high probability event. In general, this might not be true because many of
the edges violating the partition (U1 ∪ N (U2 ), V \(U1 ∪ N (U2 ))) might not
violate the partition (U1 ∪ N (U2 ), V \(U1 ∪ N (U2 ) ∪ R(U ))); however, there
are only a few such edges when U has a good coverage.
Lemma 4. If s−sU 2, then, for every given set U of sU vertices that has
a good coverage and a partition (U1 , U2 ) of this set, the event A(U1 ∪N (U2 ),
V \(U1 ∪ N (U2 ) ∪ R(U ))) happens with probability at least 1 − e−ε(s−sU )/24
when G is ε-far from being bipartite.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch13 page 321
Before proving the lemma, we would like to point out that its proof is
a slight generalization of the solution for Exercise 4, and thus, readers who
read that solution will find most of the details in the following proof very
familiar.
Proof. Let E be the set of edges violating the partition (U1 ∪ N (U2 ),
V \(U1 ∪ N (U2 ))). If there is an edge of E that connects two vertices of U ,
then this edge is guaranteed to be in G[V ] and also violate the partial
partition (U1 ∪ N (U2 ), V \(U1 ∪ N (U2 ) ∪ R(U ))), which implies that the
event A(U1 ∪ N (U2 ), V \(U1 ∪ N (U2 ) ∪ R(U ))) happens with probability 1.
Thus, we can safely assume from now on that this does not happen, which
allows us to partition E into three sets of edges: the set E1 contains the
edges of E that hit a single vertex of U and a single vertex of V \(U ∪R(U )),
the set E2 contains the edges of E that hit two vertices of V \(U ∪ R(U ))
and the set E3 contains the edges of E that hit any vertices of R(U ).
Note that, by Observation 1, the fact that G is ε-far from being bipartite
guarantees that
⎡ ⎤
Pr ⎣∀1i(s−su )/2 u2i−1 u2i ∈
/ E2 ∧ ∀1i(s−su )/2 u2i ∈
/ U (e)⎦
e∈E1
⎡ ⎤
(s−su )/2
= Pr ⎣u2i−1 u2i ∈
/ E2 ∧ u2i ∈
/ U (e)⎦
i=1 e∈E1
⎧ ⎡ ⎤⎫
(s−su )/2
⎨ ⎬
min / E2 ], Pr ⎣u2i ∈
Pr[u2i−1 u2i ∈ / U (e)⎦ ,
⎩
⎭
i=1 e∈E1
where the equality holds since the vertices u1 , u2 , . . . , us−sU are indepen-
dent. We can now upper bound the two probabilities on the rightmost side
as follows:
2|E2 | 2|E2 |
/ E2 ] = 1 −
Pr[u2i−1 u2i ∈ 1 −
|V \U|2 n2
and
⎡ ⎤
| e∈E U (e)| | e∈E U (e)| |E1 |
Pr ⎣u2i ∈
/ U (e)⎦ = 1 − 1
1− 1
1− ,
|V \U | n n2
e∈E1
where the last inequality holds since a vertex can belong to at most n – 1
edges. Combining all the inequalities we have proved so far, we get
(s−su )/2
(s−s
u )/2
2|E2 | |E1 | |E1 | + |E2 |
min 1 − , 1 − 1 −
i=1
n2 n2 i=1
2n2
(s−su )/2
ε
1− e−ε(s−su )/2/8 e−ε(s−su )/24 ,
i=1
8
where the second inequality holds since the minimum of two values is always
upper bounded by their average and the last inequality follows from our
assumption that s − sU 2.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch13 page 323
Corollary 4 implies that given this choice for s, Algorithm 1 detects correctly
with probability at least 2/3 that its input graph is not bipartite when it is
ε-far from being bipartite as long as s
= n. Additionally, one can observe
that the same thing is true also when s = n because s = n implies that
the graph G[V ] is identical to the input graph of Algorithm 1. Together
with Exercise 2, which showed that Algorithm 1 always detects correctly
that a bipartite graph is bipartite, we get that Algorithm 1 is a property
testing algorithm for determining whether its input graph is bipartite.
Theorem 1 follows from this observation and from the observation that
Algorithm 1 reads only the part of the representation of G corresponding
to edges between the s vertices of V .
Theorem 1. For an appropriately chosen value s = O(ε−2 · ln ε−1 ),
Algorithm 1 is a one-side error property testing algorithm for determining
whether its input graph is bipartite whose query complexity is O(s2 ) =
O(ε−4 · ln2 ε−1 ).
Exercise Solutions
Solution 1
Any graph over n vertices can be made connected by adding at most n − 1
new edges to it (for example, by fixing an arbitrary simple path going
through all the vertices, and adding its edges to the graph if they are not
already in it). Additionally, we observe that adding an edge to a graph
corresponds to changing two entries in the adjacency matrix of this graph.
Thus, any graph can be made connected by changing at most 2(n−1) entries
in its adjacency matrix. By definition, the distance between the adjacency
matrix of the original graph and the adjacency matrix of the connected
graph obtained after the change of these up to 2(n − 1) entries is at most
2(n − 1)
< 2n−1 ,
n2
and thus, no graph is 2n−1 -far from a connected graph.
Solution 2
It is easy to see that the claim we are requested to prove in this exercise
follows from Lemma 5.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch13 page 326
Solution 3
Assume first that there exists a partition of V into disjoint sets V1 and V2
such that the event A(V1 , V2 ) does not occur, and let us prove this implies
that G[V ] is bipartite. The fact that A(V1 , V2 ) did not occur implies that
G[V ] does not contain any edge within V1 or V2 , and thus, it also does not
contain any edge within the sets V1 ∩ V and V2 ∩ V . If the last two sets are
not empty, then they form a partition of the vertices of G[V ], and thus,
we get that G[V ] is bipartite, as desired. Additionally, if one of the sets
V1 ∩ V or V2 ∩ V is empty, then the observation that no edge is contained
within these sets implies that G[V ] contains no edges, and thus, it is again
bipartite since we assume |V | = s 2.
Assume now that G[V ] is bipartite, and let us prove that this implies
the existence of a partition of V into disjoint sets V1 and V2 such that
the event A(V1 , V2 ) does not occur. Since G[V ] is bipartite, there exists
a partition (V1 , V2 ) of V such that no edge of G[V ] is within a set of
this partition. Consider now the partition (V \V2 , V2 ) of V , and consider an
arbitrary edge e violating this partition. Since e violates this partition, it
must be within one of the two sets of this partition, however, e cannot be
within V2 because we already know that no edge of G[V ] (and thus, also
of G) is within V2 ⊆ V . Hence, e must be within the set V \V2 . Analogously,
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch13 page 327
we can also get that the edge e cannot be an edge between two vertices of
V1 , and thus, at least one of the endpoints of e must belong to the set
V \(V1 ∪ V2 ) = V \V ; which implies that e does not belong to G[V ]. Since
e was chosen as an arbitrary edge violating the partition (V \V2 , V2 ), we get
that no edge violating this partition belongs to G[V ], and thus, the event
A(V \V2 , V2 ) does not occur (by definition).
Solution 4
Let E be the set of edges violating the partition (U1 ∪ N (U2 ), V \(U1 ∪
N (U2 ))). If there is an edge of E that connects two vertices of U , then
this edge is guaranteed to be in G[V ], and thus, the event A(U1 ∪ N (U2 ),
V \(U1 ∪ N (U2 ))) happens with probability 1. Thus, we can safely assume
from now on that this does not happen, which allows us to partition E into
two sets of edges E1 and E2 , such that E1 contains the edges of E which
hit a single vertex of V \U , and E2 contains the edges of E which hit two
vertices of V \U . Note that, by Observation 1, the fact that G is ε-far from
being bipartite guarantees that
εn2
|E1 | + |E2 | = |E | .
2
⎡ ⎤
Pr ⎣∀1i(s−su )/2 u2i−1 u2i ∈
/ E2 ∧ ∀1i(s−su )/2 u2i ∈
/ U (e)⎦
e∈E1
⎡ ⎤
(s−su )/2
= Pr ⎣u2i−1 u2i ∈
/ E2 ∧ u2i ∈
/ U (e)⎦
i=1 e∈E1
⎧ ⎡ ⎤⎫
(s−su )/2
⎨ ⎬
min / E2 ], Pr ⎣u2i ∈
Pr[u2i−1 u2i ∈ / U (e)⎦ ,
⎩
⎭
i=1 e∈E1
where the equality holds since the vertices u1 , u2 , . . . , us−sU are indepen-
dent. We can now upper bound the two probabilities on the rightmost side
as follows:
2|E2 | 2|E2 |
/ E2 ] = 1 −
Pr[u2i−1 u2i ∈ 1 −
|V \U|2 n2
and
⎡ ⎤
| e∈E U (e)| | e∈E U (e)| |E1 |
Pr ⎣u2i ∈
/ U (e)⎦ = 1 − 1
1− 1
1− ,
|V \U | n n2
e∈E1
where the last inequality holds since a vertex can belong to at most n − 1
edges. Combining all the inequalities we have proved so far, we get
(s−su )/2
(s−s
u )/2
2|E2 | |E1 | |E1 | + |E2 |
min 1 − ,1 − 2 1−
i=1
n2 n i=1
2n2
(s−su )/2
ε
1− e−ε(s−su )/2/4 e−ε(s−su )/12 ,
i=1
4
where the second inequality holds since the minimum of two values is always
upper bounded by their average and the last inequality follows from our
assumption that s − sU 2.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch13 page 329
Solution 5
Assume that the event A(U1 ∪N (U2 ), V \(U1 ∪N (U2 )∪R(U ))) has happened.
This implies that there is an edge e in G[V ] violating its partial partition.
The edge e must be within one of the two sides of this partition, which
gives us two cases to consider.
• The first case is that e connects two vertices of U1 ∪ N (U2 ). This case
is analogous to the first three cases studied in the proof of Lemma 2,
and the same proofs used in these three cases show that there must
be an edge of G[V ] violating the partition (U1 ∪ W1 , U2 ∪ W2 ) of V
regardless of the chosen partition (W1 , W2 ) of W .
• The second case we need to consider is the case that e connects two
vertices of V \(U1 ∪ N (U2 ) ∪ R(U )). One can observe that V \(U1 ∪
N (U2 ) ∪ R(U )) contains only vertices of U2 and vertices of V \U that
have a neighbor in U \U2 = U1 , and thus, e connects two vertices of
U2 ∪ N (U1 ). This property of e is symmetric to the property of e that
we had in the first case (namely, e ∈ U1 ∪N (U2 )), and thus, a symmetric
proof can be used to show that also in this case there must be an edge
of G[V ] violating the partition (U1 ∪ W1 , U2 ∪ W2 ) of V regardless of
the chosen partition (W1 , W2 ) of W .
Solution 6
Let G be an ε-far graph for some ε ∈ (0, 1), and consider the events I1 and
I2 from the proof of Corollary 4. We recall that I1 is the event that U has
a good coverage and I2 is the event that Algorithm 1 detects that G is not
bipartite.
Repeating the proof of Lemma 3, we get that the event I1 holds with
probability at least 1 − δ/2 if sU 8 · ln(16/(δε))/ε. Similarly, by repeating
a part of the proof of Corollary 4, we get that Pr[I2 |I1 ] 1 − δ/2 when
s (c/ε + 1)sU + 24 ln(2/δ)/ε. Thus, by choosing
8 · ln(16/(δε))
sU = = Θ(ε−1 · (ln ε−1 + ln δ −1 ))
ε
and
24 · ln 2 24 ln(2/δ)
s = min + 1 sU + ,n
ε ε
Chapter 14
Boolean functions are functions which get (multiple) bits as their input
and produce a (single) output bit based on these input bits. A few very
simple examples of such functions are the standard logical gates of AND,
OR and NOT, but the relationship between the study of Boolean functions
and computer science is much deeper than the superficial relationship
represented by these logical gates. In particular, results about Boolean
functions have found applications in diverse fields of computer science such
as coding theory and complexity.
The importance of Boolean functions has motivated the study of
property testing algorithms that test various properties that such functions
can have. In this chapter, we present algorithms for two such properties.
14.1 Model
We begin the chapter by presenting the Boolean function model, which
is the property testing model that we assume throughout this chapter.
Formally, a Boolean function is a function from {0, 1}m to {0, 1}, where
{0, 1}m is the set of all vectors of m bits (i.e., all vectors of m coordinates
that have either 0 or 1 in each one of these coordinates). Thus, the set N
of objects in the Boolean function model is the set of all possible functions
f : {0, 1}m → {0, 1}. Since we consider arbitrary Boolean functions, it is
natural to think of a function as a truth table (recall that the truth table of
a function is a table that explicitly states the value of the function for every
possible input — see Tables 14.1 and 14.2 for examples). This allows us to
define the distance between two Boolean functions as the fraction of the
331
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch14 page 332
entries in which their truth tables differ, i.e., the fraction of inputs on which
the two functions produce different outputs. A more formal way to state
this is that the distance between two Boolean functions f : {0, 1}m → {0, 1}
and g: {0, 1}m → {0, 1} is
Exercise 1. Explain why the formal and informal definitions given above
for the distance between f and g coincide. In other words, explain why the
probability Prx∈R {0,1}m [f (x) = g(x)] is indeed equal to the fraction of inputs
on which the functions f and g differ.
which shows that the AND function is not linear. Consider now the XOR
function whose truth table is given in Table 14.2. One can verify using this
table that for any vector x ∈ {0, 1}m the XOR function obeys XOR(x) =
x1 + x2 .
This observation implies that for every two vectors x, y ∈ {0, 1}2, it
holds that
1 Given standard definitions of linearity for other objects, it is natural to require also
that f (c · x) = c · f (x) for every vector x ∈ {0, 1}m and scalar c ∈ {0, 1}. However, this
is trivial for c = 1, and for c = 0 it becomes f (0̄) = 0, which already follows from our
definition of linearity because this definition implies f (0̄) + f (0̄) = f (0̄ + 0̄) = f (0̄).
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch14 page 334
Exercise 3. Prove that if η(f ) is at least ε/6 whenever f is ε-far from being
linear, then, given a function f which is ε-far from being linear, Algorithm 1
correctly detects that it is not linear with probability at least 2/3.
from f is less than ε. In particular, we will show that this is the case for
the function
Lemma 2. For every vector x ∈ {0, 1}m, Pry∈R {0,1}m [g(x) = f (x + y)−
f (y)] ≥ 1 − 2η(f ).
where the second equality is true since the vectors z and y are independent,
and the inequality is true since the definition of g guarantees that p is at
least 1/2. To prove the lemma, it remains to show that the probability on
the leftmost side is at least 1 − 2η(f ). Since addition and subtraction of
bits are equivalent operations (verify!), we get
where the second inequality follows from the union bound and the last
equality holds since x+ z and x+ y are uniformly random vectors of {0, 1}m
whose sum is y + z.
We are now ready to prove that g is a linear function. Note that our
assumption (by way of contradiction) that η(f ) < ε/6, together with the
observation that the distance of a function from linearity can be at most 1,
implies η(f ) < 1/6.
Corollary 1. Since η(f ) < 1/6, g is linear.
Proof. Consider any pair of vectors x and y. We need to show that
g(x)+ g(y) = g(x+ y). Naturally, this requirement is deterministic, which is
problematic since all the guarantees we have on g are randomized. Thus, let
us consider a uniformly random vector z in {0, 1}m. Then, one can observe
that the equality that we want to prove holds whenever the following three
equalities hold.
(i) g(x) = f (x + z) − f (z), (ii) g(y) = f (y + x + z) − f (x + z) and
(iii) g(x + y) = f (x + y + z) − f (z).
Since z is a uniformly random vector, Lemma 2 implies that both
Inequalities (i) and (iii), separately, hold with probability at least 1− 2η(f ).
Moreover, since x + z is also a uniformly random vector in {0, 1}m,
Inequality (ii) also holds with this probability. Thus, we get by the union
bound that the probability that all three inequalities hold at the same time
is at least 1 − 6η(f ). Plugging in our knowledge that η(f ) < 1/6, we get
that there is a positive probability that Inequalities (i), (ii) and (iii) all hold
at the same time, which implies that there must be some vector z ∈ {0, 1}m
for which the three inequalities hold.2 As discussed above, the existence of
such a vector z implies that the equality g(x) + g(y) = g(x + y), which we
wanted to prove, is true.
Lemma 1 and Corollary 1 imply together that the distance of f from
the linear function g is at most 2η(f ) < 2(ε/6) < ε, which contradicts our
assumption that f is a Boolean function that is ε-far from being linear while
obeying η(f ) < ε/6. Thus, as intended, we get that every Boolean function
which is ε-far from being linear must obey η(f ) ≥ ε/6. Recall that, by
Exercise 3, this implies that Algorithm 1 detects with probability at least
2/3 that a Boolean function that is ε-far from being linear is not linear.
3 Thisis not trivial for general vectors. For example, the vectors x = (0, 1) and y = (1, 0)
obey neither x ≤ y nor x ≥ y.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch14 page 339
(0, 1, 1) (1, 1, 1)
(0, 1) (1, 1)
(0, 1, 0) (1, 1, 0)
(0, 0) (1, 0)
(0, 0, 0) (1, 0, 0)
(a) (b) (c)
Figure 14.1. The Boolean hypercubes of dimensions 1(a), 2(b) and 3(c).
We will prove Lemma 3 momentarily, but first let us show that it indeed
implies that Algorithm 2 is a property testing algorithm.
ε · 2m−1 ε
=
m · 2m−1 m
because the hypercube of dimension m contains m · 2m−1 edges (an edge
of the hypercube is specified by the choice of the single coordinate in
which the two endpoints of the edge differ and the values of the remaining
m − 1 coordinates). Thus, the probability that no iteration of the algorithm
detects that f is non-monotone is at most
ε s 1
1− ≤ e−(ε/m)s ≤ e−(ε/m)(2m/ε) = e−2 < .
m 3
Lemma 4. For every Boolean function f : {0, 1}m → {0, 1} and dimen-
sion 1 ≤ i ≤ m, the following hold.
0 1 1 1
1 0 0 0
Figure 14.2. A graphical illustration of the Si operation. The left side of this figure
presents a Boolean function (with m = 2). The value of the Boolean function for every
vector x ∈ {0, 1}2 appears in italics for clarity. The right side of the figures presents the
function obtained by applying the operation S2 to the function given on the left. Note
that the original function has a single non-monotone edge in dimension 2 (the edge from
(0, 0) to (0, 1)), and thus, the function produced by the S2 operation flips the values of f
in the endpoints of this edge, while preserving the values of f for the two other vectors.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch14 page 342
• Di (Si (f )) = 0.
• For every other dimension 1 ≤ j ≤ m, Dj (Si (f )) ≤ Dj (f ). In
particular, if Dj (f ) = 0, then Dj (Si (f )) = 0.
0 0 0 0
1 0 0 1
Figure 14.3. One of the cases in the proof of Lemma 4. The left rectangle represents
one possible assignment of values by f to the vectors of the set A. More specifically, each
vertex of this rectangle represents a single vector of A, where the two coordinates beside
the vertex give the values of the coordinates i and j (in that order) of the vector in A
that the vertex corresponds to, and the number in italics near the vertex gives the value
assigned by f to this vector. In a similar way, the right rectangle represents the values
assigned by Si (f ) to the vectors of A. One can observe that the left rectangle includes
a single non-monotone edge in dimension j, and so is the right rectangle. Hence, in the
case described by this figure, Si (f ) does not make the situation worse in dimension j
compared to f .
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch14 page 344
0 1 0 1 1 1 1 1
1 0 0 1 1 0 0 1
1 0 0 1 1 0 0 1
0 0 0 0 0 1 0 1
1 0 0 1 1 0 0 1
1 1 1 1 1 0 0 1
Figure 14.4. The cases of the proof of Lemma 4 which are not given in Figure 14.3.
There are seven possible assignments of values by f to the four vertices of the set A that
result in at least one non-monotone edge belonging to dimension i within the set. One
of these assignments was depicted in Figure 14.3, and the other six cases are depicted
in this figure (see Figure 14.3 for an explanation about the way in which each case is
depicted). One can observe that in each case the number of edges of dimension j that
are non-monotone with respect to Si (f ) is at most as large as the number of such edges
that are non-monotone with respect to f . Specifically, if we number the cases from left
to right and from top to bottom, then in cases 2, 3 and 6 there are no non-monotone
edges of dimension j with respect to either f or Si (f ), in cases 1 and 4 there is a single
non-monotone edge of dimension j with respect to f , but no such edge with respect to
Si (f ), and finally, in case 5 there is a single non-monotone edge of dimension j with
respect to both f and Si (f ).
that are non-monotone with respect to f . Since each edge of the Boolean
m
hypercube belongs to exactly one dimension, the sum i=1 Di (f ) is exactly
equal to the number of edges of the hypercube that are non-monotone with
respect to f , which yields
m
Di (f ) < ε · 2m−1 .
i=1
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch14 page 345
1. First, pick uniformly at random the dimension i to which the edge will
belong.
2. Pick a uniformly random vector x ∈ {0, 1}m, and let y be the vector
obtained from x by flipping the value of coordinate i.
3. Output x and y as the two endpoints of the sampled edge.
Exercise Solutions
Solution 1
When x is a uniformly random vector out of {0, 1}m, the probability
Pr[f (x) = g(x)] is simply the fraction of the vectors x ∈ {0, 1}m which
obey the condition f (x) = g(x), i.e., the fraction of the inputs which make
f and g produce different outputs.
Solution 2
We begin by proving that if f is a Boolean function and a1 , a2 , . . . , am ∈
m
{0, 1} are constants such that f (x) = i=1 ai xi for every vector x ∈
{0, 1}m, then f is linear. Let x and y be two arbitrary vectors in {0, 1}m.
Then,
m
m
m
f (x) + f (y) = ai xi + ai y i = ai (xi + yi ) = f (x + y),
i=1 i=1 i=1
(verify the equality by plugging in the two possible values for xi ). The
linearity of f now gives us
m
m m
(i)
f (x) = f xi · e = f xi · e(i) = xi · f e(i) .
i=1 i=1 i=1
One can observe that the last equality completes the proof because x is a
general vector in {0, 1}m and f (e(i) ) is some constant in {0, 1} for every
1 ≤ i ≤ m.
Solution 3
Consider a function f which is ε-far from being linear. By the assumption of
the exercise, we get that η(f ) ≥ ε/6. In other words, every single iteration
of Algorithm 1 has a probability of at least ε/6 to detect that f is not
linear. Since the iterations are independent, and Algorithm 1 declares the
function to be non-linear if a single iteration detects that it is not linear,
we get that the probability that Algorithm 1 declares f to be linear is
at most
ε number of iterations ε 12/ε
1− ≤ 1−
6 6
1
≤ e−(ε/6)12/ε ≤ e−2 < ,
3
Solution 4
Following the hint, let h be a linear Boolean function which has the
minimum distance to f among all such functions. Since f is ε-close to
being linear for some ε < 1/4, the set U of vectors on which f and h
disagree contains less than 1/4 of the vectors. Moreover, by the linearity of
h, h(x) = h(x + y) − h(y) for every two vectors x, y ∈ {0, 1}m . Hence, for
every vector x ∈ {0, 1}m,
≤ Pry∈{0,1}m [h (x + y) = f (x + y)]
|U | 1
+ Pry∈{0,1}m [h (y) = f (y)] = 2 · m
< ,
2 2
where the second inequality follows from the union bound. Observe now
that the above calculation shows that h(x) is the value in {0, 1} that is
equal to f (x + y) − f (y) for more vectors y ∈ {0, 1}m, which makes it
identical to g(x) by definition.
Solution 5
We begin the solution by proving that AND(x) is monotone. To do that, we
need to verify that for every pair of vectors x, y ∈ {0, 1}2 such that x ≤ y,
it holds that AND(x) ≤ AND(y). Since this trivially holds when x = y, we
only need to verify it for the case that x and y are distinct vectors, which
is done in Table 14.3. Note that indeed AND(x) ≤ AND(y) in every row of
the table.
We now would like to prove that XOR(x) is not monotone. Consider
the vectors x = {0, 1} and y = {1, 1}. Clearly, these are two vectors in
{0, 1}2 obeying x ≤ y, however, one can verify that
Solution 6
We first prove that if f : {0, 1}m → {0, 1} is a monotone Boolean function,
then every edge of the Boolean hypercube of dimension m is monotone with
respect to f . Fix an arbitrary edge (x, y) of this hypercube. As discussed
above, we must have either x ≤ y or y ≤ x, thus, we can assume without
loss of generality that x ≤ y. The monotonicity of f now implies that
f (x) ≤ f (y), and therefore, by definition, the edge (x, y) is monotone with
respect to f .
Proof. The vectors zi−1 and zi agree on all coordinates other than
coordinate i. In coordinate i, the vector zi−1 takes the value of this
coordinate in vector x, while the vector zi takes the value of this coordinate
in vector y. Since x ≤ y, the first value is at most the last value, which
implies the observation.
Solution 7
Consider the function f : {0, 1}2 → {0, 1} given on the left-hand side
of Figure 14.5. One can observe that no edges of dimension 2 are non-
monotone with respect to this function, but both edges of dimension 1 are.
The right-hand side of the same figure gives the function obtained from f by
flipping one of the two (non-monotone) edges of dimension 1. One can note
that the edge between the vectors (1, 0) and (1, 1) is an edge of dimension
2 that is non-monotone with respect to the function resulting from the
flip, despite the fact that no edges of dimension 2 were non-monotone with
respect to the original function f .
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch14 page 351
1 0 1 0
1 0 0 1
Figure 14.5. A graphical illustration of a Boolean function (with m = 2), and the
function obtained from it by flipping the values of the endpoints of a single edge. The
left side of this figure presents the original function by specifying the value of this function
for every vector {0, 1}2 (the values appear in italics). The right side of the figure gives in
a similar way the function obtained by flipping the values of the endpoints of the edge
between the vectors (0, 0) and (1, 0).
Solution 8
Algorithm 2 has at most s = O(m/ε) iterations, and in each one of these
iterations, it evaluates the function f on two vectors. Thus, its query
complexity is
2s = O(m/ε).
Note that this query complexity is smaller than the time complexity we
have shown for the algorithm by a factor of m.
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
353
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
Chapter 15
Introduction to Map-Reduce
355
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch15 page 356
that are parallelizable and the parts of the algorithm that are inherently
sequential. For example, multiplying matrices is an easily parallelizable task
because calculating the value of every cell in the result matrix can be done
independently. In contrast, predicting the change over time in a physical
system (such as the weather) tends to be an inherently sequential task
because it is very difficult to calculate the predicted state of the system
at time t + 1 before the predicted state of the system at time t has been
calculated.
Another issue appears when a parallel algorithm is intended to be
executed on a computer cluster (rather than on a multi-CPU computer).
Despite the very fast communication networks that are available today, the
communication between computers inside the clusters is still significantly
slower than the transfer of information between the components inside
an individual computer. Thus, a parallel algorithm that is implemented
on a computer cluster should strive to minimize the amount of infor-
mation transferred between the computers of the cluster, even (to some
extent) at the cost of performing more internal computations within the
computers.
Implementing a parallel algorithm on a computer cluster also raises an
additional issue, which, unlike the one discussed in the previous paragraph,
is more relevant to the programmer implementing the algorithm than to the
algorithm’s designer. To understand this issue, we note that implementing
a parallel algorithm on a computer cluster involves a large amount of
“administrative” work that needs to be done in addition to the logic of
the algorithm itself. For example, tasks should be assigned to the various
computers of the cluster, and data has to be transferred accordingly and
reliably between the computers. This administrative work is made more
complicated by failures that often occur in components of the cluster (such
as computers and communication links) during the algorithm’s execution.
While these components are usually reasonably reliable, their sheer number
in the cluster makes a failure of some of them a likely event even within a
relatively short time frame, which means that an algorithm implementation
has to handle such failures on a regular basis.
In an attempt to alleviate the repetitive work of handling the above-
mentioned administrative work, Google developed a framework known
as Map-Reduce which handles this work, but requires the algorithm to
have a given structure. A paper about the Map-Reduce framework was
published by Google employees in 2004, which led to a rapid adoption
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch15 page 357
said Alice,) and round the neck of the bottle was a paper label, with the
words ‘DRINK ME’ beautifully printed on it in large letters.
One can verify that this paragraph contains, for example, 5 occurrences of
the word “the”, 3 occurrences of the word “she” and a single occurrence of
the word “drink”.
Determining the frequency of the words in a text is a simple task, but
if the text is long enough, then it makes sense to use a parallel algorithm
for the job. Thus, we would like to design a Map-Reduce algorithm for
it. The basic unit of information in the Map-Reduce framework is a pair
of a key and a value. Hence, to process data using this framework, the
data must be converted into a representation based on such pairs. The best
way to do that depends of course on the exact data that we would like to
process, but for the sake our example, let us assume that each word in the
input text becomes a pair whose value is the word itself and whose key is
the name of the text from which the word was taken. Thus, the first 18
words in the above text are represented using the following pairs (note that
some of the pairs appear multiple times because the text contains repeating
words).
2 Observe that in a long text there are likely to be many distinct words, which will
require the Map-Reduce framework to assign many reducers. As the number of available
computers is limited, this might require the Map-Reduce framework to assign a single
computer as the reducer of multiple keys. Fortunately, the framework hides this technical
issue, and thus, we assume for simplicity in the description of the framework that there
is a distinct reducer for each key.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch15 page 360
Exercise 2. Consider the following pairs (these are the pairs produced by the
map step based on the 18 input pairs listed above). Assuming that these pairs
are the input for the shuffle step, determine how many reducers are assigned
by the shuffle step and the list of pairs that are transferred to each one of
these reducers.
(there, 1) (seemed, 1) (to, 1) (be, 1) (no, 1) (use, 1)
(in, 1) (waiting, 1) (by, 1) (the, 1) (little, 1) (door, 1)
(so, 1) (she, 1) (went, 1) (back, 1) (to, 1) (the, 1)
After the shuffle step, all the pairs having a particular key are located
on a single computer, namely the reducer designated for this key, and thus,
can be processed as a group. This is done by the third and final step of
the Map-Reduce framework — the reduce step. To implement this step,
the user of the system specifies a reduce procedure that gets as input a
key and the values of the pairs with this key, and based on this input
produces a list of new pairs. The Map-Reduce system then executes this
procedure, independently, on every key that appears in some pair. In other
words, each reducer executes the reduce procedure once, and passes to it
as parameters the key of the reducer and the values of the pairs with this
key (which are exactly the pairs transferred to this reducer during the
shuffle step).
In our running example, the keys of the pairs produced by the map
step are the words of the original text. Thus, the reduce procedure is
executed once for every distinct word in this text. Moreover, the execution
corresponding to a word w gets as parameters the word w itself and the
values of the pairs that have w as their key. These values are all 1 by
construction, but by counting them, the reduce procedure can determine
the number of appearances (or frequency) of the word w in the original
text. The reduce procedure then outputs this information in the form
of a new pair (w, c), where w is the word itself and c is its number of
appearances.
Map-Reduce round
(a)
map step shuffle step reduce step
input output
(map proc.) (reduce proc.)
(b)
map step shuffle step reduce step
input
(map proc. 1) (reduce proc. 1)
The sequence of a map step, a shuffle step and a reduce step is called
a Map-Reduce round. A graphical illustration of such a round can be found
in Figure 15.1(a). In our running example, a single Map-Reduce round was
sufficient for producing the output that we looked for (i.e., the frequency
of every word in the text). However, in general it might be necessary to use
multiple Map-Reduce rounds, where the output pairs of each round serve
as the input pairs for the next round and each round has its own map and
reduce procedures (see Figure 15.1(b) for a graphical illustration). We will
see examples of such use later in this book.
Figure 15.2. A graphical illustration of the Map-Reduce framework after the removal
of all the map steps but the first one
3 Despite the theoretical argument used above to justify the removal of all the map steps
but the first one, practical Map-Reduce systems keep these steps because they have
practical advantages. The map procedure is executed separately for every pair, while
the reduce procedure is executed only once for all the pairs having the same key. Thus,
work done by a map procedure tends to be partitioned into many small independent
executions, while work done by a reduce procedure is usually partitioned into fewer and
larger executions. Small executions can be better parallelized and also allow for an easier
recovery in case of a fault in a hardware component, and thus, are preferable in practice.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch15 page 363
value to the reducer of this key. Given this point of view, one can view
the Map-Reduce execution after the dropping of the map steps as follows.
Every piece of information (pair) of the input is processed independently
by the map procedure, and while doing so, the map procedure can send
arbitrary information to the reducers. The reducers then work in iterations.
In the first iteration, every reducer processes the information it received
from the executions of the map procedure, and can send information to
other reducers and to itself. Similarly, in every other iteration every reducer
processes the information sent to it from the reducers in the previous
iteration, and can send information to the other reducers and to itself for
processing in the next iteration.
Using the above intuitive view of the Map-Reduce execution, we are
now ready to formulate it in a model, which we call in this book the
Map-Reduce model. In this model, there is an infinite number of machines
(or computers),4 one machine associated with every possible “name”. The
computation in this model proceeds in iterations. In the first iteration,
every piece of information is sent to a different machine, and this machine
can process this piece of information in an arbitrary way. While processing
the information, the machine can also send messages to other machines
based on their names. In other words, while sending a message, the sending
machine should specify the information of the message and the name of the
destination machine for the message. In every one of the next iterations,
every machine processes the messages that it received during the previous
iteration, and while doing so can send messages to other machines to be
processed in the next iteration (more formally, in iteration i > 1 every
machine processes the messages sent to it during iteration i − 1 and can
send messages to other machines — and these messages will be processed
during iteration i + 1). In the last iteration, instead of sending messages,
the machines are allowed to produce pieces of the output.
To exemplify the Map-Reduce model, let us explain how to convert to
this model the algorithm described above for determining the frequency
of words in a text. In this problem, the basic pieces of the input are the
words of the text, and thus, we assume that every such word starts on
4 The infinite number of machines in the Map-Reduce model might make it seem very far
from any practical situation. However, one should keep in mind that only a finite set of
these machines make any computation in algorithms that terminate in finite time, and
that multiple such logical machines can be simulated on one real computer. Given these
observations, there is no significant barrier for implementing the Map-Reduce model
using a real-world system.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch15 page 364
a different machine. Our plan is to use every machine for counting the
number of instances of the word equal to its name. To do that, during
the first iteration, every machine that received a word should send the
number 1 to the machine whose name is equal to that word (intuitively,
this 1 represents a single copy of the word). A more formal description of
the procedure executed during the first iteration is given as Algorithm 1.
To simplify the exposition, we use in this procedure (and in the rest of this
book) the convention that only machines that got information execute the
procedure; which means, for example, that a machine that did not get any
input word does not execute the procedure of the first iteration.
In the second iteration, every machine whose name is a word that
appeared in the original text gets the message “1” once for every appearance
of this word. Thus, to calculate the frequency of this word, all that remains
to be done is to add up these ones, and output this sum as the frequency
of the word, which is formally done by Algorithm 2.
to the number of appearances of this word in the input, and thus, the total
running time of all these machines is again proportional to the number of
words in the input. Thus, the work of this algorithm is O(number of words
in the input), which is very good since one can verify that no sequential
algorithm can (exactly) calculate the frequencies of all the words without
reading the entire input at least once.
does not exceed its memory S. To understand this requirement, one should
assume that the messages which a machine would like to send during a
given iteration are stored in its memory until the end of the iteration. Once
the iteration ends, these messages are transferred to the memory of their
destination machines, where they can be accessed by these machines at the
following iteration. Given this description, it is clear that the sizes of the
messages sent or received by a machine should be counted toward its space
complexity, and thus, their sum is restricted by S.
To substantiate the above description of the MPC model, let us consider
a mock problem in which the input consist of n tokens, where every token
has one of k colors for some constant k. The objective of the problem is
to determine the number of tokens of each color. Let us now describe an
algorithm for this problem in the MPC model. Originally, each one of the
M machines is given up to n/M tokens of the input. During the first
iteration, each machine counts the number of tokens of each color that
it was given and forwards these counts to machine number 1 (this is an
arbitrary choice, any other fixed machine could be used instead). In the
second iteration, machine number 1 gets the counts calculated by all the
machines in iteration 1 and then combines them to produce the required
output, i.e., the number of appearances of each one of the k colors in the
input.
Since the input is originally split evenly among the M machines, they
should have enough memory together to store the entire input. Formally,
this means that we must always have S = Ω(n/M ), where n is the size of
the input. It is desirable to design algorithms which do not require S to
be much larger than this natural lower bound. For example, the solution
of Exercise 6 shows that the algorithm we designed above for our mock
problem can work with S = Θ(n/M ) as long as M = O(n0.5 / log0.5 n).
Exercise Solutions
Solution 1
A pseudocode for the map procedure described by the text before the
exercise is given as Algorithm 3. Recall that the key in the pair that the
procedure gets is the name of the text from which the word was taken, and
the value in this pair is the word itself.
Applying this map procedure to each one of the pairs from the list appearing
on page 3, independently, results in the following pairs. Observe that, like
in the list from page 3, there are repetitions in this list.
Solution 2
The 18 pairs listed have 16 distinct keys (because each one of the words
“the” and “to” serves as the key of two pairs). Accordingly, 16 reducers
are assigned, one for each one of the distinct keys. Each one of these
reducers gets all the pairs whose key matches the key of the reducer. More
specifically, the reducer of the key “the” gets the two identical (the, 1) pairs,
the reducer of the key “to” gets the two identical (to, 1) pairs, and every
one of the other reducers gets the single pair corresponding to its key.
Solution 3
A pseudocode for the reduce procedure described by the text before the
exercise is given as Algorithm 4. Recall that the key which the procedure
gets is a word of the original text and the values are the values of the pairs
with this key. By construction, all these values are 1, but their number
is the number of such pairs, i.e., the number of times that the key word
appeared in the text.
Solution 4
The input for the problem we consider consists of the set of edges of the
graph, and thus, every machine executing the procedure of the first iteration
starts this iteration with a single such edge. We will make this machine send
the message “1” to every one of the end points of this edge (we assume that
the representation of the edge consists of its two end points, which makes
this step possible). Intuitively, this “1” represents a single edge hitting the
end point. Formally, the procedure executed during the first iteration is
given as Algorithm 5.
At the second iteration, every machine named after a vertex of the
graph gets the message “1” once for every edge of the graph hitting this
vertex, and thus, it can calculate the degree of this vertex by simply
summing up these messages. This is formally done by Algorithm 6.
Algorithms 5 and 6 together form the Map-Reduce algorithm that we
are requested to find by the exercise.
Solution 5
In this solution, we analyze the performance measures of the algorithm
described in the solution for Exercise 4. By definition, this algorithm has
two iterations. We continue the analysis of the algorithm by calculating its
machine space and time complexities. Let us denote by n the number of
vertices in the graph. In the first iteration of the algorithm, every machine
stores one edge, which requires O(log n) space because every vertex can
be represented using O(log n) bits. Additionally, the processing done by
every machine consists of sending two messages, which requires O(1) time.
Consider now the second iteration of the algorithm. In this iteration, every
machine associated with a vertex v gets a message of constant size for every
edge hitting v, and has to count these messages. Thus, both the space used
by the machine and its running time are O(the degree of v). Combining our
findings for the two iterations of the algorithm, we get that the machine
time complexity of the algorithm is O(max{1, d}) = O(d), where d is the
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch15 page 373
maximum degree of any vertex; and the machine space complexity of the
algorithm is O(max {log n, d}) = O(log n + d).
Our next step is to analyze the total space complexity of the algorithm.
In the first iteration, every machine stores a different edge of the graph, and
thus, the total space used by the machines is O(m log n), where m is the
number of edges in the graph. In the second iteration, the machine named
after each vertex v has to store a single message for every edge hitting this
vertex, and additionally, it has to store its name (i.e., the vertex itself).
Thus, the total space complexity of the machines in this iteration is
O(n log n) + O(degree of v) = O(n log n + m).
v
Combining the bounds we got on the total space complexity used in every
iteration, we get that the total space complexity of the algorithm is
To complete the solution, we also need to analyze the work done by the
algorithm. In the first iteration, we have one machine for each edge, and
this machine uses a constant processing time, and thus, the work done in
this iteration is O(m). In the second iteration, we have one machine for
each vertex, and this machine does a work proportional to the degree of
this vertex. Thus, the total work done by all the machines in the second
iteration is
O (degree of v) = O(m).
v
Hence, the work done by the algorithm, which is the sum of the work done
by its two iterations, is O(m).
Solution 6
Recall that in the first iteration of the algorithm, each machine counts the
number of tokens it received of each color and forwards these counts to
machine 1. A pseudocode representation for this is given as Algorithm 7.
In this algorithm, we assume that the k colors are represented using the
numbers 1 to k.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch15 page 374
In the second iteration, machine number 1 should combine all the counts
it receives and produce the output. This is done by the pseudocode of
Algorithm 8. In this pseudocode, we used Cj to denote the array C received
from machine j.
Let us now calculate the minimum space per machine M necessary for
this algorithm to work. In the first iteration, every machine stores up to
n/M tokens that it got plus the array C. If we assume that each token is
represented by its color, then storing a token requires only constant space
since we assumed that k is a constant. Additionally, observe that the array
C can be represented using O(k log n) = O(log n) bits because it consists of
a constant number of cells and each one of these cells stores an integer value
of at most n. Thus, the space complexity per machine required for the first
iteration is at most O(n/M +log n). Note also that this memory is sufficient
for storing the messages sent from the machine during the iteration because
these messages consist of the array C alone.
Consider next the second iteration of the above algorithm. In this
iteration, the algorithm keeps in memory M + 1 arrays, each requiring
O(log n) space, and thus, its space complexity is O(M log n). Once again
we observe that this memory is also large enough to store all the messages
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch15 page 375
Chapter 16
377
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 378
all the messages resulting from the appearances of w, this single machine
requires a lot of time and space when w is a frequent word, which results
in the potentially very bad machine time and space complexities of the
algorithm.
To improve the algorithm, we need to partition the work of counting
the messages resulting from the appearances of a word w between multiple
√
machines. Algorithm 1 does that by partitioning this work between n
√
machines named (w, 1), (w, 2), . . . , (w, n ), where n is the number of
words in the input. In the first iteration of this algorithm, which is formally
given as Algorithm 1, every machine processes a single word of the input
and sends the value 1 (which represents a single appearance) to a random
√
machine out of the n machines assigned to this word.
Let us denote by cw,r the number of messages sent during the first
iteration to machine √(w, r). One can observe that the frequency of w is
equal to the sum r=1n cw,r because every appearance of the word w in
the original list results in a single message being sent to one of the machines
√
(w, 1), (w, 2), . . . , (w, n ) during the first iteration. Thus, to calculate the
frequency of w, we need to evaluate the above sum, which is done by the
next two iterations of the algorithm. In the second iteration, each machine
(w, r) counts the number cw,r of messages that it has got and forwards the
count to the machine named after the word w. Then, in the third iteration,
√
every machine named after a word w gets n counts from the machines
√
(w, 1), (w, 2), . . . , (w, n ) and sums up these counts to get the frequency
of the word w. A formal description of these two iterations appears as
Algorithm 2.
The correctness of the algorithm follows from its description, so it
remains to analyze its performance measures. In this analysis, we assume
for simplicity that each individual word takes a constant amount of
space.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 379
We observe that the machine processing word number i (in the first
iteration) can send a message to (w, r ) only when ri = r , and thus, the
sum ni=1 Xi is an upper bound on the number of messages received by
machine (w, r ) in the second iteration. Moreover, this sum has a binomial
distribution, which allows us to bound the probability that it is larger than
√
2 n using the Chernoff bound.
n n √ n
√ 2 n
Pr Xi > 2 n = Pr Xi > √ ·E Xi
i=1 i=1
n i=1
(2n−0.5 ·√n−1)·E[Pni=1 Xi ]
≤ e− 3
√ √ √
2 n−n/ n n
−
=e 3 ≤ e− 3 .
So far we have proved that the probability of every single machine of the
√ √
second iteration to receive more than 2 n messages is at most e− n/3 . We
√
also observe that there are only n n machines that might get messages
in the second iteration (one for every word of the input and a value that
r can get). Combining these facts using the union bound, we get that the
√
probability that any machine of the second iteration gets more than 2 n
√ √
messages is at most n ne− n/3 , which approaches 0 as n approaches
infinity.
Corollary 1. With high probability the machine time and space complex-
ities of the Map-Reduce algorithm described by Algorithms 1 and 2 are
√ √
O( n) and O( n · log n), respectively.
Proof. Let us denote by E the event that every machine of the second
√
iteration receives only O( n) messages. Lemma 1 shows that the event E
occurs with high probability. Thus, it is enough to prove that the machine
time and space complexities of the algorithm are as promised whenever E
happens. The rest of the proof is devoted to showing that this is indeed
the case.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 381
In the first iteration, every machine stores only the word it processes
and the random value r, and thus, requires only O(log n) space. Moreover,
every such machine uses only constant time, and thus, both its running
time and space usage are consistent with the machine time and space
complexities that we would like to prove.
Consider now the second iteration of the algorithm. Since the time
complexity of every machine in the second iteration is proportional to the
number of messages it receives, we get that the machine time complexity
√
in the second iteration is O( n) as required whenever E happens.
Additionally, we observe that in the second iteration every machine has to
store the counts that it gets as messages and one additional counter. Thus,
√
when E happens, the machine has to store O( n) counters taking O(log n)
√
space each, which leads to a machine space complexity of O( n · log n) as
required.
Finally, let us consider the last iteration of the algorithm. In this
√ √
iteration, every machine gets up to n = O( n) messages and adds
√
them up, which requires O( n) time. Moreover, since every message is an
integer between 1 and n, storing all of them and their sum requires only
√
O( n · log n) space.
Corollary 1 shows that the machine time and space complexities of the
Map-Reduce algorithm we described for the problem of determining the
frequencies of all the words in the list is roughly the square root of
the length of the list (with high probability). This is of course a significant
improvement over the roughly linear machine time and space complexities
of the algorithm that was given for this problem in Chapter 15. However,
when n is very large, one might want to further reduce the machine time
and space complexities, which is the subject of Exercise 2.
i
(more formally, si = j=1 A[j]). An easy sequential algorithm can calculate
all the sums s1 , s2 , . . . , sn in linear time (see Exercise 3), but the problem
becomes more challenging in parallel computation settings such as the Map-
Reduce model.
3, 1
2, 1 2, 2
1, 1 1, 2 1, 3
0, 1 0, 2 0, 3 0, 4 0, 5 0, 6
• Every machine has at most d children, and the height of the tree is the
minimum height necessary for a tree with this property having n leafs,
i.e., it is logd n.
• Each machine can calculate the pair of its parent machine.
• Each machine can calculate the pairs of its children machines. More
specifically, the children of machine (h, r) of level h > 0 are the
machines out of (h − 1, d(r − 1) + 1), (h − 1, d(r − 1) + 2), . . . , (h − 1, dr)
that exist.
• Every leaf (0, r) forwards its value A[r] to its parent. Note that this
value is indeed equal to the sum of the values in the leaves in the
subtree of (0, r) since (0, r) is the single leaf in this subtree.
• Each internal machine u with c children which is not the root waits until
it receives a value from every one of these children. Let us denote these
values by v1 , v2 , . . . , vc . Then, the machine forwards to its parent the
c
sum i=1 vi . Note that since the subtrees of the children are disjoint
and their union includes all the leaves in the subtree of u, this sum
is indeed the sum of the values of the leaves in u’s subtree given that
v1 , v2 , . . . , vc are the corresponding sums for the subtrees of u’s children.
• Every machine u waits until it knows sb(u) . The root knows sb(u) from
the start because it is always 0. Other machines are given sb(u) by their
parent at some point.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 384
In the fourth and final stage of the algorithm, every leaf (0, r) adds its
value A[r] to sb((0,r)) = sr−1 , which produces the value sr that we have
been looking for.
Remark. Recall that the input for the All-Prefix-Sums problem consists of
n pairs, and each such pair contains a value that can be stored in O(1) space
(by our assumption) and an index which requires Θ(log n) space. Thus, even
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 385
just storing the input for this problem requires Θ(n log n) space, which
implies that the total space complexity proved by Lemma 3 is optimal.
Proof. Storing the value received by a node from each one of its children
requires O(log n) bits for the value itself, because the value is at most the
sum of all the n input values, plus O(log d) bits for the identity of the son.
Storing the value received by a node from its parent requires O(log n) bits
again, and thus, the total space required for storing all this information
(which is the machine space complexity of the algorithm) is only
Our next step is to upper bound the total space complexity of the algorithm.
In the first iteration, only the input machines are active, and they use
O(n log n) bits since they store n pairs and each pair requires O(log n) bits.
In the other iterations, only the tree machines are active. Every one of the
internal nodes of the tree requires O(d log n) bits as discussed above, and
the number of such nodes is at most
logd n ∞
n/d
n/di ≤ logd n + 1 + n/di = logd n + 1 +
i=1 i=1
1 − 1/d
Additionally, there are n leaves in T , and each one of these leaves requires
only O(log n) bits of space (since they do not have d sons). Thus, the total
space complexity required by the algorithm is
Proof. Machines that are not internal nodes of the tree T need only
O(1) time per iteration. In contrast, the internal nodes of the tree are often
required to do some operations with respect to every one of their children,
and thus, they require more time. Moreover, a naı̈ve implementation of
these nodes will require Θ(d2 ) time per iteration because during the down
propagation of information, each internal node is required to calculate up
i−1
to d sums (the sum j=1 vj for every 1 ≤ i ≤ c, where c is the number
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 386
of children of the node), and the calculation of each such individual sum
might require up to Θ(d) time. However, it is not difficult to see that
the calculation of all these sums together can also be done in O(d) time
because this calculation is an instance of the All-Prefix-Sums problem
itself. Thus, no machine of the algorithm requires more than O(d) time per
iteration.
Recall that there are only 2n machines that are active in the algorithm
and are not internal nodes of the tree T — the n input machines and the
n leaves of the tree. As mentioned above, each one of these machines uses
O(1) time per iteration, and thus, their total work is at most
2n · O(1) · (# of iterations) = O(n) · O(logd n) = O(n logd n).
Next, we recall that in the proof of Lemma 3 we have seen that there
are only O(n/d) internal nodes in the tree T . Since each one of these nodes
requires only O(d) time per iteration, their total work is also at most
O(n/d) · O(d) · (# of iterations) = O(n) · O(log d n) = O(n logd n).
of O(log n), a total space complexity of O(n log n), a machine time com-
plexity of O(1) and a work of O(n log n).
16.3 Indexing
In the description of the All-Prefix-Sums problem, we have assumed that
every input value is accompanied by an index. Such an index is useful for
many problems beside All-Prefix-Sums, but in many cases it is not supplied
as part of the input. Thus, it is interesting to find a procedure that given
a set of n input elements assigns a unique index between 1 and n to every
one of these items. In this section, we will describe and analyze one such
procedure, which is given as Algorithm 3.
The final line of Algorithm 3 makes sense only when every element is
assigned a unique temporary index. Fortunately, Lemma 5 shows that this
happens with high probability.
By the union bound, the probability that at least one of the events Ei
happens is only
n
n
n
Pr Ei ≤ Pr [Ei ] ≤ n−2 = n−1 ,
i=1 i=1 i=1
which completes the proof of the lemma since the temporary indexes are
unique when none of the events Ei happen.
Exercise Solutions
Solution 1
By definition, the Map-Reduce algorithm we consider has three iterations.
Let us now determine its total space complexity. In the first iteration, each
machine stores the word that it got and a number r that can be represented
using O(log n) bits. Since there are n active machines in this iteration, the
total space complexity of the machines in this iteration is O(n log n). In
the second iteration, each machine stores three things: the messages that
it got, the count of the number of these messages and its name. Both the
count and the name of the machine are represented using O(log n) bits,
and thus, the counts and the names of all the machines together contribute
at most O(n log n) to the total space complexity of the second iteration
(note that at most n machines were sent messages during the first iteration
and these are the only machines that execute code in this iteration). To
determine the space required for the messages received by the machines
of the second iteration, we note that each such message takes O(1) space
because it always contains only the value 1 and that the number of such
messages is exactly n because one message is generated for every word in
the input. Thus, these messages contribute only O(n) to the total space
complexity of the second iteration, which is dominated by the O(n log n)
required for the counts and names.
Consider now the third iteration of the Map-Reduce algorithm. In this
iteration, each machine stores the counts that it got plus the frequency
that it calculates. The total number of frequencies calculated is at most n
because one frequency is calculated for every distinct word. To determine
the number of counts received, we observe that every count that is received
represents at least one word of the original text because a count is only
generated by machines of the second iteration that received inputs. Thus,
the number of counts received by the machines of the third iteration is also
bounded by n, like the number of frequencies calculated by these machines.
Since both the frequencies and the counts received are integers between 1
and n, each one of them can be stored using O(log n) space, which means
that the total space complexity of the third iteration is O(n log n). Since
this was also the total space complexity of the two previous iterations, it
follows that the total space complexity of the entire algorithm is O(n log n).
To complete the solution, the work of the Map-Reduce algorithm
remains to be determined. In the first iteration, we have n machines,
and each one of them uses constant time; thus, this iteration contributes
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 391
O(n) to the work. In the second iteration, the time complexity of each
machine is proportional to the number of messages it receives. Since only n
messages are sent during the first iteration (one for every word of the input),
this means that the total contribution to the work by all the machines
of the second iteration is O(n). Similarly, in the third iteration, the time
complexity of each machine is proportional to the number of counts that it
receives. Above, we argued that the number of these counts is at most n,
and thus, we get once more that the contribution of all the machines of the
third iteration to the work is O(n). Summing up the contributions to the
work of all three iterations, we get that the work of the entire Map-Reduce
algorithm is also O(n).
Solution 2
Before describing the solution for the current exercise, let us discuss
briefly the Map-Reduce algorithm represented by Algorithms 1 and 2.
Specifically, let us fix an arbitrary word w from the input. Figure 16.2
graphically presents all the machines that process the appearances of this
word, either directly or indirectly. The first row in the figure hosts the
machines of the first round that process the individual occurrences of the
word w in the input. The second row in the figure hosts the machines
of the second iteration that are related to this word, i.e., the machines
√
(w, 1), (w, 2), . . . , (w, n). Finally, the third row in the figure hosts the
single machine of the third iteration that is related to the word w, which
is the machine named w. The figure also contains arrows representing the
flow of messages between the above machines.
up to n machines
machines processing
individual instances of w
machines of the
w, 1 w, 2 w, 3 w,
second iteraon
single machine of
w
the third iteraon
Figure 16.2. The machines related to the processing of a word w in the Map-Reduce
algorithm described by Algorithms 1 and 2, and the flow of messages between them.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 392
n
1/c
cw,r1 ,r2 ,...,rc−i+1 = cw,r1 ,r2 ,...,rc−i+2
i=1
n
1/c
= |{a ∈ A|∀1≤j≤c−i+2 rj (a) = rj }|
i=1
where the second equality holds by the induction hypothesis, and the last
equality holds since i iterates over all the possible values for rc−i+2 (a).
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 394
This completes the proof by induction. The lemma now follows since for
i = c + 1, we got
Proof. Let us begin the proof by fixing some word w and integers
r1 , r2 , . . . rc−1
between 1 and n1/c . We would like to upper bound
the probability that machine (w, r1 , r2 , . . . rc−1
) is sent more than 2n1/c
messages during the first iteration. For this purpose, let us denote by rji ,
for every pair of integers 1 ≤ i ≤ n and 1 ≤ j ≤ c − 1, the random value rj
chosen during the first iteration by the machine processing word number i
of the list. We also denote by Xi an indicator for the event that rji = rj for
every 1 ≤ j ≤ c − 1. Since every value rji is an independent uniformly
random integer out of the range [1,n1/c ], we get
c−1
Pr [Xi = 1] = Pr ∀1≤j≤c−1 rj = rj =
i
Pr rji = rj
j=1
c−1
1
1−c
= = n1/c .
j=1
n1/c
We observe that the machine processing word number i (in the first
iteration) can send a message to (w, r1 , r2 , . . . , rc−1
) only when Xi = 1,
n
and thus, the sum i=1 Xi is an upper bound on the number of messages
received by machine (w, r1 , r2 , . . . , rc−1
) in the second iteration. Moreover,
this sum has a binomial distribution, which allows us to bound the
probability that it is larger than 2n1/c using the Chernoff bound.
n 1/c c−1 n
n
2 n
1/c
Pr Xi > 2n = Pr Xi > ·E Xi
i=1 i=1
n1−1/c i=1
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 395
„ «
c−1 P
2n1/c−1 · (n1/c ) −1 ·E[ ni=1 Xi ]
≤ e− 3
1−c
2n1/c −n (n1/c ) n1/c
= e− 3 ≤ e− 3 .
Corollary 4. With high probability the machine time and space complex-
ities of Algorithm 4 are O(n1/c ) and O(n1/c · log n), respectively.
Proof. Let us denote by E the event that every machine of the second
iteration receives only O(n1/c ) messages. Lemma 1 shows that the event E
occurs with high probability. Thus, it is enough to prove that the machine
time and space complexities of the algorithm are as promised whenever E
happens. The rest of the proof is devoted to showing that this is indeed the
case, and thus, implicitly assumes that E occurred.
In the first iteration, every machine stores only the word it processes
and the random values r1 , r2 , . . . , rc−1 , and thus, requires only O(c·log n) =
O(log n) space, where the equality holds since we treat c as a constant.
Moreover, every such machine uses only constant time, and thus, both its
running time and space usage are consistent with the machine time and
space complexities that we would like to prove.
Consider now another arbitrary iteration of the algorithm. Observe
that the time complexity of every machine in this iteration is proportional
to the number of messages it receives, thus, to show that the machine
time complexity in this iteration is O(n1/c ), it is only required to prove
that no machine of the iteration gets more than O(n1/c ) messages. For
iteration number 2, that follows immediately from our assumption that
the event E has occurred. For later iterations, we observe that every
machine (w, r1 , r2 , . . . , rc−i+1 ) of iteration i gets a message only from
machines (w, r1 , r2 . . . rc−i+2
) of iteration i − 1 whose name obeys rj = rj
for every 1 ≤ j ≤ c − i + 1. Since there are only n1/c values that
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 396
rc−i+2 can take, there are only n1/c such machines, and therefore, there
are also only that many messages that machine (w, r1 , r2 , . . . , rc−i+1 ) can
receive.
Consider now the machine space complexity of iteration i ≥ 2. Every
machine in this iteration has to store its name, the counts that it gets as
messages and one additional counter. As we argued before, each machine
gets at most n1/c messages. We also observe that each value that it
needs to store counts some of the appearances of a word w in the input
list, and thus, is upper bounded by n. Thus, the space required for the
counts stored by every single machine in iteration i is upper bounded
by (n1/c + 1) · O(log n) = O(n1/c · log n). Additionally, the name of the
machine consists of the word w, which requires a constant space, and up
to c − 1 additional values which require O(log n) space each, and thus, the
name can be stored in O(c · log n) = O(log n) space, which is less than the
machine space complexity that we need to prove.
Solution 3
The algorithm we suggest for the problem is given as Algorithm 5. It is
not difficult to see that the time complexity of this algorithm is indeed
O(n). Moreover, one can prove by induction that the sums calculated by
i
this algorithm obey si = j=1 A[j] for every 1 ≤ i ≤ n.
Solution 4
The following is a formal presentation of the Map-Reduce algorithm
described before the exercise. The first iteration corresponds to the first
stage of the algorithm, iterations 2 up to logd n + 1 correspond to the
second stage of the algorithm, iterations logd n + 2 up to 2logd n + 1
correspond to the third stage of the algorithm, and finally, iteration number
2logd n + 2 corresponds to the fourth stage of the algorithm.
• First iteration
Description: In this iteration, every machine that got an input pair of
index and value forwards the value to the leaf machine of T corresponding
to the index. Note that we use T (h, r) to denote the machine of T
corresponding to the pair (h, r).
1. The input of the machine in this iteration is a pair (index, value).
2. Forward value to the machine named T (0, index ).
• Second iteration
Description: In this iteration we begin the upward propagation of
information. Specifically, the leaf machines, which are the only machines
that got messages (and thus, are active) in this iteration forward their
values to their parents. Starting from this point, we use the convention
that a machine keeps in a pair (i, v) the value v that it got from its zero-
based ith son. The sole exception to this rule is the leaf machines which
have no sons, and thus keep in (0, v) their own value v as follows:
1. Let T (0, r) be the name of this leaf machine, and let us name the value
it received by v.
2. Forward the pair (0, v) to myself (to keep it) and the pair (r mod d, v)
to my parent.
• Iteration number i for 3 ≤ i ≤ logd n + 1
Description: These iterations complete the upward propagation of infor-
mation. In the beginning of iteration i only the machines in levels
0, 1, . . . , i − 2 of the tree have messages (and thus, are active). They all
forward these messages to themselves to keep them for the next iteration,
and the nodes at level i − 2 forward information also to their parents,
which have not been active so far.
1. Let T (h, r) be the name of this machine.
2. Let us denote by c the number of my children (or 1 if I am a leaf),
then I received (either from myself or from my children) c pairs
(0, v0 ), (1, v1 ), . . . , (c − 1, vc−1 ).
3. Forward all the pairs (0, v0 ), (1, v1 ), . . . , (c − 1, vc−1 ) to myself, to keep
them.
4. if h = i − 2 then
c−1
5. Forward the pair (r mod d, i=0 vi ) to my parent machine T (h + 1,
r/d).
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 398
Solution 5
The term logd n in the bound on the work proved by Lemma 4 follows from
the fact that each machine of the Map-Reduce algorithm might operate
during O(logd n) iterations. However, one can note that only in a constant
number out of these iterations a given machine does any job other than
sending its information to itself in order to preserve it. Iterations in which
a machine only sends information to itself become redundant if one allows a
machine access to the information that it had in previous iterations. Thus,
such access will allow us to reduce the number of iterations in which every
given machine is active to O(1); which will result in the removal of the
logd n term from the bound on the work done by the algorithm.
Solution 6
The algorithm we described in Section 16.2 for the All-Prefix-Sums problem
was designed under the assumption that some value is given for every index
between 1 and n. To lift this assumption, we modify the algorithm a bit.
Specifically, in the stage in which the algorithm propagates information
downwards, we make the nodes forward information only to the children
from which they got information during the upward propagation phase of
the algorithm (rather than sending a message to all their children as in
the original algorithm). One can verify that after this change the algorithm
still produces a value si for every index i for which a value was supplied
as part of the input, and moreover, si is still correct in the sense that it
is the sum of the values belonging to index i and lower indexes. Thus, the
omission of pairs with zero values does not affect the si values produced by
this algorithm for pairs that have not been omitted.
In particular, the above observation implies that, when using the
above modified algorithm for All-Prefix-Sums in an implementation of
Algorithm 3, the implementation can skip the generation of the zero valued
pairs. Since only n out of the n3 pairs generated by Algorithm 3 has a non-
zero value (one pair for every input element of Algorithm 3), this greatly
reduces the number of pairs generated by the algorithm. In the rest of this
answer, we refer to the implementation of Algorithm 3 obtained this way
as the “efficient implementation”.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch16 page 400
Chapter 17
Graph Algorithms
Up to this point in the book, all the Map-Reduce algorithms we have seen
operated on inputs with very little combinatorial structure (essentially, all
the algorithms operated on either sets or ordered lists of elements). To
demonstrate that the Map-Reduce framework can also handle inputs with a
significant combinatorial structure, we present in this chapter Map-Reduce
algorithms for two graph problems.
401
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 402
edges that appear before e in the order π (because every edge e of such
a path would have been added to Fi−1 when considered by the algorithm
unless Fi−1 had already contained at this point a path between the end
points of e consisting of edges that appear even earlier than e in π). If
e ∈ Ei , then from the facts that e ∈/ E(Fi ) and that Fi is a minimum weight
spanning tree of (V, Ei ), we get that Fi must contain a path between the
end points of e consisting solely of edges which are at most as heavy as e.
However, such edges must appear before e in the order π by the choice of
this order, which contradicts our previous observation that the end points
of e cannot be connected in Ei−1 by a path of edges that appear before e in
π. Similarly, if e ∈/ Ei , then we get that for some 1 j pi we must have
had e ∈ Ei−1,j /Fi,j , and since Fi,j is a minimum weight spanning tree of
(V, Ei−1,j ), this implies that Fi−1,j contains a path between the end points
of e consisting solely of edges which are at most as heavy as e. Since all
these edges belong to Ei , they all appear before e in the order π, and once
again the existence of this path leads to a contradiction for the same reason
as before.
where all the equalities other than the first one hold due to Lemma 1. Thus,
the spanning tree T is a minimum weight spanning tree of G because it has
the same weight as the minimum weight spanning forest F0 .
Proof. Only edges having u as one of their end points can be sent to
L(u) during the pre-processing step. The observation now follows because
the degree of every node, including u, is upper bounded by n − 1.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 407
|Ei−1 |
“ |Ei−1 |
”
− 2M− p
e−M/3 e−n ,
/3
=e i
where the last two inequalities follow from our assumption that M 3n.
Corollary 2. With probability 1 − O(ne−n log n) all the machines whose
name is of the form L(i, j) get at most 2M edges each.
Proof. Lemma 2 shows that for every iteration i of Algorithm 1,
conditioned on the random decisions made in the previous iterations, every
given machine L(i, j) gets at most 2M edges with probability 1 − O(e−n ).
Thus, by the union bond, conditioned on the same assumption we get that
the probability that some machine L(i, j) gets more than 2M edges in
iteration i is at most
2
−n |Ei−1 | −n n
pi · O(e ) = · O(e ) · O(e−n ) = O(ne−n ),
M n
where the inequality follows from our assumption that M 3n and the
observation that the number m of edges in the graph G is upper bounded
by n2 . We now note that since we obtained an upper bound on this
probability that applies conditioned on any fixed choice of the random
decisions in the previous iterations, the law of total probability allows us to
remove the condition. Hence, the probability that for a fixed i any machine
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 408
where the last equality follows again from the assumption that M 3n.
Lemma 3. In every iteration of Algorithm 1, the counter machine gets
values from at most O(n) machines.
Proof. We begin the proof by observing that in the first iteration of
Algorithm 1, the counter machine can get values only from the machines
whose name is L(u) for some vertex u. Since there are only n such machines,
they send to the counter only n values during this iteration. Thus, in the rest
of the proof we only need to consider the other iterations of Algorithm 1.
In the beginning of iteration i 2 of Algorithm 1, the number of
machines that have edges is pi−1 (specifically, these are the machines L(i −
1, 1), L(i − 1, 2), . . . , L(i − 1, pi−1 )). Since every one of these machines sends
a single value to the counter machine during the iteration, the number of
messages received by counter machine during the iteration is
|Ei−2 | n2
pi−1 = + 1 = O(n).
M 3n
counter to the counter machine, and each such counter is a number of value
at most n2 , and thus, can be stored using O(log n) bits. Hence, the total
space required for storing these counters is again O(m log n), and thus, the
total space complexity of Algorithm 1 is bounded by this expression.
Let us now bound the work done by Algorithm 1. Observe that the
time complexity of each machine used by Algorithm 1 is either linear
in the number of edges and counters it gets or dominated by the time
required to find a minimum spanning forest in the graph consisting of the
edges it gets. Thus, if we use Kruskal’s algorithm to find the minimum
spanning forest, then a machine that gets m edges has a time complexity
of O(m log m ) = O(m log n). Together with the above observation that
all the machines together get in a single iteration of Algorithm 1 only
O(m) edges and counters, this implies that Algorithm 1 does O(m log n)
work in each iteration. Using now the bound we have proved above on the
number of iterations that this algorithm makes, we get that the work it does
throughout its execution is upper bounded by O(m log n logM/(2n) n) =
O(m log n logM/n n) iterations.
Theorem 1 summarizes the properties of Algorithm 1 that we have
proved.
a b
e d
Figure 17.1. A graph with three triangles: {a, b, c}, {a, c, d} and {c, d, e}.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 411
this algorithm indeed outputs every triangle of the graph exactly once, and
that it runs in O(n3 ) time because every one of its loops iterates at most
n times.
Despite the simplicity of Algorithm 2, the dependence of its time
complexity on n is optimal because a graph with n vertices might have
Θ(n3 ) triangles, and thus, this is the minimal amount of time necessary for
listing them all. One can observe, however, that only very dense graphs
can have so many triangles, which leaves open the possibility of faster
algorithms for counting triangles in sparse graphs. Lemma 4 makes this
intuitive observation more concrete by upper bounding the number of
triangles that a graph with m edges might have.
Let us now consider the triangles of the second type. We logically assign
every such triangle to one of the low-degree vertices it includes (a triangle of
the second type includes such a vertex by definition). One can observe that
the number of triangles in which a vertex u participates is upper bounded
by (deg(u))2 — because this expression upper bounds the number of ways
in which we can group the neighbors of u into pairs — and thus, (deg(u))2
also upper bounds the number of triangles that are logically assigned to u.
Since every triangle of the second type is logically assigned to some low-
degree vertex, we can upper bound the number of triangles of this type by
the expression
√ √
(deg(v))2 m · deg(v) m · 2m = O(m3/2 ).
v∈V v∈V
√ √
deg(v)< m deg(v)< m
This completes the proof of the lemma since we have shown that the number
of triangles of the first type is also upper bounded by the same asymptotic
expression.
input graph is a star. Recall that a star is a graph in which there is a single
vertex (known as the center of the star) that is connected by edges to all
the other vertices and there are no additional edges.
If the center of the star happens to be toward the end of the array V ,
then the time complexity of Algorithm 3 will be on the order of Θ(n) =
Θ(m) because every vertex has very few neighbors appearing after it in the
array V . In contrast, if the center of the star appears toward the beginning
of the array V , then the center of the star has Θ(n) neighbors appearing
after it in the array, leading to a time complexity on the order of Θ(n2 ),
which is much larger than O(m3/2 ) = O(n3/2 ).
The above discussion shows that Algorithm 3 performs exceptionally
well when the high-degree center of the star appears toward the end of the
array V , but performs much worse when this high-degree vertex appears
early in V . This intuitively suggests that it might be beneficial to sort the
vertices in V according to their degrees because this will imply that the
low-degree vertices will tend to appear at the beginning of V and the high-
degree vertices will tend to appear at the end of V . In Exercise 6, you are
asked to prove that this intuitive suggestion indeed works, and that it allows
the time complexity of Algorithm 3 to roughly match the bound implied
by Lemma 4.
At this point, we would like to use the ideas developed in the above
discussion to get a good Map-Reduce algorithm for triangle listing. As
usual, we assume that each edge of the graph is originally located in a
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 414
should be reported as such. Otherwise, the edge (v, w) does not exist, and
thus, any suspected triangle reaching M (v, w) should be simply discarded.
1 Make sure that you understand why that is the case. In particular, note that unlike in
Exercise 6, the work done by the Map-Reduce algorithm does not depend on n. This is
a consequence of the fact that a vertex that does not appear in any edge is completely
missing from the input of the Map-Reduce algorithm.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 416
N. Alon, R. Yuster and U. Zwick. Finding and Counting Given Length Cycles.
Algorithmica, 17(3): 209–223, 1997.
S. Lattanzi, B. Moseley, S. Suri and S. Vassilvitskii. Filtering: A Method for
Solving Graph Problems in MapReduce. In Proceedings of the 23rd Annual
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA),
85–94, 2011.
T. Schank and D. Wagner. Finding, Counting and Listing All Triangles in Large
Graphs, an Experimental Study. In Proceedings on the 4th International
Workshop on Experimental and Efficient Algorithms (WEA), 606–609, 2005.
S. Suri and S. Vassilvitskii. Counting Triangles and the Curse of the Last Reducer.
In Proceedings of the 20th International Conference on World Wide Web
(WWW), 607–614, 2011.
Exercise Solutions
Solution 1
For every 1 j pi , the forest Fi,j contains at most n − 1 edges because
it is a forest with n vertices. Thus,
pi
pi
|Ei | =
E(Fi,j )
where the second equality holds since the forests Fi,1 , Fi,2 , . . . , Fi,pi are
edge-disjoint and the second inequality holds since the fact that Algorithm 1
started the ith iteration implies that the size of Ei−1 was at least M .
Let us now explain why Inequality (17.1) implies an upper bound on
the number of iterations performed by Algorithm 1. Assume by way of
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 417
|E0 | |E|
|Eı̂−1 | (2n/M )ı̂−1 · |E0 | (2n/M )logM/(2n) m · |E0 | = = = 1.
m m
However, this implies |Eı̂−1 | M (since we assume M 3n), and thus,
contradicts our assumption that the algorithm started iteration number ı̂.
Solution 2
The Map-Reduce implementation of Algorithm 1 repeatedly applies the
following three Map-Reduce iterations. Note that a machine executing the
code of the first of these Map-Reduce iterations stops the algorithm and
generates an output when it gets the message “terminate” from the previous
Map-Reduce iteration. One can also verify that only a single machine
(named L(i, 1) for some i) gets the message “terminate”, and thus, the
algorithm always outputs a single forest.
Iteration 3. Let t be the value obtained from the “counter” machine, and
calculate p = t/M . Additionally, determine i as follows: if this machine
has a name of the form L(i, j), then use the value of i from this name.
Otherwise, if this machine is an input machine, let i = 0. Then, for every
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 418
Solution 3
In this solution, we assume that every machine of the form L(i, j) gets at
most 2M edges, which happens with high probability by Corollary 2.
Observe that every input machine of Algorithm 1 receives a single edge,
and then, it only needs to send this edge to the appropriate machines of
the form L(u). Hence, the space required by the input machine is the space
required for storing a single edge, which is O(log n), and the time required
for the machine is constant.
Consider now a machine whose name is of one of the forms L(u) or
L(i, j). Such a machine gets at most 2M edges by Observation 1 and the
assumption we have made above. Additionally, the machine gets the value
pi , which is a number of value at most n. Thus, all the information that
the machine gets can be stored in O(M log n) space. Since the machine
does not require significant additional space for its calculations, this is also
the space complexity used by this machine. Finally, we observe that the
time complexity of the machine is upper bounded by O(M log n) because
the calculations done by the machine are dominated by the time required
to compute a minimum spanning forest in the graph containing the edges
of the machine, which can be done in O(min{M, m} log min{M, m}) =
O(M log n) time using Kruskal’s algorithm.
The counter machine remains to be considered. Every time that this
machine gets messages, it gets at most n messages by Lemma 3. Since each
one of these messages is a count of edges, they are all numbers of value at
most n2 . Hence, all the values this machine gets can be stored in O(n log n)
space, and so can their sum, which is the value calculated by the counter
machine. We also observe that the time complexity of the counter machine
is linear in the number of messages it gets and, thus, is upper bounded by
O(n).
Combining all the above, we get that under our assumption the machine
space and time complexities of Algorithm 1 are
max{O(log n), O(M log n), O(n log n)} = O(M log n)
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 419
and
respectively.
Solution 4
Algorithm 1 uses randomness to partition in iteration i the set Ei−1 into
few subsets that contain O(M ) edges each. The use of randomness for that
purpose is natural since the algorithm does not assume anything about the
way in which the edges of Ei−1 are partitioned between the machines before
the iteration begins. However, one can observe that our implementation of
Algorithm 1 gives a natural upper bound of n on the number of edges stored
in each one of these machines. Thus, by grouping these machines into groups
of a given size and then combining all the edges of the machines in a group,
we can deterministically get a partition of Ei−1 into few sets of size O(M )
each.
Following is a Map-Reduce algorithm based on the above idea. The
algorithm starts with the pre-process iteration, which is similar to the pre-
process iteration from the implementation of Algorithm 1, but sends the
input edge (u, v) to machine L(0, u) rather than L(u). We assume here that
every vertex u is represented by a number between 1 and n, and thus, the
machines that get edges during this pre-process iteration are the machines
L(0, j) for every integer j between 1 and n.
Pre-process Iteration (performed by the input machines): Send the
input edge (u, v) to the machine L(0, u).
After the pre-process iteration, the algorithm performs repeatedly the
next iteration until one of the machines stops the execution. Intuitively,
this iteration corresponds to the three Map-Reduce iterations from the
implementation of Algorithm 1. We note that we are able to combine here
the three iterations into a single one because the current algorithm can
calculate based on i a bound on the number of machines that have edges,
which removes the need to calculate pi .
Repeated Iteration: Let L(i, j) be the name of this machine, and
calculate a minimum weight spanning forest F of the edges of G that this
machine got at the end of the previous iteration. If i logM/(2n) n, stop
the algorithm at this point and output the forest F . Otherwise, forward the
edges of F to L(i + 1, jn/M ).
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 420
Proof. For i = 0, the lemma holds since machine L(0, u) gets only edges
incident to u, and there can be at most n such edges. Consider now some
integer i 1, and consider an arbitrary machine L(i, j). This machine gets
edges only from machines L(i − 1, j ) such that
j n/M = j.
One can observe that there can be only M/n such machines. Moreover,
the number of edges forwarded to L(i, j) from each one of these machines
is upper bounded by n since each machine forwards a forest. Thus, the
number of edges L(i, j) receives is at most M/n · n M + n = O(M ).
for larger values of j. Thus, the range of j values for which the machine
L(i, j) gets edges consists of the integers between 1 and n when i = 0 and
reduces by a factor of M/(2n) every time that i increases by 1 unless its size
was already upper bounded by M/n, in which case it reduces to 1. Hence,
for i logM/(2n) n (and thus, also when the algorithm stops) there is only
a single machine L(i, j) which has edges.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 421
Solution 5
Let n be the largest integer such that n (n − 1)/2 m, and let us assume
that m is large enough to guarantee that n 3 (since the exercise only
asks us for a graph with Ω(m) triangles, it suffices to consider large enough
values of m).
Consider now an arbitrary graph that contains a clique of n vertices
and m edges. Clearly, there is such a graph with any large enough number
n of vertices (as long as we allow some vertices to have a degree of zero).
Additionally, since the definition n guarantees that n (n + 1) > 2m, the
number of triangles in the clique of this graph is
Solution 6
Given a vertex u ∈ V , let N (u) be the set of neighbors of u that appear
after u in the array V . Note that since we assume that V is sorted, the
vertices in N (u) must have a degree which is at least as high as the degree
of u. Let us say, like in the proof of Lemma 4, that a vertex is a high-degree
√
vertex if its degree is at least m, and a low-degree vertex otherwise.
Our next objective is to prove the inequality
√
|N (u)| min{2 deg(u), 2 m}. (17.2)
Solution 7
Following is a detailed implementation in the Map-Reduce model of the
algorithm described before the exercise.
Iteration 4. For every machine whose name is of the form M (u, v), if this
machine got an edge from the previous iteration and at least one individual
vertex, then for every individual vertex w that it received, it outputs the
triangle {u, v, w}.
Solution 8
Let us consider the four iterations of the algorithm one after the other as
follows:
• In the first iteration, every input machine forwards either the edge it
got or a part of it to four machines. This requires O(1) time and a space
of O(log n), which is the space required for storing an edge (we assume
here that a vertex can be stored using O(log n) space). The total space
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 423
used by all the machines in this iteration is O(m log n) since there are
m input machines, one for every edge of the input graph.
• In the second iteration, the machines whose name is of the form
M (u, v) simply forward to themselves the edge they got, which requires
O(1) time, O(log n) space per machine and O(m log n) space in total.
The machines whose name is of the form M (u) have to store all the
neighbors of their vertex, count them and send the resulting degree to
the machines in charge of these neighbors. This might require O(n log n)
space and O(m) time per machine — note that we bound here the
degree of a vertex by n for the analysis of the space complexity and by
m for the analysis of the time complexity. The total space complexity
of all these machines is O(log n) times the sum of the degrees of all the
vertices of the graph, which is 2m, and thus, it is O(m log n). Adding
up the numbers we got for the two kinds of machines, we get in this
iteration machine time and space complexities of O(m) and O(n log n),
respectively, and a total space complexity of O(m log n).
• In the third iteration, the machines whose name is of the form M (u, v)
again simply forward to themselves the edge they got, which requires
O(1) time and O(log n) space per machine and O(m log n) space in
total. The machines whose name is of the form M (u) have to store the
neighbors of u and the degrees of these neighbors, and then enumerate
over all the pairs of these neighbors that are larger than u according
to ≺. This requires O(n log n) space per machine, and O(m log n) space
in total (because the sum of all the degrees of the vertices in the graph
is 2m).
Determining the amount of time this requires per machine is slightly
√
more involved. For a vertex u of degree at most m, the time required
is at most O((deg(u))2 ) = O(m). Consider now a vertex u of degree at
√ √ √
least m. Since there can be at most 2 m vertices of the degree m
√
or more, there can only be 2 m neighbors of u that appear after u
according to ≺. Thus, the machine time complexity of M (u) is upper
√
bounded by O(( m)2 ) = O(m).
Adding up the numbers we got for the two kinds of machines, we
get in this iteration machine time and space complexities of O(m) and
O(n log n), respectively, and a total space complexity of O(m log n).
• In the final iteration, each machine M (u, v) might get a single message
of size O(log n) from every machine M (w). Thus, it uses O(n log n)
space. For every such message, the machine has to output a single
triangle, which requires O(m) time because every machine M (w)
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch17 page 424
Chapter 18
Locality-Sensitive Hashing
An Internet search engine finds many results for a query. Often some of
these results are near duplicates of other results.1 Given this situation,
most search engines will try to eliminate the near duplicate search results
to avoid producing a very repetitive output. To do so, the search engine
must be able to detect pairs of results that are similar. Consider now an
online shopping service. Such services often try to recommend items to
users, and one of the easiest ways to do this is to recommend an item that
was bought by one user to other users that share a similar taste. However,
to use this strategy, the online shopping service must be able to detect pairs
of users that bought similar items in the past, and thus, can be assumed to
have a similar taste.
The tasks that the search engine and the online shopping service need
to perform in the above scenarios are just two examples of the more general
problem of finding pairs of elements in a set that are similar based on a
certain similarity measure. This general problem is very basic, and captures
many additional practical problems besides the above two examples. Thus,
a lot of research was done on non-trivial techniques to solve it. In this
chapter, we will present one such interesting technique, which is known as
locality-sensitive hashing.
1 Two common examples for this phenomenon are a site which is hosted on multiple
425
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 426
1
0.8
0.6
0.4
0.2
0
Figure 18.1. The ideal relationship between the distance of two elements e1 , e2 ∈ S
and the probability that they are mapped to the same range item by a random function
from a locality-sensitive hashing functions family. Elements that are close to each other
are very likely to be mapped to the same range item, elements that are far from each
other are very unlikely to be mapped to the same range item, and the “transition zone”
between these two regimes is small.
Exercise 1. Prove that for any two given numbers x1 and x2 , if f is a hash
function drawn uniformly at random from F , then
|x1 − x2 |
Pr[f (x1 ) = f (x2 )] = max 0, 1 − .
10
Exercise 2. Show that for every two vectors x and y with n coordinates
and a uniformly random function fi ∈ FH , it holds that Pr[fi (x) = fi (y)] =
1 − distH (x, y)/n.
The Hamming distance measure is defined for any two vectors with the
same number of coordinates. Another distance measure for vectors, known
as the angular distance, is defined for non-zero vectors in vector spaces that
have an inner product. To keep the presentation of this distance measure
simple, we will restrict ourselves to non-zero vectors in Rn — recall that Rn
consists of the vectors with n coordinates whose individual coordinates are
real numbers. For such vectors, the angular distance between two vectors
is simply the angle between them. Let us denote by distθ (x, y) the angular
distance between vectors x and y.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 429
x y
Figure 18.2. Two vectors x and y with a low angle between them and a third vector z.
Note that the angle between x and z is similar to the angle between y and z.
Exercise 3. Prove that Equality (18.1) holds when x and y are arbitrary
non-zero vectors in R2 and fz is a uniformly random function from Fθ .
Let us now consider a distance measure between sets which is known as the
Jaccard distance. The Jaccard distance between two non-empty sets S1 and
S2 is defined as the fraction of the elements of S1 ∪ S2 that do not belong
to both sets. More formally, the Jaccard distance distJ (S1 , S2 ) between S1
and S2 is given by
|S1 ∩ S2 |
distJ (S1 , S2 ) = 1 − .
|S1 ∪ S2 |
The Jaccard distance is very useful in practice because sets are a very
general abstraction that can capture many practical objects. In particular,
the Jaccard distance is often used to determine the distance between
documents (but for this distance measure to make sense for this application,
one has to carefully choose the method used to convert each document into
a set).
From the definition of the Jaccard distance, it is easy to see that 1 −
distJ (S1 , S2 ) is equal to the probability that a random element of S1 ∪ S2
belongs to the intersection of these sets. A natural way to convert this
observation into a hash function family is as follows. Let N be the ground
set that contains all the possible elements. Then, for every element e ∈ N ,
we define a function
1 if e ∈ S,
fe (S) = .
0 otherwise
Let FJ = {fe |e ∈ N } be the hash functions family containing all these
functions. Unfortunately, Exercise 4 shows that FJ is not a good locality-
sensitive family because the probability Pr[fe (S1 ) = fe (S2 )] strongly
depends on the size of S1 ∪ S2 (which makes this hash function family treat
small sets as close to each other even when their Jaccard distance is quite
large).
Exercise 4. Prove that for any two non-empty sets S1 and S2 and a uniformly
random function fe ∈ FJ it holds that
|S1 ∪ S2 | · distJ (S1 , S2 )
Pr[fe (S1 ) = fe (S2 )] = 1 − .
|N |
Intuitively, the failure of the hash functions family FJ to be a good locality-
sensitive family stems from the fact that the element e is a random element
of N rather than a random element of S1 ∪S2 , which makes a large difference
for small sets. Thus, to get a better locality-sensitive hash functions family,
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 431
Exercise 5. Observe that, for every two sets S1 and S2 and a permutation
π of N , fπ (S1 ) = fπ (S2 ) if and only if the first element of S1 ∪ S2 according
to the permutation π appears also in the intersection of the two sets. Then,
use this observation to show that
Pr[fπ (S1 ) = fπ (S2 )] = 1 − distJ (S1 , S2 )
when fπ is drawn uniformly at random from FJ .
Exercise 6. Prove that the hash functions family FJ defined in Section 18.2
is (1/5, 2/5, 4/5, 3/5)-sensitive.
4 Note that the equality operation defined here is unusual as it is not transitive. This
is fine for the theoretical analysis done in this section, but makes the implementation
of these ideas more involved. Some discussion of this issue appears in the solution of
Exercise 9.
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 433
1 1 1
0 0 0
0.6
0.6
0.6
0.04
0.18
0.32
0.46
0.74
0.88
0.04
0.18
0.32
0.46
0.74
0.88
0.04
0.18
0.32
0.46
0.74
0.88
(a) (b) (c)
Figure 18.3. The probability of two sets to be mapped to the same range item, as a
function of the distance between the sets, by a random hash function from (a) FJ , (b)
F and (c) F .
Exercise Solutions
Solution 1
Let c be a uniformly random value from the range [0, 10), then the definition
of the family F implies
x1 − c x2 − c
Pr[f (x1 ) = f (x2 )] = Pr[fc (x1 ) = fc (x2 )] = Pr = .
10 10
To understand the event (x1 − c)/10 = (x2 − c)/10, let us assume
that the real line is partitioned into disjoint ranges (10i, 10(i + 1)] for every
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 436
integer i. Given this partition, the last event can be interpreted as the event
that x1 − c and x2 − c end up in the same range. If |x1 − x2 | 10, then this
can never happen because the distance between x1 − c and x2 − c is |x1 − x2 |
and the length of each range is 10. Thus, the case of |x1 − x2 | < 10 remains
to be considered. In this case, the event (x1 − c)/10 = (x2 − c)/10
happens if and only if the location of x1 − c within the range that includes
it is at distance of at least |x1 − x2 | from the end of the range. Note that
the distribution of c guarantees that the distance of x1 − c from the end of
the range including it is a uniformly random number from the range (0, 10],
and thus, the probability that it is at least |x1 − x2 | is given by
|x1 − x2 | |x1 − x2 |
1− =1− .
10 − 0 10
Solution 2
Observe that fi (x) = fi (y) if and only if the vectors x and y agree on
their i-th coordinate. Thus, when fi is drawn uniformly at random from
FH (which implies that i is drawn uniformly at random from the integers
between 1 and n), we get
Solution 3
Figure 18.4 depicts the vectors x and y and two regions, one around each
one of these vectors, where the region around each vector z includes all
the vectors whose angle with respect to z is at most 90◦ . We denote the
region around x by N (x) and the region around y by N (y). One can note
that fz (x) = fz (y) if and only if the vector z is in both these regions or in
neither of them. Thus,
Figure 18.4. Vectors x and y in R2 . Around the vector x there is a region marked with
dots that includes all the vectors whose angle with respect to x is at most 90◦ . Similarly,
around y there is a region marked with lines that includes all the vectors whose angle
with respect to y is at most 90◦ .
Solution 4
Note that fe (S1 ) = fe (S2 ) if and only if e belongs to both these sets or to
neither one of them. Thus,
|S1 ∩ S2 | |S1 ∪ S2 |
= + 1− =1
|N | |N |
|S1 ∪ S2 | · distJ (S1 , S2 )
− .
|N |
To see that the last equality holds, plug in the definition of distJ .
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 438
Solution 5
Recall that fπ (S1 ) is the first element of S1 according to the permutation π,
and fπ (S2 ) is the first element of S2 according to this permutation. Thus,
if fπ (S1 ) = fπ (S2 ), then the element fπ (S1 ) is an element of S1 ∩ S2 that
appears before every other element of S1 ∪S2 in π. This proves one direction
of the first part of the exercise. To prove the other direction, we need to
show that if the element e that is the first element of S1 ∪ S2 according
to π belongs to S1 ∩ S2 , then fπ (S1 ) = fπ (S2 ). Hence, assume that this
is the case, and note that this implies, in particular, that e is an element
of S1 that appears in π before any other element of S1 . Thus, fπ (S1 ) = e.
Similarly, we also get fπ (S2 ) = e, and consequently, fπ (S1 ) = e = fπ (S2 ).
The second part of the exercise remains to be solved. Since π is a
uniformly random permutation of N in this part of the exercise, a symmetry
argument shows that the first element of S1 ∪ S2 according to π (formally
given by fπ (S1 ∪ S2 )) is a uniformly random element of S1 ∪ S2 . Thus,
Solution 6
Recall that by Exercise 5, for two sets S1 and S2 at Jaccard distance d
from each other it holds that Pr[f (S1 ) = f (S2 )] = 1 − d for a random hash
function f from FJ . Thus, for d d1 = 1/5, we get
Solution 7
Consider a pair e1 and e2 of elements at distance at most d1 . Since F is
(d1 , d2 , p1 , p2 )-sensitive, Pr[f (e1 ) = f (e2 )] p1 for a uniformly random
function f ∈ F . Consider now a uniformly random function g ∈ G. Since G
contains a function for every choice of r (not necessarily distinct) functions
from F , the uniformly random choice of g implies that it is associated with r
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 439
r
Pr[g(e1 ) = g(e2 )] = Pr[∀1ir fi (e1 ) = fi (e2 )] = Pr[f (e1 ) = f (e2 )]
i=1
r
p1 = pr1 . (18.2)
i=1
Solution 8
The solution of this exercise is very similar to the solution of Exercise 7.
However, for the sake of completeness, we repeat the following necessary
arguments.
Consider a pair e1 and e2 of elements at distance at most d1 . Since F
is (d1 , d2 , p1 , p2 )-sensitive, Pr[f (e1 ) = f (e2 )] p1 for a uniformly random
function f ∈ F . Consider now a uniformly random function g ∈ G. Since G
contains a function for every choice of r (not necessarily distinct) functions
from F , the random choice of g implies that it is associated with r uniformly
random functions f1 , f2 , . . . , fr from F . Thus,
r
1− (1 − p1 ) = 1 − (1 − p1 )r . (18.3)
i=1
Solution 9
(a) The most natural way to distribute the function f is via the following
two Map-Reduce iterations method. In the first iteration, every input
machine forwards its name to the pre-designed central machine. Then, in
the second Map-Reduce iteration, the central machine forwards f to all the
input machines. Unfortunately, using this natural method as is might result
in large machine time and space complexities because it requires the central
machine to store the names of all the input machines and send the function
f to all of them. To solve this issue, it is necessary to split the work of the
central machine between multiple machines. More specifically, we will use
a tree T of machines with k levels and d children for every internal node.
The root of the tree is the central machine, and then k − 1 Map-Reduce
iterations are used to forward f along the tree T to the leaf machines. Once
the function f gets to the leaves, all the input machines forward their names
to random leaves of T , and each leaf that gets the names of input machines
responds by forwarding to these machines the function f .
Observe that, under the above suggested solution, an internal node
of the tree needs to forward f only to its children in T , and thus, has a
small machine time complexity as long as d is kept moderate. Additionally,
every leaf of the tree gets the names of roughly n/dk−1 input machines,
where n is the number of elements, and thus, has a small machine time and
space complexities when dk−1 is close to n. Combining these observations,
we get that the suggested solution results in small machine time and space
complexity whenever dk−1 = Θ(n) and d is small. While these requirements
are somewhat contradictory, they can be made to hold together (even when
we want d to be a constant) by setting k = Θ(log n).
The above paragraphs were quite informal in the way they described the
suggested solution and analyzed it. For the interested readers, we note that
a more formal study of a very similar technique was done in the solution of
Exercise 2 in Chapter 16.
(b) The procedure described by the exercise forwards every element e to the
range item f (e) it is mapped to by f , and then detects that two elements
are mapped to the same range item by noting that they end up on the same
machine. As noted by the exercise, this works only when range items are
considered equal exactly when they are identical, which is not the case in
the OR-construction.
The range of a hash functions family obtained as an r-OR-construction
consists of r-tuples, where two tuples are considered equal if they agree on
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 441
some coordinate. Thus, detecting that two tuples are equal is equivalent to
detecting that their values for some coordinate are identical. This suggests
the following modification to the procedure described by the exercise.
Instead of forwarding the element e mapped to an r-tuple (t1 , t2 , . . . , tr ) to
a machine named (t1 , t2 , . . . , tr ), we forward it to the r machines named
(i, ti ) for every 1 i t. Then, every machine (i, t) that gets multiple
elements can know that the tuples corresponding to all these elements had
the value t in the ith coordinate of their tuple, and thus, can declare all
these elements as suspected to be close.
One drawback of this approach is that a pair of elements might be
declared as suspected to be close multiple times if their tuples agree on
multiple coordinates. If this is problematic, then one can solve it using the
following trick. A machine (i, t) that detects that a pair of elements e1 and e2
might be close to each other should forward a message to a machine named
(e1 , e2 ). Then, in the next Map-Reduce iteration, each machine (e1 , e2 ) that
got one or more messages will report e1 and e2 as suspected to be close.
This guarantees, at the cost of one additional Map-Reduce iteration, that
every pair of elements is reported at most once as suspected to be close.
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
Index
A C
443
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-index page 444
F I
false negative, 426–427 image (see also model (property
false positive, 426–427 testing): pixel model)
filtering technique, 401–410, 415 impossibility result, 51, 119–123,
forest, 166–168, 200–206, 217–218, 168–169, 181, 186–187, 227, 275,
401–409, 416–420 325, 370
frequency independence
words, 357–361, 363–368, random variables, 26–31, 35–36,
377–381 62–63, 79, 90, 121, 124, 147,
vector, 11, 133–135, 145, 150, 149–150, 161, 184, 242–243,
152, 157 271, 286, 313
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-index page 445
Index 445
O S
one-sided error algorithm, 244, sampling
248–249, 258 uniform, 73–74, 85–86, 91, 182,
OR-construction, 432–435, 439–440 194–195
uniform with replacement,
P 74–75, 78, 81, 84
uniform without replacement,
parallel algorithm (see also
75–76, 86–89
Map-Reduce: algorithm), 355–358 weighted, 81–84, 91–92
parallelizable, 356 semi-streaming algorithm, 169, 171,
passes (multiple), 5–14, 63–64, 175, 186, 189, 192–193, 204, 218
136–137, 156, 186 shuffle, 358–362, 366
pixel model (see also model (property sketch, 133–154
testing): pixel model) Count (see also Count sketch)
probabilistic method, 122, 337 Count-Median (see also
property testing algorithm, 237–265, Count-Median sketch)
288–295, 306–307, 309–351 Count-Min (see also Count-Min
pseudometric, 231–232, 235 sketch)
linear (see also linear sketch)
Q
smooth histograms method, 208–216,
quantile, 77–81 220–224
query complexity, 233–236, 240, 244, smooth function, 208, 212–215,
248–249, 251, 254–255, 257, 220
261–262, 264, 324, 338, 346, 351 sorted list testing, 245–249,
queue, 217–218 259–261
spanning tree (minimum weight) (see
R also minimum weight spanning
random function, 93–94, 97 tree)
random variable stack, 173–181, 190–193
Bernoulli, 30–31, 183 standard deviation, 34
July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-index page 447
Index 447
star, 232, 236, 413 271, 275, 287, 304, 306, 312–313,
streaming algorithm, 9, 51, 77, 84, 317, 323, 337, 349, 380, 388, 395,
109, 111, 115, 135–136, 156, 408
168–169, 170 universal family, 93–94, 102
sublinear time algorithm, 227–228, update event, 133–139, 142, 145–146,
230–234, 237, 267, 272, 276, 285, 155–156
288, 295, 297, 309, 332
V
T variance, 27–34, 48, 57–60, 63, 66–67,
tail bound, 15, 31–38 70, 128–129, 148, 183–184
tie breaking rule, 260–261, 404 vertex coloring, 281–282, 303–304
trailing zeros, 106, 110–118 vertex cover, 276–288, 295, 415
token
counting distinct, 109–131 W
processing time, 10, 13–14, 63, work, 367–368, 373, 385–389, 391,
67–68, 202 399–400, 409–410, 415, 421
total space complexity, 366, 373, window
385–388, 390, 400, 408–409, 415, active, 198–208, 210–219,
423–424 221–222, 224
triangle algorithm, 198–201, 203–208,
counting, 181–187, 410, 416 213–214
listing, 410–416, 421 length, 198–199, 204
two-sided error algorithm, 244 model (see also model (data
stream): sliding window)
U
union bound, 21–22, 50, 80, 91, 113,
117, 121, 126, 130, 158, 229, 243,