Randomised Algorithm
Randomised Algorithm
Sariel Har-Peled¬
April 2, 2024
¬ Departmentof Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801, USA;
[email protected]; http://sarielhp.org/. Work on this paper was partially supported by a NSF CAREER
award CCR-0132901.
2
Contents
Contents 3
3
3.2.2 Analysis of QuickSelect via conditional expectations . . . . . . . . . . . . . . . . . . 35
7 On k-wise independence 55
7.1 Pairwise independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1.1 Pairwise independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1.2 A pairwise independent set of bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1.3 An application: Max cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.1.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2 On k-wise independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2.2 On working modulo prime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2.3 Construction of k-wise independence variables . . . . . . . . . . . . . . . . . . . . . 58
7.2.4 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2.5 Applications of k-wide independent variables . . . . . . . . . . . . . . . . . . . . . . 59
4
7.2.5.1 Product of expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2.5.2 Application: Using less randomization for a randomized algorithm . . . . . 60
7.3 Higher moment inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8 Hashing 63
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2 Universal Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.2.1 How to build a 2-universal family . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.2.1.1 A quick reminder on working modulo prime . . . . . . . . . . . . . . . . . 66
8.2.1.2 Constructing a family of 2-universal hash functions . . . . . . . . . . . . . 66
8.2.1.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.2.1.4 Explanation via pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.3 Perfect hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.3.1 Some easy calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.3.2 Construction of perfect hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.3.2.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.4 Bloom filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9 Closest Pair 73
9.1 How many times can a minimum change? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.2 Closest Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5
12.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6
16 Discrepancy and Derandomization 121
16.1 Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
16.2 The Method of Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
16.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
19 Martingales 131
19.1 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
19.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
19.1.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
19.1.2.1 Examples of martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
19.1.2.2 Azuma’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
19.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
20 Martingales II 135
20.1 Filters and Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
20.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
20.2.1 Martingales – an alternative definition . . . . . . . . . . . . . . . . . . . . . . . . . . 137
20.3 Occupancy Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
20.3.1 Lets verify this is indeed an improvement . . . . . . . . . . . . . . . . . . . . . . . . 139
7
22 Evaluating And/Or Trees 149
22.1 Evaluating an And/Or Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
22.1.1 Randomized evaluation algorithm for T 2k . . . . . . . . . . . . . . . . . . . . . . . . 150
22.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
22.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8
26.3 Better estimation for F2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
26.3.1 Pseudo-random k-wide independent sequence of signed bits . . . . . . . . . . . . . . 174
26.3.2 Estimator construction for F2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
26.3.2.1 The basic estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
26.3.3 Improving the estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
26.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9
30 Random Walks II 197
30.1 Catalan numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
30.2 Walking on the integer line revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
30.2.1 Estimating the middle binomial coefficient . . . . . . . . . . . . . . . . . . . . . . . 198
30.3 Solving 2SAT using random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
30.3.1 Solving 2SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
30.4 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
10
36.3.2 Computing an r-net in a sparse graph . . . . . . . . . . . . . . . . . . . . . . . . . . 227
36.3.2.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
36.3.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
36.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
11
40.5.1 The bounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
40.5.2 The unbounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
40.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
40.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
42 Entropy II 269
42.1 Huffman coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
42.1.1 The algorithm to build Hoffman’s code . . . . . . . . . . . . . . . . . . . . . . . . . 270
42.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
42.1.3 A formula for the average size of a code word . . . . . . . . . . . . . . . . . . . . . . 271
42.2 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
42.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
45 Expanders I 287
45.1 Preliminaries on expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
45.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
45.2 Tension and expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
46 Expanders II 291
46.1 Bi-tension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
46.2 Explicit construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
46.2.1 Explicit construction of a small expander . . . . . . . . . . . . . . . . . . . . . . . . 293
46.2.1.1 A quicky reminder of fields . . . . . . . . . . . . . . . . . . . . . . . . . . 293
46.2.1.2 The construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
12
47 Expanders III - The Zig Zag Product 297
47.1 Building a large expander with constant degree . . . . . . . . . . . . . . . . . . . . . . . . . 297
47.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
47.1.2 The Zig-Zag product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
47.1.3 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
47.1.4 The construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
47.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
47.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
13
52 Primality testing 335
52.1 Number theory background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
52.1.1 Modulo arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
52.1.1.1 Prime and coprime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
52.1.1.2 Computing gcd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
52.1.1.3 The Chinese remainder theorem . . . . . . . . . . . . . . . . . . . . . . . . 337
52.1.1.4 Euler totient function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
52.1.2 Structure of the modulo group Zn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
52.1.2.1 Some basic group theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
52.1.2.2 Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
52.1.2.3 Cyclic groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
52.1.2.4 Modulo group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
52.1.2.5 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
52.1.2.6 Z∗p is cyclic for prime numbers . . . . . . . . . . . . . . . . . . . . . . . . 340
52.1.2.7 Z∗n is cyclic for powers of a prime . . . . . . . . . . . . . . . . . . . . . . . 341
52.1.3 Quadratic residues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
52.1.3.1 Quadratic residue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
52.1.3.2 Legendre symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
52.1.3.3 Jacobi symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
52.1.3.4 Jacobi(a, n): Computing the Jacobi symbol . . . . . . . . . . . . . . . . . 345
52.1.3.5 Subgroups induced by the Jacobi symbol . . . . . . . . . . . . . . . . . . . 346
52.2 Primality testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
52.2.1 Distribution of primes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
52.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
14
56 Some math stuff 371
56.1 Some useful estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Index 378
15
16
Chapter 1
A randomized algorithm. The randomized algorithm in this case is easy – the player randomly chooses a
number among 1, 2, 3 at every round. Since, at every point in time, there are two coins that have the same side
up, and the other coin is the other side up, a random choice hits the lonely coin, and thus finishes the game,
with probability 1/3 at each step. In particular, the number of iterations of the game till it terminates behaves
like a geometric variable with geometric distribution with probability 1/3 (and thus the expected number of
rounds is 3). Clearly, the probability that the game continues for more than i rounds, when the player uses this
random algorithm, is (2/3)i . In particular, it vanishes to zero relatively quickly.
A deterministic algorithm. The surprise here is that there is no deterministic algorithm that can generate a
winning sequence. Indeed, if the player uses a deterministic algorithm, then the adversary can simulate the
algorithm herself, and know at every stage what coin the player would ask to flip (it is easy to verify that
17
flipping two coins in a step is equivalent to flipping the other coin – so we can restrict ourselves to a single coin
flip at each step). In particular, the adversary can rotate the board in the end of the round, such that the player
(in the next round) flips one of the two coins that are in the same state. Namely, the player never wins.
The shocker. One can play the same game with a board of size 4 (i.e., a square), where at each stage the
player can flip one or two coins, and the adversary can rotate the board by 0, 90, 180, 270 degrees after each
round. Surprisingly, there is a deterministic winning strategy for this case. The interested reader can think what
it is (this is one of these brain teasers that are not immediate, and might take you 15 minutes to solve, or longer
[or much longer]).
The unfair game of the analysis of algorithms. The underlying problem with analyzing algorithm is the
inherent unfairness of worst case analysis. We are given a problem, we propose an algorithm, then an all-
powerful adversary chooses the worst input for our algorithm. Using randomness gives the player (i.e., the
algorithm designer) some power to fight the adversary by being unpredictable.
18
1.2. Examples of randomized algorithms
1.2.1. 2SAT
The input is a 2SAT formula. That is a 2CNF boolean formula – that is, the formula is a conjunction of clauses,
where each clause is made out of two literals, which are ored together. A literal here is either a variable or its
negation. For example, the input formula might be
(Here, ∨ is a boolean or, and ∧ is a boolean and.) Assume that F is using n variables (say x1 , . . . , xn ∈ {0, 1}),
and m clauses. The task at hand is to compute a satisfying assignment for F. That is, determine what values
has to be assigned to x1 , . . . , xn .
This problem can be solved in linear time (i.e., O(n + m)) by a somewhat careful and somewhat clever usage
of directed graphs and strong connected components. Here, we present a much simpler randomized algorithm
– we will present some intuition why this algorithm works. We hopefully will provide a more detailed formal
proof later in the course.
1.2.1.2. Intuition
Fix a specific satisfying assignment Ξ to F. Assume Xi is the number of variables in the assignment in the
beginning of the ith iteration that agree with Ξ. If Xi = n, then the algorithm found Ξ, and it is done. Otherwise,
Xi changes by exactly one at each iteration. That is Xi+1 = Xi + 1 or Xi+1 = Xi − 1. If both variables of Ci
are assigned the “wrong” value (i.e., the negation of their value in Ξ), then Xi+1 = Xi + 1. The other option is
that one of the variables of Ci is assigned the wrong value. The probability that the algorithm guess the right
variable to flip is 1/2. Thus, we have Xi+1 = Xi + 1 with probability 1/2, and Xi+1 = Xi − 1 with probability 1/2.
Thus, the execution of the algorithm is a random process. Starting with X1 being some value in the range
J0 : nK = {0, . . . , n}, the question is how long do we have to run this process till Xi = n? It turns out that the
answer is O(n2 ), because essentially this process is related to the random walk on the line, described next.
19
X = X0 , X1 , . . . is a random walk on the integers. A natural question is how many times would the walk visit
the origin, in the infinite walk X?
Well, the probability of the random walk at time 2n to be in the origin is exactly αn = 2n n
/22n . Indeed,
there are 22n random walks of length 2n. For the walk to be in the origin at time 2n, the walk has to be balanced
– equal number of steps have to be taken
2n to the left and to the right. The number of binary sequences of length
2n that have exactly n 0s and n 1s is n .
√
Exercise 1.2.2. Prove that 2n n
= Θ(22n / n). (An easy proof follows from using Stirling’s formula, but there
is also a not too difficult direct elementary proof).
√ √
As such, we have that c− / n ≤ αn ≤ c+ / n, where c− , c+ are two constants. Thus, the expected number
of times the random walk visits the origin is
X
∞ X∞
1
αn ≥ c− √ = +∞.
n=1 n=1
n
As before, one can ask what is the number of times this random walk visits the origin. Let βn be the
probability of being in the origin at time 2n.
Exercise 1.2.3. Prove that βn = α2n = Θ(1/n). (There is a nifty trick to prove this. See if you can figure it out.)
P∞
Arguing as above, we have that the expected number of times the walk visits the origin is n=1 Θ(1/n) =
+∞. Namely, the walk visit the origin infinite number of times.
20
Theorem 1.2.4. The range JnK = {1, . . . , n} contains Θ(n/ log n) prime numbers.
As a number n can be written using O(log n) digits, that essentially means that a random number with t
digits has probability ≈ 1/t to be a prime number. Namely, primes are quite common.
Fortunately, one can test quickly whether or not a random number is a prime.
Theorem 1.2.5. Given a positive integer n, it can be written using T = log10 n digits. Furthermore, one can
decide in O(T 4 ) = O(log4 n) randomized time if n is prime. More precisely, if n is not prime, the algorithm
would return “not prime” with probability half, if it is prime, it would return “prime”.
A natural way to decide if a number n with t bits is prime, is to run the above algorithm (say), 10t times.
If any of the runs returns that the number is not prime, then we return “not prime”. Otherwise, we return the
number of is a prime. The probability that a composite number would be reported as prime is 1/210t ≤ 1/10t ,
which is a tiny number, for t, say, larger than 512.
This gives us an efficient way to pick random prime numbers – the time to compute such a number is
polynomial in the number of bits its uses. Now, we can deploy RSA as computing large random prime numbers
is the main technical difficulty in computing it.
Theorem 1.2.6. The above algorithm always outputs a cut, and it outputs a min-cut with probability ≥ 2/n2 .
In particular, it turns out that if you run the above algorithm O(n2 log n) times, and returns the smallest cut
computed, then with probability ≥ 1 − 1/nO(1) , the returned cut is the minimum cut! This algorithm has running
time (roughly) O(n4 ) – it can be made faster, but this is already pretty good.
21
22
Chapter 2
As a concrete example, if we are rolling a dice, then Ω = {1, 2, 3, 4, 5, 6} and F would be the power set of
all possible subsets of Ω.
Definition 2.1.3. A probability measure is a mapping P : F → [0, 1] assigning probabilities to events. The
function P needs to have the following properties:
(i) Additive: for X, Y ∈ F disjoint sets, we have that P X ∪ Y = P X + P Y , and
(ii) P[Ω] = 1.
23
Observation 2.1.4. Let C1 , . . . , Cn be random events (not necessarily independent). Than
X
n
n
P ∪i=1Ci ≤ P[Ci ].
i=1
(This is usually referred to as the union bound.) If C1 , . . . , Cn are disjoint events then
X
n
n
P ∪i=1Ci = P[Ci ].
i=1
Definition 2.1.5. A probability space is a triple (Ω, F , P), where Ω is a sample space, F is a σ-algebra defined
over Ω, and P is a probability measure.
Definition 2.1.6. A random variable f is a mapping from Ω into some set G. We require that the probability
of the random variable to take on any value in a given subset
h of values
i is well defined. Formally, for any subset
−1 −1
U ⊆ G, we have that f (U) ∈ F . That is, P f ∈ U = P f (U) is defined.
Going back to the dice example, the number on the top of the dice when we roll it is a random variable.
Similarly, let X be one if the number rolled is larger than 3, and zero otherwise. Clearly X is a random variable.
We denote the probability of a random variable X to get the value x, by P[X = x] (or sometime P[x], if we
are lazy).
Definition 2.1.8 (Conditional Probability.). The conditional probability of X given Y, is the probability that
X = x given that Y = y. We denote this quantity by P X = x | Y = y .
One useful way to think about the conditional probability P[X | Y] is as a function, between the given value
of Y (i.e., y), and the probability of X (to be equal to x) in this case. Since in many cases x and y are omitted in
the notation, it is somewhat confusing.
The conditional probability can be computed using the formula
P (X = x) ∩ (Y = y)
PX=x|Y=y = .
PY=y
For example, let us roll a dice and let X be the number we got. Let Y be the random variable that is true if
the number we get is even. Then, we have that
h i 1
P X = 2 Y = true = .
3
h i
Definition 2.1.9. Two random variables X and Y are independent if P X = x Y = y = P[X = x], for all x and
y.
24
h i
Observation 2.1.10. If X and Y are independent then P X = x Y = y = P[X = x] which is equivalent to
PX=x ∩ Y=y
= P[X = x]. That is, X and Y are independent, if for all x and y, we have that
PY=y
PX=x ∩ Y=y =PX=x PY=y.
Remark. Informally, and not quite correctly, one possible way to think about the conditional probability
P X = x | Y = y is that it measure the benefit of having more information. If we know that Y = y, do we
have any change in the probability of X = x?
Lemma 2.1.11 (Linearity of expectation). Linearity of expectation is the property that for any two random
variables X and Y, we have that E X + Y = E X + E Y .
X X X
Proof: E X + Y = P[ω] X(ω) + Y(ω) = P[ω]X(ω) + P[ω]Y(ω) = E X + E Y . ■
ω∈Ω ω∈Ω ω∈Ω
Lemma 2.1.12. If X and Y are two random independent variables, then E[XY] = E[X] E[Y].
Proof: Let U(X) the sets of all the values that X might have. We have that
X X
E[XY] = xy P X = x and Y = y = xy P[X = x] P Y = y
x∈U(X),y∈U(Y) x∈U(X),y∈U(Y)
X X X X
= xy P[X = x] P Y = y = x P[X = x] yP Y = y
x∈U(X) y∈U(Y) x∈U(X) y∈U(Y)
= E[X] E[Y] . ■
denote the variance of X, where µX = E[X]. Intuitively, this√tells us how concentrated is the distribution of X.
The standard deviation of X, denoted by σX is the quantity V[X].
Observation 2.1.14. (i) For any constant c ≥ 0, we have V cX = c2 V X .
(ii) For X and Y independent variables, we have V X + Y = V X + V Y .
25
Definition 2.2.2 (Binomial distribution). Assume that we repeat a Bernoulli experiment n times (independently!).
Let X1 , . . . , Xn be the resulting random variables, and let X = X1 + · · · + Xn . The variable X has the binomial
distribution with parameters n and p. We denote this fact by X ∼ Bin(n, p). We have
!
n k n−k
b(k; n, p) = P X = k = pq .
k
P P
Also, E[X] = np, and V[X] = V ni=1 Xi = ni=1 V[Xi ] = npq.
Lemma 2.2.4. For a variable X ∼ Geom(p), we have, for all i, that P[X = i] = (1 − p)i−1 p. Furthermore,
E[X] = 1/p and V[X] = (1 − p)/p2 .
Proof: The proof of the expectation and variance is included for the sake of completeness, and the reader is
P
of course encouraged to skip (reading) this proof. So, let f (x) = ∞ ′
i=0 x = 1−x , and observe that f (x) =
i 1
P∞ i−1
i=1 ix = (1 − x)−2 . As such, we have
X
∞
p 1
E[X] = i (1 − p)i−1 p = p f ′ (1 − p) = = ,
i=1 (1 − (1 − p))2 p
h1 X i ∞
1 X 1
∞
and V[X] = E X − 2 = 2
i2 (1 − p)i−1 p − 2 . = p + p(1 − p) i2 (1 − p)i−2 − 2 .
p i=1
p i=2
p
Observe that
X
∞ ′′ 2
f ′′ (x) = i(i − 1)xi−2 = (1 − x)−1 = .
i=2
(1 − x)3
26
2.3. Application of expectation: Approximating 3SAT
Let F be a boolean formula with n variables in CNF form, with m clauses, where each clause has exactly k
literals. We claim that a random assignment for F, where value 0 or 1 is picked with probability 1/2, satisfies
in expectation (1 − 2−k )m of the clauses.
We remind the reader that an instance of 3SAT is a boolean formula, for example F = (x1 + x2 + x3 )(x4 +
x1 + x2 ), and the decision problem is to decide if the formula has a satisfiable assignment. Interestingly, we can
turn this into an optimization problem.
Max 3SAT
Instance: A collection of clauses: C1 , . . . , Cm .
Question: Find the assignment to x1 , ..., xn that satisfies the maximum number of clauses.
Clearly, since 3SAT is NP-Complete it implies that Max 3SAT is NP-Hard. In particular, the formula F
becomes the following set of two clauses:
x1 + x2 + x3 and x4 + x1 + x2 .
Definition 2.3.1. Algorithm Alg for a maximization problem achieves an approximation factor α if for all
inputs, we have:
Alg(G)
≥ α.
Opt(G)
In the following, we present a randomized algorithm – it is allowed to consult with a source of random
numbers in making decisions. A key property we need about random variables, is the linearity of expectation
property defined above.
Definition 2.3.2. For an event E, let X be a random variable which is 1 if E occurred, and 0 otherwise. The
random variable X is an indicator variable.
Theorem 2.3.4. One can achieve (in expectation) (7/8)-approximation to Max 3SAT in polynomial time.
Specifically, consider a 3SAT formula F with n variables and m clauses, and consider the randomized al-
gorithm that assigns each variable value 0 or 1 with equal probability (independently to each variable) . Then
this assignment satisfies (7/8)m clauses in expectation.
Proof: Let x1 , . . . , xn be the n variables used in the given instance. The algorithm works by randomly assigning
values to x1 , . . . , xn , independently, and equal probability, to 0 or 1, for each one of the variables.
Let Yi be the indicator variables which is 1 if (and only if) the ith clause is satisfied by the generated random
assignment, and 0 otherwise, for i = 1, . . . , m. Formally, we have
1 Ci is satisfied by the generated assignment,
Yi =
0 otherwise.
27
Pm
Now, the number of clauses satisfied by the given assignment is Y = i=1 Yi . We claim that E[Y] = (7/8)m,
where m is the number of clauses in the input. Indeed, we have
hX i X
m m
EY =E Yi = E Yi ,
i=1 i=1
by linearity of expectation. The probability that Yi = 0 is exactly the probability that all three literals appearing
in the clause Ci are evaluated to FALSE. Since the three literals, Say ℓ1 , ℓ2 , ℓ3 , are instance of three distinct
variable these three events are independent, and as such the probability for this happening is
1 1 1 1
P Yi = 0 = P[(ℓ1 = 0) ∧ (ℓ2 = 0) ∧ (ℓ3 = 0)] = P[ℓ1 = 0] P[ℓ2 = 0] P[ℓ3 = 0] = ∗ ∗ = .
2 2 2 8
(Another way to see this, is to observe that since Ci has exactly three literals, there is only one possible assign-
ment to the three variables appearing in it, such that the clause evaluates to FALSE. Now, there are eight (8)
possible assignments to this clause, and thus the probability of picking a FALSE assignment is 1/8.) Thus,
7
P Yi = 1 = 1 − P Yi = 0 = ,
8
and
7
E Yi = P Yi = 0 ∗ 0 + P Yi = 1 ∗ 1 = .
8
Pm
Namely, E[# of clauses sat] = E[Y] = i=1 E[Yi ] = (7/8)m. Since the optimal solution satisfies at most m
clauses, the claim follows. ■
Curiously, Theorem 2.3.4 is stronger than what one usually would be able to get for an approximation
algorithm. Here, the approximation quality is independent of how well the optimal solution does (the optimal
can satisfy at most m clauses, as such we get a (7/8)-approximation. Curiouser and curiouser¬ , the algorithm
does not even look on the input when generating the random assignment.
Håstad [Hås01a] proved that one can do no better; that is, for any constant ε > 0, one can not approximate
3SAT in polynomial time (unless P = NP) to within a factor of 7/8 + ε. It is pretty amazing that a trivial
algorithm like the above is essentially optimal.
Remark 2.3.5. For k ≥ 3, the above implies (1 − 2−k )-approximation algorithm, for k-SAT, as long as the
instances are each of length at least k.
28
Proof: Indeed,
X X X
E Y = y P Y = y + y P Y = y ≥ yP Y = y
y≥t y<t y≥t
X
≥ tP Y = y = tP Y ≥ t . ■
y≥t
Exercise 2.4.2. For any (integer) k > 1, define a random positive variable Xk such that P[Xk ≥ k E[Xk ]] = 1/k.
times, and returns the assignment satisfying the largest number of clauses.
Lemma 2.4.3. Given a 3SAT formula with n variables and m clauses, and parameters ε, φ ∈ (0, 1/2), the
above algorithm returns an assignment that satisfies ≥ (1 − ε)(7/8)m clauses of F, with probability ≥ 1 − φ.
The running time of the algorithm is O(ε−1 (n + m) log φ−1 ).
Proof: Let Zi be the number of clauses not satisfied by the ith random assignment considered by the algorithm.
Observe that E[Zi ] = m/8,as the probability of a clause not to be satisfied is 1/23 . The ith iteration fails if
m
m − Zi < (1 − ε)(7/8)m =⇒ Zi > m 1 − (1 − ε)7/8 = 1 + 7ε = 1 + 7ε E[Zi ] .
8
Thus, by Markov’s inequality, the ith iteration fails with probability
E[Zi ] 1
p = P[m − Zi < (1 − ε)(7/8)m] = P Zi > (1 + 7ε) E[Zi ] < = < 1 − ε,
(1 + 7ε) E[Zi ] 1 + 7ε
29
2.4.3. Example: Coloring a graph
Consider a graph G = (V, E) with n vertices and m edges. We would like to color it with k colors. As a
reminder, a coloring of a graph by k colors, is an assignment χ : V → JkK of a color to each vertex of G, out of
the k possible colors JkK = {1, 2, . . . , k}. A coloring of an edge uv ∈ E is valid if χ(u) , χ(v).
Lemma 2.4.4. Consider a random coloring χ of the vertices of a graph G = (V, E), where each vertex is
assigned a color randomly and uniformly from JkK. Then, the expected number of edges with invalid coloring
is m/k, where m = |E(G)| is the number of edges in G.
Proof: Let E = {e1 , . . . , em }. Let Xi be an indicator variable that is 1 ⇐⇒ ei is invalid colored by χ. Let
ei = ui vi . We have that
1
P[Xi = 1] = P χ(ui ) = χ(vi ) = .
k
Indeed, conceptually color ui first, and vi later. The probability that vi would be assigned the same color as ui
P
is 1/k. Let Z be the random variable that is the number of edges that are invalid for χ. We have that Z = i Xi .
By linearity of expectations, and the expectation of an indicator variable, we have
X
m X
m X
m
1 m
E[Z] = E[Xi ] = P[Xi = 1] = = . ■
i=1 i=1 i=1
k k
That is pretty good, but what about an algorithm that always succeeds? The above algorithm might always
somehow gives us a bad coloring. Well, not to worry.
Lemma 2.4.5. The above random coloring of G with k colors, has at most 2m/k invalid edges, with probability
≥ 1/2.
Proof: We have that E[Z] = m/k. As such, by Markov’s inequality, we have that
E[Z] m/k 1
P[Z > 2m/k] ≤ P[Z ≥ 2m/k] ≤ = = .
2m/k 2m/k 2
Thus
1 1
P[Z ≤ 2m/k] = 1 − P[Z > 2m/k] ≥ 1 − = . ■
2 2
In particular, consider he modified algorithm – it randomly colors the graph G. If there are at most 2m/k
invalid edges, it output the coloring, and stops. Otherwise, it retries. The probability of every iteration to
succeeds is p ≥ 1/2, and as such, the number of iterations behaves like a geometric random variable. It
follows, that in expectation, the number of iterations is at most 1/p ≤ 2. Thus, the expected running time of
this algorithm is O(m). Indeed, let R be the number iterations performed by the algorithm. We have that the
expected running time is proportional to
Note, that this is not the full picture – P[R = i] ≤ 1/2i−1 . So the probability of this algorithm tor for long
decreases quickly.
30
2.4.3.1. Getting a valid coloring
√
2.4.3.1.1. A fun algorithm. A natural approach is to run the above algorithm for k = m (assume it is an
integer). Then identify all the invalid edges, and invalidate the color of all the vertices involved. We now repeat
the coloring algorithm on these invalid vertices and invalid edges, again using random coloring, but now using
colors {k + 1, . . . , 2k}. If after this, there is a single invalid edge, we color one of its vertices by the color 2k + 1,
and output this coloring. Otherwise, it fails.
Lemma 2.4.6. The above algorithm succeeds with probability at least 1/2.
Proof: Let Y be the number of invalid edges in the end of the second round. For an edge to be invalid, its
coloring must have failed in both rounds, and the probability for that is exactly (1/k) · (1/k) = 1/m since the
two events are independent. As such, arguing as above, we have E[Y] = 1. By Markov’s inequality, we have
that
E[Y] 1
P algorithm fails = P[Y > 1] = P[Y ≥ 2] ≤ = . ■
2 2
Remark
√ 2.4.7. This is a toy example - it is not hard to come up with a deterministic algorithm that uses (say)
2m + 2 colors (how? think about it). However, this algorithm is a nice distributed algorithm - after three
rounds of communications, it colors the graph in a valid way, with probability at least half.
References
[Hås01a] J. Håstad. Some optimal inapproximability results. J. Assoc. Comput. Mach., 48(4): 798–859,
2001.
31
32
Chapter 3
3.1. QuickSort
Let the input be a set T = {t1 , . . . , tn } of n items to be sorted. We remind the reader, that the QuickSort
algorithm randomly pick a pivot element (uniformly), splits the input into two subarrays of all the elements
smaller than the pivot, and all the elements larger than the pivot, and then it recurses on these two subarrays
(the pivot is not included in these two subproblems). Here we will show that the expected running time of
QuickSort is O(n log n).
Let S 1 , . . . , S n be the elements in their sorted order (i.e., the output order). Let Xi j = 1 be the indicator
variable which is one ⇐⇒ QuickSort compares S i to S j , and let pi j denote the probability that this happens.
P
Clearly, the number of comparisons performed by the algorithm is C = i< j Xi j . By linearity of expectations,
we have
X X h i X
EC =E Xi j =
i< j
E Xi j = pi j .
i< j i< j
We want to bound pi j , the probability that the S i is compared to S j . Consider the last recursive call involving
both S i and S j . Clearly, the pivot at this step must be one of S i , . . . , S j , all equally likely. Indeed, S i and S j
were separated in the next recursive call.
Observe, that S i and S j get compared if and only if pivot is S i or S j . Thus, the probability for that is
2/( j − i + 1). Indeed,
h i 2
pi j = P S i or S j picked picked pivot from S i , . . . , S j = .
j−i+1
33
Thus,
X
n X X
n X X X 2
n n−i+1 X
n X
n
1
pi j = 2/( j − i + 1) = ≤2 ≤ 2nHn ≤ n + 2n ln n,
i=1 j>i i=1 j>i i=1 k=1
k i=1 k=1
k
Pn
where Hn is the harmonic number¬ Hn = i=1 1/i. We thus proved the following result.
Lemma 3.1.1. QuickSort performs in expectation at most n + 2n ln n comparisons, when sorting n elements.
Note, that this holds for all inputs. No assumption on the input is made. Similar bounds holds not only in
expectation, but also with high probability.
This raises the question, of how does the algorithm pick a random element? We assume we have access to
a random source that can get us number between 1 and n uniformly.
Note, that the algorithm always works, but it might take quadratic time in the worst case.
Remark 3.1.2 (Wait, wait, wait). Let us do the key argument in the above more slowly, and more carefully.
Imagine, that before running QuickSort we choose for every element a random priority, which is a real number
in the range [0, 1]. Now, we re-implement QuickSort such that it always pick the element with the lowest
random priority (in the given subproblem) to be the pivot. One can verify that this variant and the standard
implementation have the same running time. Now, ai gets compares to a j if and only if all the elements
ai+1 , . . . , a j−1 have random priority larger than both the random priority of ai and the random priority of a j . But
the probability that one of two elements would have the lowest random-priority out of j − i + 1 elements is
2 ∗ 1/( j − i + 1), as claimed.
hX i X
m−2 X
m−1 X
m−2 X
m−1
2 X 2(m − i − 1)
med−2
α1 = E Xi j = E Xi j = = ≤2 m−2 .
i=1 j=i+1 i=1 j=i+1
m−i+1 i=1
m−i+1
i< j<m
Rn Rn
¬
Using integration to bound summation, we have Hn ≤ 1 + 1
x=1 x
dx ≤ 1 + ln n. Similarly, Hn ≥ 1
x=1 x
dx = ln n.
34
h i
(ii) If m < i < j: Using the same analysis as above, we have that P Xi j = 1 = 2/( j − m + 1). As such,
X n Xj−1 Xn Xj−1 Xn
2( j − m − 1)
α2 = E
Xi j =
2
= ≤2 n−m .
j=m+1 i=m+1 j=m+1 i=m+1
j − m + 1 j=m+1 j − m + 1
Observe, that for a fixed ∆ = j − i + 1, we are going to handle the gap ∆ in the above summation, at most
P
∆ − 2 times. As such, α3 ≤ n∆=3 2(∆ − 2)/∆ ≤ 2n.
Xn h i X n
2
(iv) i = m. We have α4 = E Xi j = = ln n + 1.
j=m+1 j=m+1
j−m+1
X
m−1 h i Xm−1
2
(v) j = m. We have α5 = E Xi j = ≤ ln m + 1.
i=1 i=1
m−i+1
Thus, the expected number of comparisons performed by QuickSelect is bounded by
X
αi ≤ 2(m − 2) + 2(n − m) + 2n + ln n + 1 + ln m = 4n − 2 + ln n + ln m.
i
Theorem 3.2.1. In expectation, QuickSelect performs at most 4n−2+ln n+ln m comparisons, when selecting
the mth element out of n elements.
A different approach can reduce the number of comparisons (in expectation) to 1.5n + o(n). More on that
later in the course.
Theorem 3.2.2. Given a set X of n numbers, and any integer k, the expected running time of QuickSelect(X, n)
is O(n).
Proof: Let X1 = X, and Xi be the set of numbers in the ith level of the recursion. Let yi and ri be the random
element and its rank in Xi , respectively, in the ith iteration of the algorithm. Finally, let ni = |Xi |. Observe that
the probability that the pivot yi is in the “middle” of its subproblem is
h ni 3 i 1
α=P ≤ ri ≤ ni ≥ ,
4 4 2
35
QuickSelect(T J1 : nK , k)
// Input: T J1 : nK array with n numbers, parameter k.
// Assume all numbers in t are distinct.
// Task: Return kth smallest number in T .
y ← random element of T .
r ← rank of y in T .
if r = k then return y
T< = array with all elements in T < than y
T> = all elements in T > than y
// By assumption |T< | + |T> | + 1 = |T |.
if r < k then
return QuickSelect( T> , k − r )
else
return QuickSelect( T< , k )
as desired. ■
36
Chapter 4
Lemma 4.1.1. Let Xi ∈h {−1, +1} √ i probability half for each value, for i = 1, . . . , n (all picked indepen-
with
P
dently). We have that P | i Xi | > t n < 1/t2 .
P
Proof: Let Y = i Xi and Z = Y 2 . We have
X √ hX 2 i h h ii h i
P Xi > t n = P Xi > t2 n = P Y 2 > t2 E Y 2 = P Z > t2 E[Z] ≤ 1/t2 ,
i i
by Markov’s inequality. ■
37
Theorem
√ 4.1.2 (Chebyshev’s inequality). Let X be a real random variable, with µX = E[X], and σX =
V[X]. Then, for any t > 0, we have P |X − µX | ≥ tσX ≤ 1/t .
2
Proof: Let Yi = ψ(ri ) be an indicator variable that is 1 if the ith sample ri has the property ψ, for i = 1, . . . , m.
Observe that
Γ
p = E[Yi ] = .
n
P
Consider the random variable Y = i Yi .
Variance of a binomial distribution. (I am including the following here as a way to remember this formula.) The
variable Y is a binomial distribution with probability p = Γ/n, and m samples; that is, Y ∼ Bin(m, p). Thus, Y is the sum
of m random variables
h i Y1 , . . . , Ym that are independent indicator variables (i.e., Bernoulli distribution), with E[Yi ] = p,
and V[Yi ] = E Yi − E[Yi ]2 = p − p2 = p(1 − p). Since the variance is additive for independent variables, we have
2
hP i P
V[Y] = V i Yi = m i=1 V[T i ] = mp(1 − p).
Thus, we have
Γ m
E[Y] = mp = m · = Γ, and V[Y] = mp(1 − p).
n n
38
The standard deviation of Y is p √
σY = mp(1 − p) ≤ m/2,
p
as p(1 − p) is maximized for p = 1/2.
Consider the estimate Z = (n/m)Y for Γ, and observe that
n nm
E[Z] = E[(n/m)Y] = E[Y] = Γ = Γ.
m mn
By Chebychev’s inequality, we have that P |Y − E[Y]| ≥ tσY ≤ 1/t2 . Since (n/m) E[Y] = E[Z] = Γ, this
implies that
" # " √ # n
n n m n n n
P |Z − Γ| ≥ t √ = P |Z − Γ| ≥ t · ≤ P |Z − Γ| ≥ tσ Y = P Y − E [Y] ≥ tσ Y
2 m m 2 m m m m
1
= P |Y − E[Y]| ≥ tσY ≤ 2 ■
t
Proof: (A) Compute a random sample R of U of size m in O(m) time (assuming the input numbers are given
in an array, say). Next sort the numbers of R in O(m log m) time. Let
$ % & '
k √ k √
ℓ− = m − t m/2 − 1 and ℓ+ = m + t m/2 + 1.
n n
Set r− = R[ℓ− ] and r+ = R[ℓ+ ].
Let Y be the number of elements in the sample R that are ≤ U⟨k⟩ . By Lemma 4.2.1, we have
h √ √ i
P E[Y] − t m/2 ≤ Y ≤ E[Y] + t m/2 ≥ 1 − 1/t .
2
(B) Let g = k − t √nm − 3 mn , and let gR be the number of elements in R that are smaller than U⟨g⟩ . Arguing as
h √ i
above, we have that P gR ≤ gn m + t m/2 ≥ 1 − 1/t2 . Now
!
g √ m n n √ m √ √ m √
m + t m/2 = k−t√ −3 + t m/2 = k − t m − 3 + t m/2 = k − t m/2 − 3 < ℓ− .
n n m m n n
39
This implies that the g smallest numbers in U are outside the interval [r− , r+ ] with probability ≥ 1 − 1/t2 .
Next, let h = k + t √nm + 3 mn . A similar argument, shows that all the n − h largest numbers in U are too large
to be in [r− , r+ ]. This implies that
n n tn
|[r− , r+ ] ∩ U| ≤ h − g + 1 = 6 + 2t √ ≤ 8 √ . ■
m m m
40
√
algorithm failed. The other possibility for failure is that S m is too large. That is, larger than 8tn/ m = O(n3/4 ).
If any of these failures happen, then we rerun this algorithm from scratch.
Otherwise, the algorithm need to compute the element of rank k − |S < | in the set S m , and this can be done
in O(|S m | log |S m |) = O(n3/4 log n) time by using sorting.
4.3.2.2. Analysis
The correctness is easy – the algorithm clearly returns the desired element. As for running time, observe
that by Lemma 4.3.1, by probability ≥ 1 − 1/n1/4 , we succeeded in the first try, and then the running time is
O(n + (m log m)) = O(n). More generally, the probability that the algorithm failed in the first α tries to get a
good interval [r− , r+ ] is at most 1/nα/4 .
One can slightly improve the number of comparisons performed by the algorithm using the following
modifications.
Lemma 4.3.2. Given the numbers r− , r+ , one can compute the sets S < , S m , S > using in expectation (only!)
1.5n + O(n3/4 ) comparisons.
Proof: We need to compute the sets S < , S m , S > . Namely, we need to compare all the numbers of S to r− and r+ .
Since only O(n3/4 ) numbers fall in S m , almost all of the numbers are in either S < or S > . If a number of is in S <
(resp. S > ), then comparing it r− (resp. r+ ) is enough to verify that this is indeed the case. Otherwise, perform
the other comparison and put the element in its proper set (in this case we had to perform two comparisons to
handle the element).
So let us guess, by a coin flip, for each element of S whether they are in S < or S > . If we are right, then
the algorithm would require only one comparison to put them into the right set. Otherwise, it would need two
comparisons. Let X s be the random variable that is the number of comparisons used by this algorithm for an
element s ∈ S . We have that if s ∈ S < ∪ S > then E[X s ] = 1(1/2) + 2(1/2) = 3/2. If s ∈ S m then both
comparisons will be performed, and thus E[X s ] = 2 in this case.
Thus, the expected numbers of comparisons for all the elements of S , by linearity of expectations, is
3
2
(n − |S m |) + 2|S m | = (3/2)n + |S m |/2. ■
Theorem 4.3.3. Given an array S with n numbers and a rank k, one can compute the element of rank k
in S in expected linear time. Formally, the resulting algorithm performs in expectation 1.5n + O(n3/4 log n)
comparisons.
Proof: Let X be the random variable that is the number of iteration till the interval is good. We have that X
is a geometric variable with probability of success ≥ 1 − 1/n1/4 . As such, the expected number of rounds till
h is ≤ 1/p ≤ 1 + 2/n
1/4
success i . As such, the expected number of comparisons performed by the algorithm is
E X · 1.5n + O(n log n) = 1.5n + O(n3/4 log n).
3/4
■
41
42
Chapter 5
43
Now, there are two possibilities:
(A) If α′ . β′ mod 2, then, with probability half, we have ri = 0, and as such α . β mod 2.
(B) If α′ ≡ β′ mod 2, then, with probability half, we have ri = 1, and as such α . β mod 2.
As such, with probability at most half, the algorithm would fail to discover that the two vectors are different.■
5.1.1.1. Amplification
Of course, this is not a satisfying algorithm – it returns the correct answer only with probability half if the
vectors are different. So, let us run the algorithm t times. Let T 1 , . . . , T t be the returned values from all these
executions. If any of the t executions returns that the vectors are different, then we know that they are different.
P Algorithm fails = P v , u, but all t executions return ‘=’
= P T 1 = ‘=’ ∩ T 2 = ‘=’ ∩ · · · ∩ T t = ‘=’
Y1
t
1
= P T 1 = ‘=’ P T 2 = ‘=’ · · · P T t = ‘=’ ≤ = t.
i=1
2 2
5.1.2. Matrices
Given three binary matrices B, C, D of size n × n, we are interested in deciding if BC
= D. Computing BC is
expensive – the fastest known (theoretical!) algorithm has running time (roughly) O n2.37 . On the other hand,
multiplying such a matrix with a vector r (modulo 2, as usual) takes only O(n2 ) time (and this algorithm is
simple).
n×n
Lemma 5.1.3. Given three binary matrices B, C, D ∈ 0, 1 and a confidence parameter δ > 0, a randomized
algorithm can decide if BC = D or not. More precisely the algorithm can return one of the following two
results:
,: Then BC , D.
=: Then BC = D with probability ≥1 − δ.
The running time of the algorithm is O n2 log δ−1 .
Proof: Compute a random vector r = (r1 , . . . , rn ), and compute the quantity x = BCr = B(Cr) in O(n2 ) time,
using the associative property of matrix multiplication. Similarly, compute y = Dr. Now, if x , y then return
‘=’. l m
Now, we execute this algorithm t = lg δ−1 times. If all of these independent runs return that the matrices
are equal then return ‘=’.
44
The algorithm fails only if BC , D, but then, assume the ith row in two matrices BC and D are different.
The probability that the algorithm would not detect that these rows are different is at most 1/2, by Lemma 5.1.1.
As such, the probability that all t runs failed is at most 1/2t ≤ δ, as desired. ■
where degree( fi ) ≤ d − i. Since f is not zero, one of the f s must be non-zero, and let i the maximum value such
that fi , 0.
Now, we randomly choose the values r2 , . . . , rn ∈ S (independently and uniformly). And consider the
polynomial in the single variable x, which is
X
d
g(x) = f j (r2 , . . . , rn )xi .
j=0
Let F be the event that fi (r2 , . . . , rn ) = 0. Let G be the event that g(x) = 0. By induction, we have P[F] ≤
(d − i)/|S |. More interestingly if F does not happen, then degree(g) = i. As such, by induction, we have that
i
P[G | F] = P g(x) = 0 | F ≤ .
|S |
¬
Wikipedia notes that the proof is not algebraic, and it is definitely not fundamental to modern algebra. So maybe it should be
cited as “the theorem formerly known as the fundamental theorem of algebra”.
45
We conclude that
d−i i d
P f (r) = 0 = P[G ∩ F] + P[G ∩ F] ≤ P[F] + P[G | F] ≤ + ≤ . ■
|S | |S | |S |
Remark 5.1.5. Consider the polynomial f (x, y) = (x − 1)2 + (y − 1)2 − 1. The zero set of this polynomial is
the unit circle. So the zero set Z f is infinite in this case. However, note that for any choice of S , the set S 2 is a
grid. The Schwartz-Zippel lemma, tells us that there relatively few grid points that are in the zero set.
5.1.3.2. Applications
5.1.3.2.1. Checking if a polynomial is the zero polynomial. Let f be a polynomial of degree d, with n
variables, over the reals that can be evaluated in O(T ) time. One can check if f zero, by picking randomly a n
numbers from S = Jd3 K. By Lemma 5.1.4, we have that the probability of f to be zero over the chosen values
is ≤ d/d3 , which is a small number. As above, we can do amplification to get a high confidence answer.
5.1.3.2.2. Checking if two polynomials are equal. Given two polynomials f and g, one can now check if
they are equal by checking if f (r) = g(r), for some random input. The proof of correctness follows from the
above, as one interpret the algorithm as checking if f − g is the zero polynomial.
5.1.3.2.3. Verifying polynomials product. Given three polynomials f, g, and h, one can now check if f g = h.
Again, one randomly pick a value r, and check if f (r)g(r) = h(r). The proof of correctness follows from the
above, as one interprets the algorithm as checking if f g − h is the zero polynomial.
Let M be an n × n matrix, where M[i, j] = 0 if ui v j < E, and M[i, j] = xi, j otherwise. Let Π be the set of all
permutations of JnK.
A perfect matching is a permutation π : JnK → JnK, such that for all i, we have ui vπ(i) ∈ E. For such a
permutation π, consider the monomial
Yn
fπ = sign(π) M[i, j],
i=1
where sign is the sign of the permutation (it is either −1 or +1 – for our purpose here we do not care about
the exact definition of this quantity). It is either a polynomial of degree exactly n, or it is zero. Furthermore,
observe that for any two different permutation π, σ ∈ Π, we have that if fπ and fσ are both non-zero, then
fπ , fσ and fπ , − fσ .
Consider the following “crazy” polynomial over the set of variables V:
X
ψ = ψ(V) = det(M) = sign(π) fπ .
π∈Π
If there is perfect matching in G, then there is a permutation π such that fπ , 0. But this implies that ψ , 0
(since it has a non-zero monomials, and the monomials can not cancel each other).
46
In the other direction, if there is no perfect matching in G, then fπ = 0 for all permutation π. This implies
that ψ = 0. Thus, deciding if G has a perfect matching is equivalent to deciding if ψ , 0. The polynomial ψ is
defined via a determinant of a matrix that variables as some of the entries (and zeros otherwise). By the above,
all we need to do is to evaluate ψ over some random values. If we use exact arithmetic, we would just pick a
random number in [0, 1] for each variable, and evaluate ψ for these values of the variable. Namely, we filled
the matrix M with values (so it is all numbers now), and we need to computes its determinant. Via Gaussian
elimination, the determinant can be computed in cubic time. Thus, we can evaluate ψ in cubic time, which
implies that with high probability we can check if G has a perfect matching.
If we do not want to be prisoners of the impreciseness of floating point arithmetic, then one can perform
the above calculations over some finite field (usually, the field is simply working modulo a prime number).
Definition 5.2.4. The class NP consists of all languages L that have a polynomial time algorithm Alg, such
that for any input Σ∗ , we have:
(i) If x ∈ L ⇒ then ∃y ∈ Σ∗ , Alg(x, y) accepts, where |y| (i.e. the length of y) is bounded by a polynomial in
|x|.
(ii) If x < L ⇒ then ∀y ∈ Σ∗ Alg(x, y) rejects.
Definition 5.2.5. For a complexity class C, we define the complementary class co-C as the set of languages
whose complement is in the class C. That is
co−C = L L ∈ C ,
4 where L = Σ∗ \ L.
There is also the internet.
47
It is obvious that P = co−P and P ⊆ NP ∩ co−NP. (It is currently unknown if P = NP ∩ co−NP or whether
NP = co−NP, although both statements are believed to be false.)
Definition 5.2.6. The class RP (for Randomized Polynomial time) consists of all languages L that have a
randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ , we have
(i) If x ∈ L then P Alg(x) accepts ≥ 1/2.
(ii) x < L then P Alg(x) accepts = 0.
An RP algorithm is a Monte Carlo algorithm, but this algorithm can make a mistake only if x ∈ L. As such,
co−RP is all the languages that have a Monte Carlo algorithm that make a mistake only if x < L. A problem
which is in RP ∩ co−RP has an algorithm that does not make a mistake, namely a Las Vegas algorithm.
Definition 5.2.7. The class ZPP (for Zero-error Probabilistic Polynomial time) is the class of languages that
have a Las Vegas algorithm that runs in expected polynomial time.
Definition 5.2.8. The class PP (for Probabilistic Polynomial time) is the class of languages that have a ran-
domized algorithm Alg, with worst case polynomial running time, such that for any input x ∈ Σ∗ , we have
(i) If x ∈ L then P Alg(x) accepts > 1/2.
(ii) If x < L then P Alg(x) accepts < 1/2.
Consider the mind-boggling stupid randomized algorithm that returns either yes or no with probability half.
This algorithm is almost in PP, as it return the correct answer with probability half. An algorithm is in PP needs
to be slightly better, and be correct with probability better than half. However, how much better can be made to
be arbitrarily close to 1/2. In particular, there is no way to do effective amplification with such an algorithm.
Definition 5.2.10. The class BPP (for Bounded-error Probabilistic Polynomial time) is the class of languages
that have a randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ ,
we have
(i) If x ∈ L then P Alg(x) accepts ≥ 3/4.
(ii) If x < L then P Alg(x) accepts ≤ 1/4.
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
48
Chapter 6
49
10
9
8
7
6
y 5
4
3
2
1
0
−1
−10 −8 −6 −4 −2 0 2 4
x
ex 1+x
Figure 6.1
P xi
by the Taylor expansion of e x = ∞ i=0 i! . This implies that (n/e) ≤ n!, as required.
n
As for the righthand side. The claim holds for n = 0 and n = 1. Let f (n) = (n + 1)n+1 /en , and observe that
by (ii), we have
!n+1 !n+1
f (n) (n + 1)n+1 /en n n + 1 n 1 e
= = = 1+ ≥ n = n.
f (n − 1) n /e
n n−1 e n e n e
Thus, we have
(n + 1)n+1 f (n) f (n − 1) f (1)
= f (n) = · ··· ≥ n · n − 1 · · · 1 = n!
e n f (n − 1) f (n − 2) f (0)
50
npicked uniformly, randomly and independently from JmK = {1, . . . , m}.
Lemma 6.2.1. Let X1 , . . . , Xn be n variables
Then, the expected number of collisions is 2 /m.
h i h i
Proof: Let Yi, j = 1 ⇐⇒ Xi = X j . We have that E Yi, j = P Yi, j = 1 = 1/m. Thus, the expected number of
collisions is n−1 n n−1 n
X X X X h i !
n 1
E Yi, j = E Yi, j = . ■
i=1 j=i+1 i=1 j=i+1
2 m
As such, for birthdays, for m = 364, and n = 28, we have that the expected number of collisions is
!
28 1 378
= > 1.
2 364 364
Proof: Let Ei be the event that Xi is distinct from all the values in X1 , . . . , Xi−1 . Let Bi = ∩ik=1 Ek = Bi−1 ∩ Ei be
the event that all of X1 , . . . , Xi are distinct. Clearly, we have
!
m − (i − 1) i−1 i−1
P[Ei | Bi−1 ] = P[Ei | E1 ∩ · · · ∩ Ei−1 ] = =1− ≤ exp − .
m m m
Observe that
! Yi !
P[Ei ∩ Bi−1 ] i−1 k−1
P[Bi ] = P[Bi−1 ] = P[Bi−1 ] P[Ei | Bi−1 ] ≤ exp − P[Bi−1 ] ≤ exp − .
P[Bi−1 ] m k=1
m
i ! ! !
X k − 1 i(i − 1) 1 i
= exp−
= exp − = exp − /m .
k=1
m 2 m 2
51
Proof: Let Xi be the number of balls in the ith bins, when we throw n balls into n bins (i.e., m = n). Clearly,
Xn
1
E[Xi ] = P The jth ball fall in ith bin = n · = 1,
j=1
n
by linearity of expectation. The probability that the first bin has exactly i balls is
! !i !n−i ! !i i !i i
n 1 1 n 1 ne 1 e
1− ≤ ≤ =
i n n i n i n i
This follows by Lemma 6.1.1 (iv).
Let C j (k) be the event that the jth bin has k or more balls in it. Then,
! k
X e i e k
n
e e2 e 1
P C1 (k) ≤ ≤ 1 + + 2 + ... = .
i=k
i k k k k 1 − e/k
For k∗ = c ln n/ ln ln n, we have
e k∗ !
∗ 1 ∗ ∗ k∗ ln k∗
P C1 (k ) ≤ ∗ ≤ 2 exp k (1 − ln k ) ≤ 2 exp −
k 1 − e/k∗ 2
! !
c ln n c ln n c ln n 1
≤ 2 exp − ln ≤ 2 exp − ≤ 2,
2 ln ln n | ln
{zln }
n 4 n
≈ln ln n
for n and c sufficiently large.
Let us redo this calculation more carefully (yuk!). For k∗ = ⌈(3 ln n)/ ln ln n⌉, we have
e k∗ !k ∗ k∗
∗ 1 e
P 1C (k ) ≤ ≤ 2 = 2 exp 1 − ln
| {z } 3 − ln ln n + ln ln ln n
k∗ 1 − e/k∗ (3 ln n)/ ln ln n
<0
≤ 2exp (− ln ln n + ln ln ln n)k∗
!
ln ln ln n 1
≤ 2 exp −3 ln n + 6 ln n ≤ 2 exp(−2.5 ln n) ≤ 2 ,
ln ln n n
for n large enough. We conclude, that since there are n bins and they have identical distributions that
X
n
∗ 1
P any bin contains more than k balls ≤ Ci (k∗ ) ≤ . ■
i=1
n
Exercise 6.3.3. Show that when throwing m = n ln n balls into n bins, with probability 1 − o(1), every bin has
O(log n) balls.
i=3 i=2
n i=2
n
Y m !
−(i−1)/n m(m − 1)
≤ e ≤ exp − ,
i=2
2n
52
√
thus for m = ⌈ 2n + 1⌉, the probability that all the m balls fall in different bins is smaller than 1/e.
This is sometime referred to as the birthday paradox, which was already mentioned above. You have
m = 30 people in the room, and you ask them for the date (day and month) of their birthday (i.e., n = 365).
The above shows that the probability of all birthdays to be distinct is exp(−30 · 29/730) ≤ 1/e. Namely, there
is more than 50% chance for a birthday collision, a simple but counter-intuitive phenomena.
Furthermore, Xi has a geometric distribution with parameter pi , that is Xi ∼ Geom(pi ), with pi = (n − i)/n. The
expectation and variance of Xi are
1 1 − pi
E[Xi ] = and V[Xi ] = .
pi p2i
Lemma 6.4.1. Let X be the number of rounds till we collection all n coupons. Then, V[X] ≈ π2 /6 n2 and its
√
standard deviation is σX ≈ (π/ 6)n.
Proof: The probability of Xi to succeed in a trial is pi = (n − i)/n, and Xi has the geometric distribution with
probability pi . As such E[Xi ] = 1/pi , and V[Xi ] = q/p2 = (1 − pi )/p2i .
Thus,
Xn−1 X
n−1
n
E[X] = E[Xi ] = = nHn = n(ln n + Θ(1)) = n ln n + O(n),
i=0 i=0
n−i
P
where Hn = ni=1 1/i is the nth Harmonic number.
As for variance, using the independence of X0 , . . . , Xn−1 , we have
n−1
X X X 1 − (n − i)/n X i/n X i n 2
n−1 n−1 n−1 n−1
1 − pi
V[X] = V[Xi ] = = 2 = 2 =
i=0 i=0
p2i i=0
n−i
i=0
n−i
i=0
n n−i
n n
n
X
n−1
i Xn
n−i X n X n
1 Xn
1 π2 2
=n = =
−
= 2
− ≈ n,
i2 i=1 i
n n n nH n
i=0
(n − i)2 i=1
i2 i=1 i=1
i2 6
Pn V[X] π2
since limn→∞ 1
i=1 i2 = π2 /6, we have lim = . ■
n→∞ n2 6
This implies a weak bound on the concentration of X, using Chebyshev inequality, we have
h π i h i 1
P X ≥ n ln n + n + t · n √ ≤ P X − E[X] ≥ tσX ≤ 2 ,
6 t
Note, that this is somewhat approximate, and hold for n sufficiently large.
53
6.5. Notes
The material in this note covers parts of [MR95, sections 3.1, 3.2, 3.6]
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
54
Chapter 7
On k-wise independence
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
Namely, the variables are independent if you look at pairs of variables. Compare this to the much stronger
property of independence.
Definition 7.1.2. A set of random variables X1 , . . . , Xn is independent, if for any t, and any t values α1 , . . . , αt ,
and any t indices i1 , . . . , it , we have that
Y h i
t
P Xi1 = α1 , Xi2 = α2 , . . . , and Yit = αit = P Xi j = α j .
j=1
Proof: We claim that, for any i, we have P[Yi = 1] = P[Yi = 0] = 1/2. So fix i, and let α be an index such that
bit(i, α) = 1, and observe that this follows readily if pick all the true random variables X0 , . . . , Xt−1 in such an
order such that Xα is the last one to be set.
Next, consider two distinct indices i, i′ , and two arbitrary values v, v′ . We need to prove that
′ ′ 1
P Yi = v and Yi′ = v = P[Yi = v] P Yi′ = v = .
4
55
To this end, let B = { j | bit(i, j) = 1} and B′ = { j | bit(i′ , j) = 1}. If there is an index β ∈ B \ B′ , then we have
hO i h O i
′ ′
P Yi = v | Yi′ = v = P j:bit(i, j)=1 X j = v Yi′ = v = P Xβ ⊗ X j = v Yi′ = v′
j:bit(i, j)=1
h O i 1
= P Xβ = v ⊗ X j Yi′ = v′ = .
j:bit(i, j)=1
2
This implies that P[Yi = v and Yi′ = v′ ] = P[Yi = v | Yi′ = v′ ] P[Yi′ = v′ ] = (1/2)(1/2) = 1/4, as claimed.
A similar argument implies that if there is an index β ∈ B′ \ B, then P[Yi′ = v′ | Yi = v] = 1/2, which implies
the claim in this case.
Since i , i′ , one of the two scenarios must happen, implying the claim. ■
(S , S ) = (S , V \ S ) = {uv ∈ E | u ∈ S , v ∈ V \ S } .
is of maximum cardinality.
7.1.3.0.1. Algorithm. To this end, let Y1 , . . . , Yn be the pairwise independent bits of Section 7.1.2. Here, let
S be the set of all vertices vi ∈ V, such that Yi = 1. The algorithm outputs (S , S ) as the candidate cut.
7.1.4. Analysis
Lemma 7.1.4. The expected size of the cut computed by the above algorithm is m/2, where m = |E(G)|.
Proof: Let Zuv be an indicator variable for the event that the edge uv ∈ E is in the cut (S , S ).
We have that h i hX i X X
E (S , S ) = E Zuv = E[Zuv ] = P[Yu , Yv ] = |E|/2,
uv∈E uv∈E uv∈E
Lemma 7.1.5. Given a graph G with n vertices and m edges, say stored in a read only memory, one can
compute a max-cut of G, and the edges in it, using O(log n) random bits, and O(log n) RAM bits. Furthermore,
the expected size of the cut is ≥ m/2.
Proof: The algorithm description is above. The pairwise independence is also described above, and requires
only O(log n) random bits, which needs to be stored. Otherwise, all we need is to scan the edges of the graph,
and for each one to decide if it is, or not in the cut. Clearly, this can be done using O(log n) RAM bits. ■
Compare this to the natural randomized algorithm of computing a random subset S . This would require
using n random bits, and n bits of space to store it.
56
Max cut in the streaming model. Imagine that the edges of the graph are given to you via streaming: You are
told the number of vertices in advance, but then edges arrive one by one. The above enables you to compute the
cut in a streaming fashion using O(log n) bits. Alternatively, you can output the edges in a streaming fashion.
Another way of thinking about it, is that given a set S = {s1 , . . . , sn } of n elements, we can use the above
to select a random sample where every element is selected with probability half, and the samples are pairwise
independent. The kicker is that to specify the sample, or decide if an element is in the sample, we can do it
using O(log n) bits. This is a huge save compared to the regular n bits required as storage to remember the
sample.
It is clear however that we want a stronger concept – where things are k-wise independent.
Y h i
t
P i1
X = v1 and X i2 = v 2 and · · · and Xit = vt = P Xi j = v j .
j=1
Observe, that verifying the above property needs to be done only for t = k.
Proof: (A) If αβ ≡ 0 (mod p), then p must divide αβ, as it divides 0. But α, β are smaller than p, and p is
prime. This implies that either p | α or p | β, which is impossible.
(B) Assume that α > β. Furthermore, for the sake of contradiction, assume that αi ≡ βi (mod p). But then,
(α − β)i ≡ 0 (mod p), which is impossible, by (A).
(C) For any α ∈ {1, . . . , p − 1}, consider the set Lα = {α ∗ 1 mod p, α ∗ 2 mod p, . . . , α ∗ (p − 1) mod p}. By
(A), zero is not in Lα , and by (B), Lα must contain p − 1 distinct values. It follows that Lα = {1, 2, . . . , p − 1}.
As such, there exists exactly one number y ∈ {1, . . . , p − 1}, such that αy ≡ 1 (mod p). ■
Lemma 7.2.4. Consider a prime p, and any numbers x, y ∈ Z p . If x , y then, for any a, b ∈ Z p , such that
a , 0, we have ax + b . ay + b (mod p).
57
Proof: Assume y > x (the other case is handled similarly). If ax+b ≡ ay+b (mod p) then a(x−y) (mod p) = 0
and a , 0 and (x − y) , 0. However, a and x − y cannot divide p since p is prime and a < p and 0 < x − y < p.■
Lemma 7.2.5. Consider a prime p, and any numbers x, y ∈ Z p . If x , y then, for each pair of numbers
r, s ∈ Z p = {0, 1, . . . , p − 1}, such that r , s, there is exactly one unique choice of numbers a, b ∈ Z p such that
ax + b (mod p) = r and ay + b (mod p) = s.
Q
Claim 7.2.6. det(V) = 1≤i< j≤n (x j − xi ).
Proof: One can prove this in several ways, and we include a proof via properties of polynomials. The deter-
minant det(V) is a polynomial in the variables x1 , x2 , . . . , xn . Formally, let Π be the set of all permutations of
JnK = {1, . . . , n}. For a permutation π ∈ Π, let sign(π) ∈ {−1, +1} denote the sign of this permutation. We have
that X
f (x1 , x2 , . . . , xn ) = det(V) = sign(π)xiπ(i) .
π∈Π
Pn
Every monomial in this polynomial has total degree i=1 π(i) = 1 + 2 + · · · + n = n(n − 1)/2. Observe, that if
we replace x j by xi , then we have f (x1 , . . . , xi , . . . , x j−1 , xi , x j+1 , . . . , xn ) is the determinant of a matrix with two
identical rows, and such a matrix has a zero determinate. Namely, the polynomial f is zero if xi = x j . This
Q
implies that x j − xi divides f . We conclude that the polynomial g ≡ 1≤i< j≤n (x j − xi ) divides f . Namely, we
can write f = g ∗ h, where h is some polynomial.
Consider the monomial x2 x32 · · · xnn−1 . It appears in f with coefficient 1. Similarly, it generated in g by
Q
selecting the first term in each sub-polynomial, that is 1≤i< j≤n x j − xi . It is to verify that this is the only time
this monomial appears in g. This implies that h = 1. We conclude that f = g, as claimed. ■
58
Proof: Let αi = 1, αi , α2i , · · · , αk−1 i . We have that f (αi , b) = ⟨αi , b⟩ mod p. This translates into the linear
system
1 α1 α21 . . . αn−1
1
α1 v1 v1 1 α
2 α22 . . . αn−1
2
2 T 2
α v 2
v 1 α3 n−1
.. b = .. ⇐⇒ M b = ..
T
where M = α23 . . . α3 .
. . . .. .. .. ... ..
. . . .
αk vk vk n−1
1 αn α2n . . . αn
The matrix M is the Vandermonde matrix, and by Claim 7.2.7 it is invertible. We thus get there exists a unique
solution to this system of linear equations (modulo p). ■
The construction. So, let us pick independently and uniformly k values b0 .b1 , . . . , bk−1 ∈ Z p , let b =
P
(b0 , b1 , . . . , bk−1 ). g(x) = k−1 i
i=0 bi x mod p, and consider the random variables
Yi = g(i), ∀i ∈ Z p .
Lemma 7.2.9. The variables Y0 , . . . , Y p−1 are uniformly distributed and k-wise independent.
Proof: The uniform distribution for each Yi follows readily by picking b0 last, and observing that each such
choice corresponds to a different value of Yi .
As for the k-independence, observe that for any set I = {i1 , i2 , . . . , ik } of indices, for t ≤ k, and any set of
values v1 , . . . , vk ∈ Z p , we have that the event
happens only for a unique choice of b, by Lemmah 7.2.8.i But there are pk such choices. We conclude that the
Q
probability of the above event is 1/pk = kj=1 P Yi j = v j , as desired. ■
Theorem 7.2.10. let p be a prime number, and pick independently and uniformly k values b0 .b1 , . . . , bk−1 ∈ Z p ,
Pk−1 i
and let g(x) = i=0 bi x mod p. Then the random variables
Proof: Immediate. ■
59
7.2.5.2. Application: Using less randomization for a randomized algorithm
We can consider a randomized algorithm, to be a deterministic algorithm Alg(x, r) that receives together with
the input x, a random string r of bits, that it uses to read random bits from. Let us redefine RP:
Definition 7.2.12. The class RP (for Randomized Polynomial time) consists of all languages L that have a
deterministic algorithm Alg(x, r) with worst case polynomial running time such that for any input x ∈ Σ∗ ,
• x ∈ L =⇒ Alg(x, r) = 1 for half the possible values of r.
• x < L =⇒ Alg(x, r) = 0 for all values of r.
Let assume that we now want to minimize the number of random bits we use in the execution of the
algorithm (Why?). If we run the algorithm t times, we have confidence 2−t in our result, while using t log n
random bits (assuming our random algorithm needs only log n bits in each execution). Similarly, let us choose
two random numbers from Zn , and run Alg(x, a) and Alg(x, b), gaining us only confidence 1/4 in the correctness
of our results, while requiring 2 log n bits.
Can we do better? Let us define ri = ai + b mod n, where a, b are random values as above (note, that
P
we assume that n is prime), for i = 1, . . . , t. Thus Y = ti=1 Alg(x, ri ) is a sum of random variables which
are pairwise independent, as the ri are pairwise independent.
√ Assume, that x ∈ L, then we have E[Y] = t/2,
P
and σ2Y = V[Y] = ti=1 V Alg(x, ri ) ≤ t/4, and σY ≤ t/2. The probability that all those executions failed,
corresponds to the event that Y = 0, and
" √ #
h i t t √ 1
P Y = 0 ≤ P Y − E[Y] ≥ = P Y − E[Y] ≥ · t ≤ ,
2 2 t
by the Chebyshev inequality. Thus we were able to “extract” from our random bits, much more than one would
naturally suspect is possible. We thus get the following result.
Lemma 7.2.13. Given an algorithm Alg in RP that uses lg n random bits, one can run it t times, such that the
runs results in a new algorithm that fails with probability at most 1/t, and uses only 2 lg n random bits.
60
Proof: Observe that E[X] = n E[X1 ] = 0. We are interested in computing
h i X k hXn X
n i Xn X
n
Mk (X) = E X = E
k
Xi = E ... Xi1 Xi2 · · · Xik = ... E Xi1 Xi2 · · · Xik (7.1)
i i1 =1 ik =1 i1 =1 ik =1
Consider a term in the above summation, where one of the indices (say i1 ) has a unique value among i1 , i2 , . . . , ik .
By independence, we have
E Xi1 Xi2 · · · Xik = E Xi1 E Xi2 · · · Xik = 0,
since E Xi1 = 0. As such, in the above all terms that have a unique index disappear. A term that does not
disappear is going to be of the form
h α α α
i h αi h αi h αi
E Xi11 Xi22 . . . Xiℓ ℓ = E Xi11 E Xi22 . . . E Xiℓ ℓ
P
where αi ≥ 2, and i αi = k. Observe that
t
0 t is odd
E X1 =
1 t is even.
As such, all the terms in the summation of Eq. (7.1) that have value that is not zero, have value one. These
terms corresponds to tuples T = (i1 , i2 , . . . , ik ), such that the set of values I(T ) = {i1 , . . . , ik } has at most k/2
values, and furthermore, each such value appears an even number of times in T (here k/2 is an integer as k is
even by assumption). We conclude that the total number of such tuples is at most
nk/2 (k/2)k .
Note, that this is a naive bound – indeed, we choose the k/2 values that are in I(T ), and then we generate the
tuple T , by choosing values for each coordinate separately. We thus conclude that
h i
Mk (X) = E X k ≤ nk/2 (k/2)k .
h i h i
Since k is even, we have E X k = E |X|k , and by Lemma 7.3.1, we have
" # h i1/k
tk √ k 1/k
P |X| ≥ n = P |X| ≥ t n (k/2) ≤ P |X| ≥ tE |X|k ≤ 1/tk . ■
k/2
2
Observe, that the above proof did not require all the variables to be purely independent – it was enough that
they are k-wise independent. We readily get the following.
Definition 7.3.4. Given n random variables X1 , . . . , Xn they are k-wise independent, if for any k of them (i.e.,
i1 < i2 , . . . , ik ), and any k values x1 , . . . , xk , we have
h\
k
i Y
k
P Xiℓ = vℓ = P Xiℓ = v ℓ .
ℓ=1 ℓ=1
Informally, variables are k-wise independent, if any k of them (on their own) looks totally random.
Lemma 7.3.5. Let k > 0 be an even integer, and let X1 , . . . , Xn be n random independent variables, that are
P
k-wise independent, such that P[Xi = −1] = P[Xi = +1] = 1/2. Let X = ni=1 Xi . Then, we have
h tk √ i 1
P |X| ≥ n ≤ k.
2 t
61
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
62
Chapter 8
Hashing
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“I tried to read this book, Huckleberry Finn, to my grandchildren, but I couldnt get past page six because the book is fraught
with the n-word. And although they are the deepest-thinking, combat-ready eight- and ten-year-olds I know, I knew my
babies werent ready to comprehend Huckleberry Finn on its own merits. Thats why I took the liberty to rewrite Mark Twains
masterpiece. Where the repugnant n-word occurs, I replaced it with warrior and the word slave with dark-skinned volunteer.”
8.1. Introduction
We are interested here in dictionary data structure. The settings for such a data-structure:
(A) U: universe of keys with total order: numbers, strings, etc.
(B) Data structure to store a subset S ⊆ U
(C) Operations:
(A) search/lookup: given x ∈ U is x ∈ S ?
(B) insert: given x < S add x to S .
(C) delete: given x ∈ S delete x from S
(D) Static structure: S given in advance or changes very infrequently, main operations are lookups.
(E) Dynamic structure: S changes rapidly so inserts and deletes as important as lookups.
Common constructions for such data-structures, include using a static sorted array, where the lookup is a
binary search. Alternatively, one might use a balanced search tree (i.e., red-black tree). The time to perform
an operation like lookup, insert, delete take O(log |S |) time (comparisons).
Naturally, the above are potently an “overkill”, in the sense that sorting is unnecessary. In particular, the
universe U may not be (naturally) totally ordered. The keys correspond to large objects (images, graphs etc)
for which comparisons are expensive. Finally, we would like to improve “average” performance of lookups to
O(1) time, even at cost of extra space or errors with small probability: many applications for fast lookups in
networking, security, etc.
Hashing and Hash Tables. The hash-table data structure has an associated (hash) table/array T of size m
(the table size). A hash function h : U → {0, . . . , m − 1}. An item x ∈ U hashes to slot h(x) in T .
Given a set S ⊆ U, in a perfect ideal situation, each element x ∈ S hashes to a distinct slot in T , and we
store x in the slot h(x). The Lookup for an item y ∈ U, is to check if T [h(y)] = y. This takes constant time.
63
y f
Unfortunately, collisions are unavoidable, and several different techniques to handle them. Formally, two
items x , y collide if h(x) = h(y).
A standard technique to handle collisions is to use chaining (aka open hashing). Here, we handle collisions
as follows:
(A) For each slot i store all items hashed to slot i in a linked list. T [i] points to the linked list.
(B) Lookup: to find if y ∈ U is in T , check the linked list at T [h(y)]. Time proportion to size of linked list.
Other techniques for handling collisions include associating a list of locations where an element can be (in
certain order), and check these locations in this order. Another useful technique is cuckoo hashing which we
will discuss later on: Every value has two possible locations. When inserting, insert in one of the locations,
otherwise, kick out the stored value to its other location. Repeat till stable. if no stability then rebuild table.
The relevant questions when designing a hashing scheme, include: (I) Does hashing give O(1) time per
operation for dictionaries? (II) Complexity of evaluating h on a given element? (III) Relative sizes of the
universe U and the set to be stored S . (IV) Size of table relative to size of S . (V) Worst-case vs average-case
vs randomized (expected) time? (VI) How do we choose h?
The load factor of the array T is the ratio n/t where n = |S | is the number of elements being stored and
m = |T | is the size of the array being used. Typically n/t is a small constant smaller than 1.
In the following, we assume that U (the universe the keys are taken from) is large – specifically, N = |U| ≫
m2 , where m is the size of the table. Consider a hash function h : U → {0, . . . , m − 1}. If hash N items to the m
slots, then by the pigeon hole principle, there is some i ∈ {0, . . . , m − 1} such that N/m ≥ m elements of U get
hashed to i. In particular, this implies that there is set S ⊆ U, where |S | = m such that all of S hashes to same
slot. Oops.
Namely, for every hash function there is a bad set with many collisions.
Observation 8.1.1. Let H be the set of all functions from U = {1, . . . , U} to {1, . . . , m}. The number of
functions in H is mU . As such, specifying a function in H would require log2 |H| = O(U log m).
As such, picking a truely random hash function requires many random bits, and furthermore, it is not even
clear how to evaluate it efficiently (which is the whole point of hashing).
Picking a hash function. Picking a good hash function in practice is a dark art involving many non-trivial
considerations and ideas. For parameters N = |U|, m = |T |, and n = |S |, we require the following:
(A) H is a family of hash functions: each function h ∈ H should be efficient to evaluate (that is, to compute
h(x)).
(B) h is chosen randomly from H (typically uniformly at random). Implicitly assumes that H allows an
efficient sampling.
(C) Require that for any fixed set S ⊆ U, of size m, the expected number of collisions for a function chosen
from H should be “small”. Here the expectation is over the randomness in choice of h.
64
8.2. Universal Hashing
We would like the hash function to have the following property – For any element x ∈ U, and a random h ∈ H,
then h(x) should have a uniform distribution. That is Pr[h(x) = i] = 1/m, for every 0 ≤ i < m. A somewhat
stronger property is that for any two distinct elements x, y ∈ U, for a random h ∈ H, the probability of a
collision between x and y should be at most 1/m. P h(x) = h(y) = 1/m.
Definition 8.2.1. A family H of hash functions is 2-universal if for all distinct x, y ∈ U, we have P h(x) = h(y) ≤
1/m.
Applying a 2-universal family hash function to a set of distinct numbers, results in a 2-wise independent
sequence of numbers.
Lemma 8.2.2. Let S be a set of n elements stored using open hashing in a hash table of size m, using open
hashing, where the hash function is picked from a 2-universal family. Then, the expected lookup time, for any
element x ∈ U is O(n/m).
P
Proof: The number of elements colliding with x is ℓ(x) = y∈S Dy , where Dy = 1 ⇐⇒ x and y collide under
the hash function h. As such, we have
X h i X X1
E[ℓ(x)] = E Dy = P h(x) = h(y) = = |S |/m = n/m. ■
y∈S y∈S y∈S
m
Remark 8.2.3. The above analysis holds even if we perform a sequence of O(n) insertions/deletions opera-
tions. Indeed, just repeat the analysis with the set of elements being all elements encountered during these
operations.
The worst-case bound is of course much worse – it is not hard to show that in the worst case, the load of a
single hash table entry might be Ω(log n/ log log n) (as we seen in the occupancy problem).
Rehashing, amortization, etc. The above assumed that the set S is fixed. If items are inserted and deleted,
then the hash table might become much worse. In particular, |S | grows to more than cm, for some constant c,
then hash table performance start degrading. Furthermore, if many insertions and deletions happen then the
initial random hash function is no longer random enough, and the above analysis no longer holds.
A standard solution is to rebuild the hash table periodically. We choose a new table size based on current
number of elements in table, and a new random hash function, and rehash the elements. And then discard the
old table and hash function. In particular, if |S | grows to more than twice current table size, then rebuild new
hash table (choose a new random hash function) with double the current number of elements. One can do a
similar shrinking operation if the set size falls below quarter the current hash table size.
If the working |S | stays roughly the same but more than c|S | operations on table for some chosen constant
c (say 10), rebuild.
The amortize cost of rebuilding to previously performed operations. Rebuilding ensures O(1) expected
analysis holds even when S changes. Hence O(1) expected look up/insert/delete time dynamic data dictionary
data structure!
65
8.2.1. How to build a 2-universal family
8.2.1.1. A quick reminder on working modulo prime
Definition 8.2.4. For a number p, let Zn = 0, . . . , n − 1 .
For two integer numbers x and y, the quotient of x/y is x div y = ⌊x/y⌋. The remainder of x/y is x mod y =
x − y ⌊x/y⌋. If the x mod y = 0, than y divides x, denoted by y | x. We use α ≡ β (mod p) or α ≡ p β to denote
that α and β are congruent modulo p; that is α mod p = β mod p – equivalently, p | (α − β).
Remark 8.2.5. A quick review of what we already know. Let p be a prime number.
(A) Lemma 7.2.3: For any α, β ∈ {1, . . . , p − 1}, we have that αβ . 0 (mod p).
(B) Lemma 7.2.3: For any α, β, i ∈ {1, . . . , p − 1}, such that α , β, we have that αi . βi (mod p).
(C) Lemma 7.2.3: For any x ∈ {1, . . . , p − 1} there exists a unique y such that xy ≡ 1 (mod p). The number
y is the inverse of x, and is denoted by x−1 or 1/x.
(D) Lemma 7.2.4: For any numbers x, y ∈ Z p . If x , y then, for any a, b ∈ Z p , such that a , 0, we have
ax + b . ay + b (mod p).
(E) Lemma 7.2.5: For any numbers x, y ∈ Z p . If x , y then, for each pair of numbers r, s ∈ Z p = {0, 1, . . . , p−
1}, such that r , s, there is exactly one unique choice of numbers a, b ∈ Z p such that ax + b (mod p) = r
and ay + b (mod p) = s.
8.2.1.3. Analysis
Once we fix a and b, and we are given a value x, we compute the hash value of x in two stages:
(A) Compute: r ← (ax + b) (mod p).
(B) Fold: r′ ← r (mod m)
Lemma 8.2.6. Assume that p is a prime, and 1 < m < p. The number of pairs (r, s) ∈ Z p × Z p , such that r , s,
that are folded to the same number is ≤ p(p − 1)/m. Formally, the set of bad pairs
n o
B = (r, s) ∈ Z p × Z p r ≡m s
is of size at most p(p − 1)/m.
Proof: Consider a pair (x, y) ∈ {0, 1, . . . , p − 1}2 , such that x , y. For a fixed x, there are at most ⌈p/m⌉ values
of y that fold into x. Indeed, x ≡m y if and only if
y ∈ L(x) = x + im i is an integer ∩ Z p .
The size of L(x) is maximized when x = 0, The number of such elements is at most ⌈p/m⌉ (note, that since p
is a prime, p/m is fractional). One of the numbers in O(x) is x itself. As such, we have that
|B| ≤ p |L(x)| − 1 ≤ p ⌈p/m⌉ − 1 ≤ p p − 1 /m,
since ⌈p/m⌉ − 1 ≤ (p − 1)/m ⇐⇒ m ⌈p/m⌉ − m ≤ p − 1 ⇐⇒ m ⌊p/m⌋ ≤ p − 1 ⇐⇒ m ⌊p/m⌋ < p, which is
true since p is a prime, and 1 < m < p. ■
66
(A) (B) (C)
Claim 8.2.7. For two distinct numbers x, y ∈ U, a pair a, b is bad if ha,b (x) = ha,b (y). The number of bad pairs
is ≤ p(p − 1)/m.
Proof: Let a, b ∈ Z p such that a , 0 and ha,b (x) = ha,b (y). Let
By Lemma 7.2.4, we have that r , s. As such, a collision happens if r ≡ s (mod m). By Lemma 8.2.6, the
number of such pairs (r, s) is at most p(p − 1)/m. By Lemma 7.2.5, for each such pair (r, s), there is a unique
choice of a, b that maps x and y to r and s, respectively. As such, there are at most p(p − 1)/m bad pairs. ■
Proof: Fix two distinct numbers x, y ∈ U. We are interested in the probability they collide if h is picked
randomly from H. By Claim 8.2.7 there are M ≤ p(p − 1)/m bad pairs that causes such a collision, and since
H contains N = p(p − 1) functions, it follows the probability for collision is M/N ≤ 1/m, which implies that
H is 2-universal. ■
67
8.3. Perfect hashing
An interesting special case of hashing is the static case – given a set S of elements, we want to hash S so that
we can answer membership queries efficiently (i.e., dictionary data-structures with no insertions). it is easy to
come up with a hashing scheme that is optimal as far as space.
Lemma 8.3.1. Let S ⊆ U be a set of n elements, and let H be a 2-universal family of hash functions, into a
table of size m ≥ n2 . Then with probability ≤ 1/2, there is a pair of elements of S that collide under a random
hash function h ∈ H.
Lemma 8.3.2. Let S ⊆ U be a set of n elements, and let H be a 2-universal family of hash functions, into
a table of size m ≥ cn, where c is an arbitrary constant. Let h ∈ H be a random hash function, and let
Xih be the number of elements of S mapped to the ith bucket by h, for i = 0, . . . , m − 1. Then, we have
Pm−1 2 i
E j=0 X j ≤ (1 + 1/c)n.
h i
Proof: Let s1 , . . . , sn be the n items in S , and let Zi, j = 1 if h(si ) = h(s j ), for i < j. Observe that E Zi, j =
h i
P h(si ) = h(s j ) ≤ 1/m (this is the only place we use the property that H is 2-universal). In particular, let Z(α)
be all the variables Zi, j , for i < j, such that Zi, j = 1 and h(si ) = h(s j ) = α.
If for some α we have that Xα = k, then there are k indices ℓ1 < ℓ2 < . . . < ℓk , such that h(sℓ1 ) = · · · =
h(sℓk ) = i. As such, z(α) = |Z(α)| = 2k . In particular, we have
!
k
Xα2 = k = 2 + k = 2z(α) + Xα
2
2
hX
m−1 i h n−1 X
X n i n−1 X
X n h i n−1 X
X n
1
E Xα2 =E n+2 Zi j = n + 2 E Zi j ≤ n + 2
α=0 i=1 j=i+1 i=1 j=i+1 i=1 j=i+1
m
! ! !
2 n 2n(n − 1) n−1 1
=n+ =n+ ≤n 1+ ≤n 1+
m 2 2m m c
since m ≥ cn. ■
68
8.3.2. Construction of perfect hashing
Given a set S of n elements, we build a open hash table T of size, say, 2n. We use a random hash function h
that is 2-universal for this hash table, see Theorem 8.2.8. Next, we map the elements of S into the hash table.
Let S j be the list of all the elements of S mapped to the jth bucket, and let X j = L j , for j = 0, . . . , n − 1.
P
We compute Y = i=1 X 2j . If Y > 6n, then we reject h, and resample a hash function h. We repeat this
process till success.
In the second stage, we build secondary hash tables for each bucket. Specifically, for j = 0, . . . , 2n − 1, if
the jth bucket contains X j > 0 elements, then we construct a secondary hash table H j to store the elements of
S j , and this secondary hash table has size X 2j , and again we use a random 2-universal hash function h j for the
hashing of S j into H j . If any pair of elements of S j collide under h j , then we resample the hash function h j ,
and try again till success.
8.3.2.1. Analysis
Theorem 8.3.3. Given a (static) set S ⊆ U of n elements, the above scheme, constructs, in expected linear
time, a two level hash-table that can perform search queries in O(1) time. The resulting data-structure uses
O(n) space.
Proof: Given an element x ∈ U, we first compute j = h(x), and then k = h j (x), and we can check whether the
element stored in the secondary hash table H j at the entry k is indeed x. As such, the search time is O(1).
The more interesting issue is the construction time. Let X j be the number of elements mapped to the jth
P
bucket, and let Y = ni=1 Xi2 . Observe, that E[Y] ≤ (1 + 1/2)n = (3/2)n, by Lemma 8.3.2 (here, m = 2n and as
such c = 2). As such, by Markov’s inequality, P[X > 6n] = (3/2)n6n
≤ 1/4. In particular, picking a good top level
hash function requires in expectation at most 1/(3/4) = 4/3 ≤ 2 iterations. Thus the first stage takes O(n) time,
in expectation.
For the jth bucket, with X j entries, by Lemma 8.3.1, the construction succeeds with probability ≥ 1/2. As
before, the expected number of iterations till success is at most 2. As such, the expected construction time of
the secondary hash table for the jth bucket is O(X 2j ).
P
We conclude that the overall expected construction time is O(n + j X 2j ) = O(n).
P
As for the space used, observe that it is O(n + j X 2j ) = O(n). ■
First try. So, let start silly. Let B[0 . . . , m] be an array of bits, and pick a random hash function h : U → Zm .
Initialize B to 0. Next, for every element s ∈ S , set B[h(s)] to 1. Now, given a query, return B[h(x)] as an
answer whether or not x ∈ S . Note, that B is an array of bits, and as such it can be bit-packed and stored
efficiently.
For the sake of simplicity of exposition, assume that the hash functions picked is truly random. As such,
we have that the probability for a false positive (i.e., a mistake) for a fixed x ∈ U is n/m. Since we want the
size of the table m to be close to n, this is not satisfying.
69
Using k hash functions. Instead of using a single hash function, let us use k independent hash functions
h1 , . . . hk . For an element s ∈ S , we set B[hi (s)] to 1, for i = 1, . . . , k. Given an query x ∈ U, if B[hi (x)] is zero,
for any i = 1, . . . , k, then x < S . Otherwise, if all these k bits are on, the data-structure returns that x is in S .
Clearly, if the data-structure returns that x is not in S , then it is correct. The data-structure might make a
mistake (i.e., a false positive), if it returns that x is in S (when is not in S ).
We interpret the storing of the elements of S in B, as an experiment of throwing kn balls into m bins. The
probability of a bin to be empty is
Since the number of empty bins is a martingale, we know the number of empty bins is strongly concentrated
around the expectation pm, and we can treat p as the true probability of a bin to be empty.
The probability of a mistake is
f (k, m, n) = (1 − p)k .
In particular, for k = (m/n) ln n, we have that p = p(m, n) ≈ 1/2, and f (k, m, n) ≈ 1/2(m/n) ln 2 ≈ 0.618m/n .
Example 8.4.1. Of course, the above is fictional, as k has to be an integer. But motivated by these calculations,
let m = 3n, and k = 4. We get that p(m, n) = exp(−4/3) ≈ 0.26359, and f (4, 3n, n) ≈ (1 − 0.265)4 ≈ 0.294078.
This is better than the naive k = 1 scheme, where the probability of false positive is 1/3.
Note, that this scheme gets exponentially better over the naive scheme as m/n grows.
Example 8.4.2. Consider the setting m = 8n – this is when we allocate a byte for each element stored (the
element of course might be significantly bigger). The above implies we should take k = ⌈(m/n) ln 2⌉ = 6. We
then get p(8n, n) = exp(−6/8) ≈ 0.5352, and f (6, 8n, n) ≈ 0.0215. Here, the naive scheme with k = 1, would
give probability of false positive of 1/8 = 0.125. So this is a significant improvement.
Remark 8.4.3. It is important to remember that Bloom filters are competing with direct hashing of the whole
elements. Even if one allocates 8 bits per item, as in the example above, the space it uses is significantly
smaller than regular hashing. A situation when such a Bloom filter makes sense is for a cache – we might want
to decide if an element is in a slow external cache (say SSD drive). Retrieving item from the cache is slow, but
not so slow we are not willing to have a small overhead because of false positives.
• Universal hashing is defined for integers. To implement it for other objects, one needs to map objects in
some fashion to integers.
• Practical methods for various important cases such as vectors, strings are studied extensively. See http:
//en.wikipedia.org/wiki/Universal_hashing for some pointers.
• Recent important paper bridging theory and practice of hashing. “The power of simple tabulation hash-
ing” by Mikkel Thorup and Mihai Patrascu, 2011. See http://en.wikipedia.org/wiki/Tabulation_
hashing
70
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
71
72
Chapter 9
Closest Pair
598 - Class notes for Randomized Algorithms
The events of September 8 prompted Foch to draft the
Sariel Har-Peled
later legendary signal: “My centre is giving way, my
April 2, 2024
right is in retreat, situation excellent. I attack.” It was
probably never sent.
Proof: Consider the indicator variable Xi , such that Xi = 1 if ci , ci−1 . The probability for that is ≤ 1/i, since
this is the probability that the smallest number of b1 , . . . , bi is bi . (Why is this probability not simply equal to
P X Xn
1
1/i?) As such, we have X = i Xi , and E[X] = E i[X ] = = O(log n). ■
i i=1
i
73
Note, that every grid cell C of Gr , has a unique ID; indeed, let p = (x, y) be any point in C, and consider the
pair of integer numbers idC = id(p) = (⌊x/r⌋ , ⌊y/r⌋). Clearly, only points inside C are going to be mapped to
idC . This is useful, as one can store a set P of points inside a grid efficiently. Indeed, given a point p, compute
its id(p). We associate with each unique id a data-structure that stores all the points falling into this grid cell
(of course, we do not maintain such data-structures for grid cells which are empty). For our purposes here, the
grid-cell data-structure can simply be a linked list of points. So, once we computed id(p), we fetch the data
structure for this cell, by using hashing. Namely, we store pointers to all those data-structures in a hash table,
where each such data-structure is indexed by its unique id. Since the ids are integer numbers, we can do the
hashing in constant time.
Lemma 9.2.4. Given a set P of n points in the plane, and a distance r, one can verify in linear time, whether
or not CP(P) < r or CP(P) ≥ r.
Proof: Indeed, store the points of P in the grid Gr . For every non-empty grid cell, we maintain a linked list
of the points inside it. Thus, adding a new point p takes constant time. Indeed, compute id(p), check if id(p)
already appears in the hash table, if not, create a new linked list for the cell with this ID number, and store p in
it. If a data-structure already exist for id(p), just add p to it.
This takes O(n) time. Now, if any grid cell in Gr (P) contains more than four points of P, then, by
Lemma 9.2.3, it must be that the CP(P) < r.
Thus, when inserting a point p, the algorithm fetch all the points of P that were already inserted, for the cell
of p, and the 8 adjacent cells. All those cells must contain at most 4 points of P (otherwise, we would already
have stopped since the CP(·) of the inserted points is smaller than r). Let S be the set of all those points, and
observe that |S | ≤ 4 · 9 = O(1). Thus, we can compute by brute force the closest point to p in S . This takes
O(1) time. If d(p, S ) < r, we stop and return this distance (together with the two points realizing d(p, S ) as a
proof that the distance is too short). Otherwise, we continue to the next point, where d(p, S ) = min s∈S ∥p − s∥.
Overall, this takes O(n) time. As for correctness, first observe that if CP(P) > r then the algorithm would
never make a mistake, since it returns ‘CP(P) < r’ only after finding a pair of points of P with distance smaller
than r. Thus, assume that p, q are the pair of points of P realizing the closest pair, and ∥p − q∥ = CP(P) < r.
Clearly, when the later of them, say p, is being inserted, the set S would contain q, and as such the algorithm
would stop and return “CP(P) < r”. ■
Lemma 9.2.4 hints to a natural way to compute CP(P). Indeed, permute the points of P, in an arbitrary
fashion, and let P = ⟨p1 , . . . , pn ⟩. Next, let ri = CP {p1 , . . . , pi } . We can check if ri+1 < ri , by just calling the
74
algorithm for Lemma 9.2.4 on Pi+1 and ri . If ri+1 < ri , the algorithm of Lemma 9.2.4, would give us back the
distance ri+1 (with the other point realizing this distance).
So, consider the “good” case where ri+1 = ri = ri−1 . Namely, the length of the shortest pair does not change.
In this case we do not need to rebuild the data structure of Lemma 9.2.4 for each point. We can just reuse
it from the previous iteration. Thus, inserting a single point takes constant time as long as the closest pair
(distance) does not change.
Things become bad, when ri < ri−1 . Because then we need to rebuild the grid, and reinsert all the points of
Pi = ⟨p1 , . . . , pi ⟩ into the new grid Gri (Pi ). This takes O(i) time.
So, if the closest pair radius, in the sequence r1 , . . . , rn , changes only k times, then the running time of the
algorithm would be O(nk). But we can do even better!
Theorem 9.2.5. Let P be a set of n points in the plane. One can compute the closest pair of points of P in
expected linear time.
Proof: Pick a random permutation of the points of P, and let ⟨p1 , . . . , pn ⟩ be this permutation. Let r2 =
∥p1 − p2 ∥, and start inserting the points into the data structure of Lemma 9.2.4. In the ith iteration, if ri = ri−1 ,
then this insertion takes constant time. If ri < ri−1 , then we rebuild the grid and reinsert the points. Namely, we
recompute Gri (Pi ).
To analyze the running time of this algorithm, let Xi be the indicator variable which is 1 if ri , ri−1 , and 0
otherwise. Clearly, the running time is proportional to
Xn
R=1+ (1 + Xi · i).
i=2
by linearity of expectation and since for an indicator variable Xi , we have that E[Xi ] = P[Xi = 1].
Thus, we need to bound P[Xi = 1] = P[ri < ri−1 ]. To bound this quantity, fix the points of Pi , and randomly
permute them. A point u ∈ Pi is critical if CP(Pi \ {u}) > CP(Pi ).
(A) If there are no critical points, then ri−1 = ri and then P[Xi = 1] = 0.
(B) If there is one critical point, than P[Xi = 1] = 1/i, as this is the probability that this critical point would
be the last point in a random permutation of Pi .
(C) If there are two critical points, and let p, u be this unique pair of points of Pi realizing CP(Pi ). The
quantity ri is smaller than ri−1 , if either p or u are pi . But the probability for that is 2/i (i.e., the probability
in a random permutation of i objects, that one of two marked objects would be the last element in the
permutation).
Observe, that there can not be more than two critical points. Indeed, if p and u are two points that realize the
closest distance, than if there is a third critical point v, then CP(Pi \ {v}) = ∥p − u∥, and v is not critical.
We conclude that
X n Xn
2
E R =n+ i · P[X1 = 1] ≤ n + i · ≤ 3n.
i=2 i=2
i
As such, the expected running time of this algorithm is O(E[R]) = O(n). ■
Theorem 9.2.5 is a surprising result, since it implies that uniqueness (i.e., deciding if n real numbers are all
distinct) can be solved in linear time. However, there is a lower bound of Ω(n log n) on uniqueness, using the
comparison tree model. This reality dysfunction, can be easily explained, once one realizes that the model of
computation of Theorem 9.2.5 is considerably stronger, using hashing, randomization, and the floor function.
75
9.3. Bibliographical notes
The closest-pair algorithm follows Golin et al. [GRSS95]. This is in turn a simplification of a result of the
celebrated result of Rabin [Rab76]. Smid provides a survey of such algorithms [Smi00]. A generalization of
the closest pair algorithm was provided by Har-Peled and Raichel [HR15].
Surprisingly, Schönhage [Sch79] showed that assuming that the floor function is allowed, and the standard
arithmetic operation can be done in constant time, then every problem in PSPACE can be solved in polynomial
time. Since PSPACE includes NPC, this is bad news, as it implies that one can solve NPC problem in poly-
nomial time (finally!). The basic idea is that one can pack huge number of bits into a single number, and the
floor function enables one to read a single bit of this number. As such, a real RAM model that allows certain
operations, and put no limit on the bit complexity of numbers, and assume that each operation can take constant
time, is not a reasonable model of computation (but we already knew that).
References
[CLRS01] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press
/ McGraw-Hill, 2001.
[GRSS95] M. Golin, R. Raman, C. Schwarz, and M. Smid. Simple randomized algorithms for closest pair
problems. Nordic J. Comput., 2: 3–27, 1995.
[HR15] S. Har-Peled and B. Raichel. Net and prune: A linear time algorithm for Euclidean distance
problems. J. Assoc. Comput. Mach., 62(6): 44:1–44:35, 2015.
[Rab76] M. O. Rabin. Probabilistic algorithms. Algorithms and Complexity: New Directions and Recent
Results. Ed. by J. F. Traub. Orlando, FL, USA: Academic Press, 1976, pp. 21–39.
[Sch79] A. Schönhage. On the power of random access machines. Proc. 6th Int. Colloq. Automata Lang.
Prog. (ICALP), vol. 71. 520–529, 1979.
[Smi00] M. Smid. Closest-point problems in computational geometry. Handbook of Computational Ge-
ometry. Ed. by J.-R. Sack and J. Urrutia. Amsterdam, The Netherlands: Elsevier, 2000, pp. 877–
935.
76
Chapter 10
Lemma 10.1.1. For x ≥ 0, we have 1 − x ≤ exp(−x) and 1 + x ≤ e x . Namely, for all x, we have 1 + x ≤ e x .
Proof: For x = 0 we have equality. Next, computing the derivative on both sides, we have that we need to
prove that −1 ≤ − exp(−x) ⇐⇒ 1 ≥ exp(−x) ⇐⇒ e x ≥ 1, which clearly holds for x ≥ 0.
A similar argument works for the second inequality. ■
y
Lemma 10.1.2. For any y ≥ 1, and |x| ≤ 1, we have 1 − x2 ≥ 1 − yx2 .
Proof: Observe that the inequality holds with equality for x = 0. So compute the derivative of x of both sides
of the inequality. We need to prove that
y−1 y−1
y(−2x) 1 − x2 ≥ −2yx ⇐⇒ 1 − x2 ≤ 1,
77
Proof: The right side of the inequality is standard by now. As for the left side. Observe that
(1 − x2 )e x ≤ 1 + x,
since dividing both sides by (1 + x)e x , we get 1 − x ≤ e−x , which we know holds for any x. By Lemma 10.1.2,
we have y y y
1 − x2 y e xy ≤ 1 − x2 e xy = 1 − x2 e x ≤ 1 + x ≤ e xy . ■
for any t.
A stronger bound, follows from the following observation. Let Zir denote the event that the ith coupon was
not picked in the first r trials. Clearly,
h i !r r
1
P Zi = 1 − ≤ exp − .
r
n n
βn log n
Thus, for r = βn log n, we have P Zir ≤ exp − = n−β . Thus,
n
h i h[ βn log n i
P X > βn log n ≤ P Zi ≤ n · P Z1 ≤ n−β+1 .
i
Lemma 10.1.4. Let the random variable X denote the number of trials for collecting each of the n types of
coupons. Then, we have P X > n ln n + cn ≤ e−c .
Proof: The probability we fail to pick the first type of coupon is α = (1 − 1/n)m ≤ exp − n ln nn+cn = exp(−c)/n.
As such, using the union bound, the probability we fail to pick all n types of coupons is bounded by nα =
exp(−c), as claimed. ■
In the following, we show a slightly stronger bound on the probability, which is 1 − exp(−e−c ). To see that
it is indeed stronger, observe that e−c ≥ 1 − exp −e−c .
78
Proof: By Lemma 10.1.3, we have
! ! !m !
k2 m km k km
1 − 2 exp − ≤ 1− ≤ exp − .
n n n n
! !
k2 m km
Observe also that lim 1 − 2 = 1, and exp − = n−k exp(−ck). Also,
n→∞ n n
!
n k! n(n − 1) · · · (n − k + 1)
lim k
= lim = 1.
n→∞ k n n→∞ nk
! !m !
n k nk km nk exp(−ck)
Thus, lim 1− = lim exp − = lim n−k exp(−ck) = . ■
n→∞ k n n→∞ k! n n→∞ k! k!
Theorem 10.1.6. Let the random variable X denote the number of trials for collecting each of the n types of
coupons. Then, for any constant c ∈ R, and m = n ln n + cn, we have limn→∞ P X > m = 1 − exp −e−c .
Before dwelling into the proof, observe that 1 − exp(−e−c ) ≈ 1 − (1 − e−c ) = e−c . Namely, in the limit, the
upper bound of Lemma 10.1.4 is tight.
h i
Proof: We have P X > m = P ∪i Zim . By inclusion-exclusion, we have
h[ i X
n
P Zim = (−1)i+1 Pni ,
i i=1
X h j i Pk hS i
where Pnj = P ∩v=1 Ziv . Let S kn = i=1 (−1)i+1 Pni . We know that S 2k
m n
≤ P i Zim ≤ S 2k+1
n
.
1≤i1 <i2 <...<i j ≤n
By symmetry,
! \ ! !m
n
k
m
n k
Pk =
n
P Z = 1− ,
k v=1 v k n
Thus, Pk = limn→∞ Pnk = exp(−ck)/k!, by Lemma 10.1.5. Thus, we have
Xk Xk
exp(−c j)
Sk = (−1) P j =
j+1
(−1) j+1 · .
j=1 j=1
j!
Observe that limk→∞ S k = 1 − exp(−e−c ) by the Taylor expansion of exp(x) (for x = −e−c ). Indeed,
X∞
x j X (−e−c ) j
∞ X∞
(−1) j exp(−c j)
exp(x) = = =1+ .
j=0
j! j=0
j! j=1
j!
Clearly, limn→∞ S kn = S k and limk→∞ S k = 1 − exp(−e−c ). Thus, (using fluffy math), we have
lim P X > m = lim P ∪ni=1 Zim = lim lim S kn = lim S k = 1 − exp −e−c . ■
n→∞ n→∞ n→∞ k→∞ k→∞
79
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
80
Chapter 11
81
11.1.1. Concentration from conditional expectation
Lemma 11.1.4. Let X1 , . . . , Xn be independent random variables, that with equal probability are 0 or 1. We
P P
have that P i Xi < n/4 < 0.9n and P i Xi > (3/4)n < 0.9n .
Proof: Let Y0 = 1. If Xi = 1, then we set Yi = Yi−1 , and if Xi = 0, then we set Yi = Yi−1 /2. We thus have that
1 Yi−1 1 3
E[Yi | Yi−1 ] = + Yi−1 = Yi−1 .
2 2 2 4
As such, by Lemma 11.1.2 we have
" # !i
3 3 3
E[Yi ] = E E[Yi | Yi−1 ] = E Yi−1 = E[Yi−1 ] = .
4 4 4
P
In particular, E[Yn ] = (3/4)n . Now, if i Xi > (3/4)n, then we have
Yn ≥ (1/2)n/4 .
82
Chapter 12
12.2. Treaps
Anybody that ever implemented a balanced binary tree, knows that it can be very painful. A natural question,
is whether we can use randomization to get a simpler data-structure with good performance.
83
xk
p(xk )
TL TR
12.2.1. Construction
The key observation is that many of data-structures that offer good performance for balanced binary search
trees, do so by storing additional information to help in how to balance the tree. As such, the key Idea is that
for every element x inserted into the data-structure, randomly choose a priority p(x); that is, p(x) is chosen
uniformly and randomly in the range [0, 1].
So, for the set of elements X = {x1 , . . . , xn }, with (random) priorities p(x1 ), . . . , p(xn ), our purpose is to
build a binary tree which is “balanced”. So, let us pick the element xk with the lowest priority in X, and make
it the root of the tree. Now, we partition X in the natural way:
We can now build recursively the trees for L and R, and let denote them by T L and T R . We build the natural
tree, by creating a node for xk , having T L its left child, and T R as its right child.
We call the resulting tree a treap. As it is a tree over the elements, and a heap over the priorities; that is,
treap = tree + heap.
Lemma 12.2.1. Given n elements, the expected depth of a treap T defined over those elements is O(log(n)).
Furthermore, this holds with high probability; namely, the probability that the depth of the treap would exceed
c log n is smaller than δ = n−d , where d is an arbitrary constant, and c is a constant that depends on d.¬
Furthermore, the probability that T has depth larger than ct log(n), for any t ≥ 1, is smaller than n−dt .
Proof: Observe, that every element has equal probability to be in the root of the treap. Thus, the structure
of a treap, is identical to the recursive tree of QuickSort. Indeed, imagine that instead of picking the pivot
uniformly at random, we instead pick the pivot to be the element with the lowest (random) priority. Clearly,
these two ways of choosing pivots are equivalent. As such, the claim follows immediately from our analysis of
the depth of the recursion tree of QuickSort, see Theorem 12.1.1. ■
12.2.2. Operations
The following innocent observation is going to be the key insight in implementing operations on treaps:
Observation 12.2.2. Given n distinct elements, and their (distinct) priorities, the treap storing them is uniquely
defined.
¬
That is, if we want to decrease the probability of failure, that is δ, we need to increase c.
84
D x
0.3 0.2
x E A D
0.2 0.4 0.6 0.3
A C C E
0.6 0.5 0.5 0.4
=⇒
Figure 12.1: RotateRight: Rotate right in action. Importantly, after the rotation the priorities are ordered
correctly (at least locally for this subtree.
12.2.2.1. Insertion
Given an element x to be inserted into an existing treap T , insert it in the usual way into T (i.e., treat it a regular
search binary tree). This takes O height(T ) . Now, x is a leaf in the treap. Set x priority p(x) to some random
number [0, 1]. Now, while the new tree is a valid search tree, it is not necessarily still a valid treap, as x’s
priority might be smaller than its parent. So, we need to fix the tree around x, so that the priority property
holds.
RotateUp(x)
y ← parent(x)
while p(y) > p(x) do
if y.left_child = x then
RotateRight(y)
else
RotateLeft(y)
y ← parent(x)
We call RotateUp(x) to do so. Specifically, if x parent is y, and p(x) < p(y), we will rotate x up so that it
becomes the parent of y. We repeatedly do it till x has a larger priority than its parent. The rotation operation
takes constant time and plays around with priorities, and importantly, it preserves the binary search tree order.
A rotate right operation RotateRight(D) is depicted in Figure 12.1. RotateLeft is the same tree rewriting
operation done in the other direction.
Observe that as x is being rotated upwards, the priority properties are being fixed – in particular, as demon-
strated in Figure 12.1, nodes are being hanged on nodes that were previously their ancestors, so priorities are
still monotonically decreasing along a path.
In the end of this process, both the ordering property and the priority property holds. That is, we have a
valid treap that includes all the old elements, and the new element. By Observation 12.2.2, since the treap is
uniquely defined, we have updated the treap correctly. Since every time we do a rotation the distance of x from
the root decrease by one, it follows that insertions takes O(height(T )).
12.2.2.2. Deletion
Deletion is just an insertion done in reverse. Specifically, to delete an element x from a treap T , set its priority
to +∞, and rotate it down it becomes a leaf. The only tricky observation is that you should rotate always so
that the child with the lower priority becomes the new parent. Once x becomes a leaf deleting it is trivial - just
set the pointer pointing to it in the tree to null.
85
12.2.2.3. Split
Given an element x stored in a treap T , we would like to split T into two treaps – one treap T ≤ for all the
elements smaller or equal to x, and the other treap T > for all the elements larger than x. To this end, we set
x priority to −∞, fix the priorities by rotating x up so it becomes the root of the treap. The right child of x
is the treap T > , and we disconnect it from T by setting x right child pointer to null. Next, we restore x to its
real priority, and rotate it down to its natural location. The resulting treap is T ≤ . This again takes time that is
proportional to the depth of the treap.
12.2.2.4. Meld
Given two treaps T L and T R such that all the elements in T L are smaller than all the elements in T R , we would
like to merge them into a single treap. Find the largest element x stored in T L (this is just the element stored
in the path going only right from the root of the tree). Set x priority to −∞, and rotate it up the treap so that it
becomes the root. Now, x being the largest element in T L has no right child. Attach T R as the right child of x.
Now, restore x priority to its original priority, and rotate it back so the priorities properties hold.
12.2.3. Summery
Theorem 12.2.3. Let T be a treap, initialized to an empty treap, and undergoing a sequence of m = nc inser-
tions, where c is some constant. The probability that the depth of the treap in any point in time would exceed
d log n is ≤ 1/n f , where d is an arbitrary constant, and f is a constant that depends only c and d.
In particular, a treap can handle insertion/deletion in O(log n) time with high probability.
Proof: Since the first part of the theorem implies that with high probability all these treaps have logarithmic
depth, then this implies that all operations takes logarithmic time, as an operation on a treap takes at most the
depth of the treap.
As for the first part, let T 1 , . . . , T m be the sequence of treaps, where T i is the treap after the ith operation.
Similarly, let Xi be the set of elements stored in T i . By Lemma 12.2.1, the probability that T i has large depth is
tiny. Specifically, we have that
" ! #
′ c ′ log nc 1
αi = P depth(T i ) > tc log n = P depth(T i ) > c t · log |T i | ≤ t·c ,
log |T i | n
as a tedious and boring but straightforward calculation shows. Picking t to be sufficiently large, we have that
the probability that the ith treap is too deep is smaller than 1/n f +c . By the union bound, since there are nc treaps
in this sequence of operations, it follows that the probability of any of these treaps to be too deep is at most
1/n f , as desired. ■
86
The naive algorithm is of course to compare each nut to MatchNuts&Bolts (N: nuts, B: bolts)
each bolt, and match them together. This would require a Pick a random nut n pivot from N
quadratic number of comparisons. Another option is to sort Find its matching bolt b pivot in B
the nuts by size, and the bolts by size and then “merge” the BL ← All bolts in B smaller than n pivot
two ordered sets, matching them by size. The only problem is NL ← All nuts in N smaller than b pivot
that we can not sorts only the nuts, or only the bolts, since we BR ← All bolts in B larger than n pivot
can not compare them to each other. Indeed, we sort the two NR ← All nuts in N larger than b pivot
sets simultaneously, by simulating QuickSort. The resulting MatchNuts&Bolts(NR ,BR )
algorithm is depicted on the right. MatchNuts&Bolts(NL ,BL )
Definition 12.3.3. For a randomized algorithm, we can speak about the expected running time. Namely, we
are interested in bounding the quantity E[RT] for the worst input.
Definition 12.3.4. The expected running-time of a randomized algorithm for input of size n is
where RT(U) is the running time of the algorithm for the input U.
Definition 12.3.5. The rank of an element x in a set S , denoted by rank(x), is the number of elements in S of
size smaller or equal to x. Namely, it is the location of x in the sorted list of the elements of S .
Theorem 12.3.6. The expected running time of MatchNuts&Bolts (and thus also of QuickSort) is T (n) =
O(n log n), where n is the number of nuts and bolts. The worst case running time of this algorithm is O(n2 ).
h i
Proof: Clearly, we have that P rank(n pivot ) = k = 1n . Furthermore, if the rank of the pivot is k then
87
References
[CS00] S. Cho and S. Sahni. A new weight balanced binary search tree. Int. J. Found. Comput. Sci.,
11(3): 485–513, 2000.
[SA96] R. Seidel and C. R. Aragon. Randomized search trees. Algorithmica, 16: 464–497, 1996.
88
Chapter 13
Consider the binomial distribution Bin(n, 1/2) for various values of n as depicted in Figure 13.1 – here we
think about the value of the variable as the number of heads in flipping a fair coin n times. Clearly, as the
value of n increases the probability of getting a number of heads that is significantly smaller or larger than
n/2 is tiny. Here we are interested in quantifying exactly how far can we divert from this expected value.
Specifically, if X ∼ Bin(n,
√ 1/2), then we would be interested in bounding the probability P[X > n/2 + ∆],
where ∆ = tσX = t n/2 (i.e., we are t standard deviations away from the expectation). For t > 2, this
probability is roughly 2−t , which is what we prove here.
More surprisingly, if you look only on the middle of the distribution, it looks the same after clipping away
the uninteresting tails, see Figure 13.2; that is, it looks more and more like the normal distribution. This
is a universal phenomena known the central limit theorem – every sum of nicely behaved random variables
behaves like the normal distribution. We unfortunately need a more precise quantification of this behavior, thus
the following.
The game. Consider the game where a player starts with Y0 = 1 dollars. At every round, the player can bet
a certain amount x (fractions are fine). With probability half she loses her bet, and with probability half she
gains an amount equal to her bet. The player is not allowed to go all in – because if she looses then the game
is over. So it is natural to ask what her optimal betting strategy is, such that in the end of the game she has as
much money as possible.
89
0.16
0.3 0.2 0.14 0.1
0.25 0.12 0.08
0.15 0.1
0.2 0.08 0.06
0.15 0.1 0.06 0.04
0.1 0.05 0.04
0.02 0.02
0.05
0 0 0
0
0
5
10
15
20
25
30
0
10
20
30
40
50
60
0
2
4
6
8
10
12
14
16
0
1
2
3
4
5
6
7
8
n=8 n = 16 n = 32 n = 64
0.08 0.04
0.07 0.05 0.035 0.01
0.06 0.04 0.03
0.05 0.025 0.008
0.04 0.03 0.02 0.006
0.03 0.02 0.015 0.004
0.02 0.01 0.002
0.01 0.01 0.005 0
0 0 0
0
1000
2000
3000
4000
5000
6000
7000
8000
0
100
200
300
400
500
0
20
40
60
80
100
120
0
50
100
150
200
250
Figure 13.1: The binomial distribution for different values of n. It pretty quickly concentrates around its
expectation.
0.16 0.08
0.2 0.14 0.1 0.07
0.12 0.08 0.06
0.15 0.1 0.06 0.05
0.1
0.08 0.04
0.06 0.04 0.03
0.05 0.04 0.02 0.02
0.02 0.01
0 0 0 0
20
25
30
35
40
45
45
50
55
60
65
70
75
80
85
10
15
20
25
10
12
14
16
5
0
2
4
6
8
n = 16 n = 32 n = 64 n = 128
0.04 0.01
0.05 0.035 0.025 0.008
0.04 0.03 0.02 0.006
0.025 0.015
0.03 0.02 0.004
0.02 0.015 0.01
0.01 0.005 0.002
0.01 0.005 0
0 0 0
3950
4000
4050
4100
4150
4200
4250
460
480
500
520
540
560
220
230
240
250
260
270
280
290
300
100
110
120
130
140
150
160
Figure 13.2: The “middle” of the binomial distribution for different values of n. It very quickly converges to
the normal distribution (under appropriate rescaling and translation.
90
Xi ∈ {−1, +1} Xi ∈ {0, 1}
P[Xi = −1] = P[Xi = 1] = 1/2 P[Xi = 0] = P[Xi = 1] = 1/2
P Y ≥ ∆ ≤ exp −∆2 /2n Theorem 13.1.7 P |Y − n/2| ≥ ∆ ≤ 2 exp −2∆2 /n
P Y ≤ −∆ ≤ exp −∆2 /2n Theorem 13.1.7 Corollary 13.1.9
τ
τ≥1 P Y < µ/τ < exp ( − 1 − 1+ln τ
µ) Theorem 13.2.3
P Y − µ ≥ ∆ ≤ exp −2∆ /n
2
∆≥0 Corollary 13.3.5
P Y − µ ≤ −∆ ≤ exp −2∆ /n .
2
Table 13.1: Summary of Chernoff type inequalities covered. Here we have n independent random variables
P
X1 , . . . , Xn , Y = i Xi and µ = E[Y].
91
Is the game pointless? So, let Yi−1 be the money the player has in the end of the (i − 1)th round, and she bets
an amount ψi ≤ Yi−1 in the ith round. As such, in the end of the ith round, she has
Yi−1 − ψi lose: probability half
Yi =
Yi−1 + ψi win: probability half
dollars. This game, in expectation, does not change the amount of money the player has. Indeed, we have
h i 1 1
E Yi Yi−1 = (Yi−1 − ψi ) + (Yi−1 + ψi ) = Yi−1 .
2 2
h h ii
And as such, we have that E Yi = E E Yi Yi−1 = E Yi−1 = · · · = E Y0 = 1. In particular, E[Yn ] = 1 –
namely, on average, independent of the player strategy she is not going to make any money in this game (and
she is allowed to change her bets after every round). Unless, she is lucky¬ ...
What about a lucky player? The player believes she will get lucky and wants to develop a strategy to take
advantage of it. Formally, she believes that she can win, say, at least (1 + δ)/2 fraction of her bets (instead of the
predicted 1/2) – for example, if the bets are in the stock market, she can improve her chances by doing more
research on the companies she is investing in . Unfortunately, the player does not know which rounds she is
going to be lucky in – so she still needs to be careful.
In a search of a good strategy. Of course, there are many safe strategies the player can use, from not playing
at all, to risking only a tiny fraction of her money at each round. In other words, our quest here is to find the
best strategy that extracts the maximum benefit for the player out of her inherent luck.
Here, we restrict ourselves to a simple strategy – at every round, the player would bet β fraction of her
money, where β is a parameter to be determined. Specifically, in the end of the ith round, the player would have
(1 − β)Yi−1 lose
Yi =
(1 + β)Yi−1 win.
By our assumption, the player is going to win in at least M = (1 + δ)n/2 rounds. Our purpose here is to figure
out what the value of β should be so that player gets as rich as possible® . Now, if the player is successful in
≥ M rounds, out of the n rounds of the game, then the amount of money the player has, in the end of the game,
is
n/2−(δ/2)n
Yn ≥ (1 − β)n−M (1 + β) M = (1 − β)n/2−(δ/2)n (1 + β)n/2+(δ/2)n = (1 − β)(1 + β) (1 + β)δn
n/2−(δ/2)n n/2−(δ/2)n
= 1 − β2 (1 + β)δn ≥ exp −2β2 exp(β/2)δn = exp −β2 + β2 δ + βδ/2 n .
To maximize this quantity, we choose β = δ/4 (there is a better choice, see Lemma! 13.1.6,
! but we! use this
δ2
δ3
δ2
δ 2
value for the simplicity of exposition). Thus, we have that Yn ≥ exp − + + n ≥ exp n , proving
16 16 8 16
the following.
Lemma 13.1.1. Consider a Chernoff game with n rounds, starting with one dollar, where the player wins in
≥ (1 + δ)n/2 of the rounds. If the player bets δ/4
fraction
of her current money, at all rounds, then in the end
of the game the player would have at least exp nδ /16 dollars.
2
¬
“I would rather have a general who was lucky than one who was good.” – Napoleon Bonaparte.
“I am a great believer in luck, and I find the harder I work, the more I have of it.” – Thomas Jefferson.
®
This optimal choice is known as Kelly criterion, see Remark 13.1.3.
92
Remark 13.1.2. Note, that Lemma 13.1.1 holds if the player wins any ≥ (1 + δ)n/2 rounds. In particular, the
statement does not require randomness by itself – for our application, however, it is more natural and interesting
to think about the player wins as being randomly distributed.
Remark 13.1.3. Interestingly, the idea of choosing the best fraction to bet is an old and natural question arising
in investments strategies, and the right fraction to use is known as Kelly criterion, going back to Kelly’s work
from 1956 [Kel56].
monetary value, and this value grows quickly with δ. Since we are in a “zero-sum” game settings, this event
should be very rare indeed. Under this interpretation, of course, the player needs to know in advance the value
of δ – so imagine that she guesses it somehow in advance, or she plays the game in parallel with all the possible
values of δ, and she settles on the instance that maximizes her profit.
Can one do better? No, not really. Chernoff inequality is tight (this is a challenging homework exercise) up
to the constant in the exponent. The best bound I know for this version of the inequality has 1/2 instead of
1/16 in the exponent. Note, however, that no real effort was taken to optimize the constants – this is not the
purpose of this write-up.
93
13.1.3. A proof for −1/ + 1 case
Theorem 13.1.7. Let X1 , . . . , Xn be n independent random variables, such that P[Xi = 1] = P[Xi = −1] = 12 ,
P
for i = 1, . . . , n. Let Y = ni=1 Xi . Then, for any ∆ > 0, we have
h i
P Y ≥ ∆ ≤ exp −∆ /2n .
2
by the Taylor expansion of exp(·). Note, that (2k)! ≥ (k!)2k , and thus
!i
X t2i X t2i X 1 t2
∞ ∞ ∞
E exp(tXi ) = ≤ = = exp t2 /2 ,
i=0
(2i)! i=0 2i (i!) i=0 i! 2
again, by the Taylor expansion of exp(·). Next, by the independence of the Xi s, we have
X Y Y
n
Y t2 /2
n
E exp(tY) = Eexp tXi = E
exp(tXi ) = E exp(tXi ) ≤ e = ent /2 .
2
i i i=1 i=1
exp nt2 /2
We have P[Y ≥ ∆] ≤ = exp nt2 /2 − t∆ .
exp(t∆)
Next, by minimizing the above quantity for t, we set t = ∆/n. We conclude,
! !
n ∆ 2 ∆ ∆2
P[Y ≥ ∆] ≤ exp − ∆ = exp − . ■
2 n n 2n
Corollary 13.1.9. Let X1 , . . . , Xn be n independent coin flips, such that P[Xi = 0] = P[X i = 1] = 2 , for i =
1
Pn
1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have P[|Y − n/2| ≥ ∆] ≤ 2 exp −2∆2 /n .
94
Proof: Consider the random variables Zi = 2Xi − 1 ∈ {−1, +1}. We have that
X
n Xn
P [|Y − n/2| ≥ ∆] = P [|2Y − n| ≥ 2∆] = P (2Xi − 1) ≥ 2∆ = P Zi ≥ 2∆ ≤ 2 exp −2∆2 /n ,
i=1 i=1
by Corollary 13.1.8 applied to the independent random variables Z1 , . . . , Zn . ■
Remark 13.1.10. Before going any further, it is might be instrumental to understand what this inequalities
imply.√Consider then case where Xi is either zero or one with probability half. In this case µ = E[Y] = n/2. Set
√
δ = t n ( µ is approximately the standard deviation of X if pi = 1/2). We have by
n √ 2
P Y − ≥ ∆ ≤ 2 exp −2∆ /n = 2 exp −2(t n) /n = 2 exp −2t .
2 2
2
Thus, Chernoff inequality implies exponential decay (i.e., ≤ 2−t ) with t standard deviations, instead of just
polynomial (i.e., ≤ 1/t2 ) by the Chebychev’s inequality.
Namely,
Qn h i Qn Qn
E etXi i=1 (1 − pi )e + pi e
0 t
i=1 1 + pi (et − 1)
P X > (1 + δ)µ < = = .
i=1
et(1+δ)µ et(1+δ)µ et(1+δ)µ
Let y = pi (et − 1). We know that 1 + y < ey (since y > 0). Thus,
Qn P
i=1 exp(pi (e − 1)) exp ni=1 pi (et − 1)
t
P X > (1 + δ)µ < =
et(1+δ)µP et(1+δ)µ !µ
exp (e − 1) i=1 pi
t n
exp (e − 1)µ
t
exp et − 1
= = =
et(1+δ)µ et(1+δ)µ et(1+δ)
!µ
exp(δ)
= ,
(1 + δ)(1+δ)
if we set t = log(1 + δ). ■
95
13.2.1. The lower tail
We need the following low level lemma.
P∞
Proof: For x ∈ [0, 1), we have, by the Taylor expansion, that ln(1 − x) = − i=1 (x
i
/i). As such, we have
X X !
xi X xi+1 X X
∞ ∞ ∞ ∞ ∞
xi xi xi xi
(1 − x) ln(1 − x) = −(1 − x) =− + = −x + − = −x + .
i=1
i i=1
i i=1
i i=2
i−1 i i=2
i(i − 1)
This implies that (1 − x) ln(1 − x) ≥ −x + x2 /2, which implies the claim by exponentiation. ■
Theorem 13.2.3. Let X1 , . . . , Xn be n independent random variables, where P Xi = 1 = pi , P Xi = 0 = qi =
Pn P
1 − pi , for all i. For X = i=1 Xi , its expectation is µ = E X = i pi . We have that
h e−δ iµ
P X < (1 − δ)µ < .
(1 − δ)1−δ
1+ln τ
For any positive τ > 1, we have that P X < µ/τ ≤ exp − 1 − τ
µ.
Proof: We follow the same proof template seen already. For t = − ln(1 − δ) > 0, we have E exp(−tXi ) =
(1 − pi )e0 + pi e−t = 1 − pi + pi (1 − δ) = 1 − pi δ ≤ exp(−pi δ). As such, we have
Qn
i=1 E exp(−tXi )
P X < (1 − δ)µ = P −X > −(1 − δ)µ = P exp(−tX) > exp(−t(1 − δ)µ) ≤
exp(−t(1 − δ)µ)
Pn h −δ iµ
exp − i=1 pi δ e
≤ = .
exp(−t(1 − δ)µ) (1 − δ)1−δ
Lemma 13.2.4. Let X1 , . . . , Xn ∈ {0, 1} be n independent random variables,
with pi = P Xi = 1 , for all i. For
P P
X = ni=1 Xi , and µ = E X = i pi , we have that P X < (1 − δ)µ < Exp −µδ2 /2 .
Proof: This alternative simplified form of Theorem 13.2.3, follows readily from Lemma 13.2.2, since
h e−δ iµ h e−δ iµ
P X < (1 − δ)µ ≤ ≤ ≤ Exp(−µδ2 /2). ■
(1 − δ)1−δ Exp −δ + δ /2
2
96
13.2.2. A more convenient form of Chernoff’s inequality
Lemma 13.2.5. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] = 1− pi ,
P P
for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ, ∈ (0, 1), we have
P X > (1 + δ)µ < exp −µδ /3 .
2
We have
1
f ′ (δ) = 2δ/c − ln(1 + δ). and f ′′ (δ) = 2/c − .
1+δ
For c = 3, we have f ′′ (δ) ≤ 0 for δ ∈ [0, 1/2], and f ′′ (δ) ≥ 0 for δ ∈ [1/2, 1]. Namely, f ′ (δ) achieves its
maximum either at 0 or 1. As f ′ (0) = 0 and f ′ (1) = 2/3 − ln 2 ≈ −0.02 < 0, we conclude that f ′ (δ) ≤ 0.
Namely, f is a monotonically decreasing function in [0, 1], which implies that f (δ) ≤ 0, for all δ in this range,
thus implying the claim. ■
Lemma 13.2.6. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] = 1− pi ,
P P
for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ ∈ (0, 4), we have
P X > (1 + δ)µ < exp −µδ /4 ,
2
Proof: Lemma 13.2.5 implies a stronger bound, so we need to prove the claim only for δ ∈ (1, 4]. Continuing
as in the proof of Lemma 13.2.5, for case c = 4, we have to prove that
f (δ) = δ2 /4 + δ − (1 + δ) ln(1 + δ) ≤ 0,
Lemma 13.2.7. Let X1 , . . . , Xn be n independent random variables, where P[Xi = 1] = pi , and P[Xi = 0] =
P P
1 − pi , for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ ∈ (0, 6), we have
P X > (1 + δ)µ < exp −µδ /5 ,
2
Proof: Lemma 13.2.6 implies a stronger bound, so we need to prove the claim only for δ ∈ (4, 5]. Continuing
as in the proof of Lemma 13.2.5, for case c = 5, we have to prove that
f (δ) = δ2 /5 + δ − (1 + δ) ln(1 + δ) ≤ 0,
97
Lemma 13.2.8. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] = 1− pi ,
P P
for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ > 2e − 1, we have P X > (1 + δ)µ < 2−µ(1+δ) .
Proof: By Theorem 13.2.1, we have
e (1+δ)µ e (1+δ)µ
≤ ≤ 2−(1+δ)µ ,
1+δ 1 + 2e − 1
since δ > 2e − 1. ■
Lemma 13.2.9. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] = 1− pi ,
P P
for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ > e2 , we have P X > (1 + δ)µ < exp − µδ2ln δ .
Proof: Observe that
h i !µ
eδ
P X > (1 + δ)µ < = exp µδ − µ(1 + δ) ln(1 + δ) . (13.1)
(1 + δ)1+δ
As such, we have
h i ! !
1+δ µδ ln δ
P X > (1 + δ)µ < exp −µ(1 + δ) ln(1 + δ) − 1 ≤ exp −µδln ≤ exp − ,
e 2
1+x √ 1 + x ln x
since for x ≥ e2 we have that ≥ x ⇐⇒ ln ≥ . ■
e e 2
−1
since −µ(1 + ξ) > −µξ > µ 3 lnµδφ2 > log2 φ−1 , since δ ∈ (0, 1].
If ξ ≤ 6, then by Lemma 13.2.7, we have
α = P Y > (1 + ξ)µ ≤ exp −µξ2 /5 ≤ φ,
since !2 !
µ µ 3 ln φ−1 µ 3 ln φ−1 6 ln φ
− ξ2 = − δ + >− 2·δ· =− · > − ln φ. ■
5 5 µδ2 5 µδ 2 5 δ
Example 13.2.11. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] =
P P
1 − pi , for i = 1, . . . , n. Let Y = b
Xi , and µ = E[Y] = i pi . Assume that µ ≤ 1/2. Setting δ = 1, We have,
i=1
for t > 6, that " #
3 ln exp(t/3)
P[Y > 1 + t] ≤ P Y > (1 + δ)µ + ≤ exp(−t/3),
δ2
by Lemma 13.2.10.
98
13.3. A special case of Hoeffding’s inequality
In this section, we prove yet another version of Chernoff inequality, where each variable is randomly picked
according to its own distribution in the range [0, 1]. We prove a more general version of this inequality in
Section 13.4, but the version presented here does not follow from this generalization.
P
Theorem 13.3.1. Let X1 , . . . , Xn ∈ [0,
! 1] be n independent
! random variables, let X = ni=1 Xi , and let µ = E[X].
h i µ
µ+η
n−µ
n−µ−η
We have that P X − µ ≥ η ≤ .
µ+η n−µ−η
i=1 i=1
n i=1 n
(µ + η)(n − µ) µn − µ2 + ηn − ηµ
Setting s = = we have that
µ(n − µ − η) µn − µ2 − ηµ
µ ηn µ η n−µ
1 + (s − 1) =1+ · =1+ = .
n µn − µ − ηµ n
2 n−µ−η n−µ−η
Pn h i
Corollary 13.3.3. Let X1 , . . . , Xn ∈ [0, 1] be n independent random variables, let X = X /n, p = E X =
h i
i=1 i
p q
f (t) = (p + t) ln + (q − t) ln . (13.2)
p+t q−t
Pn
Theorem 13.3.4. Let X1 , h. . . , Xn ∈ [0,
i 1] be n independent
let X = ( i=1 Xi )/n, and let
h random variables,
i
p = E[X]. We have that P X − p ≥ t ≤ exp −2nt2 and P X − p ≤ −t ≤ exp −2nt2 .
° Pn √n
The inequality between arithmetic and geometric means: ( i=1 xi )/n ≥ x1 · · · xn .
99
Proof: Let p = µ/n, q = 1 − p, and let f (t) be the function from Eq. (13.2), for t ∈ (−p, q). Now, we have that
!
′ p p+t p q q−t q p q
f (t) = ln + (p + t) − − ln − (q − t) = ln − ln
p+t p (p + t)2 q−t q (q − t)2 p+t q−t
p(q − t)
= ln .
q(p + t)
As for the second derivative, we have
qX
(pX+X
Xt) p (p + t)(−1) − (q − t) −p − t − q + t 1
f ′′ (t) = · · .= =− ≤ −4.
p(q − t) q (p + t)A2 (q − t)(p + t) (q − t)(p + t)
Indeed,
t ∈
(−p, q) and the denominator is minimized for t = (q − p)/2, and as such (q − t)(p + t) ≤
2q − (q − p) 2p + (q − p) /4 = (p + q)2 /4 = 1/4.
f ′′ (x) 2
Now, f (0) = 0 and f ′ (0) = 0, and by Taylor’s expansion, we have that f (t) = f (0)+ f ′ (0)t + t ≤ −2t2 ,
2
where x is between 0 and t.
The first bound now readily follows from plugging this bound into Corollary 13.3.3. The second bound
follows by considering the randomh variants
i Yi = 1 − Xi , for all i, and plugging this into the first bound. Indeed,
for Y = 1 − X, we have that q = E Y , and then X − p ≤ −t ⇐⇒ t ≤ p − X ⇐⇒ t ≤ 1 − q − (1 − Y) = Y − q.
h i h i
Thus, P X − p ≤ −t = P Y − q ≥ t ≤ exp −2nt2 . ■
Pn
Corollary 13.3.5. Let X1 , . . . , Xn ∈ [0, 1] be n independent
random variables, let Y = i=1 Xi , and let µ =
E[X]. For any ∆ > 0, we have P Y − µ ≥ ∆ ≤ exp −2∆ /n and P Y − µ ≤ −∆ ≤ exp −2∆ /n .
2 2
P
Theorem 13.3.6. Let X1 , . . . , Xn ∈ [0, 1] be n independent
random variables, let X = ( ni=1 Xi ), and let µ =
E[X]. We have that P X − µ ≥ εµ ≤ exp −ε µ/4 and P X − µ ≤ −εµ ≤ exp −ε µ/2 .
2 2
Proof: Let p = µ/n, and let g(x) = f (px), for x ∈ [0, 1] and xp < q. As before, computing the derivative of g,
we have
p(q − xp) q − xp 1 px
g′ (x) = p f ′ (xp) = p ln = p ln ≤ p ln ≤− ,
q(p + xp) q(1 + x) 1+x 2
since (q − xp)/q is maximized for x = 0, and ln 1+x1
≤ −x/2, Rfor x ∈ [0, 1], as
R can be easily verified± . Now,
x x
g(0) = f (0) = 0, and by integration, we have that g(x) = y=0 g′ (y)dy ≤ y=0 (−py/2)dy = −px2 /4. Now,
plugging into Corollary 13.3.3, we get that the desired probability P X − µ ≥ εµ is
h i
P X − p ≥ εp ≤ exp n f (εp) = exp ng(ε) ≤ exp −pnε /4 = exp −µε /4 .
2 2
100
since 1 − x ≤ e−x . By integration, as before,
h we conclude
i that h(x) ≤ −px2 /2. Now, plugging into Corol-
lary 13.3.3, we get P X − µ ≤ −εµ = P X − p ≤ −εp ≤ exp n f (−εp) ≤ exp nh(ε) ≤ exp −npε2 /2 ≤
exp −µε2 /2 . ■
Proof: For the sake of simplicity of exposition, assume that X is a discrete random variable, and that there
is a value α ∈ (0, 1/2), such that β = P[X = α] > 0. Consider the modified random variable X ′ , such that
′ ′ ′
h ′ i= 0]h= iP[X = 0] + β/2, and P[X = 2α] = P[X = α] + β/2. Clearly, E[X] = hE[X
P[X i ]. Next, observe that
α
E s − E s = (β/2)(s + s )−βs ≥ 0, by the convexity of s . We conclude that E s achieves its maximum
X X 2α 0 x X
h i
if takes only the values 0 and 1. But then, we have that E sX = P[X = 0]s0 + P[X = 1]s1 = (1− E[X])+ E[X] s =
1 + (s − 1) E[X] , as claimed. ■
Lemma
h i 13.4.1. Let X be a random variable. If E[X] = 0 and a ≤ X ≤ b, then for any s > 0, we have
E e sX
≤ exp s2 (b − a)2 /8 .
Proof: Let a ≤ x ≤ b and observe that x can be written as a convex combination of a and b. In particular, we
have
b−x
x = λa + (1 − λ)b for λ= ∈ [0, 1] .
b−a
Since s > 0, the function exp(sx) is convex, and as such
b − x sa x − a sb
e sx ≤ e + e ,
b−a b−a
since we have that f (λx + (1 − λ)y) ≤ λ f (x) + (1 − λ) f (y) if f (·) is a convex function. Thus, for a random
variable X, by linearity of expectation, we have
h i " #
b − X sa X − a sb b − E[X] sa E[X] − a sb
Ee ≤E e + e = e +
sX
e
b−a b−a b−a b−a
b sa a sb
= e − e ,
b−a b−a
since E[X] = 0.
a a b
Next, set p = − and observe that 1 − p = 1 + = and
b−a b−a b−a
a
−ps(b − a) = − − s(b − a) = sa.
b−a
101
As such, we have
h i
E e ≤ (1 − p)e + pe = (1 − p + pe
sX sa sb s(b−a) sa
)e
= (1 − p + pe s(b−a) )e−ps(b−a)
= exp −ps(b − a) + ln 1 − p + pe s(b−a) = exp(−pu + ln(1 − p + peu )),
1
ϕ(u) = ϕ(0) + uϕ′ (0) + u2 ϕ′′ (θ) (13.3)
2
where θ ∈ [0, u], and notice that ϕ(0) = 0. Furthermore, we have
peu
ϕ′ (u) = −p + ,
1 − p + peu
For any x, y ≥ 0, we have (x + y)2 ≥ 4xy as this is equivalent to (x − y)2 ≥ 0. Setting x = 1 − p and y = peu , we
have that
(1 − p)peu (1 − p)peu 1
ϕ′′ (u) = ≤ = .
(1 − p + pe )
u 2 4(1 − p)pe u 4
Plugging this into Eq. (13.3), we get that
h i !
1 1 1
ϕ(u) ≤ u2 = (s(b − a))2 and E e ≤ exp(ϕ(u)) ≤ exp (s(b − a)) ,
sX 2
8 8 8
as claimed. ■
Lemma 13.4.2. Let X be a random variable. If E[X] = 0 and a ≤ X ≤ b, then for any s > 0, we have
2 2
exp s (b−a)
8
P[X > t] ≤ .
e st
Proof: Using the same technique we used in proving Chernoff’s inequality, we have that
h i 2 2
h i E e sX exp s (b−a)
8
P[X > t] = P e > e ≤ ≤ . ■
sX st
e st e st
102
Theorem 13.4.3 (Hoeffding’s inequality). Let X1 , . . . , Xn be independent random variables, where Xi ∈ [ai , bi ],
for i = 1, . . . , n. Then, for the random variable S = X1 + · · · + Xn and any η > 0, we have
!
2 η2
P S − E[S ] ≥ η ≤ 2 exp − Pn .
i=1 (bi − ai )
2
Pn
Proof: Let Zi = Xi − E[Xi ], for i = 1, . . . , n. Set Z = Zi , and observe that
i=1
h i Eexp(sZ)
PZ≥η =Pe ≥e ≤ ,
sZ sη
exp(sη)
by Markov’s inequality. Arguing as in the proof of Chernoff’s inequality, we have
n !
Y Yn
Y
n
s2 (bi − ai )2
E exp(sZ) = E
exp(sZi ) = E exp(sZi ) ≤ exp ,
i=1 i=1 i=1
8
since the Zi s are independent and by Lemma 13.4.1. This implies that
Yn
2 X
n
e s (bi −ai ) /8 = exp (bi − ai )2 − sη.
2 2 s
P Z ≥ η ≤ exp(−sη)
i=1
8 i=1
P
The upper bound is minimized for s = 4η/ i (bi − ai )2 , implying
!
2η2
P Z ≥ η ≤ exp − P .
(bi − ai )2
The claim now follows by the symmetry of the upper bound (i.e., apply the same proof to −Z). ■
13.6. Exercises
Pn
Exercise 13.6.1 (Chernoff inequality is tight.). Let S = S i be a sum of n independent random variables
i=1
each attaining values +1 and −1 with equal probability. Let P(n, ∆) = P[S > ∆]. Prove that for ∆ ≤ n/C,
!
1 ∆2
P(n, ∆) ≥ exp − ,
C Cn
where C is a suitable constant. That is, the well-known Chernoff bound P(n, ∆) ≤ exp(−∆2 /2n)) is close to the
truth.
Exercise 13.6.2 (Chernoff inequality is tight by direct calculations.). For this question use only basic argu-
mentation – do not use Stirling’s formula, Chernoff inequality or any similar “heavy” machinery.
103
X
n−k !
2n n
(A) Prove that ≤ 2 22n .
i=0
i 4k
Hint: Consider flipping a coin 2n times. Write down explicitly the probability of this coin to have at most
n − k heads, and use Chebyshev inequality.
√
(B) Using (A), prove that 2nn ≥ 22n /4 n (which is a pretty good estimate).
! ! !
2n 2i + 1 2n
(C) Prove that = 1− .
n + i!+ 1 n + i!+ 1 ! n + i
2n −i(i − 1) 2n
(D) Prove that ≤ exp .
n + i! 2n! ! n
2n 8i2 2n
(E) Prove that ≥ exp − .
n+i n n!
2n 22n
(F) Using the above, prove that ≤ c √ for some constant c (I got c = 0.824... but any reasonable
n n
constant will do).
(G) Using the above, prove that √
X n 2n !
(t+1)
≤ c22n exp −t2 /2 .
√ n−i
i=t n+1
√
In particular, conclude that when flipping faircoin 2n times, the probability to get less than n−t n heads
(for t an integer) is smaller than c′ exp −t2 /2 , for some constant c′ .
(H) Let X be the number of heads in 2n coin flips.
Prove that for any integer t > 0 and any δ > 0 sufficiently
small, it holds that P[X < (1 − δ)n] ≥ exp −c δ n , where c′′ is some constant. Namely, the Chernoff
′′ 2
Exercise 13.6.3 (Tail inequality for geometric variables). Let X1 , . . . , Xm be m independent random variables
P
with geometric distribution with probability
p (i.e.,
P Xi = j = (1 − p) j−1 p). Let Y = i Xi , and let µ = E[Y] =
m/p. Prove that P Y ≥ (1 + δ)µ ≤ exp −mδ2 /8 .
References
[DP09] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized
Algorithms. Cambridge University Press, 2009.
[Kel56] J. L. Kelly. A new interpretation of information rate. Bell Sys. Tech. J., 35(4): 917–926, 1956.
[Mat99] J. Matoušek. Geometric Discrepancy. Vol. 18. Algorithms and Combinatorics. Springer, 1999.
[McD89] C. McDiarmid. Surveys in Combinatorics. Ed. by J. Siemons. Cambridge University Press, 1989.
Chap. On the method of bounded differences.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
104
Chapter 14
105
Going back to the QuickSort problem, we have thatpif wep sort n elements, the probability
p that p u will
participate in more than L = (4 + c) lg n = 2 lg n + 4c lg n lg n, is smaller than 2 exp −c lg n lg n ≤
1/nc , by Lemma 14.1.1. There are n elements being sorted, and as such the probability that any element would
participate in more than (4 + c + 1) lg n recursive calls is smaller than 1/nc .
Lemma 14.1.2. For any c > 0, the probability that QuickSort performs more than (6 + c)n lg n, is smaller
than 1/nc .
Lemma 14.2.1. The events E1 , . . . , En are independent (as such, the variables X1 , . . . , Xn are independent).
Proof: Exercise. ■
Theorem 14.2.2. Let Π = π1 . . . πn be a random permutation of 1, . . . , n, and let Z be the number of times, that
πi is the smallest number among π1 , . . . , πi , for i = 1, . . . , n. Then, we have that for t ≥ 2e that P[Z > t ln n] ≤
1/nt ln 2 , and for t ∈ 1, 2e , we have that P Z > t ln n ≤ 1/n(t−1) /4 .
2
P
Proof: Follows readily from Chernoff’s inequality, as Z = i Xi is a sum of independent indicator variables,
and, since by linearity of expectations, we have
Z
X X1
n n+1
1
µ=EZ = E Xi = ≥ dx = ln(n + 1) ≥ ln n.
i i=1
i x=1 x
106
RandomRoute( v0 , . . . , vN−1 )
// vi : Packet at node i to be routed to node d(i).
(i) Pick a random intermediate destination σ(i) from [1, . . . , N]. Packet vi travels to σ(i).
// Here random sampling is done with replacement.
// Several packets might travel to the same destination.
(ii) Wait till all the packets arrive to their intermediate destination.
(iii) Packet vi travels from σ(i) to its destination d(i).
Figure 14.1: The routing algorithm
A routing scheme is oblivious if every node that has to forward a packet, inspect the packet, and depending
only on the content of the packet decides how to forward it. That is, such a routing scheme is local in nature, and
does not take into account other considerations. Oblivious routing is of course a bad idea – it ignores congestion
in the network, and might insist routing things through regions of the hypercube that are “gridlocked”.
Theorem 14.3.1 ([KKT91]). For any deterministic oblivious permutation routing algorithm on a network √ of
N nodes each of out-degree n, there is a permutation for which the routing of the permutation takes Ω N/n
units of time (i.e., ticks).
Proof: (Sketch.) The above is implied by a nice averaging argument – construct, for every possible destination,
the routing tree of all packets to this specific node. Argue that there must be many edges in this tree that are
highly congested in this tree (which is NOT the permutation routing we are looking for!). Now, by averaging,
there must be a single edge that is congested in “many” of these trees. Pick a source-destination pair from each
one of these trees that uses this edge, and complete it into a full permutation in the natural way. Clearly, the
congestion of the resulting permutation is high. For the exact details see [KKT91]. ■
How do we send a packet? We use bit fixing. Namely, the packet from the i node, always go to the current
adjacent node that have the first different bit as we scan the destination string d(i). For example, packet from
(0000) going to (1101), would pass through (1000), (1100), (1101).
The routing algorithm. We assume each edge have a FIFO queue. The routing algorithm is depicted in
Figure 14.1.
14.3.1. Analysis
We analyze only step (i) in the algorithm, as (iii) follows from the same analysis. In the following, let ρi denote
the route taken by vi in (i).
Exercise 14.3.2. Once a packet v j that travel along a path ρ j can not leave a path ρi , and then join it again later.
Namely, ρi ∩ ρ j is (maybe an empty) path.
Lemma 14.3.3. Let the route of a message c follow the sequence of edges π = (e1 , e2 , . . . , ek ). Let S be the set
of packets whose routes pass through at least one of (e1 , . . . , ek ). Then, the delay incurred by c is at most |S |.
Proof: A packet in S is said to leave π at that time step at which it traverses an edge of π for the last time. If a
packet is ready to follow edge e j at time t, we define its lag at time t to be t − j. The lag of c is initially zero, and
107
the delay incurred by c is its lag when it traverse ek . We will show that each step at which the lag of c increases
by one can be charged to a distinct member of S .
We argue that if the lag of c reaches ℓ + 1, some packet in S leaves π with lag ℓ. When the lag of c increases
from ℓ to ℓ + 1, there must be at least one packet (from S ) that wishes to traverse the same edge as c at that
time step, since otherwise c would be permitted to traverse this edge and its lag would not increase. Thus, S
contains at least one packet whose lag reach the value ℓ.
Let τ be the last time step at which any packet in S has lag ℓ. Thus there is a packet d ready to follow edge
eµ at τ, such that τ − µ = ℓ. We argue that some packet of S leaves π at time τ – this establishes the lemma
since once a packet leaves π, it would never join it again and as such will never again delay c.
Since d is ready to follow eµ at time τ, some packet ω (which may be d itself) in S traverses eµ at time τ.
Now ω must leave π at time τ – if not, some packet will follow eµ+1 at step µ + 1 with lag ℓ. But this violates the
maximality of τ. We charge to ω the increase in the lag of c from ℓ to ℓ + 1. Since ω leaves π, it will never be
charged again. Thus, each member of S whose route intersects π is charge for at most one delay, establishing
the lemma. ■
Let Hi j be an indicator variable that is 1 if ρi and ρ j share an edge, and 0 otherwise. The total delay for vi
P
is at most ≤ j Hi j .
Crucially, for a fixed i, the variables Hi1 , . . . , HiN are independent. Indeed, imagine first picking the desti-
nation of vi , and let the associated path be ρi . Now, pick the destinations of all the other packets in the network.
Since the sampling of destinations is done with replacements, whether or not the path ρ j of v j intersects ρi , is
independent of whether ρk intersects ρi . Of course, the probabilities P[Hi j = 1] and P[Hik = 1] are probably
different. Confusingly, however, H11 , . . . , HNN are not independent. Indeed, imagine k and j being close ver-
tices on the hypercube. If Hi j = 1 then intuitively it means that ρi is traveling close to the vertex v j , and as such
there is a higher probability that Hik = 1.
Let
ρi = (e1 , . . . , ek ),
and let T (e) be the number of packets (i.e., paths) that pass through e. We have that
N k
X
N X
k X X
Hi j ≤ T (e j ) and thus E Hi j ≤ E T (e j ) .
j=1 j=1 j=1 j=1
Because of symmetry, the variables T (e) have the same distribution for all the edges of G. On the other hand,
the expected length of a path is n/2, there are N packets, and there are Nn/2 edges® . We conclude
Total length of paths N(n/2)
E[T (e)] = = =1
# of edges in graph N(n/2)
= 1. Thus, for k = |ρi |, we have
hX N i hXk i hX k i X
|ρi |
h i |ρi |
X h i n
µ=E Hi j ≤ E T (e j ) = E E T (e j ) ρi = E E T (e j ) ρ i = E 1 = E |ρi | = .
j=1 j=1 j=1 j=1 j=1
2
By the Chernoff inequality, specifically Lemma 13.2.8, we have
X X
P Hi j > 7n ≤ P Hi j > (1 + 13)µ < 2−13µ ≤ 2−6n .
j j
Since there are N = 2n packets, we know that with probability ≤ 2−5n all packets arrive to their temporary
destination in a delay of most 7n.
®
Indeed, the hypercube has N vertices, all of degree n. As such, the number of edges is Nn/2.
108
Theorem 14.3.4. Each packet arrives to its destination in ≤ 14n stages, in probability at least 1 − 1/N (note
that this is very conservative).
Next, consider generating M such string, where the value of M would be determined shortly. Clearly, the
probability that any pair of strings are at distance at most n/2 − ∆, is
!
M
α≤ exp −2∆2 /n < M 2 exp −2∆2 /n .
2
If this probability is smaller than one, then there is some probability that all the M strings are of distance at
least n/2 − ∆ from each other. Namely, there exists a set of M strings such that every pair of them is far. We
used here the fact that if an event has probability larger than zero, then it exists. Thus, set ∆ = n/4, and observe
that
α < M 2 exp −2n2 /16n = M 2 exp(−n/8).
Thus, for M = exp(n/16), we have that α < 1. We conclude:
Lemma 14.4.1. There exists a set of exp(n/16) binary strings of length n, such that any pair of them is at
Hamming distance at least n/4 from each other.
This is our first introduction to the beautiful technique known as the probabilistic method — we will hear
more about it later in the course.
This√ result has also interesting interpretation in the Euclidean setting. Indeed, consider the sphere S of
radius n/2 centered at (1/2, 1/2, . . . , 1/2) ∈ Rn . Clearly, all the vertices of the binary hypercube {0, 1}n lie on
this sphere. As such, let P be the set ofppoints on S that
√ exists√according to Lemma 14.4.1. A pair p, q of points
of P have Euclidean distance at least dH (p, q) = n/4 = n/2 from each other. We conclude:
Lemma 14.4.2. Consider the unit hypersphere S in Rn . The sphere S contains a set Q of points, such that each
pair of points is at (Euclidean) distance at least one from each other, and |Q| ≥ exp(n/16).
√
Proof: Take the above point set, and scale it down by a factor of n/2. ■
109
14.6. Exercises
Exercise 14.6.1 (More binary strings. More!). To some extent, Lemma 14.4.1 is somewhat silly, as one can
prove a better bound by direct argumentation. Indeed, for a fixed binary string x of length n, show a bound on
the number of strings in the Hamming ball around x of radius n/4 (i.e., binary strings of distance at most n/4
from x). (Hint: interpret the special case of the Chernoff inequality as an inequality over binomial coefficients.)
Next, argue that the greedy algorithm which repeatedly pick a string which is in distance ≥ n/4 from all
strings picked so far, stops after picking at least exp(n/8) strings.
References
[KKT91] C. Kaklamanis, D. Krizanc, and T. Tsantilas. Tight bounds for oblivious routing in the hypercube.
Math. sys. theory, 24(1): 223–232, 1991.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.
110
Chapter 15
Min Cut
To acknowledge the corn - This purely American expression means to admit the losing of an argument, especially in regard
to a detail; to retract; to admit defeat. It is over a hundred years old. Andrew Stewart, a member of Congress, is said to
have mentioned it in a speech in 1828. He said that haystacks and cornfields were sent by Indiana, Ohio and Kentucky to
Philadelphia and New York. Charles A. Wickliffe, a member from Kentucky questioned the statement by commenting that
haystacks and cornfields could not walk. Stewart then pointed out that he did not mean literal haystacks and cornfields, but
the horses, mules, and hogs for which the hay and corn were raised. Wickliffe then rose to his feet, and said, “Mr. Speaker, I
acknowledge the corn”.
111
Here we are going to look on a very specific variant of this problem. Imagine that starting with a single
male. A male has exactly two children, and one of them is a male with probability half (i.e., the Y-chromosome
is being passed only to its male children). As such, the natural question is what is the probability that h
generations down, there is a male decedent that all his ancestors are male (i.e., it caries the original family
name, and the original Y-chromosome).
f(x)=x - x2/4
1
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Proof: (Feel free to skip reading.) The proof is by induction. For h = 1, we have ρ1 = 3/4 ≥ 1/(1 + 1).
Observe that ρh = f (ρh−1 ) for f (x) = x − x2 /4, and f ′ (x) = 1 − x/2. As such, f ′ (x) > 0 for x ∈ [0, 1] and
f (x) is increasing in the range [0, 1]. As such, by induction, we have that
!
1 1 1
ρh = f (ρh−1 ) ≥ f = − 2.
(h − 1) + 1 h 4h
112
Lemma 15.1.2. We have that ρh = O(1/h).
Proof: (Feel free to skip reading.) We prove the claim for infinite number of values of h – the claim then fol-
lows for all h by fiddling with the constants. The claim trivially holds for small values of h. For any j > 0, let
h j be the minimal index such that ρh j ≤ 1/2 j . It is easy to verify that ρh j ≥ 1/2 j+1 . We claim (mysteriously)
that
ρh j − ρh j+1
h j+1 − h j ≤ .
(ρh j+1 )2 /4
Indeed, ρk+1 is the number resulting from removing ρ2k /4 from ρk . Namely, the sequence ρ1 , ρ2 , . . . is a mono-
tonically decreasing sequence of numbers in the interval [0, 1], where the gaps between consecutive numbers
2
decreases. In particular, to get from ρh j to ρh j+1 , the gaps used were of size at least ∆ = ρh j+1 , which means
that there are at least (ρh j − ρh j+1 )/∆ − 1 numbers in the series between these two elements. As such, since
ρh j ≤ 1/2 j and ρh j+1 ≥ 1/2 j+2 , we have
ρh j − ρh j+1 1/2 j − 1/2 j+2
h j+1 − h j ≤ ≤ ≤ 22 j+6 /2 j = 2 j+6 .
(ρh j+1 )2 /4 1/22( j+2)+2
This implies that h j ≤ (h j − h j−1 ) + (h j−1 − h j−2 ) + · · · + (h1 − h0 ) ≤ 2 j+6 . As such, we have ρh j ≤ 1/2 j ≤ 26 /2 j+6 ≤
26 /h j , which implies the claim. ■
113
15.3. The Algorithm
Observation 15.3.1. A set of vertices in G/xy corresponds to a set of vertices in the graph G. Thus a cut
in G/xy always corresponds to a valid cut in G. However, there are cuts in G that do not exist in G/xy. For
example, the cut S = {x}, does not exist in G/xy. As such, the size of the minimum cut in G/xy is at least as large
as the minimum cut in G (as long as G/xy has at least one edge). Since any cut in G/xy has a corresponding
cut of the same cardinality in G.
Our algorithm works by repeatedly performing edge contractions. This is beneficial as this shrinks the
underlying graph, and we would compute the cut in the resulting (smaller) graph. An “extreme” example of
this, is shown in Figure 15.4, where we contract the graph into a single edge, which (in turn) corresponds to
a cut in the original graph. (It might help the reader to think about each vertex in the contracted graph, as
corresponding to a connected component in the original graph.)
Figure 15.4 also demonstrates the problem with taking this approach. Indeed, the resulting cut is not the
minimum cut in the graph.
So, why did the algorithm fail to find the minimum cut in this case?¬ The failure occurs because of the
contraction at Figure 15.4 (e), as we had contracted an edge in the minimum cut. In the new graph, depicted in
Figure 15.4 (f), there is no longer a cut of size 3, and all cuts are of size 4 or more. Specifically, the algorithm
succeeds only if it does not contract an edge in the minimum cut.
Observation 15.3.2. Let e1 , . . . , en−2 be a sequence of edges in G, such that none of them is in the minimum cut,
and such that G′ = G/ {e1 , . . . , en−2 } is a single multi-edge. Then, this multi-edge corresponds to a minimum
cut in G.
Note, that the claim in the above observation is only in one direction. We might be able to still compute
a minimum cut, even if we contract an edge in a minimum cut, the reason being that a minimum cut is not
¬
Naturally, if the algorithm had succeeded in finding the minimum cut, this would have been our success.
114
2
2 2
2 2
y
x
(a) (b) (c) (d)
2 2
2 2 4 4
2 2 2 3 3
2 2 52 5
(i) (j)
Figure 15.4: (a) Original graph. (b)–(j) a sequence of contractions in the graph, and (h) the cut in the original
graph, corresponding to the single edge in (h). Note that the cut of (h) is not a mincut in the original graph.
Algorithm MinCut(G)
G0 ← G
i=0
while Gi has more than two vertices do
Pick randomly an edge ei from the edges of Gi
Gi+1 ← Gi /ei
i←i+1
Let (S , V \ S ) be the cut in the original graph
corresponding to the single edge in Gi
return (S , V \ S ).
115
unique. In particular, another minimum cut might survived the sequence of contractions that destroyed other
minimum cuts.
Using Observation 15.3.2 in an algorithm is problematic, since the argumentation is circular, how can we
find a sequence of edges that are not in the cut without knowing what the cut is? The way to slice the Gordian
knot here, is to randomly select an edge at each stage, and contract this random edge.
See Figure 15.5 for the resulting algorithm MinCut.
15.3.1. Analysis
15.3.1.1. The probability of success
Naturally, if we are extremely lucky, the algorithm would never pick an edge in the mincut, and the algorithm
would succeed. The ultimate question here is what is the probability of success. If it is relatively “large” then
this algorithm is useful since we can run it several times, and return the best result computed. If on the other
hand, this probability is tiny, then we are working in vain since this approach would not work.
Lemma 15.3.3. If a graph G has a minimum cut of size k and G has n vertices, then |E(G)| ≥ kn
2
.
Proof: Each vertex degree is at least k, otherwise the vertex itself would form a minimum cut of size smaller
P
than k. As such, there are at least v∈V degree(v)/2 ≥ nk/2 edges in the graph. ■
Lemma 15.3.4. Fix a specific minimum cut C = (S , S ) in the graph. If we pick in random an edge e from a
graph G, uniformly at random, then with probability at most 2/n it belongs to the minimum cut C.
Proof: There are at least nk/2 edges in the graph and exactly k edges in the minimum cut. Thus, the probability
of picking an edge from the minimum cut is smaller then k/(nk/2) = 2/n. ■
The following lemma shows (surprisingly) that MinCut succeeds with reasonable probability.
2
Lemma 15.3.5. MinCut outputs the mincut with probability ≥ .
n(n − 1)
7
Proof: Let Ei be the event that ei is not in the minimum cut of Gi . By Observation 15.3.2, MinCut outputs the
minimum cut if the events E0 , . . . , En−3 all happen (namely, all edges picked are outside the minimum cut).
h i 2 2
By Lemma 15.3.4, it holds P Ei E0 ∩ E1 ∩ . . . ∩ Ei−1 ≥ 1 − =1− . Implying that
|V(Gi )| n−i
h i h i h i
∆ = P[E0 ∩ . . . ∩ En−3 ] = P[E0 ] · P E1 E0 · P E2 E0 ∩ E1 · . . . · P En−3 E0 ∩ . . . ∩ En−4 .
As such, we have
Y
n−3 ! Yn−3
2 n−i−2
∆≥ 1− =
i=0
n−i i=0
n−i
n− 2
n− 3 n−
4 X
n−XX n−
5
6 3 2 1
= ∗ ∗ ∗ ∗ ∗ · · · ∗ ∗ ∗
n n−1 n−2
n−3 n−4 5 4 3
2
= . ■
n(n − 1)
116
15.3.1.2. Running time analysis.
Observation 15.3.6. MinCut runs in O(n2 ) time.
Observation 15.3.7. The algorithm always outputs a cut, and the cut is not smaller than the minimum cut.
Definition 15.3.8. (informal) Amplification is the process of running an experiment again and again till the
things we want to happen, with good probability, do happen.
Let MinCutRep be the algorithm that runs MinCut n(n − 1) times and return the minimum cut computed
in all those independent executions of MinCut.
Lemma 15.3.9. The probability that MinCutRep fails to return the minimum cut is < 0.14.
Proof: The probability of failure of MinCut to output the mincut in each execution is at most 1 − n(n−1)
2
, by
Lemma 15.3.5. Now, MinCutRep fails, only if all the n(n − 1) executions of MinCut fail. But these executions
are independent, as such, the probability to this happen is at most
!n(n−1) !
2 2
1− ≤ exp − · n(n − 1) = exp(−2) < 0.14,
n(n − 1) n(n − 1)
since 1 − x ≤ e−x for 0 ≤ x ≤ 1. ■
4
Theorem 15.3.10.
One
can compute the minimum cut in O(n ) time with constant probability to get a correct
result. In O n4 log n time the minimum cut is returned with high probability.
The analysis. To see that this algorithm is equivalent to MinCut (Figure 15.5), observe that the contraction
algorithm simulates Kruskal’s MST algorithm when run on randomly weighted edges. First, imagine imple-
menting MinCut so that it keeps parallel edges. Then, the edges connecting two vertices that are not contracted
are exactly the edges between the two connected components. Picking a random edge to contract, is equivalent
to picking the edge with the minimum random weight. Thus, the MST algorithm here just simulates MinCut
(or vice versa).
A small optimization. It is possible to compute the heaviest edge in the MST, and the partition it induces in
(deterministic) linear time – it is a nice example of the search and prune technique.
Exercise 15.3.11. Given a graph G with weights on the edges, show how to compute the maximum weight
edge in the MST of G in O(n + m) time, where n are m are the number of vertices and edges of G, respectively.
Thus, this yields a O(n + m) implementation of MinCut. We get the following result.
Lemma 15.3.12. MinCut can implemented to run in O(n + m) time, and it outputs the mincut with probability
2
≥ .
n(n − 1)
117
FastCut(G = (V, E))
G – multi-graph
begin
n ← |V(G)|
Contract ( G, t ) if n ≤ 6 then
begin Compute (via brute force) minimum cut
while |(G)| > t do lof G and return cut.
√ m
Pick a random edge e in G. t ← 1 + n/ 2
G ← G/e H1 ← Contract(G, t)
return G H2 ← Contract(G, t)
end /* Contract is randomized!!! */
X1 ← FastCut(H1 ),
X2 ← FastCut(H2 )
return minimum cut out of X1 and X2 .
end
Figure 15.6: Contract(G, t) shrinks G till it has only t vertices. FastCut computes the minimum cut using
Contract.
Proof: Well, we perform two calls to Contract(G, t) which takes O(n2 ) time. And then we perform two
recursive calls on the resulting graphs. We have
√
T (n) = O(n2 ) + 2T n/ 2 .
The solution to this recurrence is O n2 log n as one can easily (and should) verify. ■
This would require a more involved algorithm, that is life.
118
Exercise 15.4.2. Show that one can modify FastCut so that it uses only O(n2 ) space.
√
Lemma 15.4.3. The probability that Contract G, n/ 2 had not contracted the minimum cut is at least 1/2.
Namely, the probability that the minimum cut in the contracted graph is still a minimum cut in the original
graph is at least 1/2.
l √ m
Proof: Just plug in ν = n − t = n − 1 + n/ 2 into Eq. (15.2). We have
l √ m l √ m
h i t(t − 1) 1 + n/ 2 1 + n/ 2 − 1 1
P E0 ∩ . . . ∩ En−t ≥ = ≥ . ■
n · (n − 1) n(n − 1) 2
The following lemma bounds the probability of success.
Lemma 15.4.4. FastCut finds the minimum cut with probability larger than Ω 1/ log n .
Proof: Let T h be the recursion tree of the algorithm of depth h = Θ(log n). Color an edge of recursion tree by
black if the contraction succeeded. Clearly, the algorithm succeeds if there is a path from the root to a leaf that
is all black. This is exactly the settings of Lemma 15.1.1, and we conclude that the probability of success is at
least 1/(h + 1) = Θ(1/ log n), as desired. ■
Exercise 15.4.5. Prove, that running FastCut repeatedly c · log2 n times, guarantee that the algorithm outputs
the minimum cut with probability ≥ 1 − 1/n2 , say, for c a constant large enough.
Theorem 15.4.6. One can compute the minimum cut in a graph G with n vertices in O(n2 log3 n) time. The
algorithm succeeds with probability ≥ 1 − 1/n2 .
Proof: We do amplification on FastCut by running it O(log2 n) times. The running time bound follows from
Lemma 15.4.1. The bound on the probability follows from Lemma 15.4.4, and using the amplification analysis
as done in Lemma 15.3.9 for MinCutRep. ■
Galton-Watson process. The idea of using coloring of the edges of a tree to analyze FastCut might be new
(i.e., Section 15.1.2).
References
[Gre69] W. Greg. Why are Women Redundant? Trübner, 1869.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
[Ste12] E. Steinlight. Why novels are redundant: sensation fiction and the overpopulation of literature.
ELH, 79(2): 501–535, 2012.
[WG75] H. W. Watson and F. Galton. On the probability of the extinction of families. J. Anthrop. Inst.
Great Britain, 4: 138–144, 1875.
119
120
Chapter 16
16.1. Discrepancy
Consider a set system (X, R), where n = |X|, and R ⊆ 2X . A natural task is to partition X into two sets S , T ,
such that for any range r ∈ R, we have that χ(r) = |S ∩ r| − |T ∩ r| is minimized. In a perfect partition, we
would have that χ(r) = 0 – the two sets S , T partition every range perfectly in half. A natural way to do so, is
to consider this as a coloring problem – an element of X is colored by +1 if it is in S , and −1 if it is in T .
Definition 16.1.1. Consider a set system S = (X, R), and let χ : X → {−1, +1} be a function (i.e., a coloring).
P
The discrepancy of r ∈ R is χ(r) = | x∈r χ(x)|. The discrepancy of χ is the maximum discrepancy over all the
ranges – that is
disc(χ) = max χ(r).
r∈R
The discrepancy of S is
disc(S) = min disc(χ).
χ:X→{−1,+1}
Bounding the discrepancy of a set system is quite important, as it provides a way to shrink the size of
the set system, while introducing small error. Computing the discrepancy of a set system is generally quite
challenging. A rather decent bound follows by using random coloring.
For technical reasons, it is easy to think about the set system as an incidence matrix.
Definition 16.1.3. For a m × n a binary matrix M (i.e., each entry is either 0 or 1), consider a vector b ∈
{−1, +1}n . The discrepancy of b is ∥Mb∥∞ .
121
Theorem 16.1.4. Let M be an n × n binarypmatrix (i.e., each entry is either 0 or 1), then there always exists a
vector b ∈ {−1, +1}n , such that ∥Mb∥∞ ≤ 4 n log n. Specifically, a random coloring provides such a coloring
with high probability.
Proof: Let v = (v1 , . . . , vn ) be a row of M. Chose a random b = (b1 , . . . , bn ) ∈ {−1, +1}n . Let i1 , . . . , iτ be the
indices such that vi j = 1, and let
X
n X
τ X
τ
Y = ⟨v, b⟩ = vi bi = vi j bi j = bi j .
i=1 j=1 j=1
As such Y is the sum of m independent random variables that accept values in {−1, +1}. Clearly,
hX i X X
E[Y] = E ⟨v, b⟩ = E vi bi = E[vi bi ] = vi E[bi ] = 0.
i i i
√
By Chernoff inequality and the symmetry of Y, we have that, for ∆ = 4 n ln m, it holds
hX τ i ! !
∆2 n ln m 2
P |Y| ≥ ∆ = 2 P ⟨v, b⟩ ≥ ∆ = 2 P bi j ≥ ∆ ≤ 2 exp − = 2 exp −8 ≤ 8,
j=1
2τ τ m
√
since τ ≤ n. In words, the probability that any entry in Mb exceeds (in absolute values) 4 n ln, is smaller√ than
2/m7 . Thus, with probability at least 1 − 2/m7 , all the entries of Mb have absolute
√ value smaller than 4 n ln m.
In particular, there exists a vector b ∈ {−1, +1} such that ∥ Mb ∥∞ ≤ 4 n ln m.
n
■
We might spend more time on discrepancy later on – it is a fascinating topic, well worth its own course.
√ Using random assignment and the Chernoff inequality, we showed that there exists v, such that ∥Mv∥∞ ≤
4 n ln n. Can we derandomize this algorithm? Namely, can we come up with an efficient deterministic algo-
rithm that has low discrepancy?
To derandomize our algorithm, construct a computation tree of depth n, where in the ith level we expose
the ith coordinate of v. This tree T has depth n. The root represents all possible random choices, while a
node at depth i, represents all computations when the first i bits are fixed. For a node v ∈ T , let P(v) be the
probability that a random computation starting from v succeeds – here randomly assigning the remaining bits
can be interpreted as a random walk down the tree to a leaf. √
Formally, the algorithm is successful if ends up with a vector v, such that ∥Mv∥∞ ≤ 4 n ln n.
Let vl and vr be the two children of v. Clearly, P(v) = (P(vl ) + P(vr ))/2. In particular, max(P(vl ), P(vr )) ≥
P(v). Thus, if we could compute P(·) quicklyp(and deterministically), then we could derandomize the algorithm.
Let Cm+ be the bad event that rm · v > 4 n log n, where rm is the mth row ofh M. Similarly, iCm− is the bad
p
event that rm · v < −4 n log n, and let Cm = Cm+ ∪ Cm− . Consider the probability, P Cm+ v1 , . . . , vk (namely, the
first k coordinates of v are specified). Let rm = (r1 , . . . , rn ). We have that
+ hX
n
p X
k i h X i h X i
P m 1
C v , . . . , vk = P v r
i i > 4 n log n − v r
i i = P v r
i i > L = P vi > L ,
i=k+1 i=1 i≥k+1,ri ,0 i≥k+1,ri =1
122
p P P
where L = 4 n log n − ki=1 vi ri is a known quantity (since v1 , . . . , vk are known). Let V = i≥k+1,ri =1 1. We
have,
h i hX i "X #
+ vi + 1 L + V
P C m v1 , . . . , vk = P (vi + 1) > L + V = P > ,
i≥k+1 i≥k+1
2 2
αi =1 αi =1
The last quantity is the probability that in V flips of a fair 0/1 coin one gets more than (L + V)/2 heads. Thus,
X ! !
h i 1 X
V V
V 1 V
P+m = P Cm+ v1 , . . . , vk = = .
i=⌈(L+V)/2⌉
i 2n 2n i=⌈(L+V)/2⌉ i
This implies, that we can compute P+m in polynomial time! Indeed, we are adding V ≤ n numbers, each one of
them is a binomial coefficient that has polynomial size representation in n, and can be computed in polynomial
time (why?). One can define in similar fashion P−m , and let Pm = P+m + P−m . Clearly,
h Pm can ibe computed in
− −
polynomial time, by applying a similar argument to the computation of Pm = P Cm v1 , . . . , vk .
For a node hv ∈ T , let
i vv denote the portion of v that was fixed when traversing from the root of T to v. Let
P
P(v) = nm=1 P Cm vv . By the above discussion P(v) can be computed in polynomial time. Furthermore, we
know, by the previous result on discrepancy that P(r) < 1 (that was the bound used to show that there exist a
good assignment).
As before, for any v ∈ T , we have P(v) ≥ min(P(vl ), P(vr )). Thus, we p have a polynomial deterministic
algorithm for computing a set balancing with discrepancy smaller than 4 n log n. Indeed, set v = root(T ).
And start traversing down the tree. At each stage, compute P(vl ) and P(vr ) (in polynomial time), and set v to
the child with lower
p value of P(·). Clearly, after n steps, we reach a leaf, that corresponds to a vector v′ such
that ∥Av′ ∥∞ ≤ 4 n log n.
Note, that this method might fail to find the best assignment.
References
[Mat99] J. Matoušek. Geometric Discrepancy. Vol. 18. Algorithms and Combinatorics. Springer, 1999.
123
124
Chapter 17
I don’t know why it should be, I am sure; but the sight of another man asleep in bed when I am up, maddens me. It seems
to me so shocking to see the precious hours of a man’s life - the priceless moments that will never come back to him again -
being wasted in mere brutish sleep.
598 - Class notes for Randomized Algorithms Jerome K. Jerome, Three men in a boat
Sariel Har-Peled
April 2, 2024
Lemma 17.1.3. Let G = (V, E) be a graph with n vertices, and let dG be the average degree in the graph. We
X 1 n
have that ≥ .
v∈V
1 + d(v) 1 + d G
Proof: Let the ith vertex in G be vi . Set xi = 1 + d(vi ), for all i. By Lemma 17.1.2, we have
X n
1 Xn
1 n n n
= ≥ P = P = . ■
i=1
1 + d(vi ) i=1
xi ( i x i )/n i 1 + d(vi ) /n 1 + d G
125
17.1.2. Statement and proof
Theorem 17.1.4 (Turán’s theorem). Let G = (V, E) be a graph with n vertices. The graph G has an indepen-
n
dent set of size at least , where dG is the average vertex degree in G.
1 + dG
Proof: Let π = (π1 , . . . , πn ) be a random permutation of the vertices of G. Pick the vertex πi into the indepen-
dent set if none of its neighbors appear before it in π. Clearly, v appears in the independent set if and only if
it appears in the permutation before all its d(v) neighbors. The probability for this is 1/(1 + d(v)). Thus, the
expected size of the independent set is (exactly)
X 1
τ= , (17.1)
v∈V
1 + d(v)
by linearity of expectations. Thus, by the probabilistic method, there exists an independent set in G of size at
least τ. The claim now readily follows from Lemma 17.1.3. ■
since d(u) ≥ d(v), for all uv ∈ E. Now, consider the graph H resulting from removing v and its neighbors from
G. Clearly, γ(H) is larger (or equal) to the total charge of the vertices of V(H) in G, as their degree had either
decreased (or remained the same). As such, by induction, we have an independent set in H of size at least γ(H).
Together with v this forms an independent set in G of size at least γ(H) + 1 ≥ γ(G). Implying that there exists
an independent set in G of size
X 1
τ= , (17.2)
v∈V
1 + d(v)
Now, set xv = 1 + d(v), and observe that
2
X X 1 X √
xv √ = n2 ,
1
(n + 2|E|)τ = xv ≥
v∈V
x
v∈V v v∈V
xv
n2 n n
using Cauchy-Schwartz inequality. Namely, τ ≥ = = . ■
n + 2|E| 1 + 2|E|/n 1 + dG
Lemma 17.1.5 (Cauchy-Schwartz inequality). For positive numbers α1 , . . . , αn , β1 , . . . , βn , we have
X sX sX
αi βi ≤ α2i β2i .
i i i
126
17.1.4. An algorithm for the weighted case
In the weighted case, we associate weight w(v) with each vertex of G, and we are interested in the maximum
weight independent set in G. Deploying the algorithm described in the first proof of Theorem 18.1.3, implies
the following.
X w(v)
Lemma 17.1.6. The graph G = (V, E) has an independent set of size ≥ .
v∈V
1 + d(v)
Proof: By linearity of expectations, we have that the expected weight of the independent set computed is equal
to X X w(v)
w(v) · P v in the independent set = , ■
v∈V v∈V
1 + d(v)
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
127
128
Chapter 18
The algorithm. Assume the algorithm had computed a partial assignment for v1 , . . . , vk , such that αk =
E f (v1 , . . . , vk ) ≥ E f . The algorithm then would compute the two values
αk,0 = E f (v1 , . . . , vk , 0) and αk,1 = E f (v1 , . . . , vk , 1).
Observe that
αk,0 + αk,1
αk = E f (v1 , . . . , vk ) = P[Xk+1 = 0]E f (v1 , . . . , vk , 0) + P[Xk+1 = 1]E f (v1 , . . . , vk , 1) = .
2
As such, there is an i, such that αk,i ≥ αk . The algorithm sets vk+1 = i, and continues to the next iteration.
Correctness. This is hopefully clear. Initially, α0 = E f . In each iteration, the algorithm makes a choice, such
that αk ≥ αk−1 . Thus,
αn = E f (v1 , . . . , vn ) = f (v1 , . . . , vn ) ≥ αn−1 ≥ · · · ≥ α0 = E f.
129
Running time. The algorithm performs 2n invocations of evalE f .
Result.
Theorem 18.1.1. Given a function f (X1 , . . . , Xn ) over n random binary variables, such that one can compute
determinedly E f (v1 , . . . vk ) = E f (X1 , . . . , Xn ) | X1 = v1 , . . . , Xk = vk in T (n) time. Then, one can compute an
assignment v1 , . . . , vn , such that f (v1 , . . . , vn ) ≥ E f = E f (X1 , . . . , Xn ) . The running time of the algorithm is
O n + nT (n) .
18.1.1. Applications
18.1.1.1. Max kSAT
Given a boolean formula F with n variables and m clauses, where each clause has exactly k literals, let
f (X1 , . . . , Xn ) be the number of clauses the assignment X1 , . . . , Xn satisfies. Clearly, one can compute f in
O(mk) time. More generally, given a partial assignment v1 , . . . , vk , one can compute αk = E f (v1 , . . . , vk ). In-
deed, scan F and assign all the literals that depends on the variables X1 , . . . , Xk their values. A literal evaluating
to one satisfies its clause, and we count it as such. What remains are clauses with at most k literals. A literal with
i literals, have probability exactly 1 − 1/2i to be satisfied. Thus, summing these probabilities on these leftover
clauses given use the desired value. This takes O(mk) time. Using Theorem 18.1.1 we get the following.
Lemma 18.1.2. Let F be a kSAT formula with n variables and m clauses. One can compute deterministicly an
assignment that satisfies at least (1 − 1/2k )m clauses of F. This takes O(mnk) time.
Proof: Exercise. ■
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
130
Chapter 19
Martingales
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
‘After that he always chose out a “dog command” and sent them ahead. It had the task of informing the inhabitants in the village
where we were going to stay overnight that no dog must be allowed to bark in the night otherwise it would be liquidated. I was
also on one of those commands and when we came to a village in the region of Milevsko I got mixed up and told the mayor
that every dog-owner whose dog barked in the night would be liquidated for strategic reasons. The mayor got frightened,
immediately harnessed his horses and rode to headquarters to beg mercy for the whole village. They didn’t let him in, the
sentries nearly shot him and so he returned home, but before we got to the village everybody on his advice had tied rags round
the dogs muzzles with the result that three of them went mad.’
19.1. Martingales
19.1.1. Preliminaries
Let X and Y be two random variables. Let ρ(x, y) = P (X = x) ∩ (Y = y) . Observe that
ρ(x, y) ρ(x, y)
PX=x|Y=y = =P
PY=y z ρ(z, y)
h i X h i P x xρ(x, y) P x xρ(x, y)
and E X Y = y = xP X = x Y = y = P = .
z ρ(z, y) PY=y
x h i
The conditional expectation of X given Y, is the random variable E X Y is the random variable f (y) =
h i
E X Y=y .
As a reminder, for any two random variables X and Y, we have
(I) Lemma 11.1.2: E E[X | Y] = E X .
(II) Lemma 11.1.3: E Y · E[X | Y] = E XY .
19.1.2. Martingales
Intuitively, martingales are a sequence of random variables describing a process, where the only thing that
matters at the beginning of the ith step is where the process was in the end of the (i − 1)th step. That is, it does
not matter how the process arrived to a certain state, only that it is currently at this state.
131
Definition 19.1.1. A sequence of random variables X0 , X1 , . . . , is said to be a martingale sequence if for all
i > 0, we have E[Xi | X0 , . . . , Xi−1 ] = Xi−1 .
In particular, note that for a martingale, we have E[Xi | X0 , . . . , Xi−1 ] = E[Xi | Xi−1 ] = Xi−1 .
Lemma 19.1.2. Let X0 , X1 , . . . , be a martingale sequence. Then, for all i ≥ 0, we have E Xi = E X0 .
Proof: By (I), and the martingale property, we have
E[Xi ] = E E[Xi | Xi−1 ] = E[Xi−1 ] = E[Xi−2 ] = · · · = E[X0 ] . ■
132
Example 19.1.7 (The sheep of Mabinogion). The following is taken from medieval Welsh manuscript based
on Celtic mythology:
“And he came towards a valley, through which ran a river; and the borders of the valley were
wooded, and on each side of the river were level meadows. And on one side of the river he saw
a flock of white sheep, and on the other a flock of black sheep. And whenever one of the white
sheep bleated, one of the black sheep would cross over and become white; and when one of the
black sheep bleated, one of the white sheep would cross over and become black.” – Peredur the
son of Evrawk, from the Mabinogion.
More concretely, we start at time 0 with w0 white sheep, and b0 black sheep. At every iteration, a random
sheep is picked, it bleats, and a sheep of the other color turns to this color. the game stops as soon as all the
sheep have the same color. No sheep dies or get born during the game. Let Xi be the expected number of black
sheep in the end of the game, after the ith iteration. For reasons that we would see later on, this sequence is a
martingale.
The original question is somewhat more interesting – if we are allowed to take a way sheep in the end of
each iteration, what is the optimal strategy to maximize Xi ?
|Xi+1 − Xi | ≤ 1, for i = 0, . . . , m − 1.
h √ i
For any λ > 0, we have P Xm > λ m < exp −λ2 /2 .
√
Proof: Let α = λ/ m. Let Yi = Xi −h Xi−1 , so that |Yi | ≤i 1 and E[Yi | X0 , . . . , Xi−1 ] = 0.
We are interested in bounding E eαYi X0 , . . . , Xi−1 . Note that, for −1 ≤ x ≤ 1, we have
eα + e−α eα − e−α
f (x) = eαx ≤ h(x) = + x,
2 2
as f (x) = eαx is a convex function, h(−1) = e−α = f (−1), h(1) = eα = f (+1), and h(x) is a linear function.
Thus,
h i h i h i
αY
E e i X0 , . . . , Xi−1 ≤ E h(Yi ) X0 , . . . , Xi−1 = h E Yi X0 , . . . , Xi−1
eα + e−α
=h0 =
2
(1 + α + 2! + α3! + · · · ) + (1 − α + α2! − α3! + · · · )
α2 3 2 3
=
2
α2 α4 α6
=1+ + + + ···
2 4! 6!
! !2 !3
1 α2 1 α2 1 α2
≤1+ + + + · · · = exp α2 /2 ,
1! 2 2! 2 3! 2
as (2i)! ≥ 2i i!.
133
We have that
h i hY
m i h i Y
m−1
τ = E eαXm = E eαYi = E g(X0 , . . . , Xm−1 )eαYm , where g(X0 , . . . , Xm−1 ) = eαYi .
i=1 i=1
Example 19.1.10. Let χ(H) be the chromatic number of a graph H. What is chromatic number of a random
graph? How does this random variable behaves? h i
Consider the vertex exposure martingale, and let Xi = E χ(G) Gi . Again, without proving it, we claim that
h √ i
X0 , . . . , Xn = X is a martingale, and as such, we have: P |Xn − X0 | > λ n ≤ e−λ /2 . However, X0 = E χ(G) ,
2
h i
and Xn = E χ(G) Gn = χ(G). Thus,
h √ i −λ2 /2
P χ(G) − E χ(G) > λ n ≤ e .
Namely, the chromatic number of a random graph is highly concentrated! And we do not even (need to) know
what is the expectation of this variable!
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
134
Chapter 20
Martingales II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“The Electric Monk was a labor-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for
you, thus saving you the bother of washing them yourself, video recorders watched tedious television for you, thus saving you
the bother of looking at it yourself; Electric Monks believed things for you, thus saving you what was becoming an increasingly
onerous task, that of believing all the things the world expected you to believe.”
Definition 20.1.2. Given a σ-field (Ω, F ), a probability measure P : F → R+ is a function that satisfies the
following conditions.
(A) ∀A ∈ F , 0 ≤ P[A] ≤ 1.
(B) P Ω = 1.
P
(C) For mutually disjoint events C1 , C2 , . . . , we have P ∪iCi = i P Ci .
Definition 20.1.3. A probability space (Ω, F , P) consists of a σ-field (Ω, F ) with a probability measure P
defined on it.
Definition 20.1.4. Given a σ-field (Ω, F ) with F = 2Ω , a filter (also filtration) is a nested sequence F0 ⊆ F1 ⊆
· · · ⊆ Fn of subsets of 2Ω , such that:
(A) F0 = {∅, Ω}.
(B) Fn = 2Ω .
(C) For 0 ≤ i ≤ n, (Ω, Fi ) is a σ-field.
Definition 20.1.5. An elementary event or atomic event is a subset of a sample space that contains only one
element of Ω.
135
Intuitively, when we consider a probability space, we usually consider a random variable X. The value of
X is a function of the elementary event that happens in the probability space. Formally, a random variable is a
mapping X : Ω → R. Thus, each Fi defines a partition of Ω into atomic events. This partition is getting more
and more refined as we progress down the filter.
Example 20.1.6. Consider an algorithm Alg that uses n random bits. As such, the underlying sample space is
Ω = b1 b2 . . . bn b1 , . . . , bn ∈ {0, 1} . That is, the set of all binary strings of length n. Next, let Fi be the σ-field
generated by the partition of Ω into the atomic events Bw , where w ∈ {0, 1}i ; here w is the string encoding the
first i random bits used by the algorithm. Specifically,
n o
Bw = wx ∈ Ω x ∈ {0, 1}n−i ,
n o
and the set of atomic events in Fi is Ai = Bw w ∈ {0, 1}i . The set Fi is the closure of this set of atomic events
under complement and union. In particular, we conclude that F0 , F1 , . . . , Fn form a filter.
As a concrete example, for i = 3, the set A3 contains 23 = 8 sets, and the set F3 would contain all sets
formed by finite unions of these sets (including the empty union). As such, the set F3 would have 22 = 256
3
sets.
Definition 20.1.7. A random variable X is said to be Fi -measurable if for each x ∈ R, the event X ≤ x is in Fi ;
that is, the set ω ∈ Ω X(ω) ≤ x is in Fi .
Example 20.1.8. Let F0 , . . . , Fn be the filter defined in Example 20.1.6. Let X be the parity of the n bits.
Clearly, X = 1 is a valid event only in Fn (why?). Namely, it is only measurable in Fn , but not in Fi , for i < n.
As such, a random variable X is Fi -measurable, only if it is a constant on the elementary events of Fi . This
gives us a new interpretation of what a filter is – its a sequence of refinements of the underlying probability
space, that is achieved by splitting the atomic events of Fi into smaller atomic events in Fi+1 . Putting it
explicitly, an atomic event E of Fi , is a subset of 2Σ . As we move to Fi+1 the event E might now be split
into several atomic (and disjoint events) E1 , . . . , Ek . Now, naturally, the atomic event that really happens is an
atomic event of Fn . As we progress down the filter, we “zoom” into this event.
Definition 20.1.9 (Conditional expectation in a filter). Let (Ω, F ) be any σ-field, and Y any random variable
that takes on distinct values on the elementary events in F . Then E[X | F ] = E[X | Y].
20.2. Martingales
Definition 20.2.1. A sequence of random variables Y1 , Y2 , . . . , is a martingale difference sequence if for all
h i
i ≥ 0, we have E Yi Y1 , . . . , Yi−1 = 0.
136
20.2.1. Martingales – an alternative definition
Definition 20.2.3. Let (Ω, F , P) be a probability space with a filter F0 , F1 , . . . . Suppose that X0 , X1 , . . ., are
random variables such that, for all i ≥ 0, Xi is Fi -measurable. The sequence X0 , . . . , Xn is a martingale provided
that, for all i ≥ 0, we have E Xi+1 | Fi = Xi .
Lemma 20.2.4. Let (Ω, F ) and (Ω, G) be two σ-fields such that F ⊆ G. Then, for any random variable X, we
have E E[X | G] F = E[X | F ] .
h h i i h h i i
Proof: E E X G F = E E X G = g F = f
# X x xP[X=x∩G=g] · PG = g ∩ F = f
P
"P
xP X = x ∩G = g P[G=g]
=E x F= f =
PG=g g∈G
PF= f
P P
x x P[ X=x∩G=g] x x P[ X=x∩G=g]
X P[ G=g ] · P G = g ∩ F = f X P[ G=g] · P G = g
= =
g∈G,g⊆ f
P F = f g∈G,g⊆ f
P F = f
P P P
X x P X = x ∩ G = g x x g∈G,g⊆ f P X = x ∩ G = g
= x
=
g∈G,g⊆ f
P F = f P F = f
P h i
x xP X = x ∩ F = f
= =E X F . ■
PF= f
Theorem 20.2.5. Let (Ω, F , P) be a probability space, and let F0 , . . .h, Fn bei a filter with respect to it. Let X
be any random variable over this probability space and define Xi = E X Fi then, the sequence X0 , . . . , Xn is
a martingale.
h i
Proof: We need to show that E Xi+1 Fi = Xi . Namely,
h h i i h i
E[Xi+1 | Fi ] = E E X Fi+1 Fi = E X Fi = Xi ,
Specifically, a function is c-Lipschitz, if the inequality holds with a constant c (instead of 1).
Definition 20.2.7. Let X1 , . . . , Xn be a sequence of independent random variables, and a function f = f (X1 , . . . , Xn )
defined over them, such that f satisfies the Lipschitz condition. The Doob martingale sequence Y0 , . . . , Ym is
defined by Y0 = E f (X1 , . . . , Xn ) and
Yi = E f (X1 , . . . , Xn ) X1 , . . . , Xi , for i = 1, . . . , n.
137
20.3. Occupancy Revisited
We have m balls thrown independently and uniformly into n bins. Let Z denote the number of bins that remains
empty in the end of the process. Let Xi be the bin chosen in the ith trial, and let Z = F(X1 , . . . , Xm ), where F
returns the number
h of empty bins given
√ i had thrown into bins X1 , . . . , Xm . , By Azuma’s inequality
that m balls
we have that P Z − E[Z] > λ m ≤ 2 exp −λ /2 . 2
The following is an extension of Azuma’s inequality shown in class. We do not provide a proof but it is
similar to what we saw.
Theorem 20.3.1 (Azuma’s Inequality - Stronger Form). Let X0 , X1 , . . . , be a martingale sequence such that
for each k, |Xk − Xk−1 | ≤ ck , where ck may depend on k. Then, for all t ≥ 0, and any λ > 0, we have
λ2
P |Xt − X0 | ≥ λ ≤ 2 exp − Pt .
2 k=1 c2k
Theorem 20.3.2. Let r = m/n,
and
Zend be the number of empty bins when m balls are thrown randomly into n
1 m
bins. Then µ = E Zend = n 1 − n ≈ n exp(−r), and for any λ > 0, we have
h i !
λ2 (n − 1/2)
P Zend − µ ≥ λ ≤ 2 exp − 2 .
n − µ2
Proof: Let z(Y, t) be the expected number of empty bins in the end, if there are Y empty bins in time t. The
probability of an empty bin to remain empty is (1 − 1/n)m−t , and as such
1 m−t
z(Y, t) = Y 1 − .
n
In particular, µ = z(n, 0) = n(1 − 1/n)m .
Let Ft be the σ-field generated
h iby the bins chosen in the first t steps. Let Zend be the number of empty bins
at time m, and let Zt = E Zend Ft . Namely, Zt is the expected number of empty bins after we know where
the first t balls had been placed. The random variables Z0 , Z1 , . . . , Zm form a martingale. Let Yt be the number
of empty bins after t balls where thrown. We have Zt−1 = z(Yt−1 , t − 1). Consider the ball thrown in the t-step.
Clearly:
(A) With probability 1 − Yt−1 /n the ball falls into a non-empty bin. Then Yt = Yt−1 , and Zt = z(Yt−1 , t). Thus,
!m−t !m−t+1
1 1
∆t = Zt − Zt−1 = z(Yt−1 , t) − z(Yt−1 , t − 1) = Yt−1 1 − − 1−
n n
!m−t !m−t
Yt−1 1 1
= 1− ≤ 1− .
n n n
(B) Otherwise, with probability Yt−1 /n the ball falls into an empty bin, and Yt = Yt−1 − 1. Namely, Zt =
z(Yt − 1, t). And we have that
!m−t !m−t+1
1 1
∆t = Zt − Zt−1 = z(Yt−1 − 1, t) − z(Yt−1 , t − 1) = (Yt−1 − 1) 1 − − Yt−1 1 −
n n
!m−t !! !m−t !m−t
1 1 1 Yt−1 1 Yt−1
= 1− Yt−1 − 1 − Yt−1 1 − = 1− −1 + =− 1− 1−
n n n n n n
!m−t
1
≥− 1− .
n
138
m−t
Thus, Z0 , . . . , Zm is a martingale sequence, where |Zt − Zt−1 | ≤ |∆t | ≤ ct , where ct = 1 − 1n . We have
!2(m−t) !2t
X
m X
m
1 X
m−1
1 1 − (1 − 1/n)
2m n2 1 − (1 − 1/n)2m n2 − µ2
c2t = 1− = 1− = = = .
t=1 t=1
n t=0
n 1 − (1 − 1/n)2 2n − 1 2n − 1
139
140
Chapter 21
Consider the problem of throwing n balls into n bins. It is well known that the maximum load is Θ(log n/ log log n)
with high probability. Here we show that if one is allowed to pick d bins for each ball, and throw it into the
bin that contains less balls, then the maximum load of a bin decreases to Θ(log log n/ log d). A variant of this
approach leads to maximum load Θ((log log n)/d).
As a concrete example, for n = 109 , this leads to maximum load 13 in the regular case, compared to
maximum load of 4, with only two-choices – see Figure 21.1.
21.1.2. Analysis
Lemma 21.1.1. Let m = αn balls be thrown into n bins. Let Yend the number of bins that are not empty in the
end of the process (here, we allow more than one ball into a bin).
(A) For α ∈ (0, 1], we have µ = E[Yend ] ≥ (m − α) exp(−α) ≥ αn − α2 n − 1.
(B) If α ≥ 1, then µ = E[Yend ] ≥ n 1 − exp(−α) .
p
(C) We have P |Yend − µ| > 3cm log n ≤ 1/nc .
i−1
Proof: (A) The probability of the ith ball to be the first ball in its bin, is 1 − 1n . To see this we use backward
analysis – throw in the ith ball, and now throw in the earlier i − 1 balls. The probability that none of the earlier
141
balls hit the same bin as the ith ball is as stated. Now, the expected number of non-empty bins is the number of
balls that are first in their bins, which in turn is
X
m−1 !i
1
µ= 1− ≥ m(1 − 1/n)m−1 ≥ (m − α)(1 − 1/n)m−α
i=0
n
= (m − α)(1 − 1/n)α(n−1) ≥ (m − α) exp(−α)
m−α
≥ (m − α)(1 − α) = αn − α2 n − α + α2 ≥ ,
e
using m = αn ≤ n, and (1 − 1/n)n−1 ≥ 1/e, see Lemma 6.1.1.
(B) We repeat the above analysis from the point of view of the bin. The probability of a bin to be empty is
(1 − 1/n)αn . As such, we have that
µ = E[Yend ] = n(1 − (1 − 1/n)αn ) ≥ n 1 − exp(−α) ,
Remark. The reader might be confused by cases (A) and (B) of Lemma 21.1.1 for α = 1, as the two lower
bounds are different. Observe that (A) is loose if α is relatively large and close to 1.
Back to the problem. Let α1 = 1 and n1 = α1 n. For i > 1, inductively, assume that numbers of balls being
thrown in the ith round is p
ni = αi n + O( αi−1 n log n).
By Lemma 21.1.1, with high probability, the number of balls stored in the ith row is
p
si = ni exp(−αi ) ± O( ni log n).
As such, as long as the first term is significantly large than the second therm, we have that si = nαi exp(−αi )(1 ±
o(1)). For the time being, let us ignore the o(1) term. We have that
since exp(−αi ) ≥ 1 − αi .
Observation 21.1.2. Consider the sequence α1 = 1, c = α2 = 1 − 1/e, and αi+1 = α2i , for i > 2. We have that
αi+1 = c2 . In particular, for
i−2
lg n 1
∆ = 3 + lg log1/c n = 3 + lg = 3 + lg lg n − lg lg ≤ 3 + lg lg n.
lg(1/c) 1 − 1/e
∆−2
we have that α∆ = c2 < 1/n.
142
The above observation almost implies that we need ∆ rows. The problem is that the above calculations
(i.e., the high probability guarantee in Lemma 21.1.1) breaks down when ni = O(log n) – that is, when αi =
O((log n)/n). However, if one throws in O(log n) balls into n bins, the probability of a single collision is at most
O((log n)2 /n). In particular, this implies that after roughly additional c rows, the probability of any ball left is
≤ 1/nc .
The above argumentation, done more carefully, implies the following – we omit the details because (essen-
tially) the same analysis for a more involved case is done next (the lower bound stated follows also from the
same argumentation).
Theorem 21.1.3. Consider the process of throwing n balls into n bins in several rounds. Here, a ball that can
not be placed in a round, because their chosen bin is already occupied, are promoted to the next round. The
next round throws all the rejected balls from the previous round into a new row of n empty bins. This process,
with high probability, ends after M = lg lg n + Θ(1) rounds (i.e., after M rounds, all balls are placed in bins).
√
since 2 α ≤ 1. ■
Lemma 21.1.5. Let m = αn balls be thrown into n bins, with d rows, where α > 0. Here every bin can contain
only a single ball, and if inserting the ball into ith row failed, then we throw it in the next row, and so on, till it
finds an empty bin, or it is rejected because it failed on the dth row. Let Y(d, n, m) be the number of balls that
did not get stored in this matrix of bins. We have
(A) For a constant α < 1/4, we have Y(d, n, αn) ≤ nα(2 +1)/2 , with high probability.
d
(B) We E Y(d, n, dn) = O(n log d).
(C) For a constant c > 1, we have E Y(d, n, cn log d) = n/e−d/2 , assuming d is sufficiently large.
Proof: (A) By Lemma 21.1.1, in expectation, at least s1 = nα exp(−α) balls are placed in the first row. As such,
in expectation n2 = nα(1 − exp(−α)) ≤ nα2 balls get thrown into the second row. Using Chenroff inequality, we
get that n2 ≤ 2α2 n, with high probability. Setting γ1 = α, and γi = 2γi−1
2
, we get the claim via Lemma 21.1.4.
(B) As long as we throw Ω(n log d) balls into a row, we expect by Lemma 21.1.1 that at least n(1 − 1/dO(1) )
balls to get stored in this row. As such, let D = O(log d), and observe that the first d − D rows in expectation
contains n(d − D)(1 − 1/dO(1) ) balls. This implies that only O(Dn) are not stored in these first d − D rows, which
implies the claim.
(C) Break the d rows into two groups. The first group of size
D = (c log d − 1)/(1 − 1/e) + 1 = O(log d),
and the second group is the remaining rows. As long as the number of balls arriving to a row is larger than n,
we expect at least n(1 − 1/e) of them to be stored in this row. As such, after the first D rows, we expect the
number of remaining balls to be ≤ n. Indeed, if we have i such rows, then the expected number of balls moving
on to the (i + 1)th row is at most
ni+1 = cn log d − in(1 − 1/e).
143
Solving for ni+1 ≤ n, we have cn log d − in(1 − 1/e) ≤ n =⇒ i(1 − 1/e) ≥ c log d − 1 =⇒ i ≥ (c log d −
1)/(1 − 1/e) ≥ D − 1. As such, nD ≤ n, for i ≥ D.
The same argumentation implies that the number of balls arriving to the D + i row, in expectation, is at most
n/e . In particular, we get that the number of balls failed to be placed is at most n/ed−D ≤ n/ed/2 .
i
■
Some notations:
(A) βi : An upper bound on the number of bins that have load at least i by the end of the process.
(B) h(i): The height of the ith ball.
(C) ⊔≥i (t): Number of bins with load at least i at time t.
(D) o≥i (t): Number of balls with height at least i at time t.
Let |≥i = ⊔≥i (n) be the number of bins, in the end of the process, that have load ≥ i.
Observation 21.2.3. Since every bin counted in |≥i contains at least i balls, and there are n balls, it follows
that |≥i ≤ n/i. In particular, we have |≥4 ≤ n/4.
for i ≥ 4. Let I be the last iteration, such that βI ≥ 16c ln n, where c > 1 is an arbitrary constant. Then, with
probability ≥ 1 − 1/nc , we have that
(A) |≥i ≤ βi , for i = 1, . . . , I.
(B) |≥I+1 ≤ c′ log n, for some constant c′ . h i
(C) For j > 0, and any constant ε > 0, we have P |≥I+1+ j > 0 ≤ O(1/n(d−1−ε) j ).
(D) With probability ≥ 1 − 1/nc , the maximum load of a bin is I + O(c).
Proof: (A) The claim for i = 1, 2, 3, 4 follows readily from Observation 21.2.3.
Let Bi be the bad event that |≥i > βi , for i = 1, . . . , n. The following analysis is conditioned on none of
these bad events happening. Let Gk = ∩ki=1 Bi be the good event. Let Yt be an indicator variable that is one
144
⇐⇒ h(t) ≥ i + 1 conditioned on Gi−1 (for clarity, we omit mentioning this conditioning explicitly). We have
that h i
τ j = P Y j = 1 ≤ pi for pi = (βi /n)d ,
as all d probes must hit bins of height at least i, and there are at most βi such bins. This readily implies
that E o≥i+1 (n) ≤ pi n. The variables Y1 , . . . , Yn are not independent, but consider a variable Y ′j that is 1 if
Y = 1, or if Y j = 0, then Y ′j is 1 with probability pi − τ j . Clearly, the variables Y1′ , . . . , Yn′ are independent, and
Pj ′ P
i Yj ≥ i Yi . For i < I, setting
βi+1 = 2npi = 2n(βi /n)d ,
we have, by Chernoff’s inequality, that
hX i
αi+1 = P[Bi+1 ] = P o≥i+1 > βi+1 = P o≥i+1 > 2npi ≤ P
(n) (n) Yt′ > (1 + 1)npi
i
≤ exp(−npi /4) = exp(−βi+1 /8) < 1/n2c .
(B) We have βI+1 ≤ 16c log n. Setting ∆ = 2e · 16c log n, and conditioning on the good event G1 , consider
the sequence Y1′ , . . . Yn′ as above, where the Yi is the indicator that the ith ball has height ≥ I + 1. Arguing as
P
above, for Y ′ = i Yi′ , we have E[Y] ≤ βi+1 . As such, we have
" #
′ ∆ ′ −∆ 1
P[|≥I+1 > ∆] ≤ P o≥I+1 (n) > ∆ ≤ P Y > ′ E Y ≤ 2 ≤ c,
E[Y ] n
Y
I+1 h i Y
ℓ
P[GI+1 ] = P Bℓ+1 ∩k=1 B1 = (1 − αi ) ≥ 1 − 1/nc−1 ,
ℓ=4 i
since I ≤ n.
(C) Observe that ⊔≥i+1 (n) ≤ ⊔≥i (n). As such, for all j > 0, we have that ⊔≥I+1+ j (n) ≤ o≥I+1 (n) ≤ ∆ =
2e · 16c log n, by (B). As such, we have
h i
E o≥I+1+ j (n) ≤ n(∆/n) = O(log n/n ) = O(1/n ) ≪ 1,
d d d−1 d−1−ε
h i
for ε > 0 an arbitrary constant, and n sufficient large. Using Markov’s inequality, we get that q = P o≥I+1+ j (n) ≥ 1 =
O(1/nd−1−ε ). The probability that the first j such rounds fail (i.e., that o≥I+1+ j (n) > 0) is at most q j , as claimed.
(D) This follows immediately by picking ε = 1/2, and then using (C) with j = O(c). ■
i−4 +1
Lemma 21.2.5. For i = 4, . . . , I, we have that βi ≤ n/2d .
Proof: The proof is by induction. For i = 4, we have β4 ≤ n/4, as claimed. Otherwise, we have
d
βi+1 = 2n(βi /n)d ≤ 2n 1/2d +1 = n/2d +d−1 ≤ n/2d +1 .
i−4 i+1−4 i+1−4
■
Theorem 21.2.6. When throwing n balls into n bins, with d choices, with probability ≥ 1 − 1/nO(1) , we have
that the maximum load of a bin is O(1) + lglglgdn
145
Proof: By Lemma 21.2.4, with the desired probability the βi s bound the load in the bins for i ≤ I. By
Lemma 21.2.5, it follows that for I = O(1) + lglglgdn , we have that βI ≤ o(log n). Thus giving us the desired
bound. ■
It is not hard to verify that our upper bounds (i.e., βi ) are not too badly off, and as such the maximum load
in the worst case is (up to additive constant) the same. We state the result without proof.
Theorem 21.2.7. When throwing n balls into n bins, with d choices (where the ball is placed with the bin with
the least load), with probability ≥ 1 − o(1/n), we have that the maximum load of a bin is at least lglglgdn − O(1).
21.2.2. Applications
As a direct application, we can use this approach for open hashing, where we use two hash functions, and place
an element in the bucket of the hash table with fewer elements. By the above, this improves the worst case
search time from O(log n/ log log n) to O(log log n). This comes at the cost of doubling the time it takes to do
lookup on average.
Theorem 21.2.8. When throwing n balls into n bins, using the always-go-left rule, with d groups of size n/d,
the maximum load of a bin is O(1) + log dlog n , with high probability.
Proof: (Sketch.) We consider each of the d groups to be a row in the matrix being filled. So each row has n/d
entries, and there are d rows. We can now think about the above algorithm as first trying to place the ball in the
first row (if there is an empty bin), otherwise, trying the new row and so on. If all the d locations are full, in the
row filling game we fail to place this ball. By Lemma 21.1.5 (B), we have that the number of unplaced balls is
E Y d, n/d, (n/d)d = O (n/d) log d . Thus, we have that the number of balls that get placed as the first ball in
their bin is
O(log d)
≥n 1− ,
d
and the height of these balls is one.
We now use the same argumentation for balls of height 2 – Lemma 21.1.5 (C) implies that at most dn/e−d/2
balls have height strictly larger than 2.
Lemma 21.1.5 (A) implies that now we can repeat the same analysis as the power of two choices, the
critical difference is that every one of the d groups, behaves like a separate height. Since there are O(log log n)
maximum height in the regular analysis, this implies that we get O((log log n)/d) maximum load, with high
probability. ■
146
# balls in bin Regular 2-choices 2-choices+go left
0 369,899,815 240,525,897 228,976,604
1 365,902,266 528,332,061 546,613,797
2 182,901,437 221,765,420 219,842,639
3 61,604,865 9,369,389 4,566,915
4 15,760,559 7,233 45
5 3,262,678
6 568,919
7 86,265
8 11,685
9 1,347
10 143
11 17
12 2
13 2
Figure 21.1: Simulation of the three schemes described here. This was done with n = 1, 000, 000, 000 balls
thrown into n bins. Since log log n is so small (i.e., ≈ 3 in this case, there does not seem to be any reasonable
cases where the is a significant differences between d-choices and the go-left variant. In the simulations, the
go-left variant always has a somewhat better distribution, as shown above.
147
21.5. Bibliographical notes
The multi-row balls into bins (Section 21.1) is from the work by Broder and Karlin [BK90]. The power of two
choices (Section 21.2) is from Azar et al. [ABKU99].
The restricted d choices structure, the always go-left rule, described in Section 21.2.3, is from [Vöc03].
References
[ABKU99] Y. Azar, A. Broder, A. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Computing,
29(1): 180–200, 1999.
[BK90] A. Z. Broder and A. R. Karlin. Multilevel adaptive hashing. Proc. 1th ACM-SIAM Sympos. Dis-
crete Algs. (SODA), 43–53, 1990.
[Vöc03] B. Vöcking. How asymmetry helps load balancing. J. ACM, 50(4): 568–589, 2003.
148
Chapter 22
or or or (1) or (1)
0 1 1 1 0 1 1 1
Figure 22.1: The tree T 2 , inputs in the leafs, and the output.
Defined recursively, T 2 is a tree with the root being an AND gate, and its children are OR gates. This tree has
four inputs. More generally, T 2k , is T 2 , with each leaf replaced by T 2k−2 . Let n = 22k .
So the input here is T 2k , together with 22k values stored in each leaf of the tree. Consider here the query
model – instead of read the values in the leafs, the algorithm has to explicitly perform a query to get the value
stored in the leaf. The question thus is can we minimize the number of queries the algorithm needs to perform.
It is straightforward to evaluate such a tree using a recursive algorithm in O(n) time. In particular, it
following is not too difficult to show.
Exercise 22.1.1. Show that any deterministic algorithm, in the worst case, requires Ω(n) time to evaluate a
tree T 2k .
The key observation is that AND (i.e., ∧) gate evaluation can be shortcut – that is, if x = 0 then x ∧ y = 0
independently on what value y has. Similarly, an OR (i.e., ∨) gate evaluation can be shortcut – since if x = 1,
then x ∨ y = 1 independently of what y value is.
149
22.1.1. Randomized evaluation algorithm for T 2k
The algorithm is recursive. If the current node v is a leaf, the algorithm returns the value stored at the leaf.
Otherwise, the algorithm randomly chooses (with equal probability) one of the children of v, and evaluate them
recursively. If the returned value, is sufficient to evaluate the gate at v, then the algorithm shortcut. Otherwise,
the algorithm evaluates recursively the other child, computes the value of the gate and return it.
22.1.2. Analysis
Lemma 22.1.2. The above algorithm when applied to T 2k , in expectation, reads the value of at most 3k leaves,
and this also bounds its running time.
Proof: The proof is by induction. Let start with T 2 . There are two possibilities:
(i) The tree evaluates to 0, then one of the children of the AND gate evaluates zero. But then, with probability
half the algorithm would guess the right child, and evaluate it first. Thus, in this case, the algorithm
would evaluate (in expectation) ≤ (1/2)2 + (1/2)4 = 3 leafs.
(ii) If the output of the tree is 1, then both children of the root must evaluate to 1. Each one of them is an OR
gate. Arguing as above, an OR gate evaluating to one, requires in expectation to read (1/2)1+(1/2)2 = 3/2
leafs to be evaluated by the randomized algorithm. It follows, that in this case, the algorithm would read
(in expectation) 2(3/2) = 3. (Note, that this is an upper bound – if all the four inputs are 1, this algorithm
would read only 2 leafs.)
For k > 1, consider the four grandchildren of the root c1 , c2 , c3 , c4 . By induction, in expectation, evaluating
each of c1 , . . . , c4 , takes 3k−1 leaf evaluations. Let X1 , . . . , X4 be indicator variables that are one if ci is evaluated
by the recursive algorithm. Let Yi be the expected number of leafs read when evaluating ci (i.e., E[Yi ] = 3k ).
P
By the above, we have that E i Xi = 3. Observe that Xi and Yi are independent. (Note, that the Xi are not
independent of each other.) We thus have that the expected number of leafs to be evaluated by the randomized
algorithm is
X X X
E Xi Yi = E[Xi Yi ] = E[Xi ] E[Yi ] ≤ 3 E[Yi ] = 3 · 3 = 3 .
k−1 k
■
i i i
Corollary 22.1.3. Given an AND/OR tree with n leafs, the above algorithm in expectation evaluates
3k = 2k log2 3 = 22k(log2 3)/2 = n(log2 3)/2 = n0.79248
leafs.
References
[Sni85] M. Snir. Lower bounds on probabilistic linear decision trees. Theor. Comput. Sci., 38: 69–82,
1985.
150
Chapter 23
Definition 23.1.1. An (n, d, α, c) OR-concentrator is a bipartite multigraph G(L, R, E), with the independent
sets of vertices L and R each of cardinality n, such that
(i) Every vertex in L has degree at most d.
(ii) Any subset S of vertices of L, with |S | ≤ αn has at least c |S | neighbors in R.
A good (n, d, α, c) OR-concentrator should have d as small as possible¬ , and c as large as possible.
Theorem 23.1.2. There is an integer n0 , such that for all n ≥ n0 , there is an (n, 18, 1/3, 2) OR-concentrator.
Proof: Let every vertex of L choose neighbors by sampling (with replacement) d vertices independently and
uniformly from R. We discard multiple parallel edges in the resulting graph.
Let E s be the event that a subset of s vertices of L has fewer than cs neighbors in R. Clearly,
151
as c = 2 and d = 18. Thus,
X X
P Es ≤ (0.4) s < 1.
s≥1 s≥1
It thus follows that the random graph we generated has the required properties with positive probability. ■
Proof: Let E s be the event that a subset of s vertices of L has fewer than cs neighbors in R. For a choice of
such a set S ⊆ L, and a set T of size cs in R, we have that number of ways to chose a matching such that all the
vertices of S has neighbors in T is cs · (cs − 1) · · · (cs − s + 1) – indeed, we fix an ordering of the items in S ,
and assign them their match in T one by one. As such, we have
!d
n n cs(cs − 1) · · · (cs − s + 1)
Ξ = P Es ≤ .
s cs n(n − 1) · · · (n − s + 1)
s
Using cs
n
· cs−1
n−1
··· · cs−s+1
n−s+1
≤ csn , we have
ne s ne cs cs ds
Ξ≤ .
s cs n
The quantity in the right, in the above inequality, is the same quantity bounded in the proof of Theorem 23.1.2,
and the result follows by the same argumentation. ■
23.1.2. An expander
Definition 23.1.4. An (n, d, c)-expander is a graph G = (V, E) over n vertices, n, such that
(i) Every vertex in G has degree at most d.
(ii) Any subset S of vertices of V, with |S | ≤ n/3 has at least c |S | neighbors.
Proof: Let G be a graph with the set of vertices being JnK. Construction the graph of Theorem 23.1.3, and let
G′ be this graph. For every edge vi u j in G′ create an edge i j in G. Clearly, G has the desired properties. ■
152
Next, we show that using lg2 n bits one can achieve 1/nlg n confidence, compared with the naive 1/n, and
the 1/t confidence achieved by t (dependent) executions of the algorithm using two-point sampling.
2
Theorem 23.2.1. For n large enough, there exists a bipartite graph G(V, R, E) with |V| = n, |R| = 2lg n
such
that:
2
(i) Every subset of n/2 vertices of V has at least 2lg n − n neighbors in R.
(ii) No vertex of R has more than 12 lg2 n neighbors.
2
Proof: Each vertex of V chooses d = 2lg n (4 lg2 n)/n neighbors independently in R. We show that the resulting
graph violate the required properties with probability less than half.®
The probability for a set of n/2 vertices on the left to fail to have enough neighbors, is
!dn/2 lg2 n n !
n 2lg2 n 2 e
n n dn n
τ≤ 1− 2 ≤ 2 exp −
n/2 n 2lg n n 2 2lg2 n
lg2 n n
2 e 2lg2 n (4 lg2 n)/n n2 2
2lg n e
n
≤ 2 exp − 2
≤ exp n + n ln − 2n lg 2
n ,
n 2 2lg n | {z n}
| {z }
∗ ∗
n 2lg2 n x
xe y ¯
2n
2lg
since n/2
≤ 2n and 2
2lg n −n
= n
, and y
≤ y
. Now, we have
2
2lg n e 2
ρ = n ln = n ln 2lg n + ln e − ln n ≤ (ln 2)n lg2 n ≤ 0.7n lg2 n,
n
for n ≥ 3. As such, we have τ ≤ exp n + (0.7 − 2)n lg2 n ≪ 1/4.
As for the second property, note that the expected number of neighbors of a vertex v ∈ R is 4 lg2 n. Indeed,
the probability of a vertex on R to become adjacent to a random edge is ρ = 1/|R|, and this “experiment” is
repeated independently dn times. As such, the expected degree of a vertex is µ E Y = dn/|R| = 4 lg2 n. The
Chernoff bound (Theorem 13.2.1p95 ) implies that
h i h i
α = P Y > 12 lg2 n = P Y > (1 + 2)µ < exp −µ22 /4 = exp −4 lg2 n .
2
Since there are 2lg n vertices in R, we have that the probability
that any vertex
in R has a degree that exceeds
lg2 n
12 lg n, is, by the union bound, at most |R| α ≤ 2 exp −4 lg n ≤ exp −3 lg2 n ≪ 1/4, concluding our
2 2
tedious calculations° .
Thus, with constant positive probability, the random graph has the required property, as the union of the
two bad events has probability ≪ 1/2. ■
We assume that given a vertex (of the above graph) we can compute its neighbors, without computing the
whole graph.
Everybody knows that lg n = log2 n. Everybody knows that the captain lied.
®
Here, we keep parallel edges if they happen – which is unlikely. The reader can ignore this minor technicality, on her way to
ignore this whole write-up.
¯
The reader might want to verify that one can use significantly weaker upper bounds and the result still follows – we are using
the tighter bounds here for educational reasons, and because we can.
°
Once again, our verbosity in applying the Chernoff inequality is for educational reasons – usually such calculations would be
swept under the rag. No wonder than that everybody is afraid to look under the rag.
153
So, we are given an input x. Use lg2 n bits to pick a vertex v ∈ R. We nextidentify the neighbors of v in V:
r1 , . . . , rk . We then compute Alg(x, ri ), for i = 1, . . . k. Note that k = O lg2 n . If all k calls return 0, then we
return that Alg is not in the language. Otherwise, we return that x belongs to V.
If x is in the language, then consider the subset U ⊆ V, such that running Alg on any of the strings of U
returns TRUE . We know that |U| ≥ n/2. The set U is connected to all the vertices of R except for at most
2
|R| − 2lg n − n = n of them. As such, the probability of a failure in this case, is
h i h i n n
P x ∈ L but r1 , r2 , . . . , rk < U = P v not connected to U ≤ ≤ 2 .
|R| 2lg n
We summarize the result.
Lemma 23.2.2. Given an algorithm Alg in RP that uses lg n random bits, and an access explicit access to the
graph of Theorem 23.2.1, one can decide if an input word is in the language of Alg using lg2 n bits, and the
2
probability of f failure is at most n/2lg n .
Let us compare the various results we now have about running an algorithm in RP using lg2 n bits. We have
three options:
(A) Randomly run the algorithm lg n times independently. The probability of failure is at most 1/2lg n = 1/n.
(B) Lemma 23.2.2, which as probability of failure at most 1/2lg n = 1/n.
(C) The third option is to use pairwise independent sampling (see Lemma 7.2.13p60 ). While it is not directly
comparable to the above two options, it is clearly inferior, and is thus less useful.
Unfortunately, there is no explicit construction of the expanders used here. However, there are alternative
techniques that achieve a similar result.
Corollary 23.3.2. Any randomized oblivious algorithm for permutation routing on the hypercube with N = 2n
nodes must use Ω(n) random bits in order to achieve expected running time O(n).
Theorem 23.3.3. For every n, there exists a randomized oblivious scheme for permutation routing on a hyper-
cube with n = 2n nodes that uses 3n random bits and runs in expected time at most 15n.
154
Chapter 24
Dimension Reduction
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
as its density function. We denote that X is distributed according to such distribution, using X ∼ N(0, 1). It is
depicted in Figure 24.1.
Somewhat strangely, it would be convenient to consider two such independent variables X and Y together.
Their probability space (X, Y) is the plane, and it defines a two dimensional density function
1
g(x, y) = f (x) f (y) = exp −(x2 + y2 )/2 . (24.2)
2π
155
0.4
√1 exp(−x2 /2)
2π
0.3
0.2
0.1
0
-4 -3 -2 -1 0 1 2 3 4
Figure 24.1
The key property of this function is that g(x, y) = g(x′ , y′ ) ⇐⇒ ∥(x, y)∥2 = x2 + y2 = ∥(x′ , y′ )∥2 . Namely, g(x, y)
is symmetric around the origin (i.e., all the points in the same distance from the origin have the same density).
We next use this property in verifying that f (·) it is indeed a valid density function.
R∞
Lemma 24.2.1. We have I = −∞ f (x) dx = 1, where f (x) = √12π exp −x2 /2 .
Proof: Observe that
Z ∞ 2 Z ∞ Z ∞ Z ∞ Z ∞
I =
2
f (x) dx = f (x) dx f (y) dy = f (x) f (y) dx dy
x=−∞ x=−∞ y=−∞ x=−∞ y=−∞
Z ∞ Z ∞
= g(x, y) dx dy.
x=−∞ y=−∞
Change the variables to x = r cos α, y = r sin α, and observe that the determinant of the Jacobian is
∂x
∂r
∂x
∂α cos α −r sin α
J = det ∂y ∂y = det = r cos2 α + sin2 α = r.
∂r ∂α
sin α r cos α
As such,
Z ∞ Z 2π r2 Z ∞ Z 2π r2
1 1
I =
2
exp − |J| dα dr = exp − r dα dr
2π 2 2π r=0 α=0 2
Z ∞ r=0 α=02 h 2 ir=∞
r r
= exp − r dr = − exp − = − exp(−∞) − (− exp(0)) = 1. ■
r=0 2 2 r=0
Lemma 24.2.2. For X ∼ N(0, 1), we have that E[X] = 0 and V[X] = 1.
Proof: The density function of X, see Eq. (24.2) is symmetric around 0, which implies that E[X] = 0. As for
the variance, we have
h i h i Z ∞ 1
Z ∞
V[X] = E X − (E[X]) = E X = x P[X = x] dx = √ x2 exp(−x2 /2) dx.
2 2 2 2
x=−∞ 2π x=−∞
Observing that
′
x2 exp −x2 /2 = −x exp(−x2 /2) + exp −x2 /2 ,
implies (using integration by guessing) that
Z ∞
1 h i∞ 1
V[X] = √ −x exp(−x /2) x=−∞ + √ exp(−x2 /2) dx = 0 + 1 = 1. ■
2
2π 2π −∞
156
24.2.1. The standard multi-dimensional normal distribution
The multi-dimensional normal distribution, denoted by Nd , is the distribution in Rd that assigns a point p =
1 1X d
(p1 , . . . , pd ) the density g(p) = exp − p2
i .
(2π)d/2 2 i=1
R
It is easy to verify, using the above, that Rd g(p)dp = 1. Furthermore, we have the following useful but
easy properties.¬
Proof: (A) Let f (·) denote the density function of N(0, 1), and observe that the density function of u is
f (X1 ) f (X2 ) · · · f (Xd ), = √2π exp −X1 /2 · · · √12π exp −Xd2 /2 , which readily implies the claim.
1 2
(B) Readily follows from observing that g(p) = (2π)1d/2 exp − ∥p∥2 /2 .
(C) Let p = (X1 , . . . , Xd ), where X1 , . . . , Xd ∼ N(0, 1). Let v be any unit vector in Rd , and observe that by the
symmetry of the density function, we can (rigidly) rotate the space around the origin in any way we want, and
the measure of sets does not change. In particular rotate space so that v becomes the unit vector (1, 0, . . . , 0).
We have that
P[⟨v, p⟩ ≤ α] = P[⟨(1, 0, . . . , 0), p⟩ ≤ α] = P[X1 ≤ α],
which implies that ⟨v, p⟩ ∼ X1 ∼ N(0, 1). ■
The generalized multi-dimensional distribution is a Gaussian. Fortunately, we only need the simpler notion.
we pick k vectors u1 , . . . , uk independently from the d-dimensional normal distribution Nd . Given a point
p ∈ Rd , its image is
1
h(v) = √ ⟨u1 , p⟩ , · · · , ⟨uk , p⟩ .
k
¬
The normal distribution has such useful properties that it seems that the only thing normal about it is its name.
157
In matrix notation, let
u1
1 u2
M = √ .. .
k .
uk
For every point pi ∈ P, we set ui = h(pi ) = Mpi .
24.3.2. Analysis
24.3.2.1. A single unit vector is preserved
Consider a vector v of length one in Rd . The natural question is what is the value of k needed, so that the length
of h(v) is a good approximation to v. Since ⟨ui , v⟩ ∼ N(0, 1), by Lemma 24.2.3, this question can boil down to
the following: Given k variables X1 , . . . , Xk ∼ N(0, 1), sampled independently, how concentrated is the random
variable
X
k
Y = ∥(X1 , . . . , Xk )∥ =
2
Xi2 .
i=1
h i
We have that E[Y] = k E Xi2 = k V[Xi ] = k, since Xi ∼ N(0, 1), for any i. The distribution of Y is known as the
chi-square distribution with k degrees of freedom.
Lemma 24.3.1. Let φ ∈ (0, 1), and ε ∈ (0, 1/2) be parameters, and let k ≥ 16 ε2
ln φ2 be an integer. Then,
P
for k independent random variables X1 , . . . , Xk ∼ N(0, 1), we have that Z = i Xi2 /k is strongly concentrated.
Formally, we have that P[Z ≤ 1 + ε] ≥ 1 − φ.
Proof: Arguing as in the proof of Chernoff’s inequality, using t = ε/4 < 1/2, we have
h i
E exp(tkZ) Yk
E exp(tXi2 )
P[Z ≥ 1 + ε] ≤ P exp(tkZ) ≥ exp tk(1 + ε) ≤ = .
exp tk(1 + ε) i=1
exp t(1 + ε)
X ε i h1 X ε i i 2
∞ h 1 X ε i i2
∞
1
1−ε/2
= ≤ 1+ ≤ exp .
i
2 2 i=1 2 2 i=1 2
158
As such, we have
∞ k ∞ i k
!
1 X ε i ε ε2 1 X ε kε2 φ
P[Z ≥ 1 + ε] ≤ exp
− (1 + ε) = exp− +
≤ exp − ≤ ,
2 i=1 2 4 8 2 i=3 2 16 2
P
since, for ε < 1/2, we have 21 ∞ i=3 (ε/2) ≤ (ε/2) ≤ ε /16. The last step in the above inequality follows by
i 3 2
√ we require that exp(−x) ≤ φ/2, which implies x = ln(2/φ). We further require that
√ For our purposes,
2 x/k ≤ ε and 2 x/k + 2x/k ≤ ε, which hold for k = 8ε−2 ln φ2 , for ε ≤ 1. We thus get the following result.
Corollary 24.3.3. Let φ ∈ (0, 1), and ε ∈ (0, 1/2) be parameters, and let k ≥ ε82 ln φ2 be an integer. Then, for k
P
independent random variables X1 , . . . , Xk ∼ N(0, 1), we have for Z = i Xi2 /k that that P[1 − ε ≤ Z ≤ 1 + ε] ≥
1 − φ.
Remark 24.3.4. The result of Corollary 24.3.3 is surprising. It says that if we pick a point according
√ to the
k-dimensional normal distribution, then its distance to the origin is strongly concentrated around k. Namely,
the normal distribution “converges” to a sphere, as the dimension increases. The mind boggles.
Lemma 24.3.5. Let v be a unit vector in Rd , then
1
P 1 − ε ≤ ∥Mv∥ ≤ 1 + ε ≥ 1 − 2 .
n
Proof: Observe that if for a number x, if 1 − ε ≤ x2 ≤ 1 + ε, then 1 − ε ≤ x ≤ 1 + ε. As such, the claim holds
if 1 − ε ≤ ∥Mv∥2 ≤ 1 + ε. By Corollary 24.3.3, setting φ = 1/n2 , we need
k ≥ 8ε−2 ln(2/φ) = 8ε−2 ln(2n2 ) = 24ε−2 ln n,
which holds for the value picked for k in Eq. (24.3). ■
159
We thus got the famous JL-Lemma.
Theorem 24.3.7 (The Johnson-Lindenstrauss Lemma). Given a set P of n points in Rd , and a parameter ε,
one can reduce the dimension of P to k = O(ε−2 log n) dimensions, such that all pairwise distances are 1 ± ε
preserved.
n o
Proof: Consider the region in the plane H − = (x, y) ∈ R2 αx + βy ≤ z – this is a halfspace bounded by the
line ℓ ≡ αx + βy = z. This line is orthogonal to the vector (−β, α). We have that ℓ ≡ σα x + σβ y = σz . Observe that
α β
,
σ σ
= 1, which implies that the distance of ℓ from the origin is d = z/σ.
Now, we have Z
−
P[Z ≤ z] = P αX + βY ≤ z = P H = g(x, y) dp,
p=(x,y)∈H −
see Eq. (24.2). Since, the two dimensional density function g is symmetric around the origin. any halfspace
containing the origin, whichnits boundary is in distance
o d from the origin, has the same probability. In particular,
consider the halfspace T = (x, y) ∈ R x ≤ d . We have that
2
Z ! Z z !
− 1 d
x2 1 y2 dx
P[Z ≤ z] = P H = P[T ] = P[X ≤ d] = √ exp − dx = √ exp − 2 dy,
2π −∞ 2 2π y=−∞ 2σ dy
Z z 2
!
1 y
= √ exp − 2 dy,
2πσ y=−∞ 2σ
by change of variables x = y/σ, and observing
that dx/ dy = 1/σ. By Eq. (24.4), the above integral is the
probability of a variable distributed N 0, σ2 to be smaller than z, establishing the claim. ■
160
Lemma 24.4.3. Consider two independent variables X ∼ N µ1 , σ21 and Y ∼ N µ2 , σ22 . We have Z = X + Y ∼
N µ1 + µ2 , σ21 + σ22 ,
References
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of di-
mensionality. Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC), 604–613, 1998.
[JL84] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mapping into hilbert space. Contem-
porary Mathematics, 26: 189–206, 1984.
[LM00] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection.
Ann. Statist., 28(5): 1302–1338, 2000.
161
162
Chapter 25
I don’t know why it should be, I am sure; but the sight of another man asleep in bed when I am up, maddens me. It seems
to me so shocking to see the precious hours of a man’s life - the priceless moments that will never come back to him again -
being wasted in mere brutish sleep.
163
Observe that Z n−1
Xn
1 1 n
≤ dx = ln(n − 2) − ln(r − 2) ≈ ln .
i=r
i−1 x=r−2 x r
For r = n/e, we have that P(r) ≈ nr ln nr = 1e ln e = 1/e.
If st ∈ K, then
K \ {st } ⊆ S t−1 ,
t − 1 − (k − 1) kA 1 (t − k)k!(t − 1 − k)! 1
P K = S t = P st was inserted = t−1 = = t ,
t kA (t − 1)!t
and S t−1 \ K thrown out of S t−1 k k
as desired. Indeed, there are t − 1 − (k − 1) subsets of size k of {s1 , . . . , st−1 } that contains K \ {st } – since we fix
k − 1 of the t − 1 elements. ■
164
Observation 25.3.1. For any ε ∈ (0, 1), we have that 1
1−ε
≥ 1 + ε.
Lemma 25.3.2. Let ε ∈ (0, 1/2) be a fixed parameter, and let B be a set of n numbers. Let Z be the median of
the random sample (with replacement) of B of size k. We have that
h i & '
12 2
P B⟨ 1−ε
2 n⟩
≤ Z ≤ B⟨ 1+ε
2 n⟩
≥ 1 − δ, where k ≥ 2 ln .
ε δ
Namely, with probability at least 1 − δ, the returned value Z is (ε/2)n positions away from the true median.
Proof: Let L = B⟨(1−ε)n/2⟩ , and let ei be the ith sample number, for i = 1, . . . , k. Let Xi = 1 if and only if ei ≤ L.
We have that
(1 − ε)n/2 1 − ε
P[Xi = 1] = = .
n 2
P
As such, setting Y = ki=1 Xi , we have
1−ε k 3 2
µ = E[Y] = k ≥ ≥ 2 ln .
2 4 ε δ
The above already implies that we can get a good estimate for the median. We need something somewhat
stronger – we state it without proof since it follows by similarly mucking around with Chernoff’s inequality.
Lemma 25.3.3. Let ε ∈ (0, 1/2), B an array of n elements, and let S = {e1 , . . . , ek } be a set of k samples
l picked
m
uniformly and randomly from B. Then, for some absolute constant c, and an integer k, such that k ≥ εc2 ln 1δ ,
we have that
P S ⟨k− ⟩ ≤ B⟨n/2⟩ ≤ S ⟨k+ ⟩ ≥ 1 − δ.
for k− = ⌊(1 − ε)k/2⌋, and k+ = ⌊(1 + ε)k/2⌋.
One can prove even a stronger statement:
P B⟨(1−2ε)n/2⟩ ≤ S ⟨(1−ε)k/2⟩ ≤ B⟨n/2⟩ ≤ S ⟨(1+ε)k/2⟩ ≤ B⟨(1+2ε)n/2⟩ ≥ 1 − δ
165
25.3.1. A median selection with few comparisons
The above suggests a natural algorithm for computing the median (i.e., the element of rank n/2 in B). Pick a
random sample S of k = O(n2/3 log n) elements. Next, sort S , and pick the elements L and R of ranks (1 − ε)k
and (1 + ε)k in S , respectively. Next, scan the elements, and compare them to L and R, and keep only the
elements that are between. In the end of this process, we have computed:
(A) α: The rank of the number L in the set B.
(B) T = {x ∈ B | L ≤ x ≤ H}.
Compute, by brute force (i.e., sorting) the element of rank n/2 − α in T . Return it as the desired median. If
n/2 − α is negative, then the algorithm failed, and it tries again.
Lemma 25.3.4. The above algorithm performs 2n + O(n2/3 log n) comparisons, and reports the median. This
holds with high probability.
Proof: Set ε = 1/n1/3 , and δ = 1/nO(1) , and observe that Lemma 25.3.3 implies that with probability ≥ 1 − 1/δ,
we have that the desired median is between L and H. In addition, Lemma 25.3.3 also implies that |T | ≤ 4εn ≤
4n2/3 , which readily implies the correctness of the algorithm.
As for the bound on the number of comparisons, we have, with high probability, that the number of com-
parisons is √
O |S | log |S | + |T | log |T | + 2n = O n log2 n + n2/3 log n + 2n,
since deciding if an element is between L and H requires two comparisons. ■
Lemma 25.3.5. The above algorithm can be modified to perform (3/2)n + O(n2/3 log n) comparisons, and
reports the median correctly. This holds with high probability.
Proof: The trick is to randomly compare each element either first to L or first to H with equal probability. For
elements that are either smaller than L or bigger than H, this requires (3/2)n comparisons in expectation. Thus
improving the bound from 2n to (3/2)n. ■
Lemma 25.3.6. Consider a stream B of n numbers, and assume we can make two passes over the data. Then,
one can compute exactly the median of B using:
(I) O(n2/3 ) space.
(II) 1.5n + O(n2/3 log n) comparisons.
The algorithm reports the median correctly, and it succeeds with high probability.
Proof: Implement the above algorithm, using the random sampling from Theorem 25.2.1. ■
Remark 25.3.7. Interestingly, one can do better if one is more careful. The basic idea is to do thinning – given
two sorted sequence of sizes s, consider merging the sets, and then picking all the even rank elements into a
new sequence. Clearly, the element of rank i in the output sequence, has rank 2i in the union of the two original
sequences. A sequence that is the result of i such rounds of thinning is of level i. We maintain O(log n) such
sequences as we read the stream. At any time, we have two buffers of size s, that we fill up from the stream.
Whenever the two buffers fill up completely, we perform the thinning operation on them, creating a sequence
of level 1.
If during this process we have two sequences of the same level, we merge them and perform thinning on
them. As such, we maintain O(log n) buffers sequences each of size s. Assume that our stream has size n, and
n is a power for 2. Then in the end of process, we would have only a single sequence of level h = log2 (n/s).
166
By induction, it is easy to prove that an element of rank r in this sequence, has rank between 2h (r − 1) and 2h r
in the original stream.√ √
Thus, setting s = n, we get that after a√single pass, using O( n log n) space, we have a sorted sequence,
where the rank of the elements is roughly n approximation to the true rank. We pick the two consecutive
elements (or more carefully, the predecessor, and successor), and √ filter the stream again, keeping only the
elements in between√ these two elements. It is to show that O( n) would be kept, and we can extract the
median using O( n log n) time. √
We thus got that one can compute the median in two passes using O( n log n) space. It is not hard to extend
this algorithm to α-passes, where the space required becomes O(n1/α log n).
This elegant algorithm goes back to 1980, and it is by Munro and Paterson [MP80].
167
just insert st+1 to the set, and set its counter to 1. Otherwise, |S t | = k and st+1 < S t . We then decrease all the k
counters of elements in S t by 1. If a counter of an element in S t+1 is zero, then we delete it from the set.
Correctness.
Lemma 25.5.1. The above algorithm, after the insertion of t elements, the set S t+1 would contain all the
elements in the stream that appears at least εt times.
Proof: Conceptually, imagine that the algorithm keeps counters for all the distinct elements seen in the stream.
Whenever a decrease of the counters happens – the algorithm decrease not k counters – but k + 1 counters – the
additional counter being a counter of the new element, which has value one, and goes down to zero. Clearly,
the number of distinct remaining elements in any point in time is at most k – that is, the number of counters
that have a non-zero value. Consider an element e that appears u ≥ εt times in the stream. The counter for e is
going to be increased u times, and decreased at most α time, where α(k + 1) ≤ t. We have that the counter for
u in the end of the stream must have value at least
t t t ⌈1/ε⌉ ε + ε − 1 ε
u− ≥ εt − = εt − ≥t ≥t > 0.
k+1 k+1 ⌈1/ε⌉ + 1 ⌈1/ε⌉ + 1 ⌈1/ε⌉ + 1
This implies that the counter of u is strictly larger than 0, which implies that u appears in S t+1 . ■
References
[MP80] J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theo. Comp. Sci., 12:
315–323, 1980.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
168
Chapter 26
“See? Genuine-sounding indignation. I programmed that myself. It’s the first thing you need in a university environment: the
ability to take offense at any slight, real or imagined.”
598 - Class notes for Randomized Algorithms Robert Sawyer, Factoring Humanity
Sariel Har-Peled
April 2, 2024
Example: Estimating p for a coin. Assume we have a coin that is head with probability p. A natural way to
estimate p is to flip the coin once and return 1 if it is head, and zero otherwise. Let X be the result of the coin
flip, and observe that E[X] = p. But this is not very useful estimator.
169
This would imply by Chebychev’s inequality that
h p i h p i 1
P |Z − ρ| ≥ ερ = P |Z − E[Z] | ≥ 2 (ε2 /4)ρ2 ≤ P |Z − E[Z] | ≥ 2 V[Z] ≤ .
4
Guided by Eq. (26.1), we want this quantity to be smaller than ≤ (ε2 /4)ρ2 . Thus,
& ' & '
ν 4 ν 4 V[X]
≤ (ε2 /4)ρ2 ⇐= α≥ 2 · 2 = .
α ε ρ ε2 E[X]2
Lemma 26.1.1. Let D be a non-negative distribution with ρ = E[D] and ν = V[D], and let ε ∈ (0, 1) be a
l 4 V[D] m Pα
parameter. For α ≥ 2 , consider sampling variables X 1 , . . . , X α ∼ D, and let Z = i=1 Xi /α. Then Z
ε (E[D])2
is a “good” estimator for ρ. Formally, we have
h i 3
P (1 − ε)ρ ≤ Z ≤ (1 + ε)ρ ≥ .
4
1
β = O log
φ
instances of the averaging estimators: Z1 , . . . , Zβ of Lemma 26.1.1. The median estimator returns the median
value of the Zs as the desired estimate.
170
Analysis. Let Ei be the event that Zi ∈ [(1 − ε)ρ, (1 + ε)ρ]. Let Gi be an indicator variable for Ei . By
P
Lemma 26.1.1, P[Ei ] = P[Gi = 1] ≥ 3/4. The median estimator fails if βi=1 Gi < β/2. Using Chernoff
inequality, we get that this happens with probability ≤ φ. We thus get the following.
Theorem 26.1.2. Let D be a non-negative distribution with µ = E[D] and ν = V[D], and let ε, φ ∈ (0, 1)
l 4ν m
be parameters. For some absolute constant c > 0, let M ≥ 24 2 2 ln φ1 , and consider sampling variables
ε µ
X1 , . . . , X M ∼ D. One can compute, in, O(M) time, a quantity Z from the sampled variables, such that
h i
P (1 − ε)µ ≤ Z ≤ (1 + ε)µ ≥ 1 − φ.
l m
Proof: Let m = 4ν/(ε2 µ2 ) and M = 24 ln φ1 . Build M averaging estimators, each one using m samples. That
is let Zi be the average of m samples si,1 , . . . , si,m from D, for i = 1, . . . , M. Formally,
1X
m
Zi = si, j for i = 1, . . . , M.
m j=1
be the kth frequency moment of S. The quantity, F1 = m is the length of the stream S. Similarly, F0 is the
number of distinct elements (where we use the convention that 00 = 0 and any other quantity to the power 0 is
1). It is natural to define F∞ = maxi fi .
Here, we are interested in approximating up to a factor of 1 ± ε the quantity Fk , for k ≥ 1 using small space,
and reading the stream S only once.
171
how many times it see the representative value later on in the stream (the counter is initialized to 1, to account
for the chosen representative itself). In particular, if s p is the chosen representative in the end of the stream
(i.e., the algorithm might change the representative several times), then the counter value is
n o
r = j j ≥ p and s j = s p .
where m is the number of elements seen in the stream. Let V be the random variable that is the value of the
representative in the end of the sequence.
26.2.1.2. Analysis
Lemma 26.2.1. We have E[X] = Fk .
Proof: Observe that since we choose the representative uniformly at random, we have
Xfi
1 mX
fi
m
E[X | V = i] = m jk − ( j − 1)k = jk − ( j − 1)k = fik .
f
j=1 i
fi j=1 fi
P P
As such, we have E[X] = E E[X | V] = i: fi ,0 fi m k
f
m fi i
= i fi
k
= Fk . ■
Remark 26.2.2. In the above, we estimated the function g(x) = xk , over the frequency numbers f1 , . . . , fk , but
the above argumentation, on the expectation of X, would work for any function g(x) such that g(0) = 0, and
g(x) ≥ 0, for all x ≥ 0.
P 2
Lemma 26.2.3. For k > 1, we have ni=1 ik − (i − 1)k ≤ kn2k−1 .
Proof: Observe that for x ≥ 1, we have that xk − (x − 1)k ≤ kxk−1 . As such, we have
X
n X
n X
n
k 2
i − (i − 1) ≤
k
ki i − (i − 1) ≤ kn
k−1 k k k−1
ik − (i − 1)k = knk−1 nk = kn2k−1 . ■
i=1 i=1 i=1
h i
Lemma 26.2.4. We have E X 2 ≤ kmF2k−1 .
h X i
fi
1 2 k 2 m2 2k−1
EX
2
V=i = m j − ( j − 1)k ≤ k fi = m2 k fi2k−2 ,
f
j=1 i
f i
X fi
and thus E[X 2 ] = E E[X 2 | V] = · m2 k fi2k−2 = mkF2k−1 . ■
i: f ,0
m
i
172
Proof: This is immediate from Hölder inequality, but here is a self contained proof. The above is equivalent
P P 1/k P
to proving that i fi /n ≤ ni=1 fik /n . Raising both sides to the power k, we need to show that ( i fi /n)k ≤
Pn k P Pn
i=1 fi /n. Setting g(x) = x , we have g( i fi /n) ≤
k
i=1 g( fi )/n. The last inequality holds by the convexity of
the function g(x) (indeed, g′ (x) = kxk−1 and g′′ (x) = k(k − 1)xk−2 ≥ 0, for x ≥ 0). ■
P P P 2
Lemma 26.2.6. For any n numbers f1 , . . . , fn ≥ 0, we have i fi i fi
2k−1
≤ n1−1/k i fi
k
.
P
Proof: Let M = maxi fi and m = i fi . We have
X X X X (k−1)/k X X (2k−1)/k
fi2k−1 ≤ M k−1 fik ≤ M k(k−1)/k fik ≤ fik fik ≤ fik .
i i i i i i
Pn P 1/k
By Lemma 26.2.5, we have i=1 fi ≤ n(k−1)/k i fik . Multiplying the above two inequality implies the
claim. ■
h i
i z}|{
L26.2.4
h X X z}|{
L26.2.6
Thus, the amount of space this streaming algorithm is using is proportional to M, and we have
! ! !
ν 1 kn1−1/k Fk2 1 kn1−1/k 1
M = O 2 2 ln =O ln =O ln .
εµ φ ε2 Fk2 φ ε2 φ
173
26.3. Better estimation for F2
26.3.1. Pseudo-random k-wide independent sequence of signed bits
In the following, assume that we sample O(log n) bits, such that given an index i, one can compute (quickly!) a
random signed bit b(i) ∈ {−1, +1}. We require that the resulting bits b(1), b(2), . . . , b(n) are 4-wise independent.
To this end, pick a prime p, that is, say bigger than n10 . This can easily be done by sampling a number in the
range [n10 , n11 ], and checking if it is prime (which can done in polynomial time).
P
Once we have such a prime, we generate a random polynomial g(i) = 5i=0 ci xi mod p, by choosing
c0 , . . . , c5 from Z p = 0, . . . , p − 1 . We had seen that g(0), g(1), . . . , g(n) are uniformly distributed in Z p ,
and they are, say, 6-wise independent (see Theorem 7.2.10).
We define
0 g(i) = p − 1
b(i) =
+1 g(i) is odd
−1 g(i) is even.
Clearly, the sequence b(1), . . . , b(n) are 6-wise independent. There is a chance that one of these bits might
be zero, but the probability for that is at most n/p, which is so small, that we just assume it does not happen.
There are known constructions that do not have this issue at all (one of the bits is zero), but they are more
complicated.
Lemma 26.3.1. Given a parameter φ ∈ (0, 1), in polynomial time in O(log(n/φ)), one can construct a function
b(·), requiring O(log(n/φ)) bits of storage (or O(1) words), such that b(1), . . . , b(n) ∈ {−1, +1} with equal
probability, an they are 6-wise independent. Furthermore, given i, one can compute b(i) in O(1) time.
The probability of this sequence to fail having the desired properties is smaller than φ.
Proof: We repeat the above construction, but picking a prime p in the range, say, n10 /φ . . . n11 /φ. ■
X
m X
m
T= b(i) fi = b(s j ),
i=1 j=1
which can be computed on the fly using O(1) words of memory, and O(1) time per time in the stream.
The algorithm returns X = T 2 as the desired estimate.
Analysis.
P
Lemma 26.3.2. We have E[X] = i fi
2
= F2 and V[X] ≤ 2F22 .
174
hPn 2 i
Proof: We have that E[X] = E i=1 b(i) fi , and as such
hX
n X i Xm X X 2
m
E[X] = E (b(i))2 fi2 + 2 b(i)b( j) fi f j = fi2 + 2 fi f j E b(i)b( j) = fi = F 2 ,
i=1 i< j i=1 i< j i=1
h i
since E[b(i)] = 0, E b(i)2 = 1, and E b(i)b( j) = E[b(i)] E b( j) = 0 (assuming the sequence b(1), . . . , b(n)
has not failed), by the 6-wise Independence of the sequence of signed bits.
h i
We next compute E X 2 . To this end, let N = {1, . . . , n}, and Γ = N × N × N × N. We split this set into
several sets, as follows:
n o
(i) Γ0 = (i, i, i, i) ∈ N 4 : All quadruples that are all the same value.
(ii) Γ1 : Set of all quadruples (i, j, k, l) where there is at least one value that appears exactly once.
(iii) Γ2 : Set of all (i, j, k, ℓ) with only two distinct values, each appearing exactly twice.
Clearly, we have N 4 = Γ0 ∪ Γ1 ∪ Γ2 .
h i
For a tuple (i, i, i, i) ∈ Γ0 , we have E[b(i)b(i)b(i)b(i)] = E b(i)4 = 1.
For a tuple (i, j, k, ℓ) ∈ Γ1 with i being the unique value, we have that
E b(i)b( j)b(k)b(ℓ) = E[b(i)] E b( j)b(k)b(ℓ) = 0 E b( j)b(k)b(ℓ) = 0,
using that the signed bits are 4-wise independent. h i h i h i
For a tuple (i, i, j, j) ∈ Γ2 , we have E b(i)b(i)b( j)b( j) = E b(i)2 b( j)2 = E b(i)2 E b( j)2 = 1, and the
same argumentation applies to any tuple of Γ2 . Observe that for any i < j, there are 42 = 6 different tuples in
Γ2 that are made out of i and j. As such, we have
h i hX n 4 i h X i
E X =E b(i) fi = E
2
b(i)b( j)b(k)b(ℓ) fi f j fk fℓ
Xi=1 h i X
(i, j,k,ℓ)∈Γ
X h i
= E b(i)
4
fi4 + fi f j fk fℓ E b(i)b( j)b(k)b(ℓ) + 6 2 2 2 2
E b(i) b( j) fi f j
(i,i,i,i)∈Γ0 (i, j,k,ℓ)∈Γ1 i< j
X
n X
= fi4 + 6 fi2 f j2 .
i=1 i< j
As such, we have
h i X
n X X
m 2 X
V [X] = E X 2
− (E [X])2
= fi
4
+ 6 f 2 2
f
i j − f i
2
= 4 fi2 f j2 ≤ 2F22 . ■
i=1 i< j i=1 i< j
175
Theorem 26.3.3. Given a stream S = s1 , . . . , sm of numbers from {1, . . . , n}, and parameters ε, φ ∈ (0, 1), one
can compute an estimate Z for F2 (S), such that P[|Z − F2 | > εF2 ] ≤ φ. This algorithm requires O(ε−2 log φ−1 )
space (in words), and this is also the time to handle a new element in the stream.
Proof: The scheme is described above. As before, using Chebychev’s inequality, we have that
" #
εF2 p V[Yi ] V[X] /α 2F22 1
P[|Yi − F2 | > εF2 ] = P |Yi − F2 | > √ V[Yi ] ≤ 2 2 = ≤ 2 2 = ,
V[Yi ] ε F2 ε F2
2 2
αε F2 8
by Lemma 26.3.2. Let U be the number of estimators in Y1 , . . . , Yβ that are outside the acceptable range.
Arguing as in Lemma ??, we have
!
1
P[Z is bad] ≤ P U ≥ β/2 = P U ≥ (1 + 3)β/8 ≤ exp(−(β/8)3 /4) ≤ exp − ln = φ,
2
φ
References
[AMS99] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency
moments. J. Comput. Syst. Sci., 58(1): 137–147, 1999.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
176
Chapter 27
598 - Class notes for Randomized Algorithms Robert Sawyer, Factoring Humanity
Sariel Har-Peled
April 2, 2024
Proof: Considering the pdf of X1 being x, and all other Xi s being bigger. We have that this pdf is
h \u i h\u i
g(x) = P (X1 = x) ∩ (Xi > X1 ) = P (Xi > X1 ) X1 = x P[X1 = x] = (1 − x)u−1 .
i=2 i=2
Since every one of the Xi has equal probability to realize Y, we have f (x) = ug(x). ■
h i
Lemma 27.1.2. We have E[Y] = u+1 1
, E Y 2 = (u+1)(u+2)
2
, and V[Y] = (u+1)u2 (u+2) .
177
h (1 − y)u+1 i1 1
= −y(1 − y) − u
= .
u+1 y=0 u+1
Using integration by guessing again, we have
h i Z Z ! Z 1
1
1
u
EY
2
= y P Y = y dy =
2
y · 1(1 − y) dy =
2 u−1
uy2 (1 − y)u−1 dy
y=0 y=0 1 y=0
h 2y(1 − y)u+1
2(1 − y)u+2 i1
2
= −y2 (1 − y)u − − = .
u+1 (u + 1)(u + 2) y=0 (u + 1)(u + 2)
We conclude that
h i !
2 1 1 2 1 u
V[Y] = E X − (E[X]) = − = − = . ■
2 2
(u + 1)(u + 2) (u + 1)2 u + 1 u + 2 u + 1 (u + 1)2 (u + 2)
Explanation. Note, that X is not an estimator for u – instead, as E[X] = 1/(u+1), we are estimating 1/(u+1).
The key observation is that an 1 ± ε estimator for 1/(u + 1), is 1 ± O(ε) estimator for u + 1, which is in turn an
1 ± O(ε) estimator for u.
Lemma 27.1.3. Let ε, hφ ∈ (0, 1) be parameters. Given ia stream S of items from {1, . . . , n} one can return an
estimate X, such that P (1 − ε/4) u+1
1
≤ X ≤ (1 + ε/4) u+1
1
≥ 1 − φ, where u is the number of unique elements in
S. This requires O ε12 log φ1 space.
u+1 1 u+1
−1≥ −1≥ − 1,
1 − ε/4 X 1 + ε/4
which implies
(1 + ε/4)u u + ε/4 1 u+1
(1 + ε)u ≥ ≥ ≥ −1≥ − 1 ≥ (1 − ε)u.
1 − ε/4 1 − ε/4 X 1 + ε/4
Namely, 1/X − 1 is a good estimator for the number of distinct elements.
178
The algorithm revisited. Compute X as above, and output the quantity 1/X − 1.
F0 = F0 (S) = |set(S )|
The sampling algorithm. When the ith arrives si , we compute Bsi . If this bit is 1, then we insert si into the
random sample R (if it is already in R, there is no need to store a second copy, naturally).
This defines a natural random sample
R = {i | Bi = 1 and i ∈ S } ⊆ S .
Lemma 27.2.1. For the above random sample R, let X = |R|. We have that E[X] = pν and V[X] = pν − p2 ν,
where ν = F0 (S) is the number of district elements in S .
h i hX i X h i X h i X !
2 ν
EX
2
= E ( Bi ) =
2
E Bi + 2
2
E Bi B j = pν + 2 E[Bi ] E[B j ] = pν + 2p .
i∈S i∈S i, j∈S , i< j i, j∈S , i< j
2
As such, we have
h i !
ν ν(ν − 1)
V[X] = V[|R|] = E X − (E[X]) = pν + 2p − p2 ν2 = pν + 2p2 − p2 ν 2
2 2 2
2 2
= pν + p ν(ν − 1) − p ν = pν − p ν.
2 2 2 2
■
Lemma 27.2.2. Let ε ∈ (0, 1/4). Given O(1/ε2 ) space, and a parameter N. Consider the task of estimating
the size of F0 = |set(S)|, where F0 > N/4. Then, the algorithm described below outputs one of the following:
(A) F0 > 2N.
(B) Output a number ρ such that (1 − ε)F0 ≤ ρ ≤ (1 + ε)F0 .
(Note, that the two options are not disjoint.) The output of this algorithm is correct, with probability ≥ 7/8.
179
Proof: We set p = Nεc 2 , where c is a constant to be determined shortly. Let T = pN = O(1/ε2 ). We sample a
random sample R from S , by scanning the elements of S , and adding i ∈ S to R if Bi = 1, If the random sample
is larger than 8T , at any point, then the algorithm outputs that |S | > 2N.
In all other cases, the algorithm outputs |R| /p as the estimate for the size of S , together with R.
To bound the failure probability, consider first the case that N/4 < |set(S)|. In this case, we have by the
above, that " #
E[X] p 2 V[X] 1
P[|X − E[X]| > ε E[X]] ≤ P |X − E[X]| > ε √ V[X] ≤ ε ≤ ,
V[X] (E[X]) 2 8
if ε2 (VE[X]
[X])2
≤ 18 , For ν = F0 ≥ N/4, this happens if pν
ε2 p2 ν2
≤ 81 . This in turn is equivalent to 8/ε2 ≤ pν. This is in
turn happens if
c N 8
· ≥ 2,
2
Nε 4 ε
which implies that this holds for c = 32. Namely, the algorithm in this case would output a (1 ± ε)-estimate for
|S |.
If the sample get bigger than 8T , then the above readily implies that with probability at least 7/8, the size
of S is at least (1 − ε)8T/p > 2N, Namely, the output of the algorithm is correct in this case. ■
Lemma 27.2.3. Let ε ∈ (0, 1/4) and φ ∈ (0, 1). Given O(ε−2 log φ−1 ) space, and a parameter N, and the task
is to estimate F0 of S, given that F0 > N/4. Then, there is an algorithm that would output one of the following:
(A) F0 > 2N.
(B) Output a number ρ such that (1 − ε)F0 ≤ ρ ≤ (1 + ε)F0 .
(Note, that the two options are not disjoint.) The output of this algorithm is correct, with probability ≥ 1 − φ.
Proof: We run O(log φ−1 ) copies of the of Lemma 27.2.2. If half of them returns that F0 > 2N, then the
algorithm returns that F0 > 2N. Otherwise, the algorithm returns the median of the estimates returned, and
return it as the desired estimated. The correctness readily follows by a repeated application of Chernoff’s
inequality. ■
Lemma 27.2.4. Let ε ∈ (0, 1/4). Given O(ε−2 log2 n) space, one can read the stream S once, and output a
number ρ, such that (1 − ε)F0 ≤ ρ ≤ (1 + ε)F0 . The estimate is correct with high probability (i.e., ≥ 1 − 1/nO(1) ).
Proof: Let Ni = 2i , for i = 1, . . . , M = lg n . Run M copies of Lemma 27.2.3, for each value of Ni , with
φ = 1/nO(1) . Let Y1 , . . . , Y M be the outputs of these algorithms for the stream. A prefix of these outputs, are
going to be “F0 > 2Ni ”, Let j be the first Y j that is a number. Return this number as the desired estimate.
The correctness is easy – the first estimate that is a number, is a correct estimate with high probability. Since
N M ≥ n, it also follows that Y M must be a number. As such, there is a first number in the sequence, and the
algorithm would output an estimate.
More precisely, there is an index i, such that Ni /4 ≤ F0 ≤ 2F0 , and Yi is a good estimate, with high
probability. If any of the Y j , for j < i, is an estimate, then it is correct (again) with high probability. ■
180
Chapter 28
As we saw previously, to solve the (1 + ε)-ANN problem efficiently, it is sufficient to solve the approximate
near neighbor problem. Namely, given a set P of n points in Hd , a radius r > 0, and parameter ε > 0,
we want to decide for a query point q whether dH q, P ≤ r or dH q, P ≥ (1 + ε)r, where dH q, P =
minp∈P dH q, p .
Definition 28.1.2. For a set P of points, a data-structure D = D≈Near (P, r, (1 + ε)r) solves the approximate near
neighbor problem if, given a query point q, the data-structure works as follows.
• Near: If dH q, P ≤ r, then D outputs a point p ∈ P such that dH p, q ≤ (1 + ε)r.
• Far: If dH q, P ≥ (1 + ε)r, then D outputs “dH q, P ≥ r”.
• Don’t care: If r ≤ d q, P ≤ (1 + ε)r, then D can return either of the above answers.
181
28.1.2. Preliminaries
Definition 28.1.3. Consider a sequence m of k, not necessarily distinct, integers i1 , i2 , . . . , ik ∈ JdK, where JdK =
{1, . . . , d}. For a point p = (p1 , . . . , pd ) ∈ Rd , its projection by m, denoted by mp is the point pi1 , . . . , pik ∈ Rk .
Similarly, the projection of a point set P ⊆ Rd by m is the point set mP = {mp | p ∈ P}.
Given two sequences m = i1 , . . . , ik and u = j1 , . . . , jk′ , let m|u denote the concatenated sequence m|u =
i1 , . . . , ik , j1 , . . . , jk′ . Given a probability φ, a natural way to create such a projection, is to include the ith
coordinate, for i = 1, . . . , d, with probability φ. Let Dφ denote the distribution of such sequences of indices.
Definition 28.1.4. Let DTφ denote the distribution resulting from concatenating T independent sequences sam-
pled from Dφ . The length of a sampled sequence is dT .
Observe that for a point p ∈ {0, 1}d , and M ∈ DTφ , the projection M p might be higher dimensional than the
original point p, as it might contain repeated coordinates of the original point.
28.1.2.1. Algorithm
28.1.2.1.1. Input. The input is a set P of n points in the hypercube {0, 1}d , and parameters r and ε.
28.1.2.1.3. Answering a query. Given a query point q ∈ {0, 1}d , the algorithm computes qi = Mi q, for
i = 1, . . . , L. From each Di , the algorithm retrieves a list ℓi of all the points that collide with qi . The algorithm
scans the points in the lists ℓ1 , . . . , ℓL . If any of these points is in Hamming distance smaller than (1 + ε)r,
the algorithm returns it as the desired near-neighbor (and stops). Otherwise, the algorithm returns that all the
points in P are in distance at least r from q.
28.1.2.2. Analysis
Lemma 28.1.5. Let K be a set of r marked/forbidden coordinates. The probability that a sequence M =
(m1 , . . . , mT ) sampled from DTφ does not sample any of the coordinates of K is 1/nβ . This probability increases
if K contains fewer coordinates.
r
Proof: For any i, the probability that mi does not contain any of these coordinates is (1 − φ)r = e−1/r = 1/e.
Since this experiment is repeated T times, the probability is e−T = e−β ln n = n−β . ■
Lemma 28.1.6. Let p be the nearest-neighbor to q in P. If dH q, p ≤ r then, with high probability, the
data-structure returns a point that is in distance ≤ (1 + ε)r from q.
Proof: The good event here is that p and q collide under one of the sequences of M1 , . . . , ML . However, the
probability that Mi p = Mi q is at least 1/nβ , by Lemma 28.1.5, as this is the probability that Mi does not sample
any of the (at most r) coordinates where p and q disagree. As such, the probability thatall L data-structures
fail (i.e., none of the lists ℓ1 , . . . , ℓL contains p), is at most (1 − 1/nβ )L < 1/nO(1) , as L = O nβ log n . ■
182
Lemma 28.1.7. In expectation, the total number of points in ℓ1 , . . . , ℓL that are in distance ≥ (1 + ε)r from q
is ≤ L.
Proof: Let P≥ be the set of points in P that are in distance ≥ (1 + ε)r from q. For a point u ∈ P≥ , with
∆ = dH u, q , we have that the probability for M ∈ DTφ misses all the ∆ coordinates, where u and q differ, is
(1+ε)rT 1
(1 − φ)∆T ≤ (1 − φ)(1+ε)rT = e−1/r = exp(−(1 + ε)β ln n) = ,
n
as φ = 1 − e−1/r , T = β ln n, and β = 1/(1 + ε). But then, for any i, we have
X 1
E |ℓi | = P Mi p = Mi q ≤ P≥ ≤ 1.
p ∈P
Mi n
≥
As such, the total number of far points in the lists is at most L · 1 = L, implying the claim. ■
28.1.2.3.1. Improving the performance (a bit). Observe that for M ∈ DTφ , and any two points p, u ∈ {0, 1}d ,
all the algorithm cares about is whether M p = M u. As such, if a coordinate is probed many times by M, we
might as well probe this coordinate only once. In particular, for a sequence M ∈ DTφ , let M ′ = uniq(M) be the
projection sequence resulting from removing replications in M. Significantly, M ′ is only of length ≤ d, and
as such, computing M ′ p, for a point p, takes only O(d) time. It is not hard to verify that one can also sample
directly uniq(M), for M ∈ DTφ , in O(d) time. This improves the query and processing by a logarithmic factor.
O(dn1+1/(1+ε) log n)
time and space, such that given a query point q, the algorithm returns, in expected O(dn1/(1+ε) log n) time, one
of the following:
(A) a point p ∈ P such that dH q, p ≤ (1 + ε)r, or
(B) the distance of q from P is larger than r.
The algorithm may return either result if the distance of q from P is in the range [r, (1 + ε)r]. The algorithm
succeeds with high probability (per query).
One can also get a high-probability guarantee on the query time. For a parameter δ > 0, create O(log δ−1 )
LSH data-structures as above. Perform the query as above, except that when the query time exceeds (say) twice
the expected time, move on to redo the query in the next LSH data-structure. The probability that the query had
failed on one of these LSH data-structures is ≤ 1/2, by Markov’s inequality. As such, overall, the query time
becomes O(dn1/(1+ε) log n log δ−1 ), with probability ≥ 1 − δ.
183
28.2. Testing for good items
Imagine that we have n items. One of the items is good the rest are bad. We have two tests to check if an item
is good – we have a cheap test, a really expensive test. We would like to use the expensive test as few times as
possible, and classify correctly all the items. Let T (x) ∈ {good, bad} denote the result of the cheap test on item
x. We have that
P T (x) = good x is good ≥ α > β ≥ P T (x) = good x is bad .
Repeating, the experiment k times, we create a k-test where we turn an item is good if all k tests return “good”.
We then have h i h i
P T (x) = good x is good ≥ α > β ≥ P T (x) = good x is bad .
k k k k
We need to make sure we discover the good item, so let us repeat the k-test enough times, till we discover it
with good probability. A natural value would be to repeat the k-test for each item M = (1/αk ) ln φ−1 times, so
that the probability we fail to discover the good item is
(1 − αk ) M ≤ exp −αk M < φ.
As for the bad items, how many “false positive” would we have? Every k-test is going to return in expecta-
tion at most
(n − 1)βk ≤ nβk
items. As such, the total number of false positives over the M repeated k-tests is going to be
nβk M = O(n(β/α)k log φ−1 ).
If everything is for free, that we will set k to be quite large, so that the number of false positives is practically
zero. For our purposes it would be enough if every k-test returns (in expectation) one false positive. That is,
we will require that
βk n ≤ 1.
This would set up the values we need for k and M.
184
28.3.1. A simple sensitive family
A priori it is not even clear such a sensitive family exists, but it turns out that the family randomly exposing
one coordinate is sensitive.
Lemma 28.3.2. Let fi (p) denote the function that returns the ith coordinate of p, for i = 1, . . . , d. Consider
the family of functions F = { f1 , . . . , fd }. Then, for any r > 0 and ε, the family F is (r, (1 + ε)r, α, β)-sensitive,
where α = 1 − r/d and β = 1 − r (1 + ε)/d.
Proof: If u, v ∈ {0, 1}d are within distance smaller than r from each other (under the Hamming distance), then
they differ in at most r coordinates. The probability that a random h ∈ F would project into a coordinate that
u and v agree on is ≥ 1 − r/d.
Similarly, if dH (u, v) ≥ (1 + ε)r, then the probability that a random h ∈ F would map into a coordinate that
u and v agree on is ≤ 1 − (1 + ε)r/d. ■
Proof: For two fixed points u, v ∈ Hd such that dH (u, v) ≤ r, we have that for a random h ∈ F , we have
P[h(u) = h(v)] ≥ α. As such, for a random g ∈ G, we have that
h i
P g(u) = g(v) = P f (u) = f (v) and f (u) = f (v) and . . . and f (u) = f (v)
1 1 2 2 k k
Y
k h i
= P f (u) = f (v) ≥ α .
i i k
i=1
Q h i
Similarly, if dH (u, v) > R, then P g(u) = g(v) = ki=1 P f i (u) = f i (v) ≤ βk . ■
The above lemma implies that we can build a family that has a gap between the lower and upper sensitivities;
namely, αk /βk = (α/β)k is arbitrarily large. The problem is that if αk is too small, then we will have to use too
many functions to detect whether or not there is a point close to the query point.
Nevertheless, consider the task of building a data-structure that finds all the points of P = {p1 , . . . , pn } that
are equal, under a given function g ∈ G = combine(F , k), to a query point. To this end, we compute the strings
g(p1 ), . . . , g(pn ) and store them (together with their associated point) in a hash table (or a prefix tree). Now,
given a query point q, we compute g( q) and fetch from this data-structure all the strings equal to it that are
stored in it. Clearly, this is a simple and efficient data-structure. All the points colliding with q would be the
natural candidates to be the nearest neighbor to q.
By not storing the points explicitly, but using a pointer to the original input set, we get the following easy
result.
Lemma 28.3.4. Given a function g ∈ G = combine(F , k) (see Lemma 28.3.3) and a set P ⊆ Hd of n points,
one can construct a data-structure, in O(nk) ntime and using O(nk)
o additional space, such that given a query
point q, one can report all the points in X = p ∈ P g(p) = g( q) in O(k + |X|) time.
185
28.3.3. Amplifying sensitivity
Our task is now to amplify the sensitive family we currently have. To this end, for two τ-dimensional points
x and y, let x ≎ y be the Boolean function that returns true if there exists an index i such that xi = yi and
false otherwise. Now, the regular “=” operator requires vectors to be equal in all coordinates (i.e., it is equal to
T S
i (xi = yi )) while x ≎ y is i (xi = yi ). The previous construction of Lemma 28.3.3 using this alternative equal
operator provides us with the required amplification.
Lemma 28.3.5. Given an r, R, αk , βk -sensitive family G, the family H≎ = combine(G, τ) if one uses the ≎
operator to check for equality is r, R, 1 − (1 − αk )τ , 1 − (1 − βk )τ -sensitive.
Proof: For two fixed points u, v ∈ Hd such that dH (u, v) ≤ r, we have, for a random g ∈ G, that P g(u) = g(v)
≥ αk . As such, for a random h ∈ H≎ , we have that
h i
τ τ
P[h(u) ≎ h(v)] = P g (u) = g (v) or g (u) = g (v) or . . . or g (u) = g (v)
1 1 2 2
Y τ h i
k τ
=1− P g (u) , g (v) ≥ 1 − 1 − α .
i i
i=1
Y
τ h i
k τ
P h(u) ≎ h(v) = 1 − P g (u) , g (v) ≤ 1 − 1 − β . ■
i i
i=1
To see the effect of Lemma 28.3.5, it is useful to play with a concrete example. Consider an (r, R, αk , βk )-
√ β = α /2 and yet α is very small. Setting τ = 1/α , the resulting family is (roughly)
k k k k
sensitive family where
(r, R, 1 − 1/e, 1 − 1/ e)-sensitive. Namely, the gap shrank, but the threshold sensitivity is considerably higher.
In particular, it is now a constant, and the gap is also a constant.
Using Lemma 28.3.5 as a data-structure to store P is more involved than before. Indeed, for a random
function h = g , . . . , gτ ∈ H≎ = combine(G, τ) building the associated data-structure requires us to build
1
τ data-structures for each one of the functions g1 , . . . , gτ , using Lemma 28.3.4. Now, given a query point,
we retrieve all the points of P that collide with each one of these functions, by querying each of these data-
structures.
Lemma 28.3.6. Given a function h ∈ H≎ = combine(G, τ) (see Lemma 28.3.5) and a set P ⊆ Hd of n points,
one can construct a data-structure, in O(nkτ)ntime and using O(nkτ)
o additional space, such that given a query
point q, one can report all the points in X = p ∈ P h(p) ≎ h( q) in O(kτ + |X|) time.
186
We are left with the task of fine-tuning the parameters τ and k to get the fastest possible query time, while
the data-structure has reasonable probability to succeed. Figuring the right values is technically tedious, and
we do it next.
Proof: Consider the points in P \ b( q, R). We would like to bound the number of points of this set that collide
with the query point. Observe that in this case, the probability of a point p ∈ P \ b( q, R) to collide with the
query point is
τ β k
≤ ψ = 1 − 1 − βk = βk 1 + (1 − βk ) + (1 − βk )2 + . . . + (1 − βk )τ−1 ≤ βk τ ≤ 8 ,
α
l m
as τ = 4 1/αk and α, β ∈ (0, 1). Namely, the expected number of points of P \ b( q, R) colliding with the query
point is ≤ ψn. ■
By Lemma 28.3.6, extracting the O(L) points takes O(kτ + L) time. Computing the distance of the query
time for each one of these points takes O(kτ + Ld) time. As such, by Lemma 28.3.7, the query time is
O(kτ + Ld) = O kτ + nd(β/α)k .
To minimize this query time, we “approximately” solve the equation requiring the two terms, in the above
bound, to be equal (we ignore d since, intuitively, it should be small compared to n). We get that
k βk
kτ = n(β/α)k ⇝ ≈ n =⇒ k ≈ nβk ⇝ 1/βk ≈ n =⇒ k ≈ ln1/β n.
αk αk
Thus, setting k = ln1/β n, we have that βk = 1/n and, by Eq. (28.1), that
l m !
ln n ln 1/α
τ = 4 1/α = exp
k
ln 1/α = O(nρ ), for ρ = . (28.2)
ln 1/β ln 1/β
ln(1 − x) 1
Lemma 28.3.8. (A) For x ∈ [0, 1) and t ≥ 1 such that 1 − tx > 0 we have ≤ .
ln(1 − tx) t
ln 1/α 1
(B) For α = 1 − r/d and β = 1 − r (1 + ε)/d, we have that ρ = ≤ .
ln 1/β 1 + ε
187
Proof: (A) Since ln(1 − tx) < 0, it follows that the claim is equivalent to t ln(1 − x) ≥ ln(1 − tx). This in turn is
equivalent to
This is trivially true for x = 0. Furthermore, taking the derivative, we see g′ (x) = −t + t (1 − x)t−1 , which is
non-positive for x ∈ [0, 1) and t > 0. Therefore, g is non-increasing in the interval of interest, and so g(x) ≤ 0
for all values in this interval.
ln 1/α ln α ln d−r ln 1 − r
d 1
(B) Indeed ρ = = = d−(1+ε)r
d
= ≤ , by part (A). ■
ln 1/β ln β ln ln 1 − (1 + ε) r 1+ε
d d
In the following, it would be convenient to consider d to be considerably larger than r. This can be ensured
by (conceptually) padding the points with fake coordinates that are all zero. It is easy to verify that this “hack”
would not affect the algorithm’s performance in any way and it is just a trick to make our analysis simpler. In
particular, we assume that d > 2(1 + ε)r.
Lemma 28.3.9. For α = 1 − r/d, β = 1 − r(1 + ε)/d, n and d as above, we have that I. τ = O n1/(1+ε) ,
II. k = O(ln n), and III. L = O(n1/(1+ε) ).
l m
Proof: By Eq. (28.1), τ = 4 1/αk = O(nρ ) = O(n1/(1+ε) ), by Lemma 28.3.8(B).
Now, β = 1 − r(1 + ε)/d ≤ 1/2, since we assumed that d > 2(1 + ε)r. As such, we have k = ln1/β n =
ln n
= O(ln n).
ln 1/β
By Lemma 28.3.7, L = O n(β/α)k . Now βk = 1/n and as such L = O(1/αk ) = O(τ) = O n1/(1+ε) . ■
Proof: Our building block is the data-structure described above. By Markov’s inequality, the probability that
the algorithm has to abort because of too many collisions with points of P \ b( q, (1 + ε)r) is bounded by 1/4
(since the algorithm tries 4L + 1 points). Also, if there is a point inside b( q, r), the algorithm would find it with
probability ≥ 3/4, by Eq. (28.1). As such, with probability at least 1/2 this data-structure returns the correct
answer in this case. By Lemma 28.3.6, the query time is O(kτ + Ld).
This data-structure succeeds only with constant probability. To achieve high probability, we construct
O(log n) such data-structures and perform the near neighbor query in each one of them. As such, the query
time is
O (kτ + Ld) log n = O n1/(1+ε) log2 n + dn1/(1+ε) log n = O dn1/(1+ε) log n ,
by Lemma 28.3.9 and since d = Ω lg n if P contains n distinct points of Hd .
188
As for the preprocessing time, by Lemma 28.3.6 and Lemma 28.3.9, we have
O nkτ log n = O n1+1/(1+ε) log2 n .
Finally, this data-structure requires O(dn) space to store the input points. Specifically, by Lemma 28.3.6,
we need an additional O nkτ log n = O n1+1/(1+ε) log2 n space. ■
In the hypercube case, when d = nO(1) , we can build M = O log1+ε d = O(ε−1 log d) such data-structures
such that (1 + ε)-ANN can be answered using binary search on those data-structures which correspond to radii
r1 , . . . , r M , where ri = (1 + ε)i , for i = 1, . . . , M.
Theorem 28.3.11. Given a set P of n points on the hypercube Hd (where d = nO(1) ) and a parameter ε > 0,
one can build a data-structure to answer approximate nearest neighbor queries (under the Hamming distance)
using O dn + n1/(1+ε) ε−1 log2 n log d space, such that given a query point q, one can return a (1 + ε)-ANN in P
(under the Hamming distance) in O(dn1/(1+ε) log n log(ε−1 log d)) time. The result returned is correct with high
probability.
Remark 28.3.12. The result of Theorem 28.3.11 needs to be oblivious to the queries used. Indeed, for any
instantiation of the data-structure of Theorem 28.3.11 there exist query points for which it would fail.
In particular, formally, if we perform a sequence of ANN queries using such a data-structure, where the
queries depend on earlier returned answers, then the guarantee of a high probability of success is no longer
implied by the above analysis (it might hold because of some other reasons, naturally).
Proof: By Lemma 24.2.3p157 the point X has multi-dimensional normal distribution Nd . As such, if ∥v∥ = 1,
then this holds by the symmetry of the normal distribution. Indeed, let e1 = (1, 0, . . . , 0). By the symmetry of
the d-dimensional normal distribution, we have that ⟨v, X⟩ ∼ ⟨e1 , X⟩ = X1 ∼ N.
Otherwise, ⟨v, X⟩ / ∥v∥ ∼ N, and as such ⟨v, X⟩ ∼ N 0, ∥v∥2 , which is indeed the distribution of ∥v∥ Z. ■
Definition 28.4.2. A distribution D over R is called p-stable if there exists p ≥ 0 such that for any n real
P
numbers v1 , . . . , vn and n independent variables X1 , . . . , Xn with distribution D, the random variable i vi Xi has
P
the same distribution as the variable ( i |vi | p )1/p X, where X is a random variable with distribution D.
189
Assume that the distance between p and u is η and the distance between the projection of the two points to
the direction v is β. Then, the probability that p and u get the same hash value is max(1 − β/r, 0), since this
is the probability that the random sliding will not separate them. Indeed, consider the line through v to be the
x-axis, and assume u is projected to r and v is projected to r − β (assuming r ≥ β). Clearly, u and v get mapped
to the same value by h(·) if and only if t ∈ [0, r − β], as claimed.
As such, we have that the probability of collusion is
Z r
β
α(η, r) = P h(p) = h( q) = P |⟨p, v⟩ − ⟨u, v⟩| = β 1 − dβ.
β=0 r
Since we are considering the absolute value of the variable, we need to multiply this by two. Thus, we have
Z r !
2 β2 β
α(η, r) = √ exp − 2 1 − dβ,
β=0 2πη 2η r
by plugging in the density of the normal distribution for Z. Intuitively, we care about the difference α(1 + ε, r) −
α(1, r), and we would like to maximize it as much as possible (by choosing the right value of r). Unfortunately,
this integral is unfriendly, and we have to resort to numerical computation.
Now, we are going to use this hashing scheme for constructing locality sensitive hashing, as in the hyper-
cube case, and as such we care about the ratio
log(1/α(1, r))
ρ(1 + ε) = min ;
r log(1/α(1 + ε, r))
Lemma 28.4.3 implies that the hash functions defined by Eq. (28.3) are (1, 1 + ε, α′ , β′ )-sensitive and,
log(1/α′ )
furthermore, ρ = log(1/β′ ) ≤ 1+ε 1
, for some values of α′ and β′ . As such, we can use this hashing family
to construct an approximate near neighbor data-structure D≈Near (P, r, (1 + ε)r) for the set P of points in Rd .
Following the same argumentation of Theorem 28.3.10, we have the following.
Theorem 28.4.4. Given a set P of n points in Rd and parameters ε > 0 and r > 0, one can build a D≈Near =
D≈Near (P, r, (1 + ε)r), such that given a query point q, one can decide:
(A) If b(q, r) ∩ P , ∅, then D≈Near returns a point u ∈ P, such that dH (u, q) ≤ (1 + ε)r.
(B) If b(q, (1 + ε)r) ∩ P = ∅, then D≈Near returns the result that no point is within distance ≤ r from q.
1/(1+ε)
In any other case, any of the answers is correct. The query time is O(dn log n) and the space used is
O dn + n 1+1/(1+ε)
log n . The result returned is correct with high probability.
190
28.4.3.1. The result
Plugging the above into known reduction from approximate nearest-neighbor to near-neighbor queries, yields
the following:
Corollary 28.4.5. Given a set P of n points in Rd , one can construct a data-structure D that answers (1 + ε)-
ANN queries, by performing O(log(n/ε)) (1 + ε)-approximate near neighbor queries. The total number of
points stored at these approximate near neighbor data-structure of D is O(nε−1 log(n/ε)).
Theorem 28.4.6. Given a set P of n points in Rd and parameters ε > 0 and r > 0, one can build an ANN
data-structure using
O dn + n1+1/(1+ε) ε−2 log3 (n/ε)
space, such that given a query point q, one can returns a (1 + ε)-ANN in P in
n
O dn1/(1+ε) log n log
ε
time. The result returned is correct
with high probability.
1+1/(1+ε) −2
The construction time is O dn ε log3 (n/ε) .
191
′
space into ℓ1k (this is the space Rk , where distances are the L1 metric instead of the regular L2 metric), where
k′ = O(k/ε2 ). This can be done with distortion (1 + ε).
′
Let Q′ be the resulting set of points in Rk . We want to solve approximate near neighbor queries on this
set of points, for radius r. As a first step, we partition the space into cells by taking a grid with sidelength
(say) k′ r and randomly translating it, clipping the points inside each grid cell. It is now sufficient to solve
the approximate near neighbor problem inside this grid cell (which has bounded diameter as a function of
r), since with small probability the result would be correct. We amplify the probability by repeating this a
polylogarithmic number of times.
′
Thus, we can assume that P is contained inside a cube of sidelength ≤ k′ nr, it is in Rk , and the distance
metric is the L1 metric. We next snap the points of P to a grid of sidelength (say) εr/k′ . Thus, every point
of P now has an integer coordinate, which is bounded by a polynomial in log n and 1/ε. Next, we write the
coordinates of the points of P using unary notation. (Thus, a point (2, 5) would be written as (00011, 11111)
assuming the number of bits for each coordinates is 5.) It is now easy to verify that the Hamming distance on
the resulting strings is equivalent to the L1 distance between the points.
Thus, we can solve the near neighbor problem for points in Rd by solving it on the hypercube under the
Hamming distance.
See Indyk and Motwani [IM98] for more details.
We have only scratched the surface of proximity problems in high dimensions. The interested reader is
referred to the survey by Indyk [Aga04] for more information.
References
[Aga04] P. K. Agarwal. Range searching. Handbook of Discrete and Computational Geometry. Ed. by
J. E. Goodman and J. O’Rourke. 2nd. Boca Raton, FL, USA: CRC Press LLC, 2004. Chap. 36,
pp. 809–838.
[AI06] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in
high dimensions. Proc. 47th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 459–468, 2006.
[AI08] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in
high dimensions. Commun. ACM, 51(1): 117–122, 2008.
[DIIM04] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based
on p-stable distributions. Proc. 20th Annu. Sympos. Comput. Geom. (SoCG), 253–262, 2004.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of di-
mensionality. Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC), 604–613, 1998.
[KOR00] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor
in high dimensional spaces. SIAM J. Comput., 2(30): 457–474, 2000.
[MNP06] R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. Proc. 22nd
Annu. Sympos. Comput. Geom. (SoCG), 154–157, 2006.
192
Chapter 29
Random Walks I
598 - Class notes for Randomized Algorithms
“A drunk man will find his way home; a drunk bird may
Sariel Har-Peled
wander forever.”
April 2, 2024
Anonymous,
29.1. Definitions
Let G = G(V, En) be an undirected
o connected graph. For v ∈ V, let Γ(v) denote the set of neighbors of v in G;
that is, Γ(v) = u vu ∈ E(G) . A random walk on G is the following process: Starting from a vertex v0 , we
randomly choose one of the neighbors of v0 , and set it to be v1 . We continue in this fashion, in the ith step
choosing vi , such that vi ∈ Γ(vi−1 ). It would be interesting to investigate the random walk process. Questions of
interest include:
(A) How long does it take to arrive from a vertex v to a vertex u in G?
(B) How long does it take to visit all the vertices in the graph.
(C) If we start from an arbitrary vertex v0 , how long the random walk has to be such that the location of the
random walk in the ith step is uniformly (or near uniformly) distributed on V(G)?
Example 29.1.1. In the complete graph Kn , visiting all the vertices takes in expectation O(n log n) time, as this
is the coupon collector problem with n − 1 coupons. Indeed, the probability we did not visit a specific vertex v
by the ith step of the random walk is ≤ (1 − 1/n)i−1 ≤ e−(i−1)/n ≤ 1/n10 , for i = Ω(n log n). As such, with high
probability, the random walk visited all the vertex of Kn . Similarly, arriving from u to v, takes in expectation
n − 1 steps of a random walk, as the probability of visiting v at every step of the walk is p = 1/(n − 1), and the
length of the walk till we visit v is a geometric random variable with expectation 1/p.
193
Figure 29.1: A walk in the integer grid, when rotated by 45 degrees, results, in two independent walks on one
dimension.
!
22i 2i 22i
since √ ≤ ≤ √ [MN98, p. 84]. This can also be verified using the Stirling formula, and the resulting
2 i i 2i
sequence diverges. ■
Lemma 29.1.4. Consider the infinite random walk on the two dimensional integer grid Z2 , starting from (0, 0).
The expected number of times that such a walk visits the origin is unbounded.
Proof: Rotate the grid by 45 degrees, and consider the two new axes X ′ and Y ′ , see Figure 29.1.. Let xi be
′
the projection of the location of the ith
√ step of the random walk on the X -axis, and define √ yi in a similar
fashion. Clearly, xi are of the √form j/ 2, where
√ j is an integer. By scaling by a factor of 2, consider the
′ ′
resulting random walks xi = 2xi and yi = 2yi . Clearly, xi and yi are random walks on the integer grid,
and furthermore, they are independent. As such, the probability that we visit the origin at the 2ith step is
h i h i2 2
′
P x2i = 0 ∩ y′2i = 0 = P x2i ′
= 0 = 212i 2ii ≥ 1/4i. We conclude, that the infinite random walk on the grid
Z visits the origin in expectation
2
X∞
′ X 1
∞
′
P xi = 0 ∩ yi = 0 ≥ = ∞,
i=0 i=0
4i
194
In particular, we have !
X n 1
1 = (1/3 + 1/3 + 1/3) =
n n
. (29.1)
a+b+c=n, a,b,c≥0
a b c 3n
Lemma 29.1.5. Consider the infinite random walk on the three dimensional integer grid Z3 , starting from
(0, 0, 0). The expected number of times that such a walk visits the origin is bounded.
Proof: The probability of a neighbor of a point (x, y, z) to be the next point in the walk is 1/6. Assume that
we performed a walk for 2i steps, and decided to perform 2a steps parallel to the x-axis, 2b steps parallel to
the y-axis, and 2c steps parallel to the z-axis, where a + b + c = i. Furthermore, the walk on each dimension is
balanced, that is we perform a steps to the left on the x-axis, and a steps to the right on the x-axis. Clearly, this
corresponds to the only walks in 2i steps that arrives to the origin.
(2i)!
Next, the number of different ways we can perform such a walk is a!a!b!b!c!c! , and the probability to perform
such a walk, summing over all possible values of a, b and c, is
X ! !2 !2i ! ! !i 2
(2i)! 1 2i 1 X i! 1 2i 1 X i 1
αi = 2i
= 2i
= 2i
a+b+c=i
a!a!b!b!c!c! 6 i 2 a+b+c=i a! b! c! 3 i 2 a+b+c=i a b c 3
a,b,c≥0 a,b,c≥0 a,b,c≥0
Consider the case where i = 3m. We have that i
a b c
≤ i
m m m
. As such, we have
! !i ! X ! !i
2i 1 1 i i 1
αi ≤ 2i
.
i 2 3 m m m a+b+c=i a b c 3
a,b,c≥0
| {z }
=1 by Eq. (29.1)
Finally, observe that α6m ≥ (1/6)2 α6m−2 and α6m ≥ (1/6)4 α6m−4 . Thus,
X
∞
αm = O(1). ■
m=1
195
References
[MN98] J. Matoušek and J. Nešetřil. Invitation to Discrete Mathematics. Oxford Univ Press, 1998.
[Nor98] J. R. Norris. Markov Chains. Statistical and Probabilistic Mathematics. Cambridge Press, 1998.
196
Chapter 30
Random Walks II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Then you must begin a reading program immediately so that you man understand the crises of our age," Ignatius said solemnly.
"Begin with the late Romans, including Boethius, of course. Then you should dip rather extensively into early Medieval. You
may skip the Renaissance and the Enlightenment. That is mostly dangerous propaganda. Now, that I think about of it, you had
better skip the Romantics and the Victorians, too. For the contemporary period, you should study some selected comic books.”
“You’re fantastic.”
“I recommend Batman especially, for he tends to transcend the abysmal society in which he’s found himself. His morality is
rather rigid, also. I rather respect Batman.”
Definition 30.1.1. A sequence/word σ of length 2n elements/characters made out of two symbols X and Y, is
balanced, if
(I) X appears n times (i.e., #(σ, X) = n),
(II) Y appears n times (i.e., #(σ, Y) = n),
(III) In any prefix of the string, the number of Xs is at least as large as the number of Ys.
Such a string is known as a Dyck word. If X and Y are the open and close parenthesis characters, respectively,
then the word is a balanced/valid parenthesis pattern.
Definition 30.1.2. The Catalan number, denoted by Cn , is the number of balanced strings of length 2n.
Definition 30.1.3. A sequence σ made out of two symbols X and Y is dominating, if for any non-empty prefix
of σ, the number of Xs is strictly larger than the number of Ys.
Lemma 30.1.4. Let σ be a cyclic sequence made out symbols X and Y, where n = #(σ, X) and m = #(σ, y),
with n > m. Then there are exactly n − m locations where cutting the cyclic sequence at these locations, results
in a dominating sequence.
197
Proof: Consider a location in σ that contains X, and the next location contains Y. Clearly, such a location
can not be a start for a dominating sequence. Of course, the next location can also not be a start position for a
dominating sequence. As such, these two locations must be interior to a dominating sequence, and deleting both
of these symbols from σ, results in a new cyclic sequence, where every dominating start location corresponds
to a dominating start location in the original sequence. Observe, that as long as the number of Xs is larger
than the number of Ys, there must be such a location with XY as the prefix. Repeatedly deleting XY substring,
results in a string of length n − m, where every location is a good start of a dominating sequence. We conclude
that there are exactly n − m such locations. ■
Observation 30.1.5. The number of distinct
cyclic sequences of length m + n, with m appearances of X, and
n appearances of Y is m!n! = n+m n , since there are (n + m − 1)! different cyclic ways to arrange n + m
(n+m−1)! 1 n+m
distinct values.
Theorem 30.1.6. For n ≥ 1, we have that the Catalan number Cn = n+1 1 2n
n
.
Proof: Consider a dominating sequence σ of length 2n + 1 with #(σ, X) = n + 1, and #(σ, Y) = n. Such a
sequence must start with an X, and if we remove the leading X, then what remains is a balanced sequence.
Such a sequence σ can be interpreted as a cyclic sequence. By the above lemma, there is a unique shift that is
dominating. As such, the number of such cyclic sequence is the Catalan number Cn . By the above observation,
the number of such cyclic sequences is
!
(n + m − 1)! (n + n + 1 − 1)! 1 2n! 1 2n
= = = . ■
m!n! (n + 1)!n! n + 1 n!n! n + 1 n
! !
Y i − τ + k Y τ τ τ
τ τ
τ2 τ2 1
α= = 1− ≥ 1− ≥ 1 − 2 τ exp − ≥ ,
k=1
i+k k=1
i+k i i i 3
√ √ √ 2i 2i
for τ ≤ i, and i ≥ 112 . Namely, for any k, such that − i ≤ k ≤ i, we have i+k ≥ i /3. We thus have that
√
Xi ! √ ! !
1 2i 2 i 2i 2i 2 22i
1 ≥ 2i ≥ =⇒ ≤ · √.
2 √ i+k 3 · 22i i i 3 i
k=− i+1
√
√ Vi[X] = 2i(1/2)(1/2) = i/2. Let
Let ∆ = i − 1 and X ∼ bin(2i, 1/2). We have thath E[X] = i,√ and
1 P∆
β = 22i k=−∆ i+k . By Chebychev, we have that 1 − β = P |X − i| ≥ 2 i/2 ≤ 1/2. which implies β ≥ 1/2.
2i
We have ! ! !
1 X 2i
∆
1 2∆ + 1 2i 2i 22i 22i
≤ β ≤ 2i ≤ =⇒ ≥ ≥ √. ■
2 2 k=−∆ i + k 22i i i 2(2∆ + 1) 4 i
198
Lemma 30.2.2. In a random walk on the line starting at zero, in expectation, after 48n2 steps, the walk had
visited either −n or +n.
√
Proof: √By Lemma 30.2.1, the probability that after 2i steps, for i = 16n2 , the walk is in the range {− i +
1, . . . , i − 1} is at most
1 2 22i 2 1 1
2n 2i · · √ = 2n · = .
2 3 i 3 4n 3
Namely, the walk arrived to either −n or +n during the first 32n2 steps (note that n ≤ i/2) with probability
≥ 2/3. If this did not happen, we continue the walk. As i ≥ 2n, the same argumentation essentially implies
that every 32n2 steps, the walk terminates with probability at least 2/3. As such, in expectation, after 3/2 such
epochs, the walk would terminate. ■
Proof: For simplicity of exposition assume that n is divisible by 4. Consider the random walk on the integer
line, starting from zero, where we go to the left with probability 1/2, and to the right probability 1/2. Let Yi be
the location of the walk at the i step. Clearly, E[Yi ] ≥ E[Xi ]. By defining the random walk on the integer line
more carefully, one can ensure that Yi ≤ Xi . Thus, the expected number of steps till Yi is equal to n is an upper
bound on the required quantity.
For an i, Y2i is an even number. Thus, consider the event that Y2i = 2∆ ≥ n, let Y2i = R2i − L2i , where R2i is
the number of steps to the right, and L2i is the number of steps to the left. Observe that
R2i − L2i = 2∆
R2i =i+∆
=⇒
R2i + L2i = 2i L2i = i − R2i = i − ∆.
Thus, for i ≥ n/2, we have that the probability that in the 2ith step we have Y2i ≥ n is
Xi
1 2i
ρ= .
∆=n/2
22i i + ∆
199
√ √
Lemma 30.3.2 below, tells us that for ρ > 1/3, is implied if ∆ ≤ i/6. That is, n/2 ≤ i/6, which holds for
i = 9n2 .
Next, if X2i fails to arrive to n at the first µ steps, we will reset Yµ = Xµ and continue the random walk,
repeating this process as many phases as necessary. The probability that the number of phases exceeds i is
≤ (2/3)i . As such, the expected number of steps in the walk is at most
X
c′ n2 i(1 − ρ)i = O(n2 ),
i
as claimed. ■
X2i !
1 2i 1
Lemma 30.3.2. We have √
2i
≥ .
k=i+ i/6 2 k 3
2i √
Proof: It is known¬ that i
≤ 22i / i (better constants are known). As such, since 2ii ≥ 2i
m
, for all m, we
have by symmetry that
X ! X ! !
2i
1 2i
2i
1 2i √ 1 2i 1 √ 1 22i 1
≥ − i/6 ≥ − i/6 · √ = . ■
√ 22i k k=i+1
2 2i k 2 2i i 2 2 2i
i 3
k=i+ i/6
The Markov chain start at an initial state X0 , and at each point in time moves according to the transition
probabilities. This form a sequence of states {Xt }. We have a distribution over those sequences. Such a sequence
would be referred to as a history.
Similar to Martingales, the behavior of a Markov chain in the future, depends only on its location Xt at time
t, and does not depends on the earlier stages that the Markov chain went through. This is the memorylessness
property of the Markov chain, and it follows as Pi j is independent of time. Formally, the memorylessness
property is
P Xt+1 = j X0 = i0 , X1 = i1 , . . . , Xt−1 = it−1 , Xt = i = P Xt+1 = j | Xt = i = Pi j .
The initial state of the Markov chain might also be chosen randomly in some cases.
¬
Probably because you got it as a homework problem, if not wikipedia knows, and if you are bored you can try and prove it
yourself.
200
h i
For states i, j ∈ S, the t-step transition probability is P(t)ij = P Xt = j X 0 = i . The probability that we visit
j for the first time, starting from i after t steps, is denoted by
h i
r(t)
ij = P X t = j and X 1 , j, X2 , j, . . . , X t−1 , j X 0 = i .
P
Let fi j = t>0 r(t)
i j denote the probability that the Markov chain visits state j, at any point in time, starting from
state i. The expected number of steps to arrive to state j starting from i is
X
hi j = t · ri(t)j .
t>0
Of course, if fi j < 1, then there is a positive probability that the Markov chain never arrives to j, and as such
hi j = ∞ in this case.
Definition 30.4.1. A state i ∈ S for which fii < 1 (i.e., the chain has positive probability of never visiting i
again), is a transient state. If fii = 1 then the state is persistent.
A state i that is persistent but hii = ∞ is null persistent. A state i that is persistent and hii , ∞ is non null
persistent.
Example 30.4.2. Consider the state 0 in the random walk on the integers. We already know that in expectation
the random walk visits the origin infinite number of times, so this hints that this is a persistent state. Let figure
00 . To this end, consider a walk X0 , X1 , . . . , X2n that starts at 0 and return to 0 only in the
out the probability r(2n)
2n step. Let S i = Xi − Xi−1 , for all i. Clearly, we have S i ∈ {−1, +1} (i.e., move left or move right). Assume
the walk starts by S 1 = +1 (the case −1 is handled similarly). Clearly, the walk S 2 , . . . , S 2n−1 must be prefix
balanced; that is, the number of 1s is always bigger (or equal) for any prefix of this sequence.
are known as Dyck words, and the number of such words of length 2m is the
Strings with this property
Catalan number Cm = m+1 m . As such, the probability of the random walk to visit 0 for the first time (starting
1 2m
In finite Markov chains there are no null persistent states (this requires a proof, which is left as an exercise).
There is a natural directed graph associated with a Markov chain. The states are the vertices, and the transition
probability Pi j is the weight assigned to the edge (i → j). Note that we include only edges with Pi j > 0.
Definition 30.4.3. A strong component (or a strong connected component) of a directed graph G is a maximal
subgraph C of G such that for any pair of vertices i and j in the vertex set of C, there is a directed path from i
to j, as well as a directed path from j to i.
Definition 30.4.4. A strong component C is a final strong component if there is no edge going from a vertex
in C to a vertex that is not in C.
201
In a finite Markov chain, there is positive probability to arrive from any vertex on C to any other vertex of
C in a finite number of steps. If C is a final strong component, then this probability is 1, since the Markov chain
can never leave C once it enters it . It follows that a state is persistent if and only if it lies in a final strong
component.
Definition 30.4.5. A Markov chain is irreducible if its underlying graph consists of a single strong component.
Definition 30.4.7. A stationary distribution for a Markov chain with the transition matrix P is a probability
distribution π such that π = πP.
In general, stationary distribution does not necessarily exist. We will mostly be interested in Markov chains
that have stationary distribution. Intuitively it is clear that if a stationary distribution exists, then the Markov
chain, given enough time, will converge to the stationary distribution.
Definition 30.4.8. The periodicity of a state i is the maximum integer T for which there exists an initial distri-
i > 0 then t belongs to the arithmetic
bution q(0) and positive integer a such that, for all t, if at time t we have q(t)
progression {a + T i | i ≥ 0}. A state is said to be periodic if it has periodicity greater than 1, and is aperiodic
otherwise. A Markov chain in which every state is aperiodic is aperiodic.
Example 30.4.9. The easiest example maybe of a periodic Markov chain is a directed cycle.
v2
For example, the Markov chain on the right, has periodicity of three. In particular, the initial
v1 state probability vector q(0) = (1, 0, 0) leads to the following sequence of state probability vectors
v3
q(0) = (1, 0, 0) =⇒ q(1) = (0, 1, 0) =⇒ q(2) = (0, 0, 1) =⇒ q(3) = (1, 0, 0) =⇒ . . . .
Note, that this chain still has a stationary distribution, that is (1/3, 1/3, 1/3), but unless you start from this
distribution, you are going to converge to it.
A neat trick that forces a Markov chain to be aperiodic, is to shrink all the probabilities by a factor of 2,
and make every state to have a transition probability to itself equal to 1/2. Clearly, the resulting Markov chain
is aperiodic.
The following theorem is the fundamental property of Markov chains that we will need. The interested
reader, should check the proof in [Nor98] (the proof is not hard).
Think about it as hotel California.
202
Theorem 30.4.11 (Fundamental theorem of Markov chains). Any irreducible, finite, and aperiodic Markov
chain has the following properties.
(i) All states are ergodic.
(ii) There is a unique stationary distribution π such that, for 1 ≤ i ≤ n, we have πi > 0.
(iii) For 1 ≤ i ≤ n, we have fii = 1 and hii = 1/πi .
(iv) Let N(i, t) be the number of times the Markov chain visits state i in t steps. Then
N(i, t)
lim = πi .
t→∞ t
Namely, independent of the starting distribution, the process converges to the stationary distribution.
References
[Nor98] J. R. Norris. Markov Chains. Statistical and Probabilistic Mathematics. Cambridge Press, 1998.
203
204
Chapter 31
and this holds for all v. We only need to verify the claimed solution, since there is a unique stationary distribu-
tion. Indeed,
d(v) X d(u) 1 d(v)
= πv = [πP]v = = ,
2m uv
2m d(u) 2m
205
as claimed. ■
Definition 31.1.2. The hitting time huv is the expected number of steps in a random walk that starts at u and
ends upon first reaching v.
The commute time between u and v is denoted by CTuv = huv + hvu .
Let Cu (G) denote the expected length of a walk that starts at u and ends upon visiting every vertex in G at
least once. The cover time of G denotes by C(G) is defined by C(G) = maxu Cu (G).
assuming i is a power of two (why not?). As such, T n = nT 1 +Θ(n2 ). Since T 1 = Θ(n2 ), we have that T n = Θ(n3 ).
Definition 31.1.6. A n × n matrix M is stochastic if all its entries are non-negative and for each row i, it holds
P P
k Mik = 1. It is doubly stochastic if in addition, for any i, it holds k Mki = 1.
Lemma 31.1.7. Let MC be a Markov chain, such that transition probability matrix P is doubly stochastic.
Then, the distribution u = (1/n, 1/n, . . . , 1/n) is stationary for MC.
X
n
Pki 1
Proof: [uP]i = = . ■
k=1
n n
We can interpret every edge in G as corresponding to two directed edges. In particular, imagine performing
a random walk in G, but remembering not only the current vertex in the walk, but also the (directed) edge used
the walk to arrive to this vertex. One can interpret this as a random walk on the (directed) edges. Observe, that
there are 2m directed edges. Furthermore, a vertex u of degree d(u), has stationary distribution πu = d(u)/2m.
As such, the probability that the random walk would use any of the d(u) outgoing edges from u is exactly
206
α = πu /d(u) = 1/2m. Namely, if we interpret the walk on the graph as walk on the directed edges, the stationary
distribution is uniform. This readily implies that if (u → v) is in the graph, then h(u→v)(u→v) is 1/α = 2m. This
readily implies that the expected time to go from u to v and back to u is at most 2m. Next, we provide a more
formal (and somewhat different) proof of this.
(Note, that (u → v) being an edge in the graph is crucial. Indeed, without it a significantly worst case bound
holds, see Theorem 31.2.1.)
Proof: Consider a new Markov chain defined by the edges of the graph (where every edge is taken twice as
two directed edges), where the current state is the last (directed) edge visited. There are 2m edges in the new
Markov chain, and the new transition matrix, has Q(u→v),(v→w) = Pvw = d(v) 1
. This matrix is doubly stochastic,
meaning that not only do the rows sum to one, but the columns sum to one as well. Indeed, for an edge (v → w)
we have
X X X 1
Q(x→y),(v→w) = Q(u→v),(v→w) = Pvw = d(v) × = 1.
x∈V,y∈Γ(x) u∈Γ(v) u∈Γ(v)
d(v)
Thus, the stationary distribution for this Markov chain is uniform, by Lemma 31.1.7. Namely, the stationary
distribution of e = (u → v) is hee = πe = 1/(2m). Thus, the expected time between successive traversals of e is
1/πe = 2m, by Theorem 30.4.11 (iii).
Consider huv + hvu and interpret this as the time to go from u to v and then return to u. Conditioned on the
event that the initial entry into u was via the edge (v → u), we conclude that the expected time to go from there
to v and then finally use (v → u) is 2m. The memorylessness property of a Markov chains now allows us to
remove the conditioning: since how we arrived to u is not relevant. Thus, the expected time to travel from u to
v and back is at most 2m. ■
The effective resistance between nodes u and v is the voltage difference between u and v when one ampere
is injected into u and removed from v (or injected into v and removed from u). The effective resistance is always
bounded by the branch resistance, but it can be much lower.
Given an undirected graph G, let N(G) be the electrical network defined over G, associating one ohm
resistance on the edges of N(G).
You might now see the connection between a random walk on a graph and electrical network. Intuitively
(used in the most unscientific way possible), the electricity, is made out of electrons each one of them is doing
a random walk on the electric network. The resistance of an edge, corresponds to the probability of taking the
edge. The higher the resistance, the lower the probability that we will travel on this edge. Thus, if the effective
resistance Ruv between u and v is low, then there is a good probability that travel from u to v in a random walk,
and huv would be small.
207
31.2.1. A tangent on parallel and series resistors
Consider having n resistors in parallel with resistance R1 , . . . , Rn , connecting two nodes u and v. As follows:
v
R1 R2 Rn
u
R1 R2 Rn
Then, the effective resistance between u and v is
Ruv = R1 + · · · + Rn .
In particular, if R1 = · · · = Rn , then Ruv = nR.
Proof: Let ϕuv denote the voltage at u in N(G) with respect to v, where d(x) amperes of current are injected
into each node x ∈ V, and 2m amperes are removed from v. We claim that
huv = ϕuv .
Note, that the voltage on an edge xy is ϕ xy = ϕ xv − ϕyv . Thus, using Kirchhoff’s Law and Ohm’s Law, we obtain
that X X X
ϕ xw
x ∈ V \ {v} d(x) = current(xw) = = (ϕ xv − ϕwv ), (31.1)
w∈Γ(x) w∈Γ(x)
resistance(xw) w∈Γ(x)
since the resistance of every edge is 1 ohm. (We also have the “trivial” equality that ϕvv = 0.) Furthermore, we
have only n variables in this system; that is, for every x ∈ V, we have the variable ϕ xv .
208
Now, for the random walk interpretation – by the definition of expectation, we have
1 X X X
x ∈ V \ {v} h xv = (1 + hwv ) ⇐⇒ d(x) h xv = 1+ hwv
d(x) w∈Γ(x) w∈Γ(x) w∈Γ(x)
X X X
⇐⇒ 1 = d(x) h xv − hwv = (h xv − hwv ).
w∈Γ(x) w∈Γ(x) w∈Γ(x)
P
Since d(x) = w∈Γ(x) 1, this is equivalent to
X
x ∈ V \ {v} d(x) = (h xv − hwv ). (31.2)
w∈Γ(x)
Again, we also have the trivial equality hvv = 0.¬ Note, that this system also has n equalities and n variables.
Eq. (31.1) and Eq. (31.2) show two systems of linear equalities. Furthermore, if we identify huv with ϕ xv
then they are exactly the same system of equalities. Furthermore, since Eq. (31.1) represents a physical system,
we know that it has a unique solution. This implies that ϕ xv = h xv , for all x ∈ V.
Imagine the network where u is injected with 2m amperes, and for all nodes w remove d(w) units from w.
In this new network, hvu = −ϕ′vu = ϕ′uv . Now, since flows behaves linearly, we can superimpose them (i.e., add
them up). We have that in this new network 2m unites are being injected at u, and 2m units are being extracted
at v, all other nodes the charge cancel itself out. The voltage difference between u and v in the new network is
b
ϕ = ϕuv + ϕ′uv = huv + hvu = CTuv . Now, in the new network there are 2m amperes going from u to v, and by
Ohm’s law, we have
b
ϕ = voltage = resistance ∗ current = 2mRuv ,
as claimed. ■
n vertices
u
x1
x2
v = xn
Figure 31.1: Lollipop again.
Example 31.2.2. Recall the lollipop Ln from Exercise 31.1.4, see Figure 31.1. Let u be the connecting vertex
between the clique and the stem (i.e., the path). The effective resistance between u and v is n since there are n
n= n.
resistors in series along the stem. That is Ruv
The number of edges in the lollipop is 2 + n = n(n − 1)/2 + n = n(n + 1)/2. As such, the commute time
hvu + huv = CTuv = 2mRuv = 2 n(n + 1)/2 n = n2 (n + 1).
We already know that hvu = Θ(n2 ). This implies that huv = CTuv − hvu = Θ(n3 ).
¬
In previous lectures, we interpreted hvv as the expected length of a walk starting at v and coming back to v.
209
Lemma 31.2.3. For any n vertex connected graph G, and for all u, v ∈ V(G), we have CTuv < n3 .
Proof: The effective resistance between any two nodes in the network is bounded by the length of the shortest
path between the two nodes, which is at most n − 1. As such, plugging this into Theorem 31.2.1, yields the
bound, since m < n2 . ■
References
[DS00] P. G. Doyle and J. L. Snell. Random walks and electric networks. ArXiv Mathematics e-prints,
2000. eprint: math/0001057.
210
Chapter 32
Random Walks IV
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Do not imagine, comrades, that leadership is a pleasure! On the contrary, it is a deep and heavy responsibility. No one believes
more firmly than Comrade Napoleon that all animals are equal. He would be only too happy to let you make your decisions
for yourselves. But sometimes you might make the wrong decisions, comrades, and then where should we be? Suppose you
had decided to follow Snowball, with his moonshine of windmills-Snowball, who, as we now know, was no better than a
criminal?”
Proof: (Sketch.) Construct a spanning tree T of G, and consider the time to walk around T . The expected time
to travel on this edge on both directions is CTuv = huv + hvu , which is smaller than 2m, by Lemma 31.1.8. Now,
just connect up those bounds, to get the expected time to travel around the spanning tree. Note, that the bound
is independent of the starting vertex. ■
Definition 32.1.2. The resistance of G is R(G) = maxu,v∈V(G) Ruv ; namely, it is the maximum effective resis-
tance in G.
Proof: Consider the vertices u and v realizing R(G), and observe that max(huv , hvu ) ≥ CTuv /2, and CTuv =
2mRuv by Theorem 31.2.1. Thus, C(G) ≥ CTuv /2 ≥ mR(G).
As for the upper bound. Consider a random walk, and divide it into epochs, where a epoch is a random
walk of length 2e3 mR(G). For any vertex v, the expected time to hit u is hvu ≤ 2mR(G), by Theorem 31.2.1.
Thus, the probability that u is not visited in a epoch is 1/e3 by the Markov inequality. Consider a random walk
with t = ln n epochs. We have that the probability of nnot visiting u is ≤ (1/e ) ≤ 1/n . Thus, all vertices
3 ln n 3
are visited after ln n epochs, with probability ≥ 1 − 2 /n3 ≥ 1 − 1/n. Otherwise, after this walk, we perform
a random walk till we visit all vertices. The length of this (fix-up) random walk is ≤ 2n3 , by Theorem 32.1.1.
Thus, expected length of the walk is ≤ 2e3 mR(G) ln n + 2n3 (1/n2 ). ■
211
32.1.1. Rayleigh’s Short-cut Principle.
Observe that effective resistance is never raised by lowering the resistance on an edge, and it is never lowered
by raising the resistance on an edge. Similarly, resistance is never lowered by removing a vertex.
Interestingly, effective resistance comply with the triangle inequality.
Observation 32.1.4. For a graph with minimum degree d, we have R(G) ≥ 1/d (collapse all vertices except
the minimum-degree vertex into a single vertex).
Lemma 32.1.5. Suppose that G contains p edge-disjoint paths of length at most ℓ from s to t. Then R st ≤ ℓ/p.
Theorem 32.2.2. Let USTCON denote the problem of deciding if a vertex s is connected to a vertex t in an
undirected graph. Then USTCON ∈ RLP.
Proof: Perform a random walk of length 2n3 in the input graph G, starting from s. Stop as soon as the random
walk hit t. If u and v are in the same connected component, then h st ≤ n3 . Thus, by the Markov inequality, the
algorithm works. It is easy to verify that it can be implemented in O(log n) space. ■
Given such a universal traversal sequence, we can construct (a non-uniform) Turing machine that can solve
USTCON for such d-regular graphs, by encoding the sequence in the machine.
Let F denote a family of graphs, and let U(F ) denote the length of the shortest universal traversal sequence
for all the labeled graphs in F . Let R(F ) denote the maximum resistance of graphs in this family.
Theorem 32.2.4. Let F be a family of d-regular graphs with n vertices, then U(F ) ≤ 5mR(F ) lg(n |F |).
Proof: Same old, same old. Break the string into epochs, each of length L = 2mR(G). Now, start random
walks from all the possible vertices, for all graphs in F . Continue the walks till all vertices are being visited.
Initially, there are n2 |F | vertices that need to visited. In expectation, in each epoch half the vertices get visited.
There are n |F | walks, each of them needs to visit n vertices. As such, the number of vertices waiting to be
visited is bounded by |F | n . As such, after 1 + lg2 n |F | epochs, the expected number of vertices still need
2 2
212
Let U(d, n) denote the length of the shortest universal traversal sequence of connected, labeled n-vertex,
d-regular graphs.
Lemma 32.2.5. The number of labeled n-vertex graphs that are d-regular is (nd)O(nd) .
Proof: Such a graph has dn/2 edges overall. Specifically, we encode this by listing for every vertex its d
neighbors – there are N = n−1 d
≤ nd possibilities. As such, there are at most N n ≤ nnd choices for edges in
the graph¬ . Every vertex has d! possible labeling of the edges adjacent to it, thus there are (d!)n ≤ dnd possible
labelings. ■
Lemma 32.2.6. U(d, n) = O n3 d log n .
Proof: The diameter of every connected n-vertex, d-regular graph is O(n/d). Indeed, consider the path realizing
the diameter of the graph, and assume it has t vertices. Number the vertices along the path consecutively, and
consider all the vertices that their number is a multiple of three. There are α ≥ ⌊t/3⌋ such vertices. No pair
of these vertices can share a neighbor, and as such, the graph has at least (d + 1)α vertices. We conclude that
n ≥ (d + 1)α = (d + 1)(t/3 − 1). We conclude that t ≤ d+13
(n + 1) ≤ 3n/d.
And so, this also bounds the resistance of such a graph. The number of edges is m = nd/2. Now, combine
Lemma 32.2.5 and Theorem 32.2.4. ■
This is, as mentioned before, not a uniform algorithm. There is by now a known log-space deterministic
algorithm for this problem, which is uniform.
Proof: (Sketch.) The basic idea is simple – start a random walk from s, if it fails to arrive to t after a certain
number of steps, then restart. The only challenging thing is that the number of times we need to repeat this
is exponentially large. Indeed, the probability of a random walk from s to arrive to t in n steps, is at least
p = 1/nn−1 ≥ nn if s is connected to t.
As such, we need to repeat this walk N = O((1/p) log δ) = O(nn+1 ) times, for δ ≥ 1/2n . If have all of these
walks fail, then with probability ≥ 1 − δ, there is no path from s to t.
We can do the walk using logarithmic space. However, how do we count to N (reliably) using only loga-
rithmic space? We leave this as an exercise to the reader, see Exercise 32.2.8, ■
Exercise 32.2.8. Let N be a large integer number. Show a randomized algorithm, that with, high probability,
counts from 1 to M, where M ≥ N, and always stops. The algorithm should use only O(log log N) bits.
¬
This is a callous upper bound – better analysis is possible. But never analyze things better than you have to - it usually a waste
of time.
213
214
Chapter 33
Lemma 33.1.1. The eigenvalues of a symmetric real matrix N ∈ Rn×n are real numbers.
Pn
Proof: Observe that for any real vector v = (v1 , . . . , vn ) ∈ Rn , we have that 2
i=1 vi = ⟨v, v⟩ ≥ 0. As such, for a
vector v with eigenvalue λ, we have
Lemma 33.1.2. Let N ∈ Rn×n be a matrix. Consider two eigenvectors v1 , v2 that corresponds to two eigenval-
ues λ1 , λ2 , where λ1 , λ2 . Then v1 and v2 are orthogonal.
Proof: Indeed, vT1 Nv2 = λ2 vT1 v2 . Similarly, we have vT1 Nv2 = (NT v1 )T v2 ) = λ1 vT1 v2 . We conclude that either
λ1 = λ2 , or v1 and v2 are orthogonal (i.e., vT1 v2 = 0). ■
215
33.1.2. Eigenvalues and eigenvectors of a graph
Since N = M(G) the adjacency matrix of an undirected graph is symmetric, all its eigenvalues exists and are
real numbers λ1 ≥ λ2 · · · ≥ λn , and their corresponding orthonormal basis vectors are e1 , . . . , en .
We will need the following theorem.
Theorem 33.1.3 (Fundamental theorem of algebraic graph theory). Let G = G(V, E) be an undirected (multi)graph
with maximum degree d and with n vertices. Let λ1 ≥ λ2 ≥ · · · ≥ λn be the eigenvalues of M(G) and the corre-
sponding orthonormal eigenvectors are e1 , . . . , en . The following holds.
(i) If G is connected then λ2 < λ1 .
(ii) For i = 1, . . . , n, we have |λi | ≤ d.
(iii) d is an eigenvalue if and only if G is regular.
(iv) If G is d-regular then the eigenvalue λ1 = d has the eigenvector e1 = √1n (1, 1, 1, . . . , 1).
(v) The graph G is bipartite if and only if for every eigenvalue λ there is an eigenvalue −λ of the same
multiplicity.
(vi) Suppose that G is connected. Then G is bipartite if and only if −λ1 is an eigenvalue.
(vii) If G is d-regular and bipartite, then λn = d and en = √1n (1, 1, . . . , 1, −1, . . . , −1), where there are equal
numbers of 1s and −1s in en .
References
[Bol98] B. Bollobas. Modern Graph Theory. Springer-Verlag, 1998.
[Wes01] D. B. West. Intorudction to Graph Theory. 2ed. Prentice Hall, 2001.
216
Chapter 34
Random Walks V
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Is there anything in the Geneva Convention about the rules of war in peacetime?” Stanko wanted to know, crawling back
toward the truck. “Absolutely nothing,” Caulec assured him. “The rules of war apply only in wartime. In peacetime, anything
goes.”
Definition 34.1.1. And (n, d, c)-expander is a d-regular bipartite graph G = (X, Y, E), where |X| = |Y| = n/2.
Here, we require that for any S ⊆ X, we have
!!
2|S |
|Γ(S )| ≥ 1 + c 1 − |S | .
n
The Margulis-Gabber-Galil expander. For a positive m, let n = 2m2 . Each vertex in X and Y above, is
interpreted as a pair (a, b), where a, b ∈ Zm = {0, . . . , m − 1}. A vertex (a, b) ∈ X is connected to the vertices
Specrtral gap and expansion. We remind the reader that a d-regular graph, then its adjacency matrix M(G)
has (as its biggest eigenvalue) the eigenvalue λ1 = d. In particular, let |λ1 | ≥ |λ2 | ≥ . . . ≥ |λn | be the eigenvalues
of M(G). We then have the following:
c2
Theorem 34.1.3. If G is an (n, d, c)-expander then M(G) has |λ2 | ≤ d − .
1024 + 2c2
2dε − ε2
Theorem 34.1.4. If M(G) has |λ2 | ≤ d − ε, then G is (n, d, c)-expander with c ≥ .
d2
217
34.2. Rapid mixing for expanders
Here is another equivalent definition of an expander.
Definition 34.2.1. Let G = (V, E) be an undirected d-regular graph. The graph G is a (n, d, c)-expander (or
just c-expander), for every set S ⊆ V of size at most |V| /2, there are at least cd |S | edges connecting S and
S = V \ S ; that is e S , S ≥ cd |S |,
Guaranteeing aperiodicity. Let G be a (n, d, c)-expander. We would like to perform a random walk on G.
The graph G is connected, but it might be periodic (i.e., bipartite). To overcome this, consider the random walk
on G that either stay in the current state with probability 1/2 or traverse one of the edges. Clearly, the resulting
Markov Chain (MC) is aperiodic. The resulting transition matrix is
Q = M/2d + I/2,
where M is the adjacency matrix of G and I is the identity n × n matrix. Clearly Q is doubly stochastic.
Furthermore, if λbi is an eigenvalue of M, with eigenvector vi , then
!
1 M bi
1 λ
Qvi = + I vi = + 1vi .
2 d 2 d
As such, λbi /d + 1/2 is an eigenvalue of Q. Namely, if there is a spectral gap in the graph G, there would also
be a similar spectral gap in the resulting MC. This MC can be generated by adding to each vertex d self loops,
ending up with a 2d-regular graph. Clearly, this graph is still an expander if the original graph is an expander,
and the random walk on it is aperiodic.
From this point on, we would just assume our expander is aperiodic.
i − πi
q(t)
∆(t) = max .
i πi
Namely, if ∆(t) approaches zero then q(t) approaches π.
We remind the reader that we saw a construction of a constant degree expander with constant expansion. In
its transition matrix Q, we have that λb1 = 1, and −1 ≤ λb2 < 1, and furthermore the spectral gap λb1 − λb2 was a
constant (the two properties are equivalent, but we proved only one direction of this).
We need a slightly stronger property (that does hold for our expander construction). We have that λb2 ≥
maxni=2 λbi .
Theorem 34.2.3. Let Q be the transition matrix of an aperiodic (n, d, c)-expander. Then, for any initial distri-
bution q(0) , we have that
t
∆(t) ≤ n3/2 λb2 .
Since λb2 is a constant smaller than 1, the distance ∆(t) drops exponentially with t.
218
Proof: We have that q(t) = q(0) Qt . Let B(Q) = ⟨v1 , . . . , vn ⟩ denote the orthonormal eigenvector basis of Q (see
P
Definition 45.2.3p289 ), and write q(0) = ni=1 αi vi . Since λb1 = 1, we have that
Xn X n X
n
t b t bi t vi .
q =q Q =
(t) (0) t
αi vi Q = αi λi vi = α1 v1 + αi λ
i=1 i=1 i=2
√ √
Since v1 = 1/ n, . . . , 1/ n , and λbi ≤ λb2 < 1, for i > 1, we have that limt→∞ λbi t = 0, and thus
Xn t
π = lim q = α1 v1 +
(t) b
αi lim λi vi = α1 v1 .
t→∞ t→∞
i=2
v
t
X
n X
n
Now, since v1 , . . . , vn is an orthonormal basis, and q (0)
= αi vi , we have that ∥q ∥2 =
(0)
α2i . Thus implies
i=1 i=1
that
v
t
X
n t √ Xn
√ X
n
2
q −π
(t)
= q − α1 v1
(t)
= bi vi
αi λ ≤ n bi )t vi
αi ( λ = n bi )t
αi ( λ
1 1
i=2 1 i=2 2 i=2
v
t
√ X
n
√ t (0) √ t √ t
≤ n( λb2 )t (αi )2 ≤ n λb2 q 2
≤ n λb2 q(0) 1
= n λb2 ,
i=2
We assume that Alg requires a random bit string of length n. So, we have a constant degree expander G
(say of degree d) that has at least 200 · 2n vertices. In particular, let
U = |V(G)| ,
and since our expander construction grow exponentially in size (but the base of the exponent is a constant),
we have that U = O(2n ). (Translation: We can not quite get an expander with a specific number of vertices.
Rather, we can guarantee an expander that has more vertices than we need, but not many more.)
We label the vertices of G with all the binary strings of length n, in a round robin fashion (thus, each binary
string of length n appears either |V(G)| /2n or |V(G)| /2n times). For a vertex v ∈ V(G), let s(v) denote the
binary string associated with v.
Consider a string x that we would like to decide if it is in L or not. We know that at least 99/100U vertices
of G are labeled with “random” strings that would yield the right result if we feed them into Alg (the constant
here deteriorated from 199/200 to 99/100 because the number of times a string appears is not identically the
same for all strings).
219
The algorithm. We perform a random walk of length µ = αβk on G, where α and β are constants to be
determined shortly, and k is a parameter. To this end, we randomly choose a starting vertex X0 (this would
require n + O(1) bits). Every step in the random walk, would require O(1) random bits, as the expander is a
constant degree expander, and as such overall, this would require n + O(k) random bits.
Now, lets X0 , X1 , . . . , Xµ be the resulting random walk. We compute the result of
The real thing. Let Q be the transition matrix of G. We assume, as usual, that the random walk on G is
aperiodic (if not, we can easily fix it using standard tricks), and thus ergodic. Let B = Qβ be the transition
matrix of the random walk of the states we use in the algorithm. Note, that the eigenvalues (except the first
one) of B “shrink”. In particular, by picking β to be a sufficiently large constant, we have that
bi B| ≤ 1 ,
λb1 B = 1 and |λ for i = 2, . . . , U.
10
For the input string x, let W be the matrix that has 1 in the diagonal entry Wii , if and only Alg(x, s(i)) returns
the right answer, for i = 1, . . . , U. (We remind the reader that s(i) is the string associated with the ith vertex,
and U = |V(G)|.) The matrix W is zero everywhere else. Similarly, let W = I − W be the “complement” matrix
having 1 at Wii iff Alg(x, s(i)) is incorrect. We know that W is a U × U matrix, that has at least (99/100)U ones
on its diagonal.
Lemma 34.3.1. Let Q be a symmetric transition matrix, then all its eigenvalues of Q are in the range [−1, 1].
Proof: Let p ∈ Rn be an eigenvector with eigenvalue λ. Let pi be the coordinate with the maximum absolute
value in p. We have that
X
U X
U X
U
λpi = pQ i = p j Q ji ≤ p j Q ji ≤ |pi | Q ji = pi .
j=1 j=1 j=1
220
Lemma 34.3.2. Let Q be a symmetric transition matrix, then for any p ∈ Rn , we have that ∥pQ∥2 ≤ ∥p∥2 .
Proof: Let B(Q) = ⟨v1 , . . . , vn ⟩ denote the orthonormal eigenvector basis of Q, with eigenvalues 1 = λ1 , . . . , λn .
P
Write p = i αi vi , and observe that
X X sX sX
pQ 2 = αi vi Q = αi λi vi = αi λi ≤
2 2
α2i = p 2 ,
i 2 i 2 i i
Lemma 34.3.3. Let B = Qβ be the transition matrix of the graph Gβ . For all vectors p ∈ Rn , we have:
(i) ∥pBW∥ ≤ ∥p∥, and
(ii) pBW ≤ ∥p∥ /5.
Proof: (i) Since multiplying a vector by W has the effect of zeroing out some coordinates, its clear that it can
not enlarge the norm of a matrix. As such, ∥pBW∥2 ≤ ∥pB∥2 ≤ ∥p∥2 by Lemma 34.3.2.
P
(ii) Write p = i αi vi , where v1 , . . . , vn is the orthonormal basis
√ of Q (and thus also of B), with eigenvalues
b b
1 = λ1 , . . . , λn . We remind the reader that v1 = (1, 1, . . . , 1)/ n. Since W zeroes out at least 99/100 of the
entries of a vectors it is multiplied by (and copy the rest as they are), we have that
q
√
∥v1 W∥ ≤ (n/100)(1/ n)2 ≤ 1/10 = ∥v1 ∥ /10.
Consider the strings r0 , . . . , rν . For each one of these strings, we can write down whether its a “good”
string (i.e., Alg return the correct result), or a bad string. This results in a binary pattern b0 , . . . , bk . Given a
distribution p ∈ RU on the states of the graph, its natural to ask what is the probability of being in a “good”
state. Clearly, this is the quantity ∥pW∥1 . Thus, if we are interested in the probability of a specific pattern,
then we should start with the initial distribution p0 , truncate away the coordinates that represent an invalid
state, apply the transition matrix, again truncate away forbidden coordinates, and repeat in this fashion till we
exhaust the pattern. Clearly, the ℓ1 -norm of the resulting vector is the probability of this pattern. To this end,
given a pattern b0 , . . . , bk , let S = ⟨S 0 , . . . , S ν ⟩ denote the corresponding sequence of “truncating” matrices (i.e.,
S i is either W or W). Formally, we set S i = W if Alg(x, ri ) returns the correct answer, and set S i = W otherwise.
The above argument implies the following lemma.
Lemma 34.3.4. For any fixed pattern b0 , . . . , bν the probability of the random walk to generate this pattern of
random strings is ∥p(0) S 0 BS 1 . . . BS ν ∥1 , where S = ⟨S 0 , . . . , S ν ⟩ is the sequence of W and W encoded by this
pattern.
221
Theorem 34.3.5. The probability that the majority of the outputs Alg(x, r0 ), Alg(x, r1 ), . . . , Alg(x, rk ) is incor-
rect is at most 1/2k .
Proof: The majority is wrong, only if (at least) half the elements of the sequence S = ⟨S 0 , . . . , S ν ⟩ belong to
W. Fix such a “bad” sequence S, and observe that the distributions we work with are vectors in RU . As such,
if p0 is the initial distribution, then we have that
√ √ 1
P S = p S 0 BS 1 . . . BS ν ≤ U p(0) S 0 BS 1 . . . BS ν ≤ ,
(0)
1 2
U ν/2
p(0) 2
5
by Lemma 34.3.6 below (i.e., Cauchy-Schwarz inequality) and by repeatedly applying Lemma 34.3.3, since
half of the sequence
√ S are W, and the rest are W. The distribution p(0) was uniform, which implies that
p 2 = 1/ U. As such, let S be the set of all bad patterns (there are 2ν−1 such “bad” patterns). We have
(0)
h i √ 1 1
ν
P majority is bad ≤ 2 U ν/2 p
(0)
= (4/5)ν/2 = (4/5)αk/2 ≤ ,
5 2 2k
for α = 7. ■
Proof: We can safely assume all the coordinates of v are positive. Now,
v
t d v
t d
Xd X d X X √
∥v∥1 = vi = vi · 1 = |v · (1, 1, . . . , 1)| ≤ v2i 12 = d v ,
i=1 i=1 i=1 i=1
References
[GG81] O. Gabber and Z. Galil. Explicit constructions of linear-sized superconcentrators. J. Comput.
Syst. Sci., 22(3): 407–420, 1981.
222
Chapter 35
Complexity classes
598 - Class notes for Randomized Algorithms
“I’m a simple man, a guileless man,” Panin answered.
Sariel Har-Peled
“There is a norm. The norm is five gravities. My simple,
April 2, 2024
uncomplicated organism cannot bear anything exceeding
the norm. My organism tried six once, and got carried
out at six minutes some seconds. With me along.”
Definition 35.2.2. The class NP consists of all languages L that have a polynomial time algorithm Alg, such
that for any input Σ∗ , we have:
(i) If x ∈ L ⇒ then ∃y ∈ Σ∗ , Alg(x, y) accepts, where |y| (i.e. the length of y) is bounded by a polynomial in
|x|.
¬
There is also the internet.
223
(ii) If x < L ⇒ then ∀y ∈ Σ∗ Alg(x, y) rejects.
Definition 35.2.3. For a complexity class C, we define the complementary class co-C as the set of languages
whose complement is in the class C. That is
co−C = L L ∈ C ,
4 where L = Σ∗ \ L.
It is obvious that P = co−P and P ⊆ NP ∩ co−NP. (It is currently unknown if P = NP ∩ co−NP or whether
NP = co−NP, although both statements are believed to be false.)
Definition 35.2.4. The class RP (for Randomized Polynomial time) consists of all languages L that have a
randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ , we have
(i) If x ∈ L then P Alg(x) accepts ≥ 1/2.
(ii) x < L then P Alg(x) accepts = 0.
An RP algorithm is a Monte Carlo algorithm, but this algorithm can make a mistake only if x ∈ L. As such,
co−RP is all the languages that have a Monte Carlo algorithm that make a mistake only if x < L. A problem
which is in RP ∩ co−RP has an algorithm that does not make a mistake, namely a Las Vegas algorithm.
Definition 35.2.5. The class ZPP (for Zero-error Probabilistic Polynomial time) is the class of languages that
have a Las Vegas algorithm that runs in expected polynomial time.
Definition 35.2.6. The class PP (for Probabilistic Polynomial time) is the class of languages that have a ran-
domized algorithm Alg, with worst case polynomial running time, such that for any input x ∈ Σ∗ , we have
(i) If x ∈ L then P Alg(x) accepts > 1/2.
(ii) If x < L then P Alg(x) accepts < 1/2.
The class PP is not very useful. Why?
Exercise 35.2.7. Provide a PP algorithm for 3SAT.
Consider the mind-boggling stupid randomized algorithm that returns either yes or no with probability half.
This algorithm is almost in PP, as it return the correct answer with probability half. An algorithm is in PP needs
to be slightly better, and be correct with probability better than half. However, how much better can be made to
be arbitrarily close to 1/2. In particular, there is no way to do effective amplification with such an algorithm.
Definition 35.2.8. The class BPP (for Bounded-error Probabilistic Polynomial time) is the class of languages
that have a randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ ,
we have
(i) If x ∈ L then P Alg(x) accepts ≥ 3/4.
(ii) If x < L then P Alg(x) accepts ≤ 1/4.
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
224
Chapter 36
Backwards analysis
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
The idea of backwards analysis (or backward analysis) is a technique to analyze randomized algorithms by
imagining as if it was running backwards in time, from output to input. Most of the more interesting applica-
tions of backward analysis are in Computational Geometry, but nevertheless, there are some other applications
that are interesting and we survey some of them here.
by induction. ■
Theorem 36.1.2. Let Π = π1 . . . πn be a random permutation of 1, . . . , n, and let Z be the number of times, that
πi is the smallest number among π1 , . . . , πi , for i = 1, . . . , n. Then, we have that for t ≥ 2e that P Z > t ln n ≤
1/nt ln 2 , and for t ∈ 1, 2e , we have that P Z > t ln n ≤ 1/n(t−1) /4 .
2
¬
The answer, my friend, is blowing in the permutation.
225
P
Proof: Follows readily from Chernoff’s inequality, as Z = i Xi is a sum of independent indicator variables,
and, since by linearity of expectations, we have
h i X h i X n Z n+1
1 1
µ=EZ = E Xi = ≥ dx = ln(n + 1) ≥ ln n.
i i=1
i x=1 x
36.2.2. Analysis
h i
Lemma 36.2.1. The above algorithm computes a permutation π, such that E |L| = O(n log n), and the ex-
pected running time of the algorithm is O (n log n + m) log n , where n = |V(G)| and m = |E(G)|. Note, that
both bounds also hold with high probability.
Proof: Fix a vertex v ∈ V = {v1 , . . . , vn }. Consider the set of n numbers {dG (v, v1 ), . . . , dG (v, vn )}. Clearly,
dG (v, π1 ), . . . , dG (v, πn ) is a random permutation of this set, and by Lemma 36.1.1 the random permutation π
changes this minimum O(log n) time in expectations (and also with high probability). This readily implies that
|Lv | = O(log n) both in expectations and high probability.
The more interesting claim is the running time. Consider an edge uv ∈ E(G), and observe that δ(u) or δ(v)
changes O(log n) times. As such, an edge gets visited O(log n) times, which implies overall running time of
O(n log2 n + m log n), as desired.
Indeed, overall there are O(n log n) changes in the value of δ(·). Each such change might require one delete-
min operation from the queue, which takes O(log n) time operation. Every edge, by the above, might trigger
O(log n) decrease-key operations. Using Fibonacci heaps, each such operation takes O(1) time. ■
226
36.3. Computing nets
36.3.1. Basic definitions
Definition 36.3.1. A metric space is a pair (X, d) where X is a set and d : X × X → [0, ∞) is a metric satisfying
the following axioms: (i) d(x, y) = 0 if and only if x = y, (ii) d(x, y) = d(y, x), and (iii) d(x, y) + d(y, z) ≥ d(x, z)
(triangle inequality).
For example, R2 with the regular Euclidean distance is a metric space. In the following, we assume that we
are given black-box access to dM . Namely, given two points p, u ∈ X, we assume that d(p, u) can be computed
in constant time.
Another standard example for a finite metric space is a graph G with non-negative weights ω(·) defined
on its edges. Let dG (x, y) denote the shortest path (under the given weights) between any x, y ∈ V(G). It is
easy to verify that dG (·, ·) is a metric. In fact, any finite metric (i.e., a metric defined over a finite set) can be
represented by such a weighted graph.
36.3.1.1. Nets
Definition 36.3.2. For a point set P in a metric space with a metric d, and a parameter r > 0, an r-net of P is a
subset C ⊆ P, such that
(i) for every p, u ∈ C, p , u, we have that d(p, u) ≥ r, and
(ii) for all p ∈ P, we have that minu∈C d(p, u) < r.
36.3.2.2. Analysis
While the analysis here does not directly uses backward analysis, it is inspired to a large extent by such an
analysis as in Section 36.2p226 .
227
Lemma 36.3.3. The set N is an r-net in G.
Proof: By the end of the algorithm, each v ∈ V has δ(v) < r, for δ(v) is monotonically decreasing, and if it
were larger than r when v was visited then v would have been added to the net.
An induction shows that if ℓ = δ(v), for some vertex v, then the distance of v to the set N is at most ℓ.
Indeed, for the sake of contradiction, let j be the (end of) the first iteration where this claim is false. It must be
that π j ∈ N, and it is the nearest vertex in N to v. But then, consider the shortest path between π j and v. The
modified Dijkstra must have visited all the vertices on this path, thus computing δ(v) correctly at this iteration,
which is a contradiction.
Finally, observe that every two points in N have distance ≥ r. Indeed, when the algorithm handles vertex
v ∈ N, its distance from all the vertices currently in N is ≥ r, implying the claim. ■
Lemma 36.3.4. Consider an execution of the algorithm, and any vertex v ∈ V. The expected number of times
the algorithm updates the value of δ(v) during its execution is O(log n), and more strongly the number of
updates is O(log n) with high probability.
Proof: For simplicity of exposition, assume all distances in G are distinct. Let S i be the set of all the vertices
x ∈ V, such that the following two properties both hold:
Interestingly, in the above proof, all we used was the monotonicity of the sets S 1 , . . . , S n , and that if δ(v)
changes in an iteration then the size of the set S i shrinks by a constant factor with good probability in this
iteration. This implies that there is some flexibility in deciding whether or not to initiate Dijkstra’s algorithm
from each vertex of the permutation, without damaging the number of times of the values of δ(v) are updated.
Theorem 36.3.5. Given a graph G = (V, E), with n vertices and m edges, the above algorithm computes an
r-net of G in O((n log n + m) log n) expected time.
Proof: By Lemma 36.3.4, the two δ values associated with the endpoints of an edge get updated O(log n)
times, in expectation, during the algorithm’s execution. As such, a single edge creates O(log n) decrease-key
operations in the heap maintained by the algorithm. Each such operation takes constant time if we use Fibonacci
heaps to implement the algorithm. ■
228
36.4. Bibliographical notes
Backwards analysis was invented/discovered by Raimund Seidel, and the QuickSort example is taken from
Seidel [Sei93]. The number of changes of the minimum result of Section 36.1 is by now folklore.
The good ordering of Section 36.2 is probably also folklore, although a similar idea was used by Mendel
and Schwob [MS09] for a different problem.
Computing a net in a sparse graph, Section 36.3.2, is from [EHS14]. While backwards analysis fails to hold
in this case, it provide a good intuition for the analysis, which is slightly more complicated and indirect.
References
[EHS14] D. Eppstein, S. Har-Peled, and A. Sidiropoulos. On the Greedy Permutation and Counting Dis-
tances. manuscript. 2014.
[MS09] M. Mendel and C. Schwob. Fast c-k-r partitions of sparse graphs. Chicago J. Theor. Comput.
Sci., 2009, 2009.
[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. New Trends in Discrete and
Computational Geometry. Ed. by J. Pach. Vol. 10. Algorithms and Combinatorics. Springer-
Verlag, 1993, pp. 37–68.
229
230
Chapter 37
If there is an expert that is never wrong. This situation is easy – initially start with all n experts as being
viable – to this end, we assign W(i) ← 1, for all i. If an expert prediction turns out to be wrong, we set its
weight to zero (i.e., it is no longer active). Clearly, if you follow the majority vote of the still viable experts,
then at most log2 n mistakes would be made, before one isolates the infallible experts.
Intuition. The algorithm keeps track of the quality of the experts. The useless experts would have weights
very close to zero.
231
P∞ i+1 xi
Proof: For x ∈ (−1, 1), the Taylor expansion of ln(1 + x) is i=1 (−1) i
. As such, for x ∈ [0, 1/2] we have
X∞
xi x2 x3
ln(1 − x) = − = −x − − · · · ≥ −x − x2 ,
i=1
i 2 3
since x2+i /(2 + i) ≤ x2 /2i ⇐⇒ xi /(2 + i) ≤ 1/2i , which is obviously true as x ≤ 1/2. ■
Lemma 37.2.2. Let assume we have N experts. Let βt be the number of the mistakes the algorithm performs,
and let βt (i) be the number of mistakes made by the ith expert, for i ∈ JnK (both till time t). Then, if we run this
algorithm for T rounds, we have
2 log N
∀i ∈ JnK βT ≤ 2(1 + ε)βT (i) + .
ε
Proof: Let Φt be the total weight of the experts at the beginning of round t. Observe that Φ1 = N, and if a
mistake was made in the t round, then
Φt+1 ≤ (1 − ε/2)Φt ≤ exp(−εβt+1 /2)N.
On the other hand, an expert i made βi (t) mistakes in the first t rounds, and as such its weight, at this point in
time, is (1 − ε)βt (i) . We thus have, at time T , and for any i, that
!
βT (i) εβT
exp − ε + ε βT (i) ≤ (1 − ε)
2
≤ ΦT ≤ exp − N.
2
εβ
Taking ln of both sides, we have − ε + ε2 βT (i) ≤ − 2T + ln N. ⇐⇒ βT ≤ 2(1 + ε)βT (i) + 2 lnεN . ■
232
37.4. Bibliographical notes
233
234
Chapter 38
In this chapter we will try to quantify the notion of geometric complexity. It is intuitively clear that a a (i.e.,
disk) is a simpler shape than an c (i.e., ellipse), which is in turn simpler than a - (i.e., smiley). This becomes
even more important when we consider several such shapes and how they interact with each other. As these
examples might demonstrate, this notion of complexity is somewhat elusive.
To this end, we show that one can capture the structure of a distribution/point set by a small subset. The
size here would depend on the complexity of the shapes/ranges we care about, but surprisingly it would be
independent of the size of the point set.
38.1. VC dimension
Definition 38.1.1. A range space S is a pair (X, R), where X is a ground set (finite or infinite) and R is a (finite
or infinite) family of subsets of X. The elements of X are points and the elements of R are ranges.
Our interest is in the size/weight of the ranges in the range space. For technical reasons, it will be easier to
consider a finite subset x as the underlining ground set.
Definition 38.1.2. Let S = (X, R) be a range space, and let x be a finite (fixed) subset of X. For a range r ∈ R,
its measure is the quantity
|r ∩ x|
m(r) = .
|x|
While x is finite, it might be very large. As such, we are interested in getting a good estimate to m(r) by
using a more compact set to represent the range space.
235
Definition 38.1.3. Let S = (X, R) be a range space. For a subset N (which might be a multi-set) of x, its
estimate of the measure of m(r), for r ∈ R, is the quantity
|r ∩ N|
s(r) = .
|N|
The main purpose of this chapter is to come up with methods to generate a sample N, such that m(r) ≈ s(r),
for all the ranges r ∈ R.
It is easy to see that in the worst case, no sample can capture the measure of all ranges. Indeed, given a
sample N, consider the range x \ N that is being completely missed by N. As such, we need to concentrate
on range spaces that are “low dimensional”, where not all subsets are allowable ranges. The notion of VC
dimension (named after Vapnik and Chervonenkis [VC71]) is one way to limit the complexity of a range space.
Definition 38.1.4. Let S = (X, R) be a range space. For Y ⊆ X, let
R|Y = r ∩ Y r ∈ R (38.1)
denote the projection of R on Y. The range space S projected to Y is S|Y = Y, R|Y .
If R|Y contains all subsets of Y (i.e., if Y is finite, we have R|Y = 2|Y| ), then Y is shattered by R (or
equivalently Y is shattered by S).
The Vapnik-Chervonenkis dimension (or VC dimension) of S, denoted by dimVC (S), is the maximum
cardinality of a shattered subset of X. If there are arbitrarily large shattered subsets, then dimVC (S ) = ∞.
38.1.1. Examples
Intervals. Consider the set X to be the real line, and consider R to be the set of all intervals on the 1 2
real line. Consider the set Y = {1, 2}. Clearly, one can find four intervals that contain all possible
subsets of Y. Formally, the projection R|Y = {{ } , {1} , {2} , {1, 2}}. The intervals realizing each of
these subsets are depicted on the right.
p q s
However, this is false for a set of three points B = {p, u, v}, since there is no interval that can
contain the two extreme points p and v without also containing u. Namely, the subset {p, v} is not realizable
for intervals, implying that the largest shattered set by the range space (real line, intervals) is of size two. We
conclude that the VC dimension of this space is two.
Disks. Let X = R2 , and let R be the set of disks in the plane. Clearly, for any three
points in the plane (in general position), denoted by p, u, and v, one can find eight
disks that realize all possible 23 different subsets. See the figure on the right. p
But can disks shatter a set with four points? Consider such a set P of four points. If t
the convex hull of P has only three points on its boundary, then the subset X having
q
only those three vertices (i.e., it does not include the middle point) is impossible, by
{p.q}
convexity. Namely, there is no disk that contains only the points of X without the
middle point.
d
Alternatively, if all four points are vertices of the convex hull and they are a, b, c, d
along the boundary of the convex hull, either the set {a, c} or the set {b, d} is not
realizable. Indeed, if both options are realizable, then consider the two disks D1 a
c
and D2 that realize those assignments. Clearly, ∂D1 and ∂D2 must intersect in four
points, but this is not possible, since two circles have at most two intersection points.
See the figure on the left. Hence the VC dimension of this range space is 3. b
236
Convex sets. Consider the range space S = (R2 , R), where R is the set of all (closed)
convex sets in the plane. We claim that dimVC (S) = ∞. Indeed, consider a set U of n
points p1 , . . . , pn all lying on the boundary of the unit circle in the plane. Let V be any CH(V)
subset of U, and consider the convex hull CH(V). Clearly, CH(V) ∈ R, and furthermore,
CH(V) ∩ U = V. Namely, any subset of U is realizable by S. Thus, S can shatter sets of
arbitrary size, and its VC dimension is unbounded.
Complement. Consider the range space S = (X, R) with δ = dimVC (S). Next, consider the complement space,
S = X, R , where
R= X\r r∈R .
Namely, the ranges of S are the complement of the ranges in S. What is the VC dimension of S? Well, a set
B ⊆ X is shattered by S if and only if it is shattered by S. Indeed, if S shatters B, then for any Z ⊆ B, we have
that (B \ Z) ∈ R|B , which implies that Z = B \ (B \ Z) ∈ R|B . Namely, R|B contains all the subsets of B, and S
shatters B. Thus, dimVC S = dimVC (S).
Lemma 38.1.5. For a range space S = (X, R) we have that dimVC (S) = dimVC S , where S is the complement
range space.
38.1.1.1. Halfspaces
Let S = (X, R), where X = Rd and R is the set of all (closed) halfspaces in Rd . We need the following technical
claim.
Claim 38.1.6. Let P = {p1 , . . . , pd+2 } be a set of d + 2 points in Rd . There are real numbers β1 , . . . , βd+2 , not
P P
all of them zero, such that i βi pi = 0 and i βi = 0.
Proof: Indeed, set ui = (pi , 1), for i = 1, . . . , d + 2. Now, the points u1 , . . . , ud+2 ∈ Rd+1 are linearly dependent,
P
and there are coefficients β1 , . . . , βd+2 , not all of them zero, such that d+2 i=1 βi ui = 0. Considering only the first
Pd+2
d coordinates of these points implies that i=1 βi pi = 0. Similarly, by considering only the (d + 1)st coordinate
P
of these points, we have that d+2 i=1 βi = 0. ■
To see what the VC dimension of halfspaces in Rd is, we need the following result of Radon. (For a
reminder of the formal definition of convex hulls, see Definition 38.5.1.)
Theorem 38.1.7 (Radon’s theorem). Let P = {p1 , . . . , pd+2 } be a set of d + 2 points in Rd . Then, there exist
two disjoint subsets C and D of P, such that CH(C) ∩ CH(D) , ∅ and C ∪ D = P.
P
Proof: By Claim 38.1.6 there are real numbers β1 , . . . , βd+2 , not all of them zero, such that i βi pi = 0 and
P
i βi = 0.
Assume, for the sake of simplicity of exposition, that β1 , . . . , βk ≥ 0 and βk+1 , . . ., βd+2 < 0. Furthermore,
P P
let µ = ki=1 βi = − d+2
i=k+1 βi . We have that
X
k X
d+2
βi pi = − β i pi .
i=1 i=k+1
P
In particular, v = ki=1 (βi /µ)pi is a point in CH({p1 , . . . , pk }). Furthermore, for the same point v we have
P
v = d+2
i=k+1 −(βi /µ)pi ∈ CH({pk+1 , . . . , pd+2 }). We conclude that v is in the intersection of the two convex hulls,
as required. ■
237
The following is a trivial observation, and yet we provide a proof to demonstrate it is true.
Lemma 38.1.8. Let P ⊆ Rd be a finite set, let v be any point in CH(P), and let h+ be a halfspace of Rd
containing v. Then there exists a point of P contained inside h+ .
n o
Proof: The halfspace h+ can be written as h+ = x ∈ Rd ⟨x, v⟩ ≤ c . Now v ∈ CH(P) ∩ h+ , and as such there
P P
are numbers α1 , . . . , αm ≥ 0 and points p1 , . . . , pm ∈ P, such that i αi = 1 and i αi pi = v. By the linearity of
the dot product, we have that
DX
m E X
m
v ∈ h+ =⇒ ⟨v, v⟩ ≤ c =⇒ αi pi , v ≤ c =⇒ β = αi ⟨pi , v⟩ ≤ c.
i=1 i=1
Setting βi = ⟨pi , v⟩, for i = 1, . . . , m, the above implies that β is a weighted average of β1 , . . . , βm . In particular,
there must be a βi that is no larger than the average. That is βi ≤ c. This implies that ⟨pi , v⟩ ≤ c. Namely,
pi ∈ h+ as claimed. ■
Let S be the range space having Rd as the ground set and all the close halfspaces as ranges. Radon’s
theorem implies that if a set Q of d + 2 points is being shattered by S, then we can partition this set Q into
two disjoint sets Y and Z such that CH(Y) ∩ CH(Z) , ∅. In particular, let v be a point in CH(Y) ∩ CH(Z).
If a halfspace h+ contains all the points of Y, then CH(Y) ⊆ h+ , since a halfspace is a convex set. Thus, any
halfspace h+ containing all the points of Y will contain the point v ∈ CH(Y). But v ∈ CH(Z) ∩ h+ , and this
implies that a point of Z must lie in h+ , by Lemma 38.1.8. Namely, the subset Y ⊆ Q cannot be realized by a
halfspace, which implies that Q cannot be shattered. Thus dimVC (S ) < d + 2. It is also easy to verify that the
regular simplex with d + 1 vertices is shattered by S. Thus, dimVC (S ) = d + 1.
Lemma 38.2.1 (Sauer’s lemma). If (X, R) is a range space of VC dimension δ with |X| = n, then |R| ≤ Gδ (n).
Observe that |R| = |R x | + |R \ x|. Indeed, we charge the elements of R to their corresponding element in R \ x.
The only bad case is when there is a range r such that both r ∪ {x} ∈ R and r \ {x} ∈ R, because then these two
¬
Here is a cute (and standard) counting argument: Gδ (n) is just the number of different subsets of size at most δ out of n elements.
Now, we either decide to not include the first element in these subsets (i.e., Gδ (n − 1)) or, alternatively, we include the first element in
these subsets, but then there are only δ − 1 elements left to pick (i.e., Gδ−1 (n − 1)).
238
distinct ranges get mapped to the same range in R \ x. But such ranges contribute exactly one element to R x .
Similarly, every element of R x corresponds to two such “twin” ranges in R.
Observe that (X \ {x} , R x ) has VC dimension δ − 1, as the largest set that can be shattered is of size δ − 1.
Indeed, any set B ⊂ X \ {x} shattered by R x implies that B ∪ {x} is shattered in R.
Thus, we have
|R| = |R x | + |R \ x| ≤ Gδ−1 (n − 1) + Gδ (n − 1) = Gδ (n),
by induction. ■
Definition 38.2.3 (Shatter function). Given a range space S = (X, R), its shatter function πS (m) is the maxi-
mum number of sets that might be created by S when restricted to subsets of size m. Formally,
Our arch-nemesis in the following is the function x/ ln x. The following lemma states some properties of
this function, and its proof is left as an exercise.
Lemma 38.2.4. For the function f (x) = x/ ln x the following hold.
(A) f (x) is monotonically increasing for x ≥ e.
(B) f (x) ≥ e,√for x > 1.
(C) For u ≥ √e, if f (x) ≤ u, then x ≤ 2u ln u.
(D) For u ≥ e, if x > 2u ln u, then f (x) > u.
(E) For u ≥ e, if f (x) ≥ u, then x ≥ u ln u.
Proof: As a warm-up exercise, we prove a somewhat weaker bound here of O((δ+δ′ ) log(δ + δ′ )). The stronger
bound follows from Theorem 38.2.6 below. Let B be a set of n points in X that are shattered by b S. There are
′
at most Gδ (n) and Gδ′ (n) different ranges of B in the range sets R|B and R|B , respectively, by Lemma 38.2.1.
Every subset C of B realized by b r∈b R is a union of two subsets B ∩ r and B ∩ r′ , where r ∈ R and r′ ∈ R′ ,
respectively. Thus, the number of different subsets of B realized by b S is bounded by Gδ (n)Gδ′ (n). Thus,
δ δ′
2 ≤ n n , for δ, δ > 1. We conclude that n ≤ (δ + δ ) lg n, which implies that n = O (δ + δ′ ) log(δ + δ′ ) , by
n ′ ′
Lemma 38.2.4(C). ■
Interestingly, one can prove a considerably more general result with tighter bounds. The required compu-
tations are somewhat more painful.
239
Theorem 38.2.6. Let S1 = X, R1 , . . . , Sk = X, Rk be range spaces with VC dimension δ1 , . . . , δk , respec-
tively. Next, let f (r1 , . . . , rk ) be a function that maps any k-tuple of sets r1 ∈ R1 , . . . , rk ∈ Rk into a subset of
X. Here, the function f is restricted to be defined by a sequence of set operations like complement, intersection
and union. Consider the range set
R′ = f (r1 , . . . , rk ) r1 ∈ R1 , . . . , rk ∈ Rk
and the associated range space T = (X, R′ ). Then, the VC dimension of T is bounded by O kδ lg k , where
δ = maxi δi .
by Lemma 38.2.1 and Lemma 38.2.2. On the other hand, since Y is being shattered by R′ , this implies that
k
R|Y′ = 2t . Thus, we have the inequality 2t ≤ 2(te/δ)δ , which implies t ≤ k 1 + δ lg(te/δ) . Assume that
t ≥ e and δ lg(te/δ) ≥ 1 since otherwise the claim is trivial, and observe that t ≤ k 1 + δ lg(te/δ) ≤ 3kδ lg(t/δ).
Setting x = t/δ, we have
t ln(t/δ) t x
≤ 3k ≤ 6k ln =⇒ ≤ 6k =⇒ x ≤ 2 · 6k ln(6k) =⇒ x ≤ 12k ln(6k),
δ ln 2 δ ln x
by Lemma 38.2.4(C). We conclude that t ≤ 12δk ln(6k), as claimed. ■
Corollary 38.2.7. Let S = (X, R) and T = (X, R′ ) be two range spaces of VC dimension δ and δ′ , respec-
tively, where δ, δ′ > 1. Let b S = (X, b
R = r ∩ r′ r ∈ R, r′ ∈ R′ . Then, for the range space b R), we have that
b ′
dimVC (S) = O(δ + δ ).
Corollary 38.2.8. Any finite sequence of combining range spaces with finite VC dimension (by intersecting,
complementing, or taking their union) results in a range space with a finite VC dimension.
m(r) − s(r) ≤ ε,
where m(r) = |x ∩ r| / |x| is the measure of r (see Definition 38.1.2) and s(r) = |C ∩ r| / |C| is the estimate of r
(see Definition 38.1.3). (Here C might be a multi-set, and as such |C ∩ r| is counted with multiplicity.)
As such, an ε-sample is a subset of the ground set x that “captures” the range space up to an error of ε.
Specifically, to estimate the fraction of the ground set covered by a range r, it is sufficient to count the points
of C that fall inside r.
If X is a finite set, we will abuse notation slightly and refer to C as an ε-sample for S.
240
To see the usage of such a sample, consider x = X to be, say, the population of a country (i.e., an element
of X is a citizen). A range in R is the set of all people in the country that answer yes to a question (i.e., would
you vote for party Y?, would you buy a bridge from me?, questions like that). An ε-sample of this range space
enables us to estimate reliably (up to an error of ε) the answers for all these questions, by just asking the people
in the sample.
The natural question of course is how to find such a subset of small (or minimal) size.
Theorem 38.3.2 (ε-sample theorem, [VC71]). There is a positive constant c such that if (X, R) is any range
space with VC dimension at most δ, x ⊆ X is a finite subset and ε, φ > 0, then a random subset C ⊆ x of
cardinality
!
c δ 1
s = 2 δ log + log
ε ε φ
is an ε-sample for x with probability at least 1 − φ.
(In the above theorem, if s > |x|, then we can just take all of x to be the ε-sample.)
Sometimes it is sufficient to have (hopefully smaller) samples with a weaker property – if a range is “heavy”,
then there is an element in our sample that is in this range.
Definition 38.3.3 (ε-net). A set N ⊆ x is an ε-net for x if for any range r ∈ R, if m(r) ≥ ε (i.e., |r ∩ x| ≥ ε |x|),
then r contains at least one point of N (i.e., r ∩ N , ∅).
Theorem 38.3.4 (ε-net theorem, [HW87]). Let (X, R) be a range space of VC dimension δ, let x be a finite
subset of X, and suppose that 0 < ε ≤ 1 and φ < 1. Let N be a set obtained by m random independent draws
from x, where !
4 4 8δ 16
m ≥ max lg , lg . (38.3)
ε φ ε ε
Then N is an ε-net for x with probability at least 1 − φ.
(We remind the reader that lg = log2 .)
The proofs of the above theorems are somewhat involved and we first turn our attention to some applications
before presenting the proofs.
Remark 38.3.5. The above two theorems also hold for spaces with shattering dimension at most δ, in which !
1 1 δ δ
case the sample size is slightly larger. Specifically, for Theorem 38.3.4, the sample size needed is O lg + lg .
ε φ ε ε
Remark 38.3.6. The ε-net theorem is a relatively easy consequence (up to constants) of the ε-sample theorem
– see bibliographical notes for details.
241
38.3.2.2. Learning a concept
Assume that we have a function f defined in the plane that returns ‘1’ in- Dunknown
side an (unknown) disk Dunknown and ‘0’ outside it. There is some distribution D
defined over the plane, and we pick points from this distribution. Furthermore,
we can compute the function for these labels (i.e., we can compute f for certain
values, but it is expensive). For a mystery value ε > 0, to be explained shortly,
Theorem 38.3.4 tells us to pick (roughly) O((1/ε) log(1/ε)) random points in a
sample R from this distribution and to compute the labels for the samples. This
is demonstrated in the figure on the right, where black dots are the sample points
for which f (·) returned 1.
So, now we have positive examples and negative examples. We would like to find
a hypothesis that agrees with all the samples we have and that hopefully is close to
the true unknown disk underlying the function f . To this end, compute the smallest
D
disk D that contains the sample labeled by ‘1’ and does not contain any of the ‘0’
points, and let g : R2 → {0, 1} be the function g that returns ‘1’ inside the disk
and ‘0’ otherwise. We claim that g classifies correctly all but an ε-fraction of the
points (i.e., the probability of misclassifying a point picked according to the given
distribution is smaller than ε); that is, Prp∈D f (p) , g(p) ≤ ε.
Geometrically, the region where g and f disagree is all the points in the symmet- Dunknown
ric difference between the two disks. That is, E = D ⊕ Dunknown ; see the figure on the D ⊕ Dunknown
right.
Thus, consider the range space S having the plane as the ground set and the
symmetric difference between any two disks as its ranges. By Corollary 38.2.8, this
range space has finite VC dimension. Now, consider the (unknown) disk D′ that
induces f and the region r = Dunknown ⊕ D. Clearly, the learned classifier g returns
incorrect answers only for points picked inside r. D
Thus, the probability of a mistake in the classification is the measure of r under the distribution D. So,
if PD [r] > ε (i.e., the probability that a sample point falls inside r), then by the ε-net theorem (i.e., Theo-
rem 38.3.4) the set R is an ε-net for S (ignore for the time being the possibility that the random sample fails to
be an ε-net) and as such, R contains a point u inside r. But, it is not possible for g (which classifies correctly
all the sampled points of R) to make a mistake on u, a contradiction, because by construction, the range r is
where g misclassifies points. We conclude that PD r ≤ ε, as desired.
Little lies. The careful reader might be tearing his or her hair out because of the above description. First,
Theorem 38.3.4 might fail, and the above conclusion might not hold. This is of course true, and in real appli-
cations one might use a much larger sample to guarantee that the probability of failure is so small that it can be
practically ignored. A more serious issue is that Theorem 38.3.4 is defined only for finite sets. Nowhere does it
speak about a continuous distribution. Intuitively, one can approximate a continuous distribution to an arbitrary
precision using a huge sample and apply the theorem to this sample as our ground set. A formal proof is more
tedious and requires extending the proof of Theorem 38.3.4 to continuous distributions. This is straightforward
and we will ignore this topic altogether.
242
Lemma 38.4.1. For any positive integer n, the following hold.
(i) (1 + 1/n)n ≤ e. (ii) (1 − 1/n)n−1 ≥ e−1 . ! k
n k n ne
(iii) n! ≥ (n/e) .n
(iv) For any k ≤ n, we have ≤ ≤ .
k k k
Proof: (i) Indeed, 1 + 1/n ≤ exp(1/n), since 1 + x ≤ e x , for x ≥ 0. As such (1 + 1/n)n ≤ exp(n(1/n)) = e.
n−1
(ii) Rewriting the inequality, we have that we need to prove n−1 ≥ 1e . This is equivalent to proving
n−1 n
1 n−1
e ≥ n−1n
= 1 + n−1 , which is our friend from (i).
(iii) Indeed,
nn X ni
∞
≤ = en ,
n! i=0
i!
P xi
by the Taylor expansion of e x = ∞ i=0 i! . This implies that (n/e) ≤ n!, as required.
n
n δ ne δ Xδ !
n
Lemma 38.2.2 restated. For n ≥ 2δ and δ ≥ 1, we have ≤ Gδ (n) ≤ 2 , where Gδ (n) = .
δ δ i=0
i
! δ
X X ne i
δ
n
Proof: Note that by Lemma 38.4.1 (iv), we have Gδ (n) = ≤ 1+ . This series behaves like a
i=0
i i=1
i
geometric series with constant larger than 2, since
ne i ne i−1 ne i − 1 !i−1 ne 1
!i−1
ne 1 n n
/ = = 1− ≥ = ≥ ≥ 2,
i i−1 i i i i i e i δ
by Lemma 38.4.1. As such, this series is bounded by twice the largest element in the series, implying the
claim. ■
243
References
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom., 2: 127–
151, 1987.
[VC71] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of
events to their probabilities. Theory Probab. Appl., 16: 264–280, 1971.
244
Chapter 39
Double sampling
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“What does not work when you apply force, would work when you apply even more force.”
, Anonymous
245
39.1.1. Disagreement between samples on a specific set
We provide three proofs of the following lemma – the constants are somewhat different for each version.
Lemma 39.1.3. Let R1 and R2 be two ρ-samples from a ground set S, and consider a fixed set T ⊆ S. We have
that h i
P |R1 ∩ T | − |R2 ∩ T | > ερ ≤ 3 exp −ε ρ/2 .
2
246
39.1.3. Moments of the sample size
Lemma 39.1.5. Let R an m-sample. And let f (t) ≤ αtβ , where α ≥ 1 and β ≥ 1 are constants, such that
m ≥ 16β. Then U(m) = E f |R| ≤ 2α(2m)β .
Proof: The proof follows from Chernoff’s inequality and some tedious but straightforward calculations. The
reader is as such encouraged to skip reading it.
Let X = |R|. This is a sum of 0/1 random variables with expectation m. As such, we have
X X
∞ ∞
β
ν = E f |R| ≤ P[X = i] f (i) ≤ α P[X = i]i .
i=0 i=0
We bound the last summation using Chernoff’s inequality (see Theorem 13.2.1), we have
X
5
X
∞
β β
τ= P X ≥ jm ( j + 1) + P X ≥ jm ( j + 1)
j=2 j=6
X
5 ! X ∞
m( j − 1) 2
≤ exp − ( j + 1)β + 2− jm ( j + 1)β
j=2
4 j=6
m X∞
≤ exp − 3β + exp(−m)4β + exp(−2m)5β + exp(−4m)6β + 2− jm ( j + 1)β < 1,
4 j=6
247
39.2. Proof of the ε-net theorem
Here we are working in the unweighted settings (i.e., the weight of a single element is one).
Theorem 39.2.1 (ε-net theorem, [HW87]). Let (X, R) be a range space of VC dimension δ, let x be a finite
subset of X, and suppose that 0 < ε ≤ 1 and φ < 1. Let N be an m-sample from x (see Definition 39.1.1), where
!
8 4 16δ 16
m ≥ max lg , lg . (39.2)
ε φ ε ε
Then N is an ε-net for x with probability at least 1 − φ.
h i h i
if εm ≥ 8. Thus, for r ∈ E1 , we have P[E2 ]/ P[E1 ] ≥ P |r ∩ T | ≥ εm2
= 1 − P |r ∩ T | < εm
2
≥ 21 . ■
Claim 39.2.2 implies that to bound the probability of E1 , it is enough to bound the probability of E2 . Let
εm
′
E2 = ∃r ∈ R r ∩ N = ∅ and |r ∩ T | ≥ .
2
Clearly, E2 ⊆ E′2 . Thus, bounding the probability of E′2 is enough to prove Theorem 39.2.1. Note, however, that
a shocking thing happened! We no longer have x participating in our event. Namely, we turned bounding an
event that depends on a global quantity (i.e., the ground set x) into bounding a quantity that depends only on a
local quantity/experiment (involving only N and T ). This is the crucial idea in this proof.
248
39.2.1.2. Using double sampling to finish the proof
h i
Claim 39.2.3. P E2 ≤ P E′2 ≤ 2−εm/2Gδ (2m).
Proof: We fix the content of R = N ∪ T . The range space (R, R|R ) has Gδ (|R|) ranges. Fix a range r in this
range space. Let Th = r ∩ R. If b = |T | < εm/2 theni the E′2 can not happened. Otherwise, the probability that r
is a bad range is P T ⊆ T and T ∩ N = ∅ T ⊆ R ≤ 21b , by Lemma 39.1.2. In particular, by the union bound
h i
over all ranges, we have P E′2 | R ≤ 2−εm/2 Gδ (|R|). As such, we have
′ X ′ X h i
E
P 2 = E
P 2 | R P [R] ≤ 2−εm/2 Gδ |R| P[R] ≤ 2−εm/2 E Gδ |R| ≤ 2−εm/2Gδ (2m).
R R
by Lemma 39.1.8. ■
Proof of Theorem 39.2.1. By Claim 39.2.2 and Claim 39.2.3, we have that P[E1 ] ≤ 2 · 2−εm/2Gδ (2m). It thus
remains to verify that if m satisfies Eq. (39.2), then the above is smaller than φ. Which is equivalent to
!δ !
−εm/2 −εm/2 4em εm 4em 1
2·2 Gδ (2m) ≤ φ ⇐⇒ 16 · 2 ≤ φ ⇐⇒ −4 + − δ lg ≥ lg
δ 2 δ φ
! !
εm 4e εm 1 εm m
⇐⇒ − 4 − δ lg + − lg + − δ lg ≥0
8 δ 8 φ 4 δ
We remind the reader that the value of m we pick is such that m ≥ max 8ε lg φ4 , 16δ lg 16
. In particular, m ≥
ε ε
64δ/ε and −4 − δ lg δ ≥ −4 − 4δ ≤ −8δ ≥ −εm/8. Similarly, by the choice of m, we have εm/8 ≥ lg φ1 . As
4e
such, we need to show that εm 4
≥ δ lg m
δ
⇐⇒ m ≥ 4δε lg mδ , and one can verify using some easy but tedious
calculations that this holds if m ≥ 16δ
ε
lg 16ε . ■
References
[Har11] S. Har-Peled. Geometric Approximation Algorithms. Vol. 173. Math. Surveys & Monographs.
Boston, MA, USA: Amer. Math. Soc., 2011.
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom., 2: 127–
151, 1987.
249
250
Chapter 40
Definition 40.1.2. Let (X, d) be an n-point metric space. We denote the open ball of radius r about x ∈ X, by
b(x, r) = {y ∈ X | d(x, y) < r}.
Underling our discussion of metric spaces are algorithmic applications. The hardness of various computa-
tional problems depends heavily on the structure of the finite metric space. Thus, given a finite metric space,
and a computational task, it is natural to try to map the given metric space into a new metric where the task at
hand becomes easy.
Example 40.1.3. Computing the diameter of a point set is not trivial in two dimensions (if one wants near
linear running time), but is easy in one dimension. Thus, if we could map points in two dimensions into
points in one dimension, such that the diameter is preserved, then computing the diameter becomes easy. This
approach yields an efficient approximation algorithm, see Exercise 40.7.3 below.
Of course, this mapping from one metric space to another, is going to introduce error. Naturally, one would
like to minimize the error introduced by such a mapping.
251
Definition 40.1.4. Let (X, dX ) and (Y, dY ) be two metric spaces. A mapping f : X → Y is an embedding, and
it is C-Lipschitz if dY f (x), f (y) ≤ C · dX (x, y) for all x, y ∈ X. The mapping f is K-bi-Lipschitz if there exists
a C > 0 such that
CK −1 · dX (x, y) ≤ dY f (x), f (y) ≤ C · dX (x, y),
for all x, y ∈ X.
The least K for which f is K-bi-Lipschitz is the distortion of f , and is denoted dist( f ). The least distortion
with which X may be embedded in Y is denoted cY (X).
Informally, if f : X → Y has distortion K, then the distances in X and f (X) ⊆ Y are the same up to a factor
of K (one might need to scale up the distances by some constant C).
There are several powerful results about low distortion embeddings that would be presented:
(I) Probabilistic trees. Every finite metric can be randomly embedded into a tree such that the “expected”
distortion for a specific pair of points is O(log n).
(II) Bourgain embedding. Any n-point metric space can be embedded into (finite dimensional) euclidean
metric space with O(log n) distortion.
(III) Johnson-Lindenstrauss lemma. Any n-point set in Euclidean space with the regular Euclidean distance
can be embedded into Rk with distortion (1 + ε), where k = O(ε−2 log n).
40.2. Examples
What is distortion? When considering a mapping f : X → Rd of a metric space (X, d) to Rd , it would
useful to observe that since Rd can be scaled, we can consider f to be an expansion (i.e., no distances shrink).
Furthermore, we can assume that there is at least one pair of points x, y ∈ X, such that d(x, y) = ∥x − y∥. As
such, we have dist( f ) = max x,y d∥x−y∥
(x,y)
.
Why is distortion necessary? Consider the a graph G = (V, E) with one vertex s connected
to three other vertices a, b, c, where the weights on the edges are all one (i.e., G is the star graph s
a b
with√ three leafs). We claim that G can not be embedded into Euclidean space with distortion
≤ 2. Indeed, consider the associated metric space (V, dG ) and an (expansive) embedding
c
f : V → Rd .
Consider the triangle formed by △ = a′ b′ c′ , where a′ = f (a), b′ = f (b) and c′ = f (c). Next, consider the
following quantity max(∥a′ − s′ ∥ , ∥b′ − s′ ∥ , ∥c′ − s′ ∥) which lower bounds the distortion of f . This quantity is
minimized when r = ∥a′ − s′ ∥ = ∥b′ − s′ ∥ = ∥c′ − s′ ∥. Namely, s′ is the center of the smallest enclosing circle
of △. However, r is minimized √ when all the edges of △ are of equal length, and are of length dG (a, b) = 2. It
follows that dist( f ) ≥ r ≥ 2/ 3.
This quantity is minimized when r = ∥a′ − s′ ∥ = ∥b′ − s′ ∥ = ∥c′ − s′ ∥. Namely, s′ is the c0
center of the smallest enclosing circle of △. However, r is minimized when all the edges
of △ are of equal length and are of length dG√(a, b) = 2. Observe that the height of the 2
√ with sidelength 2 is h = 3, and the radius of its inscribing circle
equilateral triangle √ is
r = (2/3)h = 2/ 3; see the figure on the right. As such, it follows that dist( f ) ≥ r = 2/ 3.
a0 1 b0
Note that the above argument is independent of the target dimension d. A packing
argument shows that embedding the star graph with n leaves into Rd requires distortion Ω n1/d ; see Exercise ??.
It is known that Ω(log n) distortion is necessary in the worst case when embedding a graph into Euclidean space
(this is shown using expanders). A proof of distortion Ω log n/ log log n is sketched in the bibliographical
notes.
252
40.2.1. Hierarchical Tree Metrics
The following metric is quite useful in practice, and nicely demonstrate why algorithmically finite metric spaces
are useful.
Definition 40.2.1. Hierarchically well-separated tree (HST) is a metric space defined on the leaves of a rooted
tree T . To each vertex u ∈ T there is associated a label ∆u ≥ 0 such that ∆u = 0 if and only if u is a leaf of T .
The labels are such that if a vertex u is a child of a vertex v then ∆u ≤ ∆v . The distance between two leaves
x, y ∈ T is defined as ∆lca(x,y) , where lca(x, y) is the least common ancestor of x and y in T .
A HST T is a k-HST if for a vertex v ∈ T , we have that ∆v ≤ ∆p(v) /k, where p(v) is the parent of v in T .
Note that a HST is a very limited metric. For example, consider the cycle G = Cn of n vertices, with weight
one on the edges, and consider an expansive embedding f of G into a HST HST. It is easy to verify, that there
must be two consecutive nodes of the cycle, which are mapped to two different subtrees of the root r of HST.
Since HST is expansive, it follows that ∆r ≥ n/2. As such, dist( f ) ≥ n/2. Namely, HSTs fail to faithfully
represent even very simple metrics.
40.2.2. Clustering
One natural problem we might want to solve on a graph (i.e., finite metric space) (X, d) is to partition it into
clusters. One such natural clustering is the k-median clustering, where we would like to choose a set C ⊆ X
P
of k centers, such that νC (X, d) = u∈X d(u, C) is minimized, where d(u, C) = minc∈C d(u, c) is the distance of
u to its closest center in C.
It is known that finding the optimal k-median clustering in a (general weighted) graph is NP-complete. As
such, the best we can hope for is an approximation algorithm. However, if the structure of the finite metric
space (X, d) is simple, then the problem can be solved efficiently. For example, if the points of X are on the real
line (and the distance between a and b is just |a − b|), then k-median can be solved using dynamic programming.
Another interesting case is when the metric space (X, d) is a HST. Is not too hard to prove the following
lemma. See Exercise 40.7.1.
Lemma 40.2.2. Let (X, d) be a HST defined over n points, and let k > 0 be an integer. One can compute the
optimal k-median clustering of X in O(k2 n) time.
Thus, if we can embed a general graph G into a HST HST, with low distortion, then we could approximate
the k-median clustering on G by clustering the resulting HST, and “importing” the resulting partition to the
original space. The quality of approximation, would be bounded by the distortion of the embedding of G into
HST.
253
Figure 40.1: An example of the partition of a square (induced by a set of points) as described in Section 40.3.1.
40.3.2. Properties
The following lemma quantifies the probability of a (crystal) ball of radius t centered at a point x is fully
contained in one of the clusters of the partition? (Otherwise, the crystal ball is of course broken.)
254
Figure 40.2: The resulting partition.
Lemma 40.3.1. Let (X, d) be a finite metric space, ∆ = 2u a prescribed parameter, and let P be the partition
of X generated by the above random partition. Then the following holds:
(i) For any C ∈ P, we have diam(C) ≤ ∆.
(ii) Let x be any point of X, and t a parameter ≤ ∆/8. Then,
h i 8t b
P b(x, t) ⊈ P(x) ≤ ln ,
∆ a
where a = |b(x, ∆/8)|, and b = |b(x, ∆)|.
Proof: Since Cy ⊆ b(y, R), we have that diam(Cy ) ≤ ∆, and thus the first claim holds.
Let U be the set of points of b(x, ∆), such that w ∈ U iff b(w, R) ∩ b(x, t) , ∅. Arrange the points of
U in increasing distance from x, and let w1 , . . . , wb′ denote the resulting order, where b′ = |U|. Let Ik =
[d(x, wk ) − t, d(x, wk ) + t] and write Ek for the event that wk is the first point in π such that b(x, t) ∩ Cwk , ∅, and
yet b(x, t) ⊈ Cwk . Note that if wk ∈ b(x, ∆/8), then P[Ek ] = 0 since b(x, t) ⊆ b(x, ∆/8) ⊆ b(wk , ∆/4) ⊆ b(wk , R).
In particular, w1 , . . . , wa ∈ b(x, ∆/8) and as such P[E1 ] = · · · = P[Ea ] = 0. Also, note that if d(x, wk ) < R − t
then b(wk , R) contains b(x, t) and as such Ek can not happen. Similarly, if d(x, wk ) > R+t then b(wk , R)∩b(x, t) =
∅ and Ek can not happen. As such, if Ek happen then R − t ≤ d(x, wk ) ≤ R + t. Namely, if Ek happen then R ∈ Ik .
Namely, P[Ek ] = P[Ek ∩ (R ∈ Ik )] = P[R ∈ Ik ] · P[Ek | R ∈ Ik ]. Now, R is uniformly distributed in the interval
[∆/4, ∆/2], and Ik is an interval of length 2t. Thus, P[R ∈ Ik ] ≤ 2t/(∆/4) = 8t/∆.
Next, to bound P[Ek | R ∈ Ik ], we observe that w1 , . . . , wk−1 are closer to x than wk and their distance to b(x, t)
is smaller than R. Thus, if any of them appear before wk in π then Ek does not happen. Thus, P[Ek | R ∈ Ik ] is
bounded by the probability that wk is the first to appear in π out of w1 , . . . , wk . But this probability is 1/k, and
thus P[Ek | R ∈ Ik ] ≤ 1/k.
We are now ready for the kill. Indeed,
h i Xb′ X
b′ X
b′
P b(x, t) ⊈ P(x) = P[Ek ] = P[Ek ] = P[R ∈ Ik ] · P[Ek | R ∈ Ik ]
k=1 k=a+1 k=a+1
Xb′
8t 1 8t b′ 8t b
≤ · ≤ ln ≤ ln ,
k=a+1
∆ k ∆ a ∆ a
Pb Rb
since 1
k=a+1 k ≤ a
dx
x
= ln ba and b′ ≤ b. ■
255
a randomized algorithm that embed (X, d) into a tree. Let T be the resulting tree, and consider two points
x, y ∈ X. Consider the random variable dT (x, y). We constructed the tree T suchh dthat idistances never shrink; i.e.
d(x, y) ≤ dT (x, y). The probabilistic distortion of this embedding is max x,y E dT(x,y) . Somewhat surprisingly,
(x,y)
Theorem 40.4.1. Given n-point metric (X, d) one can randomly embed it into a 2-HST with probabilistic dis-
tortion ≤ 24 ln n.
Proof: The construction is recursive. Let diam(P), and compute a random partition of X with cluster diameter
diam(P)/2, using the construction of Section 40.3.1. We recursively construct a 2-HST for each cluster, and
hang the resulting clusters on the root node v, which is marked by ∆v = diam(P). Clearly, the resulting tree is
a 2-HST.
For a node v ∈ T , let X(v) be the set of points of X contained in the subtree of v.
For the analysis, assume diam(P) = 1, and consider two points x, y ∈ X. We consider a node v ∈ T to be
in level i if level(v) = lg ∆v = i. The two points x and y correspond to two leaves in T , and let b u be the least
common ancestor of x and y in t. We have dT (x, y) ≤ 2 level(v)
. Furthermore, note that along a path the levels are
strictly monotonically increasing.
Being more conservative, let w be the first ancestor of x, such that b = b x, d(x, y) is not completely
contained in X(u1 ), . . . , X(um ), where u1 , . . . , um are the children of w. Clearly, level(w) > level b u . Thus,
dT (x, y) ≤ 2level(w) .
Consider the path σ from the root of T to x, and let Ei be the event that b is not fully contained in X(vi ),
where vi is the node of σ of level i (if such a node exists). Furthermore, let Yi be the indicator variable which is
P
1 if Ei is the first to happened out of the sequence of events E0 , E−1 , . . .. Clearly, dT (x, y) ≤ Yi 2i .
Let t = d(x, y) and j = lg d(x, y) , and ni = b(x, 2i ) for i = 0, . . . , −∞. We have
X X h i X
0 0 0
8t ni
E dT (x, y) ≤ E[Yi ] 2 ≤ 2i P Ei ∩ Ei−1 ∩ Ei−1 · · · E0 ≤ 2i · i ln ,
i
i= j i= j i= j
2 ni−3
Theorem 40.4.2. Let (X, d) be a n-point metric space. One can compute in polynomial time a k-median
clustering of X which has expected price O α log n , where α is the price of the optimal k-median clustering of
(X, d).
256
Figure 40.3: Examples of the sets resulting from the partition of Figure 40.1 and taking clusters into a set with
probability 1/2.
Proof: The algorithm is described above, and the fact that its running time is polynomial can be easily be
verified. To prove the bound on the quality of the clustering, for any point p ∈ X, let cen(p) denote the closest
point in Copt to p according to d, where Copt is the set of k-medians in the optimal clustering. Let C be the set
of k-medians returned by the algorithm, and let HST be the HST used by the algorithm. We have
X X
β = νC (X, d) ≤ νC (X, dHST ) ≤ νCopt (X, dHST ) ≤ dHST (p, Copt ) ≤ dHST (p, cen(p)).
p∈X p∈X
Proof: Indeed, let x′ and y′ be the closet points of Y, to x and y, respectively. Observe that
by the triangle inequality. Thus, f (x) − f (y) ≤ d(x, y). By symmetry, we have f (y) − f (x) ≤ d(x, y). Thus,
| f (x) − f (y)| ≤ d(x, y). ■
257
Theorem 40.5.2. √Given a n-point metric Y = (X, d), with spread Φ, one can embed it into Euclidean space Rk
with distortion O ln Φ ln n , where k = O(ln Φ ln n).
Proof: Assume that diam(Y) = Φ (i.e., the smallest distance in Y is 1), and let ri = 2i−2 , for i = 1, . . . , α, where
α = lg Φ . Let Pi, j be a random partition of P with diameter ri , using Theorem 40.4.1, for i = 1, . . . , α and
j = 1, . . . , β, where β = c log n and c is a large enough constant to be determined shortly.
For each cluster of Pi, j randomly toss a coin, and let Vi, j be the all the points of X that belong to clusters in
Pi, j that got ’T ’ in their coin toss. For a point u ∈ x, let
fi, j (x) = d(x, X \ Vi, j ) = min d(x, v),
v∈X\Vi, j
Next, consider two points x, y ∈ X, with distance ϕ = d(x, y). Let k be an integer such that ru ≤ ϕ/2 ≤ ru+1 .
Clearly, in any partition of Pu,1 , . . . , Pu,β the points x and y belong to different clusters. Furthermore, with
probability half x ∈ Vu, j and y < Vu, j or x < Vu, j and y ∈ Vu, j , for 1 ≤ j ≤ β.
Let E j denote the event that b(x, ρ) ⊆ Vu, j and y < Vu, j , for j = 1, . . . , β, where ρ = ϕ/(64 ln n). By
Lemma 40.3.1, we have
h i 8ρ ϕ
P b(x, ρ) ⊈ Pu, j (x) ≤ ln n ≤ ≤ 1/2.
ru 8ru
Thus,
h i h i
P E j = P b(x, ρ) ⊆ Pu, j (x) ∩ x ∈ Vu, j ∩ y < Vu, j
h i h i h i
= P b(x, ρ) ⊆ Pu, j (x) · P x ∈ Vu, j · P y < Vu, j ≥ 1/8,
since those three events are independent. Notice, that if E j happens, than fu, j (x) ≥ ρ and fu, j (y) = 0.
P
Let X jh be aniindicator variable which is 1 if E j happens, for j = 1, . . . , β. Let Z = j X j , and we have µ =
P
E[Z] = E j X j ≥ β/8. Thus, the probability that only β/16 of E1 , . . . , Eβ happens, is P[Z < (1 − 1/2) E[Z]].
By the Chernoff inequality, we have P[Z < (1 − 1/2) E[Z]] ≤ exp −µ1/(2 · 22 ) = exp(−β/64) ≤ 1/n10 , if we
set c = 640.
Thus, with high probability
v
u
t β r
X 2 √
β p ρ β
∥F(x) − F(y)∥ ≥ fu, j (x) − fu, j (y) ≥ ρ 2 = β =ϕ· .
j=1
16 4 256 ln n
On the other hand, fi, j (x) − fi, j (y) ≤ d(x, y) = ϕ ≤ 64ρ ln n. Thus,
q p p
∥F(x) − F(y)∥ ≤ αβ(64ρ ln n)2 ≤ 64 αβρ ln n = αβ · ϕ.
Thus, setting G(x) = F(x) 256√lnβ n , we get a mapping that maps two points of distance ϕ from each other
h √ i
to two points with distance in the range ϕ, ϕ · αβ · 256√lnβ n . Namely, G(·) is an embedding with distortion
√ √
O( α ln n) = O( ln Φ ln n).
The probability that G fails on one of the pairs, is smaller than (1/n10 ) · n2 < 1/n8 . In particular, we can
check the distortion of G for all n2 pairs, and if any of them fail (i.e., the distortion is too big), we restart the
process. ■
258
40.5.2. The unbounded spread case
Our next task, is to extend Theorem 40.5.2 to the case of unbounded spread. Indeed, let (X, d) be a n-point
metric, such that diam(X) ≤ 1/2. Again, we look on the different resolutions r1 , r2 , . . ., where ri = 1/2i−1 . For
each one of those resolutions ri , we can embed this resolution into β coordinates, as done for the bounded case.
Then we concatenate the coordinates together.
There are two problems with this approach: (i) the number of resulting coordinates is infinite, and (ii) a pair
x, y, might be distorted a “lot” because it contributes to all resolutions, not only to its “relevant” resolutions.
Both problems can be overcome with careful tinkering. Indeed, for a resolution ri , we are going to modify
the metric, so that it ignores short distances (i.e., distances ≤ ri /n2 ). Formally, for each resolution ri , let
Gi = (X, E bi ) be the graph where two points x and y are connected if d(x, y) ≤ ri /n2 . Consider a connected
component C ∈ Gi . For any two points x, y ∈ C, we have d(x, y) ≤ n(ri /n2 ) ≤ ri /n. Let Xi be the set of
connected components of Gi , and define the distances between two connected components C, C ′ ∈ Xi , to be
di (C, C ′ ) = d(C, C ′ ) = minc∈C,c′ ∈C ′ d(c, c′ ).
It is easy to verify that (Xi , di ) is a metric space (see Exercise 40.7.2). Furthermore, we can naturally
embed (X, d) into (Xi , di ) by mapping a point x ∈ X to its connected components in Xi . Essentially (Xi , di )
is a snapped version of the metric (X, d), with the advantage that Φ((X, di )) = O(n2 ). We now embed Xi
into β = O(log n) coordinates. Next, for any point of X we embed it into those β coordinates, by using the
embedding of its connected component in Xi . Let Ei be the embedding for resolution ri . Namely, Ei (x) =
( fi,1 (x), fi,2 (x), . . . , fi,β (x)), where fi, j (x) = min(di (x, X \ Vi, j ), 2ri ). The resulting embedding is F(x) = ⊕Ei (x) =
(E1 (x), E2 (x), . . . , ).
Since we slightly modified the definition of fi, j (·), we have to show that fi, j (·) is nonexpansive. Indeed,
consider two points x, y ∈ Xi , and observe that
fi, j (x) − fi, j (y) ≤ di (x, Vi, j ) − di (y, Vi, j ) ≤ di (x, y) ≤ d(x, y),
259
and y such that ri /n2 ≤ d(x, y) ≤ ri n2 . Thus, for every pair of distances there are O(log n) relevant resolutions.
Thus, there are at most η = O(n2 β log n) = O(n2 log2 n) relevant coordinates, and we can ignore all the other
coordinates. Next, consider the affine subspace h that spans F(P). Clearly, it is n − 1 dimensional, and consider
the projection G : Rη → Rn−1 that projects a point to its closest point in h. Clearly, G(F(·)) is an embedding
with the same distortion for P, and the target space is of dimension n − 1.
Note, that all this process succeeds with high probability. If it fails, we try again. We conclude:
Theorem 40.5.3 (Low quality Bourgain theorem). Given a n-point metric M, one can embed it into Eu-
clidean space of dimension n − 1, such that the distortion of the embedding is at most O(log3/2 n).
Using the Johnson-Lindenstrauss lemma, the dimension can be further reduced to O(log n). Being more
careful in the proof, it is possible to reduce the dimension to O(log n) directly.
Embedding into spanning trees. The above embeds the graph into a Steiner tree. A more useful represen-
tation, would be a random embedding into a spanning tree. Surprisingly, this can be done, as shown by Emek
et al. [EEST08]. This was improved to O(log n · log log n · (log log log n)3 ) by Abraham et al. [ABN08a,
ABN08b].
Alternative proof of the tree embedding result. Interestingly, if one does not care about the optimal dis-
tortion, one can get similar result (for embedding into probabilistic trees), by first embedding the metric into
Euclidean space, then reduce the dimension by the Johnson-Lindenstrauss lemma, and finally, construct an
HST by constructing a quadtree over the points. The “trick” is to randomly translate the quadtree. It is easy
to verify that this yields O(log4 n) distortion. See the survey by Indyk [Ind01] for more details. This random
shifting of quadtrees is a powerful technique that was used in getting several result, and it is a crucial ingredient
in Arora [Aro98] approximation algorithm for Euclidean TSP.
40.7. Exercises
Exercise 40.7.1 (Clustering for HST). Let (X, d) be a HST defined over n points, and let k > 0 be an integer.
Provide an algorithm that computes the optimal k-median clustering of X in O(k2 n) time.
[Transform the HST into a tree where every node has only two children. Next, run a dynamic programming
algorithm on this tree.]
Truely a polyglot of logs.
260
Exercise 40.7.2 (Partition induced metric).
(a) Give a counter example to the following claim: Let (X, d) be a metric space, and let P be a partition of
X. Then, the pair (P, d′ ) is a metric, where d′ (C, C ′ ) = d(C, C ′n) = min x∈C,y∈C ′ d(x, y) and C, C ′ ∈ P.
o
(b) Let (X, d) be a n-point metric space, and consider the set U = i 2i ≤ d(x, y) ≤ 2i+1 , for x, y ∈ X . Prove
that |U| = O(n). Namely, there are only n different resolutions that “matter” for a finite metric space.
Acknowledgments
The presentation in this write-up follows closely the insightful suggestions of Manor Mendel.
References
[ABN08a] I. Abraham, Y. Bartal, and O. Neiman. Nearly tight low stretch spanning trees. Proc. 49th Annu.
IEEE Sympos. Found. Comput. Sci. (FOCS), 781–790, 2008.
[ABN08b] I. Abraham, Y. Bartal, and O. Neiman. Nearly tight low stretch spanning trees. CoRR, abs/0808.2017,
2008. arXiv: 0808.2017.
[AKPW95] N. Alon, R. M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application to the
k-server problem. SIAM J. Comput., 24(1): 78–100, 1995.
[Aro98] S. Arora. Polynomial time approximation schemes for Euclidean TSP and other geometric prob-
lems. J. Assoc. Comput. Mach., 45(5): 753–782, 1998.
[Bar96] Y. Bartal. Probabilistic approximations of metric space and its algorithmic application. Proc.
37th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 183–193, 1996.
[Bar98] Y. Bartal. On approximating arbitrary metrics by tree metrics. Proc. 30th Annu. ACM Sympos.
Theory Comput. (STOC), 161–168, 1998.
[CKR04] G. Călinescu, H. J. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension prob-
lem. SIAM J. Comput., 34(2): 358–372, 2004.
[EEST08] M. Elkin, Y. Emek, D. A. Spielman, and S. Teng. Lower-stretch spanning trees. SIAM J. Comput.,
38(2): 608–628, 2008.
[FRT04] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by
tree metrics. J. Comput. Sys. Sci., 69(3): 485–497, 2004.
261
[Gup00] A. Gupta. Embeddings of Finite Metrics. PhD thesis. University of California, Berkeley, 2000.
[Ind01] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. Proc. 42nd Annu.
IEEE Sympos. Found. Comput. Sci. (FOCS), Tutorial. 10–31, 2001.
[KLMN05] R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: a new embedding
method for finite metric spaces. Geom. funct. anal. (GAFA), 15(4): 839–858, 2005.
[Mat02] J. Matoušek. Lectures on Discrete Geometry. Vol. 212. Grad. Text in Math. Springer, 2002.
262
Chapter 41
The function H(p) is a concave symmetric around 1/2 on the interval [0, 1] and achieves its maximum at
1/2. For a concrete example, consider H(3/4) ≈ 0.8113 and H(7/8) ≈ 0.5436. Namely, a coin that has 3/4
probably to be heads have higher amount of “randomness” in it than a coin that has probability 7/8 for heads.
Writing lg n = (ln n)/ ln 2, we have that
1
H(p) = −p ln p − (1 − p) ln(1 − p)
ln 2 !
′ 1 p 1− p 1− p
and H (p) = − ln p − − (−1) ln(1 − p) − (−1) = lg .
ln 2 p 1− p p
Deploying our amazing ability to compute derivative of simple functions once more, we get that
!
′′ 1 p p(−1) − (1 − p) 1
H (p) = =− .
ln 2 1 − p p 2 p(1 − p) ln 2
263
H(p) = −p lg p − (1 − p) lg(1 − p)
1
0.8
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Since ln 2 ≈ 0.693, we have that H′′ (p) ≤ 0, for all p ∈ (0, 1), and the H(·) is concave in this range. Also,
H′ (1/2) = 0, which implies that H(1/2) = 1 is a maximum of the binary entropy. Namely, a balanced coin has
the largest amount of randomness in it.
Example 41.1.2. A random variable X that has probability 1/n to be i, for i = 1, . . . , n, has entropy H(X) =
Pn
− 1
i=1 n lg 1n = lg n.
Note, that the entropy is oblivious to the exact values that the random variable can have, and it is sensitive
only to the probability distribution. Thus, a random variables that accepts −1, +1 with equal probability has the
same entropy (i.e., 1) as a fair coin.
Lemma 41.1.3. Let X and Y be two independent random variables, and let Z be the random variable (X, T ).
Then H(Z) = H(X) + H(Y).
Proof: In the following, summation are over all possible values that the variables can have. By the indepen-
dence of X and Y we have
X 1
H(Z) = P (X, Y) = (x, y) lg
x,y
P (X, Y) = (x, y)
X 1
= P[X = x] P Y = y lg
x,y
P[X = x] P Y = y
XX 1
= P[X = x] P Y = y lg
x y
P[X = x]
XX 1
+ P[X = x] P Y = y lg
y x
P Y = y
X 1 X 1
= P[X = x] lg + P Y = y lg = H(X) + H(Y). ■
x
P [X = x] y
P Y = y
!
2nH(q) n
Lemma 41.1.4. Suppose that nq is integer in the range [0, n]. Then ≤ ≤ 2nH(q) .
n+1 nq
264
Proof: This trivially holds if q = 0 or q = 1, so assume 0 < q < 1. We know that
!
n nq
q (1 − q)n−nq ≤ (q + (1 − q))n = 1
nq
!
n
=⇒ ≤ q−nq (1 − q)−n(1−q) = 2n (−q lg q−(1−q) lg(1−q)) = 2nH(q) .
nq
and the sign of this quantity is the sign of (k + 1)(1 − q) − (n − k)q = k + 1 − kq − q − nq + kq = 1 + k − q − nq.
Namely, ∆k ≥ 0 when k ≥ nq + q − 1, and ∆k < 0 otherwise. Namely, µ(k) < µ(k + 1), for k < nq, and
P
µ(k) ≥ µ(k + 1) for k ≥ nq. Namely,
µ(nq) is the largest term in nk=0 µ(k) = 1, and as such it is larger than the
average. We have µ(nq) = nqn qnq (1 − q)n−nq ≥ n+1 1
, which implies
!
n 1 −nq 1 nH(q)
≥ q (1 − q)−(n−nq) = 2 . ■
nq n+1 n+1
Lemma 41.1.4 can be extended to handle non-integer values of q. This is straightforward, and we omit the
easy details.
The bounds of Lemma 41.1.4 and Corollary 41.1.5 are loose but sufficient for our purposes. As a sanity
check, consider the case when we generate a sequence of n bits using a coin with probability q for head, then
by the Chernoff
n inequality, we will get roughly nq heads in this sequence. As such, the generated
sequence Y
belongs to nq ≈ 2nH(q) possible sequences that have similar probability. As such, H(Y) ≈ lg nqn = nH(q), by
Example 41.1.2, this also readily follows from Lemma 41.1.3.
265
0123
0 1 2 3 4 5 6 7 8 9 10 12 14 0 1 2 3 4 5 6 7 8 9 10 12 14 0 1 2 3 4 5 6 7 8 9 10 12 14
11 13 11 13 11 13
Figure 41.2: (A) m = 15. (B) The block decomposition. (C) If X = 10, then the extraction output is 2 in base
2, using 2 bits – that is 10.
Idea. We break the J0 : m − 1K into consecutive blocks that are powers of two. Given the value of X, we find
which block contains it, and we output a binary representation of the location of X in the block containing it,
where if a block is length 2k , then we output k bits.
Entropy can be interpreted as the amount of unbiased random coin flips can be extracted from a random
variable.
Definition 41.2.1. An extraction function Ext takes as input the value of a random variable X and outputs a
sequence of bits y, such that P Ext(X) = y |y| = k = 1/2k . whenever P |y| = k ≥ 0, where |y| denotes the
length of y.
As a concrete (easy) example, consider X to be a uniform random integer variable out of 0, . . . , 7. All that
Ext(x) has to do in this case, is just to compute the binary representation of x.
The definition of the extraction function has two subtleties:
(A) It requires that all extracted sequences of the same length (say k), have the same probability to be output
(i.e., 1/2k ).
(B) If the extraction function can output a sequence of length k, then it needs to be able to output all 2k such
binary sequences.
Thus, for X a uniform random integer variable in the range 0, . . . , 11, the function Ext(x) can output the
binary representation for x if 0 ≤ x ≤ 7. However, what do we do if x is between 8 and 11? The idea is to
output the binary hrepresentation of x − 8 as a two
i bit number. Clearly, Definition 41.2.1 holds for this extraction
function, since P Ext(X) = 00 |Ext(X)| = 2 = 1/4. as required. This scheme can be of course extracted for
any range.
Tedium 41.2.2. For x ≤ y positive integers, and any positive integer ∆, we have that
x x+∆
≤ ⇐⇒ x(y + ∆) ≤ y(x + ∆) ⇐⇒ x∆ ≤ y∆ ⇐⇒ x ≤ y.
y y+∆
Theorem 41.2.3. Suppose that the value of a random variable X is chosen uniformly at random from the
integers {0, . . . , m − 1}. Then there is an extraction function for X that outputs on average (i.e., in expectation)
at least lg m − 1 = ⌊H(X)⌋ − 1 independent and unbiased bits.
P
Proof: We represent m as a sum of unique powers of 2, namely m = i ai 2i , where ai ∈ {0, 1}. Thus, we
decomposed {0, . . . , m − 1} into a disjoint union of blocks that have sizes which are distinct powers of 2. If
a number falls inside such a block, we output its relative location in the block, using binary representation of
the appropriate length (i.e., k if the block is of size 2k ). It is not difficult to verify that this function fulfills the
conditions of Definition 41.2.1, and it is thus an extraction function.
Now, observe that the claim holds if m is a power of two, by Example 41.1.2 (i.e., if m = 2k , then H(X) = k).
Thus, if m is not a power of 2, then in the decomposition if there is a block of size 2k , and the X falls inside this
block, then the entropy is k.
266
The remainder of the proof is by induction – assume the claim holds if the range used by the random
variable is strictly smaller than m. In particular, let K = 2k be the largest power of 2 that is smaller than m, and
let U = 2u be the largest power of two such that U ≤ m − K ≤ 2U.
If the random number X ∈ J0 : K − 1K, then the scheme outputs k bits. Otherwise, we can think about the
extraction function as being recursive and extracting randomness from a random variable X ′ = X − K that is
uniformly distributed in J0 : m − KK.
By Tedium 41.2.2, we have that
m − K m − K + (2U + K − m) 2U
≤ =
m m + (2U + K − m) 2U + K
Let Y be the random variable which is the number of random bits extracted. We have that
<0
K m−K m−K m−K m − K z }| {
E[Y] ≥ k + lg(m − K) − 1 = k − k+ (u − 1) = k + (u − k − 1)
m m m m m
2U 2U
≥k− (u − k − 1) = k − (1 + k − u).
2U + K 2U + K
If u = k − 1, then H(X) ≥ k − 12 · 2 = k − 1, as required. If u = k − 2 then H(X) ≥ k − 13 · 3 = k − 1. Finally, if
u < k − 2 then
2U 2U k−u+1
E[Y] ≥ k − (1 + k − u) ≥ k − (1 + k − u) = k − (k−u+1)−2 ≥ k − 1,
2U + K K 2
since k − u + 1 ≥ 4 and i/2i−2 ≤ 1 for i ≥ 4. ■
Theorem 41.2.4. Consider a coin that comes up heads with probability p > 1/2. For any constant δ > 0 and
for n sufficiently large:
(A) One can extract, from an input of a sequence of n flips, an output sequence of (1 − δ)nH(p) (unbiased)
independent random bits.
(B) One can not extract more than nH(p) bits from such a sequence.
Proof: There are nj input sequences with exactly j heads, and each has probability p j (1 − p)n− j . We map this
n o
sequence to the corresponding number in the set 0, . . . , nj − 1 . Note, that this, conditional distribution on
j, is uniform on this set, and we can apply the extraction algorithm of Theorem 41.2.3. Let Z be the random
variables which is the number of heads in the input, and let B be the number of random bits extracted. We have
X
n h i
E[B] = P [Z = k] E B Z = k ,
k=0
h $ i !%
n
and by Theorem 41.2.3, we have E B Z = k ≥ lg − 1. Let ε < p − 1/2 be a constant to be determined
k
shortly. For n(p − ε) ≤ k ≤ n(p + ε), we have
! !
n n 2nH(p+ε)
≥ ≥ ,
k ⌊n(p + ε)⌋ n+1
by Corollary 41.1.5 (iii). We have
X
⌈n(p−ε)⌉
h i X $
⌈n(p−ε)⌉ !% !
n
E[B] ≥ P[Z = k] E B Z = k ≥ P[Z = k] lg −1
k=⌊n(p−ε)⌋ k=⌊n(p−ε)⌋
k
267
X
⌈n(p−ε)⌉ !
2nH(p+ε)
≥ P[Z = k] lg −2
k=⌊n(p−ε)⌋
n+1
= nH(p + ε) − lg(n + 1) P |Z − np| ≤ εn
!!
nε2
≥ nH(p + ε) − lg(n + 1) 1 − 2 exp − ,
4p
h i 2
ε np ε 2
since µ = E[Z] = np and P |Z − np| ≥ p pn ≤ 2 exp − 4 p = 2 exp − nε4p
, by the Chernoff inequality. In
particular, fix ε > 0, such that H(p + ε) > (1 − δ/4)H(p), and since p is fixed nH(p) = Ω(n), in particular,
2 for
δ
n sufficiently large, we have − lg(n + 1) ≥ − 10 nH(p). Also, for n sufficiently large, we have 2 exp − nε 4p
≤ 10δ .
Putting it together, we have that for n large enough, we have
δ δ δ
E [B] ≥ 1 − − nH(p) 1 − ≥ (1 − δ)nH(p),
4 10 10
as claimed.
As for the upper bound, observe that if an input sequence x has probability q, then the output sequence
y = Ext(x) has probability to be generated which is at least q. Now, all sequences of length |y| have equal
probability to be generated. Thus, we have the following (trivial) inequality 2|Ext(x)| q ≤ 2|Ext(x)| P y = Ext(X) ≤
1, implying that |Ext(x)| ≤ lg(1/q). Thus,
X X 1
E[B] = P[X = x] |Ext(x)| ≤ P[X = x] lg = H(X). ■
x x
P[X = x]
References
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.
268
Chapter 42
Entropy II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
The memory of my father is wrapped up in white paper, like sandwiches taken for a day at work. Just as a magician takes
towers and rabbits out of his hat, he drew love from his small body, and the rivers of his hands overflowed with good deeds.
269
as possible. Specifically, given an array frequency counts f [1 . . . n], we want to compute a prefix-free binary
code that minimizes the total encoded length of the message. That is we would like to compute a tree T that
minimizes
X
n
cost(T ) = f [i] ∗ len(code(i)), (42.1)
i=1
where code(i) is the binary string encoding the ith character and len(s) is the length (in bits) of the binary string
s.
A nice property of this problem is that given two trees for some parts of the alphabet, we can easily put
them together into a larger tree by just creating a new node and hanging the trees from this common node. For
example, putting two characters together, we have the following.
•....
... ...
... .. . ...
... ...
...
... ...
... ..
⇒
.
M U M U
Similarly, we can put together two subtrees.
A . .
... .....
B
... ...
•
.........
..... ..........
.....
... ... .. ..... .....
.....
... ... ... ... .....
.....
.....
.. ... ... ... ...
.. .....
.. ..
. ... ..... .....
... .....
... ...
⇒
... ... .
.. .....
.. .. . ...
.
.. .... .. .
.
....................................... .........................................
A . .
..
B. .
... ..... ... .....
... ... ... ...
... ... ... ...
.
... ... .
... ...
... ... ... ...
. ... . ...
. .
.
..........................................
..
. .
.
..........................................
..
42.1.2. Analysis
Lemma 42.1.1. Let T be an optimal code tree. Then T is a full binary tree (i.e., every node of T has either 0
or 2 children). In particular, if the height of T is d, then there are leafs nodes of height d that are sibling.
Proof: If there is an internal node in T that has one child, we can remove this node from T , by connecting
its only child directly with its parent. The resulting code tree is clearly a better compressor, in the sense of
Eq. (42.1).
As for the second claim, consider a leaf u with maximum depth d in T , and consider it parent v = p(u).
The node v has two children, and they are both leafs (otherwise u would not be the deepest node in the tree), as
claimed. ■
Lemma 42.1.2. Let x and y be the two least frequent characters (breaking ties between equally frequent char-
acters arbitrarily). There is an optimal code tree in which x and y are siblings.
270
Proof: More precisely, there is an optimal code in which x and y are siblings and have the largest depth of any
leaf. Indeed, let T be an optimal code tree with depth d. The tree T has at least two leaves at depth d that are
siblings, by Lemma 42.1.1.
Now, suppose those two leaves are not x and y, but some other characters α and β. Let U be the code tree
obtained by swapping x and α. The depth of x increases by some amount ∆, and the depth of α decreases by
the same amount. Thus,
cost(U) = cost(T ) − ( f [α] − f [x])∆.
By assumption, x is one of the two least frequent characters, but α is not, which implies that f [α] > f [x]. Thus,
swapping x and α does not increase the total cost of the code. Since T was an optimal code tree, swapping x
and α does not decrease the cost, either. Thus, U is also an optimal code tree (and incidentally, f [α] actually
equals f [x]). Similarly, swapping y and b must give yet another optimal code tree. In this final optimal code
tree, x and y as maximum-depth siblings, as required. ■
Proof: If the message has only one or two different characters, the theorem is trivial. Otherwise, let f [1 . . . n]
be the original input frequencies, where without loss of generality, f [1] and f [2] are the two smallest. To keep
things simple, let f [n + 1] = f [1] + f [2]. By the previous lemma, we know that some optimal code for f [1..n]
has characters 1 and 2 as siblings. Let Topt be this optimal tree, and consider the tree formed by it by removing
′
1 and 2 as it leaves. We remain with a tree Topt that has as leafs the characters 3, . . . , n and a “special” character
n + 1 (which is the parent of 1 and 2 in Topt ) that has frequency f [n + 1]. Now, since f [n + 1] = f [1] + f [2], we
have
X n
cost Topt = f [i]depthTopt (i)
i=1
X
n+1
= f [i]depthTopt (i) + f [1]depthTopt (1) + f [2]depthTopt (2) − f [n + 1]depthTopt (n + 1)
i=3
′
= cost Topt + ( f [1] + f [2])depth Topt − ( f [1] + f [2]) depth Topt − 1
′
= cost Topt + f [1] + f [2]. (42.2)
′ ′
This implies that minimizing the cost of Topt is equivalent to minimizing the cost of Topt . In particular, Topt must
be an optimal coding tree for f [3 . . . n + 1]. Now, consider the Huffman tree UH constructed for f [3, . . . , n + 1]
and the overall Huffman tree T H constructed for f [1, . . . , n]. By the way the construction algorithm works, we
have that UH is formed by removing the leafs of 1 and 2 from T . Now, by induction, we know that the Huffman
′
tree generated for f [3, . . . , n + 1] is optimal; namely, cost Topt = cost(UH ). As such, arguing as above, we have
′
cost(T H ) = cost(UH ) + f [1] + f [2] = cost Topt + f [1] + f [2] = cost Topt ,
by Eq. (42.2). Namely, the Huffman tree has the same cost as the optimal tree. ■
271
Now, we can use these probabilities instead of frequencies to build a Huffman tree. The natural question is
what is the length of the codewords assigned to characters as a function of their probabilities?
In general this question does not have a trivial answer, but there is a simple elegant answer, if all the
probabilities are power of 2.
Lemma 42.1.4. Let 1, . . . , n be n symbols, such that the probability for the ith symbol is pi , and furthermore,
there is an integer li ≥ 0, such that pi = 1/2li . Then, in the Huffman coding for this input, the code for i is of
length li .
Proof: The proof is by easy induction of the Huffman algorithm. Indeed, for n = 2 the claim trivially holds
since there are only two characters with probability 1/2. Otherwise, let i and j be the two characters with lowest
P
probability. It must hold that pi = p j (otherwise, k pk can not be equal to one). As such, Huffman’s merges
this two letters, into a single “character” that have probability 2pi , which would have encoding of length li − 1,
by induction (on the remaining n − 1 symbols). Now, the resulting tree encodes i and j by code words of length
(li − 1) + 1 = li , as claimed. ■
In particular, we have that li = lg 1/pi . This implies that the average length of a code word is
X 1
pi lg .
i
pi
If we consider X to be a random variable that takes a value i with probability pi , then this formula is
X 1
H(X) = P[X = i] lg ,
i
P [X = i]
Proof: The trick is to replace pi , which might not be a power of 2, by qi = 2⌊lg pi ⌋ . We have that qi ≤ pi ≤ 2qi ,
P
and qi is a power of 2, for all i. The leftover of this coding is ∆ = 1 − i qi . We write ∆ as a sum of
powers of 2 (since the frequencies are fractions of the form i/m [since the input string is of length m] – this
P
requires at most τ = O(log m) numbers): ∆ = n+τ j=n+1 q j . We now create a Huffman code T for the frequencies
q1 , . . . , qn , qn+1 , . . . , qn+τ . The output length to encode the input string using this code, by Lemma 42.1.4, is
X n X n ! Xn
1 1 1
L=m pi lg ≤ m pi 1 + lg ≤m+m pi lg = m + mH(X).
i=1
qi i=1
pi i=1
pi
One can now restrict T to be a prefix tree only for the first n symbols. Indeed, delete the τ “fake”
leafs/symbols, and repeatedly remove internal nodes that have only a single child. In the end of this pro-
cess, we get a valid prefix tree for the first n symbols, and encoding the input string using this tree would
require at most L bits, since process only shortened the code words. Finally, let V be the resulting tree.
Now, consider the Huffman tree code for the n input symbols using the original frequencies p1 , . . . pn . The
resulting tree U is a better encoder for the input string than V, by Theorem 42.1.3. As such, the compressed
string, would have at most L bits – thus establishing the claim. ■
272
42.2. Compression
In this section, we consider the problem of how to compress a binary string. We map each binary string, into a
new string (which is hopefully shorter). In general, by using a simple counting argument, one can show that no
such mapping can achieve real compression (when the inputs are adversarial). However, the hope is that there
is an underling distribution on the inputs, such that some strings are considerably more common than others.
Definition 42.2.1. A compression function Compress takes as input a sequence of n coin flips, given as an
element of {H, T }n , and outputs a sequence of bits such that each input sequence of n flips yields a distinct
output sequence.
Lemma 42.2.2. If a sequence S 1 is more likely than S 2 then the compression function that minimizes the
expected number of bits in the output assigns a bit sequence to S 2 which is at least as long as S 1 .
Note, that this is weak. Usually, we would like the function to output a prefix code, like the Huffman code.
Theorem 42.2.3. Consider a coin that comes up heads with probability p > 1/2. For any constant δ > 0,
when n is sufficiently large, the following holds.
(i) There exists a compression function Compress such that the expected number of bits output by Compress
on an input sequence of n independent coin flips (each flip gets heads with probability p) is at most
(1 + δ)nH(p); and
(ii) The expected number of bits output by any compression function on an input sequence of n independent
coin flips is at least (1 − δ)nH(p).
Proof: Let ε > 0 be a constant such that p − ε > 1/2. The first bit output by the compression procedure is ’1’
if the output string is just a copy of the input (using n + 1 bits overall in the output), and ’0’ if it is compressed.
We compress only if the number of ones in the input sequence, denoted
by X is larger than (p − ε)n. By the
Chernoff inequality, we know that P X < (p − ε)n ≤ exp −nε /2p . 2
If there are more than (p − ε)n ones in the input, and since p − ε > 1/2, we have that
X
n ! X
n !
n n n
≤ ≤ 2nH(p−ε) ,
j=⌈n(p−ε)⌉
j j=⌈n(p−ε)⌉
⌈n(p − ε)⌉ 2
by Corollary 41.1.5. As such, we can assign each such input sequence a number in the range 0 . . . n2 2nH(p−ε) ,
and this requires (with the flag bit) 1 + lg n + nH(p − ε) random bits.
Thus, the expected number of bits output is bounded by
(n + 1) exp −nε2 /2p + 1 + lg n + nH(p − ε) ≤ (1 + δ)nH(p),
by carefully setting ε and n being sufficiently large. Establishing the upper bound.
As for the lower bound, observe that at least one of the sequences having exactly τ = ⌊(p + ε)n⌋ heads,
must be compressed into a sequence having
!
n 2nH(p+ε)
lg − 1 ≥ lg − 1 = nH(p − ε) − lg(n + 1) − 1 = µ,
⌊(p + ε)n⌋ n+1
273
by Corollary 41.1.5. Now, any input string with less than τ heads has lower probability to be generated.
Indeed, for a specific strings with α < τ ones the probability to generate them is pα (1 − p)n−α and pτ (1 − p)n−τ ,
respectively. Now, observe that
τ−α
!τ−α
α τ n−τ (1 − p) τ n−τ 1 − p
p (1 − p) = p (1 − p) ·
n−α
= p (1 − p) < pτ (1 − p)n−τ ,
pτ−α p
Again, by carefully choosing ε and n sufficiently large, we have that the average output length of an optimal
compressor is at least (1 − δ)nH(p). ■
References
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.
274
Chapter 43
Translation: Every bit transmitted have the same probability to be flipped by the channel. The question is
how much information can we send on the channel with this level of noise. Naturally, a channel would have
some capacity constraints (say, at most 4,000 bits per second can be sent on the channel), and the question is
how to send the largest amount of information, so that the receiver can recover the original information sent.
Now, its important to realize that noise handling is unavoidable in the real world. Furthermore, there are
tradeoffs between channel capacity and noise levels (i.e., we might be able to send considerably more bits
on the channel but the probability of flipping (i.e., p) might be much larger). In designing a communication
protocol over this channel, we need to figure out where is the optimal choice as far as the amount of information
sent.
Definition 43.1.2. A (k, n) encoding function Enc : {0, 1}k → {0, 1}n takes as input a sequence of k bits and
outputs a sequence of n bits. A (k, n) decoding function Dec : {0, 1}n → {0, 1}k takes as input a sequence of n
bits and outputs a sequence of k bits.
Thus, the sender would use the encoding function to send its message, and the decoder would use the
received string (with the noise in it), to recover the sent message. Thus, the sender starts with a message with
k bits, it blow it up to n bits, using the encoding function, to get some robustness to noise, it send it over the
(noisy) channel to the receiver. The receiver, takes the given (noisy) message with n bits, and use the decoding
function to recover the original k bits of the message.
275
Naturally, we would like k to be as large as possible (for a fixed n), so that we can send as much information
as possible on the channel. Naturally, there might be some failure probability; that is, the receiver might be
unable to recover the original string, or recover an incorrect string.
The following celebrated result of Shannon¬ in 1948 states exactly how much information can be sent on
such a channel.
Theorem 43.1.3 (Shannon’s theorem). For a binary symmetric channel with parameter p < 1/2 and for any
constants δ, γ > 0, where n is sufficiently large, the following holds:
(i) For an k ≤ n(1 − H(p) − δ) there exists (k, n) encoding and decoding functions such that the probability
the receiver fails to obtain the correct message is at most γ for every possible k-bit input messages.
(ii) There are no (k, n) encoding and decoding functions with k ≥ n(1 − H(p) + δ) such that the probability
of decoding correctly is at least γ for a k-bit input message chosen uniformly at random.
Our scheme would be simple. Pick k ≤ n(1 − H(p) − δ). For any number i = 0, . . . , K b = 2k+1 − 1, randomly
generate a binary string Yi made out of n bits, each one chosen independently and uniformly. Let Y0 , . . . , YKb
denote these codewords.
For each of these codewords we will compute the probability that if we send this codeword, the receiver
would fail. Let X0 , . . . , XK , where K = 2k − 1, be the K codewords with the lowest probability of failure.
We assign these words to the 2k messages we need to encode in an arbitrary fashion. Specifically, for i =
0, . . . , 2k − 1, we encode i as the string Xi .
The decoding of a message w is done by going over all the codewords, and finding all the codewords that
are in (Hamming) distance in the range [p(1 − ε)n, p(1 + ε)n] from w. If there is only a single word Xi with this
property, we return i as the decoded word. Otherwise, if there are no such word or there is more than one word
then the decoder stops and report an error.
Intuition. Each code Yi corresponds to a region that looks like a ring. The “ring” r = pn
for Yi is all the strings in Hamming distance between (1 − ε)r and (1 + ε)r from Y2
Yi , where r = pn. Clearly, if we transmit a string Yi , and the receiver gets a string
inside the ring of Yi , it is natural to try to recover the received string to the original Y0
code corresponding to Yi . Naturally, there are two possible bad events here:
2εpn
(A) The received string is outside the ring of Yi . Y1
(B) The received string is contained in several rings of different Ys, and it is not clear which one should the
receiver decode the string to. These bad regions are depicted as the darker regions in the figure on the
right.
¬
Claude Elwood Shannon (April 30, 1916 - February 24, 2001), an American electrical engineer and mathematician, has been
called “the father of information theory”.
276
Let S i = S(Yi ) be all the binary strings (of length n) such that if the receiver gets this word, it would decipher
it to be the original string assigned to Yi (here are still using the extended set of codewords Y0 , . . . , YKb). Note,
that if we remove some codewords from consideration, the set S(Yi ) just increases in size (i.e., the bad region
in the ring of Yi that is covered multiple times shrinks). Let Wi be the probability that Yi was sent, but it was
not deciphered correctly. Formally, let r denote the received word. We have that
X
Wi = P[r was received when Yi was sent]. (43.1)
r<S i
To bound this quantity, let ∆(x, y) denote the Hamming distance between the binary strings x and y. Clearly, if
x was sent the probability that y was received is
As such, we have
P[r received when Yi was sent] = w(Yi , r).
Definition 43.2.1. Let S i,r be an indicator variable which is 1 if r < S i . It is one if the receiver gets r, and does
not decode it to Yi (either because of failure, or because r is too close/far from Yi ).
The value of Wi is a random variable over the choice of Y0 , . . . , YKb. As such, its natural to ask what is the
expected value of Wi .
Consider the ring
ring(r) = x ∈ {0, 1}n (1 − ε)np ≤ ∆(x, r) ≤ (1 + ε)np ,
where ε > 0 is a small enough constant. Observe that x ∈ ring(y) if and only if y ∈ ring(x). Suppose, that the
code word Yi was sent, and r was received. The decoder returns the original code associated with Yi , if Yi is the
only codeword that falls inside ring(r).
Lemma 43.2.2. Given that Yi was sent, and r was received and furthermore r ∈ ring(Yi ), then the probability
of the decoder failing, is
γ
τ = P r < S i r ∈ ring(Yi ) ≤ ,
8
where γ is the parameter of Theorem 43.1.3.
Proof: The decoder fails here, only if ring(r) contains some other codeword Y j ( j , i) in it. As such,
h i h i X h i
τ = P r < S i r ∈ ring(Yi ) ≤ P Y j ∈ ring(r), for any j , i ≤ P Y j ∈ ring(r) .
j,i
Now, we remind the reader that the Y j s are generated by picking each bit randomly and independently, with
probability 1/2. As such, we have
n !
h i ring(r) X
(1+ε)np
m n n
P Y j ∈ ring(r) = = ≤ n ,
|{0, 1}n | m=(1−ε)np
2n 2 ⌊(1 + ε)np⌋
277
since (1 + ε)p < 1/2 (for ε sufficiently small), and as such the last binomial coefficient in this summation is the
largest. By Corollary 41.1.5 (i), we have
h i n !
n n
P Y j ∈ ring(r) ≤ n ≤ n 2nH((1+ε)p) = n2n(H((1+ε)p)−1) .
2 ⌊(1 + ε)np⌋ 2
As such, we have
h i X h i
τ = P r < S i r ∈ ring(Yi ) ≤ P j
Y ∈ ring(r) ≤ b
K P 1
Y ∈ ring(r) ≤ 2 k+1
n2n H((1+ε)p)−1
j,i
n 1−H(p)−δ + 1 + n (H((1+ε)p)−1)
≤ n2 ≤ n2n H((1+ε)p)−H(p)−δ +1
since k ≤ n(1 − H(p) − δ). Now, we choose ε to be a small enough constant, so that the quantity H((1 + ε)p) −
H(p) − δ is equal to some (absolute) negative (constant), say −β, where β > 0. Then, τ ≤ n2−βn+1 , and choosing
n large enough, we can make τ smaller than γ/8, as desired. As such, we just proved that
h i γ
τ = P r < S i r ∈ ring(Yi ) ≤ . ■
8
Lemma 43.2.3. Consider the situation where Yi is sent, and the received string is r. We have that
X γ
P r < ring(Yi ) = w(Yi , r) ≤ ,
r < ring(Y )
8
i
Proof: This quantity, is the probability of sending Yi when every bit is flipped with probability p, and receiving
a string r such that more than pn + εpn bits where flipped (or less than pn − εpn). But this quantity can be
bounded using the Chernoff inequality. Indeed, let Z = ∆(Yi , r), and observe that E[Z] = pn, and it is the sum
of n independent indicator variables. As such
X !
ε2 γ
w(Yi , r) = P |Z − E[Z]| > εpn ≤ 2 exp − pn < ,
r < ring(Y )
4 4
i
We remind the reader that S i,r is an indicator variable that is one if receiving r (when sending Yi ) is “bad”,
see Definition 43.2.1. Importantly, this indicator variable also depends on all the other codewords – as they
might cause some regions in the ring of Yi to be covered multiple times.
P
Lemma 43.2.4. We have that f (Yi ) = r < ring(Yi ) E S i,r w(Yi , r) ≤ γ/8 (the expectation is over all the choices of
the Ys excluding Yi ).
Proof: Observe that S i,r w(Yi , r) ≤ w(Yi , r) and for fixed Yi and r we have that E[w(Yi , r)] = w(Yi , r). As such,
we have that X h i X X γ
f (Yi ) = E i,r
S w(Yi , r) ≤ E [w(Yi , r)] = w(Yi , r) ≤ ,
r < ring(Y ) r < ring(Y ) r < ring(Y )
8
i i i
by Lemma 43.2.3. ■
278
X h i
Lemma 43.2.5. We have that g(Yi ) = E S i,r w(Yi , r) ≤ γ/8 (the expectation is over all the choices of
r ∈ ring(Yi )
the Ys excluding Yi ).
Proof: We have that S i,r w(Yi , r) ≤ S i,r , as 0 ≤ w(Yi , r) ≤ 1. As such, we have that
X h i X h i X
g(Yi ) = E i,r
S w(Yi , r) ≤ E i,r
S = P[r < S i ]
r ∈ ring(Y ) r ∈ ring(Y ) r ∈ ring(Yi )
X i
i
= P r < S i ∩ r ∈ ring(Yi )
X
r
h i
= P r < S i r ∈ ring(Yi ) P r ∈ ring(Yi )
r
Xγ γ
≤ P r ∈ ring(Yi ) ≤ ,
r
8 8
by Lemma 43.2.2. ■
Lemma 43.2.6. For any i, we have µ = E[Wi ] ≤ γ/4, where γ is the parameter of Theorem 43.1.3, where Wi
is the probability of failure to recover Yi if it was sent, see Eq. (43.1).
P
Proof: We have by Eq. (43.2) that Wi = r S i,r w(Yi , r). For a fixed value of Yi , we have by linearity of
expectation, that
hX i X h i
E[Wi | Yi ] = E S i,r w(Yi , r) Yi = E S i,r w(Yi , r) Yi
r r
X h i X h i γ γ γ
= E S i,r w(Yi , r) Yi + E S i,r w(Yi , r) Yi = g(Yi ) + f (Yi ) ≤ + = ,
r ∈ ring(Yi ) r < ring(Y )
8 8 4
i
by Lemma 43.2.4 and Lemma 43.2.5. Now E[Wi ] = E E[Wi | Yi ] ≤ E γ/4 ≤ γ/4. ■
In the following, we need the following trivial (but surprisingly deep) observation.
Observation 43.2.7. For a random variable X, if E[X] ≤ ψ, then there exists an event in the probability space,
that assigns X a value ≤ ψ.
Lemma 43.2.8. For the codewords X0 , . . . , XK , the probability of failure in recovering them when sending them
over the noisy channel is at most γ.
Proof: We just proved that when using Y0 , . . . , YKb, the expected probability of failure when sending Yi , is
b = 2k+1 − 1. As such, the expected total probability of failure is
E[Wi ] ≤ γ/4, where K
b b
hX
K iXK
γ k+1
E Wi = E Wi ≤ 2 ≤ γ2 ,
k
i=0 i=0
4
by Lemma 43.2.6. As such, by Observation 43.2.7, there exist a choice of Yi s, such that
b
X
K
Wi ≤ 2k γ.
i=0
279
Now, we use a similar argument used in proving Markov’s inequality. Indeed, the Wi are always positive, and
it can not be that 2k of them have value larger than γ, because in the summation, we will get that
b
X
K
Wi > 2k γ.
i=0
Which is a contradiction. As such, there are 2k codewords with failure probability smaller than γ. We set the
2k codewords X0 , . . . , XK to be these words, where K = 2k − 1. Since we picked only a subset of the codewords
for our code, the probability of failure for each codeword shrinks, and is at most γ. ■
Lemma 43.2.8 concludes the proof of the constructive part of Shannon’s theorem.
References
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.
280
Chapter 44
We had encountered in the previous lecture examples of using rounding techniques for approximating discrete
optimization problems. So far, we had seen such techniques when the relaxed optimization problem is a linear
program. Interestingly, it is currently known how to solve optimization problems that are considerably more
general than linear programs. Specifically, one can solve convex programming. Here the feasible region is
convex. How to solve such an optimization problems is outside the scope of this course. It is however natural
to ask what can be done if one assumes that one can solve such general continuous optimization problems
exactly.
In the following, we show that (optimization problem) max cut can be relaxed into a weird continuous
optimization problem. Furthermore, this semi-definite program can be solved exactly efficiently. Maybe more
surprisingly, we can round this continuous solution and get an improved approximation.
281
subject to: vi ∈ S(n) ∀i ∈ V,
where S(n) is the n dimensional unit sphere in Rn+1 . This is an instance of semi-definite programming, which is a
special case of convex programming, which can be solved in polynomial time (solved here means approximated
within a factor of (1 + ε) of optimal, for any arbitrarily small ε > 0, in polynomial time). Namely, the solver
finds a feasible solution with a the target function being arbitrarily close to the optimal solution. Observe that
(P) is a relaxation of (Q), and as such the optimal solution of (P) has value larger than the optimal value of (Q).
The intuition is that vectors that correspond to vertices that should be on one side of the cut, and vertices on
the other sides, would have vectors which are faraway from each other in (P). Thus, we compute the optimal
solution for (P), and we uniformly generate a random vector r on the unit sphere S(n) . This induces a hyperplane
h which passes through the origin and is orthogonal to r. We next assign all the vectors that are on one side of
h to S , and the rest to S .
Summarizing, the algorithm is as follows: First, we solve (P), next, we pick a random vector r uniformly
on the unit sphere S(n) . Finally, we set
S = {vi | ⟨vi , r⟩ ≥ 0} .
44.1.1. Analysis
The intuition of the above rounding procedure, is that with good probability, vectors in the solution of (P) that
have large angle between them would be separated by this cut.
1
Lemma 44.1.1. We have P sign ⟨vi , r⟩ , sign ⟨v j , r⟩ = arccos ⟨vi , v j ⟩ .
π
Proof: Let us think about the vectors vi , v j and r as being in the plane.
vi
To see why this is a reasonable assumption, consider the plane g spanned by vi and vi
v j , and observe that for the random events we consider, only the direction of r matter,
which can be decided by projecting r on g, and normalizing it to have length 1. Now, the
sphere is symmetric, and as such, sampling r randomly from S(n) , projecting it down to
g, and then normalizing it, is equivalent to just choosing uniformly a vector from the unit
circle.
Now, sign(⟨vi , r⟩) , sign(⟨v j , r⟩) happens only if r falls in the double wedge formed by the lines perpendic-
ular to vi and v j . The angle of this double wedge is exactly the angle between vi and v j . Now, since vi and v j
are unit vectors, we have ⟨vi , v j ⟩ = cos(τ), where τ = ∠vi v j .
Thus,
h i 2τ 1
P sign ⟨vi , r⟩ , sign ⟨v j , r⟩ = = arccos ⟨vi , v j ⟩ ,
2π π
as claimed. ■
Theorem 44.1.2. Let W be the random variable which is the weight of the cut generated by the algorithm. We
have
1X
E[W] = ωi j arccos(⟨vi , v j ⟩).
π i< j
Proof: Let Xi j be an indicator variable which is 1 if and only if the edge i j is in the cut. We have
h i h i 1
E Xi j = P sign(⟨vi , r ⟩) , sign(⟨v j , r ⟩) = arccos(⟨vi , v j ⟩),
π
282
P
by Lemma 44.1.1. Clearly, W = i< j ωi j Xi j , and by linearity of expectation, we have
X h i 1X
E[W] = ωi j E Xi j = ωi j arccos(⟨vi , v j ⟩). ■
i< j
π i< j
arccos(y) 1
Lemma 44.1.3. For −1 ≤ y ≤ 1, we have ≥ α · (1 − y), where
π 2
2 ψ
α = min . (44.1)
0≤ψ≤π π 1 − cos(ψ)
Proof: Set y = cos(ψ). The inequality now becomes ψπ ≥ α 12 (1 − cos ψ). Reorganizing, the inequality becomes
ψ
2
π 1−cos ψ
≥ α, which trivially holds by the definition of α. ■
10 0.87856
2 ψ
π 1−cos(ψ)
8
Proof: Using simple calculus, one can see that α achieves its value for ψ = 2.331122..., the nonzero root of
cos ψ + ψ sin ψ = 1. ■
Theorem 44.1.5. The above algorithm computes in expectation a cut with total weight α · Opt ≥ 0.87856Opt,
where Opt is the weight of the maximum weight cut.
Proof: Consider the optimal solution to (P), and lets its value be γ ≥ Opt. We have
1X X 1
E[W] = ωi j arccos(⟨vi , v j ⟩) ≥ ωi j α (1 − ⟨vi , v j ⟩) = αγ ≥ α · Opt,
π i< j i< j
2
by Lemma 44.1.3. ■
283
44.2. Semi-definite programming
Let us define a variable xi j = ⟨vi , v j ⟩, and consider the n by n matrix M formed by those variables, where xii = 1
for i = 1, . . . , n. Let V be the matrix having v1 , . . . , vn as its columns. Clearly, M = V T V. In particular, this
implies that for any non-zero vector v ∈ Rn , we have vT Mv = vT AT Av = (Av)T (Av) ≥ 0. A matrix that has this
property, is called positive semidefinite. Interestingly, any positive semidefinite matrix P can be represented as
a product of a matrix with its transpose; namely, P = BT B. Furthermore, given such a matrix P of size n × n,
we can compute B such that P = BT B in O(n3 ) time. This is know as Cholesky decomposition.
Observe, that if a semidefinite matrix P = BT B has a diagonal where all the entries are one, then B has
columns which are unit vectors. Thus, if we solve (P) and get back a semi-definite matrix, then we can recover
the vectors realizing the solution, and use them for the rounding.
In particular, (P) can now be restated as
1X
(S D) max ωi j (1 − xi j )
2 i< j
subject to: xii = 1 for i = 1, . . . , n
xi j is a positive semi-definite matrix.
i=1,...,n, j=1,...,n
We are trying to find the optimal value of a linear function over a set which is the intersection of linear con-
straints and the set of positive semi-definite matrices.
Lemma 44.2.1. Let U be the set of n × n positive semidefinite matrices. The set U is convex.
Proof: Consider A, B ∈ U, and observe that for any t ∈ [0, 1], and vector v ∈ Rn , we have:
Positive semidefinite matrices corresponds to ellipsoids. Indeed, consider the set xT Ax = 1: the set of
vectors that solve this equation is an ellipsoid. Also, the eigenvalues of a positive semidefinite matrix are all
non-negative real numbers. Thus, given a matrix, we can in polynomial time decide if it is positive semidefinite
or not (by computing the eigenvalues of the matrix).
Thus, we are trying to optimize a linear function over a convex domain. There is by now machinery to
approximately solve those problems to within any additive error in polynomial time. This is done by using the
interior point method, or the ellipsoid method. See [BV04, GLS93] for more details. The key ingredient that is
required to make these methods work, is the ability to decide in polynomial time, given a solution, whether its
feasible or not. As demonstrated above, this can be done in polynomial time.
284
conjecture was recently proved by Mossel et al. [MOO05]. However, it is not clear if the “Unique Games
Conjecture” is true, see the discussion in [KKMO04].
The work of Goemans and Williamson was quite influential and spurred wide research on using SDP for
approximation algorithms. For an extension of the MAX CUT problem where negative weights are allowed
and relevant references, see the work by Alon and Naor [AN04].
References
[AN04] N. Alon and A. Naor. Approximating the cut-norm via grothendieck’s inequality. Proc. 36th
Annu. ACM Sympos. Theory Comput. (STOC), 72–80, 2004.
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge, 2004.
[GLS93] M. Grötschel, L. Lovász, and A. Schrijver. Geometric Algorithms and Combinatorial Optimiza-
tion. 2nd. Vol. 2. Algorithms and Combinatorics. Berlin Heidelberg: Springer-Verlag, 1993.
[GW95] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut
and satisfiability problems using semidefinite programming. J. Assoc. Comput. Mach., 42(6):
1115–1145, 1995.
[Hås01b] J. Håstad. Some optimal inapproximability results. J. ACM, 48(4): 798–859, 2001.
[KKMO04] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal inapproximability results for max
cut and other 2-variable csps. Proc. 45th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), To
appear in SICOMP. 146–154, 2004.
[MOO05] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability of functions with low influences
invariance and optimality. Proc. 46th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 21–30,
2005.
285
286
Chapter 45
Expanders I
598 - Class notes for Randomized Algorithms
“Mr. Matzerath has just seen fit to inform me that this
Sariel Har-Peled
partisan, unlike so many of them, was an authentic
April 2, 2024
partisan. For - to quote the rest of my patient’s lecture -
there is no such thing as a part-time partisan. Real
partisans are partisans always and as long as they live.
They put fallen governments back in power and over
throw governments that have just been put in power with
the help of partisans. Mr. Matzerath contended - and this
thesis struck me as perfectly plausible - that among all
those who go in for politics your incorrigible partisan,
who undermines what he has just set up, is closest to the
artiest because he consistently rejects what he has just
created.”
287
We denote the eigenvalues of M by λb1 ≥ λb2 ≥ · · · λbn , and the eigenvalues of Q by λb1 ≥ λb2 ≥ · · · λbn . Note,
that for a d-regular graph, the eigenvalues of Q are the eigenvalues of M scaled down by a factor of 1/d; that is
bi = λ
λ bi /d.
Lemma 45.1.1. Let G be an undirected graph, and let ∆ denote the maximum degree in G. Then, λb1 (G) =
λb1 (M) = ∆ if and only one connected component of G is ∆-regular. The multiplicity of ∆ as an eigenvector is
bi (G) ≤ ∆, for all i.
the number of ∆-regular connected components. Furthermore, we have λ
Proof: The ith entry of M1n is the degree of the ith vertex vi of G (i.e., M1n = d(vi ), where 1n = (1, 1, . . . , 1) ∈
Rn . So, let x be an eigenvector of M with eigenvalue λ, and let x j , 0 be the coordinate with the largest
(absolute value) among all coordinates of x corresponding to a connect component H of G. We have that
X
|λ| x j = (M x) j = xi ≤ ∆ x j ,
vi ∈N(v j )
where N(v j ) are the neighbors of vi in G. Thus, all the eigenvalues of G have λ bi ≤ ∆, for i = 1, . . . , n. If
λ = ∆, then this implies that xi = x j if vi ∈ N(v j ), and d(v j ) = ∆. Applying this argument to the vertices of
N(v j ), implies that H must be ∆-regular, and furthermore, x j = xi , if xi ∈ V(H). Clearly, the dimension of the
subspace with eigenvalue (in absolute value) ∆ is exactly the number of such connected components. ■
The following is also known. We do not provide a proof since we do not need it in our argumentation.
Lemma 45.1.2. If G is bipartite, then if λ is eigenvalue of M(G) with multiplicity k, then −λ is also its eigen-
value also with multiplicity k.
Intuitively, the tension captures how close is estimating the variance of a function defined over the vertices
of G, by just considering the edges of G. Note, that a disconnected graph would have infinite tension, and the
clique has tension 1.
Surprisingly, tension is directly related to expansion as the following lemma testifies.
Lemma 45.2.2. Let G = (V, E) be a given connected d-regular graph with n vertices. Then, G is a δ-expander,
1
where δ ≥ and γ(G) is the tension of G.
2γ(G)
Proof: Consider a set S ⊆ V, where |S| ≤ n/2. Let fS (v) be the function assigning 1 if v ∈ S , and zero
otherwise. Observe that if (u, v) ∈ S × S ∪ S × S then | fS (u) − fS (v)| = 1, and | fS (u) − fS (v)| = 0 otherwise.
As such, we have
2 |S | (n − |S |) h i h i e S , S
= E | fS (x) − fS (y)|2 ≤ γ(G) E | fS (x) − fS (y)|2 = γ(G) ,
n2 x,y∈V xy∈E |E|
288
by Lemma 45.2.4. Now, since G is d-regular, we have that |E| = nd/2. Furthermore, n − |S | ≥ n/2, which
implies that
2 |E| · |S | (n − |S |) 2(nd/2)(n/2) |S | 1
e S, S ≥ = = d |S | .
γ(G)n 2 γ(G)n 2 2γ(G)
which implies the claim (see Eq. (45.1)). ■
Now, a clique has tension 1, and it has the best expansion possible. As such, the smaller the tension of a
graph, the better expander it is.
Definition 45.2.3. Given a random walk matrix Q associated with a d-regular graph, let B(Q) = ⟨v1 , . . . , vn ⟩
denote the orthonormal eigenvector basis defined by Q. √That is, v1 , . . . , vn is an orthonormal basis for Rn ,
where all these vectors are eigenvectors of Q and v1 = 1n / n. Furthermore, let λ bi denote the ith eigenvalue of
Q, associated with the eigenvector vi , such that λb1 ≥ λb2 ≥ · · · ≥ λbn .
Lemma 45.2.4. Let G = (V, E) be a given connected d-regular graph with n vertices. Then γ(G) = 1
1−λb2
, where
λb2 = λ2 /d is the second largest eigenvalue of Q.
Proof: Let f : V → R. Since in Eq. (45.2), we only look on the difference between two values of f , we
can add a constant to f , and would not change the quantities involved in Eq. (45.2). As such, we assume that
E f (x) = 0. As such, we have that
h i h i h i
E | f (x) − f (y)| = E ( f (x) − f (y)) = E ( f (x)) − 2 f (x) f (y) + ( f (y))
2 2 2 2
(45.3)
x,y∈V x,y∈V x,y∈V
h i h i
= E ( f (x))2 − 2 E f (x) f (y) + E ( f (y))2
x,y∈V x,y∈V x,y∈V
h i h i h i
= E ( f (x)) − 2 E f (x) E f (y) + E ( f (y))2 = 2 E ( f (x))2 .
2
x∈V x∈V y∈V y∈V x∈V
Now, let I be the n × n identity matrix (i.e., one on its diagonal, and zero everywhere else). We have that
1X 1 X X X 2X
ρ= ( f (x) − f (y))2 = d( f (x))2 − 2 f (x) f (y) = ( f (x))2 − f (x) f (y)
d xy∈E d x∈V xy∈E x∈V
d xy∈E
X
= (I − Q) xy f (x) f (y).
x,y∈V
Note, that 1n is an eigenvector of Q with eigenvalue 1, and this is the largest eigenvalue of Q. Let B(Q) =
⟨v1 , . . . , vn ⟩ be the orthonormal eigenvector basis defined by Q, with eigenvalues λb1 ≥ λb2 ≥ · · · ≥ λbn , respec-
P
tively. Write f = ni=1 αi vi , and observe that
* + *X +
X f (i)
n
v1 v1 1 α1
0 = E f (x) = = f, √ = αi vi , √ = √ ⟨α1 v1 , v1 ⟩ = √ ,
i=1
n n i
n n n
289
Now, we have that
X *" # +
xth row of
(I − Q) xy v j (y) = , v j = (I − Q)v j (x) = 1 − λbj v j (x) = 1 − λbj v j (x),
(I − Q)
y∈V
Pn
since v j is eigenvector of Q with eigenvalue λbj . Since v1 , . . . , vn is an orthonormal basis, and f = αi vi , we
P i=1
have that ∥ f ∥2 = j α2j . Going back to ρ, we have that
X X X X
ρ= αi α j vi (x) 1 − λbj v j (x) = αi α j 1 − λbj vi (x)v j (x)
i, j x∈V i, j x∈V
X D E X
n D E
= αi α j 1 − λbj vi , v j = α2j 1 − λbj v j , v j
i, j j=1
X
n X 2 X
n X
n
≥ 1 − λb2 α2j v j (x) = 1 − λb2 α2j = 1 − λb2 ∥ f ∥2 = 1 − λb2 ( f (x))2 (45.4)
j=2 x∈V j=2 j=1
h i
= n 1 − λb2 E ( f (x))2 ,
x∈V
1
This implies that γ(G) ≤ . Observe, that the inequality in our analysis, had risen from Eq. (45.4), but if
1 − λb2
we take f = v2 , then the inequality there holds with equality, which implies that γ(G) ≥ 1−1λb , which implies
2
the claim. ■
Lemma 45.2.2 together with the above lemma, implies that the expansion δ of a d-regular graph G is at
least δ = 1/2γ(G) = (1 − λ2 /d)/2, where λ2 is the second eigenvalue of the adjacency matrix of G. Since the
tension of a graph is direct function of its second eigenvalue, we could either argue about the tension of a graph
or its second eigenvalue when bounding the graph expansion.
290
Chapter 46
Expanders II
598 - Class notes for Randomized Algorithms
Be that as it may, it is to night school that I owe what
Sariel Har-Peled
education I possess; I am the first to own that it doesn’t
April 2, 2024
amount to much, though there is something rather
grandiose about the gaps in it.
Definition 46.1.1. For a graph G, let γ2 (G) denote the bi-tension of G; that is, the smallest constant, such that
for any two function f, g : V(G) → R, we have that
h i h i
E | f (x) − g(y)| ≤ γ2 (G) E | f (x) − g(y)| .
2 2
(46.1)
x,y∈V (x→y)∈e
E
The proof of the following lemma is similar to the proof of Lemma 45.2.4. The proof is provided for the
sake of completeness, but there is little new in it.
1
Lemma 46.1.2. Let G = (V, E) be a connected d-regular graph with n vertices. Then γ2 (G) = , where
1 −bλ
b
λ = bλ(G), where b λ(G) = max λb2 , −λbn , where λ
bi is the ith largest eigenvalue of the random walk matrix
associated with G.
Proof: We can assume that E f (x) = 0. As such, we have that
h i h i h i h i h i
E | f (x) − g(y)| = E ( f (x)) − 2 E f (x)g(y) + E (g(y)) = E ( f (x)) + E (g(y)) .
2 2 2 2 2
(46.2)
x,y∈V x,y∈V x,y∈V y∈V x,y∈V y∈V
Let Q be the matrix associated with the random walk on G (each entry is either zero or 1/d), we have
h i 1 X 1 X
ρ= E | f (x) − g(y)|2 = ( f (x) − g(y))2 = Q xy ( f (x) − g(y))2
(x→y)∈e
E nd e
n x,y∈V
(x→y)∈E
1 X 2 X
= ( f (x))2 + (g(x))2 − Q xy f (x)g(y).
n x∈V n x,y∈V
291
Let B(Q) = ⟨v1 , . . . , vn ⟩ be the orthonormal eigenvector basis defined by Q (see Definition 45.2.3), with eigen-
P P
values λb1 ≥ λb2 ≥ · · · ≥ λbn , respectively. Write f = ni=1 αi vi and g = ni=1 βi vi . Since E f (x) = 0, we have that
α1 = 0. Now, Q xy = Qyx , and we have
X X X X X X X
Q xy f (x)g(y) = Qyx αi vi (x) β j v j (y) = αi β j v j (y) Qyx vi (x)
x,y∈V x,y∈V i j i, j y∈V x∈V
X X X D E Xn X
= αi β j v j (y) λbi vi (y) = bi v j , vi =
αi β j λ bi
αi βi λ (vi (y))2
i, j y∈V i, j i=2 y∈V
X
n
α2i + β2i X b
λ X
n X
≤b
λ (vi (y))2 ≤ (αi vi (y))2 + (βi vi (y))2
i=2
2 y∈V
2 i=1 y∈V
λ X
b
= ( f (y))2 + (g(y))2
2 y∈V
As such,
h i 1 X 1 X 1 X 2 f (x)g(y)
E | f (x) − g(y)|2 = | f (x) − g(y)|2 = ( f (y))2 + (g(y))2 −
(x→y)∈e
E nd e
n y∈V n x,y∈V d
(x→y)∈E
1 X 2 X
= ( f (y))2 + (g(y))2 − Q xy f (x)g(y)
n y∈V n x,y∈V
1 2 b λ X h i h i!
≥ − · 2 2 b
( f (y)) + (g(y)) = 1 − λ E ( f (y)) + E (g(y))
2 2
n n 2 y∈V y∈V y∈V
h i
= 1 −b λ E | f (x) − g(y)|2 ,
x,y∈V
by Eq. (46.2). This implies that γ2 (G) ≤ 1/ 1 − b
λ . Again, by trying either f = g = v2 or f = vn and g = −vn ,
we get that the inequality above holds with equality, which implies γ2 (G) ≥ 1/ 1 − b λ . Together, the claim
now follows. ■
Our main interest would be in the second largest eigenvalue of M. Formally, let
⟨xM, x⟩
λ2 (G) = max .
x⊥1 ,x,0
n
⟨x, x⟩
We state the following result but do not prove it since we do not need it for our nafarious purposes (however,
we did prove the left side of the inequality).
292
Theorem 46.2.2. Let G be a δ-expander with adjacency matrix M and let λ2 = λ2 (G) be the second-largest
eigenvalue of M. Then r
1 λ2 λ2
1− ≤δ≤ 2 1− .
2 d d
What the above theorem says, is that the expansion of a [n, d, δ]-expander is a function of how far is its
second eigenvalue (i.e., λ2 ) from its first eigenvalue (i.e., d). This is usually referred to as the spectral gap.
We will start by explicitly constructing an expander that has “many” edges, and then we will show to reduce
its degree till it become a constant degree expander.
The nice property of Fq is that addition can be interpreted as a xor operation. That is, for any x, y ∈ Fq , we
have that x + y + y = x and x − y − y = x. The key properties of Fq we need is that multiplications and addition
can be computed in it in polynomial time in t, and it is a field (i.e., each non-zero element has a unique inverse).
46.2.1.1.1. Computing multiplication in Fq . Consider two elements α, β ∈ Fq . Multiply the two polyno-
mials poly(α) by poly(β), let poly(γ) be the resulting polynomial (of degree at most 2t − 2), and compute the
remainder poly(β) when dividing it by the irreducible polynomial p(x). For this remainder polynomial, nor-
malize the coefficients by computing their modules base 2. The resulting polynomial is the product of α and
β.
293
For more details on this field, see any standard text on abstract algebra.
Theorem 46.2.3. For any t > 0, r > 0 and q = 2t , where r < q, we have that LD(q, r) is a graph with qr+1
vertices. Furthermore, λ1 (LD(q, r)) = q2 , and λh i (LD(q, r))i ≤ rq, for i = 2, . . . , n.
In particular, if r ≤ q/2, then LD(q, r) is a qr+1 , q2 , 14 -expander.
Proof: Let M be the N × N adjacency matrix of LD(q, r). Let L : Fq → {0, 1} be a linear map which is onto. It
is easy to verify that L−1 (0) = L−1 (1) ¬
We are interested in the eigenvalues of the matrix M. To this end, we consider vectors in RN . The ith row
an ith column of M is associated with a unique element bi ∈ G. As such, for a vector v ∈ RN , we denote by
v[bi ] the ith coordinate of v. In particular, for α = (α0 , . . . , αr ) ∈ G, let vα ∈ RN denote the vector, where its
β = (β0 , . . . , βr ) ∈ G coordinate is
Pr
vα β = (−1)L( i=0 αi βi ) .
n o
Let V = vα α ∈ G . For α , α′ ∈ V, observe that
X Pr Pr X Pr X
αi βi ) α′i βi ) (αi +α′i ) βi ) =
⟨vα , vα′ ⟩ = (−1)L( i=0 · (−1)L( i=0 = (−1)L( i=0 vα+α′ β .
β∈G β∈G β∈G
So, consider ψ = α + α′ , 0. Assume, for the simplicity of exposition that all the coordinates of ψ are non-zero.
We have, by the linearity of L that
X Pr X X
⟨vα , vα′ ⟩ = (−1)L( i=0 αi βi ) = (−1)L(ψ0 β0 +···+ψr−1 βr−1 ) (−1)L(ψr βr ) .
β∈G β0 ∈Fq ,...,βr−1 ∈Fq βr ∈Fq
n o
¬
Indeed, if Z = L−1 (0), and L(x) = 1, then L(y) = 1, for all y ∈ U = x + z z ∈ Z . Now, its clear that |Z| = |U|.
294
n o P
However, since ψr , 0, the quantity ψr βr βr ∈ Fq = Fq . Thus, the summation βr ∈Fq (−1)L(ψr βr ) gets L−1 (0)
terms that are 1, and L−1 (0) terms that are −1. As such, this summation is zero, implying that ⟨vα , vα′ ⟩ = 0.
Namely, the vectors of V are orthogonal.
Observe, that for α, β, ψ ∈ G, we have vα β + ψ = vα β vα ψ . For α ∈ G, consider the vector Mvα . We
have, for β ∈ G, that
X X X
(Mvα ) β = Mβψ · vα ψ = vα ψ = vα β + y(1, x, . . . , xr )
ψ∈G x,y ∈ Fq x,y ∈ Fq
ψ=ρ(β,x,y)
X
= vα y(1, x, . . . , xr ) · vα β .
x,y ∈ Fq
P
Thus, setting λ(α) = x,y ∈ Fq vα y(1, x, . . . , xr ) ∈ R, we have that Mvα = λ(α) · vα . Namely, vα is an eigenvector,
with eigenvalue λ(α).
P
Let pα (x) = ri=0 αi xi , and let
X X
λ(α) = vα y(1, x, . . . , xr ) ∈ R = (−1)L(ypα (x))
x,y ∈ Fq x,y∈Fq
X X
= (−1) L(y pα (x))
+ (−1) L(y pα (x))
.
x,y∈Fq x,y∈Fq
pα (x)=0 pα (x),0
If pα (x) = 0 then (−1)L(y pα (x)) = 1, for all y. As such, each such x contributes q to λ(α).
If pα (x) , 0 then y pα (x) takes all the values of Fq , and as such, L(y pα (x)) is 0 for half of these values, and
1 for the other half. Implying that these kind terms contribute 0 to λ(α). But pα (x) is a polynomial of degree
r, and as such there could be at most r values of x for which the first term is taken. As such, if α , 0 then
λ(α) ≤ rq. If α = 0 then λ(α) = q2 , which implies the theorem. ■
This construction provides an expander with constant degree only if the number of vertices is a constant.
Indeed, if we want an expander with constant degree, we have to take q to be as small as possible. We get
the relation n = qr+1 ≤ qq , since r ≤ r, which implies that q = Ω(log n/ log log n). Now, the expander of
Theorem 46.2.3 is q2 -regular, which means that it is not going to provide us with a constant degree expander.
However, we are going to use it as our building block in a construction that would start with this expander
and would inflate it up to the desired size.
295
296
Chapter 47
297
resulting graph. In this case, replacing a vertex with a path is a potential “disaster”, since every such subpath
increases the lengths of the paths of the original graph by a factor of d (and intuitively, a good expander have
“short” paths between any pair of vertices).
Consider a “large” (n, D)-graph G and a “small” (D, d)-graph H. As a G
first stage, we replace every vertex of G by a copy of H. The new graph K H
has JnK × JDK as a vertex set. Here, the edge vu ∈ V(G), where u = v[i] and
v = u[ j], is replaced by the edge connecting (v, i) ∈ V(K) with (u, j) ∈ V(K).
We will refer to this resulting edge (v, i)(u, j) as a long edge. Also, we copy
all the edges of the small graph to each one of its copies. That is, for each
i ∈ JnK, and uv ∈ E(H), we add the edge (i, u)(i, v) to K, which is a short edge.
We will refer to K, which is a (nD, d + 1)-graph, as a replacement product
of G and H, denoted by G O r H. See figure on the right for an example.
Again, intuitively, we are losing because the expan- GrH
sion of the resulting graph had deteriorated too much. To e3
overcome this problem, we will perform local shortcuts
to shorten the paths in the resulting graph (and thus im-
prove its expansion properties). A zig-zag-zig path in the e2
replacement product graph K, is a three edge path e1 e2 e3 ,
where e1 and e3 are short edges, and the middle edge e2
is a long edge. That is, if e1 = (i, u)(i, v), e2 = (i, v)( j, v′ ),
and e3 = ( j, v′ )( j, u′ ), then e1 , e2 , e3 ∈ E(K), i j ∈ E(G),
uv ∈ E(H) and v′ u′ ∈ E(H). Intuitively, you can think e1
GrH
about e1 as a small “zig” step in H, e2 is a long “zag”
step in G, and finally e3 is a “zig” step in H.
Another way of representing a zig-zag-zig path v1 v2 v3 v4 starting at the vertex v1 = (i, v) ∈ V(F), is to
parameterize it by two integers ℓ, ℓ′ ∈ JdK, where
v1 = (i, v), v2 = (i, vH [ℓ]) v3 = (iG [vH [ℓ]] , vH [ℓ]) v4 = iG [vH [ℓ]] , (vH [ℓ])H ℓ′ .
Let Z be the set of all (unordered) pairs of vertices of K connected by such a zig-zag-zig path. Note, that
every vertex (i, v) of K has d2 paths having (i, v) as an end point. Consider the graph F = (V(K), Z). The graph F
has nD vertices, and it is d2 regular. Furthermore, since we shortcut all these zig-zag-zig paths in K, the graph
F is a much better expander (intuitively) than K. We will refer to the graph F as the zig-zag product of G and H.
Definition 47.1.1. The zig-zag product of (n, D)-graph G and a (D, d)-graph H, is the (nD, d2 ) graph F =
GOz H, where the set of vertices is JnK × JDK and for any v ∈ JnK, i ∈ JDK, and ℓ, ℓ′ ∈ JdK we have in F the edge
connecting the vertex (i, v) with the vertex (iG [vH [ℓ]] , (vH [ℓ])H [ℓ′ ]).
Remark 47.1.2. We need the resulting zig-zag graph to have consistent labeling. For the sake of simplicity of
exposition, we are just going to assume this property.
We next bound the tension of the zig-zag product graph.
Theorem 47.1.3. We have γ(G O z H) ≤ γ2 (G)(γ2 (H))2 . and γ2 (G O
z H) ≤ γ2 (G)(γ2 (H))2 .
Proof: Let G = JnK, E be a (n, D)-graph and H = JDK, E ′ be a (D, d)-graph. Fix any function f : JnK ×
JDK → R, and observe that
h i " h i#
ψ = E | f (u, k) − f (v, ℓ)| = E
2
E | f (u, k) − f (v, ℓ)|
2
u,v∈JnK k,ℓ∈JDK u,v∈JnK
k,ℓ∈JDK
298
"
h i#
2
≤ E γ2 (G) E | f (u, k) − f (v, ℓ)|2 = γ2 (G) E E f (u, k) − f u p , ℓ .
k,ℓ∈JDK uv∈E(G) k,ℓ∈JDK u∈JnK
p∈JDK
| {z }
=∆1
Now,
" # " #
2 2
∆1 = E E f (u, k) − f u p , ℓ ≤ E γ2 (H) E f (u, k) − f u p , ℓ
u∈JnK k,p∈JDK u∈JnK kp∈E(H)
ℓ∈JDK ℓ∈JDK
2
= γ2 (H) E E f u, p j − f u p , ℓ .
u∈JnK p∈JDK
ℓ∈JDK j∈JdK
| {z }
=∆2
Now,
2 2
∆2 = E E f u, p j − f u p , ℓ = E E f v p , p j − f (v, ℓ)
j∈JdK u∈JnK j∈JdK v∈JnK
ℓ∈JDK p∈JDK ℓ∈JDK p∈JDK
2
= E E f v p , p j − f (v, ℓ)
j∈JdK p∈JDK
v∈JnK ℓ∈JDK
" #
2
= γ2 (H) E E f v p , p j − f (v, ℓ) .
j∈JdK pℓ∈E(H)
v∈JnK
| {z }
=∆3
Now, we have
2
∆3 = E E f v p , p j − f (v, p[i]) = E | f (u, k) − f (ℓ, v)| ,
j∈JdK p∈JDK (u,k)(ℓ,v)∈E(G Oz H)
v∈JnK i∈JdK
as v p , p j is adjacent to v p , p (a short edge), which is in turn adjacent to (v, p) (a long edge), which is
adjacent to (v, p[i]) (a short edge). Namely, v p , p j and (v, p[i]) form the endpoints of a zig-zag path in the
replacement product of G and H. That is, these two endpoints are connected by an edge in the zig-zag product
graph. Furthermore, it is easy to verify that each zig-zag edge get accounted for in this representation exactly
once, implying the above inequality. Thus, we have ψ ≤ γ2 (G)(γ2 (H))2 ∆3 , which implies the claim.
The second claim follows by similar argumentation. ■
47.1.3. Squaring
The last component in our construction, is squaringsquaring!graph a graph. Given a (n, d)-graph G, consider
the multigraph G2 formed by connecting any vertices connected in G by a path of length 2. Clearly, if M is the
adjacency matrix of G, then the adjacency matrix of G is the matrix M . Note, that M2 is the number of
2 2
ij
distinct paths of length 2 in G from i to j. Note, that the new graph might have self loops, which does not effect
our analysis, so we keep them in.
299
(γ2 (G))2
Lemma 47.1.4. Let G be a (n, d)-graph. The graph G2 is a (n, d2 )-graph. Furthermore γ2 G2 = 2γ2 (G)−1
.
2 2
Proof: The graph G2 has eigenvalues λb1 (G) , . . . , λb1 (G) for its matrix Q2 . As such, we have that
b
λ G2 = max λb2 G2 , −λbn G2 .
2 2
Now, λb1 G2 = 1. Now, if λb2 (G) ≥ λbn (G) < 1 then b
λ G2 = λb2 G2 = λb2 (G) = b λ(G) .
2 2
If λb2 (G) < λbn (G) then b
λ G2 = λb2 G2 = λbn (G) = b λ(G) ..
2
Thus, in either case b λ G2 = bλ(G) . Now, By Lemma 46.1.2 γ2 (G) = 1−bλ(1 G) , which implies that b
λ(G) =
1 − 1/γ2 (G), and thus
ni = ni−1 N
Theorem 47.1.5. For any i ≥ 0, one can compute deterministically a graph Gi with ni = (d4 + 1)d4i vertices,
which is d2 regular, where d = 256. The graph Gi is a (1/10)-expander.
Proof: The construction is described above. As for the expansion, since the bi-tension bounds the tension of
a graph, we have that γ(Gi ) ≤ γ2 (Gi ) ≤ 5. Now, by Lemma 45.2.2, we have that Gi is a δ-expander, where
δ ≥ 1/(2γ(Gi )) ≥ 1/10. ■
300
47.2. Bibliographical notes
A good survey on expanders is the monograph by Hoory et al. [HLW06]. The small expander construction
is from the paper by Alon et al. [ASS08] (but its originally from the work by Along and Roichman [AR94]).
The work by Alon et al. [ASS08] contains a construction of an expander that is constant degree, which is of
similar complexity to the one we presented here. Instead, we used the zig-zag expander construction from the
influential work of Reingold et al. [RVW02]. Our analysis however, is from an upcoming paper by Mendel
and Naor [MN08]. This analysis is arguably reasonably simple (as simplicity is in the eye of the beholder, we
will avoid claim that its the simplest), and (even better) provide a good intuition and a systematic approach to
analyzing the expansion.
We took a creative freedom in naming notations, and the name tension and bi-tension are the author’s own
invention.
47.3. Exercises
Exercise 47.3.1 (Expanders made easy.). By considering a random bipartite three-regular graph on 2n vertices
obtained by picking three random permutations between the two sides of the bipartite graph, prove that there is
a c > 0 such that for every n there exits a (2n, 3, c)-expander. (What is the value of c in your construction?)
Exercise 47.3.2 (Is your consistency in vain?). In the construction, we assumed that the graphs we are dealing
with when building expanders have consistent labeling. This can be enforced by working with bipartite graphs,
which implies modifying the construction slightly.
(A) Prove that a d-regular bipartite graph always has a consistent labeling (hint: consider matchings in this
graph).
(B) Prove that if G is bipartite so is the graph G3 (the cubed graph).
(C) Let G be a (n, D)-graph and let H be a (D, d)-graph. Prove that if G is bipartite then GG O z H is bipartite.
(D) Describe in detail a construction of an expander that is: (i) bipartite, and (ii) has consistent labeling at
every stage of the construction (prove this property if necessary). For the ith graph in your series, what
is its vertex degree, how many vertices it has, and what is the quality of expansion it provides?
Acknowledgments
Much of the presentation was followed suggestions by Manor Mendel. He also contributed some of the figures.
References
[AR94] N. Alon and Y. Roichman. Random cayley graphs and expanders. Random Struct. Algorithms,
5(2): 271–285, 1994.
[ASS08] N. Alon, O. Schwartz, and A. Shapira. An elementary construction of constant-degree expanders.
Combin. Probab. Comput., 17(3): 319–327, 2008.
301
[HLW06] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin Amer.
Math. Soc., 43: 439–561, 2006.
[MN08] M. Mendel and A. Naor. Towards a calculus for non-linear spectral gaps. manuscript. 2008.
[RVW02] O. Reingold, S. Vadhan, and A. Wigderson. Entropy waves, the zig-zag graph product, and new
constant-degree expanders and extractors. Annals Math., 155(1): 157–187, 2002.
302
Chapter 48
48.1. Introduction
The probabilistic method is a combinatorial technique to use probabilistic algorithms to create objects having
desirable properties, and furthermore, prove that such objects exist. The basic technique is based on two basic
observations:
2. If the probability of event E is larger than zero, then E exists and it is not empty.
The surprising thing is that despite the elementary nature of those two observations, they lead to a powerful
technique that leads to numerous nice and strong results. Including some elementary proofs of theorems that
previously had very complicated and involved proofs.
The main proponent of the probabilistic method, was Paul Erdős. An excellent text on the topic is the book
by Noga Alon and Joel Spencer [AS00].
This topic is worthy of its own course. The interested student is refereed to the course “Math 475 — The
Probabilistic Method”.
48.1.1. Examples
48.1.1.1. Max cut
Computing the maximum cut (i.e., max cut) in a graph is a NP-Complete problem, which is APX-Hard (i.e.,
no better than a constant approximation is possible if P , NP). We present later on a better approximation
algorithm, but the following simple algorithm already gives a pretty good approximation.
303
Theorem 48.1.1. For any undirected graph G = (V, E) with n vertices and m edges, there is a partition of the
m
vertex set V into two sets S and T , such that |(S , T )| = |{uv ∈ E | u ∈ S and v ∈ T }| ≥ . One can compute a
2
partition, in O(n) time, such that E |(S , T )| = m/2.
Proof: Consider the following experiment: randomly assign each vertex to S or T , independently and equal
probability.
For an edge e = uv, the probability that one endpoint is in S , and the other in T is 1/2, and let Xe be the
indicator variable with value 1 if this happens. Clearly,
h i X X 1 m
E uv ∈ E (u, v) ∈ S × T ∪ T × S = E[Xe ] = = .
e∈E(G) e∈E(G)
2 2
Thus, there must be an execution of the algorithm that computes a cut that is at least as large as the expectation
– namely, a partition of V that satisfies the realizes a cut with ≥ m/2 edges. ■
X
m
max zj
j=1
X X
subject to xi + (1 − xi ) ≥ z j for all j
i∈C +j i∈C −j
304
We relax this into the following linear program:
X
m
max zj
j=1
sub ject to 0 ≤ yi , z j ≤ 1 for all i, j
X X
yi + (1 − yi ) ≥ z j for all j.
i∈C +j i∈C −j
Which can be solved in polynomial time. Let b t denote the values assigned to the variable t by the linear-
P
programming solution. Clearly, mj=1 zbj is an upper bound on the number of clauses of I that can be satisfied.
We set the variable yi to 1 with probability b
yi . This is an instance randomized rounding.
Lemma 48.2.2. Let C j be a clause with k literals. The probability that it is satisfied by randomized rounding
is at least βk zbj ≥ (1 − 1/e)b
z j , where
!k
1 1
βk = 1 − 1 − ≈1− .
k e
Proof: Assume C j = y1 ∨ v2 . . . ∨ vk . By the LP, we have yb1 + · · · + ybk ≥ zbj . Furthermore, the probability
Q Q
that C j is not satisfied is ki=1 1 − b yi . Note that 1 − ki=1 1 − b
yi is minimized when all the byi ’s are equal (by
symmetry). Namely, when b yi = zbj /k. Consider the function f (x) = 1 − (1 − x/k) . This function is larger than
k
The second part of the inequality, follows from the fact that βk ≥ 1 − 1/e, for all k ≥ 0. Indeed, for k = 1, 2
the claim trivially holds. Furthermore,
!k !k
1 1 1 1
1− 1− ≥1− ⇔ 1− ≤ ,
k e k e
k
but this holds since 1 − x ≤ e−x implies that 1 − 1k ≤ e−1/k , and as such 1 − 1k ≤ e−k/k = 1/e. ■
Theorem 48.2.4. Given an instance I of MAX-SAT, the expected number of clauses satisfied by linear pro-
gramming and randomized rounding is at least (1 − 1/e) ≈ 0.632mopt (I), where mopt (I) is the maximum number
of clauses that can be satisfied on that instance.
305
Theorem 48.2.5. Given an instance I of MAX-SAT, let n1 be the expected number of clauses satisfied by
randomized assignment, and let n2 be the expected number of clauses satisfied by linear programming followed
P
by randomized rounding. Then, max(n1 , n2 ) ≥ (3/4) j zbj ≥ (3/4)mopt (I).
P
Proof: It is enough to show that (n1 + n2 )/2 ≥ 34 j zbj . Let S k denote the set of clauses that contain k literals.
We know that
X X X X
n1 = 1 − 2−k ≥ 1 − 2−k zbj .
k C j ∈S k k C j ∈S k
P P
By Lemma 48.2.2 we have n2 ≥ k C j ∈S k βk zbj . Thus,
n1 + n2 X X 1 − 2−k + βk
≥ zbj .
2 k C ∈S
2
j k
One can verify that 1 − 2−k + βk ≥ 3/2, for all k. ¬
Thus, we have
n1 + n2 3 X X 3X
≥ zbj = zbj . ■
2 4 k C ∈S 4 j
j k
References
[AS00] N. Alon and J. H. Spencer. The Probabilistic Method. 2nd. Wiley InterScience, 2000.
¬
Indeed, by the proof of Lemma 48.2.2, we have that βk ≥ 1 − 1/e. Thus, 1 − 2−k + βk ≥ 2 − 1/e − 2−k ≥ 3/2 for k ≥ 3. Thus,
we only need to check the inequality for k = 1 and k = 2, which can be done directly.
306
Chapter 49
As for (ii), we already saw it and used it in the minimum cut algorithm lecture. ■
Definition 49.1.2. An event E is mutually independent of a set of events C, if for any subset U ⊆ C, we have
T T
that P E ∩ E′ ∈U E′ = P[E] P E′ ∈U E′ .
Let E1 , . . . , En be events. A dependency graph for these events n is a directedo graph G = (V, E), where
{1, . . . , n}, such that Ei is mutually independent of all the events in E j (i, j) < E .
Intuitively, an edge (i, j) in a dependency graph indicates that Ei and E j have (maybe) some dependency
between them. We are interested in settings where this dependency is limited enough, that we can claim
something about the probability of all these events happening simultaneously.
Lemma 49.1.3 (Lovász Local Lemma). Let G(V, E) be a dependency graph for events E1 , . . . , En . Suppose
Y h i Yn
that there exist xi ∈ [0, 1], for 1 ≤ i ≤ n such that P[Ei ] ≤ xi 1 − x j . Then P ∩i=1 Ei ≥
n
(1 − xi ).
(i, j)∈E i=1
307
We need the following technical lemma.
Lemma 49.1.4. Let G(V, E) be a dependency
Y graph for events E1 , . . . , En . Suppose that there exist xi ∈ [0, 1],
for 1 ≤ i ≤ n such that P[Ei ] ≤ xi 1 − x j . Now, let S be a subset of the vertices from {1, . . . , n}, and let i
(i, j)∈E
be an index not in S . We have that
P Ei ∩ j∈S E j ≤ xi . (49.1)
since Ei is mutually independent of C(R). As for the denominator, let N = { j1 , . . . , jr }. We have, by Lemma 49.1.1
(ii), that
P E j1 ∩ . . . ∩ E jr ∩m∈R Em = P E j1 ∩m∈R Em P E j2 E j1 ∩ ∩m∈R Em
· · · P E jr E j1 ∩ . . . ∩ E jr−1 ∩ ∩m∈R Em
= 1 − P E j1 ∩m∈R Em 1 − P E j2 E j1 ∩ ∩m∈R Em
· · · 1 − P E jr E j1 ∩ . . . ∩ E jr−1 ∩ ∩m∈R Em
Y
≥ 1 − x j1 · · · 1 − x jr ≥ 1 − xj ,
(i, j)∈E
Proof of Lovász local lemma (Lemma 49.1.3): Using Lemma 49.1.4, we have that
h i Y
n
P ∩i=1 Ei = (1 − P[E1 ]) 1 − P E2 E1 · · · 1 − P En ∩i=1 Ei ≥ (1 − xi ).
n n−1
i=1
308
Corollary 49.1.5. Let E1 , . . . , En be events, with P[Ei ] ≤ p for allh i. If each
i event is mutually independent of
all other events except for at most d, and if ep(d + 1) ≤ 1, then P ∩i=1 Ei > 0.
n
Proof: If d = 0 the result is trivial, as the events are independent. Otherwise, there is a dependency graph, with
every vertex having degree at most d. Apply Lemma 49.1.3 with xi = d+1 1
. Observe that
!d
1 1 1 1
xi (1 − xi ) =
d
1− > · ≥ p,
d+1 d+1 d+1 e
d
by assumption and the since 1 − d+1 1
> 1/e, see Lemma 49.1.6 below. ■
The following is standard by now, and we include it only for the sake of completeness.
!n
1 1
Lemma 49.1.6. For any n ≥ 1, we have 1 − > .
n+1 e
n n
Proof: This is equivalent to n+1n
> 1e . Namely, we need to prove e > n+1 . But this obvious, since
n n
1 n
n+1
n
= 1 + n < exp(n(1/n)) = e. ■
309
49.2.1.1. Analysis
A clause had survived if it is not satisfied by the variables fixed in the first stage. Note, that a clause that
survived must have a dangerous clause as a neighbor in the dependency graph G. Not that I ′ , the instance
remaining from I after the first stage, has at least k/2 unspecified variables in each clause. Furthermore, every
clause of I ′ has at most d = k2k/50 neighbors in G′ , where G′ is the dependency graph for I ′ . It follows, that
again, we can apply Lovász local lemma to conclude that I ′ has a satisfying assignment.
Definition 49.2.2. Two connected graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ), where V1 , V2 ⊆ {1, . . . , n} are
unique if V1 , V2 .
Lemma 49.2.3. Let G be a graph with degree at most d and with n vertices. Then, the number of unique
subgraphs of G having r vertices is at most nd2r .
Proof: Consider a unique subgraph G b of G, which by definition is connected. Let H be a connected subtree of
G spanning G.b Duplicate every edge of H, and let H ′ denote the resulting graph. Clearly, H ′ is Eulerian, and as
such posses a Eulerian path π of length at most 2(r − 1), which can be specified, by picking a starting vertex v,
and writing down for the i-th vertex of π which of the d possible neighbors, is the next vertex in π. Thus, there
are st most nd2(r−1) ways of specifying π, and thus, there are at most nd2(r−1) unique subgraphs in G of size r.■
Lemma 49.2.4. With probability 1 − o(1), all connected components of G′ have size at most O(log m), where
G′ denote the dependency graph for I ′ .
Proof: Let G4 be a graph formed from G by connecting any pair of vertices of G of distance exactly 4 from
each other. The degree of a vertex of G4 is at most O(d4 ).
Let U be a set of r vertices of G, such that every pair is in distance at least 4 from each other in G. We are
interested in bounding the probability that all the clauses of U survive the first stage.
The probability of a clause to be dangerous is at most 2−k/2 , as we assign (random) values to half of the
variables of this clause. Now, a clause survive only if it is dangerous or one of its neighbors is dangerous. Thus,
the probability that a clause survive is bounded by 2−k/2 (d + 1).
Furthermore, the survival of two clauses Ei and E j in U is an independent event, as no neighbor of Ei shares
a variable with a neighbor of E j (because of the distance 4 requirement). We conclude, that the probability that
all the vertices of U to appear in G′ is bounded by
r
2−k/2 (d + 1) .
In fact, we are interested in sets U that induce a connected subgraphs of G4 . The number of unique such
sets of size r is bounded by the number of unique subgraphs of G4 of size r, which is bounded by md8r , by
Lemma 49.2.3. Thus, the probability of any connected subgraph of G4 of size r = log2 m to survive in G′ is
smaller than
r 8r r
md8r 2−k/2 (d + 1) = m k2k/50 2−k/2 (k2k/50 + 1) ≤ m2kr/5 · 2−kr/4 = m2−kr/20 = o(1),
since k ≥ 50. (Here, a subgraph survive of G4 survive, if all its vertices appear in G′ .) Note, however, that if a
connected component of G′ has more than L vertices, than there must be a connected component having L/d3
vertices in G4 that had survived in G′ . We conclude, that with probability o(1), no connected component of G′
has more than O(d3 log m) = O(log m) vertices (note, that we consider k to be a constant, and thus, also d). ■
Thus, after the first stage, we are left with fragments of (k/2)-SAT, where every fragment has size at most
O(log m), and thus having at most O(log m) variables. Thus, we can by brute force find the satisfying assign-
ment to each such fragment in time polynomial in m. We conclude:
Theorem 49.2.5. The above algorithm finds a satisfying truth assignment for any instance of k-SAT containing
m clauses, which each variable is contained in at most 2k/50 clauses, in expected time polynomial in m.
310
Chapter 50
Problem 50.1.1 (Set Balancing/Discrepancy). Given a binary matrix M of size n × n, find a vector v ∈
{−1, +1}n , such that ∥Mv∥∞ is minimized.
√ Using random assignment and the Chernoff inequality, we showed that there exists v, such that ∥Mv∥∞ ≤
4 n ln n. Can we derandomize this algorithm? Namely, can we come up with an efficient deterministic algo-
rithm that has low discrepancy?
To derandomize our algorithm, construct a computation tree of depth n, where in the ith level we expose
the ith coordinate of v. This tree T has depth n. The root represents all possible random choices, while a
node at depth i, represents all computations when the first i bits are fixed. For a node v ∈ T , let P(v) be the
probability that a random computation starting from v succeeds – here randomly assigning the remaining bits
can be interpreted as a random walk down the tree to a leaf. √
Formally, the algorithm is successful if ends up with a vector v, such that ∥Mv∥∞ ≤ 4 n ln n.
Let vl and vr be the two children of v. Clearly, P(v) = (P(vl ) + P(vr ))/2. In particular, max(P(vl ), P(vr )) ≥
P(v). Thus, if we could compute P(·) quicklyp(and deterministically), then we could derandomize the algorithm.
Let Cm+ be the bad event that rm · v > 4 n log n, where rm is the mth row ofh M. Similarly, iCm− is the bad
p
event that rm · v < −4 n log n, and let Cm = Cm+ ∪ Cm− . Consider the probability, P Cm+ v1 , . . . , vk (namely, the
first k coordinates of v are specified). Let rm = (r1 , . . . , rn ). We have that
n
h i
X p X
k
X X
vi ri = P vi ri > L = P vi > L,
P Cm v1 , . . . , vk = P
+
vi ri > 4 n log n −
i=k+1 i=1 i≥k+1,ri ,0 i≥k+1,ri =1
311
p P P
where L = 4 n log n − ki=1 vi ri is a known quantity (since v1 , . . . , vk are known). Let V = i≥k+1,ri =1 1. We
have,
h i
X
X vi + 1 L + V
(vi + 1) > L + V = P ,
P Cm v1 , . . . , vk = P
+
>
i≥k+1 i≥k+1
2 2
αi =1 αi =1
The last quantity is the probability that in V flips of a fair 0/1 coin one gets more than (L + V)/2 heads. Thus,
X ! !
h i 1 X
V V
V 1 V
P+m = P Cm+ v1 , . . . , vk = = .
i=⌈(L+V)/2⌉
i 2n 2n i=⌈(L+V)/2⌉ i
This implies, that we can compute P+m in polynomial time! Indeed, we are adding V ≤ n numbers, each one of
them is a binomial coefficient that has polynomial size representation in n, and can be computed in polynomial
time (why?). One can define in similar fashion P−m , and let Pm = P+m + P−m . Clearly,
h Pm can ibe computed in
polynomial time, by applying a similar argument to the computation of P−m = P Cm− v1 , . . . , vk .
For a node hv ∈ T , let
i vv denote the portion of v that was fixed when traversing from the root of T to v. Let
Pn
P(v) = m=1 P Cm vv . By the above discussion P(v) can be computed in polynomial time. Furthermore, we
know, by the previous result on discrepancy that P(r) < 1 (that was the bound used to show that there exist a
good assignment).
As before, for any v ∈ T , we have P(v) ≥ min(P(vl ), P(vr )). Thus, we p have a polynomial deterministic
algorithm for computing a set balancing with discrepancy smaller than 4 n log n. Indeed, set v = root(T ).
And start traversing down the tree. At each stage, compute P(vl ) and P(vr ) (in polynomial time), and set v to
the child with lower
p value of P(·). Clearly, after n steps, we reach a leaf, that corresponds to a vector v′ such
that ∥Av′ ∥∞ ≤ 4 n log n.
Theorem 50.1.2. Using the method of conditional
p probabilities, one can compute in polynomial time in n, a
vector v ∈ {−1, 1}n , such that ∥Av∥∞ ≤ 4 n log n.
Note, that this method might fail to find the best assignment.
Proof: Consider a random permutation of the vertices, and in the ith iteration add the vertex πi to the indepen-
dent set if none of its neighbors are in the independent set. Let I be the resulting independent set. We have for
a vertex v ∈ JnK that
1
P[v ∈ I] ≥ .
d(i) + 1
As such, the expected size of the computed independent set is
X
n X
n
1
Γ= P[i ∈ I] ≥ .
i=1 i=1
d(i) + 1
312
Observe that for x > 0, and α ≥ x, we have that
α
1/(1 + x) + 1/(1 + α − x) = .
(1 + x)(1 + α − x)
achieves its minimum when x = α/2.
P
As such, ni=1 d(i)+1
1
is minimized when all the d(·) are equal. Which means that
X
n
1 X
n
1 n
Γ≥ .≥ .= ,
i=1
d(i) + 1 i=1
(2m/n) + 1 (2m/n) + 1
as claimed. ■
We have that
h i ! !! x !! x x
n ( x
) p(x − 1) 3
P α(G) ≥ x ≤ (1 − p) 2 < n exp − < n exp − ln n < o(1) = o(1).
x 2 2
Let n be sufficiently large so that both these events have probability less than 1/2. Then there is a specific G
with less than n/2 cycles of length at most L and with α(G) < 3n1−µ ln n + 1.
Remove from G a vertex from each cycle of length at most L. This gives a graph G∗ with at least n/2
vertices. G∗ has girth greater than L and α(G∗ ) ≤ α(G) (any independent set in G∗ is also an independent set in
G). Thus
∗ |V(G∗ )| n/2 nµ
χ(G ) ≥ ≥ 1−µ ≥ .
α(G∗ ) 3n ln n 12 ln n
To complete the proof, let n be sufficiently large so that this is greater than K. ■
313
50.3.2. Crossing Numbers and Incidences
The following problem has a long and very painful history. It is truly amazing that it can be solved by such a
short and elegant proof.
And embedding of a graph G = (V, E) in the plane is a planar representation of it, where each vertex is rep-
resented by a point in the plane, and each edge uv is represented by a curve connecting the points corresponding
to the vertices u and v. The crossing number of such an embedding is the number of pairs of intersecting curves
that correspond to pairs of edges with no common endpoints. The crossing number cr(G) of G is the minimum
possible crossing number in an embedding of it in the plane.
|E|3
Theorem 50.3.3. The crossing number of any simple graph G = (V, E) with |E| ≥ 4 |V| is ≥ .
64 |V|2
Proof: By Euler’s formula any simple planar graph with n vertices has at most 3n−6 edges. (Indeed, f −e+v = 2
in the case with maximum number of edges, we have that every face, has 3 edges around it. Namely, 3 f = 2e.
Thus, (2/3)e − e + v = 2 in this case. Namely, e = 3v − 6.) This implies that the crossing number of any simple
graph with n vertices and m edges is at least m − 3n + 6 > m − 3n. Let G = (V, E) be a graph with |E| ≥ 4 |V|
embedded in the plane with t = cr(G) crossings. Let H be the random induced subgraph of G obtained by
picking each vertex of G randomly and independently, to be a vertex of H with probabilistic p (where P will
be specified shortly). The expected number of vertices of H is p |V|, the expected number of its edges is p2 |E|,
and the expected number of crossings in the given embedding is p4 t, implying that the expected value of its
crossing number is at most p4 t. Therefore, we have p4 t ≥ p2 |E| − 3p |V|, implying that
|E| 3 |V|
cr(G) ≥ − 3 ,
p2 p
let p = 4 |V| / |E| < 1, and we have cr(G) ≥ (1/16 − 3/64) |E|3 / |V|2 = |E|3 /(64 |V|2 ). ■
Theorem 50.3.4. Let P be a set of n distinct points in the plane, and let L be a set of m distinct lines. Then, the
number of incidences between the lines of L (that is, the number of pairs (p, ℓ) with p ∈ P,
the points of P and
ℓ ∈ L, and p ∈ ℓ) is at most c m n + m + n , for some absolute constant c.
2/3 2/3
Proof: Let I denote the number of such incidences. Let G = (V, E) be the graph whose vertices are all the
points of P, where two are adjacent if and only if they are consecutive points of P on some line in L. Clearly
|V| = n, and |E| = I − m. Note that G is already given embedded in the plane, where the edges are presented by
segments of the corresponding lines of L.
Either, we can not apply Theorem 50.3.3, implying that I − m = |E| < 4 |V| = 4n. Namely, I ≤ m + 4n. Or
alliteratively,
!
|E|3 (I − m)3 m m2
= ≤ cr(G) ≤ ≤ .
64 |V|2 64n2 2 2
Implying that I ≤ (32)1/3 m2/3 n2/3 + m. In both cases, I ≤ 4(m2/3 n2/3 + m + n). ■
This technique has interesting and surprising results, as the following theorem shows.
Theorem 50.3.5. For any three sets A, B and C of s real numbers each, we have
n o
|A · B + C| = ab + c a ∈ A, b ∈ B, mc ∈ C ≥ Ω s3/2 .
314
n o n o
Proof: Let R = A · B + C, |R| = r and define P = (a, t) a ∈ A, t ∈ R , and L = y = bx + c b ∈ B, c ∈ C .
Clearly n = |P|n = sr, and m = |L| = os2 . Furthermore, a line y = bx + c of L is incident with s points
of R, namely with (a, t) a ∈ A, t = ab + c . Thus, the overall number of incidences is at least s3 . By Theo-
rem 50.3.4, we have
2/3
s ≤4 m n +m+n =4 s
3 2/3 2/3 2
(sr) + s + sr = 4 s2 r2/3 + s2 + sr .
2/3 2
For r < s3 , we have that sr ≤ s2 r2/3 . Thus, for r < s3 , we have s3 ≤ 12s2 r2/3 , implying that s3/2 ≤ 12r. Namely,
|R| = Ω(s3/2 ), as claimed. ■
Among other things, the crossing number technique implies a better bounds for k-sets in the plane than
what was previously known. The k-set problem had attracted a lot of research, and remains till this day one of
the major open problems in discrete geometry.
Proof: Pick a random sample R of L, by picking each line to be in the sample with probability 1/k. Observe
that
n
E[|R|] = .
k
Let L≤k = L≤k L be the set of all vertices of A(L) of level at most k, for k > 1. For a vertex p ∈ L≤k , let Xp
( )
be an indicator variable which is 1 if p is a vertex of the 0-level of A(R). The probability that p is in the 0-level
of A(R) is the probability that none of the j lines below it are picked to be in the sample, and the two lines that
define it do get selected to be in the sample. Namely,
h i ! j !2 !k !
1 1 1 1 k 1 1
P Xp = 1 = 1 − ≥ 1− 2
≥ exp −2 2
= 2 2
k k k k k k ek
315
On the other hand, the number of vertices on the 0-level of R is at most |R| − 1. As such,
X
Xp ≤ |R| − 1.
p∈L≤k
|L≤k | n
Putting these two inequalities together, we get that ≤ . Namely, |L≤k | ≤ e2 nk. ■
e2 k2 k
316
Chapter 51
317
To this end, a vertical trapezoid is a quadrangle with two vertical sides. The breaking of the faces into such
trapezoids is the vertical decomposition of the arrangement A S .
Formally, for a subset R ⊆ S, let A| R denote the vertical decomposition of the plane formed by the
arrangement A R of the segments of R. This is the partition of the plane into interior disjoint vertical trapezoids
formed by erecting vertical walls through each vertex of A| R .
A vertex of A| R is either an endpoint of a segment of R or an intersection
point of two of its segments. From each such vertex we shoot up (similarly, down)
a vertical ray till it hits a segment of R or it continues all the way to infinity. See the
figure on the right. σ
Note that a vertical trapezoid is defined by at most four segments: two segments
defining its ceiling and floor and two segments defining the two intersection points
that induce the two vertical walls on its boundary. Of course, a vertical trapezoid
might be degenerate and thus be defined by fewer segments (i.e., an unbounded vertical trapezoid or a triangle
with a vertical segment as one of its sides).
Vertical decomposition breaks the faces of the arrangement that might be arbitrarily complicated into en-
tities (i.e., vertical trapezoids) of constant complexity. This makes handling arrangements (decomposed into
vertical trapezoid) much easier computationally.
In the following, we assume that the n segments of S have k pairwise intersection points overall, and we
want to compute the arrangement A = A S ; namely, compute the edges, vertices, and faces of A S . One
possible way is the following: Compute a random permutation of the segments of S: S = ⟨s1 , . . . , sn ⟩. Let
Si = ⟨s1 , . . . , si ⟩ be the prefix of length i of S. Compute A| Si from A| Si−1 , for i = 1, . . . , n. Clearly,
A| S = A| Sn , and we can extract A S from it. Namely, in the ith iteration, we insert the segment si into the
arrangement A| Si−1 .
This technique of building the arrangement by inserting the segments one by one is called randomized
incremental construction.
Who need these pesky arrangements anyway? The reader might wonder who needs arrangements? As a
concrete examples, consider a situation where you are give several maps of a city containing different layers of
information (i.e., streets map, sewer map, electric lines map, train lines map, etc). We would like to compute
the overlay map formed by putting all these maps on top of each other. For example, we might be interested in
figuring out if there are any buildings lying on a planned train line, etc.
More generally, think about a set of general constraints in Rd . Each constraint is bounded by a surface, or a
patch of a surface. The decomposition of Rd formed by the arrangement of these surfaces gives us a description
of the parametric space in a way that is algorithmically useful. For example, finding if there is a point inside
all the constraints, when all the constraints are induced by linear inequalities, is linear programming. Namely,
arrangements are a useful way to think about any parametric space partitioned by various constraints.
318
q
s
σ0 σ00
p t b
Figure 51.1
To facilitate this, we need to compute the trapezoids of Bi−1 that intersect si . This is done by maintaining a
conflict graph. Each trapezoid σ ∈ A| Si−1 maintains a conflict list cl(σ) of all the segments of S that intersect
its interior. In particular, the conflict list of σ cannot contain any segment of Si−1 , and as such it contains only
the segments of S \ Si−1 that intersect its interior. We also maintain a similar structure for each segment, listing
all the trapezoids of A| Si−1 that it currently intersects (in its interior). We maintain those lists with cross
pointers, so that given an entry (σ, s) in the conflict list of σ, we can find the entry (s, σ) in the conflict list of s
in constant time.
Thus, given si , we know what trapezoids need to be split (i.e., all the trapezoids in cl(si )).
Splitting a trapezoid σ by a segment si is the operation of computing a set of (at most) four
trapezoids that cover σ and have si on their boundary. We compute those new trapezoids, and
next we need to compute the conflict lists of the new trapezoids. This can be easily done by
taking the conflict list of a trapezoid σ ∈ cl(si ) and distributing its segments among the O(1)
s
new trapezoids that cover σ. Using careful implementation, this requires a linear time in the i
size of the conflict list of σ.
Note that only trapezoids that intersect si in their interior get split. Also, we need to update the conflict lists
for the segments (that were not inserted yet).
We next sketch the low-level details involved in maintaining these conflict lists. For a segment s that
intersects the interior of a trapezoid σ, we maintain the pair (s, σ). For every trapezoid σ, in the current
vertical decomposition, we maintain a doubly linked list of all such pairs that contain σ. Similarly, for each
segment s we maintain the doubly linked list of all such pairs that contain s. Finally, each such pair contains
two pointers to the location in the two respective lists where the pair is being stored.
It is now straightforward to verify that using this data-structure we can implement the required operations
in linear time in the size of the relevant conflict lists.
In the above description, we ignored the need to merge adjacent trapezoids if they have identical floor and
ceiling – this can be done by a somewhat straightforward and tedious implementation of the vertical decom-
position data-structure, by providing pointers between adjacent vertical trapezoids and maintaining the conflict
list sorted (or by using hashing) so that merge operations can be done quickly. In any case, this can be done in
linear time in the input/output size involved, as can be verified.
51.1.1.1. Analysis
Claim 51.1.1. The (amortized) running time of constructing Bi from Bi−1 is proportional to the size of the
conflict lists of the vertical trapezoids in Bi \ Bi−1 (and the number of such new trapezoids).
Proof: Observe that we can charge all the work involved in the ith iteration to either the conflict lists of the
newly created trapezoids or the deleted conflict lists. Clearly, the running time of the algorithm in the ith
iteration is linear in the total size of these conflict lists. Observe that every conflict gets charged twice – when
319
it is being created and when it is being deleted. As such, the (amortized) running time in the ith iteration is
proportional to the total length of the newly created conflict lists. ■
Thus, to bound the running time of the algorithm, it is enough to bound the expected size of the destroyed
conflict lists in ith iteration (and sum this bound on the n iterations carried out by the algorithm). Or alterna-
tively, bound the expected size of the conflict lists created in the ith iteration.
Lemma 51.1.2. Let S be a set of n segments (in general position¬ ) with k intersection points. Let Si be the first
|
i segments in a random permutation
of S. The expected size of B i = A S i , denoted by τ(i) (i.e., the number
of trapezoids in Bi ), is O i + k(i/n) .
2
Proof: Consider an intersection point p = s ∩ s′ , where s, s′ ∈ S. The probability that p is present in A| Si
is equivalent to the probability that both s and s′ are in Si . This probability is
n−2
i−2 (n − 2)! i! (n − i)! i(i − 1)
α = n = · = .
(i − 2)! (n − i)! n! n(n − 1)
i
For each intersection point p in A S define an indicator variable
h i Xp , which is 1 if the two segments defining
p are in the random sample Si and 0 otherwise. We have that E Xp = α, and as such, by linearity of expectation,
the expected number of intersection points in the arrangement A(Si ) is
X X h i X
E Xp = E Xp = α = kα,
p∈V p∈V p∈V
where V is the set of k intersection points of A S . Also, every endpoint of a segment of Si contributed its two
endpoints to the arrangement A(Si ). Thus, we have that the expected number of vertices in A(Si ) is
i(i − 1)
2i + k.
n(n − 1)
Now, the number of trapezoids in A| Si is proportional to the number of vertices of A(Si ), which implies the
claim. ■
320
We are interested in bounding the expected size of Ci , since this is (essentially) the amount of work done by
the algorithm in this iteration. Observe that the structure of Bi is defined independently of the permutation Si
and depends only on the (unordered) set Si = {s1 , . . . , si }. So, fix Si . What is the probability that si is a specific
segment s of Si ? Clearly, this is 1/i since this is the probability of s being the last element in a permutation of
the i elements of Si (i.e., we consider a random permutation of Si ).
Now, consider a trapezoid σ ∈ Bi . If σ was created in the ith iteration, then si must be one of the (at most
four) segments that define it. Indeed, if si is not one of the segments that define σ, then σ existed in the vertical
decomposition before si was inserted. Since Bi is independent of the internal ordering of Si , it follows that
P[σ ∈ (Bi \ Bi−1 )] ≤ 4/i. In particular, the overall size of the conflict lists in the end of the ith iteration is
X
Wi = |cl(σ)|.
σ∈Bi
As such, the expected overall size of the conflict lists created in the ith iteration is
h i X4 4
E C i Bi ≤ |cl(σ)| ≤ Wi .
σ∈B
i i
i
By Lemma 51.1.2, the expected size of Bi is O i + ki2 /n2 . Let us guess (for the time being) that on average
the size of the conflict list of a trapezoid of Bi is about O(n/i). In particular, assume that we know that
!
i2 n i
E iW = O i + k = O n + k ,
n2 i n
by Lemma 51.1.2, implying
h " # !! !
i 4 4 4 ki n k
E Ci = E E Ci Bi ≤ E Wi = E Wi = O n+ =O + , (51.1)
i i i n i n
using Lemma 11.1.2. In particular, the expected (amortized) amount of work in the ith iteration is proportional
to E Ci . Thus, the overall expected running time of the algorithm is
n !
X X n
n k
E Ci = O + = O n log n + k .
i=1 i=1
i n
Theorem 51.1.3. Given a set S of n segments in the plane with k intersections, one can compute the vertical
decomposition of A S in expected O(n log n + k) time.
Intuition and discussion. What remains to be seen is how we came up with the guess that the average size of
a conflict list of a trapezoid of Bi is about O(n/i). Note that using ε-nets implies that the bound O((n/i) log i)
holds with constant probability (see Theorem 38.3.4) for all trapezoids in this arrangement. As such, this result
is only slightly surprising. To prove this, we present in the next section a “strengthening” of ε-nets to geometric
settings.
To get some intuition on how we came up with this guess, consider a set P of n points on the line and a
random sample R of i points from P. Let I b be the partition of the real line into (maximal) open intervals by the
endpoints of R, such that these intervals do not contain points of R in their interior.
Consider an interval (i.e., a one-dimensional trapezoid) of I. b It is intuitively clear that this interval (in
expectation) would contain O(n/i) points. Indeed, fix a point x on the real line, and imagine that we pick each
point with probability i/n to be in the random sample. The random variable which is the number of points of
321
Notation What it means Example: Vertical decomposition of segments
S Set of n objects segments
R⊆S Subset of objects
σ Notation for a region induced by some A vertical trapezoid
objects of S
D(σ) ⊆ S Defining set of σ: Minimal set of objects Subset of segments defining σ. See Figure 51.3.
inducing σ.
K(σ) ⊆ S Stopping set of σ: All objects in S that All segments in S that intersects the interior of the
prevents σ from being created. vertical trapezoid σ. See Figure 51.3.
d Combinatorial dimension: Max size of d = 4: Every vertical trapezoid is defined by at
defining set. most four segments.
ω(σ) ω(σ) = |K(σ)|: Weight of σ.
F (R) Decomposition: Set of regions defined Set of vertical trapezoids defined by R
by R
T = T (S) Set of all possible regions defined by Set of all vertical trapezoids that can be induced
subsets of S by the segments of S.
Probability of a region σ ∈ S to appear in the decomposition of a random sample R ⊆ S
ρr,n (d, k)
of size r, where σ is defined by d objects, and its stopping set is of size k.
σ ∈ F (R) σ is t-heavy if ω(σ) ≥ tn/r, where r = |R|.
F≥t (R) Set of all t-heavy regions of F (R)
Ef (r) Ef (r) = E[|F (R)|]: Expected complexity of decomposition for sample of size r
Ef≥t (r) = E[|F≥t (R)|]: Expected number of regions that are t heavy in the decomposition
Ef≥t (r)
of a random sample of size r.
P we have to scan starting from x and going to the right of x till we “hit” a point that is in the random sample
behaves like a geometric variable with probability i/n, and as such its expected value is n/i. The same argument
works if we scan P to the left of x. We conclude that the number of points of P in the interval of Ib that contains
x but does not contain any point of R is O(n/i) in expectation.
Of course, the vertical decomposition case is more involved, as each vertical trapezoid is defined by four
input segments. Furthermore, the number of possible vertical trapezoids is larger. Instead of proving the
required result for this special case, we will prove a more general result which can be applied in a lot of other
settings.
51.2.1. Notation
Let S be a set of objects. For a subset R ⊆ S, we define a collection of ‘regions’ denoted by F (R). For the
case of vertical decomposition of segments (i.e., Theorem 51.1.3), the objects are segments, the regions are
322
trapezoids, and F (R) is the set of vertical trapezoids in A| R . Let
[
T = T (S) = F (R)
R⊆S
Definition 51.2.1 (Framework axioms). Let S, F (R), D(σ), and K(σ) be such that for any subset R ⊆ S, the
set F (R) satisfies the following axioms:
(i) For any σ ∈ F (R), we have D(σ) ⊆ R and R ∩ K(σ) = ∅.
(ii) If D(σ) ⊆ R and K(σ) ∩ R = ∅, then σ ∈ F (R).
323
(D) Edges of the convex-hull in 3d. Let S be a set of points in three dimensions. An edge e of the convex-
hull of a set R ⊆ S of points in R3 is defined by two vertices of S, and it can be certified as being on the
convex hull CH(R), by the two faces f, f′ adjacent to e. If all the points of R are on the “right” side of
both these two faces then e is an edge of the convex hull of R. Computing all the certified edges of S is
equivalent to computing the convex-hull of S.
In the following, assume that each face of any convex-hull of a subset of points of S is a triangle. As
such, a face of the convex-hull would be defined by three points. Formally, the butterfly of an edge e of
CH(R) is (e, p, u), where pnt, u ∈ R, and such that all the points of R are on the same side as u of the
plane spanned by e and p (we have symmetric condition requiring that all the points of S are on the same
as p of the plane spanned by e and u).
For a set R ⊆ P, let F (R) be its set of butterflies. Clearly, computing all the butterflies of S (i.e., F (S))
is equivalent to computing the convex-hull of S.
For a butterfly σ = (e, p, u) ∈ F (R) its defining set (i.e., D(σ)) is a set of four points (i.e., the two points
defining its edge e, and the to additional vertices defining the two faces Face and f′ adjacent to it). Its
stopping set K(σ), is the set of all the points of S \ R that of different sides of the plane spanned by e and
p (resp. e and u) than u (resp. p) [here, the stopping set is the union of these two sets].
(E) Delaunay triangles in 2d. For a set of S of n points in the plane. Consider a subset R ⊆ S. A Delaunay
circle of R is a disc D that has three points p1 , p2 , p3 of R on its boundary, and no points of R in its
interior. Naturally, these three points define a Delaunay triangle △ = △p1 p2 p3 . The defining set is
D(△) = {p1 , p2 , p3 }, and the stopping set K(△) is the set of all points in S that are contained in the interior
of the disk D.
51.2.2. Analysis
In the following, S is a set of n objects complying with (i) and (ii) of Definition 51.2.1.
The challenge. What makes the analysis not easy is that there are dependencies between the defining set of a
region and its stopping set (i.e., conflict list). In particular, we have the following difficulties
(A) The defining set might be of different sizes depending on the region σ being considered.
(B) Even if all the regions have a defining set of the same size d (say, 4 as in the case of vertical trapezoids),
it is not true that every d objects define a valid region. For example, for the case of segments, the four
segments might be vertically separated from each other (i.e., think about them as being four disjoint
intervals on the real line), and they do not define a vertical trapezoid together. Thus, our analysis is going
to be a bit loopy loop – we are going to assume we know how many regions exists (in expectation) for a
random sample of certain size, and use this to derive the desired bounds.
ρr,n (d, k)
denote the probability that a region σ ∈ T appears in F (R), where its defining set is of size d, its stopping set
is of size k, R is a random sample of size r from a set S, and n = |S|. Specifically, σ is a feasible region that
might be created by an algorithm computing F (R).
324
The sampling model. For describing algorithms it is usually easier to work with samples created by picking
a subset of a certain size (without repetition) from the original set of objects. Usually, in the algorithmic
applications this would be done by randomly permuting the objects and interpreting a prefix of this permutation
as a random sample. Insisting on analyzing this framework in the “right” sampling model creates some non-
trivial technical pain.
r k r d
Lemma 51.2.2. We have that ρr,n (d, k) ≈ 1 − . Formally,
n n
!k
1 r k r d 1 r r d
1−4· ≤ ρr,n (d, k) ≤ 2 1 − ·
2d
. (51.2)
22d n n 2 n n
Proof: Let σ be the region under consideration that is defined by d objects and having k stoppers (i.e.,
k = K(σ)). We are interested in the probability of σ being created when taking n−d−ka sample
n of size r (with-
out repetition) from a set S of n objects. Clearly, this probability is ρr,n (d, k) = r−d / r , as we have to pick
the d defining objects into the random sample and avoid picking any of the k stoppers. A tedious but careful
calculation, delegated to Section 51.4, implies Eq. (51.2).
Instead, here is an elegant argument for why this estimate is correct in a slightly different sampling model.
We pick every element of S into the sample R with probability r/n, and this is done independently for each
object. In expectation, the random sample is of size r, and clearly the probability that σ is created is the
probability that we pick its d defining objects (that is, (r/n)d ) multiplied by the probability that we did not pick
any of its k stoppers (that is, (1 − r/n)k ). ■
Remark 51.2.3. The bounds of Eq. (51.2) hold only when r, d, and k are in certain (reasonable) ranges. For
the sake of simplicity of exposition we ignore this minor issue. With care, all our arguments work when one
pays careful attention to this minor technicality.
to be created, since (1 − r/n)n/r ≈ 1/e. Namely, a t-heavy region has exponentially lower probability to be
created than a region of weight n/r. We next formalize this argument.
Lemma 51.2.4. Let r ≤ n and let t be parameters, such that 1 ≤ t ≤ r/d. Furthermore, let R be a sample of
size r, and let R′ be a sample
of size r′ = ⌊r/t⌋, both from S. Let σ ∈ T be a region with weight ω(σ) ≥ t (n/r).
Then, P σ ∈ F (R) = O exp −t/2 td P σ ∈ F R′ .
®
These are the regions that are at least t times overweight. Speak about an obesity problem.
325
Proof: For the sake of simplicity of exposition, assume that k = ω(σ) = t (n/r). By Lemma 51.2.2 (i.e.,
Eq. (51.2)) we have
! !k
r′ r d
1 r k r d
P [σ ∈ F (R)] ρr,n
(d, k) 2 2d
1 − ·
2 n n kr
= ≤ ≤ 2 exp − 2n 1 + 8 n r′
4d
P[σ ∈ F (R′ )] ρ ′ (d, k)
r ,n
1
1 − 4 r′ k r′ d
22d n n
′
! d ! !d
kr kr r tn ⌊r/t⌋ tnr r
≤ 2 exp 8
4d
− = 2 4d
exp 8 − = O exp(−t/2)t d
,
n 2n r′ nr 2nr ⌊r/t⌋
since 1/(1 − x) ≤ 1 + 2x for x ≤ 1/2 and 1 + y ≤ exp(y), for all y. (The constant in the above O(·) depends
exponentially on d.) ■
Let
Ef (r) = E[|F (R)|] and Ef≥t (r) = E[|F≥t (R)|] ,
where the expectation is over random subsets R ⊆ S of size r. Note that Ef (r) = Ef≥0 (r) is the expected
number of regions created by a random sample of size r. In words, Ef≥t (r) is the expected number of regions
in a structure created by a sample of r random objects, such that these regions have weight which is t times
larger than the “expected” weight (i.e., n/r). In the following, we assume that Ef (r) is a monotone increasing
function.
Lemma 51.2.5 (The exponential decay lemma). Given a set S of n objects and parameters r ≤ n and 1 ≤
t ≤ r/d, where d = maxσ∈T (S) |D(σ)|, if axioms (i) and (ii) above hold for any subset of S, then
Ef≥t (r) = O td exp(−t/2) Ef (r) . (51.3)
Proof: Let R be a random sample of size r from S and let R′ be a random sample of size r′ = ⌊r/t⌋ from S.
S
Let H = X⊆S,|X|=r F≥t (X) denote the set of all t-heavy regions that might be created by a sample of size r. In
the following, the expectation is taken over the content of the random samples R and R′ .
For a region σ, let Xσ be the indicator variable that is 1 if and only if σ ∈ F (R). By linearity of expectation
and since E[Xσ ] = P[σ ∈ F (R)], we have
hX i X X
Ef≥t (r) = E |F≥t (R)| = E Xσ = E Xσ = P[σ ∈ F (R)]
σ∈H σ∈H σ∈H
X X
′ ′
= O td exp(−t/2) Pσ∈F R = O td exp(−t/2) Pσ∈F R
σ∈H σ∈T
= O td exp(−t/2) Ef r′ = O td exp(−t/2) Ef (r) ,
326
Theorem 51.2.6 (Bounded moments theorem). Let R ⊆ S be a random subset of size r. Let Ef (r) = E |F (R)|
and let c ≥ 1 be an arbitrary constant. Then,
h X c i n c
E ω(σ) = O Ef (r) .
σ∈F (R)
r
Proof:
Let R ⊆ S be a random sample of size r. Observe that all the regions with weight in the range
n
(t − 1) r , t ·
n
are in the set F≥t−1 (R) \ F≥t (R). As such, we have by Lemma 51.2.5 that
r
h X i h X n c i h X n c i
W=E ω(σ)c ≤ E t (|F≥t−1 (R)| − |F≥t (R)| ) ≤ E t |F≥t−1 (R)|
σ∈F (R) t≥1
r t≥1
r
n c X n c X n c X
≤ (t + 1)c E |F≥t (R)| = (t + 1)c Ef≥t (r) = O (t + 1) c + d exp(−t/2) Ef (r)
r t≥0 r t≥0 r t≥0
n cX n c
= O Ef (r) (t + 1) c + d exp(−t/2) = O Ef (r) ,
r t≥0 r
51.3. Applications
51.3.1. Analyzing the RIC algorithm for vertical decomposition
We remind the reader that the input of the algorithm of Section 51.1.2 is a set S of n segments with k in-
tersections, and it uses randomized incremental construction to compute the vertical decomposition of the
arrangement A S .
Lemma 51.1.2 shows that the number of vertical trapezoids in the randomized incremental construction
is in expectation Ef (i) = O i + k (i/n) . Thus, by Theorem 51.2.6 (used with c = 1), we have that the total
2
expected size of the conflict lists of the vertical decomposition computed in the ith step is
hX i n i
E Wi = E ω(σ) = O Ef (i) = O n + k .
σ∈B
i n
i
This is the missing piece in the analysis of Section 51.1.2. Indeed, the amortized work in the ith step of the
algorithm is O(Wi /i) (see Eq. (51.1)), and as such, the expected running time of this algorithm is
h Xn
Wi i Xn
1 i
EO =O n+k = O n log n + k .
i=1
i i=1
i n
51.3.2. Cuttings
Let S be a set of n lines in the plane, and let r be an arbitrary parameter. A (1/r)-cutting of S is a partition of
the plane into constant complexity regions such that each region intersects at most n/r lines of S. It is natural
to try to minimize the number of regions in the cutting, as cuttings are a natural tool for performing “divide and
conquer”.
A neat proof of the existence of suboptimal cuttings follows readily from the exponential decay lemma.
327
Lemma 51.3.1. Let S be a set of n segments in the plane, and let R be a random sample from S of size ℓ =
cr ln r, where c is a sufficiently large constant. Then, with probability ≥ 1 − 1/rO(1) , the vertical decomposition
of R is a cutting of size O(r2 log2 r).
Proof: In our case, the vertical decomposition complexity Ef (ℓ) = O(ℓ2 ) – as ℓ segments have at most 2ℓ
intersections. For t = c ln r, a vertical trapezoid σ in A| (R) is bad if ω(σ) > r = t(n/ℓ). But such a trapezoid
is t-heavy. Let X be the random variable that is the number of bad trapezoids in A| (R). The exponential decay
lemma (Lemma 51.2.5) states that
2 −c/2 1
E[X] = Ef≥t (ℓ) = O t exp(−t/2) Ef (ℓ) = O (c ln r) exp(−c ln r/2) ℓ = O (ln r)r (r log r)2 < c/4 ,
2 2 2
r
if c is sufficiently large. As such, we have P[X ≥ 1] ≤ 1/rc/4 by Markov’s inequality. ■
Proof: Consider the range space having S as its ground set and vertical trapezoids as its ranges (i.e., given a
vertical trapezoid σ, its corresponding range is the set of all lines of S that intersect the interior of σ). This
range space has a VC dimension which is a constant as can be easily verified. Let X ⊆ S be an ε-net for this
range space, for ε = 1/r. By Theorem 38.3.4 (ε-net theorem), there exists such an ε-net X of this range space,
of size O((1/ε) log(1/ε)) = O(r log r). In fact, Theorem 38.3.4 states that an appropriate random sample is an
ε-net with non-zero probability, which implies, by the probabilistic method, that such a net (of this size) exists.
Consider the vertical decomposition A| X , where X is as above. We claim that this collection of trapezoids
is the desired cutting.
The bound on the size is immediate, as the complexity of A| X is O |X|2 and |X| = O(r log r).
As for correctness, consider a vertical trapezoid σ in the arrangement A| X . It does not intersect any of the
lines of X in its interior, since it is a trapezoid in the vertical decomposition A| X . Now, if σ intersected more
than n/r lines of S in its interior, where n = |S|, then it must be that the interior of σ intersects one of the lines
of X, since X is an ε-net for S, a contradiction.
It follows that σ intersects at most εn = n/r lines of S in its interior. ■
Claim 51.3.3. Any (1/r)-cutting in the plane of n lines contains at least Ω r2 regions.
n
Proof: An arrangement of n lines (in general position) has M = intersections. However, the number of
2
intersections of the lines intersecting a single region in the cutting is at most m = n/r . This implies that any
2
cutting must be of size at least M/m = Ω n2 /(n/r)2 = Ω r2 . ■
We can get cuttings of size matching the above lower bound using the moments technique.
Theorem 51.3.4. Let S be a set of n lines in the plane, and let r be a parameter. One can compute a (1/r)-
cutting of S of size O(r2 ).
Proof: Let R ⊆ S be a random sample of size r, and consider its vertical decomposition A| R . If a vertical
trapezoid σ ∈ A| R intersects at most n/r lines of S, then we can add it to the output cutting. The other
possibility is that a σ intersects t(n/r) lines of S, for some t > 1, and let cl(σ) ⊂ S be the conflict list of σ (i.e.,
the list of lines of S that intersect the interior of σ). Clearly, a (1/t)-cutting for the set cl(σ) forms a vertical
328
decomposition (clipped inside σ) such that each trapezoid in this cutting intersects at most n/r lines of S. Thus,
we compute such a cutting inside each such “heavy” trapezoid using the algorithm (implicit in the proof) of
Lemma 51.3.2, and these
subtrapezoids to the resulting cutting. Clearly, the size of the resulting cutting inside
σ is O t log t = O t4 . The resulting two-level partition is clearly the required cutting. By Theorem 51.2.6,
2 2
σ∈F (R)
n/r n σ∈F (R)
r 4 n 4 !
= O Ef (r) + Ef (r) = O(Ef (r)) = O r2 ,
n r
Proof: So, consider a region σ with d defining objects in D(σ) and k detractors in K(σ). We have to pick the d
defining objects of D(σ) to be in the random sample R of size r but avoid picking any of the k objects of K(σ)
to be in R. ! ! ! !
n n n − (r − d) r
The second part follows since = / . Indeed, for the right-hand side first pick a
r r−d d d
sample of size r − d and then a sample of size d from the remaining objects. Merging the two random samples,
we get a random sample of size r. However, since we do not care if an object is in the first sample or second
sample, we observe that every such random sample is being counted dr times.
! ! ! !
n n − (r − d) n n−d
The third part is easier, as it follows from = . The two sides count the
r−d d d r−d
different ways to pick two subsets from a set of size n, the first one of size d and the second one of size r − d.■
m
m − t t m t
t
Lemma 51.4.2. For M ≥ m ≥ t ≥ 0, we have ≤ M ≤ .
M−t M
t
m
t m! (M − t)!t! m m − 1 m−t+1
Proof: We have that α = M = = · ··· . Now, since M ≥ m, we have
(m − t)!t! M! M M−1 M−t+1
t
m−i m
that ≤ , for all i ≥ 0. As such, the maximum (resp. minimum) fraction on the right-hand size is m/M
M−i M t
m−t+1 t
m−t+1
(resp. M−t+1 m−t
). As such, we have M−t ≤ M−t+1 ≤ α ≤ (m/M)t . ■
329
X Y Y X
Lemma 51.4.3. Let 0 ≤ X, Y ≤ N. We have that 1 − ≤ 1− .
N 2N
Proof: Since 1 − α ≤ exp(−α) ≤ (1 − α/2), for 0 ≤ α ≤ 1, it follows that
X Y XY Y X Y X
1− ≤ exp − = exp − ≤ 1− . ■
N N n 2n
330
(ii) If σ ∈ F (R) and R′ is a subset of R with D(σ) ⊆ R′ , then σ ∈ F (R′ ).
Interestingly, Clarkson [Cla88] did not prove Theorem 51.2.6 using the exponential decay lemma but gave
a direct proof. In fact, his proof implicitly contains the exponential decay lemma. We chose the current
exposition since it is more modular and provides a better intuition of what is really going on and is hopefully
slightly simpler. In particular, Lemma 51.2.2 is inspired by the work of Sharir [Sha03].
The exponential decay lemma (Lemma 51.2.5) was proved by Chazelle and Friedman [CF90]. The work of
Agarwal et al. [AMS98] is a further extension of this result. Another analysis was provided by Clarkson et al.
[CMS93].
Another way to reach similar results is using the technique of Mulmuley [Mul94], which relies on a direct
analysis on ‘stoppers’ and ‘triggers’. This technique is somewhat less convenient to use but is applicable to
some settings where the moments technique does not apply directly. Also, his concept of the omega function
might explain why randomized incremental algorithms perform better in practice than their worst case analysis
[Mul89].
Backwards analysis in geometric settings was first used by Chew [Che86] and was formalized by Seidel
[Sei93]. It is similar to the “leave one out” argument used in statistics for cross validation. The basic idea was
probably known to the Greeks (or Russians or French) at some point in time.
(Naturally, our summary of the development is cursory at best and not necessarily accurate, and all possible
disclaimers apply. A good summary is provided in the introduction of [Sei93].)
Sampling model. As a rule of thumb all the different sampling approaches are similar and yield similar results.
For example, we used such an alternative sampling approach in the “proof” of Lemma 51.2.2. It is a good idea
to use whichever sampling scheme is the easiest to analyze in figuring out what’s going on. Of course, a formal
proof requires analyzing the algorithm in the sampling model its uses.
Lazy randomized incremental construction. If one wants to compute a single face that contains a marking
point in an arrangement of curves, then the problem in using randomized incremental construction is that
as you add curves, the region of interest shrinks, and regions that were maintained should be ignored. One
option is to perform flooding in the vertical decomposition to figure out what trapezoids are still reachable
from the marking point and maintaining only these trapezoids in the conflict graph. Doing it in each iteration
is way too expensive, but luckily one can use a lazy strategy that performs this cleanup only a logarithmic
number of times (i.e., you perform a cleanup in an iteration if the iteration number is, say, a power of 2). This
strategy complicates the analysis a bit; see [BDS95] for more details on this lazy randomized incremental
construction technique. An alternative technique was suggested by the author for the (more restricted) case of
planar arrangements; see [Har00b]. The idea is to compute only what the algorithm really needs to compute the
output, by computing the vertical decomposition in an exploratory online fashion. The details are unfortunately
overwhelming although the algorithm seems to perform quite well in practice.
Cuttings. The concept of cuttings was introduced by Clarkson. The first optimal size cuttings were constructed
by Chazelle and Friedman [CF90], who proved the exponential decay lemma to this end. Our elegant proof
follows the presentation by de Berg and Schwarzkopf [BS95]. The problem with this approach is that the
constant involved in the cutting size is awful¯ . Matoušek [Mat98] showed that there (1/r)-cuttings with 8r2 +
6r + 4 trapezoids, by using level approximation. A different approach was taken by the author [Har00a], who
showed how to get cuttings which seem to be quite small (i.e., constant-wise) in practice. The basic idea is
to do randomized incremental construction but at each iteration greedily add all the trapezoids with conflict
list small enough to the cutting being output. One can prove that this algorithm also generates O(r2 ) cuttings,
¯
This is why all computations related to cuttings should be done on a waiter’s bill pad. As Douglas Adams put it: “On a waiter’s
bill pad, reality and unreality collide on such a fundamental level that each becomes the other and anything is possible, within certain
parameters.”
331
but the details are not trivial as the framework described in this chapter is not applicable for analyzing this
algorithm.
Cuttings also can be computed in higher dimensions for hyperplanes. In the plane, cuttings can also be
computed for well-behaved curves; see [SA95].
Another fascinating concept is shallow cuttings. These are cuttings covering only portions of the arrange-
ment that are in the “bottom” of the arrangement. Matoušek came up with the concept [Mat92]. See [AES99,
CCH09] for extensions and applications of shallow cuttings.
Even more on randomized algorithms in geometry. We have only scratched the surface of this fascinating
topic, which is one of the cornerstones of “modern” computational geometry. The interested reader should have
a look at the books by Mulmuley [Mul94], Sharir and Agarwal [SA95], Matoušek [Mat02], and Boissonnat
and Yvinec [BY98].
51.6. Exercises
Exercise 51.6.1 (Convex hulls incrementally). Let P be a set of n points in the plane.
(A) Describe a randomized incremental algorithm for computing the convex hull CH(P). Bound the expected
running time of your algorithm.
(B) Assume that for any subset of P, its convex hull has complexity t (i.e., the convex hull of the subset has t
edges). What is the expected running time of your algorithm in this case? If your algorithm is not faster
for this case (for example, think about the case where t = O(log n)), describe a variant of your algorithm
which is faster for this case.
Exercise 51.6.2 (Compressed quadtree made incremental). Given a set P of n points in Rd , describe a ran-
domized incremental algorithm for building a compressed quadtree for P that works in expected O(dn log n)
time. Prove the bound on the running time of your algorithm.
References
[AES99] P. K. Agarwal, A. Efrat, and M. Sharir. Vertical decomposition of shallow levels in 3-dimensional
arrangements and its applications. SIAM J. Comput., 29: 912–953, 1999.
[AMS98] P. K. Agarwal, J. Matoušek, and O. Schwarzkopf. Computing many faces in arrangements of
lines and segments. SIAM J. Comput., 27(2): 491–505, 1998.
[BCKO08] M. de Berg, O. Cheong, M. J. van Kreveld, and M. H. Overmars. Computational Geometry:
Algorithms and Applications. 3rd. Santa Clara, CA, USA: Springer, 2008.
[BDS95] M. de Berg, K. Dobrindt, and O. Schwarzkopf. On lazy randomized incremental construction.
Discrete Comput. Geom., 14: 261–286, 1995.
[BS95] M. de Berg and O. Schwarzkopf. Cuttings and applications. Int. J. Comput. Geom. Appl., 5: 343–
355, 1995.
[BY98] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, 1998.
[CCH09] C. Chekuri, K. L. Clarkson., and S. Har-Peled. On the set multi-cover problem in geometric
settings. Proc. 25th Annu. Sympos. Comput. Geom. (SoCG), 341–350, 2009.
[CF90] B. Chazelle and J. Friedman. A deterministic view of random sampling and its use in geometry.
Combinatorica, 10(3): 229–249, 1990.
332
[Che86] L. P. Chew. Building Voronoi diagrams for convex polygons in linear expected time. Technical
Report PCS-TR90-147. Hanover, NH: Dept. Math. Comput. Sci., Dartmouth College, 1986.
[Cla87] K. L. Clarkson. New applications of random sampling in computational geometry. Discrete Com-
put. Geom., 2: 195–222, 1987.
[Cla88] K. L. Clarkson. Applications of random sampling in computational geometry, II. Proc. 4th Annu.
Sympos. Comput. Geom. (SoCG), 1–11, 1988.
[CMS93] K. L. Clarkson, K. Mehlhorn, and R. Seidel. Four results on randomized incremental construc-
tions. Comput. Geom. Theory Appl., 3(4): 185–212, 1993.
[CS89] K. L. Clarkson and P. W. Shor. Applications of random sampling in computational geometry, II.
Discrete Comput. Geom., 4(5): 387–421, 1989.
[Har00a] S. Har-Peled. Constructing planar cuttings in theory and practice. SIAM J. Comput., 29(6): 2016–
2039, 2000.
[Har00b] S. Har-Peled. Taking a walk in a planar arrangement. SIAM J. Comput., 30(4): 1341–1367, 2000.
[Mat02] J. Matoušek. Lectures on Discrete Geometry. Vol. 212. Grad. Text in Math. Springer, 2002.
[Mat92] J. Matoušek. Reporting points in halfspaces. Comput. Geom. Theory Appl., 2(3): 169–186, 1992.
[Mat98] J. Matoušek. On constants for cuttings in the plane. Discrete Comput. Geom., 20: 427–448, 1998.
[Mul89] K. Mulmuley. An efficient algorithm for hidden surface removal. Comput. Graph., 23(3): 379–
388, 1989.
[Mul94] K. Mulmuley. Computational Geometry: An Introduction Through Randomized Algorithms. En-
glewood Cliffs, NJ: Prentice Hall, 1994.
[SA95] M. Sharir and P. K. Agarwal. Davenport-Schinzel Sequences and Their Geometric Applications.
New York: Cambridge University Press, 1995.
[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. New Trends in Discrete and
Computational Geometry. Ed. by J. Pach. Vol. 10. Algorithms and Combinatorics. Springer-
Verlag, 1993, pp. 37–68.
[Sha03] M. Sharir. The Clarkson-Shor technique revisited and extended. Comb., Prob. & Comput., 12(2):
191–201, 2003.
333
334
Chapter 52
Primality testing
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“The world is what it is; men who are nothing, who allow themselves to become nothing, have no
place in it.”
— Bend in the river, V.S. Naipaul
335
remainder of x/y ⇐⇒ x mod y = x − y ⌊x/y⌋ .
The remainder x mod y is sometimes referred to as residue.
Proof: If α = β then the claim trivially holds. Otherwise, assume that α > β (otherwise, swap them), and
observe that gcd(α, β) = gcd(α mod β, β). In particular, by induction, there are integers x′ , y′ , such that
gcd(α mod β, β) = x′ (α mod β) + y′ β. However, τ = α mod β = α − β ⌊α/β⌋. As such, we have
gcd(α, β) = gcd(α mod β, β) = x′ α − β ⌊α/β⌋ + y′ β = x′ α + y′ − β ⌊α/β⌋ β,
as claimed. The running time follows immediately by modifying EuclidGCD to compute these numbers. ■
We use α ≡ β (mod n) or α ≡n β to denote that α and β aren congruent omodulo n; that is α mod n =
β mod n. Or put differently, we have n | (α − β). The set Zn = 0, . . . , n − 1 form a group under addition
nmodulo n (see Definition 52.1.9p338 for a formal o definition of a group). The more interesting creature is Z∗n =
x x ∈ {1, . . . , n} , x > 0, and gcd(x, n) = 1 , which is a group modulo n under multiplication.
Remark 52.1.3. Observe that Z∗1 = {1}, while for n > 1, Z∗n does not contain n.
Lemma 52.1.4. For any element α ∈ Z∗n , there exists a unique inverse element β = α−1 ∈ Z∗n such that
α ∗ β ≡n 1. Furthermore, the inverse can be computed in polynomial time¬ .
Proof: Since α ∈ Z∗n , we have that gcd(α, n) = 1. As such, by Lemma 52.1.2, there exists x and y integers,
such that xα + yn = 1. That is xα ≡ 1 (mod n), and clearly β := x mod n is the desired inverse, and it can be
computed in polynomial time by Lemma 52.1.2.
As for uniqueness, assume that there are two inverses β, β′ to α < n, such that β < β′ < n. But then
βα ≡n β′ α ≡n 1, which implies that n | (β′ − β)α, which implies that n | β′ − β, which is impossible as
0 < β′ − β < n. ■
It is now straightforward, but somewhat tedious, to verify the following (the interested reader that had not
encountered this stuff before can spend some time proving this).
Lemma 52.1.5. The set Zn under the + operation modulo n is a group, as is Z∗n under multiplication modulo
n. More importantly, for a prime number p, Z p forms a field with the +, ∗ operations modulo p (see Defini-
tion 52.1.17p340 ).
¬
Again, as is everywhere in this chapter, the polynomial time is in the number of bits needed to specify the input.
336
52.1.1.3. The Chinese remainder theorem
Theorem 52.1.6 (Chinese remainder theorem). Let n1 , . . . , nk be coprime numbers, and let n = n1 n2 · · · nk .
For any residues r1 ∈ Zn1 , . . . , rk ∈ Znk , there is a unique r ∈ Zn , which can be computed in polynomial time,
such that r ≡ ri (mod ni ), for i = 1, . . . , k.
Proof: By the coprime property of the ni s it follows that gcd(ni , n/ni ) = 1. As such, n/ni ∈ Z∗ni , and it has a
P
unique inverse mi modulo ni ; that is (n/ni )mi ≡ 1 (mod ni ). So set r = i ri mi n/ni . Observe that for i , j, we
have that n j | (n/ni ), and as such ri mi n/ni (mod n j ) ≡ 0 (mod n j ). As such, we have
! !
X n n
r mod n j = ri mi mod n j mod n j = r j m j mod n j mod n j = r j ∗ 1 mod n j = r j .
i
ni nj
As for uniqueness, if there is another such number r′ , such that r < r′ < n, then r′ − r (mod ni ) = 0 implying
that ni | r′ − r, for all i. Since all the ni s are coprime, this implies that n | r′ − r, which is of course impossible.■
Lemma 52.1.7 (Fast exponentiation). Given numbers b, c, n, one can compute bc mod n in polynomial time.
Namely, computing bc mod n can be reduced to recursively computing b⌊c/2⌋ mod n, and a constant number of
operations (on numbers that are smaller than n). Clearly, the depth of the recursion is O(log c). ■
Proof: Observe that ϕ(1) = 1 (see Remark 52.1.3), and for a prime number p, we have that ϕ(p) = p − 1. Now,
for k > 1, and p prime we have that ϕ(pk ) = pk−1 (p − 1), as a number x ≤ pk is coprime with pk , if and only if
x mod p , 0, and (p − 1)/p fraction of the numbers in this range have this property.
Now, if n and m are relative primes, then gcd(x, nm) = 1 ⇐⇒ gcd(x, n) = 1 and gcd(x, m) = 1. In
particular, there are ϕ(n)ϕ(m) pairs (α, β) ∈ Z∗n × Z∗m , such that gcd(α, n) = 1 and gcd(β, m) = 1. By the Chinese
remainder theorem (Theorem 52.1.6), each such pair represents a unique number in the range 1, . . . , nm, as
desired.
Now, the claim follows by easy induction on the prime factorization of the given number. ■
337
52.1.2. Structure of the modulo group Zn
52.1.2.1. Some basic group theory
Definition 52.1.9. A group is a set, G, together with an operation × that combines any two elements a and b
to form another element, denoted a × b or ab. To qualify as a group, the set and operation, (G, ×), must satisfy
the following:
(A) (Closure) For all a, b ∈ G, the result of the operation, a × b ∈ G.
(B) (Associativity) For all a, b, c ∈ G, we have (a × b) × c = a × (b × c).
(C) (Identity element) There exists an element i ∈ G, called the identity element, such that for every element
a ∈ G, the equation i × a = a × i = a holds.
(D) (Inverse element) For each a ∈ G, there exists an element b ∈ G such that a × b = b × a = i.
A group is abelian (aka, commutative group) if for all a, b ∈ G, we have that a × b = b × a.
In the following we restrict our attention to abelian groups since it makes the discussion somewhat simpler.
In particular, some of the claims below holds even without the restriction to abelian groups.
The identity element is unique. Indeed, if both f, g ∈ G are identity elements, then f = f × g = g.
Similarly, for every element x ∈ G there exists a unique inverse y = x−1 . Indeed, if there was another inverse z,
then y = y × i = y × (x × z) = (y × x) × z = i × z = z.
52.1.2.2. Subgroups
For a group G, a subset H ⊆ G that is also a group (under the same operation) is a subgroup.
−1 −1
For x, y ∈ G, let
us define
x ∼ y if x/y ∈ H. Here x/y = xy and y is the inverse of y in G. Observe
that (y/x)(x/y) = yx−1 xy−1 = i. That is y/x is the inverse of x/y, and it is in H. But that implies that
x ∼ y =⇒ y ∼ x. Now, if x ∼ y and y ∼ z, then x/y, y/z ∈ H. But then x/y × y/z ∈ H, and furthermore
x/y × y/z = xy−1 yz−1 = xz−1 = x/z. that is x ∼ z. Together, this implies that ∼ is an equivalence relationship.
Furthermore, observe that if x/y n= x/z then y−1o = x−1 (x/y) = x−1 (x/z) = z−1 , that is y = z. In particular, the
equivalence class of x ∈ G, is [x] = z ∈ G x ∼ z . Observe that if x ∈ H then i/x = i x−1 = x−1 ∈ H, and thus
i ∼ x. That is H = [x]. The following is now easy.
Lemma
n 52.1.10.
o Let G be an abelian group, and let H ⊆ G be a subgroup. Consider the set G/H =
[x] x ∈ G . We claim that [x] = [y] for any x, y ∈ G. Furthermore G/H is a group (that is, the quo-
tient group), with [x] × [y] = [x × y].
Proof: Pick an element α ∈ [x], and β ∈ [y], and consider the mapping f (x) = xα−1 β. We claim that f is one to
one and onto from [x] to [y]. For any γ ∈ [x], we have that γα−1 = γ/α ∈ H As such, f (γ) = γα−1 β ∈ [β] = [y].
Now, for any γ, γ′ ∈ [x] such that γ , γ′ , we have that if f (γ) = γα−1 β = γ′ α−1 β = f (γ′ ), then by multiplying
by β−1 α, we have that γ = γ′ . That is, f is one to one, implying that [x] = [y] .
The second claim follows by careful but tediously checking that the conditions in the definition of a group
holds. ■
Lemma 52.1.11. For a finite abelian group G and a subgroup H ⊆ G, we have that |H| divides |G|.
338
52.1.2.3. Cyclic groups
n o
Lemma 52.1.12. For a finite group G, and any element g ∈ G, the set ⟨g⟩ = gi i ≥ 0 is a group.
Proof: Since G is finite, there are integers i > j ≥ 1, such that i , j and gi = g j , but then g j × gi− j = gi = g j .
That is gi− j = i and, by definition, we have gi− j ∈ ⟨g⟩. It is now straightforward to verify that the other properties
of a group hold for ⟨g⟩. ■
In particular, for an element g ∈ G, we define its order as ord(g) = ⟨g⟩ , which clearly is the minimum
n o
positive integer m, such that gm = i. Indeed, for j > m, observe that g j = g j mod m ∈ X = i, g, g2 , . . . , gm−1 ,
which implies that ⟨g⟩ = X.
A group G is cyclic, if there is an element g ∈ G, such that ⟨g⟩ = G. In such a case g is a generator of G.
Lemma 52.1.13. For any finite abelian group G, and any g ∈ G, we have that ord(g) divides |G|, and g|G| = i.
Proof: By Lemma 52.1.12, the set ⟨g⟩ is a subgroup of G. By Lemma 52.1.11, we have that ord(g) = ⟨g⟩ | |G|.
|G|/ ord(g) |G|/ ord(g)
As such, g|G| = gord(g) = i = i. ■
Proof: We are working modulo n here under additions, and the identity element is 0. As such, x · ord(x) ≡n 0,
which implies that n | x ord(x). By definition, ord(x) is the minimal number that has this property, implying
lcm(n, x)
that ord(x) = . Now, lcm(n, x) = nx/ gcd(n, x). The second claim is now easy. ■
x
Theorem 52.1.15. (Euler’s theorem) For all n and x ∈ Z∗n , we have xϕ(n) ≡ 1 (mod n).
(Fermat’s theorem) If p is a prime then ∀x ∈ Z∗p x p−1 ≡ 1 (mod p).
Proof: The group Z∗n is abelian and has ϕ(n) elements, with 1 being the identity element (duh!). As such, by
Lemma 52.1.13, we have that xϕ(n) = x|Zn | ≡ 1 (mod n), as claimed.
∗
The second claim follows by setting n = p, and recalling that ϕ(p) = p − 1, if p is a prime. ■
One might be tempted to think that Lemma 52.1.14 implies that if p is a prime then Z∗p is a cyclic group,
but this does not follow, as the cardinality of Z∗p is ϕ(p) = p − 1, which is not a prime number (for p > 2). To
prove that Z∗p is cyclic, let us go back shortly to the totient function.
P
Lemma 52.1.16. For any n > 0, we have d|n ϕ(d) = n.
n o
Proof: For any g > 0, let Vg = x x ∈ {1, . . . , n} and gcd(x, n) = g . Now, x ∈ Vg ⇐⇒ gcd(x, n) = g
⇐⇒ gcd(x/g, n/g) = 1 ⇐⇒ x/g ∈ Z∗n/g . Since V1 , V2 , . . . , Vn form a partition of {1, . . . , n}, it follows that
X X X X
n= Vg = Z∗n/g = ϕ(n/g) = ϕ(d). ■
g g|n g|n d|n
339
52.1.2.5. Fields
Definition 52.1.17. A field is an algebraic structure ⟨F, +, ∗, 0, 1⟩ consisting of two abelian groups:
(A) F under +, with 0 being the identity element.
(B) F \ {0} under ∗, with 1 as the identity element (here 0 , 1).
Also, the following property (distributivity of multiplication over addition) holds:
∀a, b, c ∈ F a ∗ (b + c) = (a ∗ b) + (a ∗ c).
We need the following: A polynomial p of degree k over a field F has at most k roots. indeed, if p has the
root α then it can be written as p(x) = (x − α)q(x), where q(x) is a polynomial of one degree lower. To see
this, we divide p(x) by the polynomial (x − α), and observe that p(x) = (x − α)q(x) + β, but clearly β = 0 since
Q
p(α) = 0. As such, if p had t roots α1 , . . . , αt , then p(x) = q(x) ti=1 (x − αi ), which implies that p would have
degree at least t.
Proof: Clearly, all the elements of Rk are roots of the polynomial xk − 1 = 0 (mod n). By the above, this
polynomial has at most k roots. Now, if Rk is not empty, then it contains an element x ∈ Rk of order k, which
implies that for all i < j ≤ k, we have that xi . x j (mod n), as the order of x is the size of ⟨x⟩, and the minimum
k such that xk ≡ 1 (mod n). In particular, we have that Rk ⊆ ⟨x⟩, as for y = x j , we have that yk ≡n x jk ≡n 1 j ≡n 1.
Observe that for y = xi , if g = gcd(k, i) > 1, then yk/g ≡n xi(k/g) ≡n xlcm(i,k) ≡n 1; that is, ord(y) ≤ k/g < k,
and y < Rk . As such, Rk contains only elements of xi such that gcd(i, k) = 1. That is Rk ⊆ Z∗k . The claim now
readily follows as Z∗k = ϕ(k). ■
Proof: For p = 2 the claim trivially holds, so assume p > 2. If the set R p−1 , from Lemma 52.1.18, is not empty,
then there is g ∈ R p−1 , it has order p − 1, and it is a generator of Z∗p , as Z∗p = p − 1, implying that Z∗p = ⟨g⟩ and
this group is cyclic.
Now, by Lemma 52.1.13, we have that for any y ∈ Z∗p , we have that ord(y) | p − 1 = Z∗p . This implies that
Rk is empty if k does not divides p − 1. On the other hand, R1 , . . . , R p−1 form a partition of Z∗p . As such, we
have that
X X
∗
p − 1 = Zp = |Rk | ≤ ϕ(k) = p − 1,
k|p−1 k|p−1
by Lemma 52.1.18 and Lemma 52.1.16p339 , implying that the inequality in the above display is equality, and
for all k | p − 1, we have that |Rk | = ϕ(k). In particular, R p−1 = ϕ(p − 1) > 0, and by the above the claim
follows. ■
340
52.1.2.7. Z∗n is cyclic for powers of a prime
Lemma 52.1.20. Consider any odd prime p, and any integer c ≥ 1, then the group Z∗n is cyclic, where n = pc .
Proof: Let g be a generator of Z∗p . Observe that g p−1 ≡ 1 mod p. The number g < p, and as such p does
not divide g, and also p does not divide g p−2 , and also p does not divide p − 1. As such, p2 does not divide
∆ = (p − 1)g p−2 p; that is, ∆ . 0 (mod p2 ). As such, we have that
!
p − 1 p−2
(g + p) ≡ g +
p−1 p−1
g p ≡ g p−1 + ∆ . g p−1 (mod p2 )
1
=⇒ (g + p) p−1 . 1 (mod p2 ) or g p−1 . 1 (mod p2 ).
Renaming g + p to be g, if necessary, we have that g p−1 . 1 (mod p2 ), but by Theorem 52.1.15p339 , g p−1 ≡ 1
(mod p). As such, g p−1 = 1 + βp, where p does not divide β. Now, we have
!
p
g p(p−1)
= (1 + βp) = 1 +
p
βp + βp3 <whatever> = 1 + γ1 p2 ,
1
where γ1 is an integer (the p3 is not a typo – the binomial coefficient contributes at least one factor of p – here
we are using that p > 2). In particular, as p does not divides β, it follows that p does not divides γ1 either. Let
us apply this argumentation again to
2
p
g p (p−1) = 1 + γ1 p2 = 1 + γ1 p3 + p4 <whatever> = 1 + γ2 p3 ,
where again p does not divides γ2 . Repeating this argument, for i = 1, . . . , c − 2, we have
i
i−1 p p
αi = g p (p−1) = g p (p−1) = 1 + γi−1 pi = 1 + γi−1 pi+1 + pi+2 <whatever> = 1 + γi pi+1 ,
where p does not divides γi . In particular, this implies that αc−2 = 1 + γc−2 pc−1 and p does not divides γc−2 .
This in turn implies that αc−2 . 1 (mod pc ).
Now, the order of g in Zn , denoted by k, must divide Z∗n by Lemma 52.1.13p339 . Now Z∗n = ϕ(n) =
pc−1 (p − 1), see Lemma 52.1.8p337 . So, k | pc−1 (p − 1). Also, αc−2 . 1 (mod pc ). implies that k does not divides
pc−2 (p − 1). It follows that pc−1 | k. So, let us write k = pc−1 k′ , where k′ ≤ (p − 1). This, by definition, implies
that gk ≡ 1 (mod pc ). Now, g p ≡ g (mod p), because g is a generator of Z∗p . As such, we have that
δ ′ δ−1 k′ δ−1 k′ ′
gk ≡ p g p k ≡ p (g p ) p ≡ p (g) p ≡ p . . . ≡ p (g)k ≡ p gk mod pc mod p ≡ p 1.
′
Namely, gk ≡ 1 (mod p), which implies, as g as a generator of Z∗p , that either k′ = 1 or k′ = p − 1. The
case k′ = 1 is impossible, as this implies that g = 1, and it can not be the generator of Z∗p . We conclude that
k = pc−1 (p − 1); that is, Z∗n is cyclic. ■
Theorem 52.1.22 (Euler’s criterion). Let p be an odd prime, and α ∈ Z∗p . We have that
341
(A) α(p−1)/2 ≡ p ±1.
(B) If α is a quadratic residue, then α(p−1)/2 ≡ p 1.
(C) If α is not a quadratic residue, then α(p−1)/2 ≡ p −1.
Proof: (A) Let γ = α(p−1)/2 , and observe that γ2 ≡ p α p−1 ≡ 1, by Fermat’s theorem (Theorem 52.1.15p339 ),
which implies that γ is either +1 or −1, as the polynomial x2 − 1 has at most two roots over a field.
(B) Let α ≡ p β2 , and again by Fermat’s theorem, we have α(p−1)/2 ≡ p β p−1 ≡ p 1.
(C) Let X be the set of elements in Z∗p that are not quadratic residues, and consider α ∈ X. Since Z∗p is
a group, for any x ∈ Z∗p there is a unique y ∈ Z∗p such that xy ≡ p α. As such, we partition Z∗p into pairs
n o
C = {x, y} x, y ∈ Z∗p and xy ≡ p α . We have that
Y Y Y
τ ≡p β ≡p xy ≡ p α ≡ p α(p−1)/2 .
β∈Z∗p {x,y}∈C {x,y}∈C
n o
Let consider a similar set of pair, but this time for 1: D = {x, y} x, y ∈ Z∗p , x , y and xy ≡ p 1 . Clearly, D
does not contain −1 and 1, but all other elements in Z∗p are in D. As such,
Y Y Y
τ ≡p β ≡ p (−1)1 xy ≡ p 1 ≡ p −1. ■
β∈Z∗p {x,y}∈D {x,y}∈D
Dividing both sides by (−1)n ((p − 1)/2)!, we have that (a | p) ≡ a(p−1)/2 ≡ (−1)n (mod p), as claimed. ■
342
Lemma 52.1.26. If p is an odd prime, and a > 2 and gcd(a, p) = 1 then (a | p) = (−1)∆ , where ∆ =
X
(p−1)/2
⌊ ja/p⌋. Furthermore, we have (2 | p) = (−1)(p −1)/8 .
2
j=1
X
(p−1)/2 X
= (∆ + n)p + j−2 y.
j=1 y∈L
P(p−1)/2 p2 −1
Rearranging, and observing that j=1 j= p−1
2
· 1 p−1
2 2
+1 = 8
. We have that
p2 − 1 X p2 − 1
(a − 1) = (∆ + n)p − 2 y. =⇒ (a − 1) ≡ (∆ + n)p (mod 2). (52.1)
8 8
y∈L
Observe that p ≡ 1 (mod 2), and for any x we have that x ≡ −x (mod 2). As such, and if a is odd, then the
above implies that n ≡ ∆ (mod 2). Now the claim readily follows from Lemma 52.1.25.
As for (2 | p), setting a = 2, observe that ⌊ ja/p⌋ = 0, for j = 0, . . . (p − 1)/2, and as such ∆ = 0. Now,
Eq. (52.1) implies that p 8−1 ≡ n (mod 2), and the claim follows from Lemma 52.1.25.
2
■
Theorem 52.1.27 (Law of quadratic reciprocity). If p and q are distinct odd primes, then
p−1 q−1
(p | q) = (−1)
(q | p) . 2 2
n o
Proof: Let S = (x, y) 1 ≤ x ≤ (p − 1)/2 and 1 ≤ y ≤ (q − 1)/2 . As lcm(p, q) = pq, it follows that there are
no (x, y) ∈ S , such that qx = py, as all such numbers are strict smaller than pq. Now, let
n o n o
S 1 = (x, y) ∈ S qx > py and S 2 = (x, y) ∈ S qx < py .
P(p−1)/2
Now, (x, y) ∈ S 1 ⇐⇒ 1 ≤ x ≤ (p − 1), and 1 ≤ y ≤ ⌊qx/p⌋. As such, we have |S 1 | = ⌊qx/p⌋, and
P x=1
similarly |S 2 | = (q−1)/2
y=1 ⌊py/q⌋. We have
p−1 q−1 X
(p−1)/2 X
(q−1)/2
τ= · = |S | = |S 1 | + |S 2 | = ⌊qx/p⌋ + ⌊py/q⌋ .
2 2
|x=1 {z } |y=1 {z }
τ1 τ2
The claim now readily follows by Lemma 52.1.26, as (−1)τ = (−1)τ1 (−1)τ2 = (p | q) (q | p). ■
343
Pk Qk
Claim 52.1.29. For odd integers n1 , . . . , nk , we have that i=1 (ni − 1)/2 ≡ n
i=1 i − 1 /2 (mod 2).
Proof: We prove for two odd integers x and y, and apply this repeatedly to get the claim. Indeed, we have
x−1 y−1 xy − 1 xy − x + 1 − y + 1 − 1 xy − x − y + 1
+ ≡ (mod 2) ⇐⇒ 0 ≡ (mod 2) ⇐⇒ 0 ≡
2 2 2 2 2
(x − 1)(y − 1)
(mod 2) ⇐⇒ 0 ≡ (mod 2), which is obviously true. ■
2
Lemma 52.1.30 (Law of quadratic reciprocity). For n and m positive odd integers, we have that Jn | mK =
n−1 m−1
(−1) 2 2 Jm | nK .
Q Q
Proof: Let n = νi=1 pi and Let m = µj=1 q j be the prime factorization of the two numbers (allowing repeated
factors). If they share a common factor p, then both Jn | mK and Jm | nK contain a zero term when expanded, as
(n | p) = (m | p) = 0. Otherwise, we have
Y
ν Y
µ Y
ν Y
µ
Y
ν Y
µ
Jn | mK = Jpi | q j K = pi | q j = (−1)(q j −1)/2·(pi −1)/2 q j | pi
i=1 j=1 i=1 j=1 i=1 j=1
ν µ
Y
ν Y
µ Y Y
= (−1)(q j −1)/2·(pi −1)/2 · q j | pi = s Jm | nK .
i=1 j=1 i=1 j=1
| {z }
s
n2 − 1 m2 − 1 n2 m2 − 1
Lemma 52.1.31. For odd integers n and m, we have that + ≡ (mod 2).
8 8 8
Proof: For an odd integer n, we have that either (i) 2 | n − 1 and 4 | n + 1, or (ii) 4 | n − 1 and 2 | n + 1. As
such, 8 | n2 − 1 = (n − 1)(n + 1). In particular, 64 | n2 − 1 m2 − 1 . We thus have that
n2 − 1 m2 − 1 n2 m2 − n2 − m2 + 1
≡0 (mod 2) ⇐⇒ ≡ 0 (mod 2)
8 8
n2 m2 − 1 n2 − m2 − 2
⇐⇒ ≡ (mod 2)
8 8
n2 − 1 m2 − 1 n2 m2 − 1
⇐⇒ + ≡ (mod 2). ■
8 8 8
Lemma 52.1.32. Let m, n be odd integers, and a, b be any integers. We have the following:
(A) Jab | nK = Ja | nK Jb | nK.
(B) Ja | nmK = Ja | nK Ja | mK.
(C) If a ≡ b (mod n) then Ja | nK = Jb | nK.
(D) If gcd(a, n) > 1 then Ja | nK = 0.
(E) J1 | nK = 1.
344
(F) J2 | nK = (−1)(n −1)/8 .
2
n−1 m−1
(G) Jn | mK = (−1) 2 2 Jm | nK .
X Qt
t
p2i − 1 i=1 pi − 1
2
n2 − 1
∆≡ ≡ ≡ (mod 2),
i=1
8 8 8
Ignoring the recursive calls, all the operations takes polynomial time. Clearly, computing Jacobi(2, n)
takes polynomial time. Otherwise, observe that Jacobi reduces its input size by say, one bit, at least every two
recursive calls, and except the a = 2 case, it always perform only a single call. Thus, it follows that its running
time is polynomial. We thus get the following.
Lemma 52.1.33. Given integers a and n, where n is odd, then Ja | nK can be computed in polynomial time.
345
52.1.3.5. Subgroups induced by the Jacobi symbol
For an n, consider the set
n o
Jn = a ∈ Z∗n Ja | nK ≡ a(n−1)/2 mod n . (52.2)
Claim 52.1.34. The set Jn is a subgroup of Z∗n .
Proof: For a, b ∈ Jn , we have that Jab | nK ≡ Ja | nK Jb | nK ≡ a(n−1)/2 b(n−1)/2 ≡ (ab)(n−1)/2 mod n, implying that
ab ∈ Jn . Now, J1 | nK = 1, so 1 ∈ Jn . Now, for a ∈ Jn , let a−1 the inverse of a (which is a number in Z∗n ).
Observe that a(a−1 ) = kn + 1, for some k, and as such, we have
q y q y
1 = J1 | nK = Jkn + 1 | nK = aa−1 | n = Jkn + 1 | nK = Ja | nK a−1 | n .
And modulo n, we have
q y q y
1 ≡ Ja | nK a−1 | n ≡ a(n−1)/2 a−1 | n mod n.
(n−1)/2 q y
Which implies that a−1 ≡ a−1 | n mod n. That is a−1 ∈ Jn .
Namely, Jn contains the identity, it is closed under inverse and multiplication, and it is now easy to verify
that fulfill the other requirements to be a group. ■
Lemma 52.1.35. Let n be an odd integer that is composite, then |Jn | ≤ Z∗n /2.
Q
Proof: Let has the prime factorization n = ti=1 pki i . Let q = pk11 , and m = n/q. By Lemma 52.1.20p341 , the
group Z∗q is cyclic, and let g be its generator. Consider the element a ∈ Z∗n such that
a ≡ g mod q and a ≡ 1 mod m.
Such a number a exists and its unique, by the Chinese remainder theorem (Theorem 52.1.6p337 ). In particular,
Q
let m = ti=2 pki i , and observe that, for all i, we have a ≡ 1 (mod pi ), as pi | m. As such, writing the Jacobi
symbol explicitly, we have
Y
t Y
t Y
t
Ja | nK = Ja | qK (a | pi )ki = Ja | qK (1 | pi )ki = Ja | qK 1 = Ja | qK = Jg | qK .
i=2 i=2 i=2
since a ≡ g (mod q), and Lemma 52.1.32p344 (C). At this point there are two possibilities:
(A) If k1 = 1, then q = p1 , and Jg | qK = (g | q) = g(q−1)/2 (mod q). But g is a generator of Z∗q , and its order
is q − 1. As such g(q−1)/2 ≡ −1 (mod q), see Definition 52.1.23p342 . We conclude that Ja | nK = −1. If we
assume that Jn = Z∗n , then Ja | nK ≡ a(n−1)/2 ≡ −1 (mod n). Now, as m | n, we have
a(n−1)/2 ≡m a(n−1)/2 mod n mod m ≡m −1.
But this contradicts the choice of a as a ≡ 1 (mod m).
(B) If k1 > 1 then q = pk11 . Arguing as above, we have that Ja | nK = (−1)k1 . Thus, if we assume that Jn = Z∗n ,
then a(n−1)/2 ≡ −1 (mod n) or a(n−1)/2 ≡ 1 (mod n). This implies that an−1 ≡ 1 (mod n). Thus, an−1 ≡ 1
(mod q).
Now a ≡ g mod q, and thus gn−1 ≡ 1 (mod q). This implies that the order of g in Z∗q must divide
n − 1. That is ord(g) = ϕ(q) | n − 1. Now, since k1 ≥ 2, we have that p1 | ϕ(q) = pk11 (p1 − 1), see
Lemma 52.1.8p337 . We conclude that p1 | n − 1 and p1 | n, which is of course impossible, as p1 > 1.
We conclude that Jn must be a proper subgroup of Z∗n , but, by Lemma 52.1.11p338 , it must be that |Jn | | Z∗n . But
this implies that |Jn | ≤ Z∗n /2. ■
346
52.2. Primality testing
The primality test is now easy . Indeed, given a number n, first check if it is even (duh!). Otherwise, randomly
pick a number r ∈ {2, . . . , n − 1}. If gcd(r, n) > 1 then the number is composite. Otherwise, check if r ∈ Jn (see
Eq. (52.2)p346 ), by computing x = Jr | nK in polynomial time, see Section 52.1.3.4p345 , and x′ = a(n−1)/2 mod n.
(see Lemma 52.1.7p337 ). If x = x′ then the algorithm returns is prime, otherwise it returns it is composite.
Theorem 52.2.1. Given a number n, and a parameter δ > 0, there is a randomized algorithm that, decides if
the given number is prime or composite. The running time of the algorithm is O log n c log(1/δ) , where c is
some constant. If the algorithm returns that n is composite then it is. If the algorithm returns that n is prime,
then is wrong with probability at most δ.
Proof: Run the above algorithm m = O(log(1/δ)) times. If any of the runs returns that it is composite then the
algorithm return that n is composite, otherwise the algorithms returns that it is a prime.
If the algorithm fails, then n is a composite, and let r1 , . . . , rm be the random numbers the algorithm picked.
The algorithm fails only if r1 , . . . , rm ∈ Jn , but since |Jn | ≤ Z2n /2, by Lemma 52.1.35p346 , it follows that this
m
happens with probability at most |Jn | / Z2n ≤ 1/2m ≤ δ, as claimed. ■
Proof: Let X be the product of the all composite numbers between m and 2m, we have
!
2m 2m · (2m − 1) · · · (m + 2) · (m + 1) X·∆
= = .
m m · (m − 1) · · · 2 · 1 m · (m − 1) · · · 2 · 1
Since none of the numbers between 2 and m divides any of the factors of ∆, it must be that the number m·(m−1)···2·1
X
2m 2m
is an integer number, as m is an integer. Therefore, m = c · ∆, for some integer c > 0, implying the claim.■
Lemma 52.2.3. The number of prime numbers between m and 2m is O(m/ ln m).
Proof: Let the number of primes less than n be Π(n), then by Lemma 52.2.3, there exist some positive constant
C, such that for all ∀n ≥ N, we have Π(2n) − Π(n) ≤ C · n/ ln n. Namely, Π(2n) ≤ C · n/ ln n + Π(n). Thus,
⌈X
lg n⌉
! ⌈X
lg n⌉
n/2i n
Π(2n) ≤ Π 2n/2 − Π 2n/2
i i+1
≤ C· i)
= O , by observing that the summation behaves
i=0 i=0
ln(n/2 ln n
like a decreasing geometric series. ■
One could even say “trivial” with heavy Russian accent.
347
2m
Lemma 52.2.5. For integers m, k and a prime p, if pk | m
, then pk ≤ 2m.
Proof: Let T (p, m) be the number of times p appear in the prime factorization ofk m!. Formally, T (p, m) is the
P∞ j
highest number k such that p divides m!. We claim that T (p, m) = i=1 m/p . Indeed, consider an integer
k i
β ≤ m, such that β = pt γ, where γ is an integer that is not divisible by p. Observe that β contributes exactly to
the first t terms of the summation of T (p, m) – namely, its contribution to m! as far as powers of p is counted
correctly.
Let α be the maximum number such that pα divides 2m m
= m!m!
2m!
. Clearly,
X∞ $ % $ %!
2m m
α = T (p, 2m) − 2T (p, m) = − 2 .
i=1
pi pi
j k j k
It is easy to verify that for any integers x, y, we have that 0 ≤ 2xy − 2 yx ≤ 1. In particular, let k be the
j k j k
largest number such that 2m − 2 m
= 1, and observe that T (p, 2m) ≤ k as only the proceedings k − 1 terms
pk pk j k
might be non-zero in the summation of T (p, 2m). But this implies that 2m/pk ≥ 1, which implies in turn that
pk ≤ 2m, as desired. ■
Theorem 52.2.7. Let π(n) be the number of distinct prime numbers between 1 and n. We have that π(n) =
Θ(n/ ln n).
References
[Mil76] G. L. Miller. Riemann’s hypothesis and tests for primality. J. Comput. Sys. Sci., 13(3): 300–317,
1976.
348
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
[Rab80] M. O. Rabin. Probabilistic algorithm for testing primality. J. Number Theory, 12(1): 128–138,
1980.
349
350
Chapter 53
Talagrand’s Inequality
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
At an archaeological site I saw fragments of precious vessels, well cleaned and groomed and oiled and spoiled. And
beside it I saw a heap of discarded dust which wasn’t even good for thorns and thistles to grow on.
I asked: What is this gray dust which has been pushed around and sifted and tortured and then thrown away?
I answered in my heart: This dust is people like us, who during their lifetime lived separated from copper and gold and
marble stones and all other precious things - and they remained so in death. We are this heap of dust, our bodies, our souls, all
the words in our mouths, all hopes.
53.1. Introduction
Here, we want to prove a strong concentration inequality that is stronger than Azuma’s inequality because it is
independent of the underlying dimension of the process. This inequality is quite subtle, so we will need a quite
elaborate way to get to it – be patient.
351
To capture this intuition, we consider the convex-hull of these sets:
C(p, S ) = CH {H(p, u) | u ∈ S } .
Observation 53.1.1. An easy upper bound on the T -distance of p to a set S (i.e., ρ(p, S )) is the minimum
number of coordinates one has to change in p to get to a point in S . As the next example shows, however, things
are more subtle – if there are many different ways to get from p to a point in S , then the T -distance is going to
be significantly smaller.
Example 53.1.2. It would be useful to understand this somewhat mysterious T -distance. To this end, consider
the ball in Rd of radius 100d centered in the origin, denote by b, and let S = ∂b be its boundary sphere. For a
point p ∈ intb, we have that
H = H(p, S ) = {0, 1}d \ {(0, 0, . . . , 0)} .
As such C = H(p, S ) is the convex-hull of all the hypercube vertices, excluding the origin. it is easy to
check that the closest
√ point√in C to the origin is the point u = (1/d, 1/d, . . . , 1/d), As such, we have that
ρ(p, S ) = ∥u∥ = 1/d = 1/ d.
In particular, by monotonicity this√implies that for any set T√ in Rd we have that ρ(p, T ) is either 0 (i.e.,
p ∈ T ), or alternatively, ρ(p, T ) ≥ 1/ d. Similarly, ρ(p, T ) ≤ d as this is the maximum distance from the
origin to any vertex of the hypercube {0, 1}d . √
As a concrete example, for the set S = ∂b, and the point p = (200d, . . . , 200d), we have ρ(p, S ) = d.
√
In the following, think about the dimension d as being quite large. As such, the distance 1/ d is quite
small. In particular, for a set S ⊆ Rd , let
n o
S t = p ∈ Rd ρ(p, S ) ≤ t ,
be the expansion of S by including points that are in distance at most t from S in the T -distance.
Since we interested in probability here, consider Rd to be the product of d probability spaces. Formally,
Q
let Ωi be a probability space, and consider the product probability space Ω = di=1 Ωi . As we are given a
probability measure on each Ωi , this defines a natural probability measure on Ω. That is, a point from Ω is
generated by packing each of its coordinates independently from Ωi . for i = 1, . . . , d.
The volume of a set S ⊆ Ω is thus P[S ]. We are now ready to state Talagrand inequality (not that it is going
to help us much).
Theorem 53.1.3 (Talagrand’s inequality). For any set S ⊆ Ω, we have
P[S ] P[S t ] = P[S ] 1 − P[S t ] ≤ exp(−t /4).
2
Example 53.1.4. To see why this inequality interesting, consider Ω = [0, 100]d with uniform distribution on
each coordinate. The probability measure of a set S ⊆ Ω is P[S ] = vol(S )/100d . Let
( X )
100d
S = p = (p1 , . . . , pd ) ∈ [0, 100]d
pi ≤ .
i
2
√
It is easy to verify that vol(S ) = vol [0, 100]d /2. Let t = 4 ln d, and consider the set S t . Intuitively, and not
quite correctly, it the set of all points in [0, 100]d , such that one needs to change more than 4 ln d coordinates
before one can get to a point of S . These points are t-far from being in S .
352
h i
By Talagrand inequality, we have that P S t /2 = P[S ] 1 − P[S t ] ≤ exp(−t2 /4) = 1/d4 . Namely, only a
√
tiny fraction of the cube is more than T -distance 4 ln d from S !
Let us try to restate
√ this – for any set S that is half the volume of the hypercube [0, 100]d , the set of points
in T -distance ≤ 4 ln d in this hypercube is small.
Proof: The proof is by induction on the dimension d. For d = 1, then ρ(p, S ) = 0 if p ∈ S , and ρ(p, S ) = 1 if
p < S . As such, we have
" !#
ρ2 (p, S )
= e0 /4 P[S ] + e1 /4 (1 − P[S ]) = P[S ] + e1/4 (1 − P[S ]) = f (P[S ]),
2 2
γ = E exp
4
where f (x) = x + e1/4 (1 − x). An easy argument (see Tedium 53.1.6) shows that f (x) ≤ 1/x, which implies that
γ = f (P[S ]) ≤ 1/ P[S ], as claimed.
Q Q
For d = n + 1, let O = di=1 Ωi , and N = Ωd+1 . Clearly, Ω = d+1i=1 Ωi = O × N.
In particular, for s, s′ as above, we have (by convexity) that for any λ ∈ [0, 1], the point
h(λ) = (1 − λ) s, 1 + λ s′ , 0 = (1 − λ)s + λs′ , 1 − λ ∈ C(z, S ) ⊆ [0, 1]d+1 .
The function b h(λ) = ∥(1 − λ)s + λs′ ∥2 is convex, see Tedium 53.1.7. We thus have
ρ2 (z, S ) = min ∥p∥2 ≤ ∥h(λ)∥2 = ∥(1 − λ)s + λs′ ∥2 + (1 − λ)2 ≤ (1 − λ) ∥s∥2 + λ ∥s′ ∥2 + (1 − λ)2 .
p∈C(z,S )
353
We are still at the liberty of choosing s and s′ . Let s be the point realizing ρ(p, S O ) – this is the closest point in
C(p, S ) to the origin (i.e., ∥s∥ = ρ(p, S O )). Similarly, let s′ be the point realizing ρ(p, S (ν)). Plugging these two
points into the above inequality, we have
ρ2 (z, S ) ≤ (1 − λ)ρ(p, S O )2 + λρ(p, S (ν))2 + (1 − λ)2 .
Now, fix ν, and ride the following little integral:
Z ! Z !
ρ2 (p, ν), S (1 − λ)ρ(p, S O )2 + λρ(p, S (ν))2 + (1 − λ)2
F(ν) = exp ≤ exp
p 4 p 4
Z !1−λ !λ
1 1
≤ e(1−λ) /4 exp ρ(p, S O )2
2
exp ρ(p, S (ν))2
p 4 4
"Z !#(1−λ) "Z !#λ
(1−λ)2 /4 1 1
≤e exp ρ(p, S O ) 2
exp ρ(p, S (ν)) 2
(by Hölder’s ineq (53.4))
p 4 p 4
!(1−λ) !λ !−λ
(1−λ)2 /4 1 1 (1−λ)2 /4 1 P[S (ν)]
≤e =e (induction)
P[S O ] P[S (ν)] P[S O ] P[S O ]
1 P[S (ν)]
· e(1−λ) /4 r−λ ,
2
= for r =
P[S O ] P[S O ]
Observe that P[S O ] ≥ P[S (ν)], and thus r ≤ 1. To minimize the above, consider the function f3 (λ, r) =
exp (1 − λ)2 /4)r−λ . Easy calculation shows that f3 (λ, r) is minimized, for a fixed r, by choosing
1 + 2 ln r r ∈ [e−1/2 , 1]
λ(r) =
,
0 r ∈ [0, e−1/2 ]
see Tedium 53.1.9 (A). Furthermore, for this choice of λ, easy calculations shows that f4 (r) = fr (λ(r), r) ≤ 2−r,
see Tedium 53.1.9 (B). As such, we have
!
1 1 P[S (ν)]
F(ν) ≤ f4 (r) ≤ 2−
P[S O ] P[S O ] P[S O ]
We remind the reader that our purpose is to bound
Z ! Z Z ! Z Z !
ρ2 z, S ρ2 (p, ν), S 1 P[S (ν)]
exp = exp ≤ F(ν) ≤ 2−
4 ν∈N p∈O 4 ν∈N ν∈N P[S O ] P[S O ]
z
R ! !
1 ν∈N P
[S (ν)]
= 1 2 − P[S ] = 1 · P[S ] 2 − P[S ] ≤ 1 ,
= 2 −
P[S O ] P[S O ] P[S O ] P[S O ] P[S ] P[S O ] P[S O ] P[S ]
since for x = P[S ]/P[S O ], we have x(2 − x) ≤ 1, for any value of x (see Tedium 53.1.10). ■
Tedium 53.1.7. For any p, u ∈ Rd , the function f (λ) = ∥(1 − λ)p + λu∥2 is convex. Indeed, let fi (λ) =
P
((1 − λ)pi + λqi )2 , for i = 1, . . . , d. Observe that f (λ) = i fi (λ), and as such it is sufficient to prove that
fi is convex. We have fi′ (λ) = 2(qi − pi ) (1 − λ)pi + λqi , and fi′′ (λ) = 2(qi − pi )2 > 0, which implies convexity.
354
Fact 53.1.8 (Hölder’s inequality.). Let p, q ≥ 1 be two numbers such that 1/p + 1/q = 1. Then, R for any two
functions f, g, we have ∥ f g∥1 ≤ ∥ f ∥ p ∥g∥q . Explicitly, stated as integrals, Hölder’s inequality is | f (x)g(x)|dx ≤
R 1/p R 1/q
| f (x)| p dx | f (x)|q dx . In particular, for λ ∈ (0, 1), p = 1/(1 − λ) and q = 1/λ, we have that
Z Z !1−λ Z !λ
λ
f (x)g 1−λ
(x) dx ≤ | f (x)|dx | f (x)|dx . (53.4)
Tedium 53.1.9. (A) We need to find the minimum of the following function f (λ) = exp (1 − λ)2 /4)r−λ =
exp (1−λ)2 /4−λ ln r). We have f ′ (λ) = f3 (λ) (1 − λ)/2 − ln r . Solving for f ′ (λ) = 0, we have (1−λ)/2−ln r =
0 =⇒ 1 − λ = 2 ln r0 =⇒ λ = 1 − 2 ln r, which works as long as r ≥ e−1/2 . Otherwise, we set λ = 0.
(B) For r ≤ e−1/2 , we have, by the above, that f (0) = e1/4 ≈ 1.28 ≤ 1.39 ≈ 2 − e−1/2 ≤ 2 − r. For r > e−1/2 ,
by the above, λ = 1 − 2 ln r, and thus
g(r) = f (λ) = exp (1 − λ)2 /4 − λ ln r) = exp (2 ln r)2 /4 + (1 − 2 ln r) ln r = exp ln r − ln2 r) ≤ 1 ≤ 2 − r,
Proof: Consider a random point p ∈ Ω. We are interested in the probability p < S t . To this end, consider the
random variable X = ρ(p, S ). By definition, p ∈ S t ⇐⇒ X ≥ t. As such, by Markov’s inequality, we have
h i h i E[exp X 2 /4 ] exp −t2 /4
P S t = P[X ≥ t] = P exp X /4 ≥ exp t /4 ≤ ≤ ,
2 2
exp t2 /4 P[S ]
by Theorem 53.1.5. ■
355
Example 53.2.3. In Example 53.2.1, the function h (i.e., number of bins that are not empty) is f -certifiable,
where f (k) = k.
Example 53.2.4. Consider the random graph G(n, p) over n vertices, created by picking every edge with proba-
bility p. One can interpret such a graph as a random binary vector with n2 coordinates, where the ith coordinate
is 1 ⇐⇒ the ith edge is in the graph (for some canonical ordering of all possible n2 edges).
A triangle in a graph G is a triple of vertices i, j, k, such that i j, jk, ki ∈ E(G). For a graph G, let h(G) be
the number of distinct triangles in G. In the above interpretation as a graph as a vector x ∈ {0, 1}(2) , it is easy to
n
verify that if h(G) ≥ k then it can be certified by 3k coordinates. As such, the number of triangles in a graph is
f -certifiable, for f (k) = 3k.
Note, that the certificate is only for the lower bound on the value of the function.
Lemma 53.2.5. Consider a set S ⊆ Rd and a point p ∈ Rd . We have that ρ(p, S ) ≤ t ⇐⇒ for all x =
(x1 , . . . , xd ) ∈ Rd , with ∥x∥ = 1, there exists h ∈ H(p, S ), such that ⟨x, h⟩ ≤ t.
Proof: The quantity ℓ = ρ(p, S ) is the distance from the origin to the convex polytope C(p, S ). In particular,
let y be the closest point to the origin in this polytope, and observe that ℓ = ∥y∥ = ⟨y, y/ ∥y∥⟩. In particular, for
any other vector x, with ∥x∥ = 1, we have ⟨y, x⟩ ≤ ⟨y, y/ ∥y∥⟩ ≤ ℓ. Since y is in the convex-hull of H(p, S ), it
follows that there is h ∈ H(p, S ) such that ⟨y, x⟩ ≤ ⟨h, x⟩ ≤ ℓ.
As for the other direction, assume that ℓ = ρ(p, S ) > t, and let y ∈ C(p, S ) be the point realizing this
distance. Arguing as above, we have that for the direction y/ ∥y∥, and any vertex h ∈ H(p, S ) we have that
⟨h, x⟩ ≥ ⟨h, x⟩ ≥ ℓ > t. ■
Q
Theorem 53.2.6. Consider a probability space Ω = mi=1 Ωi , and let h : Ω → be 1-Lipschitz and f -certifiable,
for some function f . Consider the random variable X = h(x), for x picked randomly in Ω. Then, for any
positive real numbers b and t, we have
h p i
P X ≤ b − t f (b) P[X ≥ b] ≤ exp(−t /4).
2
h p i
If h is k-Lipschitz then P X ≤ b − tk f (b) P[X ≥ b] ≤ exp(−t2 /4).
p
Proof: Set S = p ∈ Ω h(p) < b − t f (b) . Consider a point u, such that h(u) ≥ b. Assume for the sake of
contradiction that u ∈ S t . Let I ⊆ JmK √ be the certificate of size ≤ f (b) that h(u) ≥ b. And consider the vector
x = (x1 , . . . , xd ), such that xi = 1/ |I| if i ∈ I, and xi = 0 otherwise. Observe that ∥x∥ = |I| (1/ |I|) = 1, and
2
thus ∥x∥ = 1. By Lemma 53.2.5, there exists h ∈ H(u, S ), such that ⟨x, h⟩ ≤ t, since by assumption ρ(u, S ) ≤ t.
Let v ∈ S be the point realizing h – that is, H(p, v) = h.
Let J ⊆ I be the set of indices of√coordinates that are in I, such that p and √ v differ pon this coordinate. We
have by the definition of x, that |J| / |I| ≤ ⟨x, h⟩ ≤ t, which implies that |J| ≤ t |I| ≤ t f (b).
Let u′ be the point that agrees with u on the coordinates of I, and agrees with v on the other coordinates.
The points u′ and v disagree only on coordinates in I, but such coordinates of disagreement are exactly the
coordinates (in I) where u disagrees with v – which is the set J of coordinates. As such, by the 1-Lipschitz
condition, we have that p
h(v) ≥ h(u′ ) − |J| ≥ h(u) − t f (b),
but then, by the definition of S , we have v < S , which is a contradiction as v ∈ S .
356
We conclude that u < S t =⇒ u ∈ S t . As such, we have P[X ≥ b] ≤ P S t . By Talagrand inequality, we
have p
P X < b − t f (b) P[X ≥ b] ≤ P[S ] P[S t ] ≤ exp(−t /4).
2
The “<” on the left side can be replaced by “≤”, as in the statement of the theorem, by using the value t + ε
instead of t, and taking the limit as ε → 0.
The k-Lipschitz version follows by applying the above inequality to the function h(·)/k. ■
As ν = O(n1/4 ), we get the following (this requires some further tedious calculations which we omit).
h i
P h(x) − ν − tcn ≤ 4 exp(−t2 /4),
1/4
357
53.3.2. Largest convex subset
A set of points P is in convex position if they are all vertices of the convex-hull of P.
Lemma 53.3.3. Let P be a set of n points picked randomly and uniformly in the unit square [0, 1]2 . Let Y be
the size of the largest subset of point of P that are in convex position. Then, we have that E[Y] = Ω(n1/3 ).
Proof: Let p = (1/2, 1/2), and consider the regular N-gon Q, for N = n1/3 , that its vertices lie on the circle
centered at p, and is of radius r = 1/2. Consider the triangle △i formed by connecting three consecutive vertices
p2i−1 , p2i , p2i+1 of Q. We have that α = 2π/N, and we pick n large enough, so that α ≤ 1/4. We remind the
reader that 1 − x2 /4 ≥ cos x ≥ 1 − x2 /2, for x ∈ (0, 1/4). As such, we have that α2 /4 ≤ 1 − cos α ≤ α2 /2. In
particular, this implies that the height of △ is h = r(1 − cos(α)), and we have α2 /8 = rα2 /4 ≤ h ≤ rα2 /2.
Let ℓ = ∥p2i−1 − p2i+1 ∥ = 2r sin α, since x/2 ≤ sin(x) ≤ x, we have that α/2 ≤ ℓ ≤ α. As such, we have that
In particular, the probability that △i does not contain a point of P is at most (1 − area(△i ))n ≤ (1 − 8/n)n ≤
exp(−8). We conclude that, in expectation, at least (1 − exp(−8))N/2 triangles contains points of P. Selecting
a point of P from each such triangle results in a convex subset, which implies the claim. ■
It is not hard to show that Y = Ω(n1/3 ), with high probability, see Exercise 53.5.1. This readily implies that
med(Y) = Ω(n1/3 ). It is significantly harder, but known, that E[Y] = O(n1/3 ), see [Val95]. We provide a weaker
but easier upper bound next.
Lemma 53.3.4. Let P be a set of n points picked randomly and uniformly in the unit square [0, 1]2 . Let Y be
the size of the largest subset of point of P that are in convex position. Then, E[Y] = O(n1/3 log n/ log log n),
with high probability.
Proof: Let V be a set of directions of size O(nc ), where c is some constant, such that for any unit vector u,
there is a vector in v ∈ V, such that the angle between u and v is at most 1/nc . For a vector v ∈ V, consider the
grid G(v) with directions v, and orthogonal direction v⊥ . Every cell of this grid is a rectangle with sidelength
1/n1/3 in the direction of v, and 1/n2/3 in the orthogonal direction. In addition the origin is a vertex of G(v).
This grid is uniquely defined, and every cell in this grid has sidelength 1. The of number of grid cells of this
grid intersecting the unit square is O(n), as can be easily verified.
Let F be the set of all rectangles in all these grids that intersect the unit square. Clearly, the number of
such cells is O(|V| n) = O(nc+1 ). Each rectangle in F has area 1/n, and as such by expectation it contains
≤ 1 point of P (the inequality is there because the rectangle might be partially outside the unit square). A
standard application of Chernoff’s inequality implies that the probability that a rectangle of F contains more
than 10c log n/ log log n points of P is ≤ 1/n2c . As such, with high probability no rectangle in F contains more
than O(log n/ log log n) points of P.
Consider any convex body C ⊆ [0, 1]2 . The key observation is that ∂C can be covered by O(n1/3 ) rectangles
of F . Indeed, the perimeter of C is at most 4. As such, place O(n1/3 ) points along ∂C that are at distance at most
1/(10n1/3 ) from each other. Similarly, place additional O(n1/3 ) points on ∂C, such that the angle of the tangents
between two consecutive points is at most 1/n1/3 (in radians) [a vertex of C might be picked repeatedly]. Let Q
be the resulting set of points. Consider two consecutive points p, u ∈ Q along ∂C, and observe that the distance
between them is at most 1/(10n1/3 ), and the angle between their two tangents is at most α = 1/n1/3 . consider
the triangle △ formed by the two tangents to ∂C at p, u, and the segment p, u. This triangle has height bounded
by ∥p − u∥ sin α ≤ 1/(10n2/3 ). It is now straightforward, if somewhat tedious to argue that one of the rectangles
of F must contain △.
358
Now we are almost done – if the maximum cardinality convex subset Q ⊆ P was larger than c′ n1/3 log n/ log log n,
for some constant c′ , then let C be the convex-hull of this large subset. The above would imply that one of
the rectangles of F must contain at least Ω(c′ log n/ log log n) points of P, but this does not happen with high
probability, for c′ sufficiently large. Thus implying the claim. ■
Proof: Observe that Y is 1-Lipschitz (i.e., changing the location of one point in P can decrease or increase the
value of Y by at most 1. In addition Y is 1-certifiable, since we only need to list the points that form the convex
subset. As such, Theorem 53.2.6 applies. Setting b = med(Y), we have by the above that med(Y) = Ω(n1/3 )
and med(Y) = O(n1/3 log n). As such, we have
h p i
P Y ≤ med(Y) − t cn1/3 log n P[X ≥ med(Y)] ≤ exp(−t /4).
2
p
Similarly, setting b = med(Y) + t cn1/3 log n ≤ 2med(Y), we have
h p i
P[Y ≤ med(Y)]P X ≥ med(Y) + t cn1/3 log n ≤ exp(−t /4).
2
Proof: Let p = 1/b. A specific ball falls into a bin with exactly i balls, if there are i − 1 balls, of
the remaining
n−i n−1
n − 1 balls that falls into the same bin. As such, the probability for that is γi = p (1 − p) i−1 . As such, a
i−1
359
As such, we have E[h≥i ] = nα = Θ(n(n/b)i−1 ).
If a ball is in a bin with exactly j balls, for j ≥ i, then it collides directly with j−1 other i-heavy balls. Thus,
P
the expected number of collisions that a specific ball has with i-heavy balls is in expectation nj=i ( j − 1)γ j =
Pn−1
j=i−1 jγ j+1 . Summing over all balls, and dividing by two, as every i-heavy collision is counted twice, we have
that the expected overall number of such collisions is
! en i−1 !
n X n X n−1 j
n−1 n
n− j−1
βi = jγ j+1 = j p (1 − p) = O ni . ■
2 j=i−1 2 j=i−1 j ib
Lemma 53.3.7. Consider throwing n balls into b bins, where b ≥ 3n. Let i be a small constant integer, h≥i
be the number of i-heavy balls, and let νi = med(h≥i ). Assume that νi ≥ 16i2 c log n, where c is some arbitrary
√
constant. Then, for some constant c′ , we have that |νi − E[h≥i ]| ≤ c′ i νi , and
h p i 1 h p i 1
′ √
P |h≥i − νi | ≥ 6i cνi ln n ≤ c and P h≥i − E[h≥i ] ≥ c i νi + 6i cνi ln n ≤ c .
n n
Proof: Observe that h≥i is 1-certifiable – indeed, the certificate is the list of indices of all the balls that are
contained in bins with i or more balls. The variable h≥i is also i-Lipschitz. Changing the location of a single
ball, can make one bin that contains i balls, into a bin that contains only i − 1 balls, thus decreasing h≥i by i.
√ √
We require that ti νi ≤ νi /2 =⇒ t ≤ νi /(2i). Theorem 53.2.6 implies that
h √ i
P h≥i ≤ νi − ti νi ≤ 2 exp(−t /4).
2
(53.5)
√
Setting b = νi + 2ti νi , we have that
√ q p p
√ √
b − ti b ≥ b − ti νi + 2ti νi ≥ b − ti 2νi = νi + 2ti νi − ti 2νi ≥ νi .
h √ i
This implies that P[h≥i ≥ b]/2 ≤ P[h≥i ≤ νi ] P[X ≥ b] ≤ P h≥i ≤ b − tk b P[h≥i ≥ b] ≤ exp(−t2 /4). We con-
clude that h √ i
P h≥i ≥ νi + 2ti νi ≤ 2 exp(−t /4).
2
(53.6)
Combining the above, we get that
h √ i
P |h≥i − νi | ≥ 2ti νi ≤ 4 exp(−t /4)
2
√
We√ require that 4 exp(−t2 /4) ≤ 1/nc , which holds for t = 3 c ln n. We get the inequality P |h≥i − νi | ≥
6i cνi ln n ≤ 1/nc , as claimed. √ √
√ √
This in turn translates into the requirement that 3 c ln n ≤ νi /(2i). =⇒ 6i c ln n ≤ νi . =⇒
36i2 c ln n ≤ νi .
Next, we estimate the expectation. We have that
X
∞
√ h √ i √ X
∞
√
E[h≥i ] ≥ νi − ti νi P h≥i ≤ νi − (t − 1)i νi ≥ νi − i νi t2 exp(−(t − 1)2 /4) ≥ νi − 10i νi ,
t=1 t=1
360
Example 53.3.8. Consider throwing n into b = n4/3 bins. Lemma 53.3.6 implies that e−2 Fi ≤ E[h≥i ] ≤ 6ei−1 Fi ,
where Fi = n/(ib1/3 )i−1 . As such E[h≥2 ] = Θ(n2/3 ), E[h≥3 ] = Θ(n1/3 ), and E[h≥4 ] = Θ(1).
Applying Lemma 53.3.7, we get that the number of balls that collides (i.e., h≥2 ), p
is strongly concentrated
around some value ν2 = Θ(n ), with the interval where it lies being of length O n
2/3 1/3
log n .
53.5. Problems
Exercise 53.5.1. Elaborating on the argument of Lemma 53.3.3, prove that, with high probability, a random
set of points picked uniformly in the unit square contains a convex subset of size Ω(n2/3 ).
References
[AS00] N. Alon and J. H. Spencer. The Probabilistic Method. 2nd. Wiley InterScience, 2000.
[HJ18] S. Har-Peled and M. Jones. On separating points by lines. Proc. 29th ACM-SIAM Sympos. Dis-
crete Algs. (SODA), 918–932, 2018.
[Val95] P. Valtr. Probability that n random points are in convex position. Discrete Comput. Geom., 13(3):
637–643, 1995.
361
362
Chapter 54
LP in d dimensions:(H,→ −c )
H - set of n closed half spaces in Rd
→
−c - vector in d dimensions
Find p ∈ Rd s.t.D ∀h ∈E H we have p ∈ h and f (p) is maximized.
Where f (p) = p,→ −c .
a1 x1 + a2 x2 + · · · + an xn ≤ bn .
One difficulty that we ignored earlier, is that the optimal solution for the LP might be unbounded, see
Figure 54.1.
Namely, we can find a solution with value ∞ to the target function.
For a half space h let η(h) denote the normal of h directed into the feasible region. Let µ(h) denote the closed
half space, resulting from h by translating it so that it passes through the origin. Let µ(H) be the resulting set
of half spaces from H. See Figure 54.1 (b).
The new set of constraints µ(H) is depicted in Figure 54.1 (c).
363
µ(H) feasible region
µ(h)
h µ(h)
→
−c
µ(h) ρ0 µ(h)
h g
h1
h2
µ(h1)
µ(h2)
Proof: Consider the ρ′ the unbounded ray in the feasible region of (H,→ −c ) such that the line that contain it
′ →
−
passes through the origin. Clearly, ρ is unbounded also in (H, c ), and this is if and only if. See Figure 54.2
(a). ■
Lemma 54.1.2. Deciding if (µ(H),→ −c ) is bounded can be done by solving a d −1 dimensional LP. Furthermore,
if it is bounded, then we have a set of d constraints, such that their intersection prove this.
Furthermore, the corresponding set of d constraints in H testify that (H,→ −c ) is bounded.
Proof: Rotate space, such that → −c is the vector (0, 0, . . . , 0, 1). And consider the hyperplane g ≡ x = 1.
−c ) is unbounded if and only if the region g ∩ T
d
Clearly, (µ(H),→ h∈µ(H) h is non-empty. By deciding if this region
′ ′
is unbounded, is equivalent to solving the following LP: L = (H , (1, 0, . . . , 0)) where
n o
H ′ = g ∩ h h ∈ µ(H) .
364
vi
p g vi+1 g
µ(h2) ∩ g µ(h1) ∩ g µ(h2) ∩ g µ(h1) ∩ g
h1 h1
h2 h2
µ(h1) µ(h1)
µ(h2) µ(h2)
→
−c
(In the above example, µ(H) ∩ g is infeasible because the intersection of µ(h2 ) ∩ g and µ(h1 ) ∩ g is empty,
which implies that h1 ∩ h2 is bounded in the direction →
−c which we care about. The positive y direction in this
figure. ) ■
We are now ready to show the algorithm for the LP for L = (H,→ −c ). By solving a d − 1 dimensional LP we
decide whether L is unbounded. If it is unbounded, we are done (we also found the unbounded solution, if you
go carefully through the details).
See Figure 54.3 (a).
(in the above figure, we computed p.)
In fact, we just computed a set h1 , . . . , hd s.t. their intersection is bounded in the direction of → −c (thats what
the boundness check returned).
Let us randomly permute the remaining half spaces of H, and let h1 , h2 , . . . , hd , hd+1 , . . . , hn be the resulting
permutation.
Let vi be the vertex realizing the optimal solution for the LP:
−c
Li = {h1 , . . . , hi } ,→
1. vi = vi+1 . This means that vi ∈ hi+1 and it can be checked in constant time.
2. vi , vi+1 . It must be that vi < hi+1 but then, we must have... What is depicted in Figure 54.3 (b).
B - the set of d constraints that define vi+1 . If hi+1 < B then vi = vi+1 . As such, the probability of vi , vi+1
is roughly d/i because this is the probability that one of the elements of B is hi+1 . Indeed, fix the first i + 1
elements, and observe that there are d elements that are marked (those are the elements of B). Thus, we are
asking what is the probability of one of d marked elements to be the last one in a random permutation of
hd+1 , . . . , hi+1 , which is exactly d/(i + 1 − d).
Note that if some of the elements of B is h1 , . . . , hd than the above expression just decreases (as there are
less marked elements).
Well, let us restrict our attention to ∂hi+1 . Clearly, the optimal solution to Li+1 on hi+1 is the required vi+1 .
Namely, we solve the LP Li+1 ∩ hi+1 using recursion.
This takes T (i + 1, d − 1) time. What is the probability that vi+1 , vi ?
365
Well, one of the d constraints defining vi+1 has to be hi+1 .The probability for that is ≤ 1 for i ≤ 2d − 1, and
it is
d
≤ ,
i+1−d
otherwise.
Summarizing everything, we have:
X
2d
T (n, d) = O(n) + T (n, d − 1) + T (i, d − 1)
i=d+1
X
n
d
+ T (i, d − 1)
i=2d+1
i + 1 − d
What is the solution of this monster? Well, one essentially to guess the solution and verify it. To guess solution,
let us “simplify” (incorrectly) the recursion to :
X n
T (i, d − 1)
T (n, d) = O(n) + T (n, d − 1) + d
i=2d+1
i+1−d
So think about the recursion tree. Now, every element in the sum is going to contribute a near constant
factor, because we divide it by (roughly) i+1−d and also, we are guessing the the optimal solution is linear/near
linear.
In every level of the recursion we are going to penalized by a multiplicative factor of d. Thus, it is natural,
to conjecture that T (n, d) ≤ (3d)3d n.
Which can be verified by tedious substitution into the recurrence, and is left as exercise.
BTW, we are being a bit conservative about the constant. In fact, one can prove that the running time is d!n.
Which is still exponential in d.
366
SolveLP((H,→ −c ))
/* initialization */
Rotate (H,→ −c ) s.t. →
−c = (0, . . . , 1)
Solve recursively the d − 1 dim LP:
L′ ≡ µ(H) ∩ (xd = 1)
′
if L has a solution then
return “Unbounded”
return vn
54.3. References
The description in this class notes is loosely based on the description of low dimensional LP in the book of de
Berg et al. [BCKO08].
367
References
[BCKO08] M. de Berg, O. Cheong, M. J. van Kreveld, and M. H. Overmars. Computational Geometry:
Algorithms and Applications. 3rd. Santa Clara, CA, USA: Springer, 2008.
368
Chapter 55
Task. Our purpose is to compute an assignment to the variables of P, such that none of the bad events of B
happens.
Algorithm. Initially, the algorithm assigns the variables of P random values. As long as there is a violated
event B ∈ B, resample all the variables of vbl(B) (independently according to their own distributions) – this is
a resampling of B. The algorithm repeats this till no event is violated.
Finer details. We fix some arbitrary strategy (randomized or deterministic) of how to pick the next event.
This now fully specify the algorithm.
Remark. Of course, it is not clear that the algorithm would always succeeds. Let us just assume, for the time
being, that we are in a case where the algorithm always finds a good assignment (which always exists).
55.1.1. Analysis
Let L(i) ∈ B be the event that was resampled in the ith iteration of the algorithm, for i > 0. The sequence
formed by L is the log of the execution.
A witness tree T = (T, σT ) is a rooted tree T together with a labeling σT . Here, every node v ∈ T is labeled
by an event σT (v) ∈ B. If a node u is a child of v in T , then σT (u) ∈ Γ+ (v). Two nodes in a tree are siblings if
they have a common parent. If all siblings have distinct labels then the witness tree is proper.
369
For a vertex v of T , let [v] = σT (v).
Theorem 55.1.1. Let P be a finite set of independent random variables in a probability space, and let B be a
set of (bad) events determined by these variables. If there is an assignment x : B → (0, 1), such that
Y
∀B ∈ B P[B] ≤ x(B) 1 − x(C) ,
C∈Γ(B)
then there exists an assignment for the variables of P such that no event in B happens. Furthermore, in
expectation, the algorithm described above resamples, any event B ∈ B, at most x(B)/ 1 − x(B) times. Overall,
P
the expected number of resampling steps is at most B∈B x(B)/ 1 − x(B) .
370
Chapter 56
Proof: Follows by induction. Indeed, for m = 1 the claim is immediate. For m ≥ 2, we have
!m ! !m−1 ! !
1 1 1 1 m−1 m
1− = 1− 1− ≥ 1− 1− ≥1− . ■
n n n n n n
371
372
Bibliography
[ABKU99] Y. Azar, A. Broder, A. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Computing,
29(1): 180–200, 1999.
[ABN08a] I. Abraham, Y. Bartal, and O. Neiman. Nearly tight low stretch spanning trees. Proc. 49th Annu.
IEEE Sympos. Found. Comput. Sci. (FOCS), 781–790, 2008.
[ABN08b] I. Abraham, Y. Bartal, and O. Neiman. Nearly tight low stretch spanning trees. CoRR, abs/0808.2017,
2008. arXiv: 0808.2017.
[AES99] P. K. Agarwal, A. Efrat, and M. Sharir. Vertical decomposition of shallow levels in 3-dimensional
arrangements and its applications. SIAM J. Comput., 29: 912–953, 1999.
[Aga04] P. K. Agarwal. Range searching. Handbook of Discrete and Computational Geometry. Ed. by
J. E. Goodman and J. O’Rourke. 2nd. Boca Raton, FL, USA: CRC Press LLC, 2004. Chap. 36,
pp. 809–838.
[AI06] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in
high dimensions. Proc. 47th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 459–468, 2006.
[AI08] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in
high dimensions. Commun. ACM, 51(1): 117–122, 2008.
[AKPW95] N. Alon, R. M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application to the
k-server problem. SIAM J. Comput., 24(1): 78–100, 1995.
[AMS98] P. K. Agarwal, J. Matoušek, and O. Schwarzkopf. Computing many faces in arrangements of
lines and segments. SIAM J. Comput., 27(2): 491–505, 1998.
[AMS99] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency
moments. J. Comput. Syst. Sci., 58(1): 137–147, 1999.
[AN04] N. Alon and A. Naor. Approximating the cut-norm via grothendieck’s inequality. Proc. 36th
Annu. ACM Sympos. Theory Comput. (STOC), 72–80, 2004.
[AR94] N. Alon and Y. Roichman. Random cayley graphs and expanders. Random Struct. Algorithms,
5(2): 271–285, 1994.
[Aro98] S. Arora. Polynomial time approximation schemes for Euclidean TSP and other geometric prob-
lems. J. Assoc. Comput. Mach., 45(5): 753–782, 1998.
[AS00] N. Alon and J. H. Spencer. The Probabilistic Method. 2nd. Wiley InterScience, 2000.
[ASS08] N. Alon, O. Schwartz, and A. Shapira. An elementary construction of constant-degree expanders.
Combin. Probab. Comput., 17(3): 319–327, 2008.
[Bar96] Y. Bartal. Probabilistic approximations of metric space and its algorithmic application. Proc.
37th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 183–193, 1996.
373
[Bar98] Y. Bartal. On approximating arbitrary metrics by tree metrics. Proc. 30th Annu. ACM Sympos.
Theory Comput. (STOC), 161–168, 1998.
[BCKO08] M. de Berg, O. Cheong, M. J. van Kreveld, and M. H. Overmars. Computational Geometry:
Algorithms and Applications. 3rd. Santa Clara, CA, USA: Springer, 2008.
[BDS95] M. de Berg, K. Dobrindt, and O. Schwarzkopf. On lazy randomized incremental construction.
Discrete Comput. Geom., 14: 261–286, 1995.
[BK90] A. Z. Broder and A. R. Karlin. Multilevel adaptive hashing. Proc. 1th ACM-SIAM Sympos. Dis-
crete Algs. (SODA), 43–53, 1990.
[Bol98] B. Bollobas. Modern Graph Theory. Springer-Verlag, 1998.
[BS95] M. de Berg and O. Schwarzkopf. Cuttings and applications. Int. J. Comput. Geom. Appl., 5: 343–
355, 1995.
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge, 2004.
[BY98] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, 1998.
[CCH09] C. Chekuri, K. L. Clarkson., and S. Har-Peled. On the set multi-cover problem in geometric
settings. Proc. 25th Annu. Sympos. Comput. Geom. (SoCG), 341–350, 2009.
[CF90] B. Chazelle and J. Friedman. A deterministic view of random sampling and its use in geometry.
Combinatorica, 10(3): 229–249, 1990.
[Che86] L. P. Chew. Building Voronoi diagrams for convex polygons in linear expected time. Technical
Report PCS-TR90-147. Hanover, NH: Dept. Math. Comput. Sci., Dartmouth College, 1986.
[CKR04] G. Călinescu, H. J. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension prob-
lem. SIAM J. Comput., 34(2): 358–372, 2004.
[Cla87] K. L. Clarkson. New applications of random sampling in computational geometry. Discrete Com-
put. Geom., 2: 195–222, 1987.
[Cla88] K. L. Clarkson. Applications of random sampling in computational geometry, II. Proc. 4th Annu.
Sympos. Comput. Geom. (SoCG), 1–11, 1988.
[CLRS01] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press
/ McGraw-Hill, 2001.
[CMS93] K. L. Clarkson, K. Mehlhorn, and R. Seidel. Four results on randomized incremental construc-
tions. Comput. Geom. Theory Appl., 3(4): 185–212, 1993.
[CS00] S. Cho and S. Sahni. A new weight balanced binary search tree. Int. J. Found. Comput. Sci.,
11(3): 485–513, 2000.
[CS89] K. L. Clarkson and P. W. Shor. Applications of random sampling in computational geometry, II.
Discrete Comput. Geom., 4(5): 387–421, 1989.
[DIIM04] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based
on p-stable distributions. Proc. 20th Annu. Sympos. Comput. Geom. (SoCG), 253–262, 2004.
[DP09] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized
Algorithms. Cambridge University Press, 2009.
[DS00] P. G. Doyle and J. L. Snell. Random walks and electric networks. ArXiv Mathematics e-prints,
2000. eprint: math/0001057.
374
[EEST08] M. Elkin, Y. Emek, D. A. Spielman, and S. Teng. Lower-stretch spanning trees. SIAM J. Comput.,
38(2): 608–628, 2008.
[EHS14] D. Eppstein, S. Har-Peled, and A. Sidiropoulos. On the Greedy Permutation and Counting Dis-
tances. manuscript. 2014.
[FRT04] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by
tree metrics. J. Comput. Sys. Sci., 69(3): 485–497, 2004.
[GG81] O. Gabber and Z. Galil. Explicit constructions of linear-sized superconcentrators. J. Comput.
Syst. Sci., 22(3): 407–420, 1981.
[GLS93] M. Grötschel, L. Lovász, and A. Schrijver. Geometric Algorithms and Combinatorial Optimiza-
tion. 2nd. Vol. 2. Algorithms and Combinatorics. Berlin Heidelberg: Springer-Verlag, 1993.
[Gre69] W. Greg. Why are Women Redundant? Trübner, 1869.
[GRSS95] M. Golin, R. Raman, C. Schwarz, and M. Smid. Simple randomized algorithms for closest pair
problems. Nordic J. Comput., 2: 3–27, 1995.
[Gup00] A. Gupta. Embeddings of Finite Metrics. PhD thesis. University of California, Berkeley, 2000.
[GW95] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut
and satisfiability problems using semidefinite programming. J. Assoc. Comput. Mach., 42(6):
1115–1145, 1995.
[Har00a] S. Har-Peled. Constructing planar cuttings in theory and practice. SIAM J. Comput., 29(6): 2016–
2039, 2000.
[Har00b] S. Har-Peled. Taking a walk in a planar arrangement. SIAM J. Comput., 30(4): 1341–1367, 2000.
[Har11] S. Har-Peled. Geometric Approximation Algorithms. Vol. 173. Math. Surveys & Monographs.
Boston, MA, USA: Amer. Math. Soc., 2011.
[Hås01a] J. Håstad. Some optimal inapproximability results. J. Assoc. Comput. Mach., 48(4): 798–859,
2001.
[Hås01b] J. Håstad. Some optimal inapproximability results. J. ACM, 48(4): 798–859, 2001.
[HJ18] S. Har-Peled and M. Jones. On separating points by lines. Proc. 29th ACM-SIAM Sympos. Dis-
crete Algs. (SODA), 918–932, 2018.
[HLW06] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin Amer.
Math. Soc., 43: 439–561, 2006.
[HR15] S. Har-Peled and B. Raichel. Net and prune: A linear time algorithm for Euclidean distance
problems. J. Assoc. Comput. Mach., 62(6): 44:1–44:35, 2015.
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom., 2: 127–
151, 1987.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of di-
mensionality. Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC), 604–613, 1998.
[Ind01] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. Proc. 42nd Annu.
IEEE Sympos. Found. Comput. Sci. (FOCS), Tutorial. 10–31, 2001.
[JL84] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mapping into hilbert space. Contem-
porary Mathematics, 26: 189–206, 1984.
[Kel56] J. L. Kelly. A new interpretation of information rate. Bell Sys. Tech. J., 35(4): 917–926, 1956.
375
[KKMO04] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal inapproximability results for max
cut and other 2-variable csps. Proc. 45th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), To
appear in SICOMP. 146–154, 2004.
[KKT91] C. Kaklamanis, D. Krizanc, and T. Tsantilas. Tight bounds for oblivious routing in the hypercube.
Math. sys. theory, 24(1): 223–232, 1991.
[KLMN05] R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: a new embedding
method for finite metric spaces. Geom. funct. anal. (GAFA), 15(4): 839–858, 2005.
[KOR00] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor
in high dimensional spaces. SIAM J. Comput., 2(30): 457–474, 2000.
[LM00] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection.
Ann. Statist., 28(5): 1302–1338, 2000.
[Mat02] J. Matoušek. Lectures on Discrete Geometry. Vol. 212. Grad. Text in Math. Springer, 2002.
[Mat92] J. Matoušek. Reporting points in halfspaces. Comput. Geom. Theory Appl., 2(3): 169–186, 1992.
[Mat98] J. Matoušek. On constants for cuttings in the plane. Discrete Comput. Geom., 20: 427–448, 1998.
[Mat99] J. Matoušek. Geometric Discrepancy. Vol. 18. Algorithms and Combinatorics. Springer, 1999.
[McD89] C. McDiarmid. Surveys in Combinatorics. Ed. by J. Siemons. Cambridge University Press, 1989.
Chap. On the method of bounded differences.
[Mil76] G. L. Miller. Riemann’s hypothesis and tests for primality. J. Comput. Sys. Sci., 13(3): 300–317,
1976.
[MN08] M. Mendel and A. Naor. Towards a calculus for non-linear spectral gaps. manuscript. 2008.
[MN98] J. Matoušek and J. Nešetřil. Invitation to Discrete Mathematics. Oxford Univ Press, 1998.
[MNP06] R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. Proc. 22nd
Annu. Sympos. Comput. Geom. (SoCG), 154–157, 2006.
[MOO05] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability of functions with low influences
invariance and optimality. Proc. 46th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 21–30,
2005.
[MP80] J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theo. Comp. Sci., 12:
315–323, 1980.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
[MS09] M. Mendel and C. Schwob. Fast c-k-r partitions of sparse graphs. Chicago J. Theor. Comput.
Sci., 2009, 2009.
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.
[Mul89] K. Mulmuley. An efficient algorithm for hidden surface removal. Comput. Graph., 23(3): 379–
388, 1989.
[Mul94] K. Mulmuley. Computational Geometry: An Introduction Through Randomized Algorithms. En-
glewood Cliffs, NJ: Prentice Hall, 1994.
[Nor98] J. R. Norris. Markov Chains. Statistical and Probabilistic Mathematics. Cambridge Press, 1998.
376
[Rab76] M. O. Rabin. Probabilistic algorithms. Algorithms and Complexity: New Directions and Recent
Results. Ed. by J. F. Traub. Orlando, FL, USA: Academic Press, 1976, pp. 21–39.
[Rab80] M. O. Rabin. Probabilistic algorithm for testing primality. J. Number Theory, 12(1): 128–138,
1980.
[RVW02] O. Reingold, S. Vadhan, and A. Wigderson. Entropy waves, the zig-zag graph product, and new
constant-degree expanders and extractors. Annals Math., 155(1): 157–187, 2002.
[SA95] M. Sharir and P. K. Agarwal. Davenport-Schinzel Sequences and Their Geometric Applications.
New York: Cambridge University Press, 1995.
[SA96] R. Seidel and C. R. Aragon. Randomized search trees. Algorithmica, 16: 464–497, 1996.
[Sch79] A. Schönhage. On the power of random access machines. Proc. 6th Int. Colloq. Automata Lang.
Prog. (ICALP), vol. 71. 520–529, 1979.
[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. New Trends in Discrete and
Computational Geometry. Ed. by J. Pach. Vol. 10. Algorithms and Combinatorics. Springer-
Verlag, 1993, pp. 37–68.
[Sha03] M. Sharir. The Clarkson-Shor technique revisited and extended. Comb., Prob. & Comput., 12(2):
191–201, 2003.
[Smi00] M. Smid. Closest-point problems in computational geometry. Handbook of Computational Ge-
ometry. Ed. by J.-R. Sack and J. Urrutia. Amsterdam, The Netherlands: Elsevier, 2000, pp. 877–
935.
[Sni85] M. Snir. Lower bounds on probabilistic linear decision trees. Theor. Comput. Sci., 38: 69–82,
1985.
[Ste12] E. Steinlight. Why novels are redundant: sensation fiction and the overpopulation of literature.
ELH, 79(2): 501–535, 2012.
[Val95] P. Valtr. Probability that n random points are in convex position. Discrete Comput. Geom., 13(3):
637–643, 1995.
[VC71] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of
events to their probabilities. Theory Probab. Appl., 16: 264–280, 1971.
[Vöc03] B. Vöcking. How asymmetry helps load balancing. J. ACM, 50(4): 568–589, 2003.
[Wes01] D. B. West. Intorudction to Graph Theory. 2ed. Prentice Hall, 2001.
[WG75] H. W. Watson and F. Galton. On the probability of the extinction of families. J. Anthrop. Inst.
Great Britain, 4: 138–144, 1875.
377
Index
378
Andoni, Alexandr, 191 Building Voronoi diagrams for convex polygons in lin-
Applications of random sampling in computational ge- ear expected time, 331
ometry, II, 330, 331 butterfly, 324
approximate near neighbor, 181
approximate nearest neighbor, 181, 189–191 Călinescu, G., 253, 260
Approximate Nearest Neighbors: Towards Removing Catalan number, 197, 201
the Curse of Dimensionality, 161, 191, 192 central limit theorem, 89
Approximating the cut-norm via Grothendieck’s inequal- certified vertex, 323
ity, 285 chaining, 64
approximation characteristic vector, 292
maximization problem, 27 Chazelle, B., 331
Approximation Algorithms for the 0-Extension Prob- Chekuri, C., 332
lem, 253, 260 Chernoff inequality, 95
approximation factor, 304 simplified form, 95
APX Chervonenkis, A. Y., 236, 241
Hard, 303 Chew, L. P., 331
Aragon, C. R., 87 chi-square distribution with k degrees of freedom, 158
Arora, S., 260 Cho, S., 87
arrangement, 315, 317 Cholesky decomposition, 284
atomic event, 23, 135 Clarkson, K. L., 330, 331
average-case analysis, 18 Clarkson, Kenneth L., 330
Azar, Y., 148
Clarkson., K. L., 332
backwards analysis, 225 clause
Backwards Analysis of Randomized Geometric Algo- dangerous, 309
rithms, 229, 331 survived, 310
bad, 67 Closest-Point Problems in computational geometry, 76
balanced, 197 clusters, 253
Balanced Allocations, 148 CNF, 27
ball, 251 collide, 64
Bartal, Y., 260 coloring, 30
Bartal, Yair, 260 combinatorial dimension, 323
Berg, M. de, 330, 331, 367 commutative group, 338
Bernoulli distribution, 25 commute time, 206
bi-tension, 291 Complexity
binary code, 269 co−, 47, 224
binary symmetric channel, 275 BPP, 48, 224
binomial NP, 47, 223
estimates, 243 PP, 48, 224
binomial distribution, 26 P, 47, 223
bit fixing, 107 RP, 48, 224
black-box access, 227 ZPP, 48, 224
Boissonnat, J.-D., 330, 332 Computational Geometry: Algorithms and Applica-
Bollobas, B., 216 tions, 330, 367
bounded differences, 133 Computational Geometry: An Introduction Through
Boyd, S., 284 Randomized Algorithms, 331, 332
Broder, Andrei Z., 148 Compute, 66
379
Computing Many Faces in Arrangements of Lines and distance
Segments, 330, 331 Hamming, 181
Concentration of Measure for the Analysis of Random- distortion, 252
ized Algorithms, 103 distribution
conditional expectation, 81, 131 normal, 155, 189, 190
conditional probability, 24, 113 multi-dimensional, 157
confidence, 44 stable
conflict graph, 319 2, 189
conflict list, 319 p, 189
congruent modulo n, 336 distributivity of multiplication over addition, 340
congruent modulo p, 57, 66 divides, 57, 66
consistent labeling, 297 Dobrindt, K., 331
Constructing planar cuttings in theory and practice, dominating, 197
330, 331 Doob martingale, 137
contraction doubly stochastic, 206
edge, 114 Doyle, P. G., 210
convex hull, 243 Dubhashi, Devdatt P., 103
Convex Optimization, 284 Dyck word, 197
convex position, 358 Dyck words, 201
convex programming, 281 Dynamic, 63
convex-hull, 352
coprime, 335 edge, 315, 317
Cormen, T. H., 73 effective resistance, 207
cover time, 206 Efficient Search for Approximate Nearest Neighbor in
critical, 75 High Dimensional Spaces, 191
crossing number, 314 Efrat, A., 332
cuckoo hashing, 64 eigenvalue, 215, 287
cumulative distribution function, 177 eigenvalues, 215
cut, 113 eigenvector, 287
minimum, 113 electrical network, 207
cuts, 113 elementary event, 23, 135
cutting, 327 Elkin, Michael, 260
Cuttings and applications, 331 embedding, 252, 314
cyclic, 339 Embeddings of Finite Metrics, 260
CNF, 304 entropy, 263, 272
binary, 263
Datar, M., 190, 191 Entropy waves, the zig-zag graph product, and new
defining set, 323 constant-degree expanders and extractors, 301
degree, 45 epochs, 211
Delaunay Eppstein, D., 229
circle, 324 escalated choices, 147
triangle, 324 estimate, 236
dependency graph, 307, 369 Euler totient function, 337
dimension event, 23
combinatorial, 323 expander
discrepancy, 121 [n, d, δ]-expander, 287
discrepancy of χ, 121 [n, d, c]-expander, 218
380
c, 218 grid cell, 73
Expander Graphs and Their Applications, 301 grid cluster, 73
expectation, 24 ground set, 235
Explicit Constructions of Linear-Sized Superconcen- group, 336, 338
trators, 217 growth function, 238, 247
Extensions of Lipschitz mapping into Hilbert space, Gupta, A., 260
161 gcd, 335
extraction function, 266
Håstad, J., 284
face, 317 Hamming distance, 181
faces, 315 harmonic number, 34
Fakcharoenphol, J., 260 Har-Peled, S., 229, 247, 330–332
family, 64 Har-Peled, Sariel, 76, 361
Fast C-K-R Partitions of Sparse Graphs, 229 Haussler, D., 241, 248
field, 293, 340 heavy
filter, 135 t-heavy, 325
filtration, 135 height, 144
final strong component, 201 Hierarchically well-separated tree, 253
finite metric, 227 history, 200
first order statistic, 177 hitting time, 206
Fold, 66 Hoeffding’s inequality, 103
Four results on randomized incremental constructions, Hoory, S., 301
331 How Asymmetry Helps Load Balancing, 148
Friedman, J., 331 HST, 253
fully explicit, 220 HST, 253, 256, 257
function Huffman coding, 270
sensitive, 184 hypercube, 106
d-dimensional hypercube, 181
Gabber, Ofer, 217 Håstad, J., 28
Galil, Zvi, 217
identity element, 338
Galton, F., 111
Improved Approximation Algorithms for Maximum Cut
Gaussian, 157
and Satisfiability Problems Using Semidefinite
generator, 339
Programming, 284
Geometric Algorithms and Combinatorial Optimiza-
independent, 24, 55
tion, 284
indicator variable, 27
Geometric Approximation Algorithms, 247
Indyk, P., 161, 191, 192, 260
Geometric Discrepancy, 103, 123 Indyk, Piotr, 191
geometric distribution, 26 inequality
Goemans, M. X., 284 Hoeffding, 103
Golin, M., 76 Intorudction to Graph Theory, 216
Grötschel, M., 284 Introduction to Algorithms, 73
graph inverse, 57, 66
d-regular, 287 Invitation to Discrete Mathematics, 194
labeled, 212 irreducible, 202
lollipop, 206
Greg, W.R., 111 Jacobi symbol, 343
grid, 73 Johnson, W. B., 161
381
Jones, Mitchell, 361 vertex exposure, 132
martingale difference, 136
Kaklamanis, C., 107 martingale sequence, 132
Karlin, Anna R., 148 Massart, P., 159
Karloff, H. J., 253, 260 Matias, Yossi, 176
Kelly criterion, 93 Matoušek, J., 103, 123, 194, 260, 330–332
Kelly, J. L., 93 max cut, 303
Khot, S., 284, 285 maximization problem, 27
Kirchhoff’s law, 207 maximum cut, 303
Krauthgamer, R., 260 maximum cut problem, 281
Krizanc, D., 107 McDiarmid, C., 103
Kushilevitz, E., 191 measure, 235
Laurent, B., 159 Measured descent: A new embedding method for finite
Law of quadratic reciprocity, 343, 344 metric spaces, 260
lazy randomized incremental construction, 331 median, 357
Lectures on Discrete Geometry, 260, 332 median estimator, 170
Legendre symbol, 342 Mehlhorn, K., 331
level, 166, 315 memorylessness property, 200
k-level, 315 Mendel, M., 229, 301
Lindenstrauss, J., 161 metric, 227, 251
Linearity of expectation, 25 metric space, 227, 251–261
Linial, N., 301 Miller, G. L., 348
Lipschitz, 252 mincut, 113
bi-Lipschitz, 252 Mitzenmacher, M., 109, 268, 274, 280
Lipschitz condition, 137 Modern Graph Theory, 216
load, 144 moments technique, 330
load factor, 64 all regions, 323
Locality-sensitive hashing scheme based on p-stable monomial, 45
distributions, 190, 191 Mossel, E., 285
log, 369 Motwani, R., 48, 54, 79, 103, 109, 119, 134, 161, 191,
lollipop graph, 206 192, 224, 348
long, 298 Mulmuley, K., 331, 332
Lovász, L., 284 multi-dimensional normal distribution, 157
Lower bounds on locality sensitive hashing, 191 Multilevel Adaptive Hashing, 148
Lower Bounds on Probabilistic Linear Decision Trees, Munro, J. I., 167
150
Lower-Stretch Spanning Trees, 260 Naor, A., 191, 285, 301
lucky, 228 Nešetřil, J., 194
lcm, 335 near neighbor
LSH, 183, 184, 189, 191 data-structure
approximate, 181, 184, 188, 190
Markov chain, 200 Near-Optimal Hashing Algorithms for Approximate Near-
aperiodic, 202 est Neighbor in High Dimensions, 191
ergodic, 202 Near-optimal hashing algorithms for approximate near-
Markov Chains, 195, 202 est neighbor in high dimensions, 191
martingale, 137 Nearly Tight Low Stretch Spanning Trees, 260
edge exposure, 132 Neiman, Ofer, 260
382
net, 227 Polynomial time approximation schemes for Euclidean
ε-net, 241 TSP and other geometric problems, 260
ε-net theorem, 241, 248 positive semidefinite, 284
Net and Prune: A Linear Time Algorithm for Euclidean prefix code, 269
Distance Problems, 76 prefix-free, 269
New applications of random sampling in computational prime, 66, 335
geometry, 330 prime factorization, 337
Noise stability of functions with low influences invari- Probabilistic algorithm for testing primality, 348
ance and optimality, 285 Probabilistic algorithms, 76
normal distribution, 155, 160, 189, 190 Probabilistic approximations of metric space and its
multi-dimensional, 157 algorithmic application, 260
Norris, J. R., 195, 202 probabilistic distortion, 256
NP, 28, 284, 303 probabilities, 23
complete, 27, 76, 281, 303, 304 Probability
hard, 27, 281 Amplification, 117
probability, 24
O’Donnell, R., 285 Probability and Computing – randomized algorithms
oblivious, 107 and probabilistic analysis, 109, 268, 274, 280
Ohm’s law, 207 probability density function, 177
Oleszkiewicz, K., 285 probability measure, 23, 135
On approximating arbitrary metrics by tree metrics, probability space, 24, 135
260 Probability that n random points are in convex posi-
On constants for cuttings in the plane, 331 tion, 358
On lazy randomized incremental construction, 331 Problem
On Separating Points by Lines, 361 3SAT
On the Greedy Permutation and Counting Distances, 3SAT Max, 27
229 problem
On the Power of Random Access Machines, 76 3SAT, 27, 28
On the Probability of the Extinction of Families, 111 Max 3SAT, 27
On the set multi-cover problem in geometric settings, MAX CUT, 281
332 MAX-SAT, 304–306
On the uniform convergence of relative frequencies of Sorting Nuts and Bolts, 86
events to their probabilities, 236, 241 projection, 182
open ball, 251 proper, 369
Optimal inapproximability results for Max Cut and other
2-variable CSPs, 284, 285 quadratic residue, 341
OR-concentrator, 151 quotation
order, 339 Carl XIV Johan, King of Sweden, 363
orthonormal eigenvector basis, 289 quotient, 57, 66, 335
Ostrovsky, R., 191 quotient group, 338
383
random graphs, 132 defining, 323
random incremental construction, 318, 323, 327 stopping, 323
lazy, 331 shallow cuttings, 332
random sample, 241, 242, 248, 315, 320, 321, 324, Shapira, A., 301
325, 327–330 Sharir, M., 331, 332
ε-sample, 241 shatter function, 239
random variable, 24, 256 shattered, 236
random walk, 193 Shor, Peter W., 330
Random Walks and Electric Networks, 210 short, 298
Randomized Algorithms, 48, 54, 79, 103, 109, 119, siblings, 369
134, 224, 348 Sidiropoulos, A., 229
randomized rounding, 305 sign, 46
Randomized search trees, 87 Simple randomized algorithms for closest pair prob-
range, 235 lems, 76
Range searching, 192 size, 63
range space, 235 Smid, M., 76
projection, 236 Snell, J. L., 210
rank, 35, 39, 87 Snir, Marc, 150
Rao, S., 260 Some optimal inapproximability results, 28, 284
Reingold, O., 301 spectral gap, 218, 293
relative pairwise distance, 218 Spencer, J. H., 303, 361
remainder, 57, 66, 336 spread, 257
replacement product, 298 squaring, 299
Reporting points in halfspaces, 332 standard deviation, 25
resampling, 369 standard normal distribution, 155
residue, 336 state
resistance, 207, 211 aperiodic, 202
Riemann’s Hypothesis and Tests for Primality, 348 ergodic, 202
Roichman, Y., 301 non null, 201
running-time null persistent, 201
expected, 87 periodic, 202
persistent, 201
Sahni, S., 87 transient, 201
sample state probability vector, 202
ε-sample, 241 Static, 63
ε-sample theorem, 241 stationary distribution, 202
ε-sample, 240 Steinlight, E., 111
sample space, 23 stochastic, 206
Schönhage, Arnold, 76 stopping set, 323
Schrijver, A., 284 streaming, 167
Schwartz, O., 301 strong component, 201
Schwarzkopf, O., 330, 331 sub martingale, 136
Schwob, C., 229 subgraph
Seidel, R., 87, 229, 331 unique, 310
Selection and Sorting with Limited Storage, 167 subgroup, 338
sensitive function, 184 successful, 122, 311
set super martingale, 136
384
Surveys in Combinatorics, 103 vertical decomposition, 318
symmetric, 215 vertex, 318
Szegedy, Mario, 176 Vertical decomposition of shallow levels in 3-dimensional
arrangements and its applications, 332
Taking a Walk in a Planar Arrangement, 331 vertical trapezoid, 318
Talwar, K., 260 vertices, 317
tension, 288 violates, 369
The Probabilistic Method, 303, 361 volume, 352
The Space Complexity of Approximating the Frequency
Moments, 176 walk, 212
The Clarkson-Shor Technique Revisited and Extended, Watson, H. W., 111
331 weight
theorem region, 323
ε-net, 241, 248 Welzl, E., 241, 248
Radon’s, 237 West, D. B., 216
ε-sample, 241 Why are Women Redundant?, 111
Tight bounds for oblivious routing in the hypercube, Why Novels are Redundant: Sensation Fiction and the
107 Overpopulation of Literature, 111
Towards a calculus for non-linear spectral gaps, 301 width, 73
transition matrix, 218, 287 Wigderson, A., 301
transition probabilities matrix, 200 Williamson, D. P., 284
transition probability, 200 witness tree, 369
traverse, 212 word, 173
treap, 84
tree Yvinec, M., 330, 332
code trees, 269 zero, 45
prefix tree, 269 zero set, 45
triangle, 356 zig-zag, 298
true, 186 zig-zag product, 298
Tsantilas, T., 107 zig-zag-zig path, 298
Turing machine
log space, 212
union bound, 24
uniqueness, 75
universal traversal sequence, 212
Upfal, U., 109, 268, 274, 280
385