0% found this document useful (0 votes)
6 views

Randomised Algorithm

Uploaded by

bububu66
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Randomised Algorithm

Uploaded by

bububu66
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 385

Class notes for Randomized Algorithms

Sariel Har-Peled¬

April 2, 2024

¬ Departmentof Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801, USA;
[email protected]; http://sarielhp.org/. Work on this paper was partially supported by a NSF CAREER
award CCR-0132901.
2
Contents

Contents 3

1 Introduction to Randomized Algorithms 17


1.1 What are randomized algorithms? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.1 The benefits of unpredictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.2 Back to randomized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.1.3 Randomized vs average-case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2 Examples of randomized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.1 2SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.1.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.1.2 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.2 Walk on the grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.2.1 Walk on the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.2.2 Walk on the two dimensional grid . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.2.3 Walk on the two dimensional grid . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.3 RSA and primality testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.4 Min cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Probability and Expectation 23


2.1 Basic probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 Formal basic definitions: Sample space, σ-algebra, and probability . . . . . . . . . . . 23
2.1.2 Expectation and conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.3 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Some distributions and their moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Application of expectation: Approximating 3SAT . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Example: A good approximation to kSAT with good probability . . . . . . . . . . . . 29
2.4.3 Example: Coloring a graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3.1 Getting a valid coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Analyzing QuickSort and QuickSelect via Expectation 33


3.1 QuickSort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 QuickSelect: Median selection in linear time . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Analysis via expectation and indicator variables . . . . . . . . . . . . . . . . . . . . . 34

3
3.2.2 Analysis of QuickSelect via conditional expectations . . . . . . . . . . . . . . . . . . 35

4 Chebychev, Sampling and Selection 37


4.1 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Example: A better inequality via moments . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.2 Chebychev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Estimation via sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Randomized selection – Using sampling to learn the world . . . . . . . . . . . . . . . . . . . 39
4.3.1 Inverse estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1.1 Inverse estimation – intuition . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Randomized selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Verifying Identities, and Some Complexity 43


5.1 Verifying equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1.1 Amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.3 Checking identity for polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.3.1 The SchwartzZippel lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.4 Checking if a bipartite graph has a perfect matching . . . . . . . . . . . . . . . . . . 46
5.2 Las Vegas and Monte Carlo algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 The Birthday Paradox, Occupancy and the Coupon Collector Problem 49


6.1 Some needed math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 The birthday paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Occupancy Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.1 The Probability of all bins to have exactly one ball . . . . . . . . . . . . . . . . . . . 52
6.4 The Coupon Collector’s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7 On k-wise independence 55
7.1 Pairwise independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1.1 Pairwise independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1.2 A pairwise independent set of bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1.3 An application: Max cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.1.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2 On k-wise independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2.2 On working modulo prime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2.3 Construction of k-wise independence variables . . . . . . . . . . . . . . . . . . . . . 58
7.2.4 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2.5 Applications of k-wide independent variables . . . . . . . . . . . . . . . . . . . . . . 59

4
7.2.5.1 Product of expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2.5.2 Application: Using less randomization for a randomized algorithm . . . . . 60
7.3 Higher moment inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8 Hashing 63
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2 Universal Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.2.1 How to build a 2-universal family . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.2.1.1 A quick reminder on working modulo prime . . . . . . . . . . . . . . . . . 66
8.2.1.2 Constructing a family of 2-universal hash functions . . . . . . . . . . . . . 66
8.2.1.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.2.1.4 Explanation via pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.3 Perfect hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.3.1 Some easy calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.3.2 Construction of perfect hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.3.2.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.4 Bloom filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

9 Closest Pair 73
9.1 How many times can a minimum change? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.2 Closest Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

10 Coupon’s Collector Problems II 77


10.1 The Coupon Collector’s Problem Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.1.1 Some technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.1.2 Back to the coupon collector’s problem . . . . . . . . . . . . . . . . . . . . . . . . . 78
10.1.3 An asymptotically tight bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
10.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

11 Conditional Expectation and Concentration 81


11.1 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
11.1.1 Concentration from conditional expectation . . . . . . . . . . . . . . . . . . . . . . . 82

12 Quick Sort with High Probability 83


12.1 QuickSort runs in O(n log n) time with high probability . . . . . . . . . . . . . . . . . . . . . 83
12.2 Treaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
12.2.1 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
12.2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
12.2.2.1 Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
12.2.2.2 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
12.2.2.3 Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
12.2.2.4 Meld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
12.2.3 Summery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
12.3 Extra: Sorting Nuts and Bolts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
12.3.1 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5
12.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

13 Concentration of Random Variables – Chernoff’s Inequality 89


13.1 Concentration of mass and Chernoff’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 89
13.1.1 Example: Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
13.1.2 A restricted case of Chernoff inequality via games . . . . . . . . . . . . . . . . . . . 89
13.1.2.1 Chernoff games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
13.1.2.2 Chernoff’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
13.1.2.3 Some low level boring calculations . . . . . . . . . . . . . . . . . . . . . . 93
13.1.3 A proof for −1/ + 1 case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
13.2 The Chernoff Bound — General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
13.2.1 The lower tail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
13.2.2 A more convenient form of Chernoff’s inequality . . . . . . . . . . . . . . . . . . . . 97
13.2.2.1 Bound when the expectation is small . . . . . . . . . . . . . . . . . . . . . 98
13.3 A special case of Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
13.3.1 Some technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
13.4 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
13.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
13.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

14 Applications of Chernoff’s Inequality 105


14.1 QuickSort is Quick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
14.2 How many times can the minimum change? . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
14.3 Routing in a parallel computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
14.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
14.4 Faraway Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
14.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

15 Min Cut 111


15.1 Branching processes – Galton-Watson Process . . . . . . . . . . . . . . . . . . . . . . . . . . 111
15.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
15.1.2 On coloring trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
15.2 Min Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
15.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
15.2.2 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
15.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
15.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
15.3.1.1 The probability of success . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
15.3.1.2 Running time analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
15.3.2 An alternative implementation using MST . . . . . . . . . . . . . . . . . . . . . . . . 117
15.4 A faster algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
15.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6
16 Discrepancy and Derandomization 121
16.1 Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
16.2 The Method of Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
16.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

17 Independent set – Turán’s theorem 125


17.1 Turán’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
17.1.1 Some silly helper lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
17.1.2 Statement and proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
17.1.3 An alternative proof of Turán’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . 126
17.1.4 An algorithm for the weighted case . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

18 Derandomization using Conditional Expectations 129


18.1 Method of conditional expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
18.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
18.1.1.1 Max kSAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
18.1.1.2 Max cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
18.1.1.3 Turán theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

19 Martingales 131
19.1 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
19.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
19.1.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
19.1.2.1 Examples of martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
19.1.2.2 Azuma’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
19.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

20 Martingales II 135
20.1 Filters and Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
20.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
20.2.1 Martingales – an alternative definition . . . . . . . . . . . . . . . . . . . . . . . . . . 137
20.3 Occupancy Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
20.3.1 Lets verify this is indeed an improvement . . . . . . . . . . . . . . . . . . . . . . . . 139

21 The power of two choices 141


21.1 Balls and bins with many rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
21.1.1 The game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
21.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
21.1.3 With only d rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
21.2 The power of two choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
21.2.1 Upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
21.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
21.2.3 The power of restricted d choices: Always go left . . . . . . . . . . . . . . . . . . . . 146
21.3 Avoiding terrible choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
21.4 Escalated choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
21.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

7
22 Evaluating And/Or Trees 149
22.1 Evaluating an And/Or Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
22.1.1 Randomized evaluation algorithm for T 2k . . . . . . . . . . . . . . . . . . . . . . . . 150
22.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
22.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

23 The Probabilistic Method II 151


23.1 Expanding Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
23.1.1 An alternative construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
23.1.2 An expander . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
23.2 Probability Amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
23.3 Oblivious routing revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

24 Dimension Reduction 155


24.1 Introduction to dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
24.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
24.2.1 The standard multi-dimensional normal distribution . . . . . . . . . . . . . . . . . . . 157
24.3 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
24.3.1 The construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
24.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
24.3.2.1 A single unit vector is preserved . . . . . . . . . . . . . . . . . . . . . . . 158
24.3.3 All pairwise distances are preserved . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
24.4 Even more on the normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
24.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

25 Streaming and the Multipass Model 163


25.1 The secretary problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
25.2 Reservoir sampling: Fishing a sample from a stream . . . . . . . . . . . . . . . . . . . . . . 164
25.3 Sampling and median selection revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
25.3.1 A median selection with few comparisons . . . . . . . . . . . . . . . . . . . . . . . . 166
25.4 Big data and the streaming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
25.5 Heavy hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
25.5.1 A randomized algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
25.5.2 A deterministic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

26 Frequency Estimation over a Stream 169


26.1 The art of estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
26.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
26.1.2 Averaging estimator: Success with constant probability . . . . . . . . . . . . . . . . . 169
26.1.2.1 The challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
26.1.2.2 Taming of the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
26.1.3 Median estimator: Success with high probability . . . . . . . . . . . . . . . . . . . . 170
26.2 Frequency estimation over a stream for the kth moment . . . . . . . . . . . . . . . . . . . . . 171
26.2.1 An estimator for the kth moment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
26.2.1.1 Basic estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
26.2.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
26.2.2 An improved estimator: Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

8
26.3 Better estimation for F2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
26.3.1 Pseudo-random k-wide independent sequence of signed bits . . . . . . . . . . . . . . 174
26.3.2 Estimator construction for F2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
26.3.2.1 The basic estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
26.3.3 Improving the estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
26.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

27 Approximating the Number of Distinct Elements in a Stream 177


27.1 Counting number of distinct elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
27.1.1 First order statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
27.1.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
27.2 Sampling from a stream with “low quality” randomness . . . . . . . . . . . . . . . . . . . . . 179
27.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

28 Approximate Nearest Neighbor (ANN) Search in High Dimensions 181


28.1 ANN on the hypercube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
28.1.1 ANN for the hypercube and the Hamming distance . . . . . . . . . . . . . . . . . . . 181
28.1.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
28.1.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
28.1.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
28.1.2.3 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
28.2 Testing for good items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
28.3 LSH for the hypercube: An elaborate construction . . . . . . . . . . . . . . . . . . . . . . . . 184
28.3.0.1 On sense and sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
28.3.1 A simple sensitive family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
28.3.2 A family with a large sensitivity gap . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
28.3.3 Amplifying sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
28.3.4 The near neighbor data-structure and handling a query . . . . . . . . . . . . . . . . . 186
28.3.4.1 Setting the parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
28.3.5 The result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
28.4 LSH and ANN in Euclidean space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
28.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
28.4.2 Locality sensitive hashing (LSH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
28.4.3 ANN in high-dimensional euclidean space . . . . . . . . . . . . . . . . . . . . . . . . 190
28.4.3.1 The result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
28.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

29 Random Walks I 193


29.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
29.1.1 Walking on grids and lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
29.1.1.1 Walking on the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
29.1.1.2 Walking on two dimensional grid . . . . . . . . . . . . . . . . . . . . . . . 194
29.1.1.3 Walking on three dimensional grid . . . . . . . . . . . . . . . . . . . . . . 194
29.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

9
30 Random Walks II 197
30.1 Catalan numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
30.2 Walking on the integer line revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
30.2.1 Estimating the middle binomial coefficient . . . . . . . . . . . . . . . . . . . . . . . 198
30.3 Solving 2SAT using random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
30.3.1 Solving 2SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
30.4 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

31 Random Walks III 205


31.1 Random walks on graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
31.2 Electrical networks and random walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
31.2.1 A tangent on parallel and series resistors . . . . . . . . . . . . . . . . . . . . . . . . . 208
31.2.2 Back to random walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
31.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

32 Random Walks IV 211


32.1 Cover times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
32.1.1 Rayleigh’s Short-cut Principle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
32.2 Graph Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
32.2.1 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

33 A Bit on Algebraic Graph Theory 215


33.1 Graphs and Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
33.1.1 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
33.1.2 Eigenvalues and eigenvectors of a graph . . . . . . . . . . . . . . . . . . . . . . . . . 216
33.2 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

34 Random Walks V 217


34.1 Explicit expander construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
34.2 Rapid mixing for expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
34.2.1 Bounding the mixing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
34.3 Probability amplification by random walks on expanders . . . . . . . . . . . . . . . . . . . . 219
34.3.1 The analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
34.3.2 Some standard inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

35 Complexity classes 223


35.1 Las Vegas and Monte Carlo algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
35.2 Complexity classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
35.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

36 Backwards analysis 225


36.1 How many times can the minimum change? . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
36.2 Computing a good ordering of the vertices of a graph . . . . . . . . . . . . . . . . . . . . . . 226
36.2.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
36.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
36.3 Computing nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
36.3.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
36.3.1.1 Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

10
36.3.2 Computing an r-net in a sparse graph . . . . . . . . . . . . . . . . . . . . . . . . . . 227
36.3.2.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
36.3.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
36.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

37 Multiplicative Weight Update: Expert Selection 231


37.1 The problem: Expert selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
37.2 Majority vote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
37.3 Randomized weighted majority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
37.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

38 On Complexity, Sampling, and ε-Nets and ε-Samples 235


38.1 VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
38.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
38.1.1.1 Halfspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
38.2 Shattering dimension and the dual shattering dimension . . . . . . . . . . . . . . . . . . . . . 238
38.2.1 Mixing range spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
38.3 On ε-nets and ε-sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
38.3.1 ε-nets and ε-samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
38.3.2 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
38.3.2.1 Range searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
38.3.2.2 Learning a concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
38.4 A better bound on the growth function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
38.5 Some required definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

39 Double sampling 245


39.1 Double sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
39.1.1 Disagreement between samples on a specific set . . . . . . . . . . . . . . . . . . . . . 246
39.1.2 Exponential decay for a single set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
39.1.3 Moments of the sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
39.1.4 Growth function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
39.2 Proof of the ε-net theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
39.2.1 The proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
39.2.1.1 Reduction to double sampling . . . . . . . . . . . . . . . . . . . . . . . . . 248
39.2.1.2 Using double sampling to finish the proof . . . . . . . . . . . . . . . . . . 249

40 Finite Metric Spaces and Partitions 251


40.1 Finite Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
40.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
40.2.1 Hierarchical Tree Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
40.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
40.3 Random Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
40.3.1 Constructing the partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
40.3.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
40.4 Probabilistic embedding into trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
40.4.1 Application: approximation algorithm for k-median clustering . . . . . . . . . . . . . 256
40.5 Embedding any metric space into Euclidean space . . . . . . . . . . . . . . . . . . . . . . . . 257

11
40.5.1 The bounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
40.5.2 The unbounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
40.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
40.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

41 Entropy, Randomness, and Information 263


41.1 The entropy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
41.2 Extracting randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
41.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

42 Entropy II 269
42.1 Huffman coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
42.1.1 The algorithm to build Hoffman’s code . . . . . . . . . . . . . . . . . . . . . . . . . 270
42.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
42.1.3 A formula for the average size of a code word . . . . . . . . . . . . . . . . . . . . . . 271
42.2 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
42.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

43 Entropy III - Shannon’s Theorem 275


43.1 Coding: Shannon’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
43.2 Proof of Shannon’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
43.2.1 How to encode and decode efficiently . . . . . . . . . . . . . . . . . . . . . . . . . . 276
43.2.1.1 The scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
43.2.1.2 The proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
43.2.2 Lower bound on the message size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
43.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

44 Approximate Max Cut 281


44.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
44.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
44.2 Semi-definite programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
44.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

45 Expanders I 287
45.1 Preliminaries on expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
45.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
45.2 Tension and expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

46 Expanders II 291
46.1 Bi-tension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
46.2 Explicit construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
46.2.1 Explicit construction of a small expander . . . . . . . . . . . . . . . . . . . . . . . . 293
46.2.1.1 A quicky reminder of fields . . . . . . . . . . . . . . . . . . . . . . . . . . 293
46.2.1.2 The construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

12
47 Expanders III - The Zig Zag Product 297
47.1 Building a large expander with constant degree . . . . . . . . . . . . . . . . . . . . . . . . . 297
47.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
47.1.2 The Zig-Zag product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
47.1.3 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
47.1.4 The construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
47.2 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
47.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

48 The Probabilistic Method 303


48.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
48.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
48.1.1.1 Max cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
48.2 Maximum Satisfiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

49 The Probabilistic Method III 307


49.1 The Lovász Local Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
49.2 Application to k-SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
49.2.1 An efficient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
49.2.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

50 The Probabilistic Method IV 311


50.1 The Method of Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
50.2 Independent set in a graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
50.3 A Short Excursion into Combinatorics via the Probabilistic Method . . . . . . . . . . . . . . 313
50.3.1 High Girth and High Chromatic Number . . . . . . . . . . . . . . . . . . . . . . . . 313
50.3.2 Crossing Numbers and Incidences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
50.3.3 Bounding the at most k-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

51 Sampling and the Moments Technique 317


51.1 Vertical decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
51.1.1 Randomized incremental construction (RIC) . . . . . . . . . . . . . . . . . . . . . . 318
51.1.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
51.1.2 Backward analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
51.2 General settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
51.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
51.2.1.1 Examples of the general framework . . . . . . . . . . . . . . . . . . . . . . 323
51.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
51.2.2.1 On the probability of a region to be created . . . . . . . . . . . . . . . . . . 324
51.2.2.2 On exponential decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
51.2.2.3 Bounding the moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
51.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
51.3.1 Analyzing the RIC algorithm for vertical decomposition . . . . . . . . . . . . . . . . 327
51.3.2 Cuttings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
51.4 Bounds on the probability of a region to be created . . . . . . . . . . . . . . . . . . . . . . . 329
51.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
51.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

13
52 Primality testing 335
52.1 Number theory background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
52.1.1 Modulo arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
52.1.1.1 Prime and coprime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
52.1.1.2 Computing gcd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
52.1.1.3 The Chinese remainder theorem . . . . . . . . . . . . . . . . . . . . . . . . 337
52.1.1.4 Euler totient function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
52.1.2 Structure of the modulo group Zn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
52.1.2.1 Some basic group theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
52.1.2.2 Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
52.1.2.3 Cyclic groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
52.1.2.4 Modulo group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
52.1.2.5 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
52.1.2.6 Z∗p is cyclic for prime numbers . . . . . . . . . . . . . . . . . . . . . . . . 340
52.1.2.7 Z∗n is cyclic for powers of a prime . . . . . . . . . . . . . . . . . . . . . . . 341
52.1.3 Quadratic residues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
52.1.3.1 Quadratic residue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
52.1.3.2 Legendre symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
52.1.3.3 Jacobi symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
52.1.3.4 Jacobi(a, n): Computing the Jacobi symbol . . . . . . . . . . . . . . . . . 345
52.1.3.5 Subgroups induced by the Jacobi symbol . . . . . . . . . . . . . . . . . . . 346
52.2 Primality testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
52.2.1 Distribution of primes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
52.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348

53 Talagrand’s Inequality 351


53.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
53.1.1 Talangrand’s inequality, and the T -distance . . . . . . . . . . . . . . . . . . . . . . . 351
53.1.2 On the way to proving Talagrand’s inequality . . . . . . . . . . . . . . . . . . . . . . 353
53.1.2.1 The low level details used in the above proof . . . . . . . . . . . . . . . . . 354
53.1.3 Proving Talagrand’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
53.2 Concentration via certification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
53.3 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
53.3.1 Longest increasing subsequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
53.3.2 Largest convex subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
53.3.3 Balls into bins revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
53.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
53.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361

54 Low Dimensional Linear Programming 363


54.1 Linear programming in constant dimension (d > 2) . . . . . . . . . . . . . . . . . . . . . . . 363
54.2 Handling Infeasible Linear Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
54.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

55 Algorithmic Version of Lovász Local Lemma 369


55.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
55.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

14
56 Some math stuff 371
56.1 Some useful estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371

Index 378

15
16
Chapter 1

Introduction to Randomized Algorithms


598 - Class notes for Randomized Algorithms
People tell me it’s a sin
Sariel Har-Peled
To know and feel too much within.
April 2, 2024 I still believe she was my twin, but I lost the ring.
She was born in spring, but I was born too late.
Blame it on a simple twist of fate.

A little twist of fate, Bob Dylan

1.1. What are randomized algorithms?


Randomized algorithms are algorithms that makes random decision during their execution. Specifically, they
are allowed to use variables, such that their value is taken from some random distribution. It is not immediately
clear why adding the ability to use randomness helps an algorithm. But it turns out that the benefits are quite
substantial. Before listing them, let start with an example.

1.1.1. The benefits of unpredictability


Consider the following game. The adversary has a equilateral triangle, with three coins on the vertices of the
triangle (which are, numbered by, I don’t known, 1,2,3). Initially, the adversary set each of the three coins to
be either heads or tails, as she sees fit.
At each round of the game, the player can ask to flip certain coins (say, flip coins at vertex 1 and 3). If after
the flips all three coins have the same side up, then the game stop. Otherwise, the adversary is allowed to rotate
the board by 0, 120 or −120 degrees, as she seems fit. And the game continues from this point on. To make
things interesting, the player does not see the board at all, and does not know the initial configuration of the
coins.

A randomized algorithm. The randomized algorithm in this case is easy – the player randomly chooses a
number among 1, 2, 3 at every round. Since, at every point in time, there are two coins that have the same side
up, and the other coin is the other side up, a random choice hits the lonely coin, and thus finishes the game,
with probability 1/3 at each step. In particular, the number of iterations of the game till it terminates behaves
like a geometric variable with geometric distribution with probability 1/3 (and thus the expected number of
rounds is 3). Clearly, the probability that the game continues for more than i rounds, when the player uses this
random algorithm, is (2/3)i . In particular, it vanishes to zero relatively quickly.

A deterministic algorithm. The surprise here is that there is no deterministic algorithm that can generate a
winning sequence. Indeed, if the player uses a deterministic algorithm, then the adversary can simulate the
algorithm herself, and know at every stage what coin the player would ask to flip (it is easy to verify that

17
flipping two coins in a step is equivalent to flipping the other coin – so we can restrict ourselves to a single coin
flip at each step). In particular, the adversary can rotate the board in the end of the round, such that the player
(in the next round) flips one of the two coins that are in the same state. Namely, the player never wins.

The shocker. One can play the same game with a board of size 4 (i.e., a square), where at each stage the
player can flip one or two coins, and the adversary can rotate the board by 0, 90, 180, 270 degrees after each
round. Surprisingly, there is a deterministic winning strategy for this case. The interested reader can think what
it is (this is one of these brain teasers that are not immediate, and might take you 15 minutes to solve, or longer
[or much longer]).

The unfair game of the analysis of algorithms. The underlying problem with analyzing algorithm is the
inherent unfairness of worst case analysis. We are given a problem, we propose an algorithm, then an all-
powerful adversary chooses the worst input for our algorithm. Using randomness gives the player (i.e., the
algorithm designer) some power to fight the adversary by being unpredictable.

1.1.2. Back to randomized algorithms


(A) Best. There are cases where only randomized algorithms are known or possible, especially for games.
For example, consider the 3 coins example given above.
(B) Speed. In some cases randomized algorithms are considerably faster than any deterministic algorithm.
(C) Simplicity. Even if a randomized algorithm is not faster, often it is considerably simpler than its deter-
ministic counterpart.
(D) Derandomization. Some deterministic algorithms arises from derandomizing the randomized algo-
rithms, and this are the only algorithm we know for these problems (i.e., discrepancy).
(E) Adversary arguments and lower bounds. The standard worst case analysis relies on the idea that
the adversary can select the input on which the algorithm performs worst. Inherently, the adversary is
more powerful than the algorithm, since the algorithm is completely predictable. By using a randomized
algorithm, we can make the algorithm unpredictable and break the adversary lower bound.
Namely, randomness makes the algorithm vs. adversary game a more balanced game, by giving the
algorithm additional power against the adversary.

1.1.3. Randomized vs average-case analysis


Randomized algorithms are not the same as average-case analysis. In average case analysis, one assumes that
is given some distribution on the input, and one tries to analyze an algorithm execution on such an input.
On the other hand, randomized algorithms do not assume random inputs – inputs can be arbitrary. As such,
randomized algorithm analysis is more widely applicable, and more general.
While there is a lot of average case analysis in the literature, the problem that it is hard to find distribution
on inputs that are meaningful in comparison to real world inputs. In particular, for numerous cases, the average
case analysis exposes structure that does not exist in real world input.

18
1.2. Examples of randomized algorithms
1.2.1. 2SAT
The input is a 2SAT formula. That is a 2CNF boolean formula – that is, the formula is a conjunction of clauses,
where each clause is made out of two literals, which are ored together. A literal here is either a variable or its
negation. For example, the input formula might be

F = (x1 ∨ x2 ) ∧ (x2 ∨ x3 ) ∧ · · · ∧ (x1 ∨ x17 ).

(Here, ∨ is a boolean or, and ∧ is a boolean and.) Assume that F is using n variables (say x1 , . . . , xn ∈ {0, 1}),
and m clauses. The task at hand is to compute a satisfying assignment for F. That is, determine what values
has to be assigned to x1 , . . . , xn .
This problem can be solved in linear time (i.e., O(n + m)) by a somewhat careful and somewhat clever usage
of directed graphs and strong connected components. Here, we present a much simpler randomized algorithm
– we will present some intuition why this algorithm works. We hopefully will provide a more detailed formal
proof later in the course.

1.2.1.1. The algorithm


The algorithm starts with an arbitrary assignment to the variables of F. If F evaluates to TRUE, then the
algorithm is done. Otherwise, there must be a clause, say Ci = ℓi ∨ ℓi′ , that is not satisfied. The algorithm
randomly chooses (with equal probability) one of the literals of Ci , and flip the value assigned to variable in
this literal. Thus, if the algorithm chosen ℓi′ = x17 , then the algorithm would flip the value of x17 . The algorithm
continues to the next iteration.
Claim 1.2.1 (Proof later in the course). If F has a satisfying assignment, then the above algorithm performs
O(n2 ) iterations in expectation, till it finds a satisfying assignment. Thus, the expected running time of this
algorithm is O(n2 m).

1.2.1.2. Intuition
Fix a specific satisfying assignment Ξ to F. Assume Xi is the number of variables in the assignment in the
beginning of the ith iteration that agree with Ξ. If Xi = n, then the algorithm found Ξ, and it is done. Otherwise,
Xi changes by exactly one at each iteration. That is Xi+1 = Xi + 1 or Xi+1 = Xi − 1. If both variables of Ci
are assigned the “wrong” value (i.e., the negation of their value in Ξ), then Xi+1 = Xi + 1. The other option is
that one of the variables of Ci is assigned the wrong value. The probability that the algorithm guess the right
variable to flip is 1/2. Thus, we have Xi+1 = Xi + 1 with probability 1/2, and Xi+1 = Xi − 1 with probability 1/2.
Thus, the execution of the algorithm is a random process. Starting with X1 being some value in the range
J0 : nK = {0, . . . , n}, the question is how long do we have to run this process till Xi = n? It turns out that the
answer is O(n2 ), because essentially this process is related to the random walk on the line, described next.

1.2.2. Walk on the grid


1.2.2.1. Walk on the line
Let Z denote the set of all integer numbers. Consider the random process, that starts at time zero, with the
“player” being at position X0 = 0. In the ith step of the game, the player randomly choose with probability half
to go left – that is, to move to Xi−1 − 1, or with equal probability to the right (i.e., Xi = X−1 + 1). The sequence

19
X = X0 , X1 , . . . is a random walk on the integers. A natural question is how many times would the walk visit
the origin, in the infinite walk X?  
Well, the probability of the random walk at time 2n to be in the origin is exactly αn = 2n n
/22n . Indeed,
there are 22n random walks of length 2n. For the walk to be in the origin at time 2n, the walk has to be balanced
– equal number of steps have to be taken
2n to the left and to the right. The number of binary sequences of length
2n that have exactly n 0s and n 1s is n .
  √
Exercise 1.2.2. Prove that 2n n
= Θ(22n / n). (An easy proof follows from using Stirling’s formula, but there
is also a not too difficult direct elementary proof).
√ √
As such, we have that c− / n ≤ αn ≤ c+ / n, where c− , c+ are two constants. Thus, the expected number
of times the random walk visits the origin is
X
∞ X∞
1
αn ≥ c− √ = +∞.
n=1 n=1
n

(Why the last argument is valid would be explained in following lectures.)


Namely, the random walk visits the origin infinite number of times.

1.2.2.2. Walk on the two dimensional grid


The same question can be asked when the underlying set is Z × Z – that is the two dimensional integer grid.
Here, the walk starts at the origin X0 = (0, 0), and in the ith step, the walk moves with (equal) probability to
one of the four adjacent locations. That is, if Xi−1 = (xi−1 , yi−1 ), then Xi is one of the following four locations
with equal probability:

(xi − 1, yi ), (xi + 1, yi ), (xi , yi − 1), and (xi , yi + 1).

As before, one can ask what is the number of times this random walk visits the origin. Let βn be the
probability of being in the origin at time 2n.
Exercise 1.2.3. Prove that βn = α2n = Θ(1/n). (There is a nifty trick to prove this. See if you can figure it out.)
P∞
Arguing as above, we have that the expected number of times the walk visits the origin is n=1 Θ(1/n) =
+∞. Namely, the walk visit the origin infinite number of times.

1.2.2.3. Walk on the two dimensional grid


The same question can be asked when the underlying set is Z × Z × Z – as before the walk starts at the origin,
and at each step the walk goes to one of the six adjacent cells. It turns out that the probability of being in
the origin at time (say) 6n is Θ(1/n3/2 ) (the proof is not clean or easy in this case), and as such, the expected
P
number of times this walk visits the origin ∞ n=1 Θ(1/n
3/2
) = O(1). Surprise!

1.2.3. RSA and primality testing


Oversimplifying the basic idea, RSA works as follows. Compute two huge random primes p and q, release
n = pq as the public key. Given n one can encrypt a message, but to decrypt it, one needs both p and q. So, we
rely here on the computational hardness of factoring.
Using RSA thus boils down to computing large prime numbers. Fortunately, the following is known (and
somewhat surprisingly, is not difficult to prove):

20
Theorem 1.2.4. The range JnK = {1, . . . , n} contains Θ(n/ log n) prime numbers.

As a number n can be written using O(log n) digits, that essentially means that a random number with t
digits has probability ≈ 1/t to be a prime number. Namely, primes are quite common.
Fortunately, one can test quickly whether or not a random number is a prime.
 
Theorem 1.2.5. Given a positive integer n, it can be written using T = log10 n digits. Furthermore, one can
decide in O(T 4 ) = O(log4 n) randomized time if n is prime. More precisely, if n is not prime, the algorithm
would return “not prime” with probability half, if it is prime, it would return “prime”.

A natural way to decide if a number n with t bits is prime, is to run the above algorithm (say), 10t times.
If any of the runs returns that the number is not prime, then we return “not prime”. Otherwise, we return the
number of is a prime. The probability that a composite number would be reported as prime is 1/210t ≤ 1/10t ,
which is a tiny number, for t, say, larger than 512.
This gives us an efficient way to pick random prime numbers – the time to compute such a number is
polynomial in the number of bits its uses. Now, we can deploy RSA as computing large random prime numbers
is the main technical difficulty in computing it.

1.2.4. Min cut


In the most basic version of the min-cut problem, you are given an undirected graph G with n vertices and
m edges, and the task is to compute the minimum number of edges one has to delete so the graph becomes
disconnected.
Consider the following algorithm – it randomly assigns the edges of G weights from the range [0, 1]. It then
computes the MST T of this graph (according to the random weights on the edges). Let e be the heaviest edge
in T . Removing it breaks T into two subtrees, with two disjoint sets of vertices S and T . Let (S , T ) denote the
set of all the edges in G that have one end point in S and one in T . The algorithm outputs the edges of (S , T )
as the candidate to be the minimum cut.
The following result is quite surprising.

Theorem 1.2.6. The above algorithm always outputs a cut, and it outputs a min-cut with probability ≥ 2/n2 .

In particular, it turns out that if you run the above algorithm O(n2 log n) times, and returns the smallest cut
computed, then with probability ≥ 1 − 1/nO(1) , the returned cut is the minimum cut! This algorithm has running
time (roughly) O(n4 ) – it can be made faster, but this is already pretty good.

21
22
Chapter 2

Probability and Expectation


598 - Class notes for Randomized Algorithms
Sariel Har-Peled Everybody knows that the dice are loaded
Everybody rolls with their fingers crossed
April 2, 2024
Everybody knows the war is over
Everybody knows the good guys lost
Everybody knows the fight was fixed
The poor stay poor, the rich get rich
That’s how it goes
Everybody knows

Everybody knows, Leonard Cohen

2.1. Basic probability


Here we recall some definitions about probability. The reader already familiar with these definition can happily
skip this section.

2.1.1. Formal basic definitions: Sample space, σ-algebra, and probability


A sample space Ω is a set of all possible outcomes of an experiment. We also have a set of events F , where
every member of F is a subset of Ω. Formally, we require that F is a σ-algebra.

Definition 2.1.1. A single element of Ω is an elementary event or an atomic event.

Definition 2.1.2. A set F of subsets of Ω is a σ-algebra if:


(i) F is not empty,
(ii) if X ∈ F then X = (Ω \ X) ∈ F , and
(iii) if X, Y ∈ F then X ∪ Y ∈ F .
More generally, we require that if Xi ∈ F , for i ∈ Z, then ∪i Xi ∈ F . A member of F is an event.

As a concrete example, if we are rolling a dice, then Ω = {1, 2, 3, 4, 5, 6} and F would be the power set of
all possible subsets of Ω.

Definition 2.1.3. A probability measure is a mapping P : F → [0, 1] assigning probabilities to events. The
function P needs to have the following properties:
     
(i) Additive: for X, Y ∈ F disjoint sets, we have that P X ∪ Y = P X + P Y , and
(ii) P[Ω] = 1.

23
Observation 2.1.4. Let C1 , . . . , Cn be random events (not necessarily independent). Than

 X
n
 n
P ∪i=1Ci ≤ P[Ci ].
i=1

(This is usually referred to as the union bound.) If C1 , . . . , Cn are disjoint events then

 X
n
 n
P ∪i=1Ci = P[Ci ].
i=1

Definition 2.1.5. A probability space is a triple (Ω, F , P), where Ω is a sample space, F is a σ-algebra defined
over Ω, and P is a probability measure.

Definition 2.1.6. A random variable f is a mapping from Ω into some set G. We require that the probability
of the random variable to take on any value in a given subset
h of values
i is well defined. Formally, for any subset
−1   −1
U ⊆ G, we have that f (U) ∈ F . That is, P f ∈ U = P f (U) is defined.

Going back to the dice example, the number on the top of the dice when we roll it is a random variable.
Similarly, let X be one if the number rolled is larger than 3, and zero otherwise. Clearly X is a random variable.
We denote the probability of a random variable X to get the value x, by P[X = x] (or sometime P[x], if we
are lazy).

2.1.2. Expectation and conditional probability


Definition 2.1.7 (Expectation). The expectation of a random variable X, is its average. Formally, the expecta-
tion of X is
  X  
EX = xP X = x .
x

Definition 2.1.8 (Conditional Probability.). The conditional probability of X given Y, is the probability that
 
X = x given that Y = y. We denote this quantity by P X = x | Y = y .

One useful way to think about the conditional probability P[X | Y] is as a function, between the given value
of Y (i.e., y), and the probability of X (to be equal to x) in this case. Since in many cases x and y are omitted in
the notation, it is somewhat confusing.
The conditional probability can be computed using the formula
 
  P (X = x) ∩ (Y = y)
PX=x|Y=y =   .
PY=y
For example, let us roll a dice and let X be the number we got. Let Y be the random variable that is true if
the number we get is even. Then, we have that
h i 1
P X = 2 Y = true = .
3
h i
Definition 2.1.9. Two random variables X and Y are independent if P X = x Y = y = P[X = x], for all x and
y.

24
h i
Observation 2.1.10. If X and Y are independent then P X = x Y = y = P[X = x] which is equivalent to
 
PX=x ∩ Y=y
  = P[X = x]. That is, X and Y are independent, if for all x and y, we have that
PY=y
     
PX=x ∩ Y=y =PX=x PY=y.

Remark. Informally, and not quite correctly, one possible way to think about the conditional probability
 
P X = x | Y = y is that it measure the benefit of having more information. If we know that Y = y, do we
have any change in the probability of X = x?

Lemma 2.1.11 (Linearity of expectation). Linearity of expectation is the property that for any two random
     
variables X and Y, we have that E X + Y = E X + E Y .
  X  X X    
Proof: E X + Y = P[ω] X(ω) + Y(ω) = P[ω]X(ω) + P[ω]Y(ω) = E X + E Y . ■
ω∈Ω ω∈Ω ω∈Ω

Lemma 2.1.12. If X and Y are two random independent variables, then E[XY] = E[X] E[Y].

Proof: Let U(X) the sets of all the values that X might have. We have that
X   X  
E[XY] = xy P X = x and Y = y = xy P[X = x] P Y = y
x∈U(X),y∈U(Y) x∈U(X),y∈U(Y)
X X   X X  
= xy P[X = x] P Y = y = x P[X = x] yP Y = y
x∈U(X) y∈U(Y) x∈U(X) y∈U(Y)

= E[X] E[Y] . ■

2.1.3. Variance and standard deviation


Definition 2.1.13 (Variance and Standard Deviation). For a random variable X, let
h i h i
V[X] = E (X − µX ) = E X − µX
2 2 2

denote the variance of X, where µX = E[X]. Intuitively, this√tells us how concentrated is the distribution of X.
The standard deviation of X, denoted by σX is the quantity V[X].
   
Observation 2.1.14. (i) For any constant c ≥ 0, we have V cX = c2 V X .
     
(ii) For X and Y independent variables, we have V X + Y = V X + V Y .

2.2. Some distributions and their moments


2.2.1. Bernoulli distribution
Definition 2.2.1 (Bernoulli distribution). Assume, that one flips a coin and get 1 (heads) with probability p,
and 0 (i.e., tail) with probability q = 1 − p. Let X be this random variable. The variable X is has Bernoulli
distribution with parameter p.
We have that E[X] = 1 · p + 0 · (1 − p) = p, and
  h i h i
V X = E X − µX = E X − p = p − p = p(1 − p) = pq.
2 2 2 2 2

25
Definition 2.2.2 (Binomial distribution). Assume that we repeat a Bernoulli experiment n times (independently!).
Let X1 , . . . , Xn be the resulting random variables, and let X = X1 + · · · + Xn . The variable X has the binomial
distribution with parameters n and p. We denote this fact by X ∼ Bin(n, p). We have
!
  n k n−k
b(k; n, p) = P X = k = pq .
k
P  P
Also, E[X] = np, and V[X] = V ni=1 Xi = ni=1 V[Xi ] = npq.

2.2.2. Geometric distribution


Definition 2.2.3. Consider a sequence X1 , X2 , . . . of independent Bernoulli trials with probability p for success.
Let X be the number of trials one has to perform till encountering the first success. The distribution of X is a
geometric distribution with parameter p. We denote this by X ∼ Geom(p).

Lemma 2.2.4. For a variable X ∼ Geom(p), we have, for all i, that P[X = i] = (1 − p)i−1 p. Furthermore,
E[X] = 1/p and V[X] = (1 − p)/p2 .

Proof: The proof of the expectation and variance is included for the sake of completeness, and the reader is
P
of course encouraged to skip (reading) this proof. So, let f (x) = ∞ ′
i=0 x = 1−x , and observe that f (x) =
i 1
P∞ i−1
i=1 ix = (1 − x)−2 . As such, we have
X

p 1
E[X] = i (1 − p)i−1 p = p f ′ (1 − p) = = ,
i=1 (1 − (1 − p))2 p
h1 X i ∞
1 X 1

and V[X] = E X − 2 = 2
i2 (1 − p)i−1 p − 2 . = p + p(1 − p) i2 (1 − p)i−2 − 2 .
p i=1
p i=2
p

Observe that
X
∞  ′′ 2
f ′′ (x) = i(i − 1)xi−2 = (1 − x)−1 = .
i=2
(1 − x)3

As such, we have that


X
∞ X
∞ X

1 X i−1

1 
∆(x) = i2 xi−2 = i(i − 1)xi−2 + ixi−2 = f ′′ (x) +
ix = f ′′ (x) + f ′ (x) − 1
i=2 i=2 i=2
x i=2 x
! !
2 1 1 2 1 1 − (1 − x)2 2 1 x(2 − x)
= + − 1 = + = + ·
(1 − x)3 x (1 − x) 2 (1 − x)3 x (1 − x) 2 (1 − x)3 x (1 − x)2
2 2−x
= + .
(1 − x)3 (1 − x)2
As such, we have that
!
1 2 1+ p 1 2(1 − p) 1 − p2 1
V[X] = p + p(1 − p)∆(1 − p) − 2 = p + p(1 − p) 3 + 2 − 2 = p + + −
p p p p p2 p p2
p3 + 2(1 − p) + p − p3 − 1 1 − p
= = . ■
p2 p2

26
2.3. Application of expectation: Approximating 3SAT
Let F be a boolean formula with n variables in CNF form, with m clauses, where each clause has exactly k
literals. We claim that a random assignment for F, where value 0 or 1 is picked with probability 1/2, satisfies
in expectation (1 − 2−k )m of the clauses.
We remind the reader that an instance of 3SAT is a boolean formula, for example F = (x1 + x2 + x3 )(x4 +
x1 + x2 ), and the decision problem is to decide if the formula has a satisfiable assignment. Interestingly, we can
turn this into an optimization problem.

Max 3SAT
Instance: A collection of clauses: C1 , . . . , Cm .
Question: Find the assignment to x1 , ..., xn that satisfies the maximum number of clauses.

Clearly, since 3SAT is NP-Complete it implies that Max 3SAT is NP-Hard. In particular, the formula F
becomes the following set of two clauses:

x1 + x2 + x3 and x4 + x1 + x2 .

Note, that Max 3SAT is a maximization problem.

Definition 2.3.1. Algorithm Alg for a maximization problem achieves an approximation factor α if for all
inputs, we have:
Alg(G)
≥ α.
Opt(G)

In the following, we present a randomized algorithm – it is allowed to consult with a source of random
numbers in making decisions. A key property we need about random variables, is the linearity of expectation
property defined above.

Definition 2.3.2. For an event E, let X be a random variable which is 1 if E occurred, and 0 otherwise. The
random variable X is an indicator variable.

Observation 2.3.3. For an indicator variable X of an event E, we have


         
E X =0·P X =0 +1·P X =1 =P X =1 =P E .

Theorem 2.3.4. One can achieve (in expectation) (7/8)-approximation to Max 3SAT in polynomial time.
Specifically, consider a 3SAT formula F with n variables and m clauses, and consider the randomized al-
gorithm that assigns each variable value 0 or 1 with equal probability (independently to each variable) . Then
this assignment satisfies (7/8)m clauses in expectation.

Proof: Let x1 , . . . , xn be the n variables used in the given instance. The algorithm works by randomly assigning
values to x1 , . . . , xn , independently, and equal probability, to 0 or 1, for each one of the variables.
Let Yi be the indicator variables which is 1 if (and only if) the ith clause is satisfied by the generated random
assignment, and 0 otherwise, for i = 1, . . . , m. Formally, we have



1 Ci is satisfied by the generated assignment,
Yi =  
0 otherwise.

27
Pm
Now, the number of clauses satisfied by the given assignment is Y = i=1 Yi . We claim that E[Y] = (7/8)m,
where m is the number of clauses in the input. Indeed, we have

  hX i X  
m m
EY =E Yi = E Yi ,
i=1 i=1

by linearity of expectation. The probability that Yi = 0 is exactly the probability that all three literals appearing
in the clause Ci are evaluated to FALSE. Since the three literals, Say ℓ1 , ℓ2 , ℓ3 , are instance of three distinct
variable these three events are independent, and as such the probability for this happening is
  1 1 1 1
P Yi = 0 = P[(ℓ1 = 0) ∧ (ℓ2 = 0) ∧ (ℓ3 = 0)] = P[ℓ1 = 0] P[ℓ2 = 0] P[ℓ3 = 0] = ∗ ∗ = .
2 2 2 8
(Another way to see this, is to observe that since Ci has exactly three literals, there is only one possible assign-
ment to the three variables appearing in it, such that the clause evaluates to FALSE. Now, there are eight (8)
possible assignments to this clause, and thus the probability of picking a FALSE assignment is 1/8.) Thus,
    7
P Yi = 1 = 1 − P Yi = 0 = ,
8
and
      7
E Yi = P Yi = 0 ∗ 0 + P Yi = 1 ∗ 1 = .
8
Pm
Namely, E[# of clauses sat] = E[Y] = i=1 E[Yi ] = (7/8)m. Since the optimal solution satisfies at most m
clauses, the claim follows. ■

Curiously, Theorem 2.3.4 is stronger than what one usually would be able to get for an approximation
algorithm. Here, the approximation quality is independent of how well the optimal solution does (the optimal
can satisfy at most m clauses, as such we get a (7/8)-approximation. Curiouser and curiouser¬ , the algorithm
does not even look on the input when generating the random assignment.
Håstad [Hås01a] proved that one can do no better; that is, for any constant ε > 0, one can not approximate
3SAT in polynomial time (unless P = NP) to within a factor of 7/8 + ε. It is pretty amazing that a trivial
algorithm like the above is essentially optimal.
Remark 2.3.5. For k ≥ 3, the above implies (1 − 2−k )-approximation algorithm, for k-SAT, as long as the
instances are each of length at least k.

2.4. Markov’s inequality


2.4.1. Markov’s inequality
P
We remind the reader that for a random variable X assuming real values, its expectation is E[Y] = y·
    P   y
P Y = y . Similarly, for a function f (·), we have E f (Y) = y f (y) · P Y = y .
Theorem 2.4.1 (Markov’s Inequality). Let Y be a random variable assuming only non-negative values. Then
for all t > 0, we have  
  EY
PY≥t ≤ .
t
¬
“Curiouser and curiouser!” Cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good
English). – Alice in wonderland, Lewis Carol

28
Proof: Indeed,
  X   X   X  
E Y = y P Y = y + y P Y = y ≥ yP Y = y
y≥t y<t y≥t
X    
≥ tP Y = y = tP Y ≥ t . ■
y≥t

Markov inequality is tight, as the following exercise testifies.

Exercise 2.4.2. For any (integer) k > 1, define a random positive variable Xk such that P[Xk ≥ k E[Xk ]] = 1/k.

2.4.2. Example: A good approximation to kSAT with good probability


In Section 2.3 we saw a surprisingly simple algorithm that, for a formula F that is 3SAT with n variables and m
clauses, in expectation (in linear time) it finds an assignment that satisfies (7/8)m of the clauses (for simplicity,
here we set k = 3).
The problem is that the guarantee is only in expectation – and the assignment being output by the algorithm
might satisfy much fewer clauses. Namely, we would like to convert a guarantee that is in expectation into, a
good probability guarantee. So, let ε, φ ∈ (0, 1/2) be two parameters. We would like an algorithm that outputs
an assignment that satisfies (say) (1 − ε)(7/8)m clauses, with probability ≥ 1 − φ.
To this end, the new algorithm runs the previous algorithm
& '
1 1
u= ln
ε φ

times, and returns the assignment satisfying the largest number of clauses.

Lemma 2.4.3. Given a 3SAT formula with n variables and m clauses, and parameters ε, φ ∈ (0, 1/2), the
above algorithm returns an assignment that satisfies ≥ (1 − ε)(7/8)m clauses of F, with probability ≥ 1 − φ.
The running time of the algorithm is O(ε−1 (n + m) log φ−1 ).

Proof: Let Zi be the number of clauses not satisfied by the ith random assignment considered by the algorithm.
Observe that E[Zi ] = m/8,as the probability of a clause not to be satisfied is 1/23 . The ith iteration fails if
 m 
m − Zi < (1 − ε)(7/8)m =⇒ Zi > m 1 − (1 − ε)7/8 = 1 + 7ε = 1 + 7ε E[Zi ] .
8
Thus, by Markov’s inequality, the ith iteration fails with probability

  E[Zi ] 1
p = P[m − Zi < (1 − ε)(7/8)m] = P Zi > (1 + 7ε) E[Zi ] < = < 1 − ε,
(1 + 7ε) E[Zi ] 1 + 7ε

since (1 + 7ε)(1 − ε) = 1 + 6ε − 7ε2 > 1, for ε < 1/2.


For the algorithm to fail, all u iterations must fail. Since 1 − x ≤ exp(−x), we have that
& '!
1 1
p ≤ (1 − ε) ≤ exp(−ε) ≤ exp(−εu) ≤ exp −ε ln
u u u
≤ φ. ■
ε φ

29
2.4.3. Example: Coloring a graph
Consider a graph G = (V, E) with n vertices and m edges. We would like to color it with k colors. As a
reminder, a coloring of a graph by k colors, is an assignment χ : V → JkK of a color to each vertex of G, out of
the k possible colors JkK = {1, 2, . . . , k}. A coloring of an edge uv ∈ E is valid if χ(u) , χ(v).

Lemma 2.4.4. Consider a random coloring χ of the vertices of a graph G = (V, E), where each vertex is
assigned a color randomly and uniformly from JkK. Then, the expected number of edges with invalid coloring
is m/k, where m = |E(G)| is the number of edges in G.

Proof: Let E = {e1 , . . . , em }. Let Xi be an indicator variable that is 1 ⇐⇒ ei is invalid colored by χ. Let
ei = ui vi . We have that
  1
P[Xi = 1] = P χ(ui ) = χ(vi ) = .
k
Indeed, conceptually color ui first, and vi later. The probability that vi would be assigned the same color as ui
P
is 1/k. Let Z be the random variable that is the number of edges that are invalid for χ. We have that Z = i Xi .
By linearity of expectations, and the expectation of an indicator variable, we have

X
m X
m X
m
1 m
E[Z] = E[Xi ] = P[Xi = 1] = = . ■
i=1 i=1 i=1
k k

That is pretty good, but what about an algorithm that always succeeds? The above algorithm might always
somehow gives us a bad coloring. Well, not to worry.

Lemma 2.4.5. The above random coloring of G with k colors, has at most 2m/k invalid edges, with probability
≥ 1/2.

Proof: We have that E[Z] = m/k. As such, by Markov’s inequality, we have that

E[Z] m/k 1
P[Z > 2m/k] ≤ P[Z ≥ 2m/k] ≤ = = .
2m/k 2m/k 2

Thus
1 1
P[Z ≤ 2m/k] = 1 − P[Z > 2m/k] ≥ 1 − = . ■
2 2

In particular, consider he modified algorithm – it randomly colors the graph G. If there are at most 2m/k
invalid edges, it output the coloring, and stops. Otherwise, it retries. The probability of every iteration to
succeeds is p ≥ 1/2, and as such, the number of iterations behaves like a geometric random variable. It
follows, that in expectation, the number of iterations is at most 1/p ≤ 2. Thus, the expected running time of
this algorithm is O(m). Indeed, let R be the number iterations performed by the algorithm. We have that the
expected running time is proportional to

E[Rm] = m E[R] = 2m.

Note, that this is not the full picture – P[R = i] ≤ 1/2i−1 . So the probability of this algorithm tor for long
decreases quickly.

30
2.4.3.1. Getting a valid coloring

2.4.3.1.1. A fun algorithm. A natural approach is to run the above algorithm for k = m (assume it is an
integer). Then identify all the invalid edges, and invalidate the color of all the vertices involved. We now repeat
the coloring algorithm on these invalid vertices and invalid edges, again using random coloring, but now using
colors {k + 1, . . . , 2k}. If after this, there is a single invalid edge, we color one of its vertices by the color 2k + 1,
and output this coloring. Otherwise, it fails.

Lemma 2.4.6. The above algorithm succeeds with probability at least 1/2.

Proof: Let Y be the number of invalid edges in the end of the second round. For an edge to be invalid, its
coloring must have failed in both rounds, and the probability for that is exactly (1/k) · (1/k) = 1/m since the
two events are independent. As such, arguing as above, we have E[Y] = 1. By Markov’s inequality, we have
that
  E[Y] 1
P algorithm fails = P[Y > 1] = P[Y ≥ 2] ≤ = . ■
2 2

Remark
√ 2.4.7. This is a toy example - it is not hard to come up with a deterministic algorithm that uses (say)
2m + 2 colors (how? think about it). However, this algorithm is a nice distributed algorithm - after three
rounds of communications, it colors the graph in a valid way, with probability at least half.

References
[Hås01a] J. Håstad. Some optimal inapproximability results. J. Assoc. Comput. Mach., 48(4): 798–859,
2001.

31
32
Chapter 3

Analyzing QuickSort and QuickSelect via


Expectation
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two
weapons are fear and surprise...and ruthless efficiency.... Our three weapons are fear, surprise, and ruthless efficiency...and an
almost fanatical devotion to the Pope.... Our four...no... Amongst our weapons.... Amongst our weaponry...are such elements
as fear, surprise....

The Spanish Inquisition, Monty Python

3.1. QuickSort
Let the input be a set T = {t1 , . . . , tn } of n items to be sorted. We remind the reader, that the QuickSort
algorithm randomly pick a pivot element (uniformly), splits the input into two subarrays of all the elements
smaller than the pivot, and all the elements larger than the pivot, and then it recurses on these two subarrays
(the pivot is not included in these two subproblems). Here we will show that the expected running time of
QuickSort is O(n log n).
Let S 1 , . . . , S n be the elements in their sorted order (i.e., the output order). Let Xi j = 1 be the indicator
variable which is one ⇐⇒ QuickSort compares S i to S j , and let pi j denote the probability that this happens.
P
Clearly, the number of comparisons performed by the algorithm is C = i< j Xi j . By linearity of expectations,
we have
  X  X h i X
EC =E Xi j =
i< j
E Xi j = pi j .
i< j i< j

We want to bound pi j , the probability that the S i is compared to S j . Consider the last recursive call involving
both S i and S j . Clearly, the pivot at this step must be one of S i , . . . , S j , all equally likely. Indeed, S i and S j
were separated in the next recursive call.
Observe, that S i and S j get compared if and only if pivot is S i or S j . Thus, the probability for that is
2/( j − i + 1). Indeed,
h i 2
pi j = P S i or S j picked picked pivot from S i , . . . , S j = .
j−i+1

33
Thus,
X
n X X
n X X X 2
n n−i+1 X
n X
n
1
pi j = 2/( j − i + 1) = ≤2 ≤ 2nHn ≤ n + 2n ln n,
i=1 j>i i=1 j>i i=1 k=1
k i=1 k=1
k
Pn
where Hn is the harmonic number¬ Hn = i=1 1/i. We thus proved the following result.

Lemma 3.1.1. QuickSort performs in expectation at most n + 2n ln n comparisons, when sorting n elements.

Note, that this holds for all inputs. No assumption on the input is made. Similar bounds holds not only in
expectation, but also with high probability.
This raises the question, of how does the algorithm pick a random element? We assume we have access to
a random source that can get us number between 1 and n uniformly.
Note, that the algorithm always works, but it might take quadratic time in the worst case.

Remark 3.1.2 (Wait, wait, wait). Let us do the key argument in the above more slowly, and more carefully.
Imagine, that before running QuickSort we choose for every element a random priority, which is a real number
in the range [0, 1]. Now, we re-implement QuickSort such that it always pick the element with the lowest
random priority (in the given subproblem) to be the pivot. One can verify that this variant and the standard
implementation have the same running time. Now, ai gets compares to a j if and only if all the elements
ai+1 , . . . , a j−1 have random priority larger than both the random priority of ai and the random priority of a j . But
the probability that one of two elements would have the lowest random-priority out of j − i + 1 elements is
2 ∗ 1/( j − i + 1), as claimed.

3.2. QuickSelect: Median selection in linear time


3.2.1. Analysis via expectation and indicator variables
We remind the reader that QuickSelect receives an array T [1 . . . n] of n real numbers, and a number k, and
returns the element of rank k in the sorted order of the elements of T , see Figure 3.1. We can of course, use
QuickSort, and just return the kth element in the sorted array, but a more efficient algorithm, would be to mod-
ify QuickSelect, so that it recurses on the subproblem that contains the element we are interested in. Formally,
QuickSelect chooses a random pivot, splits the array according to the pivot. This implies that we now know
the rank of the pivot, and if its equal to m, we return it. Otherwise, we recurse on the subproblem containing
the required element (modifying m as we go down the recursion. Namely, QuickSelect is a modification of
QuickSort performing only a single recursive call (instead of two).
As before, to bound the expected running time, we will bound the expected number of comparisons. As
before, let S 1 , . . . , S n be the elements of t in their sorted order. Now, for i < j, let Xi j be the indicator variable
that is one if S i is being compared to S j during the execution of QuickSelect. There are several possibilities to
consider:
(i) If i < j < m: Here, S i is being compared to S j , if and only if the first pivot in the range S i , . . . , S k is
either S i or S j . The probability for that is 2/(k − i + 1). As such, we have that

hX i X
m−2 X
m−1  X
m−2 X
m−1
2 X 2(m − i − 1)
med−2

α1 = E Xi j = E Xi j = = ≤2 m−2 .
i=1 j=i+1 i=1 j=i+1
m−i+1 i=1
m−i+1
i< j<m
Rn Rn
¬
Using integration to bound summation, we have Hn ≤ 1 + 1
x=1 x
dx ≤ 1 + ln n. Similarly, Hn ≥ 1
x=1 x
dx = ln n.

34
h i
(ii) If m < i < j: Using the same analysis as above, we have that P Xi j = 1 = 2/( j − m + 1). As such,
 
 X n Xj−1  Xn Xj−1 Xn
2( j − m − 1)  
α2 = E 
 

Xi j  =
2
= ≤2 n−m .
j=m+1 i=m+1 j=m+1 i=m+1
j − m + 1 j=m+1 j − m + 1

(iii) i < m < j: Here, weh compare


i h S i toi S j if and only if the first indicator in the range S i , . . . , S j is either S i
or S j . As such, E Xi j = P Xi j = 1 = 2/( j − i + 1). As such, we have
 
X
m−1 X
n  X
m−1 X
n
α3 = E Xi j  =
2
.
j − i + 1
j=m+1
i=1 j=m+1
i=1

Observe, that for a fixed ∆ = j − i + 1, we are going to handle the gap ∆ in the above summation, at most
P
∆ − 2 times. As such, α3 ≤ n∆=3 2(∆ − 2)/∆ ≤ 2n.
Xn h i X n
2
(iv) i = m. We have α4 = E Xi j = = ln n + 1.
j=m+1 j=m+1
j−m+1
X
m−1 h i Xm−1
2
(v) j = m. We have α5 = E Xi j = ≤ ln m + 1.
i=1 i=1
m−i+1
Thus, the expected number of comparisons performed by QuickSelect is bounded by
X
αi ≤ 2(m − 2) + 2(n − m) + 2n + ln n + 1 + ln m = 4n − 2 + ln n + ln m.
i

Theorem 3.2.1. In expectation, QuickSelect performs at most 4n−2+ln n+ln m comparisons, when selecting
the mth element out of n elements.

A different approach can reduce the number of comparisons (in expectation) to 1.5n + o(n). More on that
later in the course.

3.2.2. Analysis of QuickSelect via conditional expectations


Consider the problem of given a set X of n numbers, and a parameter k, to output the kth smallest number
(which is the number with rank k in X). This can be easily be done by modifying QuickSort only to perform
one recursive call. See Figure 3.1 for a pseud-code of the resulting algorithm.
Intuitively, at each iteration of QuickSelect the input size shrinks by a constant factor, leading to a linear
time algorithm.

Theorem 3.2.2. Given a set X of n numbers, and any integer k, the expected running time of QuickSelect(X, n)
is O(n).

Proof: Let X1 = X, and Xi be the set of numbers in the ith level of the recursion. Let yi and ri be the random
element and its rank in Xi , respectively, in the ith iteration of the algorithm. Finally, let ni = |Xi |. Observe that
the probability that the pivot yi is in the “middle” of its subproblem is
h ni 3 i 1
α=P ≤ ri ≤ ni ≥ ,
4 4 2

35
QuickSelect(T J1 : nK , k)
// Input: T J1 : nK array with n numbers, parameter k.
// Assume all numbers in t are distinct.
// Task: Return kth smallest number in T .
y ← random element of T .
r ← rank of y in T .
if r = k then return y
T< = array with all elements in T < than y
T> = all elements in T > than y
// By assumption |T< | + |T> | + 1 = |T |.
if r < k then
return QuickSelect( T> , k − r )
else
return QuickSelect( T< , k )

Figure 3.1: QuickSelect pseudo-code.

and if this happens then


3
ni+1 ≤ max(ri − 1, ni − ri ) ≤ ni .
4
We conclude that
 3  
E[ni+1 | ni ] ≤ P yi in the middle ni + P yi not in the middle ni
4
3
≤ α ni + (1 − α)ni = ni (1 − α/4) ≤ ni (1 − (1/2)/4) = (7/8)ni .
4
Now, we have that

mi+1 = E[ni+1 ] = E[E[ni+1 | ni ]] ≤ E[(7/8)ni ] = (7/8) E[ni ] = (7/8)mi


= (7/8)i m0 = (7/8)i n,
h h ii
since for any two random variables we have that E[X] = E E X Y . In particular, the expected running time
of QuickSelect is proportional to
 
X  X X X
E ni  = E[ni ] ≤ mi = (7/8)i n = O(n),
i i i i

as desired. ■

36
Chapter 4

Chebychev, Sampling and Selection


598 - Class notes for Randomized Algorithms
Sariel Har-Peled During a native rebellion in German East Africa, the
April 2, 2024 Imperial Ministry in Berlin issued the following order to
its representatives on the ground: The natives are to be
instructed that on pain of harsh penalties, every rebellion
must be announced, in writing, six weeks before it
breaks out.

Dead Funny: Humor in Hitler’s Germany, Rudolph


Herzog

4.1. Chebyshev’s inequality


4.1.1. Example: A better inequality via moments
P
Let Xi ∈ {−1, +1} with probability half for each value, for i = 1, . . . , n (all picked independently). Let Y = i Xi .
We have that
  hX i X  
E Y = E Xi = E Xi = n · 0 = 0.
i i
A more interesting quantity is
h i hX 2 i hX X i X h i hX i X h i
EY =E Xi = E Xi2 + 2 Xi X j = E Xi + 2 E Xi X j = n + 2
2 2
E Xi X j
i i i< j i i< j i< j
X h i
=n+2 E[Xi ] E X j = n.
i< j

Lemma 4.1.1. Let Xi ∈h {−1, +1} √ i probability half for each value, for i = 1, . . . , n (all picked indepen-
with
P
dently). We have that P | i Xi | > t n < 1/t2 .
P
Proof: Let Y = i Xi and Z = Y 2 . We have
X √  hX 2 i h h ii h i
P Xi > t n = P Xi > t2 n = P Y 2 > t2 E Y 2 = P Z > t2 E[Z] ≤ 1/t2 ,
i i

by Markov’s inequality. ■

4.1.2. Chebychev’s inequality


h i h i
As a reminder, the variance of a random variable X is V[X] = E (X − µX )2 = E X 2 − µ2X .

37
Theorem
√ 4.1.2 (Chebyshev’s inequality). Let X be a real random variable, with µX = E[X], and σX =
 
V[X]. Then, for any t > 0, we have P |X − µX | ≥ tσX ≤ 1/t .
2

Proof: Set Y = (X − µX )2 , and observe that


h i h i
σ2X = V[X] = E[Y] = E (X − µX )2 = E X 2 − µ2X .

As such, we have that


  h i h i 1
P |X − µX | ≥ tσX = P (X − µX ) ≥ t σX = P Y ≥ t E[Y] ≤ 2 ,
2 2 2 2
t
by Markov’s inequality. ■

4.2. Estimation via sampling


One of the big advantages of randomized algorithms, is that they sample the world; that is, learn how the input
looks like without reading all the input. For example, consider the following problem: We are given a set of U
of n objects u1 , . . . , un . and we want to compute the number of elements of U that have some property. Assume,
that one can check if this property holds, in constant time, for a single object, and let ψ(u) be the function that
returns 1 if the property holds for the element u. and zero otherwise. Now, let Γ be the number of objects in U
that have this property. We want to reliably estimate Γ without computing the property for all the elements of
U.
A natural approach, would be to pick a random sample R of m objects, r1 , . . . , rm from U (with replacement),
P
and compute Y = mi=1 ψ(r1 ). The estimate for Γ is β = (n/m)Y. It is natural to ask how far is β from the true
value Γ.
Lemma 4.2.1. Let U be a set of n elements, with Γ of them having a certain property ψ. Let R be a uniform
random sample from U (with repetition) of size m, and let Y be the number of elements in R that have the
property ψ, and let Z = (n/m)Y be the estimate for Γ. Then, for any t ≥ 1, we have that
h n n i 1
P Γ − t √ ≤ Z ≤ Γ + t √ ≥ 1 − 2.
2 m 2 m t
h √ √ i
Similarly, we have that P E[Y] − t m/2 ≤ Y ≤ E[Y] + t m/2 ≥ 1 − 1/t2 .

Proof: Let Yi = ψ(ri ) be an indicator variable that is 1 if the ith sample ri has the property ψ, for i = 1, . . . , m.
Observe that
Γ
p = E[Yi ] = .
n
P
Consider the random variable Y = i Yi .
Variance of a binomial distribution. (I am including the following here as a way to remember this formula.) The
variable Y is a binomial distribution with probability p = Γ/n, and m samples; that is, Y ∼ Bin(m, p). Thus, Y is the sum
of m random variables
h i Y1 , . . . , Ym that are independent indicator variables (i.e., Bernoulli distribution), with E[Yi ] = p,
and V[Yi ] = E Yi − E[Yi ]2 = p − p2 = p(1 − p). Since the variance is additive for independent variables, we have
2

hP i P
V[Y] = V i Yi = m i=1 V[T i ] = mp(1 − p).

Thus, we have
Γ m
E[Y] = mp = m · = Γ, and V[Y] = mp(1 − p).
n n

38
The standard deviation of Y is p √
σY = mp(1 − p) ≤ m/2,
p
as p(1 − p) is maximized for p = 1/2.
Consider the estimate Z = (n/m)Y for Γ, and observe that
n nm
E[Z] = E[(n/m)Y] = E[Y] = Γ = Γ.
m mn
 
By Chebychev’s inequality, we have that P |Y − E[Y]| ≥ tσY ≤ 1/t2 . Since (n/m) E[Y] = E[Z] = Γ, this
implies that
" # " √ #    n 
n n m n n n
P |Z − Γ| ≥ t √ = P |Z − Γ| ≥ t · ≤ P |Z − Γ| ≥ tσ Y = P Y − E [Y] ≥ tσ Y
2 m m 2 m m m m
  1
= P |Y − E[Y]| ≥ tσY ≤ 2 ■
t

4.3. Randomized selection – Using sampling to learn the world


4.3.1. Inverse estimation
We are given a set U = {u1 , . . . , un } of n distinct numbers. Let U⟨i⟩ denote the ith smallest number in U – that is
U⟨i⟩ is the number of rank i in U.
Lemma 4.3.1. Given a set U of n numbers, a number k, and parameters t ≥ 1 and m ≥ 1, one can compute, in
O(m log m) time, two numbers r− , r+ ∈ U, such that:
(A) The number of rank k in U is in the interval I = [r− , r+ ].

(B) There are at most 8tn/ m numbers of U in I.
The above two properties hold with probability ≥ 1 − 3/t2 .
(Namely, as t increases, the interval I becomes bigger, and the probability it contains the desired element
increases.)

Proof: (A) Compute a random sample R of U of size m in O(m) time (assuming the input numbers are given
in an array, say). Next sort the numbers of R in O(m log m) time. Let
$ % & '
k √ k √
ℓ− = m − t m/2 − 1 and ℓ+ = m + t m/2 + 1.
n n
Set r− = R[ℓ− ] and r+ = R[ℓ+ ].
Let Y be the number of elements in the sample R that are ≤ U⟨k⟩ . By Lemma 4.2.1, we have
h √ √ i
P E[Y] − t m/2 ≤ Y ≤ E[Y] + t m/2 ≥ 1 − 1/t .
2

In particular, if this happens, then r− ≤ U⟨k⟩ ≤ r+ .

(B) Let g = k − t √nm − 3 mn , and let gR be the number of elements in R that are smaller than U⟨g⟩ . Arguing as
h √ i
above, we have that P gR ≤ gn m + t m/2 ≥ 1 − 1/t2 . Now
!
g √ m n n √ m √ √ m √
m + t m/2 = k−t√ −3 + t m/2 = k − t m − 3 + t m/2 = k − t m/2 − 3 < ℓ− .
n n m m n n

39
This implies that the g smallest numbers in U are outside the interval [r− , r+ ] with probability ≥ 1 − 1/t2 .
Next, let h = k + t √nm + 3 mn . A similar argument, shows that all the n − h largest numbers in U are too large
to be in [r− , r+ ]. This implies that
n n tn
|[r− , r+ ] ∩ U| ≤ h − g + 1 = 6 + 2t √ ≤ 8 √ . ■
m m m

4.3.1.1. Inverse estimation – intuition


Here we are trying to give some intuition to the proof of the previous lemma. Feel free to skip this part if you
feel you already understand what is going on.
Given k, we are interested in estimating sk = U⟨k⟩ quickly. So, let us take a sample R of size m. Let R≤sk be
the set of all the numbers in R that are ≤ sk . For Y = R≤sk , we have that µ = E[Y] = m nk . Furthermore, for any
h √ √ i
t ≥ 1, Lemma 4.2.1 implies that P µ − t m/2 ≤ Y ≤ µ + t m/2 ≥ 1 − 1/t2 . In particular, with probability
j √ k
≥ 1 − 1/t2 the number r− = R⟨ℓ− ⟩ , for ℓ− = µ − t m/2 − 1, is smaller than sk , and similarly, the number
l √ m
r+ = R⟨ℓ+ ⟩ of rank ℓ+ = µ + t m/2 + 1 in R is larger than sk .
One can conceptually think about the interval I(k) = [r− , r+ ] as a confidence interval – we know that
sk ∈ I(k) with probability ≥ 1 − 1/t2 . But how heavy is this interval? Namely, how many elements are there in
I(k) ∩ U?
To this end, consider
h √the interval√of ranks,iin the sample, that might contain the kth element. By the above,
this is I(k, t) = k n + −t m/2 − 1, t m/2 + 1 . In particular, consider the maximum ν ≤ k, such that I(ν, t) and
m
√ √
I(k, t) are disjoint. We have the condition that ν mn + t m/2 + 1 ≤ k mn − t m/2 − 1 =⇒ ν ≤ k − t √nm − 2 mn . Let
g = k − t √nm − 2 mn and h = k + t √nm + 2 mn . We have that I(g, t), I(k, t) and I(h, t) are all disjoint with probability
≥ 1 − 3/t2 . l  m l  m
To this end, let g = k − 2 t 2 √n m and h = k + 2 t 2 √n m . It is easy to verify (using the same argumentation
 1 − 3/t , the three confidence I(g), I(k) and I(h) do not intersect. As
2
as above) that with probability at least
such, we have I(k) ∩ U ≤ h − g ≤ 4 t 2 √m . n

4.3.2. Randomized selection


4.3.2.1. The algorithm
l m
Given an array S of n numbers, and the rank k. The algorithm needs to compute S ⟨k⟩ . To this end, set t = n1/8 ,
l m
and m = n3/4 .
Using the algorithm of Lemma 4.3.1, in O(m log m) time, we get two numbers r− and r+ , such that S ⟨k⟩ ∈
[ri , r+ ], and  
√ 
|S ∩ (ri , r+ )| = O tn/ m = O n1/8 n/m3/8 = O(n3/4 ).
| {z }
Sm

To this end, we break S into three sets:


(i) S < = {s ∈ S | s ≤ r− },
(ii) S m = {s ∈ S | r− < s < r+ },
(iii) S > = {s ∈ S | r+ ≤ s}.
This three way partition can be done using 2n comparisons and in linear time. We now can readily compute
the rank of r− in S (it is |S < |) and the rank of r+ in S (it is |S < | + |S m | + 1). If r−⟨S ⟩ > k or r+⟨S ⟩ < k then the

40

algorithm failed. The other possibility for failure is that S m is too large. That is, larger than 8tn/ m = O(n3/4 ).
If any of these failures happen, then we rerun this algorithm from scratch.
Otherwise, the algorithm need to compute the element of rank k − |S < | in the set S m , and this can be done
in O(|S m | log |S m |) = O(n3/4 log n) time by using sorting.

4.3.2.2. Analysis
The correctness is easy – the algorithm clearly returns the desired element. As for running time, observe
that by Lemma 4.3.1, by probability ≥ 1 − 1/n1/4 , we succeeded in the first try, and then the running time is
O(n + (m log m)) = O(n). More generally, the probability that the algorithm failed in the first α tries to get a
good interval [r− , r+ ] is at most 1/nα/4 .
One can slightly improve the number of comparisons performed by the algorithm using the following
modifications.

Lemma 4.3.2. Given the numbers r− , r+ , one can compute the sets S < , S m , S > using in expectation (only!)
1.5n + O(n3/4 ) comparisons.

Proof: We need to compute the sets S < , S m , S > . Namely, we need to compare all the numbers of S to r− and r+ .
Since only O(n3/4 ) numbers fall in S m , almost all of the numbers are in either S < or S > . If a number of is in S <
(resp. S > ), then comparing it r− (resp. r+ ) is enough to verify that this is indeed the case. Otherwise, perform
the other comparison and put the element in its proper set (in this case we had to perform two comparisons to
handle the element).
So let us guess, by a coin flip, for each element of S whether they are in S < or S > . If we are right, then
the algorithm would require only one comparison to put them into the right set. Otherwise, it would need two
comparisons. Let X s be the random variable that is the number of comparisons used by this algorithm for an
element s ∈ S . We have that if s ∈ S < ∪ S > then E[X s ] = 1(1/2) + 2(1/2) = 3/2. If s ∈ S m then both
comparisons will be performed, and thus E[X s ] = 2 in this case.
Thus, the expected numbers of comparisons for all the elements of S , by linearity of expectations, is
3
2
(n − |S m |) + 2|S m | = (3/2)n + |S m |/2. ■

Theorem 4.3.3. Given an array S with n numbers and a rank k, one can compute the element of rank k
in S in expected linear time. Formally, the resulting algorithm performs in expectation 1.5n + O(n3/4 log n)
comparisons.

Proof: Let X be the random variable that is the number of iteration till the interval is good. We have that X
is a geometric variable with probability of success ≥ 1 − 1/n1/4 . As such, the expected number of rounds till
h  is ≤ 1/p ≤ 1 + 2/n
1/4
success i . As such, the expected number of comparisons performed by the algorithm is
E X · 1.5n + O(n log n) = 1.5n + O(n3/4 log n).
3/4

41
42
Chapter 5

Verifying Identities, and Some Complexity


598 - Class notes for Randomized Algorithms
The events of September 8 prompted Foch to draft the
Sariel Har-Peled
later legendary signal: “My centre is giving way, my
April 2, 2024
right is in retreat, situation excellent. I attack.” It was
probably never sent.

John Keegan, The first world war


5.1. Verifying equality
5.1.1. Vectors
 n
You are given two binary vectors v = (v1 , . . . , vn ), u = (u1 , . . . , un ) ∈ 0, 1 and you would like to decide if
they are equal or not. Unfortunately, the only access you have to the two vectors is via a black-box that enables
you to compute the dot-product of two binary vectors over Z2 . Formally, given two binary vectors as above,
P
their dot-product is ⟨v, u⟩ = ni=1 vi ui (which is a non-negative integer number). Their dot product modulo 2, is
⟨v, u⟩ mod 2 (i.e., it is 1 if ⟨v, u⟩ is odd and 0 otherwise).
Naturally, we could the use the black-box to read the vectors (using 2n calls), but since we are interested
only in deciding if they are equal or not, this should require less calls to the black-box (which is expensive).
 n
Lemma 5.1.1. Given two binary vectors v, u ∈ 0, 1 , a randomized algorithm can, using two computations
of dot-product modulo 2, decide if v is equal to u or not. The algorithm may return one of the following two
values:
,: Then v , u.
=: Then the probability that the algorithm made a mistake (i.e., the vectors are different) is at most 1/2.
The running time of the algorithm is O(n + B(n)), where B(n) is the time to compute a single dot-product of
vectors of length n.
Proof: Pick a random vector r = (r1 , . . . , rn ) ∈ {0, 1}n by picking each coordinate independently with probabil-
ity 1/2. Compute the two dot-products ⟨v, r⟩ and ⟨u, r⟩.
(A) If ⟨v, r⟩ ≡ ⟨v, r⟩ mod 2 ⇒ the algorithm returns ‘=’.
(B) If ⟨v, r⟩ . ⟨v, r⟩ mod 2 ⇒ the algorithm returns ‘,’.
Clearly, if the ‘,’ is returned then v , u.
So, assume that the algorithm returned ‘=’ but v , u. For the sake of simplicity of exposition, assume that
they differ on the nth bit: un , vn . We then have that
=α′ =β′
z }| { z }| {
X
n−1 X
n−1
α = ⟨v, r⟩ = vi ri + vn rn and β = ⟨u, r⟩ = ui ri + un rn .
i=1 i=1

43
Now, there are two possibilities:
(A) If α′ . β′ mod 2, then, with probability half, we have ri = 0, and as such α . β mod 2.
(B) If α′ ≡ β′ mod 2, then, with probability half, we have ri = 1, and as such α . β mod 2.
As such, with probability at most half, the algorithm would fail to discover that the two vectors are different.■

5.1.1.1. Amplification
Of course, this is not a satisfying algorithm – it returns the correct answer only with probability half if the
vectors are different. So, let us run the algorithm t times. Let T 1 , . . . , T t be the returned values from all these
executions. If any of the t executions returns that the vectors are different, then we know that they are different.
   
P Algorithm fails = P v , u, but all t executions return ‘=’
   
= P T 1 = ‘=’ ∩ T 2 = ‘=’ ∩ · · · ∩ T t = ‘=’
 Y1
t
     1
= P T 1 = ‘=’ P T 2 = ‘=’ · · · P T t = ‘=’ ≤ = t.
i=1
2 2

We thus get the following result.


 n
Lemma 5.1.2. Given two binary vectors v, u ∈ 0, 1 and a confidence parameter δ > 0, a randomized
algorithm can decide if v is equal to u or not. More precisely, the algorithm may return one of the two
following results:
,: Then v , u.
=: Then, with probability ≥ 1 − δ, we
 have v , u. 
The running time of the algorithm is O (n + B(n)) ln δ−1 , where B(n) is the time to compute a single dot-product
of two vectors of length n.
 
Proof: Follows from the above by setting t = lg(1/δ) . ■

5.1.2. Matrices
Given three binary matrices B, C, D of size n × n, we are interested in deciding if BC
 = D. Computing BC is
expensive – the fastest known (theoretical!) algorithm has running time (roughly) O n2.37 . On the other hand,
multiplying such a matrix with a vector r (modulo 2, as usual) takes only O(n2 ) time (and this algorithm is
simple).
 n×n
Lemma 5.1.3. Given three binary matrices B, C, D ∈ 0, 1 and a confidence parameter δ > 0, a randomized
algorithm can decide if BC = D or not. More precisely the algorithm can return one of the following two
results:
,: Then BC , D.
=: Then BC = D with probability ≥1 − δ. 
The running time of the algorithm is O n2 log δ−1 .

Proof: Compute a random vector r = (r1 , . . . , rn ), and compute the quantity x = BCr = B(Cr) in O(n2 ) time,
using the associative property of matrix multiplication. Similarly, compute y = Dr. Now, if x , y then return
‘=’. l m
Now, we execute this algorithm t = lg δ−1 times. If all of these independent runs return that the matrices
are equal then return ‘=’.

44
The algorithm fails only if BC , D, but then, assume the ith row in two matrices BC and D are different.
The probability that the algorithm would not detect that these rows are different is at most 1/2, by Lemma 5.1.1.
As such, the probability that all t runs failed is at most 1/2t ≤ δ, as desired. ■

5.1.3. Checking identity for polynomials


5.1.3.1. The SchwartzZippel lemma
Let F be a field (i.e., real numbers). Let F[x1 , . . . , Xn ] denote the set of polynomials over the n variables
x1 , . . . , xn over F. Such a polynomial is a sum of monomial,
 where a monomial has the form c · x1i1 · x2i2 · · · xnin ,
where c ∈ F. The degree of this monomial is degree c · x1i1 · x2i2 · · · xnin = i1 + i2 + · · · + in . Thus, a polynomial
d
of degree d over n variables n has potentially up to dn monomials (the exact bound is messier, but an easy lower
bound on this quantity is d ).
For a polynomial f ∈ F[x1 , . . . , Xn ], the zero set of f , is the set Z f = {(x1 , . . . , xn ) | f (x1 , . . . , xn ) = 0}.
Intuitively, the zero set Z f = Fn only if f (x1 , . . . , xn ) = 0 (and then it is the zero polynomial), but otherwise
(i.e., f , 0) Z f should be much “smaller”.
Specifically, a polynomial in a single variable of degree d is either zero everywhere, or has at most d roots
(i.e., |Z f | ≤ d). This is known as the fundamental theorem of algebra¬ . The picture gets much messier once
one deals with multi-variate polynomials, but fortunately there is a simple and elegant lemma that bounds the
number of zeros if we pick the values from the right set of values.
Lemma 5.1.4 (SchwartzZippel). Let f ∈ F[x1 , . . . , xn ] be a non-zero polynomial of total degree d ≥ 0, over a
field F. Let S ⊆ F be finite. Let r = (r1 , . . . , rn ) be randomly and uniformly chosen from S n . Then
  d
P f (r) = 0 ≤ .
|S |
Equivalently, we have |Z f ∩ S n | ≤ d|S |n−1 .
Proof: The proof is by induction on n. For n = 1, by the theorem, formally known as the fundamental theorem
of algebra, |Z f ∩ S | ≤ |Z f | ≤ d. So assume the theorem holds for n − 1. Since f is non-zero, it can be written as
a sum of d polynomials in n − 1 variables. That is, f can be written as
X
d
f (x1 , . . . , xn ) = x1i fi (x2 , . . . , xn ),
i=0

where degree( fi ) ≤ d − i. Since f is not zero, one of the f s must be non-zero, and let i the maximum value such
that fi , 0.
Now, we randomly choose the values r2 , . . . , rn ∈ S (independently and uniformly). And consider the
polynomial in the single variable x, which is
X
d
g(x) = f j (r2 , . . . , rn )xi .
j=0

Let F be the event that fi (r2 , . . . , rn ) = 0. Let G be the event that g(x) = 0. By induction, we have P[F] ≤
(d − i)/|S |. More interestingly if F does not happen, then degree(g) = i. As such, by induction, we have that
  i
P[G | F] = P g(x) = 0 | F ≤ .
|S |
¬
Wikipedia notes that the proof is not algebraic, and it is definitely not fundamental to modern algebra. So maybe it should be
cited as “the theorem formerly known as the fundamental theorem of algebra”.

45
We conclude that
  d−i i d
P f (r) = 0 = P[G ∩ F] + P[G ∩ F] ≤ P[F] + P[G | F] ≤ + ≤ . ■
|S | |S | |S |

Remark 5.1.5. Consider the polynomial f (x, y) = (x − 1)2 + (y − 1)2 − 1. The zero set of this polynomial is
the unit circle. So the zero set Z f is infinite in this case. However, note that for any choice of S , the set S 2 is a
grid. The Schwartz-Zippel lemma, tells us that there relatively few grid points that are in the zero set.

5.1.3.2. Applications
5.1.3.2.1. Checking if a polynomial is the zero polynomial. Let f be a polynomial of degree d, with n
variables, over the reals that can be evaluated in O(T ) time. One can check if f zero, by picking randomly a n
numbers from S = Jd3 K. By Lemma 5.1.4, we have that the probability of f to be zero over the chosen values
is ≤ d/d3 , which is a small number. As above, we can do amplification to get a high confidence answer.

5.1.3.2.2. Checking if two polynomials are equal. Given two polynomials f and g, one can now check if
they are equal by checking if f (r) = g(r), for some random input. The proof of correctness follows from the
above, as one interpret the algorithm as checking if f − g is the zero polynomial.

5.1.3.2.3. Verifying polynomials product. Given three polynomials f, g, and h, one can now check if f g = h.
Again, one randomly pick a value r, and check if f (r)g(r) = h(r). The proof of correctness follows from the
above, as one interprets the algorithm as checking if f g − h is the zero polynomial.

5.1.4. Checking if a bipartite graph has a perfect matching


Let G = (L ∪ R, E) be a bipartite graph. Let L = {u1 , . . . , un } and R = {v1 , . . . , vn }. Consider the set of variables
n o
V = xi, j ui v j ∈ E .

Let M be an n × n matrix, where M[i, j] = 0 if ui v j < E, and M[i, j] = xi, j otherwise. Let Π be the set of all
permutations of JnK.
A perfect matching is a permutation π : JnK → JnK, such that for all i, we have ui vπ(i) ∈ E. For such a
permutation π, consider the monomial
Yn
fπ = sign(π) M[i, j],
i=1

where sign is the sign of the permutation (it is either −1 or +1 – for our purpose here we do not care about
the exact definition of this quantity). It is either a polynomial of degree exactly n, or it is zero. Furthermore,
observe that for any two different permutation π, σ ∈ Π, we have that if fπ and fσ are both non-zero, then
fπ , fσ and fπ , − fσ .
Consider the following “crazy” polynomial over the set of variables V:
X
ψ = ψ(V) = det(M) = sign(π) fπ .
π∈Π

If there is perfect matching in G, then there is a permutation π such that fπ , 0. But this implies that ψ , 0
(since it has a non-zero monomials, and the monomials can not cancel each other).

46
In the other direction, if there is no perfect matching in G, then fπ = 0 for all permutation π. This implies
that ψ = 0. Thus, deciding if G has a perfect matching is equivalent to deciding if ψ , 0. The polynomial ψ is
defined via a determinant of a matrix that variables as some of the entries (and zeros otherwise). By the above,
all we need to do is to evaluate ψ over some random values. If we use exact arithmetic, we would just pick a
random number in [0, 1] for each variable, and evaluate ψ for these values of the variable. Namely, we filled
the matrix M with values (so it is all numbers now), and we need to computes its determinant. Via Gaussian
elimination, the determinant can be computed in cubic time. Thus, we can evaluate ψ in cubic time, which
implies that with high probability we can check if G has a perfect matching.
If we do not want to be prisoners of the impreciseness of floating point arithmetic, then one can perform
the above calculations over some finite field (usually, the field is simply working modulo a prime number).

5.2. Las Vegas and Monte Carlo algorithms


Definition 5.2.1. A Las Vegas algorithm is a randomized algorithms that always return the correct result. The
only variant is that it’s running time might change between executions.

An example for a Las Vegas algorithm is the QuickSort algorithm.


Definition 5.2.2. A Monte Carlo algorithm is a randomized algorithm that might output an incorrect result.
However, the probability of error can be diminished by repeated executions of the algorithm.

The matrix multiplication algorithm is an example of a Monte Carlo algorithm.

5.2.1. Complexity Classes


I assume people know what are Turing machines, NP, NPC, RAM machines, uniform model, logarithmic
model. PSPACE, and EXP. If you do now know what are those things, you should read about them. Some
of that is covered in the randomized algorithms book, and some other stuff is covered in any basic text on
complexity theory­ .
Definition 5.2.3. The class P consists of all languages L that have a polynomial time algorithm Alg, such that
for any input Σ∗ , we have
(A) x ∈ L ⇒ Alg(x) accepts,
(B) x < L ⇒ Alg(x) rejects.

Definition 5.2.4. The class NP consists of all languages L that have a polynomial time algorithm Alg, such
that for any input Σ∗ , we have:
(i) If x ∈ L ⇒ then ∃y ∈ Σ∗ , Alg(x, y) accepts, where |y| (i.e. the length of y) is bounded by a polynomial in
|x|.
(ii) If x < L ⇒ then ∀y ∈ Σ∗ Alg(x, y) rejects.

Definition 5.2.5. For a complexity class C, we define the complementary class co-C as the set of languages
whose complement is in the class C. That is

co−C = L L ∈ C ,

4 where L = Σ∗ \ L.
­
There is also the internet.

47
It is obvious that P = co−P and P ⊆ NP ∩ co−NP. (It is currently unknown if P = NP ∩ co−NP or whether
NP = co−NP, although both statements are believed to be false.)

Definition 5.2.6. The class RP (for Randomized Polynomial time) consists of all languages L that have a
randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ , we have
 
(i) If x ∈ L then P Alg(x) accepts ≥ 1/2.
 
(ii) x < L then P Alg(x) accepts = 0.

An RP algorithm is a Monte Carlo algorithm, but this algorithm can make a mistake only if x ∈ L. As such,
co−RP is all the languages that have a Monte Carlo algorithm that make a mistake only if x < L. A problem
which is in RP ∩ co−RP has an algorithm that does not make a mistake, namely a Las Vegas algorithm.

Definition 5.2.7. The class ZPP (for Zero-error Probabilistic Polynomial time) is the class of languages that
have a Las Vegas algorithm that runs in expected polynomial time.

Definition 5.2.8. The class PP (for Probabilistic Polynomial time) is the class of languages that have a ran-
domized algorithm Alg, with worst case polynomial running time, such that for any input x ∈ Σ∗ , we have
 
(i) If x ∈ L then P Alg(x) accepts > 1/2.
 
(ii) If x < L then P Alg(x) accepts < 1/2.

The class PP is not very useful. Why?

Exercise 5.2.9. Provide a PP algorithm for 3SAT.

Consider the mind-boggling stupid randomized algorithm that returns either yes or no with probability half.
This algorithm is almost in PP, as it return the correct answer with probability half. An algorithm is in PP needs
to be slightly better, and be correct with probability better than half. However, how much better can be made to
be arbitrarily close to 1/2. In particular, there is no way to do effective amplification with such an algorithm.

Definition 5.2.10. The class BPP (for Bounded-error Probabilistic Polynomial time) is the class of languages
that have a randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ ,
we have
 
(i) If x ∈ L then P Alg(x) accepts ≥ 3/4.
 
(ii) If x < L then P Alg(x) accepts ≤ 1/4.

5.3. Bibliographical notes


Section 35.1 follows [MR95, Section 1.5].

References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

48
Chapter 6

The Birthday Paradox, Occupancy and the


Coupon Collector Problem
598 - Class notes for Randomized Algorithms
I built on the sand
Sariel Har-Peled
And it tumbled down,
April 2, 2024 I built on a rock
And it tumbled down.
Now when I build, I shall begin
With the smoke from the chimney

Leopold Staff, Foundations


6.1. Some needed math
Lemma 6.1.1. For any positive integer n, we have:
(i) 1 + x ≤ e x and 1 − x ≤ e−x , for all x.
n+1
(ii) (1 + 1/n)n ≤ e ≤ 1 + 1/n .
(iii) (1 − 1/n)n ≤ 1e ≤ (1 − 1/n)n−1 .
(iv) (n/e)n ≤ n! ≤ (n + 1)n+1 /en . !  k
 n k n ne
(v) For any k ≤ n, we have: ≤ ≤ .
k k k
Proof: (i) Let h(x) = e x − 1 − x. Observe that h′ (x) = e x − 1, and h′′ (x) = e x > 0, for all x. That is h(x) is a
convex function. It achieves its minimum at h′ (x) = 0 =⇒ e x = 1, which is true for x = 0. For x = 0, we have
that h(0) = e0 − 1 − 0 = 0. That is, h(x) ≥ 0 for all x, which implies that e x ≥ 1 + x, see Figure 6.1.
(ii, iii) Indeed, 1 + 1/n ≤ exp(1/n) and (1 − 1/n)n ≤ exp(−1/n), by (i). As such
1
(1 + 1/n)n ≤ exp(n(1/n)) = e and (1 − 1/n)n ≤ exp(n(−1/n)) = ,
e
which implies the left sides of (ii) and (iii). These are equivalent to
!n !n
1  n n 1 1
≤ = 1− and e≤ 1+ ,
e n+1 n+1 n−1
which are the right side of (iii) [by replacing n + 1 by n], and the right side of (ii) [by replacing n by n + 1].
(iv) Indeed,
nn X ni

≤ = en ,
n! i=0
i!

49
10
9
8
7
6
y 5
4
3
2
1
0
−1
−10 −8 −6 −4 −2 0 2 4
x

ex 1+x

Figure 6.1

P xi
by the Taylor expansion of e x = ∞ i=0 i! . This implies that (n/e) ≤ n!, as required.
n

As for the righthand side. The claim holds for n = 0 and n = 1. Let f (n) = (n + 1)n+1 /en , and observe that
by (ii), we have
!n+1 !n+1
f (n) (n + 1)n+1 /en n n + 1 n 1 e
= = = 1+ ≥ n = n.
f (n − 1) n /e
n n−1 e n e n e
Thus, we have
(n + 1)n+1 f (n) f (n − 1) f (1)
= f (n) = · ··· ≥ n · n − 1 · · · 1 = n!
e n f (n − 1) f (n − 2) f (0)

(v) For any k ≤ n, we have nk ≤ n−1 k−1


since kn − n = n(k − 1) ≤ k(n − 1) = kn − k. As such, n
k
≤ n−i
k−i
, for
1 ≤ i ≤ k − 1. As such,
 n k n n − 1 !
n−i n−k+1 n! n
≤ · ··· ··· = = .
k k k−1 k−i 1 (n − k)!k! k
!  ne k
n nk nk
As for the other direction, we have ≤ ≤ = , by (iii). ■
k k! (k/e)k k

6.2. The birthday paradox


Consider a group of n people, and assume their birthdays are uniformly distributed no the dates in the year (this
assumption is not quite true, but close enough). We are interested in the question of how large n has to be till
we get a collision – that is, two people with the same birthday. Intuitively, since the year has m = 364 days, the
probability of person to land on a specific birthday is p = 1/364. So the natural guess would be that n needs to
be approximately 364. Surprisingly, the answer is much smaller.

50
npicked uniformly, randomly and independently from JmK = {1, . . . , m}.
Lemma 6.2.1. Let X1 , . . . , Xn be n variables
Then, the expected number of collisions is 2 /m.

h i h i
Proof: Let Yi, j = 1 ⇐⇒ Xi = X j . We have that E Yi, j = P Yi, j = 1 = 1/m. Thus, the expected number of
collisions is  n−1 n  n−1 n
X X  X X h i !
n 1
E Yi, j  = E Yi, j = . ■
i=1 j=i+1 i=1 j=i+1
2 m

As such, for birthdays, for m = 364, and n = 28, we have that the expected number of collisions is
!
28 1 378
= > 1.
2 364 364

This seems weird, but is it the truth?

    and independently from JmK = {1, . . . , m}.


Lemma 6.2.2. Let X1 , . . . , Xn be n variables picked uniformly, randomly
Then, the probability that no collision happened is at most exp − n2 /m .

Proof: Let Ei be the event that Xi is distinct from all the values in X1 , . . . , Xi−1 . Let Bi = ∩ik=1 Ek = Bi−1 ∩ Ei be
the event that all of X1 , . . . , Xi are distinct. Clearly, we have
!
m − (i − 1) i−1 i−1
P[Ei | Bi−1 ] = P[Ei | E1 ∩ · · · ∩ Ei−1 ] = =1− ≤ exp − .
m m m

Observe that
! Yi !
P[Ei ∩ Bi−1 ] i−1 k−1
P[Bi ] = P[Bi−1 ] = P[Bi−1 ] P[Ei | Bi−1 ] ≤ exp − P[Bi−1 ] ≤ exp − .
P[Bi−1 ] m k=1
m
 i  ! ! !
 X k − 1  i(i − 1) 1 i
= exp− 
 = exp − = exp − /m .
k=1
m  2 m 2

Which implies the desired claim for i = n. ■

6.3. Occupancy Problems


Problem 6.3.1. We are throwing m balls into n bins randomly (i.e., for every ball we randomly and uniformly
pick a bin from the n available bins, and place the ball in the bin picked). There are many natural questions one
can ask here:
(A) What is the maximum number of balls in any bin?
(B) What is the number of bins which are empty?
(C) How many balls do we have to throw, such that all the bins are non-empty, with reasonable probability?
' &
3 ln n

Theorem 6.3.2. With probability at least 1 − 1/n, no bin has more than k = balls in it.
ln ln n

51
Proof: Let Xi be the number of balls in the ith bins, when we throw n balls into n bins (i.e., m = n). Clearly,
Xn
  1
E[Xi ] = P The jth ball fall in ith bin = n · = 1,
j=1
n

by linearity of expectation. The probability that the first bin has exactly i balls is
! !i !n−i ! !i  i !i  i
n 1 1 n 1 ne 1 e
1− ≤ ≤ =
i n n i n i n i
This follows by Lemma 6.1.1 (iv).
Let C j (k) be the event that the jth bin has k or more balls in it. Then,
!  k
 X e i  e k
n
 e e2 e 1
P C1 (k) ≤ ≤ 1 + + 2 + ... = .
i=k
i k k k k 1 − e/k
For k∗ = c ln n/ ln ln n, we have
 e k∗   !
 ∗  1 ∗ ∗ k∗ ln k∗
P C1 (k ) ≤ ∗ ≤ 2 exp k (1 − ln k ) ≤ 2 exp −
k 1 − e/k∗ 2
! !
c ln n c ln n c ln n 1
≤ 2 exp − ln ≤ 2 exp − ≤ 2,
2 ln ln n | ln
{zln }
n 4 n
≈ln ln n
for n and c sufficiently large.

Let us redo this calculation more carefully (yuk!). For k∗ = ⌈(3 ln n)/ ln ln n⌉, we have
 e k∗ !k ∗  k∗
 ∗  1 e
P 1C (k ) ≤ ≤ 2 = 2 exp 1 − ln
| {z } 3 − ln ln n + ln ln ln n
k∗ 1 − e/k∗ (3 ln n)/ ln ln n
 <0
≤ 2exp (− ln ln n + ln ln ln n)k∗
!
ln ln ln n 1
≤ 2 exp −3 ln n + 6 ln n ≤ 2 exp(−2.5 ln n) ≤ 2 ,
ln ln n n
for n large enough. We conclude, that since there are n bins and they have identical distributions that
 X
n
 ∗ 1
P any bin contains more than k balls ≤ Ci (k∗ ) ≤ . ■
i=1
n

Exercise 6.3.3. Show that when throwing m = n ln n balls into n bins, with probability 1 − o(1), every bin has
O(log n) balls.

6.3.1. The Probability of all bins to have exactly one ball


Next, we are interested in the probability that all m balls fall in distinct bins. Let Xi be the event that the ith ball
fell in a distinct bin from the first i − 1 balls. We have:
Ym h i Y m ! Y m !
 m  n−i+1 i−1
P ∩i=2 Xi = P[X2 ] P Xi ∩ j=2 X j ≤ ≤ 1−
i−1

i=3 i=2
n i=2
n
Y m !
−(i−1)/n m(m − 1)
≤ e ≤ exp − ,
i=2
2n

52

thus for m = ⌈ 2n + 1⌉, the probability that all the m balls fall in different bins is smaller than 1/e.
This is sometime referred to as the birthday paradox, which was already mentioned above. You have
m = 30 people in the room, and you ask them for the date (day and month) of their birthday (i.e., n = 365).
The above shows that the probability of all birthdays to be distinct is exp(−30 · 29/730) ≤ 1/e. Namely, there
is more than 50% chance for a birthday collision, a simple but counter-intuitive phenomena.

6.4. The Coupon Collector’s Problem


There are n types of coupons, and at each trial one coupon is picked in random. How many trials one has to
perform before picking all coupons? Let m be the number of trials performed. We would like to bound the
probability that m exceeds a certain number, and we still did not pick all coupons.

Let Ci ∈ 1, . . . , n be the coupon picked in the ith trial. The jth trial is a success, if C j was not picked
before in the first j − 1 trials. Let Xi denote the number of trials from the ith success, till after the (i + 1)th
success. Clearly, the number of trials performed is
X
n−1
X= Xi .
i=0

Furthermore, Xi has a geometric distribution with parameter pi , that is Xi ∼ Geom(pi ), with pi = (n − i)/n. The
expectation and variance of Xi are
1 1 − pi
E[Xi ] = and V[Xi ] = .
pi p2i
 
Lemma 6.4.1. Let X be the number of rounds till we collection all n coupons. Then, V[X] ≈ π2 /6 n2 and its

standard deviation is σX ≈ (π/ 6)n.
Proof: The probability of Xi to succeed in a trial is pi = (n − i)/n, and Xi has the geometric distribution with
probability pi . As such E[Xi ] = 1/pi , and V[Xi ] = q/p2 = (1 − pi )/p2i .
Thus,
Xn−1 X
n−1
n
E[X] = E[Xi ] = = nHn = n(ln n + Θ(1)) = n ln n + O(n),
i=0 i=0
n−i
P
where Hn = ni=1 1/i is the nth Harmonic number.
As for variance, using the independence of X0 , . . . , Xn−1 , we have
n−1 
X X X 1 − (n − i)/n X i/n X i n 2
n−1 n−1 n−1 n−1
1 − pi
V[X] = V[Xi ] = =  2 =  2 =
i=0 i=0
p2i i=0
n−i
i=0
n−i
i=0
n n−i
n n
 n 
X
n−1
i Xn
n−i X n X n
1  Xn
1 π2 2
=n = = 

 − 
 = 2
− ≈ n,
i2 i=1 i 
n n n nH n
i=0
(n − i)2 i=1
i2 i=1 i=1
i2 6

Pn V[X] π2
since limn→∞ 1
i=1 i2 = π2 /6, we have lim = . ■
n→∞ n2 6
This implies a weak bound on the concentration of X, using Chebyshev inequality, we have
h π i h i 1
P X ≥ n ln n + n + t · n √ ≤ P X − E[X] ≥ tσX ≤ 2 ,
6 t
Note, that this is somewhat approximate, and hold for n sufficiently large.

53
6.5. Notes
The material in this note covers parts of [MR95, sections 3.1, 3.2, 3.6]

References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

54
Chapter 7

On k-wise independence
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

7.1. Pairwise independence


7.1.1. Pairwise independence
Definition 7.1.1. A set of random variables X1 , . . . , Xn is pairwise independent, if for any pair of values α, β,
and any two indices i, j, we have that
h i h i
P Xi = α and Y j = β = P[Xi = α] P Y j = β .

Namely, the variables are independent if you look at pairs of variables. Compare this to the much stronger
property of independence.
Definition 7.1.2. A set of random variables X1 , . . . , Xn is independent, if for any t, and any t values α1 , . . . , αt ,
and any t indices i1 , . . . , it , we have that

 Y h i
t

P Xi1 = α1 , Xi2 = α2 , . . . , and Yit = αit = P Xi j = α j .
j=1

7.1.2. A pairwise independent set of bits


Let n be a number which is a power of two. As such, t = log2 n = lg n is an integer. Let X0 , . . . , Xt−1 be truly
independent random bits, each one of them is 1 with probability 1/2.
For a non-negative integer number x, let bit(x, j) ∈ {0, 1} be the jth bit in the binary representation of x.
P
That is, we have x = j bit(x, j)2 j . N
For an index i = 1, . . . , 2t − 1, we define Yi = X , where ⊗ is the xor operator.
j:bit(i, j)=1 j

Lemma 7.1.3. The random variables Y1 , Y2 , . . . , Yn−1 are pairwise independent.

Proof: We claim that, for any i, we have P[Yi = 1] = P[Yi = 0] = 1/2. So fix i, and let α be an index such that
bit(i, α) = 1, and observe that this follows readily if pick all the true random variables X0 , . . . , Xt−1 in such an
order such that Xα is the last one to be set.
Next, consider two distinct indices i, i′ , and two arbitrary values v, v′ . We need to prove that
 ′  ′ 1
P Yi = v and Yi′ = v = P[Yi = v] P Yi′ = v = .
4
55
To this end, let B = { j | bit(i, j) = 1} and B′ = { j | bit(i′ , j) = 1}. If there is an index β ∈ B \ B′ , then we have
 hO i h O i
′ ′
P Yi = v | Yi′ = v = P j:bit(i, j)=1 X j = v Yi′ = v = P Xβ ⊗ X j = v Yi′ = v′
j:bit(i, j)=1
h  O  i 1
= P Xβ = v ⊗ X j Yi′ = v′ = .
j:bit(i, j)=1
2

This implies that P[Yi = v and Yi′ = v′ ] = P[Yi = v | Yi′ = v′ ] P[Yi′ = v′ ] = (1/2)(1/2) = 1/4, as claimed.
A similar argument implies that if there is an index β ∈ B′ \ B, then P[Yi′ = v′ | Yi = v] = 1/2, which implies
the claim in this case.
Since i , i′ , one of the two scenarios must happen, implying the claim. ■

7.1.3. An application: Max cut


Given a graph G = (V, E) with n vertices and m edges, consider the problem of computing the max-cut. That
is, computing the set of vertices S , such that the cut

(S , S ) = (S , V \ S ) = {uv ∈ E | u ∈ S , v ∈ V \ S } .

is of maximum cardinality.

7.1.3.0.1. Algorithm. To this end, let Y1 , . . . , Yn be the pairwise independent bits of Section 7.1.2. Here, let
S be the set of all vertices vi ∈ V, such that Yi = 1. The algorithm outputs (S , S ) as the candidate cut.

7.1.4. Analysis
Lemma 7.1.4. The expected size of the cut computed by the above algorithm is m/2, where m = |E(G)|.

Proof: Let Zuv be an indicator variable for the event that the edge uv ∈ E is in the cut (S , S ).
We have that h i hX i X X
E (S , S ) = E Zuv = E[Zuv ] = P[Yu , Yv ] = |E|/2,
uv∈E uv∈E uv∈E

using linearity of expectation and pairwise independence. ■

Lemma 7.1.5. Given a graph G with n vertices and m edges, say stored in a read only memory, one can
compute a max-cut of G, and the edges in it, using O(log n) random bits, and O(log n) RAM bits. Furthermore,
the expected size of the cut is ≥ m/2.

Proof: The algorithm description is above. The pairwise independence is also described above, and requires
only O(log n) random bits, which needs to be stored. Otherwise, all we need is to scan the edges of the graph,
and for each one to decide if it is, or not in the cut. Clearly, this can be done using O(log n) RAM bits. ■

Compare this to the natural randomized algorithm of computing a random subset S . This would require
using n random bits, and n bits of space to store it.

56
Max cut in the streaming model. Imagine that the edges of the graph are given to you via streaming: You are
told the number of vertices in advance, but then edges arrive one by one. The above enables you to compute the
cut in a streaming fashion using O(log n) bits. Alternatively, you can output the edges in a streaming fashion.
Another way of thinking about it, is that given a set S = {s1 , . . . , sn } of n elements, we can use the above
to select a random sample where every element is selected with probability half, and the samples are pairwise
independent. The kicker is that to specify the sample, or decide if an element is in the sample, we can do it
using O(log n) bits. This is a huge save compared to the regular n bits required as storage to remember the
sample.
It is clear however that we want a stronger concept – where things are k-wise independent.

7.2. On k-wise independence


7.2.1. Definition
Definition 7.2.1. A set of variables X1 , . . . , Xn are k-wise independent if for any set I = {i1 , i2 , . . . , it } of indices,
for t ≤ k, and any set of values v1 , . . . , vt , we have that

 Y h i
t

P i1
X = v1 and X i2 = v 2 and · · · and Xit = vt = P Xi j = v j .
j=1

Observe, that verifying the above property needs to be done only for t = k.

7.2.2. On working modulo prime



Definition 7.2.2. For a number p, let Zn = 0, . . . , n − 1 .
For two integer numbers x and y, the quotient of x/y is x div y = ⌊x/y⌋. The remainder of x/y is x mod y =
x − y ⌊x/y⌋. If the x mod y = 0, than y divides x, denoted by y | x. We use α ≡ β (mod p) or α ≡ p β to denote
that α and β are congruent modulo p; that is α mod p = β mod p – equivalently, p | (α − β).

Lemma 7.2.3. Let p be a prime number.


(A) For any α, β ∈ {1, . . . , p − 1}, we have that αβ . 0 (mod p).
(B) For any α, β, i ∈ {1, . . . , p − 1}, such that α , β, we have that αi . βi (mod p).
(C) For any x ∈ {1, . . . , p − 1} there exists a unique y such that xy ≡ 1 (mod p). The number y is the inverse
of x, and is denoted by x−1 or 1/x.

Proof: (A) If αβ ≡ 0 (mod p), then p must divide αβ, as it divides 0. But α, β are smaller than p, and p is
prime. This implies that either p | α or p | β, which is impossible.
(B) Assume that α > β. Furthermore, for the sake of contradiction, assume that αi ≡ βi (mod p). But then,
(α − β)i ≡ 0 (mod p), which is impossible, by (A).
(C) For any α ∈ {1, . . . , p − 1}, consider the set Lα = {α ∗ 1 mod p, α ∗ 2 mod p, . . . , α ∗ (p − 1) mod p}. By
(A), zero is not in Lα , and by (B), Lα must contain p − 1 distinct values. It follows that Lα = {1, 2, . . . , p − 1}.
As such, there exists exactly one number y ∈ {1, . . . , p − 1}, such that αy ≡ 1 (mod p). ■

Lemma 7.2.4. Consider a prime p, and any numbers x, y ∈ Z p . If x , y then, for any a, b ∈ Z p , such that
a , 0, we have ax + b . ay + b (mod p).

57
Proof: Assume y > x (the other case is handled similarly). If ax+b ≡ ay+b (mod p) then a(x−y) (mod p) = 0
and a , 0 and (x − y) , 0. However, a and x − y cannot divide p since p is prime and a < p and 0 < x − y < p.■

Lemma 7.2.5. Consider a prime p, and any numbers x, y ∈ Z p . If x , y then, for each pair of numbers
r, s ∈ Z p = {0, 1, . . . , p − 1}, such that r , s, there is exactly one unique choice of numbers a, b ∈ Z p such that
ax + b (mod p) = r and ay + b (mod p) = s.

Proof: Solve the system of equations


ax + b ≡ r (mod p) and ay + b ≡ s (mod p).
We get a = r−s
x−y
(mod p) and b = r − ax (mod p). ■

7.2.3. Construction of k-wise independence variables


7.2.4. Construction
Consider the following matrix, aka the Vandermonde matrix, defined by n variables:
 
1 x1 x12 . . . x1n−1 
1 x2 x2 . . . xn−1 
 2 2 
n−1  
V = 1 x3 x3 . . . x3  .
2
 .. .. .. . . .. 
 . . . . . 
 n−1 
1 xn xn . . . xn
2

Q
Claim 7.2.6. det(V) = 1≤i< j≤n (x j − xi ).

Proof: One can prove this in several ways, and we include a proof via properties of polynomials. The deter-
minant det(V) is a polynomial in the variables x1 , x2 , . . . , xn . Formally, let Π be the set of all permutations of
JnK = {1, . . . , n}. For a permutation π ∈ Π, let sign(π) ∈ {−1, +1} denote the sign of this permutation. We have
that X
f (x1 , x2 , . . . , xn ) = det(V) = sign(π)xiπ(i) .
π∈Π
Pn
Every monomial in this polynomial has total degree i=1 π(i) = 1 + 2 + · · · + n = n(n − 1)/2. Observe, that if
we replace x j by xi , then we have f (x1 , . . . , xi , . . . , x j−1 , xi , x j+1 , . . . , xn ) is the determinant of a matrix with two
identical rows, and such a matrix has a zero determinate. Namely, the polynomial f is zero if xi = x j . This
Q
implies that x j − xi divides f . We conclude that the polynomial g ≡ 1≤i< j≤n (x j − xi ) divides f . Namely, we
can write f = g ∗ h, where h is some polynomial.
Consider the monomial x2 x32 · · · xnn−1 . It appears in f with coefficient  1. Similarly, it generated in g by
Q
selecting the first term in each sub-polynomial, that is 1≤i< j≤n x j − xi . It is to verify that this is the only time
this monomial appears in g. This implies that h = 1. We conclude that f = g, as claimed. ■

Claim 7.2.7. If x1 , . . . , xn are distinct, then the Vandermonde matrix V is invertible.


Q
Proof: By Claim 7.2.6, the determinant of V is det(V) = 1≤i< j≤n (x j − xi ). This quantity is non-zero if the xs
are distinct, and a matrix is invertible in such a case. ■
P
Lemma 7.2.8. For a vector b = (b0 , . . . , bk−1 ) ∈ Zkp , consider the associated polynomial f (x, b) = k−1 i
i=0 bi x mod
p. For any k distinct values α1 , . . . , αk ∈ Z p , and k values v1 , . . . , vk ∈ Z p , there is a unique choice of b, such
that f (αi ) = vi mod p, for i = 1, . . . , k.

58
 
Proof: Let αi = 1, αi , α2i , · · · , αk−1 i . We have that f (αi , b) = ⟨αi , b⟩ mod p. This translates into the linear
system
 
      1 α1 α21 . . . αn−1
1  
α1  v1  v1  1 α 
       2 α22 . . . αn−1
2  
 2  T  2 
α v  2 
v 1 α3 n−1 
 ..  b =  ..  ⇐⇒ M b =  .. 
T
where M =  α23 . . . α3  .
 .   .   .   .. .. .. ... .. 
       . . . . 
αk vk vk  n−1 
1 αn α2n . . . αn
The matrix M is the Vandermonde matrix, and by Claim 7.2.7 it is invertible. We thus get there exists a unique
solution to this system of linear equations (modulo p). ■

The construction. So, let us pick independently and uniformly k values b0 .b1 , . . . , bk−1 ∈ Z p , let b =
P
(b0 , b1 , . . . , bk−1 ). g(x) = k−1 i
i=0 bi x mod p, and consider the random variables

Yi = g(i), ∀i ∈ Z p .

Lemma 7.2.9. The variables Y0 , . . . , Y p−1 are uniformly distributed and k-wise independent.

Proof: The uniform distribution for each Yi follows readily by picking b0 last, and observing that each such
choice corresponds to a different value of Yi .
As for the k-independence, observe that for any set I = {i1 , i2 , . . . , ik } of indices, for t ≤ k, and any set of
values v1 , . . . , vk ∈ Z p , we have that the event

Yi1 = v1 and Yi2 = v2 and · · · and Yik = vk

happens only for a unique choice of b, by Lemmah 7.2.8.i But there are pk such choices. We conclude that the
Q
probability of the above event is 1/pk = kj=1 P Yi j = v j , as desired. ■

We summarize the result for later use.

Theorem 7.2.10. let p be a prime number, and pick independently and uniformly k values b0 .b1 , . . . , bk−1 ∈ Z p ,
Pk−1 i
and let g(x) = i=0 bi x mod p. Then the random variables

Y0 = g(0), . . . , Y p−1 = g(p − 1).

are uniformly distributed in Z p and are k-wise independent.

7.2.5. Applications of k-wide independent variables


7.2.5.1. Product of expectations

Lemma 7.2.11. If X1 , . . . , Xk are k-wise independent, then E[X1 · · · Xk ] = E[X1 ] · · · E[Xk ].

Proof: Immediate. ■

59
7.2.5.2. Application: Using less randomization for a randomized algorithm
We can consider a randomized algorithm, to be a deterministic algorithm Alg(x, r) that receives together with
the input x, a random string r of bits, that it uses to read random bits from. Let us redefine RP:
Definition 7.2.12. The class RP (for Randomized Polynomial time) consists of all languages L that have a
deterministic algorithm Alg(x, r) with worst case polynomial running time such that for any input x ∈ Σ∗ ,
• x ∈ L =⇒ Alg(x, r) = 1 for half the possible values of r.
• x < L =⇒ Alg(x, r) = 0 for all values of r.

Let assume that we now want to minimize the number of random bits we use in the execution of the
algorithm (Why?). If we run the algorithm t times, we have confidence 2−t in our result, while using t log n
random bits (assuming our random algorithm needs only log n bits in each execution). Similarly, let us choose
two random numbers from Zn , and run Alg(x, a) and Alg(x, b), gaining us only confidence 1/4 in the correctness
of our results, while requiring 2 log n bits.
Can we do better? Let us define ri = ai + b mod n, where a, b are random values as above (note, that
P
we assume that n is prime), for i = 1, . . . , t. Thus Y = ti=1 Alg(x, ri ) is a sum of random variables which
are pairwise independent, as the ri are pairwise independent.
√ Assume, that x ∈ L, then we have E[Y] = t/2,
P  
and σ2Y = V[Y] = ti=1 V Alg(x, ri ) ≤ t/4, and σY ≤ t/2. The probability that all those executions failed,
corresponds to the event that Y = 0, and
 " √ #
h i t t √ 1
P Y = 0 ≤ P Y − E[Y] ≥ = P Y − E[Y] ≥ · t ≤ ,
2 2 t
by the Chebyshev inequality. Thus we were able to “extract” from our random bits, much more than one would
naturally suspect is possible. We thus get the following result.
Lemma 7.2.13. Given an algorithm Alg in RP that uses lg n random bits, one can run it t times, such that the
runs results in a new algorithm that fails with probability at most 1/t, and uses only 2 lg n random bits.

7.3. Higher moment inequalities


The following is the higher moment variant of Chebychev inequality.
 h i  1
k 1/k
Lemma 7.3.1. For a random variable X, we have that P |X − E[X]| ≥ tE |X − E[X]| ≤ k
t
Proof: Setting Z = |X − E[X]|k , and raising the inequality by a power of k, we have
h  i h i h i 1
k 1/k
P |X − E[X]| ≥ tE |X − E[X]| = P Z 1/k ≥ t E[Z]1/k = P Z ≥ tk E[Z] ≤ k ,
t
by Markov’s inequality. ■
h i
The problem is that computing (or even bounding) the kth moment Mk (X) = E |X − E[X] |k is usually not
easy. Let us do it for one interesting example.
Lemma 7.3.2. Consider k be an even integer and let X1 , . . . , Xn be n random independent variables such that
Pn
P[Xi = −1] = P[Xi = +1] = 1/2. Let X = i=1 Xi . Then, we have
" #
tk √ 1
P |X| ≥ n ≤ k.
2 t

60
Proof: Observe that E[X] = n E[X1 ] = 0. We are interested in computing
h i X k  hXn X
n i Xn X
n
 
Mk (X) = E X = E
k
Xi = E ... Xi1 Xi2 · · · Xik = ... E Xi1 Xi2 · · · Xik (7.1)
i i1 =1 ik =1 i1 =1 ik =1

Consider a term in the above summation, where one of the indices (say i1 ) has a unique value among i1 , i2 , . . . , ik .
By independence, we have
     
E Xi1 Xi2 · · · Xik = E Xi1 E Xi2 · · · Xik = 0,
 
since E Xi1 = 0. As such, in the above all terms that have a unique index disappear. A term that does not
disappear is going to be of the form
h α α α
i h αi h αi h αi
E Xi11 Xi22 . . . Xiℓ ℓ = E Xi11 E Xi22 . . . E Xiℓ ℓ
P
where αi ≥ 2, and i αi = k. Observe that

 t  
0 t is odd
E X1 =  
1 t is even.
As such, all the terms in the summation of Eq. (7.1) that have value that is not zero, have value one. These
terms corresponds to tuples T = (i1 , i2 , . . . , ik ), such that the set of values I(T ) = {i1 , . . . , ik } has at most k/2
values, and furthermore, each such value appears an even number of times in T (here k/2 is an integer as k is
even by assumption). We conclude that the total number of such tuples is at most
nk/2 (k/2)k .
Note, that this is a naive bound – indeed, we choose the k/2 values that are in I(T ), and then we generate the
tuple T , by choosing values for each coordinate separately. We thus conclude that
h i
Mk (X) = E X k ≤ nk/2 (k/2)k .
h i h i
Since k is even, we have E X k = E |X|k , and by Lemma 7.3.1, we have
" #      h i1/k 
tk √ k 1/k
P |X| ≥ n = P |X| ≥ t n (k/2) ≤ P |X| ≥ tE |X|k ≤ 1/tk . ■
k/2
2

Corollary 7.3.3. Consider k be an even integer and let X1 , . . . , Xn be n random


h independent
√ i variables such
Pn
that P[Xi = −1] = P[Xi = +1] = 1/2. For X = i=1 Xi , and any k, we have P |X| ≥ k n ≤ 1/2 . k

Observe, that the above proof did not require all the variables to be purely independent – it was enough that
they are k-wise independent. We readily get the following.
Definition 7.3.4. Given n random variables X1 , . . . , Xn they are k-wise independent, if for any k of them (i.e.,
i1 < i2 , . . . , ik ), and any k values x1 , . . . , xk , we have
h\
k
i Y
k
 
P Xiℓ = vℓ = P Xiℓ = v ℓ .
ℓ=1 ℓ=1

Informally, variables are k-wise independent, if any k of them (on their own) looks totally random.
Lemma 7.3.5. Let k > 0 be an even integer, and let X1 , . . . , Xn be n random independent variables, that are
P
k-wise independent, such that P[Xi = −1] = P[Xi = +1] = 1/2. Let X = ni=1 Xi . Then, we have
h tk √ i 1
P |X| ≥ n ≤ k.
2 t

61
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

62
Chapter 8

Hashing
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“I tried to read this book, Huckleberry Finn, to my grandchildren, but I couldnt get past page six because the book is fraught
with the n-word. And although they are the deepest-thinking, combat-ready eight- and ten-year-olds I know, I knew my
babies werent ready to comprehend Huckleberry Finn on its own merits. Thats why I took the liberty to rewrite Mark Twains
masterpiece. Where the repugnant n-word occurs, I replaced it with warrior and the word slave with dark-skinned volunteer.”

Paul Beatty, The Sellout

8.1. Introduction
We are interested here in dictionary data structure. The settings for such a data-structure:
(A) U: universe of keys with total order: numbers, strings, etc.
(B) Data structure to store a subset S ⊆ U
(C) Operations:
(A) search/lookup: given x ∈ U is x ∈ S ?
(B) insert: given x < S add x to S .
(C) delete: given x ∈ S delete x from S
(D) Static structure: S given in advance or changes very infrequently, main operations are lookups.
(E) Dynamic structure: S changes rapidly so inserts and deletes as important as lookups.

Common constructions for such data-structures, include using a static sorted array, where the lookup is a
binary search. Alternatively, one might use a balanced search tree (i.e., red-black tree). The time to perform
an operation like lookup, insert, delete take O(log |S |) time (comparisons).
Naturally, the above are potently an “overkill”, in the sense that sorting is unnecessary. In particular, the
universe U may not be (naturally) totally ordered. The keys correspond to large objects (images, graphs etc)
for which comparisons are expensive. Finally, we would like to improve “average” performance of lookups to
O(1) time, even at cost of extra space or errors with small probability: many applications for fast lookups in
networking, security, etc.

Hashing and Hash Tables. The hash-table data structure has an associated (hash) table/array T of size m
(the table size). A hash function h : U → {0, . . . , m − 1}. An item x ∈ U hashes to slot h(x) in T .
Given a set S ⊆ U, in a perfect ideal situation, each element x ∈ S hashes to a distinct slot in T , and we
store x in the slot h(x). The Lookup for an item y ∈ U, is to check if T [h(y)] = y. This takes constant time.

63
y f

Figure 8.1: Open hashing.

Unfortunately, collisions are unavoidable, and several different techniques to handle them. Formally, two
items x , y collide if h(x) = h(y).
A standard technique to handle collisions is to use chaining (aka open hashing). Here, we handle collisions
as follows:
(A) For each slot i store all items hashed to slot i in a linked list. T [i] points to the linked list.
(B) Lookup: to find if y ∈ U is in T , check the linked list at T [h(y)]. Time proportion to size of linked list.
Other techniques for handling collisions include associating a list of locations where an element can be (in
certain order), and check these locations in this order. Another useful technique is cuckoo hashing which we
will discuss later on: Every value has two possible locations. When inserting, insert in one of the locations,
otherwise, kick out the stored value to its other location. Repeat till stable. if no stability then rebuild table.
The relevant questions when designing a hashing scheme, include: (I) Does hashing give O(1) time per
operation for dictionaries? (II) Complexity of evaluating h on a given element? (III) Relative sizes of the
universe U and the set to be stored S . (IV) Size of table relative to size of S . (V) Worst-case vs average-case
vs randomized (expected) time? (VI) How do we choose h?
The load factor of the array T is the ratio n/t where n = |S | is the number of elements being stored and
m = |T | is the size of the array being used. Typically n/t is a small constant smaller than 1.
In the following, we assume that U (the universe the keys are taken from) is large – specifically, N = |U| ≫
m2 , where m is the size of the table. Consider a hash function h : U → {0, . . . , m − 1}. If hash N items to the m
slots, then by the pigeon hole principle, there is some i ∈ {0, . . . , m − 1} such that N/m ≥ m elements of U get
hashed to i. In particular, this implies that there is set S ⊆ U, where |S | = m such that all of S hashes to same
slot. Oops.
Namely, for every hash function there is a bad set with many collisions.

Observation 8.1.1. Let H be the set of all functions from U = {1, . . . , U} to {1, . . . , m}. The number of
functions in H is mU . As such, specifying a function in H would require log2 |H| = O(U log m).

As such, picking a truely random hash function requires many random bits, and furthermore, it is not even
clear how to evaluate it efficiently (which is the whole point of hashing).

Picking a hash function. Picking a good hash function in practice is a dark art involving many non-trivial
considerations and ideas. For parameters N = |U|, m = |T |, and n = |S |, we require the following:
(A) H is a family of hash functions: each function h ∈ H should be efficient to evaluate (that is, to compute
h(x)).
(B) h is chosen randomly from H (typically uniformly at random). Implicitly assumes that H allows an
efficient sampling.
(C) Require that for any fixed set S ⊆ U, of size m, the expected number of collisions for a function chosen
from H should be “small”. Here the expectation is over the randomness in choice of h.

64
8.2. Universal Hashing
We would like the hash function to have the following property – For any element x ∈ U, and a random h ∈ H,
then h(x) should have a uniform distribution. That is Pr[h(x) = i] = 1/m, for every 0 ≤ i < m. A somewhat
stronger property is that for any two distinct elements x, y ∈ U, for a random h ∈ H, the probability of a
 
collision between x and y should be at most 1/m. P h(x) = h(y) = 1/m.
 
Definition 8.2.1. A family H of hash functions is 2-universal if for all distinct x, y ∈ U, we have P h(x) = h(y) ≤
1/m.

Applying a 2-universal family hash function to a set of distinct numbers, results in a 2-wise independent
sequence of numbers.

Lemma 8.2.2. Let S be a set of n elements stored using open hashing in a hash table of size m, using open
hashing, where the hash function is picked from a 2-universal family. Then, the expected lookup time, for any
element x ∈ U is O(n/m).

P
Proof: The number of elements colliding with x is ℓ(x) = y∈S Dy , where Dy = 1 ⇐⇒ x and y collide under
the hash function h. As such, we have
X h i X   X1
E[ℓ(x)] = E Dy = P h(x) = h(y) = = |S |/m = n/m. ■
y∈S y∈S y∈S
m

Remark 8.2.3. The above analysis holds even if we perform a sequence of O(n) insertions/deletions opera-
tions. Indeed, just repeat the analysis with the set of elements being all elements encountered during these
operations.
The worst-case bound is of course much worse – it is not hard to show that in the worst case, the load of a
single hash table entry might be Ω(log n/ log log n) (as we seen in the occupancy problem).

Rehashing, amortization, etc. The above assumed that the set S is fixed. If items are inserted and deleted,
then the hash table might become much worse. In particular, |S | grows to more than cm, for some constant c,
then hash table performance start degrading. Furthermore, if many insertions and deletions happen then the
initial random hash function is no longer random enough, and the above analysis no longer holds.
A standard solution is to rebuild the hash table periodically. We choose a new table size based on current
number of elements in table, and a new random hash function, and rehash the elements. And then discard the
old table and hash function. In particular, if |S | grows to more than twice current table size, then rebuild new
hash table (choose a new random hash function) with double the current number of elements. One can do a
similar shrinking operation if the set size falls below quarter the current hash table size.
If the working |S | stays roughly the same but more than c|S | operations on table for some chosen constant
c (say 10), rebuild.
The amortize cost of rebuilding to previously performed operations. Rebuilding ensures O(1) expected
analysis holds even when S changes. Hence O(1) expected look up/insert/delete time dynamic data dictionary
data structure!

65
8.2.1. How to build a 2-universal family
8.2.1.1. A quick reminder on working modulo prime

Definition 8.2.4. For a number p, let Zn = 0, . . . , n − 1 .
For two integer numbers x and y, the quotient of x/y is x div y = ⌊x/y⌋. The remainder of x/y is x mod y =
x − y ⌊x/y⌋. If the x mod y = 0, than y divides x, denoted by y | x. We use α ≡ β (mod p) or α ≡ p β to denote
that α and β are congruent modulo p; that is α mod p = β mod p – equivalently, p | (α − β).
Remark 8.2.5. A quick review of what we already know. Let p be a prime number.
(A) Lemma 7.2.3: For any α, β ∈ {1, . . . , p − 1}, we have that αβ . 0 (mod p).
(B) Lemma 7.2.3: For any α, β, i ∈ {1, . . . , p − 1}, such that α , β, we have that αi . βi (mod p).
(C) Lemma 7.2.3: For any x ∈ {1, . . . , p − 1} there exists a unique y such that xy ≡ 1 (mod p). The number
y is the inverse of x, and is denoted by x−1 or 1/x.
(D) Lemma 7.2.4: For any numbers x, y ∈ Z p . If x , y then, for any a, b ∈ Z p , such that a , 0, we have
ax + b . ay + b (mod p).
(E) Lemma 7.2.5: For any numbers x, y ∈ Z p . If x , y then, for each pair of numbers r, s ∈ Z p = {0, 1, . . . , p−
1}, such that r , s, there is exactly one unique choice of numbers a, b ∈ Z p such that ax + b (mod p) = r
and ay + b (mod p) = s.

8.2.1.2. Constructing a family of 2-universal hash functions


For parameters N = |U|, m = |T |, n = |S |. Choose a prime number p ≥ N. Let
n o
H = ha,b a, b ∈ Z p and a , 0 ,
where ha,b (x) = ((ax + b) (mod p)) (mod m). Note that |H| = p(p − 1).

8.2.1.3. Analysis
Once we fix a and b, and we are given a value x, we compute the hash value of x in two stages:
(A) Compute: r ← (ax + b) (mod p).
(B) Fold: r′ ← r (mod m)
Lemma 8.2.6. Assume that p is a prime, and 1 < m < p. The number of pairs (r, s) ∈ Z p × Z p , such that r , s,
that are folded to the same number is ≤ p(p − 1)/m. Formally, the set of bad pairs
n o
B = (r, s) ∈ Z p × Z p r ≡m s
is of size at most p(p − 1)/m.
Proof: Consider a pair (x, y) ∈ {0, 1, . . . , p − 1}2 , such that x , y. For a fixed x, there are at most ⌈p/m⌉ values
of y that fold into x. Indeed, x ≡m y if and only if

y ∈ L(x) = x + im i is an integer ∩ Z p .
The size of L(x) is maximized when x = 0, The number of such elements is at most ⌈p/m⌉ (note, that since p
is a prime, p/m is fractional). One of the numbers in O(x) is x itself. As such, we have that
  
|B| ≤ p |L(x)| − 1 ≤ p ⌈p/m⌉ − 1 ≤ p p − 1 /m,
since ⌈p/m⌉ − 1 ≤ (p − 1)/m ⇐⇒ m ⌈p/m⌉ − m ≤ p − 1 ⇐⇒ m ⌊p/m⌋ ≤ p − 1 ⇐⇒ m ⌊p/m⌋ < p, which is
true since p is a prime, and 1 < m < p. ■

66
(A) (B) (C)

Figure 8.2: Explanation of the hashing scheme via figures.

Claim 8.2.7. For two distinct numbers x, y ∈ U, a pair a, b is bad if ha,b (x) = ha,b (y). The number of bad pairs
is ≤ p(p − 1)/m.

Proof: Let a, b ∈ Z p such that a , 0 and ha,b (x) = ha,b (y). Let

r = (ax + b) mod p and s = (ay + b) mod p.

By Lemma 7.2.4, we have that r , s. As such, a collision happens if r ≡ s (mod m). By Lemma 8.2.6, the
number of such pairs (r, s) is at most p(p − 1)/m. By Lemma 7.2.5, for each such pair (r, s), there is a unique
choice of a, b that maps x and y to r and s, respectively. As such, there are at most p(p − 1)/m bad pairs. ■

Theorem 8.2.8. The hash family H is a 2-universal hash family.

Proof: Fix two distinct numbers x, y ∈ U. We are interested in the probability they collide if h is picked
randomly from H. By Claim 8.2.7 there are M ≤ p(p − 1)/m bad pairs that causes such a collision, and since
H contains N = p(p − 1) functions, it follows the probability for collision is M/N ≤ 1/m, which implies that
H is 2-universal. ■

8.2.1.4. Explanation via pictures


Consider a pair (x, y) ∈ Z2p , such that x , y. This pair (x, y) corresponds to a cell in the natural “grid” Z2p that is
off the main diagonal. See Figure 8.2
The mapping fa,b (x) = (ax + b) mod p, takes the pair (x, y), and maps it randomly and uniformly, to some
other pair x′ = fa,b (x) and y′ = fa,b (y) (where x′ , y′ are again off the main diagonal).
Now consider the smaller grid Zm × Zm . The main diagonal of this subgrid is bad – it corresponds to a
collision. One can think about the last step, of computing ha,b (x) = fa,b (x) mod m, as tiling the larger grid, by
the smaller grid. in the natural way. Any diagonal that is in distance mi from the main diagonal get marked as
bad. At most 1/m fraction of the off diagonal cells get marked as bad. See Figure 8.2.
As such, the random mapping of (x, y) to (x′ , y′ ) causes a collision only if we map the pair to a badly marked
pair, and the probability for that ≤ 1/m.

67
8.3. Perfect hashing
An interesting special case of hashing is the static case – given a set S of elements, we want to hash S so that
we can answer membership queries efficiently (i.e., dictionary data-structures with no insertions). it is easy to
come up with a hashing scheme that is optimal as far as space.

8.3.1. Some easy calculations


The first observation is that if the hash table is quadraticly large, then there is a good (constant) probability to
have no collisions (this is also the threshold for the birthday paradox).

Lemma 8.3.1. Let S ⊆ U be a set of n elements, and let H be a 2-universal family of hash functions, into a
table of size m ≥ n2 . Then with probability ≤ 1/2, there is a pair of elements of S that collide under a random
hash function h ∈ H.

Proof: For a pair x, y ∈ S , the probability they


 collide is at most ≤ 1/m, by definition. As such, by the union
bound, the probability of any collusion is n2 /m = n(n − 1)/2m ≤ 1/2. ■

We now need a second moment bound on the sizes of the buckets.

Lemma 8.3.2. Let S ⊆ U be a set of n elements, and let H be a 2-universal family of hash functions, into
a table of size m ≥ cn, where c is an arbitrary constant. Let h ∈ H be a random hash function, and let
Xih be the number of elements of S mapped to the ith bucket by h, for i = 0, . . . , m − 1. Then, we have
Pm−1 2 i
E j=0 X j ≤ (1 + 1/c)n.
h i
Proof: Let s1 , . . . , sn be the n items in S , and let Zi, j = 1 if h(si ) = h(s j ), for i < j. Observe that E Zi, j =
h i
P h(si ) = h(s j ) ≤ 1/m (this is the only place we use the property that H is 2-universal). In particular, let Z(α)
be all the variables Zi, j , for i < j, such that Zi, j = 1 and h(si ) = h(s j ) = α.
If for some α we have that Xα = k, then there are k indices ℓ1 < ℓ2 < . . . < ℓk , such that h(sℓ1 ) = · · · =
h(sℓk ) = i. As such, z(α) = |Z(α)| = 2k . In particular, we have
!
k
Xα2 = k = 2 + k = 2z(α) + Xα
2
2

This implies that


X
m−1 X
m−1
 X
m−1 X
m−1 X
n−1 X
n
Xα2 = 2z(α) + Xα = 2 z(α) + Xα = n + 2 Zi j
α=0 α=0 α=0 α=0 i=1 j=i+1

Now, by linearity of expectations, we have

hX
m−1 i h n−1 X
X n i n−1 X
X n h i n−1 X
X n
1
E Xα2 =E n+2 Zi j = n + 2 E Zi j ≤ n + 2
α=0 i=1 j=i+1 i=1 j=i+1 i=1 j=i+1
m
! ! !
2 n 2n(n − 1) n−1 1
=n+ =n+ ≤n 1+ ≤n 1+
m 2 2m m c

since m ≥ cn. ■

68
8.3.2. Construction of perfect hashing
Given a set S of n elements, we build a open hash table T of size, say, 2n. We use a random hash function h
that is 2-universal for this hash table, see Theorem 8.2.8. Next, we map the elements of S into the hash table.
Let S j be the list of all the elements of S mapped to the jth bucket, and let X j = L j , for j = 0, . . . , n − 1.
P
We compute Y = i=1 X 2j . If Y > 6n, then we reject h, and resample a hash function h. We repeat this
process till success.
In the second stage, we build secondary hash tables for each bucket. Specifically, for j = 0, . . . , 2n − 1, if
the jth bucket contains X j > 0 elements, then we construct a secondary hash table H j to store the elements of
S j , and this secondary hash table has size X 2j , and again we use a random 2-universal hash function h j for the
hashing of S j into H j . If any pair of elements of S j collide under h j , then we resample the hash function h j ,
and try again till success.

8.3.2.1. Analysis
Theorem 8.3.3. Given a (static) set S ⊆ U of n elements, the above scheme, constructs, in expected linear
time, a two level hash-table that can perform search queries in O(1) time. The resulting data-structure uses
O(n) space.

Proof: Given an element x ∈ U, we first compute j = h(x), and then k = h j (x), and we can check whether the
element stored in the secondary hash table H j at the entry k is indeed x. As such, the search time is O(1).
The more interesting issue is the construction time. Let X j be the number of elements mapped to the jth
P
bucket, and let Y = ni=1 Xi2 . Observe, that E[Y] ≤ (1 + 1/2)n = (3/2)n, by Lemma 8.3.2 (here, m = 2n and as
such c = 2). As such, by Markov’s inequality, P[X > 6n] = (3/2)n6n
≤ 1/4. In particular, picking a good top level
hash function requires in expectation at most 1/(3/4) = 4/3 ≤ 2 iterations. Thus the first stage takes O(n) time,
in expectation.
For the jth bucket, with X j entries, by Lemma 8.3.1, the construction succeeds with probability ≥ 1/2. As
before, the expected number of iterations till success is at most 2. As such, the expected construction time of
the secondary hash table for the jth bucket is O(X 2j ).
P
We conclude that the overall expected construction time is O(n + j X 2j ) = O(n).
P
As for the space used, observe that it is O(n + j X 2j ) = O(n). ■

8.4. Bloom filters


Consider an application where we have a set S ⊆ U of n elements, and we want to be able to decide for a query
x ∈ U, whether or not x ∈ S . Naturally, we can use hashing. However, here we are interested in more efficient
data-structure as far as space. We allow the data-structure to make a mistake (i.e., say that an element is in,
when it is not in).

First try. So, let start silly. Let B[0 . . . , m] be an array of bits, and pick a random hash function h : U → Zm .
Initialize B to 0. Next, for every element s ∈ S , set B[h(s)] to 1. Now, given a query, return B[h(x)] as an
answer whether or not x ∈ S . Note, that B is an array of bits, and as such it can be bit-packed and stored
efficiently.
For the sake of simplicity of exposition, assume that the hash functions picked is truly random. As such,
we have that the probability for a false positive (i.e., a mistake) for a fixed x ∈ U is n/m. Since we want the
size of the table m to be close to n, this is not satisfying.

69
Using k hash functions. Instead of using a single hash function, let us use k independent hash functions
h1 , . . . hk . For an element s ∈ S , we set B[hi (s)] to 1, for i = 1, . . . , k. Given an query x ∈ U, if B[hi (x)] is zero,
for any i = 1, . . . , k, then x < S . Otherwise, if all these k bits are on, the data-structure returns that x is in S .
Clearly, if the data-structure returns that x is not in S , then it is correct. The data-structure might make a
mistake (i.e., a false positive), if it returns that x is in S (when is not in S ).
We interpret the storing of the elements of S in B, as an experiment of throwing kn balls into m bins. The
probability of a bin to be empty is

p = p(m, n) = (1 − 1/m)kn ≈ exp(−k(n/m)).

Since the number of empty bins is a martingale, we know the number of empty bins is strongly concentrated
around the expectation pm, and we can treat p as the true probability of a bin to be empty.
The probability of a mistake is
f (k, m, n) = (1 − p)k .
In particular, for k = (m/n) ln n, we have that p = p(m, n) ≈ 1/2, and f (k, m, n) ≈ 1/2(m/n) ln 2 ≈ 0.618m/n .

Example 8.4.1. Of course, the above is fictional, as k has to be an integer. But motivated by these calculations,
let m = 3n, and k = 4. We get that p(m, n) = exp(−4/3) ≈ 0.26359, and f (4, 3n, n) ≈ (1 − 0.265)4 ≈ 0.294078.
This is better than the naive k = 1 scheme, where the probability of false positive is 1/3.

Note, that this scheme gets exponentially better over the naive scheme as m/n grows.

Example 8.4.2. Consider the setting m = 8n – this is when we allocate a byte for each element stored (the
element of course might be significantly bigger). The above implies we should take k = ⌈(m/n) ln 2⌉ = 6. We
then get p(8n, n) = exp(−6/8) ≈ 0.5352, and f (6, 8n, n) ≈ 0.0215. Here, the naive scheme with k = 1, would
give probability of false positive of 1/8 = 0.125. So this is a significant improvement.

Remark 8.4.3. It is important to remember that Bloom filters are competing with direct hashing of the whole
elements. Even if one allocates 8 bits per item, as in the example above, the space it uses is significantly
smaller than regular hashing. A situation when such a Bloom filter makes sense is for a cache – we might want
to decide if an element is in a slow external cache (say SSD drive). Retrieving item from the cache is slow, but
not so slow we are not willing to have a small overhead because of false positives.

8.5. Bibliographical notes


Practical Issues Hashing used typically for integers, vectors, strings etc.

• Universal hashing is defined for integers. To implement it for other objects, one needs to map objects in
some fashion to integers.

• Practical methods for various important cases such as vectors, strings are studied extensively. See http:
//en.wikipedia.org/wiki/Universal_hashing for some pointers.

• Recent important paper bridging theory and practice of hashing. “The power of simple tabulation hash-
ing” by Mikkel Thorup and Mihai Patrascu, 2011. See http://en.wikipedia.org/wiki/Tabulation_
hashing

70
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

71
72
Chapter 9

Closest Pair
598 - Class notes for Randomized Algorithms
The events of September 8 prompted Foch to draft the
Sariel Har-Peled
later legendary signal: “My centre is giving way, my
April 2, 2024
right is in retreat, situation excellent. I attack.” It was
probably never sent.

John Keegan, The first world war


9.1. How many times can a minimum change?
Let a1 , . . . , an be a set of n numbers, and let us randomly permute them into the sequence b1 , . . . , bn . Next,
let ci = minik=1 bi , and let X be the random variable which is the number of distinct values that appears in the
sequence c1 , . . . , cn . What is the expectation of X?
Lemma 9.1.1. In expectation, the number of times the minimum of a prefix of n randomly permuted numbers
change, is O(log n). That is E[X] = O(log n).

Proof: Consider the indicator variable Xi , such that Xi = 1 if ci , ci−1 . The probability for that is ≤ 1/i, since
this is the probability that the smallest number of b1 , . . . , bi is bi . (Why is this probability not simply equal to
P X Xn
1
1/i?) As such, we have X = i Xi , and E[X] = E i[X ] = = O(log n). ■
i i=1
i

9.2. Closest Pair


Assumption 9.2.1. Throughout the discourse, we are going to assume that every hashing operation takes
(worst case) constant time. This is quite a reasonable assumption when true randomness is available (using
for example perfect hashing [CLRS01]). We will revisit this issue later in the course.

For a real positive number r and a point p = (x, y) in R2 , define


  x y 
Gr (p) := r , r ∈ R2 .
r r
The number r is the width of the grid Gr . Observe that Gr partitions the plane into square regions, which are
grid cells. Formally, for any i, j ∈ Z, the intersection of the half-planes x ≥ ri, x < r(i + 1), y ≥ r j and
y < r( j + 1) is a grid cell. Further a grid cluster is a block of 3 × 3 contiguous grid cells.
For a point set P, and a parameter r, the partition of P into subsets by the grid Gr , is denoted by Gr (P). More
formally, two points p, u ∈ P belong to the same set in the partition Gr (P), if both points are being mapped to
the same grid point or equivalently belong to the same grid cell.

73
Note, that every grid cell C of Gr , has a unique ID; indeed, let p = (x, y) be any point in C, and consider the
pair of integer numbers idC = id(p) = (⌊x/r⌋ , ⌊y/r⌋). Clearly, only points inside C are going to be mapped to
idC . This is useful, as one can store a set P of points inside a grid efficiently. Indeed, given a point p, compute
its id(p). We associate with each unique id a data-structure that stores all the points falling into this grid cell
(of course, we do not maintain such data-structures for grid cells which are empty). For our purposes here, the
grid-cell data-structure can simply be a linked list of points. So, once we computed id(p), we fetch the data
structure for this cell, by using hashing. Namely, we store pointers to all those data-structures in a hash table,
where each such data-structure is indexed by its unique id. Since the ids are integer numbers, we can do the
hashing in constant time.

We are interested in solving the following problem.


Problem 9.2.2. Given a set P of n points in the plane, find the pair of points closest to each other. Formally,
return the pair of points realizing CP(P) = minp,u∈P ∥p − u∥.

We need the following easy packing lemma.


Lemma 9.2.3. Let P be a set of points contained inside a square □, such that the
sidelength of □ is α = CP(P). Then |P| ≤ 4.
α
Proof: Partition □ into √ four equal squares □1 , . . . , □4 , and observe that each of these p
squares has diameter 2α/2 < α, and as such each can contain at most one point of P;
that is, the disk of radius α centered at a point p ∈ P completely covers the subsquare
containing it; see the figure on the right.
Note that the set P can have four points if it is the four corners of □. ■

Lemma 9.2.4. Given a set P of n points in the plane, and a distance r, one can verify in linear time, whether
or not CP(P) < r or CP(P) ≥ r.

Proof: Indeed, store the points of P in the grid Gr . For every non-empty grid cell, we maintain a linked list
of the points inside it. Thus, adding a new point p takes constant time. Indeed, compute id(p), check if id(p)
already appears in the hash table, if not, create a new linked list for the cell with this ID number, and store p in
it. If a data-structure already exist for id(p), just add p to it.
This takes O(n) time. Now, if any grid cell in Gr (P) contains more than four points of P, then, by
Lemma 9.2.3, it must be that the CP(P) < r.
Thus, when inserting a point p, the algorithm fetch all the points of P that were already inserted, for the cell
of p, and the 8 adjacent cells. All those cells must contain at most 4 points of P (otherwise, we would already
have stopped since the CP(·) of the inserted points is smaller than r). Let S be the set of all those points, and
observe that |S | ≤ 4 · 9 = O(1). Thus, we can compute by brute force the closest point to p in S . This takes
O(1) time. If d(p, S ) < r, we stop and return this distance (together with the two points realizing d(p, S ) as a
proof that the distance is too short). Otherwise, we continue to the next point, where d(p, S ) = min s∈S ∥p − s∥.
Overall, this takes O(n) time. As for correctness, first observe that if CP(P) > r then the algorithm would
never make a mistake, since it returns ‘CP(P) < r’ only after finding a pair of points of P with distance smaller
than r. Thus, assume that p, q are the pair of points of P realizing the closest pair, and ∥p − q∥ = CP(P) < r.
Clearly, when the later of them, say p, is being inserted, the set S would contain q, and as such the algorithm
would stop and return “CP(P) < r”. ■

Lemma 9.2.4 hints to a natural way to compute CP(P). Indeed, permute the points of P, in an arbitrary

fashion, and let P = ⟨p1 , . . . , pn ⟩. Next, let ri = CP {p1 , . . . , pi } . We can check if ri+1 < ri , by just calling the

74
algorithm for Lemma 9.2.4 on Pi+1 and ri . If ri+1 < ri , the algorithm of Lemma 9.2.4, would give us back the
distance ri+1 (with the other point realizing this distance).
So, consider the “good” case where ri+1 = ri = ri−1 . Namely, the length of the shortest pair does not change.
In this case we do not need to rebuild the data structure of Lemma 9.2.4 for each point. We can just reuse
it from the previous iteration. Thus, inserting a single point takes constant time as long as the closest pair
(distance) does not change.
Things become bad, when ri < ri−1 . Because then we need to rebuild the grid, and reinsert all the points of
Pi = ⟨p1 , . . . , pi ⟩ into the new grid Gri (Pi ). This takes O(i) time.
So, if the closest pair radius, in the sequence r1 , . . . , rn , changes only k times, then the running time of the
algorithm would be O(nk). But we can do even better!
Theorem 9.2.5. Let P be a set of n points in the plane. One can compute the closest pair of points of P in
expected linear time.
Proof: Pick a random permutation of the points of P, and let ⟨p1 , . . . , pn ⟩ be this permutation. Let r2 =
∥p1 − p2 ∥, and start inserting the points into the data structure of Lemma 9.2.4. In the ith iteration, if ri = ri−1 ,
then this insertion takes constant time. If ri < ri−1 , then we rebuild the grid and reinsert the points. Namely, we
recompute Gri (Pi ).
To analyze the running time of this algorithm, let Xi be the indicator variable which is 1 if ri , ri−1 , and 0
otherwise. Clearly, the running time is proportional to
Xn
R=1+ (1 + Xi · i).
i=2

Thus, the expected running time is


  h Xn i X
n
 X
n
E R =1+E 1+ i=2
(1 + Xi · i) = n + E[Xi ] · i = n + i · P[X1 = 1],
i=2 i=2

by linearity of expectation and since for an indicator variable Xi , we have that E[Xi ] = P[Xi = 1].
Thus, we need to bound P[Xi = 1] = P[ri < ri−1 ]. To bound this quantity, fix the points of Pi , and randomly
permute them. A point u ∈ Pi is critical if CP(Pi \ {u}) > CP(Pi ).
(A) If there are no critical points, then ri−1 = ri and then P[Xi = 1] = 0.
(B) If there is one critical point, than P[Xi = 1] = 1/i, as this is the probability that this critical point would
be the last point in a random permutation of Pi .
(C) If there are two critical points, and let p, u be this unique pair of points of Pi realizing CP(Pi ). The
quantity ri is smaller than ri−1 , if either p or u are pi . But the probability for that is 2/i (i.e., the probability
in a random permutation of i objects, that one of two marked objects would be the last element in the
permutation).
Observe, that there can not be more than two critical points. Indeed, if p and u are two points that realize the
closest distance, than if there is a third critical point v, then CP(Pi \ {v}) = ∥p − u∥, and v is not critical.
We conclude that
  X n Xn
2
E R =n+ i · P[X1 = 1] ≤ n + i · ≤ 3n.
i=2 i=2
i
As such, the expected running time of this algorithm is O(E[R]) = O(n). ■
Theorem 9.2.5 is a surprising result, since it implies that uniqueness (i.e., deciding if n real numbers are all
distinct) can be solved in linear time. However, there is a lower bound of Ω(n log n) on uniqueness, using the
comparison tree model. This reality dysfunction, can be easily explained, once one realizes that the model of
computation of Theorem 9.2.5 is considerably stronger, using hashing, randomization, and the floor function.

75
9.3. Bibliographical notes
The closest-pair algorithm follows Golin et al. [GRSS95]. This is in turn a simplification of a result of the
celebrated result of Rabin [Rab76]. Smid provides a survey of such algorithms [Smi00]. A generalization of
the closest pair algorithm was provided by Har-Peled and Raichel [HR15].
Surprisingly, Schönhage [Sch79] showed that assuming that the floor function is allowed, and the standard
arithmetic operation can be done in constant time, then every problem in PSPACE can be solved in polynomial
time. Since PSPACE includes NPC, this is bad news, as it implies that one can solve NPC problem in poly-
nomial time (finally!). The basic idea is that one can pack huge number of bits into a single number, and the
floor function enables one to read a single bit of this number. As such, a real RAM model that allows certain
operations, and put no limit on the bit complexity of numbers, and assume that each operation can take constant
time, is not a reasonable model of computation (but we already knew that).

References
[CLRS01] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press
/ McGraw-Hill, 2001.
[GRSS95] M. Golin, R. Raman, C. Schwarz, and M. Smid. Simple randomized algorithms for closest pair
problems. Nordic J. Comput., 2: 3–27, 1995.
[HR15] S. Har-Peled and B. Raichel. Net and prune: A linear time algorithm for Euclidean distance
problems. J. Assoc. Comput. Mach., 62(6): 44:1–44:35, 2015.
[Rab76] M. O. Rabin. Probabilistic algorithms. Algorithms and Complexity: New Directions and Recent
Results. Ed. by J. F. Traub. Orlando, FL, USA: Academic Press, 1976, pp. 21–39.
[Sch79] A. Schönhage. On the power of random access machines. Proc. 6th Int. Colloq. Automata Lang.
Prog. (ICALP), vol. 71. 520–529, 1979.
[Smi00] M. Smid. Closest-point problems in computational geometry. Handbook of Computational Ge-
ometry. Ed. by J.-R. Sack and J. Urrutia. Amsterdam, The Netherlands: Elsevier, 2000, pp. 877–
935.

76
Chapter 10

Coupon’s Collector Problems II


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
There is not much talking now. A silence falls upon them all. This is no time to talk of hedges and fields, or the beauties of any
country. Sadness and fear and hate, how they well up in the heart and mind, whenever one opens the pages of these messengers
of doom. Cry for the broken tribe, for the law and custom that is gone. Aye, and cry aloud for the man who is dead, for the
woman and children bereaved. Cry, the beloved country, these things are not yet at an end. The sun pours down on the earth,
on the lovely land that man cannot enjoy. He knows only the fear of his heart.

Alan Paton, Cry, the beloved country

10.1. The Coupon Collector’s Problem Revisited


10.1.1. Some technical lemmas
Unfortunately, in Randomized Algorithms, many of the calculations are awful¬ . As such, one has to be dexter-
ous in approximating such calculations. We present quickly a few of these estimates.

Lemma 10.1.1. For x ≥ 0, we have 1 − x ≤ exp(−x) and 1 + x ≤ e x . Namely, for all x, we have 1 + x ≤ e x .

Proof: For x = 0 we have equality. Next, computing the derivative on both sides, we have that we need to
prove that −1 ≤ − exp(−x) ⇐⇒ 1 ≥ exp(−x) ⇐⇒ e x ≥ 1, which clearly holds for x ≥ 0.
A similar argument works for the second inequality. ■
 y
Lemma 10.1.2. For any y ≥ 1, and |x| ≤ 1, we have 1 − x2 ≥ 1 − yx2 .

Proof: Observe that the inequality holds with equality for x = 0. So compute the derivative of x of both sides
of the inequality. We need to prove that
 y−1  y−1
y(−2x) 1 − x2 ≥ −2yx ⇐⇒ 1 − x2 ≤ 1,

which holds since 1 − x2 ≤ 1, and y − 1 ≥ 0. ■


 
Lemma 10.1.3. For any y ≥ 1, and |x| ≤ 1, we have 1 − x2 y e xy ≤ (1 + x)y ≤ e xy .
¬
"In space travel," repeated Slartibartfast, "all the numbers are awful." – Life, the Universe, and Everything Else, Douglas Adams.

77
Proof: The right side of the inequality is standard by now. As for the left side. Observe that

(1 − x2 )e x ≤ 1 + x,

since dividing both sides by (1 + x)e x , we get 1 − x ≤ e−x , which we know holds for any x. By Lemma 10.1.2,
we have    y   y  y
1 − x2 y e xy ≤ 1 − x2 e xy = 1 − x2 e x ≤ 1 + x ≤ e xy . ■

10.1.2. Back to the coupon collector’s problem


There are n types of coupons, and at each trial one coupon is picked in random. How many trials one has to
perform before picking all coupons? Let m be the number of trials performed. We would like to bound the
probability that m exceeds a certain number, and we still did not pick all coupons.
In the previous lecture, we showed that
" #
π 1
P # of trials ≥ n log n + n + t · n √ ≤ 2 ,
6 t

for any t.
A stronger bound, follows from the following observation. Let Zir denote the event that the ith coupon was
not picked in the first r trials. Clearly,

h i !r  r
1
P Zi = 1 − ≤ exp − .
r
n n

   βn log n 
Thus, for r = βn log n, we have P Zir ≤ exp − = n−β . Thus,
n
h i h[ βn log n i  
P X > βn log n ≤ P Zi ≤ n · P Z1 ≤ n−β+1 .
i
Lemma 10.1.4. Let the random variable X denote the number of trials for collecting each of the n types of
 
coupons. Then, we have P X > n ln n + cn ≤ e−c .
 
Proof: The probability we fail to pick the first type of coupon is α = (1 − 1/n)m ≤ exp − n ln nn+cn = exp(−c)/n.
As such, using the union bound, the probability we fail to pick all n types of coupons is bounded by nα =
exp(−c), as claimed. ■

In the following, we show a slightly stronger bound on the probability, which is 1 − exp(−e−c ). To see that

it is indeed stronger, observe that e−c ≥ 1 − exp −e−c .

10.1.3. An asymptotically tight bound


! Let!mc > 0 be a constant, m = n ln n + cn for a positive integer n. Then for any constant k, we
Lemma 10.1.5.
n k exp(−ck)
have lim 1− = .
n→∞ k n k!

78
Proof: By Lemma 10.1.3, we have
! ! !m !
k2 m km k km
1 − 2 exp − ≤ 1− ≤ exp − .
n n n n
! !
k2 m km
Observe also that lim 1 − 2 = 1, and exp − = n−k exp(−ck). Also,
n→∞ n n
!
n k! n(n − 1) · · · (n − k + 1)
lim k
= lim = 1.
n→∞ k n n→∞ nk
! !m !
n k nk km nk exp(−ck)
Thus, lim 1− = lim exp − = lim n−k exp(−ck) = . ■
n→∞ k n n→∞ k! n n→∞ k! k!
Theorem 10.1.6. Let the random variable X denote the number of trials for collecting each of the n types of
  
coupons. Then, for any constant c ∈ R, and m = n ln n + cn, we have limn→∞ P X > m = 1 − exp −e−c .

Before dwelling into the proof, observe that 1 − exp(−e−c ) ≈ 1 − (1 − e−c ) = e−c . Namely, in the limit, the
upper bound of Lemma 10.1.4 is tight.
  h i
Proof: We have P X > m = P ∪i Zim . By inclusion-exclusion, we have

h[ i X
n
P Zim = (−1)i+1 Pni ,
i i=1
X h j i Pk hS i
where Pnj = P ∩v=1 Ziv . Let S kn = i=1 (−1)i+1 Pni . We know that S 2k
m n
≤ P i Zim ≤ S 2k+1
n
.
1≤i1 <i2 <...<i j ≤n
By symmetry,
! \  ! !m
n 


k
m
 n k
Pk =
n 
P Z  = 1− ,
k  v=1 v  k n
Thus, Pk = limn→∞ Pnk = exp(−ck)/k!, by Lemma 10.1.5. Thus, we have

Xk Xk
exp(−c j)
Sk = (−1) P j =
j+1
(−1) j+1 · .
j=1 j=1
j!

Observe that limk→∞ S k = 1 − exp(−e−c ) by the Taylor expansion of exp(x) (for x = −e−c ). Indeed,
X∞
x j X (−e−c ) j
∞ X∞
(−1) j exp(−c j)
exp(x) = = =1+ .
j=0
j! j=0
j! j=1
j!

Clearly, limn→∞ S kn = S k and limk→∞ S k = 1 − exp(−e−c ). Thus, (using fluffy math), we have
    
lim P X > m = lim P ∪ni=1 Zim = lim lim S kn = lim S k = 1 − exp −e−c . ■
n→∞ n→∞ n→∞ k→∞ k→∞

10.2. Bibliographical notes


Are presentation follows, as usual, Motwani and Raghavan [MR95].

79
References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

80
Chapter 11

Conditional Expectation and Concentration


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“You see, dogs aren’t enough any more. People feel so damned lonely, they need company, they need something bigger,
stronger, to lean on, something that can really stand up to it all. Dogs aren’t enough, what we need is elephants...”

The roots of heaven, Romain Gary

11.1. Conditional expectation


Definition 11.1.1. For two random variables X and Y, let E[X | Y] denote the expected value of X, if the value
of Y is specified. Formally, we have
  X  
EX|Y=y = xP X = x | Y = y .
x∈Ω
 
The expression E[X | Y], which is a shorthand for E X | Y = y , is the conditional expectation of X given Y.
As such, the conditional expectation is a function from the value of y, to the average value of X. As such,
 
one can think of conditional expectation as a function f (y) = E X | Y = y .
   
Lemma 11.1.2. For any two random variables X and Y, we have E E[X | Y] = E X .
     P    
Proof: E E[X | Y] = EY E X | Y = y = y P Y = y E X | Y = y
X  P  
 x xP X = x ∩ Y = y
= PY=y  
y
P Y = y
XX   X X  
= xP X = x ∩ Y = y = x P X = x∩Y =y
y x x y
X    
= xP X = x = E X . ■
x
   
Lemma 11.1.3. For any two random variables X and Y, we have E Y · E[X | Y] = E XY .
h h ii X   h i
Proof: We have that E Y · E X Y = P Y =y ·y·E X Y =y
y
X  P   XX
 xP X = x ∩ Y = y    
= P Y =y ·y·
x
  = xy · P X = x ∩ Y = y = E XY . ■
y
PY=y x y

81
11.1.1. Concentration from conditional expectation
Lemma 11.1.4. Let X1 , . . . , Xn be independent random variables, that with equal probability are 0 or 1. We
P  P 
have that P i Xi < n/4 < 0.9n and P i Xi > (3/4)n < 0.9n .

Proof: Let Y0 = 1. If Xi = 1, then we set Yi = Yi−1 , and if Xi = 0, then we set Yi = Yi−1 /2. We thus have that
1 Yi−1 1 3
E[Yi | Yi−1 ] = + Yi−1 = Yi−1 .
2 2 2 4
As such, by Lemma 11.1.2 we have
" # !i
  3 3 3
E[Yi ] = E E[Yi | Yi−1 ] = E Yi−1 = E[Yi−1 ] = .
4 4 4
P
In particular, E[Yn ] = (3/4)n . Now, if i Xi > (3/4)n, then we have

Yn ≥ (1/2)n/4 .

We are now ready for our conclusions:


hX i h i !n
E[Yn ] (3/4)n 21/4 3
P Xi > (3/4)n = P Yn ≥ (1/2)n/4
≤ ≤ = ≤ 0.9n .
i (1/2)n/4 (1/2)n/4 4
P  P 
By symmetry, we have P i Xi < (1/4)n = P i Xi > (3/4)n < 0.9n . ■

82
Chapter 12

Quick Sort with High Probability


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

12.1. QuickSort runs in O(n log n) time with high probability


Consider a set T of the n items to be sorted, and consider a specific element t ∈ T . Let Xi be the size of the
input in the ith level of recursion that contains t. We know that X0 = n, and
h i 13 1 7
E Xi Xi−1 ≤ Xi−1 + Xi−1 ≤ Xi−1 .
24 2 8
Indeed, with probability 1/2 the pivot is the middle of the subproblem; that is, its rank is between Xi−1 /4 and
(3/4)Xi−1 (and then the subproblem has size ≤ Xi−1 (3/4)), and with probability 1/2 the subproblem might has
not shrank significantly (i.e., we pretend it did not shrink at all).
  
Now, observe that for any two random variables we have that E[X] = Ey E X |Y = y , see Lemma 11.1.2p81 ..
As such, we have that
h h ii " # !i !i
7 7 7 7
E[Xi ] = E E Xi Xi−1 = y ≤ E y = E[Xi−1 ] ≤ E[X0 ] = n.
y Xi−1 =y 8 8 8 8
In particular, consider M = 8 log8/7 n. We have that
!M
7 1 1
µ = E[X M ] ≤ n ≤ 8n = 7.
8 n n
Of course, t participates in more than M recursive calls, if and only if X M ≥ 1. However, by Markov’s
inequality (Theorem 2.4.1), we have that
" #
element t participates E[X M ] 1
P ≤ P[X M ≥ 1] ≤ ≤ 7,
in more than M recursive calls 1 n
as desired. That is, we proved that the probability that any element of the input T participates in more than M
recursive calls is at most n(1/n7 ) ≤ 1/n6 .
Theorem 12.1.1. For n elements, QuickSort runs in O(n log n) time, with high probability.

12.2. Treaps
Anybody that ever implemented a balanced binary tree, knows that it can be very painful. A natural question,
is whether we can use randomization to get a simpler data-structure with good performance.

83
xk
p(xk )

TL TR

12.2.1. Construction
The key observation is that many of data-structures that offer good performance for balanced binary search
trees, do so by storing additional information to help in how to balance the tree. As such, the key Idea is that
for every element x inserted into the data-structure, randomly choose a priority p(x); that is, p(x) is chosen
uniformly and randomly in the range [0, 1].
So, for the set of elements X = {x1 , . . . , xn }, with (random) priorities p(x1 ), . . . , p(xn ), our purpose is to
build a binary tree which is “balanced”. So, let us pick the element xk with the lowest priority in X, and make
it the root of the tree. Now, we partition X in the natural way:

(A) L: set of all the numbers smaller than xk in X, and


(B) R: set of all the numbers larger than xk in X.

We can now build recursively the trees for L and R, and let denote them by T L and T R . We build the natural
tree, by creating a node for xk , having T L its left child, and T R as its right child.
We call the resulting tree a treap. As it is a tree over the elements, and a heap over the priorities; that is,
treap = tree + heap.

Lemma 12.2.1. Given n elements, the expected depth of a treap T defined over those elements is O(log(n)).
Furthermore, this holds with high probability; namely, the probability that the depth of the treap would exceed
c log n is smaller than δ = n−d , where d is an arbitrary constant, and c is a constant that depends on d.¬
Furthermore, the probability that T has depth larger than ct log(n), for any t ≥ 1, is smaller than n−dt .

Proof: Observe, that every element has equal probability to be in the root of the treap. Thus, the structure
of a treap, is identical to the recursive tree of QuickSort. Indeed, imagine that instead of picking the pivot
uniformly at random, we instead pick the pivot to be the element with the lowest (random) priority. Clearly,
these two ways of choosing pivots are equivalent. As such, the claim follows immediately from our analysis of
the depth of the recursion tree of QuickSort, see Theorem 12.1.1. ■

12.2.2. Operations
The following innocent observation is going to be the key insight in implementing operations on treaps:

Observation 12.2.2. Given n distinct elements, and their (distinct) priorities, the treap storing them is uniquely
defined.
¬
That is, if we want to decrease the probability of failure, that is δ, we need to increase c.

84
D x
0.3 0.2

x E A D
0.2 0.4 0.6 0.3

A C C E
0.6 0.5 0.5 0.4
=⇒
Figure 12.1: RotateRight: Rotate right in action. Importantly, after the rotation the priorities are ordered
correctly (at least locally for this subtree.

12.2.2.1. Insertion

Given an element x to be inserted into an existing treap T , insert it in the usual way into T (i.e., treat it a regular

search binary tree). This takes O height(T ) . Now, x is a leaf in the treap. Set x priority p(x) to some random
number [0, 1]. Now, while the new tree is a valid search tree, it is not necessarily still a valid treap, as x’s
priority might be smaller than its parent. So, we need to fix the tree around x, so that the priority property
holds.
RotateUp(x)
y ← parent(x)
while p(y) > p(x) do
if y.left_child = x then
RotateRight(y)
else
RotateLeft(y)
y ← parent(x)
We call RotateUp(x) to do so. Specifically, if x parent is y, and p(x) < p(y), we will rotate x up so that it
becomes the parent of y. We repeatedly do it till x has a larger priority than its parent. The rotation operation
takes constant time and plays around with priorities, and importantly, it preserves the binary search tree order.
A rotate right operation RotateRight(D) is depicted in Figure 12.1. RotateLeft is the same tree rewriting
operation done in the other direction.
Observe that as x is being rotated upwards, the priority properties are being fixed – in particular, as demon-
strated in Figure 12.1, nodes are being hanged on nodes that were previously their ancestors, so priorities are
still monotonically decreasing along a path.
In the end of this process, both the ordering property and the priority property holds. That is, we have a
valid treap that includes all the old elements, and the new element. By Observation 12.2.2, since the treap is
uniquely defined, we have updated the treap correctly. Since every time we do a rotation the distance of x from
the root decrease by one, it follows that insertions takes O(height(T )).

12.2.2.2. Deletion

Deletion is just an insertion done in reverse. Specifically, to delete an element x from a treap T , set its priority
to +∞, and rotate it down it becomes a leaf. The only tricky observation is that you should rotate always so
that the child with the lower priority becomes the new parent. Once x becomes a leaf deleting it is trivial - just
set the pointer pointing to it in the tree to null.

85
12.2.2.3. Split
Given an element x stored in a treap T , we would like to split T into two treaps – one treap T ≤ for all the
elements smaller or equal to x, and the other treap T > for all the elements larger than x. To this end, we set
x priority to −∞, fix the priorities by rotating x up so it becomes the root of the treap. The right child of x
is the treap T > , and we disconnect it from T by setting x right child pointer to null. Next, we restore x to its
real priority, and rotate it down to its natural location. The resulting treap is T ≤ . This again takes time that is
proportional to the depth of the treap.

12.2.2.4. Meld
Given two treaps T L and T R such that all the elements in T L are smaller than all the elements in T R , we would
like to merge them into a single treap. Find the largest element x stored in T L (this is just the element stored
in the path going only right from the root of the tree). Set x priority to −∞, and rotate it up the treap so that it
becomes the root. Now, x being the largest element in T L has no right child. Attach T R as the right child of x.
Now, restore x priority to its original priority, and rotate it back so the priorities properties hold.

12.2.3. Summery
Theorem 12.2.3. Let T be a treap, initialized to an empty treap, and undergoing a sequence of m = nc inser-
tions, where c is some constant. The probability that the depth of the treap in any point in time would exceed
d log n is ≤ 1/n f , where d is an arbitrary constant, and f is a constant that depends only c and d.
In particular, a treap can handle insertion/deletion in O(log n) time with high probability.
Proof: Since the first part of the theorem implies that with high probability all these treaps have logarithmic
depth, then this implies that all operations takes logarithmic time, as an operation on a treap takes at most the
depth of the treap.
As for the first part, let T 1 , . . . , T m be the sequence of treaps, where T i is the treap after the ith operation.
Similarly, let Xi be the set of elements stored in T i . By Lemma 12.2.1, the probability that T i has large depth is
tiny. Specifically, we have that
" ! #
 ′ c ′ log nc 1
αi = P depth(T i ) > tc log n = P depth(T i ) > c t · log |T i | ≤ t·c ,
log |T i | n
as a tedious and boring but straightforward calculation shows. Picking t to be sufficiently large, we have that
the probability that the ith treap is too deep is smaller than 1/n f +c . By the union bound, since there are nc treaps
in this sequence of operations, it follows that the probability of any of these treaps to be too deep is at most
1/n f , as desired. ■

12.3. Extra: Sorting Nuts and Bolts


Problem 12.3.1 (Sorting Nuts and Bolts). You are given a set of n nuts and n bolts. Every nut have a matching
bolt, and all the n pairs of nuts and bolts have different sizes. Unfortunately, you get the nuts and bolts separated
from each other and you have to match the nuts to the bolts. Furthermore, given a nut and a bolt, all you can
do is to try and match one bolt against a nut (i.e., you can not compare two nuts to each other, or two bolts to
each other).
When comparing a nut to a bolt, either they match, or one is smaller than other (and you known the
relationship after the comparison).
How to match the n nuts to the n bolts quickly? Namely, while performing a small number of comparisons.

86
The naive algorithm is of course to compare each nut to MatchNuts&Bolts (N: nuts, B: bolts)
each bolt, and match them together. This would require a Pick a random nut n pivot from N
quadratic number of comparisons. Another option is to sort Find its matching bolt b pivot in B
the nuts by size, and the bolts by size and then “merge” the BL ← All bolts in B smaller than n pivot
two ordered sets, matching them by size. The only problem is NL ← All nuts in N smaller than b pivot
that we can not sorts only the nuts, or only the bolts, since we BR ← All bolts in B larger than n pivot
can not compare them to each other. Indeed, we sort the two NR ← All nuts in N larger than b pivot
sets simultaneously, by simulating QuickSort. The resulting MatchNuts&Bolts(NR ,BR )
algorithm is depicted on the right. MatchNuts&Bolts(NL ,BL )

12.3.1. Running time analysis


Definition 12.3.2. Let RT denote the random variable which is the running time of the algorithm. Note, that
the running time is a random variable as it might be different between different executions on the same input.

Definition 12.3.3. For a randomized algorithm, we can speak about the expected running time. Namely, we
are interested in bounding the quantity E[RT] for the worst input.

Definition 12.3.4. The expected running-time of a randomized algorithm for input of size n is

T (n) = max E[RT(U)] ,


U is an input of size n

where RT(U) is the running time of the algorithm for the input U.

Definition 12.3.5. The rank of an element x in a set S , denoted by rank(x), is the number of elements in S of
size smaller or equal to x. Namely, it is the location of x in the sorted list of the elements of S .

Theorem 12.3.6. The expected running time of MatchNuts&Bolts (and thus also of QuickSort) is T (n) =
O(n log n), where n is the number of nuts and bolts. The worst case running time of this algorithm is O(n2 ).
h i
Proof: Clearly, we have that P rank(n pivot ) = k = 1n . Furthermore, if the rank of the pivot is k then

T (n) = E [O(n) + T (k − 1) + T (n − k)] = O(n) + E[T (k − 1) + T (n − k)]


k=rank(n pivot ) k
X
n
= O(n) + P[Rank(Pivot) = k] ∗ (T (k − 1) + T (n − k))
k=1
Xn
1
= O(n) + · (T (k − 1) + T (n − k)),
k=1
n
Pn
by the definition of expectation. It is not easy to verify that the solution to the recurrence T (n) = O(n) + 1
k=1 n ·
(T (k − 1) + T (n − k)) is O(n log n). ■

12.4. Bibliographical Notes


Treaps were invented by Siedel and Aragon [SA96]. Experimental evidence suggests that Treaps performs
reasonably well in practice, despite their simplicity, see for example the comparison carried out by Cho and
Sahni [CS00]. Implementations of treaps are readily available. An old implementation I wrote in C is available
here: http://valis.cs.uiuc.edu/blog/?p=6060.

87
References
[CS00] S. Cho and S. Sahni. A new weight balanced binary search tree. Int. J. Found. Comput. Sci.,
11(3): 485–513, 2000.
[SA96] R. Seidel and C. R. Aragon. Randomized search trees. Algorithmica, 16: 464–497, 1996.

88
Chapter 13

Concentration of Random Variables – Chernoff’s


Inequality
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

13.1. Concentration of mass and Chernoff’s inequality

13.1.1. Example: Binomial distribution

Consider the binomial distribution Bin(n, 1/2) for various values of n as depicted in Figure 13.1 – here we
think about the value of the variable as the number of heads in flipping a fair coin n times. Clearly, as the
value of n increases the probability of getting a number of heads that is significantly smaller or larger than
n/2 is tiny. Here we are interested in quantifying exactly how far can we divert from this expected value.
Specifically, if X ∼ Bin(n,
√ 1/2), then we would be interested in bounding the probability P[X > n/2 + ∆],
where ∆ = tσX = t n/2 (i.e., we are t standard deviations away from the expectation). For t > 2, this
probability is roughly 2−t , which is what we prove here.
More surprisingly, if you look only on the middle of the distribution, it looks the same after clipping away
the uninteresting tails, see Figure 13.2; that is, it looks more and more like the normal distribution. This
is a universal phenomena known the central limit theorem – every sum of nicely behaved random variables
behaves like the normal distribution. We unfortunately need a more precise quantification of this behavior, thus
the following.

13.1.2. A restricted case of Chernoff inequality via games

13.1.2.1. Chernoff games

The game. Consider the game where a player starts with Y0 = 1 dollars. At every round, the player can bet
a certain amount x (fractions are fine). With probability half she loses her bet, and with probability half she
gains an amount equal to her bet. The player is not allowed to go all in – because if she looses then the game
is over. So it is natural to ask what her optimal betting strategy is, such that in the end of the game she has as
much money as possible.

89
0.16
0.3 0.2 0.14 0.1
0.25 0.12 0.08
0.15 0.1
0.2 0.08 0.06
0.15 0.1 0.06 0.04
0.1 0.05 0.04
0.02 0.02
0.05
0 0 0
0

0
5
10
15
20
25
30

0
10
20
30
40
50
60
0
2
4
6
8
10
12
14
16
0
1
2
3
4
5
6
7
8

n=8 n = 16 n = 32 n = 64
0.08 0.04
0.07 0.05 0.035 0.01
0.06 0.04 0.03
0.05 0.025 0.008
0.04 0.03 0.02 0.006
0.03 0.02 0.015 0.004
0.02 0.01 0.002
0.01 0.01 0.005 0
0 0 0

0
1000
2000
3000
4000
5000
6000
7000
8000
0
100
200
300
400
500
0
20
40
60
80
100
120

0
50
100
150
200
250

n = 128 n = 256 n = 512 n = 8192

Figure 13.1: The binomial distribution for different values of n. It pretty quickly concentrates around its
expectation.

0.16 0.08
0.2 0.14 0.1 0.07
0.12 0.08 0.06
0.15 0.1 0.06 0.05
0.1
0.08 0.04
0.06 0.04 0.03
0.05 0.04 0.02 0.02
0.02 0.01
0 0 0 0
20
25
30
35
40
45

45
50
55
60
65
70
75
80
85
10

15

20

25
10
12
14
16

5
0
2
4
6
8

n = 16 n = 32 n = 64 n = 128
0.04 0.01
0.05 0.035 0.025 0.008
0.04 0.03 0.02 0.006
0.025 0.015
0.03 0.02 0.004
0.02 0.015 0.01
0.01 0.005 0.002
0.01 0.005 0
0 0 0
3950
4000
4050
4100
4150
4200
4250
460
480
500
520
540
560
220
230
240
250
260
270
280
290
300
100
110
120
130
140
150
160

n = 256 n = 512 n = 1024 n = 8192

Figure 13.2: The “middle” of the binomial distribution for different values of n. It very quickly converges to
the normal distribution (under appropriate rescaling and translation.

90
Xi ∈ {−1, +1} Xi ∈ {0, 1}
P[Xi = −1] = P[Xi = 1] = 1/2 P[Xi = 0] = P[Xi = 1] = 1/2
       
P Y ≥ ∆ ≤ exp −∆2 /2n Theorem 13.1.7 P |Y − n/2| ≥ ∆ ≤ 2 exp −2∆2 /n
   
P Y ≤ −∆ ≤ exp −∆2 /2n Theorem 13.1.7 Corollary 13.1.9

Xi ∈ {0, 1} P[Xi = 1] = pi P[Xi = 0] = 1 − pi


   δ. µ
δ≥0 P = P Y > (1 + δ)µ < e (1 + δ)1+δ Theorem 13.2.1
 
δ ∈ (0, 1) P < exp −µδ2 /3 Lemma 13.2.5
 
δ ∈ (0, 4) P < exp −µδ2 /4 Lemma 13.2.6
 
δ ∈ (0, 6) P < exp −µδ2 /5 Lemma 13.2.7
δ ≥ 2e − 1 P < 2−µ(1+δ) Lemma 13.2.8

δ ≥ e2 P < exp −(µδ/2) ln δ Lemma 13.2.9
 3 ln φ−1 
δ ≥ 0, φ ∈ (0, 1] P Y > (1 + δ)µ + δ2 < φ. Lemma 13.2.10
   −δ . µ
P Y < (1 − δ)µ < e (1 − δ)1−δ Theorem 13.2.3
δ≥0  
 
P Y < (1 − δ)µ < exp −µδ /2
2
Lemma 13.2.4
   
P Y − µ ≥ ∆ ≤ exp −2∆ /n
2
∆≥0     Corollary 13.3.5
P Y − µ ≤ −∆ ≤ exp −2∆ /n .
2

   τ
τ≥1 P Y < µ/τ < exp ( − 1 − 1+ln τ
µ) Theorem 13.2.3

Xi ∈ [0, 1] Arbitrary independent distributions


   
P Y ≥ (1 + δ)µ ≤ exp −δ µ/4
2
δ ∈ [0, 1]     Theorem 13.3.6
P Y ≤ (1 − δ)µ ≤ exp −δ µ/2 .
2

   
P Y − µ ≥ ∆ ≤ exp −2∆ /n
2
∆≥0     Corollary 13.3.5
P Y − µ ≤ −∆ ≤ exp −2∆ /n .
2

Xi ∈ [ai , bi ] Arbitrary independent distributions


!
  2 ∆2
∆≥0 P |Y − µ| ≥ ∆ ≤ 2 exp − Pn Theorem 13.4.3
i=1 (bi − ai )
2

Table 13.1: Summary of Chernoff type inequalities covered. Here we have n independent random variables
P
X1 , . . . , Xn , Y = i Xi and µ = E[Y].

91
Is the game pointless? So, let Yi−1 be the money the player has in the end of the (i − 1)th round, and she bets
an amount ψi ≤ Yi−1 in the ith round. As such, in the end of the ith round, she has



Yi−1 − ψi lose: probability half
Yi = 

Yi−1 + ψi win: probability half

dollars. This game, in expectation, does not change the amount of money the player has. Indeed, we have
h i 1 1
E Yi Yi−1 = (Yi−1 − ψi ) + (Yi−1 + ψi ) = Yi−1 .
2 2
  h h ii    
And as such, we have that E Yi = E E Yi Yi−1 = E Yi−1 = · · · = E Y0 = 1. In particular, E[Yn ] = 1 –
namely, on average, independent of the player strategy she is not going to make any money in this game (and
she is allowed to change her bets after every round). Unless, she is lucky¬ ...

What about a lucky player? The player believes she will get lucky and wants to develop a strategy to take
advantage of it. Formally, she believes that she can win, say, at least (1 + δ)/2 fraction of her bets (instead of the
predicted 1/2) – for example, if the bets are in the stock market, she can improve her chances by doing more
research on the companies she is investing in­ . Unfortunately, the player does not know which rounds she is
going to be lucky in – so she still needs to be careful.

In a search of a good strategy. Of course, there are many safe strategies the player can use, from not playing
at all, to risking only a tiny fraction of her money at each round. In other words, our quest here is to find the
best strategy that extracts the maximum benefit for the player out of her inherent luck.
Here, we restrict ourselves to a simple strategy – at every round, the player would bet β fraction of her
money, where β is a parameter to be determined. Specifically, in the end of the ith round, the player would have



(1 − β)Yi−1 lose
Yi = 

(1 + β)Yi−1 win.

By our assumption, the player is going to win in at least M = (1 + δ)n/2 rounds. Our purpose here is to figure
out what the value of β should be so that player gets as rich as possible® . Now, if the player is successful in
≥ M rounds, out of the n rounds of the game, then the amount of money the player has, in the end of the game,
is
 n/2−(δ/2)n
Yn ≥ (1 − β)n−M (1 + β) M = (1 − β)n/2−(δ/2)n (1 + β)n/2+(δ/2)n = (1 − β)(1 + β) (1 + β)δn
 n/2−(δ/2)n  n/2−(δ/2)n   
= 1 − β2 (1 + β)δn ≥ exp −2β2 exp(β/2)δn = exp −β2 + β2 δ + βδ/2 n .
To maximize this quantity, we choose β = δ/4 (there is a better choice, see Lemma! 13.1.6,
! but we! use this
δ2
δ3
δ2
δ 2
value for the simplicity of exposition). Thus, we have that Yn ≥ exp − + + n ≥ exp n , proving
16 16 8 16
the following.
Lemma 13.1.1. Consider a Chernoff game with n rounds, starting with one dollar, where the player wins in
≥ (1 + δ)n/2 of the rounds. If the player bets δ/4
 fraction
 of her current money, at all rounds, then in the end
of the game the player would have at least exp nδ /16 dollars.
2

¬
“I would rather have a general who was lucky than one who was good.” – Napoleon Bonaparte.
­
“I am a great believer in luck, and I find the harder I work, the more I have of it.” – Thomas Jefferson.
®
This optimal choice is known as Kelly criterion, see Remark 13.1.3.

92
Remark 13.1.2. Note, that Lemma 13.1.1 holds if the player wins any ≥ (1 + δ)n/2 rounds. In particular, the
statement does not require randomness by itself – for our application, however, it is more natural and interesting
to think about the player wins as being randomly distributed.
Remark 13.1.3. Interestingly, the idea of choosing the best fraction to bet is an old and natural question arising
in investments strategies, and the right fraction to use is known as Kelly criterion, going back to Kelly’s work
from 1956 [Kel56].

13.1.2.2. Chernoff’s inequality


The above implies that if a player is lucky, then she is going to become filthy rich¯ . Intuitively, this should be
a pretty rare event – because if the player is rich, then (on average) many other people have to be poor. We are
thus ready for the kill.
Theorem 13.1.4 (Chernoff’s inequality). Let X1 , . . . , Xn be n independent random variables, where Xi = 0 or
Xi = 1 with equal probability. Then, for any δ ∈ (0, 1/2), we have that
hX ni  δ2 
P Xi ≥ (1 + δ) ≤ exp − n .
i
2 16
Proof: Imagine that we are playing the Chernoff game above, with β = δ/4, starting with 1 dollar, and let Yi be
the amount of money in the end of the ith round. Here Xi = 1 indicates that the player won the ith round. We
have, by Lemma 13.1.1 and Markov’s inequality, that
hX ni h  nδ2 i E[Yn ] 1  δ2 
P Xi ≥ (1 + δ) ≤ P Yn ≥ exp ≤  =  = exp − n. ■
i
2 16 exp nδ2 /16 exp nδ2 /16 16

This is crazy – so intuition maybe?


  If the player is (1 + δ)/2-lucky then she can make a lot of money;
specifically, at least f (δ) = exp nδ /16 dollars by the end of the game. Namely, beating the odds has significant
2

monetary value, and this value grows quickly with δ. Since we are in a “zero-sum” game settings, this event
should be very rare indeed. Under this interpretation, of course, the player needs to know in advance the value
of δ – so imagine that she guesses it somehow in advance, or she plays the game in parallel with all the possible
values of δ, and she settles on the instance that maximizes her profit.

Can one do better? No, not really. Chernoff inequality is tight (this is a challenging homework exercise) up
to the constant in the exponent. The best bound I know for this version of the inequality has 1/2 instead of
1/16 in the exponent. Note, however, that no real effort was taken to optimize the constants – this is not the
purpose of this write-up.

13.1.2.3. Some low level boring calculations


Above, we used the following well known facts.
Lemma 13.1.5. (A) Markov’s inequality. For any positive random
h i variable
h h Xiiand t > 0, we have P[X ≥ t] ≤ E[X] /t.
(B) For any two random variables X and Y, we have that E X = E E X Y . (C) For x ∈ (0, 1), 1 + x ≥ e x/2 .
(D) For x ∈ (0, 1/2), 1 − x ≥ e−2x .
  
δ
Lemma 13.1.6. The quantity exp −β2 + β2 δ + βδ/2 n is maximal for β = 4(1−δ) .
Proof: We have to maximize f (β) = −β2 + β2 δ + βδ/2 by choosing the correct value of β (as a function of δ,
δ
naturally). f ′ (β) = −2β + 2βδ + δ/2 = 0 ⇐⇒ 2(δ − 1)β = −δ/2 ⇐⇒ β = 4(1−δ) . ■
¯
Not that there is anything wrong with that – many of my friends are filthy,

93
13.1.3. A proof for −1/ + 1 case
Theorem 13.1.7. Let X1 , . . . , Xn be n independent random variables, such that P[Xi = 1] = P[Xi = −1] = 12 ,
P
for i = 1, . . . , n. Let Y = ni=1 Xi . Then, for any ∆ > 0, we have
h i  
P Y ≥ ∆ ≤ exp −∆ /2n .
2

Proof: Clearly, for an arbitrary t, to specified shortly, we have


 
  E exp(tY)
P[Y ≥ ∆] = P exp(tY) ≥ exp(t∆) ≤ ,
exp(t∆)
the first part follows by the fact that exp(·) preserve ordering, and the second part follows by the Markov
inequality.
Observe that
  1 t 1 −t et + e−t
E exp(tXi ) = e + e =
2 2 2 !
2
1 t t t3
= 1 + + + + ···
2 1! 2! 3!
!
1 t t2 t3
+ 1 − + − + ···
2 1! 2! 3!
!
t2 t2k
= 1+ + + +··· + + ··· ,
2! (2k)!

by the Taylor expansion of exp(·). Note, that (2k)! ≥ (k!)2k , and thus
!i
 X t2i X t2i X 1 t2  
∞ ∞ ∞

E exp(tXi ) = ≤ = = exp t2 /2 ,
i=0
(2i)! i=0 2i (i!) i=0 i! 2

again, by the Taylor expansion of exp(·). Next, by the independence of the Xi s, we have
    
   X  Y  Y
n
  Y t2 /2
n
   
E exp(tY) = Eexp tXi  = E  
exp(tXi ) = E exp(tXi ) ≤ e = ent /2 .
2

i i i=1 i=1
 
exp nt2 /2  
We have P[Y ≥ ∆] ≤ = exp nt2 /2 − t∆ .
exp(t∆)
Next, by minimizing the above quantity for t, we set t = ∆/n. We conclude,
 !  !
 n ∆ 2 ∆  ∆2
P[Y ≥ ∆] ≤ exp − ∆ = exp − . ■
2 n n 2n

By the symmetry of Y, we get the following:


Corollary 13.1.8. Let X1 , . . . , Xn be n independent random variables, such that P[Xi = 1] = P[Xi = −1] = 12 ,
P
for i = 1, . . . , n. Let Y = ni=1 Xi . Then, for any ∆ > 0, we have P[|Y| ≥ ∆] ≤ 2 exp −∆2 /2n .

Corollary 13.1.9. Let X1 , . . . , Xn be n independent coin flips, such that P[Xi = 0] = P[X i = 1] = 2 , for i =
1
Pn
1, . . . , n. Let Y = i=1 Xi . Then, for any ∆ > 0, we have P[|Y − n/2| ≥ ∆] ≤ 2 exp −2∆2 /n .

94
Proof: Consider the random variables Zi = 2Xi − 1 ∈ {−1, +1}. We have that
X
n  Xn   
P [|Y − n/2| ≥ ∆] = P [|2Y − n| ≥ 2∆] = P (2Xi − 1) ≥ 2∆ = P Zi ≥ 2∆ ≤ 2 exp −2∆2 /n ,
i=1 i=1
by Corollary 13.1.8 applied to the independent random variables Z1 , . . . , Zn . ■

Remark 13.1.10. Before going any further, it is might be instrumental to understand what this inequalities
imply.√Consider then case where Xi is either zero or one with probability half. In this case µ = E[Y] = n/2. Set

δ = t n ( µ is approximately the standard deviation of X if pi = 1/2). We have by
 n     √ 2   
P Y − ≥ ∆ ≤ 2 exp −2∆ /n = 2 exp −2(t n) /n = 2 exp −2t .
2 2
2
Thus, Chernoff inequality implies exponential decay (i.e., ≤ 2−t ) with t standard deviations, instead of just
polynomial (i.e., ≤ 1/t2 ) by the Chebychev’s inequality.

13.2. The Chernoff Bound — General Case


Here we present the Chernoff bound in a more general settings.
   
Theorem 13.2.1. Let X1 , . . . , Xn be n independent variables, where P Xi = 1 = pi and P Xi = 0 = qi = 1− pi ,
P   P
for all i. Let X = bi=1 Xi . µ = E X = i pi . For any δ > 0, we have
   δ. µ
P X > (1 + δ)µ < e (1 + δ)1+δ .
  h i
Proof: We have P X > (1 + δ)µ = P etX > et(1+δ)µ . By the Markov inequality, we have:
h i
h i E etX
P X > (1 + δ)µ < t(1+δ)µ
e
On the other hand,
h i h i h i h i
t(X +X ...+X )
E e = E e 1 2 n = E e 1 ···E e n .
tX tX tX

Namely,
Qn h i Qn   Qn 
E etXi i=1 (1 − pi )e + pi e
0 t
  i=1 1 + pi (et − 1)
P X > (1 + δ)µ < = = .
i=1
et(1+δ)µ et(1+δ)µ et(1+δ)µ
Let y = pi (et − 1). We know that 1 + y < ey (since y > 0). Thus,
Qn P 
  i=1 exp(pi (e − 1)) exp ni=1 pi (et − 1)
t
P X > (1 + δ)µ < =
et(1+δ)µP  et(1+δ)µ   !µ
exp (e − 1) i=1 pi
t n
exp (e − 1)µ
t
exp et − 1
= = =
et(1+δ)µ et(1+δ)µ et(1+δ)

exp(δ)
= ,
(1 + δ)(1+δ)
if we set t = log(1 + δ). ■

95
13.2.1. The lower tail
We need the following low level lemma.

Lemma 13.2.2. For x ∈ [0, 1), we have (1 − x)1−x ≥ exp(−x + x2 /2).

P∞
Proof: For x ∈ [0, 1), we have, by the Taylor expansion, that ln(1 − x) = − i=1 (x
i
/i). As such, we have

X X !
xi X xi+1 X X
∞ ∞ ∞ ∞ ∞
xi xi xi xi
(1 − x) ln(1 − x) = −(1 − x) =− + = −x + − = −x + .
i=1
i i=1
i i=1
i i=2
i−1 i i=2
i(i − 1)

This implies that (1 − x) ln(1 − x) ≥ −x + x2 /2, which implies the claim by exponentiation. ■

   
Theorem 13.2.3. Let X1 , . . . , Xn be n independent random variables, where P Xi = 1 = pi , P Xi = 0 = qi =
Pn   P
1 − pi , for all i. For X = i=1 Xi , its expectation is µ = E X = i pi . We have that

  h e−δ iµ
P X < (1 − δ)µ < .
(1 − δ)1−δ
     
1+ln τ
For any positive τ > 1, we have that P X < µ/τ ≤ exp − 1 − τ
µ.

 
Proof: We follow the same proof template seen already. For t = − ln(1 − δ) > 0, we have E exp(−tXi ) =
(1 − pi )e0 + pi e−t = 1 − pi + pi (1 − δ) = 1 − pi δ ≤ exp(−pi δ). As such, we have
Qn  
      i=1 E exp(−tXi )
P X < (1 − δ)µ = P −X > −(1 − δ)µ = P exp(−tX) > exp(−t(1 − δ)µ) ≤
exp(−t(1 − δ)µ)
Pn  h −δ iµ
exp − i=1 pi δ e
≤ = .
exp(−t(1 − δ)µ) (1 − δ)1−δ

For the last inequality, set δ = 1 − 1/τ, and observe that


! !
  h e−δ iµ h exp(−1 + 1/τ) iµ 1 + ln τ
P X < (1 − δ)µ ≤ = = exp − 1 − µ. ■
(1 − δ)1−δ (1/τ)1/τ τ

 
Lemma 13.2.4. Let X1 , . . . , Xn ∈ {0, 1} be n independent random variables,
 with pi = P Xi = 1 , for all i. For
P   P  
X = ni=1 Xi , and µ = E X = i pi , we have that P X < (1 − δ)µ < Exp −µδ2 /2 .

Proof: This alternative simplified form of Theorem 13.2.3, follows readily from Lemma 13.2.2, since

  h e−δ iµ h e−δ iµ
P X < (1 − δ)µ ≤ ≤  ≤ Exp(−µδ2 /2). ■
(1 − δ)1−δ Exp −δ + δ /2
2

96
13.2.2. A more convenient form of Chernoff’s inequality
Lemma 13.2.5. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] = 1− pi ,
P   P
for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ, ∈ (0, 1), we have
   
P X > (1 + δ)µ < exp −µδ /3 .
2

Proof: By Theorem 13.2.1, it is sufficient to prove, for δ ∈ [0, 1], that


!µ !
eδ µδ2 
≤ exp − ⇐⇒ µ δ − (1 + δ) ln(1 + δ) ≤ −µδ2 /c
(1 + δ) 1+δ c
⇐⇒ f (δ) = δ2 /c + δ − (1 + δ) ln(1 + δ) ≤ 0.

We have
1
f ′ (δ) = 2δ/c − ln(1 + δ). and f ′′ (δ) = 2/c − .
1+δ
For c = 3, we have f ′′ (δ) ≤ 0 for δ ∈ [0, 1/2], and f ′′ (δ) ≥ 0 for δ ∈ [1/2, 1]. Namely, f ′ (δ) achieves its
maximum either at 0 or 1. As f ′ (0) = 0 and f ′ (1) = 2/3 − ln 2 ≈ −0.02 < 0, we conclude that f ′ (δ) ≤ 0.
Namely, f is a monotonically decreasing function in [0, 1], which implies that f (δ) ≤ 0, for all δ in this range,
thus implying the claim. ■

Lemma 13.2.6. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] = 1− pi ,
P   P
for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ ∈ (0, 4), we have
   
P X > (1 + δ)µ < exp −µδ /4 ,
2

Proof: Lemma 13.2.5 implies a stronger bound, so we need to prove the claim only for δ ∈ (1, 4]. Continuing
as in the proof of Lemma 13.2.5, for case c = 4, we have to prove that

f (δ) = δ2 /4 + δ − (1 + δ) ln(1 + δ) ≤ 0,

where f ′′ (δ) = 1/2 − 1+δ


1
.
′′
For δ > 1, we have f (δ) > 0. Namely f (·) is convex for δ ≥ 1, and it achieves its maximum on the interval
[1, 4] on the endpoints. In particular, f (1) ≈ −0.13, and f (4) ≈ −0.047, which implies the claim. ■

Lemma 13.2.7. Let X1 , . . . , Xn be n independent random variables, where P[Xi = 1] = pi , and P[Xi = 0] =
P   P
1 − pi , for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ ∈ (0, 6), we have
   
P X > (1 + δ)µ < exp −µδ /5 ,
2

Proof: Lemma 13.2.6 implies a stronger bound, so we need to prove the claim only for δ ∈ (4, 5]. Continuing
as in the proof of Lemma 13.2.5, for case c = 5, we have to prove that

f (δ) = δ2 /5 + δ − (1 + δ) ln(1 + δ) ≤ 0,

where f ′′ (δ) = 2/5 − 1+δ


1
. For δ ≥ 4, we have f ′′ (δ) > 0. Namely f (·) is convex for δ ≥ 4, and it achieves its
maximum on the interval [4, 6] on the endpoints. In particular, f (4) ≈ −0.84, and f (6) ≈ −0.42, which implies
the claim. ■

97
Lemma 13.2.8. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] = 1− pi ,
P   P  
for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ > 2e − 1, we have P X > (1 + δ)µ < 2−µ(1+δ) .
Proof: By Theorem 13.2.1, we have
 e (1+δ)µ  e (1+δ)µ
≤ ≤ 2−(1+δ)µ ,
1+δ 1 + 2e − 1
since δ > 2e − 1. ■
Lemma 13.2.9. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] = 1− pi ,
P   P  
for i = 1, . . . , n. Let X = bi=1 Xi , and µ = E X = i pi . For δ > e2 , we have P X > (1 + δ)µ < exp − µδ2ln δ .
Proof: Observe that
h i !µ  

P X > (1 + δ)µ < = exp µδ − µ(1 + δ) ln(1 + δ) . (13.1)
(1 + δ)1+δ
As such, we have
h i  ! !
 1+δ µδ ln δ
P X > (1 + δ)µ < exp −µ(1 + δ) ln(1 + δ) − 1 ≤ exp −µδln ≤ exp − ,
e 2
1+x √ 1 + x ln x
since for x ≥ e2 we have that ≥ x ⇐⇒ ln ≥ . ■
e e 2

13.2.2.1. Bound when the expectation is small


Lemma 13.2.10. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] =
P P
1 − pi , for i = 1, . . . , n. Let Y = bi=1 Xi , and µ = E[Y] = i pi . For δ ∈ (0, 1], and φ ∈ (0, 1], we have
" #
3 ln φ−1
P Y > (1 + δ)µ + < φ.
δ2
3 ln φ−1
Proof: Let ξ = δ + µδ2
If ξ ≥ 2e − 1 ≈ 4.43, by Lemma 13.2.8, we have
.
" #
3 ln φ−1  
α = P Y > (1 + δ)µ + = P Y > (1 + ξ)µ ≤ 2−µ(1+ξ) < φ,
δ 2

−1
since −µ(1 + ξ) > −µξ > µ 3 lnµδφ2 > log2 φ−1 , since δ ∈ (0, 1].
If ξ ≤ 6, then by Lemma 13.2.7, we have
   
α = P Y > (1 + ξ)µ ≤ exp −µξ2 /5 ≤ φ,
since !2 !
µ µ 3 ln φ−1 µ 3 ln φ−1 6 ln φ
− ξ2 = − δ + >− 2·δ· =− · > − ln φ. ■
5 5 µδ2 5 µδ 2 5 δ

Example 13.2.11. Let X1 , . . . , Xn be n independent Bernoulli trials, where P[Xi = 1] = pi , and P[Xi = 0] =
P P
1 − pi , for i = 1, . . . , n. Let Y = b
Xi , and µ = E[Y] = i pi . Assume that µ ≤ 1/2. Setting δ = 1, We have,
i=1
for t > 6, that " #
3 ln exp(t/3)
P[Y > 1 + t] ≤ P Y > (1 + δ)µ + ≤ exp(−t/3),
δ2
by Lemma 13.2.10.

98
13.3. A special case of Hoeffding’s inequality
In this section, we prove yet another version of Chernoff inequality, where each variable is randomly picked
according to its own distribution in the range [0, 1]. We prove a more general version of this inequality in
Section 13.4, but the version presented here does not follow from this generalization.
P
Theorem 13.3.1. Let X1 , . . . , Xn ∈ [0,
! 1] be n independent
! random variables, let X = ni=1 Xi , and let µ = E[X].
h i µ
µ+η
n−µ
n−µ−η
We have that P X − µ ≥ η ≤ .
µ+η n−µ−η

Proof: Let s ≥ 1 be some arbitrary parameter. By the standard arguments, we have


h i
h i h i E sX Yn h i
γ = P X ≥ µ + η = P sX ≥ sµ+η ≤ µ+η = s−µ−η Es i .
X
s i=1
h i
By calculations, see Lemma 13.3.7 below, one can show that E sX1 ≤ 1 + (s − 1) E[Xi ]. As such, by the
AM-GM inequality° , we have that
 n 
Yn h i Y n    1 X  n  µ n

 
Es ≤ 1 + (s − 1) E[Xi ] ≤  1 + (s − 1) E[Xi ]  = 1 + (s − 1) .
Xi

i=1 i=1
n i=1 n

(µ + η)(n − µ) µn − µ2 + ηn − ηµ
Setting s = = we have that
µ(n − µ − η) µn − µ2 − ηµ
µ ηn µ η n−µ
1 + (s − 1) =1+ · =1+ = .
n µn − µ − ηµ n
2 n−µ−η n−µ−η

As such, we have that


Y
n h i !µ+η !n !µ+η !n−µ−η
−µ−η µ(n − µ − η) n−µ µ n−µ
γ≤s Es i =
X
= . ■
i=1
(µ + η)(n − µ) n−µ−η (µ + η) n−µ−η

Remark 13.3.2. Setting s = (µ + η)/µ in the proof of Theorem 13.3.1, we have


h i  µ µ+η    µ n  µ µ+η  
η n
P X − µ ≥ η ≤ µ+η 1 + µ+η
µ
− 1 n
= µ+η
1 + n
.

Pn h i
Corollary 13.3.3. Let X1 , . . . , Xn ∈ [0, 1] be n independent random variables, let X = X /n, p = E X =
h i 
i=1 i

µ/n and q = 1 − p. Then, we have that P X − p ≥ t ≤ exp n f (t) , for

p q
f (t) = (p + t) ln + (q − t) ln . (13.2)
p+t q−t
Pn
Theorem 13.3.4. Let X1 , h. . . , Xn ∈ [0,
i 1] be n independent
  let X = ( i=1 Xi )/n, and let
h random variables,
i
p = E[X]. We have that P X − p ≥ t ≤ exp −2nt2 and P X − p ≤ −t ≤ exp −2nt2 .
° Pn √n
The inequality between arithmetic and geometric means: ( i=1 xi )/n ≥ x1 · · · xn .

99
Proof: Let p = µ/n, q = 1 − p, and let f (t) be the function from Eq. (13.2), for t ∈ (−p, q). Now, we have that
!
′ p p+t p q q−t q p q
f (t) = ln + (p + t) − − ln − (q − t) = ln − ln
p+t p (p + t)2 q−t q (q − t)2 p+t q−t
p(q − t)
= ln .
q(p + t)
As for the second derivative, we have
qX
(pX+X
Xt) p (p + t)(−1) − (q − t) −p − t − q + t 1
f ′′ (t) = · · .= =− ≤ −4.
p(q − t) q (p + t)A2 (q − t)(p + t) (q − t)(p + t)
Indeed,
 t ∈ 
(−p, q) and the denominator is minimized for t = (q − p)/2, and as such (q − t)(p + t) ≤
2q − (q − p) 2p + (q − p) /4 = (p + q)2 /4 = 1/4.
f ′′ (x) 2
Now, f (0) = 0 and f ′ (0) = 0, and by Taylor’s expansion, we have that f (t) = f (0)+ f ′ (0)t + t ≤ −2t2 ,
2
where x is between 0 and t.
The first bound now readily follows from plugging this bound into Corollary 13.3.3. The second bound
follows by considering the randomh variants
i Yi = 1 − Xi , for all i, and plugging this into the first bound. Indeed,
for Y = 1 − X, we have that q = E Y , and then X − p ≤ −t ⇐⇒ t ≤ p − X ⇐⇒ t ≤ 1 − q − (1 − Y) = Y − q.
h i h i  
Thus, P X − p ≤ −t = P Y − q ≥ t ≤ exp −2nt2 . ■
Pn
Corollary 13.3.5. Let X1 , . . . , Xn ∈ [0, 1] be n independent
  random variables, let Y =  i=1 Xi , and let µ =
   
E[X]. For any ∆ > 0, we have P Y − µ ≥ ∆ ≤ exp −2∆ /n and P Y − µ ≤ −∆ ≤ exp −2∆ /n .
2 2

Proof: For X = Y/n, p = µ/n, and t = ∆/n, by Theorem 13.3.4, we have


       
P Y − µ ≥ ∆ = P X − p ≥ t ≤ exp −2nt = exp −2∆ /n . ■
2 2

P
Theorem 13.3.6. Let X1 , . . . , Xn ∈ [0, 1] be n independent
 random variables, let X = ( ni=1 Xi ), and let µ =
   
E[X]. We have that P X − µ ≥ εµ ≤ exp −ε µ/4 and P X − µ ≤ −εµ ≤ exp −ε µ/2 .
2 2

Proof: Let p = µ/n, and let g(x) = f (px), for x ∈ [0, 1] and xp < q. As before, computing the derivative of g,
we have
p(q − xp) q − xp 1 px
g′ (x) = p f ′ (xp) = p ln = p ln ≤ p ln ≤− ,
q(p + xp) q(1 + x) 1+x 2
since (q − xp)/q is maximized for x = 0, and ln 1+x1
≤ −x/2, Rfor x ∈ [0, 1], as
R can be easily verified± . Now,
x x
g(0) = f (0) = 0, and by integration, we have that g(x) = y=0 g′ (y)dy ≤ y=0 (−py/2)dy = −px2 /4. Now,
 
plugging into Corollary 13.3.3, we get that the desired probability P X − µ ≥ εµ is
h i       
P X − p ≥ εp ≤ exp n f (εp) = exp ng(ε) ≤ exp −pnε /4 = exp −µε /4 .
2 2

As for the other inequality, set h(x) = g(−x) = f (−xp). Then


!
′ ′ p(q + xp) q(1 − x) q − xq p+q
h (x) = −p f (−xp) = −p ln = p ln = p ln = p ln 1 − x
q(p − xp) q + xp q + xp q + xp
!
1
= p ln 1 − x ≤ p ln(1 − x) ≤ −px,
q + xp
±
Indeed, this is equivalent to 1
1+x ≤ e−x/2 ⇐⇒ e x/2 ≤ 1 + x, which readily holds for x ∈ [0, 1].

100
since 1 − x ≤ e−x . By integration, as before,
h we conclude
i that h(x) ≤ −px2 /2. Now, plugging into Corol-

   
lary 13.3.3, we get P X − µ ≤ −εµ = P X − p ≤ −εp ≤ exp n f (−εp) ≤ exp nh(ε) ≤ exp −npε2 /2 ≤
 
exp −µε2 /2 . ■

13.3.1. Some technical lemmas


h i
Lemma 13.3.7. Let X ∈ [0, 1] be a random variable, and let s ≥ 1. Then E sX ≤ 1 + (s − 1) E[X].

Proof: For the sake of simplicity of exposition, assume that X is a discrete random variable, and that there
is a value α ∈ (0, 1/2), such that β = P[X = α] > 0. Consider the modified random variable X ′ , such that
′ ′ ′
h ′ i= 0]h= iP[X = 0] + β/2, and P[X = 2α] = P[X = α] + β/2. Clearly, E[X] = hE[X
P[X i ]. Next, observe that
α
E s − E s = (β/2)(s + s )−βs ≥ 0, by the convexity of s . We conclude that E s achieves its maximum
X X 2α 0 x X
h i
if takes only the values 0 and 1. But then, we have that E sX = P[X = 0]s0 + P[X = 1]s1 = (1− E[X])+ E[X] s =
1 + (s − 1) E[X] , as claimed. ■

13.4. Hoeffding’s inequality


In this section, we prove a generalization of Chernoff’s inequality. The proof is considerably more tedious, and
it is included here for the sake of completeness.

Lemma
h i 13.4.1.  Let X be a random variable. If E[X] = 0 and a ≤ X ≤ b, then for any s > 0, we have
E e sX
≤ exp s2 (b − a)2 /8 .

Proof: Let a ≤ x ≤ b and observe that x can be written as a convex combination of a and b. In particular, we
have
b−x
x = λa + (1 − λ)b for λ= ∈ [0, 1] .
b−a
Since s > 0, the function exp(sx) is convex, and as such

b − x sa x − a sb
e sx ≤ e + e ,
b−a b−a
since we have that f (λx + (1 − λ)y) ≤ λ f (x) + (1 − λ) f (y) if f (·) is a convex function. Thus, for a random
variable X, by linearity of expectation, we have
h i " #
b − X sa X − a sb b − E[X] sa E[X] − a sb
Ee ≤E e + e = e +
sX
e
b−a b−a b−a b−a
b sa a sb
= e − e ,
b−a b−a
since E[X] = 0.
a a b
Next, set p = − and observe that 1 − p = 1 + = and
b−a b−a b−a
 a 
−ps(b − a) = − − s(b − a) = sa.
b−a

101
As such, we have
h i
E e ≤ (1 − p)e + pe = (1 − p + pe
sX sa sb s(b−a) sa
)e
= (1 − p + pe s(b−a) )e−ps(b−a)
  
= exp −ps(b − a) + ln 1 − p + pe s(b−a) = exp(−pu + ln(1 − p + peu )),

for u = s(b − a). Setting

ϕ(u) = −pu + ln(1 − p + peu ),


h i
we thus have E e sX ≤ exp(ϕ(u)). To prove the claim, we will show that ϕ(u) ≤ u2 /8 = s2 (b − a)2 /8.
To see that, expand ϕ(u) about zero using Taylor’s expansion. We have

1
ϕ(u) = ϕ(0) + uϕ′ (0) + u2 ϕ′′ (θ) (13.3)
2
where θ ∈ [0, u], and notice that ϕ(0) = 0. Furthermore, we have

peu
ϕ′ (u) = −p + ,
1 − p + peu

and as such ϕ′ (0) = −p + p


1−p+p
= 0. Now,

′′ (1 − p + peu )peu − (peu )2 (1 − p)peu


ϕ (u) = = .
(1 − p + peu )2 (1 − p + peu )2

For any x, y ≥ 0, we have (x + y)2 ≥ 4xy as this is equivalent to (x − y)2 ≥ 0. Setting x = 1 − p and y = peu , we
have that
(1 − p)peu (1 − p)peu 1
ϕ′′ (u) = ≤ = .
(1 − p + pe )
u 2 4(1 − p)pe u 4
Plugging this into Eq. (13.3), we get that
h i !
1 1 1
ϕ(u) ≤ u2 = (s(b − a))2 and E e ≤ exp(ϕ(u)) ≤ exp (s(b − a)) ,
sX 2
8 8 8

as claimed. ■

Lemma 13.4.2. Let X be a random variable. If E[X] = 0 and a ≤ X ≤ b, then for any s > 0, we have
2 2
exp s (b−a)
8
P[X > t] ≤ .
e st

Proof: Using the same technique we used in proving Chernoff’s inequality, we have that
h i 2 2
h i E e sX exp s (b−a)
8
P[X > t] = P e > e ≤ ≤ . ■
sX st
e st e st

102
Theorem 13.4.3 (Hoeffding’s inequality). Let X1 , . . . , Xn be independent random variables, where Xi ∈ [ai , bi ],
for i = 1, . . . , n. Then, for the random variable S = X1 + · · · + Xn and any η > 0, we have
  !
2 η2
P S − E[S ] ≥ η ≤ 2 exp − Pn .
i=1 (bi − ai )
2

Pn
Proof: Let Zi = Xi − E[Xi ], for i = 1, . . . , n. Set Z = Zi , and observe that
i=1

  h i Eexp(sZ)
PZ≥η =Pe ≥e ≤ ,
sZ sη
exp(sη)
by Markov’s inequality. Arguing as in the proof of Chernoff’s inequality, we have
 n  !
  Y  Yn
  Y
n
s2 (bi − ai )2
E exp(sZ) = E   
exp(sZi ) = E exp(sZi ) ≤ exp ,
i=1 i=1 i=1
8

since the Zi s are independent and by Lemma 13.4.1. This implies that
 
  Yn

 2 X
n

e s (bi −ai ) /8 = exp (bi − ai )2 − sη.
2 2 s
P Z ≥ η ≤ exp(−sη)
i=1
8 i=1
P 
The upper bound is minimized for s = 4η/ i (bi − ai )2 , implying
!
  2η2
P Z ≥ η ≤ exp − P .
(bi − ai )2
The claim now follows by the symmetry of the upper bound (i.e., apply the same proof to −Z). ■

13.5. Bibliographical notes


Some of the exposition here follows more or less the exposition in [MR95]. Exercise 13.6.1 (without the hint)
is from [Mat99]. McDiarmid [McD89] provides a survey of Chernoff type inequalities, and Theorem 13.3.6
and Section 13.3 is taken from there (our proof has somewhat weaker constants).
A more general treatment of such inequalities and tools is provided by Dubhashi and Panconesi [DP09].

13.6. Exercises
Pn
Exercise 13.6.1 (Chernoff inequality is tight.). Let S = S i be a sum of n independent random variables
i=1
each attaining values +1 and −1 with equal probability. Let P(n, ∆) = P[S > ∆]. Prove that for ∆ ≤ n/C,
!
1 ∆2
P(n, ∆) ≥ exp − ,
C Cn

where C is a suitable constant. That is, the well-known Chernoff bound P(n, ∆) ≤ exp(−∆2 /2n)) is close to the
truth.

Exercise 13.6.2 (Chernoff inequality is tight by direct calculations.). For this question use only basic argu-
mentation – do not use Stirling’s formula, Chernoff inequality or any similar “heavy” machinery.

103
X
n−k !
2n n
(A) Prove that ≤ 2 22n .
i=0
i 4k
Hint: Consider flipping a coin 2n times. Write down explicitly the probability of this coin to have at most
n − k heads, and use Chebyshev inequality.
  √
(B) Using (A), prove that 2nn ≥ 22n /4 n (which is a pretty good estimate).
! ! !
2n 2i + 1 2n
(C) Prove that = 1− .
n + i!+ 1 n + i!+ 1 ! n + i
2n −i(i − 1) 2n
(D) Prove that ≤ exp .
n + i! 2n! ! n
2n 8i2 2n
(E) Prove that ≥ exp − .
n+i n n!
2n 22n
(F) Using the above, prove that ≤ c √ for some constant c (I got c = 0.824... but any reasonable
n n
constant will do).
(G) Using the above, prove that √
X n 2n !
(t+1)
 
≤ c22n exp −t2 /2 .
√ n−i
i=t n+1

In particular, conclude that when flipping faircoin 2n times, the probability to get less than n−t n heads
(for t an integer) is smaller than c′ exp −t2 /2 , for some constant c′ .
(H) Let X be the number of heads in 2n coin flips.
 Prove  that for any integer t > 0 and any δ > 0 sufficiently
small, it holds that P[X < (1 − δ)n] ≥ exp −c δ n , where c′′ is some constant. Namely, the Chernoff
′′ 2

inequality is tight in the worst case.

Exercise 13.6.3 (Tail inequality for geometric variables). Let X1 , . . . , Xm be m independent random variables
  P
with geometric distribution with probability
 p (i.e.,
 P Xi = j = (1 − p) j−1 p). Let Y = i Xi , and let µ = E[Y] =
 
m/p. Prove that P Y ≥ (1 + δ)µ ≤ exp −mδ2 /8 .

References
[DP09] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized
Algorithms. Cambridge University Press, 2009.
[Kel56] J. L. Kelly. A new interpretation of information rate. Bell Sys. Tech. J., 35(4): 917–926, 1956.
[Mat99] J. Matoušek. Geometric Discrepancy. Vol. 18. Algorithms and Combinatorics. Springer, 1999.
[McD89] C. McDiarmid. Surveys in Combinatorics. Ed. by J. Siemons. Cambridge University Press, 1989.
Chap. On the method of bounded differences.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

104
Chapter 14

Applications of Chernoff’s Inequality


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

14.1. QuickSort is Quick


We revisit QuickSort. We remind the reader that the running time of QuickSort is proportional to the number
of comparisons performed by the algorithm. Next, consider an arbitrary element u being sorted. Consider
the ith level recursive subproblem that contains u, and let S i be the set of elements in this subproblems. We
consider u to be successful in the ith level, if |S i+1 | ≤ |S i | /2. Namely, if u is successful, then the next level in
the recursion involving u would include a considerably smaller subproblem. Let Xi be the indicator variable
which is 1 if u is successful.
We first observe that if QuickSort is applied to an array with n elements, then u can be successful at most
 
T = lg n times, before the subproblem it participates in is of size one, and the recursion stops. Thus, consider
the indicator variable Xi which is 1 if u is successful in the ith level, and zero otherwise. Note that the Xi s are
independent, and P[Xi = 1] = 1/2.
If u participates in v levels, then we have the random variables X1 , X2 , . . . , Xv . To make things simpler, we
will extend this series by adding independent random variables, such that P[‘]Xi = 1 = 1/2, for i ≥ v. Thus,
we have an infinite sequence of independent random variables, that are 0/1 and get 1 with probability 1/2. The
question is how many elements in the sequence we need to read, till we get T ones.
Lemma 14.1.1. Let X1 , X2 , . . . be an infinite sequence of independent random 0/1 variables.
√ Let M be an
arbitrary parameter. Then the probability that we need to read √ more than
√ 2M + 4t M variables of this
sequence till we collect M ones is at most 2 exp −t , for t ≤ M. If t ≥ M then this probability is at most
2
 √ 
2 exp −t M .
PL √
Proof: Consider the random variable Y = i=1 Xi , where L = 2M + 4t M. Its expectation is L/2, and using
the Chernoff inequality, we get
     
α = P Y ≤ M ≤ P |Y − L/2| ≥ L/2 − M ≤ 2 exp − L2 (L/2 − M)2
 √ 2   √ 2   8t2 M 
≤ 2 exp −2 M + 2t M − M /L ≤ 2 exp −2 2t M /L = 2 exp − ,
L
√ √
  13.1.9. For t ≤ M we have that L = 2M + 4t M ≤ 8M, as such in this case P[Y ≤ M] ≤
by Corollary
2 exp −t2 .
! !  √ 
√ 8t2 M 8t2 M
If t ≥ M, then α = 2 exp − √ ≤ 2 exp − √ ≤ 2 exp −t M . ■
2M + 4t M 6t M

105
Going back to the QuickSort problem, we have thatpif wep sort n elements, the probability
 p that p u will

   
participate in more than L = (4 + c) lg n = 2 lg n + 4c lg n lg n, is smaller than 2 exp −c lg n lg n ≤
1/nc , by Lemma 14.1.1. There are n elements being sorted, and as such the probability that any element would
 
participate in more than (4 + c + 1) lg n recursive calls is smaller than 1/nc .

Lemma 14.1.2. For any c > 0, the probability that QuickSort performs more than (6 + c)n lg n, is smaller
than 1/nc .

14.2. How many times can the minimum change?


Let Π = π1 . . . πn be a random permutation of {1, . . . , n}. Let Ei be the event that πi is the minimum number
seen so far as we read Π; that is, Ei is the event that πi = minik=1 πk . Let Xi be the indicator variable that is one
if Ei happens. We already seen, and it is easy to verify, that E[Xi ] = 1/i. We are interested in how many times
P
the minimum might change¬ ; that is Z = i Xi , and how concentrated is the distribution of Z. The following is
maybe surprising.

Lemma 14.2.1. The events E1 , . . . , En are independent (as such, the variables X1 , . . . , Xn are independent).

Proof: Exercise. ■

Theorem 14.2.2. Let Π = π1 . . . πn be a random permutation of 1, . . . , n, and let Z be the number of times, that
πi is the smallest number among π1 , . . . , πi , for i = 1, . . . , n. Then, we have that for t ≥ 2e that P[Z > t ln n] ≤
   
1/nt ln 2 , and for t ∈ 1, 2e , we have that P Z > t ln n ≤ 1/n(t−1) /4 .
2

P
Proof: Follows readily from Chernoff’s inequality, as Z = i Xi is a sum of independent indicator variables,
and, since by linearity of expectations, we have
Z
  X   X1
n n+1
1
µ=EZ = E Xi = ≥ dx = ln(n + 1) ≥ ln n.
i i=1
i x=1 x

Next, we set δ = t − 1, and use Theorem 13.2.1. ■

14.3. Routing in a parallel computer


Let G be a graph of a network, where every node is a processor. The processor communicate by sending
 
packets on the edges. Let 0, . . . , N − 1 denote be vertices (i.e., processors) of G, where N = 2n , and G is
the hypercube. As such, each processor is identified by a binary string b1 b2 . . . bn ∈ {0, 1}n . Two nodes are
connected if their binary string differs only in a single bit. Namely, G is the binary hypercube over n bits.
We want to investigate the best routing strategy for this network topology. We assume that every processor
need to send a message to a single other processor. This is represented by a permutation π, and we would like
to figure out how to send the messages encoded by the permutation while creating minimum delay/congestion.
Specifically, in our model, every edge has a FIFO queue­ of the packets it has to transmit. At every clock
tick, the message in front of the queue get sent. All the processors start sending their packets (to the destination
specified by the permutation) in the same time.
¬
The answer, my friend, is blowing in the permutation.
­
First in, first out queue. I sure hope you already knew that.

106
RandomRoute( v0 , . . . , vN−1 )
// vi : Packet at node i to be routed to node d(i).
(i) Pick a random intermediate destination σ(i) from [1, . . . , N]. Packet vi travels to σ(i).
// Here random sampling is done with replacement.
// Several packets might travel to the same destination.
(ii) Wait till all the packets arrive to their intermediate destination.
(iii) Packet vi travels from σ(i) to its destination d(i).
Figure 14.1: The routing algorithm

A routing scheme is oblivious if every node that has to forward a packet, inspect the packet, and depending
only on the content of the packet decides how to forward it. That is, such a routing scheme is local in nature, and
does not take into account other considerations. Oblivious routing is of course a bad idea – it ignores congestion
in the network, and might insist routing things through regions of the hypercube that are “gridlocked”.

Theorem 14.3.1 ([KKT91]). For any deterministic oblivious permutation routing algorithm on a network √ of
N nodes each of out-degree n, there is a permutation for which the routing of the permutation takes Ω N/n
units of time (i.e., ticks).

Proof: (Sketch.) The above is implied by a nice averaging argument – construct, for every possible destination,
the routing tree of all packets to this specific node. Argue that there must be many edges in this tree that are
highly congested in this tree (which is NOT the permutation routing we are looking for!). Now, by averaging,
there must be a single edge that is congested in “many” of these trees. Pick a source-destination pair from each
one of these trees that uses this edge, and complete it into a full permutation in the natural way. Clearly, the
congestion of the resulting permutation is high. For the exact details see [KKT91]. ■

How do we send a packet? We use bit fixing. Namely, the packet from the i node, always go to the current
adjacent node that have the first different bit as we scan the destination string d(i). For example, packet from
(0000) going to (1101), would pass through (1000), (1100), (1101).

The routing algorithm. We assume each edge have a FIFO queue. The routing algorithm is depicted in
Figure 14.1.

14.3.1. Analysis
We analyze only step (i) in the algorithm, as (iii) follows from the same analysis. In the following, let ρi denote
the route taken by vi in (i).

Exercise 14.3.2. Once a packet v j that travel along a path ρ j can not leave a path ρi , and then join it again later.
Namely, ρi ∩ ρ j is (maybe an empty) path.

Lemma 14.3.3. Let the route of a message c follow the sequence of edges π = (e1 , e2 , . . . , ek ). Let S be the set
of packets whose routes pass through at least one of (e1 , . . . , ek ). Then, the delay incurred by c is at most |S |.

Proof: A packet in S is said to leave π at that time step at which it traverses an edge of π for the last time. If a
packet is ready to follow edge e j at time t, we define its lag at time t to be t − j. The lag of c is initially zero, and

107
the delay incurred by c is its lag when it traverse ek . We will show that each step at which the lag of c increases
by one can be charged to a distinct member of S .
We argue that if the lag of c reaches ℓ + 1, some packet in S leaves π with lag ℓ. When the lag of c increases
from ℓ to ℓ + 1, there must be at least one packet (from S ) that wishes to traverse the same edge as c at that
time step, since otherwise c would be permitted to traverse this edge and its lag would not increase. Thus, S
contains at least one packet whose lag reach the value ℓ.
Let τ be the last time step at which any packet in S has lag ℓ. Thus there is a packet d ready to follow edge
eµ at τ, such that τ − µ = ℓ. We argue that some packet of S leaves π at time τ – this establishes the lemma
since once a packet leaves π, it would never join it again and as such will never again delay c.
Since d is ready to follow eµ at time τ, some packet ω (which may be d itself) in S traverses eµ at time τ.
Now ω must leave π at time τ – if not, some packet will follow eµ+1 at step µ + 1 with lag ℓ. But this violates the
maximality of τ. We charge to ω the increase in the lag of c from ℓ to ℓ + 1. Since ω leaves π, it will never be
charged again. Thus, each member of S whose route intersects π is charge for at most one delay, establishing
the lemma. ■
Let Hi j be an indicator variable that is 1 if ρi and ρ j share an edge, and 0 otherwise. The total delay for vi
P
is at most ≤ j Hi j .
Crucially, for a fixed i, the variables Hi1 , . . . , HiN are independent. Indeed, imagine first picking the desti-
nation of vi , and let the associated path be ρi . Now, pick the destinations of all the other packets in the network.
Since the sampling of destinations is done with replacements, whether or not the path ρ j of v j intersects ρi , is
independent of whether ρk intersects ρi . Of course, the probabilities P[Hi j = 1] and P[Hik = 1] are probably
different. Confusingly, however, H11 , . . . , HNN are not independent. Indeed, imagine k and j being close ver-
tices on the hypercube. If Hi j = 1 then intuitively it means that ρi is traveling close to the vertex v j , and as such
there is a higher probability that Hik = 1.
Let
ρi = (e1 , . . . , ek ),
and let T (e) be the number of packets (i.e., paths) that pass through e. We have that
 N   k 
X
N X
k X  X 
Hi j ≤ T (e j ) and thus E  Hi j  ≤ E  T (e j ) .
j=1 j=1 j=1 j=1

Because of symmetry, the variables T (e) have the same distribution for all the edges of G. On the other hand,
the expected length of a path is n/2, there are N packets, and there are Nn/2 edges® . We conclude
Total length of paths N(n/2)
E[T (e)] = = =1
# of edges in graph N(n/2)
= 1. Thus, for k = |ρi |, we have
hX N i hXk i  hX k i X
|ρi |
h i |ρi | 
X h i n
µ=E Hi j ≤ E T (e j ) = E E T (e j ) ρi = E E T (e j ) ρ i = E 1 = E |ρi | = .
j=1 j=1 j=1 j=1 j=1
2
By the Chernoff inequality, specifically Lemma 13.2.8, we have
   
X  X 
P Hi j > 7n ≤ P Hi j > (1 + 13)µ < 2−13µ ≤ 2−6n .
j j

Since there are N = 2n packets, we know that with probability ≤ 2−5n all packets arrive to their temporary
destination in a delay of most 7n.
®
Indeed, the hypercube has N vertices, all of degree n. As such, the number of edges is Nn/2.

108
Theorem 14.3.4. Each packet arrives to its destination in ≤ 14n stages, in probability at least 1 − 1/N (note
that this is very conservative).

14.4. Faraway Strings


Consider the Hamming distance between binary strings. It is natural to ask how many strings of length n
can one have, such that any pair of them, is of Hamming distance at least t from each other. Consider two
 
random strings, generated by picking at each bit randomly and independently. Thus, E dH (x, y) = n/2, where
dH (x, y) denote the hamming distance between x and y. In particular, using the Chernoff inequality, specifically
Corollary 13.1.9, we have that  
 
P dH (x, y) ≤ n/2 − ∆ ≤ exp −2∆ /n .
2

Next, consider generating M such string, where the value of M would be determined shortly. Clearly, the
probability that any pair of strings are at distance at most n/2 − ∆, is
!    
M
α≤ exp −2∆2 /n < M 2 exp −2∆2 /n .
2

If this probability is smaller than one, then there is some probability that all the M strings are of distance at
least n/2 − ∆ from each other. Namely, there exists a set of M strings such that every pair of them is far. We
used here the fact that if an event has probability larger than zero, then it exists. Thus, set ∆ = n/4, and observe
that  
α < M 2 exp −2n2 /16n = M 2 exp(−n/8).
Thus, for M = exp(n/16), we have that α < 1. We conclude:

Lemma 14.4.1. There exists a set of exp(n/16) binary strings of length n, such that any pair of them is at
Hamming distance at least n/4 from each other.

This is our first introduction to the beautiful technique known as the probabilistic method — we will hear
more about it later in the course.
This√ result has also interesting interpretation in the Euclidean setting. Indeed, consider the sphere S of
radius n/2 centered at (1/2, 1/2, . . . , 1/2) ∈ Rn . Clearly, all the vertices of the binary hypercube {0, 1}n lie on
this sphere. As such, let P be the set ofppoints on S that
√ exists√according to Lemma 14.4.1. A pair p, q of points
of P have Euclidean distance at least dH (p, q) = n/4 = n/2 from each other. We conclude:

Lemma 14.4.2. Consider the unit hypersphere S in Rn . The sphere S contains a set Q of points, such that each
pair of points is at (Euclidean) distance at least one from each other, and |Q| ≥ exp(n/16).

Proof: Take the above point set, and scale it down by a factor of n/2. ■

14.5. Bibliographical notes


Section 14.3 is based on Section 4.2 in [MR95]. A similar result to Theorem 14.3.4 is known for the case of the
wrapped butterfly topology (which is similar to the hypercube topology but every node has a constant degree,
and there is no clear symmetry). The interested reader is referred to [MU05].

109
14.6. Exercises
Exercise 14.6.1 (More binary strings. More!). To some extent, Lemma 14.4.1 is somewhat silly, as one can
prove a better bound by direct argumentation. Indeed, for a fixed binary string x of length n, show a bound on
the number of strings in the Hamming ball around x of radius n/4 (i.e., binary strings of distance at most n/4
from x). (Hint: interpret the special case of the Chernoff inequality as an inequality over binomial coefficients.)
Next, argue that the greedy algorithm which repeatedly pick a string which is in distance ≥ n/4 from all
strings picked so far, stops after picking at least exp(n/8) strings.

References
[KKT91] C. Kaklamanis, D. Krizanc, and T. Tsantilas. Tight bounds for oblivious routing in the hypercube.
Math. sys. theory, 24(1): 223–232, 1991.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.

110
Chapter 15

Min Cut
To acknowledge the corn - This purely American expression means to admit the losing of an argument, especially in regard
to a detail; to retract; to admit defeat. It is over a hundred years old. Andrew Stewart, a member of Congress, is said to
have mentioned it in a speech in 1828. He said that haystacks and cornfields were sent by Indiana, Ohio and Kentucky to
Philadelphia and New York. Charles A. Wickliffe, a member from Kentucky questioned the statement by commenting that
haystacks and cornfields could not walk. Stewart then pointed out that he did not mean literal haystacks and cornfields, but
the horses, mules, and hogs for which the hay and corn were raised. Wickliffe then rose to his feet, and said, “Mr. Speaker, I
acknowledge the corn”.

Funk, Earle, A Hog on Ice and Other Curious Expressions


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

15.1. Branching processes – Galton-Watson Process


15.1.1. The problem
In the 19th century, Victorians were worried that aristocratic surnames were disappearing, as family names
passed on only through the male children. As such, a family with no male children had its family name
disappear. So, imagine the number of male children of a person is an independent random variable X ∈
{0, 1, 2, . . .}. Starting with a single person, its family (as far as male children are concerned) is a random tree
with the degree of a node being distributed according to X. We continue recursively in constructing this tree,
again, sampling the number of children for each current leaf according to the distribution of X. It is not hard to
see that a family disappears if E[X] ≤ 1, and it has a constant probability of surviving if E[X] > 1.
Francis Galton asked the question of what is the probability of such a blue-blood family name to survive,
and this question was answered by Henry William Watson [WG75]. The Victorians were worried about strange
things, see [Gre69] for a provocatively titled article from the period, and [Ste12] for a more recent take on this
issue.
Of course, since infant mortality is dramatically down (as is the number of aristocrat males dying to main-
tain the British empire), the probability of family names to disappear is now much lower than it was in the 19th
century. Interestingly, countries with family names that were introduced long time ago have very few surnames
(i.e., Korean have 250 surnames, and three surnames form 45% of the population). On the other hand, coun-
tries that introduced surnames more recently have dramatically more surnames (for example, the Dutch have
surnames only for the last 200 years, and there are 68, 000 different family names).

111
Here we are going to look on a very specific variant of this problem. Imagine that starting with a single
male. A male has exactly two children, and one of them is a male with probability half (i.e., the Y-chromosome
is being passed only to its male children). As such, the natural question is what is the probability that h
generations down, there is a male decedent that all his ancestors are male (i.e., it caries the original family
name, and the original Y-chromosome).

15.1.2. On coloring trees


Let T h be a complete binary tree of height h. We randomly color its edges by black and white. Namely, for each
edge we independently choose its color to be either black or white, with equal probability (say, black indicates
the child is male). We are interested in the event that there exists a path from the root of T h to one of its leafs,
that is all black. Let Eh denote this event, and let ρh = P[Eh ]. Observe that ρ0 = 1 and ρ1 = 3/4 (see below).
To bound this probability, consider the root u of T h and its two children ul and ur . The probability that
there is a black path from ul to one of its children is ρh−1 , and as such, the probability that there is a black path
 
from u through ul to a leaf of the subtree of ul is P the edge uul is colored black · ρh−1 = ρh−1 /2. As such, the
probability that there is no black path through ul is 1 − ρh−1 /2. As such, the probability of not having a black
path from u to a leaf (through either children) is (1 − ρh−1 /2)2 . In particular, there desired probability, is the
complement; that is
 ρh−1 2 ρh−1  ρh−1  ρ2 
ρh = 1 − 1 − = 2− = ρh−1 − h−1 = f ρh−1 for f (x) = x − x2 /4.
2 2 2 4
The starting values are ρ0 = 1, and ρ1 = 3/4.

f(x)=x - x2/4
1

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

Figure 15.1: A graph of the function f (x) = x − x2 /4.

Lemma 15.1.1. We have that ρh ≥ 1/(h + 1).

Proof: (Feel free to skip reading.) The proof is by induction. For h = 1, we have ρ1 = 3/4 ≥ 1/(1 + 1).
Observe that ρh = f (ρh−1 ) for f (x) = x − x2 /4, and f ′ (x) = 1 − x/2. As such, f ′ (x) > 0 for x ∈ [0, 1] and
f (x) is increasing in the range [0, 1]. As such, by induction, we have that
!
1 1 1
ρh = f (ρh−1 ) ≥ f = − 2.
(h − 1) + 1 h 4h

We need to prove that ρh ≥ 1/(h + 1), which is implied by the above if


1 1 1
− 2 ≥ ⇔ 4h(h + 1) − (h + 1) ≥ 4h2 ⇔ 4h2 + 4h − h − 1 ≥ 4h2 ⇔ 3h ≥ 1,
h 4h h+1
which trivially holds. ■

112
Lemma 15.1.2. We have that ρh = O(1/h).
Proof: (Feel free to skip reading.) We prove the claim for infinite number of values of h – the claim then fol-
lows for all h by fiddling with the constants. The claim trivially holds for small values of h. For any j > 0, let
h j be the minimal index such that ρh j ≤ 1/2 j . It is easy to verify that ρh j ≥ 1/2 j+1 . We claim (mysteriously)
that
ρh j − ρh j+1
h j+1 − h j ≤ .
(ρh j+1 )2 /4
Indeed, ρk+1 is the number resulting from removing ρ2k /4 from ρk . Namely, the sequence ρ1 , ρ2 , . . . is a mono-
tonically decreasing sequence of numbers in the interval [0, 1], where the gaps between consecutive numbers
2
decreases. In particular, to get from ρh j to ρh j+1 , the gaps used were of size at least ∆ = ρh j+1 , which means
that there are at least (ρh j − ρh j+1 )/∆ − 1 numbers in the series between these two elements. As such, since
ρh j ≤ 1/2 j and ρh j+1 ≥ 1/2 j+2 , we have
ρh j − ρh j+1 1/2 j − 1/2 j+2
h j+1 − h j ≤ ≤ ≤ 22 j+6 /2 j = 2 j+6 .
(ρh j+1 )2 /4 1/22( j+2)+2
This implies that h j ≤ (h j − h j−1 ) + (h j−1 − h j−2 ) + · · · + (h1 − h0 ) ≤ 2 j+6 . As such, we have ρh j ≤ 1/2 j ≤ 26 /2 j+6 ≤
26 /h j , which implies the claim. ■

15.2. Min Cut


15.2.1. Problem Definition

Let G = (V, E) be an undirected graph with n vertices and m edges. We are


interested in cuts in G.
Definition 15.2.1. A cut in G is a partition of the vertices of V into two sets
S and V \ S , where the edges of the cut are S V \S
n o
(S , V \ S ) = uv u ∈ S , v ∈ V \ S , and uv ∈ E ,

where S , ∅ and V \ S , ∅. We will refer to the number of edges in the cut


(S , V \ S ) as the size of the cut. For an example of a cut, see figure on the right.
We are interested in the problem of computing the minimum cut (i.e., mincut), that is, the cut in the graph
with minimum cardinality. Specifically, we would like to find the set S ⊆ V such that (S , V \ S ) is as small as
possible, and S is neither empty nor V \ S is empty.

15.2.2. Some Definitions


h i
We remind the reader of the following concepts. The conditional probability of X given Y is P X = x Y = y =
   
P (X = x) ∩ (Y = y) /P Y = y . An equivalent, useful restatement of this is that
  h i  
P (X = x) ∩ (Y = y) = P X = x Y = y · P Y = y . (15.1)
The following is easy to prove by induction using Eq. (15.1).
Lemma 15.2.2. Let E1 , . . . , En be n events which are not necessarily independent. Then,
 n    h i h i h i
P ∩i=1 Ei = P E1 ∗ P E2 E1 ∗ P E3 E1 ∩ E2 ∗ . . . ∗ P En E1 ∩ . . . ∩ En−1 .

113
15.3. The Algorithm

The basic operation used by the algorithm is edge


contraction, depicted in Figure 15.2. We take an edge
e = xy in G and merge the two vertices into a single
vertex. The new resulting graph is denoted by G/xy. y
x {x, y}
Note, that we remove self loops created by the contrac-
tion. However, since the resulting graph is no longer (a) (b)
a regular graph, it has parallel edges – namely, it is a
multi-graph. We represent a multi-graph, as a regular Figure 15.2: (a) A contraction of the edge xy. (b)
graph with multiplicities on the edges. See Figure 15.3. The resulting graph.

The edge contraction operation can be implemented


in O(n) time for a graph with n vertices. This is done
by merging the adjacency lists of the two vertices being 2 2
contracted, and then using hashing to do the fix-ups (i.e., 2 2
we need to fix the adjacency list of the vertices that are
connected to the two vertices). (a) (b)
Note, that the cut is now computed counting multi-
plicities (i.e., if e is in the cut and it has weight w, then Figure 15.3: (a) A multi-graph. (b) A minimum cut
the contribution of e to the cut weight is w). in the resulting multi-graph.

Observation 15.3.1. A set of vertices in G/xy corresponds to a set of vertices in the graph G. Thus a cut
in G/xy always corresponds to a valid cut in G. However, there are cuts in G that do not exist in G/xy. For
example, the cut S = {x}, does not exist in G/xy. As such, the size of the minimum cut in G/xy is at least as large
as the minimum cut in G (as long as G/xy has at least one edge). Since any cut in G/xy has a corresponding
cut of the same cardinality in G.

Our algorithm works by repeatedly performing edge contractions. This is beneficial as this shrinks the
underlying graph, and we would compute the cut in the resulting (smaller) graph. An “extreme” example of
this, is shown in Figure 15.4, where we contract the graph into a single edge, which (in turn) corresponds to
a cut in the original graph. (It might help the reader to think about each vertex in the contracted graph, as
corresponding to a connected component in the original graph.)
Figure 15.4 also demonstrates the problem with taking this approach. Indeed, the resulting cut is not the
minimum cut in the graph.
So, why did the algorithm fail to find the minimum cut in this case?¬ The failure occurs because of the
contraction at Figure 15.4 (e), as we had contracted an edge in the minimum cut. In the new graph, depicted in
Figure 15.4 (f), there is no longer a cut of size 3, and all cuts are of size 4 or more. Specifically, the algorithm
succeeds only if it does not contract an edge in the minimum cut.

Observation 15.3.2. Let e1 , . . . , en−2 be a sequence of edges in G, such that none of them is in the minimum cut,
and such that G′ = G/ {e1 , . . . , en−2 } is a single multi-edge. Then, this multi-edge corresponds to a minimum
cut in G.

Note, that the claim in the above observation is only in one direction. We might be able to still compute
a minimum cut, even if we contract an edge in a minimum cut, the reason being that a minimum cut is not
¬
Naturally, if the algorithm had succeeded in finding the minimum cut, this would have been our success.

114
2
2 2
2 2
y
x
(a) (b) (c) (d)

2 2
2 2 4 4
2 2 2 3 3
2 2 52 5

(e) (f) (g) (h)

(i) (j)

Figure 15.4: (a) Original graph. (b)–(j) a sequence of contractions in the graph, and (h) the cut in the original
graph, corresponding to the single edge in (h). Note that the cut of (h) is not a mincut in the original graph.

Algorithm MinCut(G)
G0 ← G
i=0
while Gi has more than two vertices do
Pick randomly an edge ei from the edges of Gi
Gi+1 ← Gi /ei
i←i+1
Let (S , V \ S ) be the cut in the original graph
corresponding to the single edge in Gi
return (S , V \ S ).

Figure 15.5: The minimum cut algorithm.

115
unique. In particular, another minimum cut might survived the sequence of contractions that destroyed other
minimum cuts.
Using Observation 15.3.2 in an algorithm is problematic, since the argumentation is circular, how can we
find a sequence of edges that are not in the cut without knowing what the cut is? The way to slice the Gordian
knot here, is to randomly select an edge at each stage, and contract this random edge.
See Figure 15.5 for the resulting algorithm MinCut.

15.3.1. Analysis
15.3.1.1. The probability of success
Naturally, if we are extremely lucky, the algorithm would never pick an edge in the mincut, and the algorithm
would succeed. The ultimate question here is what is the probability of success. If it is relatively “large” then
this algorithm is useful since we can run it several times, and return the best result computed. If on the other
hand, this probability is tiny, then we are working in vain since this approach would not work.
Lemma 15.3.3. If a graph G has a minimum cut of size k and G has n vertices, then |E(G)| ≥ kn
2
.

Proof: Each vertex degree is at least k, otherwise the vertex itself would form a minimum cut of size smaller
P
than k. As such, there are at least v∈V degree(v)/2 ≥ nk/2 edges in the graph. ■

Lemma 15.3.4. Fix a specific minimum cut C = (S , S ) in the graph. If we pick in random an edge e from a
graph G, uniformly at random, then with probability at most 2/n it belongs to the minimum cut C.

Proof: There are at least nk/2 edges in the graph and exactly k edges in the minimum cut. Thus, the probability
of picking an edge from the minimum cut is smaller then k/(nk/2) = 2/n. ■

The following lemma shows (surprisingly) that MinCut succeeds with reasonable probability.
2
Lemma 15.3.5. MinCut outputs the mincut with probability ≥ .
n(n − 1)
7
Proof: Let Ei be the event that ei is not in the minimum cut of Gi . By Observation 15.3.2, MinCut outputs the
minimum cut if the events E0 , . . . , En−3 all happen (namely, all edges picked are outside the minimum cut).
h i 2 2
By Lemma 15.3.4, it holds P Ei E0 ∩ E1 ∩ . . . ∩ Ei−1 ≥ 1 − =1− . Implying that
|V(Gi )| n−i
h i h i h i
∆ = P[E0 ∩ . . . ∩ En−3 ] = P[E0 ] · P E1 E0 · P E2 E0 ∩ E1 · . . . · P En−3 E0 ∩ . . . ∩ En−4 .

As such, we have
Y
n−3 ! Yn−3
2 n−i−2
∆≥ 1− =
i=0
n−i i=0
n−i

n− 2 
n− 3 n−
 4 X
 n−XX n−
5  
6 3 2 1
= ∗ ∗  ∗  ∗  ∗ · · · ∗ ∗ ∗
n n−1 n−2 
 n−3 n−4 5 4 3
2
= . ■
n(n − 1)

116
15.3.1.2. Running time analysis.
Observation 15.3.6. MinCut runs in O(n2 ) time.

Observation 15.3.7. The algorithm always outputs a cut, and the cut is not smaller than the minimum cut.

Definition 15.3.8. (informal) Amplification is the process of running an experiment again and again till the
things we want to happen, with good probability, do happen.

Let MinCutRep be the algorithm that runs MinCut n(n − 1) times and return the minimum cut computed
in all those independent executions of MinCut.
Lemma 15.3.9. The probability that MinCutRep fails to return the minimum cut is < 0.14.

Proof: The probability of failure of MinCut to output the mincut in each execution is at most 1 − n(n−1)
2
, by
Lemma 15.3.5. Now, MinCutRep fails, only if all the n(n − 1) executions of MinCut fail. But these executions
are independent, as such, the probability to this happen is at most
!n(n−1) !
2 2
1− ≤ exp − · n(n − 1) = exp(−2) < 0.14,
n(n − 1) n(n − 1)
since 1 − x ≤ e−x for 0 ≤ x ≤ 1. ■
4
Theorem 15.3.10.
 One
 can compute the minimum cut in O(n ) time with constant probability to get a correct
result. In O n4 log n time the minimum cut is returned with high probability.

15.3.2. An alternative implementation using MST


The algorithm. The above algorithm can be restated as follows. Randomly assign weights to the edges of G
(say, by picking numbers in [0, 1]). Next, compute the MST T of the graph according to these weights. Remove
the heaviest edge in the MST. The resulting partition of T into two trees, corresponds to a cut in the original
graph. Return this cut as a candidate to be the minimum cut.

The analysis. To see that this algorithm is equivalent to MinCut (Figure 15.5), observe that the contraction
algorithm simulates Kruskal’s MST algorithm when run on randomly weighted edges. First, imagine imple-
menting MinCut so that it keeps parallel edges. Then, the edges connecting two vertices that are not contracted
are exactly the edges between the two connected components. Picking a random edge to contract, is equivalent
to picking the edge with the minimum random weight. Thus, the MST algorithm here just simulates MinCut
(or vice versa).

A small optimization. It is possible to compute the heaviest edge in the MST, and the partition it induces in
(deterministic) linear time – it is a nice example of the search and prune technique.
Exercise 15.3.11. Given a graph G with weights on the edges, show how to compute the maximum weight
edge in the MST of G in O(n + m) time, where n are m are the number of vertices and edges of G, respectively.

Thus, this yields a O(n + m) implementation of MinCut. We get the following result.
Lemma 15.3.12. MinCut can implemented to run in O(n + m) time, and it outputs the mincut with probability
2
≥ .
n(n − 1)

117
FastCut(G = (V, E))
G – multi-graph
begin
n ← |V(G)|
Contract ( G, t ) if n ≤ 6 then
begin Compute (via brute force) minimum cut
while |(G)| > t do lof G and return cut.
√ m
Pick a random edge e in G. t ← 1 + n/ 2
G ← G/e H1 ← Contract(G, t)
return G H2 ← Contract(G, t)
end /* Contract is randomized!!! */
X1 ← FastCut(H1 ),
X2 ← FastCut(H2 )
return minimum cut out of X1 and X2 .
end

Figure 15.6: Contract(G, t) shrinks G till it has only t vertices. FastCut computes the minimum cut using
Contract.

15.4. A faster algorithm


The algorithm presented in the previous section is extremely simple. Which raises the question of whether we
can get a faster algorithm­ ?
So, why MinCutRep needs so many executions? Well, the probability of success in the first ν iterations is
Y
ν−1 ! Yν−1
2 n−i−2
P[E0 ∩ . . . ∩ Eν−1 ] ≥ 1− =
i=0
n−i i=0
n−i
n−2 n−3 n−4 (n − ν)(n − ν − 1)
= ∗ ∗ ... = . (15.2)
n n−1 n−2 n · (n − 1)
Namely, this probability deteriorates very quickly toward the end of the execution, when the graph becomes √
small enough. (To see this, observe that for ν = n/2, the probability of success is roughly 1/4, but for ν = n− n
the probability of success is roughly 1/n.)
So, the key observation is that as the graph get smaller the probability to make a bad choice increases. So,
instead of doing the amplification from the outside of the algorithm, we will run the new algorithm more times
when the graph is smaller. Namely, we put the amplification directly into the algorithm.
The basic new operation we use is Contract, depicted in Figure 15.6, which also depict the new algorithm
FastCut.
 
Lemma 15.4.1. The running time of FastCut(G) is O n2 log n , where n = |V(G)|.

Proof: Well, we perform two calls to Contract(G, t) which takes O(n2 ) time. And then we perform two
recursive calls on the resulting graphs. We have
√ 
T (n) = O(n2 ) + 2T n/ 2 .
 
The solution to this recurrence is O n2 log n as one can easily (and should) verify. ■
­
This would require a more involved algorithm, that is life.

118
Exercise 15.4.2. Show that one can modify FastCut so that it uses only O(n2 ) space.
√ 
Lemma 15.4.3. The probability that Contract G, n/ 2 had not contracted the minimum cut is at least 1/2.
Namely, the probability that the minimum cut in the contracted graph is still a minimum cut in the original
graph is at least 1/2.
l √ m
Proof: Just plug in ν = n − t = n − 1 + n/ 2 into Eq. (15.2). We have
l √ m l √ m 
h i t(t − 1) 1 + n/ 2 1 + n/ 2 − 1 1
P E0 ∩ . . . ∩ En−t ≥ = ≥ . ■
n · (n − 1) n(n − 1) 2
The following lemma bounds the probability of success.

Lemma 15.4.4. FastCut finds the minimum cut with probability larger than Ω 1/ log n .
Proof: Let T h be the recursion tree of the algorithm of depth h = Θ(log n). Color an edge of recursion tree by
black if the contraction succeeded. Clearly, the algorithm succeeds if there is a path from the root to a leaf that
is all black. This is exactly the settings of Lemma 15.1.1, and we conclude that the probability of success is at
least 1/(h + 1) = Θ(1/ log n), as desired. ■
Exercise 15.4.5. Prove, that running FastCut repeatedly c · log2 n times, guarantee that the algorithm outputs
the minimum cut with probability ≥ 1 − 1/n2 , say, for c a constant large enough.
Theorem 15.4.6. One can compute the minimum cut in a graph G with n vertices in O(n2 log3 n) time. The
algorithm succeeds with probability ≥ 1 − 1/n2 .
Proof: We do amplification on FastCut by running it O(log2 n) times. The running time bound follows from
Lemma 15.4.1. The bound on the probability follows from Lemma 15.4.4, and using the amplification analysis
as done in Lemma 15.3.9 for MinCutRep. ■

15.5. Bibliographical Notes


The MinCut algorithm was developed by David Karger during his PhD thesis in Stanford. The fast algorithm
is a joint work with Clifford Stein. The basic algorithm of the mincut is described in [MR95, pages 7–9], the
faster algorithm is described in [MR95, pages 289–295].

Galton-Watson process. The idea of using coloring of the edges of a tree to analyze FastCut might be new
(i.e., Section 15.1.2).

References
[Gre69] W. Greg. Why are Women Redundant? Trübner, 1869.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
[Ste12] E. Steinlight. Why novels are redundant: sensation fiction and the overpopulation of literature.
ELH, 79(2): 501–535, 2012.
[WG75] H. W. Watson and F. Galton. On the probability of the extinction of families. J. Anthrop. Inst.
Great Britain, 4: 138–144, 1875.

119
120
Chapter 16

Discrepancy and Derandomization


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Shortly after the celebration of the four thousandth anniversary of the opening of space, Angary J. Gustible discovered
Gustible’s planet. The discovery turned out to be a tragic mistake.
Gustible’s planet was inhabited by highly intelligent life forms. They had moderate telepathic powers. They immediately mind-
read Angary J. Gustible’s entire mind and life history, and embarrassed him very deeply by making up an opera concerning
his recent divorce.”

Gustible’s Planet, Cordwainer Smith

16.1. Discrepancy
Consider a set system (X, R), where n = |X|, and R ⊆ 2X . A natural task is to partition X into two sets S , T ,
such that for any range r ∈ R, we have that χ(r) = |S ∩ r| − |T ∩ r| is minimized. In a perfect partition, we
would have that χ(r) = 0 – the two sets S , T partition every range perfectly in half. A natural way to do so, is
to consider this as a coloring problem – an element of X is colored by +1 if it is in S , and −1 if it is in T .

Definition 16.1.1. Consider a set system S = (X, R), and let χ : X → {−1, +1} be a function (i.e., a coloring).
P
The discrepancy of r ∈ R is χ(r) = | x∈r χ(x)|. The discrepancy of χ is the maximum discrepancy over all the
ranges – that is
disc(χ) = max χ(r).
r∈R

The discrepancy of S is
disc(S) = min disc(χ).
χ:X→{−1,+1}

Bounding the discrepancy of a set system is quite important, as it provides a way to shrink the size of
the set system, while introducing small error. Computing the discrepancy of a set system is generally quite
challenging. A rather decent bound follows by using random coloring.

Definition 16.1.2. For a vector v = (v1 , . . . , vn ) ∈ Rn , ∥v∥∞ = maxi |vi |.

For technical reasons, it is easy to think about the set system as an incidence matrix.

Definition 16.1.3. For a m × n a binary matrix M (i.e., each entry is either 0 or 1), consider a vector b ∈
{−1, +1}n . The discrepancy of b is ∥Mb∥∞ .

121
Theorem 16.1.4. Let M be an n × n binarypmatrix (i.e., each entry is either 0 or 1), then there always exists a
vector b ∈ {−1, +1}n , such that ∥Mb∥∞ ≤ 4 n log n. Specifically, a random coloring provides such a coloring
with high probability.

Proof: Let v = (v1 , . . . , vn ) be a row of M. Chose a random b = (b1 , . . . , bn ) ∈ {−1, +1}n . Let i1 , . . . , iτ be the
indices such that vi j = 1, and let
X
n X
τ X
τ
Y = ⟨v, b⟩ = vi bi = vi j bi j = bi j .
i=1 j=1 j=1

As such Y is the sum of m independent random variables that accept values in {−1, +1}. Clearly,
  hX i X X
E[Y] = E ⟨v, b⟩ = E vi bi = E[vi bi ] = vi E[bi ] = 0.
i i i

By Chernoff inequality and the symmetry of Y, we have that, for ∆ = 4 n ln m, it holds
hX τ i ! !
    ∆2 n ln m 2
P |Y| ≥ ∆ = 2 P ⟨v, b⟩ ≥ ∆ = 2 P bi j ≥ ∆ ≤ 2 exp − = 2 exp −8 ≤ 8,
j=1
2τ τ m

since τ ≤ n. In words, the probability that any entry in Mb exceeds (in absolute values) 4 n ln, is smaller√ than
2/m7 . Thus, with probability at least 1 − 2/m7 , all the entries of Mb have absolute
√ value smaller than 4 n ln m.
In particular, there exists a vector b ∈ {−1, +1} such that ∥ Mb ∥∞ ≤ 4 n ln m.
n

We might spend more time on discrepancy later on – it is a fascinating topic, well worth its own course.

16.2. The Method of Conditional Probabilities


In previous lectures, we encountered the following problem.
Problem 16.2.1 (Set Balancing/Discrepancy). Given a binary matrix M of size n × n, find a vector v ∈
{−1, +1}n , such that ∥Mv∥∞ is minimized.

√ Using random assignment and the Chernoff inequality, we showed that there exists v, such that ∥Mv∥∞ ≤
4 n ln n. Can we derandomize this algorithm? Namely, can we come up with an efficient deterministic algo-
rithm that has low discrepancy?
To derandomize our algorithm, construct a computation tree of depth n, where in the ith level we expose
the ith coordinate of v. This tree T has depth n. The root represents all possible random choices, while a
node at depth i, represents all computations when the first i bits are fixed. For a node v ∈ T , let P(v) be the
probability that a random computation starting from v succeeds – here randomly assigning the remaining bits
can be interpreted as a random walk down the tree to a leaf. √
Formally, the algorithm is successful if ends up with a vector v, such that ∥Mv∥∞ ≤ 4 n ln n.
Let vl and vr be the two children of v. Clearly, P(v) = (P(vl ) + P(vr ))/2. In particular, max(P(vl ), P(vr )) ≥
P(v). Thus, if we could compute P(·) quicklyp(and deterministically), then we could derandomize the algorithm.
Let Cm+ be the bad event that rm · v > 4 n log n, where rm is the mth row ofh M. Similarly, iCm− is the bad
p
event that rm · v < −4 n log n, and let Cm = Cm+ ∪ Cm− . Consider the probability, P Cm+ v1 , . . . , vk (namely, the
first k coordinates of v are specified). Let rm = (r1 , . . . , rn ). We have that
 +  hX
n
p X
k i h X i h X i
P m 1
C v , . . . , vk = P v r
i i > 4 n log n − v r
i i = P v r
i i > L = P vi > L ,
i=k+1 i=1 i≥k+1,ri ,0 i≥k+1,ri =1

122
p P P
where L = 4 n log n − ki=1 vi ri is a known quantity (since v1 , . . . , vk are known). Let V = i≥k+1,ri =1 1. We
have,
h i hX i "X #
+ vi + 1 L + V
P C m v1 , . . . , vk = P (vi + 1) > L + V = P > ,
i≥k+1 i≥k+1
2 2
αi =1 αi =1
The last quantity is the probability that in V flips of a fair 0/1 coin one gets more than (L + V)/2 heads. Thus,

X ! !
h i 1 X
V V
V 1 V
P+m = P Cm+ v1 , . . . , vk = = .
i=⌈(L+V)/2⌉
i 2n 2n i=⌈(L+V)/2⌉ i

This implies, that we can compute P+m in polynomial time! Indeed, we are adding V ≤ n numbers, each one of
them is a binomial coefficient that has polynomial size representation in n, and can be computed in polynomial
time (why?). One can define in similar fashion P−m , and let Pm = P+m + P−m . Clearly,
h Pm can ibe computed in
− −
polynomial time, by applying a similar argument to the computation of Pm = P Cm v1 , . . . , vk .
For a node hv ∈ T , let
i vv denote the portion of v that was fixed when traversing from the root of T to v. Let
P
P(v) = nm=1 P Cm vv . By the above discussion P(v) can be computed in polynomial time. Furthermore, we
know, by the previous result on discrepancy that P(r) < 1 (that was the bound used to show that there exist a
good assignment).
As before, for any v ∈ T , we have P(v) ≥ min(P(vl ), P(vr )). Thus, we p have a polynomial deterministic
algorithm for computing a set balancing with discrepancy smaller than 4 n log n. Indeed, set v = root(T ).
And start traversing down the tree. At each stage, compute P(vl ) and P(vr ) (in polynomial time), and set v to
the child with lower
p value of P(·). Clearly, after n steps, we reach a leaf, that corresponds to a vector v′ such
that ∥Av′ ∥∞ ≤ 4 n log n.

Theorem 16.2.2. Using the method of conditional


p probabilities, one can compute in polynomial time in n, a
vector v ∈ {−1, 1}n , such that ∥Av∥∞ ≤ 4 n log n.

Note, that this method might fail to find the best assignment.

16.3. Bibliographical Notes


There is a lot of nice work on discrepancy in geometric settings. See the books [c-dmr-01, Mat99].

References
[Mat99] J. Matoušek. Geometric Discrepancy. Vol. 18. Algorithms and Combinatorics. Springer, 1999.

123
124
Chapter 17

Independent set – Turán’s theorem

I don’t know why it should be, I am sure; but the sight of another man asleep in bed when I am up, maddens me. It seems
to me so shocking to see the precious hours of a man’s life - the priceless moments that will never come back to him again -
being wasted in mere brutish sleep.

598 - Class notes for Randomized Algorithms Jerome K. Jerome, Three men in a boat
Sariel Har-Peled
April 2, 2024

17.1. Turán’s theorem


17.1.1. Some silly helper lemmas
We will need the following well-known inequality.
Lemma 17.1.1 (AM-GM inequality: Arithmetic and geometric means inequality.). For any x1 , . . . , xn ≥ 0
x1 + x2 + · · · + xn √n
we have ≥ x1 x2 · · · xn .
n
1 n
This inequality readily implies the “inverse” inequality: √n ≥
x1 x2 · · · xn x1 + x2 + · · · + xn
P
Lemma 17.1.2. Let x1 , . . . , xn ≥ 0 be n numbers. We have that ni=1 x1i ≥ (Pi nxi )/n .
Proof: By the SM-GM inequality and then its “inverse” form, we have
Pn 1
i=1 xi 1/x1 + 1/x2 + · · · + 1/xn pn 1 n
= ≥ (1/x1 )(1/x2 ) · · · (1/xn ) = √n ≥ . ■
n n x1 x2 · · · xn x1 + x2 + · · · + xn

Lemma 17.1.3. Let G = (V, E) be a graph with n vertices, and let dG be the average degree in the graph. We
X 1 n
have that ≥ .
v∈V
1 + d(v) 1 + d G

Proof: Let the ith vertex in G be vi . Set xi = 1 + d(vi ), for all i. By Lemma 17.1.2, we have
X n
1 Xn
1 n n n
= ≥ P = P  = . ■
i=1
1 + d(vi ) i=1
xi ( i x i )/n i 1 + d(vi ) /n 1 + d G

125
17.1.2. Statement and proof
Theorem 17.1.4 (Turán’s theorem). Let G = (V, E) be a graph with n vertices. The graph G has an indepen-
n
dent set of size at least , where dG is the average vertex degree in G.
1 + dG
Proof: Let π = (π1 , . . . , πn ) be a random permutation of the vertices of G. Pick the vertex πi into the indepen-
dent set if none of its neighbors appear before it in π. Clearly, v appears in the independent set if and only if
it appears in the permutation before all its d(v) neighbors. The probability for this is 1/(1 + d(v)). Thus, the
expected size of the independent set is (exactly)
X 1
τ= , (17.1)
v∈V
1 + d(v)

by linearity of expectations. Thus, by the probabilistic method, there exists an independent set in G of size at
least τ. The claim now readily follows from Lemma 17.1.3. ■

17.1.3. An alternative proof of Turán’s theorem


Following a post of this write-up on my blog, readers suggested two modifications. We present an alternative
proof incorporating both suggestions.
Alternative proof of Theorem 18.1.3: We associate a charge of size 1/(d(v) + 1) with each vertex of G. Let
γ(G) denote the total charge of the vertices of G. We prove, using induction, that there is always an independent
set in G of size at least γ(G). If G is the empty graph, then the claim trivially holds. Otherwise, assume that it
holds if the graph has at most n − 1 vertices, and consider the vertex v of lowest degree in G. The total charge
of v and its neighbors is
1 X 1 1 X 1 d(v) + 1
+ ≤ + = = 1,
d(v) + 1 uv∈E d(u) + 1 d(v) + 1 uv∈E d(v) + 1 d(v) + 1

since d(u) ≥ d(v), for all uv ∈ E. Now, consider the graph H resulting from removing v and its neighbors from
G. Clearly, γ(H) is larger (or equal) to the total charge of the vertices of V(H) in G, as their degree had either
decreased (or remained the same). As such, by induction, we have an independent set in H of size at least γ(H).
Together with v this forms an independent set in G of size at least γ(H) + 1 ≥ γ(G). Implying that there exists
an independent set in G of size
X 1
τ= , (17.2)
v∈V
1 + d(v)
Now, set xv = 1 + d(v), and observe that
    2
X X 1  X √ 
xv √  = n2 ,
1
(n + 2|E|)τ =  xv   ≥ 
v∈V
x
v∈V v v∈V
xv

n2 n n
using Cauchy-Schwartz inequality. Namely, τ ≥ = = . ■
n + 2|E| 1 + 2|E|/n 1 + dG
Lemma 17.1.5 (Cauchy-Schwartz inequality). For positive numbers α1 , . . . , αn , β1 , . . . , βn , we have
X sX sX
αi βi ≤ α2i β2i .
i i i

126
17.1.4. An algorithm for the weighted case
In the weighted case, we associate weight w(v) with each vertex of G, and we are interested in the maximum
weight independent set in G. Deploying the algorithm described in the first proof of Theorem 18.1.3, implies
the following.
X w(v)
Lemma 17.1.6. The graph G = (V, E) has an independent set of size ≥ .
v∈V
1 + d(v)

Proof: By linearity of expectations, we have that the expected weight of the independent set computed is equal
to X   X w(v)
w(v) · P v in the independent set = , ■
v∈V v∈V
1 + d(v)

References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

127
128
Chapter 18

Derandomization using Conditional Expectations


Yes, my guard stood hard when abstract threats
Too noble to neglect
Deceived me into thinking
I had something to protect
Good and bad, I define these terms
Quite clear, no doubt, somehow
Ah, but I was so much older then
Im younger than that now
598 - Class notes for Randomized Algorithms
Sariel Har-Peled My Back Pages, Bob Dylan
April 2, 2024

18.1. Method of conditional expectations


Imagine that we have a randomized algorithm that uses as randomized input n bits X1 , . . . , Xn , and outputs a
solution of quality f (X1 , . . . , Xn ). Assume that given values v1 , . . . , vk ∈ {0, 1}, one can compute, efficiently and
deterministicly, the quantity
   
E f (v1 , . . . , vk ) = E f (v1 , . . . , vk , Xk+1 , . . . , Xn ) = E f (X1 , . . . , Xn ) | X1 = v1 , . . . , Xk = vk
by a given procedure evalE f . In such settings, one can compute efficiently and deterministicly an assignment
v1 , . . . , vn , such that
 
f (v1 , . . . , vn ) ≥ E f, where E f = E f (X1 , . . . , Xn ) .
 
Or alternatively, one can find an assignment u1 , . . . , un such that f (u1 , . . . , un ) ≤ E f (X1 , . . . , Xn ) .

The algorithm. Assume the algorithm had computed a partial assignment for v1 , . . . , vk , such that αk =
E f (v1 , . . . , vk ) ≥ E f . The algorithm then would compute the two values
αk,0 = E f (v1 , . . . , vk , 0) and αk,1 = E f (v1 , . . . , vk , 1).
Observe that
αk,0 + αk,1
αk = E f (v1 , . . . , vk ) = P[Xk+1 = 0]E f (v1 , . . . , vk , 0) + P[Xk+1 = 1]E f (v1 , . . . , vk , 1) = .
2
As such, there is an i, such that αk,i ≥ αk . The algorithm sets vk+1 = i, and continues to the next iteration.

Correctness. This is hopefully clear. Initially, α0 = E f . In each iteration, the algorithm makes a choice, such
that αk ≥ αk−1 . Thus,
αn = E f (v1 , . . . , vn ) = f (v1 , . . . , vn ) ≥ αn−1 ≥ · · · ≥ α0 = E f.

129
Running time. The algorithm performs 2n invocations of evalE f .

Result.

Theorem 18.1.1. Given a function f (X1 , . . . , Xn ) over n random binary variables, such that one can compute
 
determinedly E f (v1 , . . . vk ) = E f (X1 , . . . , Xn ) | X1 = v1 , . . . , Xk = vk in T (n) time. Then, one can compute an
 
assignment v1 , . . . , vn , such that f (v1 , . . . , vn ) ≥ E f = E f (X1 , . . . , Xn ) . The running time of the algorithm is

O n + nT (n) .

18.1.1. Applications
18.1.1.1. Max kSAT
Given a boolean formula F with n variables and m clauses, where each clause has exactly k literals, let
f (X1 , . . . , Xn ) be the number of clauses the assignment X1 , . . . , Xn satisfies. Clearly, one can compute f in
O(mk) time. More generally, given a partial assignment v1 , . . . , vk , one can compute αk = E f (v1 , . . . , vk ). In-
deed, scan F and assign all the literals that depends on the variables X1 , . . . , Xk their values. A literal evaluating
to one satisfies its clause, and we count it as such. What remains are clauses with at most k literals. A literal with
i literals, have probability exactly 1 − 1/2i to be satisfied. Thus, summing these probabilities on these leftover
clauses given use the desired value. This takes O(mk) time. Using Theorem 18.1.1 we get the following.

Lemma 18.1.2. Let F be a kSAT formula with n variables and m clauses. One can compute deterministicly an
assignment that satisfies at least (1 − 1/2k )m clauses of F. This takes O(mnk) time.

18.1.1.2. Max cut


18.1.1.3. Turán theorem
Lemma 18.1.3 (Turán’s theorem). Let G = (V, E) be a graph with n vertices and m edges. One can compute
n
determinedly, in O(nm) time, an independent set of size at least .
1 + 2m/n

Proof: Exercise. ■

References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

130
Chapter 19

Martingales
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
‘After that he always chose out a “dog command” and sent them ahead. It had the task of informing the inhabitants in the village
where we were going to stay overnight that no dog must be allowed to bark in the night otherwise it would be liquidated. I was
also on one of those commands and when we came to a village in the region of Milevsko I got mixed up and told the mayor
that every dog-owner whose dog barked in the night would be liquidated for strategic reasons. The mayor got frightened,
immediately harnessed his horses and rode to headquarters to beg mercy for the whole village. They didn’t let him in, the
sentries nearly shot him and so he returned home, but before we got to the village everybody on his advice had tied rags round
the dogs muzzles with the result that three of them went mad.’

The good soldier Svejk, Jaroslav Hasek

19.1. Martingales
19.1.1. Preliminaries
 
Let X and Y be two random variables. Let ρ(x, y) = P (X = x) ∩ (Y = y) . Observe that

  ρ(x, y) ρ(x, y)
PX=x|Y=y =  =P
PY=y z ρ(z, y)

h i X h i P x xρ(x, y) P x xρ(x, y)
and E X Y = y = xP X = x Y = y = P =   .
z ρ(z, y) PY=y
x h i
The conditional expectation of X given Y, is the random variable E X Y is the random variable f (y) =
h i
E X Y=y .
As a reminder, for any two random variables X and Y, we have
   
(I) Lemma 11.1.2: E E[X | Y] = E X .
   
(II) Lemma 11.1.3: E Y · E[X | Y] = E XY .

19.1.2. Martingales
Intuitively, martingales are a sequence of random variables describing a process, where the only thing that
matters at the beginning of the ith step is where the process was in the end of the (i − 1)th step. That is, it does
not matter how the process arrived to a certain state, only that it is currently at this state.

131
Definition 19.1.1. A sequence of random variables X0 , X1 , . . . , is said to be a martingale sequence if for all
i > 0, we have E[Xi | X0 , . . . , Xi−1 ] = Xi−1 .
In particular, note that for a martingale, we have E[Xi | X0 , . . . , Xi−1 ] = E[Xi | Xi−1 ] = Xi−1 .
   
Lemma 19.1.2. Let X0 , X1 , . . . , be a martingale sequence. Then, for all i ≥ 0, we have E Xi = E X0 .
Proof: By (I), and the martingale property, we have
 
E[Xi ] = E E[Xi | Xi−1 ] = E[Xi−1 ] = E[Xi−2 ] = · · · = E[X0 ] . ■

19.1.2.1. Examples of martingales


Example 19.1.3. Consider the sum of money after participating in a sequence of fair bets. That is, let Xi be
the amount of money a gambler has after playing i rounds. In each round it either gains one dollar, or loses one
dollar (with equal probability). Clearly, we have
1 1
E[Xi | X0 , . . . , Xi−1 ] = E[Xi | Xi−1 ] = Xi−1 + · (+1) + · (−1) = Xi−1 .
2 2
Example 19.1.4. Let Yi = Xi2 − i, where Xi is as defined in the above example. We claim that Y0 , Y1 , . . . is a
martingale. Let us verify that this is true. Given Yi−1 , we have Yi−1 = Xi−1 2
− (i − 1). We have that
h i h i 1  1 
E i i−1
Y Y = E i X 2
− i X 2
i−1 − (i − 1) = (X i−1 + 1)2
− i) + (X i−1 − 1)2
− i
2 2
= Xi−1
2
+ 1 − i = Xi−1
2
− (i − 1) = Yi−1 ,
which implies that indeed it is a martingale.
Example 19.1.5. Let U be a urn with b black balls, and w white balls. We repeatedly select a ball and replace
it by c balls having the same color. Let Xi be the fraction of black balls after the first i trials. We claim that the
sequence X0 , X1 , . . . is a martingale.
Indeed, let ni = b + w + i(c − 1) be the number of balls in the urn after the ith trial. Clearly,
h i (c − 1) + Xi−1 ni−1 Xi−1 ni−1
E i i−1
X X , . . . , X0 = Xi−1 · + (1 − Xi−1 ) ·
ni ni
Xi−1 (c − 1) + Xi−1 ni−1 c − 1 + ni−1 ni
= = Xi−1 = Xi−1 = Xi−1 .
ni ni ni
Example 19.1.6. Let G be a random graph on the vertex set V = {1, . . . , n} obtained by independently choosing
to include each possible edge with probability p. The underlying probability space over random graphs is
denoted by Gn,p . Arbitrarily label the m = n(n − 1)/2 possible edges with the sequence 1, . . . , m. For 1 ≤ j ≤ m,
define the indicator random variable I j , which takes values 1 if the edge j is present in G, and has value 0
otherwise. These indicator variables are independent and each takes value 1 with probability p.
Consider any real valued function f defined over the space of all graphs, e.g., the clique number, which is
defined as being the size of the largest complete subgraph. The edge exposure martingale is the sequence of
random variables X0 , . . . , Xm such that
 
Xi = E f (G) | I1 , . . . , Ii ,
 
while X0 = E f (G) and Xm = f (G). This sequence of random variable begin a martingale follows immediately
from a theorem that would be described in the next lecture.
One can define similarly a vertex exposure martingale, where the graph Gi is the graph induced on the first
i vertices of the random graph G.

132
Example 19.1.7 (The sheep of Mabinogion). The following is taken from medieval Welsh manuscript based
on Celtic mythology:
“And he came towards a valley, through which ran a river; and the borders of the valley were
wooded, and on each side of the river were level meadows. And on one side of the river he saw
a flock of white sheep, and on the other a flock of black sheep. And whenever one of the white
sheep bleated, one of the black sheep would cross over and become white; and when one of the
black sheep bleated, one of the white sheep would cross over and become black.” – Peredur the
son of Evrawk, from the Mabinogion.
More concretely, we start at time 0 with w0 white sheep, and b0 black sheep. At every iteration, a random
sheep is picked, it bleats, and a sheep of the other color turns to this color. the game stops as soon as all the
sheep have the same color. No sheep dies or get born during the game. Let Xi be the expected number of black
sheep in the end of the game, after the ith iteration. For reasons that we would see later on, this sequence is a
martingale.
The original question is somewhat more interesting – if we are allowed to take a way sheep in the end of
each iteration, what is the optimal strategy to maximize Xi ?

19.1.2.2. Azuma’s inequality


A sequence of random variables X0 , X1 , . . . has bounded differences if |Xi − Xi−1 | ≤ ∆, for some ∆.
Theorem 19.1.8 (Azuma’s Inequality.). Let X0 , . . . , Xm be a martingale with X0 = 0, and

|Xi+1 − Xi | ≤ 1, for i = 0, . . . , m − 1.
h √ i  
For any λ > 0, we have P Xm > λ m < exp −λ2 /2 .


Proof: Let α = λ/ m. Let Yi = Xi −h Xi−1 , so that |Yi | ≤i 1 and E[Yi | X0 , . . . , Xi−1 ] = 0.
We are interested in bounding E eαYi X0 , . . . , Xi−1 . Note that, for −1 ≤ x ≤ 1, we have

eα + e−α eα − e−α
f (x) = eαx ≤ h(x) = + x,
2 2
as f (x) = eαx is a convex function, h(−1) = e−α = f (−1), h(1) = eα = f (+1), and h(x) is a linear function.
Thus,
h i h i  h i
αY
E e i X0 , . . . , Xi−1 ≤ E h(Yi ) X0 , . . . , Xi−1 = h E Yi X0 , . . . , Xi−1
 eα + e−α
=h0 =
2
(1 + α + 2! + α3! + · · · ) + (1 − α + α2! − α3! + · · · )
α2 3 2 3

=
2
α2 α4 α6
=1+ + + + ···
2 4! 6!
! !2 !3  
1 α2 1 α2 1 α2
≤1+ + + + · · · = exp α2 /2 ,
1! 2 2! 2 3! 2

as (2i)! ≥ 2i i!.

133
We have that
h i hY
m i h i Y
m−1
τ = E eαXm = E eαYi = E g(X0 , . . . , Xm−1 )eαYm , where g(X0 , . . . , Xm−1 ) = eαYi .
i=1 i=1

By the martingale property, we have that


 
E[Yi | X0 , . . . , Xm−1 ] = E Yi | g(X0 , . . . , Xm−1 ) = 0.
   
By the above, this implies that E eαYi | g(X0 , . . . , Xi−1 ) ≤ exp α2 /2 . Hence, by Lemma 11.1.3, we have that
h i hY m i h i
τ = E eαXm = E eαYi = E g(X0 , . . . , Xm−1 )eαYm
i=1
 h i  
αYm
g(X0 , . . . , Xm−1 ) ≤ eα /2 E g(X0 , . . . , Xm−1 )
2
= E g(X0 , . . . , Xm−1 ) E e
 
≤ exp mα2 /2 .
Therefore, by Markov’s inequality, we have
h i
αXm
h √ i h √ i E e √
αX αλ m
= αλ √m = emα /2−αλ m
2
P Xm > λ m = P e m > e
 e
√ 2 √ √ 
= exp m(λ/ m) /2 − (λ/ m)λ m = e−λ /2 ,
2

implying the result. ■


Here is an alternative form.
Theorem 19.1.9 (Azuma’s Inequality). Let X0 , . . . , Xm be a martingale sequence such that and |Xi+1 − Xi | ≤ 1
h √ i  
for all 0 ≤ i < m. Let λ > 0 be arbitrary. Then P |Xm − X0 | > λ m < 2 exp −λ2 /2 .

Example 19.1.10. Let χ(H) be the chromatic number of a graph H. What is chromatic number of a random
graph? How does this random variable behaves? h i
Consider the vertex exposure martingale, and let Xi = E χ(G) Gi . Again, without proving it, we claim that
h √ i  
X0 , . . . , Xn = X is a martingale, and as such, we have: P |Xn − X0 | > λ n ≤ e−λ /2 . However, X0 = E χ(G) ,
2

h i
and Xn = E χ(G) Gn = χ(G). Thus,
h   √ i −λ2 /2
P χ(G) − E χ(G) > λ n ≤ e .
Namely, the chromatic number of a random graph is highly concentrated! And we do not even (need to) know
what is the expectation of this variable!

19.2. Bibliographical notes


Our presentation follows [MR95].

References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

134
Chapter 20

Martingales II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“The Electric Monk was a labor-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for
you, thus saving you the bother of washing them yourself, video recorders watched tedious television for you, thus saving you
the bother of looking at it yourself; Electric Monks believed things for you, thus saving you what was becoming an increasingly
onerous task, that of believing all the things the world expected you to believe.”

Dirk Gently’s Holistic Detective Agency, Douglas Adams

20.1. Filters and Martingales


Definition 20.1.1. A σ-field (Ω, F ) consists of a sample space Ω (i.e., the atomic events) and a collection of
subsets F satisfying the following conditions:
(A) ∅ ∈ F .
(B) C ∈ F ⇒ C ∈ F .
(C) C1 , C2 , . . . ∈ F ⇒ C1 ∪ C2 . . . ∈ F .

Definition 20.1.2. Given a σ-field (Ω, F ), a probability measure P : F → R+ is a function that satisfies the
following conditions.
(A) ∀A ∈ F , 0 ≤ P[A] ≤ 1.
 
(B) P Ω = 1.
  P  
(C) For mutually disjoint events C1 , C2 , . . . , we have P ∪iCi = i P Ci .

Definition 20.1.3. A probability space (Ω, F , P) consists of a σ-field (Ω, F ) with a probability measure P
defined on it.

Definition 20.1.4. Given a σ-field (Ω, F ) with F = 2Ω , a filter (also filtration) is a nested sequence F0 ⊆ F1 ⊆
· · · ⊆ Fn of subsets of 2Ω , such that:
(A) F0 = {∅, Ω}.
(B) Fn = 2Ω .
(C) For 0 ≤ i ≤ n, (Ω, Fi ) is a σ-field.

Definition 20.1.5. An elementary event or atomic event is a subset of a sample space that contains only one
element of Ω.

135
Intuitively, when we consider a probability space, we usually consider a random variable X. The value of
X is a function of the elementary event that happens in the probability space. Formally, a random variable is a
mapping X : Ω → R. Thus, each Fi defines a partition of Ω into atomic events. This partition is getting more
and more refined as we progress down the filter.

Example 20.1.6. Consider an algorithm Alg that uses n random bits. As such, the underlying sample space is

Ω = b1 b2 . . . bn b1 , . . . , bn ∈ {0, 1} . That is, the set of all binary strings of length n. Next, let Fi be the σ-field
generated by the partition of Ω into the atomic events Bw , where w ∈ {0, 1}i ; here w is the string encoding the
first i random bits used by the algorithm. Specifically,
n o
Bw = wx ∈ Ω x ∈ {0, 1}n−i ,
n o
and the set of atomic events in Fi is Ai = Bw w ∈ {0, 1}i . The set Fi is the closure of this set of atomic events
under complement and union. In particular, we conclude that F0 , F1 , . . . , Fn form a filter.
As a concrete example, for i = 3, the set A3 contains 23 = 8 sets, and the set F3 would contain all sets
formed by finite unions of these sets (including the empty union). As such, the set F3 would have 22 = 256
3

sets.

Definition 20.1.7. A random variable X is said to be Fi -measurable if for each x ∈ R, the event X ≤ x is in Fi ;

that is, the set ω ∈ Ω X(ω) ≤ x is in Fi .

Example 20.1.8. Let F0 , . . . , Fn be the filter defined in Example 20.1.6. Let X be the parity of the n bits.
Clearly, X = 1 is a valid event only in Fn (why?). Namely, it is only measurable in Fn , but not in Fi , for i < n.

As such, a random variable X is Fi -measurable, only if it is a constant on the elementary events of Fi . This
gives us a new interpretation of what a filter is – its a sequence of refinements of the underlying probability
space, that is achieved by splitting the atomic events of Fi into smaller atomic events in Fi+1 . Putting it
explicitly, an atomic event E of Fi , is a subset of 2Σ . As we move to Fi+1 the event E might now be split
into several atomic (and disjoint events) E1 , . . . , Ek . Now, naturally, the atomic event that really happens is an
atomic event of Fn . As we progress down the filter, we “zoom” into this event.

Definition 20.1.9 (Conditional expectation in a filter). Let (Ω, F ) be any σ-field, and Y any random variable
that takes on distinct values on the elementary events in F . Then E[X | F ] = E[X | Y].

20.2. Martingales
Definition 20.2.1. A sequence of random variables Y1 , Y2 , . . . , is a martingale difference sequence if for all
h i
i ≥ 0, we have E Yi Y1 , . . . , Yi−1 = 0.

Clearly, X1 , . . . , is a martingale sequence if and only if Y1 , Y2 , . . . , is a martingale difference sequence where


Yi = Xi − Xi−1 .

Definition 20.2.2. A sequence of random variables Y1 , Y2 , . . . , is


h i
a super martingale sequence if ∀i E Yi Y1 , . . . , Yi−1 ≤ Yi−1 ,
h i
and a sub martingale sequence if ∀i E Yi Y1 , . . . , Yi−1 ≥ Yi−1 .

136
20.2.1. Martingales – an alternative definition
Definition 20.2.3. Let (Ω, F , P) be a probability space with a filter F0 , F1 , . . . . Suppose that X0 , X1 , . . ., are
random variables such that, for all i ≥ 0, Xi is Fi -measurable. The sequence X0 , . . . , Xn is a martingale provided
 
that, for all i ≥ 0, we have E Xi+1 | Fi = Xi .

Lemma 20.2.4. Let (Ω, F ) and (Ω, G) be two σ-fields such that F ⊆ G. Then, for any random variable X, we
 
have E E[X | G] F = E[X | F ] .
h h i i h h i i
Proof: E E X G F = E E X G = g F = f

# X x xP[X=x∩G=g] · PG = g ∩ F = f 
P
"P  
xP X = x ∩G = g P[G=g]
=E x   F= f =  
PG=g g∈G
PF= f
P P
x x P[ X=x∩G=g]   x x P[ X=x∩G=g]  
X P[ G=g ] · P G = g ∩ F = f X P[ G=g] · P G = g
=   =  
g∈G,g⊆ f
P F = f g∈G,g⊆ f
P F = f
P   P P  
X x P X = x ∩ G = g x x g∈G,g⊆ f P X = x ∩ G = g
= x
  =  
g∈G,g⊆ f
P F = f P F = f
P   h i
x xP X = x ∩ F = f
=   =E X F . ■
PF= f

Theorem 20.2.5. Let (Ω, F , P) be a probability space, and let F0 , . . .h, Fn bei a filter with respect to it. Let X
be any random variable over this probability space and define Xi = E X Fi then, the sequence X0 , . . . , Xn is
a martingale.
h i
Proof: We need to show that E Xi+1 Fi = Xi . Namely,
h h i i h i
E[Xi+1 | Fi ] = E E X Fi+1 Fi = E X Fi = Xi ,

by Lemma 20.2.4 and by definition of Xi . ■

Definition 20.2.6. Let f : D1 × · · · × Dn → R be a real-valued function with a arguments from possibly


distinct domains. The function f is said to satisfy the Lipschitz condition if for any x1 ∈ D1 , . . . , xn ∈ Dn , and
i ∈ {1, . . . , n} and any yi ∈ Di , we have

f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − f (x1 , . . . , xi−1 , yi , xi+1 , . . . , xn ) ≤ 1.

Specifically, a function is c-Lipschitz, if the inequality holds with a constant c (instead of 1).

Definition 20.2.7. Let X1 , . . . , Xn be a sequence of independent random variables, and a function f = f (X1 , . . . , Xn )
defined over them, such that f satisfies the Lipschitz condition. The Doob martingale sequence Y0 , . . . , Ym is
 
defined by Y0 = E f (X1 , . . . , Xn ) and
 
Yi = E f (X1 , . . . , Xn ) X1 , . . . , Xi , for i = 1, . . . , n.

Clearly, a Doob martingale Y0 , . . . , Yn is a martingale, by Theorem 20.2.5. Furthermore, if |Xi − Xi−1 | ≤ 1,


for i = 1, . . . , n, then |Yi − Yi−1 | ≤ 1. and we can use Azuma’s inequality on such a sequence.

137
20.3. Occupancy Revisited
We have m balls thrown independently and uniformly into n bins. Let Z denote the number of bins that remains
empty in the end of the process. Let Xi be the bin chosen in the ith trial, and let Z = F(X1 , . . . , Xm ), where F
returns the number
h of empty bins given
√ i  had thrown into bins X1 , . . . , Xm . , By Azuma’s inequality
that m balls
we have that P Z − E[Z] > λ m ≤ 2 exp −λ /2 . 2

The following is an extension of Azuma’s inequality shown in class. We do not provide a proof but it is
similar to what we saw.
Theorem 20.3.1 (Azuma’s Inequality - Stronger Form). Let X0 , X1 , . . . , be a martingale sequence such that
for each k, |Xk − Xk−1 | ≤ ck , where ck may depend on k. Then, for all t ≥ 0, and any λ > 0, we have
   λ2 
P |Xt − X0 | ≥ λ ≤ 2 exp − Pt .
2 k=1 c2k
Theorem 20.3.2. Let r = m/n,
 and
 Zend be the number of empty bins when m balls are thrown randomly into n
  1 m
bins. Then µ = E Zend = n 1 − n ≈ n exp(−r), and for any λ > 0, we have
h i !
λ2 (n − 1/2)
P Zend − µ ≥ λ ≤ 2 exp − 2 .
n − µ2
Proof: Let z(Y, t) be the expected number of empty bins in the end, if there are Y empty bins in time t. The
probability of an empty bin to remain empty is (1 − 1/n)m−t , and as such
 1 m−t
z(Y, t) = Y 1 − .
n
In particular, µ = z(n, 0) = n(1 − 1/n)m .
Let Ft be the σ-field generated
h iby the bins chosen in the first t steps. Let Zend be the number of empty bins
at time m, and let Zt = E Zend Ft . Namely, Zt is the expected number of empty bins after we know where
the first t balls had been placed. The random variables Z0 , Z1 , . . . , Zm form a martingale. Let Yt be the number
of empty bins after t balls where thrown. We have Zt−1 = z(Yt−1 , t − 1). Consider the ball thrown in the t-step.
Clearly:
(A) With probability 1 − Yt−1 /n the ball falls into a non-empty bin. Then Yt = Yt−1 , and Zt = z(Yt−1 , t). Thus,
 !m−t !m−t+1 
 1 1 
∆t = Zt − Zt−1 = z(Yt−1 , t) − z(Yt−1 , t − 1) = Yt−1  1 − − 1− 
n n
!m−t !m−t
Yt−1 1 1
= 1− ≤ 1− .
n n n
(B) Otherwise, with probability Yt−1 /n the ball falls into an empty bin, and Yt = Yt−1 − 1. Namely, Zt =
z(Yt − 1, t). And we have that
!m−t !m−t+1
1 1
∆t = Zt − Zt−1 = z(Yt−1 − 1, t) − z(Yt−1 , t − 1) = (Yt−1 − 1) 1 − − Yt−1 1 −
n n
!m−t !! !m−t   !m−t 
1 1 1 Yt−1 1 Yt−1 
= 1− Yt−1 − 1 − Yt−1 1 − = 1− −1 + =− 1− 1−
n n n n n n
!m−t
1
≥− 1− .
n

138
 m−t
Thus, Z0 , . . . , Zm is a martingale sequence, where |Zt − Zt−1 | ≤ |∆t | ≤ ct , where ct = 1 − 1n . We have

!2(m−t) !2t  
X
m X
m
1 X
m−1
1 1 − (1 − 1/n)
2m n2 1 − (1 − 1/n)2m n2 − µ2
c2t = 1− = 1− = = = .
t=1 t=1
n t=0
n 1 − (1 − 1/n)2 2n − 1 2n − 1

Now, deploying Azuma’s inequality, yield the result. ■

20.3.1. Lets verify this is indeed an improvement


 m
Consider the case where m = n ln n. Then, µ = n 1 − 1n ≤ 1. And using the “weak” Azuma’s inequality
implies that
" r # ! !
h √ i n√ λ2 n λ2
P Zend − µ ≥ λ n = P Zend − µ ≥ λ m ≤ 2 exp − = 2 exp − ,
m 2m 2 ln n

which is interesting only if λ > 2 ln n. On the other hand, Theorem 20.3.2 implies that
h !
√ i λ2 n(n − 1/2)  
P |Zend − µ| ≥ λ n ≤ 2 exp − ≤ −λ ,
2
2 exp
n2 − µ2

which is interesting for any λ ≥ 1 (say).

139
140
Chapter 21

The power of two choices


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
The Peace of Olivia. How sweat and peaceful it sounds! There the great powers noticed for the first time that the land of the
Poles lends itself admirably to partition.

The tin drum, Gunter Grass

Consider the problem of throwing n balls into n bins. It is well known that the maximum load is Θ(log n/ log log n)
with high probability. Here we show that if one is allowed to pick d bins for each ball, and throw it into the
bin that contains less balls, then the maximum load of a bin decreases to Θ(log log n/ log d). A variant of this
approach leads to maximum load Θ((log log n)/d).
As a concrete example, for n = 109 , this leads to maximum load 13 in the regular case, compared to
maximum load of 4, with only two-choices – see Figure 21.1.

21.1. Balls and bins with many rows


21.1.1. The game
Consider throwing n balls into n bins. Every bin can contain a single ball. As such, as we throw the balls, some
balls would be rejected because their assigned bin already contains a ball. We collect all the rejected balls, and
throw them again into a second row of n bins. We repeat this process till all the balls had found a good and
loving home (i.e., an empty bin). How many rows one needs before this process is completed?

21.1.2. Analysis
Lemma 21.1.1. Let m = αn balls be thrown into n bins. Let Yend the number of bins that are not empty in the
end of the process (here, we allow more than one ball into a bin).
(A) For α ∈ (0, 1], we have µ = E[Yend ] ≥ (m − α) exp(−α) ≥ αn − α2 n − 1.

(B) If α ≥ 1, then µ = E[Yend ] ≥ n 1 − exp(−α) .
 p 
(C) We have P |Yend − µ| > 3cm log n ≤ 1/nc .

 i−1
Proof: (A) The probability of the ith ball to be the first ball in its bin, is 1 − 1n . To see this we use backward
analysis – throw in the ith ball, and now throw in the earlier i − 1 balls. The probability that none of the earlier

141
balls hit the same bin as the ith ball is as stated. Now, the expected number of non-empty bins is the number of
balls that are first in their bins, which in turn is
X
m−1 !i
1
µ= 1− ≥ m(1 − 1/n)m−1 ≥ (m − α)(1 − 1/n)m−α
i=0
n
= (m − α)(1 − 1/n)α(n−1) ≥ (m − α) exp(−α)
m−α
≥ (m − α)(1 − α) = αn − α2 n − α + α2 ≥ ,
e
using m = αn ≤ n, and (1 − 1/n)n−1 ≥ 1/e, see Lemma 6.1.1.
(B) We repeat the above analysis from the point of view of the bin. The probability of a bin to be empty is
(1 − 1/n)αn . As such, we have that

µ = E[Yend ] = n(1 − (1 − 1/n)αn ) ≥ n 1 − exp(−α) ,

using 1 − 1/n ≤ exp(−1/n).


(C) Let Xi be the index of the bin the ith ball picked. Let Yi = E[Y√end | X1 , . . . , Xi ]. This is a Doob martingale,
with |Yi − Yi−1 | ≤ 1. As such, Azuma’s inequality implies, for λ = 3cm ln n, that
   
P |Yend − E[Yend ]| ≥ λ ≤ 2 exp −λ /2m ≤ 1/n . ■
2 c

Remark. The reader might be confused by cases (A) and (B) of Lemma 21.1.1 for α = 1, as the two lower
bounds are different. Observe that (A) is loose if α is relatively large and close to 1.

Back to the problem. Let α1 = 1 and n1 = α1 n. For i > 1, inductively, assume that numbers of balls being
thrown in the ith round is p
ni = αi n + O( αi−1 n log n).
By Lemma 21.1.1, with high probability, the number of balls stored in the ith row is
p
si = ni exp(−αi ) ± O( ni log n).

As such, as long as the first term is significantly large than the second therm, we have that si = nαi exp(−αi )(1 ±
o(1)). For the time being, let us ignore the o(1) term. We have that

ni+1 = ni − si = n(αi − αi exp(−αi )) ≤ n(αi − αi (1 − αi )) = nα2i ,

since exp(−αi ) ≥ 1 − αi .

Definition. For a number x > 0, we use lg x = log2 x.

Observation 21.1.2. Consider the sequence α1 = 1, c = α2 = 1 − 1/e, and αi+1 = α2i , for i > 2. We have that
αi+1 = c2 . In particular, for
i−2

lg n 1
∆ = 3 + lg log1/c n = 3 + lg = 3 + lg lg n − lg lg ≤ 3 + lg lg n.
lg(1/c) 1 − 1/e
∆−2
we have that α∆ = c2 < 1/n.

142
The above observation almost implies that we need ∆ rows. The problem is that the above calculations
(i.e., the high probability guarantee in Lemma 21.1.1) breaks down when ni = O(log n) – that is, when αi =
O((log n)/n). However, if one throws in O(log n) balls into n bins, the probability of a single collision is at most
O((log n)2 /n). In particular, this implies that after roughly additional c rows, the probability of any ball left is
≤ 1/nc .
The above argumentation, done more carefully, implies the following – we omit the details because (essen-
tially) the same analysis for a more involved case is done next (the lower bound stated follows also from the
same argumentation).

Theorem 21.1.3. Consider the process of throwing n balls into n bins in several rounds. Here, a ball that can
not be placed in a round, because their chosen bin is already occupied, are promoted to the next round. The
next round throws all the rejected balls from the previous round into a new row of n empty bins. This process,
with high probability, ends after M = lg lg n + Θ(1) rounds (i.e., after M rounds, all balls are placed in bins).

21.1.3. With only d rows


d +1)/2
Lemma 21.1.4. For α ∈ (0, 1/4], let γ1 = α, and γi = 2γi−1
2
. We have that γd+1 ≤ α(2 .

Proof: The proof, minimal as it may be, is by induction:


 i−1 2
γi+1 = 2γi2 ≤ 2 α(2 +1)/2 = 2α(2 +2)/2 ≤ α(2 +1)/2 ,
i i


since 2 α ≤ 1. ■

Lemma 21.1.5. Let m = αn balls be thrown into n bins, with d rows, where α > 0. Here every bin can contain
only a single ball, and if inserting the ball into ith row failed, then we throw it in the next row, and so on, till it
finds an empty bin, or it is rejected because it failed on the dth row. Let Y(d, n, m) be the number of balls that
did not get stored in this matrix of bins. We have
(A) For a constant α < 1/4, we have Y(d, n, αn) ≤ nα(2 +1)/2 , with high probability.
d

 
(B) We E Y(d, n, dn) = O(n log d).
 
(C) For a constant c > 1, we have E Y(d, n, cn log d) = n/e−d/2 , assuming d is sufficiently large.

Proof: (A) By Lemma 21.1.1, in expectation, at least s1 = nα exp(−α) balls are placed in the first row. As such,
in expectation n2 = nα(1 − exp(−α)) ≤ nα2 balls get thrown into the second row. Using Chenroff inequality, we
get that n2 ≤ 2α2 n, with high probability. Setting γ1 = α, and γi = 2γi−1
2
, we get the claim via Lemma 21.1.4.
(B) As long as we throw Ω(n log d) balls into a row, we expect by Lemma 21.1.1 that at least n(1 − 1/dO(1) )
balls to get stored in this row. As such, let D = O(log d), and observe that the first d − D rows in expectation
contains n(d − D)(1 − 1/dO(1) ) balls. This implies that only O(Dn) are not stored in these first d − D rows, which
implies the claim.
(C) Break the d rows into two groups. The first group of size
 
D = (c log d − 1)/(1 − 1/e) + 1 = O(log d),

and the second group is the remaining rows. As long as the number of balls arriving to a row is larger than n,
we expect at least n(1 − 1/e) of them to be stored in this row. As such, after the first D rows, we expect the
number of remaining balls to be ≤ n. Indeed, if we have i such rows, then the expected number of balls moving
on to the (i + 1)th row is at most
ni+1 = cn log d − in(1 − 1/e).

143
Solving for ni+1 ≤ n, we have cn log d − in(1 − 1/e) ≤ n =⇒ i(1 − 1/e) ≥ c log d − 1 =⇒ i ≥ (c log d −
1)/(1 − 1/e) ≥ D − 1. As such, nD ≤ n, for i ≥ D.
The same argumentation implies that the number of balls arriving to the D + i row, in expectation, is at most
n/e . In particular, we get that the number of balls failed to be placed is at most n/ed−D ≤ n/ed/2 .
i

21.2. The power of two choices


Making d choices. Let us throw n balls into n bins. For each ball, we first pick randomly d ≥ 2 bins, and
place the ball in the bin (among these d bins) that currently contains the smallest number of balls (here, a bin
might contain an arbitrary number of balls). If there are several bins with the same minimum number of bins,
we resolve it arbitrarily.
Here, we will show the surprising result that the maximum number of balls in any bin is bounded by

O logloglogd n with high probability in the end of this process. For d = 1, which is the regular balls into bins setting,

we already seen that this quantity is Θ logloglogn n , so this result is quite surprising.

21.2.1. Upper bound


Definition 21.2.1. The load of a bin is the number of balls in it. The height of a ball, is the load of the bin it
was inserted into, just after it was inserted.

Some notations:
(A) βi : An upper bound on the number of bins that have load at least i by the end of the process.
(B) h(i): The height of the ith ball.
(C) ⊔≥i (t): Number of bins with load at least i at time t.
(D) o≥i (t): Number of balls with height at least i at time t.

Observation 21.2.2. ⊔≥i (t) ≤ o≥i (t).

Let |≥i = ⊔≥i (n) be the number of bins, in the end of the process, that have load ≥ i.

Observation 21.2.3. Since every bin counted in |≥i contains at least i balls, and there are n balls, it follows
that |≥i ≤ n/i. In particular, we have |≥4 ≤ n/4.

Lemma 21.2.4. Let β1 = n, β2 = n/2, β3 = n/3, and β4 = n/4, and let

βi+1 = 2n(βi /n)d ,

for i ≥ 4. Let I be the last iteration, such that βI ≥ 16c ln n, where c > 1 is an arbitrary constant. Then, with
probability ≥ 1 − 1/nc , we have that
(A) |≥i ≤ βi , for i = 1, . . . , I.
(B) |≥I+1 ≤ c′ log n, for some constant c′ . h i
(C) For j > 0, and any constant ε > 0, we have P |≥I+1+ j > 0 ≤ O(1/n(d−1−ε) j ).
(D) With probability ≥ 1 − 1/nc , the maximum load of a bin is I + O(c).

Proof: (A) The claim for i = 1, 2, 3, 4 follows readily from Observation 21.2.3.
Let Bi be the bad event that |≥i > βi , for i = 1, . . . , n. The following analysis is conditioned on none of
these bad events happening. Let Gk = ∩ki=1 Bi be the good event. Let Yt be an indicator variable that is one

144
⇐⇒ h(t) ≥ i + 1 conditioned on Gi−1 (for clarity, we omit mentioning this conditioning explicitly). We have
that h i
τ j = P Y j = 1 ≤ pi for pi = (βi /n)d ,
as all d probes must hit bins of height at least i, and there are at most βi such bins. This readily implies
 
that E o≥i+1 (n) ≤ pi n. The variables Y1 , . . . , Yn are not independent, but consider a variable Y ′j that is 1 if
Y = 1, or if Y j = 0, then Y ′j is 1 with probability pi − τ j . Clearly, the variables Y1′ , . . . , Yn′ are independent, and
Pj ′ P
i Yj ≥ i Yi . For i < I, setting
βi+1 = 2npi = 2n(βi /n)d ,
we have, by Chernoff’s inequality, that
    hX i
αi+1 = P[Bi+1 ] = P o≥i+1 > βi+1 = P o≥i+1 > 2npi ≤ P
(n) (n) Yt′ > (1 + 1)npi
i
≤ exp(−npi /4) = exp(−βi+1 /8) < 1/n2c .

(B) We have βI+1 ≤ 16c log n. Setting ∆ = 2e · 16c log n, and conditioning on the good event G1 , consider
the sequence Y1′ , . . . Yn′ as above, where the Yi is the indicator that the ith ball has height ≥ I + 1. Arguing as
P
above, for Y ′ = i Yi′ , we have E[Y] ≤ βi+1 . As such, we have
" #
  ′ ∆  ′ −∆ 1
P[|≥I+1 > ∆] ≤ P o≥I+1 (n) > ∆ ≤ P Y > ′ E Y ≤ 2 ≤ c,
E[Y ] n

by Lemma 13.2.8, as E[Y ′ ] ≤ βI+1 , and ∆/βI+1 > 2e.


As for the conditioning used in the above, we have that

Y
I+1 h i Y

P[GI+1 ] = P Bℓ+1 ∩k=1 B1 = (1 − αi ) ≥ 1 − 1/nc−1 ,
ℓ=4 i

since I ≤ n.
(C) Observe that ⊔≥i+1 (n) ≤ ⊔≥i (n). As such, for all j > 0, we have that ⊔≥I+1+ j (n) ≤ o≥I+1 (n) ≤ ∆ =
2e · 16c log n, by (B). As such, we have
h i
E o≥I+1+ j (n) ≤ n(∆/n) = O(log n/n ) = O(1/n ) ≪ 1,
d d d−1 d−1−ε

h i
for ε > 0 an arbitrary constant, and n sufficient large. Using Markov’s inequality, we get that q = P o≥I+1+ j (n) ≥ 1 =
O(1/nd−1−ε ). The probability that the first j such rounds fail (i.e., that o≥I+1+ j (n) > 0) is at most q j , as claimed.
(D) This follows immediately by picking ε = 1/2, and then using (C) with j = O(c). ■
i−4 +1
Lemma 21.2.5. For i = 4, . . . , I, we have that βi ≤ n/2d .

Proof: The proof is by induction. For i = 4, we have β4 ≤ n/4, as claimed. Otherwise, we have
 d
βi+1 = 2n(βi /n)d ≤ 2n 1/2d +1 = n/2d +d−1 ≤ n/2d +1 .
i−4 i+1−4 i+1−4

Theorem 21.2.6. When throwing n balls into n bins, with d choices, with probability ≥ 1 − 1/nO(1) , we have
that the maximum load of a bin is O(1) + lglglgdn

145
Proof: By Lemma 21.2.4, with the desired probability the βi s bound the load in the bins for i ≤ I. By
Lemma 21.2.5, it follows that for I = O(1) + lglglgdn , we have that βI ≤ o(log n). Thus giving us the desired
bound. ■

It is not hard to verify that our upper bounds (i.e., βi ) are not too badly off, and as such the maximum load
in the worst case is (up to additive constant) the same. We state the result without proof.

Theorem 21.2.7. When throwing n balls into n bins, with d choices (where the ball is placed with the bin with
the least load), with probability ≥ 1 − o(1/n), we have that the maximum load of a bin is at least lglglgdn − O(1).

21.2.2. Applications
As a direct application, we can use this approach for open hashing, where we use two hash functions, and place
an element in the bucket of the hash table with fewer elements. By the above, this improves the worst case
search time from O(log n/ log log n) to O(log log n). This comes at the cost of doubling the time it takes to do
lookup on average.

21.2.3. The power of restricted d choices: Always go left


The always go left rule. Consider throwing a ball into n bins (which might already have some balls in them) as
follows – you pick uniformly a number Xi ∈ Jn/dK, for i = 1, . . . , d. Next, you try locations Y1 , . . . , Yd , where
Y j = X j + j(n/d), for j = 1, . . . , d. Let L j be the load of bin Y j , for j = 1, . . . , d, and let L = min j L j be the
minimum load of any bin. Let τ be the minimum index such that L j = L. We throw the ball into Yτ .
What the above scheme does, is to partition the n bins into d groups each of size n/d, placed from left to
right. We pick a bin uniformly from each group, and always throw the ball in the leftmost location that realizes
the minimum load.
The following proof is informal for the sake of simplicity.

Theorem 21.2.8. When throwing n balls into n bins, using the always-go-left rule, with d groups of size n/d,
the maximum load of a bin is O(1) + log dlog n , with high probability.

Proof: (Sketch.) We consider each of the d groups to be a row in the matrix being filled. So each row has n/d
entries, and there are d rows. We can now think about the above algorithm as first trying to place the ball in the
first row (if there is an empty bin), otherwise, trying the new row and so on. If all the d locations are full, in the
row filling game we fail to place this ball. By Lemma 21.1.5 (B), we have that the number of unplaced balls is
  
E Y d, n/d, (n/d)d = O (n/d) log d . Thus, we have that the number of balls that get placed as the first ball in
their bin is
 O(log d) 
≥n 1− ,
d
and the height of these balls is one.
We now use the same argumentation for balls of height 2 – Lemma 21.1.5 (C) implies that at most dn/e−d/2
balls have height strictly larger than 2.
Lemma 21.1.5 (A) implies that now we can repeat the same analysis as the power of two choices, the
critical difference is that every one of the d groups, behaves like a separate height. Since there are O(log log n)
maximum height in the regular analysis, this implies that we get O((log log n)/d) maximum load, with high
probability. ■

146
# balls in bin Regular 2-choices 2-choices+go left
0 369,899,815 240,525,897 228,976,604
1 365,902,266 528,332,061 546,613,797
2 182,901,437 221,765,420 219,842,639
3 61,604,865 9,369,389 4,566,915
4 15,760,559 7,233 45
5 3,262,678
6 568,919
7 86,265
8 11,685
9 1,347
10 143
11 17
12 2
13 2

Figure 21.1: Simulation of the three schemes described here. This was done with n = 1, 000, 000, 000 balls
thrown into n bins. Since log log n is so small (i.e., ≈ 3 in this case, there does not seem to be any reasonable
cases where the is a significant differences between d-choices and the go-left variant. In the simulations, the
go-left variant always has a somewhat better distribution, as shown above.

21.3. Avoiding terrible choices


Interestingly, one can prove that two choices are not really necessary. Indeed, consider the variant where the ith
ball randomly chooses a random location ri . The ball then is placed in the bin with least load among the bins ri
and ri−1 (the first ball inspects only a single bin – r1 ). It is not difficult to show that the above analysis applies
in this settings, and the maximum load is O(log log n) – despite making only n choices for n balls. Intuitively,
what is going on is that the power of two choices lies in the ability to avoid following a horrible, no good,
terrible choice, by having an alternative. This alternative choice does not have to be quite of the same quality
as the original choice - it can be stolen from the previous ball, etc.

21.4. Escalated choices


A variant that seems to work even better in practice, is the following escalated choices algorithm: The idea is
to try more than one bin only if you need to. To this end, try a random bin. If it is empty, then the algorithm
stores the ball in it. Otherwise, the algorithm tries harder. In the jth iteration, for j > 1, the algorithm picks a
random location. If any of the j locations have load < ⌈ j/2⌉, then the algorithm places the ball in the min-load
bin among these. Otherwise, the algorithm continues to the next iteration.
Experiments shows that on average, this algorithm only probes 1.96 bins per ball (thus, making less probes
than 2-choices). In this settings, the experiments show that 4-choices with move left do better, but if one use
the threshold < ⌈ j/3⌉, then the average number of probes is 2.30179, while having again a better performance.
The intuition is that a sequence of really bad choices are rare, and one can afford to try harder in such cases to
get out of them.
A theoretical analysis of this variant should be interesting.
(I “invented” this variant, but it might already be known.)

147
21.5. Bibliographical notes
The multi-row balls into bins (Section 21.1) is from the work by Broder and Karlin [BK90]. The power of two
choices (Section 21.2) is from Azar et al. [ABKU99].
The restricted d choices structure, the always go-left rule, described in Section 21.2.3, is from [Vöc03].

References
[ABKU99] Y. Azar, A. Broder, A. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Computing,
29(1): 180–200, 1999.
[BK90] A. Z. Broder and A. R. Karlin. Multilevel adaptive hashing. Proc. 1th ACM-SIAM Sympos. Dis-
crete Algs. (SODA), 43–53, 1990.
[Vöc03] B. Vöcking. How asymmetry helps load balancing. J. ACM, 50(4): 568–589, 2003.

148
Chapter 22

Evaluating And/Or Trees


598 - Class notes for Randomized Algorithms
Sariel Har-Peled That’s all a prophet is good for - to admit somebody else
April 2, 2024 is an ass or a whore.

The Violent Bear It Away, Flannery O’connor

22.1. Evaluating an And/Or Tree


Let T 2k denote a complete binary tree of height 2k – this tree has n = 22k leaves. The inputs to the tree are
boolean values stored in the leafs, where nodes are AND/OR nodes alternatingly. The task at hand is to evaluate
the tree - where the value of a internal node, is the operation associated with the node, applied to the values
returned from evaluating its two children.

and and (1)

or or or (1) or (1)

0 1 1 1 0 1 1 1

Figure 22.1: The tree T 2 , inputs in the leafs, and the output.

Defined recursively, T 2 is a tree with the root being an AND gate, and its children are OR gates. This tree has
four inputs. More generally, T 2k , is T 2 , with each leaf replaced by T 2k−2 . Let n = 22k .
So the input here is T 2k , together with 22k values stored in each leaf of the tree. Consider here the query
model – instead of read the values in the leafs, the algorithm has to explicitly perform a query to get the value
stored in the leaf. The question thus is can we minimize the number of queries the algorithm needs to perform.
It is straightforward to evaluate such a tree using a recursive algorithm in O(n) time. In particular, it
following is not too difficult to show.

Exercise 22.1.1. Show that any deterministic algorithm, in the worst case, requires Ω(n) time to evaluate a
tree T 2k .

The key observation is that AND (i.e., ∧) gate evaluation can be shortcut – that is, if x = 0 then x ∧ y = 0
independently on what value y has. Similarly, an OR (i.e., ∨) gate evaluation can be shortcut – since if x = 1,
then x ∨ y = 1 independently of what y value is.

149
22.1.1. Randomized evaluation algorithm for T 2k
The algorithm is recursive. If the current node v is a leaf, the algorithm returns the value stored at the leaf.
Otherwise, the algorithm randomly chooses (with equal probability) one of the children of v, and evaluate them
recursively. If the returned value, is sufficient to evaluate the gate at v, then the algorithm shortcut. Otherwise,
the algorithm evaluates recursively the other child, computes the value of the gate and return it.

22.1.2. Analysis
Lemma 22.1.2. The above algorithm when applied to T 2k , in expectation, reads the value of at most 3k leaves,
and this also bounds its running time.

Proof: The proof is by induction. Let start with T 2 . There are two possibilities:
(i) The tree evaluates to 0, then one of the children of the AND gate evaluates zero. But then, with probability
half the algorithm would guess the right child, and evaluate it first. Thus, in this case, the algorithm
would evaluate (in expectation) ≤ (1/2)2 + (1/2)4 = 3 leafs.
(ii) If the output of the tree is 1, then both children of the root must evaluate to 1. Each one of them is an OR
gate. Arguing as above, an OR gate evaluating to one, requires in expectation to read (1/2)1+(1/2)2 = 3/2
leafs to be evaluated by the randomized algorithm. It follows, that in this case, the algorithm would read
(in expectation) 2(3/2) = 3. (Note, that this is an upper bound – if all the four inputs are 1, this algorithm
would read only 2 leafs.)

For k > 1, consider the four grandchildren of the root c1 , c2 , c3 , c4 . By induction, in expectation, evaluating
each of c1 , . . . , c4 , takes 3k−1 leaf evaluations. Let X1 , . . . , X4 be indicator variables that are one if ci is evaluated
by the recursive algorithm. Let Yi be the expected number of leafs read when evaluating ci (i.e., E[Yi ] = 3k ).
P 
By the above, we have that E i Xi = 3. Observe that Xi and Yi are independent. (Note, that the Xi are not
independent of each other.) We thus have that the expected number of leafs to be evaluated by the randomized
algorithm is  
X  X X
E Xi Yi  = E[Xi Yi ] = E[Xi ] E[Yi ] ≤ 3 E[Yi ] = 3 · 3 = 3 .
k−1 k

i i i

Corollary 22.1.3. Given an AND/OR tree with n leafs, the above algorithm in expectation evaluates
 
3k = 2k log2 3 = 22k(log2 3)/2 = n(log2 3)/2 = n0.79248
leafs.

22.2. Bibliographical notes


The AND/OR tree algorithm is from Marc Snir work [Sni85]. One can show a lower bound using Yao’s min-max
principle, which is implied by the minimax principle of zero sum games.

References
[Sni85] M. Snir. Lower bounds on probabilistic linear decision trees. Theor. Comput. Sci., 38: 69–82,
1985.

150
Chapter 23

The Probabilistic Method II


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Today I know that everything watches, that nothing goes unseen, and that even wallpaper has a better memory than ours. It
isn’t God in His heaven that sees all. A kitchen chair, a coat-hanger a half-filled ash tray, or the wood replica of a woman name
Niobe, can perfectly well serve as an unforgetting witness to every one of our acts.”

Gunter Grass, The tin drum

23.1. Expanding Graphs


In this lecture, we are going to discuss expanding graphs.

Definition 23.1.1. An (n, d, α, c) OR-concentrator is a bipartite multigraph G(L, R, E), with the independent
sets of vertices L and R each of cardinality n, such that
(i) Every vertex in L has degree at most d.
(ii) Any subset S of vertices of L, with |S | ≤ αn has at least c |S | neighbors in R.

A good (n, d, α, c) OR-concentrator should have d as small as possible¬ , and c as large as possible.

Theorem 23.1.2. There is an integer n0 , such that for all n ≥ n0 , there is an (n, 18, 1/3, 2) OR-concentrator.

Proof: Let every vertex of L choose neighbors by sampling (with replacement) d vertices independently and
uniformly from R. We discard multiple parallel edges in the resulting graph.
Let E s be the event that a subset of s vertices of L has fewer than cs neighbors in R. Clearly,

  n  n  cs ds  ne  s  ne cs  cs ds h s d−c−1  is


P Es ≤ ≤ = exp 1 + c cd−c ,
s cs n s cs n n
n  k
since k
≤ ne
k
. Setting α = 1/3 using s ≤ αn, and c = 2, we have
"  #s
  h 1 d−c−1 1+c d−c i s h 1 d 1+c 1+c d−c i s 1 d 1+c 1+c d
P Es ≤ e c ≤ 3 e c ≤ 3 e c
3 3 3
"  d # s h  is
c 2 18 
≤ (3e)1+c
≤ (3e)1+2 ≤ 0.4 s ,
3 3
¬
Or smaller!

151
as c = 2 and d = 18. Thus,
X   X
P Es ≤ (0.4) s < 1.
s≥1 s≥1

It thus follows that the random graph we generated has the required properties with positive probability. ■

23.1.1. An alternative construction


Theorem 23.1.3. Consider a bipartite graph over left and right sets L and R, such that n = |L| = |R|. Consider
a random graph G formed by the union of d = 18 random perfect matchings between L and R. Let G be the
resulting graph. Then, for d ≥ 18, the resulting graph is (n, 18, 1/3, 2) OR-concentrator. Furthermore, G has
maximum degree d.

Proof: Let E s be the event that a subset of s vertices of L has fewer than cs neighbors in R. For a choice of
such a set S ⊆ L, and a set T of size cs in R, we have that number of ways to chose a matching such that all the
vertices of S has neighbors in T is cs · (cs − 1) · · · (cs − s + 1) – indeed, we fix an ordering of the items in S ,
and assign them their match in T one by one. As such, we have
!d
  n  n  cs(cs − 1) · · · (cs − s + 1)
Ξ = P Es ≤ .
s cs n(n − 1) · · · (n − s + 1)
 s
Using cs
n
· cs−1
n−1
··· · cs−s+1
n−s+1
≤ csn , we have
 ne  s  ne cs  cs ds
Ξ≤ .
s cs n
The quantity in the right, in the above inequality, is the same quantity bounded in the proof of Theorem 23.1.2,
and the result follows by the same argumentation. ■

23.1.2. An expander
Definition 23.1.4. An (n, d, c)-expander is a graph G = (V, E) over n vertices, n, such that
(i) Every vertex in G has degree at most d.
(ii) Any subset S of vertices of V, with |S | ≤ n/3 has at least c |S | neighbors.

Theorem 23.1.5. One can construct a (n, 36, 2)-expander

Proof: Let G be a graph with the set of vertices being JnK. Construction the graph of Theorem 23.1.3, and let
G′ be this graph. For every edge vi u j in G′ create an edge i j in G. Clearly, G has the desired properties. ■

23.2. Probability Amplification


Let Alg be an algorithm in RP, such that given x, Alg picks a random number r from the range Zn =
{0, . . . , n − 1}, for a suitable choice of a prime n, and computes a binary value Alg(x, r) with the following
properties:
(A) If x ∈ L, then Alg(x, r) = 1 for at least half the possible values of r.
(B) If x < L, then Alg(x, r) = 0 for all possible choices of r.

152
Next, we show that using lg2 n bits­ one can achieve 1/nlg n confidence, compared with the naive 1/n, and
the 1/t confidence achieved by t (dependent) executions of the algorithm using two-point sampling.
2
Theorem 23.2.1. For n large enough, there exists a bipartite graph G(V, R, E) with |V| = n, |R| = 2lg n
such
that:
2
(i) Every subset of n/2 vertices of V has at least 2lg n − n neighbors in R.
(ii) No vertex of R has more than 12 lg2 n neighbors.
2
Proof: Each vertex of V chooses d = 2lg n (4 lg2 n)/n neighbors independently in R. We show that the resulting
graph violate the required properties with probability less than half.®
The probability for a set of n/2 vertices on the left to fail to have enough neighbors, is
!dn/2  lg2 n n !
 n  2lg2 n   2 e 
n n dn n
τ≤ 1− 2 ≤ 2   exp −
n/2 n 2lg n n 2 2lg2 n
 lg2 n n
 2 e   2lg2 n (4 lg2 n)/n n2   2
2lg n e 
n
≤ 2   exp − 2
≤ exp n + n ln − 2n lg 2
n ,
n 2 2lg n | {z n}
| {z }
∗ ∗

n   2lg2 n   x
xe y ¯
2n
2lg
since n/2
≤ 2n and 2
2lg n −n
= n
, and y
≤ y
. Now, we have
2
2lg n e  2

ρ = n ln = n ln 2lg n + ln e − ln n ≤ (ln 2)n lg2 n ≤ 0.7n lg2 n,
n
 
for n ≥ 3. As such, we have τ ≤ exp n + (0.7 − 2)n lg2 n ≪ 1/4.
As for the second property, note that the expected number of neighbors of a vertex v ∈ R is 4 lg2 n. Indeed,
the probability of a vertex on R to become adjacent to a random edge is ρ = 1/|R|, and this “experiment” is
 
repeated independently dn times. As such, the expected degree of a vertex is µ E Y = dn/|R| = 4 lg2 n. The
Chernoff bound (Theorem 13.2.1p95 ) implies that
h i h i    
α = P Y > 12 lg2 n = P Y > (1 + 2)µ < exp −µ22 /4 = exp −4 lg2 n .
2
Since there are 2lg n vertices in R, we have that the probability
 that any vertex
 in R has a degree that exceeds
lg2 n
12 lg n, is, by the union bound, at most |R| α ≤ 2 exp −4 lg n ≤ exp −3 lg2 n ≪ 1/4, concluding our
2 2

tedious calculations° .
Thus, with constant positive probability, the random graph has the required property, as the union of the
two bad events has probability ≪ 1/2. ■

We assume that given a vertex (of the above graph) we can compute its neighbors, without computing the
whole graph.
­
Everybody knows that lg n = log2 n. Everybody knows that the captain lied.
®
Here, we keep parallel edges if they happen – which is unlikely. The reader can ignore this minor technicality, on her way to
ignore this whole write-up.
¯
The reader might want to verify that one can use significantly weaker upper bounds and the result still follows – we are using
the tighter bounds here for educational reasons, and because we can.
°
Once again, our verbosity in applying the Chernoff inequality is for educational reasons – usually such calculations would be
swept under the rag. No wonder than that everybody is afraid to look under the rag.

153
So, we are given an input x. Use lg2 n bits to pick a vertex v ∈ R. We nextidentify the neighbors of v in V:
r1 , . . . , rk . We then compute Alg(x, ri ), for i = 1, . . . k. Note that k = O lg2 n . If all k calls return 0, then we
return that Alg is not in the language. Otherwise, we return that x belongs to V.
If x is in the language, then consider the subset U ⊆ V, such that running Alg on any of the strings of U
returns  TRUE . We know that |U| ≥ n/2. The set U is connected to all the vertices of R except for at most
2
|R| − 2lg n − n = n of them. As such, the probability of a failure in this case, is
h i h i n n
P x ∈ L but r1 , r2 , . . . , rk < U = P v not connected to U ≤ ≤ 2 .
|R| 2lg n
We summarize the result.

Lemma 23.2.2. Given an algorithm Alg in RP that uses lg n random bits, and an access explicit access to the
graph of Theorem 23.2.1, one can decide if an input word is in the language of Alg using lg2 n bits, and the
2
probability of f failure is at most n/2lg n .

Let us compare the various results we now have about running an algorithm in RP using lg2 n bits. We have
three options:
(A) Randomly run the algorithm lg n times independently. The probability of failure is at most 1/2lg n = 1/n.
(B) Lemma 23.2.2, which as probability of failure at most 1/2lg n = 1/n.
(C) The third option is to use pairwise independent sampling (see Lemma 7.2.13p60 ). While it is not directly
comparable to the above two options, it is clearly inferior, and is thus less useful.

Unfortunately, there is no explicit construction of the expanders used here. However, there are alternative
techniques that achieve a similar result.

23.3. Oblivious routing revisited


Theorem 23.3.1. Consider any randomized oblivious algorithm for permutation routing  on √
the hypercube
 with
−k
N = 2 nodes. If this algorithm uses k random bits, then its expected running time is Ω 2
n
N/n .

Corollary 23.3.2. Any randomized oblivious algorithm for permutation routing on the hypercube with N = 2n
nodes must use Ω(n) random bits in order to achieve expected running time O(n).

Theorem 23.3.3. For every n, there exists a randomized oblivious scheme for permutation routing on a hyper-
cube with n = 2n nodes that uses 3n random bits and runs in expected time at most 15n.

154
Chapter 24

Dimension Reduction
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

24.1. Introduction to dimension reduction


Given a set P of n points in Rd , we need nd numbers to describe them. In many scenarios, d might be quite
large, or even larger than n (in some applications, where the access to the points is given only through dot-
product, it is useful to think about the dimension as being unbounded). If we care only about the distances
between any pairs of points, then all we need to store are the pairwise distances between the points. This would
require roughly n2 numbers, if we just write down the distance matrix.
But can we do better? (I.e., use less space.) A natural idea is to reduce the dimension of the points. Namely,
replace the ith point pi ∈ P, by a point ui ∈ Rk , where k ≪ d and k ≪ n. We would like k to be small. If we
can do that, then we compress the data from size dn to size kn, which might be a large compression.
Of course, one can do such compression of information without losing some information. In particular, we
are willing to let the distances to be a bit off. Formally, we would like to have the property that (1−ε)∥pi − p j ∥ ≤
∥ui − u j ∥ ≤ (1 + ε)∥pi − p j ∥, for all i, j, where ui is the image of pi ∈ P after the dimension reduction.
To this end, we generate a random matrix M of dimensions d × k, where k = Θ(ε−2 log n) (the exact details
of how to generate this matrix are below, but informally every entry is going to be picked from a normal
distribution and scaled appropriately). We then set ui = Mpi , for all pi ∈ P.
Before dwelling on the details, we need to better understand the normal distribution.

24.2. Normal distribution


The standard normal distribution has
1  
f (x) = √ exp −x2 /2 (24.1)

as its density function. We denote that X is distributed according to such distribution, using X ∼ N(0, 1). It is
depicted in Figure 24.1.
Somewhat strangely, it would be convenient to consider two such independent variables X and Y together.
Their probability space (X, Y) is the plane, and it defines a two dimensional density function

1  
g(x, y) = f (x) f (y) = exp −(x2 + y2 )/2 . (24.2)

155
0.4
√1 exp(−x2 /2)

0.3

0.2

0.1

0
-4 -3 -2 -1 0 1 2 3 4
Figure 24.1

The key property of this function is that g(x, y) = g(x′ , y′ ) ⇐⇒ ∥(x, y)∥2 = x2 + y2 = ∥(x′ , y′ )∥2 . Namely, g(x, y)
is symmetric around the origin (i.e., all the points in the same distance from the origin have the same density).
We next use this property in verifying that f (·) it is indeed a valid density function.
R∞  
Lemma 24.2.1. We have I = −∞ f (x) dx = 1, where f (x) = √12π exp −x2 /2 .
Proof: Observe that
Z ∞ 2 Z ∞ Z ∞  Z ∞ Z ∞
I =
2
f (x) dx = f (x) dx f (y) dy = f (x) f (y) dx dy
x=−∞ x=−∞ y=−∞ x=−∞ y=−∞
Z ∞ Z ∞
= g(x, y) dx dy.
x=−∞ y=−∞

Change the variables to x = r cos α, y = r sin α, and observe that the determinant of the Jacobian is
∂x
∂r
∂x
∂α cos α −r sin α  
J = det ∂y ∂y = det = r cos2 α + sin2 α = r.
∂r ∂α
sin α r cos α
As such,
Z ∞ Z 2π  r2  Z ∞ Z 2π  r2 
1 1
I =
2
exp − |J| dα dr = exp − r dα dr
2π 2 2π r=0 α=0 2
Z ∞ r=0  α=02  h  2 ir=∞
r r
= exp − r dr = − exp − = − exp(−∞) − (− exp(0)) = 1. ■
r=0 2 2 r=0

Lemma 24.2.2. For X ∼ N(0, 1), we have that E[X] = 0 and V[X] = 1.
Proof: The density function of X, see Eq. (24.2) is symmetric around 0, which implies that E[X] = 0. As for
the variance, we have
h i h i Z ∞ 1
Z ∞
V[X] = E X − (E[X]) = E X = x P[X = x] dx = √ x2 exp(−x2 /2) dx.
2 2 2 2
x=−∞ 2π x=−∞
Observing that
  ′ 
x2 exp −x2 /2 = −x exp(−x2 /2) + exp −x2 /2 ,
implies (using integration by guessing) that
Z ∞
1 h i∞ 1
V[X] = √ −x exp(−x /2) x=−∞ + √ exp(−x2 /2) dx = 0 + 1 = 1. ■
2
2π 2π −∞

156
24.2.1. The standard multi-dimensional normal distribution
The multi-dimensional normal distribution, denoted by Nd , is the distribution in Rd that assigns a point p =
1  1X d 
(p1 , . . . , pd ) the density g(p) = exp − p2
i .
(2π)d/2 2 i=1
R
It is easy to verify, using the above, that Rd g(p)dp = 1. Furthermore, we have the following useful but
easy properties.¬

Lemma 24.2.3. We have the following properties:


(A) Consider d independent variables X1 , . . . , Xd ∼ N(0, 1), the point u = (X1 , . . . , Xd ) has the multi-
dimensional normal distribution Nd .
(B) The multi-dimensional normal distribution is symmetric. For any two points p, u ∈ Rd such that ∥p∥ =
∥u∥, we have that g(p) = g(u), where g(·) is the density function of the multi-dimensional normal distri-
bution Nd .
(C) The projection of the normal distribution on any direction (i.e., any vector of length 1) is a one-dimensional
normal distribution.

Proof: (A) Let f (·) denote the density  function of N(0, 1), and observe that the density function of u is
f (X1 ) f (X2 ) · · · f (Xd ), = √2π exp −X1 /2 · · · √12π exp −Xd2 /2 , which readily implies the claim.
1 2

(B) Readily follows from observing that g(p) = (2π)1d/2 exp − ∥p∥2 /2 .
(C) Let p = (X1 , . . . , Xd ), where X1 , . . . , Xd ∼ N(0, 1). Let v be any unit vector in Rd , and observe that by the
symmetry of the density function, we can (rigidly) rotate the space around the origin in any way we want, and
the measure of sets does not change. In particular rotate space so that v becomes the unit vector (1, 0, . . . , 0).
We have that
P[⟨v, p⟩ ≤ α] = P[⟨(1, 0, . . . , 0), p⟩ ≤ α] = P[X1 ≤ α],
which implies that ⟨v, p⟩ ∼ X1 ∼ N(0, 1). ■

The generalized multi-dimensional distribution is a Gaussian. Fortunately, we only need the simpler notion.

24.3. Dimension reduction


24.3.1. The construction
The input is a set P ⊆ Rd of n points (where d is potentially very large), and let ε > 0 be an approximation
parameter. For l m
k = 24ε−2 ln n (24.3)

we pick k vectors u1 , . . . , uk independently from the d-dimensional normal distribution Nd . Given a point
p ∈ Rd , its image is
1  
h(v) = √ ⟨u1 , p⟩ , · · · , ⟨uk , p⟩ .
k
¬
The normal distribution has such useful properties that it seems that the only thing normal about it is its name.

157
In matrix notation, let  
u1 
 
1 u2 
M = √  ..  .
k  . 
 
uk
For every point pi ∈ P, we set ui = h(pi ) = Mpi .

24.3.2. Analysis
24.3.2.1. A single unit vector is preserved
Consider a vector v of length one in Rd . The natural question is what is the value of k needed, so that the length
of h(v) is a good approximation to v. Since ⟨ui , v⟩ ∼ N(0, 1), by Lemma 24.2.3, this question can boil down to
the following: Given k variables X1 , . . . , Xk ∼ N(0, 1), sampled independently, how concentrated is the random
variable
X
k
Y = ∥(X1 , . . . , Xk )∥ =
2
Xi2 .
i=1
h i
We have that E[Y] = k E Xi2 = k V[Xi ] = k, since Xi ∼ N(0, 1), for any i. The distribution of Y is known as the
chi-square distribution with k degrees of freedom.
 
Lemma 24.3.1. Let φ ∈ (0, 1), and ε ∈ (0, 1/2) be parameters, and let k ≥ 16 ε2
ln φ2 be an integer. Then,
P
for k independent random variables X1 , . . . , Xk ∼ N(0, 1), we have that Z = i Xi2 /k is strongly concentrated.
Formally, we have that P[Z ≤ 1 + ε] ≥ 1 − φ.

Proof: Arguing as in the proof of Chernoff’s inequality, using t = ε/4 < 1/2, we have
  h i
  E exp(tkZ) Yk
E exp(tXi2 )
P[Z ≥ 1 + ε] ≤ P exp(tkZ) ≥ exp tk(1 + ε) ≤ = .
exp tk(1 + ε) i=1
exp t(1 + ε)

Using the substitution x = √


y
1−2t
and dx = √1
1−2t
dy, we have
h i Z ∞ ! Z ∞ !
exp(tx2 ) x2 1 x2
E exp(tXi ) = √ exp − dx = √ exp −(1 − 2t)
2
dx
x=−∞ 2π 2 2π x=−∞ 2
Z ∞  !2  Z ∞ !
1  1 − 2t y  1 1 1 y2
= √ exp− √  √ dy = √ · √ exp − dy
2π y=−∞ 2 1 − 2t 1 − 2t 1 − 2t 2π y=−∞ 2
1
= √ .
1 − 2t
P
1
We have that 1−z = ∞ i=0 z , for 0 ≤ z < 1, and thus
i

X ε i h1 X ε  i i 2
∞ h 1 X ε i i2

1
1−ε/2
= ≤ 1+ ≤ exp .
i
2 2 i=1 2 2 i=1 2

Since t = ε/4, we have


 ∞  
h i  1 X ε i 
≤ exp
1 1
E exp(tXi2 ) = √ = √ .
1 − 2t 1 − ε/2 2 i=1 2 

158
As such, we have
 ∞  k  ∞  i k
 !
 1 X ε i ε   ε2 1 X ε  kε2 φ
P[Z ≥ 1 + ε] ≤ exp   
− (1 + ε) = exp− + 
 ≤ exp − ≤ ,
2 i=1 2 4 8 2 i=3 2 16 2
P
since, for ε < 1/2, we have 21 ∞ i=3 (ε/2) ≤ (ε/2) ≤ ε /16. The last step in the above inequality follows by
i 3 2

substituting in the lower bound on the value of k. ■


The other direction we need follows in a similar fashion. We state the needed result without proof [LM00,
Lemma 1] (which also yields better constants):
P
Lemma 24.3.2. Let Y1 , . . . , Yk be k independent random variables with Yi ∼ N(0, 1). Let Z = ki=1 Yi2 /k. For
any x > 0, we have that
h p i h p i
P Z ≤ 1 − 2 x/k ≤ exp(−x) and P Z ≥ 1 + 2 x/k + 2x/k ≤ exp(−x).

√ we require that exp(−x) ≤ φ/2, which implies x = ln(2/φ). We further require that
√ For our purposes,
2 x/k ≤ ε and 2 x/k + 2x/k ≤ ε, which hold for k = 8ε−2 ln φ2 , for ε ≤ 1. We thus get the following result.
 
Corollary 24.3.3. Let φ ∈ (0, 1), and ε ∈ (0, 1/2) be parameters, and let k ≥ ε82 ln φ2 be an integer. Then, for k
P
independent random variables X1 , . . . , Xk ∼ N(0, 1), we have for Z = i Xi2 /k that that P[1 − ε ≤ Z ≤ 1 + ε] ≥
1 − φ.
Remark 24.3.4. The result of Corollary 24.3.3 is surprising. It says that if we pick a point according
√ to the
k-dimensional normal distribution, then its distance to the origin is strongly concentrated around k. Namely,
the normal distribution “converges” to a sphere, as the dimension increases. The mind boggles.
Lemma 24.3.5. Let v be a unit vector in Rd , then
  1
P 1 − ε ≤ ∥Mv∥ ≤ 1 + ε ≥ 1 − 2 .
n
Proof: Observe that if for a number x, if 1 − ε ≤ x2 ≤ 1 + ε, then 1 − ε ≤ x ≤ 1 + ε. As such, the claim holds
if 1 − ε ≤ ∥Mv∥2 ≤ 1 + ε. By Corollary 24.3.3, setting φ = 1/n2 , we need
k ≥ 8ε−2 ln(2/φ) = 8ε−2 ln(2n2 ) = 24ε−2 ln n,
which holds for the value picked for k in Eq. (24.3). ■

24.3.3. All pairwise distances are preserved


Lemma 24.3.6. With probability at least half, for all points p, p′ ∈ P, we have that
(1 − ε) ∥p − p′ ∥ ≤ ∥Mp − Mu∥ ≤ (1 + ε) ∥p − u∥ .
Proof: The key observation is that M is a linear operator. As such, let v = (p − p′ )/ ∥p − p′ ∥ be a unit vector,
and observe that
(1 − ε) ∥p − p′ ∥ ≤ ∥Mp − Mp′ ∥ = ∥M(p − p′ )∥ ≤ (1 + ε) ∥p − p′ ∥
p − p′
⇐⇒ 1−ε≤ M ≤1+ε
∥p − p′ ∥
⇐⇒ (1 − ε) ∥v∥ ≤ ∥Mv∥ ≤ (1 + ε) ∥v∥ .
The probability the later condition does not hold
 is at most 1/n2 , by Lemma 24.3.5. As such, for all possible
pairs of points, the probability of failure is 2 · n2 ≤ 1/2, as claimed.
n 1

159
We thus got the famous JL-Lemma.
Theorem 24.3.7 (The Johnson-Lindenstrauss Lemma). Given a set P of n points in Rd , and a parameter ε,
one can reduce the dimension of P to k = O(ε−2 log n) dimensions, such that all pairwise distances are 1 ± ε
preserved.

24.4. Even more on the normal distribution


The following is not used anywhere in the above, and is provided as additional information about the normal
distribution.
Lemma 24.4.1. Let X ∼ N(0, 1), and let σ > 0 and µ be two real numbers. The random variable Y = σX + µ
has the density function !
1 (x − µ)2
fµ,σ (x) = √ exp − . (24.4)
2πσ 2σ2
 
The variable Y has the normal distribution with variance σ2 , and expectation µ, denoted by Y ∼ N µ, σ2 .
  h i R (α−µ)/σ  
Proof: We have P[Y ≤ α] = P σX + µ ≤ α = P X ≤ α−µ σ
= y=−∞ f (y) dy, where f (x) = √1

exp −x2 /2 .
Substituting y = (x − µ)/σ, and observing that dy/ dx = 1/σ, we have
Z α  x − µ 1 Z α !
1 (x − µ)2
P[Y ≤ α] = f dx = √ exp − dx,
x=−∞ σ σ 2πσ x=−∞ 2σ2
as claimed.
   
As for the second part, observe that E[Y] = E σX + µ = σ E[X] + µ = µ and V[Y] = V σX + µ =
V[σX] = σ2 V[X] = σ2 . ■

Lemma 24.4.2. Consider


  two independent
p variables X ∼ N(0, 1) and Y ∼ N(0, 1). For α, β > 0, we have
Z = αX + βY ∼ N 0, σ , where σ = α + β2 .
2 2

n o
Proof: Consider the region in the plane H − = (x, y) ∈ R2 αx + βy ≤ z – this is a halfspace bounded by the
line ℓ ≡ αx + βy = z. This line is orthogonal to the vector (−β, α). We have that ℓ ≡ σα x + σβ y = σz . Observe that
α β
,
σ σ
= 1, which implies that the distance of ℓ from the origin is d = z/σ.
Now, we have Z
   −
P[Z ≤ z] = P αX + βY ≤ z = P H = g(x, y) dp,
p=(x,y)∈H −

see Eq. (24.2). Since, the two dimensional density function g is symmetric around the origin. any halfspace
containing the origin, whichnits boundary is in distance
o d from the origin, has the same probability. In particular,
consider the halfspace T = (x, y) ∈ R x ≤ d . We have that
2

Z ! Z z !
 − 1 d
x2 1 y2 dx
P[Z ≤ z] = P H = P[T ] = P[X ≤ d] = √ exp − dx = √ exp − 2 dy,
2π −∞ 2 2π y=−∞ 2σ dy
Z z 2
!
1 y
= √ exp − 2 dy,
2πσ y=−∞ 2σ
by change of variables x = y/σ, and observing
 that dx/ dy = 1/σ. By Eq. (24.4), the above integral is the
probability of a variable distributed N 0, σ2 to be smaller than z, establishing the claim. ■

160
   
Lemma 24.4.3. Consider two independent variables X ∼ N µ1 , σ21 and Y ∼ N µ2 , σ22 . We have Z = X + Y ∼
 
N µ1 + µ2 , σ21 + σ22 ,

Proof: Let Xb ∼ N(0, 1) and b b + µ1 and Y = σ2 b


Y ∼ N(0, 1), and observe that we can write X = σ1 X Y + µ2 . As
such, we have
Z = X + Y = σ1 X b + σ2 b
Y + µ1 + µ2 .
 
The variable W = σ1 X b + σ2 bY ∼ N 0, σ21 + σ22 , by Lemma 24.4.2. Adding µ1 + µ2 to W, just shifts its
expectation, implying the claim. ■

24.5. Bibliographical notes


The original result is due to Johnson and Lindenstrauss [JL84]. By now there are many proofs of this lemma.
Our proof follows class notes of Anupam Gupta, which in turn follows Indyk and Motwani [IM98],

References
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of di-
mensionality. Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC), 604–613, 1998.
[JL84] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mapping into hilbert space. Contem-
porary Mathematics, 26: 189–206, 1984.
[LM00] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection.
Ann. Statist., 28(5): 1302–1338, 2000.

161
162
Chapter 25

Streaming and the Multipass Model

I don’t know why it should be, I am sure; but the sight of another man asleep in bed when I am up, maddens me. It seems
to me so shocking to see the precious hours of a man’s life - the priceless moments that will never come back to him again -
being wasted in mere brutish sleep.

Jerome K. Jerome, Three men in a boat


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

25.1. The secretary problem


Assume that we are seeing n applications: α1 , . . . , αn , where the quality of each one of them is an independent
random event. We would like to hire the best one, but we have to make an immediate decision – we see
candidate αi , we either hire them (and then we are stuck with them), or we let them go and see the next
candidate. We win the game if we hire the best candidate out of the n candidate.
The question is what is the natural strategy to win the game? Let P(r) be the probability that we win, when
the strategy is to see the first r − 1 candidates, and then hire the first candidate we see in αr , . . . , αn that is better
than all the candidates seen in α1 , . . . , αr−1 . We have that
X
n
 
P(r) = P applicant i is selected ∩ applicant i is the best
i=1
X
n
   
= P applicant i is selected applicant i is the best · P applicant i is the best
i=1
 r−1 !
X X
n
the best of the first i − 1 applicants  1

=  0 + P applicant i is the best  ·
is in the first r − 1 applicants n
i=r
 i=1 
Xn
r − 1  1
=   ·
i=r
i − 1 n
r−1X 1
n
= .
n i=r i − 1

163
Observe that Z n−1
Xn
1 1 n
≤ dx = ln(n − 2) − ln(r − 2) ≈ ln .
i=r
i−1 x=r−2 x r
For r = n/e, we have that P(r) ≈ nr ln nr = 1e ln e = 1/e.

25.2. Reservoir sampling: Fishing a sample from a stream


Imagine that you are given a stream of elements s1 , s2 , . . ., and you need to sample k numbers from this stream
(say, without repetition) – assume that you do not know the length of the stream in advance, and furthermore,
you have only O(k) space available for you. How to do that efficiently?
There are two natural schemes:
(A) Whenever an element arrives, generate a random number for it in the range [0, 1]. Maintain a heap
with the k elements with the lowest priority. Implemented naively this requires O(log k) comparisons
after each insertion, but it is not difficult to improve this to O(1) comparisons in the amortized sense per
insertion. Clearly, the resulting set is the desired random sample
(B) Let S t be the random sample maintained in the tth iteration. When the ith element arrives, the algorithm
flip a coin that is heads with probability min(1, k/i). If the coin is heads then it inserts si to S i−1 to get S i .
If S i−1 already have k elements, then first randomly delete one of the elements.
Theorem 25.2.1. Given a stream of elements, one can uniformly sample k elements (without repetition), from
the stream using O(k) space, where O(1) time is spent for handling each incoming element.
Proof: We implement the scheme (B) above. We only need to argue that this is a uniform random sample. The
claim trivially hold for i = k. So assume the claim holds for i < t, and we need to prove that the set after getting
tth element is still a uniform random sample.
So, consider a specific set K ⊆ {s  1 , . . . , st } of k elements. The probability of K to be a random sample of
t
size k from a set of t elements is 1/ k . We need to argue that this probability remains the same for this scheme.
So, if st < K, then we have
!
1 k k!(t − 1 − k)!(t − k) 1
P[K = S t ] = P[K = S t−1 and st was not inserted] = t−1 1 − = =  t .
t (t − 1)!t
k k

If st ∈ K, then
 
 K \ {st } ⊆ S t−1 , 
    t − 1 − (k − 1) kA 1 (t − k)k!(t − 1 − k)! 1
P K = S t = P st was inserted  = t−1 = =  t ,
 t kA (t − 1)!t
and S t−1 \ K thrown out of S t−1 k k

as desired. Indeed, there are t − 1 − (k − 1) subsets of size k of {s1 , . . . , st−1 } that contains K \ {st } – since we fix
k − 1 of the t − 1 elements. ■

25.3. Sampling and median selection revisited


Let B[1, . . . , n] be a set of n numbers. We would like to estimate the median, without computing it outright. A
natural idea, would be to pick k elements e1 , . . . , ek randomly from B, and return their median as the guess for
the median of B.
In the following, let B⟨t⟩ be the tth smallest number in the array B.

164
Observation 25.3.1. For any ε ∈ (0, 1), we have that 1
1−ε
≥ 1 + ε.

Lemma 25.3.2. Let ε ∈ (0, 1/2) be a fixed parameter, and let B be a set of n numbers. Let Z be the median of
the random sample (with replacement) of B of size k. We have that

h i & '
12 2
P B⟨ 1−ε
2 n⟩
≤ Z ≤ B⟨ 1+ε
2 n⟩
≥ 1 − δ, where k ≥ 2 ln .
ε δ

Namely, with probability at least 1 − δ, the returned value Z is (ε/2)n positions away from the true median.

Proof: Let L = B⟨(1−ε)n/2⟩ , and let ei be the ith sample number, for i = 1, . . . , k. Let Xi = 1 if and only if ei ≤ L.
We have that
(1 − ε)n/2 1 − ε
P[Xi = 1] = = .
n 2
P
As such, setting Y = ki=1 Xi , we have

1−ε k 3 2
µ = E[Y] = k ≥ ≥ 2 ln .
2 4 ε δ

One case of failure of the algorithm is if Y ≥ k/2. Since 1


1−ε
≥ 1 + ε, we have that
" # ! !
1/2 1−ε   ε2 µ ε2 3 2 δ
P[Y ≥ k/2] = P Y ≥ · k ≤ P Y ≥ (1 + ε)µ ≤ exp − ≤ exp − · 2 ln ≤ .
(1 − ε)/2 2 3 3 ε δ 2

by Chernoff’s inequality (see Lemma 13.2.5).


 
This implies that P B⟨(1−ε)n/2⟩ > Z ≤ δ/2. The claim now follows by realizing that by symmetry (i.e.,
revering the order), we have that
 
P Z > B⟨(1+ε)n/2⟩ ≤ δ/2,
and putting these two inequalities together. ■

The above already implies that we can get a good estimate for the median. We need something somewhat
stronger – we state it without proof since it follows by similarly mucking around with Chernoff’s inequality.

Lemma 25.3.3. Let ε ∈ (0, 1/2), B an array of n elements, and let S = {e1 , . . . , ek } be a set of k samples
l picked
m
uniformly and randomly from B. Then, for some absolute constant c, and an integer k, such that k ≥ εc2 ln 1δ ,
we have that
 
P S ⟨k− ⟩ ≤ B⟨n/2⟩ ≤ S ⟨k+ ⟩ ≥ 1 − δ.
for k− = ⌊(1 − ε)k/2⌋, and k+ = ⌊(1 + ε)k/2⌋.
One can prove even a stronger statement:
 
P B⟨(1−2ε)n/2⟩ ≤ S ⟨(1−ε)k/2⟩ ≤ B⟨n/2⟩ ≤ S ⟨(1+ε)k/2⟩ ≤ B⟨(1+2ε)n/2⟩ ≥ 1 − δ

(the constant c would have to be slightly bigger).

165
25.3.1. A median selection with few comparisons
The above suggests a natural algorithm for computing the median (i.e., the element of rank n/2 in B). Pick a
random sample S of k = O(n2/3 log n) elements. Next, sort S , and pick the elements L and R of ranks (1 − ε)k
and (1 + ε)k in S , respectively. Next, scan the elements, and compare them to L and R, and keep only the
elements that are between. In the end of this process, we have computed:
(A) α: The rank of the number L in the set B.
(B) T = {x ∈ B | L ≤ x ≤ H}.
Compute, by brute force (i.e., sorting) the element of rank n/2 − α in T . Return it as the desired median. If
n/2 − α is negative, then the algorithm failed, and it tries again.

Lemma 25.3.4. The above algorithm performs 2n + O(n2/3 log n) comparisons, and reports the median. This
holds with high probability.

Proof: Set ε = 1/n1/3 , and δ = 1/nO(1) , and observe that Lemma 25.3.3 implies that with probability ≥ 1 − 1/δ,
we have that the desired median is between L and H. In addition, Lemma 25.3.3 also implies that |T | ≤ 4εn ≤
4n2/3 , which readily implies the correctness of the algorithm.
As for the bound on the number of comparisons, we have, with high probability, that the number of com-
parisons is √ 

O |S | log |S | + |T | log |T | + 2n = O n log2 n + n2/3 log n + 2n,
since deciding if an element is between L and H requires two comparisons. ■

Lemma 25.3.5. The above algorithm can be modified to perform (3/2)n + O(n2/3 log n) comparisons, and
reports the median correctly. This holds with high probability.

Proof: The trick is to randomly compare each element either first to L or first to H with equal probability. For
elements that are either smaller than L or bigger than H, this requires (3/2)n comparisons in expectation. Thus
improving the bound from 2n to (3/2)n. ■

Lemma 25.3.6. Consider a stream B of n numbers, and assume we can make two passes over the data. Then,
one can compute exactly the median of B using:
(I) O(n2/3 ) space.
(II) 1.5n + O(n2/3 log n) comparisons.
The algorithm reports the median correctly, and it succeeds with high probability.

Proof: Implement the above algorithm, using the random sampling from Theorem 25.2.1. ■

Remark 25.3.7. Interestingly, one can do better if one is more careful. The basic idea is to do thinning – given
two sorted sequence of sizes s, consider merging the sets, and then picking all the even rank elements into a
new sequence. Clearly, the element of rank i in the output sequence, has rank 2i in the union of the two original
sequences. A sequence that is the result of i such rounds of thinning is of level i. We maintain O(log n) such
sequences as we read the stream. At any time, we have two buffers of size s, that we fill up from the stream.
Whenever the two buffers fill up completely, we perform the thinning operation on them, creating a sequence
of level 1.
If during this process we have two sequences of the same level, we merge them and perform thinning on
them. As such, we maintain O(log n) buffers sequences each of size s. Assume that our stream has size n, and
n is a power for 2. Then in the end of process, we would have only a single sequence of level h = log2 (n/s).

166
By induction, it is easy to prove that an element of rank r in this sequence, has rank between 2h (r − 1) and 2h r
in the original stream.√ √
Thus, setting s = n, we get that after a√single pass, using O( n log n) space, we have a sorted sequence,
where the rank of the elements is roughly n approximation to the true rank. We pick the two consecutive
elements (or more carefully, the predecessor, and successor), and √ filter the stream again, keeping only the
elements in between√ these two elements. It is to show that O( n) would be kept, and we can extract the
median using O( n log n) time. √
We thus got that one can compute the median in two passes using O( n log n) space. It is not hard to extend
this algorithm to α-passes, where the space required becomes O(n1/α log n).
This elegant algorithm goes back to 1980, and it is by Munro and Paterson [MP80].

25.4. Big data and the streaming model


Here, we are interested in doing some computational tasks when the amount of data we have to handle is quite
large (think terabytes or larger). The main challenge in many of these cases is that even reading the data once
is expensive. Running times of O(n log n) might not be acceptable. Furthermore, in many cases, we can not
load all the data into memory.
In the streaming model, one reads the data as it comes in, but one can not afford to keep all the data. A
natural example would be a internet router, which has gazillion of packets going through it every minute. We
might still be interested in natural questions about these packets, but we want to do this without storing all the
packets.

25.5. Heavy hitters


The problem. Imagine a stream s1 , . . ., where elements might repeat, and we would like to maintain a list of
elements that appear at least εn times, where ε ∈ (0, 1) is some parameter. The purpose here is to do this using
as little space as possible.

25.5.1. A randomized algorithm


Am easy randomized algorithm, would maintain a random sample of size m = ⌈(1/ε) ln(1/φ)⌉, using reservoir
sampling. The probability the sample fails to contain a heavy hitter after t insertions is
& '! !
1 1 1
(1 − ε) ≤ exp(−εm) ≤ exp −ε ln
m
≤ exp − ln = exp(φ) = φ.
ε φ φ

25.5.2. A deterministic algorithm


Disclaimer: The following is a deterministic algorithm, but it is too elegant to hold this against it, and we will
present it anyway.

The algorithm. To this end, let


k = ⌈1/ε⌉ .
At each point in time, we maintain a set S of k elements, with a counter for each element. Let S t be the version
of S after t were inserted. When st+1 arrives, we increase its counter if it is already in S t . If |S t | < k, then we

167
just insert st+1 to the set, and set its counter to 1. Otherwise, |S t | = k and st+1 < S t . We then decrease all the k
counters of elements in S t by 1. If a counter of an element in S t+1 is zero, then we delete it from the set.

Correctness.

Lemma 25.5.1. The above algorithm, after the insertion of t elements, the set S t+1 would contain all the
elements in the stream that appears at least εt times.

Proof: Conceptually, imagine that the algorithm keeps counters for all the distinct elements seen in the stream.
Whenever a decrease of the counters happens – the algorithm decrease not k counters – but k + 1 counters – the
additional counter being a counter of the new element, which has value one, and goes down to zero. Clearly,
the number of distinct remaining elements in any point in time is at most k – that is, the number of counters
that have a non-zero value. Consider an element e that appears u ≥ εt times in the stream. The counter for e is
going to be increased u times, and decreased at most α time, where α(k + 1) ≤ t. We have that the counter for
u in the end of the stream must have value at least
t t t ⌈1/ε⌉ ε + ε − 1 ε
u− ≥ εt − = εt − ≥t ≥t > 0.
k+1 k+1 ⌈1/ε⌉ + 1 ⌈1/ε⌉ + 1 ⌈1/ε⌉ + 1
This implies that the counter of u is strictly larger than 0, which implies that u appears in S t+1 . ■

References
[MP80] J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theo. Comp. Sci., 12:
315–323, 1980.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

168
Chapter 26

Frequency Estimation over a Stream

“See? Genuine-sounding indignation. I programmed that myself. It’s the first thing you need in a university environment: the
ability to take offense at any slight, real or imagined.”

598 - Class notes for Randomized Algorithms Robert Sawyer, Factoring Humanity
Sariel Har-Peled
April 2, 2024

26.1. The art of estimation


26.1.1. The problem
Assume we would like to estimate well some quantity ρ > 0 - specifically, for a fixed parameter ε ∈ (0, 1),
we would like to compute a quantity ρ′ such that ρ′ ∈ [(1 − ε)ρ, (1 + ε)ρ] with good probability. To this end,
assume we have access to a distribution D, such that if we sample X according to this distribution (i.e., X ∼ D),
we have that E[X] = ρ. We can use X to estimate our desired quantity, but this might not provide the desired
estimation.

Example: Estimating p for a coin. Assume we have a coin that is head with probability p. A natural way to
estimate p is to flip the coin once and return 1 if it is head, and zero otherwise. Let X be the result of the coin
flip, and observe that E[X] = p. But this is not very useful estimator.

26.1.2. Averaging estimator: Success with constant probability


26.1.2.1. The challenge
The basic problem is that X ∼ D might be much bigger than ρ. Or more specifically, its variance might be
huge, where D is a distribution we have access to. Let

ρ = E[D] and ν = V[D] .

We would to generate a variable Z, such that

E[Z] = ρ V[Z] ≤ (ε /4)ρ .


2 2
and (26.1)

169
This would imply by Chebychev’s inequality that

  h p i h p i 1
P |Z − ρ| ≥ ερ = P |Z − E[Z] | ≥ 2 (ε2 /4)ρ2 ≤ P |Z − E[Z] | ≥ 2 V[Z] ≤ .
4

26.1.2.2. Taming of the variance


l m P
The basic idea is to take α = ρν independent variables X1 , . . . , Xα ∼ D, and let Y = i Xi /α. We have by
linearity of expectation that
X
E [Y] = E[Xi ] /α = E[X] = ρ.
i

Using the independence of X1 , . . . , Xα , we have


   
X  1 X  1 X 1 ν
V[Y] = V Xi /α = 2 V Xi  = 2 V[Xi ] = 2 αν = .
i
α i
α i α α

Guided by Eq. (26.1), we want this quantity to be smaller than ≤ (ε2 /4)ρ2 . Thus,
& ' & '
ν 4 ν 4 V[X]
≤ (ε2 /4)ρ2 ⇐= α≥ 2 · 2 = .
α ε ρ ε2 E[X]2

We thus summarize the result.

Lemma 26.1.1. Let D be a non-negative distribution with ρ = E[D] and ν = V[D], and let ε ∈ (0, 1) be a
l 4 V[D] m Pα
parameter. For α ≥ 2 , consider sampling variables X 1 , . . . , X α ∼ D, and let Z = i=1 Xi /α. Then Z
ε (E[D])2
is a “good” estimator for ρ. Formally, we have

h i 3
P (1 − ε)ρ ≤ Z ≤ (1 + ε)ρ ≥ .
4

26.1.3. Median estimator: Success with high probability


We would like to get a better estimator, where the probability of success is high probability. Formally, we
would have parameter φ, and we would like the estimator to succeed with probability ≥ 1 − φ. A natural
approach is to try and use Chernoff to bound the probability of failure for the averaging estimator. This would
work in some cases, but is limited to the case when Z lies in a small bounded range. This would not work in
general if sampling from D might return a huge value with tiny probability. Instead, we are going to boost the
averaging estimator. Assume, we generating

 1
β = O log
φ

instances of the averaging estimators: Z1 , . . . , Zβ of Lemma 26.1.1. The median estimator returns the median
value of the Zs as the desired estimate.

170
Analysis. Let Ei be the event that Zi ∈ [(1 − ε)ρ, (1 + ε)ρ]. Let Gi be an indicator variable for Ei . By
P
Lemma 26.1.1, P[Ei ] = P[Gi = 1] ≥ 3/4. The median estimator fails if βi=1 Gi < β/2. Using Chernoff
inequality, we get that this happens with probability ≤ φ. We thus get the following.

Theorem 26.1.2. Let D be a non-negative distribution with µ = E[D] and ν = V[D], and let ε, φ ∈ (0, 1)
l 4ν m
be parameters. For some absolute constant c > 0, let M ≥ 24 2 2 ln φ1 , and consider sampling variables
ε µ
X1 , . . . , X M ∼ D. One can compute, in, O(M) time, a quantity Z from the sampled variables, such that
h i
P (1 − ε)µ ≤ Z ≤ (1 + ε)µ ≥ 1 − φ.
  l m
Proof: Let m = 4ν/(ε2 µ2 ) and M = 24 ln φ1 . Build M averaging estimators, each one using m samples. That
is let Zi be the average of m samples si,1 , . . . , si,m from D, for i = 1, . . . , M. Formally,

1X
m
Zi = si, j for i = 1, . . . , M.
m j=1

The estimate returned is the value median(Z1 , . . . , Z M ).


By Lemma 26.1.1 each one of the averaging estimator is in the “good” range with probability ≥ 3/4. As
such, let Xi , for i = 1, . . . , M, be an indicator variable, that is 1 if the ith averaging estimator is in the range
PM
[(1 − ε)µ, (1 + ε)µ]. Let Y = i=1 Xi . We have that E[Y] ≥ (3/4)M. As such, by Lemma ??, we have
!
      (1/3)2
P bad output = P Y < (1/2)M ≤ P Y < (1 − 1/3) E[Y] ≤ exp − E[Y] .
2
   l m 
−1
The later quantity is bounded by exp − 18 1 3
4
M = exp(−M/24) = exp − 24 ln φ /24 ≤ φ. ■

26.2. Frequency estimation over a stream for the kth moment


Let S = (s1 , . . . , sm ) be a stream (i.e., sequence) of m elements from N = {1, . . . , n}. Let fi be the number of
times the number i appears in S. For k ≥ 0, let
X
n
Fk = fik
i=1

be the kth frequency moment of S. The quantity, F1 = m is the length of the stream S. Similarly, F0 is the
number of distinct elements (where we use the convention that 00 = 0 and any other quantity to the power 0 is
1). It is natural to define F∞ = maxi fi .
Here, we are interested in approximating up to a factor of 1 ± ε the quantity Fk , for k ≥ 1 using small space,
and reading the stream S only once.

26.2.1. An estimator for the kth moment


26.2.1.1. Basic estimator
One can pick a representative element from a stream uniformly at random by using reservoir sampling. That
is, sample the ith element si to be the representative with probability 1/i. Once sampled, the algorithm counts

171
how many times it see the representative value later on in the stream (the counter is initialized to 1, to account
for the chosen representative itself). In particular, if s p is the chosen representative in the end of the stream
(i.e., the algorithm might change the representative several times), then the counter value is
n o
r = j j ≥ p and s j = s p .

The output of the algorithm is the quantity



X = m rk − (r − 1)k ,

where m is the number of elements seen in the stream. Let V be the random variable that is the value of the
representative in the end of the sequence.

26.2.1.2. Analysis
Lemma 26.2.1. We have E[X] = Fk .

Proof: Observe that since we choose the representative uniformly at random, we have
Xfi
1  mX
fi
 m
E[X | V = i] = m jk − ( j − 1)k = jk − ( j − 1)k = fik .
f
j=1 i
fi j=1 fi
  P P
As such, we have E[X] = E E[X | V] = i: fi ,0 fi m k
f
m fi i
= i fi
k
= Fk . ■

Remark 26.2.2. In the above, we estimated the function g(x) = xk , over the frequency numbers f1 , . . . , fk , but
the above argumentation, on the expectation of X, would work for any function g(x) such that g(0) = 0, and
g(x) ≥ 0, for all x ≥ 0.
P  2
Lemma 26.2.3. For k > 1, we have ni=1 ik − (i − 1)k ≤ kn2k−1 .

Proof: Observe that for x ≥ 1, we have that xk − (x − 1)k ≤ kxk−1 . As such, we have
X
n   X
n   X
n  
k 2
i − (i − 1) ≤
k
ki i − (i − 1) ≤ kn
k−1 k k k−1
ik − (i − 1)k = knk−1 nk = kn2k−1 . ■
i=1 i=1 i=1

h i
Lemma 26.2.4. We have E X 2 ≤ kmF2k−1 .

Proof: By Lemma 26.2.3, we have

h X i
fi
1 2 k 2 m2 2k−1
EX
2
V=i = m j − ( j − 1)k ≤ k fi = m2 k fi2k−2 ,
f
j=1 i
f i

  X fi
and thus E[X 2 ] = E E[X 2 | V] = · m2 k fi2k−2 = mkF2k−1 . ■
i: f ,0
m
i

Lemma 26.2.5. For any non-negative numbers f1 , . . . , fn , and k ≥ 1, we have


X
n X
n 1/k
fi ≤ n(k−1)/k fik .
i=1 i=1

172
Proof: This is immediate from Hölder inequality, but here is a self contained proof. The above is equivalent
P P 1/k P
to proving that i fi /n ≤ ni=1 fik /n . Raising both sides to the power k, we need to show that ( i fi /n)k ≤
Pn k P Pn
i=1 fi /n. Setting g(x) = x , we have g( i fi /n) ≤
k
i=1 g( fi )/n. The last inequality holds by the convexity of
the function g(x) (indeed, g′ (x) = kxk−1 and g′′ (x) = k(k − 1)xk−2 ≥ 0, for x ≥ 0). ■
P P  P 2
Lemma 26.2.6. For any n numbers f1 , . . . , fn ≥ 0, we have i fi i fi
2k−1
≤ n1−1/k i fi
k
.
P
Proof: Let M = maxi fi and m = i fi . We have
X X X X (k−1)/k X X (2k−1)/k
fi2k−1 ≤ M k−1 fik ≤ M k(k−1)/k fik ≤ fik fik ≤ fik .
i i i i i i

Pn P 1/k
By Lemma 26.2.5, we have i=1 fi ≤ n(k−1)/k i fik . Multiplying the above two inequality implies the
claim. ■

Lemma 26.2.7. We have V[X] ≤ kn1−1/k Fk2 .


P
Proof: Since m = i fi , Lemma 26.2.4 and Lemma 26.2.6 together implies that

h i
i z}|{
L26.2.4
h X X  z}|{
L26.2.6

V[X] = E X − (E[X]) ≤ E X ≤ kmF2k−2 = k ≤ kn1−1/k Fk2 . ■


2 2 2 2k−1
fi fi
i i

26.2.2. An improved estimator: Plugin


We have an estimator for Fk using constant space O(1). Specifically, µ = E[X] = Fk see Lemma 49.1.1, and
ν = V[X] ≤ kn1−1/k Fk2 . Let
l 4ν m 1
M = 24 2 2 ln
εµ φ
We compute M estimators as the above (in parallel on the stream), and combine them as specified by Theo-
rem 26.1.2, to get a new estimate Z. We have that
h i
P (1 − ε)µ ≤ Z ≤ (1 + ε)µ ≥ 1 − φ.

Thus, the amount of space this streaming algorithm is using is proportional to M, and we have
! ! !
ν 1 kn1−1/k Fk2 1 kn1−1/k 1
M = O 2 2 ln =O ln =O ln .
εµ φ ε2 Fk2 φ ε2 φ

We thus proved the following.


In the following, we consider a computer word to be sufficiently large to contain lg n or lg m bits. This
readily implies the following.
Theorem 26.2.8. Let S = (s1 , . . . , sn ) be a stream of numbers from the set {1, . . . , n}. Let k ≥ 1 be a parameter.
Given ε, φ ∈ (0, 1), one can build a data-structure using O(kn1−1/k ε−2 log φ−1 ) words, such that one can (1 ± ε)-
approximate the kth moment of the elements in the stream; that is, the algorithm outs a number Z, such that
P
(1 − ε)Fk ≤ Z ≤ (1 + ε)Fk , where Fk = ni=1 fik , and fi is the number of times i appears in the stream S. The
algorithm succeeds with probability ≥ 1 − φ.

173
26.3. Better estimation for F2
26.3.1. Pseudo-random k-wide independent sequence of signed bits
In the following, assume that we sample O(log n) bits, such that given an index i, one can compute (quickly!) a
random signed bit b(i) ∈ {−1, +1}. We require that the resulting bits b(1), b(2), . . . , b(n) are 4-wise independent.
To this end, pick a prime p, that is, say bigger than n10 . This can easily be done by sampling a number in the
range [n10 , n11 ], and checking if it is prime (which can done in polynomial time).
P
Once we have such a prime, we generate a random polynomial g(i) = 5i=0 ci xi mod p, by choosing

c0 , . . . , c5 from Z p = 0, . . . , p − 1 . We had seen that g(0), g(1), . . . , g(n) are uniformly distributed in Z p ,
and they are, say, 6-wise independent (see Theorem 7.2.10).
We define 


0 g(i) = p − 1



b(i) = 
+1 g(i) is odd



−1 g(i) is even.

Clearly, the sequence b(1), . . . , b(n) are 6-wise independent. There is a chance that one of these bits might
be zero, but the probability for that is at most n/p, which is so small, that we just assume it does not happen.
There are known constructions that do not have this issue at all (one of the bits is zero), but they are more
complicated.

Lemma 26.3.1. Given a parameter φ ∈ (0, 1), in polynomial time in O(log(n/φ)), one can construct a function
b(·), requiring O(log(n/φ)) bits of storage (or O(1) words), such that b(1), . . . , b(n) ∈ {−1, +1} with equal
probability, an they are 6-wise independent. Furthermore, given i, one can compute b(i) in O(1) time.
The probability of this sequence to fail having the desired properties is smaller than φ.

Proof: We repeat the above construction, but picking a prime p in the range, say, n10 /φ . . . n11 /φ. ■

26.3.2. Estimator construction for F2


26.3.2.1. The basic estimator
As before we have the stream S = s1 . . . . , sm of numbers from the set 1, . . . , n. We compute the 6-wise
independent sequence of random bits of Lemma 26.3.1, and in the following we assume this sequence is good
(i.e., has only −1 and +1 in it). We compute the quantity

X
m X
m
T= b(i) fi = b(s j ),
i=1 j=1

which can be computed on the fly using O(1) words of memory, and O(1) time per time in the stream.
The algorithm returns X = T 2 as the desired estimate.

Analysis.
P
Lemma 26.3.2. We have E[X] = i fi
2
= F2 and V[X] ≤ 2F22 .

174
hPn 2 i
Proof: We have that E[X] = E i=1 b(i) fi , and as such

hX
n X i Xm X   X 2
m
E[X] = E (b(i))2 fi2 + 2 b(i)b( j) fi f j = fi2 + 2 fi f j E b(i)b( j) = fi = F 2 ,
i=1 i< j i=1 i< j i=1
h i    
since E[b(i)] = 0, E b(i)2 = 1, and E b(i)b( j) = E[b(i)] E b( j) = 0 (assuming the sequence b(1), . . . , b(n)
has not failed), by the 6-wise Independence of the sequence of signed bits.
h i
We next compute E X 2 . To this end, let N = {1, . . . , n}, and Γ = N × N × N × N. We split this set into
several sets, as follows:
n o
(i) Γ0 = (i, i, i, i) ∈ N 4 : All quadruples that are all the same value.
(ii) Γ1 : Set of all quadruples (i, j, k, l) where there is at least one value that appears exactly once.
(iii) Γ2 : Set of all (i, j, k, ℓ) with only two distinct values, each appearing exactly twice.
Clearly, we have N 4 = Γ0 ∪ Γ1 ∪ Γ2 .
h i
For a tuple (i, i, i, i) ∈ Γ0 , we have E[b(i)b(i)b(i)b(i)] = E b(i)4 = 1.
For a tuple (i, j, k, ℓ) ∈ Γ1 with i being the unique value, we have that
     
E b(i)b( j)b(k)b(ℓ) = E[b(i)] E b( j)b(k)b(ℓ) = 0 E b( j)b(k)b(ℓ) = 0,
using that the signed bits are 4-wise independent. h i h i h i
 
For a tuple (i, i, j, j) ∈ Γ2 , we have E b(i)b(i)b( j)b( j) = E b(i)2 b( j)2 = E b(i)2 E b( j)2 = 1, and the

same argumentation applies to any tuple of Γ2 . Observe that for any i < j, there are 42 = 6 different tuples in
Γ2 that are made out of i and j. As such, we have
h i hX n 4 i h X i
E X =E b(i) fi = E
2
b(i)b( j)b(k)b(ℓ) fi f j fk fℓ
Xi=1 h i X
(i, j,k,ℓ)∈Γ
  X h i
= E b(i)
4
fi4 + fi f j fk fℓ E b(i)b( j)b(k)b(ℓ) + 6 2 2 2 2
E b(i) b( j) fi f j
(i,i,i,i)∈Γ0 (i, j,k,ℓ)∈Γ1 i< j

X
n X
= fi4 + 6 fi2 f j2 .
i=1 i< j

As such, we have
h i X
n X X
m 2 X
V [X] = E X 2
− (E [X])2
= fi
4
+ 6 f 2 2
f
i j − f i
2
= 4 fi2 f j2 ≤ 2F22 . ■
i=1 i< j i=1 i< j

26.3.3. Improving the estimator


We repeat the same scheme as above. Let φ, ε ∈ (0, 1) be parameters. In the following, let
1
α = 16/ε2 and β = 4 ln .
φ
Let Xi, j be a basic estimator for F2 , using the estimator of Section 26.3.2.1, for i = 1, . . . , β and j = 1, . . . , α.
P
Let Yi = αj=1 Xi, j /α, for i = 1, . . . , β. Let Z be the median of Y1 , . . . , Yβ , and the algorithm returns Z as the
estimator.

175
Theorem 26.3.3. Given a stream S = s1 , . . . , sm of numbers from {1, . . . , n}, and parameters ε, φ ∈ (0, 1), one
can compute an estimate Z for F2 (S), such that P[|Z − F2 | > εF2 ] ≤ φ. This algorithm requires O(ε−2 log φ−1 )
space (in words), and this is also the time to handle a new element in the stream.

Proof: The scheme is described above. As before, using Chebychev’s inequality, we have that
" #
εF2 p V[Yi ] V[X] /α 2F22 1
P[|Yi − F2 | > εF2 ] = P |Yi − F2 | > √ V[Yi ] ≤ 2 2 = ≤ 2 2 = ,
V[Yi ] ε F2 ε F2
2 2
αε F2 8

by Lemma 26.3.2. Let U be the number of estimators in Y1 , . . . , Yβ that are outside the acceptable range.
Arguing as in Lemma ??, we have
!
    1
P[Z is bad] ≤ P U ≥ β/2 = P U ≥ (1 + 3)β/8 ≤ exp(−(β/8)3 /4) ≤ exp − ln = φ,
2
φ

by Chernoff inequality (Lemma 13.2.6), and ■

26.4. Bibliographical notes


The beautiful results of this chapter are from a paper from Alon et al. [AMS99].

References
[AMS99] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency
moments. J. Comput. Syst. Sci., 58(1): 137–147, 1999.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

176
Chapter 27

Approximating the Number of Distinct Elements


in a Stream
“See? Genuine-sounding indignation. I programmed that myself. It’s the first thing you need in a university environment: the
ability to take offense at any slight, real or imagined.”

598 - Class notes for Randomized Algorithms Robert Sawyer, Factoring Humanity
Sariel Har-Peled
April 2, 2024

27.1. Counting number of distinct elements


27.1.1. First order statistic
Let X1 , . . . , Xu be u random variables uniformly distributed in [0, 1]. Let Y = min(X1 , . . . , Xu ). The value Y is
the first order statistic of X1 , . . . , Xu .
For a continuous variable X, the probability density function (i.e., pdf ) is the “probability” of X having
this value. Since this is not well defined, one looks on the cumulative distribution function F(x) = P[X ≤].
The pdf is then the derivative of the cdf. Somewhat abusing notations, the pdf of the Xi s is P[Xi = x] = 1.
The following proof is somewhat dense, check any standard text on probability for more details.

Lemma 27.1.1. The probability density function of Y is f (x) = u1 1(1 − x)u−1 .

Proof: Considering the pdf of X1 being x, and all other Xi s being bigger. We have that this pdf is
h \u i h\u i
g(x) = P (X1 = x) ∩ (Xi > X1 ) = P (Xi > X1 ) X1 = x P[X1 = x] = (1 − x)u−1 .
i=2 i=2
Since every one of the Xi has equal probability to realize Y, we have f (x) = ug(x). ■
h i
Lemma 27.1.2. We have E[Y] = u+1 1
, E Y 2 = (u+1)(u+2)
2
, and V[Y] = (u+1)u2 (u+2) .

Proof: Using integration by guessing, we have


Z 1 Z ! Z 1
  1
u
E[Y] = y P Y = y dy = y · 1(1 − y) dy = uy(1 − y)u−1 dy
u−1
y=0 y=0 1 y=0

177
h (1 − y)u+1 i1 1
= −y(1 − y) − u
= .
u+1 y=0 u+1
Using integration by guessing again, we have

h i Z Z ! Z 1
1
  1
u
EY
2
= y P Y = y dy =
2
y · 1(1 − y) dy =
2 u−1
uy2 (1 − y)u−1 dy
y=0 y=0 1 y=0
h 2y(1 − y)u+1
2(1 − y)u+2 i1
2
= −y2 (1 − y)u − − = .
u+1 (u + 1)(u + 2) y=0 (u + 1)(u + 2)

We conclude that
h i !
2 1 1 2 1 u
V[Y] = E X − (E[X]) = − = − = . ■
2 2
(u + 1)(u + 2) (u + 1)2 u + 1 u + 2 u + 1 (u + 1)2 (u + 2)

27.1.2. The algorithm


A single estimator. Assume that we have a perfectly random hash function h that randomly maps N =
{1, . . . , n} to [0, 1]. Assume that the stream has u unique numbers in N. Then the set {h(s1 ), . . . , h(sm )} contains
u random numbers uniformly distributed in [0.1]. The algorithm as such, would compute X = mini h(si ).

Explanation. Note, that X is not an estimator for u – instead, as E[X] = 1/(u+1), we are estimating 1/(u+1).
The key observation is that an 1 ± ε estimator for 1/(u + 1), is 1 ± O(ε) estimator for u + 1, which is in turn an
1 ± O(ε) estimator for u.

Lemma 27.1.3. Let ε, hφ ∈ (0, 1) be parameters. Given ia stream S of items from {1, . . . , n} one can return an
estimate X, such that P (1 − ε/4) u+1
1
≤ X ≤ (1 + ε/4) u+1
1
≥ 1 − φ, where u is the number of unique elements in
 
S. This requires O ε12 log φ1 space.

Proof: The basic estimator Y has µ = E[Y] = u+1


1
and ν = V[Y] = (u+1)u2 (u+2) . We now plug this estimator into
the mean/median framework. By Lemma 27.1.2, for c some absolute constant, this requires maintaining M
estimators, where M is larger than
! !
4 · 16ν 1 u2 1 1 1
c 2 2 log = O 2 2 log = O 2 log . ■
εµ φ εu φ ε φ

Observe that if (1 − ε/4) u+1


1
≤ X ≤ (1 + ε/4) u+1
1
then

u+1 1 u+1
−1≥ −1≥ − 1,
1 − ε/4 X 1 + ε/4

which implies
(1 + ε/4)u u + ε/4 1 u+1
(1 + ε)u ≥ ≥ ≥ −1≥ − 1 ≥ (1 − ε)u.
1 − ε/4 1 − ε/4 X 1 + ε/4
Namely, 1/X − 1 is a good estimator for the number of distinct elements.

178
The algorithm revisited. Compute X as above, and output the quantity 1/X − 1.

This immediately implies the following.


Lemma 27.1.4. Under the unreasonable assumption that we can sample perfectly random functions from
{1, . . . , n} to [0, 1], and storing such a function requires O(1) words, then one can estimate the number of
unique elements in a stream, using O(ε−2 log φ−1 ) words.

27.2. Sampling from a stream with “low quality” randomness


Assume that we have a stream of elements S = s1 , . . . , sm , all taken from the set {1, . . . , n}. In the following, let
set(S ) denote the set of values that appear in S . That is

F0 = F0 (S) = |set(S )|

is the number of distinct values in the stream S.


Assume that we have a random sequence of bits B ≡ B1 , . . . , Bn , such that P[Bi = 1] = p, for some p.
Furthermore, we can compute Bi efficiently. Assume that the bits of B are pairwise independent.

The sampling algorithm. When the ith arrives si , we compute Bsi . If this bit is 1, then we insert si into the
random sample R (if it is already in R, there is no need to store a second copy, naturally).
This defines a natural random sample

R = {i | Bi = 1 and i ∈ S } ⊆ S .

Lemma 27.2.1. For the above random sample R, let X = |R|. We have that E[X] = pν and V[X] = pν − p2 ν,
where ν = F0 (S) is the number of district elements in S .

Proof: Let X = |R|, and we have hX i X


E[X] = E Bi = E[Bi ] = pν.
i∈S i∈S
ih
As for the E X 2 , we have

h i hX i X h i X h i X !
2 ν
EX
2
= E ( Bi ) =
2
E Bi + 2
2
E Bi B j = pν + 2 E[Bi ] E[B j ] = pν + 2p .
i∈S i∈S i, j∈S , i< j i, j∈S , i< j
2

As such, we have
h i !
ν ν(ν − 1)
V[X] = V[|R|] = E X − (E[X]) = pν + 2p − p2 ν2 = pν + 2p2 − p2 ν 2
2 2 2
2 2
= pν + p ν(ν − 1) − p ν = pν − p ν.
2 2 2 2

Lemma 27.2.2. Let ε ∈ (0, 1/4). Given O(1/ε2 ) space, and a parameter N. Consider the task of estimating
the size of F0 = |set(S)|, where F0 > N/4. Then, the algorithm described below outputs one of the following:
(A) F0 > 2N.
(B) Output a number ρ such that (1 − ε)F0 ≤ ρ ≤ (1 + ε)F0 .
(Note, that the two options are not disjoint.) The output of this algorithm is correct, with probability ≥ 7/8.

179
Proof: We set p = Nεc 2 , where c is a constant to be determined shortly. Let T = pN = O(1/ε2 ). We sample a
random sample R from S , by scanning the elements of S , and adding i ∈ S to R if Bi = 1, If the random sample
is larger than 8T , at any point, then the algorithm outputs that |S | > 2N.
In all other cases, the algorithm outputs |R| /p as the estimate for the size of S , together with R.
To bound the failure probability, consider first the case that N/4 < |set(S)|. In this case, we have by the
above, that " #
E[X] p 2 V[X] 1
P[|X − E[X]| > ε E[X]] ≤ P |X − E[X]| > ε √ V[X] ≤ ε ≤ ,
V[X] (E[X]) 2 8
if ε2 (VE[X]
[X])2
≤ 18 , For ν = F0 ≥ N/4, this happens if pν
ε2 p2 ν2
≤ 81 . This in turn is equivalent to 8/ε2 ≤ pν. This is in
turn happens if
c N 8
· ≥ 2,
2
Nε 4 ε
which implies that this holds for c = 32. Namely, the algorithm in this case would output a (1 ± ε)-estimate for
|S |.
If the sample get bigger than 8T , then the above readily implies that with probability at least 7/8, the size
of S is at least (1 − ε)8T/p > 2N, Namely, the output of the algorithm is correct in this case. ■

Lemma 27.2.3. Let ε ∈ (0, 1/4) and φ ∈ (0, 1). Given O(ε−2 log φ−1 ) space, and a parameter N, and the task
is to estimate F0 of S, given that F0 > N/4. Then, there is an algorithm that would output one of the following:
(A) F0 > 2N.
(B) Output a number ρ such that (1 − ε)F0 ≤ ρ ≤ (1 + ε)F0 .
(Note, that the two options are not disjoint.) The output of this algorithm is correct, with probability ≥ 1 − φ.

Proof: We run O(log φ−1 ) copies of the of Lemma 27.2.2. If half of them returns that F0 > 2N, then the
algorithm returns that F0 > 2N. Otherwise, the algorithm returns the median of the estimates returned, and
return it as the desired estimated. The correctness readily follows by a repeated application of Chernoff’s
inequality. ■

Lemma 27.2.4. Let ε ∈ (0, 1/4). Given O(ε−2 log2 n) space, one can read the stream S once, and output a
number ρ, such that (1 − ε)F0 ≤ ρ ≤ (1 + ε)F0 . The estimate is correct with high probability (i.e., ≥ 1 − 1/nO(1) ).
 
Proof: Let Ni = 2i , for i = 1, . . . , M = lg n . Run M copies of Lemma 27.2.3, for each value of Ni , with
φ = 1/nO(1) . Let Y1 , . . . , Y M be the outputs of these algorithms for the stream. A prefix of these outputs, are
going to be “F0 > 2Ni ”, Let j be the first Y j that is a number. Return this number as the desired estimate.
The correctness is easy – the first estimate that is a number, is a correct estimate with high probability. Since
N M ≥ n, it also follows that Y M must be a number. As such, there is a first number in the sequence, and the
algorithm would output an estimate.
More precisely, there is an index i, such that Ni /4 ≤ F0 ≤ 2F0 , and Yi is a good estimate, with high
probability. If any of the Y j , for j < i, is an estimate, then it is correct (again) with high probability. ■

27.3. Bibliographical notes


References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

180
Chapter 28

Approximate Nearest Neighbor (ANN) Search in


High Dimensions
598 - Class notes for Randomized Algorithms
Sariel Har-Peled Possession of anything new or expensive only reflected a
April 2, 2024 person’s lack of theology and geometry; it could even
cast doubts upon one’s soul.

A confederacy of Dunces, John Kennedy Toole

28.1. ANN on the hypercube


28.1.1. ANN for the hypercube and the Hamming distance
Definition 28.1.1. The set of points Hd = {0, 1}d is the d-dimensional hypercube. A point p = (p1 , . . . , pd ) ∈
Hd can be interpreted, naturally, as a binary string p1 p2 . . . pd . The Hamming distance dH (p, u) between
p, u ∈ Hd is the number of coordinates where p and u disagree.
It is easy to verify that the Hamming distance, being the L1 -norm, complies with the triangle inequality and
is thus a metric.

As we saw previously, to solve the (1 + ε)-ANN problem efficiently, it is sufficient to solve the approximate
near neighbor problem. Namely, given a set P of n points in Hd , a radius r > 0, and parameter ε > 0,
  
we want to decide for a query point q whether dH q, P ≤ r or dH q, P ≥ (1 + ε)r, where dH q, P =

minp∈P dH q, p .

Definition 28.1.2. For a set P of points, a data-structure D = D≈Near (P, r, (1 + ε)r) solves the approximate near
neighbor problem if, given a query point q, the data-structure works as follows.
 
• Near: If dH q, P ≤ r, then D outputs a point p ∈ P such that dH p, q ≤ (1 + ε)r.
 
• Far: If dH q, P ≥ (1 + ε)r, then D outputs “dH q, P ≥ r”.

• Don’t care: If r ≤ d q, P ≤ (1 + ε)r, then D can return either of the above answers.

Given such a data-structure,


  one can construct a data-structure that answers the approximate nearest neigh-

−1
bor query using O log ε log d queries using an approximate near neighbor data-structure. Indeed, the de-

sired distance dH q, P is an integer number in the range 0,1, . . . , d. We can build a D≈Near data-structure for
distances (1 + ε)i , for i = 1, . . . , M, where M = O ε−1 log d . Performing a binary search over these distances
would resolve the approximate nearest neighbor query and requires O(log M) queries.
As such, in the following, we concentrate on constructing the approximate near neighbor data-structure
(i.e., D≈Near ).

181
28.1.2. Preliminaries
Definition 28.1.3. Consider a sequence m of k, not necessarily distinct, integers i1 , i2 , . . . , ik ∈ JdK, where JdK =

{1, . . . , d}. For a point p = (p1 , . . . , pd ) ∈ Rd , its projection by m, denoted by mp is the point pi1 , . . . , pik ∈ Rk .
Similarly, the projection of a point set P ⊆ Rd by m is the point set mP = {mp | p ∈ P}.

Given two sequences m = i1 , . . . , ik and u = j1 , . . . , jk′ , let m|u denote the concatenated sequence m|u =
i1 , . . . , ik , j1 , . . . , jk′ . Given a probability φ, a natural way to create such a projection, is to include the ith
coordinate, for i = 1, . . . , d, with probability φ. Let Dφ denote the distribution of such sequences of indices.
Definition 28.1.4. Let DTφ denote the distribution resulting from concatenating T independent sequences sam-
pled from Dφ . The length of a sampled sequence is dT .

Observe that for a point p ∈ {0, 1}d , and M ∈ DTφ , the projection M p might be higher dimensional than the
original point p, as it might contain repeated coordinates of the original point.

28.1.2.1. Algorithm
28.1.2.1.1. Input. The input is a set P of n points in the hypercube {0, 1}d , and parameters r and ε.

28.1.2.1.2. Preprocessing. We set parameters as follows:


!
1 1 1
β= ∈ (0, 1), φ = 1 − exp − ≈ , T = β ln n, and L = O(nβ log n).
1+ε r r
We randomly and independently pick L sequences M1 , . . . , ML ∈ DTφ . Next, the algorithm computes the point
sets Qi = Mi Pi , for i = 1, . . . , L, and stores them each in a hash table, denoted by Di , for i = 1, . . . , L.

28.1.2.1.3. Answering a query. Given a query point q ∈ {0, 1}d , the algorithm computes qi = Mi q, for
i = 1, . . . , L. From each Di , the algorithm retrieves a list ℓi of all the points that collide with qi . The algorithm
scans the points in the lists ℓ1 , . . . , ℓL . If any of these points is in Hamming distance smaller than (1 + ε)r,
the algorithm returns it as the desired near-neighbor (and stops). Otherwise, the algorithm returns that all the
points in P are in distance at least r from q.

28.1.2.2. Analysis
Lemma 28.1.5. Let K be a set of r marked/forbidden coordinates. The probability that a sequence M =
(m1 , . . . , mT ) sampled from DTφ does not sample any of the coordinates of K is 1/nβ . This probability increases
if K contains fewer coordinates.
r
Proof: For any i, the probability that mi does not contain any of these coordinates is (1 − φ)r = e−1/r = 1/e.
Since this experiment is repeated T times, the probability is e−T = e−β ln n = n−β . ■

Lemma 28.1.6. Let p be the nearest-neighbor to q in P. If dH q, p ≤ r then, with high probability, the
data-structure returns a point that is in distance ≤ (1 + ε)r from q.

Proof: The good event here is that p and q collide under one of the sequences of M1 , . . . , ML . However, the
probability that Mi p = Mi q is at least 1/nβ , by Lemma 28.1.5, as this is the probability that Mi does not sample
any of the (at most r) coordinates where p and q disagree. As such, the probability thatall L data-structures 
fail (i.e., none of the lists ℓ1 , . . . , ℓL contains p), is at most (1 − 1/nβ )L < 1/nO(1) , as L = O nβ log n . ■

182
Lemma 28.1.7. In expectation, the total number of points in ℓ1 , . . . , ℓL that are in distance ≥ (1 + ε)r from q
is ≤ L.

Proof: Let P≥ be the set of points in P that are in distance ≥ (1 + ε)r from q. For a point u ∈ P≥ , with

∆ = dH u, q , we have that the probability for M ∈ DTφ misses all the ∆ coordinates, where u and q differ, is
 (1+ε)rT 1
(1 − φ)∆T ≤ (1 − φ)(1+ε)rT = e−1/r = exp(−(1 + ε)β ln n) = ,
n
as φ = 1 − e−1/r , T = β ln n, and β = 1/(1 + ε). But then, for any i, we have
  X   1
E |ℓi | = P Mi p = Mi q ≤ P≥ ≤ 1.
p ∈P
Mi n

As such, the total number of far points in the lists is at most L · 1 = L, implying the claim. ■

28.1.2.3. Running time


For each i, the query computes Mi q and that takes O(dT ) = O(d log n) time. Repeated L times, this takes
O(Ld log n) time overall. Let X be the random variable that is the number of points in the extracted lists that

are in distance ≥ (1 + ε)r from the query point. The time to scan the lists is O d(X + 1) , since the algorithm
 as soon as it finds a near point. As such, by Lemma 28.1.7, the expected query time is O(Ld log n + Ld) =
stops
O dn1/(1+ε) log2 n .

28.1.2.3.1. Improving the performance (a bit). Observe that for M ∈ DTφ , and any two points p, u ∈ {0, 1}d ,
all the algorithm cares about is whether M p = M u. As such, if a coordinate is probed many times by M, we
might as well probe this coordinate only once. In particular, for a sequence M ∈ DTφ , let M ′ = uniq(M) be the
projection sequence resulting from removing replications in M. Significantly, M ′ is only of length ≤ d, and
as such, computing M ′ p, for a point p, takes only O(d) time. It is not hard to verify that one can also sample
directly uniq(M), for M ∈ DTφ , in O(d) time. This improves the query and processing by a logarithmic factor.

We thus get the following result.


Theorem 28.1.8. Given a set P of n points in {0, 1}d , and parameters r, ε, one can preprocess P in

O(dn1+1/(1+ε) log n)

time and space, such that given a query point q, the algorithm returns, in expected O(dn1/(1+ε) log n) time, one
of the following:

(A) a point p ∈ P such that dH q, p ≤ (1 + ε)r, or
(B) the distance of q from P is larger than r.
The algorithm may return either result if the distance of q from P is in the range [r, (1 + ε)r]. The algorithm
succeeds with high probability (per query).

One can also get a high-probability guarantee on the query time. For a parameter δ > 0, create O(log δ−1 )
LSH data-structures as above. Perform the query as above, except that when the query time exceeds (say) twice
the expected time, move on to redo the query in the next LSH data-structure. The probability that the query had
failed on one of these LSH data-structures is ≤ 1/2, by Markov’s inequality. As such, overall, the query time
becomes O(dn1/(1+ε) log n log δ−1 ), with probability ≥ 1 − δ.

183
28.2. Testing for good items
Imagine that we have n items. One of the items is good the rest are bad. We have two tests to check if an item
is good – we have a cheap test, a really expensive test. We would like to use the expensive test as few times as
possible, and classify correctly all the items. Let T (x) ∈ {good, bad} denote the result of the cheap test on item
x. We have that
   
P T (x) = good x is good ≥ α > β ≥ P T (x) = good x is bad .
Repeating, the experiment k times, we create a k-test where we turn an item is good if all k tests return “good”.
We then have h i h i
P T (x) = good x is good ≥ α > β ≥ P T (x) = good x is bad .
k k k k

We need to make sure we discover the good item, so let us repeat the k-test enough times, till we discover it
with good probability. A natural value would be to repeat the k-test for each item M = (1/αk ) ln φ−1 times, so
that the probability we fail to discover the good item is
 
(1 − αk ) M ≤ exp −αk M < φ.
As for the bad items, how many “false positive” would we have? Every k-test is going to return in expecta-
tion at most
(n − 1)βk ≤ nβk
items. As such, the total number of false positives over the M repeated k-tests is going to be
nβk M = O(n(β/α)k log φ−1 ).
If everything is for free, that we will set k to be quite large, so that the number of false positives is practically
zero. For our purposes it would be enough if every k-test returns (in expectation) one false positive. That is,
we will require that
βk n ≤ 1.
This would set up the values we need for k and M.

28.3. LSH for the hypercube: An elaborate construction


We next present a similar scheme in a more systematic fashion – this would provide some intuition how we
came up with the above construction.

28.3.0.1. On sense and sensitivity


Let P = {p1 , . . . , pn } be a subset of vertices of the hypercube in d dimensions. In the following we assume
that d = nO(1) . Let r, ε > 0 be two prespecified parameters. We are interested in building an approximate near
neighbor data-structure (i.e., D≈Near ) for balls of radius r in the Hamming distance.

Definition 28.3.1. A family F of functions (defined over Hd ) is r, R, b α, b
β -sensitive if for any q, v ∈ Hd , we
have the following
 
(A) If v ∈ b( q, r), then P f ( q) = f (v) ≥ b α.
  b
(B) If v < b( q, R), then P f ( q) = f (v) ≤ β.
Here, f is a randomly picked function from F , r < R, and b α >b β.
Intuitively, if we can construct an (r, R, α, β)-sensitive family, then we can distinguish between two points
which are close together and two points which are far away from each other. Of course, the probabilities α and
β might be very close to each other, and we need a way to do amplification.

184
28.3.1. A simple sensitive family
A priori it is not even clear such a sensitive family exists, but it turns out that the family randomly exposing
one coordinate is sensitive.
Lemma 28.3.2. Let fi (p) denote the function that returns the ith coordinate of p, for i = 1, . . . , d. Consider
the family of functions F = { f1 , . . . , fd }. Then, for any r > 0 and ε, the family F is (r, (1 + ε)r, α, β)-sensitive,
where α = 1 − r/d and β = 1 − r (1 + ε)/d.

Proof: If u, v ∈ {0, 1}d are within distance smaller than r from each other (under the Hamming distance), then
they differ in at most r coordinates. The probability that a random h ∈ F would project into a coordinate that
u and v agree on is ≥ 1 − r/d.
Similarly, if dH (u, v) ≥ (1 + ε)r, then the probability that a random h ∈ F would map into a coordinate that
u and v agree on is ≤ 1 − (1 + ε)r/d. ■

28.3.2. A family with a large sensitivity gap


Let k be a parameter to be specified shortly, and consider the family of functions G that concatenates k of the
given functions. Formally, let
n  o
G = combine(F , k) = g g(p) = f 1 (p), . . . , f k (p) , for f 1 , . . . , f k ∈ F

be the set of all such functions.


Lemma 28.3.3. For a (r, R, α, β)-sensitive family F , the family G = combine(F , k) is (r, R, αk , βk )-sensitive.

Proof: For two fixed points u, v ∈ Hd such that dH (u, v) ≤ r, we have that for a random h ∈ F , we have
P[h(u) = h(v)] ≥ α. As such, for a random g ∈ G, we have that
  h i
P g(u) = g(v) = P f (u) = f (v) and f (u) = f (v) and . . . and f (u) = f (v)
1 1 2 2 k k

Y
k h i
= P f (u) = f (v) ≥ α .
i i k

i=1
  Q h i
Similarly, if dH (u, v) > R, then P g(u) = g(v) = ki=1 P f i (u) = f i (v) ≤ βk . ■

The above lemma implies that we can build a family that has a gap between the lower and upper sensitivities;
namely, αk /βk = (α/β)k is arbitrarily large. The problem is that if αk is too small, then we will have to use too
many functions to detect whether or not there is a point close to the query point.
Nevertheless, consider the task of building a data-structure that finds all the points of P = {p1 , . . . , pn } that
are equal, under a given function g ∈ G = combine(F , k), to a query point. To this end, we compute the strings
g(p1 ), . . . , g(pn ) and store them (together with their associated point) in a hash table (or a prefix tree). Now,
given a query point q, we compute g( q) and fetch from this data-structure all the strings equal to it that are
stored in it. Clearly, this is a simple and efficient data-structure. All the points colliding with q would be the
natural candidates to be the nearest neighbor to q.
By not storing the points explicitly, but using a pointer to the original input set, we get the following easy
result.
Lemma 28.3.4. Given a function g ∈ G = combine(F , k) (see Lemma 28.3.3) and a set P ⊆ Hd of n points,
one can construct a data-structure, in O(nk) ntime and using O(nk)
o additional space, such that given a query
point q, one can report all the points in X = p ∈ P g(p) = g( q) in O(k + |X|) time.

185
28.3.3. Amplifying sensitivity
Our task is now to amplify the sensitive family we currently have. To this end, for two τ-dimensional points
x and y, let x ≎ y be the Boolean function that returns true if there exists an index i such that xi = yi and
false otherwise. Now, the regular “=” operator requires vectors to be equal in all coordinates (i.e., it is equal to
T S
i (xi = yi )) while x ≎ y is i (xi = yi ). The previous construction of Lemma 28.3.3 using this alternative equal
operator provides us with the required amplification.
 
Lemma 28.3.5. Given an r, R, αk , βk -sensitive family G, the family H≎ = combine(G, τ) if one uses the ≎

operator to check for equality is r, R, 1 − (1 − αk )τ , 1 − (1 − βk )τ -sensitive.
 
Proof: For two fixed points u, v ∈ Hd such that dH (u, v) ≤ r, we have, for a random g ∈ G, that P g(u) = g(v)
≥ αk . As such, for a random h ∈ H≎ , we have that
h i
τ τ
P[h(u) ≎ h(v)] = P g (u) = g (v) or g (u) = g (v) or . . . or g (u) = g (v)
1 1 2 2

Y τ h i
k τ
=1− P g (u) , g (v) ≥ 1 − 1 − α .
i i

i=1

Similarly, if dH (u, v) > R, then

  Y
τ h i
k τ
P h(u) ≎ h(v) = 1 − P g (u) , g (v) ≤ 1 − 1 − β . ■
i i

i=1

To see the effect of Lemma 28.3.5, it is useful to play with a concrete example. Consider an (r, R, αk , βk )-
√ β = α /2 and yet α is very small. Setting τ = 1/α , the resulting family is (roughly)
k k k k
sensitive family where
(r, R, 1 − 1/e, 1 − 1/ e)-sensitive. Namely, the gap shrank, but the threshold sensitivity is considerably higher.
In particular, it is now a constant, and the gap is also a constant.
Using Lemma 28.3.5 as a data-structure to store P is more involved than before. Indeed, for a random
function h = g , . . . , gτ ∈ H≎ = combine(G, τ) building the associated data-structure requires us to build
1

τ data-structures for each one of the functions g1 , . . . , gτ , using Lemma 28.3.4. Now, given a query point,
we retrieve all the points of P that collide with each one of these functions, by querying each of these data-
structures.
Lemma 28.3.6. Given a function h ∈ H≎ = combine(G, τ) (see Lemma 28.3.5) and a set P ⊆ Hd of n points,
one can construct a data-structure, in O(nkτ)ntime and using O(nkτ)
o additional space, such that given a query
point q, one can report all the points in X = p ∈ P h(p) ≎ h( q) in O(kτ + |X|) time.

28.3.4. The near neighbor data-structure and handling a query


We construct the data-structure D of Lemma 28.3.6 with parameters k and τ to be determined shortly, for a
random function h ∈ H≎ . Given a query point q, we retrieve all the points that collide with h and compute
their distance to the query point. Next, scan these points one by one and compute their distance to q. As soon

as encountering a point v ∈ P such that dH q, v ≤ R, the data-structures returns true together with v.
Let’s assume that we know that the expected number of points of P \ b( q, R) (i.e., R = (1 + ε)r) that will
collide with q in D is in expectation L (we will figure out the value of L below). To ensure the worst case query
time, the query would abort after checking 4L + 1 points and would return false. Naturally, the data-structure
would also return false if all points encountered have distance larger than R from q.
Clearly, the query time of this data-structure is O(kτ + dL).

186
We are left with the task of fine-tuning the parameters τ and k to get the fastest possible query time, while
the data-structure has reasonable probability to succeed. Figuring the right values is technically tedious, and
we do it next.

28.3.4.1. Setting the parameters



If there exists p ∈ Psuch that
τ dH q, p ≤ r, then the probability of this point to collide with q under the
function h is ϕ ≥ 1 − 1 − αk . Let us demand that this data-structure succeeds with probability ≥ 3/4. To this
end, we set l m τ 
τ = 4 1/αk =⇒ ϕ ≥ 1 − 1 − αk ≥ 1 − exp −αk τ ≥ 1 − exp(−4) ≥ 3/4, (28.1)
since 1 − x ≤ exp(−x), for x ≥ 0.

Lemma 28.3.7. The expected number of points of P \ b( q, R) colliding with the query point is L = O n(β/α)k ,
where R = (1 + ε)r.

Proof: Consider the points in P \ b( q, R). We would like to bound the number of points of this set that collide
with the query point. Observe that in this case, the probability of a point p ∈ P \ b( q, R) to collide with the
query point is

τ    β k
≤ ψ = 1 − 1 − βk = βk 1 + (1 − βk ) + (1 − βk )2 + . . . + (1 − βk )τ−1 ≤ βk τ ≤ 8 ,
α
l m
as τ = 4 1/αk and α, β ∈ (0, 1). Namely, the expected number of points of P \ b( q, R) colliding with the query
point is ≤ ψn. ■

By Lemma 28.3.6, extracting the O(L) points takes O(kτ + L) time. Computing the distance of the query
time for each one of these points takes O(kτ + Ld) time. As such, by Lemma 28.3.7, the query time is
 
O(kτ + Ld) = O kτ + nd(β/α)k .

To minimize this query time, we “approximately” solve the equation requiring the two terms, in the above
bound, to be equal (we ignore d since, intuitively, it should be small compared to n). We get that

k βk
kτ = n(β/α)k ⇝ ≈ n =⇒ k ≈ nβk ⇝ 1/βk ≈ n =⇒ k ≈ ln1/β n.
αk αk

Thus, setting k = ln1/β n, we have that βk = 1/n and, by Eq. (28.1), that

l m !
ln n ln 1/α
τ = 4 1/α = exp
k
ln 1/α = O(nρ ), for ρ = . (28.2)
ln 1/β ln 1/β

As such, to minimize the query time, we need to minimize ρ.

ln(1 − x) 1
Lemma 28.3.8. (A) For x ∈ [0, 1) and t ≥ 1 such that 1 − tx > 0 we have ≤ .
ln(1 − tx) t
ln 1/α 1
(B) For α = 1 − r/d and β = 1 − r (1 + ε)/d, we have that ρ = ≤ .
ln 1/β 1 + ε

187
Proof: (A) Since ln(1 − tx) < 0, it follows that the claim is equivalent to t ln(1 − x) ≥ ln(1 − tx). This in turn is
equivalent to

g(x) ≡ (1 − tx) − (1 − x)t ≤ 0.

This is trivially true for x = 0. Furthermore, taking the derivative, we see g′ (x) = −t + t (1 − x)t−1 , which is
non-positive for x ∈ [0, 1) and t > 0. Therefore, g is non-increasing in the interval of interest, and so g(x) ≤ 0
for all values in this interval.  
ln 1/α ln α ln d−r ln 1 − r
d 1
(B) Indeed ρ = = = d−(1+ε)r
d
=  ≤ , by part (A). ■
ln 1/β ln β ln ln 1 − (1 + ε) r 1+ε
d d

In the following, it would be convenient to consider d to be considerably larger than r. This can be ensured
by (conceptually) padding the points with fake coordinates that are all zero. It is easy to verify that this “hack”
would not affect the algorithm’s performance in any way and it is just a trick to make our analysis simpler. In
particular, we assume that d > 2(1 + ε)r.
 
Lemma 28.3.9. For α = 1 − r/d, β = 1 − r(1 + ε)/d, n and d as above, we have that I. τ = O n1/(1+ε) ,
II. k = O(ln n), and III. L = O(n1/(1+ε) ).
l m
Proof: By Eq. (28.1), τ = 4 1/αk = O(nρ ) = O(n1/(1+ε) ), by Lemma 28.3.8(B).
Now, β = 1 − r(1 + ε)/d ≤ 1/2, since we assumed that d > 2(1 + ε)r. As such, we have k = ln1/β n =
ln n
= O(ln n).
ln 1/β  
By Lemma 28.3.7, L = O n(β/α)k . Now βk = 1/n and as such L = O(1/αk ) = O(τ) = O n1/(1+ε) . ■

28.3.5. The result


Theorem 28.3.10. Given a set P of n points on the hypercube Hd and parameters ε > 0 and r > 0, one
can build a data-structure D = D≈Near (P, r, (1 + ε)r) that solves the approximate near neighbor problem (see
Definition 28.1.2). The data-structure answers a query successfully with high probability. In addition we have
the following:

(A) The query time is O dn1/(1+ε) log n .

(B) The preprocessing time to build this data-structure is O n1+1/(1+ε) log2 n .

(C) The space required to store this data-structure is O nd + n1+1/(1+ε) log2 n .

Proof: Our building block is the data-structure described above. By Markov’s inequality, the probability that
the algorithm has to abort because of too many collisions with points of P \ b( q, (1 + ε)r) is bounded by 1/4
(since the algorithm tries 4L + 1 points). Also, if there is a point inside b( q, r), the algorithm would find it with
probability ≥ 3/4, by Eq. (28.1). As such, with probability at least 1/2 this data-structure returns the correct
answer in this case. By Lemma 28.3.6, the query time is O(kτ + Ld).
This data-structure succeeds only with constant probability. To achieve high probability, we construct
O(log n) such data-structures and perform the near neighbor query in each one of them. As such, the query
time is
    
O (kτ + Ld) log n = O n1/(1+ε) log2 n + dn1/(1+ε) log n = O dn1/(1+ε) log n ,

by Lemma 28.3.9 and since d = Ω lg n if P contains n distinct points of Hd .

188
As for the preprocessing time, by Lemma 28.3.6 and Lemma 28.3.9, we have
 
O nkτ log n = O n1+1/(1+ε) log2 n .
Finally, this data-structure requires O(dn) space to store the input points. Specifically, by Lemma 28.3.6,
 
we need an additional O nkτ log n = O n1+1/(1+ε) log2 n space. ■

In the hypercube case, when d = nO(1) , we can build M = O log1+ε d = O(ε−1 log d) such data-structures
such that (1 + ε)-ANN can be answered using binary search on those data-structures which correspond to radii
r1 , . . . , r M , where ri = (1 + ε)i , for i = 1, . . . , M.
Theorem 28.3.11. Given a set P of n points on the hypercube Hd (where d = nO(1) ) and a parameter ε > 0,
one can build a data-structure to answer approximate nearest neighbor queries (under the Hamming distance)

using O dn + n1/(1+ε) ε−1 log2 n log d space, such that given a query point q, one can return a (1 + ε)-ANN in P
(under the Hamming distance) in O(dn1/(1+ε) log n log(ε−1 log d)) time. The result returned is correct with high
probability.

Remark 28.3.12. The result of Theorem 28.3.11 needs to be oblivious to the queries used. Indeed, for any
instantiation of the data-structure of Theorem 28.3.11 there exist query points for which it would fail.
In particular, formally, if we perform a sequence of ANN queries using such a data-structure, where the
queries depend on earlier returned answers, then the guarantee of a high probability of success is no longer
implied by the above analysis (it might hold because of some other reasons, naturally).

28.4. LSH and ANN in Euclidean space


28.4.1. Preliminaries
Lemma 28.4.1. Let X = (X1 , . . . , Xd ) be a vector of d independent variables which have normal distribution
P
N, and let v = (v1 , . . . , vd ) ∈ Rd . We have that ⟨v, X⟩ = i vi Xi is distributed as ∥v∥ Z, where Z ∼ N.

Proof: By Lemma 24.2.3p157 the point X has multi-dimensional normal distribution Nd . As such, if ∥v∥ = 1,
then this holds by the symmetry of the normal distribution. Indeed, let e1 = (1, 0, . . . , 0). By the symmetry of
the d-dimensional normal distribution, we have that ⟨v, X⟩ ∼ ⟨e1 , X⟩ = X1 ∼ N.

Otherwise, ⟨v, X⟩ / ∥v∥ ∼ N, and as such ⟨v, X⟩ ∼ N 0, ∥v∥2 , which is indeed the distribution of ∥v∥ Z. ■

Definition 28.4.2. A distribution D over R is called p-stable if there exists p ≥ 0 such that for any n real
P
numbers v1 , . . . , vn and n independent variables X1 , . . . , Xn with distribution D, the random variable i vi Xi has
P
the same distribution as the variable ( i |vi | p )1/p X, where X is a random variable with distribution D.

By Lemma 28.4.1, the normal distribution is a 2-stable distribution.

28.4.2. Locality sensitive hashing (LSH)


Let p and u be two points in Rd . We want to perform an experiment to decide if ∥p − u∥ ≤ 1 or ∥p − u∥ ≥ η,
where η = 1 + ε. We will randomly choose a vector v from the d-dimensional normal distribution Nd (which
is 2-stable). Next, let r be a parameter, and let t be a random number chosen uniformly from the interval [0, r].
For p ∈ Rd , consider the random hash function
$ %
⟨p, v⟩ + t
h(p) = . (28.3)
r

189
Assume that the distance between p and u is η and the distance between the projection of the two points to
the direction v is β. Then, the probability that p and u get the same hash value is max(1 − β/r, 0), since this
is the probability that the random sliding will not separate them. Indeed, consider the line through v to be the
x-axis, and assume u is projected to r and v is projected to r − β (assuming r ≥ β). Clearly, u and v get mapped
to the same value by h(·) if and only if t ∈ [0, r − β], as claimed.
As such, we have that the probability of collusion is
Z r
    β
α(η, r) = P h(p) = h( q) = P |⟨p, v⟩ − ⟨u, v⟩| = β 1 − dβ.
β=0 r

However, since v is chosen from a 2-stable distribution, we have that



Z = ⟨p, v⟩ − ⟨u, v⟩ = ⟨p − u, v⟩ ∼ N 0, ∥p − u∥2 .

Since we are considering the absolute value of the variable, we need to multiply this by two. Thus, we have
Z r !
2 β2  β
α(η, r) = √ exp − 2 1 − dβ,
β=0 2πη 2η r

by plugging in the density of the normal distribution for Z. Intuitively, we care about the difference α(1 + ε, r) −
α(1, r), and we would like to maximize it as much as possible (by choosing the right value of r). Unfortunately,
this integral is unfriendly, and we have to resort to numerical computation.
Now, we are going to use this hashing scheme for constructing locality sensitive hashing, as in the hyper-
cube case, and as such we care about the ratio

log(1/α(1, r))
ρ(1 + ε) = min ;
r log(1/α(1 + ε, r))

see Eq. (28.2). The following is verified using numerical calculations.

Lemma 28.4.3 ([DIIM04]). One can choose r, such that ρ(1 + ε) ≤ 1


1+ε
.

Lemma 28.4.3 implies that the hash functions defined by Eq. (28.3) are (1, 1 + ε, α′ , β′ )-sensitive and,
log(1/α′ )
furthermore, ρ = log(1/β′ ) ≤ 1+ε 1
, for some values of α′ and β′ . As such, we can use this hashing family
to construct an approximate near neighbor data-structure D≈Near (P, r, (1 + ε)r) for the set P of points in Rd .
Following the same argumentation of Theorem 28.3.10, we have the following.

Theorem 28.4.4. Given a set P of n points in Rd and parameters ε > 0 and r > 0, one can build a D≈Near =
D≈Near (P, r, (1 + ε)r), such that given a query point q, one can decide:
(A) If b(q, r) ∩ P , ∅, then D≈Near returns a point u ∈ P, such that dH (u, q) ≤ (1 + ε)r.
(B) If b(q, (1 + ε)r) ∩ P = ∅, then D≈Near returns the result that no point is within distance ≤ r from q.
1/(1+ε)
In any other case, any  of the answers is correct. The query time is O(dn log n) and the space used is
O dn + n 1+1/(1+ε)
log n . The result returned is correct with high probability.

28.4.3. ANN in high-dimensional euclidean space


Unlike the binary hypercube case, where we could just do direct binary search on the distances, here we need
to use the reduction from ANN to near neighbor queries.

190
28.4.3.1. The result
Plugging the above into known reduction from approximate nearest-neighbor to near-neighbor queries, yields
the following:

Corollary 28.4.5. Given a set P of n points in Rd , one can construct a data-structure D that answers (1 + ε)-
ANN queries, by performing O(log(n/ε)) (1 + ε)-approximate near neighbor queries. The total number of
points stored at these approximate near neighbor data-structure of D is O(nε−1 log(n/ε)).

This in turn leads to the following:

Theorem 28.4.6. Given a set P of n points in Rd and parameters ε > 0 and r > 0, one can build an ANN
data-structure using
 
O dn + n1+1/(1+ε) ε−2 log3 (n/ε)

space, such that given a query point q, one can returns a (1 + ε)-ANN in P in
  n
O dn1/(1+ε) log n log
ε
time. The result returned is correct
 with high probability.

1+1/(1+ε) −2
The construction time is O dn ε log3 (n/ε) .

28.5. Bibliographical notes


Section 28.1 follows the exposition of Indyk and Motwani [IM98]. Kushilevitz et al. [KOR00] offered an
alternative data-structure with somewhat inferior performance. It is quite surprising that one can perform
approximate nearest neighbor queries in high dimensions in time and space polynomial in the dimension (which
is sublinear in the number of points). One can reduce the approximate near neighbor in Euclidean space to the
same question on the hypercube “directly” (we show the details below). However, doing the LSH directly on
the Euclidean space is more efficient.
The value of the results shown in this chapter depends to a large extent on the reader’s perspective. Indeed,
for a small value of ε > 0, the query time O(dn1/(1+ε) ) is very close to linear dependency on n and is almost
equivalent to just scanning the points. Thus, from the low-dimensional perspective, where ε is assumed to be
small, this result is slightly sublinear. On the other hand, if one is willing to pick ε to be large (say 10), then
the result isclearly better than the naive algorithm, suggesting running time for an ANN query which takes
1/11
(roughly) O n time.
The idea of doing locality sensitive hashing directly on the Euclidean space, as done in Section 28.4,
is not shocking after seeing the Johnson-Lindenstrauss lemma. Our description follows the paper of Datar
et al. [DIIM04]. In particular, the current analysis which relies on computerized estimates is far from being
satisfactory. It would be nice to have a simpler and more elegant scheme for this case. This is an open problem
for further research.
 construction in R is due to Andoni
d
Currently,
 the best LSH  and Indyk [AI06]. Its space usage is bounded
1+1/(1+ε) +o(1) 1/(1+ε)2 +o(1)
by O dn + n
2
and its query time is bounded by O dn . This (almost) matches the lower
bound of Motwani et al. [MNP06]. For a nice survey on LSH see [AI08].
From approximate near neighbor in Rd to approximate near neighbor on the hypercube.
The reduction is quite involved, and we only sketch the details. Let P be a set of n points in Rd . We
first reduce the dimension to k = O(ε−2 log n) using the Johnson-Lindenstrauss lemma. Next, we embed this

191

space into ℓ1k (this is the space Rk , where distances are the L1 metric instead of the regular L2 metric), where
k′ = O(k/ε2 ). This can be done with distortion (1 + ε).

Let Q′ be the resulting set of points in Rk . We want to solve approximate near neighbor queries on this
set of points, for radius r. As a first step, we partition the space into cells by taking a grid with sidelength
(say) k′ r and randomly translating it, clipping the points inside each grid cell. It is now sufficient to solve
the approximate near neighbor problem inside this grid cell (which has bounded diameter as a function of
r), since with small probability the result would be correct. We amplify the probability by repeating this a
polylogarithmic number of times.

Thus, we can assume that P is contained inside a cube of sidelength ≤ k′ nr, it is in Rk , and the distance
metric is the L1 metric. We next snap the points of P to a grid of sidelength (say) εr/k′ . Thus, every point
of P now has an integer coordinate, which is bounded by a polynomial in log n and 1/ε. Next, we write the
coordinates of the points of P using unary notation. (Thus, a point (2, 5) would be written as (00011, 11111)
assuming the number of bits for each coordinates is 5.) It is now easy to verify that the Hamming distance on
the resulting strings is equivalent to the L1 distance between the points.
Thus, we can solve the near neighbor problem for points in Rd by solving it on the hypercube under the
Hamming distance.
See Indyk and Motwani [IM98] for more details.

We have only scratched the surface of proximity problems in high dimensions. The interested reader is
referred to the survey by Indyk [Aga04] for more information.

References
[Aga04] P. K. Agarwal. Range searching. Handbook of Discrete and Computational Geometry. Ed. by
J. E. Goodman and J. O’Rourke. 2nd. Boca Raton, FL, USA: CRC Press LLC, 2004. Chap. 36,
pp. 809–838.
[AI06] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in
high dimensions. Proc. 47th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 459–468, 2006.
[AI08] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in
high dimensions. Commun. ACM, 51(1): 117–122, 2008.
[DIIM04] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based
on p-stable distributions. Proc. 20th Annu. Sympos. Comput. Geom. (SoCG), 253–262, 2004.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of di-
mensionality. Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC), 604–613, 1998.
[KOR00] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor
in high dimensional spaces. SIAM J. Comput., 2(30): 457–474, 2000.
[MNP06] R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. Proc. 22nd
Annu. Sympos. Comput. Geom. (SoCG), 154–157, 2006.

192
Chapter 29

Random Walks I
598 - Class notes for Randomized Algorithms
“A drunk man will find his way home; a drunk bird may
Sariel Har-Peled
wander forever.”
April 2, 2024
Anonymous,
29.1. Definitions
Let G = G(V, En) be an undirected
o connected graph. For v ∈ V, let Γ(v) denote the set of neighbors of v in G;
that is, Γ(v) = u vu ∈ E(G) . A random walk on G is the following process: Starting from a vertex v0 , we
randomly choose one of the neighbors of v0 , and set it to be v1 . We continue in this fashion, in the ith step
choosing vi , such that vi ∈ Γ(vi−1 ). It would be interesting to investigate the random walk process. Questions of
interest include:
(A) How long does it take to arrive from a vertex v to a vertex u in G?
(B) How long does it take to visit all the vertices in the graph.
(C) If we start from an arbitrary vertex v0 , how long the random walk has to be such that the location of the
random walk in the ith step is uniformly (or near uniformly) distributed on V(G)?
Example 29.1.1. In the complete graph Kn , visiting all the vertices takes in expectation O(n log n) time, as this
is the coupon collector problem with n − 1 coupons. Indeed, the probability we did not visit a specific vertex v
by the ith step of the random walk is ≤ (1 − 1/n)i−1 ≤ e−(i−1)/n ≤ 1/n10 , for i = Ω(n log n). As such, with high
probability, the random walk visited all the vertex of Kn . Similarly, arriving from u to v, takes in expectation
n − 1 steps of a random walk, as the probability of visiting v at every step of the walk is p = 1/(n − 1), and the
length of the walk till we visit v is a geometric random variable with expectation 1/p.

29.1.1. Walking on grids and lines



Lemma 29.1.2 (Stirling’s formula). For any integer n ≥ 1, it holds n! ≈ 2πn (n/e)n .

29.1.1.1. Walking on the line


Lemma 29.1.3. Consider the infinite random walk on the integer line, starting from 0. Here, the vertices are
the integer numbers, and from a vertex k, one walks with probability 1/2 either to k − 1 or k + 1. The expected
number of times that such a walk visits 0 is unbounded.
 
Proof: The probability that in the 2ith step we visit 0 is 212i 2ii , As such, the expected number of times we visit
the origin is
X∞ ! X ∞
1 2i 1
2i
≥ √ = ∞,
i=1
2 i i=1 2 i

193
Figure 29.1: A walk in the integer grid, when rotated by 45 degrees, results, in two independent walks on one
dimension.

!
22i 2i 22i
since √ ≤ ≤ √ [MN98, p. 84]. This can also be verified using the Stirling formula, and the resulting
2 i i 2i
sequence diverges. ■

29.1.1.2. Walking on two dimensional grid


A random walk on the integer grid Zd , starts from a point of this integer grid, and at each step if it is at
point (i1 , i2 , . . . , id ), it chooses a coordinate and either increases it by one, or decreases it by one, with equal
probability.

Lemma 29.1.4. Consider the infinite random walk on the two dimensional integer grid Z2 , starting from (0, 0).
The expected number of times that such a walk visits the origin is unbounded.

Proof: Rotate the grid by 45 degrees, and consider the two new axes X ′ and Y ′ , see Figure 29.1.. Let xi be

the projection of the location of the ith
√ step of the random walk on the X -axis, and define √ yi in a similar
fashion. Clearly, xi are of the √form j/ 2, where
√ j is an integer. By scaling by a factor of 2, consider the
′ ′
resulting random walks xi = 2xi and yi = 2yi . Clearly, xi and yi are random walks on the integer grid,
and furthermore, they are independent. As such, the probability that we visit the origin at the 2ith step is
h i h i2   2

P x2i = 0 ∩ y′2i = 0 = P x2i ′
= 0 = 212i 2ii ≥ 1/4i. We conclude, that the infinite random walk on the grid
Z visits the origin in expectation
2

X∞
 ′  X 1


P xi = 0 ∩ yi = 0 ≥ = ∞,
i=0 i=0
4i

as this sequence diverges. ■

29.1.1.3. Walking on three dimensional grid


!
i i!
In the following, let = denote the multinomial coefficient. The multinomial theorem states
a b c a! b! c!
that !Y
X n
m
(x1 + x2 + · · · + xm ) =
n
xtkt .
k +k +···+k =n k ,k ,··· ,k ≥0
k1 , k2 , . . . , km t=1
1 2 m 1 2 m

194
In particular, we have !
X n 1
1 = (1/3 + 1/3 + 1/3) =
n n
. (29.1)
a+b+c=n, a,b,c≥0
a b c 3n

Lemma 29.1.5. Consider the infinite random walk on the three dimensional integer grid Z3 , starting from
(0, 0, 0). The expected number of times that such a walk visits the origin is bounded.

Proof: The probability of a neighbor of a point (x, y, z) to be the next point in the walk is 1/6. Assume that
we performed a walk for 2i steps, and decided to perform 2a steps parallel to the x-axis, 2b steps parallel to
the y-axis, and 2c steps parallel to the z-axis, where a + b + c = i. Furthermore, the walk on each dimension is
balanced, that is we perform a steps to the left on the x-axis, and a steps to the right on the x-axis. Clearly, this
corresponds to the only walks in 2i steps that arrives to the origin.
(2i)!
Next, the number of different ways we can perform such a walk is a!a!b!b!c!c! , and the probability to perform
such a walk, summing over all possible values of a, b and c, is

X ! !2 !2i !  ! !i 2
(2i)! 1 2i 1 X i! 1 2i 1 X  i 1 
αi = 2i
= 2i
= 2i
 
a+b+c=i
a!a!b!b!c!c! 6 i 2 a+b+c=i a! b! c! 3 i 2 a+b+c=i a b c 3
a,b,c≥0 a,b,c≥0 a,b,c≥0
   
Consider the case where i = 3m. We have that i
a b c
≤ i
m m m
. As such, we have
! !i ! X ! !i
2i 1 1 i i 1
αi ≤ 2i
.
i 2 3 m m m a+b+c=i a b c 3
a,b,c≥0
| {z }
=1 by Eq. (29.1)

By the Stirling formula, we have


! √
i 2πi(i/e)i 3i
≈ 
 i/3 3 = c ,
m m m √ i i
2πi/3 3e
 !  !
 1 1 i 3i  1
for some constant c. As such, αi = O √  = O 3/2 . Thus,
i 3 i i
X∞ X !
1
α6m = O 3/2 = O(1).
m=1 i
i

Finally, observe that α6m ≥ (1/6)2 α6m−2 and α6m ≥ (1/6)4 α6m−4 . Thus,
X

αm = O(1). ■
m=1

29.2. Bibliographical notes


The presentation here follows [Nor98].

195
References
[MN98] J. Matoušek and J. Nešetřil. Invitation to Discrete Mathematics. Oxford Univ Press, 1998.
[Nor98] J. R. Norris. Markov Chains. Statistical and Probabilistic Mathematics. Cambridge Press, 1998.

196
Chapter 30

Random Walks II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Then you must begin a reading program immediately so that you man understand the crises of our age," Ignatius said solemnly.
"Begin with the late Romans, including Boethius, of course. Then you should dip rather extensively into early Medieval. You
may skip the Renaissance and the Enlightenment. That is mostly dangerous propaganda. Now, that I think about of it, you had
better skip the Romantics and the Victorians, too. For the contemporary period, you should study some selected comic books.”
“You’re fantastic.”
“I recommend Batman especially, for he tends to transcend the abysmal society in which he’s found himself. His morality is
rather rigid, also. I rather respect Batman.”

John Kennedy Toole, A confederacy of Dunces

30.1. Catalan numbers


For a sequence σ of symbols, let #(σ, X) be the number of times the symbol X appears in σ.

Definition 30.1.1. A sequence/word σ of length 2n elements/characters made out of two symbols X and Y, is
balanced, if
(I) X appears n times (i.e., #(σ, X) = n),
(II) Y appears n times (i.e., #(σ, Y) = n),
(III) In any prefix of the string, the number of Xs is at least as large as the number of Ys.
Such a string is known as a Dyck word. If X and Y are the open and close parenthesis characters, respectively,
then the word is a balanced/valid parenthesis pattern.

Definition 30.1.2. The Catalan number, denoted by Cn , is the number of balanced strings of length 2n.

There are many other equivalent definitions of Catalan number.

Definition 30.1.3. A sequence σ made out of two symbols X and Y is dominating, if for any non-empty prefix
of σ, the number of Xs is strictly larger than the number of Ys.

Lemma 30.1.4. Let σ be a cyclic sequence made out symbols X and Y, where n = #(σ, X) and m = #(σ, y),
with n > m. Then there are exactly n − m locations where cutting the cyclic sequence at these locations, results
in a dominating sequence.

197
Proof: Consider a location in σ that contains X, and the next location contains Y. Clearly, such a location
can not be a start for a dominating sequence. Of course, the next location can also not be a start position for a
dominating sequence. As such, these two locations must be interior to a dominating sequence, and deleting both
of these symbols from σ, results in a new cyclic sequence, where every dominating start location corresponds
to a dominating start location in the original sequence. Observe, that as long as the number of Xs is larger
than the number of Ys, there must be such a location with XY as the prefix. Repeatedly deleting XY substring,
results in a string of length n − m, where every location is a good start of a dominating sequence. We conclude
that there are exactly n − m such locations. ■
Observation 30.1.5. The number of distinct
 cyclic sequences of length m + n, with m appearances of X, and
n appearances of Y is m!n! = n+m n , since there are (n + m − 1)! different cyclic ways to arrange n + m
(n+m−1)! 1 n+m

distinct values.
 
Theorem 30.1.6. For n ≥ 1, we have that the Catalan number Cn = n+1 1 2n
n
.
Proof: Consider a dominating sequence σ of length 2n + 1 with #(σ, X) = n + 1, and #(σ, Y) = n. Such a
sequence must start with an X, and if we remove the leading X, then what remains is a balanced sequence.
Such a sequence σ can be interpreted as a cyclic sequence. By the above lemma, there is a unique shift that is
dominating. As such, the number of such cyclic sequence is the Catalan number Cn . By the above observation,
the number of such cyclic sequences is
!
(n + m − 1)! (n + n + 1 − 1)! 1 2n! 1 2n
= = = . ■
m!n! (n + 1)!n! n + 1 n!n! n + 1 n

30.2. Walking on the integer line revisited


30.2.1. Estimating the middle binomial coefficient
!  
1 22i 2i 22i 2i
Lemma 30.2.1. For i ≥ 11 , we have · √ ≤
2
≤ 2 · √ . and i+2i√i ≥ 121 · 2√i
4 i i i
2i 2i √
Proof: Observe that i ≥ j , for any j. We assume that i ≥ 112 , and i is an integer. O bserve that
 2i  2i Qτ
i+τ
= 2i!
(i+τ)!(i−τ)!
= 2i! (i−τ+1)···(i−1)i
i!i! (i+1)···(i+τ)
= i k=1 i+k . Now, by Lemma 10.1.3, we have
i−τ+k

! !
Y i − τ + k Y τ   τ τ
τ τ
τ2 τ2 1
α= = 1− ≥ 1− ≥ 1 − 2 τ exp − ≥ ,
k=1
i+k k=1
i+k i i i 3
√ √ √  2i  2i
for τ ≤ i, and i ≥ 112 . Namely, for any k, such that − i ≤ k ≤ i, we have i+k ≥ i /3. We thus have that

Xi ! √ ! !
1 2i 2 i 2i 2i 2 22i
1 ≥ 2i ≥ =⇒ ≤ · √.
2 √ i+k 3 · 22i i i 3 i
k=− i+1

√ Vi[X] = 2i(1/2)(1/2) = i/2. Let
Let ∆ =  i − 1 and X ∼ bin(2i, 1/2). We have thath E[X] = i,√ and
1 P∆
β = 22i k=−∆ i+k . By Chebychev, we have that 1 − β = P |X − i| ≥ 2 i/2 ≤ 1/2. which implies β ≥ 1/2.
2i

We have ! ! !
1 X 2i

1 2∆ + 1 2i 2i 22i 22i
≤ β ≤ 2i ≤ =⇒ ≥ ≥ √. ■
2 2 k=−∆ i + k 22i i i 2(2∆ + 1) 4 i

198
Lemma 30.2.2. In a random walk on the line starting at zero, in expectation, after 48n2 steps, the walk had
visited either −n or +n.

Proof: √By Lemma 30.2.1, the probability that after 2i steps, for i = 16n2 , the walk is in the range {− i +
1, . . . , i − 1} is at most
1 2 22i 2 1 1
2n 2i · · √ = 2n · = .
2 3 i 3 4n 3
Namely, the walk arrived to either −n or +n during the first 32n2 steps (note that n ≤ i/2) with probability
≥ 2/3. If this did not happen, we continue the walk. As i ≥ 2n, the same argumentation essentially implies
that every 32n2 steps, the walk terminates with probability at least 2/3. As such, in expectation, after 3/2 such
epochs, the walk would terminate. ■

30.3. Solving 2SAT using random walk


Let G = G(V, E) be a undirected connected graph. For v ∈ V, let Γ(v) denote the neighbors of v in G. A random
walk on G is the following process: Starting from a vertex v0 , we randomly choose one of the neighbors of v0 ,
and set it to be v1 . We continue in this fashion, such that vi ∈ Γ(vi−1 ). It would be interesting to investigate the
process of the random walk. For example, questions like: (i) how long does it take to arrive from a vertex v to
a vertex u in G? and (ii) how long does it take to visit all the vertices in the graph.

30.3.1. Solving 2SAT


Consider a 2SAT formula F with m clauses defined over n variables. Start from an arbitrary assignment to the
variables, and consider a non-satisfied clause in F. Randomly pick one of the clause variables, and change its
value. Repeat this till you arrive to a satisfying assignment.
Consider the random variable Xi , which is the number of variables assigned the correct value (according to
the satisfying assignment) in the current assignment. Clearly, with probability (at least) half Xi = Xi−1 + 1.
Thus, we can think about this algorithm as performing a random walk on the numbers 0, 1, . . . , n, where at
each step, we go to the right probability at least half. The question is, how long does it take to arrive to n in
such a settings.
Theorem 30.3.1. The expected number of steps to arrive to a satisfying assignment is O(n2 ).

Proof: For simplicity of exposition assume that n is divisible by 4. Consider the random walk on the integer
line, starting from zero, where we go to the left with probability 1/2, and to the right probability 1/2. Let Yi be
the location of the walk at the i step. Clearly, E[Yi ] ≥ E[Xi ]. By defining the random walk on the integer line
more carefully, one can ensure that Yi ≤ Xi . Thus, the expected number of steps till Yi is equal to n is an upper
bound on the required quantity.
For an i, Y2i is an even number. Thus, consider the event that Y2i = 2∆ ≥ n, let Y2i = R2i − L2i , where R2i is
the number of steps to the right, and L2i is the number of steps to the left. Observe that
 


R2i − L2i = 2∆ 

R2i =i+∆

 =⇒ 

R2i + L2i = 2i L2i = i − R2i = i − ∆.

Thus, for i ≥ n/2, we have that the probability that in the 2ith step we have Y2i ≥ n is
Xi
1  2i 
ρ= .
∆=n/2
22i i + ∆

199
√ √
Lemma 30.3.2 below, tells us that for ρ > 1/3, is implied if ∆ ≤ i/6. That is, n/2 ≤ i/6, which holds for
i = 9n2 .
Next, if X2i fails to arrive to n at the first µ steps, we will reset Yµ = Xµ and continue the random walk,
repeating this process as many phases as necessary. The probability that the number of phases exceeds i is
≤ (2/3)i . As such, the expected number of steps in the walk is at most
X
c′ n2 i(1 − ρ)i = O(n2 ),
i

as claimed. ■
X2i !
1 2i 1
Lemma 30.3.2. We have √
2i
≥ .
k=i+ i/6 2 k 3
2i √    
Proof: It is known¬ that i
≤ 22i / i (better constants are known). As such, since 2ii ≥ 2i
m
, for all m, we
have by symmetry that
X ! X ! !
2i
1 2i
2i
1 2i √ 1 2i 1 √ 1 22i 1
≥ − i/6 ≥ − i/6 · √ = . ■
√ 22i k k=i+1
2 2i k 2 2i i 2 2 2i
i 3
k=i+ i/6

30.4. Markov chains


Let S denote a state space, which is either finite or countable. A Markov chain is at one state at any given
time. There is a transition probability Pi j , which is the probability to move to the state j,nif the
o Markov chain is
P
currently at state i. As such, j Pi j = 1 and ∀i, j we have 0 ≤ Pi j ≤ 1. The matrix P = Pi j is the transition
ij
probabilities matrix.
 
 jth column 
 
 
 
P =  

 ith row Pi j

 


The Markov chain start at an initial state X0 , and at each point in time moves according to the transition
probabilities. This form a sequence of states {Xt }. We have a distribution over those sequences. Such a sequence
would be referred to as a history.
Similar to Martingales, the behavior of a Markov chain in the future, depends only on its location Xt at time
t, and does not depends on the earlier stages that the Markov chain went through. This is the memorylessness
property of the Markov chain, and it follows as Pi j is independent of time. Formally, the memorylessness
property is
   
P Xt+1 = j X0 = i0 , X1 = i1 , . . . , Xt−1 = it−1 , Xt = i = P Xt+1 = j | Xt = i = Pi j .

The initial state of the Markov chain might also be chosen randomly in some cases.
¬
Probably because you got it as a homework problem, if not wikipedia knows, and if you are bored you can try and prove it
yourself.

200
h i
For states i, j ∈ S, the t-step transition probability is P(t)ij = P Xt = j X 0 = i . The probability that we visit
j for the first time, starting from i after t steps, is denoted by
h i
r(t)
ij = P X t = j and X 1 , j, X2 , j, . . . , X t−1 , j X 0 = i .
P
Let fi j = t>0 r(t)
i j denote the probability that the Markov chain visits state j, at any point in time, starting from
state i. The expected number of steps to arrive to state j starting from i is
X
hi j = t · ri(t)j .
t>0

Of course, if fi j < 1, then there is a positive probability that the Markov chain never arrives to j, and as such
hi j = ∞ in this case.
Definition 30.4.1. A state i ∈ S for which fii < 1 (i.e., the chain has positive probability of never visiting i
again), is a transient state. If fii = 1 then the state is persistent.
A state i that is persistent but hii = ∞ is null persistent. A state i that is persistent and hii , ∞ is non null
persistent.

Example 30.4.2. Consider the state 0 in the random walk on the integers. We already know that in expectation
the random walk visits the origin infinite number of times, so this hints that this is a persistent state. Let figure
00 . To this end, consider a walk X0 , X1 , . . . , X2n that starts at 0 and return to 0 only in the
out the probability r(2n)
2n step. Let S i = Xi − Xi−1 , for all i. Clearly, we have S i ∈ {−1, +1} (i.e., move left or move right). Assume
the walk starts by S 1 = +1 (the case −1 is handled similarly). Clearly, the walk S 2 , . . . , S 2n−1 must be prefix
balanced; that is, the number of 1s is always bigger (or equal) for any prefix of this sequence.
 are known as Dyck words, and the number of such words of length 2m is the
Strings with this property
Catalan number Cm = m+1 m . As such, the probability of the random walk to visit 0 for the first time (starting
1 2m

from 0) after 2n steps, is ! ! !


1 2n − 2 1 1 1 1
r00 = 2
(2n)
= Θ · √ = Θ 3/2 .
n n − 1 22n n n n
   √ 
(the 2 here is because the other option is that the sequence starts with −1), using that 2n = Θ 22n
/ n.
P∞ (2n) n
Observe that f00 = n=0 r00 = O(1). However, one can be more precise – that is, f00 = 1 (this requires a
trick)! On the other hand, we have that
X X
∞ X
∞  √ 
h00 = t· r(t)
00 ≥ 2nr(2n)
00 = Θ 1/ n = ∞.
t>0 n=1 n=1

Namely, 0 (and indeed all integers) are null persistent.

In finite Markov chains there are no null persistent states (this requires a proof, which is left as an exercise).
There is a natural directed graph associated with a Markov chain. The states are the vertices, and the transition
probability Pi j is the weight assigned to the edge (i → j). Note that we include only edges with Pi j > 0.
Definition 30.4.3. A strong component (or a strong connected component) of a directed graph G is a maximal
subgraph C of G such that for any pair of vertices i and j in the vertex set of C, there is a directed path from i
to j, as well as a directed path from j to i.

Definition 30.4.4. A strong component C is a final strong component if there is no edge going from a vertex
in C to a vertex that is not in C.

201
In a finite Markov chain, there is positive probability to arrive from any vertex on C to any other vertex of
C in a finite number of steps. If C is a final strong component, then this probability is 1, since the Markov chain
can never leave C once it enters it­ . It follows that a state is persistent if and only if it lies in a final strong
component.

Definition 30.4.5. A Markov chain is irreducible if its underlying graph consists of a single strong component.

Clearly, if a Markov chain is irreducible, then all states are persistent.


 
Definition 30.4.6. Let q(t) = q(t) 1 , q(t)
2 , . . . , q (t)
n be the state probability vector (also known as the distribution
of the chain at time t), to be the row vector whose ith component is the probability that the chain is in state i at
time t.

The key observation is that


q(t) = q(t−1) P = q(0) Pt .
Namely, a Markov chain is fully defined by q(0) and P.

Definition 30.4.7. A stationary distribution for a Markov chain with the transition matrix P is a probability
distribution π such that π = πP.

In general, stationary distribution does not necessarily exist. We will mostly be interested in Markov chains
that have stationary distribution. Intuitively it is clear that if a stationary distribution exists, then the Markov
chain, given enough time, will converge to the stationary distribution.

Definition 30.4.8. The periodicity of a state i is the maximum integer T for which there exists an initial distri-
i > 0 then t belongs to the arithmetic
bution q(0) and positive integer a such that, for all t, if at time t we have q(t)
progression {a + T i | i ≥ 0}. A state is said to be periodic if it has periodicity greater than 1, and is aperiodic
otherwise. A Markov chain in which every state is aperiodic is aperiodic.

Example 30.4.9. The easiest example maybe of a periodic Markov chain is a directed cycle.
v2
For example, the Markov chain on the right, has periodicity of three. In particular, the initial
v1 state probability vector q(0) = (1, 0, 0) leads to the following sequence of state probability vectors
v3
q(0) = (1, 0, 0) =⇒ q(1) = (0, 1, 0) =⇒ q(2) = (0, 0, 1) =⇒ q(3) = (1, 0, 0) =⇒ . . . .

Note, that this chain still has a stationary distribution, that is (1/3, 1/3, 1/3), but unless you start from this
distribution, you are going to converge to it.

A neat trick that forces a Markov chain to be aperiodic, is to shrink all the probabilities by a factor of 2,
and make every state to have a transition probability to itself equal to 1/2. Clearly, the resulting Markov chain
is aperiodic.

Definition 30.4.10. An ergodic state is aperiodic and (non-null) persistent.


An ergodic Markov chain is one in which all states are ergodic.

The following theorem is the fundamental property of Markov chains that we will need. The interested
reader, should check the proof in [Nor98] (the proof is not hard).
­
Think about it as hotel California.

202
Theorem 30.4.11 (Fundamental theorem of Markov chains). Any irreducible, finite, and aperiodic Markov
chain has the following properties.
(i) All states are ergodic.
(ii) There is a unique stationary distribution π such that, for 1 ≤ i ≤ n, we have πi > 0.
(iii) For 1 ≤ i ≤ n, we have fii = 1 and hii = 1/πi .
(iv) Let N(i, t) be the number of times the Markov chain visits state i in t steps. Then
N(i, t)
lim = πi .
t→∞ t
Namely, independent of the starting distribution, the process converges to the stationary distribution.

References
[Nor98] J. R. Norris. Markov Chains. Statistical and Probabilistic Mathematics. Cambridge Press, 1998.

203
204
Chapter 31

Random Walks III


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“I gave the girl my protection, offering in my equivocal way to be her father. But I came too late, after she had ceased to
believe in fathers. I wanted to do what was right, I wanted to make reparation: I will not deny this decent impulse, however
mixed with more questionable motives: there must always be a place for penance and reparation. Nevertheless, I should never
have allowed the gates of the town to be opened to people who assert that there are higher considerations that those of decency.
They exposed her father to her naked and made him gibber with pain, they hurt her and he could not stop them (on a day I
spent occupied with the ledgers in my office). Thereafter she was no longer fully human, sister to all of us. Certain sympathies
died, certain movements of the heart became no longer possible to her. I too, if I live longer enough in this cell with its ghost
not only of the father and the daughter but of the man who even by lamplight did not remove the black discs from his eyes
and the subordinate whose work it was to keep the brazier fed, will be touched with the contagion and turned into a create that
believes in nothing.”

J. M. Coetzee, Waiting for the Barbarians

31.1. Random walks on graphs


Let G = (V, E) be a connected, non-bipartite, undirected graph, with n vertices. We define the natural Markov
chain on G, where the transition probability is



 d(u) if uv ∈ E
1
Puv = 

0 otherwise,
where d(w) is the degree of vertex w. Clearly, the resulting Markov chain MG is irreducible. Note, that the
graph must have an odd cycle, and it has a cycle of length 2. Thus, the gcd of the lengths of its cycles is 1.
Namely, MG is aperiodic. Now, by the Fundamental theorem of Markov chains, MG has a unique stationary
distribution π.
Lemma 31.1.1. For all v ∈ V, we have πv = d(v)/2m.
Proof: Since π is stationary, and the definition of Puv , we get
  X
πv = πP v = πu Puv ,
uv

and this holds for all v. We only need to verify the claimed solution, since there is a unique stationary distribu-
tion. Indeed,
d(v) X d(u) 1 d(v)
= πv = [πP]v = = ,
2m uv
2m d(u) 2m

205
as claimed. ■

Definition 31.1.2. The hitting time huv is the expected number of steps in a random walk that starts at u and
ends upon first reaching v.
The commute time between u and v is denoted by CTuv = huv + hvu .
Let Cu (G) denote the expected length of a walk that starts at u and ends upon visiting every vertex in G at
least once. The cover time of G denotes by C(G) is defined by C(G) = maxu Cu (G).

Lemma 31.1.3. For all v ∈ V, we have hvv = 1/πv = 2m/d(v).


Example 31.1.4 (Lollipop). Let L2n be the 2n-vertex lollipop graph, this graph consists
of a clique on n vertices, and a path on the remaining n vertices. There is a vertex u in the
clique which is where the path is attached to it. Let v denote the end of the path, see figure
on the right. n vertices
Taking a random walk from u to v requires in expectation O(n2 ) steps, as we already
saw in class. This ignores the probability of escape – that is, with probability (n − 1)/n u
when at u we enter the clique Kn (instead of the path). As such, it turns out that huv = x1
Θ(n3 ), and hvu = Θ(n2 ). (Thus, hitting times are not symmetric!) x2
Note, that the cover time is not monotone decreasing with the number of edges. In-
deed, the path of length n, has cover time O(n2 ), but the larger graph Ln has cover time v = xn
Ω(n3 ).
Example 31.1.5 (More on walking on the Lollipop). To see why huv = Θ(n3 ), number the vertices on the stem
x1 , . . . , xn . Let T i be the expected time to arrive to the vertex xi when starting a walk from u. Observe, that
surprisingly, T 1 = Θ(n2 ). Indeed, the walk has to visit the vertex u about n times in expectation, till the walk
would decide to go to x1 instead of falling back into the clique. The time between visits to u is in expectation
O(n) (assuming the walk is inside the clique).
Now, observe that T 2i = T i + Θ(i2 ) + 12 T 2i . Indeed, starting with xi , it takes in expectation Θ(i2 ) steps of
the walk to either arrive (with equal probability) at x2i (good), or to get back to u (oopsi). In the later case, the
game begins from scratch. As such, we have that
        
T 2i = 2T i + Θ i2 = 2 2T i/2 + Θ (i/2)2 + Θ i2 = · · · = 21+log2 i T 1 + Θ i2 ,

assuming i is a power of two (why not?). As such, T n = nT 1 +Θ(n2 ). Since T 1 = Θ(n2 ), we have that T n = Θ(n3 ).

Definition 31.1.6. A n × n matrix M is stochastic if all its entries are non-negative and for each row i, it holds
P P
k Mik = 1. It is doubly stochastic if in addition, for any i, it holds k Mki = 1.

Lemma 31.1.7. Let MC be a Markov chain, such that transition probability matrix P is doubly stochastic.
Then, the distribution u = (1/n, 1/n, . . . , 1/n) is stationary for MC.
X
n
Pki 1
Proof: [uP]i = = . ■
k=1
n n

We can interpret every edge in G as corresponding to two directed edges. In particular, imagine performing
a random walk in G, but remembering not only the current vertex in the walk, but also the (directed) edge used
the walk to arrive to this vertex. One can interpret this as a random walk on the (directed) edges. Observe, that
there are 2m directed edges. Furthermore, a vertex u of degree d(u), has stationary distribution πu = d(u)/2m.
As such, the probability that the random walk would use any of the d(u) outgoing edges from u is exactly

206
α = πu /d(u) = 1/2m. Namely, if we interpret the walk on the graph as walk on the directed edges, the stationary
distribution is uniform. This readily implies that if (u → v) is in the graph, then h(u→v)(u→v) is 1/α = 2m. This
readily implies that the expected time to go from u to v and back to u is at most 2m. Next, we provide a more
formal (and somewhat different) proof of this.

Lemma 31.1.8. For any edge (u → v) ∈ E, we have huv + hvu ≤ 2m.

(Note, that (u → v) being an edge in the graph is crucial. Indeed, without it a significantly worst case bound
holds, see Theorem 31.2.1.)

Proof: Consider a new Markov chain defined by the edges of the graph (where every edge is taken twice as
two directed edges), where the current state is the last (directed) edge visited. There are 2m edges in the new
Markov chain, and the new transition matrix, has Q(u→v),(v→w) = Pvw = d(v) 1
. This matrix is doubly stochastic,
meaning that not only do the rows sum to one, but the columns sum to one as well. Indeed, for an edge (v → w)
we have
X X X 1
Q(x→y),(v→w) = Q(u→v),(v→w) = Pvw = d(v) × = 1.
x∈V,y∈Γ(x) u∈Γ(v) u∈Γ(v)
d(v)

Thus, the stationary distribution for this Markov chain is uniform, by Lemma 31.1.7. Namely, the stationary
distribution of e = (u → v) is hee = πe = 1/(2m). Thus, the expected time between successive traversals of e is
1/πe = 2m, by Theorem 30.4.11 (iii).
Consider huv + hvu and interpret this as the time to go from u to v and then return to u. Conditioned on the
event that the initial entry into u was via the edge (v → u), we conclude that the expected time to go from there
to v and then finally use (v → u) is 2m. The memorylessness property of a Markov chains now allows us to
remove the conditioning: since how we arrived to u is not relevant. Thus, the expected time to travel from u to
v and back is at most 2m. ■

31.2. Electrical networks and random walks


A resistive electrical network is an undirected graph. Each edge has branch resistance associated with it. The
electrical flow is determined by two laws: Kirchhoff’s law (preservation of flow - all the flow coming into
a node, leaves it) and Ohm’s law (the voltage across a resistor equals the product of the resistance times the
current through it). Explicitly, Ohm’s law states

voltage = resistance ∗ current.

The effective resistance between nodes u and v is the voltage difference between u and v when one ampere
is injected into u and removed from v (or injected into v and removed from u). The effective resistance is always
bounded by the branch resistance, but it can be much lower.
Given an undirected graph G, let N(G) be the electrical network defined over G, associating one ohm
resistance on the edges of N(G).
You might now see the connection between a random walk on a graph and electrical network. Intuitively
(used in the most unscientific way possible), the electricity, is made out of electrons each one of them is doing
a random walk on the electric network. The resistance of an edge, corresponds to the probability of taking the
edge. The higher the resistance, the lower the probability that we will travel on this edge. Thus, if the effective
resistance Ruv between u and v is low, then there is a good probability that travel from u to v in a random walk,
and huv would be small.

207
31.2.1. A tangent on parallel and series resistors
Consider having n resistors in parallel with resistance R1 , . . . , Rn , connecting two nodes u and v. As follows:
v

R1 R2 Rn
u

The effective resistance between u and v is


1
Ruv = .
1
R1
+ 1
R2
··· + 1
Rn

In particular, if R1 = · · · = Rn = R, then we have that Ruv = 1/(1/R + · · · 1/R) = 1/(n/R) = R/n.


Similarly, if we have n resistors in series, with resistance R1 , R2 , . . . , Rn , between u and v:
u v

R1 R2 Rn
Then, the effective resistance between u and v is
Ruv = R1 + · · · + Rn .
In particular, if R1 = · · · = Rn , then Ruv = nR.

31.2.2. Back to random walks


Theorem 31.2.1. For any two vertices u and v in G, the commute time CTuv = 2mRuv , where Ruv is the effective
resistance between u and v.

Proof: Let ϕuv denote the voltage at u in N(G) with respect to v, where d(x) amperes of current are injected
into each node x ∈ V, and 2m amperes are removed from v. We claim that
huv = ϕuv .
Note, that the voltage on an edge xy is ϕ xy = ϕ xv − ϕyv . Thus, using Kirchhoff’s Law and Ohm’s Law, we obtain
that X X X
ϕ xw
x ∈ V \ {v} d(x) = current(xw) = = (ϕ xv − ϕwv ), (31.1)
w∈Γ(x) w∈Γ(x)
resistance(xw) w∈Γ(x)

since the resistance of every edge is 1 ohm. (We also have the “trivial” equality that ϕvv = 0.) Furthermore, we
have only n variables in this system; that is, for every x ∈ V, we have the variable ϕ xv .

208
Now, for the random walk interpretation – by the definition of expectation, we have
1 X X X
x ∈ V \ {v} h xv = (1 + hwv ) ⇐⇒ d(x) h xv = 1+ hwv
d(x) w∈Γ(x) w∈Γ(x) w∈Γ(x)
X X X
⇐⇒ 1 = d(x) h xv − hwv = (h xv − hwv ).
w∈Γ(x) w∈Γ(x) w∈Γ(x)
P
Since d(x) = w∈Γ(x) 1, this is equivalent to
X
x ∈ V \ {v} d(x) = (h xv − hwv ). (31.2)
w∈Γ(x)

Again, we also have the trivial equality hvv = 0.¬ Note, that this system also has n equalities and n variables.
Eq. (31.1) and Eq. (31.2) show two systems of linear equalities. Furthermore, if we identify huv with ϕ xv
then they are exactly the same system of equalities. Furthermore, since Eq. (31.1) represents a physical system,
we know that it has a unique solution. This implies that ϕ xv = h xv , for all x ∈ V.
Imagine the network where u is injected with 2m amperes, and for all nodes w remove d(w) units from w.
In this new network, hvu = −ϕ′vu = ϕ′uv . Now, since flows behaves linearly, we can superimpose them (i.e., add
them up). We have that in this new network 2m unites are being injected at u, and 2m units are being extracted
at v, all other nodes the charge cancel itself out. The voltage difference between u and v in the new network is
b
ϕ = ϕuv + ϕ′uv = huv + hvu = CTuv . Now, in the new network there are 2m amperes going from u to v, and by
Ohm’s law, we have
b
ϕ = voltage = resistance ∗ current = 2mRuv ,

as claimed. ■

n vertices

u
x1
x2

v = xn
Figure 31.1: Lollipop again.

Example 31.2.2. Recall the lollipop Ln from Exercise 31.1.4, see Figure 31.1. Let u be the connecting vertex
between the clique and the stem (i.e., the path). The effective resistance between u and v is n since there are n
n= n.
resistors in series along the stem. That is Ruv
The number of edges in the lollipop is 2 + n = n(n − 1)/2 + n = n(n + 1)/2. As such, the commute time

hvu + huv = CTuv = 2mRuv = 2 n(n + 1)/2 n = n2 (n + 1).
We already know that hvu = Θ(n2 ). This implies that huv = CTuv − hvu = Θ(n3 ).
¬
In previous lectures, we interpreted hvv as the expected length of a walk starting at v and coming back to v.

209
Lemma 31.2.3. For any n vertex connected graph G, and for all u, v ∈ V(G), we have CTuv < n3 .

Proof: The effective resistance between any two nodes in the network is bounded by the length of the shortest
path between the two nodes, which is at most n − 1. As such, plugging this into Theorem 31.2.1, yields the
bound, since m < n2 . ■

31.3. Bibliographical Notes


A nice survey of the material covered here, is available online at http://arxiv.org/abs/math.PR/0001057
[DS00].

References
[DS00] P. G. Doyle and J. L. Snell. Random walks and electric networks. ArXiv Mathematics e-prints,
2000. eprint: math/0001057.

210
Chapter 32

Random Walks IV
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Do not imagine, comrades, that leadership is a pleasure! On the contrary, it is a deep and heavy responsibility. No one believes
more firmly than Comrade Napoleon that all animals are equal. He would be only too happy to let you make your decisions
for yourselves. But sometimes you might make the wrong decisions, comrades, and then where should we be? Suppose you
had decided to follow Snowball, with his moonshine of windmills-Snowball, who, as we now know, was no better than a
criminal?”

Animal Farm, George Orwell

32.1. Cover times


We remind the reader that the cover time of a graph is the expected time to visit all the vertices in the graph,
starting from an arbitrary vertex (i.e., worst vertex). The cover time is denoted by C(G).
Theorem 32.1.1. Let G be an undirected connected graph, then C(G) ≤ 2m(n − 1), where n = |V(G)| and
m = |E(G)|.

Proof: (Sketch.) Construct a spanning tree T of G, and consider the time to walk around T . The expected time
to travel on this edge on both directions is CTuv = huv + hvu , which is smaller than 2m, by Lemma 31.1.8. Now,
just connect up those bounds, to get the expected time to travel around the spanning tree. Note, that the bound
is independent of the starting vertex. ■

Definition 32.1.2. The resistance of G is R(G) = maxu,v∈V(G) Ruv ; namely, it is the maximum effective resis-
tance in G.

Theorem 32.1.3. mR(G) ≤ C(G) ≤ 2e3 mR(G) ln n + 2n.

Proof: Consider the vertices u and v realizing R(G), and observe that max(huv , hvu ) ≥ CTuv /2, and CTuv =
2mRuv by Theorem 31.2.1. Thus, C(G) ≥ CTuv /2 ≥ mR(G).
As for the upper bound. Consider a random walk, and divide it into epochs, where a epoch is a random
walk of length 2e3 mR(G). For any vertex v, the expected time to hit u is hvu ≤ 2mR(G), by Theorem 31.2.1.
Thus, the probability that u is not visited in a epoch is 1/e3 by the Markov inequality. Consider a random walk
with t = ln n epochs. We have that the probability of nnot visiting u is ≤ (1/e ) ≤ 1/n . Thus, all vertices
3 ln n 3

are visited after ln n epochs, with probability ≥ 1 − 2 /n3 ≥ 1 − 1/n. Otherwise, after this walk, we perform
a random walk till we visit all vertices. The length of this (fix-up) random walk is ≤ 2n3 , by Theorem 32.1.1.
Thus, expected length of the walk is ≤ 2e3 mR(G) ln n + 2n3 (1/n2 ). ■

211
32.1.1. Rayleigh’s Short-cut Principle.
Observe that effective resistance is never raised by lowering the resistance on an edge, and it is never lowered
by raising the resistance on an edge. Similarly, resistance is never lowered by removing a vertex.
Interestingly, effective resistance comply with the triangle inequality.

Observation 32.1.4. For a graph with minimum degree d, we have R(G) ≥ 1/d (collapse all vertices except
the minimum-degree vertex into a single vertex).

Lemma 32.1.5. Suppose that G contains p edge-disjoint paths of length at most ℓ from s to t. Then R st ≤ ℓ/p.

32.2. Graph Connectivity


Definition 32.2.1. A probabilistic log-space Turing machine for a language L is a Turing machine using space
O(log n) and running in time O(poly(n)), where n is the input size. A problem A is in RLP, if there exists a
probabilistic log-space Turing machine M such that M accepts x ∈ L(A) with probability larger than 1/2, and
if x < L(A) then M(x) always reject.

Theorem 32.2.2. Let USTCON denote the problem of deciding if a vertex s is connected to a vertex t in an
undirected graph. Then USTCON ∈ RLP.

Proof: Perform a random walk of length 2n3 in the input graph G, starting from s. Stop as soon as the random
walk hit t. If u and v are in the same connected component, then h st ≤ n3 . Thus, by the Markov inequality, the
algorithm works. It is easy to verify that it can be implemented in O(log n) space. ■

Definition 32.2.3. A graph is d-regular, if all its vertices are of degree d.


A d-regular graph is labeled if at each vertex of the graph, each of the d edges incident on that vertex has a
unique label in {1, . . . , d}.
Any sequence of symbols σ = (σ1 , σ2 , . . .) from {1, . . . , d} together with a starting vertex s in a labeled
graph describes a walk in the graph. For our purposes, such a walk would almost always be finite.
A sequence σ is said to traverse a labeled graph if the walk visits every vertex of G regardless of the starting
vertex. A sequence σ is said to be a universal traversal sequence of a labeled graph if it traverses all the graphs
in this class.

Given such a universal traversal sequence, we can construct (a non-uniform) Turing machine that can solve
USTCON for such d-regular graphs, by encoding the sequence in the machine.
Let F denote a family of graphs, and let U(F ) denote the length of the shortest universal traversal sequence
for all the labeled graphs in F . Let R(F ) denote the maximum resistance of graphs in this family.

Theorem 32.2.4. Let F be a family of d-regular graphs with n vertices, then U(F ) ≤ 5mR(F ) lg(n |F |).

Proof: Same old, same old. Break the string into epochs, each of length L = 2mR(G). Now, start random
walks from all the possible vertices, for all graphs in F . Continue the walks till all vertices are being visited.
Initially, there are n2 |F | vertices that need to visited. In expectation, in each epoch half the vertices get visited.
There are n |F | walks, each of them needs to visit n vertices.  As such, the number of vertices waiting to be
visited is bounded by |F | n . As such, after 1 + lg2 n |F | epochs, the expected number of vertices still need
2 2

visiting is ≤ 1/2. Namely, with constant probability we are done. ■

212
Let U(d, n) denote the length of the shortest universal traversal sequence of connected, labeled n-vertex,
d-regular graphs.

Lemma 32.2.5. The number of labeled n-vertex graphs that are d-regular is (nd)O(nd) .

Proof: Such a graph has dn/2  edges overall. Specifically, we encode this by listing for every vertex its d
neighbors – there are N = n−1 d
≤ nd possibilities. As such, there are at most N n ≤ nnd choices for edges in
the graph¬ . Every vertex has d! possible labeling of the edges adjacent to it, thus there are (d!)n ≤ dnd possible
labelings. ■
 
Lemma 32.2.6. U(d, n) = O n3 d log n .

Proof: The diameter of every connected n-vertex, d-regular graph is O(n/d). Indeed, consider the path realizing
the diameter of the graph, and assume it has t vertices. Number the vertices along the path consecutively, and
consider all the vertices that their number is a multiple of three. There are α ≥ ⌊t/3⌋ such vertices. No pair
of these vertices can share a neighbor, and as such, the graph has at least (d + 1)α vertices. We conclude that
n ≥ (d + 1)α = (d + 1)(t/3 − 1). We conclude that t ≤ d+13
(n + 1) ≤ 3n/d.
And so, this also bounds the resistance of such a graph. The number of edges is m = nd/2. Now, combine
Lemma 32.2.5 and Theorem 32.2.4. ■

This is, as mentioned before, not a uniform algorithm. There is by now a known log-space deterministic
algorithm for this problem, which is uniform.

32.2.1. Directed graphs


−−−−−−→
Theorem 32.2.7. One can solve the STCON problem, for a given directed graph with n vertices, using a
log-space randomized algorithm, that always output NO if there is no path from s to t, and output YES with
probability at least 1/2 if there is a path from s to t.

Proof: (Sketch.) The basic idea is simple – start a random walk from s, if it fails to arrive to t after a certain
number of steps, then restart. The only challenging thing is that the number of times we need to repeat this
is exponentially large. Indeed, the probability of a random walk from s to arrive to t in n steps, is at least
p = 1/nn−1 ≥ nn if s is connected to t.
As such, we need to repeat this walk N = O((1/p) log δ) = O(nn+1 ) times, for δ ≥ 1/2n . If have all of these
walks fail, then with probability ≥ 1 − δ, there is no path from s to t.
We can do the walk using logarithmic space. However, how do we count to N (reliably) using only loga-
rithmic space? We leave this as an exercise to the reader, see Exercise 32.2.8, ■

Exercise 32.2.8. Let N be a large integer number. Show a randomized algorithm, that with, high probability,
counts from 1 to M, where M ≥ N, and always stops. The algorithm should use only O(log log N) bits.

¬
This is a callous upper bound – better analysis is possible. But never analyze things better than you have to - it usually a waste
of time.

213
214
Chapter 33

A Bit on Algebraic Graph Theory


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“The Party told you to reject the evidence of your eyes and ears. It was their final, most essential command.”

1984, George Orwell

33.1. Graphs and Eigenvalues


Consider an undirected graph G = G(V, E) with n vertices. The adjacency matrix M(G) of G is the n × n
symmetric matrix where Mi j = M ji is the number of edges between the vertices vi and v j . If G is bipartite, we
assume that V is made out of two independent sets X and Y. In this case the matrix M(G) can be written in
block form.

33.1.1. Eigenvalues and eigenvectors


A non-zero vector v is an eigenvector of M, if there is a value λ, known as the eigenvalue of v, such that
Mv = λv. That is, the vector v is mapped to zero by the matrix N = M − λI. This happens only if N is not full
ranked, which in turn implies that det(N) = 0. We have that f (λ) = det(M − λI) is a polynomial of degree n. It
has n roots (not necessarily real), which are the eigenvalues of M. A matrix N ∈ Rn×n is symmetric if NT = N.

Lemma 33.1.1. The eigenvalues of a symmetric real matrix N ∈ Rn×n are real numbers.
Pn
Proof: Observe that for any real vector v = (v1 , . . . , vn ) ∈ Rn , we have that 2
i=1 vi = ⟨v, v⟩ ≥ 0. As such, for a
vector v with eigenvalue λ, we have

0 ≤ ⟨Nv, Nv⟩ = (Nv)T Nv = (λv)T λv = λ2 ⟨v, v⟩ .

Namely, λ2 is a non-negative number, which implies that the λ is a real number. ■

Lemma 33.1.2. Let N ∈ Rn×n be a matrix. Consider two eigenvectors v1 , v2 that corresponds to two eigenval-
ues λ1 , λ2 , where λ1 , λ2 . Then v1 and v2 are orthogonal.

Proof: Indeed, vT1 Nv2 = λ2 vT1 v2 . Similarly, we have vT1 Nv2 = (NT v1 )T v2 ) = λ1 vT1 v2 . We conclude that either
λ1 = λ2 , or v1 and v2 are orthogonal (i.e., vT1 v2 = 0). ■

215
33.1.2. Eigenvalues and eigenvectors of a graph
Since N = M(G) the adjacency matrix of an undirected graph is symmetric, all its eigenvalues exists and are
real numbers λ1 ≥ λ2 · · · ≥ λn , and their corresponding orthonormal basis vectors are e1 , . . . , en .
We will need the following theorem.

Theorem 33.1.3 (Fundamental theorem of algebraic graph theory). Let G = G(V, E) be an undirected (multi)graph
with maximum degree d and with n vertices. Let λ1 ≥ λ2 ≥ · · · ≥ λn be the eigenvalues of M(G) and the corre-
sponding orthonormal eigenvectors are e1 , . . . , en . The following holds.
(i) If G is connected then λ2 < λ1 .
(ii) For i = 1, . . . , n, we have |λi | ≤ d.
(iii) d is an eigenvalue if and only if G is regular.
(iv) If G is d-regular then the eigenvalue λ1 = d has the eigenvector e1 = √1n (1, 1, 1, . . . , 1).
(v) The graph G is bipartite if and only if for every eigenvalue λ there is an eigenvalue −λ of the same
multiplicity.
(vi) Suppose that G is connected. Then G is bipartite if and only if −λ1 is an eigenvalue.
(vii) If G is d-regular and bipartite, then λn = d and en = √1n (1, 1, . . . , 1, −1, . . . , −1), where there are equal
numbers of 1s and −1s in en .

33.2. Bibliographical Notes


A nice survey of algebraic graph theory appears in [Wes01] and in [Bol98].

References
[Bol98] B. Bollobas. Modern Graph Theory. Springer-Verlag, 1998.
[Wes01] D. B. West. Intorudction to Graph Theory. 2ed. Prentice Hall, 2001.

216
Chapter 34

Random Walks V
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Is there anything in the Geneva Convention about the rules of war in peacetime?” Stanko wanted to know, crawling back
toward the truck. “Absolutely nothing,” Caulec assured him. “The rules of war apply only in wartime. In peacetime, anything
goes.”

Romain Gary, Gasp

34.1. Explicit expander construction


We state here a few facts about expander graphs without proofs (see also Theorem 33.1.3).

Definition 34.1.1. And (n, d, c)-expander is a d-regular bipartite graph G = (X, Y, E), where |X| = |Y| = n/2.
Here, we require that for any S ⊆ X, we have
!!
2|S |
|Γ(S )| ≥ 1 + c 1 − |S | .
n

The Margulis-Gabber-Galil expander. For a positive m, let n = 2m2 . Each vertex in X and Y above, is
interpreted as a pair (a, b), where a, b ∈ Zm = {0, . . . , m − 1}. A vertex (a, b) ∈ X is connected to the vertices

(a, b), (a, a + b), (a, a + b + 1), (a + b, b), and (a + b + 1, b),

in Y, where the addition is done module m.



Theorem 34.1.2 ([GG81]). The above graph is 5 regular, and it is (n, 5, (2 − 3)/4)-expander.

Specrtral gap and expansion. We remind the reader that a d-regular graph, then its adjacency matrix M(G)
has (as its biggest eigenvalue) the eigenvalue λ1 = d. In particular, let |λ1 | ≥ |λ2 | ≥ . . . ≥ |λn | be the eigenvalues
of M(G). We then have the following:

c2
Theorem 34.1.3. If G is an (n, d, c)-expander then M(G) has |λ2 | ≤ d − .
1024 + 2c2

2dε − ε2
Theorem 34.1.4. If M(G) has |λ2 | ≤ d − ε, then G is (n, d, c)-expander with c ≥ .
d2

217
34.2. Rapid mixing for expanders
Here is another equivalent definition of an expander.
Definition 34.2.1. Let G = (V, E) be an undirected d-regular graph. The graph G is a (n, d, c)-expander (or
just c-expander), for every set S ⊆ V of size at most |V| /2, there are at least cd |S | edges connecting S and

S = V \ S ; that is e S , S ≥ cd |S |,

Guaranteeing aperiodicity. Let G be a (n, d, c)-expander. We would like to perform a random walk on G.
The graph G is connected, but it might be periodic (i.e., bipartite). To overcome this, consider the random walk
on G that either stay in the current state with probability 1/2 or traverse one of the edges. Clearly, the resulting
Markov Chain (MC) is aperiodic. The resulting transition matrix is
Q = M/2d + I/2,

where M is the adjacency matrix of G and I is the identity n × n matrix. Clearly Q is doubly stochastic.
Furthermore, if λbi is an eigenvalue of M, with eigenvector vi , then
!  
1 M bi
1  λ 
Qvi = + I vi =  + 1vi .
2 d 2 d

As such, λbi /d + 1/2 is an eigenvalue of Q. Namely, if there is a spectral gap in the graph G, there would also
be a similar spectral gap in the resulting MC. This MC can be generated by adding to each vertex d self loops,
ending up with a 2d-regular graph. Clearly, this graph is still an expander if the original graph is an expander,
and the random walk on it is aperiodic.
From this point on, we would just assume our expander is aperiodic.

34.2.1. Bounding the mixing time



For a MC with n states, we denote by π = π1 , . . . , πn its stationary distribution. We consider only nicely
behave MC that fall under Theorem 30.4.11. As such, no state in the MC has zero stationary probability.
Definition 34.2.2. Let q(t) denote the state probability vector of a Markov chain defined by a transition matrix
Q at time t ≥ 0, given an initial distribution q(0) . The relative pairwise distance of the Markov chain at time t is

i − πi
q(t)
∆(t) = max .
i πi
Namely, if ∆(t) approaches zero then q(t) approaches π.

We remind the reader that we saw a construction of a constant degree expander with constant expansion. In
its transition matrix Q, we have that λb1 = 1, and −1 ≤ λb2 < 1, and furthermore the spectral gap λb1 − λb2 was a
constant (the two properties are equivalent, but we proved only one direction of this).
We need a slightly stronger property (that does hold for our expander construction). We have that λb2 ≥
maxni=2 λbi .

Theorem 34.2.3. Let Q be the transition matrix of an aperiodic (n, d, c)-expander. Then, for any initial distri-
bution q(0) , we have that
t
∆(t) ≤ n3/2 λb2 .
Since λb2 is a constant smaller than 1, the distance ∆(t) drops exponentially with t.

218
Proof: We have that q(t) = q(0) Qt . Let B(Q) = ⟨v1 , . . . , vn ⟩ denote the orthonormal eigenvector basis of Q (see
P
Definition 45.2.3p289 ), and write q(0) = ni=1 αi vi . Since λb1 = 1, we have that
Xn X n X
n
t b t bi t vi .
q =q Q =
(t) (0) t
αi vi Q = αi λi vi = α1 v1 + αi λ
i=1 i=1 i=2
√ √ 
Since v1 = 1/ n, . . . , 1/ n , and λbi ≤ λb2 < 1, for i > 1, we have that limt→∞ λbi t = 0, and thus
Xn  t 
π = lim q = α1 v1 +
(t) b
αi lim λi vi = α1 v1 .
t→∞ t→∞
i=2
v
t
X
n X
n
Now, since v1 , . . . , vn is an orthonormal basis, and q (0)
= αi vi , we have that ∥q ∥2 =
(0)
α2i . Thus implies
i=1 i=1
that
v
t
X
n  t √ Xn
√ X
n
2
q −π
(t)
= q − α1 v1
(t)
= bi vi
αi λ ≤ n bi )t vi
αi ( λ = n bi )t
αi ( λ
1 1
i=2 1 i=2 2 i=2
v
t
√ X
n
√  t (0) √ t √ t
≤ n( λb2 )t (αi )2 ≤ n λb2 q 2
≤ n λb2 q(0) 1
= n λb2 ,
i=2

since q(0) is a distribution. Now, since πi = 1/n, we have


i − πi
q(t) √ t
∆(t) = max = max n q(t) − π i ≤ n max ∥q(t)
− π∥ ≤ n n λb2 . ■
i πi i i 1
i

34.3. Probability amplification by random walks on expanders


We are interested in performing probability amplification for an algorithm that is a BPP algorithm (see Defini-
tion 35.2.8). It would be convenient to work with an algorithm which is already somewhat amplified. That is,
we assume that we are given a BPP algorithm Alg for a language L, such that
 
(A) If x ∈ L then P Alg(x) accepts ≥ 199/200.
 
(B) If x < L then P Alg(x) accepts ≤ 1/200.

We assume that Alg requires a random bit string of length n. So, we have a constant degree expander G
(say of degree d) that has at least 200 · 2n vertices. In particular, let
U = |V(G)| ,
and since our expander construction grow exponentially in size (but the base of the exponent is a constant),
we have that U = O(2n ). (Translation: We can not quite get an expander with a specific number of vertices.
Rather, we can guarantee an expander that has more vertices than we need, but not many more.)
We label the vertices of G with all the binary strings of length n, in a round robin fashion (thus, each binary
   
string of length n appears either |V(G)| /2n or |V(G)| /2n times). For a vertex v ∈ V(G), let s(v) denote the
binary string associated with v.
Consider a string x that we would like to decide if it is in L or not. We know that at least 99/100U vertices
of G are labeled with “random” strings that would yield the right result if we feed them into Alg (the constant
here deteriorated from 199/200 to 99/100 because the number of times a string appears is not identically the
same for all strings).

219
The algorithm. We perform a random walk of length µ = αβk on G, where α and β are constants to be
determined shortly, and k is a parameter. To this end, we randomly choose a starting vertex X0 (this would
require n + O(1) bits). Every step in the random walk, would require O(1) random bits, as the expander is a
constant degree expander, and as such overall, this would require n + O(k) random bits.
Now, lets X0 , X1 , . . . , Xµ be the resulting random walk. We compute the result of

Yi = Alg(x, ri ), for i = 0, . . . , ν, and ν = αk,


 
where ri = s Xi·β . Specifically, we use the strings associated with nodes that are in distance β from each other
along the path of the random walk. We return the majority of the bits Y0 , . . . , Yαk as the decision of whether
x ∈ L or not.
We assume here that we have a fully explicit construction of an expander. That is, given a vertex of an
expander, we can compute all its neighbors in polynomial time (in the length of the index of the vertex). While
the construction of expander shown is only explicit it can be made fully explicit with more effort.

34.3.1. The analysis


Intuition. Skipping every β nodes in the random walk corresponds to performing a random walk on the
graph Gβ ; that is, we raise the graph to power β. This new graph is a much better expander (but the degree had
deteriorated). Now, consider a specific input x, and mark the bad vertices for it in the graph G. Clearly, we
mark at most 1/100 fraction of the vertices. Conceptually, think about these vertices as being uniformly spread
in the graph and far apart. From the execution of the algorithm to fail, the random walk needs to visit αk/2
bad vertices in the random walk in Gk . However, the probability for that is extremely small - why would the
random walk keep stumbling into bad vertices, when they are so infrequent?

The real thing. Let Q be the transition matrix of G. We assume, as usual, that the random walk on G is
aperiodic (if not, we can easily fix it using standard tricks), and thus ergodic. Let B = Qβ be the transition
matrix of the random walk of the states we use in the algorithm. Note, that the eigenvalues (except the first
one) of B “shrink”. In particular, by picking β to be a sufficiently large constant, we have that
 bi B| ≤ 1 ,
λb1 B = 1 and |λ for i = 2, . . . , U.
10
For the input string x, let W be the matrix that has 1 in the diagonal entry Wii , if and only Alg(x, s(i)) returns
the right answer, for i = 1, . . . , U. (We remind the reader that s(i) is the string associated with the ith vertex,
and U = |V(G)|.) The matrix W is zero everywhere else. Similarly, let W = I − W be the “complement” matrix
having 1 at Wii iff Alg(x, s(i)) is incorrect. We know that W is a U × U matrix, that has at least (99/100)U ones
on its diagonal.
Lemma 34.3.1. Let Q be a symmetric transition matrix, then all its eigenvalues of Q are in the range [−1, 1].

Proof: Let p ∈ Rn be an eigenvector with eigenvalue λ. Let pi be the coordinate with the maximum absolute
value in p. We have that

 X
U X
U X
U
λpi = pQ i = p j Q ji ≤ p j Q ji ≤ |pi | Q ji = pi .
j=1 j=1 j=1

This implies that |λ| ≤ 1.


(We used the symmetry of the matrix, in implying that Q eigenvalues are all real numbers.) ■

220
Lemma 34.3.2. Let Q be a symmetric transition matrix, then for any p ∈ Rn , we have that ∥pQ∥2 ≤ ∥p∥2 .

Proof: Let B(Q) = ⟨v1 , . . . , vn ⟩ denote the orthonormal eigenvector basis of Q, with eigenvalues 1 = λ1 , . . . , λn .
P
Write p = i αi vi , and observe that
X X sX sX
pQ 2 = αi vi Q = αi λi vi = αi λi ≤
2 2
α2i = p 2 ,
i 2 i 2 i i

since |λi | ≤ 1, for i = 1, . . . , n, by Lemma 34.3.1. ■

Lemma 34.3.3. Let B = Qβ be the transition matrix of the graph Gβ . For all vectors p ∈ Rn , we have:
(i) ∥pBW∥ ≤ ∥p∥, and
(ii) pBW ≤ ∥p∥ /5.

Proof: (i) Since multiplying a vector by W has the effect of zeroing out some coordinates, its clear that it can
not enlarge the norm of a matrix. As such, ∥pBW∥2 ≤ ∥pB∥2 ≤ ∥p∥2 by Lemma 34.3.2.
P
(ii) Write p = i αi vi , where v1 , . . . , vn is the orthonormal basis
√ of Q (and thus also of B), with eigenvalues
b b
1 = λ1 , . . . , λn . We remind the reader that v1 = (1, 1, . . . , 1)/ n. Since W zeroes out at least 99/100 of the
entries of a vectors it is multiplied by (and copy the rest as they are), we have that
q

∥v1 W∥ ≤ (n/100)(1/ n)2 ≤ 1/10 = ∥v1 ∥ /10.

Now, for any x ∈ RU , we have ∥xW∥ ≤ ∥x∥. As such, we have that


X X
U
pBW = αi vi BW ≤ α1 v1 BW + αi vi BW
2
i 2 i=2
X
U
β |α1 | XU
≤ α1 v1 W + b
αi vi λi W ≤ + bi β
αi vi λ
i=2
10 i=2
v
tU v
tU
|α1 | X 
β 2 |α1 | 1 X 2 ∥p∥ 1 ∥p∥
≤ + b
αi λi ≤ + αi ≤ + ∥p∥ ≤ ,
10 i=2
10 10 i=2 10 10 5

since λβi ≤ 1/10, for i = 2, . . . , n. ■

Consider the strings r0 , . . . , rν . For each one of these strings, we can write down whether its a “good”
string (i.e., Alg return the correct result), or a bad string. This results in a binary pattern b0 , . . . , bk . Given a
distribution p ∈ RU on the states of the graph, its natural to ask what is the probability of being in a “good”
state. Clearly, this is the quantity ∥pW∥1 . Thus, if we are interested in the probability of a specific pattern,
then we should start with the initial distribution p0 , truncate away the coordinates that represent an invalid
state, apply the transition matrix, again truncate away forbidden coordinates, and repeat in this fashion till we
exhaust the pattern. Clearly, the ℓ1 -norm of the resulting vector is the probability of this pattern. To this end,
given a pattern b0 , . . . , bk , let S = ⟨S 0 , . . . , S ν ⟩ denote the corresponding sequence of “truncating” matrices (i.e.,
S i is either W or W). Formally, we set S i = W if Alg(x, ri ) returns the correct answer, and set S i = W otherwise.
The above argument implies the following lemma.
Lemma 34.3.4. For any fixed pattern b0 , . . . , bν the probability of the random walk to generate this pattern of
random strings is ∥p(0) S 0 BS 1 . . . BS ν ∥1 , where S = ⟨S 0 , . . . , S ν ⟩ is the sequence of W and W encoded by this
pattern.

221
Theorem 34.3.5. The probability that the majority of the outputs Alg(x, r0 ), Alg(x, r1 ), . . . , Alg(x, rk ) is incor-
rect is at most 1/2k .

Proof: The majority is wrong, only if (at least) half the elements of the sequence S = ⟨S 0 , . . . , S ν ⟩ belong to
W. Fix such a “bad” sequence S, and observe that the distributions we work with are vectors in RU . As such,
if p0 is the initial distribution, then we have that
  √ √ 1
P S = p S 0 BS 1 . . . BS ν ≤ U p(0) S 0 BS 1 . . . BS ν ≤ ,
(0)
1 2
U ν/2
p(0) 2
5
by Lemma 34.3.6 below (i.e., Cauchy-Schwarz inequality) and by repeatedly applying Lemma 34.3.3, since
half of the sequence
√ S are W, and the rest are W. The distribution p(0) was uniform, which implies that
p 2 = 1/ U. As such, let S be the set of all bad patterns (there are 2ν−1 such “bad” patterns). We have
(0)

h i √ 1 1
ν
P majority is bad ≤ 2 U ν/2 p
(0)
= (4/5)ν/2 = (4/5)αk/2 ≤ ,
5 2 2k
for α = 7. ■

34.3.2. Some standard inequalities



Lemma 34.3.6. For any vector v = (v1 , . . . , vd ) ∈ Rd , we have that ∥v∥1 ≤ d ∥v∥2 .

Proof: We can safely assume all the coordinates of v are positive. Now,
v
t d v
t d
Xd X d X X √
∥v∥1 = vi = vi · 1 = |v · (1, 1, . . . , 1)| ≤ v2i 12 = d v ,
i=1 i=1 i=1 i=1

by the Cauchy-Schwarz inequality. ■

References
[GG81] O. Gabber and Z. Galil. Explicit constructions of linear-sized superconcentrators. J. Comput.
Syst. Sci., 22(3): 407–420, 1981.

222
Chapter 35

Complexity classes
598 - Class notes for Randomized Algorithms
“I’m a simple man, a guileless man,” Panin answered.
Sariel Har-Peled
“There is a norm. The norm is five gravities. My simple,
April 2, 2024
uncomplicated organism cannot bear anything exceeding
the norm. My organism tried six once, and got carried
out at six minutes some seconds. With me along.”

Almost the same, Arkady and Boris Strugatsky


35.1. Las Vegas and Monte Carlo algorithms
Definition 35.1.1. A Las Vegas algorithm is a randomized algorithms that always return the correct result.
The only variant is that it’s running time might change between executions.

An example for a Las Vegas algorithm is the QuickSort algorithm.


Definition 35.1.2. A Monte Carlo algorithm is a randomized algorithm that might output an incorrect result.
However, the probability of error can be diminished by repeated executions of the algorithm.

The matrix multiplication algorithm is an example of a Monte Carlo algorithm.

35.2. Complexity classes


I assume people know what are Turing machines, NP, NPC, RAM machines, uniform model, logarithmic
model. PSPACE, and EXP. If you do now know what are those things, you should read about them. Some
of that is covered in the randomized algorithms book, and some other stuff is covered in any basic text on
complexity theory¬ .
Definition 35.2.1. The class P consists of all languages L that have a polynomial time algorithm Alg, such that
for any input Σ∗ , we have
(A) x ∈ L ⇒ Alg(x) accepts,
(B) x < L ⇒ Alg(x) rejects.

Definition 35.2.2. The class NP consists of all languages L that have a polynomial time algorithm Alg, such
that for any input Σ∗ , we have:
(i) If x ∈ L ⇒ then ∃y ∈ Σ∗ , Alg(x, y) accepts, where |y| (i.e. the length of y) is bounded by a polynomial in
|x|.
¬
There is also the internet.

223
(ii) If x < L ⇒ then ∀y ∈ Σ∗ Alg(x, y) rejects.
Definition 35.2.3. For a complexity class C, we define the complementary class co-C as the set of languages
whose complement is in the class C. That is

co−C = L L ∈ C ,
4 where L = Σ∗ \ L.
It is obvious that P = co−P and P ⊆ NP ∩ co−NP. (It is currently unknown if P = NP ∩ co−NP or whether
NP = co−NP, although both statements are believed to be false.)
Definition 35.2.4. The class RP (for Randomized Polynomial time) consists of all languages L that have a
randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ , we have
 
(i) If x ∈ L then P Alg(x) accepts ≥ 1/2.
 
(ii) x < L then P Alg(x) accepts = 0.
An RP algorithm is a Monte Carlo algorithm, but this algorithm can make a mistake only if x ∈ L. As such,
co−RP is all the languages that have a Monte Carlo algorithm that make a mistake only if x < L. A problem
which is in RP ∩ co−RP has an algorithm that does not make a mistake, namely a Las Vegas algorithm.
Definition 35.2.5. The class ZPP (for Zero-error Probabilistic Polynomial time) is the class of languages that
have a Las Vegas algorithm that runs in expected polynomial time.
Definition 35.2.6. The class PP (for Probabilistic Polynomial time) is the class of languages that have a ran-
domized algorithm Alg, with worst case polynomial running time, such that for any input x ∈ Σ∗ , we have
 
(i) If x ∈ L then P Alg(x) accepts > 1/2.
 
(ii) If x < L then P Alg(x) accepts < 1/2.
The class PP is not very useful. Why?
Exercise 35.2.7. Provide a PP algorithm for 3SAT.
Consider the mind-boggling stupid randomized algorithm that returns either yes or no with probability half.
This algorithm is almost in PP, as it return the correct answer with probability half. An algorithm is in PP needs
to be slightly better, and be correct with probability better than half. However, how much better can be made to
be arbitrarily close to 1/2. In particular, there is no way to do effective amplification with such an algorithm.
Definition 35.2.8. The class BPP (for Bounded-error Probabilistic Polynomial time) is the class of languages
that have a randomized algorithm Alg with worst case polynomial running time such that for any input x ∈ Σ∗ ,
we have
 
(i) If x ∈ L then P Alg(x) accepts ≥ 3/4.
 
(ii) If x < L then P Alg(x) accepts ≤ 1/4.

35.3. Bibliographical notes


Section 35.1 follows [MR95, Section 1.5].

References
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.

224
Chapter 36

Backwards analysis
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

The idea of backwards analysis (or backward analysis) is a technique to analyze randomized algorithms by
imagining as if it was running backwards in time, from output to input. Most of the more interesting applica-
tions of backward analysis are in Computational Geometry, but nevertheless, there are some other applications
that are interesting and we survey some of them here.

36.1. How many times can the minimum change?


Let Π = π1 . . . πn be a random permutation of {1, . . . , n}. Let Ei be the event that πi is the minimum number
seen so far as we read Π; that is, Ei is the event that πi = minik=1 πk . Let Xi be the indicator variable that is one
if Ei happens. We already seen, and it is easy to verify, that E[Xi ] = 1/i. We are interested in how many times
P
the minimum might change¬ ; that is Z = i Xi , and how concentrated is the distribution of Z. The following is
maybe surprising.
Lemma 36.1.1. The events E1 , . . . , En are independent (as such, variables X1 , . . . , Xn are independent).
Proof: The trick is to think about the sampling process in a different way, and then the result readily follows.
Indeed, we randomly pick a permutation of the given numbers, and set the first number to be πn . We then,
again, pick a random permutation of the remaining numbers and set the first number as the penultimate number
(i.e., πn−1 ) in the output permutation. We repeat this process till wehgenerate the wholeipermutation.
 
Now, consider 1 ≤ i1 < i2 < . . . < ik ≤ n, and observe that P Ei1 Ei2 ∩ . . . ∩ Eik = P Ei1 , since by our
thought experiment, Ei1 is determined after all the other variables Ei2 , . . . , Eik . In particular, the variable Ei1 is
inherently not effected by these events happening or not. As such, we have
h i h i  
P Ei1 ∩ Ei2 ∩ . . . ∩ Eik = P Ei1 Ei2 ∩ . . . ∩ Eik P Ei2 ∩ . . . ∩ Eik
h i h i Y k h i Y k
1
= P Ei1 P Ei2 ∩ Ei2 ∩ . . . ∩ Eik = E
P ij = ,
j=1 j=1
i j

by induction. ■
Theorem 36.1.2. Let Π = π1 . . . πn be a random permutation of 1, . . . , n, and let Z be the number of times, that
 
πi is the smallest number among π1 , . . . , πi , for i = 1, . . . , n. Then, we have that for t ≥ 2e that P Z > t ln n ≤
   
1/nt ln 2 , and for t ∈ 1, 2e , we have that P Z > t ln n ≤ 1/n(t−1) /4 .
2

¬
The answer, my friend, is blowing in the permutation.

225
P
Proof: Follows readily from Chernoff’s inequality, as Z = i Xi is a sum of independent indicator variables,
and, since by linearity of expectations, we have

h i X h i X n Z n+1
1 1
µ=EZ = E Xi = ≥ dx = ln(n + 1) ≥ ln n.
i i=1
i x=1 x

Next, we set δ = t − 1, and use Chernoff inequality. ■

36.2. Computing a good ordering of the vertices of a graph


We are given a G = (V, E) be an edge-weighted graph with n vertices and m edges. The task is to compute an
ordering π = ⟨π1 , . . . , πn ⟩ of the vertices, and for every vertex v ∈ V, the list of vertices Lv , such that πi ∈ Łv , if
πi is the closet vertex to v in the ith prefix ⟨π1 , . . . , πi ⟩.
This situation can arise for example in a streaming scenario, where we install servers in a network. In the
ith stage there i servers installed, and every client in the network wants to know its closest server. As we install
more and more servers (ultimately, every node is going to be server), each client needs to maintain its current
closest server.
P
The purpose is to minimize the total size of these lists L = v∈V |Lv |.

36.2.1. The algorithm


Take a random permutation π1 , . . . , πn of the vertices V of G. Initially, we set δ(v) = +∞, for all v ∈ V.
In the ith iteration, set δ(πi ) to 0, and start Dijkstra from the ith vertex πi . The Dijkstra propagates only if
it improves the current distance associated with a vertex. Specifically, in the ith iteration, we update δ(u) to
dG (πi , u) if and only if dG (πi , u) < δ(u) before this iteration started. If δ(u) is updated, then we add πi to Lu .
Note, that this Dijkstra propagation process might visit only small portions of the graph in some iterations –
since it improves the current distance only for few vertices.

36.2.2. Analysis
h i
Lemma 36.2.1. The above algorithm computes a permutation π, such that E |L| = O(n log n), and the ex-
 
pected running time of the algorithm is O (n log n + m) log n , where n = |V(G)| and m = |E(G)|. Note, that
both bounds also hold with high probability.

Proof: Fix a vertex v ∈ V = {v1 , . . . , vn }. Consider the set of n numbers {dG (v, v1 ), . . . , dG (v, vn )}. Clearly,
dG (v, π1 ), . . . , dG (v, πn ) is a random permutation of this set, and by Lemma 36.1.1 the random permutation π
changes this minimum O(log n) time in expectations (and also with high probability). This readily implies that
|Lv | = O(log n) both in expectations and high probability.
The more interesting claim is the running time. Consider an edge uv ∈ E(G), and observe that δ(u) or δ(v)
changes O(log n) times. As such, an edge gets visited O(log n) times, which implies overall running time of
O(n log2 n + m log n), as desired.
Indeed, overall there are O(n log n) changes in the value of δ(·). Each such change might require one delete-
min operation from the queue, which takes O(log n) time operation. Every edge, by the above, might trigger
O(log n) decrease-key operations. Using Fibonacci heaps, each such operation takes O(1) time. ■

226
36.3. Computing nets
36.3.1. Basic definitions
Definition 36.3.1. A metric space is a pair (X, d) where X is a set and d : X × X → [0, ∞) is a metric satisfying
the following axioms: (i) d(x, y) = 0 if and only if x = y, (ii) d(x, y) = d(y, x), and (iii) d(x, y) + d(y, z) ≥ d(x, z)
(triangle inequality).

For example, R2 with the regular Euclidean distance is a metric space. In the following, we assume that we
are given black-box access to dM . Namely, given two points p, u ∈ X, we assume that d(p, u) can be computed
in constant time.
Another standard example for a finite metric space is a graph G with non-negative weights ω(·) defined
on its edges. Let dG (x, y) denote the shortest path (under the given weights) between any x, y ∈ V(G). It is
easy to verify that dG (·, ·) is a metric. In fact, any finite metric (i.e., a metric defined over a finite set) can be
represented by such a weighted graph.

36.3.1.1. Nets
Definition 36.3.2. For a point set P in a metric space with a metric d, and a parameter r > 0, an r-net of P is a
subset C ⊆ P, such that
(i) for every p, u ∈ C, p , u, we have that d(p, u) ≥ r, and
(ii) for all p ∈ P, we have that minu∈C d(p, u) < r.

Intuitively, an r-net represents P in resolution r.

36.3.2. Computing an r-net in a sparse graph


Given a G = (V, E) be an edge-weighted graph with n vertices and m edges, and let r > 0 be a parameter. We
are interested in the problem of computing an r-net for G. That is, a set of vertices of G that complies with
Definition 36.3.2p227 .

36.3.2.1. The algorithm


We compute an r-net in a sparse graph using a variant of Dijkstra’s algorithm with the sequence of starting
vertices chosen in a random permutation.
Let πi be the ith vertex in a random permutation π of V. For each vertex v we initialize δ(v) to +∞. In the
ith iteration, we test whether δ(πi ) ≥ r, and if so we do the following steps:
(A) Add πi to the resulting net N.
(B) Set δ(πi ) to zero.
(C) Perform Dijkstra’s algorithm starting from πi , modified to avoid adding a vertex u to the priority queue
unless its tentative distance is smaller than the current value of δ(u). When such a vertex u is expanded,
we set δ(u) to be its computed distance from πi , and relax the edges adjacent to u in the graph.

36.3.2.2. Analysis
While the analysis here does not directly uses backward analysis, it is inspired to a large extent by such an
analysis as in Section 36.2p226 .

227
Lemma 36.3.3. The set N is an r-net in G.

Proof: By the end of the algorithm, each v ∈ V has δ(v) < r, for δ(v) is monotonically decreasing, and if it
were larger than r when v was visited then v would have been added to the net.
An induction shows that if ℓ = δ(v), for some vertex v, then the distance of v to the set N is at most ℓ.
Indeed, for the sake of contradiction, let j be the (end of) the first iteration where this claim is false. It must be
that π j ∈ N, and it is the nearest vertex in N to v. But then, consider the shortest path between π j and v. The
modified Dijkstra must have visited all the vertices on this path, thus computing δ(v) correctly at this iteration,
which is a contradiction.
Finally, observe that every two points in N have distance ≥ r. Indeed, when the algorithm handles vertex
v ∈ N, its distance from all the vertices currently in N is ≥ r, implying the claim. ■

Lemma 36.3.4. Consider an execution of the algorithm, and any vertex v ∈ V. The expected number of times
the algorithm updates the value of δ(v) during its execution is O(log n), and more strongly the number of
updates is O(log n) with high probability.

Proof: For simplicity of exposition, assume all distances in G are distinct. Let S i be the set of all the vertices
x ∈ V, such that the following two properties both hold:

(A) dG (x, v) < dG (v, Πi ), where Πi = {π1 , . . . , πi }.


(B) If πi+1 = x then δ(v) would change in the (i + 1)th iteration.

Let si = |S i |. Observe that S 1 ⊇ S 2 ⊇ · · · ⊇ S n , and |S n | = 0.


In particular, let Ei+1 be the event that δ(v) changed in iteration (i + 1) – we will refer to such an iteration
as being active. If iteration (i + 1) is active then one of the points of S i is πi+1 . However, πi+1 has a uniform
distribution over the vertices of S i , and in particular, if Ei+1 happens then si+1 ≤ si /2, with probability at least
half, and we will refer to such an iteration as being lucky. (It is possible that si+1 < si even if Ei+1 does not
happen, but this is only to our benefit.) After O(log n) lucky iterations the set S i is empty, and we are done.
Clearly, if both the ith and jth iteration are active, the events that they are each lucky are independent of each
 
other. By the Chernoff inequality, after c log n active iterations, at least log2 n iterations were lucky with high
probability, implying the claim. Here c is a sufficiently large constant. ■

Interestingly, in the above proof, all we used was the monotonicity of the sets S 1 , . . . , S n , and that if δ(v)
changes in an iteration then the size of the set S i shrinks by a constant factor with good probability in this
iteration. This implies that there is some flexibility in deciding whether or not to initiate Dijkstra’s algorithm
from each vertex of the permutation, without damaging the number of times of the values of δ(v) are updated.

Theorem 36.3.5. Given a graph G = (V, E), with n vertices and m edges, the above algorithm computes an
r-net of G in O((n log n + m) log n) expected time.

Proof: By Lemma 36.3.4, the two δ values associated with the endpoints of an edge get updated O(log n)
times, in expectation, during the algorithm’s execution. As such, a single edge creates O(log n) decrease-key
operations in the heap maintained by the algorithm. Each such operation takes constant time if we use Fibonacci
heaps to implement the algorithm. ■

228
36.4. Bibliographical notes
Backwards analysis was invented/discovered by Raimund Seidel, and the QuickSort example is taken from
Seidel [Sei93]. The number of changes of the minimum result of Section 36.1 is by now folklore.
The good ordering of Section 36.2 is probably also folklore, although a similar idea was used by Mendel
and Schwob [MS09] for a different problem.
Computing a net in a sparse graph, Section 36.3.2, is from [EHS14]. While backwards analysis fails to hold
in this case, it provide a good intuition for the analysis, which is slightly more complicated and indirect.

References
[EHS14] D. Eppstein, S. Har-Peled, and A. Sidiropoulos. On the Greedy Permutation and Counting Dis-
tances. manuscript. 2014.
[MS09] M. Mendel and C. Schwob. Fast c-k-r partitions of sparse graphs. Chicago J. Theor. Comput.
Sci., 2009, 2009.
[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. New Trends in Discrete and
Computational Geometry. Ed. by J. Pach. Vol. 10. Algorithms and Combinatorics. Springer-
Verlag, 1993, pp. 37–68.

229
230
Chapter 37

Multiplicative Weight Update: Expert Selection


598 - Class notes for Randomized Algorithms
Sariel Har-Peled Possession of anything new or expensive only reflected a
April 2, 2024 person’s lack of theology and geometry; it could even
cast doubts upon one’s soul.

A confederacy of Dunces, John Kennedy Toole

37.1. The problem: Expert selection


We are given N experts JNK = {1, 2, . . . , N}. At each time t, an expert i makes a prediction what is going to
happen at this time slot. To make things simple, assume the prediction is one of two values, say, 0 or 1. You
are going to play this game for a while – at each iteration you are going to get the advice of the N experts, and
you are going to select either decision as your own prediction. The purpose here is to come up with a strategy
that minimizes the overall number of wrong predictions made.

If there is an expert that is never wrong. This situation is easy – initially start with all n experts as being
viable – to this end, we assign W(i) ← 1, for all i. If an expert prediction turns out to be wrong, we set its
weight to zero (i.e., it is no longer active). Clearly, if you follow the majority vote of the still viable experts,
then at most log2 n mistakes would be made, before one isolates the infallible experts.

37.2. Majority vote


The algorithm. Unfortunately, we are unlikely to be in the above scenario – experts makes mistakes. Throw-
ing a way an expert because of a single mistake is a sure way to have no expert remaining. Instead, we are
going to moderate our strategy. If expert i is wrong, in a round, we are going to decrease its weight – to be
precise, we set W(i) ← (1 − ε)W(i), where ε is some parameter. Note, that this weight update is done every
round, independent on the decision output in the round. It is now natural, in each round, to compute the total
weight of the experts predicting 0, and the total weight of the experts predicting 1, and return the prediction
that has a heavier total weight supporting it.

Intuition. The algorithm keeps track of the quality of the experts. The useless experts would have weights
very close to zero.

Analysis. We need the following easy calculation.

Lemma 37.2.1. For x ∈ [0, 1/2], we have 1 − x ≥ exp(−x − x2 ).

231
P∞ i+1 xi
Proof: For x ∈ (−1, 1), the Taylor expansion of ln(1 + x) is i=1 (−1) i
. As such, for x ∈ [0, 1/2] we have
X∞
xi x2 x3
ln(1 − x) = − = −x − − · · · ≥ −x − x2 ,
i=1
i 2 3
since x2+i /(2 + i) ≤ x2 /2i ⇐⇒ xi /(2 + i) ≤ 1/2i , which is obviously true as x ≤ 1/2. ■
Lemma 37.2.2. Let assume we have N experts. Let βt be the number of the mistakes the algorithm performs,
and let βt (i) be the number of mistakes made by the ith expert, for i ∈ JnK (both till time t). Then, if we run this
algorithm for T rounds, we have
2 log N
∀i ∈ JnK βT ≤ 2(1 + ε)βT (i) + .
ε
Proof: Let Φt be the total weight of the experts at the beginning of round t. Observe that Φ1 = N, and if a
mistake was made in the t round, then
Φt+1 ≤ (1 − ε/2)Φt ≤ exp(−εβt+1 /2)N.
On the other hand, an expert i made βi (t) mistakes in the first t rounds, and as such its weight, at this point in
time, is (1 − ε)βt (i) . We thus have, at time T , and for any i, that
    !
βT (i) εβT
exp − ε + ε βT (i) ≤ (1 − ε)
2
≤ ΦT ≤ exp − N.
2
  εβ
Taking ln of both sides, we have − ε + ε2 βT (i) ≤ − 2T + ln N. ⇐⇒ βT ≤ 2(1 + ε)βT (i) + 2 lnεN . ■

37.3. Randomized weighted majority


Let Wt (i) be the weight assigned to the ith expert with in the beginning of the tth round. We modify the
algorithm to choose expert i, at round t, with probability Wt (i)/Φt . That is, the algorithm randomly choose an
expert to follow according to their weights. Unlike before, all the experts that are wrong in a round get a weight
decrease.
PN
Proof: We have that Φt = i=1 Wt (i). Let mt (i) = 1 be a an indicator variable that is one if and only if expert i
made a mistake at round t. Similarly, let mt = 1 ⇐⇒ the algorithm made a mistake at round t. By definition,
we have that
X N
  XN
Wt (i)
E t[m ] = P i expert chosen m t (i) = mt (i).
i=1 i=1
Φ t

We then have that



Wt+1 (i) = 1 − εmt (i) Wt (i).
PN
As such, we have Φt+1 = i=1 Wt+1 (i), and
X
N
 X N X N
Wt (i)  
Φt+1 = 1 − εmt (i) Wt (i) = Φt − ε mt (i)Wt (i) = Φt − εΦt mt (i) = 1 − ε E[mt ] Φt .
i=1 i=1 i=1
Φt
We now follow the same argument as before
Y
T      
βT (i)
(1 − ε) ≤ ΦT ≤ N 1 − ε E[mt ] ≤ N exp −ε E βT =⇒ (−ε − ε2 )βT (i) ≤ ln N − ε E βT
t=1
  ln N
=⇒ E βT ≤ (1 + ε)βT (i) + . ■
ε

232
37.4. Bibliographical notes

233
234
Chapter 38

On Complexity, Sampling, and ε-Nets and


ε-Samples
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“I’ve never touched the hard stuff, only smoked grass a few times with the boys to be polite, and that’s all, though ten is the
age when the big guys come around teaching you all sorts to things. But happiness doesn’t mean much to me, I still think
life is better. Happiness is a mean son of a bitch and needs to be put in his place. Him and me aren’t on the same team, and
I’m cutting him dead. I’ve never gone in for politics, because somebody always stand to gain by it, but happiness is an even
crummier racket, and their ought to be laws to put it out of business.”

Momo, Emile Ajar

In this chapter we will try to quantify the notion of geometric complexity. It is intuitively clear that a a (i.e.,
disk) is a simpler shape than an c (i.e., ellipse), which is in turn simpler than a - (i.e., smiley). This becomes
even more important when we consider several such shapes and how they interact with each other. As these
examples might demonstrate, this notion of complexity is somewhat elusive.
To this end, we show that one can capture the structure of a distribution/point set by a small subset. The
size here would depend on the complexity of the shapes/ranges we care about, but surprisingly it would be
independent of the size of the point set.

38.1. VC dimension
Definition 38.1.1. A range space S is a pair (X, R), where X is a ground set (finite or infinite) and R is a (finite
or infinite) family of subsets of X. The elements of X are points and the elements of R are ranges.

Our interest is in the size/weight of the ranges in the range space. For technical reasons, it will be easier to
consider a finite subset x as the underlining ground set.

Definition 38.1.2. Let S = (X, R) be a range space, and let x be a finite (fixed) subset of X. For a range r ∈ R,
its measure is the quantity
|r ∩ x|
m(r) = .
|x|

While x is finite, it might be very large. As such, we are interested in getting a good estimate to m(r) by
using a more compact set to represent the range space.

235
Definition 38.1.3. Let S = (X, R) be a range space. For a subset N (which might be a multi-set) of x, its
estimate of the measure of m(r), for r ∈ R, is the quantity
|r ∩ N|
s(r) = .
|N|
The main purpose of this chapter is to come up with methods to generate a sample N, such that m(r) ≈ s(r),
for all the ranges r ∈ R.
It is easy to see that in the worst case, no sample can capture the measure of all ranges. Indeed, given a
sample N, consider the range x \ N that is being completely missed by N. As such, we need to concentrate
on range spaces that are “low dimensional”, where not all subsets are allowable ranges. The notion of VC
dimension (named after Vapnik and Chervonenkis [VC71]) is one way to limit the complexity of a range space.
Definition 38.1.4. Let S = (X, R) be a range space. For Y ⊆ X, let

R|Y = r ∩ Y r ∈ R (38.1)

denote the projection of R on Y. The range space S projected to Y is S|Y = Y, R|Y .
If R|Y contains all subsets of Y (i.e., if Y is finite, we have R|Y = 2|Y| ), then Y is shattered by R (or
equivalently Y is shattered by S).
The Vapnik-Chervonenkis dimension (or VC dimension) of S, denoted by dimVC (S), is the maximum
cardinality of a shattered subset of X. If there are arbitrarily large shattered subsets, then dimVC (S ) = ∞.

38.1.1. Examples
Intervals. Consider the set X to be the real line, and consider R to be the set of all intervals on the 1 2
real line. Consider the set Y = {1, 2}. Clearly, one can find four intervals that contain all possible
subsets of Y. Formally, the projection R|Y = {{ } , {1} , {2} , {1, 2}}. The intervals realizing each of
these subsets are depicted on the right.
p q s
However, this is false for a set of three points B = {p, u, v}, since there is no interval that can
contain the two extreme points p and v without also containing u. Namely, the subset {p, v} is not realizable
for intervals, implying that the largest shattered set by the range space (real line, intervals) is of size two. We
conclude that the VC dimension of this space is two.
Disks. Let X = R2 , and let R be the set of disks in the plane. Clearly, for any three
points in the plane (in general position), denoted by p, u, and v, one can find eight
disks that realize all possible 23 different subsets. See the figure on the right. p
But can disks shatter a set with four points? Consider such a set P of four points. If t
the convex hull of P has only three points on its boundary, then the subset X having
q
only those three vertices (i.e., it does not include the middle point) is impossible, by
{p.q}
convexity. Namely, there is no disk that contains only the points of X without the
middle point.
d
Alternatively, if all four points are vertices of the convex hull and they are a, b, c, d
along the boundary of the convex hull, either the set {a, c} or the set {b, d} is not
realizable. Indeed, if both options are realizable, then consider the two disks D1 a
c
and D2 that realize those assignments. Clearly, ∂D1 and ∂D2 must intersect in four
points, but this is not possible, since two circles have at most two intersection points.
See the figure on the left. Hence the VC dimension of this range space is 3. b

236
Convex sets. Consider the range space S = (R2 , R), where R is the set of all (closed)
convex sets in the plane. We claim that dimVC (S) = ∞. Indeed, consider a set U of n
points p1 , . . . , pn all lying on the boundary of the unit circle in the plane. Let V be any CH(V)
subset of U, and consider the convex hull CH(V). Clearly, CH(V) ∈ R, and furthermore,
CH(V) ∩ U = V. Namely, any subset of U is realizable by S. Thus, S can shatter sets of
arbitrary size, and its VC dimension is unbounded.

Complement. Consider the range space S = (X, R) with δ = dimVC (S). Next, consider the complement space,

S = X, R , where

R= X\r r∈R .
Namely, the ranges of S are the complement of the ranges in S. What is the VC dimension of S? Well, a set
B ⊆ X is shattered by S if and only if it is shattered by S. Indeed, if S shatters B, then for any Z ⊆ B, we have
that (B \ Z) ∈ R|B , which implies that Z = B \ (B \ Z) ∈ R|B . Namely, R|B contains all the subsets of B, and S

shatters B. Thus, dimVC S = dimVC (S).
 
Lemma 38.1.5. For a range space S = (X, R) we have that dimVC (S) = dimVC S , where S is the complement
range space.

38.1.1.1. Halfspaces
Let S = (X, R), where X = Rd and R is the set of all (closed) halfspaces in Rd . We need the following technical
claim.

Claim 38.1.6. Let P = {p1 , . . . , pd+2 } be a set of d + 2 points in Rd . There are real numbers β1 , . . . , βd+2 , not
P P
all of them zero, such that i βi pi = 0 and i βi = 0.

Proof: Indeed, set ui = (pi , 1), for i = 1, . . . , d + 2. Now, the points u1 , . . . , ud+2 ∈ Rd+1 are linearly dependent,
P
and there are coefficients β1 , . . . , βd+2 , not all of them zero, such that d+2 i=1 βi ui = 0. Considering only the first
Pd+2
d coordinates of these points implies that i=1 βi pi = 0. Similarly, by considering only the (d + 1)st coordinate
P
of these points, we have that d+2 i=1 βi = 0. ■

To see what the VC dimension of halfspaces in Rd is, we need the following result of Radon. (For a
reminder of the formal definition of convex hulls, see Definition 38.5.1.)

Theorem 38.1.7 (Radon’s theorem). Let P = {p1 , . . . , pd+2 } be a set of d + 2 points in Rd . Then, there exist
two disjoint subsets C and D of P, such that CH(C) ∩ CH(D) , ∅ and C ∪ D = P.
P
Proof: By Claim 38.1.6 there are real numbers β1 , . . . , βd+2 , not all of them zero, such that i βi pi = 0 and
P
i βi = 0.
Assume, for the sake of simplicity of exposition, that β1 , . . . , βk ≥ 0 and βk+1 , . . ., βd+2 < 0. Furthermore,
P P
let µ = ki=1 βi = − d+2
i=k+1 βi . We have that

X
k X
d+2
βi pi = − β i pi .
i=1 i=k+1

P
In particular, v = ki=1 (βi /µ)pi is a point in CH({p1 , . . . , pk }). Furthermore, for the same point v we have
P
v = d+2
i=k+1 −(βi /µ)pi ∈ CH({pk+1 , . . . , pd+2 }). We conclude that v is in the intersection of the two convex hulls,
as required. ■

237
The following is a trivial observation, and yet we provide a proof to demonstrate it is true.

Lemma 38.1.8. Let P ⊆ Rd be a finite set, let v be any point in CH(P), and let h+ be a halfspace of Rd
containing v. Then there exists a point of P contained inside h+ .
n o
Proof: The halfspace h+ can be written as h+ = x ∈ Rd ⟨x, v⟩ ≤ c . Now v ∈ CH(P) ∩ h+ , and as such there
P P
are numbers α1 , . . . , αm ≥ 0 and points p1 , . . . , pm ∈ P, such that i αi = 1 and i αi pi = v. By the linearity of
the dot product, we have that
DX
m E X
m
v ∈ h+ =⇒ ⟨v, v⟩ ≤ c =⇒ αi pi , v ≤ c =⇒ β = αi ⟨pi , v⟩ ≤ c.
i=1 i=1

Setting βi = ⟨pi , v⟩, for i = 1, . . . , m, the above implies that β is a weighted average of β1 , . . . , βm . In particular,
there must be a βi that is no larger than the average. That is βi ≤ c. This implies that ⟨pi , v⟩ ≤ c. Namely,
pi ∈ h+ as claimed. ■

Let S be the range space having Rd as the ground set and all the close halfspaces as ranges. Radon’s
theorem implies that if a set Q of d + 2 points is being shattered by S, then we can partition this set Q into
two disjoint sets Y and Z such that CH(Y) ∩ CH(Z) , ∅. In particular, let v be a point in CH(Y) ∩ CH(Z).
If a halfspace h+ contains all the points of Y, then CH(Y) ⊆ h+ , since a halfspace is a convex set. Thus, any
halfspace h+ containing all the points of Y will contain the point v ∈ CH(Y). But v ∈ CH(Z) ∩ h+ , and this
implies that a point of Z must lie in h+ , by Lemma 38.1.8. Namely, the subset Y ⊆ Q cannot be realized by a
halfspace, which implies that Q cannot be shattered. Thus dimVC (S ) < d + 2. It is also easy to verify that the
regular simplex with d + 1 vertices is shattered by S. Thus, dimVC (S ) = d + 1.

38.2. Shattering dimension and the dual shattering dimension


The main property of a range space with bounded VC dimension is that the number of ranges for a set of n
elements grows polynomially in n (with the power being the dimension) instead of exponentially. Formally, let
the growth function be
Xδ ! X δ
n ni
Gδ (n) = ≤ ≤ nδ , (38.2)
i=0
i i=0
i!
for δ > 1 (the cases where δ = 0 or δ = 1 are not interesting and we will just ignore them). Note that for all
n, δ ≥ 1, we have Gδ (n) = Gδ (n − 1) + Gδ−1 (n − 1)¬ .

Lemma 38.2.1 (Sauer’s lemma). If (X, R) is a range space of VC dimension δ with |X| = n, then |R| ≤ Gδ (n).

Proof: The claim trivially holds for δ = 0 or n = 0.


Let x be any element of X, and consider the sets
 
R x = r \ {x} r ∪ {x} ∈ R and r \ {x} ∈ R and R \ x = r \ {x} r ∈ R .

Observe that |R| = |R x | + |R \ x|. Indeed, we charge the elements of R to their corresponding element in R \ x.
The only bad case is when there is a range r such that both r ∪ {x} ∈ R and r \ {x} ∈ R, because then these two
¬
Here is a cute (and standard) counting argument: Gδ (n) is just the number of different subsets of size at most δ out of n elements.
Now, we either decide to not include the first element in these subsets (i.e., Gδ (n − 1)) or, alternatively, we include the first element in
these subsets, but then there are only δ − 1 elements left to pick (i.e., Gδ−1 (n − 1)).

238
distinct ranges get mapped to the same range in R \ x. But such ranges contribute exactly one element to R x .
Similarly, every element of R x corresponds to two such “twin” ranges in R.
Observe that (X \ {x} , R x ) has VC dimension δ − 1, as the largest set that can be shattered is of size δ − 1.
Indeed, any set B ⊂ X \ {x} shattered by R x implies that B ∪ {x} is shattered in R.
Thus, we have
|R| = |R x | + |R \ x| ≤ Gδ−1 (n − 1) + Gδ (n − 1) = Gδ (n),
by induction. ■

Interestingly, Lemma 38.2.1 is tight.


Next, we show pretty tight bounds on Gδ (n). The proof is technical and not very interesting, and it is
delegated to Section 38.4.
 n δ  ne δ Xδ !
n
Lemma 38.2.2. For n ≥ 2δ and δ ≥ 1, we have ≤ Gδ (n) ≤ 2 , where Gδ (n) = .
δ δ i=0
i

Definition 38.2.3 (Shatter function). Given a range space S = (X, R), its shatter function πS (m) is the maxi-
mum number of sets that might be created by S when restricted to subsets of size m. Formally,

πS (m) = max R|B ;


B⊂X
|B|=m
see Eq. (38.1).

Our arch-nemesis in the following is the function x/ ln x. The following lemma states some properties of
this function, and its proof is left as an exercise.
Lemma 38.2.4. For the function f (x) = x/ ln x the following hold.
(A) f (x) is monotonically increasing for x ≥ e.
(B) f (x) ≥ e,√for x > 1.
(C) For u ≥ √e, if f (x) ≤ u, then x ≤ 2u ln u.
(D) For u ≥ e, if x > 2u ln u, then f (x) > u.
(E) For u ≥ e, if f (x) ≥ u, then x ≥ u ln u.

38.2.1. Mixing range spaces


Lemma 38.2.5. Let S = (X, R) and T = (X, R′ ) be two range spaces of VC dimension δ and δ′ , respec-

tively, where δ, δ′ > 1. Let b S = X, b
R = {r ∪ r′ | r ∈ R, r′ ∈ R′ } . Then, for the range space b R , we have that

dimVC b S = O(δ + δ′ ).

Proof: As a warm-up exercise, we prove a somewhat weaker bound here of O((δ+δ′ ) log(δ + δ′ )). The stronger
bound follows from Theorem 38.2.6 below. Let B be a set of n points in X that are shattered by b S. There are

at most Gδ (n) and Gδ′ (n) different ranges of B in the range sets R|B and R|B , respectively, by Lemma 38.2.1.
Every subset C of B realized by b r∈b R is a union of two subsets B ∩ r and B ∩ r′ , where r ∈ R and r′ ∈ R′ ,
respectively. Thus, the number of different subsets of B realized by b S is bounded by Gδ (n)Gδ′ (n). Thus,
δ δ′ 
2 ≤ n n , for δ, δ > 1. We conclude that n ≤ (δ + δ ) lg n, which implies that n = O (δ + δ′ ) log(δ + δ′ ) , by
n ′ ′

Lemma 38.2.4(C). ■

Interestingly, one can prove a considerably more general result with tighter bounds. The required compu-
tations are somewhat more painful.

239
   
Theorem 38.2.6. Let S1 = X, R1 , . . . , Sk = X, Rk be range spaces with VC dimension δ1 , . . . , δk , respec-
tively. Next, let f (r1 , . . . , rk ) be a function that maps any k-tuple of sets r1 ∈ R1 , . . . , rk ∈ Rk into a subset of
X. Here, the function f is restricted to be defined by a sequence of set operations like complement, intersection
and union. Consider the range set

R′ = f (r1 , . . . , rk ) r1 ∈ R1 , . . . , rk ∈ Rk

and the associated range space T = (X, R′ ). Then, the VC dimension of T is bounded by O kδ lg k , where
δ = maxi δi .

Proof: Assume a set Y ⊆ X of size t is being shattered by R′ , and observe that



R|Y′ ≤ (r1 , . . . , rk ) r1 ∈ R|Y1 , . . . , rk ∈ R|Yk ≤ R|Y1 · · · R|Yk ≤ Gδ1 (t) · Gδ2 (t) · · · Gδk (t)
k  δ k
≤ Gδ (t) ≤ 2 te/δ ,

by Lemma 38.2.1 and Lemma 38.2.2. On the other hand, since Y is being shattered by R′ , this implies that
k 
R|Y′ = 2t . Thus, we have the inequality 2t ≤ 2(te/δ)δ , which implies t ≤ k 1 + δ lg(te/δ) . Assume that

t ≥ e and δ lg(te/δ) ≥ 1 since otherwise the claim is trivial, and observe that t ≤ k 1 + δ lg(te/δ) ≤ 3kδ lg(t/δ).
Setting x = t/δ, we have
t ln(t/δ) t x
≤ 3k ≤ 6k ln =⇒ ≤ 6k =⇒ x ≤ 2 · 6k ln(6k) =⇒ x ≤ 12k ln(6k),
δ ln 2 δ ln x
by Lemma 38.2.4(C). We conclude that t ≤ 12δk ln(6k), as claimed. ■

Corollary 38.2.7. Let S = (X, R) and T = (X, R′ ) be two range spaces of VC dimension δ and δ′ , respec-

tively, where δ, δ′ > 1. Let b S = (X, b
R = r ∩ r′ r ∈ R, r′ ∈ R′ . Then, for the range space b R), we have that
b ′
dimVC (S) = O(δ + δ ).

Corollary 38.2.8. Any finite sequence of combining range spaces with finite VC dimension (by intersecting,
complementing, or taking their union) results in a range space with a finite VC dimension.

38.3. On ε-nets and ε-sampling


38.3.1. ε-nets and ε-samples
Definition 38.3.1 (ε-sample). Let S = (X, R) be a range space, and let x be a finite subset of X. For 0 ≤ ε ≤ 1,
a subset C ⊆ x is an ε-sample for x if for any range r ∈ R, we have

m(r) − s(r) ≤ ε,

where m(r) = |x ∩ r| / |x| is the measure of r (see Definition 38.1.2) and s(r) = |C ∩ r| / |C| is the estimate of r
(see Definition 38.1.3). (Here C might be a multi-set, and as such |C ∩ r| is counted with multiplicity.)

As such, an ε-sample is a subset of the ground set x that “captures” the range space up to an error of ε.
Specifically, to estimate the fraction of the ground set covered by a range r, it is sufficient to count the points
of C that fall inside r.
If X is a finite set, we will abuse notation slightly and refer to C as an ε-sample for S.

240
To see the usage of such a sample, consider x = X to be, say, the population of a country (i.e., an element
of X is a citizen). A range in R is the set of all people in the country that answer yes to a question (i.e., would
you vote for party Y?, would you buy a bridge from me?, questions like that). An ε-sample of this range space
enables us to estimate reliably (up to an error of ε) the answers for all these questions, by just asking the people
in the sample.
The natural question of course is how to find such a subset of small (or minimal) size.
Theorem 38.3.2 (ε-sample theorem, [VC71]). There is a positive constant c such that if (X, R) is any range
space with VC dimension at most δ, x ⊆ X is a finite subset and ε, φ > 0, then a random subset C ⊆ x of
cardinality
!
c δ 1
s = 2 δ log + log
ε ε φ
is an ε-sample for x with probability at least 1 − φ.
(In the above theorem, if s > |x|, then we can just take all of x to be the ε-sample.)
Sometimes it is sufficient to have (hopefully smaller) samples with a weaker property – if a range is “heavy”,
then there is an element in our sample that is in this range.
Definition 38.3.3 (ε-net). A set N ⊆ x is an ε-net for x if for any range r ∈ R, if m(r) ≥ ε (i.e., |r ∩ x| ≥ ε |x|),
then r contains at least one point of N (i.e., r ∩ N , ∅).
Theorem 38.3.4 (ε-net theorem, [HW87]). Let (X, R) be a range space of VC dimension δ, let x be a finite
subset of X, and suppose that 0 < ε ≤ 1 and φ < 1. Let N be a set obtained by m random independent draws
from x, where !
4 4 8δ 16
m ≥ max lg , lg . (38.3)
ε φ ε ε
Then N is an ε-net for x with probability at least 1 − φ.
(We remind the reader that lg = log2 .)
The proofs of the above theorems are somewhat involved and we first turn our attention to some applications
before presenting the proofs.
Remark 38.3.5. The above two theorems also hold for spaces with shattering dimension at most δ, in which !
1 1 δ δ
case the sample size is slightly larger. Specifically, for Theorem 38.3.4, the sample size needed is O lg + lg .
ε φ ε ε
Remark 38.3.6. The ε-net theorem is a relatively easy consequence (up to constants) of the ε-sample theorem
– see bibliographical notes for details.

38.3.2. Some applications


We mention two (easy) applications of these theorems, which (hopefully) demonstrate their power.

38.3.2.1. Range searching


So, consider a (very large) set of points P in the plane. We would like to be able to quickly decide how
many points are included inside a query rectangle. Let us assume that we allow ourselves 1% error. What
Theorem 38.3.2 tells us is that there is a subset of constant size (that depends only on ε) that can be used to
perform this estimation, and it works for all query rectangles (we used here the fact that rectangles in the plane
have finite VC dimension). In fact, a random sample of this size works with constant probability.

241
38.3.2.2. Learning a concept

Assume that we have a function f defined in the plane that returns ‘1’ in- Dunknown
side an (unknown) disk Dunknown and ‘0’ outside it. There is some distribution D
defined over the plane, and we pick points from this distribution. Furthermore,
we can compute the function for these labels (i.e., we can compute f for certain
values, but it is expensive). For a mystery value ε > 0, to be explained shortly,
Theorem 38.3.4 tells us to pick (roughly) O((1/ε) log(1/ε)) random points in a
sample R from this distribution and to compute the labels for the samples. This
is demonstrated in the figure on the right, where black dots are the sample points
for which f (·) returned 1.
So, now we have positive examples and negative examples. We would like to find
a hypothesis that agrees with all the samples we have and that hopefully is close to
the true unknown disk underlying the function f . To this end, compute the smallest
D
disk D that contains the sample labeled by ‘1’ and does not contain any of the ‘0’
points, and let g : R2 → {0, 1} be the function g that returns ‘1’ inside the disk
and ‘0’ otherwise. We claim that g classifies correctly all but an ε-fraction of the
points (i.e., the probability of misclassifying a point picked according to the given
 
distribution is smaller than ε); that is, Prp∈D f (p) , g(p) ≤ ε.
Geometrically, the region where g and f disagree is all the points in the symmet- Dunknown
ric difference between the two disks. That is, E = D ⊕ Dunknown ; see the figure on the D ⊕ Dunknown
right.
Thus, consider the range space S having the plane as the ground set and the
symmetric difference between any two disks as its ranges. By Corollary 38.2.8, this
range space has finite VC dimension. Now, consider the (unknown) disk D′ that
induces f and the region r = Dunknown ⊕ D. Clearly, the learned classifier g returns
incorrect answers only for points picked inside r. D
Thus, the probability of a mistake in the classification is the measure of r under the distribution D. So,
if PD [r] > ε (i.e., the probability that a sample point falls inside r), then by the ε-net theorem (i.e., Theo-
rem 38.3.4) the set R is an ε-net for S (ignore for the time being the possibility that the random sample fails to
be an ε-net) and as such, R contains a point u inside r. But, it is not possible for g (which classifies correctly
all the sampled points of R) to make a mistake on u, a contradiction, because by construction, the range r is
 
where g misclassifies points. We conclude that PD r ≤ ε, as desired.
Little lies. The careful reader might be tearing his or her hair out because of the above description. First,
Theorem 38.3.4 might fail, and the above conclusion might not hold. This is of course true, and in real appli-
cations one might use a much larger sample to guarantee that the probability of failure is so small that it can be
practically ignored. A more serious issue is that Theorem 38.3.4 is defined only for finite sets. Nowhere does it
speak about a continuous distribution. Intuitively, one can approximate a continuous distribution to an arbitrary
precision using a huge sample and apply the theorem to this sample as our ground set. A formal proof is more
tedious and requires extending the proof of Theorem 38.3.4 to continuous distributions. This is straightforward
and we will ignore this topic altogether.

38.4. A better bound on the growth function


In this section, we prove Lemma 38.2.2. Since the proof is straightforward but tedious, the reader can safely
skip reading this section.

242
Lemma 38.4.1. For any positive integer n, the following hold.
(i) (1 + 1/n)n ≤ e. (ii) (1 − 1/n)n−1 ≥ e−1 . !  k
 n k n ne
(iii) n! ≥ (n/e) .n
(iv) For any k ≤ n, we have ≤ ≤ .
k k k

Proof: (i) Indeed, 1 + 1/n ≤ exp(1/n), since 1 + x ≤ e x , for x ≥ 0. As such (1 + 1/n)n ≤ exp(n(1/n)) = e.
 n−1
(ii) Rewriting the inequality, we have that we need to prove n−1 ≥ 1e . This is equivalent to proving
 n−1   n
1 n−1
e ≥ n−1n
= 1 + n−1 , which is our friend from (i).
(iii) Indeed,
nn X ni

≤ = en ,
n! i=0
i!
P xi
by the Taylor expansion of e x = ∞ i=0 i! . This implies that (n/e) ≤ n!, as required.
n

(iv) Indeed, for any k ≤ n, we have nk ≤ n−1 k−1


, as can be easily verified. As such, nk ≤ n−i k−i
, for 1 ≤ i ≤ k − 1.
As such, !
 n k n n − 1 n−k+1 n
≤ · ··· = .
k k k−1 1 k
!  ne k
n nk nk
≤ ≤ = ■
k!  k k
As for the other direction, by (iii), we have .
k k
e

 n δ  ne δ Xδ !
n
Lemma 38.2.2 restated. For n ≥ 2δ and δ ≥ 1, we have ≤ Gδ (n) ≤ 2 , where Gδ (n) = .
δ δ i=0
i
! δ 
X X ne i
δ
n
Proof: Note that by Lemma 38.4.1 (iv), we have Gδ (n) = ≤ 1+ . This series behaves like a
i=0
i i=1
i
geometric series with constant larger than 2, since
 ne i  ne i−1 ne i − 1 !i−1 ne 1
!i−1
ne 1 n n
/ = = 1− ≥ = ≥ ≥ 2,
i i−1 i i i i i e i δ

by Lemma 38.4.1. As such, this series is bounded by twice the largest element in the series, implying the
claim. ■

38.5. Some required definitions


Definition 38.5.1 (Convex hull). The convex hull of a set R ⊆ Rd is the set of all convex combinations of
points of R; that is,  

X
 m X
m 


CH(R) = 
 α v ∀i v ∈ R, α ≥ 0, and α = 1 
 .
 i i i i i

i=0 j=1

243
References
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom., 2: 127–
151, 1987.
[VC71] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of
events to their probabilities. Theory Probab. Appl., 16: 264–280, 1971.

244
Chapter 39

Double sampling
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“What does not work when you apply force, would work when you apply even more force.”

, Anonymous

39.1. Double sampling


Double sampling is the idea that two random independent samples should look similar, and should not be
completely different in the way they intersect a certain set. We use the following sampling model, which makes
the computations somewhat easier.
Definition 39.1.1. Let S = { f1 , . . . , fn } be a set of objects, where the ith object has weight ωi > 0, for all i. Let
P
W = i ωi . For a target size ρ, a ρ-sample is a random sample R ⊆ S, where object fi is picked independently
with probability ρωi /W. To simplify the discussion, we assume that ρωi /W < 1. Handling the more general
case is easy if somewhat tedious.
Lemma 39.1.2. Let R1 and R2 be two ρ-samples, and consider the merged sample R = R1 ∪ R2 . Let T ⊆ S be
a set of m elements. Then, we have that
h i 1 h i 1
P T ⊆ R1 T ⊆ R ≥ m and P T ⊆ R1 and T ∩ R2 = ∅ T ⊆ R ≤ m .
2 2
 
Proof: Consider an object f ∈ T , and observe that P f ∈ R1 or f ∈ R2 | f ∈ R = 1. As such, by symmetry
   
P f ∈ R1 | f ∈ R = P f ∈ R2 | f ∈ R ≥ 1/2,
Now, let T = { f1 , . . . , fm }. Since R1 and R2 are independent, and each element is being picked independently,
we have that
 Y 
b
   
P T ⊆ R1 | T ⊆ R = P f1 , . . . , fm ∈ R1 | f1 , . . . , fm ∈ R = P fi ∈ R1 | f1 , . . . , fm ∈ R
i=1
Y
b
  1
= P fi ∈ R1 | fi ∈ R ≥ m .
i=1
2
For the second claim, observe that again, by symmetry, we have that
   
P f ∈ R1 and f < R2 f ∈ R = P f < R1 and f ∈ R2 | f ∈ R ≤ 1/2,
as the two events are disjoint. Now, the claim follows by arguing as above. ■

245
39.1.1. Disagreement between samples on a specific set
We provide three proofs of the following lemma – the constants are somewhat different for each version.
Lemma 39.1.3. Let R1 and R2 be two ρ-samples from a ground set S, and consider a fixed set T ⊆ S. We have
that h i  
P |R1 ∩ T | − |R2 ∩ T | > ερ ≤ 3 exp −ε ρ/2 .
2

Proof: (Simplest proof.) By Chernoff’s inequality, for δ ∈ (0, 1), we have


h i    
P |R1 | − ρ ≥ (ε/2)ρ ≤ 2 exp −(ε/2) ρ/4 = 2 exp −ε ρ/16 .
2 2

The same holds for R2 , and as such we have


h i h i
P |R1 | − |R2 | ≥ ερ ≤ P |R1 | − ρ + ρ − |R2 | ≥ ερ
h i h i  
≤ P |R1 | − ρ ≥ (ε/2)ρ + P ρ − |R2 | ≥ (ε/2)ρ ≤ 4 exp −ε2 ρ/16 ■

Proof: For an object fi ∈ S, let Xi be a random variable, where





1 fi ∈ R1 and fi < R2



Xi = 
−1 fi < R1 and fi ∈ R2



0 otherwise.
We have that pi = P[Xi = 1] = P[Xi = −1] = (ρωi /W)(1 − ρωi /W) and E[Xi ] = 0. Applying the regular
concentration inequalities in this case is not immediate, since there are many Xi s that are zero. To overcome
this, let T be a random variable that is the number of variables in X1 , . . . , Xn that are non-zero. We have that
P
T is a sum of n independent 0/1 random variables, where E[T ] = i 2pi = 2ρ. In particular, by Chernoff’s
inequality, we have that
     
q1 = P T > (1 + ε)2ρ ≤ exp −2ρε2 /4 = exp −ρε2 /2 .
and assume this happens. In particular, let Z1 , . . . , ZT be the non-zero variables in X1 , . . . , Xn , and observe that
P P
P[Zi = 1] = P[Zi = −1] = 1/2. Let Y = i Xi = i Zi . Observe that E[Y] = 0, and by Chernoff inequality, we
have that
h i h i hP i
q2 = P |R1 ∩ S| − |R1 ∩ S| > ερ = P |Y − E[Y]| ≥ ερ ≤ P i Zi − 0 ≥ ερ
! ! !  
(ερ)2 (ερ)2 ε2 ρ
≤ 2 exp −2 ≤ 2 exp −2 + q1 = 2 exp − + q1 ≤ 3 exp −ε2 ρ/2 ,
2T 2(1 + ε)ρ 1+ε
using T ≤ (1 + ξ)2ρ. ■

39.1.2. Exponential decay for a single set


P
Lemma 39.1.4. Consider a set S of m objects, where every object fi ∈ S has weight ωi > 0, and W = mi=1 ωi .
Next, consider a set r ⊆ S such that ω(r) ≥ tW/ρ (such a set is t-heavy). Let R be a ρ-sample from S. Then,
the probability that R misses r is at most e−t . Formally, we have P[r ∩ R = ∅] ≤ exp(−t).

Proof: Let r = { f1 , . . . , fk }. Clearly, the probability that R fails to pick


 one of these
 ωconflicting
 objects, is
  Qk  ωi Qk  ρ P
bounded by P[r ∩ R = ∅] = P ∀i ∈ {1, . . . , k} fi < R2 = i=1 1 − ρ W ≤ i=1 exp −ρ W = exp − W i ωi ≤
i
 
exp − Wρ · t Wρ = exp(−t). ■

246
39.1.3. Moments of the sample size
Lemma 39.1.5. Let R an m-sample. And let f (t) ≤ αtβ , where α ≥ 1 and β ≥ 1 are constants, such that
 
m ≥ 16β. Then U(m) = E f |R| ≤ 2α(2m)β .
Proof: The proof follows from Chernoff’s inequality and some tedious but straightforward calculations. The
reader is as such encouraged to skip reading it.
Let X = |R|. This is a sum of 0/1 random variables with expectation m. As such, we have
 X X
∞ ∞
 β
ν = E f |R| ≤ P[X = i] f (i) ≤ α P[X = i]i .
i=0 i=0

Considering the last sum, we have


X∞
 β X 

 X

 
β β β β
P X = i i ≤ P X ≥ jm (( j + 1)m) ≤ (2m) + m P X ≥ jm ( j + 1) .
i=0 j=0 j=2

We bound the last summation using Chernoff’s inequality (see Theorem 13.2.1), we have
X
5
  X

 
β β
τ= P X ≥ jm ( j + 1) + P X ≥ jm ( j + 1)
j=2 j=6
X
5 ! X ∞
m( j − 1) 2
≤ exp − ( j + 1)β + 2− jm ( j + 1)β
j=2
4 j=6
 m X∞
≤ exp − 3β + exp(−m)4β + exp(−2m)5β + exp(−4m)6β + 2− jm ( j + 1)β < 1,
4 j=6

since m ≥ 16β. We conclude that ν ≤ α(2m)β + αmβ τ ≤ 2α(2m)β . ■


Remark 39.1.6. The constant 16 in the above lemma is somewhat strange. A better constant can be derived by
breaking the range of sizes into smaller intervals and using the right Chernoff inequality. Since this is somewhat
tangential to the point of this write-up, we leave it as is (i.e., this constant is not critical to our discussion).

39.1.4. Growth function


The growth function Gδ (n) is the maximum number of ranges in a range space with VC dimension δ, and with
n elements. By Sauer’s lemma, it is known that
Xδ ! X δ
n ni
Gδ (n) = ≤ ≤ nδ , (39.1)
i=0
i i=0
i!
The following is well known (the estimates are somewhat tedious to prove):
 n δ  ne δ P 
Lemma 39.1.7 ([Har11]). For n ≥ 2δ and δ ≥ 1, we have ≤ Gδ (n) ≤ 2 , where Gδ (n) = δi=0 ni .
δ δ
 
Lemma 39.1.8. Let R and R′ be two independent m-samples from x. Assume that m ≥ δ. Then E Gδ |R| + |R′ | ≤
Gδ (2m), where Gδ (2m) = 4 (4em/δ)δ .
 δ
Proof: We set α = 2 δe , β = δ, and f (n) = αnβ . Duplicate every element in x, and let x′ be the resulting set.
h i
Clearly, the size of a 2m-sample R from x′ is the same as |R| + |T |. By Lemma 39.1.7, we have E Gδ |R| ≤
h i  δ
E f |R| ≤ 2α(4m)β ≤ 4 4em δ
The last inequality follows from Lemma 39.1.5. ■

247
39.2. Proof of the ε-net theorem
Here we are working in the unweighted settings (i.e., the weight of a single element is one).
Theorem 39.2.1 (ε-net theorem, [HW87]). Let (X, R) be a range space of VC dimension δ, let x be a finite
subset of X, and suppose that 0 < ε ≤ 1 and φ < 1. Let N be an m-sample from x (see Definition 39.1.1), where
!
8 4 16δ 16
m ≥ max lg , lg . (39.2)
ε φ ε ε
Then N is an ε-net for x with probability at least 1 − φ.

39.2.1. The proof


39.2.1.1. Reduction to double sampling
Let n = |x|. Let N be the m-sample from x. Let E1 be the probability that N fails to be an ε-net. Namely,

E1 = ∃r ∈ R |r ∩ x| ≥ εn and r ∩ N = ∅ .
(Namely, there exists a “heavy” range r that does not contain any point of N.) To complete the proof, we must
show that P[E1 ] ≤ φ. Let T be another m-sample generated in a similar fashion to N. Let E2 be the event that
N fails but T “works”. Formally
 εm 
E2 = ∃r ∈ R |r ∩ x| ≥ εn, r ∩ N = ∅, and |r ∩ T | ≥ .
2
 
Intuitively, since E |r ∩ x| ≥ εm, we have that for the range r that N fails for, it follows with “good” probability
that |r ∩ T | ≥ εm/2. Namely, E1 and E2 have more or less the same probability.
Claim 39.2.2. P[E2 ] ≤ P[E1 ] ≤ 2 P[E2 ].
Proof: Clearly, E2 ⊆ E1 , and thus P[E2 ] ≤ P[E1 ]. As for the other part, note that by the definition of conditional
probability, we have
P[E2 | E1 ] = P[E2 ∩ E1 ]/ P[E1 ] = P[E2 ]/ P[E1 ].
It is thus enough to show that P[E2 | E1 ] ≥ 1/2.
Assume that E1 occurs. There is r ∈ R, such that |r ∩ x| > εn and r ∩ N = ∅. The required probability is at
least the probability that for this specific r, we have X = |r ∩ T | ≥ εn2 . The variable X is a sum of t = |r ∩ x| ≥ εn
random independent 0/1 variables, each one has probability m/n to be one. Setting µ = E[X] = tm/n ≥ εm and
ξ = 1/2, we have by Chernoff’s inequality that
   
P[|r ∩ T | ≤ εm/2] ≤ P X < (1 − ξ)µ ≤ exp −µξ /2 = exp(−εm/8) < 1/2,
2

h i h i
if εm ≥ 8. Thus, for r ∈ E1 , we have P[E2 ]/ P[E1 ] ≥ P |r ∩ T | ≥ εm2
= 1 − P |r ∩ T | < εm
2
≥ 21 . ■
Claim 39.2.2 implies that to bound the probability of E1 , it is enough to bound the probability of E2 . Let
 εm 

E2 = ∃r ∈ R r ∩ N = ∅ and |r ∩ T | ≥ .
2
Clearly, E2 ⊆ E′2 . Thus, bounding the probability of E′2 is enough to prove Theorem 39.2.1. Note, however, that
a shocking thing happened! We no longer have x participating in our event. Namely, we turned bounding an
event that depends on a global quantity (i.e., the ground set x) into bounding a quantity that depends only on a
local quantity/experiment (involving only N and T ). This is the crucial idea in this proof.

248
39.2.1.2. Using double sampling to finish the proof
  h i
Claim 39.2.3. P E2 ≤ P E′2 ≤ 2−εm/2Gδ (2m).

Proof: We fix the content of R = N ∪ T . The range space (R, R|R ) has Gδ (|R|) ranges. Fix a range r in this
range space. Let Th = r ∩ R. If b = |T | < εm/2 theni the E′2 can not happened. Otherwise, the probability that r
is a bad range is P T ⊆ T and T ∩ N = ∅ T ⊆ R ≤ 21b , by Lemma 39.1.2. In particular, by the union bound
h i
over all ranges, we have P E′2 | R ≤ 2−εm/2 Gδ (|R|). As such, we have
 ′ X  ′  X  h i
E
P 2 = E
P 2 | R P [R] ≤ 2−εm/2 Gδ |R| P[R] ≤ 2−εm/2 E Gδ |R| ≤ 2−εm/2Gδ (2m).
R R

by Lemma 39.1.8. ■

Proof of Theorem 39.2.1. By Claim 39.2.2 and Claim 39.2.3, we have that P[E1 ] ≤ 2 · 2−εm/2Gδ (2m). It thus
remains to verify that if m satisfies Eq. (39.2), then the above is smaller than φ. Which is equivalent to
!δ !
−εm/2 −εm/2 4em εm 4em 1
2·2 Gδ (2m) ≤ φ ⇐⇒ 16 · 2 ≤ φ ⇐⇒ −4 + − δ lg ≥ lg
δ 2 δ φ
! !   
εm 4e εm 1 εm m
⇐⇒ − 4 − δ lg + − lg + − δ lg ≥0
8 δ 8 φ 4 δ
 
We remind the reader that the value of m we pick is such that m ≥ max 8ε lg φ4 , 16δ lg 16
. In particular, m ≥
  ε ε
64δ/ε and −4 − δ lg δ ≥ −4 − 4δ ≤ −8δ ≥ −εm/8. Similarly, by the choice of m, we have εm/8 ≥ lg φ1 . As
4e
 
such, we need to show that εm 4
≥ δ lg m
δ
⇐⇒ m ≥ 4δε lg mδ , and one can verify using some easy but tedious
calculations that this holds if m ≥ 16δ
ε
lg 16ε . ■

References
[Har11] S. Har-Peled. Geometric Approximation Algorithms. Vol. 173. Math. Surveys & Monographs.
Boston, MA, USA: Amer. Math. Soc., 2011.
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom., 2: 127–
151, 1987.

249
250
Chapter 40

Finite Metric Spaces and Partitions


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

40.1. Finite Metric Spaces


Definition 40.1.1. A metric space is a pair (X, d) where X is a set and d : X×X → [0, ∞) is a metric, satisfying
the following axioms:
(i) d(x, y) = 0 ⇐⇒ x = y,
(ii) d(x, y) = d(y, x), and
(iii) d(x, y) + d(y, z) ≥ d(x, z) (triangle inequality).

The plane, R2 , with the regular Euclidean distance is a metric space. 


Of special interest is the finite case, where X is an n-point set. Then, the function d can be specified by n2
real numbers. Alternatively, one can think about (X, d) as a weighted complete graph, where positive weights
are specified on the edges, and these weights comply with the triangle inequality.
Finite metric spaces rise naturally from (sparse) graphs. Indeed, let G = (X, E) be an undirected weighted
graph defined over X, and let dG (x, y) be the length of the shortest path between x and y in G. It is easy to verify
that (X, dG ) is a finite metric space. As such if the graph G is sparse, it provides a compact representation to
the finite space (X, dG ).

Definition 40.1.2. Let (X, d) be an n-point metric space. We denote the open ball of radius r about x ∈ X, by
b(x, r) = {y ∈ X | d(x, y) < r}.

Underling our discussion of metric spaces are algorithmic applications. The hardness of various computa-
tional problems depends heavily on the structure of the finite metric space. Thus, given a finite metric space,
and a computational task, it is natural to try to map the given metric space into a new metric where the task at
hand becomes easy.

Example 40.1.3. Computing the diameter of a point set is not trivial in two dimensions (if one wants near
linear running time), but is easy in one dimension. Thus, if we could map points in two dimensions into
points in one dimension, such that the diameter is preserved, then computing the diameter becomes easy. This
approach yields an efficient approximation algorithm, see Exercise 40.7.3 below.

Of course, this mapping from one metric space to another, is going to introduce error. Naturally, one would
like to minimize the error introduced by such a mapping.

251
Definition 40.1.4. Let (X, dX ) and (Y, dY ) be two metric spaces. A mapping f : X → Y is an embedding, and

it is C-Lipschitz if dY f (x), f (y) ≤ C · dX (x, y) for all x, y ∈ X. The mapping f is K-bi-Lipschitz if there exists
a C > 0 such that

CK −1 · dX (x, y) ≤ dY f (x), f (y) ≤ C · dX (x, y),
for all x, y ∈ X.
The least K for which f is K-bi-Lipschitz is the distortion of f , and is denoted dist( f ). The least distortion
with which X may be embedded in Y is denoted cY (X).

Informally, if f : X → Y has distortion K, then the distances in X and f (X) ⊆ Y are the same up to a factor
of K (one might need to scale up the distances by some constant C).
There are several powerful results about low distortion embeddings that would be presented:
(I) Probabilistic trees. Every finite metric can be randomly embedded into a tree such that the “expected”
distortion for a specific pair of points is O(log n).
(II) Bourgain embedding. Any n-point metric space can be embedded into (finite dimensional) euclidean
metric space with O(log n) distortion.
(III) Johnson-Lindenstrauss lemma. Any n-point set in Euclidean space with the regular Euclidean distance
can be embedded into Rk with distortion (1 + ε), where k = O(ε−2 log n).

40.2. Examples
What is distortion? When considering a mapping f : X → Rd of a metric space (X, d) to Rd , it would
useful to observe that since Rd can be scaled, we can consider f to be an expansion (i.e., no distances shrink).
Furthermore, we can assume that there is at least one pair of points x, y ∈ X, such that d(x, y) = ∥x − y∥. As
such, we have dist( f ) = max x,y d∥x−y∥
(x,y)
.

Why is distortion necessary? Consider the a graph G = (V, E) with one vertex s connected
to three other vertices a, b, c, where the weights on the edges are all one (i.e., G is the star graph s
a b
with√ three leafs). We claim that G can not be embedded into Euclidean space with distortion
≤ 2. Indeed, consider the associated metric space (V, dG ) and an (expansive) embedding
c
f : V → Rd .
Consider the triangle formed by △ = a′ b′ c′ , where a′ = f (a), b′ = f (b) and c′ = f (c). Next, consider the
following quantity max(∥a′ − s′ ∥ , ∥b′ − s′ ∥ , ∥c′ − s′ ∥) which lower bounds the distortion of f . This quantity is
minimized when r = ∥a′ − s′ ∥ = ∥b′ − s′ ∥ = ∥c′ − s′ ∥. Namely, s′ is the center of the smallest enclosing circle
of △. However, r is minimized √ when all the edges of △ are of equal length, and are of length dG (a, b) = 2. It
follows that dist( f ) ≥ r ≥ 2/ 3.
This quantity is minimized when r = ∥a′ − s′ ∥ = ∥b′ − s′ ∥ = ∥c′ − s′ ∥. Namely, s′ is the c0
center of the smallest enclosing circle of △. However, r is minimized when all the edges
of △ are of equal length and are of length dG√(a, b) = 2. Observe that the height of the 2
√ with sidelength 2 is h = 3, and the radius of its inscribing circle
equilateral triangle √ is
r = (2/3)h = 2/ 3; see the figure on the right. As such, it follows that dist( f ) ≥ r = 2/ 3.
a0 1 b0
Note that the above argument is independent of the target dimension d. A packing  
argument shows that embedding the star graph with n leaves into Rd requires distortion Ω n1/d ; see Exercise ??.
It is known that Ω(log n) distortion is necessary in the worst case when embedding a graph into Euclidean space

(this is shown using expanders). A proof of distortion Ω log n/ log log n is sketched in the bibliographical
notes.

252
40.2.1. Hierarchical Tree Metrics
The following metric is quite useful in practice, and nicely demonstrate why algorithmically finite metric spaces
are useful.

Definition 40.2.1. Hierarchically well-separated tree (HST) is a metric space defined on the leaves of a rooted
tree T . To each vertex u ∈ T there is associated a label ∆u ≥ 0 such that ∆u = 0 if and only if u is a leaf of T .
The labels are such that if a vertex u is a child of a vertex v then ∆u ≤ ∆v . The distance between two leaves
x, y ∈ T is defined as ∆lca(x,y) , where lca(x, y) is the least common ancestor of x and y in T .
A HST T is a k-HST if for a vertex v ∈ T , we have that ∆v ≤ ∆p(v) /k, where p(v) is the parent of v in T .

Note that a HST is a very limited metric. For example, consider the cycle G = Cn of n vertices, with weight
one on the edges, and consider an expansive embedding f of G into a HST HST. It is easy to verify, that there
must be two consecutive nodes of the cycle, which are mapped to two different subtrees of the root r of HST.
Since HST is expansive, it follows that ∆r ≥ n/2. As such, dist( f ) ≥ n/2. Namely, HSTs fail to faithfully
represent even very simple metrics.

40.2.2. Clustering
One natural problem we might want to solve on a graph (i.e., finite metric space) (X, d) is to partition it into
clusters. One such natural clustering is the k-median clustering, where we would like to choose a set C ⊆ X
P
of k centers, such that νC (X, d) = u∈X d(u, C) is minimized, where d(u, C) = minc∈C d(u, c) is the distance of
u to its closest center in C.
It is known that finding the optimal k-median clustering in a (general weighted) graph is NP-complete. As
such, the best we can hope for is an approximation algorithm. However, if the structure of the finite metric
space (X, d) is simple, then the problem can be solved efficiently. For example, if the points of X are on the real
line (and the distance between a and b is just |a − b|), then k-median can be solved using dynamic programming.
Another interesting case is when the metric space (X, d) is a HST. Is not too hard to prove the following
lemma. See Exercise 40.7.1.

Lemma 40.2.2. Let (X, d) be a HST defined over n points, and let k > 0 be an integer. One can compute the
optimal k-median clustering of X in O(k2 n) time.

Thus, if we can embed a general graph G into a HST HST, with low distortion, then we could approximate
the k-median clustering on G by clustering the resulting HST, and “importing” the resulting partition to the
original space. The quality of approximation, would be bounded by the distortion of the embedding of G into
HST.

40.3. Random Partitions


Let (X, d) be a finite metric space. Given a partition P = {C1 , . . . , Cm } of X, we refer to the sets Ci as clusters.
We write PX for the set of all partitions of X. For x ∈ X and a partition P ∈ PX we denote by P(x) the unique
cluster of P containing x. Finally, the set of all probability distributions on PX is denoted DX .
The following partition scheme is due to [CKR04].

253
Figure 40.1: An example of the partition of a square (induced by a set of points) as described in Section 40.3.1.

40.3.1. Constructing the partition


Consider a given metric space (X, d), where X is a set of n points.
Let ∆ = 2u be a prescribed parameter, which is the required diameter of the resulting clusters. Choose,
uniformly at random, a permutation π of X and a random value α ∈ [1/4, 1/2]. Let R = α∆, and observe that it
is uniformly distributed in the interval [∆/4, ∆/2].
The partition is now defined as follows: A point x ∈ X is assigned to the cluster Cy of y, where y is the first
point in the permutation in distance ≤ R from x. Formally,
n o
Cy = x ∈ X x ∈ b(y, R) and π(y) ≤ π(z) for all z ∈ X with x ∈ b(z, R) .
Let P = {Cy }y∈X denote the resulting partition.
Here is a somewhat more intuitive explanation: Once we fix the radius of the clusters R, we start scooping
out balls of radius R centered at the points of the random permutation π. At the ith stage, we scoop out only the
remaining mass at the ball centered at xi of radius r, where xi is the ith point in the random permutation.

40.3.2. Properties
The following lemma quantifies the probability of a (crystal) ball of radius t centered at a point x is fully
contained in one of the clusters of the partition? (Otherwise, the crystal ball is of course broken.)

254
Figure 40.2: The resulting partition.

Lemma 40.3.1. Let (X, d) be a finite metric space, ∆ = 2u a prescribed parameter, and let P be the partition
of X generated by the above random partition. Then the following holds:
(i) For any C ∈ P, we have diam(C) ≤ ∆.
(ii) Let x be any point of X, and t a parameter ≤ ∆/8. Then,
h i 8t b
P b(x, t) ⊈ P(x) ≤ ln ,
∆ a
where a = |b(x, ∆/8)|, and b = |b(x, ∆)|.
Proof: Since Cy ⊆ b(y, R), we have that diam(Cy ) ≤ ∆, and thus the first claim holds.
Let U be the set of points of b(x, ∆), such that w ∈ U iff b(w, R) ∩ b(x, t) , ∅. Arrange the points of
U in increasing distance from x, and let w1 , . . . , wb′ denote the resulting order, where b′ = |U|. Let Ik =
[d(x, wk ) − t, d(x, wk ) + t] and write Ek for the event that wk is the first point in π such that b(x, t) ∩ Cwk , ∅, and
yet b(x, t) ⊈ Cwk . Note that if wk ∈ b(x, ∆/8), then P[Ek ] = 0 since b(x, t) ⊆ b(x, ∆/8) ⊆ b(wk , ∆/4) ⊆ b(wk , R).
In particular, w1 , . . . , wa ∈ b(x, ∆/8) and as such P[E1 ] = · · · = P[Ea ] = 0. Also, note that if d(x, wk ) < R − t
then b(wk , R) contains b(x, t) and as such Ek can not happen. Similarly, if d(x, wk ) > R+t then b(wk , R)∩b(x, t) =
∅ and Ek can not happen. As such, if Ek happen then R − t ≤ d(x, wk ) ≤ R + t. Namely, if Ek happen then R ∈ Ik .
Namely, P[Ek ] = P[Ek ∩ (R ∈ Ik )] = P[R ∈ Ik ] · P[Ek | R ∈ Ik ]. Now, R is uniformly distributed in the interval
[∆/4, ∆/2], and Ik is an interval of length 2t. Thus, P[R ∈ Ik ] ≤ 2t/(∆/4) = 8t/∆.
Next, to bound P[Ek | R ∈ Ik ], we observe that w1 , . . . , wk−1 are closer to x than wk and their distance to b(x, t)
is smaller than R. Thus, if any of them appear before wk in π then Ek does not happen. Thus, P[Ek | R ∈ Ik ] is
bounded by the probability that wk is the first to appear in π out of w1 , . . . , wk . But this probability is 1/k, and
thus P[Ek | R ∈ Ik ] ≤ 1/k.
We are now ready for the kill. Indeed,
h i Xb′ X
b′ X
b′
P b(x, t) ⊈ P(x) = P[Ek ] = P[Ek ] = P[R ∈ Ik ] · P[Ek | R ∈ Ik ]
k=1 k=a+1 k=a+1
Xb′
8t 1 8t b′ 8t b
≤ · ≤ ln ≤ ln ,
k=a+1
∆ k ∆ a ∆ a
Pb Rb
since 1
k=a+1 k ≤ a
dx
x
= ln ba and b′ ≤ b. ■

40.4. Probabilistic embedding into trees


In this section, given n-point finite metric (X, d). we would like to embed it into a HST. As mentioned above,
one can verify that for any embedding into HST, the distortion in the worst case is Ω(n). Thus, we define

255
a randomized algorithm that embed (X, d) into a tree. Let T be the resulting tree, and consider two points
x, y ∈ X. Consider the random variable dT (x, y). We constructed the tree T suchh dthat idistances never shrink; i.e.
d(x, y) ≤ dT (x, y). The probabilistic distortion of this embedding is max x,y E dT(x,y) . Somewhat surprisingly,
(x,y)

one can find such an embedding with logarithmic probabilistic distortion.

Theorem 40.4.1. Given n-point metric (X, d) one can randomly embed it into a 2-HST with probabilistic dis-
tortion ≤ 24 ln n.

Proof: The construction is recursive. Let diam(P), and compute a random partition of X with cluster diameter
diam(P)/2, using the construction of Section 40.3.1. We recursively construct a 2-HST for each cluster, and
hang the resulting clusters on the root node v, which is marked by ∆v = diam(P). Clearly, the resulting tree is
a 2-HST.
For a node v ∈ T , let X(v) be the set of points of X contained in the subtree of v.
For the analysis, assume diam(P) = 1, and consider two points x, y ∈ X. We consider a node v ∈ T to be
 
in level i if level(v) = lg ∆v = i. The two points x and y correspond to two leaves in T , and let b u be the least
common ancestor of x and y in t. We have dT (x, y) ≤ 2 level(v)
. Furthermore, note that along a path the levels are
strictly monotonically increasing.

Being more conservative, let w be the first ancestor of x, such that b = b x, d(x, y) is not completely

contained in X(u1 ), . . . , X(um ), where u1 , . . . , um are the children of w. Clearly, level(w) > level b u . Thus,
dT (x, y) ≤ 2level(w) .
Consider the path σ from the root of T to x, and let Ei be the event that b is not fully contained in X(vi ),
where vi is the node of σ of level i (if such a node exists). Furthermore, let Yi be the indicator variable which is
P
1 if Ei is the first to happened out of the sequence of events E0 , E−1 , . . .. Clearly, dT (x, y) ≤ Yi 2i .
 
Let t = d(x, y) and j = lg d(x, y) , and ni = b(x, 2i ) for i = 0, . . . , −∞. We have

 X X h i X
0 0 0
 8t ni
E dT (x, y) ≤ E[Yi ] 2 ≤ 2i P Ei ∩ Ei−1 ∩ Ei−1 · · · E0 ≤ 2i · i ln ,
i

i= j i= j i= j
2 ni−3

by Lemma 40.3.1. Thus,


 0 
  Y ni 
E dT (x, y) ≤ 8t ln  ≤ 8t ln(n0 · n1 · n2 ) ≤ 24t ln n.

n
i= j i−3

It thus follows, that the expected distortion for x and y is ≤ 24 ln n. ■

40.4.1. Application: approximation algorithm for k-median clustering


Let (X, d) be a n-point metric space, and let k be an integer number. We would like to compute the optimal
k-median clustering. Number, find a subset Copt ⊆ X, such that νCopt (X, d) is minimized, see Section 40.2.2. To
this end, we randomly embed (X, d) into a HST HST using Theorem 40.4.1. Next, using Lemma 40.2.2, we
compute the optimal k-median clustering of HST. Let C be the set of centers computed. We return C together
with the partition of X it induces as the required clustering.

Theorem 40.4.2. Let (X, d) be a n-point metric space. One can compute in polynomial time a k-median

clustering of X which has expected price O α log n , where α is the price of the optimal k-median clustering of
(X, d).

256
Figure 40.3: Examples of the sets resulting from the partition of Figure 40.1 and taking clusters into a set with
probability 1/2.

Proof: The algorithm is described above, and the fact that its running time is polynomial can be easily be
verified. To prove the bound on the quality of the clustering, for any point p ∈ X, let cen(p) denote the closest
point in Copt to p according to d, where Copt is the set of k-medians in the optimal clustering. Let C be the set
of k-medians returned by the algorithm, and let HST be the HST used by the algorithm. We have
X X
β = νC (X, d) ≤ νC (X, dHST ) ≤ νCopt (X, dHST ) ≤ dHST (p, Copt ) ≤ dHST (p, cen(p)).
p∈X p∈X

Thus, in expectation we have


  hX i X   X 
E β = E d HST (p, cen(p)) = E HST
d (p, cen(p)) = O d(p, cen(p)) log n
p∈X p∈X p∈X
 X   
= O (log n) d(p, cen(p)) = O νCopt (X, d) log n ,
p∈X
by linearity of expectation and Theorem 40.4.1. ■

40.5. Embedding any metric space into Euclidean space


Lemma 40.5.1. Let (X, d) be a metric, and let Y ⊂ X. Consider the mapping f : X → R, where f (x) =
d(x, Y) = miny∈Y d(x, y). Then for any x, y ∈ X, we have | f (x) − f (y)| ≤ d(x, y). Namely f is nonexpansive.

Proof: Indeed, let x′ and y′ be the closet points of Y, to x and y, respectively. Observe that

f (x) = d(x, x′ ) ≤ d(x, y′ ) ≤ d(x, y) + d(y, y′ ) = d(x, y) + f (y)

by the triangle inequality. Thus, f (x) − f (y) ≤ d(x, y). By symmetry, we have f (y) − f (x) ≤ d(x, y). Thus,
| f (x) − f (y)| ≤ d(x, y). ■

40.5.1. The bounded spread case


Let (X, d) be a n-point metric. The spread of X, denoted by
diam(X)
Φ(X) = ,
min x,y∈X,x,y d(x, y)
is the ratio between the diameter of X and the distance between the closest pair of points.

257
Theorem 40.5.2. √Given a n-point metric Y = (X, d), with spread Φ, one can embed it into Euclidean space Rk

with distortion O ln Φ ln n , where k = O(ln Φ ln n).
Proof: Assume that diam(Y) = Φ (i.e., the smallest distance in Y is 1), and let ri = 2i−2 , for i = 1, . . . , α, where
 
α = lg Φ . Let Pi, j be a random partition of P with diameter ri , using Theorem 40.4.1, for i = 1, . . . , α and
 
j = 1, . . . , β, where β = c log n and c is a large enough constant to be determined shortly.
For each cluster of Pi, j randomly toss a coin, and let Vi, j be the all the points of X that belong to clusters in
Pi, j that got ’T ’ in their coin toss. For a point u ∈ x, let
fi, j (x) = d(x, X \ Vi, j ) = min d(x, v),
v∈X\Vi, j

for i = 0, . . . , m and j = 1, . . . , β. Let F : X → R(m+1)·β be the embedding, such that



F(x) = f0,1 (x), f0,2 (x), . . . , f0,β (x), f1,1 (x), f1,2 (x), . . . , f1,β (x), . . . , fm,1 (x), fm,2 (x), . . . , fα,β (x) .
| {z }
first n resolution block

Next, consider two points x, y ∈ X, with distance ϕ = d(x, y). Let k be an integer such that ru ≤ ϕ/2 ≤ ru+1 .
Clearly, in any partition of Pu,1 , . . . , Pu,β the points x and y belong to different clusters. Furthermore, with
probability half x ∈ Vu, j and y < Vu, j or x < Vu, j and y ∈ Vu, j , for 1 ≤ j ≤ β.
Let E j denote the event that b(x, ρ) ⊆ Vu, j and y < Vu, j , for j = 1, . . . , β, where ρ = ϕ/(64 ln n). By
Lemma 40.3.1, we have
h i 8ρ ϕ
P b(x, ρ) ⊈ Pu, j (x) ≤ ln n ≤ ≤ 1/2.
ru 8ru
Thus,
h i h     i
P E j = P b(x, ρ) ⊆ Pu, j (x) ∩ x ∈ Vu, j ∩ y < Vu, j
h i h i h i
= P b(x, ρ) ⊆ Pu, j (x) · P x ∈ Vu, j · P y < Vu, j ≥ 1/8,
since those three events are independent. Notice, that if E j happens, than fu, j (x) ≥ ρ and fu, j (y) = 0.
P
Let X jh be aniindicator variable which is 1 if E j happens, for j = 1, . . . , β. Let Z = j X j , and we have µ =
P
E[Z] = E j X j ≥ β/8. Thus, the probability that only β/16 of E1 , . . . , Eβ happens, is P[Z < (1 − 1/2) E[Z]].
 
By the Chernoff inequality, we have P[Z < (1 − 1/2) E[Z]] ≤ exp −µ1/(2 · 22 ) = exp(−β/64) ≤ 1/n10 , if we
set c = 640.
Thus, with high probability
v
u
t β r
X 2 √
β p ρ β
∥F(x) − F(y)∥ ≥ fu, j (x) − fu, j (y) ≥ ρ 2 = β =ϕ· .
j=1
16 4 256 ln n

On the other hand, fi, j (x) − fi, j (y) ≤ d(x, y) = ϕ ≤ 64ρ ln n. Thus,
q p p
∥F(x) − F(y)∥ ≤ αβ(64ρ ln n)2 ≤ 64 αβρ ln n = αβ · ϕ.

Thus, setting G(x) = F(x) 256√lnβ n , we get a mapping that maps two points of distance ϕ from each other
h √ i
to two points with distance in the range ϕ, ϕ · αβ · 256√lnβ n . Namely, G(·) is an embedding with distortion
√ √
O( α ln n) = O( ln Φ ln n). 
The probability that G fails on one of the pairs, is smaller than (1/n10 ) · n2 < 1/n8 . In particular, we can

check the distortion of G for all n2 pairs, and if any of them fail (i.e., the distortion is too big), we restart the
process. ■

258
40.5.2. The unbounded spread case
Our next task, is to extend Theorem 40.5.2 to the case of unbounded spread. Indeed, let (X, d) be a n-point
metric, such that diam(X) ≤ 1/2. Again, we look on the different resolutions r1 , r2 , . . ., where ri = 1/2i−1 . For
each one of those resolutions ri , we can embed this resolution into β coordinates, as done for the bounded case.
Then we concatenate the coordinates together.
There are two problems with this approach: (i) the number of resulting coordinates is infinite, and (ii) a pair
x, y, might be distorted a “lot” because it contributes to all resolutions, not only to its “relevant” resolutions.
Both problems can be overcome with careful tinkering. Indeed, for a resolution ri , we are going to modify
the metric, so that it ignores short distances (i.e., distances ≤ ri /n2 ). Formally, for each resolution ri , let
Gi = (X, E bi ) be the graph where two points x and y are connected if d(x, y) ≤ ri /n2 . Consider a connected
component C ∈ Gi . For any two points x, y ∈ C, we have d(x, y) ≤ n(ri /n2 ) ≤ ri /n. Let Xi be the set of
connected components of Gi , and define the distances between two connected components C, C ′ ∈ Xi , to be
di (C, C ′ ) = d(C, C ′ ) = minc∈C,c′ ∈C ′ d(c, c′ ).
It is easy to verify that (Xi , di ) is a metric space (see Exercise 40.7.2). Furthermore, we can naturally
embed (X, d) into (Xi , di ) by mapping a point x ∈ X to its connected components in Xi . Essentially (Xi , di )
is a snapped version of the metric (X, d), with the advantage that Φ((X, di )) = O(n2 ). We now embed Xi
into β = O(log n) coordinates. Next, for any point of X we embed it into those β coordinates, by using the
embedding of its connected component in Xi . Let Ei be the embedding for resolution ri . Namely, Ei (x) =
( fi,1 (x), fi,2 (x), . . . , fi,β (x)), where fi, j (x) = min(di (x, X \ Vi, j ), 2ri ). The resulting embedding is F(x) = ⊕Ei (x) =
(E1 (x), E2 (x), . . . , ).
Since we slightly modified the definition of fi, j (·), we have to show that fi, j (·) is nonexpansive. Indeed,
consider two points x, y ∈ Xi , and observe that

fi, j (x) − fi, j (y) ≤ di (x, Vi, j ) − di (y, Vi, j ) ≤ di (x, y) ≤ d(x, y),

as a simple case analysis¬ shows.


For a pair x, y ∈ X, and let ϕ = d(x, y). To see that F(·) is the required embedding (up to scaling), observe
that, by the same argumentation of Theorem 40.5.2, we have that with high probability

β
∥F(x) − F(y)∥ ≥ ϕ · .
256 ln n
To get an upper bound on this distance, observe that for i such that ri > ϕn2 , we have Ei (x) = Ei (y). Thus,
X X
∥F(x) − F(y)∥2 = ∥Ei (x) − Ei (y)∥2 = ∥Ei (x) − Ei (y)∥2
i i,ri <ϕn2
X X
= ∥Ei (x) − Ei (y)∥2 + ∥Ei (x) − Ei (y)∥2
i,ϕ/n2 <ri <ϕn2 i,ri <ϕ/n2
  X 4ϕ2 β
= βϕ2 lg n4 + (2ri )2 β ≤ 4βϕ2 lg n + 4 ≤ 5βϕ2 lg n.
n
2 i,ri <ϕ/n
p
Thus, ∥F(x) − F(y)∥ ≤ ϕ 5β lgn. We conclude,   that with high probability, F(·) is an embedding of X into
p √
β
Euclidean space with distortion ϕ 5β lg n / ϕ · 256 ln n = O(log3/2 n).
We still have to handle the infinite number of coordinates problem. However, the above proof shows that
we care about a resolution ri (i.e., it contributes to the estimates in the above proof) only if there is a pair x
¬
Indeed, if fi, j (x) < di (x, Vi, j ) and fi, j (y) < di (x, Vi, j ) then fi, j (x) = 2ri and fi, j (y) = 2ri , which implies the above inequality. If
fi, j (x) = di (x, Vi, j ) and fi, j (y) = di (x, Vi, j ) then the inequality trivially holds. The other option is handled in a similar fashion.

259
and y such that ri /n2 ≤ d(x, y) ≤ ri n2 . Thus, for every pair of distances there are O(log n) relevant resolutions.
Thus, there are at most η = O(n2 β log n) = O(n2 log2 n) relevant coordinates, and we can ignore all the other
coordinates. Next, consider the affine subspace h that spans F(P). Clearly, it is n − 1 dimensional, and consider
the projection G : Rη → Rn−1 that projects a point to its closest point in h. Clearly, G(F(·)) is an embedding
with the same distortion for P, and the target space is of dimension n − 1.
Note, that all this process succeeds with high probability. If it fails, we try again. We conclude:

Theorem 40.5.3 (Low quality Bourgain theorem). Given a n-point metric M, one can embed it into Eu-
clidean space of dimension n − 1, such that the distortion of the embedding is at most O(log3/2 n).

Using the Johnson-Lindenstrauss lemma, the dimension can be further reduced to O(log n). Being more
careful in the proof, it is possible to reduce the dimension to O(log n) directly.

40.6. Bibliographical notes


The partitions we use are due to Calinescu et al. [CKR04]. The idea of embedding into √spanning trees

is due to
O log n log log n
Alon et al. [AKPW95], which showed that one can get a probabilistic distortion of 2 . Yair Bartal
realized that by allowing trees with additional vertices, one can get a considerably better result. In particular,
he showed [Bar96] that probabilistic embedding into trees can be done with polylogarithmic average distortion.
He later improved the distortion to O(log n log log n) in [Bar98]. Improving this result was an open question,
culminating in the work of Fakcharoenphol et al. [FRT04] which achieve the optimal O(log n) distortion.
Our proof of Lemma 40.3.1 (which is originally from [FRT04]) is taken from [KLMN05]. The proof of
Theorem 40.5.3 is by Gupta [Gup00].
A good exposition of metric spaces is available in Matoušek [Mat02].

Embedding into spanning trees. The above embeds the graph into a Steiner tree. A more useful represen-
tation, would be a random embedding into a spanning tree. Surprisingly, this can be done, as shown by Emek
et al. [EEST08]. This was improved to O(log n · log log n · (log log log n)3 )­ by Abraham et al. [ABN08a,
ABN08b].

Alternative proof of the tree embedding result. Interestingly, if one does not care about the optimal dis-
tortion, one can get similar result (for embedding into probabilistic trees), by first embedding the metric into
Euclidean space, then reduce the dimension by the Johnson-Lindenstrauss lemma, and finally, construct an
HST by constructing a quadtree over the points. The “trick” is to randomly translate the quadtree. It is easy
to verify that this yields O(log4 n) distortion. See the survey by Indyk [Ind01] for more details. This random
shifting of quadtrees is a powerful technique that was used in getting several result, and it is a crucial ingredient
in Arora [Aro98] approximation algorithm for Euclidean TSP.

40.7. Exercises
Exercise 40.7.1 (Clustering for HST). Let (X, d) be a HST defined over n points, and let k > 0 be an integer.
Provide an algorithm that computes the optimal k-median clustering of X in O(k2 n) time.
[Transform the HST into a tree where every node has only two children. Next, run a dynamic programming
algorithm on this tree.]
­
Truely a polyglot of logs.

260
Exercise 40.7.2 (Partition induced metric).
(a) Give a counter example to the following claim: Let (X, d) be a metric space, and let P be a partition of
X. Then, the pair (P, d′ ) is a metric, where d′ (C, C ′ ) = d(C, C ′n) = min x∈C,y∈C ′ d(x, y) and C, C ′ ∈ P.
o
(b) Let (X, d) be a n-point metric space, and consider the set U = i 2i ≤ d(x, y) ≤ 2i+1 , for x, y ∈ X . Prove
that |U| = O(n). Namely, there are only n different resolutions that “matter” for a finite metric space.

Exercise 40.7.3 (Computing the diameter via embeddings).


(a) (h:1) Let ℓ be a line in the plane, and consider the embedding f : R2 → ℓ, which is the projection of the
plane into ℓ. Prove that f is 1-Lipschitz, but it is not K-bi-Lipschitz for√any constant K.
(b) (h:3) Prove that one can find a family of projections F of size O(1/ ε), such that for any two points
x, y ∈ R2 , for one of the projections f ∈ F we have √ d( f (x), f (y)) ≥ (1 − ε)d(x, y).
(c) (h:1) Given a set P of n in the plane, given a O(n/ ε) time algorithm that outputs two points x, y ∈ P,
such that d(x, y) ≥ (1 − ε)diam(P), where diam(P) = maxz,w∈P d(z, w) is the diameter of P.
(d) (h:2) Given P, show how to extract, in O(n) time, a set Q ⊆ P of size O(ε−2 ), such that diam(Q) ≥
(1 − ε/2)diam(P). (Hint: Construct a grid of appropriate resolution.)
In particular, give an (1 − ε)-approximation algorithm to the diameter of P that works in O(n + ε−2.5 ) time.
(There are slightly faster approximation algorithms known for approximating the diameter.)

Acknowledgments
The presentation in this write-up follows closely the insightful suggestions of Manor Mendel.

References
[ABN08a] I. Abraham, Y. Bartal, and O. Neiman. Nearly tight low stretch spanning trees. Proc. 49th Annu.
IEEE Sympos. Found. Comput. Sci. (FOCS), 781–790, 2008.
[ABN08b] I. Abraham, Y. Bartal, and O. Neiman. Nearly tight low stretch spanning trees. CoRR, abs/0808.2017,
2008. arXiv: 0808.2017.
[AKPW95] N. Alon, R. M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application to the
k-server problem. SIAM J. Comput., 24(1): 78–100, 1995.
[Aro98] S. Arora. Polynomial time approximation schemes for Euclidean TSP and other geometric prob-
lems. J. Assoc. Comput. Mach., 45(5): 753–782, 1998.
[Bar96] Y. Bartal. Probabilistic approximations of metric space and its algorithmic application. Proc.
37th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 183–193, 1996.
[Bar98] Y. Bartal. On approximating arbitrary metrics by tree metrics. Proc. 30th Annu. ACM Sympos.
Theory Comput. (STOC), 161–168, 1998.
[CKR04] G. Călinescu, H. J. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension prob-
lem. SIAM J. Comput., 34(2): 358–372, 2004.
[EEST08] M. Elkin, Y. Emek, D. A. Spielman, and S. Teng. Lower-stretch spanning trees. SIAM J. Comput.,
38(2): 608–628, 2008.
[FRT04] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by
tree metrics. J. Comput. Sys. Sci., 69(3): 485–497, 2004.

261
[Gup00] A. Gupta. Embeddings of Finite Metrics. PhD thesis. University of California, Berkeley, 2000.
[Ind01] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. Proc. 42nd Annu.
IEEE Sympos. Found. Comput. Sci. (FOCS), Tutorial. 10–31, 2001.
[KLMN05] R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: a new embedding
method for finite metric spaces. Geom. funct. anal. (GAFA), 15(4): 839–858, 2005.
[Mat02] J. Matoušek. Lectures on Discrete Geometry. Vol. 212. Grad. Text in Math. Springer, 2002.

262
Chapter 41

Entropy, Randomness, and Information


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“If only once - only once - no matter where, no matter before what audience - I could better the record of the great Rastelli and
juggle with thirteen balls, instead of my usual twelve, I would feel that I had truly accomplished something for my country.
But I am not getting any younger, and although I am still at the peak of my powers there are moments - why deny it? - when I
begin to doubt - and there is a time limit on all of us.”

Romain Gary, The talent scout

41.1. The entropy function


Definition 41.1.1. The entropy in bits of a discrete random variable X is given by
X
H(X) = − P[X = x] lg P[X = x],
x
h i
where lg x is the logarithm base 2 of x. Equivalently, H(X) = E lg P[X]
1
.
The binary entropy function H(p) for a random binary variable that is 1 with probability p, is

H(p) = −p lg p − (1 − p) lg(1 − p).

We define H(0) = H(1) = 0.

The function H(p) is a concave symmetric around 1/2 on the interval [0, 1] and achieves its maximum at
1/2. For a concrete example, consider H(3/4) ≈ 0.8113 and H(7/8) ≈ 0.5436. Namely, a coin that has 3/4
probably to be heads have higher amount of “randomness” in it than a coin that has probability 7/8 for heads.
Writing lg n = (ln n)/ ln 2, we have that
1 
H(p) = −p ln p − (1 − p) ln(1 − p)
ln 2 !
′ 1 p 1− p 1− p
and H (p) = − ln p − − (−1) ln(1 − p) − (−1) = lg .
ln 2 p 1− p p

Deploying our amazing ability to compute derivative of simple functions once more, we get that
!
′′ 1 p p(−1) − (1 − p) 1
H (p) = =− .
ln 2 1 − p p 2 p(1 − p) ln 2

263
H(p) = −p lg p − (1 − p) lg(1 − p)
1

0.8

0.6

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 41.1: The binary entropy function.

Since ln 2 ≈ 0.693, we have that H′′ (p) ≤ 0, for all p ∈ (0, 1), and the H(·) is concave in this range. Also,
H′ (1/2) = 0, which implies that H(1/2) = 1 is a maximum of the binary entropy. Namely, a balanced coin has
the largest amount of randomness in it.
Example 41.1.2. A random variable X that has probability 1/n to be i, for i = 1, . . . , n, has entropy H(X) =
Pn
− 1
i=1 n lg 1n = lg n.

Note, that the entropy is oblivious to the exact values that the random variable can have, and it is sensitive
only to the probability distribution. Thus, a random variables that accepts −1, +1 with equal probability has the
same entropy (i.e., 1) as a fair coin.
Lemma 41.1.3. Let X and Y be two independent random variables, and let Z be the random variable (X, T ).
Then H(Z) = H(X) + H(Y).

Proof: In the following, summation are over all possible values that the variables can have. By the indepen-
dence of X and Y we have
X   1
H(Z) = P (X, Y) = (x, y) lg  
x,y
P (X, Y) = (x, y)
X   1
= P[X = x] P Y = y lg  
x,y
P[X = x] P Y = y
XX   1
= P[X = x] P Y = y lg
x y
P[X = x]
XX   1
+ P[X = x] P Y = y lg  
y x
P Y = y
X 1 X   1
= P[X = x] lg + P Y = y lg   = H(X) + H(Y). ■
x
P [X = x] y
P Y = y

!
2nH(q) n
Lemma 41.1.4. Suppose that nq is integer in the range [0, n]. Then ≤ ≤ 2nH(q) .
n+1 nq

264
Proof: This trivially holds if q = 0 or q = 1, so assume 0 < q < 1. We know that
!
n nq
q (1 − q)n−nq ≤ (q + (1 − q))n = 1
nq
!
n
=⇒ ≤ q−nq (1 − q)−n(1−q) = 2n (−q lg q−(1−q) lg(1−q)) = 2nH(q) .
nq

As for the other direction, let !


n k
µ(k) = q (1 − q)n−k .
k
P 
The claim is that µ(nq) is the largest term in nk=0 µ(k) = 1, where µ(k) = nk qk (1 − q)n−k . Indeed,
! !
n k n−k q
∆k = µ(k) − µ(k + 1) = q (1 − q) 1 −
n−k
,
k k+11−q

and the sign of this quantity is the sign of (k + 1)(1 − q) − (n − k)q = k + 1 − kq − q − nq + kq = 1 + k − q − nq.
Namely, ∆k ≥ 0 when k ≥ nq + q − 1, and ∆k < 0 otherwise. Namely, µ(k) < µ(k + 1), for k < nq, and
P
µ(k) ≥ µ(k + 1) for k ≥ nq. Namely,
 µ(nq) is the largest term in nk=0 µ(k) = 1, and as such it is larger than the
average. We have µ(nq) = nqn qnq (1 − q)n−nq ≥ n+1 1
, which implies
!
n 1 −nq 1 nH(q)
≥ q (1 − q)−(n−nq) = 2 . ■
nq n+1 n+1

Lemma 41.1.4 can be extended to handle non-integer values of q. This is straightforward, and we omit the
easy details.

Corollary 41.1.5. We have:


! !
n 2nH(q) n
(i) q ∈ [0, 1/2] ⇒ ≤ 2nH(q) . (iii) q ∈ [1/2, 1] ⇒ ≤ .
⌊nq⌋ n+1 ⌊nq⌋
! !
n 2nH(q) n
(ii) q ∈ [1/2, 1] ⇒ ≤ 2nH(q) . (iv) q ∈ [0, 1/2] ⇒ ≤ .
⌈nq⌉ n+1 ⌈nq⌉

The bounds of Lemma 41.1.4 and Corollary 41.1.5 are loose but sufficient for our purposes. As a sanity
check, consider the case when we generate a sequence of n bits using a coin with probability q for head, then
by the Chernoff
 n  inequality, we will get roughly nq heads in this sequence. As such, the generated
  sequence Y
belongs to nq ≈ 2nH(q) possible sequences that have similar probability. As such, H(Y) ≈ lg nqn = nH(q), by
Example 41.1.2, this also readily follows from Lemma 41.1.3.

41.2. Extracting randomness


The problem. We are given a random variable X that is chosen uniformly at random from J0 : m − 1K =
{0, . . . , m − 1}. Our purpose is built an algorithm that given X output a binary string, such that the bits in the
binary string can be interpreted as the coin flips of a fair balanced coin. That is, the probability of the ith bit of
the output (if it exists) to be 0 (or 1) is exactly half, and the different bits of the output are independent.

265
0123
0 1 2 3 4 5 6 7 8 9 10 12 14 0 1 2 3 4 5 6 7 8 9 10 12 14 0 1 2 3 4 5 6 7 8 9 10 12 14
11 13 11 13 11 13

(A) (B) (C)

Figure 41.2: (A) m = 15. (B) The block decomposition. (C) If X = 10, then the extraction output is 2 in base
2, using 2 bits – that is 10.

Idea. We break the J0 : m − 1K into consecutive blocks that are powers of two. Given the value of X, we find
which block contains it, and we output a binary representation of the location of X in the block containing it,
where if a block is length 2k , then we output k bits.

Entropy can be interpreted as the amount of unbiased random coin flips can be extracted from a random
variable.

Definition 41.2.1. An extraction function Ext takes as input the value of a random variable X and outputs a
   
sequence of bits y, such that P Ext(X) = y |y| = k = 1/2k . whenever P |y| = k ≥ 0, where |y| denotes the
length of y.

As a concrete (easy) example, consider X to be a uniform random integer variable out of 0, . . . , 7. All that
Ext(x) has to do in this case, is just to compute the binary representation of x.
The definition of the extraction function has two subtleties:
(A) It requires that all extracted sequences of the same length (say k), have the same probability to be output
(i.e., 1/2k ).
(B) If the extraction function can output a sequence of length k, then it needs to be able to output all 2k such
binary sequences.
Thus, for X a uniform random integer variable in the range 0, . . . , 11, the function Ext(x) can output the
binary representation for x if 0 ≤ x ≤ 7. However, what do we do if x is between 8 and 11? The idea is to
output the binary hrepresentation of x − 8 as a two
i bit number. Clearly, Definition 41.2.1 holds for this extraction
function, since P Ext(X) = 00 |Ext(X)| = 2 = 1/4. as required. This scheme can be of course extracted for
any range.

Tedium 41.2.2. For x ≤ y positive integers, and any positive integer ∆, we have that

x x+∆
≤ ⇐⇒ x(y + ∆) ≤ y(x + ∆) ⇐⇒ x∆ ≤ y∆ ⇐⇒ x ≤ y.
y y+∆

Theorem 41.2.3. Suppose that the value of a random variable X is chosen uniformly at random from the
integers {0, . . . , m − 1}. Then there is an extraction function for X that outputs on average (i.e., in expectation)
 
at least lg m − 1 = ⌊H(X)⌋ − 1 independent and unbiased bits.
P
Proof: We represent m as a sum of unique powers of 2, namely m = i ai 2i , where ai ∈ {0, 1}. Thus, we
decomposed {0, . . . , m − 1} into a disjoint union of blocks that have sizes which are distinct powers of 2. If
a number falls inside such a block, we output its relative location in the block, using binary representation of
the appropriate length (i.e., k if the block is of size 2k ). It is not difficult to verify that this function fulfills the
conditions of Definition 41.2.1, and it is thus an extraction function.
Now, observe that the claim holds if m is a power of two, by Example 41.1.2 (i.e., if m = 2k , then H(X) = k).
Thus, if m is not a power of 2, then in the decomposition if there is a block of size 2k , and the X falls inside this
block, then the entropy is k.

266
The remainder of the proof is by induction – assume the claim holds if the range used by the random
variable is strictly smaller than m. In particular, let K = 2k be the largest power of 2 that is smaller than m, and
let U = 2u be the largest power of two such that U ≤ m − K ≤ 2U.
If the random number X ∈ J0 : K − 1K, then the scheme outputs k bits. Otherwise, we can think about the
extraction function as being recursive and extracting randomness from a random variable X ′ = X − K that is
uniformly distributed in J0 : m − KK.
By Tedium 41.2.2, we have that
m − K m − K + (2U + K − m) 2U
≤ =
m m + (2U + K − m) 2U + K
Let Y be the random variable which is the number of random bits extracted. We have that
<0
K m−K    m−K m−K m − K z }| {
E[Y] ≥ k + lg(m − K) − 1 = k − k+ (u − 1) = k + (u − k − 1)
m m m m m
2U 2U
≥k− (u − k − 1) = k − (1 + k − u).
2U + K 2U + K
If u = k − 1, then H(X) ≥ k − 12 · 2 = k − 1, as required. If u = k − 2 then H(X) ≥ k − 13 · 3 = k − 1. Finally, if
u < k − 2 then
2U 2U k−u+1
E[Y] ≥ k − (1 + k − u) ≥ k − (1 + k − u) = k − (k−u+1)−2 ≥ k − 1,
2U + K K 2
since k − u + 1 ≥ 4 and i/2i−2 ≤ 1 for i ≥ 4. ■

Theorem 41.2.4. Consider a coin that comes up heads with probability p > 1/2. For any constant δ > 0 and
for n sufficiently large:
(A) One can extract, from an input of a sequence of n flips, an output sequence of (1 − δ)nH(p) (unbiased)
independent random bits.
(B) One can not extract more than nH(p) bits from such a sequence.

Proof: There are nj input sequences with exactly j heads, and each has probability p j (1 − p)n− j . We map this
n  o
sequence to the corresponding number in the set 0, . . . , nj − 1 . Note, that this, conditional distribution on
j, is uniform on this set, and we can apply the extraction algorithm of Theorem 41.2.3. Let Z be the random
variables which is the number of heads in the input, and let B be the number of random bits extracted. We have
X
n h i
E[B] = P [Z = k] E B Z = k ,
k=0

h $ i !%
n
and by Theorem 41.2.3, we have E B Z = k ≥ lg − 1. Let ε < p − 1/2 be a constant to be determined
k
shortly. For n(p − ε) ≤ k ≤ n(p + ε), we have
! !
n n 2nH(p+ε)
≥ ≥ ,
k ⌊n(p + ε)⌋ n+1
by Corollary 41.1.5 (iii). We have
X
⌈n(p−ε)⌉
h i X $
⌈n(p−ε)⌉ !% !
n
E[B] ≥ P[Z = k] E B Z = k ≥ P[Z = k] lg −1
k=⌊n(p−ε)⌋ k=⌊n(p−ε)⌋
k

267
X
⌈n(p−ε)⌉ !
2nH(p+ε)
≥ P[Z = k] lg −2
k=⌊n(p−ε)⌋
n+1
  
= nH(p + ε) − lg(n + 1) P |Z − np| ≤ εn
!!
 nε2
≥ nH(p + ε) − lg(n + 1) 1 − 2 exp − ,
4p
h i     2
ε np ε 2
since µ = E[Z] = np and P |Z − np| ≥ p pn ≤ 2 exp − 4 p = 2 exp − nε4p
, by the Chernoff inequality. In
particular, fix ε > 0, such that H(p + ε) > (1 − δ/4)H(p), and since p is fixed nH(p) = Ω(n), in particular,
 2  for
δ
n sufficiently large, we have − lg(n + 1) ≥ − 10 nH(p). Also, for n sufficiently large, we have 2 exp − nε 4p
≤ 10δ .
Putting it together, we have that for n large enough, we have
 δ δ  δ
E [B] ≥ 1 − − nH(p) 1 − ≥ (1 − δ)nH(p),
4 10 10
as claimed.
As for the upper bound, observe that if an input sequence x has probability q, then the output sequence
y = Ext(x) has probability to be generated which is at least q. Now, all sequences of length |y| have equal
 
probability to be generated. Thus, we have the following (trivial) inequality 2|Ext(x)| q ≤ 2|Ext(x)| P y = Ext(X) ≤
1, implying that |Ext(x)| ≤ lg(1/q). Thus,
X X 1
E[B] = P[X = x] |Ext(x)| ≤ P[X = x] lg = H(X). ■
x x
P[X = x]

41.3. Bibliographical Notes


The presentation here follows [MU05, Sec. 9.1-Sec 9.3].

References
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.

268
Chapter 42

Entropy II
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
The memory of my father is wrapped up in white paper, like sandwiches taken for a day at work. Just as a magician takes
towers and rabbits out of his hat, he drew love from his small body, and the rivers of his hands overflowed with good deeds.

Yehuda Amichai, My Father

42.1. Huffman coding


A binary code assigns a string of 0s and 1s to each character in the alphabet. A code assigns for each symbol
in the input a codeword over some other alphabet. Such a coding is necessary, for example, for transmitting
messages over a wire, were you can send only 0 or 1 on the wire (i.e., for example, consider the good old
telegraph and Morse code). The receiver gets a binary stream of bits and needs to decode the message sent. A
prefix code, is a code where one can decipher the message, a character by character, by reading a prefix of the
input binary string, matching it to a code word (i.e., string), and continuing to decipher the rest of the stream.
Such a code is a prefix code.
A binary code (or a prefix code) is prefix-free if no code is a prefix of any other. ASCII and Unicode’s
UTF-8 are both prefix-free binary codes. Morse code is a binary code (and also a prefix code), but it is not
prefix-free; for example, the code for S (· · · ) includes the code for E (·) as a prefix. (Hopefully the receiver
knows that when it gets · · · that it is extremely unlikely that this should be interpreted as EEE, but rather S.
Any prefix-free binary code can be visualized as a binary tree with the encoded 0 1
characters stored at the leaves. The code word for any symbol is given by the path d
from the root to the corresponding leaf; 0 for left, 1 for right. The length of a codeword 0 1
for a symbol is the depth of the corresponding leaf. Such trees are usually referred to a
as prefix trees or code trees. 0 1
The beauty of prefix trees (and thus of prefix odes) is that decoding is easy. As c
b
a concrete example, consider the tree on the right. Given a string ’010100’, we can
traverse down the tree from the root, going left if get a ’0’ and right if we get ’1’. Whenever we get to a leaf,
we output the character output in the leaf, and we jump back to the root for the next character we are about to
read. For the example ’010100’, after reading ’010’ our traversal in the tree leads us to the leaf marked with
’b’, we jump back to the root and read the next input digit, which is ’1’, and this leads us to the leaf marked
with ’d’, which we output, and jump back to the root. Finally, ’00’ leads us to the leaf marked by ’a’, which
the algorithm output. Thus, the binary string ’010100’ encodes the string “bda”.
Suppose we want to encode messages in an n-character alphabet so that the encoded message is as short

269
as possible. Specifically, given an array frequency counts f [1 . . . n], we want to compute a prefix-free binary
code that minimizes the total encoded length of the message. That is we would like to compute a tree T that
minimizes
X
n
cost(T ) = f [i] ∗ len(code(i)), (42.1)
i=1

where code(i) is the binary string encoding the ith character and len(s) is the length (in bits) of the binary string
s.
A nice property of this problem is that given two trees for some parts of the alphabet, we can easily put
them together into a larger tree by just creating a new node and hanging the trees from this common node. For
example, putting two characters together, we have the following.
•....
... ...
... .. . ...
... ...
...
... ...
... ..

.
M U M U
Similarly, we can put together two subtrees.
A . .
... .....
B
... ...

.........
..... ..........
.....
... ... .. ..... .....
.....
... ... ... ... .....
.....
.....
.. ... ... ... ...
.. .....
.. ..
. ... ..... .....
... .....
... ...

... ... .
.. .....
.. .. . ...
.
.. .... .. .
.
....................................... .........................................
A . .
..
B. .
... ..... ... .....
... ... ... ...
... ... ... ...
.
... ... .
... ...
... ... ... ...
. ... . ...
. .
.
..........................................
..
. .
.
..........................................
..

42.1.1. The algorithm to build Hoffman’s code


This suggests a simple algorithm that takes the two least frequent characters in the current frequency table,
merge them into a tree, and put the merged tree back into the table (instead of the two old trees). The algorithm
stops when there is a single tree. The intuition is that infrequent characters would participate in a large number
of merges, and as such would be low in the tree – they would be assigned a long code word.
This algorithm is due to David Huffman, who developed it in 1952. Shockingly, this code is the best
one can do. Namely, the resulting code is asymptotically gives the best possible compression of the data (of
course, one can do better compression in practice using additional properties of the data and careful hacking).
This Huffman coding is used widely and is the basic building block used by numerous other compression
algorithms.

42.1.2. Analysis
Lemma 42.1.1. Let T be an optimal code tree. Then T is a full binary tree (i.e., every node of T has either 0
or 2 children). In particular, if the height of T is d, then there are leafs nodes of height d that are sibling.

Proof: If there is an internal node in T that has one child, we can remove this node from T , by connecting
its only child directly with its parent. The resulting code tree is clearly a better compressor, in the sense of
Eq. (42.1).
As for the second claim, consider a leaf u with maximum depth d in T , and consider it parent v = p(u).
The node v has two children, and they are both leafs (otherwise u would not be the deepest node in the tree), as
claimed. ■

Lemma 42.1.2. Let x and y be the two least frequent characters (breaking ties between equally frequent char-
acters arbitrarily). There is an optimal code tree in which x and y are siblings.

270
Proof: More precisely, there is an optimal code in which x and y are siblings and have the largest depth of any
leaf. Indeed, let T be an optimal code tree with depth d. The tree T has at least two leaves at depth d that are
siblings, by Lemma 42.1.1.
Now, suppose those two leaves are not x and y, but some other characters α and β. Let U be the code tree
obtained by swapping x and α. The depth of x increases by some amount ∆, and the depth of α decreases by
the same amount. Thus,
cost(U) = cost(T ) − ( f [α] − f [x])∆.
By assumption, x is one of the two least frequent characters, but α is not, which implies that f [α] > f [x]. Thus,
swapping x and α does not increase the total cost of the code. Since T was an optimal code tree, swapping x
and α does not decrease the cost, either. Thus, U is also an optimal code tree (and incidentally, f [α] actually
equals f [x]). Similarly, swapping y and b must give yet another optimal code tree. In this final optimal code
tree, x and y as maximum-depth siblings, as required. ■

Theorem 42.1.3. Huffman codes are optimal prefix-free binary codes.

Proof: If the message has only one or two different characters, the theorem is trivial. Otherwise, let f [1 . . . n]
be the original input frequencies, where without loss of generality, f [1] and f [2] are the two smallest. To keep
things simple, let f [n + 1] = f [1] + f [2]. By the previous lemma, we know that some optimal code for f [1..n]
has characters 1 and 2 as siblings. Let Topt be this optimal tree, and consider the tree formed by it by removing

1 and 2 as it leaves. We remain with a tree Topt that has as leafs the characters 3, . . . , n and a “special” character
n + 1 (which is the parent of 1 and 2 in Topt ) that has frequency f [n + 1]. Now, since f [n + 1] = f [1] + f [2], we
have
  X n
cost Topt = f [i]depthTopt (i)
i=1
X
n+1
= f [i]depthTopt (i) + f [1]depthTopt (1) + f [2]depthTopt (2) − f [n + 1]depthTopt (n + 1)
i=3
       

= cost Topt + ( f [1] + f [2])depth Topt − ( f [1] + f [2]) depth Topt − 1
 

= cost Topt + f [1] + f [2]. (42.2)
′ ′
This implies that minimizing the cost of Topt is equivalent to minimizing the cost of Topt . In particular, Topt must
be an optimal coding tree for f [3 . . . n + 1]. Now, consider the Huffman tree UH constructed for f [3, . . . , n + 1]
and the overall Huffman tree T H constructed for f [1, . . . , n]. By the way the construction algorithm works, we
have that UH is formed by removing the leafs of 1 and 2 from  T . Now, by induction, we know that the Huffman

tree generated for f [3, . . . , n + 1] is optimal; namely, cost Topt = cost(UH ). As such, arguing as above, we have
   

cost(T H ) = cost(UH ) + f [1] + f [2] = cost Topt + f [1] + f [2] = cost Topt ,

by Eq. (42.2). Namely, the Huffman tree has the same cost as the optimal tree. ■

42.1.3. A formula for the average size of a code word


Assume that our input is made out of n characters, where the ith character is pi fraction of the input (one can
think about pi as the probability of seeing the ith character, if we were to pick a random character from the
input).

271
Now, we can use these probabilities instead of frequencies to build a Huffman tree. The natural question is
what is the length of the codewords assigned to characters as a function of their probabilities?
In general this question does not have a trivial answer, but there is a simple elegant answer, if all the
probabilities are power of 2.
Lemma 42.1.4. Let 1, . . . , n be n symbols, such that the probability for the ith symbol is pi , and furthermore,
there is an integer li ≥ 0, such that pi = 1/2li . Then, in the Huffman coding for this input, the code for i is of
length li .

Proof: The proof is by easy induction of the Huffman algorithm. Indeed, for n = 2 the claim trivially holds
since there are only two characters with probability 1/2. Otherwise, let i and j be the two characters with lowest
P
probability. It must hold that pi = p j (otherwise, k pk can not be equal to one). As such, Huffman’s merges
this two letters, into a single “character” that have probability 2pi , which would have encoding of length li − 1,
by induction (on the remaining n − 1 symbols). Now, the resulting tree encodes i and j by code words of length
(li − 1) + 1 = li , as claimed. ■

In particular, we have that li = lg 1/pi . This implies that the average length of a code word is
X 1
pi lg .
i
pi

If we consider X to be a random variable that takes a value i with probability pi , then this formula is
X 1
H(X) = P[X = i] lg ,
i
P [X = i]

which is the entropy of X.


Theorem 42.1.5. Consider an input sequence S of m characters, where the characters are taken from an
alphabet set Σ of size n. In particular, let fi be the number of times the ith character of Σ appears in S , for
i = 1, . . . , n. Consider the compression of this string using Huffman’s code. Then, the total length of the
compressed string (ignoring the space needed to store the code itself) is ≤ m (H(X) + 1), where X is a random
variable that returns i with probability pi = fi /m.

Proof: The trick is to replace pi , which might not be a power of 2, by qi = 2⌊lg pi ⌋ . We have that qi ≤ pi ≤ 2qi ,
P
and qi is a power of 2, for all i. The leftover of this coding is ∆ = 1 − i qi . We write ∆ as a sum of
powers of 2 (since the frequencies are fractions of the form i/m [since the input string is of length m] – this
P
requires at most τ = O(log m) numbers): ∆ = n+τ j=n+1 q j . We now create a Huffman code T for the frequencies
q1 , . . . , qn , qn+1 , . . . , qn+τ . The output length to encode the input string using this code, by Lemma 42.1.4, is
X n X n ! Xn
1 1 1
L=m pi lg ≤ m pi 1 + lg ≤m+m pi lg = m + mH(X).
i=1
qi i=1
pi i=1
pi

One can now restrict T to be a prefix tree only for the first n symbols. Indeed, delete the τ “fake”
leafs/symbols, and repeatedly remove internal nodes that have only a single child. In the end of this pro-
cess, we get a valid prefix tree for the first n symbols, and encoding the input string using this tree would
require at most L bits, since process only shortened the code words. Finally, let V be the resulting tree.
Now, consider the Huffman tree code for the n input symbols using the original frequencies p1 , . . . pn . The
resulting tree U is a better encoder for the input string than V, by Theorem 42.1.3. As such, the compressed
string, would have at most L bits – thus establishing the claim. ■

272
42.2. Compression
In this section, we consider the problem of how to compress a binary string. We map each binary string, into a
new string (which is hopefully shorter). In general, by using a simple counting argument, one can show that no
such mapping can achieve real compression (when the inputs are adversarial). However, the hope is that there
is an underling distribution on the inputs, such that some strings are considerably more common than others.

Definition 42.2.1. A compression function Compress takes as input a sequence of n coin flips, given as an
element of {H, T }n , and outputs a sequence of bits such that each input sequence of n flips yields a distinct
output sequence.

The following is easy to verify.

Lemma 42.2.2. If a sequence S 1 is more likely than S 2 then the compression function that minimizes the
expected number of bits in the output assigns a bit sequence to S 2 which is at least as long as S 1 .

Note, that this is weak. Usually, we would like the function to output a prefix code, like the Huffman code.

Theorem 42.2.3. Consider a coin that comes up heads with probability p > 1/2. For any constant δ > 0,
when n is sufficiently large, the following holds.
(i) There exists a compression function Compress such that the expected number of bits output by Compress
on an input sequence of n independent coin flips (each flip gets heads with probability p) is at most
(1 + δ)nH(p); and
(ii) The expected number of bits output by any compression function on an input sequence of n independent
coin flips is at least (1 − δ)nH(p).

Proof: Let ε > 0 be a constant such that p − ε > 1/2. The first bit output by the compression procedure is ’1’
if the output string is just a copy of the input (using n + 1 bits overall in the output), and ’0’ if it is compressed.
We compress only if the number of ones in the input sequence,  denoted
 by X is larger than (p − ε)n. By the
 
Chernoff inequality, we know that P X < (p − ε)n ≤ exp −nε /2p . 2

If there are more than (p − ε)n ones in the input, and since p − ε > 1/2, we have that

X
n ! X
n !
n n n
≤ ≤ 2nH(p−ε) ,
j=⌈n(p−ε)⌉
j j=⌈n(p−ε)⌉
⌈n(p − ε)⌉ 2

by Corollary 41.1.5. As such, we can assign each such input sequence a number in the range 0 . . . n2 2nH(p−ε) ,
 
and this requires (with the flag bit) 1 + lg n + nH(p − ε) random bits.
Thus, the expected number of bits output is bounded by
   
(n + 1) exp −nε2 /2p + 1 + lg n + nH(p − ε) ≤ (1 + δ)nH(p),

by carefully setting ε and n being sufficiently large. Establishing the upper bound.
As for the lower bound, observe that at least one of the sequences having exactly τ = ⌊(p + ε)n⌋ heads,
must be compressed into a sequence having
!
n 2nH(p+ε)
lg − 1 ≥ lg − 1 = nH(p − ε) − lg(n + 1) − 1 = µ,
⌊(p + ε)n⌋ n+1

273
by Corollary 41.1.5. Now, any input string with less than τ heads has lower probability to be generated.
Indeed, for a specific strings with α < τ ones the probability to generate them is pα (1 − p)n−α and pτ (1 − p)n−τ ,
respectively. Now, observe that
τ−α
!τ−α
α τ n−τ (1 − p) τ n−τ 1 − p
p (1 − p) = p (1 − p) ·
n−α
= p (1 − p) < pτ (1 − p)n−τ ,
pτ−α p

as 1 − p < 1/2 < p implies that (1 − p)/p < 1.


As such, Lemma 42.2.2 implies that all the input strings with less than τ ones, must be compressed into
 length at least µ, by an optimal compresser. Now, the Chenroff inequality implies
strings of   P[X ≤ τ] ≥
that
1 − exp −nε /12p . Implying that an optimal compresser outputs on average at least 1 − exp −nε2 /12p µ.
2

Again, by carefully choosing ε and n sufficiently large, we have that the average output length of an optimal
compressor is at least (1 − δ)nH(p). ■

42.3. Bibliographical Notes


The presentation here follows [MU05, Sec. 9.1-Sec 9.3].

References
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.

274
Chapter 43

Entropy III - Shannon’s Theorem


598 - Class notes for Randomized Algorithms
The memory of my father is wrapped up in
Sariel Har-Peled
white paper, like sandwiches taken for a day at work.
April 2, 2024
beginequation*-0.2cm] Just as a magician takes towers
and rabbits
out of his hat, he drew love from his small body,

beginequation*-0.2cm] and the rivers of his hands


overflowed with good deeds.

– Yehuda Amichai, My Father.,


43.1. Coding: Shannon’s Theorem
We are interested in the problem sending messages over a noisy channel. We will assume that the channel noise
is “nicely” behaved.
Definition 43.1.1. The input to a binary symmetric channel with parameter p is a sequence of bits x1 , x2 , . . . ,
 
and the output is a sequence of bits y1 , y2 , . . . , such that P xi = yi = 1 − p independently for each i.

Translation: Every bit transmitted have the same probability to be flipped by the channel. The question is
how much information can we send on the channel with this level of noise. Naturally, a channel would have
some capacity constraints (say, at most 4,000 bits per second can be sent on the channel), and the question is
how to send the largest amount of information, so that the receiver can recover the original information sent.
Now, its important to realize that noise handling is unavoidable in the real world. Furthermore, there are
tradeoffs between channel capacity and noise levels (i.e., we might be able to send considerably more bits
on the channel but the probability of flipping (i.e., p) might be much larger). In designing a communication
protocol over this channel, we need to figure out where is the optimal choice as far as the amount of information
sent.
Definition 43.1.2. A (k, n) encoding function Enc : {0, 1}k → {0, 1}n takes as input a sequence of k bits and
outputs a sequence of n bits. A (k, n) decoding function Dec : {0, 1}n → {0, 1}k takes as input a sequence of n
bits and outputs a sequence of k bits.

Thus, the sender would use the encoding function to send its message, and the decoder would use the
received string (with the noise in it), to recover the sent message. Thus, the sender starts with a message with
k bits, it blow it up to n bits, using the encoding function, to get some robustness to noise, it send it over the
(noisy) channel to the receiver. The receiver, takes the given (noisy) message with n bits, and use the decoding
function to recover the original k bits of the message.

275
Naturally, we would like k to be as large as possible (for a fixed n), so that we can send as much information
as possible on the channel. Naturally, there might be some failure probability; that is, the receiver might be
unable to recover the original string, or recover an incorrect string.
The following celebrated result of Shannon¬ in 1948 states exactly how much information can be sent on
such a channel.
Theorem 43.1.3 (Shannon’s theorem). For a binary symmetric channel with parameter p < 1/2 and for any
constants δ, γ > 0, where n is sufficiently large, the following holds:
(i) For an k ≤ n(1 − H(p) − δ) there exists (k, n) encoding and decoding functions such that the probability
the receiver fails to obtain the correct message is at most γ for every possible k-bit input messages.
(ii) There are no (k, n) encoding and decoding functions with k ≥ n(1 − H(p) + δ) such that the probability
of decoding correctly is at least γ for a k-bit input message chosen uniformly at random.

43.2. Proof of Shannon’s theorem


The proof is not hard, but requires some care, and we will break it into parts.

43.2.1. How to encode and decode efficiently


43.2.1.1. The scheme

Our scheme would be simple. Pick k ≤ n(1 − H(p) − δ). For any number i = 0, . . . , K b = 2k+1 − 1, randomly
generate a binary string Yi made out of n bits, each one chosen independently and uniformly. Let Y0 , . . . , YKb
denote these codewords.
For each of these codewords we will compute the probability that if we send this codeword, the receiver
would fail. Let X0 , . . . , XK , where K = 2k − 1, be the K codewords with the lowest probability of failure.
We assign these words to the 2k messages we need to encode in an arbitrary fashion. Specifically, for i =
0, . . . , 2k − 1, we encode i as the string Xi .
The decoding of a message w is done by going over all the codewords, and finding all the codewords that
are in (Hamming) distance in the range [p(1 − ε)n, p(1 + ε)n] from w. If there is only a single word Xi with this
property, we return i as the decoded word. Otherwise, if there are no such word or there is more than one word
then the decoder stops and report an error.

43.2.1.2. The proof

Intuition. Each code Yi corresponds to a region that looks like a ring. The “ring” r = pn
for Yi is all the strings in Hamming distance between (1 − ε)r and (1 + ε)r from Y2
Yi , where r = pn. Clearly, if we transmit a string Yi , and the receiver gets a string
inside the ring of Yi , it is natural to try to recover the received string to the original Y0
code corresponding to Yi . Naturally, there are two possible bad events here:
2εpn
(A) The received string is outside the ring of Yi . Y1
(B) The received string is contained in several rings of different Ys, and it is not clear which one should the
receiver decode the string to. These bad regions are depicted as the darker regions in the figure on the
right.
¬
Claude Elwood Shannon (April 30, 1916 - February 24, 2001), an American electrical engineer and mathematician, has been
called “the father of information theory”.

276
Let S i = S(Yi ) be all the binary strings (of length n) such that if the receiver gets this word, it would decipher
it to be the original string assigned to Yi (here are still using the extended set of codewords Y0 , . . . , YKb). Note,
that if we remove some codewords from consideration, the set S(Yi ) just increases in size (i.e., the bad region
in the ring of Yi that is covered multiple times shrinks). Let Wi be the probability that Yi was sent, but it was
not deciphered correctly. Formally, let r denote the received word. We have that
X
Wi = P[r was received when Yi was sent]. (43.1)
r<S i

To bound this quantity, let ∆(x, y) denote the Hamming distance between the binary strings x and y. Clearly, if
x was sent the probability that y was received is

w(x, y) = p∆(x,y) (1 − p)n−∆(x,y) .

As such, we have
P[r received when Yi was sent] = w(Yi , r).
Definition 43.2.1. Let S i,r be an indicator variable which is 1 if r < S i . It is one if the receiver gets r, and does
not decode it to Yi (either because of failure, or because r is too close/far from Yi ).

We have that failure probability when sending r is


X X X
Wi = P[r received when Yi was sent] = w(Yi , r) = S i,r w(Yi , r). (43.2)
r<S i r<S i r

The value of Wi is a random variable over the choice of Y0 , . . . , YKb. As such, its natural to ask what is the
expected value of Wi .
Consider the ring

ring(r) = x ∈ {0, 1}n (1 − ε)np ≤ ∆(x, r) ≤ (1 + ε)np ,
where ε > 0 is a small enough constant. Observe that x ∈ ring(y) if and only if y ∈ ring(x). Suppose, that the
code word Yi was sent, and r was received. The decoder returns the original code associated with Yi , if Yi is the
only codeword that falls inside ring(r).

Lemma 43.2.2. Given that Yi was sent, and r was received and furthermore r ∈ ring(Yi ), then the probability
of the decoder failing, is
  γ
τ = P r < S i r ∈ ring(Yi ) ≤ ,
8
where γ is the parameter of Theorem 43.1.3.

Proof: The decoder fails here, only if ring(r) contains some other codeword Y j ( j , i) in it. As such,
h i h i X h i
τ = P r < S i r ∈ ring(Yi ) ≤ P Y j ∈ ring(r), for any j , i ≤ P Y j ∈ ring(r) .
j,i

Now, we remind the reader that the Y j s are generated by picking each bit randomly and independently, with
probability 1/2. As such, we have
n !
h i ring(r) X
(1+ε)np
m n n
P Y j ∈ ring(r) = = ≤ n ,
|{0, 1}n | m=(1−ε)np
2n 2 ⌊(1 + ε)np⌋

277
since (1 + ε)p < 1/2 (for ε sufficiently small), and as such the last binomial coefficient in this summation is the
largest. By Corollary 41.1.5 (i), we have
h i n !
n n
P Y j ∈ ring(r) ≤ n ≤ n 2nH((1+ε)p) = n2n(H((1+ε)p)−1) .
2 ⌊(1 + ε)np⌋ 2

As such, we have
h i X h i    
τ = P r < S i r ∈ ring(Yi ) ≤ P j
Y ∈ ring(r) ≤ b
K P 1
Y ∈ ring(r) ≤ 2 k+1
n2n H((1+ε)p)−1

j,i
   
n 1−H(p)−δ + 1 + n (H((1+ε)p)−1)
≤ n2 ≤ n2n H((1+ε)p)−H(p)−δ +1

since k ≤ n(1 − H(p) − δ). Now, we choose ε to be a small enough constant, so that the quantity H((1 + ε)p) −
H(p) − δ is equal to some (absolute) negative (constant), say −β, where β > 0. Then, τ ≤ n2−βn+1 , and choosing
n large enough, we can make τ smaller than γ/8, as desired. As such, we just proved that
h i γ
τ = P r < S i r ∈ ring(Yi ) ≤ . ■
8

Lemma 43.2.3. Consider the situation where Yi is sent, and the received string is r. We have that
  X γ
P r < ring(Yi ) = w(Yi , r) ≤ ,
r < ring(Y )
8
i

where γ is the parameter of Theorem 43.1.3.

Proof: This quantity, is the probability of sending Yi when every bit is flipped with probability p, and receiving
a string r such that more than pn + εpn bits where flipped (or less than pn − εpn). But this quantity can be
bounded using the Chernoff inequality. Indeed, let Z = ∆(Yi , r), and observe that E[Z] = pn, and it is the sum
of n independent indicator variables. As such
X !
  ε2 γ
w(Yi , r) = P |Z − E[Z]| > εpn ≤ 2 exp − pn < ,
r < ring(Y )
4 4
i

since ε is a constant, and for n sufficiently large. ■

We remind the reader that S i,r is an indicator variable that is one if receiving r (when sending Yi ) is “bad”,
see Definition 43.2.1. Importantly, this indicator variable also depends on all the other codewords – as they
might cause some regions in the ring of Yi to be covered multiple times.
P  
Lemma 43.2.4. We have that f (Yi ) = r < ring(Yi ) E S i,r w(Yi , r) ≤ γ/8 (the expectation is over all the choices of
the Ys excluding Yi ).

Proof: Observe that S i,r w(Yi , r) ≤ w(Yi , r) and for fixed Yi and r we have that E[w(Yi , r)] = w(Yi , r). As such,
we have that X h i X X γ
f (Yi ) = E i,r
S w(Yi , r) ≤ E [w(Yi , r)] = w(Yi , r) ≤ ,
r < ring(Y ) r < ring(Y ) r < ring(Y )
8
i i i

by Lemma 43.2.3. ■

278
X h i
Lemma 43.2.5. We have that g(Yi ) = E S i,r w(Yi , r) ≤ γ/8 (the expectation is over all the choices of
r ∈ ring(Yi )
the Ys excluding Yi ).

Proof: We have that S i,r w(Yi , r) ≤ S i,r , as 0 ≤ w(Yi , r) ≤ 1. As such, we have that
X h i X h i X
g(Yi ) = E i,r
S w(Yi , r) ≤ E i,r
S = P[r < S i ]
r ∈ ring(Y ) r ∈ ring(Y ) r ∈ ring(Yi )
X i 
i

= P r < S i ∩ r ∈ ring(Yi )
X
r
h i  
= P r < S i r ∈ ring(Yi ) P r ∈ ring(Yi )
r
Xγ   γ
≤ P r ∈ ring(Yi ) ≤ ,
r
8 8

by Lemma 43.2.2. ■

Lemma 43.2.6. For any i, we have µ = E[Wi ] ≤ γ/4, where γ is the parameter of Theorem 43.1.3, where Wi
is the probability of failure to recover Yi if it was sent, see Eq. (43.1).
P
Proof: We have by Eq. (43.2) that Wi = r S i,r w(Yi , r). For a fixed value of Yi , we have by linearity of
expectation, that
hX i X h i
E[Wi | Yi ] = E S i,r w(Yi , r) Yi = E S i,r w(Yi , r) Yi
r r
X h i X h i γ γ γ
= E S i,r w(Yi , r) Yi + E S i,r w(Yi , r) Yi = g(Yi ) + f (Yi ) ≤ + = ,
r ∈ ring(Yi ) r < ring(Y )
8 8 4
i

   
by Lemma 43.2.4 and Lemma 43.2.5. Now E[Wi ] = E E[Wi | Yi ] ≤ E γ/4 ≤ γ/4. ■

In the following, we need the following trivial (but surprisingly deep) observation.

Observation 43.2.7. For a random variable X, if E[X] ≤ ψ, then there exists an event in the probability space,
that assigns X a value ≤ ψ.

Lemma 43.2.8. For the codewords X0 , . . . , XK , the probability of failure in recovering them when sending them
over the noisy channel is at most γ.

Proof: We just proved that when using Y0 , . . . , YKb, the expected probability of failure when sending Yi , is
b = 2k+1 − 1. As such, the expected total probability of failure is
E[Wi ] ≤ γ/4, where K
b b
hX
K iXK
  γ k+1
E Wi = E Wi ≤ 2 ≤ γ2 ,
k

i=0 i=0
4

by Lemma 43.2.6. As such, by Observation 43.2.7, there exist a choice of Yi s, such that
b
X
K
Wi ≤ 2k γ.
i=0

279
Now, we use a similar argument used in proving Markov’s inequality. Indeed, the Wi are always positive, and
it can not be that 2k of them have value larger than γ, because in the summation, we will get that
b
X
K
Wi > 2k γ.
i=0

Which is a contradiction. As such, there are 2k codewords with failure probability smaller than γ. We set the
2k codewords X0 , . . . , XK to be these words, where K = 2k − 1. Since we picked only a subset of the codewords
for our code, the probability of failure for each codeword shrinks, and is at most γ. ■

Lemma 43.2.8 concludes the proof of the constructive part of Shannon’s theorem.

43.2.2. Lower bound on the message size


We omit the proof of this part. It follows similar argumentation showing that for every ring associated with
a codewords it must be that most of it is covered only by this ring (otherwise, there is no hope for recovery).
Then an easy packing argument implies the claim.

43.3. Bibliographical Notes


The presentation here follows [MU05, Sec. 9.1-Sec 9.3].

References
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.

280
Chapter 44

Approximate Max Cut


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

We had encountered in the previous lecture examples of using rounding techniques for approximating discrete
optimization problems. So far, we had seen such techniques when the relaxed optimization problem is a linear
program. Interestingly, it is currently known how to solve optimization problems that are considerably more
general than linear programs. Specifically, one can solve convex programming. Here the feasible region is
convex. How to solve such an optimization problems is outside the scope of this course. It is however natural
to ask what can be done if one assumes that one can solve such general continuous optimization problems
exactly.
In the following, we show that (optimization problem) max cut can be relaxed into a weird continuous
optimization problem. Furthermore, this semi-definite program can be solved exactly efficiently. Maybe more
surprisingly, we can round this continuous solution and get an improved approximation.

44.1. Problem Statement


Given an undirected graph G = (V, E) and nonnegative weights ωi j , for all i j ∈ E, the maximum cut problem

(MAX CUT) is that of finding the set of vertices S that maximizes the weight of the edges in the cut S , S ; that
is, the weight of the edges with one endpoint
 X in S . For simplicity, we usually set ωi j = O for
 in S and the other
i j < E and denote the weight of a cut S , S by w S , S = ωi j .
i∈S , j∈S
This problem is NP-Complete, and hard to approximate within a certain constant.
Given a graph with vertex set V = {1, . . . , n} and nonnegative weights ωi j , the weight of the maximum cut
w(S , S ) is given by the following integer quadratic program:
1X
(Q) max ωi j (1 − yi y j )
2 i< j
subject to: yi ∈ {−1, 1} ∀i ∈ V.
n o  P
Indeed, set S = i yi = 1 . Clearly, ω S , S = 12 i< j ωi j (1 − yi y j ).
Solving quadratic integer programming is of course NP-Hard. Thus, we will relax it, by thinking about
the numbers yi as unit vectors in higher dimensional space. If so, the multiplication of the two vectors, is now
replaced by dot product. We have:
1X  D E
(P) max γ = ωi j 1 − vi , v j
2 i< j

281
subject to: vi ∈ S(n) ∀i ∈ V,

where S(n) is the n dimensional unit sphere in Rn+1 . This is an instance of semi-definite programming, which is a
special case of convex programming, which can be solved in polynomial time (solved here means approximated
within a factor of (1 + ε) of optimal, for any arbitrarily small ε > 0, in polynomial time). Namely, the solver
finds a feasible solution with a the target function being arbitrarily close to the optimal solution. Observe that
(P) is a relaxation of (Q), and as such the optimal solution of (P) has value larger than the optimal value of (Q).
The intuition is that vectors that correspond to vertices that should be on one side of the cut, and vertices on
the other sides, would have vectors which are faraway from each other in (P). Thus, we compute the optimal
solution for (P), and we uniformly generate a random vector r on the unit sphere S(n) . This induces a hyperplane
h which passes through the origin and is orthogonal to r. We next assign all the vectors that are on one side of
h to S , and the rest to S .
Summarizing, the algorithm is as follows: First, we solve (P), next, we pick a random vector r uniformly
on the unit sphere S(n) . Finally, we set
S = {vi | ⟨vi , r⟩ ≥ 0} .

44.1.1. Analysis
The intuition of the above rounding procedure, is that with good probability, vectors in the solution of (P) that
have large angle between them would be separated by this cut.
   1 
Lemma 44.1.1. We have P sign ⟨vi , r⟩ , sign ⟨v j , r⟩ = arccos ⟨vi , v j ⟩ .
π
Proof: Let us think about the vectors vi , v j and r as being in the plane.
vi
To see why this is a reasonable assumption, consider the plane g spanned by vi and vi
v j , and observe that for the random events we consider, only the direction of r matter,
which can be decided by projecting r on g, and normalizing it to have length 1. Now, the
sphere is symmetric, and as such, sampling r randomly from S(n) , projecting it down to
g, and then normalizing it, is equivalent to just choosing uniformly a vector from the unit
circle.
Now, sign(⟨vi , r⟩) , sign(⟨v j , r⟩) happens only if r falls in the double wedge formed by the lines perpendic-
ular to vi and v j . The angle of this double wedge is exactly the angle between vi and v j . Now, since vi and v j
are unit vectors, we have ⟨vi , v j ⟩ = cos(τ), where τ = ∠vi v j .
Thus,
h  i 2τ 1 
P sign ⟨vi , r⟩ , sign ⟨v j , r⟩ = = arccos ⟨vi , v j ⟩ ,
2π π
as claimed. ■

Theorem 44.1.2. Let W be the random variable which is the weight of the cut generated by the algorithm. We
have
1X
E[W] = ωi j arccos(⟨vi , v j ⟩).
π i< j

Proof: Let Xi j be an indicator variable which is 1 if and only if the edge i j is in the cut. We have
h i h i 1
E Xi j = P sign(⟨vi , r ⟩) , sign(⟨v j , r ⟩) = arccos(⟨vi , v j ⟩),
π

282
P
by Lemma 44.1.1. Clearly, W = i< j ωi j Xi j , and by linearity of expectation, we have
X h i 1X
E[W] = ωi j E Xi j = ωi j arccos(⟨vi , v j ⟩). ■
i< j
π i< j

arccos(y) 1
Lemma 44.1.3. For −1 ≤ y ≤ 1, we have ≥ α · (1 − y), where
π 2
2 ψ
α = min . (44.1)
0≤ψ≤π π 1 − cos(ψ)

Proof: Set y = cos(ψ). The inequality now becomes ψπ ≥ α 12 (1 − cos ψ). Reorganizing, the inequality becomes
ψ
2
π 1−cos ψ
≥ α, which trivially holds by the definition of α. ■

10 0.87856
2 ψ
π 1−cos(ψ)
8

0 0.5 1 1.5 2 2.5 3


ψ

Figure 44.1: The function of Eq. (44.1).

Lemma 44.1.4. α > 0.87856.

Proof: Using simple calculus, one can see that α achieves its value for ψ = 2.331122..., the nonzero root of
cos ψ + ψ sin ψ = 1. ■

Theorem 44.1.5. The above algorithm computes in expectation a cut with total weight α · Opt ≥ 0.87856Opt,
where Opt is the weight of the maximum weight cut.

Proof: Consider the optimal solution to (P), and lets its value be γ ≥ Opt. We have
1X X 1
E[W] = ωi j arccos(⟨vi , v j ⟩) ≥ ωi j α (1 − ⟨vi , v j ⟩) = αγ ≥ α · Opt,
π i< j i< j
2

by Lemma 44.1.3. ■

283
44.2. Semi-definite programming
Let us define a variable xi j = ⟨vi , v j ⟩, and consider the n by n matrix M formed by those variables, where xii = 1
for i = 1, . . . , n. Let V be the matrix having v1 , . . . , vn as its columns. Clearly, M = V T V. In particular, this
implies that for any non-zero vector v ∈ Rn , we have vT Mv = vT AT Av = (Av)T (Av) ≥ 0. A matrix that has this
property, is called positive semidefinite. Interestingly, any positive semidefinite matrix P can be represented as
a product of a matrix with its transpose; namely, P = BT B. Furthermore, given such a matrix P of size n × n,
we can compute B such that P = BT B in O(n3 ) time. This is know as Cholesky decomposition.
Observe, that if a semidefinite matrix P = BT B has a diagonal where all the entries are one, then B has
columns which are unit vectors. Thus, if we solve (P) and get back a semi-definite matrix, then we can recover
the vectors realizing the solution, and use them for the rounding.
In particular, (P) can now be restated as

1X
(S D) max ωi j (1 − xi j )
2 i< j
subject to: xii = 1 for i = 1, . . . , n
 
xi j is a positive semi-definite matrix.
i=1,...,n, j=1,...,n

We are trying to find the optimal value of a linear function over a set which is the intersection of linear con-
straints and the set of positive semi-definite matrices.

Lemma 44.2.1. Let U be the set of n × n positive semidefinite matrices. The set U is convex.

Proof: Consider A, B ∈ U, and observe that for any t ∈ [0, 1], and vector v ∈ Rn , we have:

vT (tA + (1 − t)B)v = vT (tAv + (1 − t)Bv) = tvT Av + (1 − t)vT Bv ≥ 0 + 0 ≥ 0,

since A and B are positive semidefinite. ■

Positive semidefinite matrices corresponds to ellipsoids. Indeed, consider the set xT Ax = 1: the set of
vectors that solve this equation is an ellipsoid. Also, the eigenvalues of a positive semidefinite matrix are all
non-negative real numbers. Thus, given a matrix, we can in polynomial time decide if it is positive semidefinite
or not (by computing the eigenvalues of the matrix).
Thus, we are trying to optimize a linear function over a convex domain. There is by now machinery to
approximately solve those problems to within any additive error in polynomial time. This is done by using the
interior point method, or the ellipsoid method. See [BV04, GLS93] for more details. The key ingredient that is
required to make these methods work, is the ability to decide in polynomial time, given a solution, whether its
feasible or not. As demonstrated above, this can be done in polynomial time.

44.3. Bibliographical Notes


The approximation algorithm presented is from the work of Goemans and Williamson [GW95]. Håstad
[Hås01b] showed that MAX CUT can not be approximated within a factor of 16/17 ≈ 0.941176. Recently,
Khot et al. [KKMO04] showed a hardness result that matches the constant of Goemans and Williamson (i.e.,
one can not approximate it better than α, unless P = NP). However, this relies on two conjectures, the first
one is the “Unique Games Conjecture”, and the other one is “Majority is Stablest”. The “Majority is Stablest”

284
conjecture was recently proved by Mossel et al. [MOO05]. However, it is not clear if the “Unique Games
Conjecture” is true, see the discussion in [KKMO04].
The work of Goemans and Williamson was quite influential and spurred wide research on using SDP for
approximation algorithms. For an extension of the MAX CUT problem where negative weights are allowed
and relevant references, see the work by Alon and Naor [AN04].

References
[AN04] N. Alon and A. Naor. Approximating the cut-norm via grothendieck’s inequality. Proc. 36th
Annu. ACM Sympos. Theory Comput. (STOC), 72–80, 2004.
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge, 2004.
[GLS93] M. Grötschel, L. Lovász, and A. Schrijver. Geometric Algorithms and Combinatorial Optimiza-
tion. 2nd. Vol. 2. Algorithms and Combinatorics. Berlin Heidelberg: Springer-Verlag, 1993.
[GW95] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut
and satisfiability problems using semidefinite programming. J. Assoc. Comput. Mach., 42(6):
1115–1145, 1995.
[Hås01b] J. Håstad. Some optimal inapproximability results. J. ACM, 48(4): 798–859, 2001.
[KKMO04] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal inapproximability results for max
cut and other 2-variable csps. Proc. 45th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), To
appear in SICOMP. 146–154, 2004.
[MOO05] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability of functions with low influences
invariance and optimality. Proc. 46th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 21–30,
2005.

285
286
Chapter 45

Expanders I
598 - Class notes for Randomized Algorithms
“Mr. Matzerath has just seen fit to inform me that this
Sariel Har-Peled
partisan, unlike so many of them, was an authentic
April 2, 2024
partisan. For - to quote the rest of my patient’s lecture -
there is no such thing as a part-time partisan. Real
partisans are partisans always and as long as they live.
They put fallen governments back in power and over
throw governments that have just been put in power with
the help of partisans. Mr. Matzerath contended - and this
thesis struck me as perfectly plausible - that among all
those who go in for politics your incorrigible partisan,
who undermines what he has just set up, is closest to the
artiest because he consistently rejects what he has just
created.”

Gunter Grass, The tin drum


45.1. Preliminaries on expanders
45.1.1. Definitions
Let G = (V, E) Be an undirected graph, where V = {1, . . . , n}. A d-regular graph is a graph where all vertices
have degree d. A d-regular graph G = (V, E) is a δ-edge expander (or just, δ-expander) if for every set S ⊆ V
of size at most |V| /2, there are at least δd |S | edges connecting S and S = V \ S ; that is
e(S , S ) ≥ δd |S | , (45.1)
where n o
e(X, Y) = uv u ∈ X, v ∈ Y .
A graph is [n, d, δ]-expander if it is a n vertex, d-regular, δ-expander.
A (n, d)-graph G is a connected d-regular undirected (multi) graph. We will consider the set of vertices of
such a graph to be the set JnK = {1, . . . , n}.
For a (multi) graph G with n nodes, its adjacency matrix is a n × n matrix M, where Mi j is the number of
edges between i and j. It would be convenient to work the transition matrix Q associated with the random
walk on G. If G is d-regular then Q = M(G)/d and it is doubly stochastic.
A vector x is eigenvector of a matrix M with eigenvalue µ, if xM = µx. In particular, by taking the dot
product of both size by x, we get ⟨xM, x⟩ = ⟨µx, x⟩, which implies µ = ⟨xM, x⟩ / ⟨x, x⟩. Since the adjacency
matrix M of G is symmetric, all its eigenvalues are real numbers (this is a special case of the spectral theorem
from linear algebra). Two eigenvectors with different eigenvectors are orthogonal to each other.

287
We denote the eigenvalues of M by λb1 ≥ λb2 ≥ · · · λbn , and the eigenvalues of Q by λb1 ≥ λb2 ≥ · · · λbn . Note,
that for a d-regular graph, the eigenvalues of Q are the eigenvalues of M scaled down by a factor of 1/d; that is
bi = λ
λ bi /d.

Lemma 45.1.1. Let G be an undirected graph, and let ∆ denote the maximum degree in G. Then, λb1 (G) =
λb1 (M) = ∆ if and only one connected component of G is ∆-regular. The multiplicity of ∆ as an eigenvector is
bi (G) ≤ ∆, for all i.
the number of ∆-regular connected components. Furthermore, we have λ

Proof: The ith entry of M1n is the degree of the ith vertex vi of G (i.e., M1n = d(vi ), where 1n = (1, 1, . . . , 1) ∈
Rn . So, let x be an eigenvector of M with eigenvalue λ, and let x j , 0 be the coordinate with the largest
(absolute value) among all coordinates of x corresponding to a connect component H of G. We have that
X
|λ| x j = (M x) j = xi ≤ ∆ x j ,
vi ∈N(v j )

where N(v j ) are the neighbors of vi in G. Thus, all the eigenvalues of G have λ bi ≤ ∆, for i = 1, . . . , n. If
λ = ∆, then this implies that xi = x j if vi ∈ N(v j ), and d(v j ) = ∆. Applying this argument to the vertices of
N(v j ), implies that H must be ∆-regular, and furthermore, x j = xi , if xi ∈ V(H). Clearly, the dimension of the
subspace with eigenvalue (in absolute value) ∆ is exactly the number of such connected components. ■
The following is also known. We do not provide a proof since we do not need it in our argumentation.
Lemma 45.1.2. If G is bipartite, then if λ is eigenvalue of M(G) with multiplicity k, then −λ is also its eigen-
value also with multiplicity k.

45.2. Tension and expansion


Let G = (V, E), where V = {1, . . . , n} and G is a d regular graph.
Definition 45.2.1. For a graph G, let γ(G) denote the tension of G; that is, the smallest constant, such that for
any function f : V(G) → R, we have that
h i h i
E | f (x) − f (y)| ≤ γ(G) E | f (x) − f (y)| .
2 2
(45.2)
x,y∈V xy∈E

Intuitively, the tension captures how close is estimating the variance of a function defined over the vertices
of G, by just considering the edges of G. Note, that a disconnected graph would have infinite tension, and the
clique has tension 1.
Surprisingly, tension is directly related to expansion as the following lemma testifies.
Lemma 45.2.2. Let G = (V, E) be a given connected d-regular graph with n vertices. Then, G is a δ-expander,
1
where δ ≥ and γ(G) is the tension of G.
2γ(G)
Proof: Consider a set S ⊆ V, where  |S| ≤ n/2. Let fS (v) be the function assigning 1 if v ∈ S , and zero
otherwise. Observe that if (u, v) ∈ S × S ∪ S × S then | fS (u) − fS (v)| = 1, and | fS (u) − fS (v)| = 0 otherwise.
As such, we have
 
2 |S | (n − |S |) h i h i e S , S
= E | fS (x) − fS (y)|2 ≤ γ(G) E | fS (x) − fS (y)|2 = γ(G) ,
n2 x,y∈V xy∈E |E|

288
by Lemma 45.2.4. Now, since G is d-regular, we have that |E| = nd/2. Furthermore, n − |S | ≥ n/2, which
implies that
  2 |E| · |S | (n − |S |) 2(nd/2)(n/2) |S | 1
e S, S ≥ = = d |S | .
γ(G)n 2 γ(G)n 2 2γ(G)
which implies the claim (see Eq. (45.1)). ■

Now, a clique has tension 1, and it has the best expansion possible. As such, the smaller the tension of a
graph, the better expander it is.
Definition 45.2.3. Given a random walk matrix Q associated with a d-regular graph, let B(Q) = ⟨v1 , . . . , vn ⟩
denote the orthonormal eigenvector basis defined by Q. √That is, v1 , . . . , vn is an orthonormal basis for Rn ,
where all these vectors are eigenvectors of Q and v1 = 1n / n. Furthermore, let λ bi denote the ith eigenvalue of
Q, associated with the eigenvector vi , such that λb1 ≥ λb2 ≥ · · · ≥ λbn .

Lemma 45.2.4. Let G = (V, E) be a given connected d-regular graph with n vertices. Then γ(G) = 1
1−λb2
, where
λb2 = λ2 /d is the second largest eigenvalue of Q.

Proof: Let f : V → R. Since in Eq. (45.2), we only look on the difference between two values of f , we
can add a constant to f , and would not change the quantities involved in Eq. (45.2). As such, we assume that
 
E f (x) = 0. As such, we have that
h i h i h i
E | f (x) − f (y)| = E ( f (x) − f (y)) = E ( f (x)) − 2 f (x) f (y) + ( f (y))
2 2 2 2
(45.3)
x,y∈V x,y∈V x,y∈V
h i   h i
= E ( f (x))2 − 2 E f (x) f (y) + E ( f (y))2
x,y∈V x,y∈V x,y∈V
h i     h i h i
= E ( f (x)) − 2 E f (x) E f (y) + E ( f (y))2 = 2 E ( f (x))2 .
2
x∈V x∈V y∈V y∈V x∈V

Now, let I be the n × n identity matrix (i.e., one on its diagonal, and zero everywhere else). We have that
 
1X 1 X X  X 2X
ρ= ( f (x) − f (y))2 =  d( f (x))2 − 2 f (x) f (y) = ( f (x))2 − f (x) f (y)
d xy∈E d x∈V xy∈E x∈V
d xy∈E
X
= (I − Q) xy f (x) f (y).
x,y∈V

Note, that 1n is an eigenvector of Q with eigenvalue 1, and this is the largest eigenvalue of Q. Let B(Q) =
⟨v1 , . . . , vn ⟩ be the orthonormal eigenvector basis defined by Q, with eigenvalues λb1 ≥ λb2 ≥ · · · ≥ λbn , respec-
P
tively. Write f = ni=1 αi vi , and observe that
* + *X +
 X f (i)
n
 v1 v1 1 α1
0 = E f (x) = = f, √ = αi vi , √ = √ ⟨α1 v1 , v1 ⟩ = √ ,
i=1
n n i
n n n

since vi ⊥v1 for i ≥ 2. Hence α1 = 0, and we have


X X X
n X
n
ρ= (I − Q) xy f (x) f (y) = (I − Q) xy αni=1 αi vi (x) α j v j (y)
x,y∈V x,y∈V i=2 j=1
X X X
= αi α j vi (x) (I − Q) xy v j (y).
i, j x∈V y∈V

289
Now, we have that
X *" # +       
xth row of
(I − Q) xy v j (y) = , v j = (I − Q)v j (x) = 1 − λbj v j (x) = 1 − λbj v j (x),
(I − Q)
y∈V

Pn
since v j is eigenvector of Q with eigenvalue λbj . Since v1 , . . . , vn is an orthonormal basis, and f = αi vi , we
P i=1
have that ∥ f ∥2 = j α2j . Going back to ρ, we have that
X X   X  X
ρ= αi α j vi (x) 1 − λbj v j (x) = αi α j 1 − λbj vi (x)v j (x)
i, j x∈V i, j x∈V
X  D E X
n  D E
= αi α j 1 − λbj vi , v j = α2j 1 − λbj v j , v j
i, j j=1
 X
n X 2  X
n    X
n
≥ 1 − λb2 α2j v j (x) = 1 − λb2 α2j = 1 − λb2 ∥ f ∥2 = 1 − λb2 ( f (x))2 (45.4)
j=2 x∈V j=2 j=1
  h i
= n 1 − λb2 E ( f (x))2 ,
x∈V

since α1 = 0 and λb1 ≥ λb2 ≥ · · · ≥ λbn .


We are now ready for the kill. Indeed, by Eq. (45.3), and the above, we have that
h i h i 2 2 X
E | f (x) − f (y)|2
= 2 E ( f (x))2
≤  ρ=   ( f (x) − f (y))2
x,y∈V x∈V b
n 1 − λ2 b
dn 1 − λ2 xy∈E
1 1 X 1 h i
= · ( f (x) − f (y))2 = E | f (x) − f (y)| .
2
1 − λb2 |E| xy∈E 1 − λb2 xy∈E

1
This implies that γ(G) ≤ . Observe, that the inequality in our analysis, had risen from Eq. (45.4), but if
1 − λb2
we take f = v2 , then the inequality there holds with equality, which implies that γ(G) ≥ 1−1λb , which implies
2
the claim. ■

Lemma 45.2.2 together with the above lemma, implies that the expansion δ of a d-regular graph G is at
least δ = 1/2γ(G) = (1 − λ2 /d)/2, where λ2 is the second eigenvalue of the adjacency matrix of G. Since the
tension of a graph is direct function of its second eigenvalue, we could either argue about the tension of a graph
or its second eigenvalue when bounding the graph expansion.

290
Chapter 46

Expanders II
598 - Class notes for Randomized Algorithms
Be that as it may, it is to night school that I owe what
Sariel Har-Peled
education I possess; I am the first to own that it doesn’t
April 2, 2024
amount to much, though there is something rather
grandiose about the gaps in it.

Gunter Grass, The tin drum


46.1. Bi-tension
Our construction of good expanders, would use the idea of composing graphs together. To this end, in our
analysis, we will need the notion of bi-tension. Let e
E(G) be the set of directed edges of G; that is, every edge
xy ∈ E(G) appears twice as (x → y) and (y → x) in E.e

Definition 46.1.1. For a graph G, let γ2 (G) denote the bi-tension of G; that is, the smallest constant, such that
for any two function f, g : V(G) → R, we have that
h i h i
E | f (x) − g(y)| ≤ γ2 (G) E | f (x) − g(y)| .
2 2
(46.1)
x,y∈V (x→y)∈e
E

The proof of the following lemma is similar to the proof of Lemma 45.2.4. The proof is provided for the
sake of completeness, but there is little new in it.
1
Lemma 46.1.2. Let G = (V, E) be a connected d-regular graph with n vertices. Then γ2 (G) = , where
  1 −bλ
b
λ = bλ(G), where b λ(G) = max λb2 , −λbn , where λ
bi is the ith largest eigenvalue of the random walk matrix
associated with G.
 
Proof: We can assume that E f (x) = 0. As such, we have that
h i h i   h i h i h i
E | f (x) − g(y)| = E ( f (x)) − 2 E f (x)g(y) + E (g(y)) = E ( f (x)) + E (g(y)) .
2 2 2 2 2
(46.2)
x,y∈V x,y∈V x,y∈V y∈V x,y∈V y∈V

Let Q be the matrix associated with the random walk on G (each entry is either zero or 1/d), we have
h i 1 X 1 X
ρ= E | f (x) − g(y)|2 = ( f (x) − g(y))2 = Q xy ( f (x) − g(y))2
(x→y)∈e
E nd e
n x,y∈V
(x→y)∈E
1 X  2 X
= ( f (x))2 + (g(x))2 − Q xy f (x)g(y).
n x∈V n x,y∈V

291
Let B(Q) = ⟨v1 , . . . , vn ⟩ be the orthonormal eigenvector basis defined by Q (see Definition 45.2.3), with eigen-
P P  
values λb1 ≥ λb2 ≥ · · · ≥ λbn , respectively. Write f = ni=1 αi vi and g = ni=1 βi vi . Since E f (x) = 0, we have that
α1 = 0. Now, Q xy = Qyx , and we have
  
X X X X  X X X
Q xy f (x)g(y) = Qyx  αi vi (x) β j v j (y) = αi β j v j (y) Qyx vi (x)
x,y∈V x,y∈V i j i, j y∈V x∈V
X X   X D E Xn X
= αi β j v j (y) λbi vi (y) = bi v j , vi =
αi β j λ bi
αi βi λ (vi (y))2
i, j y∈V i, j i=2 y∈V
X
n
α2i + β2i X b
λ X
n X 
≤b
λ (vi (y))2 ≤ (αi vi (y))2 + (βi vi (y))2
i=2
2 y∈V
2 i=1 y∈V

λ X
b 
= ( f (y))2 + (g(y))2
2 y∈V

As such,
h i 1 X 1 X  1 X 2 f (x)g(y)
E | f (x) − g(y)|2 = | f (x) − g(y)|2 = ( f (y))2 + (g(y))2 −
(x→y)∈e
E nd e
n y∈V n x,y∈V d
(x→y)∈E
1 X  2 X
= ( f (y))2 + (g(y))2 − Q xy f (x)g(y)
n y∈V n x,y∈V
 
 1 2 b λ  X    h i h i!
≥  − ·  2 2 b
( f (y)) + (g(y)) = 1 − λ E ( f (y)) + E (g(y))
2 2
n n 2 y∈V y∈V y∈V
  h i
= 1 −b λ E | f (x) − g(y)|2 ,
x,y∈V
 
by Eq. (46.2). This implies that γ2 (G) ≤ 1/ 1 − b
λ . Again, by trying either f = g = v2 or f = vn and g = −vn ,
 
we get that the inequality above holds with equality, which implies γ2 (G) ≥ 1/ 1 − b λ . Together, the claim
now follows. ■

46.2. Explicit construction


For a set U ⊆ V of vertices, its characteristic vector, denoted by x = χU , is the n dimensional vector, where
xi = 1 if and only if i ∈ U.
The following is an easy consequence of Lemma 45.1.1.
Lemma 46.2.1. For a d-regular graph G the vector 1n = (1, 1, . . . , 1) is the only eigenvector with eigenvalue d
(of the adjacency matrix M(G), if and only if G is connected. Furthermore, we have |λi | ≤ d, for all i.

Our main interest would be in the second largest eigenvalue of M. Formally, let
⟨xM, x⟩
λ2 (G) = max .
x⊥1 ,x,0
n
⟨x, x⟩
We state the following result but do not prove it since we do not need it for our nafarious purposes (however,
we did prove the left side of the inequality).

292
Theorem 46.2.2. Let G be a δ-expander with adjacency matrix M and let λ2 = λ2 (G) be the second-largest
eigenvalue of M. Then r 
1 λ2  λ2 
1− ≤δ≤ 2 1− .
2 d d
What the above theorem says, is that the expansion of a [n, d, δ]-expander is a function of how far is its
second eigenvalue (i.e., λ2 ) from its first eigenvalue (i.e., d). This is usually referred to as the spectral gap.
We will start by explicitly constructing an expander that has “many” edges, and then we will show to reduce
its degree till it become a constant degree expander.

46.2.1. Explicit construction of a small expander


46.2.1.1. A quicky reminder of fields
A field is a set F together with two operations, called addition and multiplication, and denoted by + and ·,
respectively, such that the following axioms hold:
(i) Closure: ∀x, y ∈ F, we have x + y ∈ F and x · y ∈ F.
(ii) Associativity: ∀x, y, z ∈ F, we have x + (y + z) = (x + y) + z and (x · y) · z = x · (y · z).
(iii) Commutativity: ∀x, y ∈ F, we have x + y = y + x and x · y = y · x.
(iv) Identity: There exists two distinct special elements 0, 1 ∈ F. We have that ∀x ∈ F it holds x + 0 = a and
x · 1 = x.
(v) Inverse: There exists two distinct special elements 0, 1 ∈ F, and we have that ∀x ∈ F there exists an
element −x ∈ F, such that x + (−x) = 0.
Similarly, ∀x ∈ F, x , 0, there exists an element y = x−1 = 1/x ∈ F such that x · y = 1.
(vi) Distributivity: ∀x, y, z ∈ F we have x · (y + z) = x · y + x · z.
Let q = 2t , and r > 0 be an integer. Consider the finite field Fq . It is the field of polynomials of degree at
most t − 1, where the coefficients are over Z2 (i.e., all calculations are done modulus 2). Formally, consider the
polynomial
p(x) = xt + x + 1.
It it irreducible over F2 = {0, 1} (i.e., p(0) = p(1) , 0). We can now do polynomial arithmetic over polynomials
(with coefficients from F2 ), where we do the calculations modulus p(x). Note, that any irreducible polynomial
of degree n yields the same field up to isomorphism. Intuitively, we are introducing the n distinct roots of p(x)
into F by creating an extension field of F with those roots.
An element of Fq = F2t can be interpreted as a binary string b = b0 b1 . . . , bt−1 of length t, where the
corresponding polynomial is
X
t−1
poly(b) = bi x i .
i=0

The nice property of Fq is that addition can be interpreted as a xor operation. That is, for any x, y ∈ Fq , we
have that x + y + y = x and x − y − y = x. The key properties of Fq we need is that multiplications and addition
can be computed in it in polynomial time in t, and it is a field (i.e., each non-zero element has a unique inverse).

46.2.1.1.1. Computing multiplication in Fq . Consider two elements α, β ∈ Fq . Multiply the two polyno-
mials poly(α) by poly(β), let poly(γ) be the resulting polynomial (of degree at most 2t − 2), and compute the
remainder poly(β) when dividing it by the irreducible polynomial p(x). For this remainder polynomial, nor-
malize the coefficients by computing their modules base 2. The resulting polynomial is the product of α and
β.

293
For more details on this field, see any standard text on abstract algebra.

46.2.1.2. The construction


Let q = 2t , and r > 0 be an integer. Consider the linear space G = Fqr . Here, a member α = (α0 , . . . , αr ) ∈ G can
be thought of as being a string (of length r + 1) over Fq , or alternatively, as a binary string of length n = t(r + 1).
For α = (α0 , . . . , αr ) ∈ G, and x, y ∈ Fq , define the operator
   
ρ(α, x, y) = α + y · 1, x, x2 , . . . , xr = α0 + y, α1 + yx, α2 + yx2 , . . . , αr + yxr ∈ G.

Since addition over Fq is equivalent to a xor operation we have that


 
ρ(ρ(α, x, y), x, y) = α0 + y + y, α1 + yx + yx, α2 + yx2 + yx2 , . . . , αr + yxr + yxr
= (α0 , α1 , α2 , . . . , αr ) = α.

Furthermore, if (x, y) , (x′ , y′ ) then ρ(α, x, y) , ρ(α, x′ , y′ ).


We now define a graph LD(q, r) = (G, E), where
( )
α ∈ G, x, y ∈ Fq
E = αβ
β = ρ(α, x, y)
2
Note, that this graph is well defined, as ρ(β, x, y) = α. The degree of a vertex of LD(q, r) is Fq = q2 , and
LD(q, r) has N = |G| = qr+1 = 2t(r+1) = 2n vertices.

Theorem 46.2.3. For any t > 0, r > 0 and q = 2t , where r < q, we have that LD(q, r) is a graph with qr+1
vertices. Furthermore, λ1 (LD(q, r)) = q2 , and λh i (LD(q, r))i ≤ rq, for i = 2, . . . , n.
In particular, if r ≤ q/2, then LD(q, r) is a qr+1 , q2 , 14 -expander.

Proof: Let M be the N × N adjacency matrix of LD(q, r). Let L : Fq → {0, 1} be a linear map which is onto. It
is easy to verify that L−1 (0) = L−1 (1) ¬
We are interested in the eigenvalues of the matrix M. To this end, we consider vectors in RN . The ith row
an ith column of M is associated with a unique element bi ∈ G. As such, for a vector v ∈ RN , we denote by
v[bi ] the ith coordinate of v. In particular, for α = (α0 , . . . , αr ) ∈ G, let vα ∈ RN denote the vector, where its
β = (β0 , . . . , βr ) ∈ G coordinate is
  Pr
vα β = (−1)L( i=0 αi βi ) .
n o
Let V = vα α ∈ G . For α , α′ ∈ V, observe that
X Pr Pr X Pr X  
αi βi ) α′i βi ) (αi +α′i ) βi ) =
⟨vα , vα′ ⟩ = (−1)L( i=0 · (−1)L( i=0 = (−1)L( i=0 vα+α′ β .
β∈G β∈G β∈G

So, consider ψ = α + α′ , 0. Assume, for the simplicity of exposition that all the coordinates of ψ are non-zero.
We have, by the linearity of L that
X Pr X X
⟨vα , vα′ ⟩ = (−1)L( i=0 αi βi ) = (−1)L(ψ0 β0 +···+ψr−1 βr−1 ) (−1)L(ψr βr ) .
β∈G β0 ∈Fq ,...,βr−1 ∈Fq βr ∈Fq
n o
¬
Indeed, if Z = L−1 (0), and L(x) = 1, then L(y) = 1, for all y ∈ U = x + z z ∈ Z . Now, its clear that |Z| = |U|.

294
n o P
However, since ψr , 0, the quantity ψr βr βr ∈ Fq = Fq . Thus, the summation βr ∈Fq (−1)L(ψr βr ) gets L−1 (0)
terms that are 1, and L−1 (0) terms that are −1. As such, this summation is zero, implying that ⟨vα , vα′ ⟩ = 0.
Namely, the vectors of V are orthogonal.
     
Observe, that for α, β, ψ ∈ G, we have vα β + ψ = vα β vα ψ . For α ∈ G, consider the vector Mvα . We
have, for β ∈ G, that
  X   X   X  
(Mvα ) β = Mβψ · vα ψ = vα ψ = vα β + y(1, x, . . . , xr )
ψ∈G x,y ∈ Fq x,y ∈ Fq
ψ=ρ(β,x,y)
 
 X     
=  vα y(1, x, . . . , xr )  · vα β .
x,y ∈ Fq

P  
Thus, setting λ(α) = x,y ∈ Fq vα y(1, x, . . . , xr ) ∈ R, we have that Mvα = λ(α) · vα . Namely, vα is an eigenvector,
with eigenvalue λ(α).
P
Let pα (x) = ri=0 αi xi , and let
X   X
λ(α) = vα y(1, x, . . . , xr ) ∈ R = (−1)L(ypα (x))
x,y ∈ Fq x,y∈Fq
X X
= (−1) L(y pα (x))
+ (−1) L(y pα (x))
.
x,y∈Fq x,y∈Fq
pα (x)=0 pα (x),0

If pα (x) = 0 then (−1)L(y pα (x)) = 1, for all y. As such, each such x contributes q to λ(α).
If pα (x) , 0 then y pα (x) takes all the values of Fq , and as such, L(y pα (x)) is 0 for half of these values, and
1 for the other half. Implying that these kind terms contribute 0 to λ(α). But pα (x) is a polynomial of degree
r, and as such there could be at most r values of x for which the first term is taken. As such, if α , 0 then
λ(α) ≤ rq. If α = 0 then λ(α) = q2 , which implies the theorem. ■

This construction provides an expander with constant degree only if the number of vertices is a constant.
Indeed, if we want an expander with constant degree, we have to take q to be as small as possible. We get
the relation n = qr+1 ≤ qq , since r ≤ r, which implies that q = Ω(log n/ log log n). Now, the expander of
Theorem 46.2.3 is q2 -regular, which means that it is not going to provide us with a constant degree expander.
However, we are going to use it as our building block in a construction that would start with this expander
and would inflate it up to the desired size.

295
296
Chapter 47

Expanders III - The Zig Zag Product


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
Gradually, but not as gradually as it seemed to some parts of his brain, he began to infuse his tones with a sarcastic wounding
bitterness. Nobody outside a madhouse, he tried to imply, could take seriously a single phrase of this conjectural, nugatory,
deluded, tedious rubbish. Within quite a short time he was contriving to sound like an unusually fanatical Nazi trooper in
charge of a book-burning reading out to the crowd excerpts from a pamphlet written by a pacifist, Jewish, literate Communist.
A growing mutter, half-amused, half-indignant, arose about him, but he closed his ears to it and read on. Almost unconsciously
he began to adopt an unnameable foreign accent and to read faster and faster, his head spinning. As if in a dream he heard
Welch stirring, then whispering, then talking at his side. he began punctuating his discourse with smothered snorts of derision.
He read on, spitting out the syllables like curses, leaving mispronunciations, omissions, spoonerisms uncorrected, turning over
the pages of his script like a score-reader following a presto movement, raising his voice higher and higher. At last he found
his final paragraph confronting him, stopped, and look at his audience.

Kingsley Amis, Lucky Jim

47.1. Building a large expander with constant degree


47.1.1. Notations
For a vertex v ∈ V(G), we will denote by vG [i] = v[i] the ith neighbor of v in the graph G (we order the
neighbors of a vertex in an arbitrary order).
The regular graphs we next discuss have consistent labeling. That is, for a regular graph G (we assume
here that G is regular). This means that if u is the ith neighbor v then v is the ith neighbor of u. Formally, this
means that v[i][i] = v, for all v and i. This is a non-trivial property, but its easy to verify that the low quality
expander of Theorem 46.2.3 has this property. It is also easy to verify that the complete graph can be easily be
made into having consistent labeling (exercise). These two graphs would be sufficient for our construction.

47.1.2. The Zig-Zag product


At this point, we know how to construct a good “small” expander. The question is how to build a large expander
(i.e., large number of vertices) and with constant degree.
The intuition of the construction is the following: It is easy to improve the expansion qualities of a graph
by squaring it. The problem is that the resulting graph G has a degree which is too large. To overcome this, we
will replace every vertex in G by a copy of a small graph that is connected and has low degree. For example, we
could replace every vertex of degree d in G by a path having d vertices. Every such vertex is now in charge of
original edge of the graph. Naturally, such a replacement operation reduces the quality of the expansion of the

297
resulting graph. In this case, replacing a vertex with a path is a potential “disaster”, since every such subpath
increases the lengths of the paths of the original graph by a factor of d (and intuitively, a good expander have
“short” paths between any pair of vertices).
Consider a “large” (n, D)-graph G and a “small” (D, d)-graph H. As a G
first stage, we replace every vertex of G by a copy of H. The new graph K H
has JnK × JDK as a vertex set. Here, the edge vu ∈ V(G), where u = v[i] and
v = u[ j], is replaced by the edge connecting (v, i) ∈ V(K) with (u, j) ∈ V(K).
We will refer to this resulting edge (v, i)(u, j) as a long edge. Also, we copy
all the edges of the small graph to each one of its copies. That is, for each
i ∈ JnK, and uv ∈ E(H), we add the edge (i, u)(i, v) to K, which is a short edge.
We will refer to K, which is a (nD, d + 1)-graph, as a replacement product
of G and H, denoted by G O r H. See figure on the right for an example.
Again, intuitively, we are losing because the expan- GrH
sion of the resulting graph had deteriorated too much. To e3
overcome this problem, we will perform local shortcuts
to shorten the paths in the resulting graph (and thus im-
prove its expansion properties). A zig-zag-zig path in the e2
replacement product graph K, is a three edge path e1 e2 e3 ,
where e1 and e3 are short edges, and the middle edge e2
is a long edge. That is, if e1 = (i, u)(i, v), e2 = (i, v)( j, v′ ),
and e3 = ( j, v′ )( j, u′ ), then e1 , e2 , e3 ∈ E(K), i j ∈ E(G),
uv ∈ E(H) and v′ u′ ∈ E(H). Intuitively, you can think e1
GrH
about e1 as a small “zig” step in H, e2 is a long “zag”
step in G, and finally e3 is a “zig” step in H.
Another way of representing a zig-zag-zig path v1 v2 v3 v4 starting at the vertex v1 = (i, v) ∈ V(F), is to
parameterize it by two integers ℓ, ℓ′ ∈ JdK, where
 
v1 = (i, v), v2 = (i, vH [ℓ]) v3 = (iG [vH [ℓ]] , vH [ℓ]) v4 = iG [vH [ℓ]] , (vH [ℓ])H ℓ′ .
Let Z be the set of all (unordered) pairs of vertices of K connected by such a zig-zag-zig path. Note, that
every vertex (i, v) of K has d2 paths having (i, v) as an end point. Consider the graph F = (V(K), Z). The graph F
has nD vertices, and it is d2 regular. Furthermore, since we shortcut all these zig-zag-zig paths in K, the graph
F is a much better expander (intuitively) than K. We will refer to the graph F as the zig-zag product of G and H.
Definition 47.1.1. The zig-zag product of (n, D)-graph G and a (D, d)-graph H, is the (nD, d2 ) graph F =
GOz H, where the set of vertices is JnK × JDK and for any v ∈ JnK, i ∈ JDK, and ℓ, ℓ′ ∈ JdK we have in F the edge
connecting the vertex (i, v) with the vertex (iG [vH [ℓ]] , (vH [ℓ])H [ℓ′ ]).
Remark 47.1.2. We need the resulting zig-zag graph to have consistent labeling. For the sake of simplicity of
exposition, we are just going to assume this property.
We next bound the tension of the zig-zag product graph.
Theorem 47.1.3. We have γ(G O z H) ≤ γ2 (G)(γ2 (H))2 . and γ2 (G O
z H) ≤ γ2 (G)(γ2 (H))2 .
 
Proof: Let G = JnK, E be a (n, D)-graph and H = JDK, E ′ be a (D, d)-graph. Fix any function f : JnK ×
JDK → R, and observe that
h i " h i#
ψ = E | f (u, k) − f (v, ℓ)| = E
2
E | f (u, k) − f (v, ℓ)|
2
u,v∈JnK k,ℓ∈JDK u,v∈JnK
k,ℓ∈JDK

298
 
"  
h i#   
   2 
≤ E γ2 (G) E | f (u, k) − f (v, ℓ)|2 = γ2 (G) E  E f (u, k) − f u p , ℓ  .
k,ℓ∈JDK uv∈E(G) k,ℓ∈JDK   u∈JnK 
p∈JDK
| {z }
=∆1

Now,
"  # " #
   2     2
∆1 = E E f (u, k) − f u p , ℓ ≤ E γ2 (H) E f (u, k) − f u p , ℓ
u∈JnK k,p∈JDK u∈JnK kp∈E(H)
ℓ∈JDK ℓ∈JDK
 
   
      2 
= γ2 (H) E  E f u, p j − f u p , ℓ  .
u∈JnK   p∈JDK 
ℓ∈JDK j∈JdK
| {z }
=∆2

Now,
   
      
      2       2  
∆2 = E  E f u, p j − f u p , ℓ  = E  E f v p , p j − f (v, ℓ) 
j∈JdK   u∈JnK  j∈JdK  v∈JnK 
ℓ∈JDK p∈JDK ℓ∈JDK p∈JDK
 
  
     2  
= E  E f v p , p j − f (v, ℓ) 
j∈JdK   p∈JDK 
v∈JnK ℓ∈JDK
"  #
    2
= γ2 (H) E E f v p , p j − f (v, ℓ) .
j∈JdK pℓ∈E(H)
v∈JnK
| {z }
=∆3

Now, we have
 
   
     2    
∆3 = E  E f v p , p j − f (v, p[i])  = E | f (u, k) − f (ℓ, v)| ,
j∈JdK   p∈JDK  (u,k)(ℓ,v)∈E(G Oz H)
v∈JnK i∈JdK
      
as v p , p j is adjacent to v p , p (a short edge), which is in turn adjacent to (v, p) (a long edge), which is
   
adjacent to (v, p[i]) (a short edge). Namely, v p , p j and (v, p[i]) form the endpoints of a zig-zag path in the
replacement product of G and H. That is, these two endpoints are connected by an edge in the zig-zag product
graph. Furthermore, it is easy to verify that each zig-zag edge get accounted for in this representation exactly
once, implying the above inequality. Thus, we have ψ ≤ γ2 (G)(γ2 (H))2 ∆3 , which implies the claim.
The second claim follows by similar argumentation. ■

47.1.3. Squaring
The last component in our construction, is squaringsquaring!graph a graph. Given a (n, d)-graph G, consider
the multigraph G2 formed by connecting any vertices connected in G by a path of length  2. Clearly, if M is the
adjacency matrix of G, then the adjacency matrix of G is the matrix M . Note, that M2 is the number of
2 2
ij
distinct paths of length 2 in G from i to j. Note, that the new graph might have self loops, which does not effect
our analysis, so we keep them in.

299
  (γ2 (G))2
Lemma 47.1.4. Let G be a (n, d)-graph. The graph G2 is a (n, d2 )-graph. Furthermore γ2 G2 = 2γ2 (G)−1
.

 2  2
Proof: The graph G2 has eigenvalues λb1 (G) , . . . , λb1 (G) for its matrix Q2 . As such, we have that
      
b
λ G2 = max λb2 G2 , −λbn G2 .

       2  2
Now, λb1 G2 = 1. Now, if λb2 (G) ≥ λbn (G) < 1 then b
λ G2 = λb2 G2 = λb2 (G) = b λ(G) .
     2  2
If λb2 (G) < λbn (G) then b
λ G2 = λb2 G2 = λbn (G) = b λ(G) ..
   2
Thus, in either case b λ G2 = bλ(G) . Now, By Lemma 46.1.2 γ2 (G) = 1−bλ(1 G) , which implies that b
λ(G) =
1 − 1/γ2 (G), and thus

  1 1 1 γ2 (G) (γ2 (G))2


γ2 G 2
=  =   =  2 = = . ■
1 −b 2 − γ21(G) 2γ2 (G) − 1
2
λG2
1− b
λ(G) 1− 1− 1
γ2 (G)

47.1.4. The construction


So, let build an expander using Theorem 46.2.3, with parameters r = 7 q = 24 = 32. Let d = q2 = 256. The
bi ≤ r/q = 7/32, for all
resulting graph H has N = qr+1 = d4 vertices, and it is d = q2 regular. Furthermore, λ
i ≥ 2. As such, we have
1 32
γ(H) = γ2 (H) = = .
1 − 7/32 25
Let G0 beany graph that its square is the complete graph over n0 = N +1 vertices. Observe that G20 is d4 -regular.
Set Gi = G2i−1 O
z H , Clearly, the graph Gi has

ni = ni−1 N

vertices. The graph G2i−1 O


z H is d2 regular. As far as the bi-tension, let αi = γ2 (Gi ). We have that
!2
α2i−1 α2i−1 32 α2i−1
αi = (γ (
2 H))2
= ≤ 1.64 .
2αi−1 − 1 2αi−1 − 1 25 2αi−1 − 1

It is now easy to verify, that αi can not be bigger than 5.

Theorem 47.1.5. For any i ≥ 0, one can compute deterministically a graph Gi with ni = (d4 + 1)d4i vertices,
which is d2 regular, where d = 256. The graph Gi is a (1/10)-expander.

Proof: The construction is described above. As for the expansion, since the bi-tension bounds the tension of
a graph, we have that γ(Gi ) ≤ γ2 (Gi ) ≤ 5. Now, by Lemma 45.2.2, we have that Gi is a δ-expander, where
δ ≥ 1/(2γ(Gi )) ≥ 1/10. ■

300
47.2. Bibliographical notes
A good survey on expanders is the monograph by Hoory et al. [HLW06]. The small expander construction
is from the paper by Alon et al. [ASS08] (but its originally from the work by Along and Roichman [AR94]).
The work by Alon et al. [ASS08] contains a construction of an expander that is constant degree, which is of
similar complexity to the one we presented here. Instead, we used the zig-zag expander construction from the
influential work of Reingold et al. [RVW02]. Our analysis however, is from an upcoming paper by Mendel
and Naor [MN08]. This analysis is arguably reasonably simple (as simplicity is in the eye of the beholder, we
will avoid claim that its the simplest), and (even better) provide a good intuition and a systematic approach to
analyzing the expansion.
We took a creative freedom in naming notations, and the name tension and bi-tension are the author’s own
invention.

47.3. Exercises
Exercise 47.3.1 (Expanders made easy.). By considering a random bipartite three-regular graph on 2n vertices
obtained by picking three random permutations between the two sides of the bipartite graph, prove that there is
a c > 0 such that for every n there exits a (2n, 3, c)-expander. (What is the value of c in your construction?)

Exercise 47.3.2 (Is your consistency in vain?). In the construction, we assumed that the graphs we are dealing
with when building expanders have consistent labeling. This can be enforced by working with bipartite graphs,
which implies modifying the construction slightly.
(A) Prove that a d-regular bipartite graph always has a consistent labeling (hint: consider matchings in this
graph).
(B) Prove that if G is bipartite so is the graph G3 (the cubed graph).
(C) Let G be a (n, D)-graph and let H be a (D, d)-graph. Prove that if G is bipartite then GG O z H is bipartite.
(D) Describe in detail a construction of an expander that is: (i) bipartite, and (ii) has consistent labeling at
every stage of the construction (prove this property if necessary). For the ith graph in your series, what
is its vertex degree, how many vertices it has, and what is the quality of expansion it provides?

Exercise 47.3.3 (Tension and bi-tension.). [30 points]


Disprove (i.e., give a counter example) that there exists a universal constant c, such that for any connected
graph G, we have that γ(G) ≤ γ2 (G) ≤ cγ(G).

Acknowledgments
Much of the presentation was followed suggestions by Manor Mendel. He also contributed some of the figures.

References
[AR94] N. Alon and Y. Roichman. Random cayley graphs and expanders. Random Struct. Algorithms,
5(2): 271–285, 1994.
[ASS08] N. Alon, O. Schwartz, and A. Shapira. An elementary construction of constant-degree expanders.
Combin. Probab. Comput., 17(3): 319–327, 2008.

301
[HLW06] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin Amer.
Math. Soc., 43: 439–561, 2006.
[MN08] M. Mendel and A. Naor. Towards a calculus for non-linear spectral gaps. manuscript. 2008.
[RVW02] O. Reingold, S. Vadhan, and A. Wigderson. Entropy waves, the zig-zag graph product, and new
constant-degree expanders and extractors. Annals Math., 155(1): 157–187, 2002.

302
Chapter 48

The Probabilistic Method


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Shortly after the celebration of the four thousandth anniversary of the opening of space, Angary J. Gustible discovered
Gustible’s planet. The discovery turned out to be a tragic mistake.
Gustible’s planet was inhabited by highly intelligent life forms. They had moderate telepathic powers. They immediately mind-
read Angary J. Gustible’s entire mind and life history, and embarrassed him very deeply by making up an opera concerning
his recent divorce.”

Gustible’s Planet, Cordwainer Smith

48.1. Introduction
The probabilistic method is a combinatorial technique to use probabilistic algorithms to create objects having
desirable properties, and furthermore, prove that such objects exist. The basic technique is based on two basic
observations:

1. If E[X] = µ, then there exists a value x of X, such that x ≥ E[X].

2. If the probability of event E is larger than zero, then E exists and it is not empty.

The surprising thing is that despite the elementary nature of those two observations, they lead to a powerful
technique that leads to numerous nice and strong results. Including some elementary proofs of theorems that
previously had very complicated and involved proofs.
The main proponent of the probabilistic method, was Paul Erdős. An excellent text on the topic is the book
by Noga Alon and Joel Spencer [AS00].
This topic is worthy of its own course. The interested student is refereed to the course “Math 475 — The
Probabilistic Method”.

48.1.1. Examples
48.1.1.1. Max cut
Computing the maximum cut (i.e., max cut) in a graph is a NP-Complete problem, which is APX-Hard (i.e.,
no better than a constant approximation is possible if P , NP). We present later on a better approximation
algorithm, but the following simple algorithm already gives a pretty good approximation.

303
Theorem 48.1.1. For any undirected graph G = (V, E) with n vertices and m edges, there is a partition of the
m
vertex set V into two sets S and T , such that |(S , T )| = |{uv ∈ E | u ∈ S and v ∈ T }| ≥ . One can compute a
  2
partition, in O(n) time, such that E |(S , T )| = m/2.
Proof: Consider the following experiment: randomly assign each vertex to S or T , independently and equal
probability.
For an edge e = uv, the probability that one endpoint is in S , and the other in T is 1/2, and let Xe be the
indicator variable with value 1 if this happens. Clearly,
h i X X 1 m
E uv ∈ E (u, v) ∈ S × T ∪ T × S = E[Xe ] = = .
e∈E(G) e∈E(G)
2 2

Thus, there must be an execution of the algorithm that computes a cut that is at least as large as the expectation
– namely, a partition of V that satisfies the realizes a cut with ≥ m/2 edges. ■

48.2. Maximum Satisfiability


In the MAX-SAT problem, we are given a binary formula F in CNF (Conjunctive normal form), and we would
like to find an assignment that satisfies as many clauses as possible of F, for example F = (x ∨ y) ∧ (x ∨ z).
Of course, an assignment satisfying all the clauses of the formula, and thus F itself, would be even better – but
this problem is of course NPC. As such, we are looking for how well can be we do when we relax the problem
to maximizing the number of clauses to be satisfied..
Theorem 48.2.1. For any set of m clauses, there is a truth assignment of variables that satisfies at least m/2
clauses.
Proof: Assign every variable a random value. Clearly, a clause with k variables, has probability 1 − 2−k to be
satisfied. Using linearity of expectation, and the fact that every clause has at least one variable, it follows, that
E[X] = m/2, where X is the random variable counting the number of clauses being satisfied. In particular, there
exists an assignment for which X ≥ m/2. ■
For an instant I, let mopt (I), denote the maximum number of clauses that can be satisfied by the “best”
assignment. For an algorithm Alg, let mAlg (I) denote the number of clauses satisfied computed by the algorithm
Alg. The approximation factor of Alg, is mAlg (I)/mopt (I). Clearly, the algorithm of Theorem 48.2.1 provides
us with 1/2-approximation algorithm.
For every clause, C j in the given instance, let z j ∈ {0, 1} be a variable indicating whether C j is satisfied or
not. Similarly, let xi = 1 if the ith variable is being assigned the value TRUE. Let C +j be indices of the variables
that appear in C j in the positive, and C −j the indices of the variables that appear in the negative. Clearly, to solve
MAX-SAT, we need to solve:

X
m
max zj
j=1
X X
subject to xi + (1 − xi ) ≥ z j for all j
i∈C +j i∈C −j

xi , z j ∈ {0, 1} for all i, j

304
We relax this into the following linear program:

X
m
max zj
j=1
sub ject to 0 ≤ yi , z j ≤ 1 for all i, j
X X
yi + (1 − yi ) ≥ z j for all j.
i∈C +j i∈C −j

Which can be solved in polynomial time. Let b t denote the values assigned to the variable t by the linear-
P
programming solution. Clearly, mj=1 zbj is an upper bound on the number of clauses of I that can be satisfied.
We set the variable yi to 1 with probability b
yi . This is an instance randomized rounding.
Lemma 48.2.2. Let C j be a clause with k literals. The probability that it is satisfied by randomized rounding
is at least βk zbj ≥ (1 − 1/e)b
z j , where
!k
1 1
βk = 1 − 1 − ≈1− .
k e

Proof: Assume C j = y1 ∨ v2 . . . ∨ vk . By the LP, we have yb1 + · · · + ybk ≥ zbj . Furthermore, the probability
Q  Q 
that C j is not satisfied is ki=1 1 − b yi . Note that 1 − ki=1 1 − b
yi is minimized when all the byi ’s are equal (by
symmetry). Namely, when b yi = zbj /k. Consider the function f (x) = 1 − (1 − x/k) . This function is larger than
k

g(x) = βk x, for all 0 ≤ x ≤ 1, as can be easily verified (see Tedium 48.2.3).


Thus,
h i Y
k
  
P j
C is satisfied = 1 − 1−b
yi ≥ f zbj ≥ βk zbj .
i=1

The second part of the inequality, follows from the fact that βk ≥ 1 − 1/e, for all k ≥ 0. Indeed, for k = 1, 2
the claim trivially holds. Furthermore,
!k !k
1 1 1 1
1− 1− ≥1− ⇔ 1− ≤ ,
k e k e
 k
but this holds since 1 − x ≤ e−x implies that 1 − 1k ≤ e−1/k , and as such 1 − 1k ≤ e−k/k = 1/e. ■

Tedium 48.2.3. Consider the two functions


 !k 
 1 
f (x) = 1 − (1 − x/k) k
and g(x) = 1 − 1 −  x.
k

We have f ′ (x) = (1 − x/k)k−1 and f ′′ (x) = − k−1


k
(1 − x/k)k−2 . That is f ′′ (x) ≤ 0, for x ∈ [0, 1]. As such f is a
concave function.   k 
Observe that f (0) = 0 = g(0) and f (1) = 1 − 1 − 1k = g(1). Since f is concave, and g is linear, it
follows that f (x) ≥ g(x), for all x ∈ [0, 1].

Theorem 48.2.4. Given an instance I of MAX-SAT, the expected number of clauses satisfied by linear pro-
gramming and randomized rounding is at least (1 − 1/e) ≈ 0.632mopt (I), where mopt (I) is the maximum number
of clauses that can be satisfied on that instance.

305
Theorem 48.2.5. Given an instance I of MAX-SAT, let n1 be the expected number of clauses satisfied by
randomized assignment, and let n2 be the expected number of clauses satisfied by linear programming followed
P
by randomized rounding. Then, max(n1 , n2 ) ≥ (3/4) j zbj ≥ (3/4)mopt (I).
P
Proof: It is enough to show that (n1 + n2 )/2 ≥ 34 j zbj . Let S k denote the set of clauses that contain k literals.
We know that
X X  X X 
n1 = 1 − 2−k ≥ 1 − 2−k zbj .
k C j ∈S k k C j ∈S k

P P
By Lemma 48.2.2 we have n2 ≥ k C j ∈S k βk zbj . Thus,

n1 + n2 X X 1 − 2−k + βk
≥ zbj .
2 k C ∈S
2
j k

 
One can verify that 1 − 2−k + βk ≥ 3/2, for all k. ¬
Thus, we have

n1 + n2 3 X X 3X
≥ zbj = zbj . ■
2 4 k C ∈S 4 j
j k

References
[AS00] N. Alon and J. H. Spencer. The Probabilistic Method. 2nd. Wiley InterScience, 2000.

 
¬
Indeed, by the proof of Lemma 48.2.2, we have that βk ≥ 1 − 1/e. Thus, 1 − 2−k + βk ≥ 2 − 1/e − 2−k ≥ 3/2 for k ≥ 3. Thus,
we only need to check the inequality for k = 1 and k = 2, which can be done directly.

306
Chapter 49

The Probabilistic Method III


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
At other times you seemed to me either pitiable or contemptible, eunuchs, artificially confined to an eternal childhood, childlike
and childish in your cool, tightly fenced, neatly tidied playground and kindergarten, where every nose is carefully wiped and
every troublesome emotion is soothed, every dangerous thought repressed, where everyone plays nice, safe, bloodless games
for a lifetime and every jagged stirring of life, every strong feeling, every genuine passion, every rapture is promptly checked,
deflected and neutralized by meditation therapy.

The Glass Bead Game, Hermann Hesse

49.1. The Lovász Local Lemma


h i
h i P A∩B C
Lemma 49.1.1. (i) P A B ∩ C = h i
PB C
(ii) Let η1 , . . . , ηn be n events which are not necessarily independent. Then,
h i h i h i h i h i
P ∩i=1 ηi = P η1 ∗ P η2 η1 P η3 η1 ∩ η2 ∗ . . . ∗ P ηn η1 ∩ . . . ∩ ηn−1 .
n

Proof: (i) We have that


h i ,
P A∩B C P[A ∩ B ∩ C] P[B ∩ C] P[A ∩ B ∩ C] h i
h i = = = P A B∩C .
PB C P[C] P[C] P[B ∩ C]

As for (ii), we already saw it and used it in the minimum cut algorithm lecture. ■
Definition 49.1.2. An event E is mutually independent of a set of events C, if for any subset U ⊆ C, we have
 T  T 
that P E ∩ E′ ∈U E′ = P[E] P E′ ∈U E′ .
Let E1 , . . . , En be events. A dependency graph for these events n is a directedo graph G = (V, E), where
{1, . . . , n}, such that Ei is mutually independent of all the events in E j (i, j) < E .
Intuitively, an edge (i, j) in a dependency graph indicates that Ei and E j have (maybe) some dependency
between them. We are interested in settings where this dependency is limited enough, that we can claim
something about the probability of all these events happening simultaneously.
Lemma 49.1.3 (Lovász Local Lemma). Let G(V, E) be a dependency graph for events E1 , . . . , En . Suppose
Y  h i Yn
that there exist xi ∈ [0, 1], for 1 ≤ i ≤ n such that P[Ei ] ≤ xi 1 − x j . Then P ∩i=1 Ei ≥
n
(1 − xi ).
(i, j)∈E i=1

307
We need the following technical lemma.
Lemma 49.1.4. Let G(V, E) be a dependency
Y graph for events E1 , . . . , En . Suppose that there exist xi ∈ [0, 1],
for 1 ≤ i ≤ n such that P[Ei ] ≤ xi 1 − x j . Now, let S be a subset of the vertices from {1, . . . , n}, and let i
(i, j)∈E
be an index not in S . We have that
 
P Ei ∩ j∈S E j ≤ xi . (49.1)

Proof: The proof is by induction on k = |S |.  Q  


For k = 0, we have by assumption that P Ei ∩ j∈S E j = P[Ei ] ≤ xi (i, j)∈E 1 − x j ≤ xi .
n o
Thus, let N = j ∈ S (i, j) ∈ E , and let R = S \ N. If N = ∅, then we have that Ei is mutually independent
n o    
of the events of C(R) = E j j ∈ R . Thus, P Ei ∩ j∈S E j = P Ei ∩ j∈R E j = P[Ei ] ≤ xi , by arguing as above.
By Lemma 49.1.1 (i), we have that
   
 
 \  P i E ∩ ∩ E
j∈N j ∩ E
m∈R m
PEi E j  =   .
j∈S ∩ E
P j∈N j m∈R m∩ E

We bound the numerator by


      Y 
P Ei ∩ ∩ j∈N E j ∩m∈R Em ≤ P Ei ∩m∈R Em = P[Ei ] ≤ xi 1 − xj ,
(i, j)∈E

since Ei is mutually independent of C(R). As for the denominator, let N = { j1 , . . . , jr }. We have, by Lemma 49.1.1
(ii), that
      
P E j1 ∩ . . . ∩ E jr ∩m∈R Em = P E j1 ∩m∈R Em P E j2 E j1 ∩ ∩m∈R Em
  
· · · P E jr E j1 ∩ . . . ∩ E jr−1 ∩ ∩m∈R Em
       
= 1 − P E j1 ∩m∈R Em 1 − P E j2 E j1 ∩ ∩m∈R Em
    
· · · 1 − P E jr E j1 ∩ . . . ∩ E jr−1 ∩ ∩m∈R Em
    Y 
≥ 1 − x j1 · · · 1 − x jr ≥ 1 − xj ,
(i, j)∈E

by Eq. (49.1) and induction,


 T  probability term in the above expression has less than |S | items involved.
as every
It thus follows, that P Ei j∈S E j ≤ xi . ■

Proof of Lovász local lemma (Lemma 49.1.3): Using Lemma 49.1.4, we have that
h i       Y
n
P ∩i=1 Ei = (1 − P[E1 ]) 1 − P E2 E1 · · · 1 − P En ∩i=1 Ei ≥ (1 − xi ).
n n−1

i=1

308
Corollary 49.1.5. Let E1 , . . . , En be events, with P[Ei ] ≤ p for allh i. If each
i event is mutually independent of
all other events except for at most d, and if ep(d + 1) ≤ 1, then P ∩i=1 Ei > 0.
n

Proof: If d = 0 the result is trivial, as the events are independent. Otherwise, there is a dependency graph, with
every vertex having degree at most d. Apply Lemma 49.1.3 with xi = d+1 1
. Observe that
!d
1 1 1 1
xi (1 − xi ) =
d
1− > · ≥ p,
d+1 d+1 d+1 e
 d
by assumption and the since 1 − d+1 1
> 1/e, see Lemma 49.1.6 below. ■
The following is standard by now, and we include it only for the sake of completeness.
!n
1 1
Lemma 49.1.6. For any n ≥ 1, we have 1 − > .
n+1 e
 n  n
Proof: This is equivalent to n+1n
> 1e . Namely, we need to prove e > n+1 . But this obvious, since
 n   n
1 n
n+1
n
= 1 + n < exp(n(1/n)) = e. ■

49.2. Application to k-SAT


We are given a instance I of k-SAT, where every clause contains k literals, there are m clauses, and every one
of the n variables, appears in at most 2k/50 clauses.
Consider a random assignment, and let Ei be the event that the ith clause was not satisfied. We know
−k
 p = P[Ei ] = 2 , and furthermore, Ei depends on at most d = k2 other events. Since ep(d + 1) =
k/50
that
−k
e k·2 k/50
+ 1 2 < 1, for k ≥ 4, we conclude that by Corollary 49.1.5, that
h i h i
P I have a satisfying assignment = P ∪i Ei > 0.

49.2.1. An efficient algorithm


The above just proves that a satisfying assignment exists. We next show a polynomial algorithm (in m) for the
computation of such an assignment (the algorithm will not be polynomial in k).
Let G be the dependency graph for I, where the vertices are the clauses of I, and two clauses are connected
if they share a variable. In the first stage of the algorithm, we assign values to the variables one by one, in an
arbitrary order. In the beginning of this process all variables are unspecified, at each step, we randomly assign
a variable either 0 or 1 with equal probability.
Definition 49.2.1. A clause Ei is dangerous if both the following conditions hold:
(i) k/2 literals of Ei have been fixed.
(ii) Ei is still unsatisfied.
After assigning each value, we discover all the dangerous clauses, and we defer (“freeze”) all the unassigned
variables participating in such a clause. We continue in this fashion till all the unspecified variables are frozen.
This completes the first stage of the algorithm.
At the second stage of the algorithm, we will compute a satisfying assignment to the variables using brute
force. This would be done by taking the surviving formula I ′ and breaking it into fragments, so that each
fragment does not share any variable with any other fragment (naively, it might be that all of I ′ is one fragment).
We can find a satisfying assignment to each fragment separately, and if each such fragment is “small” the
resulting algorithm would be “fast”.
We need to show that I ′ has a satisfying assignment and that the fragments are indeed small.

309
49.2.1.1. Analysis
A clause had survived if it is not satisfied by the variables fixed in the first stage. Note, that a clause that
survived must have a dangerous clause as a neighbor in the dependency graph G. Not that I ′ , the instance
remaining from I after the first stage, has at least k/2 unspecified variables in each clause. Furthermore, every
clause of I ′ has at most d = k2k/50 neighbors in G′ , where G′ is the dependency graph for I ′ . It follows, that
again, we can apply Lovász local lemma to conclude that I ′ has a satisfying assignment.
Definition 49.2.2. Two connected graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ), where V1 , V2 ⊆ {1, . . . , n} are
unique if V1 , V2 .
Lemma 49.2.3. Let G be a graph with degree at most d and with n vertices. Then, the number of unique
subgraphs of G having r vertices is at most nd2r .
Proof: Consider a unique subgraph G b of G, which by definition is connected. Let H be a connected subtree of
G spanning G.b Duplicate every edge of H, and let H ′ denote the resulting graph. Clearly, H ′ is Eulerian, and as
such posses a Eulerian path π of length at most 2(r − 1), which can be specified, by picking a starting vertex v,
and writing down for the i-th vertex of π which of the d possible neighbors, is the next vertex in π. Thus, there
are st most nd2(r−1) ways of specifying π, and thus, there are at most nd2(r−1) unique subgraphs in G of size r.■
Lemma 49.2.4. With probability 1 − o(1), all connected components of G′ have size at most O(log m), where
G′ denote the dependency graph for I ′ .
Proof: Let G4 be a graph formed from G by connecting any pair of vertices of G of distance exactly 4 from
each other. The degree of a vertex of G4 is at most O(d4 ).
Let U be a set of r vertices of G, such that every pair is in distance at least 4 from each other in G. We are
interested in bounding the probability that all the clauses of U survive the first stage.
The probability of a clause to be dangerous is at most 2−k/2 , as we assign (random) values to half of the
variables of this clause. Now, a clause survive only if it is dangerous or one of its neighbors is dangerous. Thus,
the probability that a clause survive is bounded by 2−k/2 (d + 1).
Furthermore, the survival of two clauses Ei and E j in U is an independent event, as no neighbor of Ei shares
a variable with a neighbor of E j (because of the distance 4 requirement). We conclude, that the probability that
all the vertices of U to appear in G′ is bounded by
 r
2−k/2 (d + 1) .
In fact, we are interested in sets U that induce a connected subgraphs of G4 . The number of unique such
sets of size r is bounded by the number of unique subgraphs of G4 of size r, which is bounded by md8r , by
Lemma 49.2.3. Thus, the probability of any connected subgraph of G4 of size r = log2 m to survive in G′ is
smaller than
 r  8r  r
md8r 2−k/2 (d + 1) = m k2k/50 2−k/2 (k2k/50 + 1) ≤ m2kr/5 · 2−kr/4 = m2−kr/20 = o(1),
since k ≥ 50. (Here, a subgraph survive of G4 survive, if all its vertices appear in G′ .) Note, however, that if a
connected component of G′ has more than L vertices, than there must be a connected component having L/d3
vertices in G4 that had survived in G′ . We conclude, that with probability o(1), no connected component of G′
has more than O(d3 log m) = O(log m) vertices (note, that we consider k to be a constant, and thus, also d). ■
Thus, after the first stage, we are left with fragments of (k/2)-SAT, where every fragment has size at most
O(log m), and thus having at most O(log m) variables. Thus, we can by brute force find the satisfying assign-
ment to each such fragment in time polynomial in m. We conclude:
Theorem 49.2.5. The above algorithm finds a satisfying truth assignment for any instance of k-SAT containing
m clauses, which each variable is contained in at most 2k/50 clauses, in expected time polynomial in m.

310
Chapter 50

The Probabilistic Method IV


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
Once I sat on the steps by a gate of David’s Tower, I placed my two heavy baskets at my side. A group of tourists was standing
around their guide and I became their target marker. “You see that man with the baskets? Just right of his head there’s an arch
from the Roman period. Just right of his head.” “But he’s moving, he’s moving!” I said to myself: redemption will come only
if their guide tells them, “You see that arch from the Roman period? It’s not important: but next to it, left and down a bit, there
sits a man who’s bought fruit and vegetables for his family.”

Yehuda Amichai, Tourists

50.1. The Method of Conditional Probabilities


In previous lectures, we encountered the following problem.

Problem 50.1.1 (Set Balancing/Discrepancy). Given a binary matrix M of size n × n, find a vector v ∈
{−1, +1}n , such that ∥Mv∥∞ is minimized.

√ Using random assignment and the Chernoff inequality, we showed that there exists v, such that ∥Mv∥∞ ≤
4 n ln n. Can we derandomize this algorithm? Namely, can we come up with an efficient deterministic algo-
rithm that has low discrepancy?
To derandomize our algorithm, construct a computation tree of depth n, where in the ith level we expose
the ith coordinate of v. This tree T has depth n. The root represents all possible random choices, while a
node at depth i, represents all computations when the first i bits are fixed. For a node v ∈ T , let P(v) be the
probability that a random computation starting from v succeeds – here randomly assigning the remaining bits
can be interpreted as a random walk down the tree to a leaf. √
Formally, the algorithm is successful if ends up with a vector v, such that ∥Mv∥∞ ≤ 4 n ln n.
Let vl and vr be the two children of v. Clearly, P(v) = (P(vl ) + P(vr ))/2. In particular, max(P(vl ), P(vr )) ≥
P(v). Thus, if we could compute P(·) quicklyp(and deterministically), then we could derandomize the algorithm.
Let Cm+ be the bad event that rm · v > 4 n log n, where rm is the mth row ofh M. Similarly, iCm− is the bad
p
event that rm · v < −4 n log n, and let Cm = Cm+ ∪ Cm− . Consider the probability, P Cm+ v1 , . . . , vk (namely, the
first k coordinates of v are specified). Let rm = (r1 , . . . , rn ). We have that
 n     
h i 
 X p X
k

  X   X 
 vi ri  = P vi ri > L = P vi > L,
P Cm v1 , . . . , vk = P
+
vi ri > 4 n log n −
i=k+1 i=1 i≥k+1,ri ,0 i≥k+1,ri =1

311
p P P
where L = 4 n log n − ki=1 vi ri is a known quantity (since v1 , . . . , vk are known). Let V = i≥k+1,ri =1 1. We
have,
   
   
h i 
 X 
 
 X vi + 1 L + V 
 (vi + 1) > L + V  = P ,
P Cm v1 , . . . , vk = P
+
  > 
i≥k+1 i≥k+1
2 2 
αi =1 αi =1

The last quantity is the probability that in V flips of a fair 0/1 coin one gets more than (L + V)/2 heads. Thus,
X ! !
h i 1 X
V V
V 1 V
P+m = P Cm+ v1 , . . . , vk = = .
i=⌈(L+V)/2⌉
i 2n 2n i=⌈(L+V)/2⌉ i

This implies, that we can compute P+m in polynomial time! Indeed, we are adding V ≤ n numbers, each one of
them is a binomial coefficient that has polynomial size representation in n, and can be computed in polynomial
time (why?). One can define in similar fashion P−m , and let Pm = P+m + P−m . Clearly,
h Pm can ibe computed in
polynomial time, by applying a similar argument to the computation of P−m = P Cm− v1 , . . . , vk .
For a node hv ∈ T , let
i vv denote the portion of v that was fixed when traversing from the root of T to v. Let
Pn
P(v) = m=1 P Cm vv . By the above discussion P(v) can be computed in polynomial time. Furthermore, we
know, by the previous result on discrepancy that P(r) < 1 (that was the bound used to show that there exist a
good assignment).
As before, for any v ∈ T , we have P(v) ≥ min(P(vl ), P(vr )). Thus, we p have a polynomial deterministic
algorithm for computing a set balancing with discrepancy smaller than 4 n log n. Indeed, set v = root(T ).
And start traversing down the tree. At each stage, compute P(vl ) and P(vr ) (in polynomial time), and set v to
the child with lower
p value of P(·). Clearly, after n steps, we reach a leaf, that corresponds to a vector v′ such
that ∥Av′ ∥∞ ≤ 4 n log n.
Theorem 50.1.2. Using the method of conditional
p probabilities, one can compute in polynomial time in n, a
vector v ∈ {−1, 1}n , such that ∥Av∥∞ ≤ 4 n log n.

Note, that this method might fail to find the best assignment.

50.2. Independent set in a graph


Theorem 50.2.1. Consider a graph G = (JnK , E), with n vertices an m edges. Then G contains an independent
set of size
≥ f (n, m) = n/(2m/n + 1).
In particular, a randomized algorithm can compute an independent set of expected size Ω( f (n, m)).

Proof: Consider a random permutation of the vertices, and in the ith iteration add the vertex πi to the indepen-
dent set if none of its neighbors are in the independent set. Let I be the resulting independent set. We have for
a vertex v ∈ JnK that
1
P[v ∈ I] ≥ .
d(i) + 1
As such, the expected size of the computed independent set is
X
n X
n
1
Γ= P[i ∈ I] ≥ .
i=1 i=1
d(i) + 1

312
Observe that for x > 0, and α ≥ x, we have that
α
1/(1 + x) + 1/(1 + α − x) = .
(1 + x)(1 + α − x)
achieves its minimum when x = α/2.
P
As such, ni=1 d(i)+1
1
is minimized when all the d(·) are equal. Which means that
X
n
1 X
n
1 n
Γ≥ .≥ .= ,
i=1
d(i) + 1 i=1
(2m/n) + 1 (2m/n) + 1
as claimed. ■

50.3. A Short Excursion into Combinatorics via the Probabilistic Method


In this section, we provide some additional examples of the Probabilistic Method to prove some results in
combinatorics and discrete geometry. While the results are not directly related to our main course, their beauty,
hopefully, will speak for itself.

50.3.1. High Girth and High Chromatic Number


Definition 50.3.1. For a graph G, let α(G) be the cardinality of the largest independent set in G, χ(G) denote
the chromatic number of G, and let girth(G) denote the length of the shortest circle in G.
Theorem 50.3.2. For all K, L there exists a graph G with girth(G) > L and χ(G) > K.
Proof: Fix µ < 1/L, and let G ≈ G(n, p) with p = nµ−1 ; namely, G is a random graph on n vertices chosen by
picking each pair of vertices to be an edge in G, randomly and independently with probability p. Let X be the
number of cycles of size at most L. Then
X
L
n! 1 XL
ni  µ−1 i X nµi
L
E[X] = · · pi ≤ · n ≤ = o(n),
i=3
(n − i)! 2i i=3
2i i=3
2i

as µL < 1, and since the number of different sequence of i vertices is (n−i)!


n!
, and every cycle is being counted in
this sequence 2i times.
In particular,
l m ≥ n/2] = o(1).
P[X
Let x = p ln n + 1. We remind the reader that α(G) denotes the size of the largest independent set in G.
3

We have that
h i ! !! x !! x  x
n ( x
) p(x − 1) 3
P α(G) ≥ x ≤ (1 − p) 2 < n exp − < n exp − ln n < o(1) = o(1).
x 2 2
Let n be sufficiently large so that both these events have probability less than 1/2. Then there is a specific G
with less than n/2 cycles of length at most L and with α(G) < 3n1−µ ln n + 1.
Remove from G a vertex from each cycle of length at most L. This gives a graph G∗ with at least n/2
vertices. G∗ has girth greater than L and α(G∗ ) ≤ α(G) (any independent set in G∗ is also an independent set in
G). Thus
∗ |V(G∗ )| n/2 nµ
χ(G ) ≥ ≥ 1−µ ≥ .
α(G∗ ) 3n ln n 12 ln n
To complete the proof, let n be sufficiently large so that this is greater than K. ■

313
50.3.2. Crossing Numbers and Incidences
The following problem has a long and very painful history. It is truly amazing that it can be solved by such a
short and elegant proof.
And embedding of a graph G = (V, E) in the plane is a planar representation of it, where each vertex is rep-
resented by a point in the plane, and each edge uv is represented by a curve connecting the points corresponding
to the vertices u and v. The crossing number of such an embedding is the number of pairs of intersecting curves
that correspond to pairs of edges with no common endpoints. The crossing number cr(G) of G is the minimum
possible crossing number in an embedding of it in the plane.

|E|3
Theorem 50.3.3. The crossing number of any simple graph G = (V, E) with |E| ≥ 4 |V| is ≥ .
64 |V|2

Proof: By Euler’s formula any simple planar graph with n vertices has at most 3n−6 edges. (Indeed, f −e+v = 2
in the case with maximum number of edges, we have that every face, has 3 edges around it. Namely, 3 f = 2e.
Thus, (2/3)e − e + v = 2 in this case. Namely, e = 3v − 6.) This implies that the crossing number of any simple
graph with n vertices and m edges is at least m − 3n + 6 > m − 3n. Let G = (V, E) be a graph with |E| ≥ 4 |V|
embedded in the plane with t = cr(G) crossings. Let H be the random induced subgraph of G obtained by
picking each vertex of G randomly and independently, to be a vertex of H with probabilistic p (where P will
be specified shortly). The expected number of vertices of H is p |V|, the expected number of its edges is p2 |E|,
and the expected number of crossings in the given embedding is p4 t, implying that the expected value of its
crossing number is at most p4 t. Therefore, we have p4 t ≥ p2 |E| − 3p |V|, implying that
|E| 3 |V|
cr(G) ≥ − 3 ,
p2 p

let p = 4 |V| / |E| < 1, and we have cr(G) ≥ (1/16 − 3/64) |E|3 / |V|2 = |E|3 /(64 |V|2 ). ■

Theorem 50.3.4. Let P be a set of n distinct points in the plane, and let L be a set of m distinct lines. Then, the
number of incidences between  the lines of L (that is, the number of pairs (p, ℓ) with p ∈ P,
 the points of P and
ℓ ∈ L, and p ∈ ℓ) is at most c m n + m + n , for some absolute constant c.
2/3 2/3

Proof: Let I denote the number of such incidences. Let G = (V, E) be the graph whose vertices are all the
points of P, where two are adjacent if and only if they are consecutive points of P on some line in L. Clearly
|V| = n, and |E| = I − m. Note that G is already given embedded in the plane, where the edges are presented by
segments of the corresponding lines of L.
Either, we can not apply Theorem 50.3.3, implying that I − m = |E| < 4 |V| = 4n. Namely, I ≤ m + 4n. Or
alliteratively,
!
|E|3 (I − m)3 m m2
= ≤ cr(G) ≤ ≤ .
64 |V|2 64n2 2 2

Implying that I ≤ (32)1/3 m2/3 n2/3 + m. In both cases, I ≤ 4(m2/3 n2/3 + m + n). ■

This technique has interesting and surprising results, as the following theorem shows.

Theorem 50.3.5. For any three sets A, B and C of s real numbers each, we have
n o  
|A · B + C| = ab + c a ∈ A, b ∈ B, mc ∈ C ≥ Ω s3/2 .

314
n o n o
Proof: Let R = A · B + C, |R| = r and define P = (a, t) a ∈ A, t ∈ R , and L = y = bx + c b ∈ B, c ∈ C .
Clearly n = |P|n = sr, and m = |L| = os2 . Furthermore, a line y = bx + c of L is incident with s points
of R, namely with (a, t) a ∈ A, t = ab + c . Thus, the overall number of incidences is at least s3 . By Theo-
rem 50.3.4, we have
   2/3   
s ≤4 m n +m+n =4 s
3 2/3 2/3 2
(sr) + s + sr = 4 s2 r2/3 + s2 + sr .
2/3 2

For r < s3 , we have that sr ≤ s2 r2/3 . Thus, for r < s3 , we have s3 ≤ 12s2 r2/3 , implying that s3/2 ≤ 12r. Namely,
|R| = Ω(s3/2 ), as claimed. ■

Among other things, the crossing number technique implies a better bounds for k-sets in the plane than
what was previously known. The k-set problem had attracted a lot of research, and remains till this day one of
the major open problems in discrete geometry.

50.3.3. Bounding the at most k-level


Let L be a set of n lines in the plane. Assume, without loss of generality, that no three lines of L pass through a
common point, and none of them is vertical. The complement of union of lines L break the plane into regions
known as faces. An intersection of two lines, is a vertex, and the maximum interval on a line between two
vertices is am edge. The whole structure of vertices, edges and faces induced by L is known as arrangement
of L, denoted by A(L).
S
Let L be a set of n lines in the plane. A point p ∈ ℓ∈L ℓ is of level k if there are k lines of L strictly below
it. The k-level is the closure of the set of points of level k. Namely, the k-level is an x-monotone curve along
the lines of L.t
The 0-level is the boundary of the “bottom” face of the arrangement of L (i.e.,
the face containing the negative y-axis). It is easy to verify that the 0-level has
at most n − 1 vertices, as each line might contribute at most one segment to the
0-level (which is an unbounded convex polygon). 3-level
It is natural to ask what the number of vertices at the k-level is (i.e., what the
combinatorial complexity of the polygonal chain forming the k-level is). This is 1-level
0-level
a surprisingly hard question, but the same question on the complexity of the at
most k-level is considerably easier.
Theorem 50.3.6. The number of vertices of level at most k in an arrangement of n lines in the plane is O(nk).

Proof: Pick a random sample R of L, by picking each line to be in the sample with probability 1/k. Observe
that
n
E[|R|] = .
k
Let L≤k = L≤k L be the set of all vertices of A(L) of level at most k, for k > 1. For a vertex p ∈ L≤k , let Xp
( )
be an indicator variable which is 1 if p is a vertex of the 0-level of A(R). The probability that p is in the 0-level
of A(R) is the probability that none of the j lines below it are picked to be in the sample, and the two lines that
define it do get selected to be in the sample. Namely,
h i ! j !2 !k !
1 1 1 1 k 1 1
P Xp = 1 = 1 − ≥ 1− 2
≥ exp −2 2
= 2 2
k k k k k k ek

since j ≤ k and 1 − x ≥ e−2x , for 0 < x ≤ 1/2.

315
On the other hand, the number of vertices on the 0-level of R is at most |R| − 1. As such,
X
Xp ≤ |R| − 1.
p∈L≤k

Moreover this, of course, also holds in expectation, implying


 
 X  h i n
E Xp  ≤ E |R| − 1 ≤ .
p∈L
k
≤k

On the other hand, by linearity of expectation, we have


 
 X  X h i |L≤k |
E Xp  = E Xp ≥ 2 2 .
p∈L p∈L
ek
≤k ≤k

|L≤k | n
Putting these two inequalities together, we get that ≤ . Namely, |L≤k | ≤ e2 nk. ■
e2 k2 k

316
Chapter 51

Sampling and the Moments Technique


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
Sun and rain and bush had made the site look old, like the site of a dead civilization. The ruins, spreading over so many acres,
seemed to speak of a final catastrophe. But the civilization wasn’t dead. It was the civilization I existed in and in fact was still
working towards. And that could make for an odd feeling: to be among the ruins was to have your time-sense unsettled. You
felt like a ghost, not from the past, but from the future. You felt that your life and ambition had already been lived out for you
and you were looking at the relics of that life. You were in a place where the future had come and gone.

A bend in the river, V. S. Naipaul

51.1. Vertical decomposition


 vertex
Given a set S of n segments in the plane, its arrangement, denoted by A S , edge
is the decomposition of the plane into faces, edges, and vertices. The vertices of

A S are the endpoints, or the intersection points of the segments of S, the edges
are the maximal connected portions of the segments not containing any vertex, and
the faces are the connected components of the complement of the union of the
face
segments of S. These definitions are depicted on the right.
For numerical reasons (and also conceptually), a symbolic representation would
be better than a numerical one. Thus, an intersection vertex would be represented
by two pointers to the segments that their intersection is this vertex. Similarly, an edge would be represented
as a pointer to the segment that contains it, and two pointers to the vertices forming its endpoints.
Naturally, we are assuming here that we have geometric primitives that can resolve any decision problem
of interest that involve a few geometric entities. For example, for a given segment s and a point p, we would be
interested in deciding if p lies vertically below s. From a theoretical point of view, all these primitives require
a constant amount of computation, and are “easy”. In the real world, numerical issues and degeneracies make
implementing these primitives surprisingly challenging. We are going to ignore this major headache here, but
the reader should be aware of it.

We will be interested in computing the arrangement A S and a representation of it that makes it easy to
manipulate. In particular, we would like to be able to quickly resolve questions of the type
(i) are two points in the same face?
(ii) can one traverse from one point to the other without crossing any segment?
The naive representation of each face as polygons (potentially with holes) is not conducive to carrying out such
tasks, since a polygon might be arbitrarily complicated. Instead, we will prefer to break the arrangement into
smaller canonical tiles.

317
To this end, a vertical trapezoid is a quadrangle with two vertical sides. The breaking of the faces into such

trapezoids is the vertical decomposition of the arrangement A S .

Formally, for a subset R ⊆ S, let A| R denote the vertical decomposition of the plane formed by the

arrangement A R of the segments of R. This is the partition of the plane into interior disjoint vertical trapezoids

formed by erecting vertical walls through each vertex of A| R .

A vertex of A| R is either an endpoint of a segment of R or an intersection
point of two of its segments. From each such vertex we shoot up (similarly, down)
a vertical ray till it hits a segment of R or it continues all the way to infinity. See the
figure on the right. σ
Note that a vertical trapezoid is defined by at most four segments: two segments
defining its ceiling and floor and two segments defining the two intersection points
that induce the two vertical walls on its boundary. Of course, a vertical trapezoid
might be degenerate and thus be defined by fewer segments (i.e., an unbounded vertical trapezoid or a triangle
with a vertical segment as one of its sides).
Vertical decomposition breaks the faces of the arrangement that might be arbitrarily complicated into en-
tities (i.e., vertical trapezoids) of constant complexity. This makes handling arrangements (decomposed into
vertical trapezoid) much easier computationally.
In the following, we assume that the n segments of S have k pairwise intersection points overall, and we
 
want to compute the arrangement A = A S ; namely, compute the edges, vertices, and faces of A S . One
possible way is the following: Compute a random permutation of the segments of S: S = ⟨s1 , . . . , sn ⟩. Let
 
Si = ⟨s1 , . . . , si ⟩ be the prefix of length i of S. Compute A| Si from A| Si−1 , for i = 1, . . . , n. Clearly,
  
A| S = A| Sn , and we can extract A S from it. Namely, in the ith iteration, we insert the segment si into the
arrangement A| Si−1 .
This technique of building the arrangement by inserting the segments one by one is called randomized
incremental construction.

Who need these pesky arrangements anyway? The reader might wonder who needs arrangements? As a
concrete examples, consider a situation where you are give several maps of a city containing different layers of
information (i.e., streets map, sewer map, electric lines map, train lines map, etc). We would like to compute
the overlay map formed by putting all these maps on top of each other. For example, we might be interested in
figuring out if there are any buildings lying on a planned train line, etc.
More generally, think about a set of general constraints in Rd . Each constraint is bounded by a surface, or a
patch of a surface. The decomposition of Rd formed by the arrangement of these surfaces gives us a description
of the parametric space in a way that is algorithmically useful. For example, finding if there is a point inside
all the constraints, when all the constraints are induced by linear inequalities, is linear programming. Namely,
arrangements are a useful way to think about any parametric space partitioned by various constraints.

51.1.1. Randomized incremental construction (RIC)



Imagine that we had computed the arrangement Bi−1 = A| Si−1 . In the ith iteration we compute Bi by inserting
si into the arrangement Bi−1 . This involves splitting some trapezoids (and merging some others).
As a concrete example, consider Figure 51.1. Here we insert s in the arrangement. To this end we split the
“vertical trapezoids” △pux and △yux, each into three trapezoids. The two trapezoids σ′ and σ′′ now need to be
merged together to form the new trapezoid which appears in the vertical decomposition of the new arrangement.
(Note that the figure does not show all the trapezoids in the vertical decomposition.)

318
q
s

σ0 σ00

p t b
Figure 51.1

To facilitate this, we need to compute the trapezoids of Bi−1 that intersect si . This is done by maintaining a

conflict graph. Each trapezoid σ ∈ A| Si−1 maintains a conflict list cl(σ) of all the segments of S that intersect
its interior. In particular, the conflict list of σ cannot contain any segment of Si−1 , and as such it contains only
the segments of S \ Si−1 that intersect its interior. We also maintain a similar structure for each segment, listing

all the trapezoids of A| Si−1 that it currently intersects (in its interior). We maintain those lists with cross
pointers, so that given an entry (σ, s) in the conflict list of σ, we can find the entry (s, σ) in the conflict list of s
in constant time.
Thus, given si , we know what trapezoids need to be split (i.e., all the trapezoids in cl(si )).
Splitting a trapezoid σ by a segment si is the operation of computing a set of (at most) four
trapezoids that cover σ and have si on their boundary. We compute those new trapezoids, and
next we need to compute the conflict lists of the new trapezoids. This can be easily done by
taking the conflict list of a trapezoid σ ∈ cl(si ) and distributing its segments among the O(1)
s
new trapezoids that cover σ. Using careful implementation, this requires a linear time in the i
size of the conflict list of σ.
Note that only trapezoids that intersect si in their interior get split. Also, we need to update the conflict lists
for the segments (that were not inserted yet).
We next sketch the low-level details involved in maintaining these conflict lists. For a segment s that
intersects the interior of a trapezoid σ, we maintain the pair (s, σ). For every trapezoid σ, in the current
vertical decomposition, we maintain a doubly linked list of all such pairs that contain σ. Similarly, for each
segment s we maintain the doubly linked list of all such pairs that contain s. Finally, each such pair contains
two pointers to the location in the two respective lists where the pair is being stored.
It is now straightforward to verify that using this data-structure we can implement the required operations
in linear time in the size of the relevant conflict lists.
In the above description, we ignored the need to merge adjacent trapezoids if they have identical floor and
ceiling – this can be done by a somewhat straightforward and tedious implementation of the vertical decom-
position data-structure, by providing pointers between adjacent vertical trapezoids and maintaining the conflict
list sorted (or by using hashing) so that merge operations can be done quickly. In any case, this can be done in
linear time in the input/output size involved, as can be verified.

51.1.1.1. Analysis
Claim 51.1.1. The (amortized) running time of constructing Bi from Bi−1 is proportional to the size of the
conflict lists of the vertical trapezoids in Bi \ Bi−1 (and the number of such new trapezoids).

Proof: Observe that we can charge all the work involved in the ith iteration to either the conflict lists of the
newly created trapezoids or the deleted conflict lists. Clearly, the running time of the algorithm in the ith
iteration is linear in the total size of these conflict lists. Observe that every conflict gets charged twice – when

319
it is being created and when it is being deleted. As such, the (amortized) running time in the ith iteration is
proportional to the total length of the newly created conflict lists. ■

Thus, to bound the running time of the algorithm, it is enough to bound the expected size of the destroyed
conflict lists in ith iteration (and sum this bound on the n iterations carried out by the algorithm). Or alterna-
tively, bound the expected size of the conflict lists created in the ith iteration.

Lemma 51.1.2. Let S be a set of n segments (in general position¬ ) with k intersection points. Let Si be the first
| 
i segments in a random permutation
  of S. The expected size of B i = A S i , denoted by τ(i) (i.e., the number
of trapezoids in Bi ), is O i + k(i/n) .
2


Proof: Consider­ an intersection point p = s ∩ s′ , where s, s′ ∈ S. The probability that p is present in A| Si
is equivalent to the probability that both s and s′ are in Si . This probability is
n−2
i−2 (n − 2)! i! (n − i)! i(i − 1)
α = n = · = .
(i − 2)! (n − i)! n! n(n − 1)
i

For each intersection point p in A S define an indicator variable
h i Xp , which is 1 if the two segments defining
p are in the random sample Si and 0 otherwise. We have that E Xp = α, and as such, by linearity of expectation,
the expected number of intersection points in the arrangement A(Si ) is
 
X  X h i X
E Xp  = E Xp = α = kα,
p∈V p∈V p∈V


where V is the set of k intersection points of A S . Also, every endpoint of a segment of Si contributed its two
endpoints to the arrangement A(Si ). Thus, we have that the expected number of vertices in A(Si ) is
i(i − 1)
2i + k.
n(n − 1)

Now, the number of trapezoids in A| Si is proportional to the number of vertices of A(Si ), which implies the
claim. ■

51.1.2. Backward analysis


In the following, we would like to consider the total amount of work involved in the ith iteration of the algo-
rithm. The way to analyze these iterations is (conceptually) to run the algorithm for the first i iterations and
then run “backward” the last iteration.
So, imagine that the overall size of the conflict lists of the trapezoids of Bi is Wi and the total size of the
conflict lists created only in the ith iteration is Ci .
¬
In this case, no two intersection points of input segments are the same, no two intersection points (or vertices) have the same
x-coordinate, no two segments lie on the same line, etc. Making the geometric algorithm work correctly for all degenerate inputs
is a huge task that can usually be handled by tedious and careful implementation. Thus, we will always assume general position
of the input. In other words, in theory all geometric inputs are inherently good, while in practice they are all evil (as anybody who
tried to implement geometric algorithms can testify). The reader is encouraged not to use this to draw any conclusions on the human
condition.
­
The proof is provided in excruciating detail to get the reader used to this kind of argumentation. I would apologize for this pain,
but it is a minor trifle, not to be mentioned, when compared to the other offenses in this book.

320
We are interested in bounding the expected size of Ci , since this is (essentially) the amount of work done by
the algorithm in this iteration. Observe that the structure of Bi is defined independently of the permutation Si
and depends only on the (unordered) set Si = {s1 , . . . , si }. So, fix Si . What is the probability that si is a specific
segment s of Si ? Clearly, this is 1/i since this is the probability of s being the last element in a permutation of
the i elements of Si (i.e., we consider a random permutation of Si ).
Now, consider a trapezoid σ ∈ Bi . If σ was created in the ith iteration, then si must be one of the (at most
four) segments that define it. Indeed, if si is not one of the segments that define σ, then σ existed in the vertical
decomposition before si was inserted. Since Bi is independent of the internal ordering of Si , it follows that
P[σ ∈ (Bi \ Bi−1 )] ≤ 4/i. In particular, the overall size of the conflict lists in the end of the ith iteration is
X
Wi = |cl(σ)|.
σ∈Bi

As such, the expected overall size of the conflict lists created in the ith iteration is
h i X4 4
E C i Bi ≤ |cl(σ)| ≤ Wi .
σ∈B
i i
i

 
By Lemma 51.1.2, the expected size of Bi is O i + ki2 /n2 . Let us guess (for the time being) that on average
the size of the conflict list of a trapezoid of Bi is about O(n/i). In particular, assume that we know that
! 
   i2  n i
E iW = O i + k = O n + k ,
n2 i n
by Lemma 51.1.2, implying
h  " # !! !
  i 4 4   4 ki n k
E Ci = E E Ci Bi ≤ E Wi = E Wi = O n+ =O + , (51.1)
i i i n i n
using Lemma 11.1.2. In particular, the expected (amortized) amount of work in the ith iteration is proportional
 
to E Ci . Thus, the overall expected running time of the algorithm is
 n  !
X  X n
n k  
E Ci  = O + = O n log n + k .
i=1 i=1
i n

Theorem 51.1.3. Given a set S of n segments in the plane with k intersections, one can compute the vertical

decomposition of A S in expected O(n log n + k) time.

Intuition and discussion. What remains to be seen is how we came up with the guess that the average size of
a conflict list of a trapezoid of Bi is about O(n/i). Note that using ε-nets implies that the bound O((n/i) log i)
holds with constant probability (see Theorem 38.3.4) for all trapezoids in this arrangement. As such, this result
is only slightly surprising. To prove this, we present in the next section a “strengthening” of ε-nets to geometric
settings.
To get some intuition on how we came up with this guess, consider a set P of n points on the line and a
random sample R of i points from P. Let I b be the partition of the real line into (maximal) open intervals by the
endpoints of R, such that these intervals do not contain points of R in their interior.
Consider an interval (i.e., a one-dimensional trapezoid) of I. b It is intuitively clear that this interval (in
expectation) would contain O(n/i) points. Indeed, fix a point x on the real line, and imagine that we pick each
point with probability i/n to be in the random sample. The random variable which is the number of points of

321
Notation What it means Example: Vertical decomposition of segments
S Set of n objects segments
R⊆S Subset of objects
σ Notation for a region induced by some A vertical trapezoid
objects of S
D(σ) ⊆ S Defining set of σ: Minimal set of objects Subset of segments defining σ. See Figure 51.3.
inducing σ.
K(σ) ⊆ S Stopping set of σ: All objects in S that All segments in S that intersects the interior of the
prevents σ from being created. vertical trapezoid σ. See Figure 51.3.
d Combinatorial dimension: Max size of d = 4: Every vertical trapezoid is defined by at
defining set. most four segments.
ω(σ) ω(σ) = |K(σ)|: Weight of σ.
F (R) Decomposition: Set of regions defined Set of vertical trapezoids defined by R
by R
T = T (S) Set of all possible regions defined by Set of all vertical trapezoids that can be induced
subsets of S by the segments of S.
Probability of a region σ ∈ S to appear in the decomposition of a random sample R ⊆ S
ρr,n (d, k)
of size r, where σ is defined by d objects, and its stopping set is of size k.
σ ∈ F (R) σ is t-heavy if ω(σ) ≥ tn/r, where r = |R|.
F≥t (R) Set of all t-heavy regions of F (R)
Ef (r) Ef (r) = E[|F (R)|]: Expected complexity of decomposition for sample of size r
Ef≥t (r) = E[|F≥t (R)|]: Expected number of regions that are t heavy in the decomposition
Ef≥t (r)
of a random sample of size r.

Figure 51.2: Notation used in the analysis.

P we have to scan starting from x and going to the right of x till we “hit” a point that is in the random sample
behaves like a geometric variable with probability i/n, and as such its expected value is n/i. The same argument
works if we scan P to the left of x. We conclude that the number of points of P in the interval of Ib that contains
x but does not contain any point of R is O(n/i) in expectation.
Of course, the vertical decomposition case is more involved, as each vertical trapezoid is defined by four
input segments. Furthermore, the number of possible vertical trapezoids is larger. Instead of proving the
required result for this special case, we will prove a more general result which can be applied in a lot of other
settings.

51.2. General settings

51.2.1. Notation
Let S be a set of objects. For a subset R ⊆ S, we define a collection of ‘regions’ denoted by F (R). For the
case of vertical decomposition of segments (i.e., Theorem 51.1.3), the objects are segments, the regions are

322

trapezoids, and F (R) is the set of vertical trapezoids in A| R . Let
[
T = T (S) = F (R)
R⊆S

denote the set of all possible regions defined by subsets of S. a


c
In the vertical trapezoids case, the set T is the set of all vertical trape-
zoids that can be defined by any subset of the given input segments. e
We associate two subsets D(σ), K(σ) ⊆ S with each region σ ∈ T . σ
The defining set D(σ) of σ is the subset of S defining the region σ d b
(the precise requirements from this set are specified in the axioms below). f
We assume that for every σ ∈ T , |D(σ)| ≤ d for a (small) constant d. Figure 51.3: D(σ) = {b, c, d, e}
The constant d is sometime referred to as the combinatorial dimension. In and K(σ) = { f }.
the case of Theorem 51.1.3, each trapezoid σ is defined by at most four
segments (or lines) of S that define the region covered by the trapezoid σ, and this set of segments is D(σ). See
Figure 51.3.
The stopping set K(σ) of σ is the set of objects of S such that including any object of K(σ) in R prevents σ
from appearing in F (R). In many applications K(σ) is just the set of objects intersecting the cell σ; this is also
the case in Theorem 51.1.3, where K(σ) is the set of segments of S intersecting the interior of the trapezoid
σ (see Figure 51.3). Thus, the stopping set of a region σ, in many cases, is just the conflict list of this region,
when it is being created by an RIC algorithm. The weight of σ is ω(σ) = |K(σ)|.

Definition 51.2.1 (Framework axioms). Let S, F (R), D(σ), and K(σ) be such that for any subset R ⊆ S, the
set F (R) satisfies the following axioms:
(i) For any σ ∈ F (R), we have D(σ) ⊆ R and R ∩ K(σ) = ∅.
(ii) If D(σ) ⊆ R and K(σ) ∩ R = ∅, then σ ∈ F (R).

51.2.1.1. Examples of the general framework


(A) Vertical decomposition. Discussed above.
(B) Points on a line. Let S be a set of n points on the real line. For a set R ⊆ S, let F (R) be the set of atomic
intervals of the real lines formed by R; that is, the partition of the real line into maximal connected sets
(i.e., intervals and rays) that do not contain a point of R in their interior.
Clearly, in this case, an interval I ∈ F (R) the defining set of I (i.e., D(I)) is the set containing the (one
or two) endpoints of I in R. The stopping set of an I is the set K(I), which is the set of all points of S
contained in I.
(C) Vertices of the convex-hull in 2d. Consider a set S of n points in the plane. A vertex on the convex hull
is defined by the point defining the vertex, and the two edges before and after it on the convex hull. To
this end, a certified vertex of the convex hull (say this vertex is u) is a triplet (p, u, v), such that p, u and
v are consecutive vertices of CH(S) (say, in clockwise order). Observe, that computing the convex-hull
of S is equivalent to computing the set of certified vertices of S.
For a set R ⊆ S, let F (R) denote the set of certified vertices of R (i.e., this is equivalent to the set of
vertices of the convex-hull of R. For a certified vertex σ ∈ F (R), its defining set is the set of three
vertices p, u, v that (surprise, surprise) define it. Its stopping set, is the set of all points in S, that either on
the “wrong” side of the line spanning pu, or on the “wrong” side of the line spanning uv. Equivalently,
K(σ) is the set of all points x ∈ S \ R, such that the convex-hull of p, u, v, and x does not form a convex
quadrilateral.

323
(D) Edges of the convex-hull in 3d. Let S be a set of points in three dimensions. An edge e of the convex-
hull of a set R ⊆ S of points in R3 is defined by two vertices of S, and it can be certified as being on the
convex hull CH(R), by the two faces f, f′ adjacent to e. If all the points of R are on the “right” side of
both these two faces then e is an edge of the convex hull of R. Computing all the certified edges of S is
equivalent to computing the convex-hull of S.
In the following, assume that each face of any convex-hull of a subset of points of S is a triangle. As
such, a face of the convex-hull would be defined by three points. Formally, the butterfly of an edge e of
CH(R) is (e, p, u), where pnt, u ∈ R, and such that all the points of R are on the same side as u of the
plane spanned by e and p (we have symmetric condition requiring that all the points of S are on the same
as p of the plane spanned by e and u).
For a set R ⊆ P, let F (R) be its set of butterflies. Clearly, computing all the butterflies of S (i.e., F (S))
is equivalent to computing the convex-hull of S.
For a butterfly σ = (e, p, u) ∈ F (R) its defining set (i.e., D(σ)) is a set of four points (i.e., the two points
defining its edge e, and the to additional vertices defining the two faces Face and f′ adjacent to it). Its
stopping set K(σ), is the set of all the points of S \ R that of different sides of the plane spanned by e and
p (resp. e and u) than u (resp. p) [here, the stopping set is the union of these two sets].
(E) Delaunay triangles in 2d. For a set of S of n points in the plane. Consider a subset R ⊆ S. A Delaunay
circle of R is a disc D that has three points p1 , p2 , p3 of R on its boundary, and no points of R in its
interior. Naturally, these three points define a Delaunay triangle △ = △p1 p2 p3 . The defining set is
D(△) = {p1 , p2 , p3 }, and the stopping set K(△) is the set of all points in S that are contained in the interior
of the disk D.

51.2.2. Analysis
In the following, S is a set of n objects complying with (i) and (ii) of Definition 51.2.1.

The challenge. What makes the analysis not easy is that there are dependencies between the defining set of a
region and its stopping set (i.e., conflict list). In particular, we have the following difficulties
(A) The defining set might be of different sizes depending on the region σ being considered.
(B) Even if all the regions have a defining set of the same size d (say, 4 as in the case of vertical trapezoids),
it is not true that every d objects define a valid region. For example, for the case of segments, the four
segments might be vertically separated from each other (i.e., think about them as being four disjoint
intervals on the real line), and they do not define a vertical trapezoid together. Thus, our analysis is going
to be a bit loopy loop – we are going to assume we know how many regions exists (in expectation) for a
random sample of certain size, and use this to derive the desired bounds.

51.2.2.1. On the probability of a region to be created


Inherently, to analyze a randomized algorithm using this framework, we will be interested in the probability
that a certain region would be created. Thus, let

ρr,n (d, k)

denote the probability that a region σ ∈ T appears in F (R), where its defining set is of size d, its stopping set
is of size k, R is a random sample of size r from a set S, and n = |S|. Specifically, σ is a feasible region that
might be created by an algorithm computing F (R).

324
The sampling model. For describing algorithms it is usually easier to work with samples created by picking
a subset of a certain size (without repetition) from the original set of objects. Usually, in the algorithmic
applications this would be done by randomly permuting the objects and interpreting a prefix of this permutation
as a random sample. Insisting on analyzing this framework in the “right” sampling model creates some non-
trivial technical pain.
 r k  r d
Lemma 51.2.2. We have that ρr,n (d, k) ≈ 1 − . Formally,
n n
!k
1  r k  r d 1 r  r d
1−4· ≤ ρr,n (d, k) ≤ 2 1 − ·
2d
. (51.2)
22d n n 2 n n

Proof: Let σ be the region under consideration that is defined by d objects and having k stoppers (i.e.,
k = K(σ)). We are interested in the probability of σ being created when taking n−d−ka sample
n of size r (with-
out repetition) from a set S of n objects. Clearly, this probability is ρr,n (d, k) = r−d / r , as we have to pick
the d defining objects into the random sample and avoid picking any of the k stoppers. A tedious but careful
calculation, delegated to Section 51.4, implies Eq. (51.2).
Instead, here is an elegant argument for why this estimate is correct in a slightly different sampling model.
We pick every element of S into the sample R with probability r/n, and this is done independently for each
object. In expectation, the random sample is of size r, and clearly the probability that σ is created is the
probability that we pick its d defining objects (that is, (r/n)d ) multiplied by the probability that we did not pick
any of its k stoppers (that is, (1 − r/n)k ). ■

Remark 51.2.3. The bounds of Eq. (51.2) hold only when r, d, and k are in certain (reasonable) ranges. For
the sake of simplicity of exposition we ignore this minor issue. With care, all our arguments work when one
pays careful attention to this minor technicality.

51.2.2.2. On exponential decay


For any natural number r and a number t > 0, consider R to be a random sample of size r from S without
n
repetition. A region σ ∈ F (R) as being t-heavy if ω(σ) ≥ t · . Let F≥t (R) denote all the t-heavy regions of
r
F (R).®
Intuitively, and somewhat incorrectly, we expect the average weight of a region of F (R) to be roughly n/r.
We thus expect the size of this set to drop fast as t increases. Indeed, Lemma 51.2.2 tells us that a trapezoid of
weight t (n/r) has probability
 n  r t (n/r)  r d  r d   r n/r  r d
ρr,n d, t · ≈ 1− ≈ exp(−t) · ≈ exp −t + 1 · 1 −
r n n n n n
≈ exp(−t + 1) · ρr,n (d, n/r)

to be created, since (1 − r/n)n/r ≈ 1/e. Namely, a t-heavy region has exponentially lower probability to be
created than a region of weight n/r. We next formalize this argument.

Lemma 51.2.4. Let r ≤ n and let t be parameters, such that 1 ≤ t ≤ r/d. Furthermore, let R be a sample of
size r, and let R′ be a sample
 of size r′ = ⌊r/t⌋, both from S. Let σ ∈ T be a region with weight ω(σ) ≥ t (n/r).
    
Then, P σ ∈ F (R) = O exp −t/2 td P σ ∈ F R′ .
®
These are the regions that are at least t times overweight. Speak about an obesity problem.

325
Proof: For the sake of simplicity of exposition, assume that k = ω(σ) = t (n/r). By Lemma 51.2.2 (i.e.,
Eq. (51.2)) we have
    ! !k
r′  r d
1 r k r d
P [σ ∈ F (R)] ρr,n
(d, k) 2 2d
1 − ·
2 n n kr
= ≤     ≤ 2 exp − 2n 1 + 8 n r′
4d
P[σ ∈ F (R′ )] ρ ′ (d, k)
r ,n
1
1 − 4 r′ k r′ d
22d n n

!  d ! !d  
kr kr r tn ⌊r/t⌋ tnr r
≤ 2 exp 8
4d
− = 2 4d
exp 8 − = O exp(−t/2)t d
,
n 2n r′ nr 2nr ⌊r/t⌋
since 1/(1 − x) ≤ 1 + 2x for x ≤ 1/2 and 1 + y ≤ exp(y), for all y. (The constant in the above O(·) depends
exponentially on d.) ■

Let
Ef (r) = E[|F (R)|] and Ef≥t (r) = E[|F≥t (R)|] ,
where the expectation is over random subsets R ⊆ S of size r. Note that Ef (r) = Ef≥0 (r) is the expected
number of regions created by a random sample of size r. In words, Ef≥t (r) is the expected number of regions
in a structure created by a sample of r random objects, such that these regions have weight which is t times
larger than the “expected” weight (i.e., n/r). In the following, we assume that Ef (r) is a monotone increasing
function.
Lemma 51.2.5 (The exponential decay lemma). Given a set S of n objects and parameters r ≤ n and 1 ≤
t ≤ r/d, where d = maxσ∈T (S) |D(σ)|, if axioms (i) and (ii) above hold for any subset of S, then

Ef≥t (r) = O td exp(−t/2) Ef (r) . (51.3)

Proof: Let R be a random sample of size r from S and let R′ be a random sample of size r′ = ⌊r/t⌋ from S.
S
Let H = X⊆S,|X|=r F≥t (X) denote the set of all t-heavy regions that might be created by a sample of size r. In
the following, the expectation is taken over the content of the random samples R and R′ .
For a region σ, let Xσ be the indicator variable that is 1 if and only if σ ∈ F (R). By linearity of expectation
and since E[Xσ ] = P[σ ∈ F (R)], we have
  hX i X   X
Ef≥t (r) = E |F≥t (R)| = E Xσ = E Xσ = P[σ ∈ F (R)]
σ∈H σ∈H σ∈H
 X    X  
′  ′ 
= O td exp(−t/2) Pσ∈F R = O td exp(−t/2) Pσ∈F R
σ∈H σ∈T
   
= O td exp(−t/2) Ef r′ = O td exp(−t/2) Ef (r) ,

by Lemma 51.2.4 and since Ef (r) is a monotone increasing function. ■

51.2.2.3. Bounding the moments


Consider a different randomized algorithm that in a first round samples r objects, R ⊆ S (say, segments),

computes the arrangement induced by these r objects (i.e., A| R ), and then inside each region σ it computes the

arrangement of the ω(σ) objects intersecting the interior of this region, using an algorithm that takes O ω(σ)c
time, where c > 0 is some fixed constant. The overall expected running time of this algorithm is
h X c i
E ω(σ) .
σ∈F (R)
We are now able to bound this quantity.

326
 
Theorem 51.2.6 (Bounded moments theorem). Let R ⊆ S be a random subset of size r. Let Ef (r) = E |F (R)|
and let c ≥ 1 be an arbitrary constant. Then,
h X c i   n c 
E ω(σ) = O Ef (r) .
σ∈F (R)
r

Proof:
 Let R ⊆ S be a random sample of size r. Observe that all the regions with weight in the range
n
(t − 1) r , t ·
n
are in the set F≥t−1 (R) \ F≥t (R). As such, we have by Lemma 51.2.5 that
r
h X i h X n  c i h X n  c i
W=E ω(σ)c ≤ E t (|F≥t−1 (R)| − |F≥t (R)| ) ≤ E t |F≥t−1 (R)|
σ∈F (R) t≥1
r t≥1
r
 n c X    n c X  n c X  
≤ (t + 1)c E |F≥t (R)| = (t + 1)c Ef≥t (r) = O (t + 1) c + d exp(−t/2) Ef (r)
r t≥0 r t≥0 r t≥0
  
n cX    n c 
= O Ef (r) (t + 1) c + d exp(−t/2) = O Ef (r) ,
r t≥0 r

since c and d are both constants. ■

51.3. Applications
51.3.1. Analyzing the RIC algorithm for vertical decomposition
We remind the reader that the input of the algorithm of Section 51.1.2 is a set S of n segments with k in-
tersections, and it uses randomized incremental construction to compute the vertical decomposition of the

arrangement A S .
Lemma 51.1.2 shows that  the number  of vertical trapezoids in the randomized incremental construction
is in expectation Ef (i) = O i + k (i/n) . Thus, by Theorem 51.2.6 (used with c = 1), we have that the total
2

expected size of the conflict lists of the vertical decomposition computed in the ith step is
  hX i  n  i
E Wi = E ω(σ) = O Ef (i) = O n + k .
σ∈B
i n
i

This is the missing piece in the analysis of Section 51.1.2. Indeed, the amortized work in the ith step of the
algorithm is O(Wi /i) (see Eq. (51.1)), and as such, the expected running time of this algorithm is
h Xn
Wi i Xn
1 i  
EO =O n+k = O n log n + k .
i=1
i i=1
i n

This implies Theorem 51.1.3.

51.3.2. Cuttings
Let S be a set of n lines in the plane, and let r be an arbitrary parameter. A (1/r)-cutting of S is a partition of
the plane into constant complexity regions such that each region intersects at most n/r lines of S. It is natural
to try to minimize the number of regions in the cutting, as cuttings are a natural tool for performing “divide and
conquer”.
A neat proof of the existence of suboptimal cuttings follows readily from the exponential decay lemma.

327
Lemma 51.3.1. Let S be a set of n segments in the plane, and let R be a random sample from S of size ℓ =
cr ln r, where c is a sufficiently large constant. Then, with probability ≥ 1 − 1/rO(1) , the vertical decomposition
of R is a cutting of size O(r2 log2 r).

Proof: In our case, the vertical decomposition complexity Ef (ℓ) = O(ℓ2 ) – as ℓ segments have at most 2ℓ
intersections. For t = c ln r, a vertical trapezoid σ in A| (R) is bad if ω(σ) > r = t(n/ℓ). But such a trapezoid
is t-heavy. Let X be the random variable that is the number of bad trapezoids in A| (R). The exponential decay
lemma (Lemma 51.2.5) states that
  
2 −c/2 1
E[X] = Ef≥t (ℓ) = O t exp(−t/2) Ef (ℓ) = O (c ln r) exp(−c ln r/2) ℓ = O (ln r)r (r log r)2 < c/4 ,
2 2 2
r
if c is sufficiently large. As such, we have P[X ≥ 1] ≤ 1/rc/4 by Markov’s inequality. ■

We provide an alternative proof to the above using the ε-net theorem.


 
Lemma 51.3.2. There exists a (1/r)-cutting of a set of lines S in the plane of size O (r log r)2 .

Proof: Consider the range space having S as its ground set and vertical trapezoids as its ranges (i.e., given a
vertical trapezoid σ, its corresponding range is the set of all lines of S that intersect the interior of σ). This
range space has a VC dimension which is a constant as can be easily verified. Let X ⊆ S be an ε-net for this
range space, for ε = 1/r. By Theorem 38.3.4 (ε-net theorem), there exists such an ε-net X of this range space,
of size O((1/ε) log(1/ε)) = O(r log r). In fact, Theorem 38.3.4 states that an appropriate random sample is an
ε-net with non-zero probability, which implies, by the probabilistic method, that such a net (of this size) exists.

Consider the vertical decomposition A| X , where X is as above. We claim that this collection of trapezoids
is the desired cutting.  

The bound on the size is immediate, as the complexity of A| X is O |X|2 and |X| = O(r log r).

As for correctness, consider a vertical trapezoid σ in the arrangement A| X . It does not intersect any of the

lines of X in its interior, since it is a trapezoid in the vertical decomposition A| X . Now, if σ intersected more
than n/r lines of S in its interior, where n = |S|, then it must be that the interior of σ intersects one of the lines
of X, since X is an ε-net for S, a contradiction.
It follows that σ intersects at most εn = n/r lines of S in its interior. ■
 
Claim 51.3.3. Any (1/r)-cutting in the plane of n lines contains at least Ω r2 regions.
n
Proof: An arrangement of n lines (in general position) has M = intersections. However, the number of
2  
intersections of the lines intersecting a single region in the cutting is at most m = n/r . This implies that any
    2
cutting must be of size at least M/m = Ω n2 /(n/r)2 = Ω r2 . ■

We can get cuttings of size matching the above lower bound using the moments technique.
Theorem 51.3.4. Let S be a set of n lines in the plane, and let r be a parameter. One can compute a (1/r)-
cutting of S of size O(r2 ).

Proof: Let R ⊆ S be a random sample of size r, and consider its vertical decomposition A| R . If a vertical

trapezoid σ ∈ A| R intersects at most n/r lines of S, then we can add it to the output cutting. The other
possibility is that a σ intersects t(n/r) lines of S, for some t > 1, and let cl(σ) ⊂ S be the conflict list of σ (i.e.,
the list of lines of S that intersect the interior of σ). Clearly, a (1/t)-cutting for the set cl(σ) forms a vertical

328
decomposition (clipped inside σ) such that each trapezoid in this cutting intersects at most n/r lines of S. Thus,
we compute such a cutting inside each such “heavy” trapezoid using the algorithm (implicit in the proof) of
Lemma 51.3.2, and these
  subtrapezoids to the resulting cutting. Clearly, the size of the resulting cutting inside
σ is O t log t = O t4 . The resulting two-level partition is clearly the required cutting. By Theorem 51.2.6,
2 2

the expected size of the cutting is


 h X  ω(σ) 4 i  r 4 h X i!
O Ef (r) + E 2 = O Ef (r) + E (ω(σ))4

σ∈F (R)
n/r n σ∈F (R)
 r 4  n 4 !  
= O Ef (r) + Ef (r) = O(Ef (r)) = O r2 ,
n r

since Ef (r) is proportional to the complexity of A(R) which is O(r2 ). ■

51.4. Bounds on the probability of a region to be created


Here we prove Lemma 51.2.2 in the “right” sampling model. The casual reader is encouraged to skip this
section, as it contains mostly tedious (and not very insightful) calculations.
Let S be a given set of n objects. Let ρr,n (d, k) be the probability that a region σ ∈ T whose defining set
is of size d and whose stopping set is of size k appears in F (R), where R is a random sample from S of size r
(without repetition).
n−d−k n−d−k r  n−d−k  r 
r−d r−d d r−d d
Lemma 51.4.1. We have ρr,n (d, k) = n =  n
 · n−(r−d) = n−d · n .
r r−d d r−d d

Proof: So, consider a region σ with d defining objects in D(σ) and k detractors in K(σ). We have to pick the d
defining objects of D(σ) to be in the random sample R of size r but avoid picking any of the k objects of K(σ)
to be in R. ! ! ! !
n n n − (r − d) r
The second part follows since = / . Indeed, for the right-hand side first pick a
r r−d d d
sample of size r − d and then a sample of size d from the remaining objects. Merging the two random samples,
we get a random sample of size r. However, since we do not care if an object is in the first sample or second
sample, we observe that every such random sample is being counted dr times.
! ! ! !
n n − (r − d) n n−d
The third part is easier, as it follows from = . The two sides count the
r−d d d r−d
different ways to pick two subsets from a set of size n, the first one of size d and the second one of size r − d.■
m
 m − t t  m t
t
Lemma 51.4.2. For M ≥ m ≥ t ≥ 0, we have ≤ M ≤ .
M−t M
t

m
t m! (M − t)!t! m m − 1 m−t+1
Proof: We have that α =  M = = · ··· . Now, since M ≥ m, we have
(m − t)!t! M! M M−1 M−t+1
t
m−i m
that ≤ , for all i ≥ 0. As such, the maximum (resp. minimum) fraction on the right-hand size is m/M
M−i M  t  
m−t+1 t
m−t+1
(resp. M−t+1 m−t
). As such, we have M−t ≤ M−t+1 ≤ α ≤ (m/M)t . ■

329
 X Y  Y X
Lemma 51.4.3. Let 0 ≤ X, Y ≤ N. We have that 1 − ≤ 1− .
N 2N
Proof: Since 1 − α ≤ exp(−α) ≤ (1 − α/2), for 0 ≤ α ≤ 1, it follows that
 X Y  XY    Y X  Y X
1− ≤ exp − = exp − ≤ 1− . ■
N N n 2n

Lemma 51.4.4. For 2d ≤ r ≤ n/8 and k ≤ n/2, we have that


!k
1  r k  r d 1 r  r d
1−4· ≤ ρr,n (d, k) ≤ 2 1 − ·
2d
.
22d n n 2 n n
Proof: By Lemma 51.4.1, Lemma 51.4.2, and Lemma 51.4.3 we have
n−d−k  r  !r−d  d !r−d  d !r
r−d d n−d−k r k r k  r d
ρr,n (d, k) = n−d · n ≤ ≤ 1− ≤2 1−
d
n−d n n n n n
r−d d
 r k  r d
≤ 2d 1 − ,
2n n
since k ≤ n/2. As for the other direction, by similar argumentation, we have
n−d−k r  !r−d !d
r−d d n − d − k − (r − d) r−d
ρr,n (d, k) =  n  · n−(r−d) ≥
n − (r − d) n − (r − d) − d
r−d d
!r−d !d !r !d
d+k r−d d + k r/2
= 1− ≥ 1−
n − (r − d) n−r n/2 n
!d+k  d !k  d
1 4r r 1 4r r
≥ d 1− ≥ 2d 1 − ,
2 n n 2 n n
by Lemma 51.4.3 (setting N = n/4, X = r, and Y = d + k) and since r ≥ 2d and 4r/n ≤ 1/2. ■

51.5. Bibliographical notes


The technique described in this chapter is generally attributed to the work by Clarkson and Shor [CS89], which
is historically inaccurate as the technique was developed by Clarkson [Cla88]. Instead of mildly confusing the
matter by referring to it as the Clarkson technique, we decided to make sure to really confuse the reader and
refer to it as the moments technique. The Clarkson technique [Cla88] is in fact more general and implies a
connection between the number of “heavy” regions and “light” regions. The general framework can be traced
back to the earlier paper [Cla87]. This implies several beautiful results, some of which we cover later in the
book.
For the full details of the algorithm of Section 51.1, the interested reader is refereed to the books [BCKO08,
BY98]. Interestingly, in some cases the merging stage can be skipped; see [Har00a].
Agarwal et al. [AMS98] presented a slightly stronger variant than the original version of Clarkson [Cla88]
that allows a region to disappear even if none of the members of its stopping set are in the random sample. This
stronger setting is used in computing the vertical decomposition of a single face in an arrangement (instead of
the whole arrangement). Here an insertion of a faraway segment of the random sample might cut off a portion
of the face of interest. In particular, in the settings of Agarwal et al. Axiom (ii) is replaced by the following:

330
(ii) If σ ∈ F (R) and R′ is a subset of R with D(σ) ⊆ R′ , then σ ∈ F (R′ ).

Interestingly, Clarkson [Cla88] did not prove Theorem 51.2.6 using the exponential decay lemma but gave
a direct proof. In fact, his proof implicitly contains the exponential decay lemma. We chose the current
exposition since it is more modular and provides a better intuition of what is really going on and is hopefully
slightly simpler. In particular, Lemma 51.2.2 is inspired by the work of Sharir [Sha03].
The exponential decay lemma (Lemma 51.2.5) was proved by Chazelle and Friedman [CF90]. The work of
Agarwal et al. [AMS98] is a further extension of this result. Another analysis was provided by Clarkson et al.
[CMS93].
Another way to reach similar results is using the technique of Mulmuley [Mul94], which relies on a direct
analysis on ‘stoppers’ and ‘triggers’. This technique is somewhat less convenient to use but is applicable to
some settings where the moments technique does not apply directly. Also, his concept of the omega function
might explain why randomized incremental algorithms perform better in practice than their worst case analysis
[Mul89].
Backwards analysis in geometric settings was first used by Chew [Che86] and was formalized by Seidel
[Sei93]. It is similar to the “leave one out” argument used in statistics for cross validation. The basic idea was
probably known to the Greeks (or Russians or French) at some point in time.
(Naturally, our summary of the development is cursory at best and not necessarily accurate, and all possible
disclaimers apply. A good summary is provided in the introduction of [Sei93].)
Sampling model. As a rule of thumb all the different sampling approaches are similar and yield similar results.
For example, we used such an alternative sampling approach in the “proof” of Lemma 51.2.2. It is a good idea
to use whichever sampling scheme is the easiest to analyze in figuring out what’s going on. Of course, a formal
proof requires analyzing the algorithm in the sampling model its uses.
Lazy randomized incremental construction. If one wants to compute a single face that contains a marking
point in an arrangement of curves, then the problem in using randomized incremental construction is that
as you add curves, the region of interest shrinks, and regions that were maintained should be ignored. One
option is to perform flooding in the vertical decomposition to figure out what trapezoids are still reachable
from the marking point and maintaining only these trapezoids in the conflict graph. Doing it in each iteration
is way too expensive, but luckily one can use a lazy strategy that performs this cleanup only a logarithmic
number of times (i.e., you perform a cleanup in an iteration if the iteration number is, say, a power of 2). This
strategy complicates the analysis a bit; see [BDS95] for more details on this lazy randomized incremental
construction technique. An alternative technique was suggested by the author for the (more restricted) case of
planar arrangements; see [Har00b]. The idea is to compute only what the algorithm really needs to compute the
output, by computing the vertical decomposition in an exploratory online fashion. The details are unfortunately
overwhelming although the algorithm seems to perform quite well in practice.
Cuttings. The concept of cuttings was introduced by Clarkson. The first optimal size cuttings were constructed
by Chazelle and Friedman [CF90], who proved the exponential decay lemma to this end. Our elegant proof
follows the presentation by de Berg and Schwarzkopf [BS95]. The problem with this approach is that the
constant involved in the cutting size is awful¯ . Matoušek [Mat98] showed that there (1/r)-cuttings with 8r2 +
6r + 4 trapezoids, by using level approximation. A different approach was taken by the author [Har00a], who
showed how to get cuttings which seem to be quite small (i.e., constant-wise) in practice. The basic idea is
to do randomized incremental construction but at each iteration greedily add all the trapezoids with conflict
list small enough to the cutting being output. One can prove that this algorithm also generates O(r2 ) cuttings,
¯
This is why all computations related to cuttings should be done on a waiter’s bill pad. As Douglas Adams put it: “On a waiter’s
bill pad, reality and unreality collide on such a fundamental level that each becomes the other and anything is possible, within certain
parameters.”

331
but the details are not trivial as the framework described in this chapter is not applicable for analyzing this
algorithm.
Cuttings also can be computed in higher dimensions for hyperplanes. In the plane, cuttings can also be
computed for well-behaved curves; see [SA95].
Another fascinating concept is shallow cuttings. These are cuttings covering only portions of the arrange-
ment that are in the “bottom” of the arrangement. Matoušek came up with the concept [Mat92]. See [AES99,
CCH09] for extensions and applications of shallow cuttings.
Even more on randomized algorithms in geometry. We have only scratched the surface of this fascinating
topic, which is one of the cornerstones of “modern” computational geometry. The interested reader should have
a look at the books by Mulmuley [Mul94], Sharir and Agarwal [SA95], Matoušek [Mat02], and Boissonnat
and Yvinec [BY98].

51.6. Exercises
Exercise 51.6.1 (Convex hulls incrementally). Let P be a set of n points in the plane.
(A) Describe a randomized incremental algorithm for computing the convex hull CH(P). Bound the expected
running time of your algorithm.
(B) Assume that for any subset of P, its convex hull has complexity t (i.e., the convex hull of the subset has t
edges). What is the expected running time of your algorithm in this case? If your algorithm is not faster
for this case (for example, think about the case where t = O(log n)), describe a variant of your algorithm
which is faster for this case.

Exercise 51.6.2 (Compressed quadtree made incremental). Given a set P of n points in Rd , describe a ran-
domized incremental algorithm for building a compressed quadtree for P that works in expected O(dn log n)
time. Prove the bound on the running time of your algorithm.

References
[AES99] P. K. Agarwal, A. Efrat, and M. Sharir. Vertical decomposition of shallow levels in 3-dimensional
arrangements and its applications. SIAM J. Comput., 29: 912–953, 1999.
[AMS98] P. K. Agarwal, J. Matoušek, and O. Schwarzkopf. Computing many faces in arrangements of
lines and segments. SIAM J. Comput., 27(2): 491–505, 1998.
[BCKO08] M. de Berg, O. Cheong, M. J. van Kreveld, and M. H. Overmars. Computational Geometry:
Algorithms and Applications. 3rd. Santa Clara, CA, USA: Springer, 2008.
[BDS95] M. de Berg, K. Dobrindt, and O. Schwarzkopf. On lazy randomized incremental construction.
Discrete Comput. Geom., 14: 261–286, 1995.
[BS95] M. de Berg and O. Schwarzkopf. Cuttings and applications. Int. J. Comput. Geom. Appl., 5: 343–
355, 1995.
[BY98] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, 1998.
[CCH09] C. Chekuri, K. L. Clarkson., and S. Har-Peled. On the set multi-cover problem in geometric
settings. Proc. 25th Annu. Sympos. Comput. Geom. (SoCG), 341–350, 2009.
[CF90] B. Chazelle and J. Friedman. A deterministic view of random sampling and its use in geometry.
Combinatorica, 10(3): 229–249, 1990.

332
[Che86] L. P. Chew. Building Voronoi diagrams for convex polygons in linear expected time. Technical
Report PCS-TR90-147. Hanover, NH: Dept. Math. Comput. Sci., Dartmouth College, 1986.
[Cla87] K. L. Clarkson. New applications of random sampling in computational geometry. Discrete Com-
put. Geom., 2: 195–222, 1987.
[Cla88] K. L. Clarkson. Applications of random sampling in computational geometry, II. Proc. 4th Annu.
Sympos. Comput. Geom. (SoCG), 1–11, 1988.
[CMS93] K. L. Clarkson, K. Mehlhorn, and R. Seidel. Four results on randomized incremental construc-
tions. Comput. Geom. Theory Appl., 3(4): 185–212, 1993.
[CS89] K. L. Clarkson and P. W. Shor. Applications of random sampling in computational geometry, II.
Discrete Comput. Geom., 4(5): 387–421, 1989.
[Har00a] S. Har-Peled. Constructing planar cuttings in theory and practice. SIAM J. Comput., 29(6): 2016–
2039, 2000.
[Har00b] S. Har-Peled. Taking a walk in a planar arrangement. SIAM J. Comput., 30(4): 1341–1367, 2000.
[Mat02] J. Matoušek. Lectures on Discrete Geometry. Vol. 212. Grad. Text in Math. Springer, 2002.
[Mat92] J. Matoušek. Reporting points in halfspaces. Comput. Geom. Theory Appl., 2(3): 169–186, 1992.
[Mat98] J. Matoušek. On constants for cuttings in the plane. Discrete Comput. Geom., 20: 427–448, 1998.
[Mul89] K. Mulmuley. An efficient algorithm for hidden surface removal. Comput. Graph., 23(3): 379–
388, 1989.
[Mul94] K. Mulmuley. Computational Geometry: An Introduction Through Randomized Algorithms. En-
glewood Cliffs, NJ: Prentice Hall, 1994.
[SA95] M. Sharir and P. K. Agarwal. Davenport-Schinzel Sequences and Their Geometric Applications.
New York: Cambridge University Press, 1995.
[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. New Trends in Discrete and
Computational Geometry. Ed. by J. Pach. Vol. 10. Algorithms and Combinatorics. Springer-
Verlag, 1993, pp. 37–68.
[Sha03] M. Sharir. The Clarkson-Shor technique revisited and extended. Comb., Prob. & Comput., 12(2):
191–201, 2003.

333
334
Chapter 52

Primality testing
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“The world is what it is; men who are nothing, who allow themselves to become nothing, have no
place in it.”
— Bend in the river, V.S. Naipaul

Introduction – how to read this write-up


In this note, we present a simple randomized algorithms for primality testing. The challenge is that it requires a
non-trivial amount of number theory, which is not the purpose of this course. Nevertheless, this note is more or
less self contained, and all necessary background is provided (assuming some basic mathematical familiarity
with groups, fields and modulo arithmetic). It is however not really necessary to understand all the number
theory material needed, and the reader can take it as given. In particular, I recommend to read the number
theory background part without reading all of the proofs (at least on first reading). Naturally, a complete and
total understanding of this material one needs to read everything carefully.
The description of the primality testing algorithm in this write-up is not minimal – there are shorter descrip-
tions out there. However, it is modular – assuming the number theory machinery used is correct, the algorithm
description is relatively straightforward.

52.1. Number theory background


52.1.1. Modulo arithmetic
52.1.1.1. Prime and coprime
For integer numbers x and y, let x | y denotes that x divides y. The greatest common divisor (gcd) of two
numbers x and y, denoted by gcd(x, y), is the largest integer that divides both x and y. The least common
multiple (lcm) of x and y, denoted by lcm(x, y) = xy/ gcd(x, y), is the smallest integer α, such that x | α and
y | α. An integer number p > 0 is prime if it is divisible only by 1 and itself (we will consider 1 not to be
prime).
Some standard definitions:

x, y are coprime ⇐⇒ gcd(x, y) = 1,


quotient of x/y ⇐⇒ x div y = ⌊x/y⌋ ,

335
remainder of x/y ⇐⇒ x mod y = x − y ⌊x/y⌋ .
The remainder x mod y is sometimes referred to as residue.

52.1.1.2. Computing gcd

Computing the gcd of two numbers is a classical algorithm, EuclidGCD(a, b):


see code on the right – proving that it indeed returns the right if (b = 0)
result follows by an easy induction. It is easy to verify that return a
if the input is made out of log n bits, then this algorithm takes
 else
O poly(log n) time (i.e., it is polynomial in the input size). In- return EuclidGCD(b, a mod b)
deed, doing basic operations on numbers (i.e., multiplication,  
division, addition, subtraction, etc) with total of ℓ bits takes O ℓ2 time (naively – faster algorithms are known).
Exercise 52.1.1. Show that gcd(Fn , Fn−1 ) = 1, where Fi is the ith Fibonacci number. Argue that for two
consecutive Fibonacci numbers EuclidGCD(Fn , Fn−1 ) takes O(n) time, if every operation takes O(1) time.
Lemma 52.1.2. For all α, β > 0 integers, there are integer
 numbers x and y, such that gcd(α, β) = αx + βy,
 
which can be computed in polynomial time; that is, O poly log α + log β .

Proof: If α = β then the claim trivially holds. Otherwise, assume that α > β (otherwise, swap them), and
observe that gcd(α, β) = gcd(α mod β, β). In particular, by induction, there are integers x′ , y′ , such that
gcd(α mod β, β) = x′ (α mod β) + y′ β. However, τ = α mod β = α − β ⌊α/β⌋. As such, we have
  
gcd(α, β) = gcd(α mod β, β) = x′ α − β ⌊α/β⌋ + y′ β = x′ α + y′ − β ⌊α/β⌋ β,
as claimed. The running time follows immediately by modifying EuclidGCD to compute these numbers. ■
We use α ≡ β (mod n) or α ≡n β to denote that α and β aren congruent omodulo n; that is α mod n =
β mod n. Or put differently, we have n | (α − β). The set Zn = 0, . . . , n − 1 form a group under addition
nmodulo n (see Definition 52.1.9p338 for a formal o definition of a group). The more interesting creature is Z∗n =
x x ∈ {1, . . . , n} , x > 0, and gcd(x, n) = 1 , which is a group modulo n under multiplication.
Remark 52.1.3. Observe that Z∗1 = {1}, while for n > 1, Z∗n does not contain n.
Lemma 52.1.4. For any element α ∈ Z∗n , there exists a unique inverse element β = α−1 ∈ Z∗n such that
α ∗ β ≡n 1. Furthermore, the inverse can be computed in polynomial time¬ .
Proof: Since α ∈ Z∗n , we have that gcd(α, n) = 1. As such, by Lemma 52.1.2, there exists x and y integers,
such that xα + yn = 1. That is xα ≡ 1 (mod n), and clearly β := x mod n is the desired inverse, and it can be
computed in polynomial time by Lemma 52.1.2.
As for uniqueness, assume that there are two inverses β, β′ to α < n, such that β < β′ < n. But then
βα ≡n β′ α ≡n 1, which implies that n | (β′ − β)α, which implies that n | β′ − β, which is impossible as
0 < β′ − β < n. ■
It is now straightforward, but somewhat tedious, to verify the following (the interested reader that had not
encountered this stuff before can spend some time proving this).
Lemma 52.1.5. The set Zn under the + operation modulo n is a group, as is Z∗n under multiplication modulo
n. More importantly, for a prime number p, Z p forms a field with the +, ∗ operations modulo p (see Defini-
tion 52.1.17p340 ).
¬
Again, as is everywhere in this chapter, the polynomial time is in the number of bits needed to specify the input.

336
52.1.1.3. The Chinese remainder theorem
Theorem 52.1.6 (Chinese remainder theorem). Let n1 , . . . , nk be coprime numbers, and let n = n1 n2 · · · nk .
For any residues r1 ∈ Zn1 , . . . , rk ∈ Znk , there is a unique r ∈ Zn , which can be computed in polynomial time,
such that r ≡ ri (mod ni ), for i = 1, . . . , k.

Proof: By the coprime property of the ni s it follows that gcd(ni , n/ni ) = 1. As such, n/ni ∈ Z∗ni , and it has a
P
unique inverse mi modulo ni ; that is (n/ni )mi ≡ 1 (mod ni ). So set r = i ri mi n/ni . Observe that for i , j, we
have that n j | (n/ni ), and as such ri mi n/ni (mod n j ) ≡ 0 (mod n j ). As such, we have
 ! !
X n  n
r mod n j =  ri mi mod n j  mod n j = r j m j mod n j mod n j = r j ∗ 1 mod n j = r j .
i
ni nj

As for uniqueness, if there is another such number r′ , such that r < r′ < n, then r′ − r (mod ni ) = 0 implying
that ni | r′ − r, for all i. Since all the ni s are coprime, this implies that n | r′ − r, which is of course impossible.■

Lemma 52.1.7 (Fast exponentiation). Given numbers b, c, n, one can compute bc mod n in polynomial time.

Proof: The key property we need is that


 
xy mod n = (x mod n) (y mod n) mod n.
Now, if c is even, then we can compute
 2  2
bc mod n = bc/2 mod n = bc/2 mod n mod n.
Similarly, if c is odd, we have
 2  2
bc mod n = (b mod n) b(c−1)/2 mod n = (b mod n) b(c−1)/2 mod n mod n.

Namely, computing bc mod n can be reduced to recursively computing b⌊c/2⌋ mod n, and a constant number of
operations (on numbers that are smaller than n). Clearly, the depth of the recursion is O(log c). ■

52.1.1.4. Euler totient function


The Euler totient function ϕ(n) = Z∗n is the number of positive integer numbers that at most n and are coprime
with n. If n is prime then ϕ(n) = n − 1.
Lemma 52.1.8. Let n = pk11 · · · pkt t , where the pi s are prime numbers and the ki s are positive integers (this is
Yt
the prime factorization of n). Then ϕ(n) = piki −1 (pi − 1). and this quantity can be computed in polynomial
i=1
time if the factorization is given.

Proof: Observe that ϕ(1) = 1 (see Remark 52.1.3), and for a prime number p, we have that ϕ(p) = p − 1. Now,
for k > 1, and p prime we have that ϕ(pk ) = pk−1 (p − 1), as a number x ≤ pk is coprime with pk , if and only if
x mod p , 0, and (p − 1)/p fraction of the numbers in this range have this property.
Now, if n and m are relative primes, then gcd(x, nm) = 1 ⇐⇒ gcd(x, n) = 1 and gcd(x, m) = 1. In
particular, there are ϕ(n)ϕ(m) pairs (α, β) ∈ Z∗n × Z∗m , such that gcd(α, n) = 1 and gcd(β, m) = 1. By the Chinese
remainder theorem (Theorem 52.1.6), each such pair represents a unique number in the range 1, . . . , nm, as
desired.
Now, the claim follows by easy induction on the prime factorization of the given number. ■

337
52.1.2. Structure of the modulo group Zn
52.1.2.1. Some basic group theory
Definition 52.1.9. A group is a set, G, together with an operation × that combines any two elements a and b
to form another element, denoted a × b or ab. To qualify as a group, the set and operation, (G, ×), must satisfy
the following:
(A) (Closure) For all a, b ∈ G, the result of the operation, a × b ∈ G.
(B) (Associativity) For all a, b, c ∈ G, we have (a × b) × c = a × (b × c).
(C) (Identity element) There exists an element i ∈ G, called the identity element, such that for every element
a ∈ G, the equation i × a = a × i = a holds.
(D) (Inverse element) For each a ∈ G, there exists an element b ∈ G such that a × b = b × a = i.
A group is abelian (aka, commutative group) if for all a, b ∈ G, we have that a × b = b × a.

In the following we restrict our attention to abelian groups since it makes the discussion somewhat simpler.
In particular, some of the claims below holds even without the restriction to abelian groups.
The identity element is unique. Indeed, if both f, g ∈ G are identity elements, then f = f × g = g.
Similarly, for every element x ∈ G there exists a unique inverse y = x−1 . Indeed, if there was another inverse z,
then y = y × i = y × (x × z) = (y × x) × z = i × z = z.

52.1.2.2. Subgroups
For a group G, a subset H ⊆ G that is also a group (under the same operation) is a subgroup.
−1 −1
For x, y ∈ G, let
 us define
 x ∼ y if x/y ∈ H. Here x/y = xy and y is the inverse of y in G. Observe
that (y/x)(x/y) = yx−1 xy−1 = i. That is y/x is the inverse of x/y, and it is in H. But that implies that
x ∼ y =⇒ y ∼ x. Now, if x ∼ y and y ∼ z, then x/y, y/z ∈ H. But then x/y × y/z ∈ H, and furthermore
x/y × y/z = xy−1 yz−1 = xz−1 = x/z. that is x ∼ z. Together, this implies that ∼ is an equivalence relationship.
Furthermore, observe that if x/y n= x/z then y−1o = x−1 (x/y) = x−1 (x/z) = z−1 , that is y = z. In particular, the
equivalence class of x ∈ G, is [x] = z ∈ G x ∼ z . Observe that if x ∈ H then i/x = i x−1 = x−1 ∈ H, and thus
i ∼ x. That is H = [x]. The following is now easy.

Lemma
n 52.1.10.
o Let G be an abelian group, and let H ⊆ G be a subgroup. Consider the set G/H =
[x] x ∈ G . We claim that [x] = [y] for any x, y ∈ G. Furthermore G/H is a group (that is, the quo-
tient group), with [x] × [y] = [x × y].

Proof: Pick an element α ∈ [x], and β ∈ [y], and consider the mapping f (x) = xα−1 β. We claim that f is one to
one and onto from [x] to [y]. For any γ ∈ [x], we have that γα−1 = γ/α ∈ H As such, f (γ) = γα−1 β ∈ [β] = [y].
Now, for any γ, γ′ ∈ [x] such that γ , γ′ , we have that if f (γ) = γα−1 β = γ′ α−1 β = f (γ′ ), then by multiplying
by β−1 α, we have that γ = γ′ . That is, f is one to one, implying that [x] = [y] .
The second claim follows by careful but tediously checking that the conditions in the definition of a group
holds. ■

Lemma 52.1.11. For a finite abelian group G and a subgroup H ⊆ G, we have that |H| divides |G|.

Proof: By Lemma 52.1.10, we have that |G| = |H| · |G/H|, as H = [i]. ■

338
52.1.2.3. Cyclic groups
n o
Lemma 52.1.12. For a finite group G, and any element g ∈ G, the set ⟨g⟩ = gi i ≥ 0 is a group.

Proof: Since G is finite, there are integers i > j ≥ 1, such that i , j and gi = g j , but then g j × gi− j = gi = g j .
That is gi− j = i and, by definition, we have gi− j ∈ ⟨g⟩. It is now straightforward to verify that the other properties
of a group hold for ⟨g⟩. ■

In particular, for an element g ∈ G, we define its order as ord(g) = ⟨g⟩ , which clearly is the minimum
n o
positive integer m, such that gm = i. Indeed, for j > m, observe that g j = g j mod m ∈ X = i, g, g2 , . . . , gm−1 ,
which implies that ⟨g⟩ = X.
A group G is cyclic, if there is an element g ∈ G, such that ⟨g⟩ = G. In such a case g is a generator of G.

Lemma 52.1.13. For any finite abelian group G, and any g ∈ G, we have that ord(g) divides |G|, and g|G| = i.

Proof: By Lemma 52.1.12, the set ⟨g⟩ is a subgroup of G. By Lemma 52.1.11, we have that ord(g) = ⟨g⟩ | |G|.
 |G|/ ord(g)  |G|/ ord(g)
As such, g|G| = gord(g) = i = i. ■

52.1.2.4. Modulo group


Lemma 52.1.14. For any integer n, consider the additive group Zn . Then, for any x ∈ Zn , we have that
lcm(n, x) n
x · ord(x) = lcm(x, n). In particular, ord(x) = = . If n is prime, and x , 0 then ord(x) =
x gcd(n, x)
|Zn | = n, and Zn is a cyclic group.

Proof: We are working modulo n here under additions, and the identity element is 0. As such, x · ord(x) ≡n 0,
which implies that n | x ord(x). By definition, ord(x) is the minimal number that has this property, implying
lcm(n, x)
that ord(x) = . Now, lcm(n, x) = nx/ gcd(n, x). The second claim is now easy. ■
x
Theorem 52.1.15. (Euler’s theorem) For all n and x ∈ Z∗n , we have xϕ(n) ≡ 1 (mod n).
(Fermat’s theorem) If p is a prime then ∀x ∈ Z∗p x p−1 ≡ 1 (mod p).

Proof: The group Z∗n is abelian and has ϕ(n) elements, with 1 being the identity element (duh!). As such, by
Lemma 52.1.13, we have that xϕ(n) = x|Zn | ≡ 1 (mod n), as claimed.

The second claim follows by setting n = p, and recalling that ϕ(p) = p − 1, if p is a prime. ■

One might be tempted to think that Lemma 52.1.14 implies that if p is a prime then Z∗p is a cyclic group,
but this does not follow, as the cardinality of Z∗p is ϕ(p) = p − 1, which is not a prime number (for p > 2). To
prove that Z∗p is cyclic, let us go back shortly to the totient function.
P
Lemma 52.1.16. For any n > 0, we have d|n ϕ(d) = n.
n o
Proof: For any g > 0, let Vg = x x ∈ {1, . . . , n} and gcd(x, n) = g . Now, x ∈ Vg ⇐⇒ gcd(x, n) = g
⇐⇒ gcd(x/g, n/g) = 1 ⇐⇒ x/g ∈ Z∗n/g . Since V1 , V2 , . . . , Vn form a partition of {1, . . . , n}, it follows that
X X X X
n= Vg = Z∗n/g = ϕ(n/g) = ϕ(d). ■
g g|n g|n d|n

339
52.1.2.5. Fields
Definition 52.1.17. A field is an algebraic structure ⟨F, +, ∗, 0, 1⟩ consisting of two abelian groups:
(A) F under +, with 0 being the identity element.
(B) F \ {0} under ∗, with 1 as the identity element (here 0 , 1).
Also, the following property (distributivity of multiplication over addition) holds:

∀a, b, c ∈ F a ∗ (b + c) = (a ∗ b) + (a ∗ c).

We need the following: A polynomial p of degree k over a field F has at most k roots. indeed, if p has the
root α then it can be written as p(x) = (x − α)q(x), where q(x) is a polynomial of one degree lower. To see
this, we divide p(x) by the polynomial (x − α), and observe that p(x) = (x − α)q(x) + β, but clearly β = 0 since
Q
p(α) = 0. As such, if p had t roots α1 , . . . , αt , then p(x) = q(x) ti=1 (x − αi ), which implies that p would have
degree at least t.

52.1.2.6. Z∗p is cyclic for prime numbers


For a prime number p, the group Z∗p has size ϕ(p) = p − 1, which is not a prime number for p > 2. As such,
Lemma 52.1.13 does not imply that there must be an element in Z∗p that has order p − 1 (and thus Z∗p is cyclic).
Instead, our argument is going to be more involved and less direct.
n o
Lemma 52.1.18. For k < n, let Rk = x ∈ Z∗p ord(x) = k be the set of all numbers in Z∗p that are of order k.
We have that |Rk | ≤ ϕ(k).

Proof: Clearly, all the elements of Rk are roots of the polynomial xk − 1 = 0 (mod n). By the above, this
polynomial has at most k roots. Now, if Rk is not empty, then it contains an element x ∈ Rk of order k, which
implies that for all i < j ≤ k, we have that xi . x j (mod n), as the order of x is the size of ⟨x⟩, and the minimum
k such that xk ≡ 1 (mod n). In particular, we have that Rk ⊆ ⟨x⟩, as for y = x j , we have that yk ≡n x jk ≡n 1 j ≡n 1.
Observe that for y = xi , if g = gcd(k, i) > 1, then yk/g ≡n xi(k/g) ≡n xlcm(i,k) ≡n 1; that is, ord(y) ≤ k/g < k,
and y < Rk . As such, Rk contains only elements of xi such that gcd(i, k) = 1. That is Rk ⊆ Z∗k . The claim now
readily follows as Z∗k = ϕ(k). ■

Lemma 52.1.19. For any prime p, the group Z∗p is cyclic.

Proof: For p = 2 the claim trivially holds, so assume p > 2. If the set R p−1 , from Lemma 52.1.18, is not empty,
then there is g ∈ R p−1 , it has order p − 1, and it is a generator of Z∗p , as Z∗p = p − 1, implying that Z∗p = ⟨g⟩ and
this group is cyclic.
Now, by Lemma 52.1.13, we have that for any y ∈ Z∗p , we have that ord(y) | p − 1 = Z∗p . This implies that
Rk is empty if k does not divides p − 1. On the other hand, R1 , . . . , R p−1 form a partition of Z∗p . As such, we
have that
X X

p − 1 = Zp = |Rk | ≤ ϕ(k) = p − 1,
k|p−1 k|p−1

by Lemma 52.1.18 and Lemma 52.1.16p339 , implying that the inequality in the above display is equality, and
for all k | p − 1, we have that |Rk | = ϕ(k). In particular, R p−1 = ϕ(p − 1) > 0, and by the above the claim
follows. ■

340
52.1.2.7. Z∗n is cyclic for powers of a prime
Lemma 52.1.20. Consider any odd prime p, and any integer c ≥ 1, then the group Z∗n is cyclic, where n = pc .

Proof: Let g be a generator of Z∗p . Observe that g p−1 ≡ 1 mod p. The number g < p, and as such p does
not divide g, and also p does not divide g p−2 , and also p does not divide p − 1. As such, p2 does not divide
∆ = (p − 1)g p−2 p; that is, ∆ . 0 (mod p2 ). As such, we have that
!
p − 1 p−2
(g + p) ≡ g +
p−1 p−1
g p ≡ g p−1 + ∆ . g p−1 (mod p2 )
1
=⇒ (g + p) p−1 . 1 (mod p2 ) or g p−1 . 1 (mod p2 ).

Renaming g + p to be g, if necessary, we have that g p−1 . 1 (mod p2 ), but by Theorem 52.1.15p339 , g p−1 ≡ 1
(mod p). As such, g p−1 = 1 + βp, where p does not divide β. Now, we have
!
p
g p(p−1)
= (1 + βp) = 1 +
p
βp + βp3 <whatever> = 1 + γ1 p2 ,
1

where γ1 is an integer (the p3 is not a typo – the binomial coefficient contributes at least one factor of p – here
we are using that p > 2). In particular, as p does not divides β, it follows that p does not divides γ1 either. Let
us apply this argumentation again to
2
 p
g p (p−1) = 1 + γ1 p2 = 1 + γ1 p3 + p4 <whatever> = 1 + γ2 p3 ,

where again p does not divides γ2 . Repeating this argument, for i = 1, . . . , c − 2, we have
i
 i−1 p  p
αi = g p (p−1) = g p (p−1) = 1 + γi−1 pi = 1 + γi−1 pi+1 + pi+2 <whatever> = 1 + γi pi+1 ,

where p does not divides γi . In particular, this implies that αc−2 = 1 + γc−2 pc−1 and p does not divides γc−2 .
This in turn implies that αc−2 . 1 (mod pc ).
Now, the order of g in Zn , denoted by k, must divide Z∗n by Lemma 52.1.13p339 . Now Z∗n = ϕ(n) =
pc−1 (p − 1), see Lemma 52.1.8p337 . So, k | pc−1 (p − 1). Also, αc−2 . 1 (mod pc ). implies that k does not divides
pc−2 (p − 1). It follows that pc−1 | k. So, let us write k = pc−1 k′ , where k′ ≤ (p − 1). This, by definition, implies
that gk ≡ 1 (mod pc ). Now, g p ≡ g (mod p), because g is a generator of Z∗p . As such, we have that
δ ′ δ−1 k′ δ−1 k′ ′
 
gk ≡ p g p k ≡ p (g p ) p ≡ p (g) p ≡ p . . . ≡ p (g)k ≡ p gk mod pc mod p ≡ p 1.

Namely, gk ≡ 1 (mod p), which implies, as g as a generator of Z∗p , that either k′ = 1 or k′ = p − 1. The
case k′ = 1 is impossible, as this implies that g = 1, and it can not be the generator of Z∗p . We conclude that
k = pc−1 (p − 1); that is, Z∗n is cyclic. ■

52.1.3. Quadratic residues


52.1.3.1. Quadratic residue
Definition 52.1.21. An integer α is a quadratic residue modulo a positive integer n, if gcd(α, n) = 1 and for
some integer β, we have α ≡ β2 (mod n).

Theorem 52.1.22 (Euler’s criterion). Let p be an odd prime, and α ∈ Z∗p . We have that

341
(A) α(p−1)/2 ≡ p ±1.
(B) If α is a quadratic residue, then α(p−1)/2 ≡ p 1.
(C) If α is not a quadratic residue, then α(p−1)/2 ≡ p −1.
Proof: (A) Let γ = α(p−1)/2 , and observe that γ2 ≡ p α p−1 ≡ 1, by Fermat’s theorem (Theorem 52.1.15p339 ),
which implies that γ is either +1 or −1, as the polynomial x2 − 1 has at most two roots over a field.
(B) Let α ≡ p β2 , and again by Fermat’s theorem, we have α(p−1)/2 ≡ p β p−1 ≡ p 1.
(C) Let X be the set of elements in Z∗p that are not quadratic residues, and consider α ∈ X. Since Z∗p is
a group, for any x ∈ Z∗p there is a unique y ∈ Z∗p such that xy ≡ p α. As such, we partition Z∗p into pairs
n o
C = {x, y} x, y ∈ Z∗p and xy ≡ p α . We have that
Y Y Y
τ ≡p β ≡p xy ≡ p α ≡ p α(p−1)/2 .
β∈Z∗p {x,y}∈C {x,y}∈C
n o
Let consider a similar set of pair, but this time for 1: D = {x, y} x, y ∈ Z∗p , x , y and xy ≡ p 1 . Clearly, D
does not contain −1 and 1, but all other elements in Z∗p are in D. As such,
Y Y Y
τ ≡p β ≡ p (−1)1 xy ≡ p 1 ≡ p −1. ■
β∈Z∗p {x,y}∈D {x,y}∈D

52.1.3.2. Legendre symbol


For an odd prime p, and an integer a with gcd(a, n) = 1, the Legendre symbol (a | p) is one if a is a quadratic
residue modulo p, and −1 otherwise (if p | a, we define (a | p) = 0). Euler’s criterion (Theorem 52.1.22)
implies the following equivalent definition.
Definition 52.1.23. The Legendre symbol, for a prime number p, and a ∈ Z∗p , is
(a | p) = a(p−1)/2 (mod p).
The following is easy to verify.
Lemma 52.1.24. Let p be an odd prime, and let a, b be integer numbers. We have:
(i) (−1 | p) = (−1)(p−1)/2 .
(ii) (a | p) (b | p) = (ab | p).
(iii) If a ≡ p b then (a | p) = (b | p).
Lemman 52.1.25 (Gauss’ lemma). Let p be an o odd primen and let a be an integer
o that is not divisible by p. Let
X = α j = ja (mod p) j = 1, . . . , (p − 1)/2 , and L = x ∈ X x > p/2 ⊆ X. Then (a | p) = (−1)n , where
n = |L|.
Proof: Observe that for any distinct i, j, such that 1 ≤ i ≤ j ≤ (p − 1)/2, we have that ja ≡ ia (mod p) implies
that ( j − i)a ≡ 0 (mod p), which is impossible as j − i < p and gcd(a, p) = 1. As such, all the elements of
X are distinct, and |X| = (p − 1)/2. We have a somewhat stronger property: If n ja ≡ p − iao (mod p) implies
( j + i)a ≡ 0 (mod p), which is impossible. That is, S = X \ L, and L = p − ℓ ℓ ∈ L are disjoint, and
 
S ∪ L = 1, . . . , (p − 1)/2 . As such,
! Y Y Y Y Y
(p−1)/2 !
p−1 n (p−1)/2 p − 1
! ≡ x· (p − y) ≡ (−1) n
x· y ≡ (−1) n
ja ≡ (−1) a ! (mod p).
2 x∈S y∈L x∈S y∈L j=1
2

Dividing both sides by (−1)n ((p − 1)/2)!, we have that (a | p) ≡ a(p−1)/2 ≡ (−1)n (mod p), as claimed. ■

342
Lemma 52.1.26. If p is an odd prime, and a > 2 and gcd(a, p) = 1 then (a | p) = (−1)∆ , where ∆ =
X
(p−1)/2
⌊ ja/p⌋. Furthermore, we have (2 | p) = (−1)(p −1)/8 .
2

j=1

Proof: Using the notation of Lemma 52.1.25, we have


X
(p−1)/2 X 
(p−1)/2  X X X X
ja = ⌊ ja/p⌋ p + ( ja mod p) = ∆p + x+ y = (∆ + n)p + x− y
j=1 j=1 x∈S y∈L x∈S y∈L

X
(p−1)/2 X
= (∆ + n)p + j−2 y.
j=1 y∈L

P(p−1)/2   p2 −1
Rearranging, and observing that j=1 j= p−1
2
· 1 p−1
2 2
+1 = 8
. We have that

p2 − 1 X p2 − 1
(a − 1) = (∆ + n)p − 2 y. =⇒ (a − 1) ≡ (∆ + n)p (mod 2). (52.1)
8 8
y∈L

Observe that p ≡ 1 (mod 2), and for any x we have that x ≡ −x (mod 2). As such, and if a is odd, then the
above implies that n ≡ ∆ (mod 2). Now the claim readily follows from Lemma 52.1.25.
As for (2 | p), setting a = 2, observe that ⌊ ja/p⌋ = 0, for j = 0, . . . (p − 1)/2, and as such ∆ = 0. Now,
Eq. (52.1) implies that p 8−1 ≡ n (mod 2), and the claim follows from Lemma 52.1.25.
2

Theorem 52.1.27 (Law of quadratic reciprocity). If p and q are distinct odd primes, then
p−1 q−1
(p | q) = (−1)
(q | p) . 2 2

n o
Proof: Let S = (x, y) 1 ≤ x ≤ (p − 1)/2 and 1 ≤ y ≤ (q − 1)/2 . As lcm(p, q) = pq, it follows that there are
no (x, y) ∈ S , such that qx = py, as all such numbers are strict smaller than pq. Now, let
n o n o
S 1 = (x, y) ∈ S qx > py and S 2 = (x, y) ∈ S qx < py .
P(p−1)/2
Now, (x, y) ∈ S 1 ⇐⇒ 1 ≤ x ≤ (p − 1), and 1 ≤ y ≤ ⌊qx/p⌋. As such, we have |S 1 | = ⌊qx/p⌋, and
P x=1
similarly |S 2 | = (q−1)/2
y=1 ⌊py/q⌋. We have

p−1 q−1 X
(p−1)/2 X
(q−1)/2
τ= · = |S | = |S 1 | + |S 2 | = ⌊qx/p⌋ + ⌊py/q⌋ .
2 2
|x=1 {z } |y=1 {z }
τ1 τ2

The claim now readily follows by Lemma 52.1.26, as (−1)τ = (−1)τ1 (−1)τ2 = (p | q) (q | p). ■

52.1.3.3. Jacobi symbol


Definition 52.1.28. For any integer a, and an odd number n with prime factorization n = pk11 · · · pkt t , its Jacobi
symbol is
Y
t
Ja | nK = (a | pi )ki .
i=1

343
Pk Qk 
Claim 52.1.29. For odd integers n1 , . . . , nk , we have that i=1 (ni − 1)/2 ≡ n
i=1 i − 1 /2 (mod 2).

Proof: We prove for two odd integers x and y, and apply this repeatedly to get the claim. Indeed, we have
x−1 y−1 xy − 1 xy − x + 1 − y + 1 − 1 xy − x − y + 1
+ ≡ (mod 2) ⇐⇒ 0 ≡ (mod 2) ⇐⇒ 0 ≡
2 2 2 2 2
(x − 1)(y − 1)
(mod 2) ⇐⇒ 0 ≡ (mod 2), which is obviously true. ■
2
Lemma 52.1.30 (Law of quadratic reciprocity). For n and m positive odd integers, we have that Jn | mK =
n−1 m−1
(−1) 2 2 Jm | nK .
Q Q
Proof: Let n = νi=1 pi and Let m = µj=1 q j be the prime factorization of the two numbers (allowing repeated
factors). If they share a common factor p, then both Jn | mK and Jm | nK contain a zero term when expanded, as
(n | p) = (m | p) = 0. Otherwise, we have
Y
ν Y
µ Y
ν Y
µ
  Y
ν Y
µ
 
Jn | mK = Jpi | q j K = pi | q j = (−1)(q j −1)/2·(pi −1)/2 q j | pi
i=1 j=1 i=1 j=1 i=1 j=1
 ν µ 
Y
ν Y
µ Y Y  
= (−1)(q j −1)/2·(pi −1)/2 · q j | pi  = s Jm | nK .
i=1 j=1 i=1 j=1
| {z }
s

by Theorem 52.1.27. As for the value of s, observe that


 µ (pi −1)/2  ν (m−1)/2
Yν Y
  Y
ν  (pi −1)/2 Y 
s=  (−1)(q j −1)/2  = (−1)(m−1)/2 =  (−1)(pi −1)/2  = (−1)(n−1)/2·(m−1)/2 ,
i=1 j=1 i=1 i=1

by repeated usage of Claim 52.1.29. ■

n2 − 1 m2 − 1 n2 m2 − 1
Lemma 52.1.31. For odd integers n and m, we have that + ≡ (mod 2).
8 8 8
Proof: For an odd integer n, we have that either (i) 2 | n − 1 and 4 | n + 1, or (ii) 4 | n − 1 and 2 | n + 1. As
such, 8 | n2 − 1 = (n − 1)(n + 1). In particular, 64 | n2 − 1 m2 − 1 . We thus have that
  
n2 − 1 m2 − 1 n2 m2 − n2 − m2 + 1
≡0 (mod 2) ⇐⇒ ≡ 0 (mod 2)
8 8
n2 m2 − 1 n2 − m2 − 2
⇐⇒ ≡ (mod 2)
8 8
n2 − 1 m2 − 1 n2 m2 − 1
⇐⇒ + ≡ (mod 2). ■
8 8 8

Lemma 52.1.32. Let m, n be odd integers, and a, b be any integers. We have the following:
(A) Jab | nK = Ja | nK Jb | nK.
(B) Ja | nmK = Ja | nK Ja | mK.
(C) If a ≡ b (mod n) then Ja | nK = Jb | nK.
(D) If gcd(a, n) > 1 then Ja | nK = 0.
(E) J1 | nK = 1.

344
(F) J2 | nK = (−1)(n −1)/8 .
2

n−1 m−1
(G) Jn | mK = (−1) 2 2 Jm | nK .

Proof: (A) Follows immediately, as (ab | pi ) = (a | pi ) (b | pi ), see Lemma 52.1.24p342 .


(B) Immediate from definition.
(C) Follows readily from Lemma 52.1.24p342 (iii).
(D) Indeed, if p | gcd(a, n) and p > 1, then (a | p)k = (0 | p)k = 0 appears as a term in Ja | nK.
(E) Obvious by definition.
Q
(F) By Lemma 52.1.26p343 , for a prime p, we have (2 | p) = (−1)(p −1)/8 . As such, writing n = ti=1 pi as a
2

product of primes (allowing repeated primes), we have


Y
t Y
t
(−1)(pi −1)/8 = (−1)∆ ,
2
J2 | nK = (2 | pi ) =
i=1 i=1
Pt
where ∆ = 2
i=1 (pi − 1)/8. As such, we need to compute the ∆ (mod 2), which by Lemma 52.1.31, is

X Qt
t
p2i − 1 i=1 pi − 1
2
n2 − 1
∆≡ ≡ ≡ (mod 2),
i=1
8 8 8

and as such J2 | nK = (−1)∆ = (−1)(n −1)/8 .


2

(G) This is Lemma 52.1.30. ■

52.1.3.4. Jacobi(a, n): Computing the Jacobi symbol


Given a and n (n is an odd number), we are interested in computing (in polynomial time) the Jacobi symbol
Ja | nK. The algorithm Jacobi(a, n) works as follows:
(A) If a = 0 then return 0 // Since J0 | nK = 0.
(B) If a > n then return Jacobi(a (mod n), n) // Lemma 52.1.32 (C)
(C) If gcd(a, n) > 1 then return 0 // Lemma 52.1.32 (D)
(D) If a = 2 then
(I) Compute ∆ = n2 − 1 (mod 16),
(II) Return (−1)∆/8 (mod 2) // As (n2 − 1)/8 ≡ ∆/8 (mod 2), and by Lemma 52.1.32 (F)
(E) If 2 | a then return Jacobi(2, n) * Jacobi(a/2, n) // Lemma 52.1.32 (A)
// Must be that a and b are both odd, a < n, and they are coprime
(F) a′ := a (mod 4), n′ := n (mod 4), β = (a′ − 1)(n′ − 1)/4.
return (−1)β Jacobi(n, a) // By Lemma 52.1.32 (G)

Ignoring the recursive calls, all the operations takes polynomial time. Clearly, computing Jacobi(2, n)
takes polynomial time. Otherwise, observe that Jacobi reduces its input size by say, one bit, at least every two
recursive calls, and except the a = 2 case, it always perform only a single call. Thus, it follows that its running
time is polynomial. We thus get the following.

Lemma 52.1.33. Given integers a and n, where n is odd, then Ja | nK can be computed in polynomial time.

345
52.1.3.5. Subgroups induced by the Jacobi symbol
For an n, consider the set
n o
Jn = a ∈ Z∗n Ja | nK ≡ a(n−1)/2 mod n . (52.2)
Claim 52.1.34. The set Jn is a subgroup of Z∗n .
Proof: For a, b ∈ Jn , we have that Jab | nK ≡ Ja | nK Jb | nK ≡ a(n−1)/2 b(n−1)/2 ≡ (ab)(n−1)/2 mod n, implying that
ab ∈ Jn . Now, J1 | nK = 1, so 1 ∈ Jn . Now, for a ∈ Jn , let a−1 the inverse of a (which is a number in Z∗n ).
Observe that a(a−1 ) = kn + 1, for some k, and as such, we have
q y q y
1 = J1 | nK = Jkn + 1 | nK = aa−1 | n = Jkn + 1 | nK = Ja | nK a−1 | n .
And modulo n, we have
q y q y
1 ≡ Ja | nK a−1 | n ≡ a(n−1)/2 a−1 | n mod n.
 (n−1)/2 q y
Which implies that a−1 ≡ a−1 | n mod n. That is a−1 ∈ Jn .
Namely, Jn contains the identity, it is closed under inverse and multiplication, and it is now easy to verify
that fulfill the other requirements to be a group. ■
Lemma 52.1.35. Let n be an odd integer that is composite, then |Jn | ≤ Z∗n /2.
Q
Proof: Let has the prime factorization n = ti=1 pki i . Let q = pk11 , and m = n/q. By Lemma 52.1.20p341 , the
group Z∗q is cyclic, and let g be its generator. Consider the element a ∈ Z∗n such that
a ≡ g mod q and a ≡ 1 mod m.
Such a number a exists and its unique, by the Chinese remainder theorem (Theorem 52.1.6p337 ). In particular,
Q
let m = ti=2 pki i , and observe that, for all i, we have a ≡ 1 (mod pi ), as pi | m. As such, writing the Jacobi
symbol explicitly, we have
Y
t Y
t Y
t
Ja | nK = Ja | qK (a | pi )ki = Ja | qK (1 | pi )ki = Ja | qK 1 = Ja | qK = Jg | qK .
i=2 i=2 i=2

since a ≡ g (mod q), and Lemma 52.1.32p344 (C). At this point there are two possibilities:
(A) If k1 = 1, then q = p1 , and Jg | qK = (g | q) = g(q−1)/2 (mod q). But g is a generator of Z∗q , and its order
is q − 1. As such g(q−1)/2 ≡ −1 (mod q), see Definition 52.1.23p342 . We conclude that Ja | nK = −1. If we
assume that Jn = Z∗n , then Ja | nK ≡ a(n−1)/2 ≡ −1 (mod n). Now, as m | n, we have
 
a(n−1)/2 ≡m a(n−1)/2 mod n mod m ≡m −1.
But this contradicts the choice of a as a ≡ 1 (mod m).
(B) If k1 > 1 then q = pk11 . Arguing as above, we have that Ja | nK = (−1)k1 . Thus, if we assume that Jn = Z∗n ,
then a(n−1)/2 ≡ −1 (mod n) or a(n−1)/2 ≡ 1 (mod n). This implies that an−1 ≡ 1 (mod n). Thus, an−1 ≡ 1
(mod q).
Now a ≡ g mod q, and thus gn−1 ≡ 1 (mod q). This implies that the order of g in Z∗q must divide
 
n − 1. That is ord(g) = ϕ(q) | n − 1. Now, since k1 ≥ 2, we have that p1 | ϕ(q) = pk11 (p1 − 1), see
Lemma 52.1.8p337 . We conclude that p1 | n − 1 and p1 | n, which is of course impossible, as p1 > 1.
We conclude that Jn must be a proper subgroup of Z∗n , but, by Lemma 52.1.11p338 , it must be that |Jn | | Z∗n . But
this implies that |Jn | ≤ Z∗n /2. ■

346
52.2. Primality testing
The primality test is now easy­ . Indeed, given a number n, first check if it is even (duh!). Otherwise, randomly
pick a number r ∈ {2, . . . , n − 1}. If gcd(r, n) > 1 then the number is composite. Otherwise, check if r ∈ Jn (see
Eq. (52.2)p346 ), by computing x = Jr | nK in polynomial time, see Section 52.1.3.4p345 , and x′ = a(n−1)/2 mod n.
(see Lemma 52.1.7p337 ). If x = x′ then the algorithm returns is prime, otherwise it returns it is composite.

Theorem 52.2.1. Given a number n, and a parameter δ > 0, there is a randomized algorithm that, decides if
 
the given number is prime or composite. The running time of the algorithm is O log n c log(1/δ) , where c is
some constant. If the algorithm returns that n is composite then it is. If the algorithm returns that n is prime,
then is wrong with probability at most δ.

Proof: Run the above algorithm m = O(log(1/δ)) times. If any of the runs returns that it is composite then the
algorithm return that n is composite, otherwise the algorithms returns that it is a prime.
If the algorithm fails, then n is a composite, and let r1 , . . . , rm be the random numbers the algorithm picked.
The algorithm fails only if r1 , . . . , rm ∈ Jn , but since |Jn | ≤ Z2n /2, by Lemma 52.1.35p346 , it follows that this
 m
happens with probability at most |Jn | / Z2n ≤ 1/2m ≤ δ, as claimed. ■

52.2.1. Distribution of primes


In the following, let π(n) denote the number of primes between 1 and n. Here, we prove that π(n) = Θ(n/ log n).
 
Lemma 52.2.2. Let ∆ be the product of all the prime numbers p, where m < p ≤ 2m. We have that ∆ ≤ 2m m
.

Proof: Let X be the product of the all composite numbers between m and 2m, we have
!
2m 2m · (2m − 1) · · · (m + 2) · (m + 1) X·∆
= = .
m m · (m − 1) · · · 2 · 1 m · (m − 1) · · · 2 · 1

Since none of the numbers between 2 and m divides any of the factors of ∆, it must be that the number m·(m−1)···2·1
X
2m 2m
is an integer number, as m is an integer. Therefore, m = c · ∆, for some integer c > 0, implying the claim.■

Lemma 52.2.3. The number of prime numbers between m and 2m is O(m/ ln m).

  m and 2m as p1 < p2 < · · · < pk . Since p1 ≥ m, it follows from


Proof: Let us denote all primes between
Q
Lemma 52.2.2 that mk ≤ ki=1 pi ≤ 2m m
≤ 22m . Now, taking log of both sides, we have k lg m ≤ 2m. Namely,
k ≤ 2m/ lg m. ■

Lemma 52.2.4. π(n) = O(n/ ln n).

Proof: Let the number of primes less than n be Π(n), then by Lemma 52.2.3, there exist some positive constant
C, such that for all ∀n ≥ N, we have Π(2n) − Π(n) ≤ C · n/ ln n. Namely, Π(2n) ≤ C · n/ ln n + Π(n). Thus,
⌈X
lg n⌉
   ! ⌈X
lg n⌉
n/2i  n 
Π(2n) ≤ Π 2n/2 − Π 2n/2
i i+1
≤ C· i)
= O , by observing that the summation behaves
i=0 i=0
ln(n/2 ln n
like a decreasing geometric series. ■
­
One could even say “trivial” with heavy Russian accent.

347
2m
Lemma 52.2.5. For integers m, k and a prime p, if pk | m
, then pk ≤ 2m.

Proof: Let T (p, m) be the number of times p appear in the prime factorization ofk m!. Formally, T (p, m) is the
P∞ j
highest number k such that p divides m!. We claim that T (p, m) = i=1 m/p . Indeed, consider an integer
k i

β ≤ m, such that β = pt γ, where γ is an integer that is not divisible by p. Observe that β contributes exactly to
the first t terms of the summation of T (p, m) – namely, its contribution to m! as far as powers of p is counted
correctly.  
Let α be the maximum number such that pα divides 2m m
= m!m!
2m!
. Clearly,

X∞ $ % $ %!
2m m
α = T (p, 2m) − 2T (p, m) = − 2 .
i=1
pi pi
j k j k
It is easy to verify that for any integers x, y, we have that 0 ≤ 2xy − 2 yx ≤ 1. In particular, let k be the
j k j k
largest number such that 2m − 2 m
= 1, and observe that T (p, 2m) ≤ k as only the proceedings k − 1 terms
pk pk j k
might be non-zero in the summation of T (p, 2m). But this implies that 2m/pk ≥ 1, which implies in turn that
pk ≤ 2m, as desired. ■

Lemma 52.2.6. π(n) = Ω(n/ ln n).


    Qk n
Proof: Assume 2m m
have k prime factors, and thus can be written as 2mm
= i=1 pi i , By Lemma 52.2.5, we
have pni i ≤ 2m. Of course, the above product might not include some prime numbers between 1 and 2m,
and as such k is a lower bound on the number of primes in this range; that is, k ≤ π(2m). This implies
! Y k
22m 2m 2m − lg(2m)
≤ ≤ 2m = (2m)k . By taking lg of both sides, we have ≤ k ≤ π(2m). ■
2m m i=1
lg(2m)

We summarize the result.

Theorem 52.2.7. Let π(n) be the number of distinct prime numbers between 1 and n. We have that π(n) =
Θ(n/ ln n).

52.3. Bibliographical notes


Miller [Mil76] presented the primality testing algorithm which runs in deterministic polynomial time but relies
on Riemann’s Hypothesis (which is still open). Later on, Rabin [Rab80] showed how to convert this algorithm
to a randomized algorithm, without relying on the Riemann’s hypothesis.
This write-up is based on various sources – starting with the description in [MR95], and then filling in some
details from various sources on the web.
What is currently missing from the write-up is a description of the RSA encryption system. This would
hopefully be added in the future. There are of course typos in these notes – let me know if you find any.

References
[Mil76] G. L. Miller. Riemann’s hypothesis and tests for primality. J. Comput. Sys. Sci., 13(3): 300–317,
1976.

348
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
[Rab80] M. O. Rabin. Probabilistic algorithm for testing primality. J. Number Theory, 12(1): 128–138,
1980.

349
350
Chapter 53

Talagrand’s Inequality
598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
At an archaeological site I saw fragments of precious vessels, well cleaned and groomed and oiled and spoiled. And
beside it I saw a heap of discarded dust which wasn’t even good for thorns and thistles to grow on.
I asked: What is this gray dust which has been pushed around and sifted and tortured and then thrown away?
I answered in my heart: This dust is people like us, who during their lifetime lived separated from copper and gold and
marble stones and all other precious things - and they remained so in death. We are this heap of dust, our bodies, our souls, all
the words in our mouths, all hopes.

At an archaeological site, Yehuda Amichai

53.1. Introduction
Here, we want to prove a strong concentration inequality that is stronger than Azuma’s inequality because it is
independent of the underlying dimension of the process. This inequality is quite subtle, so we will need a quite
elaborate way to get to it – be patient.

53.1.1. Talangrand’s inequality, and the T -distance


For two numbers x, y, let [x , y] be 1 if x , y, and 0 otherwise. For two points p = (p1 , . . . , pd ), u =
(q1 , . . . , qd ) ∈ Rd , let H(p, u) be the binary vector in {0, 1}n that encodes the coordinates where they are different.
Formally, we have  
H(p, u) = [p1 , q1 ], [p2 , q2 ], . . . , [pd , qd ] . (53.1)

For example H (1, 2, 3), (0.1, 2, −1) = (1, 0, 1). Given a set S ⊆ Rd , and a point p ∈ Rd , let
H(p, S ) = {H(p, u) | u ∈ S } . (53.2)
To understand this mysterious set, consider a point p ∈ Rd . If p ∈ S , then (0, . . . , 0) = H(p, p) ∈ H(p, S )
(which would be an uninteresting case). Otherwise, every binary point x ∈ H(p, S ) specifies which coordinates
in p one has to change, so that one can move to a point that belongs to S .
A natural measure of the distance of p from S , is then to ask for the vector that minimizes the Hamming
distance from the origin to H(p, S ) – that is, the minimum number of coordinates one has to change in p to get
to a point of S .
This distance measure is not informative. Think about H(p, S ) = {(0, 1)} and H(p, S ′ ) = {(0, 1), (1, 0)}. In
this case, the Hamming measure would rank both sets as being of equal quality (i.e., 1). But clearly, S ′ is closer
– there are two different ways to get from p to some point of S ′ by changing a single coordinate.

351
To capture this intuition, we consider the convex-hull of these sets:

C(p, S ) = CH {H(p, u) | u ∈ S } .

And the corresponding T -distance


ρ(p, S ) = min ∥u∥ . (53.3)
u∈C(p,S )

Observation 53.1.1. An easy upper bound on the T -distance of p to a set S (i.e., ρ(p, S )) is the minimum
number of coordinates one has to change in p to get to a point in S . As the next example shows, however, things
are more subtle – if there are many different ways to get from p to a point in S , then the T -distance is going to
be significantly smaller.

Example 53.1.2. It would be useful to understand this somewhat mysterious T -distance. To this end, consider
the ball in Rd of radius 100d centered in the origin, denote by b, and let S = ∂b be its boundary sphere. For a
point p ∈ intb, we have that
H = H(p, S ) = {0, 1}d \ {(0, 0, . . . , 0)} .
As such C = H(p, S ) is the convex-hull of all the hypercube vertices, excluding the origin. it is easy to
check that the closest
√ point√in C to the origin is the point u = (1/d, 1/d, . . . , 1/d), As such, we have that
ρ(p, S ) = ∥u∥ = 1/d = 1/ d.
In particular, by monotonicity this√implies that for any set T√ in Rd we have that ρ(p, T ) is either 0 (i.e.,
p ∈ T ), or alternatively, ρ(p, T ) ≥ 1/ d. Similarly, ρ(p, T ) ≤ d as this is the maximum distance from the
origin to any vertex of the hypercube {0, 1}d . √
As a concrete example, for the set S = ∂b, and the point p = (200d, . . . , 200d), we have ρ(p, S ) = d.

In the following, think about the dimension d as being quite large. As such, the distance 1/ d is quite
small. In particular, for a set S ⊆ Rd , let
n o
S t = p ∈ Rd ρ(p, S ) ≤ t ,

be the expansion of S by including points that are in distance at most t from S in the T -distance.
Since we interested in probability here, consider Rd to be the product of d probability spaces. Formally,
Q
let Ωi be a probability space, and consider the product probability space Ω = di=1 Ωi . As we are given a
probability measure on each Ωi , this defines a natural probability measure on Ω. That is, a point from Ω is
generated by packing each of its coordinates independently from Ωi . for i = 1, . . . , d.
The volume of a set S ⊆ Ω is thus P[S ]. We are now ready to state Talagrand inequality (not that it is going
to help us much).
Theorem 53.1.3 (Talagrand’s inequality). For any set S ⊆ Ω, we have

P[S ] P[S t ] = P[S ] 1 − P[S t ] ≤ exp(−t /4).
2

Example 53.1.4. To see why this inequality interesting, consider Ω = [0, 100]d with uniform distribution on
each coordinate. The probability measure of a set S ⊆ Ω is P[S ] = vol(S )/100d . Let
( X )
100d
S = p = (p1 , . . . , pd ) ∈ [0, 100]d
pi ≤ .
i
2
  √
It is easy to verify that vol(S ) = vol [0, 100]d /2. Let t = 4 ln d, and consider the set S t . Intuitively, and not
quite correctly, it the set of all points in [0, 100]d , such that one needs to change more than 4 ln d coordinates
before one can get to a point of S . These points are t-far from being in S .

352
h i 
By Talagrand inequality, we have that P S t /2 = P[S ] 1 − P[S t ] ≤ exp(−t2 /4) = 1/d4 . Namely, only a

tiny fraction of the cube is more than T -distance 4 ln d from S !
Let us try to restate
√ this – for any set S that is half the volume of the hypercube [0, 100]d , the set of points
in T -distance ≤ 4 ln d in this hypercube is small.

53.1.2. On the way to proving Talagrand’s inequality


The following helper result is the core of the proof of Talagrand’s inequality. The reader might want to skip
reading the proof of this claim, at least at first reading.
Q
Theorem 53.1.5. For any set S ⊆ Ω = di=1 Ωi , we have
" !# Z !
ρ2 (p, S ) ρ2 (p, S ) 1
E exp = exp dp ≤ .
4 p∈Ω 4 P[S ]

Proof: The proof is by induction on the dimension d. For d = 1, then ρ(p, S ) = 0 if p ∈ S , and ρ(p, S ) = 1 if
p < S . As such, we have
" !#
ρ2 (p, S )
= e0 /4 P[S ] + e1 /4 (1 − P[S ]) = P[S ] + e1/4 (1 − P[S ]) = f (P[S ]),
2 2
γ = E exp
4

where f (x) = x + e1/4 (1 − x). An easy argument (see Tedium 53.1.6) shows that f (x) ≤ 1/x, which implies that
γ = f (P[S ]) ≤ 1/ P[S ], as claimed.
Q Q
For d = n + 1, let O = di=1 Ωi , and N = Ωd+1 . Clearly, Ω = d+1i=1 Ωi = O × N.

S O = {p ∈ O | (p, y) ∈ S , for some y ∈ N} .


For ν ∈ N, let
S (ν) = {p ∈ O | (p, ν) ∈ S } ⊆ S O .
Given a point z = (p, y) ∈ Ω, we can get to a point in S , either by changing the new coordinate and then
moving inside the old space O, or alternatively, keeping the new coordinate ν fixed and moving only in the old
coordinates. In particular, we have that if

s ∈ H(p, S O ) ⊆ {0, 1}d =⇒ (s, 1) ∈ H(z, S ) (see Eq. (53.2))


s′ ∈ H(p, S (ν)) =⇒ (s′ , 0) ∈ H(z, S ).

And similarly, for the corresponding convex-hulls, we have

s ∈ C(p, S O ) =⇒ (s, 1) ∈ C(z, S ) and s′ ∈ C(p, S (ν)) =⇒ (s′ , 0) ∈ C(z, S ).

In particular, for s, s′ as above, we have (by convexity) that for any λ ∈ [0, 1], the point
  
h(λ) = (1 − λ) s, 1 + λ s′ , 0 = (1 − λ)s + λs′ , 1 − λ ∈ C(z, S ) ⊆ [0, 1]d+1 .

The function b h(λ) = ∥(1 − λ)s + λs′ ∥2 is convex, see Tedium 53.1.7. We thus have
 
ρ2 (z, S ) = min ∥p∥2 ≤ ∥h(λ)∥2 = ∥(1 − λ)s + λs′ ∥2 + (1 − λ)2 ≤ (1 − λ) ∥s∥2 + λ ∥s′ ∥2 + (1 − λ)2 .
p∈C(z,S )

353
We are still at the liberty of choosing s and s′ . Let s be the point realizing ρ(p, S O ) – this is the closest point in
C(p, S ) to the origin (i.e., ∥s∥ = ρ(p, S O )). Similarly, let s′ be the point realizing ρ(p, S (ν)). Plugging these two
points into the above inequality, we have
ρ2 (z, S ) ≤ (1 − λ)ρ(p, S O )2 + λρ(p, S (ν))2 + (1 − λ)2 .
Now, fix ν, and ride the following little integral:
Z ! Z !
ρ2 (p, ν), S (1 − λ)ρ(p, S O )2 + λρ(p, S (ν))2 + (1 − λ)2
F(ν) = exp ≤ exp
p 4 p 4
Z !1−λ !λ
1 1
≤ e(1−λ) /4 exp ρ(p, S O )2
2
exp ρ(p, S (ν))2
p 4 4
"Z !#(1−λ) "Z !#λ
(1−λ)2 /4 1 1
≤e exp ρ(p, S O ) 2
exp ρ(p, S (ν)) 2
(by Hölder’s ineq (53.4))
p 4 p 4
!(1−λ) !λ !−λ
(1−λ)2 /4 1 1 (1−λ)2 /4 1 P[S (ν)]
≤e =e (induction)
P[S O ] P[S (ν)] P[S O ] P[S O ]
1 P[S (ν)]
· e(1−λ) /4 r−λ ,
2
= for r =
P[S O ] P[S O ]
Observe that P[S O ] ≥ P[S (ν)], and thus r ≤ 1. To minimize the above, consider the function f3 (λ, r) =
exp (1 − λ)2 /4)r−λ . Easy calculation shows that f3 (λ, r) is minimized, for a fixed r, by choosing



1 + 2 ln r r ∈ [e−1/2 , 1]
λ(r) = 
 ,
0 r ∈ [0, e−1/2 ]
see Tedium 53.1.9 (A). Furthermore, for this choice of λ, easy calculations shows that f4 (r) = fr (λ(r), r) ≤ 2−r,
see Tedium 53.1.9 (B). As such, we have
!
1 1 P[S (ν)]
F(ν) ≤ f4 (r) ≤ 2−
P[S O ] P[S O ] P[S O ]
We remind the reader that our purpose is to bound
Z ! Z Z ! Z Z !
ρ2 z, S ρ2 (p, ν), S 1 P[S (ν)]
exp = exp ≤ F(ν) ≤ 2−
4 ν∈N p∈O 4 ν∈N ν∈N P[S O ] P[S O ]
z
 R  ! !
1  ν∈N P
[S (ν)] 
 = 1 2 − P[S ] = 1 · P[S ] 2 − P[S ] ≤ 1 ,
= 2 −
P[S O ] P[S O ] P[S O ] P[S O ] P[S ] P[S O ] P[S O ] P[S ]
since for x = P[S ]/P[S O ], we have x(2 − x) ≤ 1, for any value of x (see Tedium 53.1.10). ■

53.1.2.1. The low level details used in the above proof


Tedium 53.1.6. Let f (x) = x + e1/4 (1 − x). We claim that, for x ∈ (0, 1], f (x) ≤ 1/x. Indeed, set g(x) = 1/x,
and observe that f (1) = 1 = 1/1 = g(1). We have that f ′ (x) = e1/4 − 1 ≈ −0.284 and g′ (x) = −1/x2 . In
particular, g′ (x) ≤ f ′ (x), for all x ∈ (0, 1). Since and f (1) = g(1). it follows that f (x) ≤ g(x), for x ∈ (0, 1].

Tedium 53.1.7. For any p, u ∈ Rd , the function f (λ) = ∥(1 − λ)p + λu∥2 is convex. Indeed, let fi (λ) =
P
((1 − λ)pi + λqi )2 , for i = 1, . . . , d. Observe that f (λ) = i fi (λ), and as such it is sufficient to prove that

fi is convex. We have fi′ (λ) = 2(qi − pi ) (1 − λ)pi + λqi , and fi′′ (λ) = 2(qi − pi )2 > 0, which implies convexity.

354
Fact 53.1.8 (Hölder’s inequality.). Let p, q ≥ 1 be two numbers such that 1/p + 1/q = 1. Then, R for any two
functions f, g, we have ∥ f g∥1 ≤ ∥ f ∥ p ∥g∥q . Explicitly, stated as integrals, Hölder’s inequality is | f (x)g(x)|dx ≤
R 1/p R 1/q
| f (x)| p dx | f (x)|q dx . In particular, for λ ∈ (0, 1), p = 1/(1 − λ) and q = 1/λ, we have that
Z Z !1−λ Z !λ
λ
f (x)g 1−λ
(x) dx ≤ | f (x)|dx | f (x)|dx . (53.4)

Tedium 53.1.9. (A) We need to find the minimum of the following function f (λ) = exp (1 − λ)2 /4)r−λ =

exp (1−λ)2 /4−λ ln r). We have f ′ (λ) = f3 (λ) (1 − λ)/2 − ln r . Solving for f ′ (λ) = 0, we have (1−λ)/2−ln r =
0 =⇒ 1 − λ = 2 ln r0 =⇒ λ = 1 − 2 ln r, which works as long as r ≥ e−1/2 . Otherwise, we set λ = 0.
(B) For r ≤ e−1/2 , we have, by the above, that f (0) = e1/4 ≈ 1.28 ≤ 1.39 ≈ 2 − e−1/2 ≤ 2 − r. For r > e−1/2 ,
by the above, λ = 1 − 2 ln r, and thus

g(r) = f (λ) = exp (1 − λ)2 /4 − λ ln r) = exp (2 ln r)2 /4 + (1 − 2 ln r) ln r = exp ln r − ln2 r) ≤ 1 ≤ 2 − r,

since ln r − ln2 r ≤ ln r ≤ 0, for r ∈ (0, 1].

Tedium 53.1.10. The function f (x) = x(2 − x) = 2x − x2 is a parabola with a maximum at 2x = 2 =⇒ x = 1


=⇒ ∀y f (y) ≤ f (1) = 1.

53.1.3. Proving Talagrand’s inequality


Proving Talagrand’s inequality is now easy peasy.

Talagrand’s inequality restatement (Theorem 53.1.3). For any set S ⊆ Ω, we have



P[S ] P[S t ] = P[S ] 1 − P[S t ] ≤ exp(−t /4).
2

Proof: Consider a random point p ∈ Ω. We are interested in the probability p < S t . To this end, consider the
random variable X = ρ(p, S ). By definition, p ∈ S t ⇐⇒ X ≥ t. As such, by Markov’s inequality, we have
   
h i h    i E[exp X 2 /4 ] exp −t2 /4
P S t = P[X ≥ t] = P exp X /4 ≥ exp t /4 ≤  ≤ ,
2 2
exp t2 /4 P[S ]
by Theorem 53.1.5. ■

53.2. Concentration via certification


Example 53.2.1. Consider the process of throwing m balls into n bins. The ith ball Xi is uniformly distributed
Q
in Ωi = JnK. For x = (X1 , . . . , Xm ) ∈ Ω = mi=1 Ωi , let h(x) be the number of bins that are not empty. If h(x) ≥ k,
then there is a set I = {i1 , . . . , ik } of k indices, such that for any two distinct i, j ∈ I, we have that Xi , X j .
Namely, I is a “compact” proof/certificate that h(x) ≥ k. Furthermore, if for y = (Y1 , . . . , Ym ) ∈ Ω we have
that Xα = Yα , for all α ∈ I, then h(y) ≥ k. Here, the certificate for a value k, was a set of size k.
Qm
Definition 53.2.2. Let Ω = i=1 Ωi . For a function h : Ω → N, it is f -certifiable, for a function f : N → N, if
whenever h(x) ≥ k, there exists a set I ⊆ JmK, with |I| ≤ f (k), such that, for any y ∈ Ω, if y agree with ⃗x on the
coordinates of I, then h(⃗y) ≥ k.

355
Example 53.2.3. In Example 53.2.1, the function h (i.e., number of bins that are not empty) is f -certifiable,
where f (k) = k.

Example 53.2.4. Consider the random graph G(n, p) over n vertices, created   by picking every edge with proba-
bility p. One can interpret such a graph as a random binary vector with n2 coordinates, where the ith coordinate

is 1 ⇐⇒ the ith edge is in the graph (for some canonical ordering of all possible n2 edges).
A triangle in a graph G is a triple of vertices i, j, k, such that i j, jk, ki ∈ E(G). For a graph G, let h(G) be
the number of distinct triangles in G. In the above interpretation as a graph as a vector x ∈ {0, 1}(2) , it is easy to
n

verify that if h(G) ≥ k then it can be certified by 3k coordinates. As such, the number of triangles in a graph is
f -certifiable, for f (k) = 3k.
Note, that the certificate is only for the lower bound on the value of the function.

We need the following reinterpretation of the T -distance.

Lemma 53.2.5. Consider a set S ⊆ Rd and a point p ∈ Rd . We have that ρ(p, S ) ≤ t ⇐⇒ for all x =
(x1 , . . . , xd ) ∈ Rd , with ∥x∥ = 1, there exists h ∈ H(p, S ), such that ⟨x, h⟩ ≤ t.

Proof: The quantity ℓ = ρ(p, S ) is the distance from the origin to the convex polytope C(p, S ). In particular,
let y be the closest point to the origin in this polytope, and observe that ℓ = ∥y∥ = ⟨y, y/ ∥y∥⟩. In particular, for
any other vector x, with ∥x∥ = 1, we have ⟨y, x⟩ ≤ ⟨y, y/ ∥y∥⟩ ≤ ℓ. Since y is in the convex-hull of H(p, S ), it
follows that there is h ∈ H(p, S ) such that ⟨y, x⟩ ≤ ⟨h, x⟩ ≤ ℓ.
As for the other direction, assume that ℓ = ρ(p, S ) > t, and let y ∈ C(p, S ) be the point realizing this
distance. Arguing as above, we have that for the direction y/ ∥y∥, and any vertex h ∈ H(p, S ) we have that
⟨h, x⟩ ≥ ⟨h, x⟩ ≥ ℓ > t. ■
Q
Theorem 53.2.6. Consider a probability space Ω = mi=1 Ωi , and let h : Ω → be 1-Lipschitz and f -certifiable,
for some function f . Consider the random variable X = h(x), for x picked randomly in Ω. Then, for any
positive real numbers b and t, we have
h p i
P X ≤ b − t f (b) P[X ≥ b] ≤ exp(−t /4).
2

h p i
If h is k-Lipschitz then P X ≤ b − tk f (b) P[X ≥ b] ≤ exp(−t2 /4).
 p
Proof: Set S = p ∈ Ω h(p) < b − t f (b) . Consider a point u, such that h(u) ≥ b. Assume for the sake of
contradiction that u ∈ S t . Let I ⊆ JmK √ be the certificate of size ≤ f (b) that h(u) ≥ b. And consider the vector
x = (x1 , . . . , xd ), such that xi = 1/ |I| if i ∈ I, and xi = 0 otherwise. Observe that ∥x∥ = |I| (1/ |I|) = 1, and
2

thus ∥x∥ = 1. By Lemma 53.2.5, there exists h ∈ H(u, S ), such that ⟨x, h⟩ ≤ t, since by assumption ρ(u, S ) ≤ t.
Let v ∈ S be the point realizing h – that is, H(p, v) = h.
Let J ⊆ I be the set of indices of√coordinates that are in I, such that p and √ v differ pon this coordinate. We
have by the definition of x, that |J| / |I| ≤ ⟨x, h⟩ ≤ t, which implies that |J| ≤ t |I| ≤ t f (b).
Let u′ be the point that agrees with u on the coordinates of I, and agrees with v on the other coordinates.
The points u′ and v disagree only on coordinates in I, but such coordinates of disagreement are exactly the
coordinates (in I) where u disagrees with v – which is the set J of coordinates. As such, by the 1-Lipschitz
condition, we have that p
h(v) ≥ h(u′ ) − |J| ≥ h(u) − t f (b),
but then, by the definition of S , we have v < S , which is a contradiction as v ∈ S .

356
 
We conclude that u < S t =⇒ u ∈ S t . As such, we have P[X ≥ b] ≤ P S t . By Talagrand inequality, we
have p
 
P X < b − t f (b) P[X ≥ b] ≤ P[S ] P[S t ] ≤ exp(−t /4).
2

The “<” on the left side can be replaced by “≤”, as in the statement of the theorem, by using the value t + ε
instead of t, and taking the limit as ε → 0.
The k-Lipschitz version follows by applying the above inequality to the function h(·)/k. ■

53.3. Some examples


Definition 53.3.1. For a random variable X ∈ R, let med(X) denote the maximum number m, such that
P[X < m] ≤ 1/2 and P[X > m] ≤ 1/2. The number med(X) is the median of X.

53.3.1. Longest increasing subsequence


Let x = (X1 , . . . , Xn ) be a vector of n numbers picked randomly and uniformly from [0, 1]. Let h(x) be the
longest increasing subsequence in the associated sequence.

Lemma 53.3.2. We have h h(x) = Θ( n) with high
i probability. Furthermore, for some constant c and any
t > 0, we have that P |h(x) − med(h(x))| − tcn1/4 ≤ 4 exp(−t2 /4), Namely, the random variable h(x) is strongly
concentrated.
Proof: Let Yi be an indicator variable that is 1 ⇐⇒ x[i] ≡ x(i−1) √n+1, , . . . , xi √n contains a number in the interval
√ √ √ √
J(i) = [(i − 1)/ n, i n]. We have P[Yi = 1] = 1 − (1 − 1/ n) n ≥ 1 − 1/e ≥ 1/2, since (1 − 1/m)m ≤ 1/e. If
Yi happens, then we can take the number in x[i] that falls in J(i), and add it to the generated sequence.√As such,
P
the length of the generated sequence,h which is increasing,
i is Y = ni=1 Yi . And in particular, E[Y] ≥ n/2, and
√ √
Chernoff’s inequality implies that P Y ≥ (1 − δ) n/2 ≤ exp(−δ2 n/8).
The upper bound is more interesting. The probability that a specific subsequence of t indices i1 < i2 <
. . . < it form an increasing subsequence Xi1 < Xi2 < · · · < Xit is 1/t!. As such, the expected number of such
increasing sequences of length ≥ ℓ is bounded by
!
X n 1 X ne t 1 X ne t 1 X
n n n n
nt e2t
α= ≤ ≤ t
= 2t
,
t=ℓ
t t! t=ℓ
t t! t=ℓ
t (t/e) t=ℓ
t

using Lemma 6.1.1. In particular, for ℓ = 4e n, we have
X
n
nt e2t X 1 n √
α≤  √ 2t , = ≤ 2/4 8e n
≪ 1.
t=ℓ 4e n t=ℓ
42t
h √ i √
By Markov’s inequality, this implies that P h(x) ≥ 4e n ≤ 2/48e n .

The above readily implies that ν = med(h(x)) = Θ( n). Furthermore, h(x) is f (x) = x certifiable, and it is
1-Lipschitz. Theorem 53.2.6 now implies that
h √ i h √ i
P h(x) ≤ ν − t ν /2 ≤ P h(x) ≤ ν − t ν P[X ≥ ν] ≤ exp(−t /4).
2

As ν = O(n1/4 ), we get the following (this requires some further tedious calculations which we omit).
h i
P h(x) − ν − tcn ≤ 4 exp(−t2 /4),
1/4

where c is some constant. ■

357
53.3.2. Largest convex subset
A set of points P is in convex position if they are all vertices of the convex-hull of P.

Lemma 53.3.3. Let P be a set of n points picked randomly and uniformly in the unit square [0, 1]2 . Let Y be
the size of the largest subset of point of P that are in convex position. Then, we have that E[Y] = Ω(n1/3 ).

Proof: Let p = (1/2, 1/2), and consider the regular N-gon Q, for N = n1/3 , that its vertices lie on the circle
centered at p, and is of radius r = 1/2. Consider the triangle △i formed by connecting three consecutive vertices
p2i−1 , p2i , p2i+1 of Q. We have that α = 2π/N, and we pick n large enough, so that α ≤ 1/4. We remind the
reader that 1 − x2 /4 ≥ cos x ≥ 1 − x2 /2, for x ∈ (0, 1/4). As such, we have that α2 /4 ≤ 1 − cos α ≤ α2 /2. In
particular, this implies that the height of △ is h = r(1 − cos(α)), and we have α2 /8 = rα2 /4 ≤ h ≤ rα2 /2.
Let ℓ = ∥p2i−1 − p2i+1 ∥ = 2r sin α, since x/2 ≤ sin(x) ≤ x, we have that α/2 ≤ ℓ ≤ α. As such, we have that

area(△i ) = hℓ/2 ≥ (α2 /8)(α/2) = α3 /16 = (2π/N)3 /16 = 8/n.

In particular, the probability that △i does not contain a point of P is at most (1 − area(△i ))n ≤ (1 − 8/n)n ≤
exp(−8). We conclude that, in expectation, at least (1 − exp(−8))N/2 triangles contains points of P. Selecting
a point of P from each such triangle results in a convex subset, which implies the claim. ■

It is not hard to show that Y = Ω(n1/3 ), with high probability, see Exercise 53.5.1. This readily implies that
med(Y) = Ω(n1/3 ). It is significantly harder, but known, that E[Y] = O(n1/3 ), see [Val95]. We provide a weaker
but easier upper bound next.

Lemma 53.3.4. Let P be a set of n points picked randomly and uniformly in the unit square [0, 1]2 . Let Y be
the size of the largest subset of point of P that are in convex position. Then, E[Y] = O(n1/3 log n/ log log n),
with high probability.

Proof: Let V be a set of directions of size O(nc ), where c is some constant, such that for any unit vector u,
there is a vector in v ∈ V, such that the angle between u and v is at most 1/nc . For a vector v ∈ V, consider the
grid G(v) with directions v, and orthogonal direction v⊥ . Every cell of this grid is a rectangle with sidelength
1/n1/3 in the direction of v, and 1/n2/3 in the orthogonal direction. In addition the origin is a vertex of G(v).
This grid is uniquely defined, and every cell in this grid has sidelength 1. The of number of grid cells of this
grid intersecting the unit square is O(n), as can be easily verified.
Let F be the set of all rectangles in all these grids that intersect the unit square. Clearly, the number of
such cells is O(|V| n) = O(nc+1 ). Each rectangle in F has area 1/n, and as such by expectation it contains
≤ 1 point of P (the inequality is there because the rectangle might be partially outside the unit square). A
standard application of Chernoff’s inequality implies that the probability that a rectangle of F contains more
than 10c log n/ log log n points of P is ≤ 1/n2c . As such, with high probability no rectangle in F contains more
than O(log n/ log log n) points of P.
Consider any convex body C ⊆ [0, 1]2 . The key observation is that ∂C can be covered by O(n1/3 ) rectangles
of F . Indeed, the perimeter of C is at most 4. As such, place O(n1/3 ) points along ∂C that are at distance at most
1/(10n1/3 ) from each other. Similarly, place additional O(n1/3 ) points on ∂C, such that the angle of the tangents
between two consecutive points is at most 1/n1/3 (in radians) [a vertex of C might be picked repeatedly]. Let Q
be the resulting set of points. Consider two consecutive points p, u ∈ Q along ∂C, and observe that the distance
between them is at most 1/(10n1/3 ), and the angle between their two tangents is at most α = 1/n1/3 . consider
the triangle △ formed by the two tangents to ∂C at p, u, and the segment p, u. This triangle has height bounded
by ∥p − u∥ sin α ≤ 1/(10n2/3 ). It is now straightforward, if somewhat tedious to argue that one of the rectangles
of F must contain △.

358
Now we are almost done – if the maximum cardinality convex subset Q ⊆ P was larger than c′ n1/3 log n/ log log n,
for some constant c′ , then let C be the convex-hull of this large subset. The above would imply that one of
the rectangles of F must contain at least Ω(c′ log n/ log log n) points of P, but this does not happen with high
probability, for c′ sufficiently large. Thus implying the claim. ■

In particular, the above implies that med(Y) = O(n1/3 log n).


Theorem 53.3.5. Let P be a set of n points picked randomly and uniformly in the unit square [0, 1]2 . Let Y be
the size of the largest subset of point of P that are in convex position. Then, for any t > 0, we have
h i
P |Y − med(Y)| ≥ tcn log n ≤ 2 exp(−t /4),
1/6 1/2 2

for some constant c.

Proof: Observe that Y is 1-Lipschitz (i.e., changing the location of one point in P can decrease or increase the
value of Y by at most 1. In addition Y is 1-certifiable, since we only need to list the points that form the convex
subset. As such, Theorem 53.2.6 applies. Setting b = med(Y), we have by the above that med(Y) = Ω(n1/3 )
and med(Y) = O(n1/3 log n). As such, we have
h p i
P Y ≤ med(Y) − t cn1/3 log n P[X ≥ med(Y)] ≤ exp(−t /4).
2

p
Similarly, setting b = med(Y) + t cn1/3 log n ≤ 2med(Y), we have
h p i
P[Y ≤ med(Y)]P X ≥ med(Y) + t cn1/3 log n ≤ exp(−t /4).
2

Putting the two inequalities together, we get


h p i
P |Y − med(Y)| ≥ t cn1/3 log n ≤ 2 exp(−t /4). ■
2

53.3.3. Balls into bins revisited


Given n balls, one throw them into b bins, where b ≥ n. A ball that falls into a bin with i or more balls is
i-heavy. Let h≥i be the number of i-heavy balls. It turns out that a strong concentration on h≥i follows readily
from Talagrand’s inequality.
Lemma 53.3.6. Consider throwing n balls into b bins, where b ≥ 3n. Then, e−2 Fi ≤ E[h≥i ] ≤ 6ei−1 Fi , where
h≥i is the number of i-heavy balls, and Fi = n(n/ib)i−1 . Let βi denote the expected number of pairs of i-heavy

balls that are colliding. We have that βi = O ni(en/ib)i−1 .

Proof: Let p = 1/b. A specific ball falls into a bin with exactly i balls, if there are i − 1 balls, of
 the remaining
n−i n−1
n − 1 balls that falls into the same bin. As such, the probability for that is γi = p (1 − p) i−1 . As such, a
i−1

specific ball is i-heavy with probability


X n Xn−1 ! Xn−1 !j !i−1  en i−1
n−1 j n− j−1 e(n − 1) en
α= γj = p (1 − p) ≤ ≤2 ≤6 ,
j=i j=i−1
j j=i−1
jb b(i − 1) ib
n
as (n/i)i ≤ i
≤ (en/i)i . Since (1 − p)n− j−1 ≥ (1 − 1/b)b−1 ≥ 1/e, we have
!j !i−1
1 X n−1 1  n i−1
n−1
1 n−1 n
α≥ ≥ · ≥ 2 .
e j=i−1 jb e n (i − 1)b e ib

359
As such, we have E[h≥i ] = nα = Θ(n(n/b)i−1 ).
If a ball is in a bin with exactly j balls, for j ≥ i, then it collides directly with j−1 other i-heavy balls. Thus,
P
the expected number of collisions that a specific ball has with i-heavy balls is in expectation nj=i ( j − 1)γ j =
Pn−1
j=i−1 jγ j+1 . Summing over all balls, and dividing by two, as every i-heavy collision is counted twice, we have
that the expected overall number of such collisions is
!  en i−1 !
n X n X n−1 j
n−1 n
n− j−1
βi = jγ j+1 = j p (1 − p) = O ni . ■
2 j=i−1 2 j=i−1 j ib

Lemma 53.3.7. Consider throwing n balls into b bins, where b ≥ 3n. Let i be a small constant integer, h≥i
be the number of i-heavy balls, and let νi = med(h≥i ). Assume that νi ≥ 16i2 c log n, where c is some arbitrary

constant. Then, for some constant c′ , we have that |νi − E[h≥i ]| ≤ c′ i νi , and
h p i 1 h p i 1
′ √
P |h≥i − νi | ≥ 6i cνi ln n ≤ c and P h≥i − E[h≥i ] ≥ c i νi + 6i cνi ln n ≤ c .
n n
Proof: Observe that h≥i is 1-certifiable – indeed, the certificate is the list of indices of all the balls that are
contained in bins with i or more balls. The variable h≥i is also i-Lipschitz. Changing the location of a single
ball, can make one bin that contains i balls, into a bin that contains only i − 1 balls, thus decreasing h≥i by i.
√ √
We require that ti νi ≤ νi /2 =⇒ t ≤ νi /(2i). Theorem 53.2.6 implies that
h √ i
P h≥i ≤ νi − ti νi ≤ 2 exp(−t /4).
2
(53.5)

Setting b = νi + 2ti νi , we have that
√ q p p
√ √
b − ti b ≥ b − ti νi + 2ti νi ≥ b − ti 2νi = νi + 2ti νi − ti 2νi ≥ νi .
h √ i
This implies that P[h≥i ≥ b]/2 ≤ P[h≥i ≤ νi ] P[X ≥ b] ≤ P h≥i ≤ b − tk b P[h≥i ≥ b] ≤ exp(−t2 /4). We con-
clude that h √ i
P h≥i ≥ νi + 2ti νi ≤ 2 exp(−t /4).
2
(53.6)
Combining the above, we get that
h √ i
P |h≥i − νi | ≥ 2ti νi ≤ 4 exp(−t /4)
2

√ 
We√ require that 4 exp(−t2 /4) ≤ 1/nc , which holds for t = 3 c ln n. We get the inequality P |h≥i − νi | ≥

6i cνi ln n ≤ 1/nc , as claimed. √ √
√ √
This in turn translates into the requirement that 3 c ln n ≤ νi /(2i). =⇒ 6i c ln n ≤ νi . =⇒
36i2 c ln n ≤ νi .
Next, we estimate the expectation. We have that
X

√ h √ i √ X


E[h≥i ] ≥ νi − ti νi P h≥i ≤ νi − (t − 1)i νi ≥ νi − i νi t2 exp(−(t − 1)2 /4) ≥ νi − 10i νi ,
t=1 t=1

by Eq. (53.5). Similarly, by Eq. (53.6), we have


X

√ h √ i √ X


E[h≥i ] ≤ νi + 2ti νi P h≥i ≥ νi + (t − 1)i νi ≤ νi + 4i νi t exp(−(t − 1)2 /4) ≤ νi + 20i νi ,
t=1 t=1

As such, we have that |E[h≥i ] − νi | ≤ 30i νi , namely c′ ≤ 30. Combining the above inequalities implies the
statement of the lemma. ■

360
Example 53.3.8. Consider throwing n into b = n4/3 bins. Lemma 53.3.6 implies that e−2 Fi ≤ E[h≥i ] ≤ 6ei−1 Fi ,
where Fi = n/(ib1/3 )i−1 . As such E[h≥2 ] = Θ(n2/3 ), E[h≥3 ] = Θ(n1/3 ), and E[h≥4 ] = Θ(1).
Applying Lemma 53.3.7, we get that the number of balls that collides (i.e., h≥2 ), p
is strongly concentrated

around some value ν2 = Θ(n ), with the interval where it lies being of length O n
2/3 1/3
log n .

53.4. Bibliographical notes


Our presentation follows closely Alon and Spencer [AS00]. Section 53.3.3 is from Har-Peled and Jones [HJ18].

53.5. Problems
Exercise 53.5.1. Elaborating on the argument of Lemma 53.3.3, prove that, with high probability, a random
set of points picked uniformly in the unit square contains a convex subset of size Ω(n2/3 ).

References
[AS00] N. Alon and J. H. Spencer. The Probabilistic Method. 2nd. Wiley InterScience, 2000.
[HJ18] S. Har-Peled and M. Jones. On separating points by lines. Proc. 29th ACM-SIAM Sympos. Dis-
crete Algs. (SODA), 918–932, 2018.
[Val95] P. Valtr. Probability that n random points are in convex position. Discrete Comput. Geom., 13(3):
637–643, 1995.

361
362
Chapter 54

Low Dimensional Linear Programming


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024
“Napoleon has not been conquered by man. He was greater than all of us. But god punished him because he relied on
his own intelligence alone, until that prodigious instrument was strained to breaking point. Everything breaks in the
end.”
– Carl XIV Johan, King of Sweden.

54.1. Linear programming in constant dimension (d > 2)


Let assume that we have a set H of n linear inequalities defined over d (d is a small constant) variables.
Every inequality in H defines a closed half space in Rd . Given a vector → −c = (c , . . . , c ) we want to find
P 1 d
p = (p1 , . . . , pd ) ∈ Rd which is in all the half spaces h ∈ H and f (p) = i ci pi is maximized. Formally:

LP in d dimensions:(H,→ −c )
H - set of n closed half spaces in Rd

−c - vector in d dimensions
Find p ∈ Rd s.t.D ∀h ∈E H we have p ∈ h and f (p) is maximized.
Where f (p) = p,→ −c .

A closed half space in d dimensions is defined by an inequality of the form

a1 x1 + a2 x2 + · · · + an xn ≤ bn .

One difficulty that we ignored earlier, is that the optimal solution for the LP might be unbounded, see
Figure 54.1.
Namely, we can find a solution with value ∞ to the target function.
For a half space h let η(h) denote the normal of h directed into the feasible region. Let µ(h) denote the closed
half space, resulting from h by translating it so that it passes through the origin. Let µ(H) be the resulting set
of half spaces from H. See Figure 54.1 (b).
The new set of constraints µ(H) is depicted in Figure 54.1 (c).

Lemma 54.1.1. (H,→


−c ) is unbounded if and only if (µ(H),→
−c ) is unbounded.

363
µ(H) feasible region

µ(h)

h µ(h)

−c

(a) (b) (c)

Figure 54.1: (a) Unbounded LP. (b). (c).

µ(H) feasible region


g
µ(h2) ∩ g µ(h1) ∩ g

µ(h) ρ0 µ(h)

h g
h1
h2
µ(h1)
µ(h2)

feasible region of µ(H)

(a) (b) (c)

Figure 54.2: (a). (b). (c).

Proof: Consider the ρ′ the unbounded ray in the feasible region of (H,→ −c ) such that the line that contain it
′ →

passes through the origin. Clearly, ρ is unbounded also in (H, c ), and this is if and only if. See Figure 54.2
(a). ■

Lemma 54.1.2. Deciding if (µ(H),→ −c ) is bounded can be done by solving a d −1 dimensional LP. Furthermore,
if it is bounded, then we have a set of d constraints, such that their intersection prove this.
Furthermore, the corresponding set of d constraints in H testify that (H,→ −c ) is bounded.

Proof: Rotate space, such that → −c is the vector (0, 0, . . . , 0, 1). And consider the hyperplane g ≡ x = 1.
−c ) is unbounded if and only if the region g ∩ T
d
Clearly, (µ(H),→ h∈µ(H) h is non-empty. By deciding if this region
′ ′
is unbounded, is equivalent to solving the following LP: L = (H , (1, 0, . . . , 0)) where
n o
H ′ = g ∩ h h ∈ µ(H) .

Let h ≡ a1 x1 + . . . + ad xd ≤ 0, the region corresponding to g ∩ h is a1 x1 + · · · + ad−1 xd−1 ≤ −ad which is a


d − 1 dimensional hyperplane. See Figure 54.2 (b).
But this is a d − 1 dimensional LP, because everything happens on the hyperplane xd = 1.
Notice that if (µ(H),→ −c ) is bounded (which happens if and only if (H,→−c ) is bounded), then L′ is infeasible,
and the LP L′ would return us a set d constraints that their intersection is empty. Interpreting those constraints
in the original LP, results in a set of constraints that their intersection is bounded in the direction of → −c . See
Figure 54.2 (c).

364
vi
p g vi+1 g
µ(h2) ∩ g µ(h1) ∩ g µ(h2) ∩ g µ(h1) ∩ g

h1 h1
h2 h2
µ(h1) µ(h1)
µ(h2) µ(h2)

feasible region of µ(H) feasible region of µ(H)


−c

(a) (b) (c)

Figure 54.3: (a). (b). (c).

(In the above example, µ(H) ∩ g is infeasible because the intersection of µ(h2 ) ∩ g and µ(h1 ) ∩ g is empty,
which implies that h1 ∩ h2 is bounded in the direction →
−c which we care about. The positive y direction in this
figure. ) ■

We are now ready to show the algorithm for the LP for L = (H,→ −c ). By solving a d − 1 dimensional LP we
decide whether L is unbounded. If it is unbounded, we are done (we also found the unbounded solution, if you
go carefully through the details).
See Figure 54.3 (a).
(in the above figure, we computed p.)
In fact, we just computed a set h1 , . . . , hd s.t. their intersection is bounded in the direction of → −c (thats what
the boundness check returned).
Let us randomly permute the remaining half spaces of H, and let h1 , h2 , . . . , hd , hd+1 , . . . , hn be the resulting
permutation.

Let vi be the vertex realizing the optimal solution for the LP:
 −c 
Li = {h1 , . . . , hi } ,→

There are two possibilities:

1. vi = vi+1 . This means that vi ∈ hi+1 and it can be checked in constant time.

2. vi , vi+1 . It must be that vi < hi+1 but then, we must have... What is depicted in Figure 54.3 (b).

B - the set of d constraints that define vi+1 . If hi+1 < B then vi = vi+1 . As such, the probability of vi , vi+1
is roughly d/i because this is the probability that one of the elements of B is hi+1 . Indeed, fix the first i + 1
elements, and observe that there are d elements that are marked (those are the elements of B). Thus, we are
asking what is the probability of one of d marked elements to be the last one in a random permutation of
hd+1 , . . . , hi+1 , which is exactly d/(i + 1 − d).
Note that if some of the elements of B is h1 , . . . , hd than the above expression just decreases (as there are
less marked elements).
Well, let us restrict our attention to ∂hi+1 . Clearly, the optimal solution to Li+1 on hi+1 is the required vi+1 .
Namely, we solve the LP Li+1 ∩ hi+1 using recursion.
This takes T (i + 1, d − 1) time. What is the probability that vi+1 , vi ?

365
Well, one of the d constraints defining vi+1 has to be hi+1 .The probability for that is ≤ 1 for i ≤ 2d − 1, and
it is
d
≤ ,
i+1−d
otherwise.
Summarizing everything, we have:

X
2d
T (n, d) = O(n) + T (n, d − 1) + T (i, d − 1)
i=d+1
X
n
d
+ T (i, d − 1)
i=2d+1
i + 1 − d

What is the solution of this monster? Well, one essentially to guess the solution and verify it. To guess solution,
let us “simplify” (incorrectly) the recursion to :
X n
T (i, d − 1)
T (n, d) = O(n) + T (n, d − 1) + d
i=2d+1
i+1−d

So think about the recursion tree. Now, every element in the sum is going to contribute a near constant
factor, because we divide it by (roughly) i+1−d and also, we are guessing the the optimal solution is linear/near
linear.
In every level of the recursion we are going to penalized by a multiplicative factor of d. Thus, it is natural,
to conjecture that T (n, d) ≤ (3d)3d n.
Which can be verified by tedious substitution into the recurrence, and is left as exercise.

−c ),it can be solved in expected O(3d)3d n time (the constant


Theorem 54.1.3. Given an d dimensional LP (H,→
in the O is dim independent).

BTW, we are being a bit conservative about the constant. In fact, one can prove that the running time is d!n.
Which is still exponential in d.

366
SolveLP((H,→ −c ))
/* initialization */
Rotate (H,→ −c ) s.t. →
−c = (0, . . . , 1)
Solve recursively the d − 1 dim LP:
L′ ≡ µ(H) ∩ (xd = 1)

if L has a solution then
return “Unbounded”

Let g1 , . . . , gd be the set of constraints of L′ that testifies that L′ is infeasible


Let h1 , . . . , hd be the hyperplanes of H corresponding to g1 , . . . , gd
Permute H s.t. h1 , . . . , hd are first.
vd = ∂h1 ∩ ∂h2 ∩ · · · ∩ ∂hd
/*vd is a vertex that testifies that (H,→ −c ) is bounded */

/* the algorithm itself */


for i ← d + 1 to n do
if vi−1 ∈ hi then
vi ← vi−1
else
vi ← SolveLP((Hi−1 ∩ ∂hi , → −c )) (*)
where Hi−1 = {h1 , . . . , hi−1 }

return vn

54.2. Handling Infeasible Linear Programs


In the above discussion, we glossed over the question of how to handle LPs which are infeasible. This requires
slightly modifying our algorithm to handle this case, and I am only describing the required modifications.
First, the simplest case, where we are given an LP L which is one dimensional (i.e., defined over one
variable). Clearly, we can solve this LP in linear time (verify!), and furthermore, if there is no solution, we can
return two input inequality ax ≤ b and cx ≥ d for which there is no solution together (i.e., those two inequalities
[i.e., constraints] testifies that the LP is not satisfiable).
Next, assume that the algorithm SolveLP when called on a d − 1 dimensional LP L′ , if L′ is not feasible
it return the d constraints of L′ that together have non-empty intersection. Namely, those constraints are the
witnesses that L′ is infeasible.
So the only place, where we can get such answer, is when computing vi (in the (*) line in the algorithm).
Let h′1 , . . . , h′d be the corresponding set of d constraints of Hi−1 that testifies that (Hi−1 ∩ ∂hi , →
−c ) is an infeasible
LP. Clearly, h′1 , . . . , h′d , hi must be a set of d + 1 constraints that are together are infeasible, and that is what
SolveLP returns.

54.3. References
The description in this class notes is loosely based on the description of low dimensional LP in the book of de
Berg et al. [BCKO08].

367
References
[BCKO08] M. de Berg, O. Cheong, M. J. van Kreveld, and M. H. Overmars. Computational Geometry:
Algorithms and Applications. 3rd. Santa Clara, CA, USA: Springer, 2008.

368
Chapter 55

Algorithmic Version of Lovász Local Lemma


598 - Class notes for Randomized Algorithms
As for me, they took away my toy merchant, wishing
Sariel Har-Peled
with him to banish all toys from the world.
April 2, 2024
The tin drum, Gunter Grass
55.1. Introduction
Let P be a collection of independent random variables in some probability space Ω. We are interested in (bad)
events that are determined by these variables. Specifically, for an event B, there is a set of variables S ⊆ P that
determine if B happens. There minimal such set of variables is denoted by vbl(B) (note, that it is unique). We
assume that the set vbl(B) is either easily computable or available, for any event of interest.
Consider a (finite) family B of such (bad) events. The dependency graph G = G(B), with the vertices
being the events of B, and two events B1 , B2 ∈ B are connected by an edge ⇐⇒ vbl(B1 ) ∩ vbl(B2 ) , ∅. Let
Γ(B) be all the neighbors B in G, and let Γ+ (B) = {B} ∪ Γ(B). Observe that B is independent of the events in
B \ ({B} ∪ Γ(B)).
A specific assignment to the variables of S violates B, if it causes it to happen (remember, it is a bad event).

Task. Our purpose is to compute an assignment to the variables of P, such that none of the bad events of B
happens.

Algorithm. Initially, the algorithm assigns the variables of P random values. As long as there is a violated
event B ∈ B, resample all the variables of vbl(B) (independently according to their own distributions) – this is
a resampling of B. The algorithm repeats this till no event is violated.

Finer details. We fix some arbitrary strategy (randomized or deterministic) of how to pick the next event.
This now fully specify the algorithm.
Remark. Of course, it is not clear that the algorithm would always succeeds. Let us just assume, for the time
being, that we are in a case where the algorithm always finds a good assignment (which always exists).

55.1.1. Analysis
Let L(i) ∈ B be the event that was resampled in the ith iteration of the algorithm, for i > 0. The sequence
formed by L is the log of the execution.
A witness tree T = (T, σT ) is a rooted tree T together with a labeling σT . Here, every node v ∈ T is labeled
by an event σT (v) ∈ B. If a node u is a child of v in T , then σT (u) ∈ Γ+ (v). Two nodes in a tree are siblings if
they have a common parent. If all siblings have distinct labels then the witness tree is proper.

369
For a vertex v of T , let [v] = σT (v).

Theorem 55.1.1. Let P be a finite set of independent random variables in a probability space, and let B be a
set of (bad) events determined by these variables. If there is an assignment x : B → (0, 1), such that
Y 
∀B ∈ B P[B] ≤ x(B) 1 − x(C) ,
C∈Γ(B)

then there exists an assignment for the variables of P such that no event in B happens. Furthermore, in

expectation, the algorithm described above resamples, any event B ∈ B, at most x(B)/ 1 − x(B) times. Overall,
P 
the expected number of resampling steps is at most B∈B x(B)/ 1 − x(B) .

370
Chapter 56

Some math stuff


598 - Class notes for Randomized Algorithms
Sariel Har-Peled
April 2, 2024

56.1. Some useful estimates


Lemma 56.1.1. For any n ≥ 2, and m ≥ 1, we have that (1 − 1/n)m ≥ 1 − m/n.

Proof: Follows by induction. Indeed, for m = 1 the claim is immediate. For m ≥ 2, we have
!m ! !m−1 ! !
1 1 1 1 m−1 m
1− = 1− 1− ≥ 1− 1− ≥1− . ■
n n n n n n

This implies the following.

Lemma 56.1.2. For any m ≤ n, we have that 1 − m/n ≤ (1 − 1/n)m ≤ exp(−m/n).

371
372
Bibliography

[ABKU99] Y. Azar, A. Broder, A. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Computing,
29(1): 180–200, 1999.
[ABN08a] I. Abraham, Y. Bartal, and O. Neiman. Nearly tight low stretch spanning trees. Proc. 49th Annu.
IEEE Sympos. Found. Comput. Sci. (FOCS), 781–790, 2008.
[ABN08b] I. Abraham, Y. Bartal, and O. Neiman. Nearly tight low stretch spanning trees. CoRR, abs/0808.2017,
2008. arXiv: 0808.2017.
[AES99] P. K. Agarwal, A. Efrat, and M. Sharir. Vertical decomposition of shallow levels in 3-dimensional
arrangements and its applications. SIAM J. Comput., 29: 912–953, 1999.
[Aga04] P. K. Agarwal. Range searching. Handbook of Discrete and Computational Geometry. Ed. by
J. E. Goodman and J. O’Rourke. 2nd. Boca Raton, FL, USA: CRC Press LLC, 2004. Chap. 36,
pp. 809–838.
[AI06] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in
high dimensions. Proc. 47th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 459–468, 2006.
[AI08] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in
high dimensions. Commun. ACM, 51(1): 117–122, 2008.
[AKPW95] N. Alon, R. M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application to the
k-server problem. SIAM J. Comput., 24(1): 78–100, 1995.
[AMS98] P. K. Agarwal, J. Matoušek, and O. Schwarzkopf. Computing many faces in arrangements of
lines and segments. SIAM J. Comput., 27(2): 491–505, 1998.
[AMS99] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency
moments. J. Comput. Syst. Sci., 58(1): 137–147, 1999.
[AN04] N. Alon and A. Naor. Approximating the cut-norm via grothendieck’s inequality. Proc. 36th
Annu. ACM Sympos. Theory Comput. (STOC), 72–80, 2004.
[AR94] N. Alon and Y. Roichman. Random cayley graphs and expanders. Random Struct. Algorithms,
5(2): 271–285, 1994.
[Aro98] S. Arora. Polynomial time approximation schemes for Euclidean TSP and other geometric prob-
lems. J. Assoc. Comput. Mach., 45(5): 753–782, 1998.
[AS00] N. Alon and J. H. Spencer. The Probabilistic Method. 2nd. Wiley InterScience, 2000.
[ASS08] N. Alon, O. Schwartz, and A. Shapira. An elementary construction of constant-degree expanders.
Combin. Probab. Comput., 17(3): 319–327, 2008.
[Bar96] Y. Bartal. Probabilistic approximations of metric space and its algorithmic application. Proc.
37th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 183–193, 1996.

373
[Bar98] Y. Bartal. On approximating arbitrary metrics by tree metrics. Proc. 30th Annu. ACM Sympos.
Theory Comput. (STOC), 161–168, 1998.
[BCKO08] M. de Berg, O. Cheong, M. J. van Kreveld, and M. H. Overmars. Computational Geometry:
Algorithms and Applications. 3rd. Santa Clara, CA, USA: Springer, 2008.
[BDS95] M. de Berg, K. Dobrindt, and O. Schwarzkopf. On lazy randomized incremental construction.
Discrete Comput. Geom., 14: 261–286, 1995.
[BK90] A. Z. Broder and A. R. Karlin. Multilevel adaptive hashing. Proc. 1th ACM-SIAM Sympos. Dis-
crete Algs. (SODA), 43–53, 1990.
[Bol98] B. Bollobas. Modern Graph Theory. Springer-Verlag, 1998.
[BS95] M. de Berg and O. Schwarzkopf. Cuttings and applications. Int. J. Comput. Geom. Appl., 5: 343–
355, 1995.
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge, 2004.
[BY98] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, 1998.
[CCH09] C. Chekuri, K. L. Clarkson., and S. Har-Peled. On the set multi-cover problem in geometric
settings. Proc. 25th Annu. Sympos. Comput. Geom. (SoCG), 341–350, 2009.
[CF90] B. Chazelle and J. Friedman. A deterministic view of random sampling and its use in geometry.
Combinatorica, 10(3): 229–249, 1990.
[Che86] L. P. Chew. Building Voronoi diagrams for convex polygons in linear expected time. Technical
Report PCS-TR90-147. Hanover, NH: Dept. Math. Comput. Sci., Dartmouth College, 1986.
[CKR04] G. Călinescu, H. J. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension prob-
lem. SIAM J. Comput., 34(2): 358–372, 2004.
[Cla87] K. L. Clarkson. New applications of random sampling in computational geometry. Discrete Com-
put. Geom., 2: 195–222, 1987.
[Cla88] K. L. Clarkson. Applications of random sampling in computational geometry, II. Proc. 4th Annu.
Sympos. Comput. Geom. (SoCG), 1–11, 1988.
[CLRS01] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press
/ McGraw-Hill, 2001.
[CMS93] K. L. Clarkson, K. Mehlhorn, and R. Seidel. Four results on randomized incremental construc-
tions. Comput. Geom. Theory Appl., 3(4): 185–212, 1993.
[CS00] S. Cho and S. Sahni. A new weight balanced binary search tree. Int. J. Found. Comput. Sci.,
11(3): 485–513, 2000.
[CS89] K. L. Clarkson and P. W. Shor. Applications of random sampling in computational geometry, II.
Discrete Comput. Geom., 4(5): 387–421, 1989.
[DIIM04] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based
on p-stable distributions. Proc. 20th Annu. Sympos. Comput. Geom. (SoCG), 253–262, 2004.
[DP09] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized
Algorithms. Cambridge University Press, 2009.
[DS00] P. G. Doyle and J. L. Snell. Random walks and electric networks. ArXiv Mathematics e-prints,
2000. eprint: math/0001057.

374
[EEST08] M. Elkin, Y. Emek, D. A. Spielman, and S. Teng. Lower-stretch spanning trees. SIAM J. Comput.,
38(2): 608–628, 2008.
[EHS14] D. Eppstein, S. Har-Peled, and A. Sidiropoulos. On the Greedy Permutation and Counting Dis-
tances. manuscript. 2014.
[FRT04] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by
tree metrics. J. Comput. Sys. Sci., 69(3): 485–497, 2004.
[GG81] O. Gabber and Z. Galil. Explicit constructions of linear-sized superconcentrators. J. Comput.
Syst. Sci., 22(3): 407–420, 1981.
[GLS93] M. Grötschel, L. Lovász, and A. Schrijver. Geometric Algorithms and Combinatorial Optimiza-
tion. 2nd. Vol. 2. Algorithms and Combinatorics. Berlin Heidelberg: Springer-Verlag, 1993.
[Gre69] W. Greg. Why are Women Redundant? Trübner, 1869.
[GRSS95] M. Golin, R. Raman, C. Schwarz, and M. Smid. Simple randomized algorithms for closest pair
problems. Nordic J. Comput., 2: 3–27, 1995.
[Gup00] A. Gupta. Embeddings of Finite Metrics. PhD thesis. University of California, Berkeley, 2000.
[GW95] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut
and satisfiability problems using semidefinite programming. J. Assoc. Comput. Mach., 42(6):
1115–1145, 1995.
[Har00a] S. Har-Peled. Constructing planar cuttings in theory and practice. SIAM J. Comput., 29(6): 2016–
2039, 2000.
[Har00b] S. Har-Peled. Taking a walk in a planar arrangement. SIAM J. Comput., 30(4): 1341–1367, 2000.
[Har11] S. Har-Peled. Geometric Approximation Algorithms. Vol. 173. Math. Surveys & Monographs.
Boston, MA, USA: Amer. Math. Soc., 2011.
[Hås01a] J. Håstad. Some optimal inapproximability results. J. Assoc. Comput. Mach., 48(4): 798–859,
2001.
[Hås01b] J. Håstad. Some optimal inapproximability results. J. ACM, 48(4): 798–859, 2001.
[HJ18] S. Har-Peled and M. Jones. On separating points by lines. Proc. 29th ACM-SIAM Sympos. Dis-
crete Algs. (SODA), 918–932, 2018.
[HLW06] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin Amer.
Math. Soc., 43: 439–561, 2006.
[HR15] S. Har-Peled and B. Raichel. Net and prune: A linear time algorithm for Euclidean distance
problems. J. Assoc. Comput. Mach., 62(6): 44:1–44:35, 2015.
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom., 2: 127–
151, 1987.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of di-
mensionality. Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC), 604–613, 1998.
[Ind01] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. Proc. 42nd Annu.
IEEE Sympos. Found. Comput. Sci. (FOCS), Tutorial. 10–31, 2001.
[JL84] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mapping into hilbert space. Contem-
porary Mathematics, 26: 189–206, 1984.
[Kel56] J. L. Kelly. A new interpretation of information rate. Bell Sys. Tech. J., 35(4): 917–926, 1956.

375
[KKMO04] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal inapproximability results for max
cut and other 2-variable csps. Proc. 45th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), To
appear in SICOMP. 146–154, 2004.
[KKT91] C. Kaklamanis, D. Krizanc, and T. Tsantilas. Tight bounds for oblivious routing in the hypercube.
Math. sys. theory, 24(1): 223–232, 1991.
[KLMN05] R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: a new embedding
method for finite metric spaces. Geom. funct. anal. (GAFA), 15(4): 839–858, 2005.
[KOR00] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor
in high dimensional spaces. SIAM J. Comput., 2(30): 457–474, 2000.
[LM00] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection.
Ann. Statist., 28(5): 1302–1338, 2000.
[Mat02] J. Matoušek. Lectures on Discrete Geometry. Vol. 212. Grad. Text in Math. Springer, 2002.
[Mat92] J. Matoušek. Reporting points in halfspaces. Comput. Geom. Theory Appl., 2(3): 169–186, 1992.
[Mat98] J. Matoušek. On constants for cuttings in the plane. Discrete Comput. Geom., 20: 427–448, 1998.
[Mat99] J. Matoušek. Geometric Discrepancy. Vol. 18. Algorithms and Combinatorics. Springer, 1999.
[McD89] C. McDiarmid. Surveys in Combinatorics. Ed. by J. Siemons. Cambridge University Press, 1989.
Chap. On the method of bounded differences.
[Mil76] G. L. Miller. Riemann’s hypothesis and tests for primality. J. Comput. Sys. Sci., 13(3): 300–317,
1976.
[MN08] M. Mendel and A. Naor. Towards a calculus for non-linear spectral gaps. manuscript. 2008.
[MN98] J. Matoušek and J. Nešetřil. Invitation to Discrete Mathematics. Oxford Univ Press, 1998.
[MNP06] R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. Proc. 22nd
Annu. Sympos. Comput. Geom. (SoCG), 154–157, 2006.
[MOO05] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability of functions with low influences
invariance and optimality. Proc. 46th Annu. IEEE Sympos. Found. Comput. Sci. (FOCS), 21–30,
2005.
[MP80] J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theo. Comp. Sci., 12:
315–323, 1980.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University
Press, 1995.
[MS09] M. Mendel and C. Schwob. Fast c-k-r partitions of sparse graphs. Chicago J. Theor. Comput.
Sci., 2009, 2009.
[MU05] M. Mitzenmacher and U. Upfal. Probability and Computing – randomized algorithms and prob-
abilistic analysis. Cambridge, 2005.
[Mul89] K. Mulmuley. An efficient algorithm for hidden surface removal. Comput. Graph., 23(3): 379–
388, 1989.
[Mul94] K. Mulmuley. Computational Geometry: An Introduction Through Randomized Algorithms. En-
glewood Cliffs, NJ: Prentice Hall, 1994.
[Nor98] J. R. Norris. Markov Chains. Statistical and Probabilistic Mathematics. Cambridge Press, 1998.

376
[Rab76] M. O. Rabin. Probabilistic algorithms. Algorithms and Complexity: New Directions and Recent
Results. Ed. by J. F. Traub. Orlando, FL, USA: Academic Press, 1976, pp. 21–39.
[Rab80] M. O. Rabin. Probabilistic algorithm for testing primality. J. Number Theory, 12(1): 128–138,
1980.
[RVW02] O. Reingold, S. Vadhan, and A. Wigderson. Entropy waves, the zig-zag graph product, and new
constant-degree expanders and extractors. Annals Math., 155(1): 157–187, 2002.
[SA95] M. Sharir and P. K. Agarwal. Davenport-Schinzel Sequences and Their Geometric Applications.
New York: Cambridge University Press, 1995.
[SA96] R. Seidel and C. R. Aragon. Randomized search trees. Algorithmica, 16: 464–497, 1996.
[Sch79] A. Schönhage. On the power of random access machines. Proc. 6th Int. Colloq. Automata Lang.
Prog. (ICALP), vol. 71. 520–529, 1979.
[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. New Trends in Discrete and
Computational Geometry. Ed. by J. Pach. Vol. 10. Algorithms and Combinatorics. Springer-
Verlag, 1993, pp. 37–68.
[Sha03] M. Sharir. The Clarkson-Shor technique revisited and extended. Comb., Prob. & Comput., 12(2):
191–201, 2003.
[Smi00] M. Smid. Closest-point problems in computational geometry. Handbook of Computational Ge-
ometry. Ed. by J.-R. Sack and J. Urrutia. Amsterdam, The Netherlands: Elsevier, 2000, pp. 877–
935.
[Sni85] M. Snir. Lower bounds on probabilistic linear decision trees. Theor. Comput. Sci., 38: 69–82,
1985.
[Ste12] E. Steinlight. Why novels are redundant: sensation fiction and the overpopulation of literature.
ELH, 79(2): 501–535, 2012.
[Val95] P. Valtr. Probability that n random points are in convex position. Discrete Comput. Geom., 13(3):
637–643, 1995.
[VC71] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of
events to their probabilities. Theory Probab. Appl., 16: 264–280, 1971.
[Vöc03] B. Vöcking. How asymmetry helps load balancing. J. ACM, 50(4): 568–589, 2003.
[Wes01] D. B. West. Intorudction to Graph Theory. 2ed. Prentice Hall, 2001.
[WG75] H. W. Watson and F. Galton. On the probability of the extinction of families. J. Anthrop. Inst.
Great Britain, 4: 138–144, 1875.

377
Index

(k, n) decoding function, 275 Abraham, Ittai, 260


(k, n) encoding function, 275 active, 228
(n, d)-graph, 287 Adaptive estimation of a quadratic functional by model
(n, d, c)-expander, 152 selection, 159
2-universal, 65 adjacency matrix, 287
C-Lipschitz, 252 Agarwal, P. K., 192, 330–332
K-bi-Lipschitz, 252 algorithm
T -distance, 352 Alg, 27, 47, 48, 60, 136, 152, 154, 219–224, 304
δ-expander, 287 Alg, 27
Fi -measurable, 136 Contract, 118, 119
ρ-sample, 245 decrease-key, 226
σ-algebra, 23 delete-min, 226
σ-field, 135 EuclidGCD, 336
c-Lipschitz, 137 EuclidGCD, 336
regular, d, 212 FastCut, 118, 119
f -certifiable, 355 Jacobi, 345
i-heavy, 359 Las Vegas, 47, 223
k-HST, 253 Lookup, 63, 64
k-median clustering, 253
lookup, 63
k-test, 184
MinCut, 115–117, 119
k-wise independent, 57, 61
MinCutRep, 117–119
kth frequency moment, 171
Monte Carlo, 47, 223
t-heavy, 246
QuickSort, 33–35, 47, 83, 84, 87, 105, 106, 223,
t-step transition probability, 201
229
ε-nets and simplex range queries, 241, 248
QuickSelect, 34–36
pdf, 177
randomized, 27
Davenport-Schinzel Sequences and Their Geometric
RandomRoute, 107
Applications, 332
Algorithmic applications of low-distortion geometric
A deterministic view of random sampling and its use embeddings, 260
in geometry, 331 Algorithmic Geometry, 330, 332
A graph-theoretic game and its application to the k- Alon, N., 260, 285, 301, 303, 361
server problem, 260 Alon, Noga, 176
A New Interpretation of Information Rate, 93 amortize, 65
A New Weight Balanced Binary Search Tree, 87 An efficient algorithm for hidden surface removal, 331
A tight bound on approximating arbitrary metrics by An elementary construction of constant-degree expanders,
tree metrics, 260 301
abelian, 338 Andoni, A., 191

378
Andoni, Alexandr, 191 Building Voronoi diagrams for convex polygons in lin-
Applications of random sampling in computational ge- ear expected time, 331
ometry, II, 330, 331 butterfly, 324
approximate near neighbor, 181
approximate nearest neighbor, 181, 189–191 Călinescu, G., 253, 260
Approximate Nearest Neighbors: Towards Removing Catalan number, 197, 201
the Curse of Dimensionality, 161, 191, 192 central limit theorem, 89
Approximating the cut-norm via Grothendieck’s inequal- certified vertex, 323
ity, 285 chaining, 64
approximation characteristic vector, 292
maximization problem, 27 Chazelle, B., 331
Approximation Algorithms for the 0-Extension Prob- Chekuri, C., 332
lem, 253, 260 Chernoff inequality, 95
approximation factor, 304 simplified form, 95
APX Chervonenkis, A. Y., 236, 241
Hard, 303 Chew, L. P., 331
Aragon, C. R., 87 chi-square distribution with k degrees of freedom, 158
Arora, S., 260 Cho, S., 87
arrangement, 315, 317 Cholesky decomposition, 284
atomic event, 23, 135 Clarkson, K. L., 330, 331
average-case analysis, 18 Clarkson, Kenneth L., 330
Azar, Y., 148
Clarkson., K. L., 332
backwards analysis, 225 clause
Backwards Analysis of Randomized Geometric Algo- dangerous, 309
rithms, 229, 331 survived, 310
bad, 67 Closest-Point Problems in computational geometry, 76
balanced, 197 clusters, 253
Balanced Allocations, 148 CNF, 27
ball, 251 collide, 64
Bartal, Y., 260 coloring, 30
Bartal, Yair, 260 combinatorial dimension, 323
Berg, M. de, 330, 331, 367 commutative group, 338
Bernoulli distribution, 25 commute time, 206
bi-tension, 291 Complexity
binary code, 269 co−, 47, 224
binary symmetric channel, 275 BPP, 48, 224
binomial NP, 47, 223
estimates, 243 PP, 48, 224
binomial distribution, 26 P, 47, 223
bit fixing, 107 RP, 48, 224
black-box access, 227 ZPP, 48, 224
Boissonnat, J.-D., 330, 332 Computational Geometry: Algorithms and Applica-
Bollobas, B., 216 tions, 330, 367
bounded differences, 133 Computational Geometry: An Introduction Through
Boyd, S., 284 Randomized Algorithms, 331, 332
Broder, Andrei Z., 148 Compute, 66

379
Computing Many Faces in Arrangements of Lines and distance
Segments, 330, 331 Hamming, 181
Concentration of Measure for the Analysis of Random- distortion, 252
ized Algorithms, 103 distribution
conditional expectation, 81, 131 normal, 155, 189, 190
conditional probability, 24, 113 multi-dimensional, 157
confidence, 44 stable
conflict graph, 319 2, 189
conflict list, 319 p, 189
congruent modulo n, 336 distributivity of multiplication over addition, 340
congruent modulo p, 57, 66 divides, 57, 66
consistent labeling, 297 Dobrindt, K., 331
Constructing planar cuttings in theory and practice, dominating, 197
330, 331 Doob martingale, 137
contraction doubly stochastic, 206
edge, 114 Doyle, P. G., 210
convex hull, 243 Dubhashi, Devdatt P., 103
Convex Optimization, 284 Dyck word, 197
convex position, 358 Dyck words, 201
convex programming, 281 Dynamic, 63
convex-hull, 352
coprime, 335 edge, 315, 317
Cormen, T. H., 73 effective resistance, 207
cover time, 206 Efficient Search for Approximate Nearest Neighbor in
critical, 75 High Dimensional Spaces, 191
crossing number, 314 Efrat, A., 332
cuckoo hashing, 64 eigenvalue, 215, 287
cumulative distribution function, 177 eigenvalues, 215
cut, 113 eigenvector, 287
minimum, 113 electrical network, 207
cuts, 113 elementary event, 23, 135
cutting, 327 Elkin, Michael, 260
Cuttings and applications, 331 embedding, 252, 314
cyclic, 339 Embeddings of Finite Metrics, 260
CNF, 304 entropy, 263, 272
binary, 263
Datar, M., 190, 191 Entropy waves, the zig-zag graph product, and new
defining set, 323 constant-degree expanders and extractors, 301
degree, 45 epochs, 211
Delaunay Eppstein, D., 229
circle, 324 escalated choices, 147
triangle, 324 estimate, 236
dependency graph, 307, 369 Euler totient function, 337
dimension event, 23
combinatorial, 323 expander
discrepancy, 121 [n, d, δ]-expander, 287
discrepancy of χ, 121 [n, d, c]-expander, 218

380
c, 218 grid cell, 73
Expander Graphs and Their Applications, 301 grid cluster, 73
expectation, 24 ground set, 235
Explicit Constructions of Linear-Sized Superconcen- group, 336, 338
trators, 217 growth function, 238, 247
Extensions of Lipschitz mapping into Hilbert space, Gupta, A., 260
161 gcd, 335
extraction function, 266
Håstad, J., 284
face, 317 Hamming distance, 181
faces, 315 harmonic number, 34
Fakcharoenphol, J., 260 Har-Peled, S., 229, 247, 330–332
family, 64 Har-Peled, Sariel, 76, 361
Fast C-K-R Partitions of Sparse Graphs, 229 Haussler, D., 241, 248
field, 293, 340 heavy
filter, 135 t-heavy, 325
filtration, 135 height, 144
final strong component, 201 Hierarchically well-separated tree, 253
finite metric, 227 history, 200
first order statistic, 177 hitting time, 206
Fold, 66 Hoeffding’s inequality, 103
Four results on randomized incremental constructions, Hoory, S., 301
331 How Asymmetry Helps Load Balancing, 148
Friedman, J., 331 HST, 253
fully explicit, 220 HST, 253, 256, 257
function Huffman coding, 270
sensitive, 184 hypercube, 106
d-dimensional hypercube, 181
Gabber, Ofer, 217 Håstad, J., 28
Galil, Zvi, 217
identity element, 338
Galton, F., 111
Improved Approximation Algorithms for Maximum Cut
Gaussian, 157
and Satisfiability Problems Using Semidefinite
generator, 339
Programming, 284
Geometric Algorithms and Combinatorial Optimiza-
independent, 24, 55
tion, 284
indicator variable, 27
Geometric Approximation Algorithms, 247
Indyk, P., 161, 191, 192, 260
Geometric Discrepancy, 103, 123 Indyk, Piotr, 191
geometric distribution, 26 inequality
Goemans, M. X., 284 Hoeffding, 103
Golin, M., 76 Intorudction to Graph Theory, 216
Grötschel, M., 284 Introduction to Algorithms, 73
graph inverse, 57, 66
d-regular, 287 Invitation to Discrete Mathematics, 194
labeled, 212 irreducible, 202
lollipop, 206
Greg, W.R., 111 Jacobi symbol, 343
grid, 73 Johnson, W. B., 161

381
Jones, Mitchell, 361 vertex exposure, 132
martingale difference, 136
Kaklamanis, C., 107 martingale sequence, 132
Karlin, Anna R., 148 Massart, P., 159
Karloff, H. J., 253, 260 Matias, Yossi, 176
Kelly criterion, 93 Matoušek, J., 103, 123, 194, 260, 330–332
Kelly, J. L., 93 max cut, 303
Khot, S., 284, 285 maximization problem, 27
Kirchhoff’s law, 207 maximum cut, 303
Krauthgamer, R., 260 maximum cut problem, 281
Krizanc, D., 107 McDiarmid, C., 103
Kushilevitz, E., 191 measure, 235
Laurent, B., 159 Measured descent: A new embedding method for finite
Law of quadratic reciprocity, 343, 344 metric spaces, 260
lazy randomized incremental construction, 331 median, 357
Lectures on Discrete Geometry, 260, 332 median estimator, 170
Legendre symbol, 342 Mehlhorn, K., 331
level, 166, 315 memorylessness property, 200
k-level, 315 Mendel, M., 229, 301
Lindenstrauss, J., 161 metric, 227, 251
Linearity of expectation, 25 metric space, 227, 251–261
Linial, N., 301 Miller, G. L., 348
Lipschitz, 252 mincut, 113
bi-Lipschitz, 252 Mitzenmacher, M., 109, 268, 274, 280
Lipschitz condition, 137 Modern Graph Theory, 216
load, 144 moments technique, 330
load factor, 64 all regions, 323
Locality-sensitive hashing scheme based on p-stable monomial, 45
distributions, 190, 191 Mossel, E., 285
log, 369 Motwani, R., 48, 54, 79, 103, 109, 119, 134, 161, 191,
lollipop graph, 206 192, 224, 348
long, 298 Mulmuley, K., 331, 332
Lovász, L., 284 multi-dimensional normal distribution, 157
Lower bounds on locality sensitive hashing, 191 Multilevel Adaptive Hashing, 148
Lower Bounds on Probabilistic Linear Decision Trees, Munro, J. I., 167
150
Lower-Stretch Spanning Trees, 260 Naor, A., 191, 285, 301
lucky, 228 Nešetřil, J., 194
lcm, 335 near neighbor
LSH, 183, 184, 189, 191 data-structure
approximate, 181, 184, 188, 190
Markov chain, 200 Near-Optimal Hashing Algorithms for Approximate Near-
aperiodic, 202 est Neighbor in High Dimensions, 191
ergodic, 202 Near-optimal hashing algorithms for approximate near-
Markov Chains, 195, 202 est neighbor in high dimensions, 191
martingale, 137 Nearly Tight Low Stretch Spanning Trees, 260
edge exposure, 132 Neiman, Ofer, 260

382
net, 227 Polynomial time approximation schemes for Euclidean
ε-net, 241 TSP and other geometric problems, 260
ε-net theorem, 241, 248 positive semidefinite, 284
Net and Prune: A Linear Time Algorithm for Euclidean prefix code, 269
Distance Problems, 76 prefix-free, 269
New applications of random sampling in computational prime, 66, 335
geometry, 330 prime factorization, 337
Noise stability of functions with low influences invari- Probabilistic algorithm for testing primality, 348
ance and optimality, 285 Probabilistic algorithms, 76
normal distribution, 155, 160, 189, 190 Probabilistic approximations of metric space and its
multi-dimensional, 157 algorithmic application, 260
Norris, J. R., 195, 202 probabilistic distortion, 256
NP, 28, 284, 303 probabilities, 23
complete, 27, 76, 281, 303, 304 Probability
hard, 27, 281 Amplification, 117
probability, 24
O’Donnell, R., 285 Probability and Computing – randomized algorithms
oblivious, 107 and probabilistic analysis, 109, 268, 274, 280
Ohm’s law, 207 probability density function, 177
Oleszkiewicz, K., 285 probability measure, 23, 135
On approximating arbitrary metrics by tree metrics, probability space, 24, 135
260 Probability that n random points are in convex posi-
On constants for cuttings in the plane, 331 tion, 358
On lazy randomized incremental construction, 331 Problem
On Separating Points by Lines, 361 3SAT
On the Greedy Permutation and Counting Distances, 3SAT Max, 27
229 problem
On the Power of Random Access Machines, 76 3SAT, 27, 28
On the Probability of the Extinction of Families, 111 Max 3SAT, 27
On the set multi-cover problem in geometric settings, MAX CUT, 281
332 MAX-SAT, 304–306
On the uniform convergence of relative frequencies of Sorting Nuts and Bolts, 86
events to their probabilities, 236, 241 projection, 182
open ball, 251 proper, 369
Optimal inapproximability results for Max Cut and other
2-variable CSPs, 284, 285 quadratic residue, 341
OR-concentrator, 151 quotation
order, 339 Carl XIV Johan, King of Sweden, 363
orthonormal eigenvector basis, 289 quotient, 57, 66, 335
Ostrovsky, R., 191 quotient group, 338

pairwise independent, 55 Rabani, Y., 191, 253, 260


Panconesi, Alessandro, 103 Rabin, M. O., 76, 348
Panigrahi, R., 191 Radon’s theorem, 237
Paterson, M., 167 Raghavan, P., 48, 54, 79, 103, 109, 119, 134, 224, 348
perfect matching, 46 Raichel, Benjamin, 76
periodicity, 202 Random Cayley Graphs and Expanders, 301

383
random graphs, 132 defining, 323
random incremental construction, 318, 323, 327 stopping, 323
lazy, 331 shallow cuttings, 332
random sample, 241, 242, 248, 315, 320, 321, 324, Shapira, A., 301
325, 327–330 Sharir, M., 331, 332
ε-sample, 241 shatter function, 239
random variable, 24, 256 shattered, 236
random walk, 193 Shor, Peter W., 330
Random Walks and Electric Networks, 210 short, 298
Randomized Algorithms, 48, 54, 79, 103, 109, 119, siblings, 369
134, 224, 348 Sidiropoulos, A., 229
randomized rounding, 305 sign, 46
Randomized search trees, 87 Simple randomized algorithms for closest pair prob-
range, 235 lems, 76
Range searching, 192 size, 63
range space, 235 Smid, M., 76
projection, 236 Snell, J. L., 210
rank, 35, 39, 87 Snir, Marc, 150
Rao, S., 260 Some optimal inapproximability results, 28, 284
Reingold, O., 301 spectral gap, 218, 293
relative pairwise distance, 218 Spencer, J. H., 303, 361
remainder, 57, 66, 336 spread, 257
replacement product, 298 squaring, 299
Reporting points in halfspaces, 332 standard deviation, 25
resampling, 369 standard normal distribution, 155
residue, 336 state
resistance, 207, 211 aperiodic, 202
Riemann’s Hypothesis and Tests for Primality, 348 ergodic, 202
Roichman, Y., 301 non null, 201
running-time null persistent, 201
expected, 87 periodic, 202
persistent, 201
Sahni, S., 87 transient, 201
sample state probability vector, 202
ε-sample, 241 Static, 63
ε-sample theorem, 241 stationary distribution, 202
ε-sample, 240 Steinlight, E., 111
sample space, 23 stochastic, 206
Schönhage, Arnold, 76 stopping set, 323
Schrijver, A., 284 streaming, 167
Schwartz, O., 301 strong component, 201
Schwarzkopf, O., 330, 331 sub martingale, 136
Schwob, C., 229 subgraph
Seidel, R., 87, 229, 331 unique, 310
Selection and Sorting with Limited Storage, 167 subgroup, 338
sensitive function, 184 successful, 122, 311
set super martingale, 136

384
Surveys in Combinatorics, 103 vertical decomposition, 318
symmetric, 215 vertex, 318
Szegedy, Mario, 176 Vertical decomposition of shallow levels in 3-dimensional
arrangements and its applications, 332
Taking a Walk in a Planar Arrangement, 331 vertical trapezoid, 318
Talwar, K., 260 vertices, 317
tension, 288 violates, 369
The Probabilistic Method, 303, 361 volume, 352
The Space Complexity of Approximating the Frequency
Moments, 176 walk, 212
The Clarkson-Shor Technique Revisited and Extended, Watson, H. W., 111
331 weight
theorem region, 323
ε-net, 241, 248 Welzl, E., 241, 248
Radon’s, 237 West, D. B., 216
ε-sample, 241 Why are Women Redundant?, 111
Tight bounds for oblivious routing in the hypercube, Why Novels are Redundant: Sensation Fiction and the
107 Overpopulation of Literature, 111
Towards a calculus for non-linear spectral gaps, 301 width, 73
transition matrix, 218, 287 Wigderson, A., 301
transition probabilities matrix, 200 Williamson, D. P., 284
transition probability, 200 witness tree, 369
traverse, 212 word, 173
treap, 84
tree Yvinec, M., 330, 332
code trees, 269 zero, 45
prefix tree, 269 zero set, 45
triangle, 356 zig-zag, 298
true, 186 zig-zag product, 298
Tsantilas, T., 107 zig-zag-zig path, 298
Turing machine
log space, 212

union bound, 24
uniqueness, 75
universal traversal sequence, 212
Upfal, U., 109, 268, 274, 280

Vöcking, Berthold, 148


Vadhan, S., 301
Valtr, P., 358
Vandenberghe, L., 284
Vandermonde matrix, 58
Vapnik, V. N., 236, 241
variance, 25
VC
dimension, 236
vertex, 315

385

You might also like