0% found this document useful (0 votes)
6 views155 pages

A Graduate Course in Algorithm Design and Analysis

COS 521 is a graduate course at Princeton University focused on algorithm design and analysis, targeting students with prior knowledge of undergraduate algorithms. The course emphasizes nontraditional topics and hands-on programming assignments using tools like Matlab and Scipy, aiming to provide a fresh mathematical perspective on algorithms. It contrasts with undergraduate courses by exploring algorithms developed since 1990 and adapting to modern challenges in computer science, such as high-dimensional data and decision-making under uncertainty.

Uploaded by

Liu Qin Chang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views155 pages

A Graduate Course in Algorithm Design and Analysis

COS 521 is a graduate course at Princeton University focused on algorithm design and analysis, targeting students with prior knowledge of undergraduate algorithms. The course emphasizes nontraditional topics and hands-on programming assignments using tools like Matlab and Scipy, aiming to provide a fresh mathematical perspective on algorithms. It contrasts with undergraduate courses by exploring algorithms developed since 1990 and adapting to modern challenges in computer science, such as high-dimensional data and decision-making under uncertainty.

Uploaded by

Liu Qin Chang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 155

COS 521: A Graduate Course in Algorithm Design and

Analysis

Sanjeev Arora
Department of Computer Science
Princeton University

March 22, 2015


Abstract

These are lecture notes from a graduate course for Computer Science graduate students at
Princeton University in Fall 2013 and Fall 2014. (The course also attracts undergrads and
non-CS grads and total enrollment in 2014 was 35.) The course assumes prior knowledge
of algorithms at the undergraduate level.
This new course arose out of thinking about the right algorithms training for CS grads to-
day, since the environment for algorithms design and use has greatly changed since the 1980s
when the canonical grad algorithms courses were designed. The course topics are somewhat
nontraditional, and some of the homeworks involve simple programming assignments that
encourage students to play with algorithms using simple environments like Matlab and
Scipy.
Since this is the last theory course many of my students (grad or undergrad) might take
for the rest of their lives, I think of the scope as more than just algorithms; my goal is to
make them look at the world anew with a mathematical/algorithmic eye. For instance, I
discovered many holes in my students’ undergraduate CS education: information/coding
theory, economic utility and game theory, decision-making under uncertainty, cryptography
(anything beyond the RSA cryptosystem), etc. So I created space for these topics as well,
figuring that the value added by this was greater than by, say, presenting detailed analysis
of the Ellipsoid algorithm (which I sketch instead).
Programming assignments went out of fashion in most algorithms courses in the past
few decades, but I think it is time to bring them back. First, CS students are used to a
hands-on learning experience; an algorithm becomes real only once they see it run on real
data. Second, the computer science world increasingly relies on off-the-shelf packages and
library routines, and this is how algorithms are implemented in industry. One can write a
few lines of code in matlab and scipy, and run it within minutes on datasets of millions or
billions of numbers. No JAVA or C++ needed! Algorithms education should give students
at least a taste of such powerful tools. Finally, even for theory students it can be very
beneficial to play with algorithms and data a bit; this will help them develop a different
kind of theory.
The course gives students a choice between taking a 48-hour final, or doing a collab-
orative term project. Some sample term projects can be found at the course home page.
http://www.cs.princeton.edu/courses/archive/fall14/cos521/
This course is very much a work in progress, and I welcome your feedback and sugges-
tions.
I thank numerous colleagues for useful suggestions during the design of this course.
Above all, I thank my students for motivating me to teach them better; their feedback and
questions have helped shape these course notes.

Sanjeev Arora
March 2015
About this course
Algorithms are integral to computer science: every computer scientist (even as an un-
dergrad) has designed some. So has many a physicist, electrical engineer, mathematician
etc. This course is meant to be your one-stop shop to learn how to design a variety of
algorithms. The operative word is “variety.”In other words you will avoid the blinders that
one often sees in domain experts. A bayesian needs to see priors on the data before he
can begin designing algorithms; an optimization expert needs to cast all problems as con-
vex optimization; a systems designer has never seen any problem that cannot be solved by
hashing. (OK, mostly kidding but there is some truth in these stereotypes.) These and
more domain-specific ideas make an appearance in our course, but we will learn to not be
wedded to any single approach.
The primary skill you will learn in this course is how to analyse algorithms: prove their
correctness and their running time and any other relevant properties. Learning to analyse a
variety of algorithms (designed by others) will let you design better algorithms later in life.
I will try to fill the course with beautiful algorithms. Be prepared for frequent rose-smelling
stops, in other words.

Difference between undergrad algorithms and this course


Undergrad algorithms is largely about algorithms discovered before 1990; grad algo-
rithms is a lot about algorithms discovered since 1990. What happened in 1990 that caused
this change, you may ask? Nothing. I chose this arbitrarily; maybe I could have said 1985
or 1995. There was no single event but just a gradual shift in the emphasis and goals of
computer science as it became a more mature field.
In the first few decades of computer science, algorithms research was driven by the goal
of designing basic components of a computer: operating systems, compilers, networks, etc.
Other motivations were classical problems in discrete mathematics, operations research,
graph theory. The algorithmic ideas that came out of these quests form the core of un-
dergraduate course: data structures, graph traversal, string matching, parsing, network
flows, matchings, etc. Starting around 1990 theoretical computer science broadened its
horizons and started looking at new problems: algorithms for bioinformatics, algorithms
and mechanism design for e-commerce, algorithms to understand big data or big networks.
This changed algorithms research and the change is ongoing. One big change is that it is
often unclear what the algorithmic problem even is. Identifying it is part of the challenge.
Thus good modeling is important. This in turn is shaped by understanding what is possible
(given our understanding of computational complexity) and what is reasonable given the
limitations of the type of inputs we are given.
Some examples of this change:

The changing graph. In undergrad algorithms the graph is given and arbitrary (worst-
case). In grad algorithms we are willing to look at the domain (social network, computer
vision etc.) that the graph came from since the properties of graphs in those domains may
be germane to designing a good algorithm. (This is not a radical idea of course but we will
see that formulating good graph models is not easy. This is why you see a lot of heuristic
work in practice, without any mathematical proofs of correctness.)

1
Changing data structures: In undergrad algorithms the data structures were simple
and often designed to hold data generated by other algorithms. A stack allows you to hold
vertices during depth-first search traversal of a graph, or instances of a recursive call to a
procedure. A heap is useful for sorting and searching.
But in the newer applications, data often comes from sources we don’t control. Thus it
may be noisy, or inexact, or both. It may be high dimensional. Thus something like heaps
will not work, and we need more advanced data structures.
We will encounter the “curse of dimensionality”which constrains algorithm design for
high-dimensional data.

Changing notion of input/output: Algorithms in your undergrad course have a simple


input/output model. But increasingly we see a more nuanced interpretation of what the
input is: datastreams (useful in analytics involving routers and webservers), online (sequence
of requests), social network graphs, etc. And there is a corresponding subtlety in settling
on what an appropriate output is, since we have to balance output quality with algorithmic
efficiency. In fact, design of a suitable algorithm often goes hand in hand with understanding
what kind of output is reasonable to hope for.

Type of analysis: In undergrad algorithms the algorithms were often exact and work on
all (i.e., worst-case) inputs. In grad algorithms we are willing to relax these requirements.

2
Contents

1 Hashing 9
1.1 Hashing: Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 2-Universal Hash Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Karger’s Min Cut Algorithm 14


2.1 Analysis of Karger’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Improvement by Karger-Stein . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Large deviations bounds and applications 18


3.1 Three progressively stronger tail bounds . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Markov’s Inequality (aka averaging) . . . . . . . . . . . . . . . . . . 18
3.1.2 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Large deviation bounds . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Application 1: Sampling/Polling . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Balls and Bins revisited: Load balancing . . . . . . . . . . . . . . . . . . . . 21
3.4 What about the median? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Hashing with real numbers and their big-data applications 23


4.1 Estimating the cardinality of a set that’s too large to store . . . . . . . . . 24
4.2 Estimating document similarity . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Stable matchings, stable marriages and price of anarchy 27

6 Linear Thinking 28
6.1 Simplest example: Solving systems of linear equations . . . . . . . . . . . . 28
6.2 Systems of linear inequalities and linear programming . . . . . . . . . . . . 29
6.3 Linear modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.4 Meaning of polynomial-time . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7 Provable Approximation via Linear Programming 35


7.1 Deterministic Rounding (Weighted Vertex Cover) . . . . . . . . . . . . . . . 35
7.2 Simple randomized rounding: MAX-2SAT . . . . . . . . . . . . . . . . . . . 36
7.3 Dependent randomized rounding: Virtual circuit routing . . . . . . . . . . . 37

3
8 Decision-making under uncertainty: Part 1 40
8.1 Decision-making as dynamic programming . . . . . . . . . . . . . . . . . . . 41
8.2 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . . . . . . . . 42
8.3 Optimal MDP policies via LP . . . . . . . . . . . . . . . . . . . . . . . . . 44

9 Decision-making under total uncertainty: the multiplicative weight algo-


rithm 46
9.1 Motivating example: weighted majority algorithm . . . . . . . . . . . . . . 46
9.1.1 Randomized version . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.2 The Multiplicative Weights algorithm . . . . . . . . . . . . . . . . . . . . . 49

10 Applications of multiplicative weight updates: LP solving, Portfolio Man-


agement 52
10.1 Solving systems of linear inequalities . . . . . . . . . . . . . . . . . . . . . . 52
10.1.1 Duality Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.2 Portfolio Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduc-


tion 57
11.1 Number of almost-orthogonal vectors . . . . . . . . . . . . . . . . . . . . . . 58
11.2 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.3 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.3.1 Locality preserving hashing . . . . . . . . . . . . . . . . . . . . . . . 61
11.3.2 Dimension reduction for efficiently learning a linear classifier . . . . 61

12 Random walks, Markov chains, and how to analyse them 63


12.1 Recasting a random walk as linear algebra . . . . . . . . . . . . . . . . . . . 65
12.1.1 Mixing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12.2 Upper bounding the mixing time (undirected d-regular graphs) . . . . . . . 67
12.3 Analysis of Mixing Time for General Markov Chains . . . . . . . . . . . . . 68

13 Intrinsic dimensionality of data and low-rank approximations: SVD 71


13.1 View 1: Inherent dimensionality of a dataset . . . . . . . . . . . . . . . . . 71
13.2 View 2: Low rank matrix approximations . . . . . . . . . . . . . . . . . . . 72
13.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 74
13.3.1 General matrices: Singular values . . . . . . . . . . . . . . . . . . . . 75
13.4 View 3: Directions of Maximum Variance . . . . . . . . . . . . . . . . . . . 75

14 SVD, Power method, and Planted Graph problems (+ eigenvalues of


random matrices) 77
14.1 SVD computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
14.1.1 The power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
14.2 Recovering planted bisections . . . . . . . . . . . . . . . . . . . . . . . . . . 78
14.2.1 Eigenvalues of random matrices . . . . . . . . . . . . . . . . . . . . . 80

4
15 Semidefinite Programs (SDPs) and Approximation Algorithms 83
15.1 Max Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
15.2 0.878-approximation for MAX-2SAT . . . . . . . . . . . . . . . . . . . . . . 85

16 Going with the slope: offline, online, and randomly 87


16.1 Gradient descent for convex functions: univariate case . . . . . . . . . . . . 88
16.2 Convex multivariate functions . . . . . . . . . . . . . . . . . . . . . . . . . . 89
16.3 Gradient Descent for Constrained Optimization . . . . . . . . . . . . . . . . 91
16.4 Online Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
16.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
16.6 Portfolio Management via Online gradient descent . . . . . . . . . . . . . . 95
16.7 Hints of more advanced ideas . . . . . . . . . . . . . . . . . . . . . . . . . . 96

17 Oracles, Ellipsoid method and their uses in convex optimization 98


17.1 Linear programs too big to write down . . . . . . . . . . . . . . . . . . . . . 98
17.2 A general formulation of convex programming . . . . . . . . . . . . . . . . . 99
17.2.1 Presenting a convex body: separation oracles . . . . . . . . . . . . . 100
17.3 Ellipsoid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

18 Duality and MinMax Theorem 104


18.1 Linear Programming and Farkas’ Lemma . . . . . . . . . . . . . . . . . . . 104
18.2 LP Duality Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
18.3 Example: Max Flow Min Cut theorem in graphs . . . . . . . . . . . . . . . 107
18.4 Game theory and the minmax theorem . . . . . . . . . . . . . . . . . . . . . 108

19 Equilibria and algorithms 110


19.1 Nonzero sum games and Nash equilibria . . . . . . . . . . . . . . . . . . . . 110
19.2 Multiplayer games and Bandwidth Sharing . . . . . . . . . . . . . . . . . . 112
19.3 Correlated equilibria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

20 Protecting against information loss: coding theory 116


20.1 Shannon’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
20.2 Finite fields and polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 118
20.3 Reed Solomon codes and their decoding . . . . . . . . . . . . . . . . . . . . 119
20.4 Code concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

21 Counting and Sampling Problems 121


21.1 Counting vs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
21.1.1 Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
21.2 Dyer’s algorithm for counting solutions to KNAPSACK . . . . . . . . . . . 124

22 Taste of cryptography: Secret sharing and secure multiparty computation126


22.1 Shamir’s secret sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
22.2 Multiparty computation: the model . . . . . . . . . . . . . . . . . . . . . . 127
22.3 Easy protocol: linear combinations of inputs . . . . . . . . . . . . . . . . . . 128
22.4 General protocol: + and × suffice . . . . . . . . . . . . . . . . . . . . . . . 128

5
23 Real-life environments for big-data computations (MapReduce etc.) 130
23.1 Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
23.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

24 Heuristics: Algorithms we don’t know how to analyze 133


24.1 Davis-Putnam procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
24.2 Local search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
24.3 Difficult instances of 3SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
24.4 Random SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
24.5 Metropolis-Hastings and Computational statistics . . . . . . . . . . . . . . . 137

6
About this course
Algorithms are integral to computer science: every computer scientist (even as an un-
dergrad) has designed some. So has many a physicist, electrical engineer, mathematician
etc. This course is meant to be your one-stop shop to learn how to design a variety of
algorithms. The operative word is “variety.”In other words you will avoid the blinders that
one often sees in domain experts. A bayesian needs to see priors on the data before he
can begin designing algorithms; an optimization expert needs to cast all problems as con-
vex optimization; a systems designer has never seen any problem that cannot be solved by
hashing. (OK, mostly kidding but there is some truth in these stereotypes.) These and
more domain-specific ideas make an appearance in our course, but we will learn to not be
wedded to any single approach.
The primary skill you will learn in this course is how to analyse algorithms: prove their
correctness and their running time and any other relevant properties. Learning to analyse a
variety of algorithms (designed by others) will let you design better algorithms later in life.
I will try to fill the course with beautiful algorithms. Be prepared for frequent rose-smelling
stops, in other words.

Difference between undergrad algorithms and this course


Undergrad algorithms is largely about algorithms discovered before 1990; grad algo-
rithms is a lot about algorithms discovered since 1990. OK, I picked 1990 as an arbitrary
cutoff. Maybe it is 1985, or 1995. What happened in 1990 that caused this change, you
may ask? Nothing. It was no single event but just a gradual shift in the emphasis and goals
of computer science as it became a more mature field.
In the first few decades of computer science, algorithms research was driven by the goal
of designing basic components of a computer: operating systems, compilers, networks, etc.
Other motivations were classical problems in discrete mathematics, operations research,
graph theory. The algorithmic ideas that came out of these quests form the core of un-
dergraduate course: data structures, graph traversal, string matching, parsing, network
flows, etc. Starting around 1990 theoretical computer science broadened its horizons and
started looking at new problems: algorithms for bioinformatics, algorithms and mechanism
design for e-commerce, algorithms to understand big data or big networks. This changed
algorithms research and the change is ongoing. One big change is that it is often unclear
what the algorithmic problem even is— identifying it is part of the challenge. Thus good
modeling is important. This in turn is shaped by understanding what is possible (given our
understanding of computational complexity) and what is reasonable given the limitations
of the type of inputs we are given.
Some examples of this change:

The changing graph. In undergrad algorithms the graph is given and arbitrary (worst-
case). In grad algorithms we are willing to look at the domain (social network, computer
vision etc.) that the graph came from since the properties of graphs in those domains may
be germane to designing a good algorithm. (This is not a radical idea of course but we will
see that formulating good graph models is not easy. This is why you see a lot of heuristic
work in practice, without any mathematical proofs of correctness.)

7
Changing data structures: In undergrad algorithms the data structures were simple
and often designed to hold data generated by other algorithms (and hence under our con-
trol). A stack allows you to hold vertices during depth-first search traversal of a graph, or
instances of a recursive call to a procedure. A heap is useful for sorting and searching.
But in the newer applications, data often comes from sources we don’t control. Thus it
may be noisy, or inexact, or both. It may be high dimensional. Thus something like heaps
will not work, and we need more advanced data structures.
We will encounter the “curse of dimensionality”which constrains algorithm design for
high-dimensional data.

Changing notion of input/output: Algorithms in your undergrad course have a simple


input/output model. But increasingly we see a more nuanced interpretation of what the
input is: datastreams (useful in analytics involving routers and webservers), online (sequence
of requests), social network graphs, etc. And there is a corresponding subtlety in settling
on what an appropriate output is, since we have to balance output quality with algorithmic
efficiency. In fact, design of a suitable algorithm often goes hand in hand with understanding
what kind of output is reasonable to hope for.

Type of analysis: In undergrad algorithms the algorithms were often exact and work on
all (i.e., worst-case) inputs. In grad algorithms we are willing to relax these requirements.

8
Chapter 1

Hashing

Today we briefly study hashing, both because it is such a basic data structure, and because
it is a good setting to develop some fluency in probability calculations.

1.1 Hashing: Preliminaries


Hashing can be thought of as a way to rename an address space. For instance, a router at
the internet backbone may wish to have a searchable database of destination IP addresses
of packets that are whizing by. An IP address is 128 bits, so the number of possible IP
addresses is 2128 , which is too large to let us have a table indexed by IP addresses. Hashing
allows us to rename each IP address by fewer bits. Furthermore, this renaming is done
probabilistically, and the renaming scheme is decided in advance before we have seen the
actual addresses. In other words, the scheme is oblivious to the actual addresses.
Formally, we want to store a subset S of a large universe U (where |U | = 2128 in the
above example). And |S| = m is a relatively small subset. For each x ∈ U , we want to
support 3 operations:

• insert(x). Insert x into S.

• delete(x). Delete x from S.

• query(x). Check whether x ∈ S.

A hash table can support all these 3 operations. We design a hash function

h : U −→ {0, 1, . . . , n − 1} (1.1)

such that x ∈ U is placed in T [h(x)], where T is a table of size n.


Since |U |  n, multiple elements can be mapped into the same location in T , and we
deal with these collisions by constructing a linked list at each location in the table.
One natural question to ask is: how long is the linked list at each location?
This can be analysed under two kinds of assumptions:

1. Assume the input is the random.

9
10

U
h
n elements

Figure 1.1: Hash table. x is placed in T [h(x)].

2. Assume the input is arbitrary, but the hash function is random.

Assumption 1 may not be valid for many applications.


Hashing is a concrete method towards Assumption 2. We designate a set of hash func-
tions H, and when it is time to hash S, we choose a random function h ∈ H and hope
that on average we will achieve good performance for S. This is a frequent benefit of a
randomized approach: no single hash function works well for every input, but the average
hash function may be good enough.

1.2 Hash Functions


Say we have a family of hash functions H, and for each h ∈ H, h : U −→ [n]1 . What do
mean if we say these functions are random?
For any x1 , x2 , . . . , xm ∈ S (xi 6= xj when i 6= j), and any a1 , a2 , . . . , am ∈ [n], ideally a
random H should satisfy:

• Prh∈H [h(x1 ) = a1 ] = n1 .
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ] = n2
. Pairwise independence.
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ∧ · · · ∧ h(xk ) = ak ] = nk
. k-wise independence.
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ∧ · · · ∧ h(xm ) = am ] = nm . Full independence (note
that |U | = m).

Generally speaking, we encounter a tradeoff. The more random H is, the greater the
number of random bits needed to generate a function h from this class, and the higher the
cost of computing h.
For example, if H is a fully random family, there are nm possible h, since each of the
m elements at S have n possible locations they can hash to. So we need log |H| = m log n
bits to represent each hash function. Since m is usually very large, this is not practical.
1
We use [n] to denote the set {0, 1, . . . , n − 1}
11

But the advantage of a random hash function is that it ensures very few collisions with
high probability. Let Lx be the length of the linked list containing x; this is just the number
of elements with the same hash value as x. Let random variable
(
1 if h(y) = h(x),
Iy = (1.2)
0 otherwise.
P
So Lx = 1 + y∈S;y6=x Iy , and
X m−1
E[Lx ] = 1 + E[Iy ] = 1 + (1.3)
n
y∈S;y6=x

Usually we choose n > m, so this expected length is less than 2. Later we will analyse
this in more detail, asking how likely is Lx to exceed say 100.
The expectation calculation above doesn’t need full independence; pairwise indepen-
dence would actually suffice. This motivates the next idea.

1.3 2-Universal Hash Families


Definition 1 (Carter Wegman 1979) Family H of hash functions is 2-universal if for
any x 6= y ∈ U ,
1
Pr [h(x) = h(y)] ≤ (1.4)
h∈H n
Note that this property is even weaker than 2 independence.
We can design 2-universal hash families in the following way. Choose a prime p ∈
{|U |, . . . , 2|U |}, and let

fa,b (x) = ax + b mod p (a, b ∈ [p], a 6= 0) (1.5)

And let
ha,b (x) = fa,b (x) mod n (1.6)
Lemma 1
6 x2 and s 6= t, the following system
For any x1 =

ax1 + b = s mod p (1.7)


ax2 + b = t mod p (1.8)

has exactly one solution.

Since [p] constitutes a finite field, we have that a = (x1 − x2 )−1 (s − t) and b = s − ax1 .
Since we have p(p − 1) different hash functions in H in this case,
1
Pr [h(x1 ) = s ∧ h(x2 ) = t] = (1.9)
h∈H p(p − 1)
Claim H = {ha,b : a, b ∈ [p] ∧ a 6= 0} is 2-universal.
12

Proof: For any x1 6= x2 ,

Pr[ha,b (x1 ) = ha,b (x2 )] (1.10)


X
= δ(s=t mod n) Pr[fa,b (x1 ) = s ∧ fa,b (x2 ) = t] (1.11)
s,t∈[p],s6=t
1 X
= δ(s=t mod n) (1.12)
p(p − 1)
s,t∈[p],s6=t
1 p(p − 1)
≤ (1.13)
p(p − 1) n
1
= (1.14)
n
where δ is the Dirac delta function. Equation (1.13) follows because for each s ∈ [p], we
have at most (p − 1)/n different t such that s 6= t and s = t mod n. 2
Can we design a collision free hash table then? Say we have m elements, and the hash
table is of size n. Since for any x1 6= x2 , Prh [h(x1 ) = h(x2 )] ≤ n1 , the expected number of
total collisions is just
X X  
m 1
E[ h(x1 ) = h(x2 )] = E[h(x1 ) = h(x2 )] ≤ (1.15)
2 n
x1 6=x2 x1 6=x2

Let’s pick m ≥ n2 , then


1
E[number of collisions] ≤ (1.16)
2
and so
1
Pr [∃ a collision] ≤ (1.17)
h∈H 2
So if the size the hash table is large enough m ≥ n2 , we can easily find a collision free
hash functions. But in reality, such a large table is often unrealistic. We may use a two-layer
hash table to avoid this problem.
Specifically, let si denote the number of collisions at location i. If we can construct a
second layer table of size s2i , we can easily find a collision-free hash table to store all the si
Pm−1
elements. Thus the total size of the second-layer hash tables is 2
Pm−1 i=0 si .
Note that i=0 si (si − 1) is just the number of collisions calculated in Equation (1.15),
so
X X X m(m − 1)
E[ s2i ] = E[ si (si − 1)] + E[ si ] = + m ≤ 2m (1.18)
n
i i i

1.4 Load Balancing


Now we think a bit about how large the linked lists (ie number of collisions) can get. Let
us think for simplicity about hashing n keys in a hash table of size n. This is the famous
13

0
1

si elements
i
s2i locations

n−1

Figure 1.2: Two layer hash tables.

balls-and-bins calculation, also called load balance problem. We have n balls and n bins,
and we randomly put the balls into bins. Then for a given i,
 
n 1 1
Pr[bini gets more than k elements] ≤ · k ≤ (1.19)
k n k!

By Stirling’s formula,
√ k
k! ∼ 2nk( )k (1.20)
e
If we choose k = O( logloglogn n ), we can let 1
k! ≤ 1
n2
. Then

1 1
Pr[∃ a bin ≥ k balls] ≤ n · = (1.21)
n2 n
12
So with probability larger than 1 − n ,

log n
max load ≤ O( ) (1.22)
log log n

Aside: The above load balancing is not bad; no more than O( logloglogn n ) balls in a bin with
high probability. Can we modify the method of throwing balls into bins to improve the load
balancing? We use an idea that you use at the supermarket checkout: instead of going to
a random checkout counter you try to go to the counter with the shortest queue. In the
load balancing case this is computationally too expensive: one has to check all n queues.
A much simpler version is the following: when the ball comes in, pick 2 random bins, and
place the ball in the one that has fewer balls. Turns out this modified rule ensures that the
maximal load drops to O(log log n), which is a huge improvement. This called the power of
two choices.

2 1
this can be easily improve to 1 − nc
for any constant c
Chapter 2

Karger’s Min Cut Algorithm

Today’s topic is simple but gorgeous: Karger’s min cut algorithm and its extension. It is a
simple randomized algorithm for finding the minimum cut in a graph: a subset of vertices
S in which the set of edges leaving S, denoted E(S, S) has minimum size among all subsets.
You may have seen a polynomial-time algorithm for this problem in your undergrad class
that uses maximum flow. Karger’s algorithm is much more elementary and and a great
introduction to randomized algorithms.
The algorithm is this: Pick a random edge, and merge its endpoints into a single “su-
pernode.”Repeat until the graph has only two supernodes, which is output as our guess for
min-cut. (As you continue, the supernodes may develop parallel edges; these are allowed.
Selfloops are ignored.) See Figure 2.1.
Note that if you pick a random edge, it is more likely to come from parts of the graph that
contain more edges in the first place. Thus this algorithm looks like a great heuristic to try on
all kinds of real-life graphs, where one wants to cluster the nodes into “tightly-knit”portions.
For example, social networks may cluster into communities; graphs where edges capture
similarity of pixels may cluster to give different portions of the image (sky, grass, road etc.).
Thus instead of continuing Karger’s algorithm until you have two supernodes left, you could
stop it when there are k supernodes and try to understand whether these correspond to a
reasonable clustering. (Aside: There are much better clustering algorithms out there.)
Today we will first see that the above version of the algorithm yields the optimum min
cut with probability at least 2/n2 . Thus we can repeat it say 20n2 times, and output the
smallest cut seen in any iteration. The probability that the optimum cut is not seen in any
2
repetition is at most (1 − 2/n2 )20n < 0.01. Unfortunately, this simple version has running
time about n4 which is not great. So then we see a better version with a simple tweak that
brings the running time down to closer to n2 .

2.1 Analysis of Karger’s algorithm


Clearly, the two supernodes at the end correspond to a cut of the original graph, so the
algorithm does always return a cut.
Main Claim: The cut at the end is a minimum cut of the original cut with probability at
least 2/n(n − 1).

14
15

Thus repeating the algorithm K times where K = n(n − 1)/2 and taking the smallest
cut ever discovered in these repetitions will yield a minimum cut with chance at least
1 − (1 − 1/K)K = 1 − 1/e. (Aside: the approximation (1 − 1/K)K ≈ 1/e for large K is
very useful and will reappear in later lectures.) It is relatively easy using data structures
you learnt in undergrad algorithms to implement each repetition of the algorithm in O(n2 )
time, so the overall running time is O(n4 ).
The analysis is rather simple. First, recall that the sum of node degrees in an undirected
graph G = (V, E) is exactly 2 |E| (since adding the degrees counts each edge twice). Thus
if |V | = n, there exists a node of degree at most 2 |E| /n. Putting this vertex on one side
of the cut and all other vertices on the other side gives a cut of size at most 2 |E| /n. Thus
the minimum cut cannot have any more than 2 |E| /n edges.
Let (S, S) be any minimum cut. Then the probability that a random edge picked at the
first step by Karger’s algorithm lies in this particular cut is at most 2 |E| /n |E| = 2/n. If
it doesn’t lie in the cut, then contracting the edge maintains (S, S) as a viable cut in the
graph.
Since each edge contraction reduces the number of nodes by 1, the total number of edge
pickings is n − 1 and the probability that (S, S) survives all of them is

Pr[first edge not in this cut]×Pr[second edge not in this cut | first edge was not in the cut]×· · · ,

which is at least
2 2 2 3 1
(1 − )(1 − )(1 − ) × ··· × ×
n n−1 n−2 4 2
n−2 n−3 n−4 3 1
=( )( )( ) × ··· × × (Note: telescoping!!)
2 n−1 n−2 4 2
2
=
n(n − 1)

Aside: We have proven a stronger result than we had needed to: every minimum cut
remains at the end with probability at least 2/n(n − 1). This implies in particular that
the number of minimum cuts in an undirected graph is at most n(n − 1)/2. (Note that
the number of cuts in the graph is the set of all nonempty subsets, which is 2n − 1, so this
implies only a tiny number of all cuts are minimum cuts.) This upper bound has had great
impact in subsequent theory of algorithms, though we will not have occasion to explore that
in this course.

2.2 Improvement by Karger-Stein


Karger and Stein improved the algorithm to run in O(n2 log2 n) time.
The idea is roughly that repetition ensures fault tolerance. The real-life advice of making
two backups of your hard drive is related to this: the probability that both fail is much
smaller than one does. In case of Karger’s algorithm, the overall probability
√ of success is
too low at 2/n(n−1). But if run part of the way until the graph has n/ 2 supernodes, then
the same calculation as before shows that the probability that the mincut has survived (i.e.
16

no edge has√been picked in it) is at least 1/2. So you make two independent runs that go
down to n/ 2 supernodes, and recursively solve both of these with the same Karger-Stein
algorithm. Then return the smaller of the two cuts returned by the recursive calls.
The running time for such an algorithm satisfies

T (n) = O(n2 ) + 2T (n/ 2),

which the Master theorem of ugrad algorithms1 shows to yield T (n) = O(n2 log n). As you
might suspect, this is not the end of the story but improvements beyond this get more hairy.
If anybody is interested I can give more pointers.
Claim: The probability the algorithm returns a minimum cut is at least 1/ log n.
Thus repeating the algorithm O(log n) times gives a success probability at least 0.9 (say)
and a running time of O(n2 log2 n).
To prove the claim we note that if P (n) is the probability that the procedure returns a
minimum cut, then
1 √
P (n) ≥ 1 − (1 − P (n/ 2))2 ,
2
1

where the term 2 P (n/ 2) represents the probability of the event that a minimum cut sur-

vived in the shrinkage to n/ 2 vertices, and the recursive call then recovered this minimum
cut.
To see that this solves to P (n) ≥ 1/ log n we can do a simple induction, where the
inductive step needs to verify that
1 1 1 1 1
≤ 1 − (1 − )2 = − ,
log n 2 log n − 0.5 log n − 0.5 4(log n − 0.5)2

which is true using the approximation


1 1 0.5
≈ + .
log n − 0.5 log n log2 n

Bibliography
1) Global min-cuts in RNC, and other ramifications of a simple min-cut algorithm. David
Karger, Proc. ACM-SIAM SODA 1993.
2) A new approach to the minimum cut problem. David Karger and Cliff Stein, JACM
43(4):601640, 1996.

1
Hush, hush, don’t tell anybody, but most researchers don’t use the Master theorem, even though it
was stressed a lot in undergrad algorithms. When we need to solve such recurrences, we just unwrap the
recurrence a few times and see that there are O(log n) levels, and each involves O(n2 ) work, for a total of
O(n2 log n).
Figure 2.1: Illustration of Karger’s Algorithm (borrowed from lecture notes of Sanjoy Das-
gupta)

17
Chapter 3

Large deviations bounds and


applications

Today’s topic is deviation bounds: what is the probability that a random variable deviates
from its mean by a lot? Recall that a random variable X is a mapping from a probability
space to R. The expectation or mean is denoted E[X] or sometimes as µ.
In many settings we have a set of n random variables X1 , X2 , X3 , . . . , Xn defined on
the same probability space. To give an example, the probability space could be that of all
possible outcomes of n tosses of a fair coin, and Xi is the random variable that is 1 if the
ith toss is a head, and is 0 otherwise, which means E[Xi ] = 1/2.
The first observation we make is that of the Linearity of Expectation, viz.
X X
E[ Xi ] = E[Xi ]
i i
It is important to realize that linearity holds regardless of the whether or not the random
variables are independent.
Can we say something about E[X1 X2 ]? In general, nothing much but if X1 , X2 are
independent events (formally, this means that for all a, b Pr[X1 = a, X2 = b] = Pr[X1 =
a] Pr[X2 = b]) then E[X1 X2 ] = E[X1 ] E[X2 ].
Note that if the Xi ’s are
P pairwisePindependent (i.e., each pair are mutually independent)
then this means that var[ i Xi ] = i var[Xi ].

3.1 Three progressively stronger tail bounds


Now we give three methods that give progressively stronger bounds.

3.1.1 Markov’s Inequality (aka averaging)


The first of a number of inequalities presented today, Markov’s inequality says that any
non-negative random variable X satisfies
1
Pr (X ≥ k E[X]) ≤ .
k

18
19

Note that this is just another way to write the trivial observation that E[X] ≥ k ·Pr[X ≥ k].
Can we give any meaningful upperbound on Pr[X < c · E[X]] where c < 1, in other
words the probability that X is a lot less than its expectation? In general we cannot.
However, if we know an upperbound on X then we can. For example, if X ∈ [0, 1] and
E[X] = µ then for any c < 1 we have (simple exercise)
1−µ
Pr[X ≤ cµ] ≤ .
1 − cµ
Sometimes this is also called an averaging argument.
Example 1 Suppose you took a lot of exams, each scored from 1 to 100. If your average
score was 90 then in at least half the exams you scored at least 80.

3.1.2 Chebyshev’s Inequality


The variance of a random variable X is one measure (there are others too) of how “spread
out”it is around its mean. It is defined as E[(x − µ)2 ] = E[X 2 ] − µ2 .
A more powerful inequality, Chebyshev’s inequality, says
1
Pr[|X − µ| ≥ kσ] ≤ ,
k2
where µ and σ 2 are the mean and variance of X. Recall that σ 2 = E[(X −µ)2 ] = E[X 2 ]−µ2 .
Actually, Chebyshev’s inequality is just a special case of Markov’s inequality: by definition,
 
E |X − µ|2 = σ 2 ,

and so,
  1
Pr |X − µ|2 ≥ k 2 σ 2 ≤ 2 .
k
Here is simple fact that’s used a lot: If Y1 , Y2 , . . . , Yt are iid (whichP
is jargon for inde-
pendent and identically distributed) then the variance of their average k1 i Yt is exactly 1/t
times the variance of one of them. Using Chebyshev’s inequality, this already implies that
the average of iid variables converges sort-of strongly to the mean.

Example: Load balancing


Suppose we toss m balls into n bins. You can think of m jobs being randomly assigned to
n processors. Let X = number of balls assigned to the first bin. Then E[X] = m/n. What
is the chance that X > 2m/n? Markov’s inequality says this is less than 1/2.
To use Chebyshev we need to compute the variance of X. For this let YP i be the indicator
random variable that is 1 iff the ith ball falls in the first bin. Then X = i Yi . Hence
X X X X
E[X 2 ] = E[ Yi2 + 2 Yi Yj ] = E[Yi2 ] + E[Yi Yj ].
i i<j i i<j

m(m−1)
Now for independent random variables E[Yi Yj ] = E[Yi ] E[Yj ] so E[X 2 ] = m
n + n2
.
Hence the variance is very close to m/n, and thus Chebyshev implies that the probability
that Pr[X > 2 m n
n ] < m . When m > 3n, say, this is stronger than Markov.
20

3.1.3 Large deviation bounds


When we toss a coin many times, the expected number of heads is half the number of tosses.
How tightly is this distribution concentrated? Should we be very surprised if after 1000
tosses we have 625 heads?
The Central Limit Theorem says that the sum of n independent random variables (with
bounded mean and variance) converges to the famous Gaussian distribution (popularly
known as the Bell Curve). This is very useful in algorithm design: we maneuver to de-
sign algorithms so that the analysis boils down to estimating the sum of independent (or
somewhat independent) random variables.
To do a back of the envelope calculation, if all n coin tosses are fair (Heads has probability
1/2) then the Gaussian approximation implies that the probability of seeing N heads where
√ 2
|N − n/2| > a n is at most e−a /2 . The chance of seeing at least 625 heads in 1000 tosses
of an unbiased coin is less than 5.3 × 10−7 . These are pretty strong bounds!
This kind of back-of-the-envelope calculations will get most of the credit in homeworks.
Of course, for finite n the sum of n random variables need not be an exact Gaussian
and that’s where Chernoff bounds come in. (By the way these bounds are also known by
other names in different fields since they have been independently discovered.)
First we give an inequality that works for general variables that are real-valued in [−1, 1].
(To apply it to more general bounded variables just scale them to [−1, 1] first.)
Theorem 2 (Quantitative version of CLT due to H. Chernoff; Slightly inexact version but goo
P random variables and each Xi ∈ [−1, 1]. Let µi = E[Xi ]
If X1 , X2 , . . . , Xn are independent
2
and σi = var[Xi ]. Then X = i Xi satisfies
k2
Pr[|X − µ| > kσ] ≤ 2 exp(− ),
4
P P
where µ = i µi and σ 2 = 2
i σi . Also, k ≤ σ/2 (say).

Instead of proving the above we prove a simpler theorem for binary valued variables
which showcases the basic idea.
Theorem 3
Let X1 , X2 , . . . , Xn be independent
P 0/1-valued random variables
Pand let pi = E[Xi ], where
0 < pi < 1. Then the sum X = ni=1 Xi , which has mean µ = ni=1 pi , satisfies

Pr[X ≥ (1 + δ)µ] ≤ (cδ )µ


 δ 
where cδ is shorthand for (1+δ)e (1+δ) .

Remark: There is an analogous inequality that bounds the probability of deviation below
the mean, whereby δ becomes negative and the ≥ in the probability becomes ≤ and the cδ
is very similar.
Proof: Surprisingly, this inequality also is proved using the Markov inequality, albeit
applied to a different random variable.
We introduce a positive dummy variable t and observe that
X Y Y
E[exp(tX)] = E[exp(t Xi )] = E[ exp(tXi )] = E[exp(tXi )], (3.1)
i i i
21

where the last equality holds because the Xi r.v.s are independent. Now,

E[exp(tXi )] = (1 − pi ) + pi et ,

therefore,
Y Y Y
E[exp(tXi )] = [1 + pi (et − 1)] ≤ exp(pi (et − 1))
i i i
X (3.2)
= exp( pi (et − 1)) = exp(µ(et − 1)),
i

as 1 + x ≤ ex . Finally, apply Markov’s inequality to the random variable exp(tX), viz.

E[exp(tX)] exp((et − 1)µ)


Pr[X ≥ (1 + δ)µ] = Pr[exp(tX) ≥ exp(t(1 + δ)µ)] ≤ = ,
exp(t(1 + δ)µ) exp(t(1 + δ)µ)
using lines (3.1) and (3.2) and the fact that t is positive. Since t is a dummy variable, we can
choose any positive value we like for it. The right hand size is minimized if t = ln(1+δ)—just
differentiate—and this leads to the theorem statement. 2

3.2 Application 1: Sampling/Polling


Opinion polls and statistical sampling rely on tail bounds. Suppose there are n arbitrary
numbers in [0, 1] If we pick t of them randomly (with replacement!) then the sample mean
is within (1 ± ]) of the true mean with probability at least 1 − δ if t > Ω( 12 log 1/δ). (Verify
this calculation!)
In general, Chernoff bounds implies that taking k independent estimates and taking
their mean ensures that the value is highly concentrated about their mean; large deviations
happen with exponentially small probability.

3.3 Balls and Bins revisited: Load balancing


Suppose we toss m balls into n bins. You can think of m jobs being randomly assigned to
n processors. Then the expected number of balls in each bin is m/n. When m = n this
expectation is 1 but we saw in Lecture 1 that the most overloaded bin has Ω(log n/ log log n)
balls. However, if m = cn log n then the expected number of balls in each bin is c log n.
Thus Chernoff bounds imply that the chance of seeing less than 0.5c log n or more than
1.5c log n is less than γ c log n for some constant γ (which depends on the 0.5, 1.5 etc.) which
can be made less than say 1/n2 by choosing c to be a large constant.
Moral: if an office boss is trying to allocate work fairly, he/she should first create more
work and then do a random assignment.

3.4 What about the median?


Given n numbers in [0, 1] can we approximate the median via sampling? This will be part
of your homework.
22

Exercise: Show that it is impossible to estimate the value of the median within say 1.1
factor with o(n) samples.
But what is possible is to produce a number that is an approximate median: it is greater
than at least n/2 − n/t numbers below it and less than at least n/2 − n/t numbers. The
idea is to take a random sample of a certain size and take the median of that sample. (Hint:
Use balls and bins.)
One can use the approximate median algorithm to describe a version of quicksort with
very predictable performance. Say we are given n numbers in an array. Recall that (random)
quicksort is the sorting algorithm where you randomly pick one of the n numbers as a pivot,
then partition the numbers into those that are bigger than and smaller than the pivot (which
takes O(n) time). Then you recursively sort the two subsets.
This procedure works in expected O(n log n) time as you may have learnt in an undergrad
course. But its performance is uneven because the pivot may not divide the instance into
two exactly equal pieces. For instance the chance that the running time exceeds 10n log n
time is quite high.
A better way to run quicksort is to first do a quick estimation of the median and then
do a pivot. This algorithm runs in very close to n log n time, which is optimal.
Chapter 4

Hashing with real numbers and


their big-data applications

Using only memory equivalent to 5 lines of printed text, you can estimate with a
typical accuracy of 5 per cent and in a single pass the total vocabulary of Shake-
speare. This wonderfully simple algorithm has applications in data mining, esti-
mating characteristics of huge data flows in routers, etc. It can be implemented
by a novice, can be fully parallelized with optimal speed-up and only need minimal
hardware requirements. Theres even a bit of math in the middle!
Opening lines of a paper by Durand and Flajolet, 2003.

As we saw in Lecture 1, hashing can be thought of as a way to rename an address space.


For instance, a router at the internet backbone may wish to have a searchable database of
destination IP addresses of packets that are whizzing by. An IP address is 128 bits, so the
number of possible IP addresses is 2128 , which is too large to let us have a table indexed
by IP addresses. Hashing allows us to rename each IP address by fewer bits. In Lecture 1
this hash was a number in a finite field (integers modulo a prime p). In recent years large
data algorithms have used hashing in interesting ways where the hash is viewed as a real
number. For instance, we may hash IP addresses to real numbers in the unit interval [0, 1].
Example 2 (Dartthrowing method of estimating areas) Suppose gives you a piece
of paper of irregular shape and you wish to determine its area. You can do so by pinning
it on a piece of graph paper. Say, it lies completely inside the unit square. Then throw a
dart n times on the unit square and observe the fraction of times it falls on the irregularly
shaped paper. This fraction is an estimator for the area of the paper.
Of course, the digital analog of throwing a dart n times on the unit square is to take a
random hash function from {1, . . . , n} to [0, 1] × [0, 1].

Strictly speaking, one cannot hash to a real number since computers lack infinite preci-
sion. Instead, one hashes to rational numbers in [0, 1]. For instance, hash IP addresses to
the set [p] as before, and then think of number “i mod p”as the rational number i/p. This
works OK so long as our method doesn’t use too many bits of precision in the real-valued
hash.

23
24

A general note about sampling. As pointed out in Lecture 3 using the random variable
”Number of ears,” the expectation of a random variable may never be attained at any point
in the probability space. But if we draw a random sample, then we know by Chebysev’s
inequality that the sample has chance at least 1 − 1/k 2 of taking a value in the interval
[µ − kσ, µ + kσ] where µ, σ denote the mean and variance respectively. Thus to get any
reasonable idea of µ we need σ to be less than µ. But if we take t independent samples
(even pairwise independent will do) then the variance of the mean of these samples is σ 2 /t.
Hence by increasing t we can get a better estimate of µ.

4.1 Estimating the cardinality of a set that’s too large to


store
Continuing with the router example, suppose the router wishes to maintain a count of the
number of distinct IP addresses seen in the past hour. It would be too wasteful to actually
store all the IP addresses; an approximate count is fine. This is also the application alluded
to in the quote at the start of the lecture.
An idea: Pick k random hash functions h1 , h2 , . . . , hk that map a 128-bit address to
a random real number in [0, 1]. (For now let’s suppose that these are actually random
functions.) Now maintain k registers, initialized to 0. Whenever a packet whizzes by, and
its IP address is x, compute hi (x) for each i. If hi (x) is less than the number currently
stored in the ith register, then write hi (x) in the ith register.
Let Yi be the random variable denoting the contents of the ith register at the end. (It
is a random variable because the hash function was chosen randomly. The packet addresses
are not random.) Realize that Yi is nothing but the lowest value of hi (x) among all IP
addresses seen so far.
Suppose the number of distinct IP addresses seen is N . This is what we are trying to
estimate.
Fact: E[Yi ] = N 1+1 and the variance of Yi is 1/(N + 1)2 .
The expectation looks intuitively about right: the minimum of N random elements in
[0, 1] should be around 1/N .
Let’s do the expectation calculation. The probability that Yi is z is the probability that
one of the IP addresses mapped to z and all the others mapped to numbers greater than z.
Z 1 Z 1
1
E[Yi ] = Pr[Yi > z]dz = (1 − z)N dz = .
z=0 z=0 N +1

(Here’s a slick alternative proof of the 1/(N + 1) calculation. Imagine picking N + 1


random numbers in [0, 1] and consider the chance that the N + 1th element is the smallest.
By symmetry this chance is 1/(N + 1). But this chance is exactly the expected value of the
minimum of the first N numbers. QED. )
Since we picked k random hash functions, the Yi ’s are iid. Let Y is be their mean. Then
the variance of Y is 1/k(N + 1)2 , in other words, k times lower than the variance of each
individual Yi . Thus if 1/k is less than 2 the standard deviation is less than /(N + 1),
whereas the mean is 1/(N + 1). Thus with constant probability the estimate 1/Y is within
(1 + ) factor of N .
25

All this assumed that the hash functions are random functions from 128-bit numbers to
[0, 1]. Let’s now show that it suffices to pick hash functions from a pairwise independent
family, albeit now yielding an estimate that is only correct up to some constant factor.
Specifically, the algorithm will take k pairwise independent hashes and see if the majority
of the min values are contained in some interval of the type [1/3x, 3/x]. Then x is our
estimate for N , the number of elements. This estimate will be correct up to a factor 3 with
probability at least 1 − 1/k.
What is the probability that we hash N different elements using such a hash function
and the smallest element is less than 1/3N ? For each element x, Pr[h(x) < 1/3N ] is at
most 1/3N , so by the union bound, the probability in question is at most N × 1/3N = 1/3.
Similarly, the probability that Pr[∃x : h(x) ≤ 1/N ] can be lowerbounded by the inclusion-
exclusion bound.
Lemma 4 (inclusion-exclusion bound)
Pr[A1 ∨ A2 . . . ∨ An ], the probability that at least one of the events A1 , A2 , . . . , An happens,
satisfies X X X
Pr[Ai ] − Pr[Ai ∧ Aj ] ≤ Pr[A1 ∨ A2 . . . ∨ An ] ≤ Pr[Ai ].
i i6=j i

Since our events are pairwise independent we obtain


 
1 N 1 1
Pr[∃x : h(x) ≤ 1/N ] ≥ N × − 2
≥ .
N 2 N 2

Using a little more work it can be shown that with probability at least 0.6 the minimum
hash is in the interval [1/3N, 3/N ]. (NB: These calculations can be improved if the hash is
from a 4-wise independent family.) Thus if we repeat with k hashes, the probability that
the majority of min values are not contained in [1/3N, 3/N ] drops as O(1/k).

4.2 Estimating document similarity


One of the aspects of the data deluge on the web is that often one finds duplicate copies of
the same thing. Sometimes the copies may not be exactly identical: for example mirrored
copies of the same page but some are out of date. The same news article or blog post may
be reposted many times, sometimes with editorial comments. By detecting duplicates and
near-duplicates internet companies can often save on storage by an order of magnitude.
We present a technique called similarity hashing that allows this approximately. It
is a hashing method such that the hash preserves some ”sketch” of the document. Two
documents’ similarity can be estimate by comparing their hashes. This is an example of a
burgeoning research area of hashing while preserving some semantic information. In general
finding similar items in databases is a big part of data mining (find customers with similar
purchasing habits, similar tastes, etc.). Today’s simple hash is merely a way to dip our toes
in these waters.
So think of a document as a set: the set of words appearing in it. The Jaccard similarity
of documents/sets A, B is defined to be |A ∩ B| / |A ∪ B|. This is 1 iff A = B and 0 iff the
sets are disjoint.
26

Basic idea: Pick a random hash function mapping the underlying universe of elements
to [0, 1]. Define the hash of a set A to be the minimum of h(x) over all x ∈ A. Then
by symmetry, Pr[hash(A) = hash(B)] is exactly the Jaccard similarity. (Note that if two
elements x, y are different then Pr[h(x) = h(y)] is 0 when the hash is real-valued. Thus the
only possibility of a collision arises from elements in the intersection of A, B.) Thus one
could pick k random hash functions and take the fraction of instances of hash(A) = hash(B)
as an estimate of the Jaccard similarity. This has the right expectation but we need to repeat
with k different hash functions to get a better estimate.
The analysis goes as follows. Suppose we are interested in flagging pairs of documents
whose Jaccard-similarity is at least 0.9. Then we compute k hashes and flag the pair if at
least 0.9 −  fraction of the hashes collide. Chernoff bounds imply that if k = Ω(1/2 ) this
flags all document pairs that have similarity at least 0.9 and does not flag any pairs with
similarity less than 0.9 − 3.
To make this method more realistic we need to replace the idealized random hash func-
tion with a real one and analyse it. That is beyond the scope of this lecture. Indyk showed
that it suffices to use a k-wise independent hash function for k = Ω(log(1/) to let us es-
timate Jaccard-similarity up to error . Thorup recently showed how to do the estimation
with pairwise independent functions. This analysis seems rather sophisticated; let me know
if you happen to figure it out.

Bibliography

1. Broder, Andrei Z. (1997), On the resemblance and containment of documents, Com-


pression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno,
Italy, June 11-13, 1997.

2. Broder, Andrei Z.; Charikar, Moses; Frieze, Alan M.; Mitzenmacher, Michael (1998),
Min-wise independent permutations, Proc. 30th ACM Symposium on Theory of Com-
puting (STOC ’98).

3. Gurmeet Singh, Manku; Das Sarma, Anish (2007), Detecting near-duplicates for web
crawling, Proceedings of the 16th international conference on World Wide Web, ACM.

4. Indyk, P (1999). A small approximately min-wise independent family of hash func-


tions. Proc. ACM SIAM SODA.

5. Thorup, M. (2013). http://arxiv.org/abs/1303.5479.


Chapter 5

Stable matchings, stable marriages


and price of anarchy

Guest lecture by Mark Braverman. Handwritten scribe notes available from course website.
http://www.cs.princeton.edu/courses/archive/fall14/cos521/

27
Chapter 6

Linear Thinking

According to conventional wisdom, linear thinking describes thought process that is logical
or step-by-step (i.e., each step must be completed before the next one is undertaken).
Nonlinear thinking, on the other hand, is the opposite of linear: creative, original, capable
of leaps of inference, etc.
From a complexity-theoretic viewpoint, conventional wisdom turns out to be startlingly
right in this case: linear problems are generally computationally easy, and nonlinear prob-
lems are generally not.

Example 3 Solving linear systems of equations is easy. Let’s show solving quadratic sys-
tems of equations is NP-hard. Consider the vertex cover problem, which is NP-hard:
Given graph G = (V, E) and an integer k we need to determine if there a subset of vertices
S of size k such that for each edge {i, j}, at least one of i, j is in S.
We can rephrase this as a problem involving solving a system of nonlinear equations,
where xi = 1 stands for “i is in the vertex cover.”

(1 − xi )(1 − xj ) = 0 ∀ {i, j} ∈ E
P
xi (1 − xi ) = 0 ∀i ∈ V. i xi = k

Not all nonlinear problems are difficult, but the ones that turn out to be easy are
generally those that can leverage linear algebra (eigenvalues, singular value decomposition,
etc.)
In mathematics too linear algebra is simple, and easy to understand. The goal of much
of higher mathematics seems to be to reduce study of complicated (nonlinear!) objects to
study of linear algebra.

6.1 Simplest example: Solving systems of linear equations


The following is a simple system of equations.

28
29

2x1 − 3x2 = 5
3x1 + 4x2 = 6

More generally we represent a linear system of m equations in n variables as Ax = b


where A is an m × n coefficient matrix, x is a vector of n variables, and b is a vector of m
real numbers. In your linear algebra course you learnt that this system is feasible iff b is in
the span of the column vectors of A, namely, the rank of A|b (i.e., the matrix where b is
tacked on as a new column of A) has rank exactly the same as A. The solution is computed
via matrix inversion. One subtlety not addressed in most linear algebra courses is whether
this procedure is polynomial time. You may protest that actually they point out that the
system can be solved in O(n3 ) operations. Yes, but this misses a crucial point which we
will address before the end of the lecture.

6.2 Systems of linear inequalities and linear programming


If we replace some or all of the = signs with ≥ or ≤ in a system of linear equations we
obtain a system of linear inequalities.

Figure 6.1: A system of linear inequalities and its feasible region

The feasible region has sharp corners; it is a convex region and is called a polytope.
In general, a region of space is called convex if for every pair of points x, y in it, the line
segment joining x, y, namely, {λ · x + (1 − λ) · y : λ ∈ [0, 1]}, lies in the region.
In Linear Programming one is trying to optimize (i.e., maximize or minimize) a linear
function over the set of feasible values. The general form of an LP is

min cT x (6.1)
Ax ≥ b (6.2)

Here ≥ denotes componentwise ”greater than.”


30

Figure 6.2: Convex and nonConvex regions

This form is very flexible. To express maximization instead of minimization, just replace
c by −c. To include an inequality of the form a·x ≤ bi just write it as −a·x ≥ −bi . To include
an equation a · x = bi as a constraint just replace with two inequalities a · x ≥ bi , a · x ≤ bi .

Solving LPs: In Figure 6.1 we see the convex feasible region of an LP. The objective
function is linear, so it is clear that the optimum of the linear program is attained at some
vertex of the feasible region. Thus a trivial algorithm to find the optimum is to enumerate
all vertices of the feasible region and take the one with the lowest value of the objective.
This method (sometimes taught in high schools) of graphing the inequalities and their
feasible region does not scale well with n, m. The number of vertices of this feasible region
grows roughly as mn/2 in general. Thus the algorithm is exponential time. The famous
simplex method is a clever method to enumerate these vertices one by one, ensuring that
the objective keeps decreasing at each step. It works well in practice. The first polynomial-
time method to determine feasibility of linear inequalities was only discovered in 1979 by
Khachiyan, a Soviet mathematician. We will discuss the core ideas of this method later in
the course. For now, we just assume polynomial-time solvability and see how to use LP as
a tool.

Example 4 (Assignment Problem) Suppose n jobs have to be assigned to n factories. Each


job has its attendant requirements and raw materials. Suppose all of these are captured by
a single number: cij is the cost of assigning job i to factory j. Let xij be a variable that
corresponds to assigning job i to factory j. We hope this variable is either 0 or 1 but that
is not expressible in the LP so we relax this to the constraint

xij ≥ 0 andxij ≤ 1 for each i, j.


P
Each job must be assigned to exactly one factory so we have the constraint j xij = 1 for
each
P job i. Then we must ensure each factory obtains one job, so we include the constraint
i xij = 1 for each factory j. Finally, we want to minimize overall cost so the objective is
X
min cij xij .
ij
31

Fact: the solution to this LP has the property that all xij variables are either 0 or 1.
(Maybe this will be a future homework.) Thus solving the LP actually solves the assignment
problem.
In general one doesn’t get so lucky: solutions to LPs end up being nonintegral no matter
how hard we pray for the opposite outcome. Next lecture we will discuss what to do if that
happens.2

In fact linear programming was invented in 1939 by Kontorovich, a Russian mathematician,


to enable efficient organization of industrial production and other societal processes (such
as the assignment problem). The premise of communist economic system in the 1940s and
1950s was that centralized planning —using linear programming!— would enable optimum
use of a society’s resources and help avoid the messy “inefficiencies”of the market system!
The early developers of linear programming were awarded the Nobel prize in economics!
Alas, linear programming has not proved sufficient to ensure a perfect economic system.
Nevertheless it is extremely useful and popular in optimizing flight schedules, trucking
operations, traffic control, manufacturing methods, etc. At one point it was estimated
that 50% of all computation in the world was devoted to LP solving. Then youtube was
invented...

6.3 Linear modeling


At the heart of mathematical modeling is the notion of a system of variables: some variables
are mathematically expressed in terms of others. In general this mathematical expression
may not be succinct or even finite (think of the infinite processes captured in the quantum
theory of elementary particles). A linear model is a simple way to express interrelationships
that are linear.

y = 0.1x1 + 9.1x2 − 3.2x3 + 7.

Example 5 (Diet) You wish to balance meat, sugar, veggies, and grains in your diet. You
have a certain dollar budget and a certain calorie goal. You don’t like these foodstuffs
equally; you can give them a score between 1 and 10 according to how much you like them.
Let lm , ls , lv , lg denote your score for meat, sugar, veggies and grains respectively. Assuming
your overall happiness is given by

m × lm + g × lg + v × lv + s × ls ,

where m, g, v, s denote your consumption of meat, grain, veggies and sugar respectively
(note: this is a modeling assumption about you) then the problem of maximizing your
happiness subject to a dollar and calorie budget is a linear program. 2

Example 6 (`1 regression) This example is from Bob Vanderbei’s book on linear program-
ming. You are given data containing grades in different courses for various students; say
Gij is the grade of student i in course j. (Of course, Gij is not defined for all i, j since
each student has only taken a few courses.) You can try to come up with a model for
explaining these scores. You hypothesize that a student’s grade in a course is determined
32

by the student’s innate aptitude, and the difficulty of the course. One could try various
functional forms for how the grade is determined by these factors, but the simplest form to
try is linear. Of course, such a simple relationship will not completely explain the data so
you must allow for some error. This linear model hypothesizes that

Gij = aptitudei + easinessj + ij , (6.3)

where ij is an error term.


Clearly,
P the error could be positive or negative. A good model is one that has a low
value of ij |ij |. Thus the best model is one that minimizes this quantity.
We can solve this model for the aptitude and easiness scores using an LP. We have
the constraints in (6.3) for each student i and course j. Then for each i, j we have the
constraints
sij ≥ 0 and − sij ≤ ij ≤ sij .
P
Finally, the objective is min ij sij .
This method of minimizingP the sum of absolute values is called `1 -regression because
the `1 norm of a vector x is i |xi |. 2

Just as LP is the tool of choice to squeeze out inefficiencies of production and planning,
linear modeling is the bedrock of data analysis in science and even social science.

Example 7 (Econometric modeling) Econometrics is the branch of economics dealing with


analysis of empirical data and understanding the interrelationships of the underlying eco-
nomic variables —also useful in sociology, political science etc.. It often relies upon modeling
dependencies among variables using linear expressions. Usually the variables have a time
dependency. For instance it may a posit a relationship of the form

Growth(T + 1) = α · Interest rate(T ) + β · Deficit(T − 1) + (T ),

where Interest rate(T ) denotes say the interest rate at time T , etc. Here α, β may not be
constant and may be probabilistic variables (e.g., a random variable uniformly distributed in
[0.5, 0.8]) since future growth may not be a deterministic function of the current variables.
Often these models are solved (i.e., for α, β in this case) by regression methods related
to the previous example, or more complicated probabilistic inference methods that we may
study later in the course.2

Example 8 (Perceptrons and Support Vector Machines in machine learning) Suppose you
have a bunch of images labeled by whether or not they contain a car. These are data
points of the form (x, y) where x is n-dimensional (n= number of pixels in the image)and
yi ∈ {0, 1} where 1 denotes that it contains a car. You are trying to train an algorithm
to recognize cars in other unlabeled images. There is a general method called SVM’s that
allows you to find some kind of a linear model. (Aside: such simple linear models don’t
work for finding cars in images; this is an example.) This involves hypothesizing that there
is an unknown set of coefficients α0 , α1 , α2 , . . . , αn such that
X
αi xi ≥ α0 + errorx if x is an image containing a car,
i
33

X
αi xi ≤ 0.5α0 + errorx if x does not contain a car,
i
where errorx is required to be nonpositive for each x. Then finding such αi ’s while minimiz-
ing the sum of the absolute values of the error terms is a linear program. After finding these
αi ’s, given
Pa new image the program tries to predict whether it has a car by just checking
whether i αi xi ≥ α0 or ≤ 0.5α0 . (There is nothing magical about the 0.5 gap here; one
usually stipulates a gap or margin between the yes and no cases.)
This technique is related to the so-called support vector machines in machine learning
(and an older model called perceptrons), though we’re dropping a few technical details (
`2 -regression, regularization etc.). Also, in practice it could be that the linear explanation
is a good fit only after you first apply a nonlinear transformation on the x’s. This is the
idea in kernel SVMs. For instance let z be the vector where the ith coordinate zi = φ(xi ) =
x2
exp(− 2i ). You then find a linear predictor using the z’s. (How to choose such nonlinear
transformations is an art.) 2

One reason for the popularity of linear models is that the mathematics is simple, elegant,
and most importantly, efficient. Thus if the number of variables is large, a linear model is
easiest to solve.
A theoretical justification for linear modeling is Taylor expansion, according to which
every “well-behaved”function is expressible as an infinite series of terms involving the deriva-
tives. Here is the taylor series for an m-variate function f :

X ∂f X ∂f
f (x1 , x2 , . . . , xm ) = f (0, 0, .., 0) + xi (0) + x i1 x i2 (0) + · · · .
∂xi ∂xi1 ∂xi2
i i1 i2

If we assume the higher order terms are negligible, we obtain a linear expression.
Whenever you see an article in the newspaper describing certain quantitative relation-
ships —eg, the effect of more policing on crime, or the effect of certain economic policy
on interest rates—chances are it has probably been obtained via a linear model and `1
regression (or the related `2 regression). So don’t put blind faith in those numbers; they
are necessarily rough approximations to the complex behavior of a complex world.

6.4 Meaning of polynomial-time


Of course, the goal in this course is designing polynomial-time algorithms. When a problem
definition involves numbers, the correct definition of polynomial-time is “polynomial in the
number of bits needed to represent the input. ”
Thus the input size of an m × n system Ax = b is not mn but the number of bits used to
represent A, b, which is at most mnL where L denotes the number of bits used to represent
each entry of A, b. We assume that the numbers in A, b are rational, and in fact by clearing
denominators we may assume wlog they are integer.
Let’s return to the question we raised earlier: is Gaussian elimination a polynomial-
time procedure? The answer is yes. The reason this is nontrivial is that conceivably during
Gaussian elimination we may produce a number that is too large to represent. We have to
show it runs in poly(m, n, L) time.
34

Towards this end, first note that standard arithmetic operations +, −, × run in time
polynomial in the input size (e.g., multiplying two k-bit integers takes time at most O(k 2 )
even using the gradeschool algorithm).
Next, note that by Cramer’s rule for solving linear systems, the numbers produced during
the algorithm are related to the determinant of n × n submatrices of A. For example if A is
invertible then the solution to Ax = b is x = A−1 b, and the i, j entry of A−1 is Cij /det(A),
where Cij is a cofactor, i.e. an (n−1)×(n−1) submatrix of A. The determinant of an n×n
matrix whose entries are L bit integers is at most n!2Ln . This follows from the formula for
determinant of an n × n matrix, which is
X Y
det(A) = sgn(σ) Aiσ(i) ,
σ i

where σ ranges over all permutations of n elements.


The number of bits used to represent determinant is the log of this, which is n log n+Ln,
which is indeed polynomial. Thus doing arithmetic operations on these numbers is also
polynomial-time.
The above calculation has some consequence for linear programming as well. Recall
that the optimum of a linear program is attained at a vertex of the polytope. The vertex
is defined as the solution of all the equations obtained from the inequalities that are tight
there. We conclude that each vertex of the polytope can be represented by n log n+Ln bits.
This at least shows that the solution can be written down in polynomial time (a necessary
precondition for being able to compute it in polynomial time!).
Chapter 7

Provable Approximation via Linear


Programming

One of the running themes in this course is the notion of approximate solutions. Of course,
this notion is tossed around a lot in applied work: whenever the exact solution seems hard to
achieve, you do your best and call the resulting solution an approximation. In theoretical
work, approximation has a more precise meaning whereby you prove that the computed
solution is close to the exact or optimum solution in some precise metric. We saw some
earlier examples of approximation in sampling-based algorithms; for instance our hashing-
based estimator for set size. It produces an answer that is whp within (1 + ) of the true
answer. Today we will see many other examples that rely upon linear programming (LP).
Recall that most NP-hard optimization problems involve finding 0/1 solutions. Using
LP one can find fractional solutions, where the relevant variables are constrained to take
real values in [0, 1].
Recall the example of the assignment problem from last time, which is also a 0/1 problem
(a job is either assigned to a particular factory or it is not) but the LP relaxation magically
produces a 0/1 solution (although we didn’t prove this in class). Whenever the LP produces
a solution in which all variables are 0/1, then this must be the optimum 0/1 solution as well
since it is the best fractional solution, and the class of fractional solutions contains every
0/1 solution. Thus the assignment problem is solvable in polynomial time.
Needless to say, we don’t expect this magic to repeat for NP-hard problems. So the
LP relaxation yields a fractional solution in general. Then we give a way to round the
fractional solutions to 0/1 solutions. This is accompanied by a mathematical proof that the
new solution is provably approximate.

7.1 Deterministic Rounding (Weighted Vertex Cover)


First we give an example of the most trivial rounding of fractional solutions to 0/1 solutions:
round variables < 1/2 to 0 and ≥ 1/2 to 1. Surprisingly, this is good enough in some settings.
In the weighted vertex cover problem, which is NP-hard, we are given a graph G = (V, E)
and a weight for each node; the nonnegative weight of node i is wi . The goal is to find a
vertex cover, which is a subset S of vertices such that every edge contains at least one vertex

35
36

of S. Furthermore, we wish to find such a subset of minimum total weight. Let V Cmin be
this minimumPweight. The following is the LP relaxation:
min i wi x i
0 ≤ xi ≤ 1 ∀i
xi + xj ≥ 1 ∀ {i, j} ∈ E.
Let OP Tf be the optimum value of this LP. It is no more than V Cmin since every 0/1
solution (including in particular the 0/1 solution of minimum cost) is also an acceptable
fractional solution.
Applying deterministic rounding, we can produce a new set S: every node i with xi ≥
1/2 is placed in S and every other i is left out of S.
Claim 1: S is a vertex cover.
Reason: For every edge {i, j} we know xi + xj ≥ 1, and thus at least one of the xi ’s is at
least 1/2. Hence at least one of i, j must be in S.
Claim 2: The weight P of S is at most 2OP Tf .
Reason: OP Tf = i wi xi , and we are only picking those i’s for which xi ≥ 1/2. 2.
Thus we have constructed a vertex cover whose cost is within a factor 2 of the optimum
cost even though we don’t know the optimum cost per se.
Exercise: Show that for the complete graph the above method indeed computes a set of
size no better than 2 times OP Tf .
Remark: This 2-approximation was discovered a long time ago, and despite myriad attempts
we still don’t know if it can be improved. Using the so-called PCP Theorems Dinur and
Safra showed (improving a long line of work) that 1.36-approximation is NP-hard. Khot
and Regev showed that computing a (2 − )-approximation is UG-hard, which is a new form
of hardness popularized in recent years. The bibliography mentions a popular article on
UG-hardness.

7.2 Simple randomized rounding: MAX-2SAT


Simple randomized rounding is as follows: if a variable xi is a fraction then toss a coin which
comes up heads with probability xi . (In Homework 1 you figured out how to do this given a
binary representation of xi .) If the coin comes up heads, make the variable 1 and otherwise
let it be 0. The expectation of this new variable is exactly xi . Furthermore, linearity of
expectations implies that if the fractional solution satisfied some linear constraint cT x = d
then the new variable vector satisfies the same constraint in the expectation. But in the
analysis that follows we will in fact do something more.
A 2CNF formula consists of n boolean variables x1 , x2 , . . . , xn and clauses of the type
y ∨ z where each of y, z is a literal, i.e., either a variable or its negation. The goal in
MAX2SAT is to find an assignment that maximises the number of satisfied clauses. (Aside:
If we wish to satisfy all the clauses, then in polynomial time we can check if such an
assignment exists. Surprisingly, the maximization version is NP-hard.) The following is
the LP relaxation where J is the set of clauses and yj1 , yj2 are the two literals in clause j.
We have a variable zj for each clause j, where the intended meaning is that it is 1 if the
assignment decides to satisfy that clause and 0 otherwise. (Of course the LP can choose to
give zj a fractional value.)
37

P
min zj
j∈J
1 ≥ xi ≥ 0 ∀i
yj1 + yj2 ≥ zj
Where yj1 is shorthand for xi if the first literal in the jth clause is the ith variable, and
shorthand for 1 − xi if the literal is the negation of the i variable. (Similarly for yj2 .)
If MAX-2SAT denotes the number of clauses satisfied by the best assignment, then it is
no more than OP Tf , the value of the above LP. Let us apply randomized rounding to the
fractional solution to get a 0/1 assignment. How good is it?
Claim: E[number of clauses satisfied] ≥ 34 × OP Tf .
We show that the probability that the jth clause is satisfied is at least 3zj /4 and then
the claim follows by linear of expectation.
If the clause is of size 1, say xr , then the probability it gets satisfied is xr , which is at
least zj . Since the LP contains the constraint xr ≥ zj , the probability is certainly at least
3zj /4.
Suppose the clauses is xr ∨ xs . Then zj ≤ xr + xs and in fact it is easy to see that
zj = min {1, xr + xs } at the optimum solution:P after all, why would the LP not make zj as
large as allowed; its goal is to maximize j zj . The probability that randomized rounding
satisfies this clause is exactly 1 − (1 − xr )(1 − xs ) = xr + xs − xr xs .
But xr xs ≤ 41 (xr + xs )2 (prove this!) so we conclude that the probability that clause j
is satisfied is at least zj − zj2 /4 ≥ 3zj /4. 2.
Remark: This algorithm is due to Goemans-Williamson, but the original 3/4-approximation
is due to Yannakakis. The 3/4 factor has been improved by other methods to 0.94.

7.3 Dependent randomized rounding: Virtual circuit routing


Often a simple randomized rounding produces a solution that makes no sense. Then one
must resort to a more dependent form of rounding whereby chunks of variables may be
rounded up or down in a correlated way. Now we see an example of this from a classic
paper of Raghavan and Tompson.
In networks that break up messages into packets, a virtual circuit is sometimes used
to provide quality of service guarantees between endpoints. A fixed path is identified and
reserved between the two desired endpoints, and all messages are sped over that fixed path
with minimum processing overhead.
Given the capacity of all network edges, and a set of endpoint pairs (i1 , j1 ), (i2 , j2 ), . . . , (ik , jk )
it is NP-hard to determine if there is a set of paths which provide a unit capacity link be-
tween each of these pairs and which together fit into the capacity constraints of the network.
Now we give an approximation algorithm where we assume that (a) a unit-capacity path
is desired between each given endpoint pair (b) the total capacity cuv of each edge is at
least d log n, where d is a sufficiently large constant.
We give a somewhat funny approximation. Assuming there exists an integral solution
that connects all k endpoint pairs and which uses at most 0.9 fraction of each edge’s capacity,
we give an integral solution that connects at least (1 − 1/e) fraction of the endpoints pairs
and does not exceed any edge’s capacity.
38

The idea is to write an LP. For each endpoint pair i, j that have to be connected and
each edge e = (u, v) we have a variable xi,j
uv that is supposed to be 1 if the path from i to j
passes through (u, v), and 0 otherwise. (Note that edges are directed.) Then for each edge
(u, v) we can add a capacity constraint
X
xi,j
uv ≤ cuv .
i,j:endpoints

But since we can’t require variables to be 0/1 in an LP, we relax to 0 ≤ xi,juv ≤ 1. This
allows a path to be split over many paths (this will remind you of network flow if you have
seen it in undergrad courses). Of course, this seems all wrong since avoiding such splitting
was the whole point in the problem! Be patient just a bit more.
Furthermore we need the so-called flow conservation constraints. These say that the
fractional amount of paths leaving i and arriving at j is 1, and that paths never get stranded
in between.
P ij P ij
x = x ∀u 6= i, j
P vij uv P vij vu
x − x = 1 u=i
Pv uv ij Pv vu ij
v xvu − v xuv = 1 u=j
Undern ourohypothesis about the problem, this LP is feasible and we get a fractional
solution xi,j uv . These values can be seen as bits and pieces of paths lying strewn about
the network.
Let us first see that neither deterministic rounding nor simple randomized rounding is
a good idea. Consider a node u where xij u v is 1/3 on three incoming edges and 1/2 on two
outgoing edges. Then deterministic rounding would round the incoming edges to 1 and
outgoing edges to 1, creating a bad situation where the path never enters u but leaves it on
two edges! Simple randomized rounding will also create a similar bad situation with Ω(1)
(i.e., constant) probability. Clearly, it would be much better to round along entire paths
instead of piecemeal.
Flow decomposition: For each endpoint pair i, j we create a finite set of paths p1 , p2 , . . . ,
from i to j as well as associated weights wp1 , wp2 , . . . , that lie in [0, 1] and sum up to 1.
Furthermore, for each edge (u, v): xi,ju,v = sum of weights of all paths among these that
contain u, v.
Flow decomposition is easily accomplished via depth first search. Just repeatedly find a
path from i to j in the weighted graph defined by the xij uv ’s: the flow conservation constraints
imply that this path can leave every vertex it arrives at except possibly at j. After you
find such a path from i to j subtract from all edges on it the minimum xij uv value along this
path. This ensures that at least one xij uv gets zeroed out at every step, so the process is
finite.
Randomized rounding: For each endpoint pair i, j pick a path from the above decom-
position randomly by picking it with probability proportional to its weight.
Part 1: We show that this satisfies the edge capacities approximately.
This follows from Chernoff bounds. The expected number of paths that use an edge
{u, v} is
39

X
xi,j
u,v .
i,j:endpoints

The LP constraint says this is at most cuv , and since cuv > d log n this is a sum of at least
d log n random variables. Chernoff bounds (see our earlier lecture) imply that this is at most
(1 + ) times its expectation for all edges with high probability. Chernoff bounds similarly
imply that the overall number of paths is pretty close to k. )
Part 2: We show that in the expectation, (1 − 1/e) fraction of endpoints get connected
by paths. Consider any endpoint pair. Suppose P they are connected by t fractional paths
p1 , p2 , .. with weights w1 , w2 .. etc. Then i wi = 1 since the endpoints were fractionally
connected. The probability that the randomized rounding will round all these paths down
to 0 is
Y P
(1−w )
(1 − wi ) ≤ ( i t i )t (Geometric mean ≤ Arithmetic mean)
i
≤ (1 − 1/t)t ≤ 1/e.

The downside of this rounding is that some of the endpoint pairs may end up with zero
paths, whereas others may end up with 2 or more. We can of course discard extra paths.
(There are better variations of this approximation but covering them is beyond the scope
of this lecture.)
Remark: We have only computed the expectation here, but one can check using Markov’s
inequality that the algorithm gets arbitrarily close to this expectation with probability at
least 1/n (say).

Bibliography

1. New 3/4-approximation to MAX-SAT by M. X. Goemans and D. P. Williamson, SIAM


J. Discrete Math 656-666, 1994.

2. Randomized rounding: A technique for provably good algorithms and algorithmic


proofs by P. Raghavan and C. T. Tompson, Combinatorica pp 365-374 1987.

3. On the hardness of approximating minimum vertex cover by I. Dinur and S. Safra,


Annals of Math, pp 439485, 2005.

4. Approximately hard: the Unique Games Conjecture. by E. Klarreich.


Popular article on https://www.simonsfoundation.org/
Chapter 8

Decision-making under
uncertainty: Part 1

This lecture is an introduction to decision theory, which gives tools for making rational
choices in face of uncertainty. It is useful in all kinds of disciplines from electrical engineering
to economics. In computer science, a compelling setting to consider is an autonomous
vehicle or robot navigating in a new environment. It may have some prior notions about
the environment but inevitably it encounters many different situations and must respond
to them. The actions it chooses (drive over the object on the road or drive around it?)
changes the set of future events it will see, and thus its choice of the immediate action must
necessarily take into account the continuing effects of that choice far into the future. You
can immediately see that the same issues arise in any kind of decision-making in real life:
save your money in stocks or bonds; go to grad school or get a job; marry the person you
are dating now, or wait a few more years?
Of course, italicized terms in the previous paragraph are all very loaded. What is a
rational choice? What is “uncertainty”? In everyday life uncertainty can be interpreted in
many ways: risk, ignorance, probability, etc.
Decision theory suggests some answers —perhaps simplistic, but a good start. The first
element of this theory is its probabilistic interpretation of uncertainty: there is a probability
distribution on future events that the decision maker is assumed to know. The second
element is quantifying “rational choice.” It is assumeed that each outcome has some utility
to the decisionmaker, which is a number. The decision-making is said to be rational if it
maximises the expected utility.
Example 9 Say your utility involves job satisfaction quantified in some way. If you decide
to go for a PhD the distribution of your utility is given by random variable X0 . If you
decide to take a job instead, your return is a random variable X1 . Decision theory assumes
that you (i.e.,the decision-maker) know and understand these two random variables. You
choose to get a PhD if E[X0 ] > E[X1 ].

Example 10 17th century mathematician Blaise Pascal’s famous wager is an early example
of an argument recognizable as modern decision theory. He tried to argue that it is the
rational choice for humans to believe in God (he meant Christian god, of course). If you

40
41

choose to be a disbeliever and sin all your life, you may have infinite loss if God exists
(eternal damnation). If you choose to believe and live your life in virtue, and God doesn’t
exist it is all for naught. Therefore if you think that the probability that God exists is
nonzero, you must choose to live as a believer to avoid an infinite expected loss. (Aside:
how convincing is this argument to you?) 2

We will not go into a precise definition of utility (wikipedia moment) but illustrate it
with an example. You can think of it as a quantification of “satisfaction ”. In computer
science we also use payoff, reward etc.

Example 11 (Meaning of utility) You have bought a cake. On any single day if you eat

x percent of the cake your utility is x. (This happiness is sublinear because the 5th bite
of the cake brings less happiness than the first.) The cake reaches its expiration date in 5
days and if any is still left at that point you might as well finish it (since there is no payoff
from throwing away cake).
What schedule of cake eating will maximise your total utility over 5 days? √ Your optimal
choice is to eat 20% of the cake each day, since it yields a payoff of 5 × 20, which is a
lot more than any √ of the alternatives. For instance, eating it all on day 1 would produce a
much lower payoff 5 × 20.
This example is related to Modigliani’s Life cycle hypothesis, which suggests that con-
sumers consume wealth in a way that evens out consumption over their lifetime. (For
instance, it is rational to take a loan early in life to get an education or buy a house, be-
cause it lets you enjoy a certain quality of life, and pay for it later in life when your earnings
are higher.)

In our class discussion some of you were unconvinced about the axiom about maximising
expected utility. (And the existence of lotteries in real life suggests you are on to something.)
Others objected that one doesn’t truly know —at least very precisely—the distribution of
outcomes, as in the PhD vs job example. Very true. (The financial crash of 2008 relates
to some of this, but that’s a story for another day.) It is important to understand the
limitations of this powerful theory.

8.1 Decision-making as dynamic programming


Often you can think of decision-making under uncertainty as playing a game against a
random opponent, and the optimum policy can be computed via dynamic programming.

Example 12 (Cake eating revisited) Let’s now complicate the cake-eating problem. In
addition to the expiration date, your decision must contend with actions of your housemates,
who tend to eat small amounts of cake when you are not looking. On each day with
probability 1/2 they eat 10% of the cake.
Assume that each day the amount you eat as a percentage of the original is a multiple
of 10. You have to compute the cake eating schedule that maximises your expected utility.
Now you can draw a tree of depth 5 that describes all possible outcomes. (For instance
the first level consists of a 11-way choice between eating 0%, 10%, . . . , 100%.) Computing
your optimum cake-eating schedule is a simple dynamic programming over this tree. 2
42

The above cake-eating examples can be seen as a metaphor for all kinds of decision-
making in life: e.g., how should you spend/save throughout your life to maximize overall
happiness1 ?
Decision choice theory says that all such decisions can be made by an appropriate
dynamic programming over some tree. Say you think of time as discrete and you have
a finite choice of actions at each step: say, two actions labeled 0 and 1. In response the
environment responds with a coin toss. (In cake-eating if the coin comes up heads, 10%
of the cake disappears.) Then you receive some payoff/utility, which is a real number, and
depends upon the sequence of T moves made so far. If this goes on for T steps, we can
represent this entire game as a tree of depth T .
Then the best decision at each step involves a simple dynamic programming where the
operation at each action node is max and the operation at each probabilistic node is average.
If the node is a leaf it just returns its value. Note that this takes time exponential 2 in T .
Interestingly, dynamic programming was invented by R. Bellman in this decision-theory
context. (If you ever wondered what the “dynamic”in dynamic programming refers to, well
now you know. Check out wikipedia for the full story.) The dynamic programming is also
related to the game-theoretic notion of backwards induction.
The cake example had a finite horizon of 5 days and often such a finite horizon is imposed
on the problem to make it tractable.
But one can consider a process that goes on for ever and still make it tractable using
discounted payoffs. The payoff is being accumulated at every step, but the decision-maker
discounts the value of payoffs at time t as γ t where γ is the discount factor. This notion is
based upon the observation that most people, given a choice between getting 10 dollars now
versus 11 a year from now, will choose the former. This means that they discount payoffs
made a year from now by 10/11 at least.
Since γ t → 0 as t gets large, discounting ensures that payoffs obtained a large time from
now are perceived as almost zero. Thus it is a “soft ”way to impose a finite horizon.
Aside: Children tend to be fairly shortsighted in their decisions, and don’t understand
the importance of postponement of gratification. Is growing up a process of adjusting your
γ to a higher value? There is evidence that people are born with different values of γ, and
this is known to correlate with material success later in life. (See the wikipedia page on the
Stanford marshmallow experiment.)

8.2 Markov Decision Processes (MDPs)


This is the version of decision-making most popular in AI and robotics, and is used in
autonomous vehicles, drones etc. (Of course, the difficult “engineering”part is figuring out
the correct MDP description.) The literature on this topic is also vast.
The MDP framework is a way to succinctly represent the decision-maker’s interaction
with the environment. The decision-maker has a finite number of states and a finite number
1
Several Nobel prizes were awarded for figuring out the implications of this theory for explaining economic
behavior, and even phenomena like marriage/divorce.
2
In fact in a reasonable model where each node of the tree can be computed in time polynomial in
the description of the node, Papadimitriou showed that the problem of computing the optimum policy is
PSPACE-complete, and hence exp(T ) time is unavoidable.
43

Figure 8.1: An MDP (from S. Thrun’s notes)

of actions it is allowed to take in each state. (For example, a state for an autonomous vehicle
could be defined using a finite set of variables: its speed, what lane it is in, whether or not
there is a vehicle in front/back/left/right, whether or not one of them is getting closer at
a fast rate.) Upon taking an action the decision-maker gets a reward and then “nature”or
“chance”transitions him probabilistically to another state. The optimal policy is defined as
one that maximises the total reward (or discounted reward).
For simplicity assume the set of states is labeled by integers 1, . . . , n, the possible actions
in each state are 0/1. For each action b there is a probability p(i, b, j) of transitioning to
state j if this action is taken in that state. Such a transition brings an immediate reward
of R(i, b, j). Note that this process goes forever; the decision-maker keeps taking actions,
which affect the sequence of states it passes through and the rewards it gets.
The name Markov: This refers to the memoryless aspect of the above setup: the reward
and transition probabilities do not depend upon the past history.
Example 13 If the decision-maker always takes action 0 and s1 , s2 , . . . , are the random
variables denoting the states it passes through, then its total reward is

X
R(st , 0, st+1 ).
t=1

Furthermore, the distribution of st is completely determined (as described above) given st−1
(i.e., we don’t need to know the earlier sequence of states that were visited).
This sum of rewards is typically going to be infinite, so if we use a discount factor γ
then the discounted reward of the above sequence is

X
γ t R(st , 0, st+1 ).
t=1
2
44

8.3 Optimal MDP policies via LP


A policy is a strategy for the decision-maker to choose its actions in the MDP. You can think
of it as the driver of the hardware whose workings are described by the MDP. One idea —
based upon the discussion above—is to let the policy be dictated by a dynamic programming
that is limited to lookahead T steps ahead. But this is computationally difficult for even
moderate T . Ideally we would want a simple precomputed answer.
The problem with precomputed answers is that in general the optimal action in a state
at a particular time could depend upon the precise sequence of states traversed in the past.
Dynamic programming allows this possibility.
We are interested in history-independent policies: each time the decision-maker enters
the state it takes the same action. This is computationally trivial to implement in real-time.
The above example contained a very simple history-independent policy: always take the
action 0. In general such a policy is a mapping π : {1, . . . , n} → {0, 1}. So there are 2n
possible policies. Are they any good?
For each fixed policy the MDP turns into a simple (but infinite) random walk on states,
where the probability of transitioning from i to j is p(i, π(i), j). To talk sensibly about an
optimum policy one has to make the total reward finite, so we assume a discount factor
γ < 1. Then the expression for reward is

X
γ t (reward at time t).
i=1

Clearly this converges. Under some technical condition it can be shown that the optimum
policy is history-independent3
To compute the rewards from the optimum policy one ignores transient effects as the
random walk settles down, and look at the final steady state. This computation can be
done via linear programming.
Let Vi be the expected reward of following the optimum policy if one starts in state i.
In the first step the policy takes action π(i) ∈ {0, 1}, and transititions to another state j.
Then the subpolicy that kicks in after this transition must also be optimal too, though its
contribution is attenuated by γ. So Vi must satisfy
n
X
Vi = p(i, π(i), j)(R(i, π(i), j) + γVj ). (8.1)
j=1
Thus if the allowed actions are 0, 1 the optimum policy must satisfy:
n
X
Vi ≥ p(i, 0, j)(R(i, 0, j) + γVj ),
j=1

and
n
X
Vi ≥ p(i, 1, j)(R(i, 1, j) + γVj ).
j=1
3
This condition has to do with the Ergodicity of the MDP. For each fixing of the policy the MDP turns
into a simple random walk on the state space. One needs this to converge to a stationary distribution
whereby each state i appears during the walk some pi fraction of times.
45

P
The objective is to minimize i Vi subject to the above constraints. (Note that the
constraints for other states will have Vi on the other side of the inequality, which will
constrain it also.) So the LP is really solving for
n
X
Vi = max p(i, b, j)(R(i, b, j) + γVj ).
b∈{0,1}
j=1

After solving the LP one has to look at which of the above two inequalities involving Vi is
tight to figure out whether the optimum action π(i) is 0 or 1.
In practice solving via LP is considered too slow (since the number of states could be
100,000 or more) and iterative methods are used instead. We’ll see some iterative methods
later in the course in other contexts.
Bibliography

1. C. Papadimitriou, Games against Nature. JCSS 31, 288-301 (1985)


Chapter 9

Decision-making under total


uncertainty: the multiplicative
weight algorithm

(Today’s notes below are largely lifted with minor modifications from a survey by Arora,
Hazan, Kale in Theory of Computing journal, Volume 8 (2012), pp. 121-164.)
Today we study decision-making under total uncertainty: there is no a priori distri-
bution on the set of possible outcomes. (This line will cause heads to explode among
devout Bayesians, but it makes sense in many computer science settings. One reason is
computational complexity or general lack of resources: the decision-maker usually lacks the
computational power to construct the tree of all exp(T ) outcomes possible in the next T
steps, and the resources to do enough samples/polls/surveys to figure out their distribution.
Or the algorithm designer may not be a Bayesian.)
Such decision-making (usually done with efficient algorithms) is studied in the field of
online computation, which takes the view that the algorithm is responding to a sequence of
requests that arrive one by one. The algorithm must take an action as each request arrives,
and it may discover later, after seeing more requests, that its past actions were suboptimal.
But past actions cannot be unchanged.
See the book by Borodin and El-Yaniv for a fuller introduction to online algorithms.
This lecture and the next covers one such success story: an online optimization tool called
the multiplicative weight update method. The power of the method arises from the very
minimalistic assumptions, which allow it to be plugged into various settings (as we will do
in next lecture).

9.1 Motivating example: weighted majority algorithm


Now we briefly illustrate the general idea in a simple and concrete setting. This is known
as the Prediction from Expert Advice problem.
Imagine the process of picking good times to invest in a stock. For simplicity, assume
that there is a single stock of interest, and its daily price movement is modeled as a sequence

46
47

of binary events: up/down. (Below, this will be generalized to allow non-binary events.)
Each morning we try to predict whether the price will go up or down that day; if our
prediction happens to be wrong we lose a dollar that day, and if it’s correct, we lose nothing.
The stock movements can be arbitrary and even adversarial1 . To balance out this
pessimistic assumption, we assume that while making our predictions, we are allowed to
watch the predictions of n “experts”. These experts could be arbitrarily correlated, and
they may or may not know what they are talking about. The algorithm’s goal is to limit its
cumulative losses (i.e., bad predictions) to roughly the same as the best of these experts. At
first sight this seems an impossible goal, since it is not known until the end of the sequence
who the best expert was, whereas the algorithm is required to make predictions all along.
For example, the first algorithm one thinks of is to compute each day’s up/down pre-
diction by going with the majority opinion among the experts that day. But this algorithm
doesn’t work because a majority of experts may be consistently wrong on every single day,
while some single expert in this crowd happens to be right every time.
The weighted majority algorithm corrects the trivial algorithm. It maintains a weighting
of the experts. Initially all have equal weight. As time goes on, some experts are seen as
making better predictions than others, and the algorithm increases their weight proportion-
ately. The algorithm’s prediction of up/down for each day is computed by going with the
opinion of the weighted majority of the experts for that day.

Weighted majority algorithm


Initialization: Fix an η ≤ 12 . For each expert i, associate the weight wi (1) := 1.
For t = 1, 2, . . . , T :

1. Make the prediction that is the weighted majority of the experts’ predictions based
on the weights w1 (t) , . . . , wn (t) . That is, predict “up” or “down” depending on which
prediction has a higher total weight of experts advising it (breaking ties arbitrarily).

2. For every expert i who predicts wrongly, decrease his weight for the next round by
multiplying it by a factor of (1 − η):

wi (t+1) = (1 − η)wi (t) (update rule). (9.1)

Theorem 5
After T steps, let mi (T ) be the number of mistakes of expert i and M (T ) be the number of
mistakes our algorithm has made. Then we have the following bound for every i:
2 ln n
M (T ) ≤ 2(1 + η)mi (T ) + .
η

In particular, this holds for i which is the best expert, i.e. having the least mi (T ) .
1
Note that finance experts have studied stock movements for over a century and there are all kinds
of stochastic models fitted to them. But we are doing computer science here, and we will see that this
adversarial view will help us apply the same idea to a variety of other settings.
48

(t) P
Proof: A simple induction shows that wi (t+1) = (1 − η)mi . Let Φ(t) = i wi (t) (“the
potential function”). Thus Φ(1) = n. Each time we make a mistake, the weighted majority
of experts also made a mistake, so at least half the total weight decreases by a factor 1 − η.
Thus, the potential function decreases by a factor of at least (1 − η/2):
 
(t+1) (t) 1 1
Φ ≤ Φ + (1 − η) = Φ(t) (1 − η/2).
2 2
(T )
Thus simple induction gives Φ(T +1) ≤ n(1 − η/2)M . Finally, since Φ(T +1) ≥ wi (T +1) for
all i, the claimed bound follows by comparing the above two expressions and using the fact
that − ln(1 − η) ≤ η + η 2 since η < 12 . 2
The beauty of this analysis is that it makes no assumption about the sequence of events:
they could be arbitrarily correlated and could even depend upon our current weighting of
the experts. In this sense, the algorithm delivers more than initially promised, and this lies
at the root of why (after obvious generalization) it can give rise to the diverse algorithms
mentioned earlier. In particular, the scenario where the events are chosen adversarially
resembles a zero-sum game, which we will study in a future lecture.

9.1.1 Randomized version


The above algorithm is deterministic. When mi (T )  2 lnη n we see from the statement of
Theorem 5 that the number of mistakes made by the algorithm is bounded from above
by roughly 2(1 + η)mi (T ) , i.e., approximately twice the number of mistakes made by the
best expert. This is tight for any deterministic algorithm (Exercise: prove this!). However,
the factor of 2 can be removed by substituting the above deterministic algorithm by a
randomized algorithm that predicts according to the majority opinion with probability
proportional to its weight. (In other words, if the total weight of the experts saying “up”
is 3/4 then the algorithm predicts “up” with probability 3/4 and “down” with probability
1/4.) Then the number of mistakes after T steps is a random variable and the claimed
upper bound holds for its expectation. Now we give this calculation.
First note that the randomized algorithm can be restated as picking an expert i with
probability proportional to its weight and using that expert’s prediction. Note that the
probability of picking the expert is
def (t) wi (t)
pi (t) = Pwi (t) = .
w
j j Φ(t)

Now let’s slightly change notation: mi (t) be 1 if expert i makes a wrong prediction at
time t and 0 else. (Thus mi (t) is the cost incurred by this expert
P at that time.) Then the
probability the algorithm makes a mistake at time t is simply i pi (t) mi (t) , which we will
write as the inner product of the m and p vectors: m(t) · p(t) . Thus the expected number
of mistakes by our algorithm at the end is
T
X −1
m(t) · p(t) .
t=0
49

P
Now lets compute the change in potential Φ(t) = i wi
(t) :

X
Φ(t+1) = wi (t+1)
i
X
= wi (t) (1 − ηmi (t) )
i
X
= Φ(t) − ηΦ(t) mi (t) pi (t)
i
(t) (t)
= Φ (1 − ηm · p(t) )
≤ Φ(t) exp(−ηm(t) · p(t) ).

Note that this potential drop is not a random variable; it is a deterministic quantity
that depends only on the loss vector m(t) and the current expert weights (which in turn are
determined by the loss vectors of the previous steps).
We conclude by induction that the final potential is at most
T
Y X
Φ(0) exp(−ηm(t) · p(t) ) = Φ(0) exp(−η m(t) · p(t) ).
t=0 t

For each i this final potential is at least the final weight of the ith expert, which is
Y P (t)
(1 − ηmi (t) ) ≥ (1 − η) t mi .
t
P −1 (t) (t)
Thus taking logs and that − log(1 − η) ≤ η(1 + η) we conclude that Tt=0 m ·p
(which is also the expected number of mistakes by our algorithm) is at most (1 + η) times
the number of mistakes by expert i, plus the same old additive factor 2 log n/η.

9.2 The Multiplicative Weights algorithm


(Now we give a more general result that was not done in class but is completely analogous.
We will use the statement in the next class; you can find the proof in the AHK survey if
you like.)
In the general setting, we have a choice of n decisions in each round, from which we
are required to select one. (The precise details of the decision are not important here:
think of them as just indexed from 1 to n.) In each round, each decision incurs a certain
cost, determined by nature or an adversary. All the costs are revealed after we choose our
decision, and we incur the cost of the decision we chose. For example, in the prediction
from expert advice problem, each decision corresponds to a choice of an expert, and the
cost of an expert is 1 if the expert makes a mistake, and 0 otherwise.
To motivate the Multiplicative Weights (MW) algorithm, consider the naı̈ve strategy
that, in each iteration, simply picks a decision at random. The expected penalty will be
that of the “average” decision. Suppose now that a few decisions are clearly better in the
long run. This is easy to spot as the costs are revealed over time, and so it is sensible to
50

reward them by increasing their probability of being picked in the next round (hence the
multiplicative weight update rule).
Intuitively, being in complete ignorance about the decisions at the outset, we select them
uniformly at random. This maximum entropy starting rule reflects our ignorance. As we
learn which ones are the good decisions and which ones are bad, we lower the entropy to
reflect our increased knowledge. The multiplicative weight update is our means of skewing
the distribution.
We now set up some notation. Let t = 1, 2, . . . , T denote the current round, and let i
be a generic decision. In each round t, we select a distribution p(t) over the set of decisions,
and select a decision i randomly from it. At this point, the costs of all the decisions are
revealed by nature in the form of the vector m(t) such that decision i incurs cost mi (t) .
We assume that the costs lie in the range [−1, 1]. This is the only assumption we make on
the costs; nature is completely free to choose the cost vector as long as these bounds are
respected, even with full knowledge of the distribution that we choose our decision from.
The expected cost to the algorithm for sampling a decision i from the distribution p(t)
is
E [mi (t) ] = m(t) · p(t) .
i∈p(t)
P
The total expected cost over all rounds is therefore Tt=1 m(t) · p(t) . Just as before, our
goal is to design an algorithm which achieves a total expected
PT cost not too much more
(t)
than the cost of the best decision in hindsight, viz. mini t=1 mi . Consider the following
algorithm, which we call the Multiplicative Weights Algorithm. This algorithm has been
studied before as the prod algorithm of Cesa-Bianchi, Mansour, and Stoltz.

Multiplicative Weights algorithm


Initialization: Fix an η ≤ 12 . For each decision i, associate the weight wi (t) := 1.
For t = 1, 2, . . . , T :

1. Choose decision i with probability proportional to its weight wi (t) . I.e.,Puse the dis-
tribution over decisions p(t) = {w1 (t) /Φ(t) , . . . , wn (t) /Φ(t) } where Φ(t) = i wi (t) .

2. Observe the costs of the decisions m(t) .

3. Penalize the costly decisions by updating their weights as follows: for every decision
i, set
wi (t+1) = wi (t) (1 − ηmi (t) ) (9.2)

Figure 9.1: The Multiplicative Weights algorithm.

The following theorem —completely analogous to Theorem 5— bounds the total ex-
pected cost of the Multiplicative Weights algorithm (given in Figure 9.1) in terms of the
total cost of the best decision:
Theorem 6
Assume that all costs mi (t) ∈ [−1, 1] and η ≤ 1/2. Then the Multiplicative Weights algo-
51

rithm guarantees that after T rounds, for any decision i, we have


T
X T
X T
X ln n
m(t) · p(t) ≤ mi (t) + η |mi (t) | + .
η
t=1 t=1 t=1

Note that we have not addressed the optimal choice of η thus far. Firstly, it should
be small enough that all calculations in the analysis hold, say η · mi (t) ≤ 1/2 for all
i,
Pt.T Typically this is done by rescaling
p the payoffs to lie in [−1, 1], which means that
|m (t) | ≤ T . Then setting η ≈ ln n/T gives the tightest upperbound on the right
t=1 i √
hand side in Theorem 6, by reducing the additive error to about T ln n. Of course, this
is a safe choice; in practice the best η depends upon the actual sequence of events, but of
course those are not known in advance.
bibliography
S. Arora, E. Hazan, S. Kale. The multiplicative weights update method: A meta algo-
rithm and its applications. Theory of Computing, Volume 8 (2012), pp. 121164.
A. Borodin and R. El Yaniv. Online Computation and Competitive Analysis. Cambridge
University Press, 1998.
Chapter 10

Applications of multiplicative
weight updates: LP solving,
Portfolio Management

Today we see how to use the multiplicative weight update method to solve other problems.
In many settings there is a natural way to make local improvements that “make sense.”The
multiplicative weight updates analysis from last time (via a simple potential function) allows
us to understand and analyse the net effect of such sensible improvements. (Formally, what
we are doing in many settings is analysing an algorithm called gradient descent which we’ll
encounter more formally later in the course.)

10.1 Solving systems of linear inequalities


We encountered systems of linear inequalities in Lecture 6. Today we study a version that
seems slightly more restricted but is nevertheless as powerful as general linear programming.
(Exercise!)

system 1
a1 · x ≥ b1
a2 · x ≥ b2
..
.
am · x ≥ bm
xi ≥ 0 ∀i = 1, 2, . . . , n
X
xi = 1.
i

In your high school you learnt the “graphical”method to solve linear inequalities, and
as we discussed in Lecture 6, those can take mn/2 time. Here we design an algorithm that,

52
53

given an error parameter ε > 0, runs in O(mL/ε) time and either tells us that the original
system is infeasible, or gives us a solution x satisfying the last two lines of the above system,
and
aj · x ≥ bj − ε ∀j = 1, . . . , m.
(Note that this allows the possibility that the system is infeasible per se and nevertheless
the algorithm returns such an approximate solution. In that case we have to be happy with
the approximate solution.) Here L is an instance-specific parameter that will be clarified
below; roughly speaking it is the maximum absolute value of any coefficient. (Recall that
the dependence would need to be poly(log L) to be considered polynomial time. We will
study such a method later on in the course.)
What is a way to certify to somebody that the system is infeasible? The following is
sufficient: Come up with a system of nonnegative weights w1 , w2 , . . . , wm , one per inequality,
such that the following linear program has a negative value:

system 2
X
max wj (aj · x − bj )
j

xi ≥ 0 ∀i = 1, 2, . . . , n
X
xi = 1.
i

Note: the wj ’s are fixed constants. So this linear program has only two nontrivial constraints
(not counting the constraints xi ≥ 0) so it is trivial to find a solution quickly, as we saw in
class.

Example 14 The system of inequalities x1 + x2 ≥ 1, x1 − 5x2 ≥ 5 is infeasible when


combined with the constraints x1 + x2 = 1, x1 ≥ 0, x2 ≥ 0 since we can multiply the first
inequality by 5 and the second by 1 and add to obtain 6x1 ≥ 10. Note that 6x1 − 10 cannot
take a positive value when x1 ≤ 1.

This method of certifying infeasibility is eminently sensible and the weighting of in-
equalities is highly reminiscent of the weighting of experts in the last lecture. So we can try
to leverage it into a precise algorithm. It will have the following guarantee: (a) Either it
(f )
finds a set of nonnegative weights certifying infeasibility or (b) It finds a solution x that
approximately satisfies the system, in that aj · x − bj ≥ −ε. Note that conditions (a) and
(b) are not disjoint; if a system satisfies both conditions, the algorithm can do either (a) or
(b).
We use the meta theorem on MW (Theorem 2) from Lecture 8, where experts have
positive or negative costs (where negative costs can be seen as payoffs) and the algorithm
seeks to minimize costs by adaptively decreasing the weights of experts with larger cost.
The meta theorem says that the algorithm’s payoff over many steps tracks —within (1 + ε)
multiplicative factor—the cost incurred by the best player, plus an additive term O(log n/ε).
We identify m “experts,”one per inequality. We maintain a weighting of experts, with
w1 , w2 (t) , . . . , wm (t) denoting the weights at step t. (At t = 0 all weights are 1.) Solve
(t)
54

system 2 using these weights. If it turns out to have a negative value, we have proved the
infeasibility of system 1 and can HALT right away. Otherwise take any solution, say x(t) ,
and think of it as imposing a “cost ”of mj (t) = ai · x(t) − bi on the jth expert. (In particular,
the first line of system 2 is merely —up to scaling by the sum of weights— the expected
cost for our MW algorithm, and it is positive.) Thus the MW update rule will update the
experts’ weights as:
wj (t+1) ← wj (t) (1 − η mj (t) ).
We continue thus for some number T of steps and if we never found a certificate of the
infeasibility of system 1 we output the solution x(f ) = T1 (x(1) + x(2) + · + x(T ) ), which is the
average of all the solution vectors found at various steps. Now let L denote the maximum
possible absolute value of any ai · x − b subject to the final two lines of system 2.
Claim: If T > L2 log n/ε2 then x(f ) satisfies aj · x(f ) − bj ≥ −ε for all j.
The proof involves the MW meta theorem p which requires us to rescale (multiplying by
1/L) so all costs lie in [−1, 1] and setting ε = log n/T . p
We wish to make T large enough so that the per-step additive error log n/T < ε/L,
which implies T > L2 log n/ε2 .
Then we can reason as follows: (a) The expected per-step cost of the MW algorithm
was positive (in fact it was positive in each step). (b) The quantity aj · x(f ) − bj is simply
the average cost for expert j per step. (c) The total number of steps is large enough that
our MW theorem says that (a) cannot be ε more than (b).
Here is another intuitive explanation that suggests why this algorithm makes sense
independent of the experts idea. Vectors x(1) , x(2) , . . . , x(T ) represent simplistic attempts
to find a solution to system 1. If ai · x(t) − bi is positive (resp., negative) this means
that the jth constraint was satisfied (resp., unsatisfied) and thus designating it as a cost
(resp., reward) ensures that the constraint is given less (resp., more) weight in the next
round. Thus the multiplicative update rule is a reasonable way to search for a weighting of
constraints that gives us the best shot at proving infeasibility.
Remarks: See the AHK survey on multiplicative weights for the history of this algorithm,
which is actually a quantitative version of an older algorithm called Lagrangian relaxation.

10.1.1 Duality Theorem


The duality theorem for linear programming says that our method of showing infeasibility
of system 1 —namely, show for some weighting that system 2 has negative value–is not
just sufficient but also necessary.
This follows by imagining letting ε go to 0. If the system is infeasible, then there is
some ε0 (depending upon the number of constraints and the coefficient values) such that
there is no ε-close solution with the claimed properties of x(f ) for ε < ε0 . Hence at one of
the steps we must have failed to find a positive solution for system 2.
We’ll further discuss LP duality in a later lecture.

10.2 Portfolio Management


Now we return to a a more realistic version of the stock-picking problem that motivated
our MW algorithm. (You will study this further in a future homework.) There is a set of n
55

stocks (e.g., the 500 stocks in S& P 500) and you wish to manage an investment portfolio
using them. You wish to do at least as well as the best stock in hindsight, and also better
than index funds, which keep a fixed proportion of wealth in each stock. Let ci (t) be the
price of stock i at the end of day t.
If you have Pi (t) fraction of your wealth investedPin stock i then on the tth day your
portfolio will rise in value by a multiplicative factor i Pi (t) ci (t) /ci (t−1) . Looks familiar?
Let ri (t) be shorthand for ci (t) /ci (t−1) .
If you invested all your money in stock i on day 0 then the rise in wealth at the end is
TY−1
ci (T )
= ri (t) .
ci (0) t=0

Since log ab = log a + log b this gives us the idea to set up the MW algorithm as follows.
We run it by looking at n imagined experts, each corresponding to one of the stocks. The
payoff for expertPi on day t is log ri (t) . Then as noted above, the total payoff for expert i
over all days is t log ri (t) = log(ci (T ) /ci (0) ). This is simply the log of the multiplicative
factor by which our wealth would increase in T days if we had just invested all of it in stock
i on the first day. (This is the jackpot we are shooting for: imagine the money we could
have made if we’d put all our savings in Google stock on the day of its IPO.)
Our algorithm plays the canonical MW strategy from last lecture with a suitably small η
and with the probability distribution P (t) on experts at time t being interpreted as follows:
Pi (t) is the fraction of wealth invested in stock i at the start of day t. Thus we are no longer
thinking of picking one expert to follow at each time step; the distribution on experts is the
way of splitting our P money into the n stocks. In particular on day t our portfolio increases
in value by a factor i Pi (t) · r(t) .
Note that we are playing the MW strategy that involves maximising payoffs, not mini-
mizing costs. (That is, increase the weight of experts if they get positive payoff; and reduce
weight in case of negative payoff.) The MW theorem says that the total payoff of the MW
strategy,
P P (t) namely,
t i Pi · log ri (t) , is at least (1 − ε) times the payoff of the best expert provided T is
large enough. P P
It only remains to make sense of the total payoff for the MW strategy, namely, t i Pi (t) ·
log ri (t) , since thus far it is just an abstract quantity in a mental game that doesn’t make
sense per se in terms of actual money made. P (t)
Since the logarithm is a concave function (i.e. 21 (log x + log y) ≤ log x+y 2 ) and i Pi =
1, simple calculus shows that
X X
Pi (t) · log r(t) ≤ log( Pi (t) · r(t) ).
i i

The right hand side is exactly the logarithm of the rise in value of the portfolio of the MW
strategy on day t. Thus we conclude that the total payoff over all days lower bounds the
sum of the logarithms of these rises, which of course is the log of the ratio (final value of
the portfolio)/(initial value).
All of this requires that the number of steps T should be large enough. Specifically, if
log ri (t) ≤ 1 (i.e., no stock changes value by more than a factor 2 on a single day) then
56

p
the total
P difference between the desired payoff and the actual payoff is log n/T times
(t)
maxi t log ri , as noted in Lecture 8. This performance can be improved by other
variations of the method (see the paper by Hazan and Kale). In practice this method
doesn’t work very well; we’ll later explore a better algorithm.
Remark: One limitation of this strategy is that we have ignored trading costs (ie
dealer’s commisions). As you can imagine, researchers have also incorporated trading costs
in this framework (see Blum and Kalai). Perhaps the bigger limitation of the MW strat-
egy is that it assumes nothing about price movements whereas there is a lot known about
the (random-like) behavior of the stock market. Traditional portfolio management theory
assumes such stochastic models, and is more akin to the decision theory we studied two
lectures ago. But stochastic models of the stock market fail sometimes (even catastrophi-
cally) and so ideally one wants to combine the stochastic models with the more pessimistic
viewpoint taken in the MW method. See the paper by Hazan and Kale. See also a recent
interesting paper by Abernathy et al. that suggests that the standard stochastic model
arises from optimal actions of market players.
Thomas Cover was the originator of the notion of managing a portfolio against an
adversarial market. His strategy is called universal portfolio.
bibliography

1. A. Blum and A. Kalai. Efficient Algorithms for Universal Portfolios. J. Machine


Learning Research, 2002.

2. E. Hazan and S. Kale. On Stochastic and Worst-case Models for Investing. Proc.
NIPS 2009.

3. J. Abernethy, R. Frongillo, A. Wibisono. Minimax Option Pricing Meets Black-


Scholes in the Limit. Proc. ACM STOC 2012.
Chapter 11

High Dimensional Geometry,


Curse of Dimensionality,
Dimension Reduction

High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies
watched by Netflix customer, etc.) and this lecture seeks to introduce some common prop-
erties of these vectors. We encounter the so-called curse of dimensionality which refers to
the fact that algorithms are simply harder to design in high dimensions and often have a
running time exponential in the dimension. We also encounter the blessings of dimensional-
ity, which allows us to reason about higher dimensional geometry using tools like Chernoff
bounds. We also show the possibility of dimension reduction — it is sometimes possible to
reduce the dimension of a dataset, for some purposes. P 2 1/2
Notation: For a vector x ∈ <d its ` -norm is |x| = (
P 2 2 i xi ) and the `1 -norm is
|x|1 = i |xi |. For any two vectors x, y their Euclidean distance refers to |x − y|2 and
Manhattan distance refers to |x − y|1 .
High dimensional geometry is inherently different from low-dimensional geometry.

Example 15 Consider how many almost orthogonal unit vectors we can have in space, such
that all pairwise angles lie between 88 degrees and 92 degrees.
In <2 the answer is 2. In <3 it is 3. (Prove these to yourself.)
In <d the answer is exp(cd) where c > 0 is some constant.

Example 16 Another example is the ratio of the the volume of the unit sphere to its
circumscribing cube (i.e. cube of side 2). In <2 it is π/4 or about 0.78. In <3 it is π/6 or
about 0.52. In d dimensions it is exp(−c · d log d).

Let’s start with useful generalizations of some geometric objects to higher dimensional
geometry:

• The n-cube in <n : {(x1 ...xn ) : 0 ≤ xi ≤ 1}. To visualize this in <4 , think of yourself
as looking at one of the faces, say x1 = 1. This is a cube in <3 and if you were

57
58

Rd#
R2# R3#

Figure 11.1: Number of almost-orthogonal vectors in <2 , <3 , <d

able to look in the fourth dimension you would see a parallel cube at x1 = 0. The
visualization in <n is similar.
The volume of the n-cube is 1.
P 2
• The unit n-ball in <d : Bd := {(x1 ...xd ) : xi ≤ 1}. Again, to visualize the ball in
<4 , imagine you have psliced through√it with a hyperplane, say x1 = 1/2. This slice is
a ball in <3 of radius 1 − 1/22 = 3/2. Every parallel slice also gives a ball.
π d/2
The volume of Bd is (d/2)! (assume d is even if the previous expression bothers you),
1
which is dΘ(d) .

• In <2 , if we slice the unit ball (i.e., disk) with a line at distance 1/2 from the center
then a significant fraction of the ball’s volume lies on each side. In <d√if we do the
same with a hyperplane, then the radius of the d − 1 dimensional ball is 3/2, and so
the volume on the other side√is negligible. In fact a constant fraction of the volume lies
within a slice at distance 1/ d from the center, and for qany c > 1, a (1 − 1/c)-fraction
log c
of the volume of the d-ball lies in a strip of width O( d ) around the center.

• A good approximation to picking a random point on the surface of Bn is by choosing


random xi ∈ {−1, 1} independently for i = 1..n and normalizing to get √1n (x1 , ..., xn ).
An exact way to pick a random point on the surface of B n is to choose xi from
the standard normal distribution for i = 1..n, and to normalize: 1l (x1 , ..., xn ), where
P
l = ( i x2i )1/2 .

11.1 Number of almost-orthogonal vectors


Now we show there are exp(d) vectors in <d that are almost-orthogonal. Recall that the
angle between two vectors x, y is given by cos(θ) = hx, yi/ |x|2 |y|2 .
Lemma 7
Suppose a is a unit vector in <n . Let x = (x1 , ..., xn ) ∈ Rn be chosen from the surface
of Bn by choosing each coordinate at random from {1, −1} and normalizing by factor √1n .
59

P
Denote by X the random variable a · x = ai xi . Then:
2
P r(|X| > t) < e−nt

Proof: We have: X
µ = E(X) = E( ai xi ) = 0
X X X X a2 1
σ 2 = E[( ai xi )2 ] = E[ ai aj xi xj ] = ai aj E[xi xj ] = i
= .
n n
Using the Chernoff bound, we see that,
t 2 2
P r(|X| > t) < e−( σ ) = e−nt .

Corollary 8
If x, y are chosen at random from {−1, 1}n , and the angle between them is θx,y then
" r #
log c 1
P r |cos(θx,y )| > < .
n c

Hence by if we pick say c/2 random vectors in {−1, 1}n , the union
q bound says that
log c
the chance that they all make a pairwise angle with cosine less than n is less than 1/2.
Hence we can make c = exp(0.01n) and still have the vectors be almost-orthogonal (i.e.
cosine is a very small constant).

11.2 Curse of dimensionality


Suppose we have a set of vectors in d dimensions and given another vector we wish to
determine its closest neighbor (in `2 norm) to it. Designing fast algorithms for this in the
plane (i.e., <2 ) uses the fact that in the plane there are only O(1) distinct points whose
pairwise distance is about 1 ± ε. In <d there can be exp(d) such points.
Thus most algorithms —nearest neighbor, minimum spanning tree, point location etc.—
have a running time depending upon exp(d). This is the curse of dimensionality in algo-
rithms. (The term was invented by R. Bellman, who as we saw earlier had a knack for
giving memorable names.)
In machine learning and statistics sometimes the term refers to the fact that available
data is too sparse in high dimensions; it takes exp(d) amount of data (say, points on the
sphere)to ensure that each new sample is close to an existing sample. This is a different
take on the same underlying phenomenon.
I hereby coin a new term: Blessing of dimensionality. This refers to the fact that many
phenomena become much clearer and easier to think about in high dimensions because one
can use simple rules of thumb (e.g., Chernoff bounds, measure concentration) which don’t
hold in low dimensions.
60

11.3 Dimension Reduction


Now we describe a central result of high-dimensional geometry (at least when distances are
measured in the `2 norm). Problem: Given n points z 1 , z 2 , ..., z n in <n , we would like to
find n points u1 , u2 , ..., un in <m where m is of low dimension (compared to n) and the
metric restricted to the points is almost preserved, namely:

kz i − z j k2 ≤ kui − uj k2 ≤ (1 + ε)kz j − z j k2 ∀i, j. (11.1)

The following main result is by Johnson & Lindenstrauss :


Theorem 9
In order to ensure (11.1), m = O( log
ε2
n
) suffices, and in fact the mapping can be a linear
mapping.

The following ideas do not work to prove this theorem (as we discussed in class): (a)
take a random sample of m coordinates out of n. (b) Partition the n coordinates into m
subsets of size about n/m and add up the values in each subset to get a new coordinate.
1 m n
Proof:qChooseqm vectors x , ..., x ∈ < at random by choosing each coordinate randomly
from { 1+ε
m ,−
1+ε n m
m }. Then consider the mapping from < to < given by

z −→ (z · x1 , z · x2 , . . . , z · xm ).

In other words ui = (z i · x1 , z i · x2 , ..., z i · xm ) for i = 1, . . . , k. We want to show that with


positive probability, u1 , ..., uk has the desired properties. This would mean that there exists
at least one choice of u1 , ..., uk satisfying inequality 11.1. To show this, first we write the
expression kui − uj k explicitly:
m n
!2
X X j k
i j 2 i
ku − u k = (zl − zl )xl .
k=1 l=1

Denote by z the vector z i − z j , and by u the vector ui − uj . So we get:


m n
!2
X X
kuk2 = kui − uj k2 = zl xkl .
k=1 l=1
Pn
Let Xk be the random variable ( l=1 zl xkl )2 . Its expectation is µ = 1+ε 2
m kzk (can be seen
similarly to the proof of lemma 7). Therefore, the expectation of kuk is (1 + ε)kzk2 . If we
2

show that kuk2 is concentrated enough around its mean, then it would prove the theorem.
More formally, this is done in the following Chernoff bound lemma. 2

Lemma 10
There exist constants c1 > 0 and c2 > 0 such that:
2m
1. P r[kuk2 > (1 + β)µ] < e−c1 β
2m
2. P r[kuk2 < (1 − β)µ] < e−c2 β
61

Therefore there is a constant c such that the probability of a ”bad” case is bounded by:
2
P r[(kuk2 > (1 + β)µ) ∨ (kuk2 < (1 − β)µ)] < e−cβ m

Now, we have n2 random variables of the type kui − uj k2 . Choose β = 2ε . Using the
union bound, we get that the probability that any of these random variables is not within
(1 ± 2ε ) of their expected value is bounded by
 
n −c ε2 m
e 4 .
2

So if we choose m > 8(log n+log


ε2
c)
, we get that with positive probability, all the variables
are close to their expectation within factor (1 ± 2ε ). This means that for all i,j:
ε ε
(1 − )(1 + ε)kz i − z j k2 ≤ kui − uj k2 ≤ (1 + )(1 + ε)kz i − z j k2
2 2
Therefore,
kzi − zj k2 ≤ kui − uj k2 ≤ (1 + ε)2 kz i − z j k2 ,
and taking square root:

kz i − z j k ≤ kui − uj k ≤ (1 + ε)kz i − z j k,

as required.
Question: The above dimension reduction preserves (approximately) `2 -distances. Can
we do dimension reduction that preserves `1 distance? This was an open problem for many
years until Brinkman and Charikar showed in 2004 that no such dimension reduction is
possible.
Question: Is the theorem tight, or can we reduce the dimension even further below
O(log n/ε2 )? Alon has shown that this is essentially tight.
Finally, we note that there is a now-extensive literature on more efficient techniques
for JL-style dimension reduction, with a major role played by a 2006 paper of Ailon and
Chazelle. Do a google search for “Fast Johnson Lindenstrauss Transforms.”

11.3.1 Locality preserving hashing


Suppose we wish to hash high-dimensional vectors so that nearby vectors tend to hash
into the same bucket. To do this we can do a random projection into say the cube in 5
dimensions. We discretise the cube into smaller cubes of size ε. Then there are 1/ε5 smaller
cubes; these can be the buckets.
This is simplistic; more complicated schemes have been constructed. Things get even
more interesting when we are interested in `1 -distance.

11.3.2 Dimension reduction for efficiently learning a linear classifier


Suppose we are given a set of m data points in <d , each labeled with 0 or 1. For example the
data points may represent emails (represented by the vector giving frequencies of various
62

Figure 11.2: Margin of a linear classifier with respect to some labeled points

words in them) and the label indicates whether or not the user labeled them as spam. We
are trying to learn the rule (or “classifier”) that separates the 1’s from 0’s. P
The simplest classifier is a halfspace. Finding whether there exists a halfspace i ai xi ≥
b that separates the 0’s from 1’s is solvable via Linear Programming. This LP has n + 1
variables and m constraints.
However, there is no guarantee in general that the halfspace that separates the training
data will generalize to new examples? ML theory suggests conditions under which the
classifier does generalize, and the simplest is margin. Suppose the data points are unit
vectors. We say the halfspace has margin ε if every datapoint has distance at least ε to the
halfspace.
In the next homework you will show that if such a margin exists then dimension reduction
to O(log n/ε2 ) dimensions at most halves the margin. Hence the LP to find it only has
O(log n/ε2 ) variables instead of n + 1.

Bibliography:

1. W. Brinkman and M. Charikar. On the impossibility of Dimension Reduction in `1 .


IEEE FOCS 2003.
Chapter 12

Random walks, Markov chains,


and how to analyse them

Today we study random walks on graphs. When the graph is allowed to be directed and
weighted, such a walk is also called a Markov Chain. These are ubiquitous in modeling
many real-life settings.

Example 17 (Drunkard’s walk) There is a sequence of 2n + 1 pubs on a street. A


drunkard starts at the middle house. At every time step, if he is at pub number i, then
with probability 1/2 he goes to pub number i − 1 and with probability 1/2 to pub i + 1.
How many time steps does it take him to reach either the first or the last pub?
Thinking a bit, we quickly realize that the first m steps correspond to m coin tosses,
and the distance from the starting point is simply the difference between the number of
heads and the number of tails. We need this difference to be n. Recall that the number
of heads is distributed like a normal distribution with mean m/2 and standard deviation

m/2. Thus m needs to be of the order of n2 before there is a good chance of this random
variable taking the value m + n/2.
Thus being drunk slows down the poor guy by a quadratic factor.

Example 18 (Exercise) Suppose the drunkard does his random walk in a city that’s
designed like a grid. At each step he goes North/South/East/West by one block with
probability 1/4. How many steps does it take him to get to his intended address, which is
n blocks north and n blocks east away?

Random walks in space are sometimes called Brownian motion, after botanist Robert
Brown, who in 1826 peered at a drop of water using a microscope and observed tiny par-
ticles (such as pollen grains and other impurities) in it performing strange random-looking
movements. He probably saw motion similar to the one in the above figure. Explaining
this movement was a big open problem. In 1905, during his ”miraculous year” (when he
solved 3 famous open problems in physics) Einstein explained Brownian motion as a ran-
dom walk in space caused by the little momentum being imparted to the pollen in random
directions by the (invisible) molecules of water. This theoretical prediction was soon ex-
perimentally confirmed and seen as a “proof”of the existence of molecules. Today random

63
64

walks and brownian motion are used to model the movements of many systems, including
stock prices.

Figure 12.1: A 2D Random Walk

Example 19 (Random walks on graph) We can consider a random walk on a d-regular


graph G = (V, E) instead of in physical space. The particle starts at some vertex v0 and at
each step, if it is at a vertex u, it picks a random edge of u with probability 1/d and then
moves to the other vertex in that edge. There is also a lazy version of this walk where he
stays at u with probability 1/2 and moves to a random neighbor with probability 1/2d.
Thus the drunkard’s walk can be viewed as a random walk on a line graph.
One can similarly consider random walks on directed graph (randomly pick an outgoing
edge out of u to leave from) and walks on weighted graph (pick an edge with probability
proportional to its weight). Walks on directed weighted graphs are called Markov Chains.

In a random walk, the next step does not depend upon the previous history of steps, only
on the current position/state of the moving particle. In general, the term markovian refers
to systems with a “memoryless”property. In an earlier lecture we encountered Markov
Decision Processes, which also had this memoryless property.
Example 20 (Bigram and trigram models in speech recognition) Language recog-
nition systems work by constantly predicting what’s coming next. Having heard the first
i words they try to generate a prediction of the i + 1th word1 . This is a very complicated
piece of software, and one underlying idea is to model language generation as a markov
chain. (This is not an exact model; language is known to not be markovian, at least in the
simple way described below.)
The simplest idea would be to model this as a markov chain on the words of a dictionary.
Recall that everyday English has about 5, 000 words. A simple markovian model consists
of thinking of a piece of text as a random walk on a space with 5000 states (= words).
A state corresponds to the last word that was just seen. For each word pair w1 , w2 there
is a probability pw1 ,w2 of going from w1 to w2 . According to this Markovian model, the
probability of generating a sentence with the words w1 , w2 , w3 , w4 is qw1 pw1 w2 pw2 w3 pw3 w4
where qw1 is the probability that the first word is w1 .
1
You can see this in the typing box on smartphones, which always display their guesses of the next word
you are going to type. This lets you save time by clicking the correct guess.
65

To actually fit such a model to real-life text data, we have to estimate 5, 000 probabilities
qw1 for all words and (5, 000)2 probabilities pw1 w2 for all word pairs. Here

Pr[w2 w1 ]
pw1 w2 = Pr[w2 | w1 ] = ,
Pr[w1 ]

namely, the probability that word w2 is the next word given that the last word was w1 .
One can derive empirical values of these probabilities using a sufficiently large text
corpus. (Realize that we have to estimate 25 million numbers, which requires either a very
large text corpus or using some shortcuts.)
An even better model in practice is a trigram model which uses the previous two words
to predict the next word. This involves a markov chain containing one state for every pair of
words. Thus the model is specified by (5, 000)3 numbers of the form Pr[w3 | w2 w1 ]. Fitting
such a model is beyond the reach of current computers but we won’t discuss the shortcuts
that need to be taken.

12.1 Recasting a random walk as linear algebra


A Markov chain is a discrete-time stochastic process on n states defined in terms of a
transition probability matrix (M ) with rows i and columns j.

M = Mij

where Mij corresponds to the probability that the state at time step t + 1 will be j, given
that the state at time t is i. This process is memoryless in the sense that this transition
probability does not depend upon the history of previous transitions.
P Therefore, each row in the matrix M is a distribution, implying Mij ≥ 0∀i, j ∈ S and
j Mij = 1. The bigram or trigram models are examples of Markov chains.
Using a slight twist in the viewpoint we can use linear algebra to analyse random walks.
Instead of thinking of the drunkard as being at a specific point in the state space, we think
of the vector that specifies his probability of being at point i ∈ S. Then the randomness
goes away and this vector evolves according to deterministic rules. Let us understand this
evolution. P
Let the initial distribution be given by the row Pvector x ∈ <n , xi ≥ 0 and i xi = 1.
After one step, the probability of being at space i is j xj Mji , which corresponds to a new
distribution xM. It is easy to see that xM is again a distribution.
Sometimes it is useful to think of x as describing the amount of probability fluid sitting
at each node, such that the sum of the amounts is 1. After one step, the fluid sitting at
node i distributes to its neighbors, such that Mij fraction goes to j.
Suppose we take two steps in this Markov
P chain. The memoryless property implies that
the probability of going from i to j is k Mik Mkj , which is just the (i, j)th entry of the
matrix M 2 . In general taking t steps in the Markov chain corresponds to the matrix M t ,
and the state at the end is xM t . Thus the

Definition 2 A distribution π for the Markov chain M is a stationary distribution if


πM = π.
66

Example 21 (Drunkard’s walk on n-cycle) Consider a Markov chain defined by the


following random walk on the nodes of an n-cycle. At each step, stay at the same node
with probability 1/2. Go left with probability 1/4 and right with probability 1/4.
The uniform distribution, which assigns probability 1/n to each node, is a stationary
distribution for this chain, since it is unchanged after applying one step of the chain.

Definition 3 A Markov chain M is ergodic if there exists a unique stationary distribution


π and for every (initial) distribution x the limit limt→∞ xMt = π.

In other words, no matter what initial distribution you choose, if you let it evolve long
enough the distribution converges to the stationary distribution. Some basic questions
are when stationary distributions exist, whether or not they are unique, and how fast the
Markov chain converges to the stationary distribution.
Does Definition 2 remind you of something? Almost all of you know about eigenvalues,
and you can see that the definition requires π to be an eigenvector which has all nonnegative
coordinates and whose corresponding eigenvalue is 1.
In today’s lecture we will be interested in Markov chains corresponding to undirected
d-regular graphs, where the math is easier because the underlying matrix is symmetric:
Mij = Mji .

Eigenvalues. Recall that if M ∈ <n×n is a square symmetric matrix of n rows and


columns then an eigenvalue of M is a scalar λ ∈ < such that exists a vector x ∈ <n for
which M · x = λ · x. The vector x is called the eigenvector corresponding to the eigenvalue
λ. M has n real eigenvalues denoted λ1 ≤ ... ≤ λn . (The multiset of eigenvalues is called
the spectrum.) The eigenvectors associated with these eigenvalues form an orthogonal basis
for the vector space <n (for any two such vectors the inner product is zero and all vectors
are linear independent). The word eigenvector comes from German, and it means “one’s
own vector. ”The eigenvectors are n prefered directions u1 , u2 , . . . , un for the matrix, such
that applying the matrix on these directions amounts to simple scaling by the corresponding
eigenvalue. Furthermore these eigenvectors span <n so every vector x can be written as a
linear combination of these.

Example 22 We show that every eigenvalue λ of M is at most 1. Suppose P ~e is the cor-


1
responding eigenvector. Say the largest coordinate is i. Then λei = j:{i,j}∈E d ej by
definition. If λ > 1 then at least one of the neighbors must have ej > ei , which is a con-
tradiction. By similar argument we conclude that every eigenvalue of M is at most −1 in
absolute value.

12.1.1 Mixing Time


Informally, the mixing time of a Markov chain is the time it takes to reach “nearly station-
ary” distribution from any arbitrary starting distribution.

Definition 4 The mixing time of an ergodic Markov chain M is t if for every starting
distribution x, the distribution xM t satisfies xM t − π 1 ≤ 1/4. (Here |·|1 denotes the `1
norm and the constant “1/4” is arbitrary.)
67

Example 23 (Mixing time of drunkard’s walk on a cycle) Let us consider the mix-
ing time of the walk in Example 21. Suppose the initial distribution concentrates all prob-
ability at state 0. Then 2t steps correspond to about t random steps (= coin tosses) since
with probability 1/2 the drunk does not move. Thus the location of the drunk is

(#(Heads) − #(Tails)) (mod n).

As argued earlier, it takes Ω(n2 ) steps for the walk to reach the other half of the circle
with any reasonable probability, which implies that the mixing time is Ω(n2 ). We will soon
see that this lowerbound is fairly tight; the walk takes about O(n2 log n) steps to mix well.

12.2 Upper bounding the mixing time (undirected d-regular


graphs)
For simplicity we restrict attention to random walks on regular graphs. Let M be a Markov
chain on a d-regular undirected graph with an adjacency matrix A. Then, clearly M = d1 A.
Clearly, n1 ~1 is a stationary distribution, which means it is an eigenvector of M . What is
the mixing time? In other words if we start in the initial distribution x then how fast does
xM t converge to ~1?
First, let’s identify two hurdles that would prevent such convergence, and in fact prevent
the graph from having a unique stationary distribution. (a) Being disconnected: if the walk
starts in a vertex in one connected component, it never visits another component, and
vice versa. So two walks starting in the two components cannot converge to the same
distribution, no longer how long we run them. (b) Being bipartite: This means the graph
consists of two sets A, B such that there are no edges within A and within B; all edges go
between A and B. Then the walk starting in A will bounce back and forth between A and
B and thus not converge.

Example 24 (Exercise: ) Show that if the graph is connected, then every eigenvalue of M
apart from the first one is strictly less than 1. However, the value −1 is still possible. Show
that if −1 is an eigenvalue then the graph is bipartite.

Note that if x is a distribution, x can be written as


n
X
1
x = ~1 + αi ei
n
i=2

where ei are the eigenvectors of M which form an orthogonal basis and 1 is the first eigen-
vector with eigenvalue 1. (Clearly, x can be written as a combination of the eigenvectors;
2
the observation here is that the coefficient in front of the first eigenvector ~1 is ~1 · x/ ~1
P 2
which is n1 i xi = n1 .)
68

M t x = M t−1 (M x)
Xn
1
= M t−1
( ~1 + αi λi ei )
n
i=2
X n
1
= M t−2
(M ( ~1 + αi λi ei ))
n
i=2
...
n
1~ X
= 1+ αi λti ei
n
i=2

Also
n
X
k αi λti ei k2 ≤ λtmax
i=2

where λmax is the second largest eigenvalue of M in absolute


P 2 value.
P (Note that we are using
the fact that the total `2 norm of any distribution is i xi ≤ i xi = 1.)
Thus we have proved M t x − n1 1 2 ≤ λtmax . Mixing times were defined using `1 distance,

but Cauchy Schwartz inequality relates the `2 and `1 distances: |p|1 ≤ n |p|2 . So we have
proved:
Theorem 11
The mixing time is at most O( λlog
max
n
).

Note also that if we let the Markov chain run for O(k log n/λmax ) steps then the distance
to uniform distribution drops to exp(−k). This is why we were not very fussy about the
constant 1/4 in the definition of the mixing time earlier.
Remark: What if λmax is 1 (i.e., −1 is an eigenvalue)? This breaks the proof and in fact
the walk may not be ergodic. However, we can get around this problem by modifying the
random walk to be lazy, by adding a self-loop at each node that ensures that the walk stays
at a node with probability 1/2. Then the matrix describing the new walk is 21 (I + M ), and
its eigenvalues are 21 (1 + λi ). Now all eigenvalues are less than 1 in absolute value. This is
a general technique for making walks ergodic.

Example 25 (Exercise) Compute the eigenvalues of the drunkard’s walk on the n-cycle
and show that its mixing time is O(n2 log n).

12.3 Analysis of Mixing Time for General Markov Chains


We did not do this in class; this is extra reading for those who are interested.
In the class we only analysed random walks on d-regular graphs and showed that they
converge exponentially fast with rate given by the second largest eigenvalue of the transition
matrix. Here, we prove the same fact for general ergodic Markov chains.
Theorem 12
The following are necessary and sufficient conditions for ergodicity:
69

1. connectivity: ∀i, j : Mt (i, j) > 0 for some t.


2. aperiodicity: ∀i : gcd{t : Mt (i, j) > 0} = 1.
Remark 1 Clearly, these conditions are necessary. If the Markov chain is disconnected it
cannot have a unique stationary distribution —there is a different stationary distribution for
each connected component. Similarly, a bipartite graph does not have a unique distribution:
if the initial distribution places all probability on one side of the bipartite graph, then the
distribution at time t oscillates between the two sides depending on whether t is odd or
even. Note that in a bipartite graph gcd{t : Mt (i, j) > 0} ≥ 2. The sufficiency of these
conditions is proved using eigenvalue techniques (for inspiration see the analysis of mixing
time later on).
Both conditions are easily satisfied in practice. In particular, any Markov chain can be
made aperiodic by adding self-loops assigned probability 1/2.
Definition 5 An ergodic Markov chain is reversible if the stationary distribution π satisfies
for all i, j, πi Pij = πj Pji .
We need a lemma first.
Lemma 13
Let M be the transition matrix of an ergodic Markov chain with stationary distribution π
and eigenvalues λ1 (= 1) ≥ λ2 ≥ . . . ≥ λn , corresponding to eigenvectors v1 (= π), v2 , . . . vn .
Then for any k ≥ 2,
vk~1 = 0.
Proof: We have vk M = λk vk . Mulitplying by ~1 and noting that M ~1 = ~1, we get
vk~1 = λk vk~1.
Since the Markov chain is ergodic, λk 6= 1, so vk~1 = 0 as required. 2
We are now ready to prove the main result concerning the exponentially fast convergence
of a general ergodic Markov chain:
Theorem 14
In the setup of the lemma above, let λ = max {|λ2 |, |λn |}. Then for any initial distribution
x, we have
||xM t − π||2 ≤ λt ||x||2 .
Proof: Write x in terms of v1 , v2 , . . . , vn as
n
X
x = α1 π + αi vi .
i=2

the above equation by ~1, we get α1 = 1 (since x~1 = π~1 = 1). Therefore
MultiplyingP
xM = π + ni=2 αi λti vi , and hence
t

n
X
t
||xM − π||2 ≤ || αi λti vi ||2 (12.1)
i=2
q
≤ λt α22 + · · · + αn2 (12.2)
≤ λt ||x||2 , (12.3)
70

as needed. 2
Chapter 13

Intrinsic dimensionality of data


and low-rank approximations: SVD

Today’s topic is a technique called singular value decomposition or SVD. We’ll take two
views of it, and then encounter a surprising algorithm for it, which in turn leads to a third
interesting view.

13.1 View 1: Inherent dimensionality of a dataset


In many settings we have a set of m vectors v1 , v2 , . . . , vm in <n . Think of n as large,
and maybe m also. We would like to represent vi ’s using fewer number of dimensions, say
k. We saw one technique in an earlier lecture, namely, Johnson-Lindenstrauss dimension
reduction, which achieves k = O(log n/ε2 ). As explored in HW 3, JL-dimension reduction
is relevant only where we only care about preserving all pairwise `2 distances among the
vectors. Its advantage is that it works for all datasets. But to many practitioners, that
is also a huge disadvantage: since it is oblivious to the dataset, it cannot be tweaked to
leverage properties of the data at hand.
Today we are interested in datasets where the vi ’s do have a special structure: they
are well-approximated by some low-dimensional set of vectors. By this we mean that for
some small k, there are vectors u1 , u2 , . . . , uk ∈ <n such that every vi is close to the span
of u1 , u2 , . . . , uk . In many applications k is fairly small, even 3 or 4, and JL dimension
reduction is of no use.
Let’s attempt to formalize the problem at hand. We are looking for k-dimensional
P 2
vectors u1 , u2 , . . . , uk and mk coefficients αi1 , . . . , αik ∈ < such that vi − j αij uj ≈
2
small. But of course any real-life data set has outliers, for which this may not hold. But if
most vectors fit the conjectured structure, then we expect
2
X X
vi − αij uj ≈ small (13.1)
i j
2

This problem is nonlinear and nonconvex as stated. Today we will try to understand it

71
72

more and learn how to solve it. We will find that it is actually easy (which I find one of the
miracles of math: one of few natural nonlinear problems that are solvable in polynomial
time).
But first some examples of why this problem arises in practice.

Example 26 (Understanding shopping data) Suppose a marketer is trying to assess


shopping habits. He observes the shopping behaviour of m shoppers with respect to n
goods: how much of each good did they buy? This gives m vectors in <n .
The simplest model for this would be: every shopper starts with a budget, and allocates
it equally among all m items. Then if Bi is the budget of shopper i and pj is the price
for item j, the ith vector is n1 ( B i Bi Bi
p1 , p2 , . . . , pn ). Denoting by ~
u the vector of price inverses,
Bi
namely, (1/p1 , 1/p2 , . . . , 1/pn ) this is just n ~u. We conclude that the data is 1-dimensional:
just scalar multiples of ~u.
But maybe the above model is too unrealistic and doesn’t fit the data well. Then
one could try another model. We assume that the goods partition into k categories like
produce, canned goods, etc. S1 , S2 , . . . , Sk . These categories are unknown to us. Assume
furthermore that the ith shopper designates a budget Bit for the tth category, and then
divides this budget equally among goods in that category. Let ut ∈ <n denotes the vector
inf <n whose coordinate is 0 for goods not in St and the inverse price for goods P in SBtit. Then
the quantities of each good purchased by shopper i are given by the vector kt=1 |S t|
ut . In
other words, this model predicts that the dataset is k-dimensional.
Of course, no model is exact so the data set will only be approximately k-dimensional,
and thus the problem in (13.1) is a possible formulation.
One can consider alternative probabilistic models of data generation where the shopper
picks items randomly from each category. You’ll analyse that in the next homework.

Example 27 (Understanding microarray data in biology) The number of genes in


your cell is rather large, and their activity levels —which depend both upon your genetic
code and environmental factors— determine your body’s functioning. Microarrays are tiny
”chips” of chemicals sites that can screen the activity levels —aka gene expression levels—of
a large number of genes in one go, say n = 10, 000 genes. Typically these genes would have
been chosen because they are suspected to be related to the phenomenon being studied,
say a particular disease, immune reaction etc. After testing m individuals, one obtains m
vectors in <n .
In practice it is found that this gene expression data is low-dimensional in the sense
of (13.1). This means that there are say 4 directions u1 , u2 , u3 , u4 such that most of the
vectors are close to their span. These new axis directions usually have biological meaning; eg
they help identify genes whose expression (up or down) is controlled by common regulatory
mechanisms.

13.2 View 2: Low rank matrix approximations


We have an m × n matrix M . We suspect it is actually a noisy version of a rank-k matrix,
say M̃ . We would like to find out M̃ . One natural idea is to solve the following optimization
73

problem
X 2
min Mij − M̃ij s.t. M̃ is a rank-k matrix (13.2)
ij

Again, seems like a hopeless nonlinear optimization problem. Peer a little harder and
you realize that, first, a rank-k matrix is just one whose rows are linear combinations of k
independent vectors, and second, if you let Mi denote the ith column of M then you are
trying to solve nothing but problem (13.1)!
Example 28 (Planted bisection/Hidden Bisection) Graph bisection is the problem
where we are given a graph G = (V, E) and wish to partition V into two equal sets S, S
such that we minimize the number of edges between S, S. It is NP-complete. Let’s consider
the following average case version.
Nature creates a random graph on n nodes as follows. It partitions nodes into S1 , S2 .
Within S1 , S2 it puts each edge with prob. p, and between S1 , S2 put each edge with prob.
q where q < p. Now this graph is given to the algorithm. Note that the algorithm doesn’t
know S1 , S2 . It has to find the optimum bisection.
It is possible to show using Chernoff bounds that if q = Ω( logn n ) then with high proba-
bility the optimum bisection in the graph is the planted one, namely, S1 , S2 . How can the
algorithm recover this partition?

Figure 13.1: Planted Bisection problem: Edge probability is p within S1 , S2 and q between
S1 , S2 where q < p. On the right hand side is the adjacency matrix. If we somehow knew
S1 , S2 and grouped the corresponding rows and columns together, and squint at the matrix
from afar, we’d see more density of edges within S1 , S2 and less density between S1 , S2 .
Thus from a distance the adjacency matrix looks like a rank 2 matrix.

The observation in Figure 14.1 suggests that the adjacency matrix is close to a rank 2
matrix shown there: the block within S1 , S2 have value p in each entry; the blocks between
S1 , S2 have q in each entry.
Maybe if we can solve (13.2) with k = 2 we are done? This turns out to be correct as
we will see in next lecture.
One can study planted versions of many other NP-hard problems as well.

Many practical problems involve graph partitioning. For instance, image recognition
involves first partitioning the image into its component pieces (sky, ground, tree, etc.); a
74

process called image segmentation in computer vision. This is done by graph partitioning
on a graph defined on pixels where edges denote pixel-pixel similarity. Perhaps planted
graphs are a better model for such real-life settings than worst-case graphs.

13.3 Singular Value Decomposition


Now we describe the tool that lets us solve the above problems.
For simplicity let’s start with a symmetric matrix M . Suppose its eigenvalues are
λ1 , . . . , λn in decreasing order by absolute value, and the corresponding eigenvectors (scaled
to be unit vectors) are e1 , e2 , . . . , en . (These are column vectors.) Then M has the following
alternative representation.
Theorem
P 15 T(Spectral decomposition)
M = i λi ei ei .

Proof: At first sight, the equality does not even seem to pass a “typecheck”; a matrix on
the left and vectors on the right. But then we realize that ei eTi is actually an n × n matrix
(it has rank 1 since every column is a multiple of ei ). So the right hand side is indeed a
matrix. Let us call it B.
Any matrix can be specified completely by describing how it acts on an orthonor-
mal basis. By definition, M is the matrix that acts as follows on the orthonormal set
{e1 , e2 , . . . , en }: M ej = λj ej . How does B act on this orthonormal set? We have
X
Bej = ( λi ei eTi )ej
i
X
= λi ei (eTi ej ) (distributivity and associativity of matrix multiplication)
i
= λj ej

since eTi ej =< ei , ej > is 1 if i = j and 0 else. We conclude that B = M . 2

Theorem 16 (Best rank k approximation)


The solution M̃ to (13.2) is simply the sum of the first k terms in the previous Theorem.

The proof of this theorem uses the following, which is not too hard to prove from the
spectral decomposition using definitions.
Theorem 17 (Courant-Fisher)
If e1 , e2 , . . . , en are the eigenvectors as above then:

1. e1 is the unit vector that maximizes |M x|22 .

2. ei+1 is the unit vector that is orthogonal to e1 , e2 , . . . , ei and maximizes |M x|22 .

Let’s prove Theorem 16 for k = 1 by verifying that the first term of the spectral de-
composition gives the best rank 1 approximation to M . A rank 1 matrix is one whose each
row is a multiple of some unit vector x; in other words is on the line defined by x. Denote
75

the rows of M as M1 , M2 , . . . , Mn . Then the multiple of x that closest to Mi is simply its


projection, namely < Mi , x > x. Thus the matrix approximation consists of finding a unit
vector x so as to minimize
X X X
|Mi − < Mi , x > x|2 = |Mi |2 − |< Mi , x >|2 .
i i i

This minimization is tantamount to maximising


X
|< Mi , x >|2 = |M x|2 , (13.3)
i

which by the Courant-Fisher theorem happens for x = e1 . Thus the best rank 1 approx-
imation to M is the matrix whose ith row is < Mi , e1 > eT1 , which of course is λ1 e1i eT1 .
Thus the rank 1 matrix approximation is λ1 eT1 e1 , which proves the theorem for k = 1. The
proof of Theorem 16 for general k follows similarly by induction and is left as exercise.

13.3.1 General matrices: Singular values


Now we look at general matrices that are not symmetric. The notion of eigenvalues and
eigenvectors have to be modified. The following theorem is proved similarly as in the
symmetric case but with a bit more tedium.
Theorem 18 (Singular Value Decomposition and best rank-k-approximation)
Every m × n real matrix has t ≤ min {m, n} nonnegative real numbers σ1 , σ2 , . . . , σt (called
singular values) and two sets of unit vectors U = {u1 , u2 , . . . , ut } which are in <m and
V = v1 , v2 , . . . , vt ∈ <n (all vectors are column vectors) where U, V are orthogonormal sets
and
uTi M = σi vi and M vi = σi uTi (13.4)
Furthermore, M can be represented as
X
M= σi ui viT . (13.5)
i

The best rank k approximation to M consists of taking the first k terms of (14.2) and
discarding the rest.

This solves problems (13.1) and (13.2). Next time we’ll go into some detail of the
algorithm for computing them. In practice you can just use matlab or another package.

13.4 View 3: Directions of Maximum Variance


The above proof of Theorem 16, especially the subcase k = 1 we proved, also shows yet
another view of SVD which is sometimes useful in data analysis. Let us again see this in the
case of symmetric
P matrices. Suppose we shift the given points M1 , M2 , . . . , Mn so that their
mean n1 i Mi is the origin. Then the rank-1 SVD corresponds to the direction x where
the projections of the given data points —a sequence of n real numbers— have maximum
variance. Since the mean is 0, this variance is exactly the quantity in (13.3). The second
76

SVD direction corresponds to directions with maximum variance after we have removed the
component along the first direction, and so on.
bibliography

1. O. Alter, P. Brown, and D. Botstein. Singular value decomposition for genome-wide


expression data processing and modeling. PNAS August 29, 2000 vol. 97 no. 18

2. Relevant chapter of Hopcroft-Kannan book on data science. (link on course website)


Chapter 14

SVD, Power method, and Planted


Graph problems (+ eigenvalues of
random matrices)

Today we continue the topic of low-dimensional approximation to datasets and matrices.


Last time we saw the singular value decomposition of matrices.

14.1 SVD computation


Recall this theorem from last time.
Theorem 19 (Singular Value Decomposition and best rank-k-approximation)
An m × n real matrix has t ≤ min {m, n} nonnegative real numbers σ1 , σ2 , . . . , σt (called
singular values) and two sets of unit vectors U = {u1 , u2 , . . . , ut } which are in <m and
V = v1 , v2 , . . . , vt ∈ <n (all vectors are column vectors) where U, V are orthogonormal sets
and
uTi M = σi vi and M vi = σi uTi . (14.1)
(When M is symmetric, each ui = vi and the σi ’s are eigenvalues and can be negative.)
Furthermore, M can be represented as
X
M= σi ui viT . (14.2)
i

The best rank k approximation to M consists of taking the first k terms of (14.2) and
discarding the rest (where σ1 ≥ σ2 · · · ≥ σr ).

Taking the best rank k approximation is also called Principal Component Analysis or
PCA.
You probably have seen eigenvalue and eigenvector computations in your linear algebra
course, so you know how to compute the PCA for symmetric matrices. The nonsymmetric

77
78

case reduces to the symmetric one by using the following observation. If M is the matrix
in (14.2) then
X X X
MMT = ( σi ui viT )( σi vi uTi ) = σi2 ui uTi since viT vj = 1 iff i = j and 0 else.
i i i

Thus we can recover the ui ’s and σi ’s by computing the eigenvalues and eigenvectors of
M M T , and then recover vi by using (14.1).
Another application of singular vectors is the Pagerank algorithm for ranking webpages.

14.1.1 The power method


The eigenvalue computation you saw in your linear algebra course takes at least n3 time.
Often we are only interested in the top few eigenvectors, in which case there’s a method that
can work much faster (especially when the matrix is sparse, i.e., has few nonzero entries).
As usual, we first look at the subcase of symmetric matrices. To compute the largest
eigenvector of matrix M we do the following. Pick a random unit vector x. Then repeat
the following a few times: replace x by M x. We show this works under the following Gap
assumption: There is a gap of γ between the the top two eigenvalues: |λ1 | − |λ2 | = γ.
The analysis Pis the same calculation as the one we used to analyse Markov chains. We
can write x as i αi ei where ei ’s are the eigenvectors and λi ’sP are numbered in decreasing
order byPabsolute value. Then t iterations produces M x = i αi λti ei . Since x is a unit
t

vector, i αi2 = 1.
Since |λi | ≤ |λ1 | − γ for i ≥ 2, we have
X
|αi | λti ≤ nαmax (|λ1 | − γ)t = n |λ1 |t (1 − γ/ |λ1 |)t ,
i≥2

where αmax is the largest coefficient in magnitude.


Furthermore, since x was a random unit vector (and recalling that its projection α1
on the fixed vector e1 is normally distributed), the probability is at least 0.99 that α1 >
1/(10n). Thus setting t = O(log n |λ1 | /γ) the components for i ≥ 2 become miniscule and
x ≈ α1 |λ1 |t e1 . Thus rescaling to make it a unit vector, we get e1 up to some error. Then
we can project all vectors to the subspace perpendicular to e1 and continue with the process
to find the remaining eigenvectors and eigenvalues.
This process works under the above gap assumption. What if the gap assumption does
not hold? Say, the first 3 eigenvalues are all close together, and separated by a gap from the
fourth. Then the above process ends up with some random vector in the subspace spanned
by the top three eigenvectors. For real-life matrices the gap assumption often holds.

14.2 Recovering planted bisections


Now we return to the planted bisection problem, also introduced last time.
The observation in Figure 14.1 suggests that the adjacency matrix is close to a rank 2
matrix shown there: the block within S1 , S2 have value p in each entry; the blocks between
S1 , S2 have q in each entry. This is rank 2 since it has only two distinct column vectors.
79

Figure 14.1: Planted Bisection problem: Edge probability is p within S1 , S2 and q between
S1 , S2 where q < p. On the right hand side is the adjacency matrix. If we somehow knew
S1 , S2 and grouped the corresponding rows and columns together, and squint at the matrix
from afar, we’d see more density of edges within S1 , S2 and less density between S1 , S2 .
Thus from a distance the adjacency matrix looks like a rank 2 matrix.

Now we sketch why the best rank-2 approximation to the adjacency matrix will more or
less recover the planted bisection. Specifically, the idea is to find the rank 2 approximation;
with very high probability its columns can be cleanly clustered into 2 clusters. This gives a
grouping of the vertices into 2 groups as well, which turns out to be the planted bisection.
Why this works has to do with the properties of rank k approximations. First we define
two norms of a matrix.

Definition 6 (Frobenius
qP and spectral norm) If M is an n×n matrix then its Frobe-
nius norm |M |F is 2
ij Mij and its spectral norm |M |2 is the maximum value of |M x|2
over all unit vectors x ∈ <n . (By Courant-Fisher, the spectral norm is also the highest
eigenvalue.) For matrices that are not symmetric the definition of Frobenius norm is anal-
ogous and the spectral norm is the highest singular value.

Last time we defined the best rank k approximation to M as the matrix M̃ that is rank
2
k and minimizes M − M̃ . The following theorem shows that we could have defined it
F
equivalently using spectral norm.
Lemma 20
Matrix M̃ as defined above also satisfies that M − M̃ ≤ |M − B|2 for all B that have
2
rank k.
Theorem 21
If M̃ is the best rank-k approximation to M , then for every rank k matrix C:
2
M̃ − C ≤ 5k |M − C|22 .
F

Proof: Follows by Spectral decomposition and Courant-Fisher theorem, and the fact that
the column vectors in M̃ and C together span a space of dimension at most 2k. Thus
80

2
M̃ − C involves a matrix of rank at most 2k. Rest of the details are cut and pasted from
F
Hopcroft-Kannan in Figure 14.2.
2
Returning to planted graph bisection, let M be the adjacency matrix of the graph with
planted bisection. Let C be the rank-2 matrix that we think is a good approximation to M ,
namely, the one in Figure 14.1. Let M̃ be the true rank 2 approximation found via SVD.
In general M̃ is not the same as C. But Theorem 21 implies that we can upper bound the
average coordinate-wise squared difference of M̃ and C by the quantity on the right hand
side, which is the spectral norm (i.e., largest eigenvalue) of M − C.
Notice, M − C is a random matrix whose each coordinate is one of four values 1 −
p, −p, 1 − q, −q. More importantly, the expectation of each coordinate is 0 (since the entry
of M is a coin toss whose expected value is the corresponding entry of C). The study
of eigenvalues of such random matrices is a famous subfield of science with unexpected
connections to number theory (including the famous Riemann hypothesis), quantum physics
(quantum gravity, quantum chaos), etc. We show below that |M − C|22 is at most O(np).
We conclude that the average column vector in M̃ and C (whose square norm is about np)
are apart by O(p). Thus intuitively, clustering the columns of C into two will find us the
bipartition. Actually showing this requires more work which we will not do.
Here is a generic clustering algorithm into two clusters: Pick a random column of M̃
and put into one cluster all columns whose distance from it is at most 10p. Put all other
columns in the other cluster.

14.2.1 Eigenvalues of random matrices


We sketch a proof of the following classic theorem to give a taste of this beautiful area.
Theorem 22
Let R be a random matrix such that Rij ’s are independent random variables in [−1, 1] of
expectation 0 and variance at most σ 2 . Then with probability 1 − exp(−n) the largest

eigenvalue of R is at most O(σ n).

For simplicity we prove this for σ = 1.


Proof: Recalling that the largest eigenvalue is maxx xT Rx , we break the proof as follows.

Idea 1) For any fixed unit vector x ∈ <n , xT Rx ≤ O( n) with probability 1−exp(−Cn)
where C is
Pan arbitrarily large constant. This follows from Chernoff-type bounds. Note that
xT Rx = ij Rij xi xj . By Chernoff bounds (Hoeffding’s inequality) the probability that this
exceeds t is at most
t2
exp(− P 2 2 ) ≤ exp(−Ω(t2 )),
i xi xj
P 2 2 1/2 P 2
since ( ij xi xj ) ≤ i xi = 1.
Idea 2) There is a set of exp(n) special directions x(1) , x(2) , . . . , that approximately
“cover”the set of unit vectors in <n . Namely, for every unit vector v, there is at least
one x(i) such that < v, x(i) > > 0.9.
81

First, note that < v, x(i) > > 0.9 iff


2 2
v − x(i) = |v|2 + x(i) − 2 < v, x(i) > ≤ 0.2.

In other words we are trying to cover the unit sphere with spheres of radius 0.2.
Try to pick this set greedily. Pick x(1) arbitrarily, and throw out the unit sphere of
radius 0.2 around it. Then pick x(2) arbitrarily out of the remaining sphere, and throw out
the unit sphere of radius 0.2 around it. And so on.
How many points did we end up with? By construction, each point that was picked
has distance at least 0.2 from every other point that was picked, so the spheres of radius
0.1 around the picked points are mutually disjoint. Thus the maximum number of points
we could have picked is the number of disjoint spheres of radius 0.1 in a ball of radius at
most 1.1. Denoting by B(r) denote the volume of spheres of volume r, this is at most
B(1.1)/B(0.1) = exp(n).
Idea 3) Combining Ideas 1 and 2, and the union bound, we have with high probability,

xT(i) Rx(i)
≤ O( n) for all the special directions.

Idea 4): If v is the eigenvector corresponding to the largest eigenvalue satisfies then there
is some special direction satisfying xT(i) Rx(i) > 0.4v T Rv.
By the covering property, there is some special direction x(i) that is close to v. Represent

it as αv + βu where u ⊥ v and u is a unit vector. So α ≥ 0.9 and β ≤ 0.19 ≤ 0.5. Then
xT(i) Rx(i) = αv T Rv + βuT Ru. But v is the largest eigenvalue so uT Ru ≤ v T Rv. We
conclude xT(i) Rx(i) ≥ (0.9 − 0.5)v T Rv, as claimed.
The theorem now follows from Idea 3 and 4. 2

bibliography

1. F. McSherry. Spectral partitioning of random graphs. In IEEE FOCS 2001 Proceed-


ings.
Figure 14.2: Proof of Theorem 21 from Hopcroft-Kannan book

82
Chapter 15

Semidefinite Programs (SDPs) and


Approximation Algorithms

Recall that a set of points K is convex if for every two x, y ∈ K the line joining x, y,
i.e., {λx + (1 − λ)y : λ ∈ [0, 1]} lies entirely inside K. A function f : <n → < is convex if
f ( x+y 1
2 ) ≤ 2 (f (x)+f (y)). It is called concave if the previous inequality goes theother way. A
linear function is both convex and concave. A convex program consists of a convex function
f and a convex body K and the goal is to minimize f (x) subject to x ∈ K. Is is a vast
generalization of linear programming and like LP, can be solved in polynomial time under
fairly general conditions on f, K. Today’s lecture is about a special type of convex program
called semidefinite programs.
Recall that a symmetric n × n matrix M is positive semidefinite (PSD for short) iff it
can be written as M = AAT for some real-valued matrix A (need not be square). It is a
simple exercise that this happens iff every eigenvalue is nonnegative. Another equivalent
characterization is that there are n vectors u1 , u2 , . . . , un such that Mij = hui , uj i. Given
a PSD matrix M one can compute such n vectors in polynomial time using a procedure
called Cholesky decomposition.
Lemma 23
2
The set of all n × n PSD matrices is a convex set in <n .

Proof: It is easily checked that if M1 and M2 are PSD then so is M1 + M2 and hence so
is 21 (M1 + M2 ). 2
Now we are ready to define semidefinite programs. These are very useful in a variety
of optimization settings as well as control theory. We will use them for combinatorial
optimization, specifically to compute approximations to some NP-hard problems. In this
respect SDPs are more powerful than LPs.
View 1: A linear program in n2 real valued variables Yij where 1 ≤ i, j ≤ n, with the
additional constraint “Y is a PSD matrix.”
View 2: A vector program where we are seeking n vectors u1 , u2 , . . . , un ∈ <n such that
their inner products hui , uj i satisfy some set of linear constraints.
Clearly, these views are equivalent.

83
84

Exercise: Show that every LP can be rewritten as a (slightly larger) SDP. The idea is
that a diagonal matrix, i.e., a matrix whose offdiagonal entries are 0, is PSD iff the entries
are nonnegative.
Question: Can the vectors u1 , . . . , un in View 2 be required to be in <d for d < n?
Answer: This is not known and imposing such a constraint makes the program nonconvex.
(The reason is that the sum of two matrices of rank d can have rank higher than d.)

15.1 Max Cut


Given an n-vertex graph G = (V, E) find a cut (S, S) such that you maximise E(S, S).
The exact characterization of this problem is to find x1 , x2 , . . . , xn ∈ {−1, 1} (which
thus represent a cut) so as to maximise
X 1
|xi − xj |2 .
4
{i,j}∈E

This works since an edge contributes 1 to the objective iff the endpoints have opposite signs.
The SDP relaxation is to find vectors u1 , u2 , . . . , un such that |ui |22 = 1 for all i and so
as to maximise X 1
|vi − vj |2 .
4
{i,j}∈E

This is a relaxation since every ±1 solution to the problem is also a vector solution where
every ui is ±v0 for some fixed unite vector v0 .
Thus when we solve this SDP we get n vectors, then the value of the objective OP TSDP
is at least as large as the capacity of the max cut. How do we get a cut out of these vectors?
The following is the simplest rounding one can think of. Pick a random vector z. If hui , zi
is positive, put it in S and otherwise in S. Note that this is the same as picking a random
hyperplane passing through the origin and partitioning the vertices according to which side
of the hyperplane they lie on.

ui#

Θij#

ui#

Figure 15.1: SDP solutions are unit vectors and they are rounded to ±1 by using a random
hyperplane through the origin. The probability that i, j end up on opposite sides of the cut
is proportional to Θij , the angle between them.
85

Theorem 24 (Goemans-Williamson’94)
The expected number of edges in the cut produced by this rounding is at least 0.878.. times
OP TSDP .

Proof: The rounding is essentially picking a random hyperplane through the origin and
vertices i, j fall on opposite sides of the cut iff ui , uj lie on opposite sides of the hyperplane.
Let’s estimate the probability they end up on opposite sides. This may seem a difficult n-
dimensional calculation, until we realize that there is a 2-dimensional subspace defined by
ui , uj , and all that matters is the intercept of the random hyperplane with this 2-dimensional
subspace, which is a random line in this subspace. Specifically θij be the angle between ui
and uj . Then the probability that they fall on opposite sides of this random line is θij /π.
Thus by linearity of expectations,
X θij
E[Number of edges in cut] = . (15.1)
π
{i,j}∈E

How do we relate this to OP TSDP ? We use the fact that hui , uj i = cos θij to rewrite
the objective as
X 1 X 1 X 1
|vi − vj |2 = (|vi |2 + |vj |2 − 2hvi , vj i) = (1 − cos θij ). (15.2)
4 4 2
{i,j}∈E {i,j}∈E {i,j}∈E

This seems hopeless to analyse for us mortals: we know almost nothing about the graph or
the set of vectors. Luckily Goemans and Williamson had the presence of mind to verify the
following in Matlab: each term of (15.1) is at least 0.878.. times the corresponding term of
(15.2)! Specifically, Matlab shows that for all


≥ 0.878 ∀θ ∈ [0, π]. (15.3)
π(1 − cos θ)
QED 2

The saga of 0.878... The GW paper came on the heels of the PCP Theorem (1992) which
established that there is a constant

15.2 0.878-approximation for MAX-2SAT


We earlier designed approximation algorithms for MAX-2SAT using LP. The SDP relax-
ation gives much tighter approximation than the 3/4 we achieved back then. Given a 2CNF
formula on n variables with m clauses, we can express MAX-2SAT as a quadratic optimiza-
tion problem. We want x2i = 1 for all i (hence xi is ±1; where +1 corresponds to setting the
variable yi to true) and we can write a quadratic expression for each clause expressing that
it is satisfied. For instance if the clause is yi ∨ yj then the expression is 1 − 41 (1 − xi )(1 − xj ).
It is 1 if either of xi , xj is 1 and 0 else.
Representing this expression directly as we did for MAX-CUT is tricky because of the
”1” appearing in it. Instead we are going to look for n + 1 vectors u0 , u1 , . . . , un . The first
86

vector u0 is a dummy vector that stands for ”1”. If ui = u0 then we think of this variable
being set to True and if ui = −u0 we think of the variable being set to False. Of course, in
general hui , u0 i need not be ±1 in the optimum solution.
2
P So the SDP is to find these vectors satisfying |ui | = 1 for all i so as to maximize
clausel vl where vl is the expression for lth clause. For instance if the clause is yi ∨ yj then
the expression is
1 1 1 1
1 − (u0 − ui )(u0 − uj ) = (1 + u0 · uj ) + (1 + u0 · ui ) + (1 − ui · uj ).
4 4 4 4
This is a very Goemans-Williamson like expression, except we have expressions like
1 + u0 · ui whereas in MAX-CUT we have 1 − ui · uj . Now we do Goemans-Williamson
rounding. The key insight is that since we round to ±1 each term 1 + ui · uj becomes 2
θ π−θ
with probability 1 − πij = π ij and is 0 otherwise. Similarly, 1 − ui · uj becomes 2 with
probability θij /π and 0 else.
Now the term-by-term analysis used for MAX-CUT works again once we realize that
2(π−θ)
(15.3) also implies (by substituting π − θ for θ in the expression) that π(1+cos θ) ≥ 0.878 for
θ ∈ [0, π]. We conclude that the expected number of satisfied clauses is at least 0.878 times
OP TSDP .
Chapter 16

Going with the slope: offline,


online, and randomly

This lecture is about gradient descent, a popular method for continuous optimization (es-
pecially nonlinear optimization).
We start by recalling that allowing nonlinear constraints in optimization leads to NP-
hard problems in general. For instance the following single constraint can be used to force
all variables to be 0/1. X
x2i (1 − xi )2 = 0.
i

Notice, this constraint is nonconvex. We saw in earlier lectures that the Ellipsoid method
can solve convex optimization problems efficiently under fairly general conditions. But it is
slow in practice.
Gradient descent is a popular alternative because it is simple and it gives some kind
of meaningful result for both convex and nonconvex optimization. It tries to improve the
function value by moving in a direction related to the gradient (i.e., the first derivative).
For convex optimization it gives the global optimum under fairly general conditions. For
nonconvex optimization it arrives at a local optimum.

Figure 16.1: For nonconvex functions, a local optimum may be different from the global
optimum

87
88

We will first study unconstrained gradient descent where we are simply optimizing a
function f (·). Recall that the function is convex if f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
for all x, y and λ ∈ [0, 1].

16.1 Gradient descent for convex functions: univariate case


The gradient for a univariate function f is simply the derivative: f 0 (x). If this is negative,
the value decreases if we increase x a little, and increases if we decrease f . Gradient descent
consists of evaluating the derivative and moving a small amount to the right (i.e., increase
x) if f 0 (x) < 0 and to move to the left otherwise. Thus the basic iteration is x ← x − ηf 0 (x)
for a tiny η called step size.
The function is convex if between every two points x, y the graph of the function lies
below the line joining (x, f (x)) and (y, f (y)). It need not be differentiable everywhere but
when all derivatives exist we can do the Taylor expansion:

η 2 00 η3
f (x + η) = f (x) + ηf 0 (x) + f (x) + f 000 (x) · · · . (16.1)
2 3!
If f 00 (x) ≥ 0 for all x then the the function is convex. This is because f 0 (x) is an
increasing function of x. The minimum is attained for x where f 0 (x) = 0 since f 0 (x) is +ve
to the right of it and −ve to the left. Thus moving both left and right of this point increases
f and it never drops. The function is concave if f 00 (x) ≤ 0 for all x; such functions have a
unique maximum.
Examples of convex functions: ax + b for any a, b ∈ <; exp(ax) for any a ∈ <; xα for
x ≥ 0, α ≥ 1 or α ≤ 0. Another interesting example is the negative entropy: x log x for
x ≥ 0.
Examples of concave functions: ax + b for any a, b ∈ <; xα for α ∈ [0, 1] and x ≥ 0; log x
for x ≥ 0.

Figure 16.2: Concave and Convex Function

To minimize a convex function by gradient descent we start at some x0 and at step i


update xi to xi+1 = xi + ηf 0 (x) for some small η < 0. In other words, move in the direction
where f decreases. If we ignore terms that involve η 3 or higher, then

η 2 00
f (xi+1 ) = f (xi ) + ηf 0 (xi ) + f (xi ).
2
89

and the best value for η (which gives the most reduction in one step) is η = −f 0 (x)/2f 00 (x),
which gives
(f 0 (xi ))2
f (xi+1 ) = f (xi ) − .
2f 00 (xi )
Thus the algorithm makes progress so long as f 00 (xi ) > 0. Convex functions that satisfy
f 00 (x) > 0 for all x are called strongly convex.
The above calculation is the main idea in Newton’s method, which you may have seen
in calculus. Proving convergence requires further assumptions.

16.2 Convex multivariate functions


A convex function on <n , if it is differentiable, satisfies the following basic inequality, which
says that the function lies “above ”the tangent plane at any point.

f (x + z) ≥ f (x) + Of (x) · z ∀x, y. (16.2)

Here Of (x) is the vector of first order derivatives where the ith coordinate is ∂f /∂xi and
called the gradient. Sometimes we restate it equivalently as

f (x) − f (y) ≤ Of (x) · (x − y) ∀x, z (16.3)


3.1 Basic properties and examples 69

f (y)

f (x) + ∇f (x)T (y − x)

(x, f (x))

Figure 3.2 If f is convex and differentiable, then f (x)+∇f (x)T (y−x) ≤ f (y)
for all x, y ∈ dom f .
Figure 16.3: A differentiable convex function lies above the tangent plane f (x) + Of (x) ·
(y − x)
is given by !
0 x∈C
I˜C (x) =
∞ x &∈ C.
If The
higher
convexderivatives alsotheexist,
function I˜C is called the
indicator multivariate
function of the set C. Taylor expansion for an n-variate func-
tion fWeis can play several notational tricks with the indicator function I˜C . For example
the problem of minimizing a function f (defined on all of Rn , say) on theTset 2
C is the
n (x) · y + y
same as minimizing theffunction
(x + y) f += f (x)
I˜C over all + Of
of R I˜C + · · · .
O ff+(x)y
. Indeed, the function (16.4)
is (by our convention) f restricted to the set C.
Here O2 f (x) denotes the n × n matrix whose i, j entry is ∂ 2 f /∂xi ∂xj and it is called the
Hessian. It can
In a similar way be checked
we can extend athat f function
concave is convex if theit Hessian
by defining to be −∞ is positive semidefinite; this
outside T
its 2
domain.
means y O f y ≥ 0 for all y.

3.1.3Example The following are some examples of convex functions.


29conditions
First-order
• Norms.
Suppose Every `p (i.e.,
f is differentiable norm is convex
its gradient on <atn .each
∇f exists The reason
point in domis
f , that a norm satisfies triangle
which is open). Then
inequality: |xf +
is convex if and
y| ≤ |x| +only
|y| if∀x,
domy.f is convex and
f (y) ≥ f (x) + ∇f (x)T (y − x) (3.2)

holds for all x, y ∈ dom f . This inequality is illustrated in figure 3.2.


The affine function of y given by f (x)+∇f (x)T (y−x) is, of course, the first-order
Taylor approximation of f near x. The inequality (3.2) states that for a convex
function, the first-order Taylor approximation is in fact a global underestimator of
the function. Conversely, if the first-order Taylor approximation of a function is
always a global underestimator of the function, then the function is convex.
90

Figure 16.4: The Hessian

• f (x) = log(ex1 + ex2 + · · · + exn ) is convex on <n . This fact is used in practice as an
analytic approximation of the max function since
max {x1 , . . . , xn } ≤ f (x)+ ≤ max {x1 , . . . , xn } + log n.
Turns out this fact is at the root of the multiplicative weight update method; the
algorithm for approximately solving LPs that we saw in Lecture 10 can be seen as
doing a gradient descent on this function, where the xi ’s are the slacks of the linear
constraints. (For a linear constraint aT z ≥ b the slack is aT z − b.)
P
• f (x) = xT Ax = ij Aij xi xj where A is positive semidefinite. Its Hessian is A.
Q
Some important examples of concave functions are: geometric mean ( ni=1 xi )1/n and log-
2
determinant (defined for X ∈ <n as log det(X) where X is interpreted as an n × n matrix).
Many famous inequalities in mathematics (such as Cauchy-Schwartz) are derived using
convex functions. 2
Example 30 (Linear equations with PSD constraint matrix) In linear algebra you
learnt that the method of choice to solve systems of equations Ax = b is Gaussian elimina-
tion. In many practical settings its O(n3 ) running time may be too high. Instead one does
gradient descent on the function 21 xT Ax − bT x, whose local optimum satisfies Ax = b. If
A is positive semidefinite this function is also convex since the Hessian is A, and gradient
descent will actually find the solution. (Actually in real life these are optimized using more
advanced methods such as conjugate gradient.) Also, if A is diagonal dominant, a stronger
constraint than PSD, then Spielman and Teng (2003) have shown how to solve this prob-
lem in time that is near linear in the number of nonzero entries. This has had surprising
applications to basic algorithmic problems like max-flow.
Example 31 (Least squares) In some settings we are given a set of points a1 , a2 , . . . , am ∈
<n and some data values b1 , b2 , . . . , bm taken at these points by some function of interest.
We suspect that the unknown function is a line, except the data values have a little error in
them. One standard technique is to find a least squares fit: a line that minimizes the sum of
squares of the distance to the datapoints to the line. The objective function is min |Ax − b|22
where A ∈ <m×n is the matrix whose rows are the ai ’s. (We saw in an earlier lecture that
the solution is also the first singular vector.) This objective is just xT AT Ax−2(Ax)T b+bT b,
which is convex.
91

In the univariate case, gradient descent has a choice of only two directions to move in: right
or left. In n dimensions, it can move in any direction in <n . The most direct analog of the
univariate method is to move diametrically opposite from the gradient.
The most direct analogue of our univariate analysis would be to assume a lowerbound
of y T O2 f y for all y (in other words, a lowerbound on the eigenvalues of O2 f ). This will be
explored in the homework. In the rest of lecture we will only assume (16.2).

16.3 Gradient Descent for Constrained Optimization


As studied in previous lectures, constrained optimization consists of solving the following
where K is a convex set and f (·) is a convex function.

min f (x) s.t. x ∈ K.

Example 32 (Spam classification via SVMs) This example will run through the en-
tire lecture. Support Vector Machine is the name in machine learning for a linear classifier;
we saw these before in Lecture 6 (Linear Thinking). Suppose we wish to train the classifier
to classify emails as spam/nonspam. Each email is represented using a vector in <n that
gives the frequencies of various words in it (“bag of words”model). Say a1 , a2 , . . . , aN are
the emails, and for each there is a corresponding bit bi ∈ {−1, 1} where bi = 1 means Xi is
spam. SVMs use a linear classifier to separate spam from nonspam. If spam were perfectly
identifiable by a linear classfier, there would be a function W · x such that W · ai ≥ 1 if ai
is spam, and W · ai ≤ −1 if not. In other words,

1 − bi W · ai ≤ 0 ∀i (16.5)

Of course, in practice a linear classifier makes errors, so we have to allow for the possibility
that (16.5) is violated by some ai ’s. The obvious thing to try is to find a W that satisfies
as many of the constraints as possible, but that leads to a nonconvex NP-hard problem.
(Even approximating this weakly is NP-hard.) Thus a more robust version of this problem
is
X
min Loss(1 − W · (bi ai )) (16.6)
i
|W |22 ≤n (scaling constraint)

where Loss(·) is a function that penalizes unsatisfied constraints according to the amount
by which they are unsatisfied. (Note that W is the vector of variables, and the scaling
constraint gives meaning to the separation of “1 ”in (16.5) by saying that W is a vector in
the sphere of radius n, which is a convex constraint.) The most obvious loss function would
be to count the number of unsatisfied constraints but that is nonconvex. For this lecture
we focus on convex loss functions; the simplest is the hinge loss: Loss(t) = max {0, t}.
Applying it to 1 − W · (bi ai ) insures that correctly classified emails contribute 0 to the loss,
and incorrectly classified emails contribute as much to the loss as the amount by which they
fail the inequality. The function in (16.6) is convex because the function inside Loss() is
linear and thus convex, and Loss() preserves convexity since it can only lift the value of the
linear function even further.
92

If x ∈ K is the current point and we use the gradient to step to x − η M x then in general
this new point will not be in K. Thus one needs to do a projection.
Definition 7 The projection of a point y on K is x ∈ K that minimizes |y − x|2 . (It is
also possible to use other norms than `2 to define projections.)
A projection oracle for the convex body a black box that, for every point y, returns its
projection on K.

Often convex sets used in applications are simple to project to.


Example 33 If K = unit sphere, then the projection of y is y/ |y|2 .

Here is a simple algorithm for solving the constrained optimization problem. The algo-
rithm only needs to access f via a gradient oracle and K via a projection oracle.
Definition 8 (Gradient Oracle) A gradient oracle for a function f is a black box that,
for every point z, returns Of (z) the gradient valuated at point z. (Notice, this is a linear
function of the form gx where g is the vector of partial derivatives evaluated at z.)

The same value of η will be used throughout.


Gradient Descent for Constrained Optimization
Let η == GD √ .
T
Repeat for i = 0 to T
y (i+1) ← x(i) − ηOf (x(i) )
x(i+1) ← Projection of y (i+1) on K.
1 P (i)
At the end output z = x .
T i
Let us analyse this algorithm as follows. Let x∗ be the point where the optimum
is attained. Let G denote an upperbound on |Of (x)|2 for any x ∈ K, and let D =
maxx,y∈K |x − y|2 be the so-called diameter of K. To ensure that the output z satisfies
2 2
f (z) ≤ f (x∗ ) + ε we will use T = 4Dε2G .
Since x(i) is a projection of y (i) on K we have
2 2
x(i+1) − x∗ ≤ y (i+1) − x∗
2
= x(i) − x∗ − ηOf (x(i) )
2 2
= x(i) − x∗ + η 2 O(f )(x(i) ) − 2ηOf (x(i) ) · (x(i) − x∗ )

Reorganizing and using definition of G we obtain:


1 2 2 η
Of (x(i) ) · (x∗ − x(i) ) ≤ ( x(i) − x∗ − x(i+1) − x∗ ) + G2
2η 2

Using (16.3), we can lowerbound the left hand side by f (x(i) ) − f (x∗ ). We conclude that
1 2 2 η
f (x(i) ) − f (x∗ ) ≤ ( x(i) − x∗ − x(i+1) − x∗ ) + G2 . (16.7)
2η 2
93

Now sum the previous inequality over i = 1, 2, . . . , T and use the telescoping cancellations
to obtain
T
X 1 2 2 Tη
(f (x(i) ) − f (x∗ )) ≤ ( x(0) − x∗ − x(T ) − x∗ ) + |G|2 .
2η 2
i=1
P P
Finally, by convexity f ( T1 (i)
ix ) ≤ 1
i f (x
(i) ) so we conclude that the point z =
1 P (i)
T
T ix satisfies
D2 η
f (z) − f (z ∗ ) ≤ + G2 .
2ηT 2
4D2 G2
Now set η = GD √ to get an upperbound on the right hand side of 2 DG
T
√ . Since T =
T ε2

we see that f (z) ≤ f (x ) + ε.

16.4 Online Gradient Descent


In online gradient descent we deal with the following scenario. There is a convex set K given
via a projection oracle. For i = 1, 2, . . . , T we are presented at step i a convex function
fi . At step i we have to put forth our guess solution x(i) ∈ K but the catch is that we do
not know the functions that will be presented in future. P So our online decisions have to be
made such that if x∗ is the point w that minimizes i fi (w) (i.e. the point that we would
have chosen in hindsight after all the functions were revealed) then the following quantity
(called regret) should stay small:
X
fi (x(i) ) − fi (x∗ ).
i

This notion should remind you of multiplicative weights, except here we may have general
convex functions as “payoffs.”

Example 34 (Spam classification against adaptive adversaries) We return to the


spam classification problem of Example 32, with the new twist that this classifier changes
over time, as spammers learn to evade the current classifier. Thus there is no fixed distri-
bution of spam emails and it is fruitless to train the classifier at one go. It is better to have
it improve and adapt itself as new emails arrive. At step t the optimum classifier ft may
not be known and is presented using a gradient oracle. This function just corresponds to
the term in (16.6) corresponding to the latest email that was classfied as spam/nonspam.
The goal is to do as well as the best single classfier we would want to use in hindsight.

Zinkevich noticed that the analysis of gradient descent applies to this much more general
scenario. Specifically, modify the above gradient descent algorithm to this problem by
replacing Of (x(i) ) by Ofi (x(i) ). This algorithm is called Online Gradient Descent. The
earlier analysis works essentially unchanged, once we realize that the left hand side of
(16.7) has the regret for trial i. Summing over i gives the total regret on the left side, and
the right hand side is analysed and upperbounded as before. Thus we have shown:
94

Theorem 25 (Zinkevich 2003)


If D is the diameter of K and G is an upperbound on the norm of the gradient of any of
the presented functions, and η is set to GD
√ then the regret per step after T steps is at
T
2DG
√ .
most T

16.5 Stochastic Gradient Descent


Stochastic gradient descent is a variant of the algorithm in Section 16.3 that works with
convex functions presented using an even weaker notion: an expected gradient oracle. Given
a point z, this oracle returns a linear function gx + f that is drawn from a probability
distribution Dz such that the expectation Eg,f ∈Dz [gx + f ] is exactly the gradient of f at z.
Example 35 (Spam classification using SGD) Returning to the spam classification
problem of Example 32, we see that the function in (16.6) is a sum of many similar terms.
If we randomly pick a single term and compute just its gradient (which is very quick to do!)
then by linearity of expectations, the expectation of this gradient is just the true gradient.
Thus the expected gradient oracle may be a much faster computation than the gradient
oracle (a million times faster if the number of email examples is a million!). In fact this
setting is not atypical; often the convex function of interest is a sum of many similar terms.

Stochastic gradient descent can be analysed using Online Gradient Descent (OGD). Let
gi · x be the gradient at step i. Then we use this function —which is a linear function and
1 PT
hence convex— as fi in the ith step of OGD. Let z = x(i) . Let x∗ be the point in
T i=1
K where f attains its minimum value.
Theorem 26
2DG
E[f (z)] ≤ f (x∗ ) + √ , where D is the diameter as before and G is an upperbound of the
T
norm of any gradient vector ever output by the oracle.

Proof:
1 X
E[f (z) − f (x∗ )] ≤ E[ (f (x(i) ) − f (x∗ ))] by convexity of f
T
i
1X
≤ E[Of (x(i) ) · (x(i) − x∗ )] using (16.2)
T
i
1X
= E[gi · (x(i) − x∗ )] Since expected gradient is the true gradient
T
i
1X
= E[fi (x(i)) − fi (x∗ )] Defn. of fi
T
i
1 X
= E[ (fi (x(i) ) − fi (x∗ )]
T
i

and the theorem now follows since the expression in the E[·] is just the regret, which is always
upperbounded by the quantity given in Zinkevich’s theorem, so the same upperbound holds
also for the expectation. 2
95

16.6 Portfolio Management via Online gradient descent


(This was actually covered at the start of Lecture 17)
Let’s return to the portfolio management problem discussed in context of multiplicative
weights. We are trying to invest in a set of n stocks and maximise our wealth. For t =
1, 2, . . . , let r(t) be the vector of relative price increase on day t, in other words

(t) Price of stock i on day t


ri = .
Price of stock i on day t − 1

Some thought shows (confirming conventional wisdom) that it can be very suboptimal
to put all money in a single stock. A strategy that works better in practice is Constant
Rebalanced Portfolio (CRB): decide upon a fixed proportion of money to put into each stock,
and buy/sell individual stocks each day to maintain this proportion.

Example 36 Say there are only two assets, stocks and bonds. One CRB strategy is to put
split money equally between these two. Notice what this implies: if an asset’s price falls, you
tend to buy more of it, and if the price rises, you tend to sell it. Thus this strategy roughly
implements the age-old advice to “buy low, sell high.”Concretely, suppose the prices each
day fluctuate as follows.

Stock r(t) Bond r(t)


Day 1 4/3 3/4
Day 2 3/4 4/3
Day 3 4/3 3/4
Day 4 3/4 4/3
... ... ...

Note that the prices go up and down by the same ratio on alternate days, so money
parked fully in stocks or fully in bonds earns nothing in the long run. (Aside: This kind of
fluctuation is not unusual; it is generally observed that bonds and stocks move in opposite
directions.) And what happens if you split your money equally between these two assets?
Each day it increases by a factor 0.5 × (4/3 + 3/4) = 0.5 × 25/12 ≈ 1.04. Thus your money
grows exponentially!
Exercise: Modify the price increases in the above example so that keeping all money
in stocks or bonds alone will cause it to drop exponentially, but the 50-50 CRB increases
money at an exponential rate.

CRB uses a fixed split among n assets, but what is this split? Wouldn’t it be great to
have an angel whisper in our ears on day 1 what this magic split is? Online optimization
is precisely such an angel. Suppose the algorithm uses the vector x(t) at time t; the ith
coordinate gives the proportion of money in stock i at the start of the tth day. Then the
96

algorithm’s wealth increases on t by a factor r(t) · x(t) . Thus the goal is to find x(t) ’s to
maximize the final wealth, which is
Y
r(t) · x(t) .
t

Taking logs, this becomes X


log(r(t) · x(t) ) (16.8)
t

For any fixed r(1) , r(2) , . . . this function happens to be concave, but that is fine since we are
interested in maximization. Now we can try to run online gradient descent on this objective.
By Zinkevich’s theorem, the quantity in (16.8) converges to
X
log(r(t) · x∗ ) (16.9)
t

where x∗ is the best money allocation in hindsight.


This analysis needs to assume very little about the r(t) ’s, except a bound on the norm
of the gradient at each step, which translates into a weak condition on price movements. In
the next homework you will apply this simple algorithm on real stock data.

16.7 Hints of more advanced ideas


Gradient descent algorithms come in dozens of flavors. (The Boyd-Vandenberghe book is a
good resource. and Nesterov’s lecture notes are terser but still have a lot of intuition.)
We know the optimal running time (i.e., number of iterations) of gradient descent in
the oracle model; see the books by Hazan and Bubeck.
Surprisingly, just going along the gradient (more precisely, diametrically opposite direc-
tion from gradient) is not always the best strategy. Steepest descent direction is defined by
quantifying the best decrease in the objective function obtainable via a step of unit length.
The catch is that different norms can be used to define “unit length.”For example, if dis-
tance is measured using `1 norm, then the best reduction happens by picking the largest
coordinate of the gradient vector and reducing the corresponding coordinate in x (coordi-
nate descent). The classical Newton method is a subcase where distance is measured using
the ellipsoidal norm defined using the Hessian.
Gradient descent ideas underlie recent advances in algorithms for problems like Spielman-
Teng style solver for Laplacian systems, near-linear time approximation algorithms for
maximum flow in undirected graphs, and Madry’s faster algorithm for maximum weight
matching.
Bibliography

1. Convex Optimization, S. Boyd and L. Vandenberghe. Cambridge University Press.


(pdf available online.)

2. Introductory Lectures on Convex Optimization: A Basic Course. Y. Nesterov. Springer


2004.
97

3. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. M. Zinke-


vich, ICML 2003.

4. Book draft Online convex optimization. Elad Hazan.

5. Lecture notes on online optimization. S. Bubeck.


Chapter 17

Oracles, Ellipsoid method and


their uses in convex optimization

Oracle: A person or agency considered to give wise counsel or prophetic pre-


dictions or precognition of the future, inspired by the gods.

Recall that Linear Programming is the following problem:

maximize cT x
Ax ≤ b
x≥0

where A is a m × n real constraint matrix and x, c ∈ Rn . Recall that if the number of bits
to represent the input is L, a polynomial time solution to the problem is allowed to have a
running time of poly(n, m, L).
The Ellipsoid algorithm for linear programming is a specific application of the ellipsoid
method developed by Soviet mathematicians Shor(1970), Yudin and Nemirovskii(1975).
Khachiyan(1979) applied the ellipsoid method to derive the first polynomial time algorithm
for linear programming. Although the algorithm is theoretically better than the Simplex
algorithm, which has an exponential running time in the worst case, it is very slow practically
and not competitive with Simplex. Nevertheless, it is a very important theoretical tool for
developing polynomial time algorithms for a large class of convex optimization problems,
which are much more general than linear programming.
In fact we can use it to solve convex optimization problems that are even too large to
write down.

17.1 Linear programs too big to write down


Often we want to solve linear programs that are too large to even write down in polynomial
time.

98
99

Example 37 Semidefinite programming (SDP) uses the convex set of PSD matrices in <n .
This set is defined by the following infinite set of constraints: aT Xa ≥ 0 ∀a ∈ Ren . This
is really a linear constraint on the Xij ’s:
X
Xij ai aj ≥ 0.
ij

Thus this set is defined by infinitely many linear constraints.

Example 38 (Held-Karp relaxation for TSP) In the traveling salesman problem (TSP)
we are given n points and distances dij between every pair. We have to find a salesman
tour, which is a sequence of hops among the points such that each point is visited exactly
once and the total distance covered is minimized.
An integer programming formulation of this problem is:
P
min ij dij Xij
Xij ∈ {0, 1} ∀i, j
P
i∈S,j∈S Xij ≥ 2 ∀S ⊆ V, S 6= ∅, V (subtour elimination)

The last constraint is needed because without it the solution could be a disjoint union of
subtours, and hence these constraints are called subtour elimination constraints. The Held-
Karp relaxation relaxes the first constraint to 0 ≤ Xij ≤ 1. Now this is a linear program,
but it has 2n + n2 constraints! We cannot afford to write them down (for then we might as
well use the trivial exponential time algorithm for TSP).

Clearly, we would like to solve such large (or infinite) programs, but we need a different
paradigm than the usual one that examines the entire input.

17.2 A general formulation of convex programming


A convex set K in <n is a subset such that for every x, y ∈ K and λ ∈ [0, 1] the point
λx + (1 − λ)y is in K. (In other words, the line joining x, y lies in K.) If it is compact and
bounded we call it a convex body. It follows that if K1 , K2 are both convex bodies then so
is K1 ∩ K2 .
A general formulation of convex programming is

min cT x
x∈K

where K is a convex body.


Example 39 Linear programming is exactly this problem where K is simply the polytope
defined by the constraints.

Example 40 Some lectures ago we were interested in semidefinite programming, where K


= set of PSD matrices. This is convex since if X, Y are psd matrices then so is (X + Y )/2.
The set of PSD matrices is a convex set but extends to ∞. In the examples last time it
100

was finite since we had a constraint like Xii = 1 for all i, which implies that |Xij | ≤ 1 for
all i, j. Usually in most settings of interest we can place some a priori upper bound on the
desired solution that ensures K is a finite body.
In fact, since we can use binary search to reduce optimization to decision problem, we
can replace the objective
 by a constraint cT x ≥ c0 . Then we are looking for a point in
the convex body K ∩ x : c x ≥ c0 , which is another convex body K0 . We conclude that
T

convex programming boils down to finding a single point in a convex body (where we may
repeat this basic operation multiple times with different convex bodies).
Here are other examples of convex sets and bodies.
1. The whole space Rn is trivially an infinite convex set.
2. Hypercube length l is the set of all x such that 0 ≤ xi ≤ l, 1 ≤ i ≤ n.
n
X
3. Ball of radius r around the origin is the set of all x such that x2i ≤ r2 .
i=1

17.2.1 Presenting a convex body: separation oracles


We need a way to work with a convex body K without knowing its full description. The
simplest way to present a body to the algorithm is via a membership oracle: a blackbox
program that, given a point x, tells us if x ∈ K. We will work with a stronger version of
the oracle, which relies upon the following fact.

Figure 17.1: Farkas’s Lemma: Between every convex body and a point outside it, there’s a
hyperplane

Farkas’s Lemma: If K ⊆ Rn is a convex set and p ∈ Rn is a point, then one of the


following holds
(i) p ∈ K
(ii) there is a hyperplane that separates p from K. (Recall that a hyperplane is the set of
points satisfying a linear equation of the form ax = b where a, x, b ∈ Rn .)
This Lemma is intuitively clear but the proof takes a little formal math and is omitted.
This prompts the following definition of a polynomial time Separating Oracle.
Definition 9 A polynomial time Separation Oracle for a convex set K is a procedure
which given p, either tells that p ∈ K or returns a hyperplane that separates p and all of K.
The procedure runs in polynomial time.
101

Example 41 Consider the polytope defined by the Held-Karp relaxation. We are given a
candidate solution P = (Pij ). Suppose P12 = 1.1. Then it violates the constraint X12 ≤ 1,
and thus the hyperplane X12 = 1 separates the polytope from P .
Thus
P to check that it lies in the polytope defined by all the constraints, we first check
that j Pij = 2 for all i. This can be done in polynomial time. If the equality is violated
for any i then that is a separating hyperplane.
If all the other constraints are satisfied, we finally turn to the subtour elimination
constraints. We construct the weighted graph on n nodes where the weight of edge {i, j}
is Pij . We compute the minimum cut in this weighted graph. The subtour elimination
constraints are all satisfied iff the minimum cut S, S has capacity ≥ 2. If the mincut S, S
has capacity less than 2 then the hyperplane
X
Xij = 2,
i∈S,j∈S

has P on the < 2 side and the Held-Karp polytope on the ≥ 2 side.

Thus you can think of a separation oracle as providing a “letter of rejection”to the point
outside it explaining why it is not in the body K.

Example 42 For the set of PSD matrices, the separation oracle is given a matrix P . It
computes eigenvalues and eigenvectors to check if P only has nonnegative eigenvalues. If
not, then it P
takes an eigenvector a corresponding to a negative eigenvalue and returns the
hyperplane ij Xij ai aj = 0. (Note that ai ’s are constants here.) Then the PSD matrices
are on the ≥ 0 side and P is on the < 0 side.

17.3 Ellipsoid Method


The Ellipsoid algorithm solves the basic problem of finding a point in a convex body K.
The basic idea is divide and conquer. At each step the algorithm asks the separation oracle
about a particular point p. If p is in K then the algorithm can declare success. Otherwise
the algorithm is able to divide the space into two (using the hyperplane provided by the
separation oracle) and recurse on the correct side. (To quote the classic GLS text: How
do you catch a lion in the Sahara? Fence the Sahara down the middle. Gaze on one side
and see if you spot the lion on the left. If so, continue on the left side, else continue on the
right.)
The only problem is to make sure that the algorithm makes progress at every step.
After all, space is infinite and the body could be anywhere it. Cutting down an infinite set
into two still leaves infinite sets. For this we use the notion of the containing Ellipsoid of a
convex body.
An axis aligned ellipsoid is the set of all x such that
n
X x2 i
≤ 1,
i=1
λ2i
102

where λi ’s are nonzero reals. in 3D this is an egg-like object where a1 , a2 , a3 are the radii
along the three axes (see Figure 17.2). A general ellipsoid in Rn can be represented as

(x − a)T B(x − a) ≤ 1,

where B is a positive semidefinite matrix. (Being positive semidefinite means B can be


written as B = AAT for some n×n real matrix A. This is equivalent to saying B = Q−1 DQ,
where Q is a unitary and D is a diagonal matrix with all positive entries.)

Figure 17.2: 3D-Ellipsoid and its axes

The convex body K is presented by a membership oracle, and we are told that the body
lies somewhere inside some ellipsoid E0 whose description is given to us. At the ith iteration
algorithm maintains the invariant that the body is inside some ellipsoid Ei . The iteration
is very simple.
Let p = central point of Ei . Ask the oracle if p ∈ K. If it says ”Yes,” declare succes.
Else the oracle returns some halfspace aT x ≥ b p that contains K whereasp lies on the other
side. Let Ei+1 = minimum containing ellipsoid of the convex body Ei ∩ x : aT x ≥ b .

Figure 17.3: Couple of runs of the Ellipsoid method showing the tiny convex set in blue
and the containing ellipsoids. The separating hyperplanes do not pass through the centers
of the ellipsoids in this figure.

The running time of each iteration depends on the running time of the separation oracle
and the time required to find Ei+1 . For linear programming, the separation oracle runs in
103

O(mn) time as all we need to do is check whether p satisfies all the constraints, and return
a violating constraint as the halfspace (if it exists). The time needed to find Ei+1 is also
polynomial by the following non-trivial lemma from convex geometry.
Lemma 27 T
The minimum volume ellipsoid surrounding a half ellipsoid (i.e. Ei H + where H + is a
halfspace as above) can be calculated in polynomial time and
 
1
V ol(Ei+1 ) ≤ 1 − V ol(Ei )
2n

Thus after t steps the volume of the enclosing ellipsoid has dropped by (1 − 1/2n)t ≤
exp(−t/2n).
Technically speaking, there are many fine points one has to address. (i) The Ellipsoid
method can never say unequivocally that the convex body was empty; it can only say after
T steps that the volume is less than exp(−T /2n). In many settings we know a priori that
the volume of K if nonempty is at least exp(−n2 ) or some such number, so this is good
enough. (ii) The convex body may be low-dimensional. Then its n-dimensional volume is
0 and the containing ellipsoid continues to shrink forever. At some point the algorithm has
to take notice of this, and identify the lower dimensional subspace that the convex body
lies in, and continue in that subspace.
As for linear programming can be shown that for a linear program which requires L bits
to represent the input, it suffices to have volume of E0 = 2c2 nL (since the solution can be
written in c2 nL bits, it fits inside an ellipsoid of about this size) and to finish when volume
of Et = 2−c1 nL for some constants c1 , c2 , which implies t = O(n2 L). Therefore, the after
O(n2 L) iterations, the containing ellipsoid is so small that the algorithm can easily ”round”
it to some vertex of the polytope. (This number of iterations can be improved to O(nL)
with some work.) Thus the overall running time is poly(n, m, L). For a detailed proof of the
above lemma and other derivations, please refer to Santosh Vempala’s notes linked from the
webpage. The classic [GLS] text is a very readable yet authoritative account of everything
related (and there’s a lot) to the Ellipsoid method and its variants.

bibliography

[GLS ] M. Groetschel, L. Lovasz, A. Schrijver. Geometric Algorithms and Combinatorial


Optimization. Springer 1993.
Chapter 18

Duality and MinMax Theorem

We are used to the concept of duality in life: yin and yang, Mars and Venus, etc. In
mathematics duality refers to the phenomenon whereby two objects that look very different
are actually the same in a technical sense.
Today we first see LP duality, which will then be explored a bit more in the homeworks.
Duality has several equivalent statements.

1. If K is a polytope and p is a point outside it, then there is a hyperplane separating p


from K.

2. The following system of inequalities

a1 · X ≥ b1
a2 · X ≥ b2
.. (18.1)
.
am · X ≥ bm
X ≥ 0

is infeasible iff using positive linear combinations of the inequalities it is possible to


derive −1 ≥ 0, i.e. there exist λ1 , λ2 , . . . λm ≥ 0 such that
m
X m
X
λi ai < 0 and λi bi > 0.
i=1 i=1

This statement is called Farkas’s Lemma.

18.1 Linear Programming and Farkas’ Lemma


In courses and texts duality is taught in context of LPs. Say the LP looks as follows:

104
105

Given: vectors c, a1 , a2 , . . . am ∈ Rn , and real numbers b1 , b2 , . . . bm .


Objective: find X ∈ Rn to minimize c · X, subject to:

a1 · X ≥ b1
a2 · X ≥ b2
.. (18.2)
.
am · X ≥ bm
X ≥ 0

The notation X > Y simply means that X is componentwise larger than Y. Now we
represent the system in (18.2) more compactly using matrix notation. Let
 T   
a1 b1
 aT   b2 
 2   
A =  .  and b =  . 
.
 .  .
 . 
T
am bm

Then the Linear Program (LP for short) can be rewritten as:

min cT X :
AX ≥ b (18.3)
X ≥0

This form is general enough to represent any possible linear program. For instance,
if the linear program involves a linear equality a · X = b then we can replace it by two
inequalities
a · X ≥ b and − a · X ≥ −b.
If the variable Xi is unconstrained, then we can replace each occurence by Xi+ − Xi− where
Xi+ , Xi− are two new non-negative variables.

18.2 LP Duality Theorem


With every LP we can associate another LP called its dual. The original LP is called the
primal. If the primal has n variables and m constraints, then the dual has m variables and
n constraints. Thus there is a primal variable corresponding to each dual constraint, and a
dual variable for each primal constraint.

Primal Dual
min cT X : max YT b :
(18.4)
AX ≥ b Y T A ≤ cT
X ≥0 Y ≥0
(Aside: if the primal contains an equality constraint instead of inequality then the
corresponding dual variable is unconstrained.)
It is an easy exercise that the dual of the dual is just the primal.
106

Theorem 28
The Duality Theorem. If both the Primal and the Dual of an LP are feasible, then the
two optima coincide.

Proof: The proof involves two parts:


1. Primal optimum ≥ Dual optimum.
This is the easy part. Suppose X∗ , Y∗ are the respective optima. This implies that

AX∗ ≥ b.

Now, since Y∗ ≥ 0, the product Y∗ AX∗ is a non-negative linear combination of the


rows of AX∗ , so the inequality

Y∗ T AX∗ ≥ Y∗ T b

holds. Again, since X∗ ≥ 0 and cT ≥ Y∗ T A, we obtain the inequality

cT X∗ ≥ (Y∗ T A)X∗ .

Examining the previous two lines we conclude cT X∗ ≥ Y∗ T b, which completes the


proof of this part.

2. Dual optimum ≥ Primal optimum.


Let k be the optimum value of the primal. Since the primal is a minimization problem,
the following set of linear inequalities is infeasible for any ε > 0:

−cT X ≥ −(k − ε)
AX ≥ b (18.5)
X ≥0

Here, ε is a small positive quantity. Therefore, by Farkas’ Lemma, there exist λ0 , λ1 , . . . λm ≥


0 such that
m
X
−λ0 c + λi ai < 0 (18.6)
i=1
Xm
−λ0 (k − ε) + λi bi > 0. (18.7)
i=1

Note that λ0 > 0 omitting the first inequality in (18.5) leaves a feasible system by
assumption about the primal. Thus, consider the nonnegative vector
λ1 λm T
Λ=( ,... ) .
λ0 λ0
The inequality (18.6) implies that ΛT A ≤ cT . So Λ is a feasible solution to the Dual.
The inequality (18.7) implies that ΛT b > (k−ε), and since the Dual is a maximization
problem, this implies that the Dual optimal is bigger than k − ε. Since this holds for
every ε > 0, by compactness we conclude that there is a Dual feasible solution of value
k. Thus, this part is proved, too. Hence the Duality Theorem is proved.
107

2
My thoughts on this business:
(1) Usually textbooks bundle the case of infeasible systems into the statement of the Duality
theorem. This muddies the issue for the student. Usually all applications of LPs fall into
two cases: (a) We either know (for trivial reasons) that the system is feasible, and are only
interested in the value of the optimum or (b) We do not know if the system is feasible and
that is precisely what we want to determine. Then it is best to just use Farkas’ Lemma.
(2) The proof of the Duality theorem is interesting. The first part shows that for any
dual feasible solution Y the various Yi ’s can be used to obtain a weighted sum of primal
inequalities, and thus obtain a lowerbound on the primal. The second part shows that
this method of taking weighted sums of inequalities is sufficient to obtain the best possible
lowerbound on the primal: there is no need to do anything fancier (e.g., taking products of
inequalities or some such thing).

18.3 Example: Max Flow Min Cut theorem in graphs


The input is a directed graph G(V, E) with one source s and one sink t. Each edge e has
a capacity ce . The flow on any edge must be less than its capacity, and at any node apart
from s and t, flow must be conserved: total incoming flow must equal total outgoing flow.
We wish to maximize the flow we can send from s to t. The maximum flow problem can be
formulated as a Linear Program as follows:
Let P denote the set of all (directed) paths from s to t. Then the max flow problem
becomes:
X
max fP : (18.8)
P ∈P
∀P ∈ P : fP ≥ 0 (18.9)
X
∀e ∈ E : fP ≤ ce (18.10)
P :e∈P

Since P could contain exponentially many paths, this is an LP with exponentially many
variables. Luckily duality tells us how to solve it using the Ellipsoid method.
Going over to the dual, we get:
X
min ce ye : (18.11)
e∈E
∀e ∈ E : ye ≥ 0 (18.12)
X
∀P ∈ P : ye ≥ 1 (18.13)
e∈P

Notice that the dual in fact represents the fractional min s − t cut problem: think of
each edge e being picked up to a fraction ye . The constraints say that a total weight of 1
must be picked on each path. Thus the usual s-t min cut problem simply involves 0 − 1
solutions to the ye ’s in the dual.
108

Exercise 1 Prove that the optimum solution does have ye ∈ {0, 1}, and thus the solution
to the dual is the best s-t min cut.

Thus, LP duality implies max-st-flow = (capacity of) min-cut.

Polynomial-time algorithms? The primal has exponentially many variables! (Aside:


turns out it is equivalent to a more succinct LP but lets’ proceed with this one.) Nevertheless
we can use the Ellipsoid method by applying it to the dual, which has m variables and
exponentially many constraints. As we saw last time, we only need to show a polynomial-
time separation oracle for the dual. Namely, for each candidate vector (ye ) we need to check
if it satisfies all the dual constraints. This can be done by creating a weighted version of
the graph where the weight on edge e is ye . Then compute the shortest path from s to t
in this weighted graph. If the shortest path has length < 1 then we have found a violated
constraint.
Of course, for Max Flow we know of much faster algorithms than the Ellipsoid method
(e.g., the algorithms you saw in your undergrad course), but there are other LPs with
exponentially many variables for which the only known polynomial time algorithms go via
the Ellipsoid method.

18.4 Game theory and the minmax theorem


In the 1930s, polymath John von Neumann (professor at IAS, now buried in the cemetery
close to downtown) was interested in applying mathematical reasoning to understand strate-
gic interactions among people —or for that matter, nations, corporations, political parties,
etc. He was a founder of game theory, which models rational choice in these interactions as
maximization of some payoff function.
A starting point of this theory is the zero-sum game. There are two players, 1 and 2,
where 1 has a choice of m possible moves, and 2 has a choice of n possible moves. When
player 1 plays his ith move and player 2 plays her jth move, the outcome is that player 1
pays Aij to player 2. Thus the game is completely described by an m × n payoff matrix.

Figure 18.1: Payoff matrix for Rock/Paper/Scissor

This setting is called zero sum because what one player wins, the other loses. By
contrast, war (say) is a setting where both parties may lose material and men. Thus their
combined worth at the end may be lower than at the start. (Aside: An important stimulus
109

for development of game theory in the 1950s was the US government’s desire to behave
“strategically ”in matters of national defence, e.g. the appropriate tit-for-tat policy for
waging war —whether nuclear or conventional or cold.)
von Neumann was interested in a notion of equilibrium. In physics, chemistry etc. an
equilibrium is a stable state for the system that results in no further change. In game theory
it is a pair of strategies g1 , g2 for the two players such that each is the optimum response
to the other.
Let’s examine this for zero sum games. If player 1 announces he will play the ith move,
then the rational move for player 2 is the move j that maximises Aij . Conversely, if player
2 announces she will play the jth move, player 1 will respond with move i0 that minimizes
Ai0 j . In general, there may be no equilibrium in such announcements: the response of player
1 to player 2’s response to his announced move i will not be i in general:

min max Aij 6= max min Aij .


i j j i

In fact there is no such equilibrium in Rock/paper/scissors either, as every child knows.


von Neumann realized that this lack of equilibrium disappears if one allows players’
announced strategy to be a distribution on moves,
P a so-called mixed strategy. Player 1’s
distribution is x ∈ <mP satisfying xi ≥ 0 and i xi = 1; Player 2’s distribution is y ∈ <n
satisfying yj ≥ 0 and j yj = 1. Clearly, the expected payoff from Player 1 to Player 2
P
then is ij xi Aij yj = xT Ay.
But has this fixed the problem about nonexistence of equilibrium? If Player 1 announces
first the payoff is minx maxy xT Ay whereas if Player 2 announces first it is maxy minx xT Ay.
The next theorem says that it doesn’t matter who announces first; neither player has an
incentive to change strategies after seeing the other’s announcement.
Theorem 29 (Famous Min-Max Theorem of Von Neumann)
minx maxy xT Ay = maxy minx xT Ay.

Turns out this result is a simple consequence of LP duality and is equivalent to it. You
will explore it further in the homework.
What if the game is not zero sum? Defining an equilibrium for it was an open problem
until John Nash at Princeton managed to define it in the early 1950s; this solution is called
a Nash equilibrium. We’ll return to it in a future lecture. BTW, you can still sometimes
catch a glimpse of Nash around campus.
Chapter 19

Equilibria and algorithms

Economic and game-theoretic reasoning —specifically, how agents respond to economic in-
centives as well as to each other’s actions– has become increasingly important in algorithm
design. Examples: (a) Protocols for networking have to allow for sharing of network re-
sources among users, companies etc., who may be mutually cooperating or competing. (b)
Algorithm design at Google, Facebook, Netflix etc.—what ads to show, which things to
recommend to users, etc.—not only has to be done using objective functions related to eco-
nomics, but also with an eye to how users and customers change their behavior in response
to the algorithms and to each other.
Algorithm design mindful of economic incentives and strategic behavior is studied in a
new field called Algorithmic Game Theory. (See the book by Nisan et al., or many excellent
lecture notes on the web.)
Last lecture we encountered zero sum games, a simple setting. Today we consider more
general games.

19.1 Nonzero sum games and Nash equilibria


Recall that a 2-player game is zero sum if the amount won by one player is the same as
the amount lost by the other. Today we relax this. Thus if player 1 has n possible actions
and player 2 has m, then specifying the game requires two a n × m matrices A, B such that
when they play actions i, j respectively then the first player wins Aij and the second wins
Bij . (For zero sum games, Aij = −Bij .)
A Nash equilibrium is defined similarly to the equilibrium we discussed for zero sum
games: a pair of strategies, one for each player, such that each is the optimal response to
the other. In other words, if they both announce their strategies, neither has an incentive
to deviate from his/her announced strategy. The equilibrium is pure if the strategy consists
of deterministically playing a single action.

Example 43 (Prisoners’ Dilemma) This is a classic example that people in myriad dis-
ciplines have discussed for over six decades. Two people suspected of having committed a
crime have been picked up by the police. In line with usual practice, they have been placed
in separate cells and offered the standard deal: help with the investigation, and you’ll be

110
111

treated with leniency. How should each prisoner respond: Cooperate (i.e., stick to the story
he and his accomplice decided upon in advance), or Defect (rat on his accomplice and get
a reduced term)?
Let’s describe their incentives as a 2 × 2 matrix, where the first entry describes payoff
for the player whose actions determine the row. If they both cooperate, the police can’t

Cooperate Defect
Cooperate 3, 3 0, 4
Defect 4, 0 1, 1

prove much and they get off with fairly light sentences after which they can enjoy their loot
(payoff of 3). If one defects and the other cooperates, then the defector goes scot free and
has a high payoff of 4 whereas the other one has a payoff of 0 (long prison term, plus anger
at his accomplice).
The only pure Nash equilibrium is (Defect, Defect), with both receiving payoff 1. In
every other scenario, the player who’s cooperating can improve his payoff by switching to
Defect. This is much worse for both of them than if they play (Cooperate, Cooperate),
which is also the social optimum —where the sum of their payoffs is highest at 6—is to
cooperate. Thus in particular the social optimum solution is not a Nash equilibrium. ((OK,
we are talking about criminals here so maybe social optimum is (Defect, Defect) after all.
But read on.)
One can imagine other games with similar payoff structure. For instance, two companies
in a small town deciding whether to be polluters or to go green. Going green requires
investment of money and effort. If one does it and the other doesn’t, then the one who is
doing it has incentive to also become a polluter. Or, consider two people sharing an office.
Being organized and neat takes effort, and if both do it, then the office is neat and both are
fairly happy. If one is a slob and the other is neat, then the neat person has an incentive
to become a slob (saves a lot of effort, and the end result is not much worse).
Such games are actually ubiquitous if you think about it, and it is a miracle that humans
(and animals) cooperate as much as they do. Social scientists have long pondered how to
cope with this paradox. For instance, how can one change the game definition (e.g. a wise
governing body changes the payoff structure via fines or incentives) so that cooperating
with each other —the socially optimal solution—becomes a Nash equilibrium? The game
can also be studied via the repeated game interpretation, whereby people realize that they
participate in repeated games through their lives, and playing nice may well be a Nash
equilibrium in that setting. As you can imagine, many books have been written. 2

Example 44 (Chicken) This dangerous game was supposedly popular among bored teenagers
in American towns in the 1950s (as per some classic movies). Two kids would drive their
cars at high speed towards each other on a collision course. The one who swerved away first
to avoid a collision was the “chicken.”How should we assign payoffs in this game? Each
player has two possible actions, Chicken or Dare. If both play Dare, they wreck their cars
and risk injury or death. Lets call this a payoff of 0 to each. If both go Chicken, they both
live and have not lost face, so let’s call it a payoff of 5 for each. But if one goes Chicken and
the other goes Dare, then the one who went Dare looks like the tough one (and presumably
112

attracts more dates), whereas the Chicken is better of being alive than dead but lives in
shame. So we get the payoff table:

Chicken Dare
Chicken 5, 5 1, 6
Dare 6, 1 0, 0

This has two pure Nash equilibria: (Dare, Chicken) and (Dare, Chicken). We may
think of this as representing two types of behavior: the reckless type may play Dare and
the careful type may play Chicken.
Note that the socially optimal solution —both players play chicken, which maximises
their total payoff—is not a Nash equilibrium.

Many games do not have any pure Nash equilibrium. Nash’s great insight during his
grad school years in Princeton was to consider what happens if we allow players to play a
mixed strategy, which is a probability distribution over actions. An equilibrium now is a
pair of mixed strategies x, y such that each strategy is the optimum response (in terms of
maximising expected payoff) to the other.
Theorem 30 (Nash 1950)
For every pair of payoff matrices A, B there is an odd number (hence nonzero) of mixed
equilibria.

Unfortunately, Nash’s proof doesn’t yield an efficient algorithm for computing an equi-
librium: when the number of possible actions is n, computation may require exp(n) time.
Recent work has shown that this may be inherent: computing Nash equilibria is PPAD-
complete (Chen and Deng’06).
The Chicken game has a mixed equilibrium: play each of Chicken and Dare with prob-
ability 1/2. This has expected payoff 41 (5 + 1 + 6 + 0) = 3 for each, and a simple calculation
shows that neither can improve his payoff against the other by changing to a different
strategy.

19.2 Multiplayer games and Bandwidth Sharing


One can define multiplayer games and equilibria analogously to single player games. One
can also define games where each player’s set of moves comes from a continuous set like the
interval [0, 1]. Now we do this in a simple setting: multiple users sharing a single link of
fixed bandwidth, say 1 unit. They have different utilities for internet speed, and different
budgets. Hence the owner of the link can try to allocate bandwidth using a game-theoretic
view, which we study using a game introduced by Frank Kelly.

1. There are n users. If user i gets x units of bandwidth by paying w dollars, his/her
utility is Ui (x) − w, where the utility function Ui is nonnegative, increasing, concave1
1
Concavity implies that the going from 0 units to 1 brings more happiness than going from 1 to 2, which
in turn brings more happiness than going from 2 to 3. For twice differentiable functions, concavity means
the second derivative is negative.
113

Figure 19.1: Sharing a fixed bandwidth link among many users

and differentiable. If a unit of bandwidth is priced at p, this utility describes the


amount of bandwidth desired by a utility-maximizing user: the ith user demands xi
that maximises Ui (xi ) − pxi . This maximum can be computed by calculus.

2. ThePgame is as follows: user i offers to pay a sum of wi . The link owner allocates
wi / j wj portion of the bandwidth to user i. Thus P the entire bandwidth is used up
and the effective price for the entire bandwidth is j wj .

What n-tuple of strategiesPw1 , w2 , . . . , wn is a Nash equilibrium? Note that this n-tuple


implies a per unitPprice p of j wj , and for each i his received amount is optimal at this
price if xi = wi / j wj is the solution to max Ui (xi ) − w, which requires (by chain rule of
differentiation):

1 wi
Ui0 (xi )( − 2 ) = 1
p p
0
⇒ Ui (xi )(1 − xi ) = p.

This implicitly defines xi in terms of p. Furthermore, the left hand side is easily checked to
be a decreasing function of xi . (Specifically, its derivative is (1 − xi )Ui ”(xi ) − U 0 (xi ), whose
first term is negative by concavity and P the second because Ui0 (xi ) ≥ 0 by our assumption
that Ui is an increasing function.) Thus i xi is a decreasing function of p. When p = +∞,
the xi ’s that maximise
P utility are all 0, whereas for p = 0 the xi ’s are all 1, which violates
the constraint i xi = 1. P By the mean value theorem, there must exceed a choice of p
between 0 and +∞ where i xi = 1, and the corresponding values of wi ’s then constitute
a Nash equilibrium.
Is this equilibrium socially optimal? Let p∗ be the socially optimal price. At this price
the ith user desires a bandwidth xi that maximises Ui (xi ) − p∗ xi , which is the unique xi
that satisfies Ui0 (xi ) = p∗ . Furthermore these xi ’s must sum to 1.
By contrast, the Nash equilibrium price pN corresponds to solving Ui0 (xi )(1 − xi ) = pN .
If the number of users is large (and the utility functions not “too different”so that the xi ’s
are not too different) then each xi is small and 1 − xi ≈ 1. Thus the Nash equilibrium price
is close to but not the same as the socially optimal choice.
114

Price of Anarchy
One of the notions highlighted by algorithmic game theory is price of anarchy, which is the
ratio between the cost of the Nash equilibrium and the social optimum. The idea behind
this name is that Nash equilibrium is what would be achieved in a free market, whereas
social optimum is what could be achieved by a planner who knows everybody’s utilities.
One identifies a family of games, such as bandwidth sharing, and looks at the maximum of
this ratio over all choices of the players’ utilities. The price of anarchy for the bandwidth
sharing game happens to be 4/3. Please see the chapter on inefficiency of equilibria in the
AGT book.

19.3 Correlated equilibria


In HW 3 you were asked to simulate two strategies that repeatedly play Rock-Paper-Scissors
while minimizing regret. The Payoffs were as follows:

Rock Paper Scissor


Rock 0,0 0, 1 1, 0
Paper 1, 0 0, 0 0, 1
Scissor 0, 1 1, 0 0, 0

Possibly you originally guessed that they would converge to playing Rock, Paper, Scissor
randomly. However, this is not regret minimizing since it leads to payoff 0 every third
round in the expectation. What you probably saw in your simulation was that the players
converged to a correlated strategy that guarantees one of them a payoff every other round.
Thus they learnt to game the system together and maximise their profits.
This is a subcase of a more general phenomenon, whereby playing low-regret strategies
in general leads to a different type of equilibrium, called correlated equilibrium.

Example 45 In the game of Chicken, the following is a correlated equilibrium: each of


the three pairs of moves other than (Dare, Dare) with probability 1/3. This is a correlated
strategy: there is a global random string (or higher agency) that tells the players what to
do. Neither player knows what the other has chosen.
Suppose we think of the game being played between two cars approaching a traffic
intersection from two directions. Then the correlated equilibrium of the previous paragraph
has a nice interpretation: a traffic light! Actually, it is what a traffic light would look like if
there were no traffic police to enforce the laws. The traffic light would be programmed to
repeatedly pick one of three states with equal probability: (Red, Red), (Green, Red), and
(Red, Green). (By contrast, real-life lights cycle between (Red, Green), and (Green, Red);
where we are ignoring Yellow for now.) If a motorist arriving at the intersection sees Green,
he knows that the other motorist sees Red and so can go through without hesitation. If
he sees Red on the other hand, he only knows that there is equal chance that the other
motorist sees Red or Green. So acting rationally he will come to a stop since otherwise he
has probability 1/2 of getting into an accident. Note that this means that when the light
is (Red, Red) then the traffic would be sitting at a halt in both directions.
115

The previous example illustrates the notion of correlated equilibrium, and we won’t
define it more precisely. The main point is that it can be arrived at using a simple algorithm,
namely, multiplicative weights (this statement also has caveats; see the relevant chapter in
the AGT book). Unfortunately, correlated equilibria are also not guaranteed to maximise
social welfare.

Bibliography

1. Algorithmic Game Theory. Nisan, Roughgarden, Tardos, Vazirani (eds.), Cambridge


University Press 2007.

2. The mathematics of traffic in networks. Frank Kelly. In Princeton Companion to


Mathematics (T. Gowers, Ed.). PU Press 2008.

3. Settling the Complexity of 2-Player Nash Equilibrium. X. Chen and X. Deng. IEEE
FOCS 2006.
Chapter 20

Protecting against information


loss: coding theory

Computer and information systems are prone to data loss—lost packets, crashed or cor-
rupted hard drives, noisy transmissions, etc.—and it is important to prevent actual loss of
important information when this happens. Today’s lecture concerns error correcting codes,
a stepping point to many other ideas, including a big research area (usually based in EE de-
partments) called information theory. This area started with a landmark paper by Claude
Shannon in 1948, whose key insight was that data transmission is possible despite noise and
errors if the data is encoded in some redundant way.

Example 46 (Elementary ways of introducing redundancy) The simplest way to


introduce redundancy is to repeat each bit, say 5 times. The cons are (a) large inefficiency
(b) no resistance to bursty error, which may wipe out all 5 copies.
Another simple method is checksums. For instance suppose we transmit 3 bits b1 , b2 , b3
as b1 , b2 , b3 , b1 ⊕ b2 ⊕ b3 where the last bit is the parity of the first three. Then if one of the
bits gets flipped, the parity will be incorrect. However, if two bits get corrupted, the parity
becomes correct again! Thus this method can detect when a single bit has been corrupted.
It is useful in settings where errors are rare: if an error in the checksum is detected, the
entire information/packet can be retransmitted.
A cleverer checksum method used by some cloud services is to store three bits b1 , b2 , b3
as 7 bits on 7 servers: b1 , b2 , b3 , b1 ⊕ b2 , b1 ⊕ b3 , b2 ⊕ b3 , b1 ⊕ b2 ⊕ b3 . It is easily checked that:
if up to three servers fail, each bit is still recoverable, and in fact by querying at most 2
servers. A cleverer design of such data storage codes recently saved Microsoft 13% space on
its cloud servers.

Example 47 (Generalized Checksums) A trivial extension of the checksum idea is to


encode k bits using 2k checksums: take the parity of all possible subsets. This works to
protect the data even if close to half the bits get flipped (though we won’t prove it; requires
some Fourier analysis).
Another form of checksums is to designate some random subsets of {1, 2, . . . , k}, say
S1 , S2 , . . . , Sm . Then encode any k bit vector using the m checksums corresponding to

116
117

these subsets. This works against Ω(m) errors but we don’t know of an efficient decoding
algorithm. (Decoding in exp(k) time is no problem.)

20.1 Shannon’s Theorem


Shannon considered the following problem: a message x ∈ {0, 1}n has to be sent over a
channel which flips every bit with probability p. How can we ensure that the message is
recovered correctly at the other end? A couple of years later Hamming introduced a related
notion whereby the channel flips up to p fraction of bits —and can adversarially decide
which subset of bits to flip. He was concerned that real channels exhibit burstiness: make a
lot of errors in one go and then no errors for long periods. By Chernoff bounds, Shannon’s
channel is a subcase (whp) of the Hamming channel since the chance of flipping more than
p + ε fraction of bits in total is exp(−Θ(n)). Both kinds of channels have been studied since
then and we will actually use Hamming’s notion today.
Shannon suggested that the message be encoded using a function E : {0, 1}n → {0, 1}m
and at the other end it should be decoded using a function D : {0, 1}m → {0, 1}n with the
property that D(E(x) ⊕ η) = x for any noise vector η ∈ {0, 1}m that is 1 in at most pm
indices and 0 in the rest. (Here ⊕ of two bit vectors denotes bitwise parity.)
Clearly, such a decoding is possible if for every two messages x, x0 their encodings differ
in more than 2pm bits: then E(x) ⊕ η1 will not be confused for E(x0 ) ⊕ η2 for any two noise
vectors η1 , η2 that only are nonzero in pm bits. We say such a code has minimum distance
at least 2pm.
The famous entropy function appearing in the following theorem is graphed below. (The
notion of Entropy used in the 2nd law of thermodynamics is closely related.)

Figure 20.1: The graph of H(X) as a function of X.

Theorem 31
n n
Such E, D do not exist if m < 1−H(p) , and do exist for p ≤ 1/4 if m > 1−H(2p) . Here
1 1
H(p) = p log2 p + (1 − p) log2 1−p is the so-called entropy function.

Proof: We only prove existence; the method does not give efficient algorithms to en-
code/decode. For any string y ∈ {0, 1}m let Ball(y) denote the set of strings that differ
118

from y in at most      
m m m
+ + ··· ,
0 1 2pm
which is at most 2H(2p)m by Stirling’s approximation.
Define the encoding function E using the following greedy procedure. Number the
strings in {0, 1}n from 1 to 2n and one by one assign to each string x its encoding E(x) as
follows. The first string is assigned an arbitary string in {0, 1}m . At step i the ith string is
assigned an arbitary string that lies outside Ball(E(x) for all x ≤ i − 1.
By design, such an encoding function satisfies that E(x) and E(x0 ) differ in at least
2pm fraction. Thus we only need to show that the greedy procedure succeeds in assigning
an encoding to each string. To do this it suffices to note that if 2m > 2n 2H(2p)m then the
greedy procedure never runs out of strings to assign as encodings.
The nonexistence is proved in a similar way. Now for y 0 ∈ {0, 1}m let Ball0 (y) be the
set of strings that differ from y in at most pm indices. By a similar calculation as above,
this has cardinality about 2H(p)m . If an encoding function exists, then Ball0 (E(x)) and
Ball0 (E(x0 )) must be disjoint for all x 6= x0 (since otherwise any string in the intersection
would not have an unambiguous encoding). Hence 2n × 2H(p)m < 2m , which implies that
n
m > 1−H(p) . 2

20.2 Finite fields and polynomials


Below we will design error correcting codes using polynomials over finite fields. Here finite
field will refer to Zq , the integers modulo a prime q. Recall that one can define +, ×, ÷ over
these numbers, and that x × y = 0 iff at least one of x, y is 0. A degree d polynomial p(x)
has the form
a0 + a1 x + a2 x2 + · · · + ad xd .
It can be seen as a function that maps x ∈ Zq to p(x).
Lemma 32 (Polynomial Interpolation)
For any set of n + 1 pairs (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) where the xi ’s are distinct elements
of Zq , there is a unique degree n polynomial g(x) satisfying g(xi ) = yi for each i.
Proof: Let a0 , a1 , . . . , an be the coefficients of the desired polynomial. Then the constraint
g(xi ) = yi corresponds to the following linear system.
This system has a unique solution iff the matrix on the left is invertible, i.e., has nonzero
determinant.
Q Q This is nothing but the famous Vandermonde matrix, whose determinant is
i≤n j6=i i − xj ). This is nonzero since the xi ’s are distinct. Thus the system has
(x
a solution. Actually the solution has a nice description via the Lagrange interpolation
formula:
Xn Y (x − xj )
g(x) = yi .
xi − xj
i=0 j6=i
2
Corollary 33
If a degree d has more than d roots (i.e., points where it takes zero value) then it is the zero
polynomial.
119

Figure 20.2: Linear system corresponding to polynomial interpolation; matrix on left side
is Vandermonde.

20.3 Reed Solomon codes and their decoding


The Reed Solomon code from 1960 is ubiquitous, having been used in a host of settings
including data transmission by NASA vehicles and the storage standard for music CDs. It
is simple and inspired by Lemma 32. The idea is to break up a message into chunks of
blog qc bits, where each chunk is interpreted as an element of the field Zq . If the message
has (d + 1)blog qc bits then it can be interpreted as coefficients of a degree d polynomial
p(x). The encoding consists of evaluating this polynomial at n points u1 , u2 , . . . , vn ∈ Zq
and defining the encoding to be p(u1 ), p(u2 ), . . . , p(un ).
Suppose the channel corrupts k of these values, where n − k ≥ d + 1. Let v1 , v2 , . . . , vn
denote the received values. If we knew which values are uncorrupted, the decoder could use
polynomial interpolation to recover p. Trouble is, the decoder has no idea which received
value has been corrupted. We show how to recover p if k < n−d 2 − 1.
Lemma 34
There exists a nonzero degree k polynomial e(x) and a polynomial c(x) of degree at most
d + k such that
c(ui ) = e(ui )vi for i = 1, 2, . . . , n. (20.1)
Proof: Let I ⊆ {1, 2, . . . , n}, with |I| = k be the
Q subset of indices i such that vi has
been corrupted. Then (20.1) is satisfied by e(x) = i∈I (x − ui ) and c(x) = e(x)p(x) since
e(ui ) = 0 for each i ∈ I and nonzero outside I. 2
The polynomial e in the previous proof is called the error locator polynomial. Now note
that if we let the coefficients of c, e be unknowns, then (20.1) is a system of n equations in
d+2k +2 unknowns. This system is overdetermined since the number of constraints exceeds
the number of variables. But Lemma 34 guarantees this system is feasible, and thus can be
solved in polynomial time by Gaussian elimination.
We will need the notion of a polynomial dividing another. For instance x2 + 2 divides
x3 + x2 + 2x + 2 since x3 + x2 + 2x + 2 = (x2 + 2)(x + 1). The algorithm to divide one
polynomial by another is the obvious analog of integer division.
Lemma 35
If n > d + 2k + 1 then any solution c(x), e(x) to the system of Lemma 34 satisfies (i) e(x)
divides c(x) as a polynomial (ii) c(x)/e(x) is p(x).
120

Proof: The polynomial c(x) − e(x)p(x) has a root at ui whenever vi is uncorrupted since
p(ui ) = vi . Thus this polynomial, which has degree d + k, has n − k roots. Thus if
n − k > d + k + 1 this polynomial is identically 0. 2

20.4 Code concatenation


Technically speaking, the Reed-Solomon code only works if the error rate of the channel is
less than 1/ log2 q, since otherwise the channel could corrupt one bit in every value of the
polynomial.
To allow error rate Ω(1) one uses code concatenation. This means that we encode each
value of p —which is a string of t = dlog2 qe bits—with another code that maps t bits to
O(t) bits and has minimum distance Ω(t). Wait a minute: you might say. If we had such a
code all along then why go to the trouble of defining the Reed-Solomon code?
The reason is that we do have such a code by Shannon’s construction (or by trivial
checksums; see Example 47): but since we are only applying it on strings of size t it can be
encoded and decoded in exp(t) time, which is only q. Thus if q is polynomial in the message
size, we still get encoding/decoding in polynomial time.
This technique is called code concatenation. One can also use any other error correcting
code instead of Shannon’s trivial code.
Chapter 21

Counting and Sampling Problems

Today’s topic of counting and sampling problems is motivated by computational problems


involving multivariate statistics and estimation, which arise in many fields. For instance, we
may have a probability density function φ(x) where x R∈ <n . Then we may want to compute
moments or other parameters of the distribution, e.g. x3 φ(x)dx. Or, we may have a model
for how links develop faults in a network, and we seek to compute the probability that two
nodes i, j stay connected under this model. This is a complicated probability calculation.
In general, such problems can be intractable (eg, NP-hard). The simple-looking problem
of integrating a multivariate function is NP-hard in the worst case, even when we have
an explicit expression for the function f (x1 , x2 , . . . , xn ) that allows f to be computed in
polynomial (in n) time.
Z 1 Z 1 Z 1
··· f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn .
x1 =0 x2 =0 xn =0

In fact even approximating such integrals can be NP-hard, as shown by Koutis (2003).
Valiant (1979) showed that the computational heart of such problems is combinatorial
counting problems. The goal in such problems is to compute the size of a set S where we
can test membership in S in polynomial time. The class of such problems is called ]P .

Example 48 ]SAT is the problem where, given a boolean formula ϕ, we have to compute
the number of satisfying assignments to ϕ. Clearly it is NP-hard since if we can solve it, we
can in particular solve the decision problem: decide if the number of satisfying assignments
at least 1.
]CYCLE is the problem where, given a graph G = (V, E), we have to compute the
number of cycles in G. Here the decision problem (“is G acyclic?”) is easily solvable using
breadth first search. Nevertheless, the counting problem turns out to be NP-hard.
]SPANNINGTREE is the problem where, given a graph G = (V, E), we have to compute
the number of spanning trees in G. This is known to be solvable using a simple determinant
computation (Kirchoff’s matrix-tree theorem) since the 19th century.
Valiant’s class ]P captures most interesting counting problems. Many of these are NP-
hard, but not all. You can learn more about them in COS 522: Computational Complexity,
usually taught in the spring semester. 2

121
122

It is easy to see that the above integration problem can be reduced to a counting problem
with some loss of precision. First, recall that integration basically involves summation: we
appropriately discretize the space and then take the sum of the integrand values (assuming
in each cell of space the integrand doesn’t vary much). Thus the integration reduces to
some sum of the form X
g(x1 , x2 , . . . , xn ),
x1 ∈[N ],x2 ∈[N ],...,xn ∈[N ]

where [N ] denotes the set of integers in 0, 1, . . . , N . Now assuming g(·) ≥ 0 this is easily
estimated using sizes of of the following sets:

{(x, c) : x ∈ [N ]n ; c ≤ g(x) ≤ c + ε} .

Note if g is computable in polynomial time then we can test membership in this set in
polynomial time given (x, c, ε) so we’ve shown that integration is a ]P problem.
We will also be interested in sampling a random element of a set S. In fact, this will
turn out to be intimately related to the problem of counting.

21.1 Counting vs Sampling


We say that an algorithm is an approximation scheme for a counting problem if for every
ε > 0 it can output an estimate of the size of the set that is correct within a multiplicative
factor (1 + ε). We say it is a randomized fully polynomial approximation scheme (FPRAS)
if it is randomized and it runs in poly(n, 1/ε, log 1/δ) time and has probability at least
(1 − δ) of outputting such an answer. We will assume δ < 1/poly(n) so we can ignore the
probability of outputting an incorrect answer.
An fully polynomial-time approximate sampler for S is one that runs in poly(n, 1/ε, log 1/δ)
P 1
and outputs a sample u ∈ S such that u∈S Pr[u is output] − |S| ≤ ε.

Theorem 36 (Jerrum, Valiant, Vazirani 1986)


For “nicely behaved ”counting problems (the technical term is “downward self-reducible”)
sampling in the above sense is equivalent to counting (i.e., a algorithm for one task can be
converted into one for the other).

Proof: For concreteness, let’s prove this for the problem of counting the number of satisfy-
ing assignments to a boolean formula. Let ]ϕ denote the number of satisfying assignments
to formula ϕ.
Sampling ⇒ Approximate counting: Suppose we have an algorithm that is an ap-
proximate sampler for the set of satisfying assignments for any formula. For now assume
it is an exact sampler instead of approximate. Take m samples from it and let p0 be the
fraction that have a 0 in the first bit xi , and p1 be the fraction that have a 1. Assume

p0 ≥ 1/2. Then the estimate of p0 is correct up to factor (1 + 1/ m) by Chernoff bounds.
But denoting by ϕ|x1 =0 the formula obtained from ϕ by fixing x1 to 0, we have

]ϕ|x1 =0
p0 = .

123

Since we have a good estimate of p0 , to get a good estimate of ]ϕ it suffices to have a good
estimate of ]ϕ|x1 =0 . So produce the formula ϕ|x1 =0 obtained from ϕ by fixing x1 to 0, then
use the same algorithm recursively on this smaller formula to estimate N0 , the value of
]ϕ|x1 =0 . Then output N0 /p0 as your estimate of ]ϕ. (Base case n = 1 can be solved exactly
of course.)
Thus if Errn is the error in the estimate for formulae with n variables, this satisfies

Errn ≤ (1 + 1/ m)Errn−1 ,

which solves to Errn ≤ (1 + 1/ m)n . By picking m >> n2 /ε2 this error can be made less
than 1 + ε. It is easily checked that if the sampler is not exact but only approximate, the
algorithm works essentially unchanged, except the sampling error also enters the expression
for the error in estimating p0 .
Approximate counting ⇒ Sampling: This involves reversing the above reasoning.
Given an approximate counting algorithm we are trying to generate a random satisfying
assignment. First use the counting algorithm to approximate ]ϕ|x1 =0 and ]ϕ and take the
ratio to get a good estimate of p0 , the fraction of assignments that have 0 in the first bit.
(If p0 is too small, then we have a good estimate of p1 = 1 − p0 .) Now toss a coin with
Pr[heads] = p0 . If it comes up heads, output 0 as the first bit of the assignment and then
recursively use the same algorithm on ϕ|x1 =0 to generate the remaining n − 1 bits. If it
comes up tails, output 1 as the first bit of the assignment and then recursively use the same
algorithm on ϕ|x1 =1 to generate the remaining n − 1 bits.
Note that the quality ε of the approximation suffers a bit in going between counting
and sampling. 2

21.1.1 Monte Carlo method


The classical method to do counting via sampling is the Monte Carlo method. A simple
example is the ancient method to estimate the area of a circle of unit radius. Draw the
circle in a square of side 2. Now throw darts at the square and measure the fraction that
fall in the circle. Multiply that fraction by 4 to get the area of the circle.

Figure 21.1: Monte Carlo (dart throwing) method to estimate the area of a circle. The
fraction of darts that fall inside the disk is π/4.

Now replace “circle”with any set S and “square”with any set Ω that contains S and can
be sampled in polynomial time. Then just take many samples from Ω and just observe the
124

fraction that are in S. This is an estimate for |S|. The problem with this method is that
usually the obvious Ω is much bigger than S, and we need |Ω| / |S| samples to get any that
lie in S. (For instance the obvious Ω for computing ]ϕ is the set of all possible assignments,
which may be exponentially bigger.)

21.2 Dyer’s algorithm for counting solutions to KNAPSACK


The Knapsack problem models the problem faced by a kid who is given a knapsack and
told to buy any number of toys that fit in the knapsack. The problem is that not all toys
give him the same happiness, so he has to trade off the happiness received from each toy
with its size; toys with high happiness/size ratio are prefered. Turns out this problem is
NP-hard if the numbers are given in binary. We are interested in a counting version of the
problem that uses just the sizes.
Definition 10 Given n weights w1 , w2 , . . . , wn P
and a target weight W , a feasible solution
to the knapsack problem is a subset T such that i∈T wi ≤ W .
We wish to approximately count the number of feasible solutions. This had been the subject
of some very technical papers, until M. Dyer gave a very elementary solution in 2003.
First, we note that the counting problem can be solved exactly in O(nW ) time, though
of course this is not polynomial since W is given to us in binary, i.e. using log W bits.
The idea is dynamic programming. Let Count(i, U ) denote the number of feasible solutions
involving only the first i numbers, and whose total weight is at most U . The dynamic
programming follows by observing that there are two types of solutions: those that involve
the ith element, and those that don’t. Thus


 Count(i − 1, U − wi ) + Count(i − 1, U )
Count(i, U ) = 1 if i = 1 and w1 ≤ U


0 if i = 1 and w1 > U
Denoting by S the set of feasible solutions, |S| = Count(n, W ). But as observed,
computing this exactly is computationally expensive and not polynomial-time. Dyer’s next
idea is to find a set Ω containing S but at most n times bigger. This set Ω can be exactly
counted as well as sampled from. So then by the Monte Carlo method we can estimate the
size of S in polynomial time by drawing samples from Ω.
Ω is simply the set of solutions to a Knapsack instance in which the weights have been
2
rounded to lie in [0, n2 ]. Specifically, let wi0 = b wW
in
c and W 0 = n2 . Then Ω is the set of
feasible solutions to this modified knapsack problem.
claim 1: S ⊆ Ω. (Consequently, |S| ≤ |Ω|.) P
This
P follows since if T ∈ S is a feasible solution for the original problem, then i wi0 ≤
2 2
i wi n /W ≤ n , and so T is a feasible solution for the rounded problem.
claim 2: |Ω| ≤ n |S|.
To prove this we give a mapping g from Ω to S that is at most n-to-1.
(
= T0 if T 0 ∈ S
g(T 0 ) =
= T 0 \ {j} (else) where j = index of element in T 0 with highest value of wj0
125

In the second case note that this element j satisfies wj > W/n which implies wj0 ≥ n.
Clearly, g is at most n-to-1 since a set T in S can have at most n pre-images under g.
Now let’s verify that T = g(T 0 ) lies in S.
X XW
wi ≤ (wi0 + 1)
n2
i∈T i∈T
W
≤ × (W 0 − wj0 + n − 1)
n2
≤W (since W 0 = n2 and wj0 ≥ n )

which implies T ∈ S. 2

Sampling algorithm for Ω To sample from Ω, use our earlier equivalence of approximate
counting and sampling. That algorithm needs an approximate count not only for |Ω| but
also for the subset of Ω that contain the first element. This is another knapsack problem
and can thus be solved by Dyer’s dynamic programming. And same is true for instances
obtained in the recursion.

Bibliography

1. On the Hardness of Approximate Multivariate Integration. I. Koutis, Proc. Approx-


Random 2013. Springer Verlag.

2. The complexity of enumeration and reliability problems. L. Valiant. SIAM J. Com-


puting, 8:3 (1979), pp.410-421.
Chapter 22

Taste of cryptography: Secret


sharing and secure multiparty
computation

Cryptography is the ancient art/science of sending messages so they cannot be deciphered


by somebody who intercepts them. This field was radically transformed in the 1970s using
ideas from computational complexity. Encryption schemes were designed whose decryption
by an eavesdropper requires solving computational problems (such as integer factoring)
that’re believed to be intractable. You may have seen the famous RSA cryptosystem at
some point. It is a system for giving everybody a pair of keys (currently each is a 1024-
bit integer) called a public key and a private key. The public key is published on a public
website; the private key is known only to its owner. Person x can look up person y’s public-
key and encrypt a message using it. Only y has the private key necessary to decode it;
everybody else will gain no information from seeing the encrypted message.
Since the 1980s though, the purview of cryptography greatly expanded. In inventions
that anticipated threats that wouldn’t materialize for another couple of decades, cryptogra-
phers designed solutions such as private multiparty computation, proofs that yield nothing
but their validity, digital signatures, digital cash, etc. Today’s lecture is about one such
invention due to Ben-or, Goldwasser and Wigderson (1988), secure multiparty computation,
which builds upon the Reed Solomon codes studied last time.
The model is the following. There are n players, each holding a private number (say,
their salary, or their vote in an election). The ith player holds si . They wish to compute
a joint function of their inputs f (s1 , s2 , . . . , sn ) such that nobody learns anything about
anybody else’s secret input (except of course what can be inferred from the value of f ).
The function f is known to everybody in advance (e.g., s21 + s22 + · · · + s2n ).
Admittedly, this sounds impossible when you first hear it.

22.1 Shamir’s secret sharing


We first consider a static version of the problem that introduces some of the ideas.

126
127

Say we want to distribute a secret among n, say a0 . (For example, a0 could be the secret
key to decrypt an important message.) We want the following property: every subset of
t + 1 people should be able to pool their information and recover the secret, but no subset
of t people should not be able to pool their information to recover any information at all
about the secret.
For simplicity interpret a0 as a number in a finite field Zq . Then pick t random numbers
a1 , a2 , . . . , at in Zq and constructing the polynomial p(x) = a0 + a1 x + a2 x2 + · · · + at xt
and evaluate it at n points x1 , x2 , . . . , xn that are known to all of them. Then give p(xi ) to
person i.
Notice, the set of shares are t-wise independent random variables. (Each subset of t
shares is distributed like a random t-tuple over Zq .) This follows from polynomial interpo-
lation (which we explained last time using the Vandermode determinant): for every t-tuple
of people and every t-tuple of values y1 , y2 , . . . , yt ∈ Zq , there is a unique polynomial whose
constant term is a0 and which takes these values for those people. Thus every t-tuple of
values is equally likely, irrespective of a0 , and gives no information about a0 .
Furthermore, since p has degree t, each subset of t + 1 shares can be used to reconstruct
p(x) and hence also the secret a0 .

22.2 Multiparty computation: the model


Multiparty computation vastly generalizes Shamir’s idea, allowing the players to do arbi-
trary algebraic computation on the secret input using their “shares.”
Player i holds secret si and the goal is for everybody to know f (s1 , s2 , . . . , sn ) at the
end, where f is a publicly known function (everybody has the code). No subset of t players
can pool their information to get any information about anybody else’s input that is not
implicit in the output f (s1 , s2 , . . . , sn ). (Note that if f () just outputs its first coordinate,
then there is no way for the first player’s secret s1 to not become public at the end.)
We are given a secret channel between each pair of players, which cannot be eavesdropped
upon by anybody else. Such a secret channel can be ensured using, for example, a public-
key infrastructure. If everybody’s public keys are published, player i can look up player j’s
public-key and encrypt a message using it. Only player j has the private key necessary to
decode it; everybody else will gain no information from seeing the encrypted message.
The result only applies to algebraic computations.

Definition 11 (Algebraic programs) A size m algebraic straight line program with


inputs x1 , x2 , . . . , xn ∈ Zq is a sequence of m lines of the form

yi ← yi1 op yi2 ,

where i1 , i2 < i; op = “+”or “×,”or “−”and yi = xi for i = 1, 2, . . . , n. The output of this


straight line program is defined to be ym .

A simple induction shows that a straight line program with inputs x1 , x2 , . . . xn computes
a multivariate polynomial in these variables. The degree can be rather high, about 2m . So
this is a powerful model.
128

(Aside: Straight line programs are sometimes called algebraic circuits. If you replace
the arithmetic operations with boolean operations ∨, ¬, ∧ you get a model that can do any
computation at all.)

22.3 Easy protocol: linear combinations of inputs


First
P we describe a simple protocol that allows the players to compute f (s1 , s2 , . . . , sn ) =
i ci si for any coefficients c1 , c2 , . . . , cn ∈ Zq known to all of them.
Let α1 , α2 , . . . , αn be n distinct nonzero values in Zq known to all.
Each player does a version of Shamir’s secret sharing. Player i picks t random numbers
ai1 , ai2 , . . . , ait ∈ Zq and evaluates the polynomial pi (x) = si + ai1 x + ai2 x2 + · · · + ait xt at
α1 , α2 , . . . , αn , and sends those values to the respective n players (keeping the value at αi
for himself) using the secret channels. Let γij be the secret sent by player i to player j.
P After all these shares have been sent around, P the players get down to computing f , i.e.,
i c i s i . This is easy. Player k computes i ci γ ik . In other words, he treats the shares he
received from the others as proxies for their input.
observation: The numbers computed by the kth player correspond to value of the following
polynomial at x = αk : X XX
ci si + ( air )xr .
i r i
Thus the first t players can now send their computed numbers to everybody else. Then
everybody has t + 1 values of this polynomial, allowing them to reconstruct it and thus also
reconstruct the constant term, which is the desired output.

22.4 General protocol: + and × suffice


The above protocol for + seems rather trivial. But our definition of Algebraic programs
shows that if we can design a protocol that allows multiplying of secret values, then that is
good enough to implement any algebraic computation. Let the variables in the algebraic
program be y1 , y2 , . . . , ym .

Definition 12 ((t, n)- secretsharing) If a0 ∈ Zq then its (t, n)- secretsharing is a se-
quence of nPnumbers β1 , β2 , . . . , βn obtained as in Section 22.1 by using a polynomial of the
form a0 + ti=1 ai xi , where a1 , a2 , . . . , an are random numbers in Zq .

The general invariant maintained by the protocol is the following: At the end of step i, the
n players hold the n values in some (t, n)-secretsharing of the value of yi .
Clearly, at the start of the protocol such a secretsharing for the values of the n input
variables x1 , x2 , . . . , xn has been divided among the players. So the invariant is true for
i ≤ n. Assuming it is true for i we show how to maintain it for i + 1. If yi+1 is the +
of two earlier variables, then the simple protocol of Section 22.3 allows the invariant to be
maintained.
So assume yi+1 is the × of two earlier P variables. If these
P two earlier variables were
secretshared using polynomials g(x) = tr=0 gr xr and h(x) tr=0 hr xr then the values being
secretshared are g0 , h0 and the obvious polynomial to secretshare their product is π(x) =
129

P P
g(x)h(x) = 2t r=0 x
r
j≤r gj hr−j . The constant term in this polynomial is g0 h0 which is
indeed the desired product. Secretsharing this polynomial means everybody takes their
share of g and h respectively and multiplies them. Nothing more to do.
Unfortunately, this polynomial π has two problems: the degree is 2t instead of t and,
more seriously, its coefficients are not random numbers in Zq . Thus it is not a (t, n)-
secretsharing of g0 h0 .
The degree problem is easy to solve: just drop the higher degree terms and stay with
the first t terms. Dropping terms is a linear operation and can be done using the simple
protocol of Section 22.3. We won’t go into details.
To solve the problem about the coefficients not being random numbers, each of the
players does the following. The kth player picks a random degree 2t polynomial rk (x) whose
constant term is 0. Then he secret shares this polynomial among all the other players. Now
the players can compute their secretshares of the polynomial
n
X
π(x) + rk (x),
k=1

and the constant term in this polynomial is still g0 h0 . Then they apply truncation to
this procedure to drop the higher order terms. Thus at the end the players have a (t, n)-
secretsharing of the value yi+1 , thus maintaining the invariant.

Subtleties The above description assumes that the malicious players follow the protocol.
In general the t malicious players may not follow the protocol in an attempt to learn things
they otherwise can’t. Modifying the protocol to handle this —and proving it works—is
more nontrivial.
bibliography

1. M. BenOr, S. Goldwasser, and A. Wigderson. Completeness Theorems for Non-


Cryptographic Fault Tolerant Distributed Computation Proceedings of the 20th An-
nual ACM Symposium on Theory of Computing (STOC’88), Chicago, Illinois, pages
1-10, May 1988.

2. A. Shamir. How to share a secret”, Communications of the ACM 22 (11): 612613,


1979.
Chapter 23

Real-life environments for big-data


computations (MapReduce etc.)

First 2/3rd based upon the guest lecture of Kai Li, and the other 1/3rd upon Sanjeev’s
lecture

23.1 Parallel Processing


These days many algorithms need to be run on huge inputs; gigabytes still can fit in RAM
on a single computer, but terabytes or more invariably require a multiprocessor architecture.
This state of affairs seems here to stay since (a) Moore’s law has slowed a lot, and there is
no solution in sight. So processing speeds and RAM sizes are no longer growing as fast as
in the past. (b) Data sets are growing very fast.
Multiprocessors (aka parallel computers) involve multiple CPUs operating on some form
of distributed memory. This often means that processors may compete in writing to the
same memory location, and the system design has to take this into account. Parallel
systems have been around for many decades, and continue to evolve with changing needs
and technology.
There are three major types of systems based upon their design:
Shared memory multiprocessor. Multiple processors operate on the same memory;
there are explicit mechanisms for handling conflicting updates. The underlying mem-
ory architecture may be complicated but the abstraction presented to the programmer
is that of a single memory. The programming abstraction often involves threads that
use synchronization primitives to handle conflicting updates to a memory location.
Pros: The programming is relatively easy; data structures are familiar. Cons: Hard
to scale to very large sizes. In particular, cannot handle hardware failure (all hell can
break lose otherwise).
Message passing models: Different processors control their own memory; data move-
ment is via message passing (this provides implicit synchronization). Pros: Such
systems are easier to scale. Can use checkpoint/recovery to deal with node failures.
Cons: No clean data structures.

130
131

Commodity clusters: Large number of off-the-shelf computers (with their own memories)
linked together with a LAN. There is no shared memory or storage. Pros: Easy to
scale; can easily handle the petabyte-size or larger data sets. Cons: Programming
model has to deal explicitly with failures.

Tech companies and data centers have gravitated towards commodity clusters with tens
of thousands or more processors. The power consumption may approach that of a small
town. In such massive systems failures —processor, power supplies, hard drives etc.—are
inevitable. The software must be designed to provide reliability on top of such frequent
failures. Some techniques: (a) replicate data on multiple disks/machines (b) replicate com-
putation by splitting into smaller subtasks (c) use good data placement to avoid long latency.
Google pioneered many such systems for their data centers and released some of these
for general use. MapReduce is a notable example. The open source community then came
up with its own versions of such systems, such as Hadoop. SPARK is another programming
environment developed for ML applications.

23.2 MapReduce
MapReduce is Google’s programming interface for commodity computing. It is evolved from
older ideas in functional programming and databases. It is easy to pick up, but achieving
high performance requires mastery of the system.
It abstracts away issues of data replication, processor failure/retry etc. from the pro-
grammer. One consequence is that there is no guarantee on running time.
The programming abstraction is rather simple: the data resides in an unsorted clump
of (key, value) pairs. We call this a database to ease exposition. (The programmer has to
write a mapper function that produces this database from the data.) Starting with such
a database, the system applies a sort that moves all pairs with the same key to the same
physical location. Then it applies a reduce operation –provided by the programmer—that
takes a bunch of pairs with the same key and applies some combiner function to produce
a new single pair with that key and whose value is some specified combination of the old
values.

Example 49 (Word Count) The analog to the usual Hello World program in the MapRe-
duce world is the program to count the number of repetitions of each word. The programmer
provides the following.
mapper Input: a text corpus. Output: for each word w, produce the pair (w, 1). This
gives a database.
reduce: Given a bunch of pairs of type (w, count) produces a pair of type (w, C) where
C is the sum of all the counts.

Example 50 (Matrix Vector Multiplication)


Mapper: Input is an n × n matrix M , and a n × 1 vector V . Output: Pairs (i, mij · vj )
for all {i, j} for which mij 6= 0.
Reducer: Again, just adds up all pairs with the same key and sum up their values.

One can similarly do other linear algebra operations.


132

Some other examples of mapreduce programs appear in Jelani Nelson’s notes


http://people.seas.harvard.edu/ minilek/cs229r/lec/lec24.pdf
The MapReduce paradigm was introduced in the following paper:
MapReduce: Simplified Data Processing on Large Clusters by Dean and Ghemawat.
OSDI 2004.
While it has been very influential, it is not suited for all applications. A criticial appraisal
appears in a blog post MapReduce: a major step backwards, by DeWitt and Stonebraker.
Chapter 24

Heuristics: Algorithms we don’t


know how to analyze

Any smart teenager who knows how to program can come up with a new algorithm.
Analysing algorithms, by contrast, is not easy and usually beyond the teenager’s skillset. In
fact, if the algorithm is complicated enough, proving things about it (i.e., whether or not it
works) becomes very difficult for even the best experts. Thus not all algorithms that have
been designed have been analyzed. The algorithms we study today are called heuristics:
for most of them we know that they do not work on worst-case instances, but there is good
evidence that they work very well on many instances of practical interest. Explaining this
discrepancy theoretically is an interesting and challenging open problem.
Though the heuristics apply to many problems, for pedagogical reasons, throughout the
lecture we use the same problem as an example: 3SAT. Recall that the input to this problem
consists of clauses which are ∨ (i.e., logical OR) of three literals, where a literal is one of
n variables x1 , x2 , . . . , xn , or its negation. For example: (x1 ∨ ¬x4 ∨ x7 ) ∧ (x2 ∨ x3 ∨ ¬x4 ).
The goal is to find an assignment to the variables that makes all clauses evaluate to true.
This is the canonical NP-complete problem: every other NP problem can be reduced
to 3SAT (Cook-Levin Theorem, early 1970s). More importantly, problems in a host of
areas are actually solved this way: convert the instance to an instance of 3SAT, and use
an algorithm for 3SAT. In AI this is done for problems such as constraint satisfaction and
motion planning. In hardware and software verification, the job of verifying some property
of a piece of code or circuit is also reduced to 3SAT.
Let’s get the simplest algorithm for 3SAT out of the way: try all assignments. This
has the disadvantage that it takes 2n time on instances that have few (or none) satisfying
assignments. But there are more clever algorithms, which run very fast and often solve
3SAT instances arising in practice, even on hundreds of thousand variables. The codes for
these are publicly available, and whenever faced with a difficult problem you should try to
represent it as 3SAT and use these solvers.

133
134

24.1 Davis-Putnam procedure


The Davis-Putnam procedure from the 1950s is very simple. It involves assigning values to
variables one by one, and simplifying the formula at each step. For instance, if it contains
a clause x3 ∨ ¬x5 and we have just assigned x5 to T (i.e., true) then the clause becomes
true and can be removed. Conversely, if we assign it F the then the only way the remaining
variables can satisfy the formula is if x3 = T . Thus x5 = F forces x3 = T . We call these
effects the simplification of the formula.
Say the input is ϕ. Pick a variable, say xi . Substitute xi = T in ϕ and simplify it.
Recursively check the simplified formula for satisfiability. If it turns out to be unsatisfiable,
then substitute xi = F in ϕ, simplify it, and recursively check that formula for satisfiability.
If that also turns out unsatisfiable, then declare ϕ unsatisfiable.
When implementing this algorithm schema one has various choices. For instance, which
variable to pick? Random, or one which appears in the most clauses, etc. Similarly, whether
to try the value T first or F ? What data structure to use to keep track of the variables
and clauses? Many such variants have been studied and surprisingly, they do very well in
practice. Hardware and software verification today relies upon the ability to solve instances
with hundreds of thousands of variables.
clause learning. The most successful variants of this algorithm involves learning from
experience. Suppose the formula had clauses (x1 ∨ x7 ∨ x9 ) and (x1 ∨ ¬x9 ∨ ¬x6 ) and along
some branch the algorithm tried x1 = F, x7 = F, x6 = T , which led to a contradiction since
x9 is being forced to both T and F . Then the algorithm has learnt that this combination is
forbidden, not only at this point but on every other branch it will explore in future. This
knowledge can be added in the form of a new clause x1 ∨ x7 ∨ ¬x6 , since every satisfying
assignment has to satisfy it. As can be imagined, clause learning comes in myriad variants,
depending upon what rule is used to infer and add new clauses.
One of you asked why adding clauses (ie more constraints) simplifies the problem instead
of making it harder. The answer is that the clauses can be seen as guidance towards
a satisfying assignment (if one exists). The clauses can be used in making the crucial
decision in DPLL procedures about which variable to set, and how to set it (T or F). The
wrong decision may cause you to potentially incur huge cost. So anything that lowers the
probability of wrong decision by even a bit could drastically change your running time.

24.2 Local search


The above procedures set variables one by one. There is a different family of algorithms
that does this in a different way. A typical is Papadimitriou’s Walksat algorithm: Start
with a random assignment. At each step, pick a random variable and switch its value. If
this increases the number of satisfied clauses, make this the new assignment. Continue this
way until the number of satisfied clauses cannot be increased. Papadimitriou showed that
this algorithm solves 2SAT with high probability.
Such algorithms fit in a more paradigm called local search, which can be described as
follows.
135

Figure 24.1: Local search algorithms try to improve the solution by looking for small changes
that improve it.

Maintain a solution at each step. If the current solution is x, look for a


solution y in a neighborhood Ball(x, r) of radius r around x (that is, all
solutions that differ from x up to some amount small r). If you find such
a y that improves over x (in terms of the objective being optimized) then
replace x by y. Stop if no such y was found.
Clearly, when the algorithm stops, the current solution is optimal in its neighborhood
(i.e., locally optimal). One can think of this as a discrete analog of gradient descent. An ex-
ample of nonlocal change is any of the global optimization algorithms like Ellipsoid method.
Thus local search is a formalization of improvement strategies that we come up with
intuitively, e.g., change ourselves by making small continuous changes. The Japanese have
a name for it: kaizen1
Example 51 Local search is a popular and effective heuristic for many other problems
including traveling salesman and graph partitioning. For instance, one local search strategy
(which even students in my freshman seminar were able to quickly invent) is to start with a
tour, and at each step try to improve it by changing up to two edges (2-OPT) or k edges
 (k-
OPT). We can find the best local improvement in polynomial time (there are only n2 ways
to choose 2 edges in a tour) but the number of local improvement steps may be exponential
in n. So the overall running time may be exponential.
These procedures often do well in practice, though theoretical results are few and far
between. One definitive study is
The traveling salesman problem: A case study in local optimization, by D. Johnson and
C. McGeoch. 1997

Example 52 Evolution a la Darwin can be seen as a local search procedure. Mutations


occur spontaneously and can be seen as exploring a small neighborhood of the organism’s
genome. The environment gives feedback over the quality of mutations. If the mutation
is good, the descendents thrive and the mutation becomes more common in the gene pool.
(Thus the mutated genome becomes the new solution y in the local search). If the mutation
is harmful the descendents die out and the mutation is thus removed from the gene pool.
1
I would like to know if Japanese magazines have cover stories on new kaizen ideas just as cover stories
in US magazines promote radical makeovers.
136

24.3 Difficult instances of 3SAT


We do know of hard instances for 3SAT for such heuristics. A simple family of examples uses
the fact that there are small logical circuits (i.e., acyclic digraphs using nodes labeled with
the gates ∨, ∧, ¬)for integer multiplication. The circuit for multiplying two n-bit numbers
has size about O(n log2 n). So take a circuit C that multiplies two 1000 bit numbers. Input
two random prime numbers p, q in it and evaluate it to get a result r. Now construct a
boolean formula with 2n + O(|C|) variables corresponding to the input bits and the internal
gates of C, and where the clauses capture the computation of each gate that results in
the output r. (Note that the bits of r are “hardcoded ”into the formula, but the bits of
p, q as well as the values of all the internal gates correspond to variables.) Thus finding a
satisfying assignment for this formula would also give the factors of r. (Recall that factoring
a product of two random primes is the hard problem underlying public-key cryptosystems.)
The above SAT solvers have difficulty with such instances.
Other families of difficult formulae correspond to simple math theorems. A simple one
is: Every partial order on a finite set has a maximal element. A partial order on n elements
is a relation ≺ satisfying: (a) xi ⊀ xi ∀i. (b) xi ≺ xj and xj ≺ xk implies xi ≺ xk
(transitivity) (c) xi ≺ xj implies xj ⊀ xi . (Anti-symmetry).
For example, the relationship “is a divisor of”is a partial order among integers. We can
represent a partial order by a directed acyclic graph.

Figure 24.2: The relation “is a divisor of”is a partial order among integers.

Clearly, for every partial order on a finite set, there is a maximal element i such that
i ⊀ j for all j (namely, any leaf of the directed acyclic graph.) This simple mathematical
statement can be represented as an unsatisfiable formula. However, the above heuristics
seem to have difficulty detecting that it is unsatisfiable.
This formula has a variables xij for every pair of elements i, j. There is a family of
clauses representing the properties of a partial order.

¬xii ∀i
¬xij ∨ ¬xjk ∨ xik ∀i, j, k
¬xij ∨ ¬xji ∀i, j

Finally, there is a family of clauses saying that no i is a maximal element. These clauses
137

don’t have size 3 but can be rewritten as clauses of size 3 using new variables.
xi1 ∨ xi2 ∨ · · · ∨ xin ∀i

24.4 Random SAT


One popular test-bed for 3SAT algorithms are random instances. A random formula with
m clauses is picked by picking each clauses independently as follows: pick three variables
randomly, and then toss a coin for each to decide whether it appears negated or unnegated.
Turns out if m < 3.9n or so, then Davis-Putnal type procedures usually find a satisfying
assignment. If m > 4.3n these procedures usually fail. There is a different algorithm called
Survey propagation that finds algorithms up to m close to 4.3n. It is conjectured that there
is a phase transition around m = 4.3n whereby the formula goes from being satisfiable with
probability close to 1 to being unsatisfable with probability close to 1. But this conjecture
is unproven, as is the conjecture that survey propagation works up to this threshold.
Now we show that if m > 5.2m then the formula is unsatisfiable with high probability.
This follows since the expected number of satisfying assignments in such a formula is 2n ( 87 )m
(this follows by linearity of expectation since there are 2n possible assignments, and any
fixed assignment satisfies all the m independently chosen clauses with probability ( 87 )m ).
For m > 5.2n this number is very tiny, so by Markov’s inequality the probability it is ≥ 1
is tiny.
Note that we do not know how to prove in polynomial time, given such a formula with
m > 5.2n, that it is unsatisfiable. In fact it is known that known that for m > Cn for some
large constant C, the simple DP-style algorithms take exponential time.

24.5 Metropolis-Hastings and Computational statistics


Now we turn to counting problems and statistical estimation, discussed earlier in Lecture
21. Recall the Monte Carlo method for estimating the area of a region: through darts and
see what fraction land in the region.

Figure 24.3: Monte Carlo (dart throwing) method to estimate the area of a circle. The
fraction of darts that fall inside the disk is π/4.

Now suppose we are trying to integrate a nonnegative valued function f over the region.
Then we should throw a dart which lands at x with probability f (x). We’ll examine how
to throw such a dart.
138

First note that this is an example of sampling from a probability distribution for which
only the density function is known. Say the distribution is defined on {0, 1}n and we have
a goodness function f (x) that is nonnegative and computable in polynomial time given
x ∈ {0, 1}n . Then we wish to sample from the distribution where probability of getting x is
proportional to f (x). Since
P probabilities must sum to 1, we conclude that this probability
is f (x)/N where N = x∈{0,1}n f (x) is the so-called partition function. The main problem
problem here is that N is in general hard to compute; it is complete for the class ]P
mentioned in the earlier lecture.

Example 53 The dart throwing/integration problem arises in machine learning (more gen-
erally, statistical procedures). For instance if there is a density p(x, y) and we wish to
compute p(x|y) using Bayes’ rule then we need p(xy)/p(y), and
Z
p(y) = f (x, y)dx.

Lets note that if one could do such dart throwing in general, then 3SAT becomes easy.
Suppose the formula has n variables and m clauses. For any assignment x define f (x) =
22nfx where fx = number of clauses satisfied by x. Then if the formula has a satisfiable
assignment, then N > 22nm whereas if the formula is unsatisfiable then N < 2n ×22n(m−1) <
22nm . In particular, the mass f (x) of a satisfying assignment exceeds the mass of all
unsatisfying assignments. So the ability to sample from the distribution would yield a
satisfying assignment with high probability.
The Metropolis-Hastings algorithm (named after its inventors) is a heuristic for sampling
from such a distribution. Define the following random walk on {0, 1}n . At every step the
walk is at some x ∈ {0, 1}n . (At the beginning use an arbitrary x.) At every step, toss a
coin. If it comes up heads, stay at x. (In other words, there is a self-loop of probability at
least 1/2.) If the coin came up tails, then randomly pick a neighbor x0 of x. Move to x0
0)
with probability min 1, ff(x 0
(x) . (In other words, if f (x ) ≥ f (x), definitely move. Otherwise
move with probability given by their ratio.)
claim: If all f (x) > 0 then the stationary distribution of this Markov chain is exactly
p(x)/N , the desired distribution
Proof: The markov chain defined by this random walk is ergodic since f (x) > 0 implies it
is connected, and the self-loops imply it mixes. Thus it suffices to show that the (unique)
stationary distribution has the form f (x)/K for some scale factor K, and then it follows
that K is the partition function N . To do so it suffices to verify that such a distribution is
stationary, i.e., in one step the probability flowing out of a vertex equals its inflow. For any
x, lets L be the neighbors with a lower f value and H be the neighbors with value at least
as high. Then the outflow of probability per step is

f (x) X f (x0 ) X
( + 1),
2K 0 f (x) 0
x ∈L x ∈H

whereas the inflow is


1 X f (x0 ) X f (x0 ) f (x)
( ·1+ ),
2 0 K 0
K f (x0 )
x ∈L x ∈H
and the two are the same. 2

Note: The advantage of the random walk method is that it can in principle explore a space
of exponential size while using only space for storing the current x. In this sense it is like
local search. In fact it is like a probabilistic version of local search on the objective f (x). In
local search one would move from x to x0 if that improves f , whereas here the move is made
with some probability depending upon f (x)/f (x0 ) and every possible move has a nonzero
probability.

Simulated Annealing. If we use the suggested goodness function for 3SAT f (x) = 22nfx
then this Markov chain can be shown to have poor mixing. So a variant is to use a markov
chain that updates itself. The goodness function is initialized to say 2γfx for γ = 1, then
allowed to mix. This stationary distribution may put too little weight on the satisfying
assignments. So then slowly increase γ from 1 to 2n, allowing the chain to mix for a while
at each step. This family of algorithms is called simulated annealing, named after the
physical process of annealing.

For further information see this survey and its list of references.

Satisfiability Solvers, by C.P. Gomes, H. Kautz, A. Sabharwal, and B. Selman. Handbook


of Knowledge Representation, Elsevier 2008.

139
140

princeton university fall ’14 cos 521:Advanced Algorithms


Homework 1
Out: Sep 25 Due: Oct 2

You can collaborate with your classmates, but be sure to list your collaborators with your
answer. If you get help from a published source (book, paper etc.), cite that. The answer
must be written by you and you should not be looking at any other source while writing it.
Also, limit your answers to one page, preferably less —you just need to give enough detail
to convince the grader.
Typeset your answer in latex (if you don’t know latex, you can write by hand but scan
in your answers into pdf form before submitting). You can scanners in the mailroom and
also using most smartphones.

§1 The simplest model for a random graph consists of n vertices, and tossing a fair coin
for each pair {i, j} to decide whether this edge should be present in the graph. Call
this G(n, 1/2). A triangle is a set of 3 vertices with an edge between each pair.
What is the expected number of triangles? What is the variance? Use the Chebyshev
inequality to show that the number is concentrated around the expectation and give
an expression for the exact decay in probability. Is it possible to use Chernoff bounds
in this setting?

§2 (Part 1): You are given a fair coin, and a program that generates the binary expansion
of p upto any desired accuracy. Formally describe the procedure to simulate a biased
coin that comes up with head with probability p. (This was sketched in class.) (Part 2)
Now, show how to do the reverse: generate a fair coin toss using a biased coin but
where the bias is unknown.

§3 A cut is said to be a B-approximate min cut if the number of edges in it is at most


B times that of the minimum cut. Show that a graph has at most (2n)2B cuts that
are B-approximate. (Hint: Run Karger’s algorithm until it has 2B + 1 supernodes.
What is the chance that a particular B-approximate cut is still available? How many
possible cuts does this collapsed graph have?)

§4 Show that given n numbers in [0, 1] it is impossible to estimate the value of the median
within say 1.1 factor with o(n) samples. (Hint: to show an impossibility result you
show two different sets of n numbers that have very different medians but which
generate —whp—identical samples of size o(n).)
Now calculate the sample size needed (as a function of t) so that the following is true:
with high probability, the median of the sample has at least n/2 − t numbers less than
it and at least n/2 − t numbers more than it.

§5 Consider the following process for matching n jobs to n processors. In each step, every
job picks a processor at random. The jobs that have no contention on the processors
they picked get executed, and all the other jobs back off and then try again. Jobs only
take one round of time to execute, so in every round all the processors are available.
Show that all the jobs finish executing whp after O(log log n) steps.

§6 In class we saw a hash to estimate the size of a set. Change it to estimate frequencies.
Thus there is a stream of packets each containing a key and you wish to maintain
a data structure which allows us to give an estimate at the end of the number of
times each key appeared in the stream. The size of the data structure should not
depend upon the number of distinct keys in the stream but can depend upon the
success probability, approximation error etc. Just shoot for the following kind of
approximation: if ak is the true numberP of times that key k appeared in the stream
then your estimate should be ak ± ε( k ak ). In other words, the estimate is going
to be accurate only for keys that appear frequently (”heavy hitters”) in the stream.
(This is useful in detecting anomalies or malicious attacks.) Hint: Think in terms of
maintaining m1 × m2 counts using as many independent hash functions, where each
key updates m2 of them.

§7 In Matlab or another suitable programming environment implement a pairwise in-


dependent hash function and use it to map {100, 200, 300, ..., 100n} to a set of size
around n. (Use n = 105 for starters.) Report the largest bucket size you noticed.
Then make up a hash function of your own design (could involve crazy stuff like tak-
ing XOR of bits, etc.) and repeat the experiment with it and report the largest bucket
size. Include your code with your answer and brief description of any design decisions.

141
142

princeton university fall ’14 cos 521:Advanced Algorithms


Homework 2
Out: Oct 7 Due: Oct 16

You can collaborate with your classmates, but be sure to list your collaborators with your
answer. If you get help from a published source (book, paper etc.), cite that. The answer
must be written by you and you should not be looking at any other source while writing it.
Also, limit your answers to one page, preferably less —you just need to give enough detail
to convince the grader.
Typeset your answer in latex (if you don’t know latex, scan your handwritten work into
pdf form before submitting). To make things easier to grade, submit answers in the numbered
order listed below, and also make sure your name appears on every page.
§1 Draw the full tree of possibilities for the cake-eating problem discussed in class, and
compute the optimum cake-eating schedule. To keep the tree size manageable, draw
it with the following slight changes: The amount you eat each day has to be an integer
multiple of 1/3, and on each day your roommates will with probability 1/2 eat 1/3
of the cake.) (If multiple branches or sub-branches are identical, you may label the
branch with a variable and use the variable in lieu of re-drawing the branch. You may
also omit branches where you eat 0 cake.)
§2 (Stable Matchings with Real-Valued Utilities) We saw stable matchings in a guest lec-
ture. Another formulation of the bipartite stable matching problem has each agent i
submit a real number ui (j) for each element j in the opposite partition, representing
the utility of being matched with that element. We then define a perfect matching M
to be stable if there does not exist a pair (v, w) 6∈ M such that both uv (w) > uv (v 0 )
and uw (v) > uw (w0 ) where both (v, v 0 ) and (w, w0 ) are in M .

(a) Prove that if the two partitions of the graph are of equal size, uv (w) = uw (v) for
all pairs (v, w), and uv (w) 6= uv0 (w0 ) for all {v, w} =
6 {v 0 , w0 } then there exists a
unique stable matching among the agents.
(b) Show by example that if we remove the final condition (that utilities are unique
between different pairs of agents) from part (a), then the instance can contain
multiple stable matchings.

§3 In `2 regression you are given datapoints x1 , x2 , . . . , xn ∈ <k and some values y1 , y2 , . . . , yn ∈ <
and wish to find the “best”linear function that fits this dataset. A frequent choice for
best fit is the one with least squared error, i.e. find a ∈ <k that minimizes
h
X
|y − a · xi |2 .
i=1
Show how to solve this problem in polynomial time (hint: reduce to solving linear
equations; at some point you may need a certain matrix to be invertible, which you
can assume.).
143

§4 (Firehouse location) Suppose we model a city as an m-point finite metric space with
d(x, y) denoting the distance between points x, y. These m 2 distances (which satisfy
triangle inequality) are given as part of the input. The city has n houses located at
points v1 , v2 , . . . , vn in this metric space. The city wishes to build k firehouses and asks
you to help find the best locations c1 , c2 , . . . , ck for them, which can be located at any
of the m points in the city. The happiness of a town resident with the final locations
depends uponP his distance from the closest firehouse. So you decide to minimize the
cost function ni=1 d(vi , ui ) where ui ∈ {c1 , c2 , . . . , ck } is the firehouse closest to vi .
Describe an LP-based algorithm that runs in poly(m) time and solves this problem
approximately. If OPT is the optimum cost of a solution with k firehouses, your
solution is allowed to use O(k log m) firehouses and have cost at most (1 + ε)OPT.

§5 In class we designed a 3/4-approximation for MAX-2SAT using LP rounding. Extend


it to a 3/4-approximation for MAX-SAT (i.e., where clauses can have 1 or more
variables). Hint: you may also need the following idea: if a clause has size k and
we randomly assign values to the variables (i.e., 0/1 with equal probability) then the
probability we satisfy it is 1 − 1/2k .

§6 You are given data containing grades in different courses for 5 students. As discussed
in Lecture 5, we are trying to ”explain” the grades as a linear function of the student’s
aptitude, the easiness of the course and some error term. Denoting by Gradeij the
grade of student i in course j this linear model hypothesizes that

Gradeij = aptitudei + easinessj + εij ,

where εij is an error term.


As we saw in class, the problem of finding the best model that minimizes the sum of
the |εij |’s can be solved by an LP. Your goal is to use any standard package for linear
programming (Matlab/CVX, Freemat, Sci-Python, Excel etc.; we recommend CVX
on matlab) to fit the best model to this data. Include a printout of your code, and
the calculated easiness values of all the courses and the aptitudes of all the students.
MAT CHE ANT REL POL ECO COS
Alex C+ A B+ A- C+
Billy B+ A- A- B B
Chris B B+ A A- B+
David A B- A A-
Elise B- C B+ B B C
Assume A = 10, B = 8 and so on. Let B+ = 9 and A− = 9.5. (If you use a different
numerical conversion please state it clearly.)

§7 (Optimal life partners via MDP) Your friend is trying to find a life partner by going
on dates with n people selected for her by an online dating servie. After each date
she has two choices: select the latest person she dated and stop the process, or reject
this person and continue to date. She has asked you to suggest the optimum stopping
rule. You can assume that the n persons are all linearly orderable (i.e. given a choice
between any two, she is not indifferent and prefers one over the other). The dating
service presents the n chosen people in a random order, and her goal is to maximise the
chance of ending up with the person that she will like the most among these n. (Thus
ending up even with her second favorite person out of the n counts as failure; she’s a
perfectionist.) Represent her actions as an MDP, compute the optimum strategy for
her and the expected probability of success by following this strategy.
(Hint: The Optimal rule is of the form: Date γn people and decide beforehand to pass
on them. After that select
Pthe first person who is preferable to all people seen so far.
t2
You may also need that k=t1 k1 ≈ ln tt21 .)

§8 (extra credit) In question 4 try to design an algorithm that uses k firehouses but
has cost O(OPT). (Needs a complicated dependent rounding; you can also try other
ideas.) Partial credit available for partial progress.

144
145

princeton university fall ’14 cos 521:Advanced Algorithms


Homework 3
Out: Oct 23 Due: Nov 10

1. Compute the mixing time (both upper and lower bounds) of a graph on 2n nodes
that consists of two complete graphs on n nodes joined by a single edge. (Hint: Use
elementary probability calculations and reasoning about “probability fluid”; no need
for eigenvalues.)
2. Let M be the Markov chain of a 5-regular undirected graph that is connected. Each
node has self-loops with probability 1/2. We saw in class that 1 is an eigenvalue with
eigenvector ~1. Show that every other eigenvalue has magnitude at most 1 − 1/10n2 .
(Hint: First show that a connected graph cannot have 2 eigenvalues that are 1.)
What does this imply about the mixing time for a random walk on this graph from
an arbitrary starting point?
3. (Game-playing equilibria) Recall the game of Rock, Paper, Scissors. Let’s make it
quantitative it by saying that the winning player wins $ 1 whereas the loser gets $ 0.
(In other words, the game is not zero sum.) A draw results in both getting 0. Suppose
we make two copies of the multiplicative weight update algorithm to play each other
over many iterations. Both start using the uniformly random strategy (i.e., play each
of Rock/paper/scissors with probability 1/3) and learn from experience using the
MW rule. One imagines that repeated play causes them to converge to some kind
of equilibrium. (a) Predict by just calculation/introspection what this equilibrium
is. (Be honest; it’s Ok to be wrong!). (b) Run this experiment on Matlab or any
other programming environment and report what you discovered and briefly explain
it. (We’ll discuss the result in class.)
4. This question will study how mixing can be much slower on directed graphs. Describe
an n-node directed graph (with max indegree and outdegree at most 5) that is fully
connected but where the random walk takes exp(Ω(n)) time to mix (and the walk
ultimately does mix). Argue carefully.
5. Describe an example (i.e., an appropriate set of n points in <n ) that shows that the
Johnson-Lindenstrauss dimension reduction method —precisely the transformation
described in Lecture—the does not preserve `1 distances within even factor 2. (Extra
credit: Show that no linear transformation suffices, let alone J-L.)
6. (Dimension reduction for SVM’s with margin) Suppose we are given two sets P, N of
unit vectors in <n with the guarantee that there exists a hyperplane a·x = 0 such that
every point in P is on one side and every point in N is on the other. Furthermore,
the `2 distance of each point in P and N to this hyperplane is at least ε. Then show
using the Johnson Lindenstrauss lemma (hint: you can use it as a black box) that a
random linear mapping to O(log n/ε2 ) dimensions and such that the points are still
separable by a hyperplane with margin ε/2.
7. Suppose you are trying to convince your friend that there is no perfect randomness in
his head. One way to do it would be to show that if you ask him to write down 100
random bits (say) then his last 20 are fairly predictable after you see the first 80.
Describe the design of such a predictor using a Markovian model, carefully describing
any assumptions. Implement the predictor in any suitable environment and submit
the code with your answer. Report the results from a couple of experiments of the
following form. Ask a couple of friends to input 100 bits quickly (or 200 if he is
patient), and see how well the model predicts the last 20 (or 50) bits. The metric for
the model’s success in prediction is
Number of correct guesses − Number of incorrect guesses.
In order to do better than random guessing this number should be fairly positive.

8. (Extra credit) Calculate the eigenvectors and eigenvalues of the n-dimensional boolean
hypercube, which is the graph with vertex set {−1, 1}n and x, y are connected by an
edge iff they differ in exactly one of the n locations. (Hint: Use symmetry extensively.)

146
147

princeton university fall ’14 cos 521:Advanced Algorithms


Homework 4
Out: Nov 13 Due: Dec 2

1. Implement the portfolio management appearing in the notes for Lecture 16 in any
programming environment and check its performance on S& P stock data (download
from http://ocobook.cs.princeton.edu/links.htm ). Include your code as well as the
final performance (i.e., the percentage gain achieved by your strategy).

2. Consider a set of n objects (images, songs etc.) and suppose somebody has designed a
distance function d(·) among them where d(i, j) is the distance between objects i and
j. We are trying to find a geometric realization of these distances. Of course, exact
realization may be impossible and we are willing to tolerate a factor 2 approximation.
We want n vectors u1 , u2 , . . . , un such that d(i, j) ≤ |ui − uj |2 ≤ 2d(i, j) for all pairs
i, j. Describe a polynomial-time algorithm that determines whether such ui ’s exist.

3. The course webpage links to a grayscale photo. Interpret it as an n×m matrix and run
SVD on it. What is the value of k such that a rank k approximation gives a reasonable
approximation (visually) to the image? What value of k gives an approximation that
looks high quality to your eyes? Attach both pictures and your code. (In matlab you
need mat2gray function.) Extra credit: Try to explain from first principles why SVD
works for image compression at all.

4. Suppose we have a set of n images and for some multiset E of image pairs we have been
told whether they are similar (denoted +edges in E) or dissimilar (denoted −edges).
These ratings were generated by different users and may not be mutually consistent
(in fact the same pair may be rated as + as well as −). We wish to partition them
into clusters S1 , S2 , S3 , . . . so as to maximise:

(# of +edges that lie within clusters) + (# of −edges that lie between clusters).

Show that the following SDP is an upperbound on this, where w+ (ij) and w− (ij) are
the number of times pair i, j has been rated + and − respectively.
X
max w+ (ij)(xi · xj ) + w− (ij)(1 − xi · xj )
(i,j)∈E

|xi |22 = 1 ∀i
xi · xj ≥ 0 ∀i 6= j.

5. For the problem in the previous question describe a clustering into 4 clusters that
achieves an objective value 0.75 times the SDP value. (Hint: Use Goemans-Williamson
style rounding but with two random hyperplanes instead of one. You may need a quick
matlab calculation just like GW.)
6. Suppose you are given m halfspaces in <n with rational coefficients. Describe a
polynomial-time algorithm to find the largest sphere that is contained inside the poly-
hedron defined by these halfspaces.

7. Let f be an n-variate convex function such that for every x, every eigenvalue of
O2 f (x) lies in [m, M ]. Show that the optimum value of f is lowerbounded by f (x) −
1 2 1 2
2m |Of (x)|2 and upperbounded by f (x) − 2M |Of (x)|2 , where x is any point. In other
words, if the gradient at x is small, then the value of f at x is near-optimal. (Hint:
By the mean value theorem, f (y) = f (x) + Of (x)T (y − x) + 12 (y − x)T O2 f (z)(y − x),
where z is some point on the line segment joining x, y.)

148
149

princeton university fall ’14 cos 521:Advanced Algorithms


Homework 5
Out: Dec 5 Due: Dec 13

1. Prove von Neumann’s min max theorem. You can assume LP duality.

2. (Braess’s paradox; wellknown to transportation planners) Figure (a) depicts a simple


network of roads (each is one-way for simplicity) from point s to t. The number on
the edge is the time to traverse that road. When we say the travel time is x, we mean
that the time scales linearly with the amount of traffic in it.

Figure 24.4: Braess’s paradox

One unit of traffic (a large number of individual drivers) need to travel from s to t.
(Actually assume is it just a tiny bit less than one unit.) Each driver’s choice of route
can be seen as a move in a multiplayer game. What is the Nash equilibrium and what
is each driver’s travel time to t in this equilibrium?
Figure (b) depicts the same network with a new superfast highway constructed from
v to w. What is the new Nash equilibrium and the new travel time?

3. Show that approximating the number of simple cycles within a factor 100 in a directed
graph is NP-hard. (Hint: Show that if there is a polynomial-time algorithm for this
task, then we can solve the Hamiltonian cycle problem in directed graphs, which is
NP-hard. Here the exact constant 100 is not important, and can even be replaced by,
say, n.)

4. (Extra credit) (Sudan’s list decoding) Let (a1 , b1 ), (a2 , b2 ), . . . , (an , bn ) ∈ F 2 where
F = GF (q) and q  n. We say that a polynomial p(x) describes k of these pairs if
p(ai ) = bi for k values of i. This question concerns an algorithm that recovers p even
if k < n/2 (in other words, a majority of the values are wrong).
150


(a) Show that there exists a bivariate polynomial Q(z, x) of degree at most d ne + 1
in z and x such that Q(bi , ai ) = 0 for each i = 1, . . . , n. Show also that there is
an efficient (poly(n) time) algorithm to construct such a Q.
(b) Show that if R(z, x) is a bivariate polynomial and g(x) a univariate polynomial
then z − g(x) divides R(z, x) iff R(g(x), x) is the 0 polynomial.
(c) Suppose p(x) is a degree d polynomial that describes k of the points. Show that

if d is an integer and k > (d + 1)(d ne + 1) then z − p(x) divides the bivariate
polynomial Q(z, x) described in part (a). (Aside: Note that this places an upper-
bound on the number of such polynomials. Can you improve this upperbound
by other methods?)

(There is a randomized polynomial time algorithm due to Berlekamp that factors a


bivariate polynomial. Using this we can efficiently recover all the polynomials p of
the type described in (c). This completes the description of Sudan’s algorithm for list
decoding.)
151

Princeton University
COS 521: Advanced Algorithms
Final Exam Fall 2014
Sanjeev Arora
Due electronically by Jan 19 5pm.

Instructions: The test has 6 questions. Finish the test within 48 hours after first
reading it. You can consult any notes/handouts from this class and feel free to quote,
without proof, any results from there. You cannot consult any other source or person in
any way.
Do not read the test before you are ready to work on it.

Write and sign the honor code pledge on your exam (The pledge is “I pledge
my honor that I have not violated the honor code during this examination.”)

Sanjeev, Kevin, and Siyu will be available Jan 11–19 via email and piazza to answer
any questions. We will also offer to call you if your confusion does not clear up. In case
of unresolved doubt, try to explain your confusion as part of the answer and maybe you
will receive partial credit. In general, stating clearly what you are trying to do can get you
partial credit.
152

1. Consider an undirected weighted graph (no self-loops)where wijPdenotes the weight


on edge {i, j}. Its Laplacian is the matrix whose entry (i, i) is j6=i wij and whose
(i, j) entry is −wij . Then prove that the Laplacian can be written as U U T for some
n × n matrix U .
Now consider the positive Laplacian, which is like above except the (i, j) entry is wij .
Show that this can be written as U U T where every entry is nonnegative.
2. Streaming algorithms are used in scientific or networking settings when a very long
stream of numbers is whizzing by, and the algorithm lacks the space to store them all or
time to process them offline. Suppose the stream consists of M integers in {1, . . . , N }
and let mi be the PNnumber of times i appears in the stream. Gini’s homogeneity index
G is defined as i=1 m2i . You have to design a randomized streaming algorithm that
uses O( logδ21/ε log(M +N )) bits of storage and computes a number in [(1−δ)G, (1+δ)G]
with probability at least (1 − ε).
P
(Hint: Consider the estimator ( i si mi )2 where s1 , s2 , . . . , sN are 4-wise independent
random variables in {−1, 1}. Argue carefully about storage. Generous credit given if
you get approximately the right bound.)
3. Suppose we are given a source that produces a vector-valued signal of the following
form, where u, v ∈ <n are two orthogonal unit vectors. With probability 1/2 the source
outputs a vector of the form bu + η where b is randomly chosen from [4 log n, 8 log n],
and with probability 1/2 it outputs bv + η. Here η ∼ N (0, I) is a random spherical
gaussian vector, whose each coordinate is independently distributed as a univariate
gaussian of variance 1 and mean 0.
Describe how to recover in polynomial time both u, v with error at most 1/n in each
coordinate. Your algorithm can draw as many samples from this source as it needs.
(Hint: Try to first identify the subspace spanned by u, v, and then recover u and v.)
4. A randomized algorithm that uses r random bits can be thought of as a distribution
over 2r deterministic algorithms, which allows it in principle to be exponentially more
powerful than a deterministic algorithm. Yao’s Principle is a way to prove lower
bounds on the performance of randomized algorithms. Suppose the set of all possible
deterministic algorithms running with a specified resource bound is finite, and denoted
A. The set of all inputs of a certain size is denoted I. For x ∈ I and A ∈ A
there is some cost of running A on input x, denoted cost(A, x). Then a randomized
algorithm is a distribution R on A and the expected cost of running it on input x is
EA∈R [cost(A, x)].
Now prove Yao’s principle:
min max EA∈R [cost(A, x)] = max min Ex∈D [cost(A, x)].
R x∈I D: distrib. on I A∈A
Use this principle to show that every randomized algorithm that distinguishes between
the following two cases with probability at least 0.9 must examine Ω(1/ε) bits in the
n-bit input. (a) Case 1: The input contains all 1’s. (b) Case 2: At least ε fraction of
bits are 0’s.
153

5. In class we showed the existence of good error correcting codes. Now let us see that
there actually exist good error correcting codes with linear structure. A linear error
correcting code over GF (2)n is a m × n matrix M such that the encoding of a column
vector x ∈ GF (2)n is the m-bit vector M x. Show that if m(1 − H(p)) > n then there
exists such a linear error correcting code such that E(x), E(y) disagree in at least pm
of the bits.

6. A k-coloring of a graph G = (V, E) is an assignment of one of the colors {1, 2, . . . , k}


to each vertex, so that every two adjacent vertices get different colors. (Aside: Note
that nodes that get the same color must be independent, i.e. have no edges amongst
themselves.) This question explores an approximation algorithm for coloring using
SDPs.

(a) Show that if the graph is 3-colorable then the following SDP is feasible:

1
hvi , vj i ≤ − ∀ {i, j} ∈ E
2
hvi , vi i = 1 ∀i

(b) Consider the following rounding algorithm: pick a random unit vector u and
threshold τ > 0 and select the set S = {i : hu, vi i > τ }. Argue that there is a
probability p such that for every i the probability that it lies in S is p.
(c) Show that for every edge {i, j} the probability that both i, j lie in S is at most
p4 . (Hint: Reason about the plane spanned by vi and vj .)
(d) Now argue that if the maximum degree is d then there is an efficient algorithm
algorithm that, given a 3-colorable graph with a node of degree d finds an inde-
pendent set of size Ω(n/d1/3 ). Argue that this can be turned into an algorithm
that colors the graphs with O(d1/3 log n) colors. (Full credit also given if you
have extra factors of log n in any of these bounds.)

You might also like