A Graduate Course in Algorithm Design and Analysis
A Graduate Course in Algorithm Design and Analysis
Analysis
Sanjeev Arora
Department of Computer Science
Princeton University
These are lecture notes from a graduate course for Computer Science graduate students at
Princeton University in Fall 2013 and Fall 2014. (The course also attracts undergrads and
non-CS grads and total enrollment in 2014 was 35.) The course assumes prior knowledge
of algorithms at the undergraduate level.
This new course arose out of thinking about the right algorithms training for CS grads to-
day, since the environment for algorithms design and use has greatly changed since the 1980s
when the canonical grad algorithms courses were designed. The course topics are somewhat
nontraditional, and some of the homeworks involve simple programming assignments that
encourage students to play with algorithms using simple environments like Matlab and
Scipy.
Since this is the last theory course many of my students (grad or undergrad) might take
for the rest of their lives, I think of the scope as more than just algorithms; my goal is to
make them look at the world anew with a mathematical/algorithmic eye. For instance, I
discovered many holes in my students’ undergraduate CS education: information/coding
theory, economic utility and game theory, decision-making under uncertainty, cryptography
(anything beyond the RSA cryptosystem), etc. So I created space for these topics as well,
figuring that the value added by this was greater than by, say, presenting detailed analysis
of the Ellipsoid algorithm (which I sketch instead).
Programming assignments went out of fashion in most algorithms courses in the past
few decades, but I think it is time to bring them back. First, CS students are used to a
hands-on learning experience; an algorithm becomes real only once they see it run on real
data. Second, the computer science world increasingly relies on off-the-shelf packages and
library routines, and this is how algorithms are implemented in industry. One can write a
few lines of code in matlab and scipy, and run it within minutes on datasets of millions or
billions of numbers. No JAVA or C++ needed! Algorithms education should give students
at least a taste of such powerful tools. Finally, even for theory students it can be very
beneficial to play with algorithms and data a bit; this will help them develop a different
kind of theory.
The course gives students a choice between taking a 48-hour final, or doing a collab-
orative term project. Some sample term projects can be found at the course home page.
http://www.cs.princeton.edu/courses/archive/fall14/cos521/
This course is very much a work in progress, and I welcome your feedback and sugges-
tions.
I thank numerous colleagues for useful suggestions during the design of this course.
Above all, I thank my students for motivating me to teach them better; their feedback and
questions have helped shape these course notes.
Sanjeev Arora
March 2015
About this course
Algorithms are integral to computer science: every computer scientist (even as an un-
dergrad) has designed some. So has many a physicist, electrical engineer, mathematician
etc. This course is meant to be your one-stop shop to learn how to design a variety of
algorithms. The operative word is “variety.”In other words you will avoid the blinders that
one often sees in domain experts. A bayesian needs to see priors on the data before he
can begin designing algorithms; an optimization expert needs to cast all problems as con-
vex optimization; a systems designer has never seen any problem that cannot be solved by
hashing. (OK, mostly kidding but there is some truth in these stereotypes.) These and
more domain-specific ideas make an appearance in our course, but we will learn to not be
wedded to any single approach.
The primary skill you will learn in this course is how to analyse algorithms: prove their
correctness and their running time and any other relevant properties. Learning to analyse a
variety of algorithms (designed by others) will let you design better algorithms later in life.
I will try to fill the course with beautiful algorithms. Be prepared for frequent rose-smelling
stops, in other words.
The changing graph. In undergrad algorithms the graph is given and arbitrary (worst-
case). In grad algorithms we are willing to look at the domain (social network, computer
vision etc.) that the graph came from since the properties of graphs in those domains may
be germane to designing a good algorithm. (This is not a radical idea of course but we will
see that formulating good graph models is not easy. This is why you see a lot of heuristic
work in practice, without any mathematical proofs of correctness.)
1
Changing data structures: In undergrad algorithms the data structures were simple
and often designed to hold data generated by other algorithms. A stack allows you to hold
vertices during depth-first search traversal of a graph, or instances of a recursive call to a
procedure. A heap is useful for sorting and searching.
But in the newer applications, data often comes from sources we don’t control. Thus it
may be noisy, or inexact, or both. It may be high dimensional. Thus something like heaps
will not work, and we need more advanced data structures.
We will encounter the “curse of dimensionality”which constrains algorithm design for
high-dimensional data.
Type of analysis: In undergrad algorithms the algorithms were often exact and work on
all (i.e., worst-case) inputs. In grad algorithms we are willing to relax these requirements.
2
Contents
1 Hashing 9
1.1 Hashing: Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 2-Universal Hash Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6 Linear Thinking 28
6.1 Simplest example: Solving systems of linear equations . . . . . . . . . . . . 28
6.2 Systems of linear inequalities and linear programming . . . . . . . . . . . . 29
6.3 Linear modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.4 Meaning of polynomial-time . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3
8 Decision-making under uncertainty: Part 1 40
8.1 Decision-making as dynamic programming . . . . . . . . . . . . . . . . . . . 41
8.2 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . . . . . . . . 42
8.3 Optimal MDP policies via LP . . . . . . . . . . . . . . . . . . . . . . . . . 44
4
15 Semidefinite Programs (SDPs) and Approximation Algorithms 83
15.1 Max Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
15.2 0.878-approximation for MAX-2SAT . . . . . . . . . . . . . . . . . . . . . . 85
5
23 Real-life environments for big-data computations (MapReduce etc.) 130
23.1 Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
23.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6
About this course
Algorithms are integral to computer science: every computer scientist (even as an un-
dergrad) has designed some. So has many a physicist, electrical engineer, mathematician
etc. This course is meant to be your one-stop shop to learn how to design a variety of
algorithms. The operative word is “variety.”In other words you will avoid the blinders that
one often sees in domain experts. A bayesian needs to see priors on the data before he
can begin designing algorithms; an optimization expert needs to cast all problems as con-
vex optimization; a systems designer has never seen any problem that cannot be solved by
hashing. (OK, mostly kidding but there is some truth in these stereotypes.) These and
more domain-specific ideas make an appearance in our course, but we will learn to not be
wedded to any single approach.
The primary skill you will learn in this course is how to analyse algorithms: prove their
correctness and their running time and any other relevant properties. Learning to analyse a
variety of algorithms (designed by others) will let you design better algorithms later in life.
I will try to fill the course with beautiful algorithms. Be prepared for frequent rose-smelling
stops, in other words.
The changing graph. In undergrad algorithms the graph is given and arbitrary (worst-
case). In grad algorithms we are willing to look at the domain (social network, computer
vision etc.) that the graph came from since the properties of graphs in those domains may
be germane to designing a good algorithm. (This is not a radical idea of course but we will
see that formulating good graph models is not easy. This is why you see a lot of heuristic
work in practice, without any mathematical proofs of correctness.)
7
Changing data structures: In undergrad algorithms the data structures were simple
and often designed to hold data generated by other algorithms (and hence under our con-
trol). A stack allows you to hold vertices during depth-first search traversal of a graph, or
instances of a recursive call to a procedure. A heap is useful for sorting and searching.
But in the newer applications, data often comes from sources we don’t control. Thus it
may be noisy, or inexact, or both. It may be high dimensional. Thus something like heaps
will not work, and we need more advanced data structures.
We will encounter the “curse of dimensionality”which constrains algorithm design for
high-dimensional data.
Type of analysis: In undergrad algorithms the algorithms were often exact and work on
all (i.e., worst-case) inputs. In grad algorithms we are willing to relax these requirements.
8
Chapter 1
Hashing
Today we briefly study hashing, both because it is such a basic data structure, and because
it is a good setting to develop some fluency in probability calculations.
A hash table can support all these 3 operations. We design a hash function
h : U −→ {0, 1, . . . , n − 1} (1.1)
9
10
U
h
n elements
• Prh∈H [h(x1 ) = a1 ] = n1 .
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ] = n2
. Pairwise independence.
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ∧ · · · ∧ h(xk ) = ak ] = nk
. k-wise independence.
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ∧ · · · ∧ h(xm ) = am ] = nm . Full independence (note
that |U | = m).
Generally speaking, we encounter a tradeoff. The more random H is, the greater the
number of random bits needed to generate a function h from this class, and the higher the
cost of computing h.
For example, if H is a fully random family, there are nm possible h, since each of the
m elements at S have n possible locations they can hash to. So we need log |H| = m log n
bits to represent each hash function. Since m is usually very large, this is not practical.
1
We use [n] to denote the set {0, 1, . . . , n − 1}
11
But the advantage of a random hash function is that it ensures very few collisions with
high probability. Let Lx be the length of the linked list containing x; this is just the number
of elements with the same hash value as x. Let random variable
(
1 if h(y) = h(x),
Iy = (1.2)
0 otherwise.
P
So Lx = 1 + y∈S;y6=x Iy , and
X m−1
E[Lx ] = 1 + E[Iy ] = 1 + (1.3)
n
y∈S;y6=x
Usually we choose n > m, so this expected length is less than 2. Later we will analyse
this in more detail, asking how likely is Lx to exceed say 100.
The expectation calculation above doesn’t need full independence; pairwise indepen-
dence would actually suffice. This motivates the next idea.
And let
ha,b (x) = fa,b (x) mod n (1.6)
Lemma 1
6 x2 and s 6= t, the following system
For any x1 =
Since [p] constitutes a finite field, we have that a = (x1 − x2 )−1 (s − t) and b = s − ax1 .
Since we have p(p − 1) different hash functions in H in this case,
1
Pr [h(x1 ) = s ∧ h(x2 ) = t] = (1.9)
h∈H p(p − 1)
Claim H = {ha,b : a, b ∈ [p] ∧ a 6= 0} is 2-universal.
12
0
1
si elements
i
s2i locations
n−1
balls-and-bins calculation, also called load balance problem. We have n balls and n bins,
and we randomly put the balls into bins. Then for a given i,
n 1 1
Pr[bini gets more than k elements] ≤ · k ≤ (1.19)
k n k!
By Stirling’s formula,
√ k
k! ∼ 2nk( )k (1.20)
e
If we choose k = O( logloglogn n ), we can let 1
k! ≤ 1
n2
. Then
1 1
Pr[∃ a bin ≥ k balls] ≤ n · = (1.21)
n2 n
12
So with probability larger than 1 − n ,
log n
max load ≤ O( ) (1.22)
log log n
Aside: The above load balancing is not bad; no more than O( logloglogn n ) balls in a bin with
high probability. Can we modify the method of throwing balls into bins to improve the load
balancing? We use an idea that you use at the supermarket checkout: instead of going to
a random checkout counter you try to go to the counter with the shortest queue. In the
load balancing case this is computationally too expensive: one has to check all n queues.
A much simpler version is the following: when the ball comes in, pick 2 random bins, and
place the ball in the one that has fewer balls. Turns out this modified rule ensures that the
maximal load drops to O(log log n), which is a huge improvement. This called the power of
two choices.
2 1
this can be easily improve to 1 − nc
for any constant c
Chapter 2
Today’s topic is simple but gorgeous: Karger’s min cut algorithm and its extension. It is a
simple randomized algorithm for finding the minimum cut in a graph: a subset of vertices
S in which the set of edges leaving S, denoted E(S, S) has minimum size among all subsets.
You may have seen a polynomial-time algorithm for this problem in your undergrad class
that uses maximum flow. Karger’s algorithm is much more elementary and and a great
introduction to randomized algorithms.
The algorithm is this: Pick a random edge, and merge its endpoints into a single “su-
pernode.”Repeat until the graph has only two supernodes, which is output as our guess for
min-cut. (As you continue, the supernodes may develop parallel edges; these are allowed.
Selfloops are ignored.) See Figure 2.1.
Note that if you pick a random edge, it is more likely to come from parts of the graph that
contain more edges in the first place. Thus this algorithm looks like a great heuristic to try on
all kinds of real-life graphs, where one wants to cluster the nodes into “tightly-knit”portions.
For example, social networks may cluster into communities; graphs where edges capture
similarity of pixels may cluster to give different portions of the image (sky, grass, road etc.).
Thus instead of continuing Karger’s algorithm until you have two supernodes left, you could
stop it when there are k supernodes and try to understand whether these correspond to a
reasonable clustering. (Aside: There are much better clustering algorithms out there.)
Today we will first see that the above version of the algorithm yields the optimum min
cut with probability at least 2/n2 . Thus we can repeat it say 20n2 times, and output the
smallest cut seen in any iteration. The probability that the optimum cut is not seen in any
2
repetition is at most (1 − 2/n2 )20n < 0.01. Unfortunately, this simple version has running
time about n4 which is not great. So then we see a better version with a simple tweak that
brings the running time down to closer to n2 .
14
15
Thus repeating the algorithm K times where K = n(n − 1)/2 and taking the smallest
cut ever discovered in these repetitions will yield a minimum cut with chance at least
1 − (1 − 1/K)K = 1 − 1/e. (Aside: the approximation (1 − 1/K)K ≈ 1/e for large K is
very useful and will reappear in later lectures.) It is relatively easy using data structures
you learnt in undergrad algorithms to implement each repetition of the algorithm in O(n2 )
time, so the overall running time is O(n4 ).
The analysis is rather simple. First, recall that the sum of node degrees in an undirected
graph G = (V, E) is exactly 2 |E| (since adding the degrees counts each edge twice). Thus
if |V | = n, there exists a node of degree at most 2 |E| /n. Putting this vertex on one side
of the cut and all other vertices on the other side gives a cut of size at most 2 |E| /n. Thus
the minimum cut cannot have any more than 2 |E| /n edges.
Let (S, S) be any minimum cut. Then the probability that a random edge picked at the
first step by Karger’s algorithm lies in this particular cut is at most 2 |E| /n |E| = 2/n. If
it doesn’t lie in the cut, then contracting the edge maintains (S, S) as a viable cut in the
graph.
Since each edge contraction reduces the number of nodes by 1, the total number of edge
pickings is n − 1 and the probability that (S, S) survives all of them is
Pr[first edge not in this cut]×Pr[second edge not in this cut | first edge was not in the cut]×· · · ,
which is at least
2 2 2 3 1
(1 − )(1 − )(1 − ) × ··· × ×
n n−1 n−2 4 2
n−2 n−3 n−4 3 1
=( )( )( ) × ··· × × (Note: telescoping!!)
2 n−1 n−2 4 2
2
=
n(n − 1)
Aside: We have proven a stronger result than we had needed to: every minimum cut
remains at the end with probability at least 2/n(n − 1). This implies in particular that
the number of minimum cuts in an undirected graph is at most n(n − 1)/2. (Note that
the number of cuts in the graph is the set of all nonempty subsets, which is 2n − 1, so this
implies only a tiny number of all cuts are minimum cuts.) This upper bound has had great
impact in subsequent theory of algorithms, though we will not have occasion to explore that
in this course.
no edge has√been picked in it) is at least 1/2. So you make two independent runs that go
down to n/ 2 supernodes, and recursively solve both of these with the same Karger-Stein
algorithm. Then return the smaller of the two cuts returned by the recursive calls.
The running time for such an algorithm satisfies
√
T (n) = O(n2 ) + 2T (n/ 2),
which the Master theorem of ugrad algorithms1 shows to yield T (n) = O(n2 log n). As you
might suspect, this is not the end of the story but improvements beyond this get more hairy.
If anybody is interested I can give more pointers.
Claim: The probability the algorithm returns a minimum cut is at least 1/ log n.
Thus repeating the algorithm O(log n) times gives a success probability at least 0.9 (say)
and a running time of O(n2 log2 n).
To prove the claim we note that if P (n) is the probability that the procedure returns a
minimum cut, then
1 √
P (n) ≥ 1 − (1 − P (n/ 2))2 ,
2
1
√
where the term 2 P (n/ 2) represents the probability of the event that a minimum cut sur-
√
vived in the shrinkage to n/ 2 vertices, and the recursive call then recovered this minimum
cut.
To see that this solves to P (n) ≥ 1/ log n we can do a simple induction, where the
inductive step needs to verify that
1 1 1 1 1
≤ 1 − (1 − )2 = − ,
log n 2 log n − 0.5 log n − 0.5 4(log n − 0.5)2
Bibliography
1) Global min-cuts in RNC, and other ramifications of a simple min-cut algorithm. David
Karger, Proc. ACM-SIAM SODA 1993.
2) A new approach to the minimum cut problem. David Karger and Cliff Stein, JACM
43(4):601640, 1996.
1
Hush, hush, don’t tell anybody, but most researchers don’t use the Master theorem, even though it
was stressed a lot in undergrad algorithms. When we need to solve such recurrences, we just unwrap the
recurrence a few times and see that there are O(log n) levels, and each involves O(n2 ) work, for a total of
O(n2 log n).
Figure 2.1: Illustration of Karger’s Algorithm (borrowed from lecture notes of Sanjoy Das-
gupta)
17
Chapter 3
Today’s topic is deviation bounds: what is the probability that a random variable deviates
from its mean by a lot? Recall that a random variable X is a mapping from a probability
space to R. The expectation or mean is denoted E[X] or sometimes as µ.
In many settings we have a set of n random variables X1 , X2 , X3 , . . . , Xn defined on
the same probability space. To give an example, the probability space could be that of all
possible outcomes of n tosses of a fair coin, and Xi is the random variable that is 1 if the
ith toss is a head, and is 0 otherwise, which means E[Xi ] = 1/2.
The first observation we make is that of the Linearity of Expectation, viz.
X X
E[ Xi ] = E[Xi ]
i i
It is important to realize that linearity holds regardless of the whether or not the random
variables are independent.
Can we say something about E[X1 X2 ]? In general, nothing much but if X1 , X2 are
independent events (formally, this means that for all a, b Pr[X1 = a, X2 = b] = Pr[X1 =
a] Pr[X2 = b]) then E[X1 X2 ] = E[X1 ] E[X2 ].
Note that if the Xi ’s are
P pairwisePindependent (i.e., each pair are mutually independent)
then this means that var[ i Xi ] = i var[Xi ].
18
19
Note that this is just another way to write the trivial observation that E[X] ≥ k ·Pr[X ≥ k].
Can we give any meaningful upperbound on Pr[X < c · E[X]] where c < 1, in other
words the probability that X is a lot less than its expectation? In general we cannot.
However, if we know an upperbound on X then we can. For example, if X ∈ [0, 1] and
E[X] = µ then for any c < 1 we have (simple exercise)
1−µ
Pr[X ≤ cµ] ≤ .
1 − cµ
Sometimes this is also called an averaging argument.
Example 1 Suppose you took a lot of exams, each scored from 1 to 100. If your average
score was 90 then in at least half the exams you scored at least 80.
and so,
1
Pr |X − µ|2 ≥ k 2 σ 2 ≤ 2 .
k
Here is simple fact that’s used a lot: If Y1 , Y2 , . . . , Yt are iid (whichP
is jargon for inde-
pendent and identically distributed) then the variance of their average k1 i Yt is exactly 1/t
times the variance of one of them. Using Chebyshev’s inequality, this already implies that
the average of iid variables converges sort-of strongly to the mean.
m(m−1)
Now for independent random variables E[Yi Yj ] = E[Yi ] E[Yj ] so E[X 2 ] = m
n + n2
.
Hence the variance is very close to m/n, and thus Chebyshev implies that the probability
that Pr[X > 2 m n
n ] < m . When m > 3n, say, this is stronger than Markov.
20
Instead of proving the above we prove a simpler theorem for binary valued variables
which showcases the basic idea.
Theorem 3
Let X1 , X2 , . . . , Xn be independent
P 0/1-valued random variables
Pand let pi = E[Xi ], where
0 < pi < 1. Then the sum X = ni=1 Xi , which has mean µ = ni=1 pi , satisfies
Remark: There is an analogous inequality that bounds the probability of deviation below
the mean, whereby δ becomes negative and the ≥ in the probability becomes ≤ and the cδ
is very similar.
Proof: Surprisingly, this inequality also is proved using the Markov inequality, albeit
applied to a different random variable.
We introduce a positive dummy variable t and observe that
X Y Y
E[exp(tX)] = E[exp(t Xi )] = E[ exp(tXi )] = E[exp(tXi )], (3.1)
i i i
21
where the last equality holds because the Xi r.v.s are independent. Now,
E[exp(tXi )] = (1 − pi ) + pi et ,
therefore,
Y Y Y
E[exp(tXi )] = [1 + pi (et − 1)] ≤ exp(pi (et − 1))
i i i
X (3.2)
= exp( pi (et − 1)) = exp(µ(et − 1)),
i
Exercise: Show that it is impossible to estimate the value of the median within say 1.1
factor with o(n) samples.
But what is possible is to produce a number that is an approximate median: it is greater
than at least n/2 − n/t numbers below it and less than at least n/2 − n/t numbers. The
idea is to take a random sample of a certain size and take the median of that sample. (Hint:
Use balls and bins.)
One can use the approximate median algorithm to describe a version of quicksort with
very predictable performance. Say we are given n numbers in an array. Recall that (random)
quicksort is the sorting algorithm where you randomly pick one of the n numbers as a pivot,
then partition the numbers into those that are bigger than and smaller than the pivot (which
takes O(n) time). Then you recursively sort the two subsets.
This procedure works in expected O(n log n) time as you may have learnt in an undergrad
course. But its performance is uneven because the pivot may not divide the instance into
two exactly equal pieces. For instance the chance that the running time exceeds 10n log n
time is quite high.
A better way to run quicksort is to first do a quick estimation of the median and then
do a pivot. This algorithm runs in very close to n log n time, which is optimal.
Chapter 4
Using only memory equivalent to 5 lines of printed text, you can estimate with a
typical accuracy of 5 per cent and in a single pass the total vocabulary of Shake-
speare. This wonderfully simple algorithm has applications in data mining, esti-
mating characteristics of huge data flows in routers, etc. It can be implemented
by a novice, can be fully parallelized with optimal speed-up and only need minimal
hardware requirements. Theres even a bit of math in the middle!
Opening lines of a paper by Durand and Flajolet, 2003.
Strictly speaking, one cannot hash to a real number since computers lack infinite preci-
sion. Instead, one hashes to rational numbers in [0, 1]. For instance, hash IP addresses to
the set [p] as before, and then think of number “i mod p”as the rational number i/p. This
works OK so long as our method doesn’t use too many bits of precision in the real-valued
hash.
23
24
A general note about sampling. As pointed out in Lecture 3 using the random variable
”Number of ears,” the expectation of a random variable may never be attained at any point
in the probability space. But if we draw a random sample, then we know by Chebysev’s
inequality that the sample has chance at least 1 − 1/k 2 of taking a value in the interval
[µ − kσ, µ + kσ] where µ, σ denote the mean and variance respectively. Thus to get any
reasonable idea of µ we need σ to be less than µ. But if we take t independent samples
(even pairwise independent will do) then the variance of the mean of these samples is σ 2 /t.
Hence by increasing t we can get a better estimate of µ.
All this assumed that the hash functions are random functions from 128-bit numbers to
[0, 1]. Let’s now show that it suffices to pick hash functions from a pairwise independent
family, albeit now yielding an estimate that is only correct up to some constant factor.
Specifically, the algorithm will take k pairwise independent hashes and see if the majority
of the min values are contained in some interval of the type [1/3x, 3/x]. Then x is our
estimate for N , the number of elements. This estimate will be correct up to a factor 3 with
probability at least 1 − 1/k.
What is the probability that we hash N different elements using such a hash function
and the smallest element is less than 1/3N ? For each element x, Pr[h(x) < 1/3N ] is at
most 1/3N , so by the union bound, the probability in question is at most N × 1/3N = 1/3.
Similarly, the probability that Pr[∃x : h(x) ≤ 1/N ] can be lowerbounded by the inclusion-
exclusion bound.
Lemma 4 (inclusion-exclusion bound)
Pr[A1 ∨ A2 . . . ∨ An ], the probability that at least one of the events A1 , A2 , . . . , An happens,
satisfies X X X
Pr[Ai ] − Pr[Ai ∧ Aj ] ≤ Pr[A1 ∨ A2 . . . ∨ An ] ≤ Pr[Ai ].
i i6=j i
Using a little more work it can be shown that with probability at least 0.6 the minimum
hash is in the interval [1/3N, 3/N ]. (NB: These calculations can be improved if the hash is
from a 4-wise independent family.) Thus if we repeat with k hashes, the probability that
the majority of min values are not contained in [1/3N, 3/N ] drops as O(1/k).
Basic idea: Pick a random hash function mapping the underlying universe of elements
to [0, 1]. Define the hash of a set A to be the minimum of h(x) over all x ∈ A. Then
by symmetry, Pr[hash(A) = hash(B)] is exactly the Jaccard similarity. (Note that if two
elements x, y are different then Pr[h(x) = h(y)] is 0 when the hash is real-valued. Thus the
only possibility of a collision arises from elements in the intersection of A, B.) Thus one
could pick k random hash functions and take the fraction of instances of hash(A) = hash(B)
as an estimate of the Jaccard similarity. This has the right expectation but we need to repeat
with k different hash functions to get a better estimate.
The analysis goes as follows. Suppose we are interested in flagging pairs of documents
whose Jaccard-similarity is at least 0.9. Then we compute k hashes and flag the pair if at
least 0.9 − fraction of the hashes collide. Chernoff bounds imply that if k = Ω(1/2 ) this
flags all document pairs that have similarity at least 0.9 and does not flag any pairs with
similarity less than 0.9 − 3.
To make this method more realistic we need to replace the idealized random hash func-
tion with a real one and analyse it. That is beyond the scope of this lecture. Indyk showed
that it suffices to use a k-wise independent hash function for k = Ω(log(1/) to let us es-
timate Jaccard-similarity up to error . Thorup recently showed how to do the estimation
with pairwise independent functions. This analysis seems rather sophisticated; let me know
if you happen to figure it out.
Bibliography
2. Broder, Andrei Z.; Charikar, Moses; Frieze, Alan M.; Mitzenmacher, Michael (1998),
Min-wise independent permutations, Proc. 30th ACM Symposium on Theory of Com-
puting (STOC ’98).
3. Gurmeet Singh, Manku; Das Sarma, Anish (2007), Detecting near-duplicates for web
crawling, Proceedings of the 16th international conference on World Wide Web, ACM.
Guest lecture by Mark Braverman. Handwritten scribe notes available from course website.
http://www.cs.princeton.edu/courses/archive/fall14/cos521/
27
Chapter 6
Linear Thinking
According to conventional wisdom, linear thinking describes thought process that is logical
or step-by-step (i.e., each step must be completed before the next one is undertaken).
Nonlinear thinking, on the other hand, is the opposite of linear: creative, original, capable
of leaps of inference, etc.
From a complexity-theoretic viewpoint, conventional wisdom turns out to be startlingly
right in this case: linear problems are generally computationally easy, and nonlinear prob-
lems are generally not.
Example 3 Solving linear systems of equations is easy. Let’s show solving quadratic sys-
tems of equations is NP-hard. Consider the vertex cover problem, which is NP-hard:
Given graph G = (V, E) and an integer k we need to determine if there a subset of vertices
S of size k such that for each edge {i, j}, at least one of i, j is in S.
We can rephrase this as a problem involving solving a system of nonlinear equations,
where xi = 1 stands for “i is in the vertex cover.”
(1 − xi )(1 − xj ) = 0 ∀ {i, j} ∈ E
P
xi (1 − xi ) = 0 ∀i ∈ V. i xi = k
Not all nonlinear problems are difficult, but the ones that turn out to be easy are
generally those that can leverage linear algebra (eigenvalues, singular value decomposition,
etc.)
In mathematics too linear algebra is simple, and easy to understand. The goal of much
of higher mathematics seems to be to reduce study of complicated (nonlinear!) objects to
study of linear algebra.
28
29
2x1 − 3x2 = 5
3x1 + 4x2 = 6
The feasible region has sharp corners; it is a convex region and is called a polytope.
In general, a region of space is called convex if for every pair of points x, y in it, the line
segment joining x, y, namely, {λ · x + (1 − λ) · y : λ ∈ [0, 1]}, lies in the region.
In Linear Programming one is trying to optimize (i.e., maximize or minimize) a linear
function over the set of feasible values. The general form of an LP is
min cT x (6.1)
Ax ≥ b (6.2)
This form is very flexible. To express maximization instead of minimization, just replace
c by −c. To include an inequality of the form a·x ≤ bi just write it as −a·x ≥ −bi . To include
an equation a · x = bi as a constraint just replace with two inequalities a · x ≥ bi , a · x ≤ bi .
Solving LPs: In Figure 6.1 we see the convex feasible region of an LP. The objective
function is linear, so it is clear that the optimum of the linear program is attained at some
vertex of the feasible region. Thus a trivial algorithm to find the optimum is to enumerate
all vertices of the feasible region and take the one with the lowest value of the objective.
This method (sometimes taught in high schools) of graphing the inequalities and their
feasible region does not scale well with n, m. The number of vertices of this feasible region
grows roughly as mn/2 in general. Thus the algorithm is exponential time. The famous
simplex method is a clever method to enumerate these vertices one by one, ensuring that
the objective keeps decreasing at each step. It works well in practice. The first polynomial-
time method to determine feasibility of linear inequalities was only discovered in 1979 by
Khachiyan, a Soviet mathematician. We will discuss the core ideas of this method later in
the course. For now, we just assume polynomial-time solvability and see how to use LP as
a tool.
Fact: the solution to this LP has the property that all xij variables are either 0 or 1.
(Maybe this will be a future homework.) Thus solving the LP actually solves the assignment
problem.
In general one doesn’t get so lucky: solutions to LPs end up being nonintegral no matter
how hard we pray for the opposite outcome. Next lecture we will discuss what to do if that
happens.2
Example 5 (Diet) You wish to balance meat, sugar, veggies, and grains in your diet. You
have a certain dollar budget and a certain calorie goal. You don’t like these foodstuffs
equally; you can give them a score between 1 and 10 according to how much you like them.
Let lm , ls , lv , lg denote your score for meat, sugar, veggies and grains respectively. Assuming
your overall happiness is given by
m × lm + g × lg + v × lv + s × ls ,
where m, g, v, s denote your consumption of meat, grain, veggies and sugar respectively
(note: this is a modeling assumption about you) then the problem of maximizing your
happiness subject to a dollar and calorie budget is a linear program. 2
Example 6 (`1 regression) This example is from Bob Vanderbei’s book on linear program-
ming. You are given data containing grades in different courses for various students; say
Gij is the grade of student i in course j. (Of course, Gij is not defined for all i, j since
each student has only taken a few courses.) You can try to come up with a model for
explaining these scores. You hypothesize that a student’s grade in a course is determined
32
by the student’s innate aptitude, and the difficulty of the course. One could try various
functional forms for how the grade is determined by these factors, but the simplest form to
try is linear. Of course, such a simple relationship will not completely explain the data so
you must allow for some error. This linear model hypothesizes that
Just as LP is the tool of choice to squeeze out inefficiencies of production and planning,
linear modeling is the bedrock of data analysis in science and even social science.
where Interest rate(T ) denotes say the interest rate at time T , etc. Here α, β may not be
constant and may be probabilistic variables (e.g., a random variable uniformly distributed in
[0.5, 0.8]) since future growth may not be a deterministic function of the current variables.
Often these models are solved (i.e., for α, β in this case) by regression methods related
to the previous example, or more complicated probabilistic inference methods that we may
study later in the course.2
Example 8 (Perceptrons and Support Vector Machines in machine learning) Suppose you
have a bunch of images labeled by whether or not they contain a car. These are data
points of the form (x, y) where x is n-dimensional (n= number of pixels in the image)and
yi ∈ {0, 1} where 1 denotes that it contains a car. You are trying to train an algorithm
to recognize cars in other unlabeled images. There is a general method called SVM’s that
allows you to find some kind of a linear model. (Aside: such simple linear models don’t
work for finding cars in images; this is an example.) This involves hypothesizing that there
is an unknown set of coefficients α0 , α1 , α2 , . . . , αn such that
X
αi xi ≥ α0 + errorx if x is an image containing a car,
i
33
X
αi xi ≤ 0.5α0 + errorx if x does not contain a car,
i
where errorx is required to be nonpositive for each x. Then finding such αi ’s while minimiz-
ing the sum of the absolute values of the error terms is a linear program. After finding these
αi ’s, given
Pa new image the program tries to predict whether it has a car by just checking
whether i αi xi ≥ α0 or ≤ 0.5α0 . (There is nothing magical about the 0.5 gap here; one
usually stipulates a gap or margin between the yes and no cases.)
This technique is related to the so-called support vector machines in machine learning
(and an older model called perceptrons), though we’re dropping a few technical details (
`2 -regression, regularization etc.). Also, in practice it could be that the linear explanation
is a good fit only after you first apply a nonlinear transformation on the x’s. This is the
idea in kernel SVMs. For instance let z be the vector where the ith coordinate zi = φ(xi ) =
x2
exp(− 2i ). You then find a linear predictor using the z’s. (How to choose such nonlinear
transformations is an art.) 2
One reason for the popularity of linear models is that the mathematics is simple, elegant,
and most importantly, efficient. Thus if the number of variables is large, a linear model is
easiest to solve.
A theoretical justification for linear modeling is Taylor expansion, according to which
every “well-behaved”function is expressible as an infinite series of terms involving the deriva-
tives. Here is the taylor series for an m-variate function f :
X ∂f X ∂f
f (x1 , x2 , . . . , xm ) = f (0, 0, .., 0) + xi (0) + x i1 x i2 (0) + · · · .
∂xi ∂xi1 ∂xi2
i i1 i2
If we assume the higher order terms are negligible, we obtain a linear expression.
Whenever you see an article in the newspaper describing certain quantitative relation-
ships —eg, the effect of more policing on crime, or the effect of certain economic policy
on interest rates—chances are it has probably been obtained via a linear model and `1
regression (or the related `2 regression). So don’t put blind faith in those numbers; they
are necessarily rough approximations to the complex behavior of a complex world.
Towards this end, first note that standard arithmetic operations +, −, × run in time
polynomial in the input size (e.g., multiplying two k-bit integers takes time at most O(k 2 )
even using the gradeschool algorithm).
Next, note that by Cramer’s rule for solving linear systems, the numbers produced during
the algorithm are related to the determinant of n × n submatrices of A. For example if A is
invertible then the solution to Ax = b is x = A−1 b, and the i, j entry of A−1 is Cij /det(A),
where Cij is a cofactor, i.e. an (n−1)×(n−1) submatrix of A. The determinant of an n×n
matrix whose entries are L bit integers is at most n!2Ln . This follows from the formula for
determinant of an n × n matrix, which is
X Y
det(A) = sgn(σ) Aiσ(i) ,
σ i
One of the running themes in this course is the notion of approximate solutions. Of course,
this notion is tossed around a lot in applied work: whenever the exact solution seems hard to
achieve, you do your best and call the resulting solution an approximation. In theoretical
work, approximation has a more precise meaning whereby you prove that the computed
solution is close to the exact or optimum solution in some precise metric. We saw some
earlier examples of approximation in sampling-based algorithms; for instance our hashing-
based estimator for set size. It produces an answer that is whp within (1 + ) of the true
answer. Today we will see many other examples that rely upon linear programming (LP).
Recall that most NP-hard optimization problems involve finding 0/1 solutions. Using
LP one can find fractional solutions, where the relevant variables are constrained to take
real values in [0, 1].
Recall the example of the assignment problem from last time, which is also a 0/1 problem
(a job is either assigned to a particular factory or it is not) but the LP relaxation magically
produces a 0/1 solution (although we didn’t prove this in class). Whenever the LP produces
a solution in which all variables are 0/1, then this must be the optimum 0/1 solution as well
since it is the best fractional solution, and the class of fractional solutions contains every
0/1 solution. Thus the assignment problem is solvable in polynomial time.
Needless to say, we don’t expect this magic to repeat for NP-hard problems. So the
LP relaxation yields a fractional solution in general. Then we give a way to round the
fractional solutions to 0/1 solutions. This is accompanied by a mathematical proof that the
new solution is provably approximate.
35
36
of S. Furthermore, we wish to find such a subset of minimum total weight. Let V Cmin be
this minimumPweight. The following is the LP relaxation:
min i wi x i
0 ≤ xi ≤ 1 ∀i
xi + xj ≥ 1 ∀ {i, j} ∈ E.
Let OP Tf be the optimum value of this LP. It is no more than V Cmin since every 0/1
solution (including in particular the 0/1 solution of minimum cost) is also an acceptable
fractional solution.
Applying deterministic rounding, we can produce a new set S: every node i with xi ≥
1/2 is placed in S and every other i is left out of S.
Claim 1: S is a vertex cover.
Reason: For every edge {i, j} we know xi + xj ≥ 1, and thus at least one of the xi ’s is at
least 1/2. Hence at least one of i, j must be in S.
Claim 2: The weight P of S is at most 2OP Tf .
Reason: OP Tf = i wi xi , and we are only picking those i’s for which xi ≥ 1/2. 2.
Thus we have constructed a vertex cover whose cost is within a factor 2 of the optimum
cost even though we don’t know the optimum cost per se.
Exercise: Show that for the complete graph the above method indeed computes a set of
size no better than 2 times OP Tf .
Remark: This 2-approximation was discovered a long time ago, and despite myriad attempts
we still don’t know if it can be improved. Using the so-called PCP Theorems Dinur and
Safra showed (improving a long line of work) that 1.36-approximation is NP-hard. Khot
and Regev showed that computing a (2 − )-approximation is UG-hard, which is a new form
of hardness popularized in recent years. The bibliography mentions a popular article on
UG-hardness.
P
min zj
j∈J
1 ≥ xi ≥ 0 ∀i
yj1 + yj2 ≥ zj
Where yj1 is shorthand for xi if the first literal in the jth clause is the ith variable, and
shorthand for 1 − xi if the literal is the negation of the i variable. (Similarly for yj2 .)
If MAX-2SAT denotes the number of clauses satisfied by the best assignment, then it is
no more than OP Tf , the value of the above LP. Let us apply randomized rounding to the
fractional solution to get a 0/1 assignment. How good is it?
Claim: E[number of clauses satisfied] ≥ 34 × OP Tf .
We show that the probability that the jth clause is satisfied is at least 3zj /4 and then
the claim follows by linear of expectation.
If the clause is of size 1, say xr , then the probability it gets satisfied is xr , which is at
least zj . Since the LP contains the constraint xr ≥ zj , the probability is certainly at least
3zj /4.
Suppose the clauses is xr ∨ xs . Then zj ≤ xr + xs and in fact it is easy to see that
zj = min {1, xr + xs } at the optimum solution:P after all, why would the LP not make zj as
large as allowed; its goal is to maximize j zj . The probability that randomized rounding
satisfies this clause is exactly 1 − (1 − xr )(1 − xs ) = xr + xs − xr xs .
But xr xs ≤ 41 (xr + xs )2 (prove this!) so we conclude that the probability that clause j
is satisfied is at least zj − zj2 /4 ≥ 3zj /4. 2.
Remark: This algorithm is due to Goemans-Williamson, but the original 3/4-approximation
is due to Yannakakis. The 3/4 factor has been improved by other methods to 0.94.
The idea is to write an LP. For each endpoint pair i, j that have to be connected and
each edge e = (u, v) we have a variable xi,j
uv that is supposed to be 1 if the path from i to j
passes through (u, v), and 0 otherwise. (Note that edges are directed.) Then for each edge
(u, v) we can add a capacity constraint
X
xi,j
uv ≤ cuv .
i,j:endpoints
But since we can’t require variables to be 0/1 in an LP, we relax to 0 ≤ xi,juv ≤ 1. This
allows a path to be split over many paths (this will remind you of network flow if you have
seen it in undergrad courses). Of course, this seems all wrong since avoiding such splitting
was the whole point in the problem! Be patient just a bit more.
Furthermore we need the so-called flow conservation constraints. These say that the
fractional amount of paths leaving i and arriving at j is 1, and that paths never get stranded
in between.
P ij P ij
x = x ∀u 6= i, j
P vij uv P vij vu
x − x = 1 u=i
Pv uv ij Pv vu ij
v xvu − v xuv = 1 u=j
Undern ourohypothesis about the problem, this LP is feasible and we get a fractional
solution xi,j uv . These values can be seen as bits and pieces of paths lying strewn about
the network.
Let us first see that neither deterministic rounding nor simple randomized rounding is
a good idea. Consider a node u where xij u v is 1/3 on three incoming edges and 1/2 on two
outgoing edges. Then deterministic rounding would round the incoming edges to 1 and
outgoing edges to 1, creating a bad situation where the path never enters u but leaves it on
two edges! Simple randomized rounding will also create a similar bad situation with Ω(1)
(i.e., constant) probability. Clearly, it would be much better to round along entire paths
instead of piecemeal.
Flow decomposition: For each endpoint pair i, j we create a finite set of paths p1 , p2 , . . . ,
from i to j as well as associated weights wp1 , wp2 , . . . , that lie in [0, 1] and sum up to 1.
Furthermore, for each edge (u, v): xi,ju,v = sum of weights of all paths among these that
contain u, v.
Flow decomposition is easily accomplished via depth first search. Just repeatedly find a
path from i to j in the weighted graph defined by the xij uv ’s: the flow conservation constraints
imply that this path can leave every vertex it arrives at except possibly at j. After you
find such a path from i to j subtract from all edges on it the minimum xij uv value along this
path. This ensures that at least one xij uv gets zeroed out at every step, so the process is
finite.
Randomized rounding: For each endpoint pair i, j pick a path from the above decom-
position randomly by picking it with probability proportional to its weight.
Part 1: We show that this satisfies the edge capacities approximately.
This follows from Chernoff bounds. The expected number of paths that use an edge
{u, v} is
39
X
xi,j
u,v .
i,j:endpoints
The LP constraint says this is at most cuv , and since cuv > d log n this is a sum of at least
d log n random variables. Chernoff bounds (see our earlier lecture) imply that this is at most
(1 + ) times its expectation for all edges with high probability. Chernoff bounds similarly
imply that the overall number of paths is pretty close to k. )
Part 2: We show that in the expectation, (1 − 1/e) fraction of endpoints get connected
by paths. Consider any endpoint pair. Suppose P they are connected by t fractional paths
p1 , p2 , .. with weights w1 , w2 .. etc. Then i wi = 1 since the endpoints were fractionally
connected. The probability that the randomized rounding will round all these paths down
to 0 is
Y P
(1−w )
(1 − wi ) ≤ ( i t i )t (Geometric mean ≤ Arithmetic mean)
i
≤ (1 − 1/t)t ≤ 1/e.
The downside of this rounding is that some of the endpoint pairs may end up with zero
paths, whereas others may end up with 2 or more. We can of course discard extra paths.
(There are better variations of this approximation but covering them is beyond the scope
of this lecture.)
Remark: We have only computed the expectation here, but one can check using Markov’s
inequality that the algorithm gets arbitrarily close to this expectation with probability at
least 1/n (say).
Bibliography
Decision-making under
uncertainty: Part 1
This lecture is an introduction to decision theory, which gives tools for making rational
choices in face of uncertainty. It is useful in all kinds of disciplines from electrical engineering
to economics. In computer science, a compelling setting to consider is an autonomous
vehicle or robot navigating in a new environment. It may have some prior notions about
the environment but inevitably it encounters many different situations and must respond
to them. The actions it chooses (drive over the object on the road or drive around it?)
changes the set of future events it will see, and thus its choice of the immediate action must
necessarily take into account the continuing effects of that choice far into the future. You
can immediately see that the same issues arise in any kind of decision-making in real life:
save your money in stocks or bonds; go to grad school or get a job; marry the person you
are dating now, or wait a few more years?
Of course, italicized terms in the previous paragraph are all very loaded. What is a
rational choice? What is “uncertainty”? In everyday life uncertainty can be interpreted in
many ways: risk, ignorance, probability, etc.
Decision theory suggests some answers —perhaps simplistic, but a good start. The first
element of this theory is its probabilistic interpretation of uncertainty: there is a probability
distribution on future events that the decision maker is assumed to know. The second
element is quantifying “rational choice.” It is assumeed that each outcome has some utility
to the decisionmaker, which is a number. The decision-making is said to be rational if it
maximises the expected utility.
Example 9 Say your utility involves job satisfaction quantified in some way. If you decide
to go for a PhD the distribution of your utility is given by random variable X0 . If you
decide to take a job instead, your return is a random variable X1 . Decision theory assumes
that you (i.e.,the decision-maker) know and understand these two random variables. You
choose to get a PhD if E[X0 ] > E[X1 ].
Example 10 17th century mathematician Blaise Pascal’s famous wager is an early example
of an argument recognizable as modern decision theory. He tried to argue that it is the
rational choice for humans to believe in God (he meant Christian god, of course). If you
40
41
choose to be a disbeliever and sin all your life, you may have infinite loss if God exists
(eternal damnation). If you choose to believe and live your life in virtue, and God doesn’t
exist it is all for naught. Therefore if you think that the probability that God exists is
nonzero, you must choose to live as a believer to avoid an infinite expected loss. (Aside:
how convincing is this argument to you?) 2
We will not go into a precise definition of utility (wikipedia moment) but illustrate it
with an example. You can think of it as a quantification of “satisfaction ”. In computer
science we also use payoff, reward etc.
Example 11 (Meaning of utility) You have bought a cake. On any single day if you eat
√
x percent of the cake your utility is x. (This happiness is sublinear because the 5th bite
of the cake brings less happiness than the first.) The cake reaches its expiration date in 5
days and if any is still left at that point you might as well finish it (since there is no payoff
from throwing away cake).
What schedule of cake eating will maximise your total utility over 5 days? √ Your optimal
choice is to eat 20% of the cake each day, since it yields a payoff of 5 × 20, which is a
lot more than any √ of the alternatives. For instance, eating it all on day 1 would produce a
much lower payoff 5 × 20.
This example is related to Modigliani’s Life cycle hypothesis, which suggests that con-
sumers consume wealth in a way that evens out consumption over their lifetime. (For
instance, it is rational to take a loan early in life to get an education or buy a house, be-
cause it lets you enjoy a certain quality of life, and pay for it later in life when your earnings
are higher.)
In our class discussion some of you were unconvinced about the axiom about maximising
expected utility. (And the existence of lotteries in real life suggests you are on to something.)
Others objected that one doesn’t truly know —at least very precisely—the distribution of
outcomes, as in the PhD vs job example. Very true. (The financial crash of 2008 relates
to some of this, but that’s a story for another day.) It is important to understand the
limitations of this powerful theory.
Example 12 (Cake eating revisited) Let’s now complicate the cake-eating problem. In
addition to the expiration date, your decision must contend with actions of your housemates,
who tend to eat small amounts of cake when you are not looking. On each day with
probability 1/2 they eat 10% of the cake.
Assume that each day the amount you eat as a percentage of the original is a multiple
of 10. You have to compute the cake eating schedule that maximises your expected utility.
Now you can draw a tree of depth 5 that describes all possible outcomes. (For instance
the first level consists of a 11-way choice between eating 0%, 10%, . . . , 100%.) Computing
your optimum cake-eating schedule is a simple dynamic programming over this tree. 2
42
The above cake-eating examples can be seen as a metaphor for all kinds of decision-
making in life: e.g., how should you spend/save throughout your life to maximize overall
happiness1 ?
Decision choice theory says that all such decisions can be made by an appropriate
dynamic programming over some tree. Say you think of time as discrete and you have
a finite choice of actions at each step: say, two actions labeled 0 and 1. In response the
environment responds with a coin toss. (In cake-eating if the coin comes up heads, 10%
of the cake disappears.) Then you receive some payoff/utility, which is a real number, and
depends upon the sequence of T moves made so far. If this goes on for T steps, we can
represent this entire game as a tree of depth T .
Then the best decision at each step involves a simple dynamic programming where the
operation at each action node is max and the operation at each probabilistic node is average.
If the node is a leaf it just returns its value. Note that this takes time exponential 2 in T .
Interestingly, dynamic programming was invented by R. Bellman in this decision-theory
context. (If you ever wondered what the “dynamic”in dynamic programming refers to, well
now you know. Check out wikipedia for the full story.) The dynamic programming is also
related to the game-theoretic notion of backwards induction.
The cake example had a finite horizon of 5 days and often such a finite horizon is imposed
on the problem to make it tractable.
But one can consider a process that goes on for ever and still make it tractable using
discounted payoffs. The payoff is being accumulated at every step, but the decision-maker
discounts the value of payoffs at time t as γ t where γ is the discount factor. This notion is
based upon the observation that most people, given a choice between getting 10 dollars now
versus 11 a year from now, will choose the former. This means that they discount payoffs
made a year from now by 10/11 at least.
Since γ t → 0 as t gets large, discounting ensures that payoffs obtained a large time from
now are perceived as almost zero. Thus it is a “soft ”way to impose a finite horizon.
Aside: Children tend to be fairly shortsighted in their decisions, and don’t understand
the importance of postponement of gratification. Is growing up a process of adjusting your
γ to a higher value? There is evidence that people are born with different values of γ, and
this is known to correlate with material success later in life. (See the wikipedia page on the
Stanford marshmallow experiment.)
of actions it is allowed to take in each state. (For example, a state for an autonomous vehicle
could be defined using a finite set of variables: its speed, what lane it is in, whether or not
there is a vehicle in front/back/left/right, whether or not one of them is getting closer at
a fast rate.) Upon taking an action the decision-maker gets a reward and then “nature”or
“chance”transitions him probabilistically to another state. The optimal policy is defined as
one that maximises the total reward (or discounted reward).
For simplicity assume the set of states is labeled by integers 1, . . . , n, the possible actions
in each state are 0/1. For each action b there is a probability p(i, b, j) of transitioning to
state j if this action is taken in that state. Such a transition brings an immediate reward
of R(i, b, j). Note that this process goes forever; the decision-maker keeps taking actions,
which affect the sequence of states it passes through and the rewards it gets.
The name Markov: This refers to the memoryless aspect of the above setup: the reward
and transition probabilities do not depend upon the past history.
Example 13 If the decision-maker always takes action 0 and s1 , s2 , . . . , are the random
variables denoting the states it passes through, then its total reward is
∞
X
R(st , 0, st+1 ).
t=1
Furthermore, the distribution of st is completely determined (as described above) given st−1
(i.e., we don’t need to know the earlier sequence of states that were visited).
This sum of rewards is typically going to be infinite, so if we use a discount factor γ
then the discounted reward of the above sequence is
∞
X
γ t R(st , 0, st+1 ).
t=1
2
44
Clearly this converges. Under some technical condition it can be shown that the optimum
policy is history-independent3
To compute the rewards from the optimum policy one ignores transient effects as the
random walk settles down, and look at the final steady state. This computation can be
done via linear programming.
Let Vi be the expected reward of following the optimum policy if one starts in state i.
In the first step the policy takes action π(i) ∈ {0, 1}, and transititions to another state j.
Then the subpolicy that kicks in after this transition must also be optimal too, though its
contribution is attenuated by γ. So Vi must satisfy
n
X
Vi = p(i, π(i), j)(R(i, π(i), j) + γVj ). (8.1)
j=1
Thus if the allowed actions are 0, 1 the optimum policy must satisfy:
n
X
Vi ≥ p(i, 0, j)(R(i, 0, j) + γVj ),
j=1
and
n
X
Vi ≥ p(i, 1, j)(R(i, 1, j) + γVj ).
j=1
3
This condition has to do with the Ergodicity of the MDP. For each fixing of the policy the MDP turns
into a simple random walk on the state space. One needs this to converge to a stationary distribution
whereby each state i appears during the walk some pi fraction of times.
45
P
The objective is to minimize i Vi subject to the above constraints. (Note that the
constraints for other states will have Vi on the other side of the inequality, which will
constrain it also.) So the LP is really solving for
n
X
Vi = max p(i, b, j)(R(i, b, j) + γVj ).
b∈{0,1}
j=1
After solving the LP one has to look at which of the above two inequalities involving Vi is
tight to figure out whether the optimum action π(i) is 0 or 1.
In practice solving via LP is considered too slow (since the number of states could be
100,000 or more) and iterative methods are used instead. We’ll see some iterative methods
later in the course in other contexts.
Bibliography
(Today’s notes below are largely lifted with minor modifications from a survey by Arora,
Hazan, Kale in Theory of Computing journal, Volume 8 (2012), pp. 121-164.)
Today we study decision-making under total uncertainty: there is no a priori distri-
bution on the set of possible outcomes. (This line will cause heads to explode among
devout Bayesians, but it makes sense in many computer science settings. One reason is
computational complexity or general lack of resources: the decision-maker usually lacks the
computational power to construct the tree of all exp(T ) outcomes possible in the next T
steps, and the resources to do enough samples/polls/surveys to figure out their distribution.
Or the algorithm designer may not be a Bayesian.)
Such decision-making (usually done with efficient algorithms) is studied in the field of
online computation, which takes the view that the algorithm is responding to a sequence of
requests that arrive one by one. The algorithm must take an action as each request arrives,
and it may discover later, after seeing more requests, that its past actions were suboptimal.
But past actions cannot be unchanged.
See the book by Borodin and El-Yaniv for a fuller introduction to online algorithms.
This lecture and the next covers one such success story: an online optimization tool called
the multiplicative weight update method. The power of the method arises from the very
minimalistic assumptions, which allow it to be plugged into various settings (as we will do
in next lecture).
46
47
of binary events: up/down. (Below, this will be generalized to allow non-binary events.)
Each morning we try to predict whether the price will go up or down that day; if our
prediction happens to be wrong we lose a dollar that day, and if it’s correct, we lose nothing.
The stock movements can be arbitrary and even adversarial1 . To balance out this
pessimistic assumption, we assume that while making our predictions, we are allowed to
watch the predictions of n “experts”. These experts could be arbitrarily correlated, and
they may or may not know what they are talking about. The algorithm’s goal is to limit its
cumulative losses (i.e., bad predictions) to roughly the same as the best of these experts. At
first sight this seems an impossible goal, since it is not known until the end of the sequence
who the best expert was, whereas the algorithm is required to make predictions all along.
For example, the first algorithm one thinks of is to compute each day’s up/down pre-
diction by going with the majority opinion among the experts that day. But this algorithm
doesn’t work because a majority of experts may be consistently wrong on every single day,
while some single expert in this crowd happens to be right every time.
The weighted majority algorithm corrects the trivial algorithm. It maintains a weighting
of the experts. Initially all have equal weight. As time goes on, some experts are seen as
making better predictions than others, and the algorithm increases their weight proportion-
ately. The algorithm’s prediction of up/down for each day is computed by going with the
opinion of the weighted majority of the experts for that day.
1. Make the prediction that is the weighted majority of the experts’ predictions based
on the weights w1 (t) , . . . , wn (t) . That is, predict “up” or “down” depending on which
prediction has a higher total weight of experts advising it (breaking ties arbitrarily).
2. For every expert i who predicts wrongly, decrease his weight for the next round by
multiplying it by a factor of (1 − η):
Theorem 5
After T steps, let mi (T ) be the number of mistakes of expert i and M (T ) be the number of
mistakes our algorithm has made. Then we have the following bound for every i:
2 ln n
M (T ) ≤ 2(1 + η)mi (T ) + .
η
In particular, this holds for i which is the best expert, i.e. having the least mi (T ) .
1
Note that finance experts have studied stock movements for over a century and there are all kinds
of stochastic models fitted to them. But we are doing computer science here, and we will see that this
adversarial view will help us apply the same idea to a variety of other settings.
48
(t) P
Proof: A simple induction shows that wi (t+1) = (1 − η)mi . Let Φ(t) = i wi (t) (“the
potential function”). Thus Φ(1) = n. Each time we make a mistake, the weighted majority
of experts also made a mistake, so at least half the total weight decreases by a factor 1 − η.
Thus, the potential function decreases by a factor of at least (1 − η/2):
(t+1) (t) 1 1
Φ ≤ Φ + (1 − η) = Φ(t) (1 − η/2).
2 2
(T )
Thus simple induction gives Φ(T +1) ≤ n(1 − η/2)M . Finally, since Φ(T +1) ≥ wi (T +1) for
all i, the claimed bound follows by comparing the above two expressions and using the fact
that − ln(1 − η) ≤ η + η 2 since η < 12 . 2
The beauty of this analysis is that it makes no assumption about the sequence of events:
they could be arbitrarily correlated and could even depend upon our current weighting of
the experts. In this sense, the algorithm delivers more than initially promised, and this lies
at the root of why (after obvious generalization) it can give rise to the diverse algorithms
mentioned earlier. In particular, the scenario where the events are chosen adversarially
resembles a zero-sum game, which we will study in a future lecture.
Now let’s slightly change notation: mi (t) be 1 if expert i makes a wrong prediction at
time t and 0 else. (Thus mi (t) is the cost incurred by this expert
P at that time.) Then the
probability the algorithm makes a mistake at time t is simply i pi (t) mi (t) , which we will
write as the inner product of the m and p vectors: m(t) · p(t) . Thus the expected number
of mistakes by our algorithm at the end is
T
X −1
m(t) · p(t) .
t=0
49
P
Now lets compute the change in potential Φ(t) = i wi
(t) :
X
Φ(t+1) = wi (t+1)
i
X
= wi (t) (1 − ηmi (t) )
i
X
= Φ(t) − ηΦ(t) mi (t) pi (t)
i
(t) (t)
= Φ (1 − ηm · p(t) )
≤ Φ(t) exp(−ηm(t) · p(t) ).
Note that this potential drop is not a random variable; it is a deterministic quantity
that depends only on the loss vector m(t) and the current expert weights (which in turn are
determined by the loss vectors of the previous steps).
We conclude by induction that the final potential is at most
T
Y X
Φ(0) exp(−ηm(t) · p(t) ) = Φ(0) exp(−η m(t) · p(t) ).
t=0 t
For each i this final potential is at least the final weight of the ith expert, which is
Y P (t)
(1 − ηmi (t) ) ≥ (1 − η) t mi .
t
P −1 (t) (t)
Thus taking logs and that − log(1 − η) ≤ η(1 + η) we conclude that Tt=0 m ·p
(which is also the expected number of mistakes by our algorithm) is at most (1 + η) times
the number of mistakes by expert i, plus the same old additive factor 2 log n/η.
reward them by increasing their probability of being picked in the next round (hence the
multiplicative weight update rule).
Intuitively, being in complete ignorance about the decisions at the outset, we select them
uniformly at random. This maximum entropy starting rule reflects our ignorance. As we
learn which ones are the good decisions and which ones are bad, we lower the entropy to
reflect our increased knowledge. The multiplicative weight update is our means of skewing
the distribution.
We now set up some notation. Let t = 1, 2, . . . , T denote the current round, and let i
be a generic decision. In each round t, we select a distribution p(t) over the set of decisions,
and select a decision i randomly from it. At this point, the costs of all the decisions are
revealed by nature in the form of the vector m(t) such that decision i incurs cost mi (t) .
We assume that the costs lie in the range [−1, 1]. This is the only assumption we make on
the costs; nature is completely free to choose the cost vector as long as these bounds are
respected, even with full knowledge of the distribution that we choose our decision from.
The expected cost to the algorithm for sampling a decision i from the distribution p(t)
is
E [mi (t) ] = m(t) · p(t) .
i∈p(t)
P
The total expected cost over all rounds is therefore Tt=1 m(t) · p(t) . Just as before, our
goal is to design an algorithm which achieves a total expected
PT cost not too much more
(t)
than the cost of the best decision in hindsight, viz. mini t=1 mi . Consider the following
algorithm, which we call the Multiplicative Weights Algorithm. This algorithm has been
studied before as the prod algorithm of Cesa-Bianchi, Mansour, and Stoltz.
1. Choose decision i with probability proportional to its weight wi (t) . I.e.,Puse the dis-
tribution over decisions p(t) = {w1 (t) /Φ(t) , . . . , wn (t) /Φ(t) } where Φ(t) = i wi (t) .
3. Penalize the costly decisions by updating their weights as follows: for every decision
i, set
wi (t+1) = wi (t) (1 − ηmi (t) ) (9.2)
The following theorem —completely analogous to Theorem 5— bounds the total ex-
pected cost of the Multiplicative Weights algorithm (given in Figure 9.1) in terms of the
total cost of the best decision:
Theorem 6
Assume that all costs mi (t) ∈ [−1, 1] and η ≤ 1/2. Then the Multiplicative Weights algo-
51
Note that we have not addressed the optimal choice of η thus far. Firstly, it should
be small enough that all calculations in the analysis hold, say η · mi (t) ≤ 1/2 for all
i,
Pt.T Typically this is done by rescaling
p the payoffs to lie in [−1, 1], which means that
|m (t) | ≤ T . Then setting η ≈ ln n/T gives the tightest upperbound on the right
t=1 i √
hand side in Theorem 6, by reducing the additive error to about T ln n. Of course, this
is a safe choice; in practice the best η depends upon the actual sequence of events, but of
course those are not known in advance.
bibliography
S. Arora, E. Hazan, S. Kale. The multiplicative weights update method: A meta algo-
rithm and its applications. Theory of Computing, Volume 8 (2012), pp. 121164.
A. Borodin and R. El Yaniv. Online Computation and Competitive Analysis. Cambridge
University Press, 1998.
Chapter 10
Applications of multiplicative
weight updates: LP solving,
Portfolio Management
Today we see how to use the multiplicative weight update method to solve other problems.
In many settings there is a natural way to make local improvements that “make sense.”The
multiplicative weight updates analysis from last time (via a simple potential function) allows
us to understand and analyse the net effect of such sensible improvements. (Formally, what
we are doing in many settings is analysing an algorithm called gradient descent which we’ll
encounter more formally later in the course.)
system 1
a1 · x ≥ b1
a2 · x ≥ b2
..
.
am · x ≥ bm
xi ≥ 0 ∀i = 1, 2, . . . , n
X
xi = 1.
i
In your high school you learnt the “graphical”method to solve linear inequalities, and
as we discussed in Lecture 6, those can take mn/2 time. Here we design an algorithm that,
52
53
given an error parameter ε > 0, runs in O(mL/ε) time and either tells us that the original
system is infeasible, or gives us a solution x satisfying the last two lines of the above system,
and
aj · x ≥ bj − ε ∀j = 1, . . . , m.
(Note that this allows the possibility that the system is infeasible per se and nevertheless
the algorithm returns such an approximate solution. In that case we have to be happy with
the approximate solution.) Here L is an instance-specific parameter that will be clarified
below; roughly speaking it is the maximum absolute value of any coefficient. (Recall that
the dependence would need to be poly(log L) to be considered polynomial time. We will
study such a method later on in the course.)
What is a way to certify to somebody that the system is infeasible? The following is
sufficient: Come up with a system of nonnegative weights w1 , w2 , . . . , wm , one per inequality,
such that the following linear program has a negative value:
system 2
X
max wj (aj · x − bj )
j
xi ≥ 0 ∀i = 1, 2, . . . , n
X
xi = 1.
i
Note: the wj ’s are fixed constants. So this linear program has only two nontrivial constraints
(not counting the constraints xi ≥ 0) so it is trivial to find a solution quickly, as we saw in
class.
This method of certifying infeasibility is eminently sensible and the weighting of in-
equalities is highly reminiscent of the weighting of experts in the last lecture. So we can try
to leverage it into a precise algorithm. It will have the following guarantee: (a) Either it
(f )
finds a set of nonnegative weights certifying infeasibility or (b) It finds a solution x that
approximately satisfies the system, in that aj · x − bj ≥ −ε. Note that conditions (a) and
(b) are not disjoint; if a system satisfies both conditions, the algorithm can do either (a) or
(b).
We use the meta theorem on MW (Theorem 2) from Lecture 8, where experts have
positive or negative costs (where negative costs can be seen as payoffs) and the algorithm
seeks to minimize costs by adaptively decreasing the weights of experts with larger cost.
The meta theorem says that the algorithm’s payoff over many steps tracks —within (1 + ε)
multiplicative factor—the cost incurred by the best player, plus an additive term O(log n/ε).
We identify m “experts,”one per inequality. We maintain a weighting of experts, with
w1 , w2 (t) , . . . , wm (t) denoting the weights at step t. (At t = 0 all weights are 1.) Solve
(t)
54
system 2 using these weights. If it turns out to have a negative value, we have proved the
infeasibility of system 1 and can HALT right away. Otherwise take any solution, say x(t) ,
and think of it as imposing a “cost ”of mj (t) = ai · x(t) − bi on the jth expert. (In particular,
the first line of system 2 is merely —up to scaling by the sum of weights— the expected
cost for our MW algorithm, and it is positive.) Thus the MW update rule will update the
experts’ weights as:
wj (t+1) ← wj (t) (1 − η mj (t) ).
We continue thus for some number T of steps and if we never found a certificate of the
infeasibility of system 1 we output the solution x(f ) = T1 (x(1) + x(2) + · + x(T ) ), which is the
average of all the solution vectors found at various steps. Now let L denote the maximum
possible absolute value of any ai · x − b subject to the final two lines of system 2.
Claim: If T > L2 log n/ε2 then x(f ) satisfies aj · x(f ) − bj ≥ −ε for all j.
The proof involves the MW meta theorem p which requires us to rescale (multiplying by
1/L) so all costs lie in [−1, 1] and setting ε = log n/T . p
We wish to make T large enough so that the per-step additive error log n/T < ε/L,
which implies T > L2 log n/ε2 .
Then we can reason as follows: (a) The expected per-step cost of the MW algorithm
was positive (in fact it was positive in each step). (b) The quantity aj · x(f ) − bj is simply
the average cost for expert j per step. (c) The total number of steps is large enough that
our MW theorem says that (a) cannot be ε more than (b).
Here is another intuitive explanation that suggests why this algorithm makes sense
independent of the experts idea. Vectors x(1) , x(2) , . . . , x(T ) represent simplistic attempts
to find a solution to system 1. If ai · x(t) − bi is positive (resp., negative) this means
that the jth constraint was satisfied (resp., unsatisfied) and thus designating it as a cost
(resp., reward) ensures that the constraint is given less (resp., more) weight in the next
round. Thus the multiplicative update rule is a reasonable way to search for a weighting of
constraints that gives us the best shot at proving infeasibility.
Remarks: See the AHK survey on multiplicative weights for the history of this algorithm,
which is actually a quantitative version of an older algorithm called Lagrangian relaxation.
stocks (e.g., the 500 stocks in S& P 500) and you wish to manage an investment portfolio
using them. You wish to do at least as well as the best stock in hindsight, and also better
than index funds, which keep a fixed proportion of wealth in each stock. Let ci (t) be the
price of stock i at the end of day t.
If you have Pi (t) fraction of your wealth investedPin stock i then on the tth day your
portfolio will rise in value by a multiplicative factor i Pi (t) ci (t) /ci (t−1) . Looks familiar?
Let ri (t) be shorthand for ci (t) /ci (t−1) .
If you invested all your money in stock i on day 0 then the rise in wealth at the end is
TY−1
ci (T )
= ri (t) .
ci (0) t=0
Since log ab = log a + log b this gives us the idea to set up the MW algorithm as follows.
We run it by looking at n imagined experts, each corresponding to one of the stocks. The
payoff for expertPi on day t is log ri (t) . Then as noted above, the total payoff for expert i
over all days is t log ri (t) = log(ci (T ) /ci (0) ). This is simply the log of the multiplicative
factor by which our wealth would increase in T days if we had just invested all of it in stock
i on the first day. (This is the jackpot we are shooting for: imagine the money we could
have made if we’d put all our savings in Google stock on the day of its IPO.)
Our algorithm plays the canonical MW strategy from last lecture with a suitably small η
and with the probability distribution P (t) on experts at time t being interpreted as follows:
Pi (t) is the fraction of wealth invested in stock i at the start of day t. Thus we are no longer
thinking of picking one expert to follow at each time step; the distribution on experts is the
way of splitting our P money into the n stocks. In particular on day t our portfolio increases
in value by a factor i Pi (t) · r(t) .
Note that we are playing the MW strategy that involves maximising payoffs, not mini-
mizing costs. (That is, increase the weight of experts if they get positive payoff; and reduce
weight in case of negative payoff.) The MW theorem says that the total payoff of the MW
strategy,
P P (t) namely,
t i Pi · log ri (t) , is at least (1 − ε) times the payoff of the best expert provided T is
large enough. P P
It only remains to make sense of the total payoff for the MW strategy, namely, t i Pi (t) ·
log ri (t) , since thus far it is just an abstract quantity in a mental game that doesn’t make
sense per se in terms of actual money made. P (t)
Since the logarithm is a concave function (i.e. 21 (log x + log y) ≤ log x+y 2 ) and i Pi =
1, simple calculus shows that
X X
Pi (t) · log r(t) ≤ log( Pi (t) · r(t) ).
i i
The right hand side is exactly the logarithm of the rise in value of the portfolio of the MW
strategy on day t. Thus we conclude that the total payoff over all days lower bounds the
sum of the logarithms of these rises, which of course is the log of the ratio (final value of
the portfolio)/(initial value).
All of this requires that the number of steps T should be large enough. Specifically, if
log ri (t) ≤ 1 (i.e., no stock changes value by more than a factor 2 on a single day) then
56
p
the total
P difference between the desired payoff and the actual payoff is log n/T times
(t)
maxi t log ri , as noted in Lecture 8. This performance can be improved by other
variations of the method (see the paper by Hazan and Kale). In practice this method
doesn’t work very well; we’ll later explore a better algorithm.
Remark: One limitation of this strategy is that we have ignored trading costs (ie
dealer’s commisions). As you can imagine, researchers have also incorporated trading costs
in this framework (see Blum and Kalai). Perhaps the bigger limitation of the MW strat-
egy is that it assumes nothing about price movements whereas there is a lot known about
the (random-like) behavior of the stock market. Traditional portfolio management theory
assumes such stochastic models, and is more akin to the decision theory we studied two
lectures ago. But stochastic models of the stock market fail sometimes (even catastrophi-
cally) and so ideally one wants to combine the stochastic models with the more pessimistic
viewpoint taken in the MW method. See the paper by Hazan and Kale. See also a recent
interesting paper by Abernathy et al. that suggests that the standard stochastic model
arises from optimal actions of market players.
Thomas Cover was the originator of the notion of managing a portfolio against an
adversarial market. His strategy is called universal portfolio.
bibliography
2. E. Hazan and S. Kale. On Stochastic and Worst-case Models for Investing. Proc.
NIPS 2009.
High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies
watched by Netflix customer, etc.) and this lecture seeks to introduce some common prop-
erties of these vectors. We encounter the so-called curse of dimensionality which refers to
the fact that algorithms are simply harder to design in high dimensions and often have a
running time exponential in the dimension. We also encounter the blessings of dimensional-
ity, which allows us to reason about higher dimensional geometry using tools like Chernoff
bounds. We also show the possibility of dimension reduction — it is sometimes possible to
reduce the dimension of a dataset, for some purposes. P 2 1/2
Notation: For a vector x ∈ <d its ` -norm is |x| = (
P 2 2 i xi ) and the `1 -norm is
|x|1 = i |xi |. For any two vectors x, y their Euclidean distance refers to |x − y|2 and
Manhattan distance refers to |x − y|1 .
High dimensional geometry is inherently different from low-dimensional geometry.
Example 15 Consider how many almost orthogonal unit vectors we can have in space, such
that all pairwise angles lie between 88 degrees and 92 degrees.
In <2 the answer is 2. In <3 it is 3. (Prove these to yourself.)
In <d the answer is exp(cd) where c > 0 is some constant.
Example 16 Another example is the ratio of the the volume of the unit sphere to its
circumscribing cube (i.e. cube of side 2). In <2 it is π/4 or about 0.78. In <3 it is π/6 or
about 0.52. In d dimensions it is exp(−c · d log d).
Let’s start with useful generalizations of some geometric objects to higher dimensional
geometry:
• The n-cube in <n : {(x1 ...xn ) : 0 ≤ xi ≤ 1}. To visualize this in <4 , think of yourself
as looking at one of the faces, say x1 = 1. This is a cube in <3 and if you were
57
58
Rd#
R2# R3#
able to look in the fourth dimension you would see a parallel cube at x1 = 0. The
visualization in <n is similar.
The volume of the n-cube is 1.
P 2
• The unit n-ball in <d : Bd := {(x1 ...xd ) : xi ≤ 1}. Again, to visualize the ball in
<4 , imagine you have psliced through√it with a hyperplane, say x1 = 1/2. This slice is
a ball in <3 of radius 1 − 1/22 = 3/2. Every parallel slice also gives a ball.
π d/2
The volume of Bd is (d/2)! (assume d is even if the previous expression bothers you),
1
which is dΘ(d) .
• In <2 , if we slice the unit ball (i.e., disk) with a line at distance 1/2 from the center
then a significant fraction of the ball’s volume lies on each side. In <d√if we do the
same with a hyperplane, then the radius of the d − 1 dimensional ball is 3/2, and so
the volume on the other side√is negligible. In fact a constant fraction of the volume lies
within a slice at distance 1/ d from the center, and for qany c > 1, a (1 − 1/c)-fraction
log c
of the volume of the d-ball lies in a strip of width O( d ) around the center.
P
Denote by X the random variable a · x = ai xi . Then:
2
P r(|X| > t) < e−nt
Proof: We have: X
µ = E(X) = E( ai xi ) = 0
X X X X a2 1
σ 2 = E[( ai xi )2 ] = E[ ai aj xi xj ] = ai aj E[xi xj ] = i
= .
n n
Using the Chernoff bound, we see that,
t 2 2
P r(|X| > t) < e−( σ ) = e−nt .
Corollary 8
If x, y are chosen at random from {−1, 1}n , and the angle between them is θx,y then
" r #
log c 1
P r |cos(θx,y )| > < .
n c
√
Hence by if we pick say c/2 random vectors in {−1, 1}n , the union
q bound says that
log c
the chance that they all make a pairwise angle with cosine less than n is less than 1/2.
Hence we can make c = exp(0.01n) and still have the vectors be almost-orthogonal (i.e.
cosine is a very small constant).
The following ideas do not work to prove this theorem (as we discussed in class): (a)
take a random sample of m coordinates out of n. (b) Partition the n coordinates into m
subsets of size about n/m and add up the values in each subset to get a new coordinate.
1 m n
Proof:qChooseqm vectors x , ..., x ∈ < at random by choosing each coordinate randomly
from { 1+ε
m ,−
1+ε n m
m }. Then consider the mapping from < to < given by
z −→ (z · x1 , z · x2 , . . . , z · xm ).
show that kuk2 is concentrated enough around its mean, then it would prove the theorem.
More formally, this is done in the following Chernoff bound lemma. 2
Lemma 10
There exist constants c1 > 0 and c2 > 0 such that:
2m
1. P r[kuk2 > (1 + β)µ] < e−c1 β
2m
2. P r[kuk2 < (1 − β)µ] < e−c2 β
61
Therefore there is a constant c such that the probability of a ”bad” case is bounded by:
2
P r[(kuk2 > (1 + β)µ) ∨ (kuk2 < (1 − β)µ)] < e−cβ m
Now, we have n2 random variables of the type kui − uj k2 . Choose β = 2ε . Using the
union bound, we get that the probability that any of these random variables is not within
(1 ± 2ε ) of their expected value is bounded by
n −c ε2 m
e 4 .
2
kz i − z j k ≤ kui − uj k ≤ (1 + ε)kz i − z j k,
as required.
Question: The above dimension reduction preserves (approximately) `2 -distances. Can
we do dimension reduction that preserves `1 distance? This was an open problem for many
years until Brinkman and Charikar showed in 2004 that no such dimension reduction is
possible.
Question: Is the theorem tight, or can we reduce the dimension even further below
O(log n/ε2 )? Alon has shown that this is essentially tight.
Finally, we note that there is a now-extensive literature on more efficient techniques
for JL-style dimension reduction, with a major role played by a 2006 paper of Ailon and
Chazelle. Do a google search for “Fast Johnson Lindenstrauss Transforms.”
Figure 11.2: Margin of a linear classifier with respect to some labeled points
words in them) and the label indicates whether or not the user labeled them as spam. We
are trying to learn the rule (or “classifier”) that separates the 1’s from 0’s. P
The simplest classifier is a halfspace. Finding whether there exists a halfspace i ai xi ≥
b that separates the 0’s from 1’s is solvable via Linear Programming. This LP has n + 1
variables and m constraints.
However, there is no guarantee in general that the halfspace that separates the training
data will generalize to new examples? ML theory suggests conditions under which the
classifier does generalize, and the simplest is margin. Suppose the data points are unit
vectors. We say the halfspace has margin ε if every datapoint has distance at least ε to the
halfspace.
In the next homework you will show that if such a margin exists then dimension reduction
to O(log n/ε2 ) dimensions at most halves the margin. Hence the LP to find it only has
O(log n/ε2 ) variables instead of n + 1.
Bibliography:
Today we study random walks on graphs. When the graph is allowed to be directed and
weighted, such a walk is also called a Markov Chain. These are ubiquitous in modeling
many real-life settings.
Example 18 (Exercise) Suppose the drunkard does his random walk in a city that’s
designed like a grid. At each step he goes North/South/East/West by one block with
probability 1/4. How many steps does it take him to get to his intended address, which is
n blocks north and n blocks east away?
Random walks in space are sometimes called Brownian motion, after botanist Robert
Brown, who in 1826 peered at a drop of water using a microscope and observed tiny par-
ticles (such as pollen grains and other impurities) in it performing strange random-looking
movements. He probably saw motion similar to the one in the above figure. Explaining
this movement was a big open problem. In 1905, during his ”miraculous year” (when he
solved 3 famous open problems in physics) Einstein explained Brownian motion as a ran-
dom walk in space caused by the little momentum being imparted to the pollen in random
directions by the (invisible) molecules of water. This theoretical prediction was soon ex-
perimentally confirmed and seen as a “proof”of the existence of molecules. Today random
63
64
walks and brownian motion are used to model the movements of many systems, including
stock prices.
In a random walk, the next step does not depend upon the previous history of steps, only
on the current position/state of the moving particle. In general, the term markovian refers
to systems with a “memoryless”property. In an earlier lecture we encountered Markov
Decision Processes, which also had this memoryless property.
Example 20 (Bigram and trigram models in speech recognition) Language recog-
nition systems work by constantly predicting what’s coming next. Having heard the first
i words they try to generate a prediction of the i + 1th word1 . This is a very complicated
piece of software, and one underlying idea is to model language generation as a markov
chain. (This is not an exact model; language is known to not be markovian, at least in the
simple way described below.)
The simplest idea would be to model this as a markov chain on the words of a dictionary.
Recall that everyday English has about 5, 000 words. A simple markovian model consists
of thinking of a piece of text as a random walk on a space with 5000 states (= words).
A state corresponds to the last word that was just seen. For each word pair w1 , w2 there
is a probability pw1 ,w2 of going from w1 to w2 . According to this Markovian model, the
probability of generating a sentence with the words w1 , w2 , w3 , w4 is qw1 pw1 w2 pw2 w3 pw3 w4
where qw1 is the probability that the first word is w1 .
1
You can see this in the typing box on smartphones, which always display their guesses of the next word
you are going to type. This lets you save time by clicking the correct guess.
65
To actually fit such a model to real-life text data, we have to estimate 5, 000 probabilities
qw1 for all words and (5, 000)2 probabilities pw1 w2 for all word pairs. Here
Pr[w2 w1 ]
pw1 w2 = Pr[w2 | w1 ] = ,
Pr[w1 ]
namely, the probability that word w2 is the next word given that the last word was w1 .
One can derive empirical values of these probabilities using a sufficiently large text
corpus. (Realize that we have to estimate 25 million numbers, which requires either a very
large text corpus or using some shortcuts.)
An even better model in practice is a trigram model which uses the previous two words
to predict the next word. This involves a markov chain containing one state for every pair of
words. Thus the model is specified by (5, 000)3 numbers of the form Pr[w3 | w2 w1 ]. Fitting
such a model is beyond the reach of current computers but we won’t discuss the shortcuts
that need to be taken.
where Mij corresponds to the probability that the state at time step t + 1 will be j, given
that the state at time t is i. This process is memoryless in the sense that this transition
probability does not depend upon the history of previous transitions.
P Therefore, each row in the matrix M is a distribution, implying Mij ≥ 0∀i, j ∈ S and
j Mij = 1. The bigram or trigram models are examples of Markov chains.
Using a slight twist in the viewpoint we can use linear algebra to analyse random walks.
Instead of thinking of the drunkard as being at a specific point in the state space, we think
of the vector that specifies his probability of being at point i ∈ S. Then the randomness
goes away and this vector evolves according to deterministic rules. Let us understand this
evolution. P
Let the initial distribution be given by the row Pvector x ∈ <n , xi ≥ 0 and i xi = 1.
After one step, the probability of being at space i is j xj Mji , which corresponds to a new
distribution xM. It is easy to see that xM is again a distribution.
Sometimes it is useful to think of x as describing the amount of probability fluid sitting
at each node, such that the sum of the amounts is 1. After one step, the fluid sitting at
node i distributes to its neighbors, such that Mij fraction goes to j.
Suppose we take two steps in this Markov
P chain. The memoryless property implies that
the probability of going from i to j is k Mik Mkj , which is just the (i, j)th entry of the
matrix M 2 . In general taking t steps in the Markov chain corresponds to the matrix M t ,
and the state at the end is xM t . Thus the
In other words, no matter what initial distribution you choose, if you let it evolve long
enough the distribution converges to the stationary distribution. Some basic questions
are when stationary distributions exist, whether or not they are unique, and how fast the
Markov chain converges to the stationary distribution.
Does Definition 2 remind you of something? Almost all of you know about eigenvalues,
and you can see that the definition requires π to be an eigenvector which has all nonnegative
coordinates and whose corresponding eigenvalue is 1.
In today’s lecture we will be interested in Markov chains corresponding to undirected
d-regular graphs, where the math is easier because the underlying matrix is symmetric:
Mij = Mji .
Definition 4 The mixing time of an ergodic Markov chain M is t if for every starting
distribution x, the distribution xM t satisfies xM t − π 1 ≤ 1/4. (Here |·|1 denotes the `1
norm and the constant “1/4” is arbitrary.)
67
Example 23 (Mixing time of drunkard’s walk on a cycle) Let us consider the mix-
ing time of the walk in Example 21. Suppose the initial distribution concentrates all prob-
ability at state 0. Then 2t steps correspond to about t random steps (= coin tosses) since
with probability 1/2 the drunk does not move. Thus the location of the drunk is
As argued earlier, it takes Ω(n2 ) steps for the walk to reach the other half of the circle
with any reasonable probability, which implies that the mixing time is Ω(n2 ). We will soon
see that this lowerbound is fairly tight; the walk takes about O(n2 log n) steps to mix well.
Example 24 (Exercise: ) Show that if the graph is connected, then every eigenvalue of M
apart from the first one is strictly less than 1. However, the value −1 is still possible. Show
that if −1 is an eigenvalue then the graph is bipartite.
where ei are the eigenvectors of M which form an orthogonal basis and 1 is the first eigen-
vector with eigenvalue 1. (Clearly, x can be written as a combination of the eigenvectors;
2
the observation here is that the coefficient in front of the first eigenvector ~1 is ~1 · x/ ~1
P 2
which is n1 i xi = n1 .)
68
M t x = M t−1 (M x)
Xn
1
= M t−1
( ~1 + αi λi ei )
n
i=2
X n
1
= M t−2
(M ( ~1 + αi λi ei ))
n
i=2
...
n
1~ X
= 1+ αi λti ei
n
i=2
Also
n
X
k αi λti ei k2 ≤ λtmax
i=2
Note also that if we let the Markov chain run for O(k log n/λmax ) steps then the distance
to uniform distribution drops to exp(−k). This is why we were not very fussy about the
constant 1/4 in the definition of the mixing time earlier.
Remark: What if λmax is 1 (i.e., −1 is an eigenvalue)? This breaks the proof and in fact
the walk may not be ergodic. However, we can get around this problem by modifying the
random walk to be lazy, by adding a self-loop at each node that ensures that the walk stays
at a node with probability 1/2. Then the matrix describing the new walk is 21 (I + M ), and
its eigenvalues are 21 (1 + λi ). Now all eigenvalues are less than 1 in absolute value. This is
a general technique for making walks ergodic.
Example 25 (Exercise) Compute the eigenvalues of the drunkard’s walk on the n-cycle
and show that its mixing time is O(n2 log n).
the above equation by ~1, we get α1 = 1 (since x~1 = π~1 = 1). Therefore
MultiplyingP
xM = π + ni=2 αi λti vi , and hence
t
n
X
t
||xM − π||2 ≤ || αi λti vi ||2 (12.1)
i=2
q
≤ λt α22 + · · · + αn2 (12.2)
≤ λt ||x||2 , (12.3)
70
as needed. 2
Chapter 13
Today’s topic is a technique called singular value decomposition or SVD. We’ll take two
views of it, and then encounter a surprising algorithm for it, which in turn leads to a third
interesting view.
This problem is nonlinear and nonconvex as stated. Today we will try to understand it
71
72
more and learn how to solve it. We will find that it is actually easy (which I find one of the
miracles of math: one of few natural nonlinear problems that are solvable in polynomial
time).
But first some examples of why this problem arises in practice.
problem
X 2
min Mij − M̃ij s.t. M̃ is a rank-k matrix (13.2)
ij
Again, seems like a hopeless nonlinear optimization problem. Peer a little harder and
you realize that, first, a rank-k matrix is just one whose rows are linear combinations of k
independent vectors, and second, if you let Mi denote the ith column of M then you are
trying to solve nothing but problem (13.1)!
Example 28 (Planted bisection/Hidden Bisection) Graph bisection is the problem
where we are given a graph G = (V, E) and wish to partition V into two equal sets S, S
such that we minimize the number of edges between S, S. It is NP-complete. Let’s consider
the following average case version.
Nature creates a random graph on n nodes as follows. It partitions nodes into S1 , S2 .
Within S1 , S2 it puts each edge with prob. p, and between S1 , S2 put each edge with prob.
q where q < p. Now this graph is given to the algorithm. Note that the algorithm doesn’t
know S1 , S2 . It has to find the optimum bisection.
It is possible to show using Chernoff bounds that if q = Ω( logn n ) then with high proba-
bility the optimum bisection in the graph is the planted one, namely, S1 , S2 . How can the
algorithm recover this partition?
Figure 13.1: Planted Bisection problem: Edge probability is p within S1 , S2 and q between
S1 , S2 where q < p. On the right hand side is the adjacency matrix. If we somehow knew
S1 , S2 and grouped the corresponding rows and columns together, and squint at the matrix
from afar, we’d see more density of edges within S1 , S2 and less density between S1 , S2 .
Thus from a distance the adjacency matrix looks like a rank 2 matrix.
The observation in Figure 14.1 suggests that the adjacency matrix is close to a rank 2
matrix shown there: the block within S1 , S2 have value p in each entry; the blocks between
S1 , S2 have q in each entry.
Maybe if we can solve (13.2) with k = 2 we are done? This turns out to be correct as
we will see in next lecture.
One can study planted versions of many other NP-hard problems as well.
Many practical problems involve graph partitioning. For instance, image recognition
involves first partitioning the image into its component pieces (sky, ground, tree, etc.); a
74
process called image segmentation in computer vision. This is done by graph partitioning
on a graph defined on pixels where edges denote pixel-pixel similarity. Perhaps planted
graphs are a better model for such real-life settings than worst-case graphs.
Proof: At first sight, the equality does not even seem to pass a “typecheck”; a matrix on
the left and vectors on the right. But then we realize that ei eTi is actually an n × n matrix
(it has rank 1 since every column is a multiple of ei ). So the right hand side is indeed a
matrix. Let us call it B.
Any matrix can be specified completely by describing how it acts on an orthonor-
mal basis. By definition, M is the matrix that acts as follows on the orthonormal set
{e1 , e2 , . . . , en }: M ej = λj ej . How does B act on this orthonormal set? We have
X
Bej = ( λi ei eTi )ej
i
X
= λi ei (eTi ej ) (distributivity and associativity of matrix multiplication)
i
= λj ej
The proof of this theorem uses the following, which is not too hard to prove from the
spectral decomposition using definitions.
Theorem 17 (Courant-Fisher)
If e1 , e2 , . . . , en are the eigenvectors as above then:
Let’s prove Theorem 16 for k = 1 by verifying that the first term of the spectral de-
composition gives the best rank 1 approximation to M . A rank 1 matrix is one whose each
row is a multiple of some unit vector x; in other words is on the line defined by x. Denote
75
which by the Courant-Fisher theorem happens for x = e1 . Thus the best rank 1 approx-
imation to M is the matrix whose ith row is < Mi , e1 > eT1 , which of course is λ1 e1i eT1 .
Thus the rank 1 matrix approximation is λ1 eT1 e1 , which proves the theorem for k = 1. The
proof of Theorem 16 for general k follows similarly by induction and is left as exercise.
The best rank k approximation to M consists of taking the first k terms of (14.2) and
discarding the rest.
This solves problems (13.1) and (13.2). Next time we’ll go into some detail of the
algorithm for computing them. In practice you can just use matlab or another package.
SVD direction corresponds to directions with maximum variance after we have removed the
component along the first direction, and so on.
bibliography
The best rank k approximation to M consists of taking the first k terms of (14.2) and
discarding the rest (where σ1 ≥ σ2 · · · ≥ σr ).
Taking the best rank k approximation is also called Principal Component Analysis or
PCA.
You probably have seen eigenvalue and eigenvector computations in your linear algebra
course, so you know how to compute the PCA for symmetric matrices. The nonsymmetric
77
78
case reduces to the symmetric one by using the following observation. If M is the matrix
in (14.2) then
X X X
MMT = ( σi ui viT )( σi vi uTi ) = σi2 ui uTi since viT vj = 1 iff i = j and 0 else.
i i i
Thus we can recover the ui ’s and σi ’s by computing the eigenvalues and eigenvectors of
M M T , and then recover vi by using (14.1).
Another application of singular vectors is the Pagerank algorithm for ranking webpages.
vector, i αi2 = 1.
Since |λi | ≤ |λ1 | − γ for i ≥ 2, we have
X
|αi | λti ≤ nαmax (|λ1 | − γ)t = n |λ1 |t (1 − γ/ |λ1 |)t ,
i≥2
Figure 14.1: Planted Bisection problem: Edge probability is p within S1 , S2 and q between
S1 , S2 where q < p. On the right hand side is the adjacency matrix. If we somehow knew
S1 , S2 and grouped the corresponding rows and columns together, and squint at the matrix
from afar, we’d see more density of edges within S1 , S2 and less density between S1 , S2 .
Thus from a distance the adjacency matrix looks like a rank 2 matrix.
Now we sketch why the best rank-2 approximation to the adjacency matrix will more or
less recover the planted bisection. Specifically, the idea is to find the rank 2 approximation;
with very high probability its columns can be cleanly clustered into 2 clusters. This gives a
grouping of the vertices into 2 groups as well, which turns out to be the planted bisection.
Why this works has to do with the properties of rank k approximations. First we define
two norms of a matrix.
Definition 6 (Frobenius
qP and spectral norm) If M is an n×n matrix then its Frobe-
nius norm |M |F is 2
ij Mij and its spectral norm |M |2 is the maximum value of |M x|2
over all unit vectors x ∈ <n . (By Courant-Fisher, the spectral norm is also the highest
eigenvalue.) For matrices that are not symmetric the definition of Frobenius norm is anal-
ogous and the spectral norm is the highest singular value.
Last time we defined the best rank k approximation to M as the matrix M̃ that is rank
2
k and minimizes M − M̃ . The following theorem shows that we could have defined it
F
equivalently using spectral norm.
Lemma 20
Matrix M̃ as defined above also satisfies that M − M̃ ≤ |M − B|2 for all B that have
2
rank k.
Theorem 21
If M̃ is the best rank-k approximation to M , then for every rank k matrix C:
2
M̃ − C ≤ 5k |M − C|22 .
F
Proof: Follows by Spectral decomposition and Courant-Fisher theorem, and the fact that
the column vectors in M̃ and C together span a space of dimension at most 2k. Thus
80
2
M̃ − C involves a matrix of rank at most 2k. Rest of the details are cut and pasted from
F
Hopcroft-Kannan in Figure 14.2.
2
Returning to planted graph bisection, let M be the adjacency matrix of the graph with
planted bisection. Let C be the rank-2 matrix that we think is a good approximation to M ,
namely, the one in Figure 14.1. Let M̃ be the true rank 2 approximation found via SVD.
In general M̃ is not the same as C. But Theorem 21 implies that we can upper bound the
average coordinate-wise squared difference of M̃ and C by the quantity on the right hand
side, which is the spectral norm (i.e., largest eigenvalue) of M − C.
Notice, M − C is a random matrix whose each coordinate is one of four values 1 −
p, −p, 1 − q, −q. More importantly, the expectation of each coordinate is 0 (since the entry
of M is a coin toss whose expected value is the corresponding entry of C). The study
of eigenvalues of such random matrices is a famous subfield of science with unexpected
connections to number theory (including the famous Riemann hypothesis), quantum physics
(quantum gravity, quantum chaos), etc. We show below that |M − C|22 is at most O(np).
We conclude that the average column vector in M̃ and C (whose square norm is about np)
are apart by O(p). Thus intuitively, clustering the columns of C into two will find us the
bipartition. Actually showing this requires more work which we will not do.
Here is a generic clustering algorithm into two clusters: Pick a random column of M̃
and put into one cluster all columns whose distance from it is at most 10p. Put all other
columns in the other cluster.
In other words we are trying to cover the unit sphere with spheres of radius 0.2.
Try to pick this set greedily. Pick x(1) arbitrarily, and throw out the unit sphere of
radius 0.2 around it. Then pick x(2) arbitrarily out of the remaining sphere, and throw out
the unit sphere of radius 0.2 around it. And so on.
How many points did we end up with? By construction, each point that was picked
has distance at least 0.2 from every other point that was picked, so the spheres of radius
0.1 around the picked points are mutually disjoint. Thus the maximum number of points
we could have picked is the number of disjoint spheres of radius 0.1 in a ball of radius at
most 1.1. Denoting by B(r) denote the volume of spheres of volume r, this is at most
B(1.1)/B(0.1) = exp(n).
Idea 3) Combining Ideas 1 and 2, and the union bound, we have with high probability,
√
xT(i) Rx(i)
≤ O( n) for all the special directions.
Idea 4): If v is the eigenvector corresponding to the largest eigenvalue satisfies then there
is some special direction satisfying xT(i) Rx(i) > 0.4v T Rv.
By the covering property, there is some special direction x(i) that is close to v. Represent
√
it as αv + βu where u ⊥ v and u is a unit vector. So α ≥ 0.9 and β ≤ 0.19 ≤ 0.5. Then
xT(i) Rx(i) = αv T Rv + βuT Ru. But v is the largest eigenvalue so uT Ru ≤ v T Rv. We
conclude xT(i) Rx(i) ≥ (0.9 − 0.5)v T Rv, as claimed.
The theorem now follows from Idea 3 and 4. 2
bibliography
82
Chapter 15
Recall that a set of points K is convex if for every two x, y ∈ K the line joining x, y,
i.e., {λx + (1 − λ)y : λ ∈ [0, 1]} lies entirely inside K. A function f : <n → < is convex if
f ( x+y 1
2 ) ≤ 2 (f (x)+f (y)). It is called concave if the previous inequality goes theother way. A
linear function is both convex and concave. A convex program consists of a convex function
f and a convex body K and the goal is to minimize f (x) subject to x ∈ K. Is is a vast
generalization of linear programming and like LP, can be solved in polynomial time under
fairly general conditions on f, K. Today’s lecture is about a special type of convex program
called semidefinite programs.
Recall that a symmetric n × n matrix M is positive semidefinite (PSD for short) iff it
can be written as M = AAT for some real-valued matrix A (need not be square). It is a
simple exercise that this happens iff every eigenvalue is nonnegative. Another equivalent
characterization is that there are n vectors u1 , u2 , . . . , un such that Mij = hui , uj i. Given
a PSD matrix M one can compute such n vectors in polynomial time using a procedure
called Cholesky decomposition.
Lemma 23
2
The set of all n × n PSD matrices is a convex set in <n .
Proof: It is easily checked that if M1 and M2 are PSD then so is M1 + M2 and hence so
is 21 (M1 + M2 ). 2
Now we are ready to define semidefinite programs. These are very useful in a variety
of optimization settings as well as control theory. We will use them for combinatorial
optimization, specifically to compute approximations to some NP-hard problems. In this
respect SDPs are more powerful than LPs.
View 1: A linear program in n2 real valued variables Yij where 1 ≤ i, j ≤ n, with the
additional constraint “Y is a PSD matrix.”
View 2: A vector program where we are seeking n vectors u1 , u2 , . . . , un ∈ <n such that
their inner products hui , uj i satisfy some set of linear constraints.
Clearly, these views are equivalent.
83
84
Exercise: Show that every LP can be rewritten as a (slightly larger) SDP. The idea is
that a diagonal matrix, i.e., a matrix whose offdiagonal entries are 0, is PSD iff the entries
are nonnegative.
Question: Can the vectors u1 , . . . , un in View 2 be required to be in <d for d < n?
Answer: This is not known and imposing such a constraint makes the program nonconvex.
(The reason is that the sum of two matrices of rank d can have rank higher than d.)
This works since an edge contributes 1 to the objective iff the endpoints have opposite signs.
The SDP relaxation is to find vectors u1 , u2 , . . . , un such that |ui |22 = 1 for all i and so
as to maximise X 1
|vi − vj |2 .
4
{i,j}∈E
This is a relaxation since every ±1 solution to the problem is also a vector solution where
every ui is ±v0 for some fixed unite vector v0 .
Thus when we solve this SDP we get n vectors, then the value of the objective OP TSDP
is at least as large as the capacity of the max cut. How do we get a cut out of these vectors?
The following is the simplest rounding one can think of. Pick a random vector z. If hui , zi
is positive, put it in S and otherwise in S. Note that this is the same as picking a random
hyperplane passing through the origin and partitioning the vertices according to which side
of the hyperplane they lie on.
ui#
Θij#
ui#
Figure 15.1: SDP solutions are unit vectors and they are rounded to ±1 by using a random
hyperplane through the origin. The probability that i, j end up on opposite sides of the cut
is proportional to Θij , the angle between them.
85
Theorem 24 (Goemans-Williamson’94)
The expected number of edges in the cut produced by this rounding is at least 0.878.. times
OP TSDP .
Proof: The rounding is essentially picking a random hyperplane through the origin and
vertices i, j fall on opposite sides of the cut iff ui , uj lie on opposite sides of the hyperplane.
Let’s estimate the probability they end up on opposite sides. This may seem a difficult n-
dimensional calculation, until we realize that there is a 2-dimensional subspace defined by
ui , uj , and all that matters is the intercept of the random hyperplane with this 2-dimensional
subspace, which is a random line in this subspace. Specifically θij be the angle between ui
and uj . Then the probability that they fall on opposite sides of this random line is θij /π.
Thus by linearity of expectations,
X θij
E[Number of edges in cut] = . (15.1)
π
{i,j}∈E
How do we relate this to OP TSDP ? We use the fact that hui , uj i = cos θij to rewrite
the objective as
X 1 X 1 X 1
|vi − vj |2 = (|vi |2 + |vj |2 − 2hvi , vj i) = (1 − cos θij ). (15.2)
4 4 2
{i,j}∈E {i,j}∈E {i,j}∈E
This seems hopeless to analyse for us mortals: we know almost nothing about the graph or
the set of vectors. Luckily Goemans and Williamson had the presence of mind to verify the
following in Matlab: each term of (15.1) is at least 0.878.. times the corresponding term of
(15.2)! Specifically, Matlab shows that for all
2θ
≥ 0.878 ∀θ ∈ [0, π]. (15.3)
π(1 − cos θ)
QED 2
The saga of 0.878... The GW paper came on the heels of the PCP Theorem (1992) which
established that there is a constant
vector u0 is a dummy vector that stands for ”1”. If ui = u0 then we think of this variable
being set to True and if ui = −u0 we think of the variable being set to False. Of course, in
general hui , u0 i need not be ±1 in the optimum solution.
2
P So the SDP is to find these vectors satisfying |ui | = 1 for all i so as to maximize
clausel vl where vl is the expression for lth clause. For instance if the clause is yi ∨ yj then
the expression is
1 1 1 1
1 − (u0 − ui )(u0 − uj ) = (1 + u0 · uj ) + (1 + u0 · ui ) + (1 − ui · uj ).
4 4 4 4
This is a very Goemans-Williamson like expression, except we have expressions like
1 + u0 · ui whereas in MAX-CUT we have 1 − ui · uj . Now we do Goemans-Williamson
rounding. The key insight is that since we round to ±1 each term 1 + ui · uj becomes 2
θ π−θ
with probability 1 − πij = π ij and is 0 otherwise. Similarly, 1 − ui · uj becomes 2 with
probability θij /π and 0 else.
Now the term-by-term analysis used for MAX-CUT works again once we realize that
2(π−θ)
(15.3) also implies (by substituting π − θ for θ in the expression) that π(1+cos θ) ≥ 0.878 for
θ ∈ [0, π]. We conclude that the expected number of satisfied clauses is at least 0.878 times
OP TSDP .
Chapter 16
This lecture is about gradient descent, a popular method for continuous optimization (es-
pecially nonlinear optimization).
We start by recalling that allowing nonlinear constraints in optimization leads to NP-
hard problems in general. For instance the following single constraint can be used to force
all variables to be 0/1. X
x2i (1 − xi )2 = 0.
i
Notice, this constraint is nonconvex. We saw in earlier lectures that the Ellipsoid method
can solve convex optimization problems efficiently under fairly general conditions. But it is
slow in practice.
Gradient descent is a popular alternative because it is simple and it gives some kind
of meaningful result for both convex and nonconvex optimization. It tries to improve the
function value by moving in a direction related to the gradient (i.e., the first derivative).
For convex optimization it gives the global optimum under fairly general conditions. For
nonconvex optimization it arrives at a local optimum.
Figure 16.1: For nonconvex functions, a local optimum may be different from the global
optimum
87
88
We will first study unconstrained gradient descent where we are simply optimizing a
function f (·). Recall that the function is convex if f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
for all x, y and λ ∈ [0, 1].
η 2 00 η3
f (x + η) = f (x) + ηf 0 (x) + f (x) + f 000 (x) · · · . (16.1)
2 3!
If f 00 (x) ≥ 0 for all x then the the function is convex. This is because f 0 (x) is an
increasing function of x. The minimum is attained for x where f 0 (x) = 0 since f 0 (x) is +ve
to the right of it and −ve to the left. Thus moving both left and right of this point increases
f and it never drops. The function is concave if f 00 (x) ≤ 0 for all x; such functions have a
unique maximum.
Examples of convex functions: ax + b for any a, b ∈ <; exp(ax) for any a ∈ <; xα for
x ≥ 0, α ≥ 1 or α ≤ 0. Another interesting example is the negative entropy: x log x for
x ≥ 0.
Examples of concave functions: ax + b for any a, b ∈ <; xα for α ∈ [0, 1] and x ≥ 0; log x
for x ≥ 0.
η 2 00
f (xi+1 ) = f (xi ) + ηf 0 (xi ) + f (xi ).
2
89
and the best value for η (which gives the most reduction in one step) is η = −f 0 (x)/2f 00 (x),
which gives
(f 0 (xi ))2
f (xi+1 ) = f (xi ) − .
2f 00 (xi )
Thus the algorithm makes progress so long as f 00 (xi ) > 0. Convex functions that satisfy
f 00 (x) > 0 for all x are called strongly convex.
The above calculation is the main idea in Newton’s method, which you may have seen
in calculus. Proving convergence requires further assumptions.
Here Of (x) is the vector of first order derivatives where the ith coordinate is ∂f /∂xi and
called the gradient. Sometimes we restate it equivalently as
f (y)
f (x) + ∇f (x)T (y − x)
(x, f (x))
Figure 3.2 If f is convex and differentiable, then f (x)+∇f (x)T (y−x) ≤ f (y)
for all x, y ∈ dom f .
Figure 16.3: A differentiable convex function lies above the tangent plane f (x) + Of (x) ·
(y − x)
is given by !
0 x∈C
I˜C (x) =
∞ x &∈ C.
If The
higher
convexderivatives alsotheexist,
function I˜C is called the
indicator multivariate
function of the set C. Taylor expansion for an n-variate func-
tion fWeis can play several notational tricks with the indicator function I˜C . For example
the problem of minimizing a function f (defined on all of Rn , say) on theTset 2
C is the
n (x) · y + y
same as minimizing theffunction
(x + y) f += f (x)
I˜C over all + Of
of R I˜C + · · · .
O ff+(x)y
. Indeed, the function (16.4)
is (by our convention) f restricted to the set C.
Here O2 f (x) denotes the n × n matrix whose i, j entry is ∂ 2 f /∂xi ∂xj and it is called the
Hessian. It can
In a similar way be checked
we can extend athat f function
concave is convex if theit Hessian
by defining to be −∞ is positive semidefinite; this
outside T
its 2
domain.
means y O f y ≥ 0 for all y.
• f (x) = log(ex1 + ex2 + · · · + exn ) is convex on <n . This fact is used in practice as an
analytic approximation of the max function since
max {x1 , . . . , xn } ≤ f (x)+ ≤ max {x1 , . . . , xn } + log n.
Turns out this fact is at the root of the multiplicative weight update method; the
algorithm for approximately solving LPs that we saw in Lecture 10 can be seen as
doing a gradient descent on this function, where the xi ’s are the slacks of the linear
constraints. (For a linear constraint aT z ≥ b the slack is aT z − b.)
P
• f (x) = xT Ax = ij Aij xi xj where A is positive semidefinite. Its Hessian is A.
Q
Some important examples of concave functions are: geometric mean ( ni=1 xi )1/n and log-
2
determinant (defined for X ∈ <n as log det(X) where X is interpreted as an n × n matrix).
Many famous inequalities in mathematics (such as Cauchy-Schwartz) are derived using
convex functions. 2
Example 30 (Linear equations with PSD constraint matrix) In linear algebra you
learnt that the method of choice to solve systems of equations Ax = b is Gaussian elimina-
tion. In many practical settings its O(n3 ) running time may be too high. Instead one does
gradient descent on the function 21 xT Ax − bT x, whose local optimum satisfies Ax = b. If
A is positive semidefinite this function is also convex since the Hessian is A, and gradient
descent will actually find the solution. (Actually in real life these are optimized using more
advanced methods such as conjugate gradient.) Also, if A is diagonal dominant, a stronger
constraint than PSD, then Spielman and Teng (2003) have shown how to solve this prob-
lem in time that is near linear in the number of nonzero entries. This has had surprising
applications to basic algorithmic problems like max-flow.
Example 31 (Least squares) In some settings we are given a set of points a1 , a2 , . . . , am ∈
<n and some data values b1 , b2 , . . . , bm taken at these points by some function of interest.
We suspect that the unknown function is a line, except the data values have a little error in
them. One standard technique is to find a least squares fit: a line that minimizes the sum of
squares of the distance to the datapoints to the line. The objective function is min |Ax − b|22
where A ∈ <m×n is the matrix whose rows are the ai ’s. (We saw in an earlier lecture that
the solution is also the first singular vector.) This objective is just xT AT Ax−2(Ax)T b+bT b,
which is convex.
91
In the univariate case, gradient descent has a choice of only two directions to move in: right
or left. In n dimensions, it can move in any direction in <n . The most direct analog of the
univariate method is to move diametrically opposite from the gradient.
The most direct analogue of our univariate analysis would be to assume a lowerbound
of y T O2 f y for all y (in other words, a lowerbound on the eigenvalues of O2 f ). This will be
explored in the homework. In the rest of lecture we will only assume (16.2).
Example 32 (Spam classification via SVMs) This example will run through the en-
tire lecture. Support Vector Machine is the name in machine learning for a linear classifier;
we saw these before in Lecture 6 (Linear Thinking). Suppose we wish to train the classifier
to classify emails as spam/nonspam. Each email is represented using a vector in <n that
gives the frequencies of various words in it (“bag of words”model). Say a1 , a2 , . . . , aN are
the emails, and for each there is a corresponding bit bi ∈ {−1, 1} where bi = 1 means Xi is
spam. SVMs use a linear classifier to separate spam from nonspam. If spam were perfectly
identifiable by a linear classfier, there would be a function W · x such that W · ai ≥ 1 if ai
is spam, and W · ai ≤ −1 if not. In other words,
1 − bi W · ai ≤ 0 ∀i (16.5)
Of course, in practice a linear classifier makes errors, so we have to allow for the possibility
that (16.5) is violated by some ai ’s. The obvious thing to try is to find a W that satisfies
as many of the constraints as possible, but that leads to a nonconvex NP-hard problem.
(Even approximating this weakly is NP-hard.) Thus a more robust version of this problem
is
X
min Loss(1 − W · (bi ai )) (16.6)
i
|W |22 ≤n (scaling constraint)
where Loss(·) is a function that penalizes unsatisfied constraints according to the amount
by which they are unsatisfied. (Note that W is the vector of variables, and the scaling
constraint gives meaning to the separation of “1 ”in (16.5) by saying that W is a vector in
the sphere of radius n, which is a convex constraint.) The most obvious loss function would
be to count the number of unsatisfied constraints but that is nonconvex. For this lecture
we focus on convex loss functions; the simplest is the hinge loss: Loss(t) = max {0, t}.
Applying it to 1 − W · (bi ai ) insures that correctly classified emails contribute 0 to the loss,
and incorrectly classified emails contribute as much to the loss as the amount by which they
fail the inequality. The function in (16.6) is convex because the function inside Loss() is
linear and thus convex, and Loss() preserves convexity since it can only lift the value of the
linear function even further.
92
If x ∈ K is the current point and we use the gradient to step to x − η M x then in general
this new point will not be in K. Thus one needs to do a projection.
Definition 7 The projection of a point y on K is x ∈ K that minimizes |y − x|2 . (It is
also possible to use other norms than `2 to define projections.)
A projection oracle for the convex body a black box that, for every point y, returns its
projection on K.
Here is a simple algorithm for solving the constrained optimization problem. The algo-
rithm only needs to access f via a gradient oracle and K via a projection oracle.
Definition 8 (Gradient Oracle) A gradient oracle for a function f is a black box that,
for every point z, returns Of (z) the gradient valuated at point z. (Notice, this is a linear
function of the form gx where g is the vector of partial derivatives evaluated at z.)
Using (16.3), we can lowerbound the left hand side by f (x(i) ) − f (x∗ ). We conclude that
1 2 2 η
f (x(i) ) − f (x∗ ) ≤ ( x(i) − x∗ − x(i+1) − x∗ ) + G2 . (16.7)
2η 2
93
Now sum the previous inequality over i = 1, 2, . . . , T and use the telescoping cancellations
to obtain
T
X 1 2 2 Tη
(f (x(i) ) − f (x∗ )) ≤ ( x(0) − x∗ − x(T ) − x∗ ) + |G|2 .
2η 2
i=1
P P
Finally, by convexity f ( T1 (i)
ix ) ≤ 1
i f (x
(i) ) so we conclude that the point z =
1 P (i)
T
T ix satisfies
D2 η
f (z) − f (z ∗ ) ≤ + G2 .
2ηT 2
4D2 G2
Now set η = GD √ to get an upperbound on the right hand side of 2 DG
T
√ . Since T =
T ε2
∗
we see that f (z) ≤ f (x ) + ε.
This notion should remind you of multiplicative weights, except here we may have general
convex functions as “payoffs.”
Zinkevich noticed that the analysis of gradient descent applies to this much more general
scenario. Specifically, modify the above gradient descent algorithm to this problem by
replacing Of (x(i) ) by Ofi (x(i) ). This algorithm is called Online Gradient Descent. The
earlier analysis works essentially unchanged, once we realize that the left hand side of
(16.7) has the regret for trial i. Summing over i gives the total regret on the left side, and
the right hand side is analysed and upperbounded as before. Thus we have shown:
94
Stochastic gradient descent can be analysed using Online Gradient Descent (OGD). Let
gi · x be the gradient at step i. Then we use this function —which is a linear function and
1 PT
hence convex— as fi in the ith step of OGD. Let z = x(i) . Let x∗ be the point in
T i=1
K where f attains its minimum value.
Theorem 26
2DG
E[f (z)] ≤ f (x∗ ) + √ , where D is the diameter as before and G is an upperbound of the
T
norm of any gradient vector ever output by the oracle.
Proof:
1 X
E[f (z) − f (x∗ )] ≤ E[ (f (x(i) ) − f (x∗ ))] by convexity of f
T
i
1X
≤ E[Of (x(i) ) · (x(i) − x∗ )] using (16.2)
T
i
1X
= E[gi · (x(i) − x∗ )] Since expected gradient is the true gradient
T
i
1X
= E[fi (x(i)) − fi (x∗ )] Defn. of fi
T
i
1 X
= E[ (fi (x(i) ) − fi (x∗ )]
T
i
and the theorem now follows since the expression in the E[·] is just the regret, which is always
upperbounded by the quantity given in Zinkevich’s theorem, so the same upperbound holds
also for the expectation. 2
95
Some thought shows (confirming conventional wisdom) that it can be very suboptimal
to put all money in a single stock. A strategy that works better in practice is Constant
Rebalanced Portfolio (CRB): decide upon a fixed proportion of money to put into each stock,
and buy/sell individual stocks each day to maintain this proportion.
Example 36 Say there are only two assets, stocks and bonds. One CRB strategy is to put
split money equally between these two. Notice what this implies: if an asset’s price falls, you
tend to buy more of it, and if the price rises, you tend to sell it. Thus this strategy roughly
implements the age-old advice to “buy low, sell high.”Concretely, suppose the prices each
day fluctuate as follows.
Note that the prices go up and down by the same ratio on alternate days, so money
parked fully in stocks or fully in bonds earns nothing in the long run. (Aside: This kind of
fluctuation is not unusual; it is generally observed that bonds and stocks move in opposite
directions.) And what happens if you split your money equally between these two assets?
Each day it increases by a factor 0.5 × (4/3 + 3/4) = 0.5 × 25/12 ≈ 1.04. Thus your money
grows exponentially!
Exercise: Modify the price increases in the above example so that keeping all money
in stocks or bonds alone will cause it to drop exponentially, but the 50-50 CRB increases
money at an exponential rate.
CRB uses a fixed split among n assets, but what is this split? Wouldn’t it be great to
have an angel whisper in our ears on day 1 what this magic split is? Online optimization
is precisely such an angel. Suppose the algorithm uses the vector x(t) at time t; the ith
coordinate gives the proportion of money in stock i at the start of the tth day. Then the
96
algorithm’s wealth increases on t by a factor r(t) · x(t) . Thus the goal is to find x(t) ’s to
maximize the final wealth, which is
Y
r(t) · x(t) .
t
For any fixed r(1) , r(2) , . . . this function happens to be concave, but that is fine since we are
interested in maximization. Now we can try to run online gradient descent on this objective.
By Zinkevich’s theorem, the quantity in (16.8) converges to
X
log(r(t) · x∗ ) (16.9)
t
maximize cT x
Ax ≤ b
x≥0
where A is a m × n real constraint matrix and x, c ∈ Rn . Recall that if the number of bits
to represent the input is L, a polynomial time solution to the problem is allowed to have a
running time of poly(n, m, L).
The Ellipsoid algorithm for linear programming is a specific application of the ellipsoid
method developed by Soviet mathematicians Shor(1970), Yudin and Nemirovskii(1975).
Khachiyan(1979) applied the ellipsoid method to derive the first polynomial time algorithm
for linear programming. Although the algorithm is theoretically better than the Simplex
algorithm, which has an exponential running time in the worst case, it is very slow practically
and not competitive with Simplex. Nevertheless, it is a very important theoretical tool for
developing polynomial time algorithms for a large class of convex optimization problems,
which are much more general than linear programming.
In fact we can use it to solve convex optimization problems that are even too large to
write down.
98
99
Example 37 Semidefinite programming (SDP) uses the convex set of PSD matrices in <n .
This set is defined by the following infinite set of constraints: aT Xa ≥ 0 ∀a ∈ Ren . This
is really a linear constraint on the Xij ’s:
X
Xij ai aj ≥ 0.
ij
Example 38 (Held-Karp relaxation for TSP) In the traveling salesman problem (TSP)
we are given n points and distances dij between every pair. We have to find a salesman
tour, which is a sequence of hops among the points such that each point is visited exactly
once and the total distance covered is minimized.
An integer programming formulation of this problem is:
P
min ij dij Xij
Xij ∈ {0, 1} ∀i, j
P
i∈S,j∈S Xij ≥ 2 ∀S ⊆ V, S 6= ∅, V (subtour elimination)
The last constraint is needed because without it the solution could be a disjoint union of
subtours, and hence these constraints are called subtour elimination constraints. The Held-
Karp relaxation relaxes the first constraint to 0 ≤ Xij ≤ 1. Now this is a linear program,
but it has 2n + n2 constraints! We cannot afford to write them down (for then we might as
well use the trivial exponential time algorithm for TSP).
Clearly, we would like to solve such large (or infinite) programs, but we need a different
paradigm than the usual one that examines the entire input.
min cT x
x∈K
was finite since we had a constraint like Xii = 1 for all i, which implies that |Xij | ≤ 1 for
all i, j. Usually in most settings of interest we can place some a priori upper bound on the
desired solution that ensures K is a finite body.
In fact, since we can use binary search to reduce optimization to decision problem, we
can replace the objective
by a constraint cT x ≥ c0 . Then we are looking for a point in
the convex body K ∩ x : c x ≥ c0 , which is another convex body K0 . We conclude that
T
convex programming boils down to finding a single point in a convex body (where we may
repeat this basic operation multiple times with different convex bodies).
Here are other examples of convex sets and bodies.
1. The whole space Rn is trivially an infinite convex set.
2. Hypercube length l is the set of all x such that 0 ≤ xi ≤ l, 1 ≤ i ≤ n.
n
X
3. Ball of radius r around the origin is the set of all x such that x2i ≤ r2 .
i=1
Figure 17.1: Farkas’s Lemma: Between every convex body and a point outside it, there’s a
hyperplane
Example 41 Consider the polytope defined by the Held-Karp relaxation. We are given a
candidate solution P = (Pij ). Suppose P12 = 1.1. Then it violates the constraint X12 ≤ 1,
and thus the hyperplane X12 = 1 separates the polytope from P .
Thus
P to check that it lies in the polytope defined by all the constraints, we first check
that j Pij = 2 for all i. This can be done in polynomial time. If the equality is violated
for any i then that is a separating hyperplane.
If all the other constraints are satisfied, we finally turn to the subtour elimination
constraints. We construct the weighted graph on n nodes where the weight of edge {i, j}
is Pij . We compute the minimum cut in this weighted graph. The subtour elimination
constraints are all satisfied iff the minimum cut S, S has capacity ≥ 2. If the mincut S, S
has capacity less than 2 then the hyperplane
X
Xij = 2,
i∈S,j∈S
has P on the < 2 side and the Held-Karp polytope on the ≥ 2 side.
Thus you can think of a separation oracle as providing a “letter of rejection”to the point
outside it explaining why it is not in the body K.
Example 42 For the set of PSD matrices, the separation oracle is given a matrix P . It
computes eigenvalues and eigenvectors to check if P only has nonnegative eigenvalues. If
not, then it P
takes an eigenvector a corresponding to a negative eigenvalue and returns the
hyperplane ij Xij ai aj = 0. (Note that ai ’s are constants here.) Then the PSD matrices
are on the ≥ 0 side and P is on the < 0 side.
where λi ’s are nonzero reals. in 3D this is an egg-like object where a1 , a2 , a3 are the radii
along the three axes (see Figure 17.2). A general ellipsoid in Rn can be represented as
(x − a)T B(x − a) ≤ 1,
The convex body K is presented by a membership oracle, and we are told that the body
lies somewhere inside some ellipsoid E0 whose description is given to us. At the ith iteration
algorithm maintains the invariant that the body is inside some ellipsoid Ei . The iteration
is very simple.
Let p = central point of Ei . Ask the oracle if p ∈ K. If it says ”Yes,” declare succes.
Else the oracle returns some halfspace aT x ≥ b p that contains K whereasp lies on the other
side. Let Ei+1 = minimum containing ellipsoid of the convex body Ei ∩ x : aT x ≥ b .
Figure 17.3: Couple of runs of the Ellipsoid method showing the tiny convex set in blue
and the containing ellipsoids. The separating hyperplanes do not pass through the centers
of the ellipsoids in this figure.
The running time of each iteration depends on the running time of the separation oracle
and the time required to find Ei+1 . For linear programming, the separation oracle runs in
103
O(mn) time as all we need to do is check whether p satisfies all the constraints, and return
a violating constraint as the halfspace (if it exists). The time needed to find Ei+1 is also
polynomial by the following non-trivial lemma from convex geometry.
Lemma 27 T
The minimum volume ellipsoid surrounding a half ellipsoid (i.e. Ei H + where H + is a
halfspace as above) can be calculated in polynomial time and
1
V ol(Ei+1 ) ≤ 1 − V ol(Ei )
2n
Thus after t steps the volume of the enclosing ellipsoid has dropped by (1 − 1/2n)t ≤
exp(−t/2n).
Technically speaking, there are many fine points one has to address. (i) The Ellipsoid
method can never say unequivocally that the convex body was empty; it can only say after
T steps that the volume is less than exp(−T /2n). In many settings we know a priori that
the volume of K if nonempty is at least exp(−n2 ) or some such number, so this is good
enough. (ii) The convex body may be low-dimensional. Then its n-dimensional volume is
0 and the containing ellipsoid continues to shrink forever. At some point the algorithm has
to take notice of this, and identify the lower dimensional subspace that the convex body
lies in, and continue in that subspace.
As for linear programming can be shown that for a linear program which requires L bits
to represent the input, it suffices to have volume of E0 = 2c2 nL (since the solution can be
written in c2 nL bits, it fits inside an ellipsoid of about this size) and to finish when volume
of Et = 2−c1 nL for some constants c1 , c2 , which implies t = O(n2 L). Therefore, the after
O(n2 L) iterations, the containing ellipsoid is so small that the algorithm can easily ”round”
it to some vertex of the polytope. (This number of iterations can be improved to O(nL)
with some work.) Thus the overall running time is poly(n, m, L). For a detailed proof of the
above lemma and other derivations, please refer to Santosh Vempala’s notes linked from the
webpage. The classic [GLS] text is a very readable yet authoritative account of everything
related (and there’s a lot) to the Ellipsoid method and its variants.
bibliography
We are used to the concept of duality in life: yin and yang, Mars and Venus, etc. In
mathematics duality refers to the phenomenon whereby two objects that look very different
are actually the same in a technical sense.
Today we first see LP duality, which will then be explored a bit more in the homeworks.
Duality has several equivalent statements.
a1 · X ≥ b1
a2 · X ≥ b2
.. (18.1)
.
am · X ≥ bm
X ≥ 0
104
105
a1 · X ≥ b1
a2 · X ≥ b2
.. (18.2)
.
am · X ≥ bm
X ≥ 0
The notation X > Y simply means that X is componentwise larger than Y. Now we
represent the system in (18.2) more compactly using matrix notation. Let
T
a1 b1
aT b2
2
A = . and b = .
.
. .
.
T
am bm
Then the Linear Program (LP for short) can be rewritten as:
min cT X :
AX ≥ b (18.3)
X ≥0
This form is general enough to represent any possible linear program. For instance,
if the linear program involves a linear equality a · X = b then we can replace it by two
inequalities
a · X ≥ b and − a · X ≥ −b.
If the variable Xi is unconstrained, then we can replace each occurence by Xi+ − Xi− where
Xi+ , Xi− are two new non-negative variables.
Primal Dual
min cT X : max YT b :
(18.4)
AX ≥ b Y T A ≤ cT
X ≥0 Y ≥0
(Aside: if the primal contains an equality constraint instead of inequality then the
corresponding dual variable is unconstrained.)
It is an easy exercise that the dual of the dual is just the primal.
106
Theorem 28
The Duality Theorem. If both the Primal and the Dual of an LP are feasible, then the
two optima coincide.
AX∗ ≥ b.
Y∗ T AX∗ ≥ Y∗ T b
cT X∗ ≥ (Y∗ T A)X∗ .
−cT X ≥ −(k − ε)
AX ≥ b (18.5)
X ≥0
Note that λ0 > 0 omitting the first inequality in (18.5) leaves a feasible system by
assumption about the primal. Thus, consider the nonnegative vector
λ1 λm T
Λ=( ,... ) .
λ0 λ0
The inequality (18.6) implies that ΛT A ≤ cT . So Λ is a feasible solution to the Dual.
The inequality (18.7) implies that ΛT b > (k−ε), and since the Dual is a maximization
problem, this implies that the Dual optimal is bigger than k − ε. Since this holds for
every ε > 0, by compactness we conclude that there is a Dual feasible solution of value
k. Thus, this part is proved, too. Hence the Duality Theorem is proved.
107
2
My thoughts on this business:
(1) Usually textbooks bundle the case of infeasible systems into the statement of the Duality
theorem. This muddies the issue for the student. Usually all applications of LPs fall into
two cases: (a) We either know (for trivial reasons) that the system is feasible, and are only
interested in the value of the optimum or (b) We do not know if the system is feasible and
that is precisely what we want to determine. Then it is best to just use Farkas’ Lemma.
(2) The proof of the Duality theorem is interesting. The first part shows that for any
dual feasible solution Y the various Yi ’s can be used to obtain a weighted sum of primal
inequalities, and thus obtain a lowerbound on the primal. The second part shows that
this method of taking weighted sums of inequalities is sufficient to obtain the best possible
lowerbound on the primal: there is no need to do anything fancier (e.g., taking products of
inequalities or some such thing).
Since P could contain exponentially many paths, this is an LP with exponentially many
variables. Luckily duality tells us how to solve it using the Ellipsoid method.
Going over to the dual, we get:
X
min ce ye : (18.11)
e∈E
∀e ∈ E : ye ≥ 0 (18.12)
X
∀P ∈ P : ye ≥ 1 (18.13)
e∈P
Notice that the dual in fact represents the fractional min s − t cut problem: think of
each edge e being picked up to a fraction ye . The constraints say that a total weight of 1
must be picked on each path. Thus the usual s-t min cut problem simply involves 0 − 1
solutions to the ye ’s in the dual.
108
Exercise 1 Prove that the optimum solution does have ye ∈ {0, 1}, and thus the solution
to the dual is the best s-t min cut.
This setting is called zero sum because what one player wins, the other loses. By
contrast, war (say) is a setting where both parties may lose material and men. Thus their
combined worth at the end may be lower than at the start. (Aside: An important stimulus
109
for development of game theory in the 1950s was the US government’s desire to behave
“strategically ”in matters of national defence, e.g. the appropriate tit-for-tat policy for
waging war —whether nuclear or conventional or cold.)
von Neumann was interested in a notion of equilibrium. In physics, chemistry etc. an
equilibrium is a stable state for the system that results in no further change. In game theory
it is a pair of strategies g1 , g2 for the two players such that each is the optimum response
to the other.
Let’s examine this for zero sum games. If player 1 announces he will play the ith move,
then the rational move for player 2 is the move j that maximises Aij . Conversely, if player
2 announces she will play the jth move, player 1 will respond with move i0 that minimizes
Ai0 j . In general, there may be no equilibrium in such announcements: the response of player
1 to player 2’s response to his announced move i will not be i in general:
Turns out this result is a simple consequence of LP duality and is equivalent to it. You
will explore it further in the homework.
What if the game is not zero sum? Defining an equilibrium for it was an open problem
until John Nash at Princeton managed to define it in the early 1950s; this solution is called
a Nash equilibrium. We’ll return to it in a future lecture. BTW, you can still sometimes
catch a glimpse of Nash around campus.
Chapter 19
Economic and game-theoretic reasoning —specifically, how agents respond to economic in-
centives as well as to each other’s actions– has become increasingly important in algorithm
design. Examples: (a) Protocols for networking have to allow for sharing of network re-
sources among users, companies etc., who may be mutually cooperating or competing. (b)
Algorithm design at Google, Facebook, Netflix etc.—what ads to show, which things to
recommend to users, etc.—not only has to be done using objective functions related to eco-
nomics, but also with an eye to how users and customers change their behavior in response
to the algorithms and to each other.
Algorithm design mindful of economic incentives and strategic behavior is studied in a
new field called Algorithmic Game Theory. (See the book by Nisan et al., or many excellent
lecture notes on the web.)
Last lecture we encountered zero sum games, a simple setting. Today we consider more
general games.
Example 43 (Prisoners’ Dilemma) This is a classic example that people in myriad dis-
ciplines have discussed for over six decades. Two people suspected of having committed a
crime have been picked up by the police. In line with usual practice, they have been placed
in separate cells and offered the standard deal: help with the investigation, and you’ll be
110
111
treated with leniency. How should each prisoner respond: Cooperate (i.e., stick to the story
he and his accomplice decided upon in advance), or Defect (rat on his accomplice and get
a reduced term)?
Let’s describe their incentives as a 2 × 2 matrix, where the first entry describes payoff
for the player whose actions determine the row. If they both cooperate, the police can’t
Cooperate Defect
Cooperate 3, 3 0, 4
Defect 4, 0 1, 1
prove much and they get off with fairly light sentences after which they can enjoy their loot
(payoff of 3). If one defects and the other cooperates, then the defector goes scot free and
has a high payoff of 4 whereas the other one has a payoff of 0 (long prison term, plus anger
at his accomplice).
The only pure Nash equilibrium is (Defect, Defect), with both receiving payoff 1. In
every other scenario, the player who’s cooperating can improve his payoff by switching to
Defect. This is much worse for both of them than if they play (Cooperate, Cooperate),
which is also the social optimum —where the sum of their payoffs is highest at 6—is to
cooperate. Thus in particular the social optimum solution is not a Nash equilibrium. ((OK,
we are talking about criminals here so maybe social optimum is (Defect, Defect) after all.
But read on.)
One can imagine other games with similar payoff structure. For instance, two companies
in a small town deciding whether to be polluters or to go green. Going green requires
investment of money and effort. If one does it and the other doesn’t, then the one who is
doing it has incentive to also become a polluter. Or, consider two people sharing an office.
Being organized and neat takes effort, and if both do it, then the office is neat and both are
fairly happy. If one is a slob and the other is neat, then the neat person has an incentive
to become a slob (saves a lot of effort, and the end result is not much worse).
Such games are actually ubiquitous if you think about it, and it is a miracle that humans
(and animals) cooperate as much as they do. Social scientists have long pondered how to
cope with this paradox. For instance, how can one change the game definition (e.g. a wise
governing body changes the payoff structure via fines or incentives) so that cooperating
with each other —the socially optimal solution—becomes a Nash equilibrium? The game
can also be studied via the repeated game interpretation, whereby people realize that they
participate in repeated games through their lives, and playing nice may well be a Nash
equilibrium in that setting. As you can imagine, many books have been written. 2
Example 44 (Chicken) This dangerous game was supposedly popular among bored teenagers
in American towns in the 1950s (as per some classic movies). Two kids would drive their
cars at high speed towards each other on a collision course. The one who swerved away first
to avoid a collision was the “chicken.”How should we assign payoffs in this game? Each
player has two possible actions, Chicken or Dare. If both play Dare, they wreck their cars
and risk injury or death. Lets call this a payoff of 0 to each. If both go Chicken, they both
live and have not lost face, so let’s call it a payoff of 5 for each. But if one goes Chicken and
the other goes Dare, then the one who went Dare looks like the tough one (and presumably
112
attracts more dates), whereas the Chicken is better of being alive than dead but lives in
shame. So we get the payoff table:
Chicken Dare
Chicken 5, 5 1, 6
Dare 6, 1 0, 0
This has two pure Nash equilibria: (Dare, Chicken) and (Dare, Chicken). We may
think of this as representing two types of behavior: the reckless type may play Dare and
the careful type may play Chicken.
Note that the socially optimal solution —both players play chicken, which maximises
their total payoff—is not a Nash equilibrium.
Many games do not have any pure Nash equilibrium. Nash’s great insight during his
grad school years in Princeton was to consider what happens if we allow players to play a
mixed strategy, which is a probability distribution over actions. An equilibrium now is a
pair of mixed strategies x, y such that each strategy is the optimum response (in terms of
maximising expected payoff) to the other.
Theorem 30 (Nash 1950)
For every pair of payoff matrices A, B there is an odd number (hence nonzero) of mixed
equilibria.
Unfortunately, Nash’s proof doesn’t yield an efficient algorithm for computing an equi-
librium: when the number of possible actions is n, computation may require exp(n) time.
Recent work has shown that this may be inherent: computing Nash equilibria is PPAD-
complete (Chen and Deng’06).
The Chicken game has a mixed equilibrium: play each of Chicken and Dare with prob-
ability 1/2. This has expected payoff 41 (5 + 1 + 6 + 0) = 3 for each, and a simple calculation
shows that neither can improve his payoff against the other by changing to a different
strategy.
1. There are n users. If user i gets x units of bandwidth by paying w dollars, his/her
utility is Ui (x) − w, where the utility function Ui is nonnegative, increasing, concave1
1
Concavity implies that the going from 0 units to 1 brings more happiness than going from 1 to 2, which
in turn brings more happiness than going from 2 to 3. For twice differentiable functions, concavity means
the second derivative is negative.
113
2. ThePgame is as follows: user i offers to pay a sum of wi . The link owner allocates
wi / j wj portion of the bandwidth to user i. Thus P the entire bandwidth is used up
and the effective price for the entire bandwidth is j wj .
1 wi
Ui0 (xi )( − 2 ) = 1
p p
0
⇒ Ui (xi )(1 − xi ) = p.
This implicitly defines xi in terms of p. Furthermore, the left hand side is easily checked to
be a decreasing function of xi . (Specifically, its derivative is (1 − xi )Ui ”(xi ) − U 0 (xi ), whose
first term is negative by concavity and P the second because Ui0 (xi ) ≥ 0 by our assumption
that Ui is an increasing function.) Thus i xi is a decreasing function of p. When p = +∞,
the xi ’s that maximise
P utility are all 0, whereas for p = 0 the xi ’s are all 1, which violates
the constraint i xi = 1. P By the mean value theorem, there must exceed a choice of p
between 0 and +∞ where i xi = 1, and the corresponding values of wi ’s then constitute
a Nash equilibrium.
Is this equilibrium socially optimal? Let p∗ be the socially optimal price. At this price
the ith user desires a bandwidth xi that maximises Ui (xi ) − p∗ xi , which is the unique xi
that satisfies Ui0 (xi ) = p∗ . Furthermore these xi ’s must sum to 1.
By contrast, the Nash equilibrium price pN corresponds to solving Ui0 (xi )(1 − xi ) = pN .
If the number of users is large (and the utility functions not “too different”so that the xi ’s
are not too different) then each xi is small and 1 − xi ≈ 1. Thus the Nash equilibrium price
is close to but not the same as the socially optimal choice.
114
Price of Anarchy
One of the notions highlighted by algorithmic game theory is price of anarchy, which is the
ratio between the cost of the Nash equilibrium and the social optimum. The idea behind
this name is that Nash equilibrium is what would be achieved in a free market, whereas
social optimum is what could be achieved by a planner who knows everybody’s utilities.
One identifies a family of games, such as bandwidth sharing, and looks at the maximum of
this ratio over all choices of the players’ utilities. The price of anarchy for the bandwidth
sharing game happens to be 4/3. Please see the chapter on inefficiency of equilibria in the
AGT book.
Possibly you originally guessed that they would converge to playing Rock, Paper, Scissor
randomly. However, this is not regret minimizing since it leads to payoff 0 every third
round in the expectation. What you probably saw in your simulation was that the players
converged to a correlated strategy that guarantees one of them a payoff every other round.
Thus they learnt to game the system together and maximise their profits.
This is a subcase of a more general phenomenon, whereby playing low-regret strategies
in general leads to a different type of equilibrium, called correlated equilibrium.
The previous example illustrates the notion of correlated equilibrium, and we won’t
define it more precisely. The main point is that it can be arrived at using a simple algorithm,
namely, multiplicative weights (this statement also has caveats; see the relevant chapter in
the AGT book). Unfortunately, correlated equilibria are also not guaranteed to maximise
social welfare.
Bibliography
3. Settling the Complexity of 2-Player Nash Equilibrium. X. Chen and X. Deng. IEEE
FOCS 2006.
Chapter 20
Computer and information systems are prone to data loss—lost packets, crashed or cor-
rupted hard drives, noisy transmissions, etc.—and it is important to prevent actual loss of
important information when this happens. Today’s lecture concerns error correcting codes,
a stepping point to many other ideas, including a big research area (usually based in EE de-
partments) called information theory. This area started with a landmark paper by Claude
Shannon in 1948, whose key insight was that data transmission is possible despite noise and
errors if the data is encoded in some redundant way.
116
117
these subsets. This works against Ω(m) errors but we don’t know of an efficient decoding
algorithm. (Decoding in exp(k) time is no problem.)
Theorem 31
n n
Such E, D do not exist if m < 1−H(p) , and do exist for p ≤ 1/4 if m > 1−H(2p) . Here
1 1
H(p) = p log2 p + (1 − p) log2 1−p is the so-called entropy function.
Proof: We only prove existence; the method does not give efficient algorithms to en-
code/decode. For any string y ∈ {0, 1}m let Ball(y) denote the set of strings that differ
118
from y in at most
m m m
+ + ··· ,
0 1 2pm
which is at most 2H(2p)m by Stirling’s approximation.
Define the encoding function E using the following greedy procedure. Number the
strings in {0, 1}n from 1 to 2n and one by one assign to each string x its encoding E(x) as
follows. The first string is assigned an arbitary string in {0, 1}m . At step i the ith string is
assigned an arbitary string that lies outside Ball(E(x) for all x ≤ i − 1.
By design, such an encoding function satisfies that E(x) and E(x0 ) differ in at least
2pm fraction. Thus we only need to show that the greedy procedure succeeds in assigning
an encoding to each string. To do this it suffices to note that if 2m > 2n 2H(2p)m then the
greedy procedure never runs out of strings to assign as encodings.
The nonexistence is proved in a similar way. Now for y 0 ∈ {0, 1}m let Ball0 (y) be the
set of strings that differ from y in at most pm indices. By a similar calculation as above,
this has cardinality about 2H(p)m . If an encoding function exists, then Ball0 (E(x)) and
Ball0 (E(x0 )) must be disjoint for all x 6= x0 (since otherwise any string in the intersection
would not have an unambiguous encoding). Hence 2n × 2H(p)m < 2m , which implies that
n
m > 1−H(p) . 2
Figure 20.2: Linear system corresponding to polynomial interpolation; matrix on left side
is Vandermonde.
Proof: The polynomial c(x) − e(x)p(x) has a root at ui whenever vi is uncorrupted since
p(ui ) = vi . Thus this polynomial, which has degree d + k, has n − k roots. Thus if
n − k > d + k + 1 this polynomial is identically 0. 2
In fact even approximating such integrals can be NP-hard, as shown by Koutis (2003).
Valiant (1979) showed that the computational heart of such problems is combinatorial
counting problems. The goal in such problems is to compute the size of a set S where we
can test membership in S in polynomial time. The class of such problems is called ]P .
Example 48 ]SAT is the problem where, given a boolean formula ϕ, we have to compute
the number of satisfying assignments to ϕ. Clearly it is NP-hard since if we can solve it, we
can in particular solve the decision problem: decide if the number of satisfying assignments
at least 1.
]CYCLE is the problem where, given a graph G = (V, E), we have to compute the
number of cycles in G. Here the decision problem (“is G acyclic?”) is easily solvable using
breadth first search. Nevertheless, the counting problem turns out to be NP-hard.
]SPANNINGTREE is the problem where, given a graph G = (V, E), we have to compute
the number of spanning trees in G. This is known to be solvable using a simple determinant
computation (Kirchoff’s matrix-tree theorem) since the 19th century.
Valiant’s class ]P captures most interesting counting problems. Many of these are NP-
hard, but not all. You can learn more about them in COS 522: Computational Complexity,
usually taught in the spring semester. 2
121
122
It is easy to see that the above integration problem can be reduced to a counting problem
with some loss of precision. First, recall that integration basically involves summation: we
appropriately discretize the space and then take the sum of the integrand values (assuming
in each cell of space the integrand doesn’t vary much). Thus the integration reduces to
some sum of the form X
g(x1 , x2 , . . . , xn ),
x1 ∈[N ],x2 ∈[N ],...,xn ∈[N ]
where [N ] denotes the set of integers in 0, 1, . . . , N . Now assuming g(·) ≥ 0 this is easily
estimated using sizes of of the following sets:
{(x, c) : x ∈ [N ]n ; c ≤ g(x) ≤ c + ε} .
Note if g is computable in polynomial time then we can test membership in this set in
polynomial time given (x, c, ε) so we’ve shown that integration is a ]P problem.
We will also be interested in sampling a random element of a set S. In fact, this will
turn out to be intimately related to the problem of counting.
Proof: For concreteness, let’s prove this for the problem of counting the number of satisfy-
ing assignments to a boolean formula. Let ]ϕ denote the number of satisfying assignments
to formula ϕ.
Sampling ⇒ Approximate counting: Suppose we have an algorithm that is an ap-
proximate sampler for the set of satisfying assignments for any formula. For now assume
it is an exact sampler instead of approximate. Take m samples from it and let p0 be the
fraction that have a 0 in the first bit xi , and p1 be the fraction that have a 1. Assume
√
p0 ≥ 1/2. Then the estimate of p0 is correct up to factor (1 + 1/ m) by Chernoff bounds.
But denoting by ϕ|x1 =0 the formula obtained from ϕ by fixing x1 to 0, we have
]ϕ|x1 =0
p0 = .
]ϕ
123
Since we have a good estimate of p0 , to get a good estimate of ]ϕ it suffices to have a good
estimate of ]ϕ|x1 =0 . So produce the formula ϕ|x1 =0 obtained from ϕ by fixing x1 to 0, then
use the same algorithm recursively on this smaller formula to estimate N0 , the value of
]ϕ|x1 =0 . Then output N0 /p0 as your estimate of ]ϕ. (Base case n = 1 can be solved exactly
of course.)
Thus if Errn is the error in the estimate for formulae with n variables, this satisfies
√
Errn ≤ (1 + 1/ m)Errn−1 ,
√
which solves to Errn ≤ (1 + 1/ m)n . By picking m >> n2 /ε2 this error can be made less
than 1 + ε. It is easily checked that if the sampler is not exact but only approximate, the
algorithm works essentially unchanged, except the sampling error also enters the expression
for the error in estimating p0 .
Approximate counting ⇒ Sampling: This involves reversing the above reasoning.
Given an approximate counting algorithm we are trying to generate a random satisfying
assignment. First use the counting algorithm to approximate ]ϕ|x1 =0 and ]ϕ and take the
ratio to get a good estimate of p0 , the fraction of assignments that have 0 in the first bit.
(If p0 is too small, then we have a good estimate of p1 = 1 − p0 .) Now toss a coin with
Pr[heads] = p0 . If it comes up heads, output 0 as the first bit of the assignment and then
recursively use the same algorithm on ϕ|x1 =0 to generate the remaining n − 1 bits. If it
comes up tails, output 1 as the first bit of the assignment and then recursively use the same
algorithm on ϕ|x1 =1 to generate the remaining n − 1 bits.
Note that the quality ε of the approximation suffers a bit in going between counting
and sampling. 2
Figure 21.1: Monte Carlo (dart throwing) method to estimate the area of a circle. The
fraction of darts that fall inside the disk is π/4.
Now replace “circle”with any set S and “square”with any set Ω that contains S and can
be sampled in polynomial time. Then just take many samples from Ω and just observe the
124
fraction that are in S. This is an estimate for |S|. The problem with this method is that
usually the obvious Ω is much bigger than S, and we need |Ω| / |S| samples to get any that
lie in S. (For instance the obvious Ω for computing ]ϕ is the set of all possible assignments,
which may be exponentially bigger.)
In the second case note that this element j satisfies wj > W/n which implies wj0 ≥ n.
Clearly, g is at most n-to-1 since a set T in S can have at most n pre-images under g.
Now let’s verify that T = g(T 0 ) lies in S.
X XW
wi ≤ (wi0 + 1)
n2
i∈T i∈T
W
≤ × (W 0 − wj0 + n − 1)
n2
≤W (since W 0 = n2 and wj0 ≥ n )
which implies T ∈ S. 2
Sampling algorithm for Ω To sample from Ω, use our earlier equivalence of approximate
counting and sampling. That algorithm needs an approximate count not only for |Ω| but
also for the subset of Ω that contain the first element. This is another knapsack problem
and can thus be solved by Dyer’s dynamic programming. And same is true for instances
obtained in the recursion.
Bibliography
126
127
Say we want to distribute a secret among n, say a0 . (For example, a0 could be the secret
key to decrypt an important message.) We want the following property: every subset of
t + 1 people should be able to pool their information and recover the secret, but no subset
of t people should not be able to pool their information to recover any information at all
about the secret.
For simplicity interpret a0 as a number in a finite field Zq . Then pick t random numbers
a1 , a2 , . . . , at in Zq and constructing the polynomial p(x) = a0 + a1 x + a2 x2 + · · · + at xt
and evaluate it at n points x1 , x2 , . . . , xn that are known to all of them. Then give p(xi ) to
person i.
Notice, the set of shares are t-wise independent random variables. (Each subset of t
shares is distributed like a random t-tuple over Zq .) This follows from polynomial interpo-
lation (which we explained last time using the Vandermode determinant): for every t-tuple
of people and every t-tuple of values y1 , y2 , . . . , yt ∈ Zq , there is a unique polynomial whose
constant term is a0 and which takes these values for those people. Thus every t-tuple of
values is equally likely, irrespective of a0 , and gives no information about a0 .
Furthermore, since p has degree t, each subset of t + 1 shares can be used to reconstruct
p(x) and hence also the secret a0 .
yi ← yi1 op yi2 ,
A simple induction shows that a straight line program with inputs x1 , x2 , . . . xn computes
a multivariate polynomial in these variables. The degree can be rather high, about 2m . So
this is a powerful model.
128
(Aside: Straight line programs are sometimes called algebraic circuits. If you replace
the arithmetic operations with boolean operations ∨, ¬, ∧ you get a model that can do any
computation at all.)
Definition 12 ((t, n)- secretsharing) If a0 ∈ Zq then its (t, n)- secretsharing is a se-
quence of nPnumbers β1 , β2 , . . . , βn obtained as in Section 22.1 by using a polynomial of the
form a0 + ti=1 ai xi , where a1 , a2 , . . . , an are random numbers in Zq .
The general invariant maintained by the protocol is the following: At the end of step i, the
n players hold the n values in some (t, n)-secretsharing of the value of yi .
Clearly, at the start of the protocol such a secretsharing for the values of the n input
variables x1 , x2 , . . . , xn has been divided among the players. So the invariant is true for
i ≤ n. Assuming it is true for i we show how to maintain it for i + 1. If yi+1 is the +
of two earlier variables, then the simple protocol of Section 22.3 allows the invariant to be
maintained.
So assume yi+1 is the × of two earlier P variables. If these
P two earlier variables were
secretshared using polynomials g(x) = tr=0 gr xr and h(x) tr=0 hr xr then the values being
secretshared are g0 , h0 and the obvious polynomial to secretshare their product is π(x) =
129
P P
g(x)h(x) = 2t r=0 x
r
j≤r gj hr−j . The constant term in this polynomial is g0 h0 which is
indeed the desired product. Secretsharing this polynomial means everybody takes their
share of g and h respectively and multiplies them. Nothing more to do.
Unfortunately, this polynomial π has two problems: the degree is 2t instead of t and,
more seriously, its coefficients are not random numbers in Zq . Thus it is not a (t, n)-
secretsharing of g0 h0 .
The degree problem is easy to solve: just drop the higher degree terms and stay with
the first t terms. Dropping terms is a linear operation and can be done using the simple
protocol of Section 22.3. We won’t go into details.
To solve the problem about the coefficients not being random numbers, each of the
players does the following. The kth player picks a random degree 2t polynomial rk (x) whose
constant term is 0. Then he secret shares this polynomial among all the other players. Now
the players can compute their secretshares of the polynomial
n
X
π(x) + rk (x),
k=1
and the constant term in this polynomial is still g0 h0 . Then they apply truncation to
this procedure to drop the higher order terms. Thus at the end the players have a (t, n)-
secretsharing of the value yi+1 , thus maintaining the invariant.
Subtleties The above description assumes that the malicious players follow the protocol.
In general the t malicious players may not follow the protocol in an attempt to learn things
they otherwise can’t. Modifying the protocol to handle this —and proving it works—is
more nontrivial.
bibliography
First 2/3rd based upon the guest lecture of Kai Li, and the other 1/3rd upon Sanjeev’s
lecture
130
131
Commodity clusters: Large number of off-the-shelf computers (with their own memories)
linked together with a LAN. There is no shared memory or storage. Pros: Easy to
scale; can easily handle the petabyte-size or larger data sets. Cons: Programming
model has to deal explicitly with failures.
Tech companies and data centers have gravitated towards commodity clusters with tens
of thousands or more processors. The power consumption may approach that of a small
town. In such massive systems failures —processor, power supplies, hard drives etc.—are
inevitable. The software must be designed to provide reliability on top of such frequent
failures. Some techniques: (a) replicate data on multiple disks/machines (b) replicate com-
putation by splitting into smaller subtasks (c) use good data placement to avoid long latency.
Google pioneered many such systems for their data centers and released some of these
for general use. MapReduce is a notable example. The open source community then came
up with its own versions of such systems, such as Hadoop. SPARK is another programming
environment developed for ML applications.
23.2 MapReduce
MapReduce is Google’s programming interface for commodity computing. It is evolved from
older ideas in functional programming and databases. It is easy to pick up, but achieving
high performance requires mastery of the system.
It abstracts away issues of data replication, processor failure/retry etc. from the pro-
grammer. One consequence is that there is no guarantee on running time.
The programming abstraction is rather simple: the data resides in an unsorted clump
of (key, value) pairs. We call this a database to ease exposition. (The programmer has to
write a mapper function that produces this database from the data.) Starting with such
a database, the system applies a sort that moves all pairs with the same key to the same
physical location. Then it applies a reduce operation –provided by the programmer—that
takes a bunch of pairs with the same key and applies some combiner function to produce
a new single pair with that key and whose value is some specified combination of the old
values.
Example 49 (Word Count) The analog to the usual Hello World program in the MapRe-
duce world is the program to count the number of repetitions of each word. The programmer
provides the following.
mapper Input: a text corpus. Output: for each word w, produce the pair (w, 1). This
gives a database.
reduce: Given a bunch of pairs of type (w, count) produces a pair of type (w, C) where
C is the sum of all the counts.
Any smart teenager who knows how to program can come up with a new algorithm.
Analysing algorithms, by contrast, is not easy and usually beyond the teenager’s skillset. In
fact, if the algorithm is complicated enough, proving things about it (i.e., whether or not it
works) becomes very difficult for even the best experts. Thus not all algorithms that have
been designed have been analyzed. The algorithms we study today are called heuristics:
for most of them we know that they do not work on worst-case instances, but there is good
evidence that they work very well on many instances of practical interest. Explaining this
discrepancy theoretically is an interesting and challenging open problem.
Though the heuristics apply to many problems, for pedagogical reasons, throughout the
lecture we use the same problem as an example: 3SAT. Recall that the input to this problem
consists of clauses which are ∨ (i.e., logical OR) of three literals, where a literal is one of
n variables x1 , x2 , . . . , xn , or its negation. For example: (x1 ∨ ¬x4 ∨ x7 ) ∧ (x2 ∨ x3 ∨ ¬x4 ).
The goal is to find an assignment to the variables that makes all clauses evaluate to true.
This is the canonical NP-complete problem: every other NP problem can be reduced
to 3SAT (Cook-Levin Theorem, early 1970s). More importantly, problems in a host of
areas are actually solved this way: convert the instance to an instance of 3SAT, and use
an algorithm for 3SAT. In AI this is done for problems such as constraint satisfaction and
motion planning. In hardware and software verification, the job of verifying some property
of a piece of code or circuit is also reduced to 3SAT.
Let’s get the simplest algorithm for 3SAT out of the way: try all assignments. This
has the disadvantage that it takes 2n time on instances that have few (or none) satisfying
assignments. But there are more clever algorithms, which run very fast and often solve
3SAT instances arising in practice, even on hundreds of thousand variables. The codes for
these are publicly available, and whenever faced with a difficult problem you should try to
represent it as 3SAT and use these solvers.
133
134
Figure 24.1: Local search algorithms try to improve the solution by looking for small changes
that improve it.
Figure 24.2: The relation “is a divisor of”is a partial order among integers.
Clearly, for every partial order on a finite set, there is a maximal element i such that
i ⊀ j for all j (namely, any leaf of the directed acyclic graph.) This simple mathematical
statement can be represented as an unsatisfiable formula. However, the above heuristics
seem to have difficulty detecting that it is unsatisfiable.
This formula has a variables xij for every pair of elements i, j. There is a family of
clauses representing the properties of a partial order.
¬xii ∀i
¬xij ∨ ¬xjk ∨ xik ∀i, j, k
¬xij ∨ ¬xji ∀i, j
Finally, there is a family of clauses saying that no i is a maximal element. These clauses
137
don’t have size 3 but can be rewritten as clauses of size 3 using new variables.
xi1 ∨ xi2 ∨ · · · ∨ xin ∀i
Figure 24.3: Monte Carlo (dart throwing) method to estimate the area of a circle. The
fraction of darts that fall inside the disk is π/4.
Now suppose we are trying to integrate a nonnegative valued function f over the region.
Then we should throw a dart which lands at x with probability f (x). We’ll examine how
to throw such a dart.
138
First note that this is an example of sampling from a probability distribution for which
only the density function is known. Say the distribution is defined on {0, 1}n and we have
a goodness function f (x) that is nonnegative and computable in polynomial time given
x ∈ {0, 1}n . Then we wish to sample from the distribution where probability of getting x is
proportional to f (x). Since
P probabilities must sum to 1, we conclude that this probability
is f (x)/N where N = x∈{0,1}n f (x) is the so-called partition function. The main problem
problem here is that N is in general hard to compute; it is complete for the class ]P
mentioned in the earlier lecture.
Example 53 The dart throwing/integration problem arises in machine learning (more gen-
erally, statistical procedures). For instance if there is a density p(x, y) and we wish to
compute p(x|y) using Bayes’ rule then we need p(xy)/p(y), and
Z
p(y) = f (x, y)dx.
Lets note that if one could do such dart throwing in general, then 3SAT becomes easy.
Suppose the formula has n variables and m clauses. For any assignment x define f (x) =
22nfx where fx = number of clauses satisfied by x. Then if the formula has a satisfiable
assignment, then N > 22nm whereas if the formula is unsatisfiable then N < 2n ×22n(m−1) <
22nm . In particular, the mass f (x) of a satisfying assignment exceeds the mass of all
unsatisfying assignments. So the ability to sample from the distribution would yield a
satisfying assignment with high probability.
The Metropolis-Hastings algorithm (named after its inventors) is a heuristic for sampling
from such a distribution. Define the following random walk on {0, 1}n . At every step the
walk is at some x ∈ {0, 1}n . (At the beginning use an arbitrary x.) At every step, toss a
coin. If it comes up heads, stay at x. (In other words, there is a self-loop of probability at
least 1/2.) If the coin came up tails, then randomly pick a neighbor x0 of x. Move to x0
0)
with probability min 1, ff(x 0
(x) . (In other words, if f (x ) ≥ f (x), definitely move. Otherwise
move with probability given by their ratio.)
claim: If all f (x) > 0 then the stationary distribution of this Markov chain is exactly
p(x)/N , the desired distribution
Proof: The markov chain defined by this random walk is ergodic since f (x) > 0 implies it
is connected, and the self-loops imply it mixes. Thus it suffices to show that the (unique)
stationary distribution has the form f (x)/K for some scale factor K, and then it follows
that K is the partition function N . To do so it suffices to verify that such a distribution is
stationary, i.e., in one step the probability flowing out of a vertex equals its inflow. For any
x, lets L be the neighbors with a lower f value and H be the neighbors with value at least
as high. Then the outflow of probability per step is
f (x) X f (x0 ) X
( + 1),
2K 0 f (x) 0
x ∈L x ∈H
Note: The advantage of the random walk method is that it can in principle explore a space
of exponential size while using only space for storing the current x. In this sense it is like
local search. In fact it is like a probabilistic version of local search on the objective f (x). In
local search one would move from x to x0 if that improves f , whereas here the move is made
with some probability depending upon f (x)/f (x0 ) and every possible move has a nonzero
probability.
Simulated Annealing. If we use the suggested goodness function for 3SAT f (x) = 22nfx
then this Markov chain can be shown to have poor mixing. So a variant is to use a markov
chain that updates itself. The goodness function is initialized to say 2γfx for γ = 1, then
allowed to mix. This stationary distribution may put too little weight on the satisfying
assignments. So then slowly increase γ from 1 to 2n, allowing the chain to mix for a while
at each step. This family of algorithms is called simulated annealing, named after the
physical process of annealing.
For further information see this survey and its list of references.
139
140
You can collaborate with your classmates, but be sure to list your collaborators with your
answer. If you get help from a published source (book, paper etc.), cite that. The answer
must be written by you and you should not be looking at any other source while writing it.
Also, limit your answers to one page, preferably less —you just need to give enough detail
to convince the grader.
Typeset your answer in latex (if you don’t know latex, you can write by hand but scan
in your answers into pdf form before submitting). You can scanners in the mailroom and
also using most smartphones.
§1 The simplest model for a random graph consists of n vertices, and tossing a fair coin
for each pair {i, j} to decide whether this edge should be present in the graph. Call
this G(n, 1/2). A triangle is a set of 3 vertices with an edge between each pair.
What is the expected number of triangles? What is the variance? Use the Chebyshev
inequality to show that the number is concentrated around the expectation and give
an expression for the exact decay in probability. Is it possible to use Chernoff bounds
in this setting?
§2 (Part 1): You are given a fair coin, and a program that generates the binary expansion
of p upto any desired accuracy. Formally describe the procedure to simulate a biased
coin that comes up with head with probability p. (This was sketched in class.) (Part 2)
Now, show how to do the reverse: generate a fair coin toss using a biased coin but
where the bias is unknown.
§4 Show that given n numbers in [0, 1] it is impossible to estimate the value of the median
within say 1.1 factor with o(n) samples. (Hint: to show an impossibility result you
show two different sets of n numbers that have very different medians but which
generate —whp—identical samples of size o(n).)
Now calculate the sample size needed (as a function of t) so that the following is true:
with high probability, the median of the sample has at least n/2 − t numbers less than
it and at least n/2 − t numbers more than it.
§5 Consider the following process for matching n jobs to n processors. In each step, every
job picks a processor at random. The jobs that have no contention on the processors
they picked get executed, and all the other jobs back off and then try again. Jobs only
take one round of time to execute, so in every round all the processors are available.
Show that all the jobs finish executing whp after O(log log n) steps.
§6 In class we saw a hash to estimate the size of a set. Change it to estimate frequencies.
Thus there is a stream of packets each containing a key and you wish to maintain
a data structure which allows us to give an estimate at the end of the number of
times each key appeared in the stream. The size of the data structure should not
depend upon the number of distinct keys in the stream but can depend upon the
success probability, approximation error etc. Just shoot for the following kind of
approximation: if ak is the true numberP of times that key k appeared in the stream
then your estimate should be ak ± ε( k ak ). In other words, the estimate is going
to be accurate only for keys that appear frequently (”heavy hitters”) in the stream.
(This is useful in detecting anomalies or malicious attacks.) Hint: Think in terms of
maintaining m1 × m2 counts using as many independent hash functions, where each
key updates m2 of them.
141
142
You can collaborate with your classmates, but be sure to list your collaborators with your
answer. If you get help from a published source (book, paper etc.), cite that. The answer
must be written by you and you should not be looking at any other source while writing it.
Also, limit your answers to one page, preferably less —you just need to give enough detail
to convince the grader.
Typeset your answer in latex (if you don’t know latex, scan your handwritten work into
pdf form before submitting). To make things easier to grade, submit answers in the numbered
order listed below, and also make sure your name appears on every page.
§1 Draw the full tree of possibilities for the cake-eating problem discussed in class, and
compute the optimum cake-eating schedule. To keep the tree size manageable, draw
it with the following slight changes: The amount you eat each day has to be an integer
multiple of 1/3, and on each day your roommates will with probability 1/2 eat 1/3
of the cake.) (If multiple branches or sub-branches are identical, you may label the
branch with a variable and use the variable in lieu of re-drawing the branch. You may
also omit branches where you eat 0 cake.)
§2 (Stable Matchings with Real-Valued Utilities) We saw stable matchings in a guest lec-
ture. Another formulation of the bipartite stable matching problem has each agent i
submit a real number ui (j) for each element j in the opposite partition, representing
the utility of being matched with that element. We then define a perfect matching M
to be stable if there does not exist a pair (v, w) 6∈ M such that both uv (w) > uv (v 0 )
and uw (v) > uw (w0 ) where both (v, v 0 ) and (w, w0 ) are in M .
(a) Prove that if the two partitions of the graph are of equal size, uv (w) = uw (v) for
all pairs (v, w), and uv (w) 6= uv0 (w0 ) for all {v, w} =
6 {v 0 , w0 } then there exists a
unique stable matching among the agents.
(b) Show by example that if we remove the final condition (that utilities are unique
between different pairs of agents) from part (a), then the instance can contain
multiple stable matchings.
§3 In `2 regression you are given datapoints x1 , x2 , . . . , xn ∈ <k and some values y1 , y2 , . . . , yn ∈ <
and wish to find the “best”linear function that fits this dataset. A frequent choice for
best fit is the one with least squared error, i.e. find a ∈ <k that minimizes
h
X
|y − a · xi |2 .
i=1
Show how to solve this problem in polynomial time (hint: reduce to solving linear
equations; at some point you may need a certain matrix to be invertible, which you
can assume.).
143
§4 (Firehouse location) Suppose we model a city as an m-point finite metric space with
d(x, y) denoting the distance between points x, y. These m 2 distances (which satisfy
triangle inequality) are given as part of the input. The city has n houses located at
points v1 , v2 , . . . , vn in this metric space. The city wishes to build k firehouses and asks
you to help find the best locations c1 , c2 , . . . , ck for them, which can be located at any
of the m points in the city. The happiness of a town resident with the final locations
depends uponP his distance from the closest firehouse. So you decide to minimize the
cost function ni=1 d(vi , ui ) where ui ∈ {c1 , c2 , . . . , ck } is the firehouse closest to vi .
Describe an LP-based algorithm that runs in poly(m) time and solves this problem
approximately. If OPT is the optimum cost of a solution with k firehouses, your
solution is allowed to use O(k log m) firehouses and have cost at most (1 + ε)OPT.
§6 You are given data containing grades in different courses for 5 students. As discussed
in Lecture 5, we are trying to ”explain” the grades as a linear function of the student’s
aptitude, the easiness of the course and some error term. Denoting by Gradeij the
grade of student i in course j this linear model hypothesizes that
§7 (Optimal life partners via MDP) Your friend is trying to find a life partner by going
on dates with n people selected for her by an online dating servie. After each date
she has two choices: select the latest person she dated and stop the process, or reject
this person and continue to date. She has asked you to suggest the optimum stopping
rule. You can assume that the n persons are all linearly orderable (i.e. given a choice
between any two, she is not indifferent and prefers one over the other). The dating
service presents the n chosen people in a random order, and her goal is to maximise the
chance of ending up with the person that she will like the most among these n. (Thus
ending up even with her second favorite person out of the n counts as failure; she’s a
perfectionist.) Represent her actions as an MDP, compute the optimum strategy for
her and the expected probability of success by following this strategy.
(Hint: The Optimal rule is of the form: Date γn people and decide beforehand to pass
on them. After that select
Pthe first person who is preferable to all people seen so far.
t2
You may also need that k=t1 k1 ≈ ln tt21 .)
§8 (extra credit) In question 4 try to design an algorithm that uses k firehouses but
has cost O(OPT). (Needs a complicated dependent rounding; you can also try other
ideas.) Partial credit available for partial progress.
144
145
1. Compute the mixing time (both upper and lower bounds) of a graph on 2n nodes
that consists of two complete graphs on n nodes joined by a single edge. (Hint: Use
elementary probability calculations and reasoning about “probability fluid”; no need
for eigenvalues.)
2. Let M be the Markov chain of a 5-regular undirected graph that is connected. Each
node has self-loops with probability 1/2. We saw in class that 1 is an eigenvalue with
eigenvector ~1. Show that every other eigenvalue has magnitude at most 1 − 1/10n2 .
(Hint: First show that a connected graph cannot have 2 eigenvalues that are 1.)
What does this imply about the mixing time for a random walk on this graph from
an arbitrary starting point?
3. (Game-playing equilibria) Recall the game of Rock, Paper, Scissors. Let’s make it
quantitative it by saying that the winning player wins $ 1 whereas the loser gets $ 0.
(In other words, the game is not zero sum.) A draw results in both getting 0. Suppose
we make two copies of the multiplicative weight update algorithm to play each other
over many iterations. Both start using the uniformly random strategy (i.e., play each
of Rock/paper/scissors with probability 1/3) and learn from experience using the
MW rule. One imagines that repeated play causes them to converge to some kind
of equilibrium. (a) Predict by just calculation/introspection what this equilibrium
is. (Be honest; it’s Ok to be wrong!). (b) Run this experiment on Matlab or any
other programming environment and report what you discovered and briefly explain
it. (We’ll discuss the result in class.)
4. This question will study how mixing can be much slower on directed graphs. Describe
an n-node directed graph (with max indegree and outdegree at most 5) that is fully
connected but where the random walk takes exp(Ω(n)) time to mix (and the walk
ultimately does mix). Argue carefully.
5. Describe an example (i.e., an appropriate set of n points in <n ) that shows that the
Johnson-Lindenstrauss dimension reduction method —precisely the transformation
described in Lecture—the does not preserve `1 distances within even factor 2. (Extra
credit: Show that no linear transformation suffices, let alone J-L.)
6. (Dimension reduction for SVM’s with margin) Suppose we are given two sets P, N of
unit vectors in <n with the guarantee that there exists a hyperplane a·x = 0 such that
every point in P is on one side and every point in N is on the other. Furthermore,
the `2 distance of each point in P and N to this hyperplane is at least ε. Then show
using the Johnson Lindenstrauss lemma (hint: you can use it as a black box) that a
random linear mapping to O(log n/ε2 ) dimensions and such that the points are still
separable by a hyperplane with margin ε/2.
7. Suppose you are trying to convince your friend that there is no perfect randomness in
his head. One way to do it would be to show that if you ask him to write down 100
random bits (say) then his last 20 are fairly predictable after you see the first 80.
Describe the design of such a predictor using a Markovian model, carefully describing
any assumptions. Implement the predictor in any suitable environment and submit
the code with your answer. Report the results from a couple of experiments of the
following form. Ask a couple of friends to input 100 bits quickly (or 200 if he is
patient), and see how well the model predicts the last 20 (or 50) bits. The metric for
the model’s success in prediction is
Number of correct guesses − Number of incorrect guesses.
In order to do better than random guessing this number should be fairly positive.
8. (Extra credit) Calculate the eigenvectors and eigenvalues of the n-dimensional boolean
hypercube, which is the graph with vertex set {−1, 1}n and x, y are connected by an
edge iff they differ in exactly one of the n locations. (Hint: Use symmetry extensively.)
146
147
1. Implement the portfolio management appearing in the notes for Lecture 16 in any
programming environment and check its performance on S& P stock data (download
from http://ocobook.cs.princeton.edu/links.htm ). Include your code as well as the
final performance (i.e., the percentage gain achieved by your strategy).
2. Consider a set of n objects (images, songs etc.) and suppose somebody has designed a
distance function d(·) among them where d(i, j) is the distance between objects i and
j. We are trying to find a geometric realization of these distances. Of course, exact
realization may be impossible and we are willing to tolerate a factor 2 approximation.
We want n vectors u1 , u2 , . . . , un such that d(i, j) ≤ |ui − uj |2 ≤ 2d(i, j) for all pairs
i, j. Describe a polynomial-time algorithm that determines whether such ui ’s exist.
3. The course webpage links to a grayscale photo. Interpret it as an n×m matrix and run
SVD on it. What is the value of k such that a rank k approximation gives a reasonable
approximation (visually) to the image? What value of k gives an approximation that
looks high quality to your eyes? Attach both pictures and your code. (In matlab you
need mat2gray function.) Extra credit: Try to explain from first principles why SVD
works for image compression at all.
4. Suppose we have a set of n images and for some multiset E of image pairs we have been
told whether they are similar (denoted +edges in E) or dissimilar (denoted −edges).
These ratings were generated by different users and may not be mutually consistent
(in fact the same pair may be rated as + as well as −). We wish to partition them
into clusters S1 , S2 , S3 , . . . so as to maximise:
(# of +edges that lie within clusters) + (# of −edges that lie between clusters).
Show that the following SDP is an upperbound on this, where w+ (ij) and w− (ij) are
the number of times pair i, j has been rated + and − respectively.
X
max w+ (ij)(xi · xj ) + w− (ij)(1 − xi · xj )
(i,j)∈E
|xi |22 = 1 ∀i
xi · xj ≥ 0 ∀i 6= j.
5. For the problem in the previous question describe a clustering into 4 clusters that
achieves an objective value 0.75 times the SDP value. (Hint: Use Goemans-Williamson
style rounding but with two random hyperplanes instead of one. You may need a quick
matlab calculation just like GW.)
6. Suppose you are given m halfspaces in <n with rational coefficients. Describe a
polynomial-time algorithm to find the largest sphere that is contained inside the poly-
hedron defined by these halfspaces.
7. Let f be an n-variate convex function such that for every x, every eigenvalue of
O2 f (x) lies in [m, M ]. Show that the optimum value of f is lowerbounded by f (x) −
1 2 1 2
2m |Of (x)|2 and upperbounded by f (x) − 2M |Of (x)|2 , where x is any point. In other
words, if the gradient at x is small, then the value of f at x is near-optimal. (Hint:
By the mean value theorem, f (y) = f (x) + Of (x)T (y − x) + 12 (y − x)T O2 f (z)(y − x),
where z is some point on the line segment joining x, y.)
148
149
1. Prove von Neumann’s min max theorem. You can assume LP duality.
One unit of traffic (a large number of individual drivers) need to travel from s to t.
(Actually assume is it just a tiny bit less than one unit.) Each driver’s choice of route
can be seen as a move in a multiplayer game. What is the Nash equilibrium and what
is each driver’s travel time to t in this equilibrium?
Figure (b) depicts the same network with a new superfast highway constructed from
v to w. What is the new Nash equilibrium and the new travel time?
3. Show that approximating the number of simple cycles within a factor 100 in a directed
graph is NP-hard. (Hint: Show that if there is a polynomial-time algorithm for this
task, then we can solve the Hamiltonian cycle problem in directed graphs, which is
NP-hard. Here the exact constant 100 is not important, and can even be replaced by,
say, n.)
4. (Extra credit) (Sudan’s list decoding) Let (a1 , b1 ), (a2 , b2 ), . . . , (an , bn ) ∈ F 2 where
F = GF (q) and q n. We say that a polynomial p(x) describes k of these pairs if
p(ai ) = bi for k values of i. This question concerns an algorithm that recovers p even
if k < n/2 (in other words, a majority of the values are wrong).
150
√
(a) Show that there exists a bivariate polynomial Q(z, x) of degree at most d ne + 1
in z and x such that Q(bi , ai ) = 0 for each i = 1, . . . , n. Show also that there is
an efficient (poly(n) time) algorithm to construct such a Q.
(b) Show that if R(z, x) is a bivariate polynomial and g(x) a univariate polynomial
then z − g(x) divides R(z, x) iff R(g(x), x) is the 0 polynomial.
(c) Suppose p(x) is a degree d polynomial that describes k of the points. Show that
√
if d is an integer and k > (d + 1)(d ne + 1) then z − p(x) divides the bivariate
polynomial Q(z, x) described in part (a). (Aside: Note that this places an upper-
bound on the number of such polynomials. Can you improve this upperbound
by other methods?)
Princeton University
COS 521: Advanced Algorithms
Final Exam Fall 2014
Sanjeev Arora
Due electronically by Jan 19 5pm.
Instructions: The test has 6 questions. Finish the test within 48 hours after first
reading it. You can consult any notes/handouts from this class and feel free to quote,
without proof, any results from there. You cannot consult any other source or person in
any way.
Do not read the test before you are ready to work on it.
Write and sign the honor code pledge on your exam (The pledge is “I pledge
my honor that I have not violated the honor code during this examination.”)
Sanjeev, Kevin, and Siyu will be available Jan 11–19 via email and piazza to answer
any questions. We will also offer to call you if your confusion does not clear up. In case
of unresolved doubt, try to explain your confusion as part of the answer and maybe you
will receive partial credit. In general, stating clearly what you are trying to do can get you
partial credit.
152
5. In class we showed the existence of good error correcting codes. Now let us see that
there actually exist good error correcting codes with linear structure. A linear error
correcting code over GF (2)n is a m × n matrix M such that the encoding of a column
vector x ∈ GF (2)n is the m-bit vector M x. Show that if m(1 − H(p)) > n then there
exists such a linear error correcting code such that E(x), E(y) disagree in at least pm
of the bits.
(a) Show that if the graph is 3-colorable then the following SDP is feasible:
1
hvi , vj i ≤ − ∀ {i, j} ∈ E
2
hvi , vi i = 1 ∀i
(b) Consider the following rounding algorithm: pick a random unit vector u and
threshold τ > 0 and select the set S = {i : hu, vi i > τ }. Argue that there is a
probability p such that for every i the probability that it lies in S is p.
(c) Show that for every edge {i, j} the probability that both i, j lie in S is at most
p4 . (Hint: Reason about the plane spanned by vi and vj .)
(d) Now argue that if the maximum degree is d then there is an efficient algorithm
algorithm that, given a 3-colorable graph with a node of degree d finds an inde-
pendent set of size Ω(n/d1/3 ). Argue that this can be turned into an algorithm
that colors the graphs with O(d1/3 log n) colors. (Full credit also given if you
have extra factors of log n in any of these bounds.)