Cmu850 f20
Cmu850 f20
https://www.cs.cmu.edu/~15850/
The style files (as well as the text on this page!) are mildly adapted
from the ones developed by Yufei Zhao (MIT), for his notes on Graph
Theory and Additive Combinatorics. As some of you may guess, the
LATEX template used for these notes is called tufte-book.
Contents
I Classical Algorithms 7
6 Blank 63
4
Classical Algorithms
1
Minimum Spanning Trees
• Otakar Borůvka gave the first known MST algorithm in 1926; Otakar Borůvka (1926)
it was subsequently rediscovered by Gustave Choquet (1938),
Georges Sollin (1965), and several others. Vojtečh Jarník gave his Vojtečh Jarník (1930)
algorithm in 1930, and it was independently discovered by Robert
Prim (’57) and Edsger Dijkstra (’59), among others. Joseph Kruskal J.B. Kruskal, Jr. (1956)
10 minimum spanning trees: history
• In 1984, Michael Fredman and Bob Tarjan gave an O(m log∗ n) Michael L. Fredman and Robert E.
time algorithm, based on their Fibonacci heaps data structure. Tarjan (1987)
Here log∗ is the iterated logarithm function, and denotes the num-
ber of times we must take logarithms before the argument be-
comes smaller than 1. The actual runtime is a bit more nuanced,
which we will not bother with today.
• In 1995, David Karger, Phil Klein and Bob Tarjan finally got the Karger, Klein, and Tarjan (1995)
holy grail of O(m) time! . . . but it was a randomized algorithm, so
the search for a deterministic linear-time algorithm continued.
• In 1998, Seth Pettie and Vijaya Ramachandran gave an optimal Pettie and Ramachandran (1998)
algorithm for computing minimum spanning trees—however,
we don’t know its runtime! More formally, they show that if This was part of Seth’s Ph.D. thesis, and
there exists an algorithm which uses MST ∗ (m, n) comparisons Vijaya was his advisor.
to find MSTs on all graphs with m edges and n nodes, the Pettie-
Ramachandran algorithm will run in time O( MST ∗ (m, n)).)
Theorem 1.1 (Cut Rule). For any cut of the graph, the minimum-weight
edge that crosses the cut must be in the MST. This rule helps us determine
what to add to our MST.
Theorem 1.2 (Cycle Rule). For any cycle in G, the heaviest edge on that
cycle cannot be in the MST. This helps us determine what we can remove in
constructing the MST.
Proof. Let C be any cycle, let e be the heaviest edge in C. For a con-
tradiction, let T be an MST that contains e. Dropping e from T gives
two components. Now there must be some edge e′ in C \ {e} that
crosses between these two components, and hence T ′ := ( T − {e′ }) ∪
{e} is a spanning tree. (Make sure you see why.) By the choice of e
we have w(e′ ) < w(e), so T ′ is a lower-weight spanning tree than T, a
contradiction.
red edge, use the cycle rule to mark the heaviest edge as not being in
the MST, and color it red. (Again, this edge cannot already be blue
for similar reasons.) And if either of the rules is not applicable, we
are done. Indeed, if we cannot apply the blue rule, the blue edges
cross every cut, and hence form a spanning tree, which must be the
MST. Similarly, once the non-red edges do not contain a cycle, they
form a spanning tree, which must be the MST. All known algorithms
differ only in their choice of cut/cycle, and how they find these fast.
Indeed, all the deterministic algorithms we discuss today will just
use the cut rule, whereas the randomized algorithm will use the cycle
rule as well.
• union(elem1 , elem2 ), which merges the two sets that elem1 and
elem2 are in.
of our current tree T of blue edges to some vertex not yet in T, and
color it blue—thereby adding this edge to T and increasing its size by 1 3
one. Figure 1.2 below shows an example of how we edges are added. 10
5
We’ll use a priority queue data structure which keeps track of the
4
lightest edge connecting T to each vertex not yet in T. A priority 2
queue data structure is equipped with (at least) three operations:
Figure 1.2: Dashed lines are not yet in
• insert(elem, key) inserts the given (element, key) pair into the the MST. We started at the red node,
queue, and the blue nodes are also part of T
right now.
Note that by using the standard binary heap data structure we can
get O(log n) worst-case time for each priority queue operation above.
To implement the Jarnik/Prim algorithm, we initially insert
each vertex in V \ {r } into the priority queue with key ∞, and
the root r with key 0. The key of an node v denotes the weight of
the least-weight edge from a node in T to v; it is zero if v ∈ T,
and ∞ if there are no edges yet from nodes in T to v. At each step,
use extractmin to find the vertex u with smallest key, and add u
to the tree using this edge. Then for each neighbor of u, say v, do
decreasekey(v, w({u, v})). Overall we do m decreasekey operations, We can optimize slightly by inserting
n inserts, and n extractmins, with the decreasekeys supplying the a vertex into the priority queue only
when it has an edge to the current
dominating O(m log n) term. tree T. This does not seem particularly
useful right now, but will be crucial in
the Fredman-Tarjan proof.
A H
61 13
1 6 R
2 55 7
56
14 16
8
3
57 62
9 15
4
58 63
11 10
17
52
53 12
51
G 60
5 59 18
54
D
Figure 1.4: We begin at vertices A, H,
R, and D (in that order) with K = 6.
Although D begins as its own compo-
nent, it stops when it joins with tree
A. Dashed edges are not chosen in this
step (though they may be chosen in the
next recursive call), and colors denote
trees.
16 fredman and tarjan’s O ( m log ∗ n ) -time algorithm
Let’s first note that the runtime of one round of the algorithm is
O(m + n log K ). Each edge is considered at most twice, once from
each endpoint, giving us the O(m) term. Each time we grow the
current tree in step 1, the number of connected components decreases
by 1, so there are at most n such steps. Each step calls findmin on
a heap of size at most K, which takes O(log K ) times. Hence, at the
end of this round, we’ve successfully identified a forest, each edge of
which is part of the final MST, in O(m + n log K ) time.
Let d v be the degree of the vertex v in the graph we consider in
this round. We claim that every marked vertex u belongs to a com-
ponent C such that ∑ v ∈ C d v ≥ K. Indeed, if u became marked be-
cause the neighborhood of its component had size at least K, then
this is true. Otherwise, u became marked because it entered a com-
ponent C of marked vertices. Since the vertices of C were marked,
∑ v ∈ C d v ≥ K before u joined, and this sum only increased when u
(and other vertices) joined. Thus, if C 1 , . . . , C l are the components at
the end of this routine, we have
l l
2m = ∑ dv = ∑ ∑ dv ≥ ∑ K ≥ Kl
v i = 1 v ∈ Ci i =1
Thus l ≤ 2m 2m
K , i.e. this routine produced at most K trees.
The choice of K will change over the course of the algorithm. How
should we set the thresholds K i ? Say we start round i with n i nodes
and m i ≤ m edges. One clean way is to set
2m
K i : = 2 ni
being devilishly clever. I think it is the latter (and that is the beauty
of the best algorithms). Indeed, there’s a lovely idea—of keeping the
neighborhoods small at the beginning when there’s a lot of work to
do, but allow them to grow quickly, as the graph collapses. It is quite
non-obvious at the start, and obvious in hindsight. And once you see
it, you cannot un-see it!
Fact 1.6 (Soundness). For any forest F, the F-light edges contain the
MST of the underlying graph G. In other words, any F-heavy edge is
also heavy with respect to the MST of the entire graph.
This suggests a clear strategy: pick a forest F from the current
edges, and discard all the F-heavy edges. Hopefully the number of
edges remaining is small. By Fact 1.6 these edges contain the MST of
G, so repeat the process on them. To make this idea work, we want
a forest F with many F-heavy edges. The catch is that a forest has
many heavy edges if it has small weight, if there are many off-forest
edges forming cycles where they are the heaviest edges. Indeed, one
such forest in the MST T ∗ of G: Fact 1.5 shows there are m − (n − 1)
many T ∗ -heavy edges, the maximum possible. How do we find some
similarly good tree/forest, but in linear time?
A second issue is to classify edges as light/heavy, given a forest F.
It is easy to classify a single edge e in linear time, but the following
remarkable theorem is also true:
18 a linear-time randomized algorithm
Proof. This follows from Fact 1.6, that discarding heavy edges of any
forest F in a graph does not change the MST. Indeed, the MST on
G2 is the same as the MST on G ′ , since the discarded F1 -heavy edges
cannot be in MST ( G ′ ) because of Fact 1.6. Adding back the edges
picked by Borůvka’s algorithm in Step 1 gives the MST on G, by the
cut rule.
is also short, but before we prove it, let us complete the proof of the
linear running time.
Theorem 1.11. The KKT algorithm, run on a graph with m edges and n
vertices, terminates in expected time O(m + n).
Tm,n := max { TG }.
G =(V,E),|V |=n,| E|=m
Proof of Claim 1.10. For the sake of the proof, we can use any correct
algorithm to compute F1 , so let us use Kruskal’s algorithm. Moreover,
let’s run a lazy version as follows: first sort all the edges in E′ , and
not just those in E1 ⊆ E′ , and consider then in increasing order
of weights. Now if the currently considered edge ei connects two
different trees in the current blue forest, call ei useful and flip an
independent unbiased coin: if the coin comes up “heads”, color ei
blue and add it to F1 , else color ei red. The crucial observation is
that this process produces a forest from the same distribution as first
choosing G1 and then computing F1 by running Kruskal’s algorithm
on it.
Now, let us consider the lazy process again: which edges are F1 -
light? We claim that these are precisely the useful edges. Indeed,
any non-useful edge e j forms a cycle with the previously chosen
blue edges in F1 , and it is the heaviest edge on that cycle. Hence
e j does not belong to MST ( F1 ∪ {e j }), so it is F1 -heavy by Fact 1.4.
And a useful edge ei would belong to MST ( F1 ∪ {ei }), since run-
ning Kruskal’s algorithm on F1 ∪ {ei } would see that ei connects two
different blue components and hence would pick it.
Finally, how many useful edges are there, in expectation? Let’s
abstract away the details: we’re running a process that periodically
asks us to flip an independent unbiased coin. Since each time we see
a heads, we add an edge to the forest, so we definitely stop when we
see n′ − 1 heads. (We may stop earlier, in case the process runs out of
edges, but then we can pad the random sequence to flip some more
coins.) Since the coins are independent and unbiased, the expected
number of flips until we see n′ − 1 heads is exactly 2(n′ − 1). This
proves Claim 1.10.
That’s it. The algorithm and proof are both short and slick and
beautiful: this result is a real gem. I think it’s an algorithm from The Paul Erdős claimed that God has “The
Book. The one slight annoyance with the algorithm is the relative Book” which contains the most elegant
proof of each mathematical theorem.
complexity of the MST verification algorithm, which we use to find
the F1 -light edges in linear time. Nonetheless, these verification The current verification algorithms are
algorithms also contain many nice ideas, which we now discuss. deterministic; can we use randomness
to simplify these as well?
by János Komlós. His result was subsequently made algorithmic Komlos (1985)
(“how do you find (in linear time) which linear number of queries to
make?”) by Brendan Dixon, Monika Rauch (now Monika Henzinger)
and Bob Tarjan. This algorithm was futher simplified by Valerie
King 1 , and by Thomas Hagerup 2 . We will just discuss Komlós’s 1
2
query-complexity bound.
A e : = ( a1 , a2 , · · · , a k ),
Claim 1.13. The total number of comparisons for all queries is at most
m+n
∑ log (|Qe | + 1) ≤ O(n + n log n
) = O ( m + n ).
e
Exercise 1.14. Show that each node in T ′ has at least two children,
and all leaves belong to the same level. There are n leaves (corre-
sponding to the nodes in T), and at most 2n − 1 nodes in T ′ . Also
show how to construct T ′ in linear time.
Exercise 1.15. For nodes u, v in a tree T, let maxwtT (u, v) be the maxi-
mum weight of an edge on the (unique) path between u, v in the tree
T. Show that all u, v ∈ V, maxwtT (u, v) = maxwtT ′ (u, v).
1 2 3 4 ... n
1 2 4 6 8 ... 2n
2 2 4 8 16 ... 2n
.2
2 22 ..
3 2 4 22 22 ... 22
4 2 4 65536 !!! ... huge!
1.7 Matroids
2.1 Arborescences
are non-negative. Because no outgoing arcs from r will be part If there are negative arc weights, add
of any arborescence, we can assume no such arcs exist in G either. a large positive constant M to every
weight. This increases the total weight
For brevity, we fix r and simply say arborescence when we mean of each arborescence by M (n − 1), and
r-arborescence. hence the identity of the minimum-
weight one remains unchanged.
Proof. Each arborescence has exactly one arc leaving each vertex.
Decreasing the weight of every arc exiting v by MG (v) decreases the
weight of every possible arborescence by MG (v) as well. Thus, the set
of min-weight arborescences remains unchanged.
Now each vertex has at least one 0-weight arc leaving it. Now, for
each vertex, pick an arbitrary 0-weight arc out of it. If this choice is
arborescences: directed spanning trees 27
The proof also gives an algorithm for finding the min-weight ar-
borescence on G ′ by contracting the cycle C (in linear time), recursing
on G ′′ , and the “lifting” the solution T ′′ back to a solution T ′ . Since
we recurse on a graph which has at least one fewer nodes, there are Figure 2.4: Contracting the two white
nodes down to a cycle, and removing
at most n recursive calls. Moreover, the weight-reduction, contraction, arc b.
and lifting steps in each recursive call take O(m) time, so the runtime
of the algorithm is O(mn).
Remark 2.8. This is not the best known run-time bound: there are
many optimizations possible. Tarjan presents an implementation of R.E. Tarjan (1971)
the above algorithm using priority queues in O(min(m log n, n2 ))
time, and Gabow, Galil, Spencer and Tarjan give an algorithm to H.N. Gabow, Z. Galil, T. Spencer and
solve the min-weight arborescence problem in O(n log n + m) time. R.E. Tarjan (1986)
The best runtime currently known is O(m log log n) due to Mendel-
son et al.. R. Mendelson, R.E. Tarjan, M. Thorup,
and U. Zwick (2006)
Open problem 2.9. Is there a linear-time (randomized or determinis-
tic) algorithm to find a min-weight arborescence in a digraph G?
28 linear programming methods
The dual linear program has a single variable yi for each constraint
in the original (primal) linear program. This variable can be thought
of as giving an importance weight to the constraint, so that taking a
linear combination of constraints with these weights shows that the
primal cannot possibly surpass a certain value for c⊺ x. This purpose
is exemplified by the following theorem.
Proof. c⊺ x ≥ ( A⊺ y)⊺ x = y⊺ Ax ≥ y⊺ b = b⊺ y.
arborescences: directed spanning trees 29
minimize ∑ w( a) x a
a∈ A
subject to ∑ xa ≥ 1 ∀ S ⊆ V − {r }
a∈∂+ S (2.1)
∑ xa = 1 ∀v ̸= r
a∈∂+ v
x a ∈ {0, 1} ∀ a ∈ A.
minimize ∑ w( a) x a
a∈ A
subject to ∑ xa ≥ 1 ∀ S ⊆ V − {r }
a∈∂+ S (2.2)
∑ xa = 1 ∀v ̸= r
a∈∂+ v
xa ≥ 0 ∀ a ∈ A.
Exercise 2.15. Suppose all the arc weights are non-negative. Show
that the optimal solution to the linear program remains unchanged
even if drop the constraints ∑ a∈∂+ v x a = 1.
maximize ∑ yS
S⊆V −{r }
subject to ∑ yS ≤ w( a) ∀a ∈ A (2.3)
S:a∈∂+ S
yS ≥ 0 ∀S ⊆ V − {r }, |S| > 1.
Lemma 2.16. If arc weights are non-negative, there exists a solution for the
dual LP (2.3) such that w⊺ x = 1⊺ y, where all ye values are non-negative.
• The base case is when the chosen zero-weight arcs out of each
node form an arborescence. In this case we can set yS = 0 for
all S; since all arc weights are non-negative, this is a feasible dual
solution. Moreover, both the primal and dual values are zero.
∑ yS = ∑ yS + ∑ yS
S:a∈∂+ S S:a∈∂+ S,|S|=1 S:a∈∂+ S,|S|≥2
1 r
= (y′{u} + M) + ∑ y′S
S:a∈∂+ S,|S|≥2
2 5
3
≤ M + w ′ ( a ) = M + ( w ( a ) − M ) = w ( a ). 2
1
2 3 2
Moreover, the value of the dual increases by M, the same as the
3 2
increase in the weight of the arborescence.
6 7
• Else, suppose the chosen zero-weight arcs contain a cycle C, which
we contract down to a node vC . Using induction for this new 7
graph G ′ , let y′ be the feasible dual solution. For any subset S′ of
nodes in G ′ that contains the new node vC , let S = (S′ \ {vC }) ∪ C,
Figure 2.5: An optimal dual solution:
and define yS = y′S′ . For all other subsets S in G ′ not containing
vertex sets are labeled with dual values,
vC , define yS = y′S . Moreover, for all nodes v ∈ C, define y{v} = 0. and arcs with costs.
The dual value remains unchanged, as does the weight of the
solution T obtained by lifting T ′ . The dual constraint changes only
arcs of the form a = (v, u), where v ∈ C and u ̸∈ C. But such
an arc is replaced by an arc a′ = (vC , u), whose weight is at most
w( a). Hence
∑ yS = y′{v
C}
+ ∑ y′S ≤ w( a′ ) ≤ w( a).
S:a∈∂+ S S′ :a′ ∈∂+ S′ ,S′ ̸={v C}
Corollary 2.17. There exists a solution for the dual LP (2.3) such that
w⊺ x = 1⊺ y. Hence the algorithm produces an optimal arborescence even for
negative arc weights.
Proof. If some arc weights are negative, add M to all arc weights to
get the new graph G ′ where all arc weights are positive. Let y′ be the
optimal dual for G ′ from Lemma 2.16; define yS = y′S for all sets of
size at least two, and y{v} = y′{v} − M for singletons. Note that the
weight of the optimal solution on G is precisely M(n − 1) smaller
than on G ′ ; the same is true for the total dual value. Moreover, for arc
e = (u, v), we have
Karb ⊆ K.
In general, the two polytopes are not equal. But in this case, Corol-
lary 2.17 implies that for this particular setting, the two are indeed
equal. Indeed, a geometric hand-wavy argument is easy to make —
if K were strictly bigger than Karb , there would be some direction
arborescences: directed spanning trees 33
s to all vertices in V.
2. The all-pairs shortest paths (APSP) problem asks for the distances
between each pair of vertices in V. We do not consider the s-t-shortest-
path problem, since algorithms for that
We will consider both these variants, and give multiple algorithms problem also tend to solve the SSSP (on
for both. worst-case instances).
Lemma 3.1. After i iterations of the algorithm, dist(v) equals the weight of
the shortest-path from s to v containing at most i edges. (This is defined to
be ∞ if there are no such paths.)
1. The new weights w b are all positive. This comes from the definition
of the feasible potential.
2. Let Pab be a path from a to b. Let ℓ( Pab ) be the length of Pab when
we use the weights w, and ℓ̂( Pab ) be its length when we use the
b Then
weights w.
4. If we set ϕ(s) = 0 for some vertex s, then ϕ(v) for any other vertex
v is an underestimate of the s-to-v distance. This is because for all
shortest paths in graphs 41
Maximize ∑ ϕx
x ∈V
Subject to ϕs = 0
wvu + ϕv − ϕu ≥ 0 ∀(v, u) ∈ E
This is the usual matrix multiplication, but over the semiring (R, min, +).
A semiring has a notion of addition
It turns out that computing Min-Sum Products is precisely the and one of multiplication. However,
neither the addition nor the multipli-
operation needed for the APSP problem. Indeed, initialize a matrix D cation operations are required to have
exactly as in the Floyd-Warshall algorithm: inverses.
wij , i, j ∈ E
Dij = ∞ i, j ∈
/ E, i ̸= j .
0, i=j
Now ( D ⊚ D )ij represents the cheapest i-j path using at most 2 hops!
(It’s as though we made the outer-most loop of Floyd-Warshall into
the inner-most loop.) Similarly, we can compute
D ⊚k : = D ⊚ D ⊚ D · · · ⊚ D ,
| {z }
k −1 MSPs
whose entries give the shortest i-j paths using at most k hops (or at
most k − 1 intermediate nodes). Since the shortest paths would have
at most n − 1 hops, we can compute D⊚n−1 .
How much time would this take? The very definition of MSP
shows how to implement it in O(n3 ) time. But performing it n − 1
times would be O(n) worse than all other approaches! But here’s a
classical trick, which probably goes back to the Babylonians: for any
integer k,
D⊚2k = D⊚k ⊚ D⊚k .
(Here we use that the underlying operations are associative.) Now it
is a simple exercise to compute D⊚n−1 using at most 2 log2 n MSPs.
shortest paths in graphs 43
This value, and Strassen’s idea, has been refined over the years, to The big improvements in this line
its current value of 2.3728 due to François Le Gall (2014). (See this of work were due to Arnold Schön-
hage (1981), Don Coppersmith and
survey by Virginia for a discussion of algorithmic progress until Shmuel Winograd (1990), with recent
2013.) There has been a flurry of work on lower bounds as well, e.g., refinements by Andrew Stothers, CMU
alumna Virginia Vassilevska Williams,
by Josh Alman and Virginia Vassilevska Williams showing limitations and François Le Gall (2014).
for all known approaches.
But how about MSP(n)? Sadly, progress on this has been less
impressive. Despite much effort, we don’t even know if it can be
done in O(n3−ϵ ) time. In fact, most of the recent work has been on
giving evidence that getting sub-cubic algorithms for MSP and APSP
may not be possible. There is an interesting theory of hardness within
P developed around this problem, and related ones. For instance, it is
now known that several problems are equivalent to APSP, and truly
sub-cubic algorithms for one will lead to sub-cubic algorithms for all
of them.
Yet there is some interesting progress on the positive side, albeit
qualitatively small. As far back as 1976, Fredman had shown an M.L. Fredman (1976)
log log n
algorithm to compute MSP in O(n3 log n ) time. He used the fact
that the decision-tree complexity of APSP is sub-cubic (a result we
will discuss in §3.5) in order to speed up computations over nearly-
xlogarithmic-sized sub-instances; this gives the improvement above.
More recently, another CMU alumnus Ryan Williams improved on
44 undirected apsp using fast matrix multiplication
3
this idea quite substantially to O √n , using very interesting R.R. Williams (2018)
log n
2
ideas from circuit complexity. We will discuss this result in a later
section, if we get a chance.
Now consider the graph G2 , the square of G, which has the same
vertex set as G but where an edge in G2 corresponds to being at most
two hops away in G—that is, uv ∈ E( G2 ) ⇐⇒ dG (u, v) ≤ 2. To
construct the adjacency matrix for G2 from that of A, we can use the
following idea:
u, a1 , b1 , a2 , b2 , . . . , ak , bk , v
u, a1 , b1 , a2 , b2 , . . . , ak , bk , ak+1 , v.
But which one? The following lemmas give us simple rule to decide.
Let NG (v) denote the set of neighbors of v in G.
Lemma 3.6. If duv = 2Duv , then for all w ∈ NG (v) we have Duw ≥ Duv .
Proof. Assume not, and let w ∈ NG (v) be such that Duw < Duv .
Since both of them are integers, we have 2Duw < 2Duv − 1. Then the
shortest u-w path in G along with the edge wv forms a u-v-path in G
of length at most 2Duw + 1 < 2Duv = duv , which is in contradiction
with the assumption that duv is the shortest path in G.
Lemma 3.7. If duv = 2Duv − 1, then Duw ≤ Duv for all w ∈ NG (v);
moreover, there exists z ∈ NG (v) such that Duz < Duv .
2D − 1( D Ab< D) .
output the paths is fairly simple. But for Seidel’s algorithm, things
get tricky. Indeed, since the runtime of Seidel’s algorithm is strictly
sub-cubic, how can we write down the shortest paths in nω time,
since the total length of all these paths may be Ω(n3 )? We don’t: we
just write down the successor pointers. Indeed, for each pair u, v, de-
fine Sv (u) to be the second node on a shortest u-v path (the first node
being u, and the last being v). Then to get the entire u-v shortest
path, we just follow these pointers:
Given the algorithmic advances, one may wonder about lower bounds
for the APSP problem. There is the obvious Ω(n2 ) lower bound
from the time required to write down the answer. Maybe even the
decision-tree complexity of the problem is Ω(n3 )? Then no algorithm
can do any faster, and we’d have shown the Floyd-Warshall and the
Matrix-Multiplication methods are optimal.
However, thanks to a result of Michael Fredman, we know this is M.L. Fredman (1976)
not the case. If we just care about the decision-tree complexity, we
can get much better. Specifically, Fredman shows
following:
T T
Aik∗ − Bjk ∗ ≤ Aik − B jk ∀ k (3.2)
T T
Aik∗ − Aik ≤ −( Bjk ∗ − B jk ) ∀ k (3.3)
Now for every pair of columns, p, q from Ai , BiT , and sort the follow-
ing 2n numbers
A1p − Aiq , A2p − A2q , . . . , Anp − Anq , −( B1p − B1q ), . . . , −( Bnp − Bnq )
This result does not give us a fast algorithm, since it just counts
the number of comparisons, and not the actual time to figure out
which comparisons to make. Regardless, many of the algorithms
that achieve n3 / poly log n time for APSP use Fredman’s result on
tiny instances (say of size O(poly log n), so that we can find the best
decision-tree using brute-force) to achieve their results.
4
Low-Stretch Spanning Trees
Given that shortest paths from a single source node s can be repre-
sented by a single shortest-path tree, can we get an analog for all-
pairs shortest paths? Given a graph can we find a tree T that gives us
the shortest-path distances between every pair of nodes? Does such
a tree even exist? Sadly, the answer is negative—and it remains neg-
ative even if we allow this tree to stretch distances by a small factor,
as we will soon see. However, we show that allowing randomiza-
tion will allow us to circumvent the problems, and get low-stretch
spanning trees in general graphs.
In this chapter, we consider undirected graphs G = (V, E), where
each edge e has a non-negative weight/length we . For all u, v in V,
let dG (u, v) be the distance between u, v, i.e., the length of a shortest
path in G from u to v. Observe that the set V along with the distance
function dG forms a metric space. A metric space is a set V with a dis-
tance function d satisfying symme-
try (i.e., d( x, y) = d(y, x ) for all
x, y ∈ V) and the triangle inequality
4.1 Towards a Definition (d( x, y) ≤ d( x, z) + d(z, y) for all
x, y, z ∈ V). Typically, the definition also
The study of low-stretch spanning trees is guided by two high level asks for x = y ⇐⇒ d( x, y) = 0, but we
will merely assume d( x, x ) = 0 for all x.
hopes:
1. Graphs have spanning trees that preserve their distances. That is,
given G there exists a subtree T = (V, ET ) with ET ⊆ E such that We assume that the weights of edges in
ET are the same as those in G.
dG (u, v) ≈ d T (u, v) for all u, v ∈ V.
Now, since π T is the optimal ordering for the tree T, and πG is some
other ordering,
which is α · OPTG .
Observe that the first property must hold with probability 1 (i.e.,
it holds for all trees in the support of the distribution), whereas the
second property holds only on average. Is this definition any good
for our TSP example above? If we change the algorithm to sample a
tree T from the distribution and then return the optimal tour for T,
we get a randomized algorithm that is good in expectation. Indeed,
(4.1) becomes
n−1 1 2
E[d T (u, v)] = · 1 + · ( n − 1) = 2 − .
n n n
And what about an arbitrary pair of nodes u, v in Cn ? We can use Exercise: Given a graph G, suppose
the exercise on the right to show that the stretch on other pairs is the stretch on all edges is at most α.
Show that the stretch on all pairs of
no worse! nodes is at most α. (Hint: linearity of
expectation.)
While we will not manage to get α < 1.49 for general graphs (or
even for the above examples, for which the bounds of 2 − n2 are the
best possible), we show that α ≈ O(log n) can indeed be achieved.
The following theorem is the current best result, due to Ittai Abra-
ham and Ofer Neiman:
Theorem 4.4. For any graph G, there exists a distribution D over span-
ning trees of G with stretch α = O(log n log log n). Moreover, the
construction is efficient: we can sample trees from this distribution D in
O(m log n log log n) time.
Theorem 4.7. For any metric space M = (V, d), there exists an efficiently
sampleable α B -stretch spanning tree distribution D B , where
α B = O(log n log ∆ M ).
d( x, y)
Pr[ x, y in different clusters] ≤ β · .
D
Let’s see a few examples, to get a better sense for the definition:
1. Consider a set of points on the real line. One way to partition the
line into pieces of diameter D is simple: imagine making notches
low-stretch spanning trees 55
3. What about lower bounds? One can show that for the k-dimensional
hypergrid, we cannot get β = o (k). Or for a constant-degree n-
vertex expander, we cannot get β = o (log n). Details to come soon.
Since the aspect ratio of the metric space is invariant to scaling all
the edge lengths by the same factor, it will be convenient to assume
that the smallest non-zero distance in d is 1, so the largest distance is
∆. The basic algorithm is then quite simple:
Now the probability that Rv > D/2 for one particular cluster is We use that 1 − z ≤ ez for all z ∈ R.
56 bartal’s construction
1
Pr[ Rv > D/2] = (1 − p) D/2 ≤ e− pD/2 ≤ e−2 log n = .
n2
By a union bound, there exists a cluster with diameter > D with
probability
n 1
1 − Pr[∃v ∈ V, Rv > D/2] ≥ 1 − = 1− .
n2 n
To bound the probability of some pair u, v being separated, we
use the fact that sampling from the geometric distribution with pa-
Cv
rameter p means repeatedly flipping a coin with bias p and counting
the number of flips until we see the first heads. Recall this process
is memoryless, meaning that even if we have already performed k d(v, y)
v y
flips without having seen a heads, the time until the first heads is still Rv ≤ D
2
d(v, x ) d( x, y)
geometrically distributed. x
Hence, the steps of drawing Rv and then forming the cluster can
be viewed as starting from v, where the cluster is a unit-radius ball
around v. Each time we flip a coin of bias p: it is comes up heads we Figure 4.1: A cluster forming around v
in the LDD process, separating x and
set the radius Rv to the current value, form the cluster Cv (and mark y. To reduce clutter, only some of the
its vertices) and then pick a new unmarked point v; on seeing tails, distances are shown.
we just increment the radius of v’s cluster by one and flip again. The
process ends when all vertices lie in some cluster.
For x, y, consider the first time when one of these vertices lies
inside the current ball centered at some point, say, v. (This must hap-
pen at some point, since all vertices are eventually marked.) With-
out loss of generality, let the point inside the current ball be x. At
this point, we have performed d(v, x ) flips without having seen a
heads. Now we will separate x, y if we see a heads within the next
⌈d(v, y) − d(v, x )⌉ ≤ ⌈d( x, y)⌉ flips—beyond that, both x, y will have
been contained in v’s cluster and hence cannot be separated. But
the probability of getting a heads among these flips is at most (by a
union bound)
d( x, y)
⌈d( x, y)⌉ p ≤ 2d( x, y) p ≤ 8 log n .
D
(Here we used that the minimum distance is 1, so rounding up dis-
tances at most doubles things.) This proves the claimed probability of
separation.
Lemma 4.11. If the random tree T returned by some call LDD( M′ , δ) has
root r, then (a) every vertex x in T has distance d( x, r ) ≤ 2δ+1 , and (b) the
expected distance between any x, y ∈ T has E[d T ( x, y)] ≤ 8δβ d( x, y).
Proof. The proof is by induction on δ. For the base case, the tree has
a single vertex, so the claims are trivial. Else, let x lie in cluster Cj , so
inductively the distance to the root of the tree Ti is d( x, ri ) ≤ 2(δ−1)+1 .
Now the distance to the new root r is at most 2δ more, which gives
2δ + 2δ = 2δ+1 as claimed.
Moreover, any pair x, y is separated by the LDD with probability
d( x,y)
β 2δ−1 , in which case their distance is at most
Else they lie in the same cluster, and inductively have expected dis-
58 metric embeddings: a.k.a. simplifying metrics
This proves Theorem 4.7 because β = O(log n), and the iniitial
call on the entire metric defines δ = O(log ∆). In fact, if we have a
better LDD (with smaller β), we immediately get a better low-stretch
tree. For example, shortest-path metrics of planar graphs admit an
LDD with parameter β = O(1); this shows that planar metrics admit
(randomized) low-stretch trees with stretch O(log ∆).
It turns out this factor of O(log n log ∆) can be improved to O(log n)—
this was done by Fakcharoenphol, Rao, and Talwar. Moreover, the
bound of O(log n) is tight: the lower bounds of Theorem 4.5 continue
to hold even for low-stretch non-spanning trees.
TO be added in
6
Blank
TO be added in
7
Graph Matchings I: Combinatorial Algorithms
edge covers two vertices, and each vertex can be covered by at most
one edge.
Given sets S, T, their symmetric difference is denoted Figure 7.2: An augmenting path
S △ T : = ( S \ T ) ∪ ( T \ S ).
Note that the entire set V is trivially a vertex cover, and the chal-
lenge is to find small vertex covers. We denote the size of the smallest
cardinality vertex cover of graph G as VC ( G ). Our motivation for
calling it a “dual” object comes from the following fundamental theo-
rem from the early 20th century:
Theorem 7.9 (König’s Minimax Theorem). In a bipartite graph, the size Dénes König (1916)
of the largest possible matching equals the cardinality of the smallest vertex
cover:
MM( G ) = VC( G ).
Open vertex
Hence, suppose we do not find an open node in an even level, and
Figure 7.3: Illustration of the process
stop when some X j is empty. Let X = ∪ j X j be all nodes added to any to find augmenting paths in a bipartite
of the sets X j ; we call these marked nodes. Define the set C to be the graph. Mistakes here, to be fixed!
vertices on the left which are not marked, plus the vertices on the right
which are marked. That is,
C := ( L \ X ) ∪ ( R ∩ X )
Theorem 7.12 (The Tutte-Berge Max-Min Theorem). Given a graph G, Tutte (1947), Berge (1958)
Tutte showed that the graph has a
perfect matching precisely if for every
U ⊆ V, odd( G \ U ) ≤ |U |. Berge
gave the generalization to maximum
matchings.
graph matchings i: combinatorial algorithms 71
The rest of this section defines the algorithm, and proves this
theorem. The essential idea of the algorithm is simple, and similar
to the one for the bipartite case: if we have a matching M, Berge’s
characterization from Theorem 7.7 says that if M is not optimal, there
exists an M-augmenting path. So the natural idea would be to find
such an augmenting path. However, it is not clear how to do this
directly. The clever idea in the Blossom algorithm is to either find
an M-augmenting path, or else find a structure called a “blossom”. Stem Blossom
(a)
The good thing about blossoms is that we can use them to contract
the graph in a certain way, and make progress. Let us now give some
definitions, and details. A
(b)
A flower is a subgraph of G that looks like the the object to the
Matched edge
right: it has a open vertex at the base, then a stem with an even num- Unmatched edge
ber of edges (alternating between matched and unmatched edges ), Open vertex
Let’s give some more details for the last step. Suppose we find
a flower F, with stem S and blossom B. First, toggle the stem (by
setting M ← M△S): this moves the open node to the blossom,
Figure 7.7: The shrinking of a blossom.
without changing the size of the matching M. (It makes the following Image found at http://en.wikipedia.
arguments easier, with one less case to consider.) (Change figure.) org/wiki/Blossom_algorithm.
Proof. Since we toggled the stem, the vertex v at the base of the blos-
som B is open, and so is the vertex v B created in G ′ by contracting
B. Moreover, all other nodes in the blossom are matched by edges
within itself, so all edges leaving B are non-matching edges. The
picture essentially gives the proof, and can be used to follow along.
is where we use the fact that the cycle is odd, and is alternating
except for the two edges incident to v.)
3. If v ∈ X2j for j < i, then u would have been added to the odd level
X2j+1 , which is impossible.
Now for the edges out of the odd layers considered in line 9.3.
Given u ∈ X2i+1 and matching edge uv ∈ M, the cases are:
Observe that if the algorithm does not succeed, all the matching
edges we explored are odd-to-even, whereas all the non-matching
edges are even-to-odd. Now we can prove Lemma 7.14.
(a) the marked vertices in the even levels, Xeven which are all single-
tons since there are no cross edges, and
Hence
n + |U | − odd( G \ U ) n + | Xodd | − | Xeven |
=
2 2
2| Xodd | + (n − | X |)
=
2
(n − | X |)
= | Xodd | + = | M |.
2
The last equality uses that all nodes in V \ X are perfectly matched
among themselves, and all nodes in Xodd are matched using unique
edges.
The last piece is to show that a Tutte-Berge set U ′ for a contracted
graph G ′ = G/B with respect to M′ = M/B can be lifted to one for G
with respect to M. We leave it as an exercise to show that adding the
entire blossom B to U ′ gives such an U.
• The first result along these lines is that of Laci Lovász, who intro- Lovász (1979)
duced the general idea, and gave a randomized algorithm to detect
the presence of perfect matchings in time O(nω ), and to find it in
time O(mnω ). We will present all the details of this elegant idea
soon.
• Dick Karp, Eli Upfal, and Avi Wigderson, and then Ketan Mulmu- Karp, Upfal, and Wigderson (1986)
ley, Umesh Vazirani, and Vijay Vazirani showed how to find such a Mulmuley, Vazirani, and Vazirani (1987)
matching in parallel. The question of getting a deterministic paral-
lel algorithm remains an outstanding open problem, despite recent
progress (which discuss at the end of the chapter).
• Michael Rabin and Vijay Vazirani sped up the sequential algorithm Rabin and Vazirani (1989)
to run in O(n · nω ). This was substantially improved by the work
of Marcin Mucha and Piotr Sankowski to get a runtime of O(nω ). Mucha and Sankowski (2006)
For the rest of this lecture, we fix a field F, and consider (univariate
and multivariate) polynomials over this field. We assume that we can
perform basic arithmetic operations in constant time, though some-
times it will be important to look more closely at this assumption. For finite fields Fq (where q is a prime
power), we can perform arithmetic
operations (addition, multiplication,
division) in time poly log q.
80 preliminaries: roots of low degree polynomials
d
Pr [ p( R) = 0] ≤ .
|S|
This statement holds for multivariate polynomials as well, as we
see next. The result is called the Schwartz-Zippel lemma, and it appears
in papers by Richard DeMillo and Richard Lipton, by Richard Zippel, R.A. DeMillo and R.J. Lipton (1978)
and by Jacob Schwartz. Zippel (1979)
Schwartz (1980)
Theorem 8.3. Let p( x1 , . . . , xn ) be a non-zero polynomial over a field F, Like many powerful ideas, the prove-
such that p has degree at most d. Suppose we choose values R1 , . . . , Rn nance of this result gets complicated.
A version of this for finite fields was
independently and uniformly at random from a subset S ⊆ F. Then apparently already proved in 1922 by
Øystein Ore. Anyone have a copy of
d that paper?
Pr [ p( R1 , . . . , Rn ) = 0] ≤ .
|S| A monomial is a product a collection of
variables. The degree of a monomial
Hence, the number of roots of p inside Sn is at most d|S|n−1 . is the sum of degrees of the variables
in it. The degree of a polynomial is the
Proof. We argue by induction on n. The base case of n = 1 considers maximum degree of any monomial in
it.
univariate polynomials, so the claim follows from Theorem 8.1. Now
for the inductive step for n variables. Let k be the highest power of
xn that appears in p, and let q be the quotient and r be the remainder
when dividing p by xnk . That is, let q( x1 , . . . , xn−1 ) and r ( x1 , . . . , xn )
be the (unique) polynomials such that
p( x1 , . . . , xn ) = xnk q( x1 , . . . , xn−1 ) + r ( x1 , . . . , xn ),
graph matchings ii: algebraic algorithms 81
Pr [ p( R1 , . . . , Rn ) = 0] = Pr [ p( R1 , . . . , Rn ) = 0 | E] Pr [E ]
+ Pr p( R1 , . . . , Rn ) = 0 | E Pr E
≤ Pr [E ] + Pr p( R1 , . . . , Rn ) = 0 | E
Thus we get
d−k k d
Pr [ p( R1 , . . . , Rn ) = 0] ≤ + = .
|S| |S| |S|
Remark 8.4. Finding the set S ⊆ F such that |S| ≥ dn2 , guarantees that
if p is a non-zero polynomial,
1
Pr [ p( R1 , . . . , Rn ) = 0] ≤ .
n2
Naturally, if p is zero polynomial, then the probability equals 1.
Observe that now each variable is of the form xi,j for i < j; it 1
occurs twice in the matrix, with the variables below the diagonal
being the negations of those above.
2 4
Example 8.10. For the graph to the right, the Tutte matrix is
0 x1,2 0 x1,4 3
− x1,2 0 x2,3 x2,4
Figure 8.2: Non-bipartite graph
0 − x2,3 0 x3,4
− x1,4 − x2,4 − x3,4 0
We claim the same property for this matrix as we did for the Ed-
monds matrix:
Theorem 8.11. For any graph G, the determinant of the Tutte matrix T( G )
is a non-zero polynomial over any field F if and only if there exists a perfect
matching in G.
Now given Theorem 8.11, the Tutte matrix can simply be substi-
tuted instead of the Edmonds matrix to extend the results to general
graphs.
We can convert the above perfect matching tester (which solves the
decision version of the perfect matching problem) into an algorithm
for the search version: one that outputs a perfect matching in a graph
(if one exists), using the simple but brilliant idea of self-reducibility. We are reducing the problem to smaller
Suppose that graph G has a perfect matching. Then we can pick any instances of itself, hence the term
self-reducibility.
edge e = uv and check if G [ E − e], the subgraph of G obtained by
dropping just the edge e, contains a perfect matching. If not, then
edge e must be part of every perfect matching in G, and hence we can
find a perfect matching on the induced subgraph G [V \ {u, v}]. The
following algorithm is based on this observation.
Algorithm 11: Find-PM(bipartite graph G, S ⊆ F)
11.1 Assume: G has a perfect matching let e = uv be an edge in G if
PM-tester(G [ E − e], S) == Yes then
11.2 return Find-PM(G [ E − e], S)
11.3 else
11.4 M′ ← Find-PM(G [V − {u, v}], S)
11.5 return M′ ∪ {e}
Theorem 8.12. Let |S| ≥ n3 . Given a bipartite graph G that contains some
perfect matching, Algorithm 11 finds a perfect matching with probability at
least 12 , and runs in time O(m · nω ).
graph matchings ii: algebraic algorithms 85
Proof. At each step, we call the tester once, and then recurse after
either deleting an edge or two vertices. Thus, the number of total
recursive steps inside Algorithm 11 is at most max{m, n/2} = m, if
the graph is connected. This gives a runtime of O(m · nω ). Moreover,
at each step, the probability that the tester returns a wrong answer is
at most n12 , so the PM-tester makes a mistake with probability at most
m
n2
≤ 1/2, by a union bound.
Claim 8.15. Let G have at most one perfect matching with k red edges.
The determinant det(M) has a term of the form ck yk if and only if G
has a k-red matching.
The polynomial p(y) has degree at most n, and hence we can re-
cover it by Lagrangian interpolation. Indeed, we can choose n + 1
distinct numbers a0 , . . . , an , and evaluate p( a0 ), . . . , p( an ) by comput-
ing the determinant det(M) at y = ai , for each i. These n + 1 values
are enough to determine the polynomial as follows:
n +1
x − aj
p(y) = ∑ p ( ai ) ∏ ai − a j
.
i =1 j ̸ =i
(E.g., see 451 lecture notes or Ryan’s lecture notes.) Note this is a
completely deterministic algorithm, so far.
88 matchings in parallel, and the isolation lemma
where Qi is a multilinear degree-n polynomial that corresponds to Multilinear just means that the degree
all the i-red matchings. If we set the x variables randomly (say, to of each variable in each monomial is at
most one.
values xij = aij ) from a large enough set S, we get a polynomial
R(y) = P(a, y) whose only variable is y. The coefficient of yk in this
polynomial is Qk (a), which is non-zero with high probability, by the
Schwartz-Zippel lemma. Now we can again use interpolation to find
out this coefficient, and decide the red-blue matching problem based
on whether it is non-zero.
And in this case, the signs of the permutations cancel each other out.
What if we defined a new quantity which does not have these
pesky negative signs. This function is called the permanent, defined
as: The term comes from Cauchy’s use of
perm( A) = ∑ ∏ Ai,σi . “fonctions symétriques permanentes” for
a related class of functions. The term
σ ∈Sn
determinant comes from Gauss, but
Given this definition, we immediately get the following fact: apparently Cauchy was the one to use
determinant to mean precisely the same
Fact 8.17. Given the (bipartite) adjacency matrix A for a bipartite object as we do.
graph G, perm( A) is the number of perfect matchings in G. Hence
This sounds like great news, since we no longer have to rely on the
above randomization ideas. However, we seem to have gone from a
minor annoyance to a major one: how do we compute the permanent
efficiently? This was a source of theoretical and practical annoyance
for some time, and attempts to transform permanent computations
into determinant computations had been fruitless. Finally, in 1979,
Les Valiant proved the following surprising theorem: L.R. Valiant. The complexity of computing
the permanent. (1979)
Theorem 8.18. It is NP-hard to compute the permanent of square {0, 1}-
matrices. In fact, computing the number of perfect matchings of a bipartite
graph is as hard as counting the number of satisfying assignments to a
3SAT formula. The class of problems reducible to
counting the number of satisfying
This is truly a remarkable theorem. Finding a satisfying assign- assignments to a 3SAT formula is
called #P; this contains all problems
ment to a 3SAT formula is NP-hard, whereas finding a perfect match- in NP, naturally, but also seems to
ing is in polynomial time. But counting the number of these two contain much more. Valiant’s theorem
objects has the same complexity! says: counting the number of perfect
matchings is as hard as all the problems
in #P, which blows my mind.
8.7 A Matrix Scaling Approach
B := RAC.
In other words, taking the matrix A, and scaling each row i by Rii
and each column j by Cjj , gives the matrix B. Matrix scaling gives
us yet another characterization of bipartite graphs that have perfect
matchings: A matrix A is doubly-stochastic if it has
unit row- and column-sums. In other
Theorem 8.19. A bipartite graph G admits a perfect matching if and only words, A1 = A⊺ 1 = 1. The ε-doubly-
stochasticity requires that A1, A⊺ 1 both
if for each ε > 0 there exist non-negative matrices R, C such that RAG C is have entries in (1 − ε, 1 + ε).
ε-approximate doubly-stochastic.
Given the adjacency matrix A ∈ {0, 1}n for the bipartite graph G,
we now try to find scaling matrices R and C. Since we want the row-
and column-sums to be close to 1, one “greedy” idea is to start with
A and repeatedly do the following two steps:
1. Scale each row to make the row sums equal to 1; this may put the
column sums out of whack.
2. Scale each column to make the row sums equal to 1; this may now
mess up the row sums.
We show that if we ever reach a matrix where both row and column
sums are very close to 1, then Theorem 8.19 tells us that the graph
has a perfect matching. And if we don’t manage to get close to 1 in a
“reasonable” time (which depends on n and ε), interestingly we can
conclude it has no perfect matching!
To make this precise, let’s define two diagonal matrices R( A) :=
diag( A1) and C ( A) = diag( A⊺ 1). Then the algorithm becomes:
Let A(t) be the matrix obtained after t rescalings. Here are the three While this looks very similar to the
crucial facts: Leibniz formula for the determinant—it
just lacks the (−1)sign(σ) term inside
the summation—the small difference
1. perm( A(t) ) ≤ 1. To show this, observe that for any non-negative completely changes the complexity
matrix M, of the two problems. While we can
n compute determinants in polynomial
perm( M ) ≤ ∏( Mi1 + . . . + Min ), time, the computation of permanents is
i =1 #P-complete.
2. perm( A(1) ) ≥ n−n . This follows from the fact that each matrix A(t)
has normalized rows or columns. Suppose the rows are normal-
ized
9
Graph Matchings III: Weighted Matchings
1
As an example, the half space S = {⃗x | · ⃗x ≥ 3} in R2 is shown
1 0
on the right. (Note that we implicitly restrict ourselves to closed half-
0 1 2 3 4 5 6
spaces.) x1
K = { Ax ≤ b},
94 linear programming
Although all linear programs can be put into this canonical form,
in practice they may have many different forms. These presenta-
tions can be shown to be equivalent to one another by adding new
variables and constraints, negating the entries of A and c, etc. For
example, the following are all linear programs:
max{c · x : Ax ≤ b} min{c · x : Ax = b}
x x
min{c · x : Ax ≥ b} min{c · x : Ax ≤ b, x ≥ 0}.
x x
In other words, x is an extreme point of K if it cannot be written as Figure 9.3: Here y is an extreme point,
but x is not.
the convex combination of two other points in K. See Figure 9.3 for
an example.
Here’s another kind of point in K. In this course, we will use the notation
c · x, c⊺ x, and ⟨c, x ⟩ to denote the inner-
product between vectors c and x.
graph matchings iii: weighted matchings 95
K = CH(ext(K )).
It is conceptually easy to define an | E|-dimensional polytope whose Figure 9.4: This graph has one perfect
vertices are precisely the perfect matchings of G: we simply define matching M: it contains edges 1, 4,
5, and 6, represented by the vector
χ M = (1, 0, 0, 1, 1, 1).
CPM ( G ) = CH ({χ M | M is a perfect matching in G }). (9.3)
∑ xlr = 1 ∀l ∈ L
r ∈ N (l )
K PM ( G ) =
x ∈ R| E | s.t.
∑ xlr = 1 ∀r ∈ R
l ∈ N (r )
xe ≥ 0 ∀e ∈ E
Proof. For brevity, let us refer to the polytopes as K and C. The easy
direction is to show that C ⊆ K. Indeed, the characteristic vector χ M
for each perfect matching M satisfies the constraints for K. Moreover
K is convex, so if it contains all the vertices of the convex set C, it
contains all their convex combinations, and hence contains all of C.
Now to show K ⊆ C, we again show that the vertices of K are
contained in C, and then use Fact 9.12 to infer it for the rest of K.
Consider an arbitrary vertex x ∗ of K. In this proof, we use the equiv-
alent view of a vertex as an extreme point of K. (A proof using the
“basic feasible solution” perspective appears in §9.2.3, and a proof
using the “vertex” perspective appears in §9.3.)
Let supp( x ∗ ) = {e | xe∗ > 0} be the support of this solution.
We claim that supp( x ∗ ) is acyclic. Indeed, suppose not, and cycle
C = e1 , e2 , . . . , ek is contained within the support supp( x ∗ ). Since the
graph is bipartite, this is an even-length cycle. Define
ε := min xe∗ .
e∈supp( x ∗ )
min{w · x | x ∈ K PM ( G )}
x ∗ | E\ E′ = C −1 (1 − C ′ x ∗ | E′ ) = C −1 1.
By Cramer’s rule,
det(C [1]i )
xe∗ = .
det(C )
The numerator is an integer (since the entries of C are integers), so
showing det(C ) ∈ {±1} means that xe∗ is an integer.
100 another perspective: buyers and sellers
Using the claim and using the fact C is non-singular and hence
det(C ) cannot be zero, we get that the entries of xe∗ are integers. By
the structure of the LP, the only integers possible in a feasible solu-
tion are {0, 1} and the vector x ∗ corresponds to a matching.
The results of the previous section show that the bipartite perfect
matching polytope is integral, and hence the max-weight perfect
graph matchings iii: weighted matchings 101
Consider the setting with a set B with n buyers and another set I with
n items, where buyer b has value vbi for item i. The goal is to find a
max-value perfect matching, that matches each buyer to a distinct
item and maximizes the sum of the values obtained by this matching.
Our algorithm will maintain a set of prices for items: each item i
will have price pi . Given a price vector p := ( p1 , . . . , pn ), define the
utility of item i to buyer b to be
ubi ( p) := vbi − pi .
A buyer has at least one preferred item, and can have multiple
preferred items, since there can be ties. Given prices p, we build a
preference graph H = H ( p), where the vertices are buyers B on the
left, items I on the right, and where bi is an edge if buyer b prefers
item i at prices p. The two examples show preference graphs, where
the second graph results from an increase in price of item 1. Flip the
figure.
min
p=( p1 ,...,pn )
∑ p i + ∑ u b ( p ).
i∈ I b∈ B
Consider the dual solution given by the price vector p∗ . Recall that
M is a perfect matching in the preference graph H ( p∗ ), and let M(i )
be the buyer matched to item i by it. Since u M(i) ( p) = v M(i)i − pi , the
dual objective is
Since the primal and dual values are equal, the primal matching M
must be optimal.
That’s it. Running the algorithm on our running example gives the
prices on the right.
The only way the algorithm can stop is to produce an optimal
matching. So we must show it does stop, for which we use a “semi-
invariant” argument. We keep track of the “potential”
Φ( p) := ∑ p i + ∑ u b ( p ),
i b
Lemma 9.17. Every time we increase the prices in N (S) by 1, the value of
∑i pi + ∑b ub decreases by at least 1.
that all values were integral.) Therefore, the value of the potential
∑i pi + ∑b ub changes by | N ( B)| − | B| ≤ −1.
• In fact, one can get rid of the integrality assumption by raising the
prices by the maximum amount possible for the above proof to
still go through, namely
min ub ( p) − max (vib − pi ) .
b∈S i ̸∈ N (S)
It can be shown that this update rule makes the algorithm stop in
only O(n3 ) iterations.
• If all the values are non-negative, and we don’t like the utilities to
be negative, then we can do one of the following things: (a) when
all the prices become non-zero, subtract the same amount from all
of them to make the lowest price hit zero, or (b) choose S to be a
minimal “consticted” set and raise the prices for N (S). This way,
we can ensure that each buyer still has at least one item which
gives it nonngegative utility. (Exercise!)
• Suppose there are n buyers and a single item, with all non-negative
values. (Imagine there are n − 1 dummy items, with buyers hav-
ing zero values for them.) The above algorithm behaves like the
usual ascending-price English or Vickery auction, where prices
are raised until only one bidder remains. Indeed, the final price
for the “real” item will be such that the second-highest bidder is
indifferent between it and a dummy item.
This is a more general phenomenon: indeed, even in the setting
with multiple items, the final prices are those produced by the
Vickery-Clarke-Groves truthful mechanism, at least if we use the
version of the algorithm that raises prices on minimal constricted
sets. The truthfulness of the mechanism means there is no incen-
tive for buyers to unilaterally lie about their values for items. See,
e.g., 1 for the rich connection of matching algorithms to auction 1
This proof shows that for any setting of values, there is an optimal
integer solution to the linear program
max{v · x | x ∈ K LP(G) }.
Let us now see yet another algorithm for solving weighted matching
problems in bipartite graphs. For now, we switch from maximum-
weight matchings to minimum-weight matchings, because they are
conceptually cleaner to explain here. Of course, the two problems are
equivalent, since we can always negate edge weights.
In fact, we solve a min-cost max-flow problem here: given an flow
network with terminals s and t, edge capacities ue , and also edge
costs/weights we , find an s-t flow with maximum flow value, and
whose total cost/weight is the least among all such flows. (Moreover,
if the capacities are integers, the flow we find will also have integer
flow values on all edges.) Casting the maximum-cardinality bipartite
matching problem as a integer max-flow problem, as in §blah gives
us a minimum-weight bipartite matching.
This algorithm uses an augmenting path subroutine, much like
the algorithm of Ford and Fulkerson. The subroutine, which takes in
a matching M and returns one of size | M | + 1, is presented below.
Then, we can start with the empty matching and call this subroutine
until we get a maximum matching.
Let the original bipartite graph be G. Construct the directed graph
G M as follows: For each edge e ∈ M, insert that edge directed from
right to left, with weight −we . For each edge e ∈ G \ M, insert that
edge directed from left to right, with weight we . Then, compute the
shortest path P that starts from the left and ends on the right, and
return M △ P. It is easy to see that M △ P is a matching of size | M | +
1, and has total weight equal to the sum of the weights of M and P.
Call a matching M an extreme matching if M has minimum
weight among all matchings of size | M |. The main idea is to show
that the above subroutine preserves extremity, so that the final match-
ing must be extreme and therefore optimal.
When the graph is not bipartite, there are no “left” and “right” sets of
vertices, so we can simply define Recall that ∂v is the set of edges inci-
dent on vertex v.
Kdeg ( G ) := { x ∈ R| E| | x (∂v) = 1 ∀v ∈ V, x ≥ 0}.
This matches with the definition (9.3) when the graph is bipartite.
Interestingly,
CPM ( G ) ⊊ Kdeg ( G )
for non-bipartite graphs. Indeed, consider graph K3 which consists
of a single 3-cycle: this graph has no perfect matching, but setting
xe = 1/2 for each edge satisfies all the constraints. Or in the graph
K6 (which does have perfect matchings), the solution where we set
xe = 1/2 on two disjoint 3-cycles is an extreme point. This suggests Can you find a cost vector for which
that the linear constraints defining Kdeg ( G ) are not enough, and we this half-integral solution is the unique
optimum?
need to add more constraints to capture the convex hull of perfect
matchings in general graphs.
In situations like this, it is instructive to look at the counter-
example, to see what constraints must be satisfied by any integer
solution, but are violated by this fractional solution. For a set of ver-
tices S ⊆ V, let ∂S denote the edges leaving S. Here is one such set of
constraints:
∑ xe ≥ 1 ∀S ⊆ V such that |S| is odd, .
e∈∂S
KgenPM ( G ) = CgenPM ( G ),
H has even size. If each vertex in H has an even degree, we can find
an Euler tour of these edges (which uses these edges exactly once),
and then apply the idea from Theorem 9.13 to this Euler tour to show
that x ∗ is the convex combination of two other solutions x + /x − . The
argument we used did not rely on the cycle being simple—just that
it was of even length, which holds because H has even size, and that
the solutions we get are different from x ∗ !
The rest of the proof just handles the case of non-Eulerian com-
ponents of even size. Such a non-Eulerian graph H must contain
vertices with an odd degree; but there must be an even number of
them. Pair them up in any way you like; pick the pairs one by one, Why? The Handshake Lemma says the
pick a path between them in H, and “duplicate” it—make one copy sum of degrees is twice the number of
edges and hence is even.
of each edge on the path. This increases the degree of each endpoint
by 1 (thereby changing the parity), but does not change the parity
of each other node. One important comment: pick some edge e′ on
some cycle, and ensure that the duplicated path does not use this
edge e′ by going the “other way around the cycle”.
At the end this fixes the degrees of all vertices to be even (at the
cost of duplicating edges in H). Again find an Euler tour and do
the +ε/−ε trick on these. If an edge is duplicated, it may be used
as an “odd” edge some pe number of times and an “even” edge the
remaining ne times, so will get an offset of ε( pe − ne ) in one solution
and −ε( pe − ne ) in the other. This again shows that x ∗ is not an Since the edge e′ is used only once in
extreme point. the Euler tour, it is definitely increased
or decreased in the two solutions,
ensuring that x + /x − are not equal to
Now we use the basic feasible solution perspective of x ∗ : this x∗ .
means we have a “basis” containing some | E| linearly independent
constraints of the linear program defining the polytope that are tight
at x ∗ . Fix any such basis, and suppose S is the collection of sets for
which the odd-set constraints are tight in this basis. There are two
cases:
Moreover, both these polytopes are for smaller graphs, and have
integer vertices by induction. Hence, any point within them can be
written as a convex combination of their vertices. In particular,
x (1) = ∑′ α M′ χ M′
M
We just saw several proofs that the bipartite perfect matching poly-
tope has a compact linear program. Moreover, we claimed that the
pefect matching polytope on general graphs has an explicit linear
program that, while exponential-sized, can be solved in polynomial
time. Such results allow us to solve the weighted bipartite matching
problems using generic linear programming solvers (as long as they
return vertex solutions).
Having many different ways to view a problem gives us a deeper
insight, and thereby come up with faster and better ways to solve it.
Moreover, these different perspectives give us a handle into solving
extensions of these problems. E.g., if we have a matching problem
with two different kinds weights w1 and w2 on the edges: we want to
find a matching x ∈ K PM ( G ) minimizing w1 · x, now subject to the
additional constraint w2 · x ≤ B. While the problem is now NP-hard,
this linear constraint can easily be added to the linear program to
get a fractional optimal solution. Then we can reason about how to
“round” this solution to get a near-optimal matching.
We now show how two problems we considered earlier, namely
minimum-cost arborescence and spanning trees, can be exactly mod-
eled using linear programs. We then conclude with a pointer to a
general theory of integral polyhedra.
graph matchings iii: weighted matchings 111
9.6.1 Arborescences
We already saw a linear program for the min-weight r-arborescence
polytope in §2.3.2: since each node that is not the root r must have a
path in the arborescence to the root, it is natural to say that for any
subset of vertices S ⊆ V that does not contain the root, there must
be an edge leaving it. Specifically, given the digraph G = (V, A), the
polytope can be written as
∑ x a ≥ 1 ∀S ⊂ Vs.t.r ̸∈ S
| A|
K Arb ( G ) = x ∈ R s.t. a ∈ ∂ + (S)
.
x ≥ 0
a ∀a ∈ A
Here ∂+ (S) is the set of arcs that leave set S. The proof in §2.3.2 al-
ready showed that for each weight vector w ∈ R| A| , we can find an
optimal solution to the linear program min{w · x | x ∈ K Arb ( G )}.
(The first constraint excludes the case where S is either empty or the
entire vertex set.) Sadly, this does not precisely capture the spanning
tree polytope: e.g., for the familiar cycle graph having three vertices,
setting xe = 1/2 for all three edges satisfies all the constraints. If all
edge weights are 1, this solution get a value of ∑e xe = 3/2, whereas
any spanning tree on 3 vertices must have 2 edges.
One can indeed write a different linear program that captures the
spanning tree polytope, but it is a bit non-trivial:
∑ xij ≤ |S| − 1 ∀S ⊆ V, S ̸= ∅
ij ∈ E:i,j ∈S
KST ( G ) =
x ∈ R| E | s.t.
∑ xij = |V | − 1
ij∈ E
xij ≥ 0 ∀ij ∈ E
112 integrality of polyhedra
Theorem 9.22 (Hoffman and Kruskal Theorem). If the constraint A.J. Hoffman and J.B. Kruskal (1956)
matrix [ A]m×n is totally unimodular and the vector ⃗b is integral, i.e., ⃗b ∈
Zm , then the vertices of the polytope induced by the LP are integer valued.
Moreover, if for some matrix A the polytope has integer vertices for all
integer vectors b, then the matrix A is totally unimodular.
Proof. (Sketch) This proof uses that solutions to linear systems can be
obtained using Cramer’s rule.
Thus, to show that the vertices are indeed integer valued, one
need not go through producing combinatorial proofs, as we have.
Instead, one could just check that the constraint matrix A is totally
unimodular. Here’s a nice presentation by Marc Uetz about the rela-
tion between total unimodularity and graph matchings.
Part II
Interlude: Dimension
Reduction
10
Concentration of Measure
3. How many unit vectors can you choose in Rn that are almost
orthonormal? I.e., they must satisfy | vi , v j | ≤ ε for all i ̸= j?
All these questions can be answered by the same basic tool, which
goes by the name of Chernoff bounds or concentration inequalities or tail
inequalities or concentration of measure, or tens of other names. The ba-
sic question is simple: if we have a real-valued function f ( X1 , X2 , . . . , Xm )
of several independent random variables Xi , such that it is “not too sensitive
to each coordinate”, how often does it deviate far from its mean? To make it
more concrete, consider this—
Given n independent random variables X1 , . . . , Xn , each bounded in
the interval [0, 1], let Sn = ∑in=1 Xi . What is
h i
Pr Sn ̸∈ (1 ± ε) ESn ?
Pr[ X ≥ µ + λ]
Pr[ X ≤ µ − λ]
is the lower tail. We are interested in bounding these tails for various
values of λ.
E( X )
P( X ≥ λ ) ≤ (10.5)
λ
With this in hand, we can start substituting various non-negative
functions of random variables X to deduce interesting bounds. For
instance, the next inequality looks at both the mean µ := EX and the
variance σ2 := E[( X − µ)2 ] of a random variable, and bounds both
the upper and lower tails.
σ2
Pr[| X − µ| ≥ λ] ≤ .
λ2
Proof. Using Markov’s inequality on the non-negative r.v. Y = ( X −
µ)2 , we get
E [Y ]
Pr[Y ≥ λ2 ] ≤ .
λ2
The proof follows from Pr[Y ≥ λ2 ] = Pr[| X − µ| ≥ λ].
120 non-asymptotic convergence bounds
pn 1
Pr[Sn − pn ≥ βn] ≤ = .
pn + βn 1 + ( β/p)
np(1 − p) p
Pr[|Sn − pn| ≥ βn] ≤ < 2 .
β2 n2 β n
In particular, this already says that the sample mean Sn /n lies in the
p
interval p ± β with probability at least 1 − β2 n . Equivalently, to get Concretely, to get within an additive
p p 1% error of the correct bias p with
confidence 1 − δ, we just need to set δ ≥ β2 n
, i.e., take n ≥ β2 δ
. (We probability 99.9%, set β = 0.01 and
will see a better bound soon.) δ = 0.001, so taking n ≥ 107 · p samples
suffices.
Example 2 (Balls and Bins): Throw n balls uniformly at random and
independently into n bins. Then for a fixed bin i, let Li denote the
number of balls in it. Observe that Li is distributed as a Bin(n, 1/n)
random variable. Markov’s inequality gives a bound on the probabil-
ity that Li deviates from its mean 1 by λ ≫ 1 as
1 1
Pr Li ≥ 1 + λ ≤ ≈ .
1+λ λ
However, Chebychev’s inequality gives a much tighter bound as
h i (1 − 1/n) 1
Pr | Li − 1| ≥ λ ≤ ≈ 2.
λ2 λ
√
So setting λ = 2 n says that the probability of any fixed bin having
√ (1−1/n)
more than 2 n + 1 balls is at most 4n . Now a union bound over Doing this argument with Markov’s
(1−1/n) inequality would give a trivial upper
all bins i means that, with probability at least n · 4n ≤ 1/4, the bound of 1 + 2n on the load. This is
√
load on every bin is at most 1 + 2 n. useless, since there are at most n balls,
so the load can never be more than n.
Example 3 (Random Walk): Suppose we start at the origin and at
each step move a unit distance either left or right uniformly ran-
domly and independently. We can then ask about the behaviour of
the final position after n steps. Each step (Xi ) can be modelled as a
Rademacher random variable with the following distribution. A random sign is also called a
Rademacher random variable, the name
Bernoulli being already taken for a
random bit in {0, 1}.
concentration of measure 121
1 w.p. 1
2
Xi =
−1 w.p. 1
2
√
For large n, we can use Stirling’s formula n! ≈ 2πn( ne )n :
λ
If λ ≪ n, then we can approximate 1 + k λn by ek n :
This shows that most of the probability mass lies in the region |Sn | ≤
√
O( n), and drops off exponentially as we go further. And indeed,
this is the bound we will derive next—we will get slightly weaker
constants, but we will avoid these tedious approximations.
x 7→ etx
for some value t > 0 to be chosen carefully. Since this map is mono-
tone,
Bernoulli random variables: Assume that all the Xi ∈ {0, 1}; we will
remove this assumption later. Let the mean be µi = E[ Xi ], so the
moment generating function can be explicitly computed as
Substituting, we get
∏i E[etXi ]
Pr[Sn ≥ µ + λ] ≤ (10.12)
et(µ+λ)
∏ exp(µi (et − 1))
≤ i (10.13)
et(µ+λ)
exp(µ(et − 1))
≤
et(µ+λ)
(since µ = ∑ µi )
i
t
= exp(µ(e − 1) − t(µ + λ)). (10.14)
Since this calculation holds for all positive t, and we want the tightest
upper bound, we should minimize the expression (10.14). Setting the
derivative w.r.t. t to zero gives t = ln(1 + λ/µ) which is non-negative
for λ ≥ µ. This bound on the upper tail is also
one to be kept in mind; it often is
eλ useful when we are interested in large
Pr[Sn ≥ µ + λ] ≤ . (10.15) deviations where λ ≫ µ. One such
(1 + λ/µ)µ+λ example will be the load-balancing
application with jobs and machines.
124 chernoff bounds, and hoeffding’s inequality
β
≤ ln(1 + β) (10.17)
1 + β/2
Removing the assumption that Xi ∈ {0, 1}: If the r.v.s are not Bernoullis,
then we define new Bernoulli r.v.s Yi ∼ Bernoulli(µi ), which take
value 0 with probability 1 − µi , and value 1 with probability µi , so
that E[ Xi ] = E[Yi ]. Note that f ( x ) = etx is convex for every value
of t ≥ 0; hence the function ℓ( x ) = (1 − x ) · f (0) + x · f (1) satisfies
f ( x ) ≤ ℓ( x ) for all x ∈ [0, 1]. Hence E[ f ( Xi )] ≤ E[ℓ( Xi )]; moreover
ℓ( x ) is a linear function so E[ℓ( Xi )] = ℓ(E[ Xi ]) = E[ℓ(Yi )], since
Xi and Yi have the same mean. Finally, ℓ(y) = f (y) for y ∈ {0, 1}.
Putting all this together,
so the step from (10.12) to (10.13) goes through again. This completes
the proof of Theorem 10.8.
Since the proof has a few steps, let’s take stock of what we did:
i. Apply Markov’s inequality on the function etX ,
ii. Use independence and linearity of expectations to break into etXi ,
iii. Reduce to the Bernoulli case Xi ∈ {0, 1},
iv. Compute the MGF (moment generating function) E[etXi ],
v. Choose t to minimize the resulting bound, and
vi. Use convexity to argue that Bernoullis are the “worst case”. Do make sure you see why the bounds
You can get tail bounds for other functions of random variables of Theorem 10.8 are impossible in
general if we do not assume some kind
by varying this template around; e.g., we will see an application for of boundedness and independence.
sums of independent normal (a.k.a. Gaussian) random variables in
the next chapter.
concentration of measure 125
For the rest of the proof of the Chernoff bound, we can just focus
on computing the dual ψ∗ (λ) of the log-MGF ψ(t). Let’s see some
examples:
t2 σ 2 λ2
ψ(t) = and ψ∗ (λ) = ,
2 2σ2
the latter by basic calculus. Now the generic Chernoff bound (10.19)
for the sum of n normal N (0, σ2 ) variables says:
− λ2
Pr[Sn ≥ λ] ≤ e 2n σ2 . (10.21)
et + e−t t2 t4 2
E[etX ] = = cosh t = 1 + + + · · · ≤ et /2 ,
2 2! 4!
so
t2 λ2
ψ(t) = and ψ∗ (λ) = .
2 2
Note that
∗
ψRademacher (t) ≤ ψN (0,1) (t) =⇒ ψRademacher (λ) ≥ ψ∗N (0,1) (λ).
β2 n β2 n
Pr |Sn − np| ≥ βn ≤ exp − ≤ exp − .
2p + β 2
(For the second inequality, we use that the interesting settings have
2 ln(1/δ)
p + β ≤ 1.) Hence, if n ≥ β2
, the empirical average Sn /n is
within an additve β of the bias p with probability at least 1 − δ. This
has an exponentially better dependence on 1/δ than the bound we
obtained from Chebychev’s inequality.
This is asymptotically the correct answer: consider the problem
where we have n coins, n − 1 of them having bias 1/2, and one having
bias 1/2 + 2β. We want to find the higher-bias coin. One way is to es-
1
timate the bias of each coin to within β with confidence 1 − 2n , using
concentration of measure 127
If we set λ = Θ(log n), the probability of the load Li being larger than
1 + λ is at most 1/n2 . Now taking a union bound over all bins, the
probability that any bin receives at least 1 + λ balls is at most n1 . I.e.,
the maximum load is O(log n) balls with high probability.
In fact, the correct answer is that the maximum load is (1 +
o (1)) lnlnlnnn with high probability. For example, the proofs in cite show
this. Getting this precise bound requires a bit more work, but we can
get an asymptotically correct bound by using (10.15) instead, with a
C ln n
setting of λ = ln ln n with a large constant C.
Moreover, this shows that the asymmetry in the bounds (10.8)
and (10.9) is essential. A first reaction would have been to believe The situation where λ ≤ µ is often
our proof to be weak, and to hope for a better proof to get called the Gaussian regime, since the
bound on the upper tail behaves like
exp(−λ2 /µ) = exp(− β2 µ), with
Pr[Sn ≥ (1 + β)µ] ≤ exp(− β2 µ/c) β = λ/µ. In other cases, the upper tail
bound behaves like exp(−λ), and is
for some constant c > 0, for all values of β. This is not possible, said to be the Poisson regime.
p
however, because it would imply a max-load of Θ( log n) with high
probability.
Recall from §10.2.5 that the tail bound of ≈ exp(−t2 /O(1)) is indeed
in the right ballpark.
and that the function Sn is the sum of these r.v.s. Add details and refs
to this section.
But before we move on, let us give the bound that Sergei Bernstein
gave in the 1920s: it uses knowledge about the variance of the ran-
dom variable to get a potentially sharper bound than Theorem 10.8.
We can use this in the step (10.11), since the function etx is monotone
increasing for t > 0.
Negative association arises in many settings: say we want to
choose a subset S of k items out of a universe of size n, and let
Xi = 1i∈S be the indicator for whether the ith item is selected. The
variables X1 , . . . , Xn are clearly not independent, but they are nega-
tively associated.
10.4.2 Martingales
A different and powerful set of results can be obtained when we
stop considering random variables are not independent, but al-
low variables X j to take on values that depend on the past choices
X1 , X2 , . . . , X j−1 but in a controlled way. One powerful formalization
is the notion of a martingale. A martingale difference sequence is a se-
quence of r.v.s Y1 , Y2 , . . . , Yn , such that E[Yi | Y1 , . . . , Yi−1 ] = 0 for each
i. (This is true for mean-zero independent r.v.s, but may be true in
other settings too.)
This inequality does not assume very much about the function,
except it being ci -Lipschitz in the ith coordinate; hence we can also
use this to the truncated random walk example above, or for many
other applications.
E[ X k ] E[etX ]
Pr[Sn ≥ λ] ≤ min ≤ inf .
k ≥0 λk t≥0 etλ
with a d-bit vector. Each vertex i has a single packet (which we also
call packet i), destined for vertex π (i ), where π is a permutation on
the nodes [n].
Packets move in synchronous rounds. Each edge is bi-directed,
and at most one packet can cross each directed edge in each round.
Moreover, each packet can cross at most one edge per round. So if
uv ∈ E( Qd ), one packet can cross from u to v, and one from v to u,
in a round. Each edge e has an associated waiting queue We ; so each
node has d queues, one for each edge leaving it. If several packets
want to cross an edge e in the same round, only one can cross; the
rest wait in the queue We and try again the next round. We assume
the queues are allowed to grow to arbitrary size (though one can also
show queue length bounds in the algorithm below). The goal is to get
a simple routing scheme that delivers the packets in O(d) rounds, no
matter what permutation π needs to be routed.
One natural proposal is the bit-fixing routing scheme: each packet
i looks at its current position u, finds the first bit position where u
differs from π (i ), and flips the bit (which corresponds to traversing
an edge out of u). For example:
However, this proposal can create “congestion hotspots” in the net- Suppose we choose a permutation π
work, and therefore delay some packets by 2Ω(d) . In fact, it turns such that
out any deterministic oblivious strategy (that does not depend π (w00 ) = 0 w,
p on the
actual sources and destinations) must have a delay of Ω( 2d /d) where w, 0 ∈ {0, 1}d/2 . All these 2d/2
rounds. packets have to pass through the all-
zeros node in the bit-fixing routing
scheme; since this node can send out at
most d packets at each timestep, need at
10.5.1 A Randomized Algorithm. . .
least 2d/2 /d rounds.
Here’s a lovely randomized strategy, due to Les Valiant, and to Valiant (1982)
Valiant and Brebner. It requires no centralized control, and is opti-
mal in the sense of requiring O(d) rounds (with high probability) on
any permutation π.
Each node i picks a randomized midpoint Ri independently and uni-
formly from [n]: it sends its packet to Ri . Then after 5d rounds have
elapsed, the packets proceed to their final destinations π (i ). All routing
is done using bit-fixing.
Proof. We only prove that all packets reach their midpoints by time
5d, with high probability. The argument for the second phase is then
132 application #1: oblivious routing on the hypercube
2. Suppose packet i traverses the last edge eℓ on its path and reaches
its destination at timestep T. Since it has lag T − ℓ = T − | Pi | just
before it traverses the edge, it reaches the destination at time | Pi |
plus its final lag. So it suffices to show that i’s final lag is at most
| Si | .
5. We show (in the next bullet point) how to maintain the invariant
that at the beginning of each time, any token numbered L still on
the path Pi is carried by some packet in Si with current lag L. This
implies that when a packet in Si makes its final traversal and it has
some final lag L′ , it is either carrying a single token numbered L′
at that time or no token at all. Since each token is carried by some
packet, this means there can be at most |Si | tokens overall, and
hence i’s final lag value is at most |Si |.
6. To ensure the invariant, note that when j got the token numbered
L from i, packet j had lag value L. Now as long as j does not get
delayed as it proceeds along the path, its lag remains L (and it
keeps the token). If it does get delayed, say while waiting in queue
Wek′ while some other packet j′ (having the same lag value L,
because they were sharing the same queue) traverses the edge ek′ ,
packet j gives its token numbered L to this j′ . This maintains the
invariant.
dimensions.
∥ A( xi ) − A( x j )∥22
1−ε ≤ ≤ 1 + ε.
∥ xi − x j ∥22
Moreover, such a map can be computed in expected poly(n, D, 1/ε) time.
Note that the target dimension k is independent of the original
dimension D, and depends only on the number of points n and the
accuracy parameter ε. Given n points with Euclidean distances
It is not difficult to show that we need at least Ω(log n) dimen- in (1 ± ε), the balls of radius 1− 2
ε
with probability at least 1 − 1/n2 , where vij is the unit vector in the
direction of xi − x j . By a union bound, all (n2 ) pairs of distances in
( X2 ) are maintained with probability at least 1 − (n2 ) n12 ≥ 1/2. A few
comments about this construction:
• The above proof shows not only the existence of a good map, we
also get that a random map as above works with constant prob-
ability! In other words, a Monte-Carlo randomized algorithm
for dimension reduction. (Since we can efficiently check that the
distances are preserved to within the prescribed bounds, we can
convert this into a Las Vegas algorithm.) Or we can also get deter-
ministic algorithms: see here.
Let us recall some basic facts about Gaussian distributions. The prob-
ability density function for the Gaussian N (µ, σ2 ) is
( x − µ )2
f (x) = √1 e 2σ2 .
2πσ
We also use the following; the proof just needs some elbow grease. The fact that the means and the vari-
ances take on the claimed values should
Proposition 11.3. If G1 ∼ N (µ1 , σ12 ) and G2 ∼ N (µ2 , σ22 ) are indepen- not be surprising; this is true for all
r.v.s. The surprising part is that the
dent, then for c ∈ R, resulting variables are also Gaussians.
Now, here’s the main idea in the proof of Lemma 11.2. Imagine
that the vector x is the elementary unit vector e1 = (1, 0, . . . , 0). Then
M e1 is just the first column of M, which is a vector with independent
and identical Gaussian values.
G1,1 G1,2 ···
G1,D 1 G1,1
G2,1 G2,2 ···
G2,D 0 G2,1
M e1 =
.. .. ..
.. .. = ..
.
. . .. . .
Gk,1 Gk,2 · · · Gk,D 0 Gk,1
√
A( x ) is a scaling-down of this vector by k: every entry in this
random vector A( x ) = A(e1 ) is distributed as
√
1/ k · N (0, 1) = N (0, 1/k) (by (11.1)).
138 a direct proof of Lemma 11.2
Thus, the expected squared length of A( x ) = A(e1 ) is If G has mean µ and variance σ2 , then
E[ G2 ] = Var[ G ] + E[ G ]2 = σ2 + µ2 .
" #
h i k k h i k
1
E ∥ A( x )∥ = E ∑ A( x )i = ∑ E A( x )2i =
2 2
∑k = 1.
i =1 i =1 i =1
E[∥ A( x )∥2 ] = ∥ x ∥2 .
Observe that did not use the fact that the matrix entries were
Gaussians. We will use it for the concentration bound, which we
show next.
k
1
Z := ∥ A(z)∥2 = ∑ k ( Mx)2i ,
i =1
Plugging back into (11.4), the bound on the upper tail shows that for
all t ∈ (0, 1/2),
k
1
Pr[ Z ≥ (1 + ε)] ≤ √ .
et(1+ε) 1 − 2t
140 introducing subgaussian random variables
if we set t = ε/4 and use the fact that 1 − 2t ≥ 1/2 for ε ≤ 1/2. (Note:
this setting of t also satisfies t ∈ (0, 1/2), which we needed from our
previous calculations.)
Almost done: let’s take stock of the situation. We observed that
∥ A( x )∥22 was distributed like an average of squares of Gaussians, and
by a Chernoff-like calculation we proved that
It turns out that the proof of Lemma 11.2 is a bit cleaner (with fewer
calculations) if we use the abstraction provided by the generic Cher-
noff bound from last lecture, and the notion of subGaussian random
variables which we introduce next. This abstraction will also allow
us to extend the result to JL matrices having i.i.d. entries from other
distributions, e.g., where each Mij ∈ R {−1, +1}.
σ 2 t2
ψ(t) ≤ .
2
for all t ≥ 0. It is subgaussian with parameter σ up to t0 if the above
inequality holds for all |t| ≤ t0 .
Most tail bounds you will prove using the subgaussian perspective
will come down to showing that some random variable is subgaus-
sian with parameter σ, whereupon you can use Theorem 11.7. Given
that you will often reason about sums of subgaussians, you may use
the next fact, which is an analog of Proposition 11.3.
∏ E[etxi Vi ] ≤ ∏ e(txi ) σi /2 .
2 2
E[etV ] = E[et ∑i xi Vi ] =
i i
t2 xi2 σi2
Finally taking logarithms, ψV (t) = ∑i ψVi (txi ) ≤ ∑i 2 .
1 k
k i∑
Z := ∥ A( x )∥2 = ( Mx )2i (11.9)
=1
142 a proof of Lemma 11.2 using subgaussian r.v.s
E[ Z ] = E[∥ A( x )∥2 = ∥ x ∥ = 1.
(Note that we’ve just introduced W into the mix, without any provo-
cation!) Hence, rewriting
√ √
2t(V/σ ) W
EV,W [e ] = EW [EV [e( 2tW/σ )V
]],
Excellent. Now the bound on the upper tail for sums of squares
of symmetric mean-zero σ-subgaussians follows from that of Gaus-
2
sians. The lower tail (which requires us to bound E[etV ] for t < 0)
needs one more idea: suppose V is a mean-zero σ-subgaussian with
dimension reduction and the jl lemma 143
i ≥2
2
Since E[V 2 ] = 1 and |t| < 1, this is at most 1 + t + t2 E[eV ]. Now
2 2 2
use the above bound E[eV ] ≤ E[eW ] to get that E[etV ] ≤ 1 + t +
√
t2 / 1 − 2t, and the proof proceeds as for the Gaussian case.
In summary, we get the same tail bounds as in §11.4.1, and hence
that the Rademacher matrix also has the distributional JL property,
while using far fewer random bits!
In general one can use other σ-subgaussian distributions to fill
the matrix M—using σ different than 1 may require us to rework the
proof from §11.4.1 since the linear terms in (11.6) don’t cancel any
more, see works by Indyk and Naor or Matousek for details. Indyk and Naor (2008)
Matoušek (2008)
Lemma 11.11 (Unique Decoding). If A has Kruskal rank ≥ 2s, then for
any b we have Ax = b for at most one s-sparse x.
So we can just find some sensing matrix with large Kruskal rank
Give examples here and ensure our results will be unique. The next
question is: how fast can we find x? (We should also be worried
about noise in the measurements.) A generic construction of matrices
with large Kruskal rank may not give us efficient solutions to (11.10).
Indeed, it turns out that the problem as formulated is NP-hard, as-
suming A and b are contrived by an adversary.
Of course, asking to solve (11.10) for general A, b is a more difficult
problem than we need to solve. In our setting, we can choose A as
we like and then are given b = Ax, so we can ask whether there are
matrices A for which this decoding process is indeed efficient. This is
precisely what we do next.
≥ (1 − ε) ∥∆S∪ B1 ∥2 − (1 + ε) ∑ ∥ ∆ B j ∥2
j ≥2
1+ε
≥ (1 − ε ) ∥ ∆ S ∥2 − √ ∥ ∆ S ∥2 ,
2
146 optional: compressive sensing
where the first step uses the triangle inequality for norms, the second
uses that each ∆S∪ B1 and ∆ Bj are 3s-sparse, and the last step uses
∥∆S∪ B1 ∥2 ≥ ∥∆S ∥2 and also Claim 11.15. Finally, since ε ≤ 1/9, we
have 1 − ε > 1√+ε , so the only remaining possibility is that ∆S = 0.
2
The next claim implies that ∆S = 0 implies that ∆ = 0, giving a
contradiction and hence the proof of Lemma 11.14.
Claim 11.16. ∥∆S ∥1 ≥ ∥∆S ∥1 .
√ ∥ ∆ B j −1 ∥ 1 ∥ ∆ B j −1 ∥ 1
∥ ∆ B j ∥2 ≤ 2s · = √ .
2s 2s
Summing this over all j ≥ 2, we get
∥ ∆ B j −1 ∥ 1 ∥ ∆ ∥1
∑ ∥ ∆ B j ∥2 ≤ ∑ √
2s
= √S .
2s
j ≥2 j ≥2
Now ∥∆S ∥1 ≤ ∥∆S ∥1 by Claim 11.16. And finally, since the sup- Exercise:: forpany vector v ∈ Rd , show
√ that ∥v∥1 ≤ supp(v) · ∥v∥2 .
port of ∆S is of size s, we can bound its ℓ1 length by s times its ℓ2
√
length, finishing the claim. (Since we wanted that factor of 2 in the
denominator, we made the buckets slightly larger than the size of
S.)
Proof. The proof is simple, but uses some fairly general ideas worth
emphasizing. First, focus on some s-dimensional subspace of Rn
(obtained by restricting to some subset of coordinates). For notational
simplicity, we just identify this subspace with Rs .
For the contraction, consider any x ∈ Ss−1 , with closest net point y.
Then ∥ Ax ∥ ≥ ∥ Ay∥ − ∥ A( x − y)∥ ≥ 1 − δ − (1 + 3δ)δ ≥ 1 − 3δ,
again as long as δ ≤ 1/3.
4. Now apply the above argument to each of the (ns) subspaces ob-
tained by restricting to some subset S of coordinates. By a union
bound over all subsets S, and over all points in the net for that
subspace, the matrix A is an 3δ-isometry on all points with sup-
port in S except with probability
n
· (4/δ)s · exp(−cδ2 m) ≤ exp(−Θ(m)),
s
Theorem 11.18 (Heavy Shells). At least 1 − ε of the mass of the unit ball
log 1/ε
in Rd lies within a Θ( d )-width shell next to the surface.
Theorem 11.19 (Heavy Slabs). At least (1 − ε) of the mass of the unit ball
√
in Rd lies within Θ(1/ d) slab around any hyperplane that passes through
the origin.
2 2
where G ∼ N (0, σ2 ). But we know that Pr[ G ≥ w] ≤ e−w /2σ by
our generic Chernoff bound for Gaussians (10.21). So setting that tail
probability to be ε gives
q r
log(1/ε)
2
w ≈ 2σ log(1/ε) = O .
d
This may seem quite counter-intuitive: that 99% of the volume
of the sphere is within O(1/d) of the surface, yet 99% is within
√
O(1/ d) of any central slab! This challenges our notion of the ball
Figure 11.1: Sea Urchin (from uncom-
“looking like” the smooth circular object, and more like a very spiky moncaribbean.com)
sea-urchin. Finally, a last observation:
a1 , a2 , a3 , . . . , a t , . . .
1. Can we compute the sum of all the integers seen so far? I.e.,
F ( a[1:t] ) = ∑it=1 ai . We want the outputs to be
3, 4, 21, 25, 16, 48, 149, 152, −570, −567, 333, 337, 369, . . .
3, 1, 17, 17, 17, 32, 101, 101, 101, 101, 900, 900, 900
152 streams as vectors, and additions/deletions
3. The median? The outputs on the various prefixes of (12.1) now are
3, 1, 3, 3, 3, 3, 4, 3, . . .
1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 9, 9, 9 . . .
may just want to read over the file in one quick pass and come up
with an answer. Such an algorithm might also be cache-friendly. But
how to do this?
Two of the recurring themes will be:
and xit is the number of times the ith element in U has been seen until
time t. (Hence, xi0 = 0 for all i ∈ U.) When the next element comes in
and it is element j, we increment x j by 1.
streaming algorithms 153
(add, A), (add, B), (add, A), (del, B), (del, A), (add, C ), . . .
|U |
Fp := ∑ (xit ) p . (12.2)
i =1
This estimator was given by Noga Alon, Yossi Matias, and Mario
Szegedy, in their God̈el-award winning paper on streaming computa- Alon, Matias, Szegedy (2000)
tion.
The choice of the hash family will be crucial: we want a small fam-
ily so that we require only a small amount of space to store the hash
function, but we want it to be rich enough for the subsequent analy-
sis to go through.
C := ∑ x i h ( i ).
i ∈U
= ∑ xi2 = F2 .
i
E[(C2 )2 ] = E[ ∑ h ( p ) h ( q ) h (r ) h ( s ) x p x q xr x s ] =
p,q,r,s
= ∑ x4p E[h( p)4 ] + 6 ∑ x2p xq2 E[h( p)2 h(q)2 ] + other terms
p p<q
This is because all the other terms have expectation zero. Why? The
terms like E[h( p)h(q)h(r )h(s)] where p, q, r, s are all distinct, all be-
come zero because of 4-universality. Terms like E[h( p)2 h(r )h(s)]
become zero for the same reason. It is only terms like E[h( p)2 h(q)2 ]
and E[h( p)4 ] that survive, and since h( p) ∈ {−1, 1}, they have expec-
tation 1. So
Var(C2 ) 2
Pr[|C2 − E[C2 ]| > εE[C2 ]] ≤ ≤ 2.
(εE[C2 ])2 ε
This is pretty pathetic: since ε is usually less than 1, the RHS usually
more than 1.
this estimator has mean µ and variance σ2 /k. (Why? Summing the
independent copies sums the variances and so increases it by k, but
dividing by k reduces it by k2 .)
So if we k such independent counters C1 , C2 , . . . , Ck , and return
their average C = 1k ∑i Ci , we get
2
2 2 2 Var(C ) 2
Pr[|C − E[C ]| > εE[C ]] ≤ 2
≤ .
(εE[C ])2 kε2
Mij := hi ( j).
2
2
The estimate C = 1
k ∑ik=1 Ci is nothing but
1
∥ Mx∥22 .
k
This is completely analogous to the construction for JL: we’ve got
a slightly taller matrix with k = O(ε−2 δ−1 ) rows instead of k =
O(ε−2 log δ−1 ) rows. However, the matrix entries are not fully inde-
pendent (as in JL), just 4-wise independent. I.e., we need to store only
O(k log D ) bits and can generate any entry of M quickly, whereas the
construction for JL stored all kD bits. Henceforth, we use S = √1 M to denote
k
Let us record two properties of this construction: the “sketch” matrix.
∥C − AB∥2F ≤ small.
C = AB
A B = C
158 optional: computing the number of distinct elements
This usually takes O(n3 ) time. Indeed, the ijth entry of the product
C is the dot-product of the ith row Ai⋆ of A with the jth column B⋆ j of
B, and the dot-product takes O(n) time. The intuition is that S⊺ S is an almost-
Suppose instead we use a “fat and short” k × n matrix S (for k ≪ identity matrix, it has 1 on the diag-
onals and at most ε everywhere else.
n), and calculate And hence it gives only a small error.
Ce = AS⊺ SB. Of course, we don’t multiply out S⊺ S,
but instead compute AS⊺ and SB, and
By associativity of matrix multiplication, we could first compute then multiply the smaller matrices.
( AS⊺ ) and (SB) in times O(n2 k), and then multiply the results in
time O(nk2 ). Moreover, the matrix S from the previous section works
pretty well, where we set D = n.
S>
d
A S B ≈ C
Indeed, entries of the error matrix Y = C − Ce satisfy
E[Yij ] = 0
and
= 2k ∥ A∥2F ∥ B∥2F .
Finally, setting k = ε22δ and using Markov’s inequality, we can say that
for any fixed ε > 0, we can compute an approximate matrix product
C := AS⊺ SB such that
Pr ∥ AB − C ∥ F ≤ ε · ∥ A∥ F ∥ B∥ F ≥ 1 − δ,
2
in time O( εn2 δ ). (If we want to make δ very small, at the expense of
picking more independent random bits in the sketching matrix S, we
can use the JL matrices instead. Details will appear in a homework.)
Finally, if the matrices A, B are sparse and contains only ≪ n2 entries,
the time can be made to depend on nnz( A, B).
The approximate matrix product question has been considered
often, e.g., by Edith Cohen and David Lewis using a random-walks Cohen and Lewis (1999)
approach. The algorithm we present is due to Tamás Sarlós; his pa-
per gives better results, as well as extensions to computing SVDs
faster. Better bounds have subsequently been given by Clarkson and
Woodruff. More recent refs too.
same argument, for any integer s we expect the sth smallest mapped
value at ds . We use a larger value of s to reduce the variance.
M·s
Dt = .
Lt
3
Pr[ Dt > 2 ∥xt ∥0 ] ≤ , and (12.4)
s
∥ x t ∥0 3
Pr[ Dt < ]≤ . (12.5)
2 s
We will prove this in the next section. First, some observations.
Firstly, we now use the stronger assumption that that the hash family
2-universal; recall the definition from Section 12.2.2. Next, setting
∥xt ∥
s = 8 means that the estimate Dt lies within [ 2 0 , 2∥xt ∥0 ] with
probability at least 1 − (1/4 + 1/4) = 1/2. (And we can boost the
streaming algorithms 161
2sM
Pr[ estimate too low ] = Pr[ Dt < d/2] = Pr[ Lt > ].
d
Recall T is the set of all d (= ∥xt ∥0 ) distinct elements in U that
have appeared so far. How many of these elements in T hashed to
values greater than 2sM/d? The event that Lt > 2sM/d (which
is what we want to bound the probability of) is the same as saying
that fewer than s of the elements in T hashed to values smaller than
2sM/d. For each i = 1, 2, . . . , d, define the indicator
1 if h(ei ) ≤ 2sM/d
Xi = (12.6)
0 otherwise
⌊sM/2d⌋ s 1
Pr[ Xi = 1] = ≥ − . (12.7)
M 2d M
By linearity of expectations,
" #
d d d
s 1 s d
E[ X ] = E ∑ Xi = ∑ E [Xi ] = ∑ Pr [Xi = 1] ≥ d · 2d
−
M
=
2
−
M
.
i =1 i =1 i =1
s
Let’s imagine we set M large enough so that d/M is, say, at most 100 .
Which means s s 49 s
E[ X ] ≥ − = .
2 100 100
162 optional: computing the number of distinct elements
So by Markov’s inequality,
100 49
Pr X > s = Pr X > E[ X ] ≤ .
49 100
Good? Well, not so good. We wanted a probability of failure to be
smaller than 2/s, we got it to be slightly less than 1/2. Good try, but
no cigar.
13.1 Introduction
AV = UDV ⊺ V = UD
lowing, we see how to obtain the SVD and why it solves our best fit
problem. The lecture is partly based on 2 . 2
βi
a1 a4
αi
a2
dimension reduction: singular value decompositions 165
We start with the case that k = 1. Thus, we look for the line
through the origin that minimizes the sum of the squared errors.
See Figure 13.2. It depicts a one-dimensional subspace V in blue. We
look at a point ai , its distance β i to V, and the length of its projection
to V which is named αi in the picture. Notice that the length of ai is
α2i + β2i . Thus, for our fixed ai , minimizing β i is equivalent to maxi-
mizing αi . If we represent V by a unit vector v that spans V (depicted
in orange in the picture), then we can compute the projection of ai to
V by the dot product ⟨ ai , v⟩. We have just argued that we can find the
best fit subspace of dimension one by solving
n n
max ∑ ⟨ ai , v ⟩2 =
v ∈ R d , ∥ v ∥= 1 i = 1
min ∑ dist( a i , span( v )) 2
v ∈ R d , ∥ v ∥= 1 i = 1
where we denote the distance between a point a i and the line spanned
by v by dist ( a i , span ( v )) 2 . Now because Av = (⟨ a 1 , v ⟩ , ⟨ a 2 , v ⟩ , . . . , ⟨ a n v ⟩) ⊺ ,
we can rewrite ∑ id= 1 ⟨ a i , v ⟩ 2 as ∥ Av ∥ 2 . We define the first right sin-
gular vector to be a unit vector that maximizes ∥ Av ∥ . We thus know There may be many vectors that achieve
that the subspaces spanned by it is the best fit subspace of dimension the maximum: indeed, for every v that
achieves the maximum, −v also has
one. the same maximum. Let us break ties
Now we want to generalize this concept to more than one dimen- arbitrarily.
Avi
ui : = for all i = 1, . . . , r.
∥ Avi ∥
Proof. We prove the claim by using the fact that two matrices A, B ∈
Rn×d are identical iff for all vectors v, the images are equal, i.e. Av =
dimension reduction: singular value decompositions 167
Assume that the entries in U and V are positive. Since the column
vectors are unit vectors, they define a convex combination of the
dimension reduction: singular value decompositions 169
a4 x
b4
b2 a3 x
a2 x
b3
a1 a2 a3 a4
a1 x
b1
A( A+ b) = b ∀b in the image of A
x ∗ = A+ b
For instance, you can check that Ak or e A defined this way indeed
correspond to what you think they might mean. (The other way to
k
define e A would be ∑k≥0 Ak! .)
Part III
“Modern” Algorithms
14
Online Learning: Experts and Bandits
2.41(mi + log2 N ),
Note that
Theorem 14.4. For ε ∈ (0, 1/2), penalizing each incorrect expert by a factor
of (1 − ε) guarantees that the number of mistakes made by MW is at most
log N
2(1 + ε ) m i + O .
ε
176 randomized weighted majority
This shows that we can make our mistakes bound as close to 2m∗
as we want, but this approach seems to have this inherent loss of
a factor of 2. In fact, no deterministic strategy can do better than a
factor of 2, as we show next.
Proposition 14.5. No deterministic algorithm A can do better than a factor
of 2, compared to the best expert.
Proof. Note that if the algorithm is deterministic, its predictions are
completely determined by the sequence seen thus far (and hence can
also be computed by the adversary). Consider a scenario with two
experts A,B, the first always predicts 1 and the second always pre-
dicts 0. Since A is deterministic, an adversary can fix the outcomes
such that A’s predictions are always wrong. Hence at least one of A
and B will have an error rate of ≤ 1/2, while A’s error rate will be
1.
Note that the update of the weights proceeds exactly the same as
previously.
Theorem 14.6. Fix ε ≤ 1/2. For any fixed sequence of predictions, the ex-
pected number of mistakes made by randomized weighted majority (RWM)
is at most
log N
E[ M ] ≤ (1 + ε ) m i + O
ε
log N
The quantity εmi + O ε gap
between the algorithm’s performance
Proof. The proof is an analysis of the weight evolution that is more and that of the best expert is called the
(t)
careful than in Theorem 14.4. Again, the potential is Φt = ∑i wi . regret with respect to expert i.
Define
(t)
∑i incorrect wi
Ft := (t)
∑ i wi
to be the fraction of weight on incorrect experts at time t. Note that
E[ M ] = ∑ Ft .
t∈[ T ]
ln N
E[ M ] ≤ m i (1 + ε ) + .
ε
Let’s broaden the setting slightly, and consider the following dot-
product game. In each round, Define the probability simplex as
1. The algorithm produces a vector of probabilities ∆ N := x ∈ [0, 1] N | ∑ xi = 1 .
i
t
p = ( p1t , p2t , · · · , ptN ) ∈ ∆N .
to deduce that
Pr[mistake at time t] = ℓt , pt .
Theorem 14.7. Consider a fixed ε ≤ 1/2. For any sequences of loss vectors
in [−1, 1] N and for all indices i ∈ [ N ], the Hedge algorithm guarantees:
T T
ln N
∑ ⟨ pt , ℓt ⟩ ≤ ∑ ℓit + εT + ε
i =1 t =1
online learning: experts and bandits 179
Theorem 14.8. Consider a fixed ε ≤ 1/2. For any sequences of loss vectors
in [−1, 1] N , the Hedge algorithm guarantees that for any index i ∈ [ N ]:
T T D E ln N
∑ ⟨ pt , ℓt ⟩ ≤ ∑ i ℓ t
+ ε (ℓ t 2 t
) , p +
ε
i =1 t =1
1 1
T ∑⟨ pt , ℓt ⟩ ≤ min
i T ∑ ℓit + ε
t t
1
= min
p∗ ∈∆ N T
∑ ℓt , p∗ + ε.
t
180 optional: the bandit setting
Corollary 14.10 (Average Gain). Let ρ ≥ 1 and ε ∈ (0, 1/2). For any
4ρ2 ln N
sequence of gain vectors g1 , . . . , g T ∈ [−ρ, ρ] N with T ≥ ε2 , the gains
version of the Hedge algorithm produces probability vectors pt ∈ ∆ N such
that
1 T 1 T
T t∑ ∑ gt , ei − ε.
t t
g , p ≥ max
=1 i ∈[ N ] T t =1
However, now the algorithm only gets to see the loss ℓtat corre-
sponding to the action chosen by the algorithm, and not the entire
loss vector.
This limited-information setting is called the bandit setting. The name comes from the analysis of
slot machines, which are affectionately
known as “one-armed bandits”.
14.5.1 The Exp3 Algorithm
Surprisingly, we can obtain algorithms for the bandit setting from
algorithms for the experts setting, by simply “hallucinating” the
online learning: experts and bandits 181
cost vector, using an idea called importance sampling. This causes the
parameters to degrade, however.
Indeed, consider the following algorithm: we run an instance A
of the RWM algorithm, which is in the full information model. So at
each timestep,
1. A produces a probability vector pt ∈ ∆ N .
3. We get back the loss value ℓtI t for this chosen expert.
However, the LHS is not our real loss, since we chose I t according to
qt and not pt . This means our expected total loss is really
γ
∑ q t , ℓ t = (1 − γ ) ∑ ∑ 1, ℓt
pt , ℓt +
N
t t t
N log N
≤ ∑ ℓit + εT + + γT.
t γ ε
182 optional: the bandit setting
q √ log N 1/4
log N
Now choosing ε = T and γ = N T gives us a regret
of ≈ N 1/2 T 3/4 . The interesting fact here is that the regret is again
sub-linear in T, the number of timesteps: this means that as T → ∞,
the per-step regret tends to zero.
The dependence on N, the number of experts/options, is now
polynomial, instead of being logarithmic as in the full-information
√
case. This is necessary: there is a lower bound of Ω( NT ) in the
bandit setting. And indeed, the Exp3 algorithm itself achieves a near-
p
optimal regret bound of O( NT log N ); we can show this by using a
finer analysis of Hedge that makes more careful approximations. We
defer these improvements to §14.5.3, and instead give an application
of this bandit setting to a problem in item pricing.
We can now use the low-regret algorithms for the experts problem to
show how to approximately solve linear programs (LPs). As a warm-
up, we use it to solve two-player zero-sum games, which are a special
case of LPs. In fact, zero-sum games are equivalent
to linear programming, see this work of
Ilan Adler. Is there an earlier reference?
15.1 (Two-Player) Zero-Sum Games
There are two players in such a game, traditionally called the “row
player" and the “column player". Each of them has some set of ac-
tions: the row player with m actions (associated with the set [m]), and
the column player with the n actions in [n]. Finally, we have a payoff
matrix M ∈ Rm×n . In a play of the game, the row player chooses a
row i ∈ [m], and simultaneously, the column player chooses a column
j ∈ [n]. If this happens, the row player gets Mi,j , and the column
player loses Mi,j . The winnings of the two players sum to zero, and
so we imagine that the payoff is from the row player to the column
player. Henceforth, when we talk about pay-
offs, these will always refer to payoffs to
the row player from the column player.
15.1.1 Strategies, and Best-Response This payoff may be negative, which
would capture situations where the
Each player is allowed to have a randomized strategy. Given strate- column player does better.
gies p ∈ ∆m for the row player, and q ∈ ∆n for the column player, the
expected payoff (to the row player) is
The row player wants to maximize this value, while the column
player wants to minimize it.
Suppose the row player fixes a strategy p ∈ ∆m . Knowing p, the
column player can choose an action to minimize the expected payoff:
C ( p) := min p⊺ Mq = min p⊺ Me j .
q∈∆n j∈[n]
184 (two-player) zero-sum games
The equality holds because the expected payoff is linear, and hence
the column player’s best strategy is to choose a column that mini-
mizes the expected payoff. The column player is said to be playing
their best response. Analogously, if the column player fixes a strategy
q ∈ ∆n , the row player can maximize the expected payoff by playing
their own best response:
Now, the row player would love to play the strategy p such that
even if the column player plays best-response, the payoff is as large
as possible: i.e., it wants to achieve
max C ( p).
p∈∆m
min R(q).
q∈∆n
C ( p) ≤ R(q) (15.1)
pt ∈ ∆m . Initially p1 = m1 , . . . , m1 , which represents that the row
player chooses each row with equal probability, when they have no
information to work with.
At each time t, the column player plays the best-response to pt , i.e.,
jt := arg max ( pt )⊺ Me j .
j∈[n]
to be the average long-term plays of the row player, and of the best
responses of the column player to those plays. We know that
C ( pb) ≤ R(qb)
4 ln m
by (15.1). But by Corollary 14.10, after T ≥ ε2
steps,
1 1
T ∑⟨ pt , gt ⟩ ≥ max
i T∑
ei , g t − ε (by Hedge)
t t
D 1 E
= max ei , ∑ gt − ε
i T t
D 1 E
= max ei , M
i
∑
T t
e jt −ε (by definition of gt )
= max⟨ei , Mb
q⟩ − ε
i
= R(qb) − ε.
Since pt is the row player’s strategy, and C is concave (i.e., the payoff
on the average strategy pb is no more than the average of the payoffs: To see this, recall that
1 1 1 C ( p) := min p⊺ Mq.
T ∑ ⟨ pt , gt ⟩ = ∑ C ( pt ) ≤ C
T T ∑ pt = C ( pb). q
We assume that ρ ≥ 1.
⟨ pt , gt ⟩ = ⟨ pt , Ax t − b⟩
= ⟨ pt , Ax t ⟩ − ⟨ pt , b⟩
= ⟨αt , x t ⟩ − βt ≤ 0,
T
1
T ∑ ⟨ pt , gt ⟩ ≤ 0.
t =1
1 T D 1 T E
T ∑ ei , g t = ei ,
T ∑ gt
t =1 t =1
1 T
=
T ∑ ai , xbt − bi
t =1
= ⟨ ai , xb⟩ − bi .
T
1
0≥
T ∑ ⟨ pt , gt ⟩ ≥ max
i
⟨ ai , xb⟩ − bi − ε.
t =1
x ≤ b + ε1.
This shows that Ab
x ≤ b + (ε + δ)1,
Ab
but now the number of calls to the relaxed oracle can be even
smaller, namely O(ρ2rlx ln m/ε2 ).
In the s-t maximum flow problem, we are given a graph G = (V, E), and
distinguished vertices s and t. Each edge has a capacities ue ≥ 0; we
will mostly focus on the unit-capacity case of ue = 1 in this chapter.
The graph may be directed or undirected; an undirected edge can be
modeled by two oppositely directed edges having the same capacity.
Recall that an s-t flow is an assignment f : E → R+ such that
The value of flow f is ∑e=(s,w)∈E f (e) − ∑e=(u,s)∈E f (e), the net amount
of flow leaving the source node s. The goal is to find an s-t flow in
the network, that satisfies the edge capacities, and has maximum
value.
Algorithms by Edmonds and Karp, by Yefim Dinitz, and many
others can solve the s-t max-flow problem exactly in polynomial
time. For the special case of (directed) graphs with unit capaci-
ties, Shimon Even and Bob Tarjan, and independently, Alexander
Karzanov showed in 1975 that the Ford-Fulkerson algorithm finds
192 a first algorithm using the mw framework
the maximum flow in time O(m · min(m1/2 , n2/3 )). This runtime
was eventually matched for general capacities (up to some poly-
logarithmic factors) by an algorithm of Andrew Goldberg and Satish
Rao in 1998. For the special case of m = O(n), these results gave a
runtime of O(m1.5 ), but nothing better was known even for approx-
imate max-flows, even for unit-capacity undirected graphs—until a
breakthrough in 2010, which we will see at the end of this chapter.
max ∑ fP (16.1)
P∈P
∑ f P ≤ ue ∀e ∈ E
P:e∈ P
fP ≥ 0 ∀P ∈ P
The first set of constraints says that for each edge e, the contribution
of all possible flows is no greater than the capacity ue of that edge.
The second set of constraints say that the contribution from each path
must be non-negative. This is a gigantic linear program: there could
be an exponential number of s-t paths. As we see, this will not be a
hurdle.
K := { f | ∑ f p = F, f ≥ 0}.
P∈P
∑ f P lent ( P) ≤ 1, (16.3)
P∈P
3. it increases the length of each edge on this path multiplicatively. The factor happens to be (1 + ε/F ), be-
cause of how we rescale the gains, but
This length-increase makes congested edges (those with a lot of flow) that does not matter for this intuition.
be much longer, and hence become very undesirable when search-
ing for short paths. Note that the process is repeated some number
of times, and then we average all the flows we find. So unlike usual
network flow algorithms based on residual networks, these algo-
rithms are truly greedy and cannot “undo” past actions (which is
what pushing flow in residual flow networks does, when we use an
arc backwards). This means these MW-based algorithms must ensure
that very little flow goes on edges that are “wasteful”.
To illustrate this point, consider an example commonly used to
show that the greedy algorithm does not work for max-flow: Change
the figure to make it more instructive.
undirected graphs. Since then, works by Jonah Sherman, and by Kel- Sherman (2013)
ner et al. gave O(m1+o(1) /εO(1) )-time algorithms for the problem. The Kelner, Lee, Orecchia, and Sidford
current best runtime is O(m poly log m/εO(1) )-time, due to Richard (2013)
Peng (2014)
Peng.
Interestingly, Shang-Hua Teng, Jonah
Sherman, and Richard Peng are all
CMU graduates.
16.3.1 Electrical Flows
Given a connected undirected graph with general edge-capacities, we
can view it as an electrical circuit, where each edge e of the original
𝜑 𝑠 =1 𝜑 𝑡 =0
graph represents a resistor with resistance re = 1/ue , and we connect t
s
(say, a 1-volt) battery between s to t. This causes electrical current to
flow from s (the node with higher potential) to t. Recall the following
laws about electrical flows. + -
Theorem 16.2 (Kirchoff’s Voltage Law). The directed potential changes
along any cycle sum to 0. Figure 16.1: The currents on the wires
would produce an electric flow (where
all the wires within the graph have
This means we can assign each node v a potential ϕv . Now the
resistance 1).
actual amount of current on any edge is given by Ohm’s law, and is
related to the potential drop across the edge.
ϕu − ϕv
f uv = .
ruv
For example, if we take the 6-node graph in Figure 16.1 and assume
that all edges have unit conductance, then its Laplacian LG matrix is:
s t u v w x
s 2 0 −1 −1 0 0
t 0 2 0 0 −1 −1
u −1 0 3 0 −1 −1
.
LG =
v
−1 0 0 2 0 −1
w 0 −1 −1 0 2 0
x 0 −1 −1 −1 0 3
Now for a general graph G, we define the Laplacian to be: This Laplacian for the single edge uv
has 1s on the diagonal at locations
(u, u), (v, v), and −1s at locations
LG = ∑ Luv . (u, v), (v, u). Draw figure.
uv∈ E
In other words, LG is the sum of little ‘per-edge’ Laplacians Luv . A symmetric matrix A ∈ Rn×n is called
(Since each of those Laplacians is clearly positive semidefinite (PSD), PSD if x⊺ Ax ≥ 0 for all x ∈ Rn , or
equivalently, if all its eigenvalues are
it follows that LG is PSD too.) non-negative.
For yet another definition for the Laplacian, first consider the
edge-vertex incidence matrix B ∈ {−1, 0, 1}m×n , where the rows are
indexed by edges and the columns by vertices. The row correspond-
ing to edge e = uv has zeros in all columns other than u, v, it has
an entry +1 in one of those columns (say u) and an entry −1 in the
approximate max-flows using experts 197
su sv uw ux vx wt xt
s 1 1 0 0 0 0 0
t 0 0 0 0 0 −1 −1
u −1 0 1 1 0 0 0 .
B=
v
0 −1 0 0 1 0 0
w 0 0 −1 0 0 1 0
x 0 0 0 −1 −1 0 1
A little algebra shows this to be the vth entry of the vector Lϕ. Finally,
by 16.4, this net current into v must be zero, unless v is either s or t,
in which case it is either −k or k respectively. Summarizing, if ϕ are
the voltages at the nodes, they satisfy the linear system:
Lϕ = k(es − et ).
(ϕu − ϕv )2
E ( f ) := ∑ f e2 re = ∑ r uv
= ϕ⊺ Lϕ.
e∈ E (u,v)∈ E
∥ L x̂ − b∥ L ≤ ε ∥ x̄ ∥ L .
The algorithm is randomized? and runs in time O(m log2 n log 1/ε).
E ( f ) ≤ (1 + δ)E ( fe),
approximate max-flows using experts 199
For the rest of this lecture we assume we can compute the corre-
sponding minimum-energy flow exactly in time O e (m). The arguments
can easily be extended to incorporate the errors.
K = {f | ∑ f P = F, f ≥ 0},
P∈P
∑ pe f e ≤ 1. (16.4)
e
∑ pe f e ≤ (1 + ε) ∑ pe + ε = 1 + 2ε.
e∈ E e∈ E
√
2. (width) maxe f e ≤ O( m/ε).
Proof. Since the flow f ∗ satisfies all the constraints, it burns energy
ε
E ( f ∗ ) = ∑( f e∗ )2 re ≤ ∑ re = ∑( pe + ) = 1 + ε.
e e e m
This proves the first part of the theorem. For the second part, we may
use the bound on energy burnt to obtain
ε ε
∑ f e2 m ≤ ∑ f e2 pe + m = ∑ f e2 re = E ( f ) ≤ 1 + ε.
e e e
The idea to get an improved bound on the width is to use a crude but
effective trick: if we have an edge with electrical flow of more than
approximate max-flows using experts 201
ρ ≈ m1/3 in some iteration, we delete it for that iteration (and for the
rest of the process), and find a new flow. Clearly, no edge now carries
a flow more than ρ. The main thrust of the proof is to show that we
do not end up butchering the graph, and that the maximum flow
value reduces by only a small amount due to these edge deletions.
Formally, we set:
m 1/3 log m
ρ= . (16.5)
ε
and show that at most εF edges are ever deleted by the process. The
crucial ingredient in this proof is this observation: every time we
delete an edge, the effective resistance between s and t increases by a
lot. We assume that a flow value of F is
Since we need to argue about how many edges are deleted in the feasible; moreover, F ≥ ρ, else Ford-
Fulkerson can be implemented in time
entire algorithm (and not just in one call to the oracle), we explic- O(mF ) ≤ O e (m4/3 ).
itly maintain edge-weights wet , instead of using the results from the
previous sections as a black-box.
′
Reff ≥ Reff .
′ Reff
Reff ≥( ).
1−β
202 e ( m 4/3 ) -time algorithm
optional: an O
2. If there is an edge e with f et > ρ, delete e (for the rest of the algo-
rithm), and go back to Item 1.
Lemma 16.10. We delete at most m1/3 ≤ εF edges over the run of the
algorithm.
We defer the proof to later, and observe that the total number of
electrical flows computed is therefore O( T ). Each such computation
takes Oe (m/ε) by Corollary 16.6, so the overall runtime of our algo-
rithm is O(m4/3 / poly(ε)).
Next, we show that the flow fb is an (1 + O(ε)-approximate maxi-
mum s-t flow. We start with an analog of Theorem 16.7 that accounts
for edge deletions.
The last step is very loose, but it will suffice for our purposes.
To calculate the congestion of the final flow, observe that even
though the algorithm above explicitly maintains weights, we can just
wt
appeal directly to the guarantees . Indeed, define p te : = Wet for each
time t; the previous part implies that the flow f t satisfies
∑ p te f et ≤ 1 + 3ε
e
3. Each deleted edge e has flow at least ρ, and hence energy burn at
least ( ρ 2 ) w et ≥ ( ρ 2 ) mε W t . Since the total energy burn is at most
2W t from Lemma 16.11, the deleted edge e was burning at least
ρ2 ε
β := 2m fraction of the total energy. Hence
ol d
new R eff ol d ρ2 ε
R eff ≥ ρ2 ε
≥ R eff · exp
(1 − 2m
2m )
1
if we use 1− x ≥ e x/2 when x ∈ [ 0, 1/4 ] .
4. For the final effective resistance, note that we send F flow with
total energy burn 2W T ; since the energy depends on the square of
f inal T
the flow, we have R eff ≤ 2W F2
≤ 2W T .
(All these calculations hold as long as we have not deleted more than
ε F edges.) Now, to show that this invariant is maintained, suppose D
edges are deleted over the course of the T steps. Then
0 ρ2 ε f inal T 2 ln m
R eff exp D · ≤ R eff ≤ 2W ≤ 2m · exp .
2m ε
Taking logs and simplifying, we get that
ερ 2 D 2 ln m
≤ ln ( 2m 3 ) +
2m ε
2m ( ln m )( 1 + O ( ε ))
=⇒ D ≤ 2 ≪ m 1/3 ≤ εF.
ερ ε
This bounds the number of deleted edges D as desired.
E(f) = ∑ f e2 r e
e
for flow values f that represent a unit flow from s to t (these form
a polytope). We alluded to algorithms that solve this problem, but
one can also observe that E ( f ) is a convex function, and we want to
find a minimizer within some polytope K. Equivalently, we wanted
to solve the linear system
Lϕ = ( e s − e t ) ,
∥ Lϕ − ( e s − e t )∥ 2 .
f ( λx + ( 1 − λ ) y ) ≤ λ f ( x ) + ( 1 − λ ) f ( y ) , (17.2)
y
λ f ( x ) + (1 − λ ) f ( y )
There are two kinds of problems that we will study. The most
basic question is that of unconstrained convex minimization (UCM):
x
given a convex function f , we want to find x λx + (1 − λ)y y
min f ( x ).
x ∈Rn
17.1.1 Gradient
For most of the following discussion, we assume that the function f
is differentiable. In that case, we can give an equivalent characteriza-
tion, based on the notion of the gradient ∇ f : Rn → Rn . The directional derivative of f at x (in the
direction y) is defined as
Fact 17.3 (First-order condition). A function f : K → R is convex if
f ( x + εy) − f ( x )
and only if f ′ ( x; y) := lim .
ε →0 ε
f (y) ≥ f ( x ) + ⟨∇ f ( x ), y − x ⟩ , (17.3) If there exists a vector g such that
⟨ g, y⟩ = f ′ ( x; y) for all y, then f is called
for all x, y ∈ K. differentiable at x, and g is called the
gradient. It follows that the gradient
Geometrically, Fact 17.3 states that the function always lies above must be of the form
its tangent plane, for all points in K. If the function f is twice-differentiable,
∂f ∂f ∂f
∇ f (x) = ( x ), ( x ), · · · , (x) .
and if H f ( x ) is its Hessian matrix, i.e. its matrix of second derivatives ∂x1 ∂x2 ∂xn
at x ∈ K:
∂2 f
( H f )i,j ( x ) := ( x ), (17.4)
∂xi ∂x j
then we get yet another characterization of convex functions.
Fact 17.4 (Second-order condition). A twice-differentiable function f
is convex if and only if H f ( x ) is positive semidefinite for all x ∈ K.
| f ( x ) − f (y)| ≤ G ∥ x − y∥ ,
for all x, y ∈ K.
the gradient descent framework 209
∥∇ f ( x )∥2 ≤ G, (17.5)
for all x ∈ K.
∇ f ( x ) = 0 ⇐⇒ Ax = b ⇐⇒ x = A−1 b.
x t +1 ← x t − η t · ∇ f ( x t ). (17.6)
210 unconstrained convex minimization
f ( xb) ≤ f ( x ∗ ) + ε. (17.7)
T T
1 1
∑ f ( xt ) ≤ ∑ f (x∗ ) + 2 ηTG2 + 2η ∥ x0 − x∗ ∥2 . (17.8)
t =1 t =1
We will prove Theorem 17.8 in the next section, but let’s first use it
to prove Proposition 17.7, our guarantee on the offline convergence of
vanilla gradient descent.
the gradient descent framework 211
By Theorem 17.8,
T
1 1 1
T ∑ f ( xt ) ≤ f ( x ∗ ) + ηG2 +
2 2ηT
∥ x0 − x ∗ ∥2 .
t =1 | {z }
error
−x∗ ∥
∥ x0 √
The error terms balance when η = , giving
G T
∥ x0 − x ∗ ∥ G
f ( xb) ≤ f ( x ∗ ) + √ .
T
Finally, we set T = 1 2
ε2
G ∥ x0 − x ∗ ∥2 to obtain
f ( xb) ≤ f ( x ∗ ) + ε.
Unlike the unconstrained case, the gradient at the minimizer may not
be zero in the constrained case—it may be at the boundary. In this This is the analog of the minimizer of a
case, the condition for a convex function f : K → R to be minimized single variable function being achieved
either at a point where the derivative is
at x ∗ ∈ K is now zero, or at the boundary.
function values. But we must change our algorithm to ensure that the
new point xt+1 lies within K. To ensure this, we simply project the
new iterate xt+1 back onto K. Let projK : Rn → K be defined as
1
Φ t +1 − Φ t = ∥ x t +1 − x ∗ ∥ 2 − ∥ x t − x ∗ ∥ 2 (17.13)
2η
1
Φ t +1 − Φ t = ∥ xt′ +1 − x ∗ ∥2 − ∥ xt − x ∗ ∥2 . (17.14)
2η
Now the rest of the proof of Theorem 17.8 goes through unchanged.
Why is the claim xt′ +1 − x ∗ ≥ ∥ xt+1 − x ∗ ∥ true? Since K is
convex, projecting onto it gets us closer to every point in K, in particular
to x ∗ ∈ K. To formally prove this fact about projections, consider
the angle x ∗ → xt+1 → xt′ +1 . This is a non-acute angle, since the
orthogonal projection means K likes to one side of the hyperplane
defined by the vector xt′ +1 − xt+1 , as in the figure on the right.
the gradient descent framework 215
Looking back at the proof in §17.2, the proof of Lemma 17.9 immedi-
ately extends to give us
1
f t ( xt ) + Φt+1 − Φt ≤ f t ( x ∗ ) + ηG2 .
2
Now summing this over all times t gives
T T η
∑ f t ( xt ) − f t ( x ∗ ) ≤ ∑ Φt − Φt+1 + TG2
2
t =1 t =1
1
≤ Φ1 + ηTG2 ,
2
∥ x1 − x ∗ ∥2 G 2
since Φ T +1 ≥ 0. The proof is now unchanged: setting T ≥ ε2
∥ x1 − ∗
and η = √x ∥ , and doing some elementary algebra as above,
G T
1 T ∥ x − x∗ ∥G
T ∑ f t ( xt ) − f t ( x ∗ ) ≤ 1 √
T
≤ ε.
t =0
for all convex bodies K and all convex functions, as opposed to being
just for the unit simplex ∆n and linear losses f t ( x ) = ⟨ℓt , x ⟩, say for
ℓt ∈ [−1, 1]n . However, in order to make a fair comparison, suppose
we restrict ourselves to ∆n and linear losses, and consider the number
of rounds T before we get an average regret of ε.
α
f (y) ≥ f ( x ) + ⟨∇ f ( x ), y − x ⟩ + ∥ x − y ∥2 . (17.15)
2
We will work with the first-order definition, and show that the
1
gradient descent algorithm with (time-varying) step size ηt = O αt
2
converges to a value at most f ( x ∗ ) + ε in time T = Θ( Gαε ). Note there
is no more dependence on the diameter of the polytope. Before we
give this proof, let us give the other relevant definitions.
β
f (y) ≤ f ( x ) + ⟨∇ f ( x ), y − x ⟩ + ∥ x − y ∥2 . (17.16)
2
In this case, the gradient descent algorithm with fixed step size
ηt = η = O β1 yields an xb which satisfies f ( xb) − f ( x ∗ ) ≤ ε when
β ∥ x1 − x ∗ ∥
T = Θ ε . In this case, note we have no dependence on the
Lipschitzness G any more; we only depend on the diameter of the
polytope. Again, we defer the proof for the moment.
β −t
f ( xt ) − f ( x ∗ ) ≤ exp ∥ x1 − x ∗ ∥2 .
2 κ
Proof. For β-smooth f , we can use Definition 17.12 to get
β
f ( xt+1 ) ≤ f ( xt ) − η ∥∇ f ( xt )∥2 + η 2 ∥∇ f ( xt )∥2 .
2
218 stronger assumptions
1
f ( x t +1 ) − f ( x t ) ≤ − ∥∇ f ( xt )∥2 . (17.17)
2β
α
f ( xt ) − f ( x ∗ ) ≤ ⟨∇ f ( xt ), xt − x ∗ ⟩ − ∥ x t − x ∗ ∥2 ,
2
α
≤ ∥∇ f ( xt )∥ ∥ xt − x ∗ ∥ − ∥ xt − x ∗ ∥2 ,
2
1 2
≤ ∥∇ f ( xt )∥ , (17.18)
2α
17.6.1 Subgradients
What if the convex function f is not differentiable? Staring at the
proofs above, all we need is the following:
17.6.3 Acceleration
setting the gradient of the function to zero; this gives us the expres-
222 mirror descent: the proximal point view
sion
η · ∇ f t ( x t ) + ( x t +1 − x t ) = 0 =⇒ x t +1 = x t − η · ∇ f t ( x t ),
Dh (y∥ x ) = 21 ∥y − x ∥2 ,
Again, setting the gradient at xt+1 to zero (i.e., the optimality condi-
tion for xt+1 ) now gives:
or, rephrasing
∇ h ( x t +1 ) = ∇ h ( x t ) − η ∇ f t ( x t ) (18.3)
=⇒ xt+1 = ∇h−1 ∇h( xt ) − η ∇ f t ( xt ) (18.4)
1. When h( x ) = 1
2 ∥ x ∥2 , the gradient ∇h( x ) = x. So we get
x t +1 = x t − η ∇ f t ( x t ),
The name of the process comes from thinking of the dual space as be-
ing a mirror image of the primal space. But how do we choose these mir-
ror maps? Again, this comes down to understanding the geometry
of the problem, the kinds of functions and the set K we care about,
and the kinds of guarantees we want. In order to discuss these, let us
discuss the notion of norms in some more detail.
for x ∈ Rn .
∥∇ f ( x )∥∗ ≤ G.
226 mirror descent: the mirror map view
∇(h) : Rn → Rn .
(iii) Map θt+1 back to the primal space xt′ +1 ← (∇h)−1 (θt+1 ).
(iv) Project xt′ +1 back into the feasible region K by using the Bregman
divergence: xt+1 ← minx∈K Dh ( x ∥ xt′ +1 ). In case xt′ +1 ∈ K, e.g., in
the unconstrained case, we get xt+1 = xt′ +1 .
Note that the choice of h affects almost every step of this algorithm.
mirror descent 227
KL( x ∗ ∥ x1 ) ηT ln n
+ ≤ + ηT.
η 2/ ln 2 η
The last inequality uses that the KL divergence to the uniform distri-
bution on n items is at most ln n. (Exercise!) In fact, if we start with
a distribution x1 that is closer to x ∗ , the first term of the regret gets
smaller.
Dh ( x ∗ ∥ x t )
Φt =
η
T T T
∑ f t (xt ) − ∑ f t (x∗ ) ≤ Φ1 − ΦT+1 + ∑ blaht
t =1 t =1 t =1
T
Dh ( x ∗ ∥ x1 ) T
≤ Φ1 + ∑ blaht = η
+ ∑ blaht .
t =1 t =1
The last inequality above uses that the Bregman divergence is always
non-negative for convex functions. To complete the proof, it remains
to show that blaht in inequality (18.7) can be made 2α ∥∇ f t ( xt )∥2∗ . Let
η
1
Φ t +1 − Φ t = D ( x ∗ ∥ x t +1 ) − Dh ( x ∗ ∥ x t ) ;
η |h {z }
(⋆)
where the latter inequality used the update rule (18.5) for mirror de-
scent, and the Cauchy-Schwarz inequality Corollary 18.4 for general
norms. Now using the AM-GM inequality shows that
1
(†) ≤ ∥η ∇ f t ( xt )∥2∗ .
2α
Finally, remembering that the change in potential is given by η1 (⋆)
finishes the proof of Lemma 18.8.
Dh ( x ∗ ∥ xt+1 ) ≤ Dh ( x ∗ ∥ xt′ +1 )
x t + 1 = x t − η Hh ( x t ) − 1 ∇ f ( x t ) .
Some of you may have seen Newton’s method for minimizing convex
functions, which has the following update rule:
x t +1 = x t − η H f ( x t ) −1 ∇ f ( x t ).
230 alternative views of mirror descent
This means mirror descent replaces the Hessian of the function itself
by the Hessian of a strongly convex function h. Newton’s method has
very strong convergence properties (it gets error ε in O(log log 1/ε)
iterations!) but is not “robust”—it is only guaranteed to converge
when the starting point is “close” to the minimizer. We can view
mirror descent as trading off the convergence time for robustness. Fill
in more on this view.
f ( x̂ ) ≤ min f ( x ) + ε.
x ∈K
1. Access to both a gradient oracle and a value oracle for the function
f , and ∇ f ( c2 ) K
1. By Grünbaum’s theorem,
E t +1 ← E t ∩ H
236 ellipsoid for convex optimization
vol(Et+1 ) − 1
≤ e 2(n+1) ≈ 1 − O(1/n) .
vol(Et )
3. Finally, after T = 2n(n + 1) ln( R/r ) rounds either we have not seen
any point in K—in which case we say “K is empty”—or else we
output
xb ← arg min{ f (ct ) | ct ∈ Kt , t ∈ 1 . . . T }.
the centroid and ellipsoid algorithms 237
Now adapting the analysis from the previous sections gives us the
following result (assuming exact arithmetic again):
O( GR) n T o
f ( xb) − f ( x ∗ ) ≤ exp − .
r 2n(n + 1)
Note the similarity to Theorem 19.2, as well as the differences: the
exponential term is slower by a factor of 2(n + 1). This is because
the volume of the successive ellipsoids shrinks much slower than
in Grünbaum’s lemma. Also, we lose a factor of R/r because K is
potentially smaller than the starting body by precisely this factor.
(Again, this presentation ignores precision issues, and assumes we
can do exact real arithmetic.)
The Ellipsoid algorithm is usually attributed to Naum Shor; the fact N. Z. Šor and N. G. Žurbenko (1971)
that this algorithm gives a polynomial-time algorithm for linear pro-
gramming was a breakthrough result due to Khachiyan, and was L.G. Khachiyan (1979)
front page news at the time. A great source of information about
this algorithm is the Grötschel-Lovász-Schrijver book. A historical M. Grötschel, L. Lovász, and A. Schri-
perspective appears in this this survey by Bland, Goldfarb, and Todd. jver (1988)
min{c⊺ x | x ∈ K }
L(Ball(0, 1)) = { Lx : x⊺ x ≤ 1}
= { y : ( L −1 y )⊺ ( L −1 y ) ≤ 1 }
= {y : y⊺ ( LL⊺ )−1 y ≤ 1}
= { y : y ⊺ Q −1 y ≤ 1 } ,
{ y + 1 : y ⊺ Q −1 y ≤ 1 } = { y : ( y − c )⊺ Q −1 ( y − c ) ≤ 1 } .
1
c t +1 : = c t − h
n+1
and
n2 2
Q t +1 = 2 Qk − hh⊺
n −1 n+1
q
where h = a⊺t Qt at .
(1 − c1 )2 c21 1
≤ 1 and + 2 ≤ 1.
a2 a2 b
the centroid and ellipsoid algorithms 241
and moreover the ratio of volume of the ellipsoid to that of the ball is
(1 − c )2 (n−1)/2
1
abn−1 = (1 − c1 ) · .
1 − 2c1
1
This is minimized by setting c1 = n +1 gives us
vol(E ) − 1
= · · · ≤ e 2( n +1) .
vol(Ball(0, 1))
For a more detailed description and proof of this process, see these
notes from our LP/SDP course for details.
In fact, we can view the question of finding the minimum-volume
ellipsoid that contains the half-ball K: this is a convex program, and
looking at the optimality conditions for this gives us the same con-
struction above (without having to make the assumptions of symme-
try).
Simplex: This is perhaps the first algorithm for solving LPs that most
of us see. It was also the first general-purpose linear program
solver known, having been developed by George Dantzig in 1947. G.B. Dantzig (1990)
This is a local-search algorithm: it maintains a vertex of the poly-
hedron K, and at each step it moves to a neighboring vertex with-
out decreasing the objective function value, until it reaches an op-
timal vertex. (The convexity of K ensures that such a sequence of
steps is possible.) The strategy to choose the next vertex is called
the pivot rule. Unfortunately, for most known pivot rules, there
are examples on which the following the pivot rule takes expo-
nential (or at least a super-polynomial) number of steps. Despite
that, it is often used in practice: e.g., the Excel software contains an
implementation of simplex.
242 algorithms for solving lps
x≥0
We will only sketch the high-level idea behind Step 1 (finding the
starting solution), and will skip Step 2 (the rounding); our focus will
interior-point methods 245
20.1.1 The Primal and Dual LPs, and the Duality Gap
Recall the primal linear program:
( P) min c⊺ x
Ax = b
x ≥ 0,
( D ) max b⊺ y
A⊺ y ≤ c.
( D ′ ) max b⊺ y
A⊺ y + s = c
s ≥ 0.
We assume that both the primal ( P) and dual ( D ) are strictly feasible:
i.e., they have solutions even if we replace the inequalities with strict
ones). Then we can prove the following result, which relates the
optimizer for f η to feasible primal and dual solutions:
Ax − b = 0 (20.1)
⊺
A y+s = c (20.2)
∀i ∈ [ n ] : si xi = η (20.3)
The conditions (20.1) and (20.2) show that x and (y, s) are feasible
for the primal ( P) and dual ( D ′ ) respectively. The condition (20.3)
is an analog of the usual complementary slackness result that arises
when η = 0. To prove this lemma, we use the method of Lagrange
multipliers. Observe: we get that if there exists a
maximum x ∗ , then x ∗ satisfies these
Theorem 20.2 (The Method of Lagrange Multipliers). Let functions conditions.
f and g1 , · · · , gm be continuously differentiable, and defined on some open
subset of Rn . If x ∗ is a local optimum of the following optimization problem
min f ( x )
s.t. ∀i ∈ [m] : gi ( x ) = 0
The first step uses that if there are strictly feasible primal and
dual solutions ( x̂, ŷ, ŝ), then the region { x | Ax = b, f µ ( x ) ≤ f µ x̂ }
is bounded (and clearly closed) and hence the continuous function
f µ ( x ) achieves its minimum at some point x ∗ inside this region, by
the Extreme Value theorem. (See Lemma 7.2.1 of Matoušek and Gärt-
ner, say.)
For the second step, we use the functions f µ ( x ), and gi ( x ) = a⊺i x −
bi in Theorem 20.2 to get the existence of y∗ ∈ Rm such that:
m ⊺ m
∑ yi∗ · ∇(ai x∗ − bi ) ∑ yi∗ ai .
⊺
f η (x∗ ) = ⇐⇒ c − η · 1/x1∗ , · · · , 1/xn∗ =
i =1 i =1
By weak duality, the optimal value of the linear program lies be-
tween the values of any feasible primal and dual solution, so the
duality gap c⊺ x − b⊺ y bounds the suboptimality c⊺ x − OPT of our
current solution. Lemma 20.1 allows us to relate the duality gap to η
as follows.
c⊺ x − b⊺ y = c⊺ x − ( Ax )⊺ y = x⊺ c − x⊺ (c − s) = x⊺ s = n · η.
Ax = b (20.4)
⊺
A y+s = c (20.5)
n 2
∑ s i x i − ηt ≤ (ηt/4)2 . (20.6)
i =1
The first two are again feasibility conditions for ( P) and ( D ′ ). The
third condition is new, and is an approximate version of (20.3). Sup-
pose that
1
η ′ := η ′ · 1 − √ .
4 n
A (∆x ) = 0
⊺
A (∆y) + (∆s) = 0
si (∆xi ) + (∆si ) xi + (∆si )(∆xi ) = η ′ − xi si .
Note the quadratic term in blue. Since we are aiming for an approxi-
mation anyways, and these increments are meant to be tiny, we drop
the quadratic term to get a system of linear equations in these incre-
ments:
A (∆x ) = 0
⊺
A (∆y) + (∆s) = 0
si (∆xi ) + (∆si ) xi = η ′ − xi si .
putting down just so that you recognize it the next time you see it):
A 0 0 ∆x 0
0 A⊺ I ∆y = 0 .
diag( x ) 0 diag(s) ∆s η′1 − x ◦ s
Proof. The last set of equalities in the linear system ensure that
so we get
x ′ , s′ = ⟨ x + ∆x, s + ∆s⟩
= ∑ si xi + si (∆xi ) + (∆si ) xi + ⟨∆x, ∆s⟩
i
= nη ′ + ⟨∆x, − A⊺ (∆y)⟩
= n · η ′ − ⟨ A(∆x ), ∆y⟩
= n · η′,
Proof. As in the proof of Lemma 20.3, we get that si′ xi′ − η ′ = (∆si )(∆xi ),
so it suffices to show that
s
n
∑ (∆si )2 (∆xi )2 ≤ η′/4.
i =1
xi (∆si )2 si (∆xi )2
where we set a2i = si and bi2 = xi . Hence
s
n
1 n xi s
∑ (∆si ∆xi )2 ≤ ∑
4 i =1 si
· (∆si )2 + i · (∆xi )2 + 2(∆si )(∆xi )
xi
i =1
1 n ( xi ∆si )2 + (si ∆xi )2
4 i∑
= [since (∆s)⊺ ∆x = 0 by Claim 20.3]
=1
si xi
1 ∑in=1 ( xi ∆si + si ∆xi )2
≤
4 mini∈[n] si xi
1 ∑in=1 (η ′ − si xi )2
= . (20.8)
4 mini∈[n] si xi
n n
∑ (η ′ − si xi )2 = ∑ ((1 − δ)η − si xi )2
i =1 i =1
n n n
= ∑ (η − si xi )2 + ∑ (δη )2 + 2δη ∑ (η − si xi ).
i =1 i =1 i =1
Thus
n
1
∑ (η ′ − si xi )2 ≤ (η/4)2 + n (4√n)2 η 2 = η 2 /8.
i =1
which are analogs of Lemmas 20.3 and 20.4 respectively. The latter
inequality means that |si′′ xi′′ − η ′′ | ≤ η ′′ /4 for each coordinate i, else
that coordinate itself would violate inequality (20.9). Specifically,
this means that neither xi′′ nor si′′ ever becomes zero for any value of
α ∈ [0, 1]. Now since ( xi′′ , si′′ ) is a linear interpolation between ( xi , si )
and ( xi′ , si′ ), and the former were strictly positive, the latter cannot be
non-positive.
20.3.2 An Example
Given an n-bit integer a ∈ Z, suppose we want to compute its re-
ciprocal 1/a without using divisions. This reciprocal is a zero of the
expression This method for computing recip-
g( x ) = 1/x − a. rocals appears in the classic book of
Aho, Hopcroft, Ullman, without any
Hence, the Newton-Raphson method says, we can start with x1 = 1, elaboration—it always mystified me
until I realized the connection to the
say, and then use (20.10) to get Newton-Raphson method. I guess they
(1/xt − a) expected their readers to be familiar
x t +1 ← x t − = xt + xt (1 − a xt ) = 2xt − a xt2 . with these connections, since computer
(−1/xt2 ) science used to have closer connections
If we define ε t := 1 − a xt , then to numerical analysis in the 1970s.
f ′ ( xt )
x t +1 ← x t − . (20.11)
f ′′ ( xt )
20.4 Self-Concordance
defined to be the worst-case ratio between the costs of the algorithm’s Alg ≤ r · Opt
c(Alg( I ))
ρ = ρA := min ,
I c(Opt( I ))
However, there are problems that do not fall into any of these
clean categories, such as Asymmetric k-Center, for which there
exists a O ( log ∗ n ) -approximation algorithm, and this is best possible
unless P = NP. Or Group Steiner Tree, where the approximation
ratio is O ( log 2 n ) on trees, and this is also best possible.
approximation algorithms 255
This shows that Alg( I ) ≤ α Opt( I ). Which leaves us with the ques-
tion of how to construct the surrogate. Sometimes we use the com-
binatorial properties of the problem to get a surrogate, and at other
times we use a linear programming relaxation.
The greedy algorithm does not achieve a better ratio than Ω(log n):
one example is given by the figure to the right. The optimal sets are
the two rows, whereas the greedy algorithm may break ties poorly
and pick the set covering the left half, and then half the remainder,
etc. A more sophisticated example can show a matching gap of ln n.
Proof of Theorem 24.1. Suppose Opt picks k sets from S . Let ni be the
number of elements yet uncovered when the algorithm has picked i
sets. Then n0 = n = |U |. Since the k sets in Opt cover all the elements
Figure 24.2: A Tight Example for the
of U, they also cover the uncovered elements in ni . By averaging, Greedy Algorithm
there must exist a set in S that covers ni /k of the yet-uncovered ele-
ments. Hence, As always, we use 1 + x ≤ e x , and here
we can use that the inequality is strict
ni+1 ≤ ni − ni /k = ni (1 − 1/k). whenever x ̸= 0.
Moreover, for the weighted case, the greedy algorithm changes to is the Bth Harmonic number.
picking the set S in that maximizes:
One can give an analysis somewhat like the one above for this
weighted case as well: let k now be the total cost of sets in the op-
timal set cover. After i sets have been picked, the remaining ni ele-
ments can still be covered using a collection of cost k, so there must
be a set whose cost-to-fresh-coverage ratio is at most k/ni . If it covers
ni+1 − ni previously uncovered elements, then we know that its cost
most be at most
(ni+1 − ni ) · k/ni .
ℓ
∑ (ni+1 − ni ) · k/ni ≤ k 1/n + 1/(n−1) + . . . + 1/2 + 1 = k · Hn ,
i =1
The second algorithm for Set Cover uses the popular relax-and-
round framework. The steps of this process are as follows:
1. Write an integer linear program for the problem. This will also be
NP-hard to solve, naturally.
3. Now solve the linear program, and round the fractional variables
to integer values, while ensuring that the cost of this integer solu-
tion is not much higher than the LP value.
Let’s see this in action: here is the integer linear program (ILP) that
precisely models Set Cover:
min ∑ cS xS (ILP-SC)
S∈S
s.t. ∑ xS ≥ 1 ∀e ∈ U
S:e∈S
xS ∈ {0, 1} ∀S ∈ S .
min ∑ cS xS (LP-SC)
S∈S
s.t. ∑ xS ≥ 1 ∀e ∈ U
S:e∈S
xS ≥ 0 ∀S ∈ S .
If LP( I ) is the optimal value for the linear program, then we get:
LP( I ) ≤ Opt( I ).
2. Phase 2: For each element e yet uncovered, pick any set covering it.
1. First-Fit: add the item to the earliest opened bin where it fits.
Exercise: if all the items were of size at most ε, then each bin (ex-
cept the last one) would have at least 1 − ε total size, thereby giving
an approximation of
1
Opt( I ) + 1 ≈ (1 + ε) Opt( I ) + 1.
1−ε
• Define the new size si ′ for each item i to be the size of the largest
element in i’s group.
There are D distinct item sizes, and all sizes are only increased, so
it remains to show a packing for the items in I ′ that uses at most
Opt( I ) + ⌈n/D ⌉ bins. Indeed, suppose Opt( I ) assigns item i to some
bin b. Then we assign item (i + ⌈n/D⌉) to bin b. Since the sizes of the
items only get smaller, this allocates all the items except items in first
group, without violating the sizes. Now we assign each item in the
first group into a new bin, thereby opening up ⌈n/D⌉ more bins.
this result for the case where s1 ≥ ε.) Let C be the collection of all
configurations.
We now use an integer LP due to Paul Gilmore and Ralph Gomory
(from the 1950s). It has one variable xC for every configuration C ∈ C
that denotes the number of bins with configuration C in the solution.
The LP is:
min ∑ xC ,
C ∈C
s.t. ∑ ACs xC ≥ ns , ∀ sizes s
C
xC ∈ N.
Here ACs is the number of items of type s being placed in the config-
uration C, and ns is the total number items of size s in the instance.
This is an exact formulation, and relaxing the integrality constraint to
xC ≥ 0 gives us an LP that we can solve in time poly( N, n). This is
polynomial time when N is a constant. We use the optimal value of In fact, we show in a homework prob-
this LP as our surrogate. lem that the LP can be solved in time
polynomial in n even when N is not a
How do we round the optimal solution for this LP? There are only constant.
D non-trivial constraints in the LP, and N non-negativity constraints.
So if we pick an optimal vertex solution, it must have some N of
these constraints at equality. This means at least N − D of these tight
constraints come from the latter set, and therefore N − D variables
are set to zero. In other words, at most D of the variables are non-
zero. Rounding these variables up to the closest integer, we get a
solution that uses at most LP( I ) + D ≤ Opt( I ) + D bins. Since D is a
constant, we have approximated the solution up to a constant.
Opt( I ) + ⌈n/D ⌉ + D
bins. Now if we could ensure that n/D were at most ε Opt( I ), when
D was f (ε), we would be done. Indeed, if all the items have size at
least ε, the total volume (and therefore Opt( I )) is at least εn. If we
now set D = 1/ε2 , then n/D ≤ ε2 n ≤ ε Opt( I ), and the number of
bins is at most l m
(1 + ε) Opt( I ) + 1/ε2 .
What if some of the items are smaller than ε? We now use the
observation that First-Fit behaves very well when the item sizes
are small. Indeed, we first hold back all the items smaller than ε,
and solve the remaining instance as above. Then we add in the small
items using First-Fit: if it does not open any new bins, we are
262 subsequent results and open problems
fine. And if adding these small items results in opening some new
bin, then each of the existing bins—and all the newly opened bins
(except the last one)—must have at least (1 − ε) total size in them.
The number of bins is then at most
1
Opt( I ) + 1 ≈ (1 + O(ε)) Opt( I ) + 1,
1−ε
as long as ε ≤ 1/2.
Just like the use of linear programming was a major advance in the
design of approximation algorithms, specifically in the use of lin-
ear programs in the relax-and-round framework, another significant
advantage was the use of semidefinite programs in the same frame-
work. For instance, the approximation guaranteee for the Max-Cut
problem was improved from 1/2 to 0.878 using this technique. More-
over, subsequent results have shown that any improvements to this
approximation guarantee in polynomial-time would disprove the
Unique Games Conjecture.
a. x⊺ Ax ≥ 0 for all x ∈ Rn .
Lemma 25.2. Let A ⪰ 0. If Ai,i = 0 then A j,i = Ai,j = 0 for all j. We will write A ⪰ 0 to denote that A is
PSD; more generally, we write A ⪰ B
Proof. Let j ̸= i. The determinant of the submatrix indexed by {i, j} is if A − B is PSD: this partial order on
symmetric matrices is called the Löwner
order.
Ai,i A j,j − Ai,j A j,i
We can think of this as being the usual vector inner product treat-
ing A and B as vectors of length n × n. Note that by the cyclic prop-
erty of the trace, A • xx⊺ = Tr( Axx⊺ ) = Tr( x⊺ Ax ) = x⊺ Ax; we will
use this fact to derive yet another of PSD matrices.
∑ λi ( A • xi xi ) = ∑ λi xi Axi ≥ 0.
⊺ ⊺
A•X =
i i
maximize
v1 ,...,vn ∈R
n ∑ cij vi , v j
i,j
(k)
subject to ∑ aij v i , v j ≤ bk , ∀ k ∈ [ m ].
i,j
maximize A•X
X ∈Rn × n
subject to I•X =1 (25.1)
X⪰0
Proof. Let X maximize SDP (25.1) (this exists as the objective is con-
tinuous and the feasible set is compact). Consider the spectral de-
composition X = ∑in=1 λi xi xi⊺ where λi ≥ 0 and ∥ xi ∥2 = 1. The
trace constraint I • X = 1 implies ∑i λi = 1. Thus the objective value
A • X = ∑i λi xi⊺ Axi is a convex combination of xi⊺ Axi . Hence without
loss of generality, we can put all the weight into one of these terms,
in which case X = yy⊺ is a rank-one matrix with ∥y∥2 = 1. By the
Courant-Fischer theorem, OPT ≤ max∥y∥2 =1 y⊺ Ay = λmax .
266 sdps in approximation algorithms
Here is another SDP for the same problem: In fact, it turns out that this SDP is dual
to the one in (25.1). Weak duality still
minimize t holds for this case, but strong duality
t does not hold in general for SDPs.
(25.2)
Indeed, there could be a duality gap for
subject to tI − A ⪰ 0. some cases, where both the primal and
dual are finite, but the optimal solutions
Lemma 25.7. SDP (25.2) computes the maximum eigenvalue of A. are not equal to each other. However,
under some mild regularity conditions
Proof. The matrix tI − A has eigenvalues t − λi . And hence the con- (e.g., the Slater conditions) we can show
strong duality. More about SDP duality
straint tI − A ⪰ 0 is equivalent to the constraint t − λ ≥ 0 for all its
here.
eigenvalues λ. In other words, t ≥ λmax , and thus OPT = λmax .
This result shows two things: (a) every graph has a bipartition that
cuts half the edges of the graph, so Opt ≥ | E|/2. Moreover, (b) that
since Opt ≤ | E| on any graph, this means that Alg ≥ | E|/2 ≥ Opt /2.
We cannot hope to prove a better result
Here’s a simple randomized algorithm: place each vertex in either than Lemma 25.9 in terms of | E|, since
the complete graph Kn has (n2 ) ≈ n2 /2
S or in S̄ independently and uniformly at random. Since each edge is edges and any partition can cut at most
cut with probability 1/2, the expected number of cut edges is | E|/2. n2 /4 of them.
Moreover, by the probabilistic method Opt ≥ | E|/2.
( x i − x j )2
maximize
x1 ,...,xn ∈R
∑ 4
(i,j)∈ E (25.3)
subject to xi2 =1 ∀i.
1 − vi , v j
maximize
v1 ,...,vn ∈R
n ∑ 2
(i,j)∈ E (25.5)
subject to ⟨ vi , vi ⟩ = 1 ∀i.
Proof. By linearity of expectation, it suffices to bound the probability Figure 25.1: A geometric picture of
Goemans-Williamson randomized
of an edge (i, j) being cut. Let rounding
θij := cos−1 ( vi , v j )
be the angle between the unit vectors vi and v j . Now consider the
2-dimensional plane P containing vi , v j and the origin, and let ge be
the projection of the Gaussian vector g onto this plane. Observe that
the edge (i, j) is cut precisely when the hyperplane defined by g
separates vi , v j . This is precisely when the vector perpendicular to
ge in the plane P lands between vi and v j . As the projection onto a
subspace of the standard Gaussian is again a standard Guassian (by
spherical symmetry),
2θij θij
Pr[(i, j) cut] = = .
2π π
θij
approximation algorithms via sdps 269
g̃
Proof. Pick any vertex v, recursively color the remaining graph, and
then assign v a color not among the colors of its ∆ neighbors.
find v 1 , . . . , v n ∈ Rn
subject to vi , v j ≤ λ ∀(i, j) ∈ E (25.6)
⟨ vi , vi ⟩ = 1 ∀i ∈ V.
Why is this SDP relevant to our problem? The goal is to have vectors
clustered together in groups, such that each cluster represents a color.
Intuitively, we want to have vectors of adjacent vertices to be far
apart, so we want their inner product to be close to −1 (recall we are
approximation algorithms via sdps 271
dealing with unit vectors, due to the last constraint) and vectors of
the same color to be close together.
Proof. Consider the vector placement shown in the figure to the right. 120◦
If the graph is 3-colorable, we can assign all vertices with color 1
the red vector, all vertices with color 2 the blue vector and all vertices
with color 3 the green vector. Now for every edge (i, j) ∈ E, we have
that Figure 25.3: Optimal distribution of
2π vectors for 3-coloring graph
vi , v j = cos = −1/2.
3
At first sight, it may seem like we are done: if we solve the above
SDP with λ = −1/2, don’t all three vectors look like the figure above?
No, that would only hold if all of them were to be co-planar. And in
n-dimensions we can have an exponential number of cones of angle
2π
3 , like in the next figure, so we cannot cluster vectors as easily as in
the above example.
To solve this issue, we apply a hyperplane rounding technique
similar to that from the MaxCut algorithm. Indeed, for some pa-
rameter t we will pick later, pick t random hyperplanes. Formally, we
pick gi ∈ Rn from a standard n-dimensional Gaussian distribution, Figure 25.4: Dimensionality problem of
for i ∈ [t]. Each of these defines a normal hyperplane, and these split 2π/3 far vectors
the Rn unit sphere into 2t regions (except if two of them point in the
same direction, which has zero probability). Now, each vectors {vi }
that lie in the same region can be considered “close” to each other,
and we can try to assign them a unique color. Formally, this means
that if vi and v j are such that
sign(⟨vi , gk ⟩) = sign( v j , gk )
for all k ∈ [t], then i and j are given the same color. Each region is
given a different color, of course.
However, this may color some neighbors with the same color, so
we use the method of alterations: while there exists an edge between
vertices of the same color, we uncolor both endpoints. When this
uncoloring stops, we remove the still-colored vertices from the graph,
and then repeat the same procedure on the remaining graph, until we
color every vertex. Note that since we use t hyperplanes, we add at
most 2t new colors per iteration. The goal is to now show that (a) the
number of interations is small, and (b) the value of 2t is also small.
Lemma 25.16. The expected number of vertices that remain uncolored after
a single iteration is at most n∆ (1/3)t .
Proof. Fix an edge ij: for a single random hyperplane, the probability
that vi , v j are not separated by it is
π − θij 1
≤ ,
π 3
There are n vertices, and each vertex has degree at most ∆, which
proves the result.
Typically we do not know the future requests that the CPU will make
so it is sensible to model this as an online problem. If the entire sequence of requests
We let U be a universe of n items or pages. The cache is a mem- is known, show that Belády’s rule is
optimal: evict the page in cache that is
ory containing at most k pages. The requests are pages σi ∈ U and next requested furthest in the future.
the online algorithm is an eviction policy. Now we return back to
defining the performance of an online algorithm.
Alg(σ )
max
σ Opt(σ )
E[Alg(σ)] ≤ α · Opt(σ ).
Lemma 26.1. The competitive ratio of algorithm AlgB is 2 − 1/B and this is
the best possible ratio for any deterministic algorithm.
Proof. There are two cases to consider j < B and j ≥ B. For the
first case, AlgB ( Ij ) = j and Opt( Ij ) = j, so AlgB ( Ij )/ Opt( Ij ) =
1. In the second case, AlgB ( Ij ) = 2B − 1 and Opt( Ij ) = B, so
AlgB ( Ij )/ Opt( Ij ) = 2 − 1/B. Thus the competitive ratio of AlgB
is
AlgB ( Ij )
max = 2 − 1/B
Ij Opt( Ij )
Now to show that this is the best possible competitive ratio for any
deterministic algorithm. Consider algorithm Algi . We find an in-
stance Ij such that Algi ( Ij )/ Opt( Ij ) ≥ 2 − 1/B. If i ≥ B then we take
j = B so that Algi ( Ij ) = (i − 1 + B) and Opt( Ij ) = B so that
Algi ( Ij ) i−1+B i 1 1
= = +1− ≥ 2−
Opt( Ij ) B B B B
Algi ( Ij ) B
= ≥2
Opt( Ij ) 1
Algi ( Ij ) 1
≥ 2− .
Opt( Ij ) B
I1 I2 I3 I∞
Alg1 4/1 4/2 4/3 4/4
Alg2 1/1 5/2 5/3 5/4
Alg3 1/1 2/2 6/3 6/4
Alg4 1/1 2/2 3/3 7/4
online algorithms 279
4p1 + p2 + p3 + p4 ≤ c
4p2 + 5p2 + 2p3 + 2p4
≤c
2
4p1 + 5p3 + 6p3 + 3p4
≤c
3
4p1 + 5p2 + 6p3 + 7p4
≤c
4
p1 + p2 + p3 + p4 = 1
3p1 = c − 1
4p2 + p3 + p4 = c
4p3 + p4 = c
4p4 = c.
280 the ski rental problem: rent or buy?
1
c= and pi = (3/4)i−1 (c/4)
1 − (1 − 1/4)4
1 e
c = cB = ≤ .
1 − (1 − 1/B) B e−1
f ′ (ℓ) − f (ℓ) = 0
But this solves to f (ℓ) = Ceℓ for some constant C. Since f is a prob-
R1
ability density function, ℓ=0 f (ℓ) = 1, we get C = e−1 1 . Substituting
into (26.1), we get that the competitive ratio is c = e−e 1 , as desired.
online algorithms 281
Proof. We break up the proof into an upper bound on Alg’s cost and
a lower bound on Opt’s cost. Before doing this we set up some no-
tation. For the ith phase, let Si be the set of pages in the algorithm’s
cache at the beginning of the phase. Now define
∆ i = | Si + 1 \ Si | .
k −1 k −1
c 1
∑ k−s
≤ ∆i ∑
k − s
= ∆i Hk .
s =0 s =0
∆i Hk + ∆i = ∆i ( Hk + 1).
Now we claim that Opt ≥ 12 ∑i ∆i . Let Si∗ be the pages in Opt’s cache
at the beginning of phase i. Let ϕi be the number of pages in Si but
not in Opt’s cache at the beginning of phase i, i.e., ϕi = |Si \ Si∗ |.
Now let Opti be the cost that Opt incurs in phase i. We have that
Opti ≥ ∆i − ϕi since this is the number of “clean” requests that Opt
sees. Moreover, consider the end of phase i. Alg has the k most recent
online algorithms 283
requests in cache, but Opt does not have ϕi+1 of them by definition of
ϕi+1 . Hence Opti ≥ ϕi+1 . Now by averaging,
1
Opti ≥ max{ϕi+1 , ∆i − ϕi } ≥ (ϕ + ∆i − ϕi ).
2 i +1
So summing over all phases we have
1 1
Opt ≥ ∑
2 i
∆i + ϕ f inal − ϕinitial ≥ ∑ ∆i ,
2 i
since ϕ f inal ≥ 0 and ϕinitial = 0. Combining the upper and lower
bound yields
E[Alg] ≤ 2( Hk + 1) Opt = O(log k) Opt .
It can also be shown that no randomized algorithm can do better
than Ω(log k )-competitive for the paging problem. For some intuition
as to why this might be true, consider the coupon collector problem: if
you repeatedly sample a uniformly random number from {1, . . . , k +
1} with replacement, show that the expected number of samples to
see all k + 1 coupons is Hk+1 .