Cmu850 f20
Cmu850 f20
https://www.cs.cmu.edu/~15850/
The style files (as well as the text on this page!) are mildly adapted
from the ones developed by Yufei Zhao (MIT), for his notes on Graph
Theory and Additive Combinatorics. As some of you may guess, the
LATEX template used for these notes is called tufte-book.
Contents
I Discrete Algorithms 9
Discrete Algorithms
1
Minimum Spanning Trees
• In 1984, Michael Fredman and Bob Tarjan gave an O(m log∗ n) Fredman and Tarjan (1987)
time algorithm, based on their Fibonacci heaps data structure.
Here log∗ is the iterated logarithm function, and denotes the num-
ber of times we must take logarithms before the argument be-
comes smaller than 1. The actual runtime is a bit more nuanced,
which we will not bother with today.
12 minimum spanning trees: history
• In 1995, David Karger, Phil Klein and Bob Tarjan finally got the Karger, Klein, and Tarjan (1995)
holy grail of O(m) time! . . . but it was a randomized algorithm, so
the search for a deterministic linear-time algorithm continued.
• In 1998, Seth Pettie and Vijaya Ramachandran gave an optimal Pettie and Ramachandran (1998)
algorithm for computing minimum spanning trees—however,
we don’t know its runtime! More formally, they show that if This was part of Seth’s Ph.D. thesis, and
there exists an algorithm which uses MST ∗ (m, n) comparisons Vijaya was his advisor.
to find MSTs on all graphs with m edges and n nodes, the Pettie-
Ramachandran algorithm will run in time O( MST ∗ (m, n)).)
Theorem 1.1 (Cut Rule). For any cut of the graph, the minimum-weight
edge that crosses the cut must be in the MST. This rule helps us determine
what to add to our MST.
minimum spanning trees 13
Theorem 1.2 (Cycle Rule). For any cycle in G, the heaviest edge on that
cycle cannot be in the MST. This helps us determine what we can remove in
constructing the MST.
Proof. Let C be any cycle, let e be the heaviest edge in C. For a con-
tradiction, let T be an MST that contains e. Dropping e from T gives
two components. Now there must be some edge e′ in C \ {e} that
crosses between these two components, and hence T ′ := ( T − {e′ }) ∪
{e} is a spanning tree. (Make sure you see why.) By the choice of e
we have w(e′ ) < w(e), so T ′ is a lower-weight spanning tree than T, a
contradiction.
two vertices which are not currently in the same blue component.
Figure 1.1 gives an example of how edges are added. 1 3
• union(elem1 , elem2 ), which merges the two sets that elem1 and
elem2 are in.
of our current tree T of blue edges to some vertex not yet in T, and 10
5
color it blue—thereby adding this edge to T and increasing its size by
4
one. Figure 1.2 below shows an example of how we edges are added. 2
We’ll use a priority queue data structure which keeps track of the
lightest edge connecting T to each vertex not yet in T. A priority Figure 1.2: Dashed lines are not yet in
queue data structure is equipped with (at least) three operations: the MST. We started at the red node,
and the blue nodes are also part of T
• insert(elem, key) inserts the given (element, key) pair into the right now.
queue,
Note that by using the standard binary heap data structure we can
get O(log n) worst-case time for each priority queue operation above.
To implement the Jarnik/Prim algorithm, we initially insert
each vertex in V \ {r } into the priority queue with key ∞, and
the root r with key 0. The key of an node v denotes the weight of
minimum spanning trees 15
1.2.4 A Slight Improvement on Jarnik/Prim Figure 1.3: The red edges will be
chosen and contracted in a single
step, yielding the graph on the right,
We can actually easily improve the performance of Jarnik/Prim’s
which we recurse on. Colors designate
algorithm by using a more sophisticated data structure, namely by components.
using Fibonacci heaps instead of binary heaps to implement the
priority queue. Fibonacci heaps (invented by Fredman and Tarjan)
implement the insert and decreasekey operations in constant amor-
tized time, and extractmin in amortized O(log H ) time, where H is
the maximum number of elements in the heap during the execution.
Since we do n extractmins, and O(m + n) of the other two opera-
tions, and the maximum size of the heap is at most n, this gives us a
total cost of O(m + n log n).
Note that this is linear time on graphs with m = Ω(n log n) edges;
however, we’d like to get linear-time on all graphs. So the remaining
cases are the graphs with m = o (n log n) edges.
A H
61 13
1 6 R
2 55 7
56
14 16
8
3
57 62
9 15
4
58 63
11 10
17
52
53 12
51
G 60
5 59 18
54
D
Figure 1.4: We begin at vertices A, H,
R, and D (in that order) with K = 6.
Formally, in each round of the algorithm, all vertices start as un-
Although D begins as its own compo-
marked. nent, it stops when it joins with tree
A. Dashed edges are not chosen in this
1. Pick an arbitrary unmarked vertex and start Jarnik/Prim’s algo- step (though they may be chosen in the
rithm from it, creating a tree T. Keep track of the lightest edge next recursive call), and colors denote
trees.
from T to each vertex in the neighborhood N ( T ) of T, where
N ( T ) := {v ∈ V − T | ∃u ∈ T s.t. {u, v} ∈ E}. Note that
N ( T ) may contain vertices that are marked.
minimum spanning trees 17
Let’s first note that the runtime of one round of the algorithm is
O ( m + n log K ) . Each edge is considered at most twice, once from
each endpoint, giving us the O ( m ) term. Each time we grow the
current tree in step 1, the number of connected components decreases
by 1, so there are at most n such steps. Each step calls findmin on a
heap of size at most K, which takes O ( log K ) times. Hence, at the
end of this round, we’ve successfully identified a forest, each edge of
which is part of the final MST, in O ( m + n log K ) time.
Let d v be the degree of the vertex v in the graph we consider in
this round. We claim that every marked vertex u belongs to a com-
ponent C such that ∑ v ∈ C d v ≥ K. Indeed, if u became marked be-
cause the neighborhood of its component had size at least K, then
this is true. Otherwise, u became marked because it entered a com-
ponent C of marked vertices. Since the vertices of C were marked,
∑ v ∈ C d v ≥ K before u joined, and this sum only increased when u
(and other vertices) joined. Thus, if C 1 , . . . , C l are the components at
the end of this routine, we have
l l
2m = ∑ dv = ∑ ∑ dv ≥ ∑ K ≥ Kl
v i = 1 v ∈ Ci i =1
Thus l ≤ 2m 2m
K , i.e. this routine produced at most K trees.
The choice of K will change over the course of the algorithm. How
should we set the thresholds K i ? Say we start round i with n i nodes
and m i ≤ m edges. One clean way is to set
2m
K i : = 2 ni
In turn, this means the number of trees, and hence the number of
nodes n i + 1 in the next round, is at most 2m 2m
K i ≤ K i . The number of
i
2m
Ki ≤ = lg K i + 1 =⇒ K i + 1 ≥ 2 K i .
ni +1
Hence the threshold value exponentiates in each step. Hence after The threshold increases “tetrationally”.
log ∗ n rounds, the value of K would be at least n, and we would
18 a linear-time randomized algorithm
The next facts follow from the definition: Figure 1.5: Fix this figure, make it
Fact 1.4. Edge e is F-light ⇐⇒ e ∈ MST( F ∪ {e}). interesting. Every edge in F is F-light,
as are the edges on the left, and also
Fact 1.5 (Completeness). If T is an MST of G then edge e ∈ E( G ) is those going between the components.
The edge on the right is F-heavy.
T-light if and only if e ∈ T.
Fact 1.6 (Soundness). For any forest F, the F-light edges contain the
MST of the underlying graph G. In other words, any F-heavy edge is
also heavy with respect to the MST of the entire graph.
This suggests a clear strategy: pick a forest F from the current
edges, and discard all the F-heavy edges. Hopefully the number of
edges remaining is small. By Fact 1.6 these edges contain the MST of
G, so repeat the process on them. To make this idea work, we want
a forest F with many F-heavy edges. The catch is that a forest has
many heavy edges if it has small weight, if there are many off-forest
edges forming cycles where they are the heaviest edges. Indeed, one
minimum spanning trees 19
Proof. This follows from Fact 1.6, that discarding heavy edges of any
forest F in a graph does not change the MST. Indeed, the MST on
G2 is the same as the MST on G ′ , since the discarded F1 -heavy edges
cannot be in MST ( G ′ ) because of Fact 1.6. Adding back the edges
picked by Borůvka’s algorithm in Step 1 gives the MST on G, by the
cut rule.
Theorem 1.11. The KKT algorithm, run on a graph with m edges and n
vertices, terminates in expected time O(m + n).
Tm,n := max { TG }.
G =(V,E),|V |=n,| E|=m
TG ≤ cm + E[c(2m1 + n′ )] + E[c(2m2 + n′ )]
≤ c(m + m′ + 6n′ )
≤ c(2m + n)
Proof of Claim 1.10. For the sake of the proof, we can use any correct
algorithm to compute F1 , so let us use Kruskal’s algorithm. Moreover,
let’s run a lazy version as follows: first sort all the edges in E′ , and
not just those in E1 ⊆ E′ , and consider then in increasing order
minimum spanning trees 21
That’s it. The algorithm and proof are both short and slick and
beautiful: this result is a real gem. I think it’s an algorithm from The Paul Erdős claimed that God has “The
Book. The one slight annoyance with the algorithm is the relative Book” which contains the most elegant
proof of each mathematical theorem.
complexity of the MST verification algorithm, which we use to find
the F1 -light edges in linear time. Nonetheless, these verification The current verification algorithms are
algorithms also contain many nice ideas, which we now discuss. deterministic; can we use randomness
to simplify these as well?
A e : = ( a1 , a2 , · · · , a k ),
and all leaves belong to the same level. There are n leaves (corre-
sponding to the nodes in T), and at most 2n − 1 nodes in T ′ . Also
show how to construct T ′ in linear time.
Exercise 1.15. For nodes u, v in a tree T, let maxwtT (u, v) be the maxi-
mum weight of an edge on the (unique) path between u, v in the tree
T. Show that all u, v ∈ V, maxwtT (u, v) = maxwtT ′ (u, v).
1 2 3 4 ... n
1 2 4 6 8 ... 2n
2 2 4 8 16 ... 2n
.2
2 22 ..
3 2 4 22 22 ... 22
4 2 4 65536 !!! ... huge!
1.7 Matroids
2.1 Arborescences
are non-negative. Because no outgoing arcs from r will be part If there are negative arc weights, add
of any arborescence, we can assume no such arcs exist in G either. a large positive constant M to every
weight. This increases the total weight
For brevity, we fix r and simply say arborescence when we mean of each arborescence by M (n − 1), and
r-arborescence. hence the identity of the minimum-
weight one remains unchanged.
Proof. Each arborescence has exactly one arc leaving each vertex.
Decreasing the weight of every arc exiting v by MG (v) decreases the
weight of every possible arborescence by MG (v) as well. Thus, the set
of min-weight arborescences remains unchanged.
Now each vertex has at least one 0-weight arc leaving it. Now, for
each vertex, pick an arbitrary 0-weight arc out of it. If this choice is
arborescences: directed spanning trees 29
The proof also gives an algorithm for finding the min-weight ar-
borescence on G ′ by contracting the cycle C (in linear time), recursing
on G ′′ , and the “lifting” the solution T ′′ back to a solution T ′ . Since
we recurse on a graph which has at least one fewer nodes, there are Figure 2.4: Contracting the two white
nodes down to a cycle, and removing
at most n recursive calls. Moreover, the weight-reduction, contraction, arc b.
and lifting steps in each recursive call take O(m) time, so the runtime
of the algorithm is O(mn).
Remark 2.8. This is not the best known run-time bound: there are
many optimizations possible. Tarjan 3 presents an implementation 3
The dual linear program has a single variable yi for each constraint
in the original (primal) linear program. This variable can be thought
of as giving an importance weight to the constraint, so that taking a
linear combination of constraints with these weights shows that the
primal cannot possibly surpass a certain value for c⊺ x. This purpose
is exemplified by the following theorem.
Proof. c⊺ x ≥ ( A⊺ y)⊺ x = y⊺ Ax ≥ y⊺ b = b⊺ y.
arborescences: directed spanning trees 31
minimize ∑ w( a) x a
a∈ A
subject to ∑ xa ≥ 1 ∀ S ⊆ V − {r }
a∈∂+ S (2.1)
∑ xa = 1 ∀v ̸= r
a∈∂+ v
x a ∈ {0, 1} ∀ a ∈ A.
minimize ∑ w( a) x a
a∈ A
subject to ∑ xa ≥ 1 ∀ S ⊆ V − {r }
a∈∂+ S (2.2)
∑ xa = 1 ∀v ̸= r
a∈∂+ v
xa ≥ 0 ∀ a ∈ A.
Exercise 2.15. Suppose all the arc weights are non-negative. Show
that the optimal solution to the linear program remains unchanged
even if drop the constraints ∑ a∈∂+ v x a = 1.
maximize ∑ yS
S⊆V −{r }
subject to ∑ yS ≤ w( a) ∀a ∈ A (2.3)
S:a∈∂+ S
yS ≥ 0 ∀S ⊆ V − {r }, |S| > 1.
Lemma 2.16. If arc weights are non-negative, there exists a solution for the
dual LP (2.3) such that w⊺ x = 1⊺ y, where all ye values are non-negative.
• The base case is when the chosen zero-weight arcs out of each
node form an arborescence. In this case we can set yS = 0 for
all S; since all arc weights are non-negative, this is a feasible dual
solution. Moreover, both the primal and dual values are zero.
∑ yS = ∑ yS + ∑ yS
S:a∈∂+ S S:a∈∂+ S,|S|=1 S:a∈∂+ S,|S|≥2
1 r
= (y′{u} + M) + ∑ y′S
S:a∈∂+ S,|S|≥2
2 5
3
≤ M + w ′ ( a ) = M + ( w ( a ) − M ) = w ( a ). 2
1
2 3 2
Moreover, the value of the dual increases by M, the same as the
3 2
increase in the weight of the arborescence.
6 7
• Else, suppose the chosen zero-weight arcs contain a cycle C, which
we contract down to a node vC . Using induction for this new 7
graph G ′ , let y′ be the feasible dual solution. For any subset S′ of
nodes in G ′ that contains the new node vC , let S = (S′ \ {vC }) ∪ C,
Figure 2.5: An optimal dual solution:
and define yS = y′S′ . For all other subsets S in G ′ not containing
vertex sets are labeled with dual values,
vC , define yS = y′S . Moreover, for all nodes v ∈ C, define y{v} = 0. and arcs with costs.
The dual value remains unchanged, as does the weight of the
solution T obtained by lifting T ′ . The dual constraint changes only
arcs of the form a = (v, u), where v ∈ C and u ̸∈ C. But such
an arc is replaced by an arc a′ = (vC , u), whose weight is at most
w( a). Hence
∑ yS = y′{v
C}
+ ∑ y′S ≤ w( a′ ) ≤ w( a).
S:a∈∂+ S S′ :a′ ∈∂+ S′ ,S′ ̸={v C}
Corollary 2.17. There exists a solution for the dual LP (2.3) such that
w⊺ x = 1⊺ y. Hence the algorithm produces an optimal arborescence even for
negative arc weights.
Proof. If some arc weights are negative, add M to all arc weights to
get the new graph G ′ where all arc weights are positive. Let y′ be the
optimal dual for G ′ from Lemma 2.16; define yS = y′S for all sets of
size at least two, and y{v} = y′{v} − M for singletons. Note that the
weight of the optimal solution on G is precisely M(n − 1) smaller
than on G ′ ; the same is true for the total dual value. Moreover, for arc
e = (u, v), we have
Karb ⊆ K.
In general, the two polytopes are not equal. But in this case, Corol-
lary 2.17 implies that for this particular setting, the two are indeed
equal. Indeed, a geometric hand-wavy argument is easy to make —
if K were strictly bigger than Karb , there would be some direction
arborescences: directed spanning trees 35
This implies, for instance, that max{tu , tq } = Ω(log n). More details
to come.
3. Link (u, v) adds the edge uv to the trees containing u and v (as-
suming these trees are distinct), and
Iterating over the edges incident to one these trees (say Tu ) would
take time ∑ x∈Tu deg( x ). How can we do it faster? All the algorithms
of this section address this replacement edge question in different
ways.
queries on G and H are the same. The graph H has O(m) nodes and
edges.
The mapping is simple: pick any vertex v in G with degree d ≥
2, create a cycle v = v1 , v2 , . . . , vd in H, and connect each vertex
on this cycle to a unique neighbor of v. A vertex of degree one is
unchanged. Moreover, this mapping can be maintained dynamically: We don’t need the cycle for nodes that
if an edge e = uv is inserted into G, we may have to increase the have degree at most 3 in G, but it is
easier to enforce a uniform rule.
size of the cycles for both endpoints (or create a cycle, in case the
degree has gone from 1 to 2), and then add an edge. This requires a b
a constant number of InsertNode and InsertEdge operations in
H (and maybe one DeleteEdge as well), so the number of updates
is maintained up to a constant. Deleting an edge in G is the exact
inverse of this process. v u
We need to remember this mapping between elements of G and H:
this can be done using hash-tables, and we omit the details here.
c d
3.2.4 Clustering a Tree
The advantage of a sub-cubic graph is that any spanning forest F
a b
also has maximum degree 3. This allows us to use the following
elementary clustering algorithm:
v1 v2
v5 u
v3 v4
c d
dynamic algorithms for graph connectivity 41
Lemma 3.1 (Tree Clustering). Given a positive integer z, and any tree
T with at least z nodes and maximum degree 3, we can partition its vertex
set into clusters such that each cluster (a) induces a connected subtree, and
(b) contains between z and 3z nodes.
Proof. Root the tree T at some leaf node. Since the maximum degree
is 3, each node has at most two children. Find some node u such
that the subtree Tu rooted at u has at least z nodes, but each of its
children has strictly less than z nodes in its subtrees. Since there
are at most two children, the total number of nodes in Tu is at most
2(z − 1) + 1 ≤ 2z, the last +1 to account for u itself. Put all the
nodes in Tu in a cluster, and remove them from T, and repeat. If
this process leaves T with fewer than z nodes at the end, add these
remaining nodes to one of its adjacent clusters, making its size at
most 3(z − 1) + 1 ≤ 3z.
1. Insert (uv): We call Tree (u) and Tree (v) to find the clusters Cu
and Cv that contain them. Then we call Tree (Cu ) and Tree (Cv )
on the cluster forest to find their trees Tu and Tv in F. If u, v belong
to the same tree in F, update the crossing-edge information for the
42 frederickson’s algorithm
G = G0 ⊇ G1 ⊇ · · · ⊇ Gi · · · ,
Adding an edge is easy: set its level to zero, and add it to F0 if it does
not create a cycle in F0 . Deleting an edge not in F0 is also easy: just
delete it. Finally, if an edge e ∈ F0 (say with level ℓ) is deleted, we
need to search for a replacement. Such a replacement edge can only
be at level ℓ or lower—this follows from property (⋆), and the Cut
Rule from ??. Moreover, to maintain property (⋆), we should also
add a replacement edge of the highest level possible. So we consider
off-tree edges from level ℓ downwards.
Remember, when we scan an edge, we want to raise its level, to
charge to it. That could mess with probability (⋆), because this may
cause off-tree edges to have higher level/weight than tree edges. To
avoid this problem, we first raise the levels of some of the tree edges.
Specifically, we do the following steps:
Lemma 3.2. Each tree in forest Fℓ has at most ⌊n/2ℓ ⌋ nodes. Hence, the
level ℓ of any edge is at most log2 n.
Note: we consider all neighbors in the graph G, not just in the span-
ning forest.
It is easy and fast to maintain the fingerprint F (v) for each vertex
v as edges change. Now when we delete an edge uv (and if there is a
unique replacement edge f ), consider the following:
Fact 3.4. The label of the unique replacement edge is given by the
exclusive-or of the fingerprints of all nodes in L. That is,
M
F (v) = ℓ ( f )
v∈ L
of 2 between 1/2 and 1/n. There are log2 n such values, and one of
them will be the correct one, up to a factor of 2. Specifically, we keep
O(log n) different data structures, each one using a different one of
these O(log n) sampling rates; at least one of them (the “right” one)
O(log n)
will succeed with probability 1 − poly(n)
, by a trivial union bound.
3.4.3 Wrapping Up
Many of the random experiments would have multiple L-to-R edges:
the answers for those would not make sense: they may give names
of non-edges, or of edges that do not cross between L and R. Hence,
the algorithm above needs a mechanism to check that the answer is
indeed an L-to-R edge. Details of this can be found in the Kapron,
King, and Mountjoy paper.
More worryingly, what about multiple deletions? We might be
tempted to say that if the input sequence is oblivious to the al-
gorithm’s randomness, we can just take a union bound over all
timesteps. However, the algorithm’s behavior—the structure of the
current spanning forest F, and hence which cut is queried during
later edge deletions—depends on the replacement edges found in
previous steps, which are correlated with the randomness. Hence,
we cannot claim independence for the calculations in (3.1). To han-
dle this, Kapron et al. construct a multi-level data structure; see their
paper for details.
Finally, we did not talk about any of the implementation details
here. For instance, how can we compute the XOR of the set of nodes
in L quickly? How do we check if the replacement edge names are
valid? All these can be done using a dynamic tree data structure.
Putting all this toget, the update time becomes O(log5 n), still in the
worst-case. As an aside, if we allow randomization and amortized
bounds, the update times can be improved to within poly(log log n)
factors of the lower bound of O(log n); see this work by 2 . 2
tion of) these paths. We first consider Dijkstra’s algorithm for the case
of non-negative edge-weights, and give the Bellman-Ford algorithm
that handles negative weights as well.
a
4.1.2 The Bellman-Ford Algorithm 3 1
s −3 t
Dijkstra’s algorithm does not work on instances with negative edge 5
weights; see the example on the right. For such instances, we want b
Figure 4.1: Example with negative
that a correct SSSP algorithm to either return the distances from s to
edge-weights: Dijkstra’s algorithm gives
all other vertices, or else find a negative-weight cycle in the graph. a label of 4 for t, whereas the correct
The most well-known algorithm for this case is the Shimbel- answer is 3.
Lemma 4.1. After i iterations of the algorithm, dist(v) equals the weight of
the shortest-path from s to v containing at most i edges. (This is defined to
be ∞ if there are no such paths.)
Don Johnson gave a algorithm that does the edge re-weighting in Johnson (1977)
a slightly cleverer way, using the idea of feasible potentials. Loosely,
it runs the Bellman-Ford algorithm once, then uses the information
gathered to do the re-weighting. At first glance, the concept of a
feasible potential does not seem very useful. It is just an assignment of
weights ϕv to each vertex v of the graph, with some conditions:
1. The new weights w b are all positive. This comes from the definition
of the feasible potential.
2. Let Pab be a path from a to b. Let ℓ( Pab ) be the length of Pab when
we use the weights w, and ℓ̂( Pab ) be its length when we use the
b Then
weights w.
4. If we set ϕ(s) = 0 for some vertex s, then ϕ(v) for any other vertex
v is an underestimate of the s-to-v distance. This is because for all
the paths from s to v we have
This is the usual matrix multiplication, but over the semiring (R, min, +).
A semiring has a notion of addition
It turns out that computing Min-Sum Products is precisely the and one of multiplication. However,
neither the addition nor the multipli-
operation needed for the APSP problem. Indeed, initialize a matrix D cation operations are required to have
exactly as in the Floyd-Warshall algorithm: inverses.
wij , i, j ∈ E
Dij = ∞ i, j ∈
/ E, i ̸= j .
0, i=j
Now ( D ⊚ D )ij represents the cheapest i-j path using at most 2 hops!
(It’s as though we made the outer-most loop of Floyd-Warshall into
the inner-most loop.) Similarly, we can compute
D ⊚k : = D ⊚ D ⊚ D · · · ⊚ D ,
| {z }
k −1 MSPs
whose entries give the shortest i-j paths using at most k hops (or at
most k − 1 intermediate nodes). Since the shortest paths would have
at most n − 1 hops, we can compute D⊚n−1 .
How much time would this take? The very definition of MSP
shows how to implement it in O(n3 ) time. But performing it n − 1
times would be O(n) worse than all other approaches! But here’s a
classical trick, which probably goes back to the Babylonians: for any
integer k,
D⊚2k = D⊚k ⊚ D⊚k .
(Here we use that the underlying operations are associative.) Now it
is a simple exercise to compute D⊚n−1 using at most 2 log2 n MSPs.
This a runtime of O( MSP(n) log n), where MSP(n) is the time it
takes to compute the min-sum product of two n × n matrices. In fact, with some more work, we can
Now using the naive implementation of MSP gives a total runtime implement APSP in time O( MSP(n));
you will probably see this in a home-
of O(n3 log n), which is almost in the right ballpark! The natural work.
question is: can we implement MSPs faster?
shortest paths in graphs 57
Can we get algorithms for MSP that run in time O(n3−ε ) for some
constant ε > 0? To answer this question, we can first consider
the more common case, that of matrix multiplication over the re-
als (or over some field)? Here, the answer is yes, and this has been
known for now over 50 years. In 1969, Volker Strassen showed
that one could multiply n × n matrices over any field F, using
O(nlog2 7 ) = O(n2.81 additions and multiplications. (One can allow
divisions as well, but Strassen showed that divisions do not help
asymptotically.) Mike Paterson has a beautiful but still
If we define the exponent of matrix multiplication ω > 0 to be mysterious geometric interpretation
of the sub-problems Strassen comes
smallest real such that two n × n matrices over any field F can be up with, and how they relate to Karat-
multiplied in time O(nω ), then Strassen’s result can be phrased as suba’s algorithm to multiply numbers.
saying:
ω ≤ log2 7.
This value, and Strassen’s idea, has been refined over the years, to The big improvements in this line
its current value of 2.3728 due to François Le Gall (2014). (See this of work were due to Arnold Schön-
hage (1981), Don Coppersmith and
survey by Virginia for a discussion of algorithmic progress until Shmuel Winograd (1990), with recent
2013.) There has been a flurry of work on lower bounds as well, e.g., refinements by Andrew Stothers, CMU
alumna Virginia Vassilevska Williams,
by Josh Alman and Virginia Vassilevska Williams showing limitations and François Le Gall (2014).
for all known approaches.
But how about MSP(n)? Sadly, progress on this has been less im-
pressive. Despite much effort, we don’t even know if it can be done
in O(n3−ϵ ) time. In fact, most of the recent work has been on giving
evidence that getting sub-cubic algorithms for MSP and APSP may
not be possible. There is an interesting theory of hardness within P
developed around this problem, and related ones. For instance, it is
now known that several problems are equivalent to APSP, and truly
sub-cubic algorithms for one will lead to sub-cubic algorithms for all
of them.
Yet there is some interesting progress on the positive side, albeit
qualitatively small. As far back as 1976, Fredman had shown an
log log n
algorithm to compute MSP in O(n3 log n ) time. He used the fact
that the decision-tree complexity of APSP is sub-cubic (a result we
will discuss in §4.5) in order to speed up computations over nearly-
xlogarithmic-sized sub-instances; this gives the improvement above.
More recently, another CMU alumnus Ryan
Williams improved on
3
this idea quite substantially to O √n 6, using very interesting 6
log n
2
ideas from circuit complexity. We will discuss this result in a later
section, if we get a chance.
58 undirected apsp using fast matrix multiplication
Now consider the graph G2 , the square of G, which has the same
vertex set as G but where an edge in G2 corresponds to being at most
two hops away in G—that is, uv ∈ E( G2 ) ⇐⇒ dG (u, v) ≤ 2. If
we consider A as a matrix over the finite field (F2 , +, ∗), then the In this field over {0, 1}, observe that the
multiplication operation behaves just
adjacency matrix of G2 has a nice formulation:
like the Boolean AND function, and the
addition like the Boolean XOR.
A G2 = A G ∗ A G + A G .
This shows how to get the adjacency matrix of G2 given one for G,
having spent one Boolean matrix multiplication and one matrix
addition. Suppose we recursively compute APSP on G2 : how can
we translate this result back to G? The next lemma shows that the
shortest-path distances in G2 are nicely related to those in G.
u, a1 , b1 , a2 , b2 , . . . , ak , bk , v
u, a1 , b1 , a2 , b2 , . . . , ak , bk , ak+1 , v.
But which one? The following lemmas give us simple rule to decide.
Let NG (v) denote the set of neighbors of v in G.
Lemma 4.6. If duv = 2Duv , then for all w ∈ NG (v) we have Duw ≥ Duv .
Proof. Assume not, and let w ∈ NG (v) be such that Duw < Duv .
Since both of them are integers, we have 2Duw < 2Duv − 1. Then the
shortest u-w path in G along with the edge wv forms a u-v-path in G
of length at most 2Duw + 1 < 2Duv = duv , which is in contradiction
with the assumption that duv is the shortest path in G.
Lemma 4.7. If duv = 2Duv − 1, then Duw ≤ Duv for all w ∈ NG (v);
moreover, there exists z ∈ NG (v) such that Duz < Duv .
bwv = 1wv∈E · 1
A .
deg(v)
2D − 1( D Ab< D) .
1 2+ 4−1ω
graphs, and gives an algorithm with runtime Õ W 4− ω n .
Given the algorithmic advances, one may wonder about lower bounds
for the APSP problem. There is the obvious Ω(n2 ) lower bound
from the time required to write down the answer. Maybe even the
decision-tree complexity of the problem is Ω(n3 )? Then no algorithm
can do any faster, and we’d have shown the Floyd-Warshall and the
Matrix-Multiplication methods are optimal.
However, thanks to a result of Fredman 9 , we know this is not the 9
Now for every pair of columns, p, q from Ai , BiT , and sort the follow-
ing 2n numbers
A1p − Aiq , A2p − A2q , . . . , Anp − Anq , −( B1p − B1q ), . . . , −( Bnp − Bnq )
This result does not give us a fast algorithm, since it just counts
the number of comparisons, and not the actual time to figure out
which comparisons to make. Regardless, many of the algorithms
that achieve n3 / poly log n time for APSP use Fredman’s result on
tiny instances (say of size O(poly log n), so that we can find the best
decision-tree using brute-force) to achieve their results.
5
Low-Stretch Spanning Trees
Given that shortest paths from a single source node s can be repre-
sented by a single shortest-path tree, can we get an analog for all-
pairs shortest paths? Given a graph can we find a tree T that gives us
the shortest-path distances between every pair of nodes? Does such
a tree even exist? Sadly, the answer is negative—and it remains neg-
ative even if we allow this tree to stretch distances by a small factor,
as we will soon see. However, we show that allowing randomiza-
tion will allow us to circumvent the problems, and get low-stretch
spanning trees in general graphs.
In this chapter, we consider undirected graphs G = (V, E), where
each edge e has a non-negative weight/length we . For all u, v in V,
let dG (u, v) be the distance between u, v, i.e., the length of a shortest
path in G from u to v. Observe that the set V along with the distance
function dG forms a metric space. A metric space is a set V with a dis-
tance function d satisfying symme-
try (i.e., d( x, y) = d(y, x ) for all
x, y ∈ V) and the triangle inequality
5.1 Towards a Definition (d( x, y) ≤ d( x, z) + d(z, y) for all
x, y, z ∈ V). Typically, the definition also
The study of low-stretch spanning trees is guided by two high level asks for x = y ⇐⇒ d( x, y) = 0, but we
will merely assume d( x, x ) = 0 for all x.
hopes:
1. Graphs have spanning trees that preserve their distances. That is,
given G there exists a subtree T = (V, ET ) with ET ⊆ E such that We assume that the weights of edges in
ET are the same as those in G.
dG (u, v) ≈ d T (u, v) for all u, v ∈ V.
Now, since π T is the optimal ordering for the tree T, and πG is some
other ordering,
which is α · OPTG .
Observe that the first property must hold with probability 1 (i.e.,
it holds for all trees in the support of the distribution), whereas the
second property holds only on average. Is this definition any good
for our TSP example above? If we change the algorithm to sample a
tree T from the distribution and then return the optimal tour for T,
we get a randomized algorithm that is good in expectation. Indeed,
(5.1) becomes
n−1 1 2
E[d T (u, v)] = · 1 + · ( n − 1) = 2 − .
n n n
And what about an arbitrary pair of nodes u, v in Cn ? We can use Exercise: Given a graph G, suppose
the exercise on the right to show that the stretch on other pairs is the stretch on all edges is at most α.
Show that the stretch on all pairs of
no worse! nodes is at most α. (Hint: linearity of
expectation.)
While we will not manage to get α < 1.49 for general graphs (or
even for the above examples, for which the bounds of 2 − n2 are the
best possible), we show that α ≈ O(log n) can indeed be achieved.
The following theorem is the current best result, due to Ittai Abra-
ham and Ofer Neiman:
Theorem 5.4. For any graph G, there exists a distribution D over span-
ning trees of G with stretch α = O(log n log log n). Moreover, the
construction is efficient: we can sample trees from this distribution D in
O(m log n log log n) time.
Theorem 5.7. For any metric space M = (V, d), there exists an efficiently
sampleable α B -stretch spanning tree distribution D B , where
α B = O(log n log ∆ M ).
d( x, y)
Pr[ x, y in different clusters] ≤ β · .
D
Let’s see a few examples, to get a better sense for the definition:
1. Consider a set of points on the real line. One way to partition the
line into pieces of diameter D is simple: imagine making notches
low-stretch spanning trees 69
3. What about lower bounds? One can show that for the k-dimensional
hypergrid, we cannot get β = o (k). Or for a constant-degree n-
vertex expander, we cannot get β = o (log n). Details to come soon.
Since the aspect ratio of the metric space is invariant to scaling all
the edge lengths by the same factor, it will be convenient to assume
that the smallest non-zero distance in d is 1, so the largest distance is
∆. The basic algorithm is then quite simple:
Now the probability that Rv > D/2 for one particular cluster is We use that 1 − z ≤ ez for all z ∈ R.
70 bartal’s construction
1
Pr[ Rv > D/2] = (1 − p) D/2 ≤ e− pD/2 ≤ e−2 log n = .
n2
By a union bound, there exists a cluster with diameter > D with
probability
n 1
1 − Pr[∃v ∈ V, Rv > D/2] ≥ 1 − = 1− .
n2 n
To bound the probability of some pair u, v being separated, we
use the fact that sampling from the geometric distribution with pa-
Cv
rameter p means repeatedly flipping a coin with bias p and counting
the number of flips until we see the first heads. Recall this process
is memoryless, meaning that even if we have already performed k d(v, y)
v y
flips without having seen a heads, the time until the first heads is still Rv ≤ D
2
d(v, x ) d( x, y)
geometrically distributed. x
Hence, the steps of drawing Rv and then forming the cluster can
be viewed as starting from v, where the cluster is a unit-radius ball
around v. Each time we flip a coin of bias p: it is comes up heads we Figure 5.1: A cluster forming around v
in the LDD process, separating x and
set the radius Rv to the current value, form the cluster Cv (and mark y. To reduce clutter, only some of the
its vertices) and then pick a new unmarked point v; on seeing tails, distances are shown.
we just increment the radius of v’s cluster by one and flip again. The
process ends when all vertices lie in some cluster.
For x, y, consider the first time when one of these vertices lies
inside the current ball centered at some point, say, v. (This must hap-
pen at some point, since all vertices are eventually marked.) With-
out loss of generality, let the point inside the current ball be x. At
this point, we have performed d(v, x ) flips without having seen a
heads. Now we will separate x, y if we see a heads within the next
⌈d(v, y) − d(v, x )⌉ ≤ ⌈d( x, y)⌉ flips—beyond that, both x, y will have
been contained in v’s cluster and hence cannot be separated. But
the probability of getting a heads among these flips is at most (by a
union bound)
d( x, y)
⌈d( x, y)⌉ p ≤ 2d( x, y) p ≤ 8 log n .
D
(Here we used that the minimum distance is 1, so rounding up dis-
tances at most doubles things.) This proves the claimed probability of
separation.
Lemma 5.11. If the random tree T returned by some call LDD( M′ , δ) has
root r, then (a) every vertex x in T has distance d( x, r ) ≤ 2δ+1 , and (b) the
expected distance between any x, y ∈ T has E[d T ( x, y)] ≤ 8δβ d( x, y).
Proof. The proof is by induction on δ. For the base case, the tree has
a single vertex, so the claims are trivial. Else, let x lie in cluster Cj , so
inductively the distance to the root of the tree Ti is d( x, ri ) ≤ 2(δ−1)+1 .
Now the distance to the new root r is at most 2δ more, which gives
2δ + 2δ = 2δ+1 as claimed.
Moreover, any pair x, y is separated by the LDD with probability
d( x,y)
β 2δ−1 , in which case their distance is at most
Else they lie in the same cluster, and inductively have expected dis-
72 metric embeddings: a.k.a. simplifying metrics
This proves Theorem 5.7 because β = O(log n), and the iniitial
call on the entire metric defines δ = O(log ∆). In fact, if we have a
better LDD (with smaller β), we immediately get a better low-stretch
tree. For example, shortest-path metrics of planar graphs admit an
LDD with parameter β = O(1); this shows that planar metrics admit
(randomized) low-stretch trees with stretch O(log ∆).
It turns out this factor of O(log n log ∆) can be improved to O(log n)—
this was done by Fakcharoenphol, Rao, and Talwar. Moreover, the
bound of O(log n) is tight: the lower bounds of Theorem 5.5 continue
to hold even for low-stretch non-spanning trees.
Given sets S, T, their symmetric difference is denoted Figure 6.2: An augmenting path
S △ T : = ( S \ T ) ∪ ( T \ S ).
Note that the entire set V is trivially a vertex cover, and the chal-
lenge is to find small vertex covers. We denote the size of the smallest
cardinality vertex cover of graph G as VC ( G ). Our motivation for
calling it a “dual” object comes from the following fundamental theo-
rem from the early 20th century:
Theorem 6.9 (König’s Minimax Theorem). In a bipartite graph, the size Dénes König (1916)
of the largest possible matching equals the cardinality of the smallest vertex
cover:
MM( G ) = VC( G ).
at least MM( G ).
78 bipartite graphs
Open vertex
Hence, suppose we do not find an open node in an even level, and
Figure 6.3: Illustration of the process
stop when some X j is empty. Let X = ∪ j X j be all nodes added to any to find augmenting paths in a bipartite
of the sets X j ; we call these marked nodes. Define the set C to be the graph. Mistakes here, to be fixed!
vertices on the left which are not marked, plus the vertices on the right
which are marked. That is,
C := ( L \ X ) ∪ ( R ∩ X )
Theorem 6.12 (The Tutte-Berge Max-Min Theorem). Given a graph G, Tutte (1947), Berge (1958)
the size of the maximum matching is described by the following equation. Tutte showed that the graph has a
perfect matching precisely if for every
U ⊆ V, odd( G \ U ) ≤ |U |. Berge
n + |U | − odd( G \ U )
MM ( G ) = min . gave the generalization to maximum
U ⊆V 2 matchings.
The expression on the right can seem a bit confusing, so let’s con-
sider some cases.
t
| Ki |
| M | ≤ |U | + ∑
i =1
2
n − |U | odd( G \ U )
= |U | + −
2 2
|U | + n − odd( G \ U )
= .
2
We can prove the “hard” direction using induction (see the webpage
for several such proofs). However, we defer it for now, and derive it
later from the proof of the Blossom algorithm.
The rest of this section defines the algorithm, and proves this
theorem. The essential idea of the algorithm is simple, and similar
to the one for the bipartite case: if we have a matching M, Berge’s
characterization from Theorem 6.7 says that if M is not optimal, there
exists an M-augmenting path. So the natural idea would be to find
such an augmenting path. However, it is not clear how to do this
directly. The clever idea in the Blossom algorithm is to either find
an M-augmenting path, or else find a structure called a “blossom”.
The good thing about blossoms is that we can use them to contract
82 the blossom algorithm
Stem Blossom
(a)
the graph in a certain way, and make progress. Let us now give some
definitions, and details. A
(b)
A flower is a subgraph of G that looks like the the object to the
Matched edge
right: it has a open vertex at the base, then a stem with an even num- Unmatched edge
ber of edges (alternating between matched and unmatched edges ), Open vertex
Let’s give some more details for the last step. Suppose we find
a flower F, with stem S and blossom B. First, toggle the stem (by
setting M ← M△S): this moves the open node to the blossom,
Figure 6.7: The shrinking of a blossom.
without changing the size of the matching M. (It makes the following Image found at http://en.wikipedia.
arguments easier, with one less case to consider.) (Change figure.) org/wiki/Blossom_algorithm.
Proof. Since we toggled the stem, the vertex v at the base of the blos-
som B is open, and so is the vertex v B created in G ′ by contracting
B. Moreover, all other nodes in the blossom are matched by edges
within itself, so all edges leaving B are non-matching edges. The
picture essentially gives the proof, and can be used to follow along.
3. If v ∈ X2j for j < i, then u would have been added to the odd level
X2j+1 , which is impossible.
Now for the edges out of the odd layers considered in line 9.3.
Given u ∈ X2i+1 and matching edge uv ∈ M, the cases are:
Observe that if the algorithm does not succeed, all the matching
edges we explored are odd-to-even, whereas all the non-matching
edges are even-to-odd. Now we can prove Lemma 6.14.
(a) the marked vertices in the even levels, Xeven which are all single-
tons since there are no cross edges, and
Hence
n + |U | − odd( G \ U ) n + | Xodd | − | Xeven |
=
2 2
2| Xodd | + (n − | X |)
=
2
(n − | X |)
= | Xodd | + = | M |.
2
The last equality uses that all nodes in V \ X are perfectly matched
among themselves, and all nodes in Xodd are matched using unique
edges.
The last piece is to show that a Tutte-Berge set U ′ for a contracted
graph G ′ = G/B with respect to M′ = M/B can be lifted to one for G
with respect to M. We leave it as an exercise to show that adding the
entire blossom B to U ′ gives such an U.
1
As an example, the half space S = {⃗x | · ⃗x ≥ 3} in R2 is shown
1 0
on the right. (Note that we implicitly restrict ourselves to closed half-
0 1 2 3 4 5 6
spaces.) x1
K = { Ax ≤ b},
88 linear programming
Although all linear programs can be put into this canonical form,
in practice they may have many different forms. These presenta-
tions can be shown to be equivalent to one another by adding new
variables and constraints, negating the entries of A and c, etc. For
example, the following are all linear programs:
max{c · x : Ax ≤ b} min{c · x : Ax = b}
x x
min{c · x : Ax ≥ b} min{c · x : Ax ≤ b, x ≥ 0}.
x x
In other words, x is an extreme point of K if it cannot be written as Figure 7.3: Here y is an extreme point,
but x is not.
the convex combination of two other points in K. See Figure 7.3 for
an example.
Here’s another kind of point in K. In this course, we will use the notation
c · x, c⊺ x, and ⟨c, x ⟩ to denote the inner-
product between vectors c and x.
graph matchings ii: weighted matchings 89
K = CH(ext(K )).
It is conceptually easy to define an | E|-dimensional polytope whose Figure 7.4: This graph has one perfect
vertices are precisely the perfect matchings of G: we simply define matching M: it contains edges 1, 4,
5, and 6, represented by the vector
χ M = (1, 0, 0, 1, 1, 1).
CPM(G) = CH ({χ M | M is a perfect matching in G }). (7.3)
min{w · x | x ∈ CPM(G) }.
∑ xlr = 1 ∀l ∈ L
r ∈ N (l )
K PM(G) =
x ∈ R| E | s.t.
∑ xlr = 1 ∀r ∈ R
l ∈ N (r )
xe ≥ 0 ∀e ∈ E
Proof. For brevity, let us refer to the polytopes as K and C. The easy
direction is to show that C ⊆ K. Indeed, the characteristic vector χ M
for each perfect matching M satisfies the constraints for K. Moreover
K is convex, so if it contains all the vertices of C, it contains all their
convex combinations, and hence all of C.
For the other direction, we show that an arbitrary vertex x ∗ of K
is contained within C. Using Fact 7.8, we use the fact that x ∗ is also
an extreme point for K. (We can also use the fact that x ∗ is a basic
feasible solution, or that it is a vertex of the polytope, to prove this
theorem; we will add the former proof soon, the latter proof appears
in §7.3.)
Let supp( x ∗ ) = {e | xe∗ > 0} be the support of this solution.
We claim that supp( x ∗ ) is acyclic. Indeed, suppose not, and cycle
C = e1 , e2 , . . . , ek is contained within the support supp( x ∗ ). Since the
graph is bipartite, this is an even-length cycle. Define
ε := min xe∗ .
e∈supp( x ∗ )
This completes the proof that the polytope K PM(G) exactly captures
precisely the perfect matchings in G, despite having such a simple
description. Now, using the fact that the linear program
min{w · x | x ∈ K PM(G) }
x ∗ | E\ E′ = C −1 (1 − C ′ x ∗ | E′ ) = C −1 1.
By Cramer’s rule,
det(C [1]i )
xe∗ = .
det(C )
The numerator is an integer (since the entries of C are integers), so
showing det(C ) ∈ {±1} means that xe∗ is an integer.
94 another perspective: buyers and sellers
Using the claim and using the fact C is non-singular and hence
det(C ) cannot be zero, we get that the entries of xe∗ are integers. By
the structure of the LP, the only integers possible in a feasible solu-
tion are {0, 1} and the vector x ∗ corresponds to a matching.
The results of the previous section show that the bipartite perfect
matching polytope is integral, and hence the max-weight perfect
graph matchings ii: weighted matchings 95
Consider the setting with a set B with n buyers and another set I with
n items, where buyer b has value vbi for item i. The goal is to find a
max-value perfect matching, that matches each buyer to a distinct
item and maximizes the sum of the values obtained by this matching.
Our algorithm will maintain a set of prices for items: each item i
will have price pi . Given a price vector p := ( p1 , . . . , pn ), define the
utility of item i to buyer b to be
ubi ( p) := vbi − pi .
A buyer has at least one preferred item, and can have multiple
preferred items, since there can be ties. Given prices p, we build a
preference graph H = H ( p), where the vertices are buyers B on the
left, items I on the right, and where bi is an edge if buyer b prefers
item i at prices p. The two examples show preference graphs, where
the second graph results from an increase in price of item 1. Flip the
figure.
min
p=( p1 ,...,pn )
∑ p i + ∑ u b ( p ).
i∈ I b∈ B
Consider the dual solution given by the price vector p∗ . Recall that
M is a perfect matching in the preference graph H ( p∗ ), and let M(i )
be the buyer matched to item i by it. Since u M(i) ( p) = v M(i)i − pi , the
dual objective is
Since the primal and dual values are equal, the primal matching M
must be optimal.
That’s it. Running the algorithm on our running example gives the
prices on the right.
The only way the algorithm can stop is to produce an optimal
matching. So we must show it does stop, for which we use a “semi-
invariant” argument. We keep track of the “potential”
Φ( p) := ∑ p i + ∑ u b ( p ),
i b
Lemma 7.17. Every time we increase the prices in N (S) by 1, the value of
∑i pi + ∑b ub decreases by at least 1.
that all values were integral.) Therefore, the value of the potential
∑i pi + ∑b ub changes by | N ( B)| − | B| ≤ −1.
• In fact, one can get rid of the integrality assumption by raising the
prices by the maximum amount possible for the above proof to
still go through, namely
min ub ( p) − max (vib − pi ) .
b∈S i ̸∈ N (S)
It can be shown that this update rule makes the algorithm stop in
only O(n3 ) iterations.
• If all the values are non-negative, and we don’t like the utilities to
be negative, then we can do one of the following things: (a) when
all the prices become non-zero, subtract the same amount from all
of them to make the lowest price hit zero, or (b) choose S to be a
minimal “consticted” set and raise the prices for N (S). This way,
we can ensure that each buyer still has at least one item which
gives it nonngegative utility. (Exercise!)
• Suppose there are n buyers and a single item, with all non-negative
values. (Imagine there are n − 1 dummy items, with buyers hav-
ing zero values for them.) The above algorithm behaves like the
usual ascending-price English or Vickery auction, where prices
are raised until only one bidder remains. Indeed, the final price
for the “real” item will be such that the second-highest bidder is
indifferent between it and a dummy item.
This is a more general phenomenon: indeed, even in the setting
with multiple items, the final prices are those produced by the
Vickery-Clarke-Groves truthful mechanism, at least if we use the
version of the algorithm that raises prices on minimal constricted
sets. The truthfulness of the mechanism means there is no incen-
tive for buyers to unilaterally lie about their values for items. See,
e.g., 1 for the rich connection of matching algorithms to auction 1
This proof shows that for any setting of values, there is an optimal
integer solution to the linear program
max{v · x | x ∈ K LP(G) }.
Let us now see yet another algorithm for solving weighted matching
problems in bipartite graphs. For now, we switch from maximum-
weight matchings to minimum-weight matchings, because they are
conceptually cleaner to explain here. Of course, the two problems are
equivalent, since we can always negate edges.
In fact, we solve a min-cost max-flow problem here: given an flow
network with terminals s and t, edge capacities ue , and also edge
costs/weights we , find an s-t flow with maximum flow value, and
whose total cost/weight is the least among all such flows. (Moreover,
if the capacities are integers, the flow we find will also have integer
flow values on all edges.) Casting the maximum-cardinality bipartite
matching problem as a integer max-flow problem, as in §blah gives
us a minimum-weight bipartite matching.
This algorithm uses an augmenting path subroutine, much like
the algorithm of Ford and Fulkerson. The subroutine, which takes in
a matching M and returns one of size | M | + 1, is presented below.
Then, we can start with the empty matching and call this subroutine
until we get a maximum matching.
Let the original bipartite graph be G. Construct the directed graph
G M as follows: For each edge e ∈ M, insert that edge directed from
right to left, with weight −we . For each edge e ∈ G \ M, insert that
edge directed from left to right, with weight we . Then, compute the
shortest path P that starts from the left and ends on the right, and
return M △ P. It is easy to see that M △ P is a matching of size | M | +
1, and has total weight equal to the sum of the weights of M and P.
Call a matching M an extreme matching if M has minimum
weight among all matchings of size | M |. The main idea is to show
that the above subroutine preserves extremity, so that the final match-
ing must be extreme and therefore optimal.
∑ xvu = 1 ∀v ∈ V
∈
∂(v)
u
KgenPM(G) =
x ∈ R| E | s.t.
∑ xe ≥ 1 ∀S s.t. |S| odd
e ∈ ∂(S)
xe ≥ 0 ∀e ∈ E
K genPM(G) = CgenPM(G) ,
We just saw several proofs that the bipartite perfect matching poly-
tope has a compact linear program. Moreover, we claimed that the
pefect matching polytope on general graphs has an explicit linear
program that, while exponential-sized, can be solved in polynomial
time. Such results allow us to solve the weighted bipartite matching
problems using generic linear programming solvers (as long as they
return vertex solutions).
102 integrality of polyhedra
7.6.1 Arborescences
We already saw a linear program for the min-weight r-arborescence
polytope in §2.3.2: since each node that is not the root r must have a
path in the arborescence to the root, it is natural to say that for any
subset of vertices S ⊆ V that does not contain the root, there must
be an edge leaving it. Specifically, given the digraph G = (V, A), the
polytope can be written as
∑ x a ≥ 1 ∀S ⊂ Vs.t.r ̸∈ S
| A|
K Arb(G) = x ∈ R s.t. a ∈ ∂ + (S)
.
x ≥ 0
a ∀ a ∈ A
Here ∂+ (S) is the set of arcs that leave set S. The proof in §2.3.2 al-
ready showed that for each weight vector w ∈ R| A| , we can find an
optimal solution to the linear program min{w · x | x ∈ K Arb(G) }.
(The first constraint excludes the case where S is either empty or the
entire vertex set.) Sadly, this does not precisely capture the spanning
tree polytope: e.g., for the familiar cycle graph having three vertices,
setting xe = 1/2 for all three edges satisfies all the constraints. If all
edge weights are 1, this solution get a value of ∑e xe = 3/2, whereas
any spanning tree on 3 vertices must have 2 edges.
One can indeed write a different linear program that captures the
spanning tree polytope, but it is a bit non-trivial:
∑ xij ≤ |S| − 1 ∀S ⊆ V, S ̸= ∅
ij∈ E:i,j∈S
KST (G) =
x ∈ R| E | s.t.
∑ xij = |V | − 1
ij∈ E
xij ≥ 0 ∀ij ∈ E
The homework exercises will ask you to write such a compact ex-
tended formulation for the arborescence problem.
Theorem 7.21 (Hoffman and Kruskal Theorem). If the constraint A.J. Hoffman and J.B. Kruskal (1956)
matrix [ A]m×n is totally unimodular and the vector ⃗b is integral, i.e: ⃗b ∈
Zm , then, the vertices of the polytope induced by the LP are integer valued.
Thus, to show that the vertices are indeed integer valued, one
need not go through producing combinatorial proofs, as we have.
Instead, one could just check that the constraint matrix A is totally
unimodular.
=
8
Graph Matchings III: Algebraic Algorithms
• The first result along these lines is that of Laci Lovász, who intro- Lovász (1979)
duced the general idea, and gave a randomized algorithm to detect
the presence of perfect matchings in time O(nω ), and to find it in
time O(mnω ). We will present all the details of this elegant idea
soon.
• Dick Karp, Eli Upfal, and Avi Wigderson, and then Ketan Mulmu- Karp, Upfal, and Wigderson (1986)
ley, Umesh Vazirani, and Vijay Vazirani showed how to find such a Mulmuley, Vazirani, and Vazirani (1987)
matching in parallel. The question of getting a deterministic paral-
lel algorithm remains an outstanding open problem, despite recent
progress (which discuss at the end of the chapter).
• Michael Rabin and Vijay Vazirani sped up the sequential algorithm Rabin and Vazirani (1989)
to run in O(n · nω ). This was substantially improved by the work
of Marcin Mucha and Piotr Sankowski to get a runtime of O(nω ). Mucha and Sankowski (2006)
For the rest of this lecture, we fix a field F, and consider (univariate
and multivariate) polynomials over this field. We assume that we can
perform basic arithmetic operations in constant time, though some-
times it will be important to look more closely at this assumption. For finite fields Fq (where q is a prime
power), we can perform arithmetic
operations (addition, multiplication,
division) in time poly log q.
106 preliminaries: roots of low degree polynomials
d
Pr [ p( R) = 0] ≤ .
|S|
This statement holds for multivariate polynomials as well, as we
see next. The result is called the Schwartz-Zippel lemma, and it ap-
pears in papers by Richard DeMillo and Richard Lipton 1 , by Richard 1
p( x1 , . . . , xn ) = xnk q( x1 , . . . , xn−1 ) + r ( x1 , . . . , xn ),
graph matchings iii: algebraic algorithms 107
Pr [ p( R1 , . . . , Rn ) = 0] = Pr [ p( R1 , . . . , Rn ) = 0 | E] Pr [E ]
+ Pr p( R1 , . . . , Rn ) = 0 | E Pr E
≤ Pr [E ] + Pr p( R1 , . . . , Rn ) = 0 | E
Thus we get
d−k k d
Pr [ p( R1 , . . . , Rn ) = 0] ≤ + = .
|S| |S| |S|
Remark 8.4. Finding the set S ⊆ F such that |S| ≥ dn2 , guarantees that
if p is a non-zero polynomial,
1
Pr [ p( R1 , . . . , Rn ) = 0] ≤ .
n2
Naturally, if p is zero polynomial, then the probability equals 1.
Observe that now each variable occurs twice, with the variables
below the diagonal being the negations of those above. We claim the
same property for this matrix as we did for the Edmonds matrix:
Theorem 8.11. For any graph G, the determinant of the Tutte matrix T( G )
is a non-zero polynomial over any field F if and only if there exists a perfect
matching in G.
If the cycle cover for σ has an odd-length cycle, take the permuta-
tion σ′ obtained by reversing the order of, say, the first odd cycle in
it. This does not change the sign of the permutation, which depends
on how many odd and even cycles there are, but the skew symmetry
and the odd cycle length means the product of matrix entries flips
sign. Hence the monomials for σ and σ′ cancel each other. Formally,
we get a bijection between permutations with odd-length cycles that
cancel out.
The remaining monomials corresponding to cycle covers with even
cycles. Choosing either the odd edges or even edges on each such
even cycle gives a perfect matching.
Now given Theorem 8.11, the Tutte matrix can simply be substi-
tuted instead of the Edmonds matrix to extend the results to general
graphs.
We can convert the above perfect matching tester (which solves the
decision version of the perfect matching problem) into an algorithm
for the search version: one that outputs a perfect matching in a graph
(if one exists), using the simple but brilliant idea of self-reducibility. We are reducing the problem to smaller
Suppose that graph G has a perfect matching. Then we can pick any instances of itself, hence the name.
Theorem 8.12. Let |S| ≥ n3 . Given a bipartite graph G that contains some
perfect matching, Algorithm 11 finds a perfect matching with probability at
least 12 , and runs in time O(m · nω ).
Proof. At each step, we call the tester once, and then recurse after
either deleting an edge or two vertices. Thus, the number of total
recursive steps inside Algorithm 11 is at most max{m, n/2} = m, if
the graph is connected. This gives a runtime of O(m · nω ). Moreover,
graph matchings iii: algebraic algorithms 111
at each step, the probability that the tester returns a wrong answer is
at most n12 , so the PM-tester makes a mistake with probability at most
m
n2
≤ 1/2, by a union bound.
e −1
e −1,− j ) = E
det(E × (−1) j+1 × det(E
e)
j,1
if (i, j) ̸∈ E,
0
Mi,j = 1 if (i, j) ∈ E and colored blue,
y if (i, j) ∈ E and colored red.
Claim 8.15. Let G have at most one perfect matching with k red edges.
The determinant det(M) has a term of the form ck yk if and only if G
has a k-red matching.
The polynomial p(y) has degree at most n, and hence we can re-
cover it by Lagrangian interpolation. Indeed, we can choose n + 1
distinct numbers a0 , . . . , an , and evaluate p( a0 ), . . . , p( an ) by comput-
ing the determinant det(M) at y = ai , for each i. These n + 1 values
are enough to determine the polynomial as follows:
n +1
x − aj
p(y) = ∑ p ( ai ) ∏ ai − a j
.
i =1 j ̸ =i
(E.g., see 451 lecture notes or Ryan’s lecture notes.) Note this is a
completely deterministic algorithm, so far.
114 matchings in parallel, and the isolation lemma
where Qi is a multilinear degree-n polynomial that corresponds to Multilinear just means that the degree
all the i-red matchings. If we set the x variables randomly (say, to of each variable in each monomial is at
most one.
values xij = aij ) from a large enough set S, we get a polynomial
R(y) = P(a, y) whose only variable is y. The coefficient of yk in this
polynomial is Qk (a), which is non-zero with high probability, by the
Schwartz-Zippel lemma. Now we can again use interpolation to find
out this coefficient, and decide the red-blue matching problem based
on whether it is non-zero.
The Curse of
Dimensionality, and
Dimension Reduction
9
Concentration of Measure
3. How many unit vectors can you choose in Rn that are almost
orthonormal? I.e., they must satisfy | vi , v j | ≤ ε for all i ̸= j?
All these questions can be answered by the same basic tool, which
goes by the name of Chernoff bounds or concentration inequalities
or tail inequalities or concentration of measure, or tens of other
names. The basic question is simple: if we have a real-valued function
f ( X1 , X2 , . . . , Xm ) of several independent random variables Xi , such that it
is “not too sensitive to each coordinate”, how often does it deviate far from
its mean? To make it more concrete, consider this—
Given n independent random variables X1 , . . . , Xn , each bounded in
the interval [0, 1], let Sn = ∑in=1 Xi . What is
h i
Pr Sn ̸∈ (1 ± ε)ESn ?
Pr[ X ≥ µ + λ]
Pr[ X ≤ µ − λ]
is the lower tail. We are interested in bounding these tails for various
values of λ.
E( X )
P( X ≥ λ ) ≤ (9.5)
λ
With this in hand, we can start substituting various non-negative
functions of random variables X to deduce interesting bounds. For
instance, the next inequality looks at both the mean µ := EX and the
variance σ2 := E[( X − µ)2 ] of a random variable, and bounds both
the upper and lower tails.
σ2
Pr[| X − µ| ≥ λ] ≤ .
λ2
Proof. Using Markov’s inequality on the non-negative r.v. Y = ( X −
µ)2 , we get
E [Y ]
Pr[Y ≥ λ2 ] ≤ .
λ2
The proof follows from Pr[Y ≥ λ2 ] = Pr[| X − µ| ≥ λ].
122 non-asymptotic convergence bounds
9.2.3 Examples I
Example 1 (Coin Flips): Let X1 , X2 , . . . , Xn be i.i.d. Bernoulli random
variables with Pr[ Xi = 0] = 1 − p and Pr[ Xi = 1] = p. (Im other
words, these are the outcomes of independently flipping n coins,
each with bias p.) Let Sn := ∑in Xi be the number of heads. Then Sn is
distributed as a binomial random variable Bin(n, p), with Recall that linearity of expectations for
r.v.s X, Y means E[ X + Y ] = E[ X ] +
E[Y ]. For independent we have Var[ X +
E[Sn ] = np and Var[Sn ] = np(1 − p). Y ] = Var[ X ] + Var[Y ].
pn 1
Pr[Sn − pn ≥ βn] ≤ = .
pn + βn 1 + ( β/p)
np(1 − p) p
Pr[|Sn − pn| ≥ βn] ≤ 2 2
< 2 .
β n β n
In particular, this already says that the sample mean Sn /n lies in the
p
interval p ± β with probability at least 1 − β2 n . Equivalently, to get Concretely, to get within an additive
p p 1% error of the correct bias p with
confidence 1 − δ, we just need to set δ ≥ β2 n
, i.e., take n ≥ β2 δ
. (We probability 99.9%, set β = 0.01 and
will see a better bound soon.) δ = 0.001, so taking n ≥ 107 · p samples
suffices.
1 1
Pr Li ≥ 1 + λ ≤ ≈ .
1+λ λ
the final position after n steps. Each step (Xi ) can be modelled as a
Rademacher random variable with the following distribution.
1 w.p. 12
Xi =
−1 w.p. 1
2
Theorem 9.7 (2kth -Order Moment inequalities). Let k ∈ Z≥0 . For any
random variable X having mean µ, and finite moments upto order 2k, we
have
E(( X − µ)2k )
Pr[| X − µ| ≥ λ] ≤ .
λ2k
Proof. The proof is exactly the same: using Markov’s inequality on
the non-negative r.v. Y := ( X − µ)2k ,
E [Y ]
Pr[| X − µ| ≥ λ] = Pr[Y ≥ λ2k ] ≤ .
λ2k
We can get stronger tail bounds for large values of k, however
it becomes increasingly tedious to compute E(( X − µ)2k ) for the
random variables of interest.
where we crucially used that the r.v.s are independent and mean-
zero, hence terms like Xi3 X j , Xi2 X j Xk , and Xi X j Xk Xl all have mean
124 chernoff bounds, and hoeffding’s inequality
λ
If λ ≪ n, then we can approximate 1 + k λn by ek n :
Pr[Sn = 2λ] −2λ n 2λ n −4λ2
≈ e n ( 2 +λ) e n ( 2 −λ) = e n .
Pr[Sn = 0]
√
Finally, substituting λ = tσ = t n, we get
2
Pr[Sn = 2λ] ≈ Pr[Sn = 0] · e−4t .
This shows that most of the probability mass lies in the region |Sn | ≤
√
O( n), and drops off exponentially as we go further. And indeed,
this is the bound we will derive next—we will get slightly weaker
constants, but we will avoid these tedious approximations.
Proof. We only prove (9.8); the proof for (9.9) is similar. The idea is to
use Markov’s inequality not on the square or the fourth power, but
on a function which is fast-growing enough so that we get tighter
bounds, and “not too fast” so that we can control the errors. So we
consider the Laplace transform, i.e., the function
x 7→ etx
for some value t > 0 to be chosen carefully. Since this map is mono-
tone,
Bernoulli random variables: Assume that all the Xi ∈ {0, 1}; we will
remove this assumption later. Let the mean be µi = E[ Xi ], so the
moment generating function can be explicitly computed as
Substituting, we get
∏i E[etXi ]
Pr[Sn ≥ µ(1 + β)] ≤ (9.11)
etµ(1+ β)
∏ exp(µi (et − 1))
≤ i (9.12)
etµ(1+ β)
exp(µ(et − 1))
≤
etµ(1+ β)
(since µ = ∑ µi )
i
t
= exp(µ(e − 1) − tµ(1 + β)). (9.13)
Since this calculation holds for all positive t, and we want the tightest
upper bound, we should minimize the expression (9.13). Setting the
derivative w.r.t. t to zero gives t = ln(1 + β) which is non-negative for
β ≥ 0. This bound on the upper tail is also
one to be kept in mind; it often is
µ
eβ useful when we are interested in large
Pr[Sn ≥ µ(1 + β)] ≤ . (9.14) deviations where β ≫ 1. One such
(1 + β )1+ β example will be the load-balancing
application with jobs and machines.
We’re almost there: a slight simplification is that
β
≤ ln(1 + β) (9.15)
1 + β/2
for all β ≥ 0, so
(9.15) − β2 µ
(9.13) = exp(µ( β − (1 + β) ln(1 + β))) ≤ exp ,
2+β
126 chernoff bounds, and hoeffding’s inequality
with the last inequality following from simple algebra. This proves
the upper tail bound (9.8); a similar proof gives us the lower tail as
well.
Removing the assumption that Xi ∈ {0, 1}: If the r.v.s are not Bernoullis,
then we define new Bernoulli r.v.s Yi ∼ Bernoulli(µi ), which take
value 0 with probability 1 − µi , and value 1 with probability µi , so
that E[ Xi ] = E[Yi ]. Note that f ( x ) = etx is convex for every value
of t ≥ 0; hence the function ℓ( x ) = (1 − x ) · f (0) + x · f (1) satisfies
f ( x ) ≤ ℓ( x ) for all x ∈ [0, 1]. Hence E[ f ( Xi )] ≤ E[ℓ( Xi )]; moreover
ℓ( x ) is a linear function so E[ℓ( Xi )] = ℓ(E[ Xi ]) = E[ℓ(Yi )], since
Xi and Yi have the same mean. Finally, ℓ(y) = f (y) for y ∈ {0, 1}.
Putting all this together,
so the step from (9.11) to (9.12) goes through again. This completes
the proof of Theorem 9.8.
Since the proof has a few steps, let’s take stock of what we did:
i. Markov’s inequality on the function etX ,
ii. independence and linearity of expectations to break into etXi ,
iii. reduction to the Bernoulli case Xi ∈ {0, 1},
iv. compute the MGF (moment generating function) E[etXi ],
v. choose t to minimize the resulting bound, and
vi. use convexity to argue that Bernoullis are the “worst case”. Do make sure you see why the bounds
You can get tail bounds for other functions of random variables of Theorem 9.8 are impossible in
general if we do not assume some kind
by varying this template around; e.g., we will see an application for of boundedness and independence.
sums of independent normal (a.k.a. Gaussian) random variables in
the next chapter.
If we set β = Θ(log n), the probability of the load Li being larger than
1 + β is at most 1/n2 . Now taking a union bound over all bins, the
probability that any bin receives at least 1 + β balls is at most n1 . I.e.,
the maximum load is O(log n) balls with high probability.
In fact, the correct answer is that the maximum load is (1 +
o (1)) lnlnlnnn with high probability. For example, the proofs in cite show
this. Getting this precise bound requires a bit more work, but we can
get an asymptotically correct bound by using (9.14) instead, with a
C ln n
setting of β = ln ln n with a large constant C.
Moreover, this shows that the asymmetry in the bounds (9.8)
and (9.9) is essential. A first reaction would have been to believe The situation where β ≪ 1 is often
our proof to be weak, and to hope for a better proof to get called the Gaussian regime, since the
bound on the upper tail behaves like
exp(− β2 µ). In other cases, the upper
Pr[Sn ≥ (1 + β)µ] ≤ exp(− β2 µ/c) tail bound behaves like exp(− βµ), and
is said to be the Poisson regime.
for some constant c > 0, for all values of β. This is not possible,
p
however, because it would imply a max-load of Θ( log n) with high
probability.
Recall from §9.2.5 that the tail bound of ≈ exp(−t2 /O(1)) is indeed
in the right ballpark.
and that the function Sn is the sum of these r.v.s. Add details and refs
to this section.
But before we move on, let us give the bound that Sergei Bern-
stein gave in the 1920s: it uses knowledge about the variance of the
random variable to get a potentially sharper bound than Theorem 9.8
We can use this in the step (9.10), since the function etx is monotone
increasing for t > 0.
Negative association arises in many settings: say we want to
choose a subset S of k items out of a universe of size n, and let
Xi = 1i∈S be the indicator for whether the ith item is selected. The
variables X1 , . . . , Xn are clearly not independent, but they are nega-
tively associated.
9.4.2 Martingales
A different and powerful set of results can be obtained when we
stop considering random variables are not independent, but al-
low variables X j to take on values that depend on the past choices
X1 , X2 , . . . , X j−1 but in a controlled way. One powerful formalization
is the notion of a martingale. A martingale difference sequence is a se-
quence of r.v.s Y1 , Y2 , . . . , Yn , such that E[Yi | Y1 , . . . , Yi−1 ] = 0 for each
i. (This is true for mean-zero independent r.v.s, but may be true in
other settings too.)
This inequality does not assume very much about the function,
except it being ci -Lipschitz in the ith coordinate; hence we can also
use this to the truncated random walk example above, or for many
other applications.
E[ X k ] E[etX ]
Pr[Sn ≥ λ] ≤ min ≤ inf
k ≥0 λk t≥0 etλ
with a d-bit vector. Each vertex i has a single packet (which we also
call packet i), destined for vertex π (i ), where π is a permutation on
the nodes [n].
Packets move in synchronous rounds. Each edge is bi-directed,
and at most one packet can cross each directed edge in each round.
Moreover, each packet can cross at most one edge per round. So if
uv ∈ E( Qd ), one packet can cross from u to v, and one from v to u,
in a round. Each edge e has an associated queue; if several packets
want to cross e in the same round, only one can crosse, and the rest
wait in the queue, and try again the next round. (So each node has
d queues, one for each edge leaving it.) We assume the queues are
allowed to grow to arbitrary size (though one can also show queue
length bounds in the algorithm below). The goal is to get a simple
routing scheme that delivers the packets in O(d) rounds.
One natural proposal is the bit-fixing routing scheme: each packet
i looks at its current position u, finds the first bit position where u
differs from π (i ), and flips the bit (which corresponds to traversing
an edge out of u). For example:
Claim 9.16. Packet i reaches the midpoint by time at most d + |S(i )|.
Proof. If Xij is the indicator of the event that Pi and Pj intersect, then
|S(i )| = ∑ j̸=i Xij , i.e., it is a sum of a collection of independent {0, 1}-
valued random variables. Now conditioned on any choice of Pi
(which is of length at most d), the expected number of paths using
each edge in it is at most 1/2, so the conditional expectation of S(i ) is
at most d/2. Since this holds for any choice of Pi , the unconditional
expectation µ = E[S(i )] is also at most d/2. Now apply the Chernoff
bound to S(i ) with βµ = 4d − µ and µ ≤ d/2 to get
(4d − µ)2
Pr[|S(i )| ≥ 4d] ≤ exp − ≤ e−2d .
2µ + (4d − µ)
Note that we could apply the bound even though the variables Xij
were not i.i.d., and moreover we did not need estimates for E[ Xij ],
just an upper bound for their expected sum.
dimensions.
∥ A( xi ) − A( x j )∥22
1−ε ≤ ≤ 1 + ε.
∥ xi − x j ∥22
Moreover, such a map can be computed in expected poly(n, D, 1/ε) time.
Note that the target dimension k is independent of the original
dimension D, and depends only on the number of points n and the
accuracy parameter ε. Given n points with Euclidean distances
It is not difficult to show that we need at least Ω(log n) dimen- in (1 ± ε), the balls of radius 1− 2
ε
with probability at least 1 − 1/n2 , where vij is the unit vector in the
direction of xi − x j . By a union bound, all (n2 ) pairs of distances in
( X2 ) are maintained with probability at least 1 − (n2 ) n12 ≥ 1/2. A few
comments about this construction:
• The above proof shows not only the existence of a good map, we
also get that a random map as above works with constant prob-
ability! In other words, a Monte-Carlo randomized algorithm
for dimension reduction. (Since we can efficiently check that the
distances are preserved to within the prescribed bounds, we can
convert this into a Las Vegas algorithm.) Or we can also get deter-
ministic algorithms: see here.
Let us recall some basic facts about Gaussian distributions. The prob-
ability density function for the Gaussian N (µ, σ2 ) is
( x − µ )2
f (x) = √1 e 2σ2 .
2πσ
We also use the following; the proof just needs some elbow grease. The fact that the means and the vari-
ances take on the claimed values should
Proposition 10.3. If G1 ∼ N (µ1 , σ12 ) and G2 ∼ N (µ2 , σ22 ) are indepen- not be surprising; this is true for all
r.v.s. The surprising part is that the
dent, then for c ∈ R, resulting variables are also Gaussians.
Now, here’s the main idea in the proof of Lemma 10.2. Imagine
that the vector x is the elementary unit vector e1 = (1, 0, . . . , 0). Then
M e1 is just the first column of M, which is a vector with independent
and identical Gaussian values.
G1,1 G1,2 ···
G1,D 1 G1,1
G2,1 G2,2 ···
G2,D 0 G2,1
M e1 =
.. .. ..
.. .. = ..
.
. . .. . .
Gk,1 Gk,2 · · · Gk,D 0 Gk,1
√
A( x ) is a scaling-down of this vector by k: every entry in this
random vector A( x ) = A(e1 ) is distributed as
√
1/ k · N (0, 1) = N (0, 1/k) (by (10.1)).
138 the direct proof of Lemma 10.2
Thus, the expected squared length of A( x ) = A(e1 ) is If G has mean µ and variance σ2 , then
" # E[ G2 ] = Var[ G ] + E[ G ]2 = σ2 + µ2 .
h i k k h i k
1
E ∥ A( x )∥ = E ∑ A( x )2i = ∑ E A( x )2i =
2
∑k = 1.
i =1 i =1 i =1
Yi ∼ ⟨ G1 , G2 , . . . , GD ⟩ · x = ∑ x j Gj
j
where the Gj ’s are the i.i.d. N (0, 1) r.v.s on the ith row of M. Now
Proposition 10.3 tells us that Yi ∼ N (0, x12 + x22 + . . . + x2D ). Since x is
a unit length vector, we get
Yi ∼ N (0, 1).
k
1
Z := ∥ A(z)∥2 = ∑ k · Yi2
i =1
So our current bound on the upper tail is that for all t ∈ (0, 1/2) we
have
k
1
Pr[ Z ≥ (1 + ε)] ≤ √ .
et(1+ε) 1 − 2t
Let’s just focus on part of this expression:
1 1
√ = exp −t − log(1 − 2t)) (10.6)
et 1 − 2t 2
= exp (2t)2 /4 + (2t)3 /6 + · · · (10.7)
≤ exp t2 (1 + 2t + 2t2 + · · · ) (10.8)
if we set t = ε/4 and use the fact that 1 − 2t ≥ 1/2 for ε ≤ 1/2. (Note:
this setting of t also satisfies t ∈ (0, 1/2), which we needed from our
previous calculations.)
Almost done: let’s take stock of the situation. We observed that
∥ A( x )∥22 was distributed like an average of squares of Gaussians, and
by a Chernoff-like calculation we proved that
While Gaussians have all kinds of nice properties, they are real-
valued and hence require more randomness to generate. What other
classes of r.v.s could give us bounds that are comparable? E.g., what
about setting each Mij ∈ R {−1, +1}? A random sign is also called a
It turns out that Rademacher r.v.s also suffice, and we can prove Rademacher random variable, the
name Bernoulli being already taken for
this with some effort. But instead of giving a proof from first princi- a random bit in {0, 1}.
ples, let us abstract out the process of proving Chernoff-like bounds,
and give a proof using this abstraction.
Recall the basic principle of a Chernoff bound: to bound the upper
tail of an r.v. V with mean µ, we can choose any t ≥ 0 to get
Pr[V − µ ≥ λ] ≤ e−(tλ−ψ(t)) .
so we get the concise statement for a generic Chernoff bound: Bounds for the lower tail follow from
the arguments applied to the r.v. − X.
dimension reduction and the jl lemma 141
where M j s are pairwise independent, with mean zero and unit variance.
h i h i
E Yi2 = E (∑ M j x j )(∑ Ml xl )
j l
h i
=E ∑ M2j x2j + ∑ Mj Ml x j xl
j j̸=l
h h i i
= ∑ E M2j x2j + ∑ E M j Ml x j xl = ∑ x2j .
j j̸=l j
1 k 1 k
E[ Z ] = ∑
k i =1
E[Yi2 ] = ∑ ∑ x2j = ∥ x ∥22 .
k i =1 j
(10.14)
10.6.2 Concentration
Now to show concentration: the direct proof from §10.4 showed
the Yi s were themselves Gaussian with variance ∥ x ∥2 . Since the
Rademachers are 1-subgaussian, Lemma 10.5 shows that Yi is sub-
gaussian with parameter ∥ x ∥2 . Next, we need to consider Z, which is
the average of squares of k independent Yi s. The following lemma
shows that the MGF of squares of symmetric σ-subgaussians are
bounded above by the corresponding Gaussians with variance σ2 .
An r.v. X is symmetric if it is dis-
tributed the same as R| X |, where R is
Lemma 10.6. If V is symmetric mean-zero σ-subgaussian r.v., and W ∼ an independent Rademacher.
2 2
N (0, σ2 ), then E[etV ] ≤ E[etW ] for t > 0.
dimension reduction and the jl lemma 143
(Note that we’ve just introduced W into the mix, without any provo-
cation!) Since W is also symmetric, we get |V |W = V |W |. Hence,
rewriting
√ √
2t(|V |/σ ) W 2t|W |/σ )V
EV,W [e ] = EW [EV [e( ]],
Definition 10.9. A matrix A is (t, ε)-RIP if for all unit vectors x with
∥ x ∥0 ≤ t, we have ∥ Ax ∥22 ∈ [1 ± ε].
dimension reduction and the jl lemma 145
Theorem 10.10 (Heavy Shells). At least 1 − ε of the mass of the unit ball
log 1/ε
in Rd lies within a Θ( d )-width shell next to the surface.
Theorem 10.11 (Heavy Slabs). At least (1 − ε) of the mass of the unit ball
√
in Rd lies within Θ(1/ d) slab around any hyperplane that passes through
the origin.
2 2
where G ∼ N (0, σ2 ). But we know that Pr[ G ≥ w] ≤ e−w /2σ by
our generic Chernoff bound for Gaussians (10.11). So setting that tail
probability to be ε gives
q r
log(1/ε)
2
w ≈ 2σ log(1/ε) = O .
d
Corollary 10.12. If we pick two random vectors from the surface of the unit
ball in Rd (i.e., from the sphere), then they are nearly orthogonal
q with high
log(1/ε)
probability. In particular, their dot-product is smaller than O( d )
with probability 1 − ε.
a1 , a2 , a3 , . . . , a t , . . .
1. Can we compute the sum of all the integers seen so far? I.e.,
F ( a[1:t] ) = ∑it=1 ai . We want the outputs to be
3, 4, 21, 25, 16, 48, 149, 152, −570, −567, 333, 337, 369, . . .
3, 1, 17, 17, 17, 32, 101, 101, 101, 101, 900, 900, 900
148 streams as vectors, and additions/deletions
3. The median? The outputs on the various prefixes of (11.1) now are
3, 1, 3, 3, 3, 3, 4, 3, . . .
1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 9, 9, 9 . . .
may just want to read over the file in one quick pass and come up
with an answer. Such an algorithm might also be cache-friendly. But
how to do this?
Two of the recurring themes will be:
and xit is the number of times the ith element in U has been seen until
time t. (Hence, xi0 = 0 for all i ∈ U.) When the next element comes in
and it is element j, we increment x j by 1.
streaming algorithms 149
(add, A), (add, B), (add, A), (del, B), (del, A), (add, C ), . . .
|U |
Fp := ∑ (xit ) p . (11.2)
i =1
This estimator was given by Noga Alon, Yossi Matias, and Mario
Szegedy, in their God̈el-award winning paper on streaming computa- Alon, Matias, Szegedy (2000)
tion.
The choice of the hash family will be crucial: we want a small fam-
ily so that we require only a small amount of space to store the hash
function, but we want it to be rich enough for the subsequent analy-
sis to go through.
C := ∑ x i h ( i ).
i ∈U
= ∑ xi2 = F2 .
i
E[(C2 )2 ] = E[ ∑ h ( p ) h ( q ) h (r ) h ( s ) x p x q xr x s ] =
p,q,r,s
= ∑ x4p E[h( p)4 ] + 6 ∑ x2p xq2 E[h( p)2 h(q)2 ] + other terms
p p<q
This is because all the other terms have expectation zero. Why? The
terms like E[h( p)h(q)h(r )h(s)] where p, q, r, s are all distinct, all be-
come zero because of 4-universality. Terms like E[h( p)2 h(r )h(s)]
become zero for the same reason. It is only terms like E[h( p)2 h(q)2 ]
and E[h( p)4 ] that survive, and since h( p) ∈ {−1, 1}, they have expec-
tation 1. So
Var(C2 ) 2
Pr[|C2 − E[C2 ]| > εE[C2 ]] ≤ ≤ 2.
(εE[C2 ])2 ε
This is pretty pathetic: since ε is usually less than 1, the RHS usually
more than 1.
this estimator has mean µ and variance σ2 /k. (Why? Summing the
independent copies sums the variances and so increases it by k, but
dividing by k reduces it by k2 .)
So if we k such independent counters C1 , C2 , . . . , Ck , and return
their average C = 1k ∑i Ci , we get
2
2 2 2 Var(C ) 2
Pr[|C − E[C ]| > εE[C ]] ≤ 2
≤ .
(εE[C ])2 kε2
Mij := hi ( j).
2
2
The estimate C = 1
k ∑ik=1 Ci is nothing but
1
∥ Mx∥22 .
k
This is completely analogous to the construction for JL: we’ve got
a slightly taller matrix with k = O(ε−2 δ−1 ) rows instead of k =
O(ε−2 log δ−1 ) rows. However, the matrix entries are not fully inde-
pendent (as in JL), just 4-wise independent. I.e., we need to store only
O(k log D ) bits and can generate any entry of M quickly, whereas the
construction for JL stored all kD bits. Henceforth, we use S = √1 M to denote
k
Let us record two properties of this construction: the “sketch” matrix.
∥C − AB∥2F ≤ small.
C = AB
A B = C
154 optional: computing the number of distinct elements
This usually takes O(n3 ) time. Indeed, the ijth entry of the product
C is the dot-product of the ith row Ai⋆ of A with the jth column B⋆ j of
B, and the dot-product takes O(n) time. The intuition is that S⊺ S is an almost-
Suppose instead we use a “fat and short” k × n matrix S (for k ≪ identity matrix, it has 1 on the diag-
onals and at most ε everywhere else.
n), and calculate And hence it gives only a small error.
Ce = AS⊺ SB. Of course, we don’t multiply out S⊺ S,
but instead compute AS⊺ and SB, and
By associativity of matrix multiplication, we could first compute then multiply the smaller matrices.
( AS⊺ ) and (SB) in times O(n2 k), and then multiply the results in
time O(nk2 ). Moreover, the matrix S from the previous section works
pretty well, where we set D = n.
S>
d
A S B ≈ C
Indeed, entries of the error matrix Y = C − Ce satisfy
E[Yij ] = 0
and
= 2k ∥ A∥2F ∥ B∥2F .
Finally, setting k = ε22δ and using Markov’s inequality, we can say that
for any fixed ε > 0, we can compute an approximate matrix product
C := AS⊺ SB such that
Pr ∥ AB − C ∥ F ≤ ε · ∥ A∥ F ∥ B∥ F ≥ 1 − δ,
2
in time O( εn2 δ ). (If we want to make δ very small, at the expense of
picking more independent random bits in the sketching matrix S, we
can use the JL matrices instead. Details will appear in a homework.)
Finally, if the matrices A, B are sparse and contains only ≪ n2 entries,
the time can be made to depend on nnz( A, B).
The approximate matrix product question has been considered
often, e.g., by Edith Cohen and David Lewis using a random-walks Cohen and Lewis (1999)
approach. The algorithm we present is due to Tamás Sarlós; his pa-
per gives better results, as well as extensions to computing SVDs
faster. Better bounds have subsequently been given by Clarkson and
Woodruff. More recent refs too.
bits of memory.1 1
We used the approximation that
k
OK, so why should we be able to uniquely identify the set of el- (mk) ≥ mk , and hence log2 (mk) ≥
k (log2 m − log2 k ).
ements until time N − 1? For a contradiction, suppose we could
not tell whether we’d seen S1 or S2 after N − 1 elements had come
in. Pick any element e ∈ S1 \ S2 . Now if we gave the algorithm e
as the N th element, the number of distinct elements seen would be
N if we’d already seen S2 , and N − 1 if we’d seen S1 . But the algo-
rithm could not distinguish between the two cases, and would return
the same answer. It would be incorrect in one of the two cases. This
contradicts the claim that the algorithm always correctly reports the
number of distinct elements on streams of length N.
same argument, for any integer s we expect the sth smallest mapped
value at ds . We use a larger value of s to reduce the variance.
M·s
Dt = .
Lt
3
Pr[ Dt > 2 ∥xt ∥0 ] ≤ , and (11.4)
s
∥ x t ∥0 3
Pr[ Dt < ]≤ . (11.5)
2 s
We will prove this in the next section. First, some observations.
Firstly, we now use the stronger assumption that that the hash family
2-universal; recall the definition from Section 11.2.2. Next, setting
∥xt ∥
s = 8 means that the estimate Dt lies within [ 2 0 , 2∥xt ∥0 ] with
probability at least 1 − (1/4 + 1/4) = 1/2. (And we can boost the
streaming algorithms 157
2sM
Pr[ estimate too low ] = Pr[ Dt < d/2] = Pr[ Lt > ].
d
Recall T is the set of all d (= ∥xt ∥0 ) distinct elements in U that
have appeared so far. How many of these elements in T hashed to
values greater than 2sM/d? The event that Lt > 2sM/d (which
is what we want to bound the probability of) is the same as saying
that fewer than s of the elements in T hashed to values smaller than
2sM/d. For each i = 1, 2, . . . , d, define the indicator
1 if h(ei ) ≤ 2sM/d
Xi = (11.6)
0 otherwise
⌊sM/2d⌋ s 1
Pr[ Xi = 1] = ≥ − . (11.7)
M 2d M
By linearity of expectations,
" #
d d d
s 1 s d
E[ X ] = E ∑ Xi = ∑ E [Xi ] = ∑ Pr [Xi = 1] ≥ d · 2d
−
M
=
2
−
M
.
i =1 i =1 i =1
s
Let’s imagine we set M large enough so that d/M is, say, at most 100 .
Which means s s 49 s
E[ X ] ≥ − = .
2 100 100
158 optional: computing the number of distinct elements
So by Markov’s inequality,
100 49
Pr X > s = Pr X > E[ X ] ≤ .
49 100
Good? Well, not so good. We wanted a probability of failure to be
smaller than 2/s, we got it to be slightly less than 1/2. Good try, but
no cigar.
100 50 σX2 1 3
Pr[ X > s] = Pr[ X > µ X ] ≤ Pr[| X − µ X | > µX ] ≤ ≤ ≤ .
49 49 (50/49)2 µ2X (50/49)2 µ X s
Which is precisely what we want for the bound (11.4). The proof for
the bound (11.5) is similar and left as an exercise. If we want the estimate to be at most
∥ x t ∥0
,
then we would want to bound
(1+ ε )
E[ X ]
Pr[ X < ]. Similar calculations
11.5.6 Final Bookkeeping (1+ ε )
should give this to be at most ε23s , as
Excellent. We have a hashing-based data structure that answers long as M was large enough. In that
case we would set s = O(1/ε2 ) to get
“number of distinct elements seen so far” queries, such that each some non-trivial guarantees.
answer is within a multiplicative factor of 2 of the actual value ∥xt ∥0 ,
with small error probability.
Let’s see how much space we actually used. Recall that for failure
probability 1/2, we could set s = 12, say. And the space to store
the s smallest hash values seen so far is O(s lg M ) bits. For the hash
functions themselves, the standard constructions use O((lg M) +
(lg U )) bits per hash function. So the total space used for the entire
data structure is
O(log M ) + (lg U ) bits.
What is M? Recall we needed to M large enough so that d/M ≤
s/100. Since d ≤ |U |, the total number of elements in the universe,
set M = Θ(U ). Now the total number of bits stored is
O(log U ).
12.1 Introduction
AV = UDV ⊺ V = UD
lowing, we see how to obtain the SVD and why it solves our best fit
problem. The lecture is partly based on 2 . 2
We start with the case that k = 1. Thus, we look for the line through
the origin that minimizes the sum of the squared errors. See Fig-
ure 12.2. It depicts a one-dimensional subspace V in blue. We look
at a point ai , its distance β i to V, and the length of its projection to
V which is named αi in the picture. Notice that the length of ai is
α2i + β2i . Thus, for our fixed ai , minimizing β i is equivalent to maxi-
mizing αi . If we represent V by a unit vector v that spans V (depicted
in orange in the picture), then we can compute the projection of ai to
V by the dot product ⟨ ai , v⟩. We have just argued that we can find the
best fit subspace of dimension one by solving
n n
max
v∈Rd ,∥v∥=1
∑ ⟨ai , v⟩2 = v∈Rmin
d ,∥ v ∥=1
∑ dist(ai , span(v))2
i =1 i =1
where we denote the distance between a point ai and the line spanned
by v by dist( ai , span(v))2 . Now because Av = (⟨ a1 , v⟩, ⟨ a2 , v⟩, . . . , ⟨ an v⟩)⊺ ,
we can rewrite ∑id=1 ⟨ ai , v⟩2 as ∥ Av∥2 . We define the first right singu-
lar vector to be a unit vector that maximizes ∥ Av∥.3 We thus know 3
There may be many vectors that
that the subspaces spanned by it is the best fit subspace of dimension achieve the maximum: indeed, for
every v that achieves the maximum,
one. −v also has the same maximum. Let us
break ties arbitrarily.
βi
a1 a4
αi
a2
v 1 = arg max ∥ Av ∥ , σ1 ( A ) : = ∥ Av 1 ∥
∥ v ∥= 1
v 2 = arg max ∥ Av ∥ , σ2 ( A ) : = ∥ Av 2 ∥
∥ v ∥= 1, ⟨ v,v 1 ⟩= 0
..
.
v r = arg max ∥ Av ∥ , σr ( A ) : = ∥ Av r ∥
∥ v ∥= 1, ⟨ v,v i ⟩= 0 ∀ i = 1,...,r − 1
∥ Aw 1 ∥ 2 + ∥ Aw 2 ∥ 2 ≤ ∥ Av 1 ∥ 2 + ∥ Av 2 ∥ 2 .
Proof. We prove the claim by using the fact that two matrices A, B ∈
R n × d are identical iff for all vectors v, the images are equal, i.e. Av =
Bv. Notice that it is sufficient to check this for a basis, so it is true if
the following subclaim holds (which we do not prove):
k
∑ σi ui vi .
⊺
Ak :=
i =1
Assume that the entries in U and V are positive. Since the column
vectors are unit vectors, they define a convex combination of the
r topics. We can thus imagine U to contain information on how
much each of the documents consists of each topic. Then, D assigns
a weight to each of the topics. Finally, we V ⊺ gives information on
how much each topic consists of each of the words. The combination
of the three matrices generates the actual documents. By using the
dimension reduction: singular value decompositions 165
a4 x
b4
b2 a3 x
a2 x
b3
a1 a2 a3 a4
a1 x
b1
A( A+ b) = b ∀b in the image of A
x ∗ = A+ b
166 symmetric matrices
For instance, you can check that Ak or e A defined this way indeed
correspond to what you think they might mean. (The other way to
k
define e A would be ∑k≥0 Ak! .)
Part III
From Discrete to
Continuous Algorithms
13
Online Learning: Experts and Bandits
2.41(mi + log2 N ),
Note that
Theorem 13.4. For ε ∈ (0, 1/2), penalizing each incorrect expert by a factor
of (1 − ε) guarantees that the number of mistakes made by MW is at most
log N
2(1 + ε ) m i + O .
ε
172 randomized weighted majority
This shows that we can make our mistakes bound as close to 2m∗
as we want, but this approach seems to have this inherent loss of
a factor of 2. In fact, no deterministic strategy can do better than a
factor of 2, as we show next.
Proposition 13.5. No deterministic algorithm A can do better than a factor
of 2, compared to the best expert.
Proof. Note that if the algorithm is deterministic, its predictions are
completely determined by the sequence seen thus far (and hence can
also be computed by the adversary). Consider a scenario with two
experts A,B, the first always predicts 1 and the second always pre-
dicts 0. Since A is deterministic, an adversary can fix the outcomes
such that A’s predictions are always wrong. Hence at least one of A
and B will have an error rate of ≤ 1/2, while A’s error rate will be
1.
Note that the update of the weights proceeds exactly the same as
previously.
Theorem 13.6. Fix ε ≤ 1/2. For any fixed sequence of predictions, the ex-
pected number of mistakes made by randomized weighted majority (RWM)
is at most
log N
E[ M ] ≤ (1 + ε ) m i + O
ε
log N
The quantity εmi + O ε gap
between the algorithm’s performance
Proof. The proof is an analysis of the weight evolution that is more and that of the best expert is called the
(t)
careful than in Theorem 13.4. Again, the potential is Φt = ∑i wi . regret with respect to expert i.
Define
(t)
∑i incorrect wi
Ft := (t)
∑ i wi
to be the fraction of weight on incorrect experts at time t. Note that
E[ M ] = ∑ Ft .
t∈[ T ]
ln N
E[ M ] ≤ m i (1 + ε ) + .
ε
Let’s broaden the setting slightly, and consider the following dot-
product game. In each round, Define the probability simplex as
1. The algorithm produces a vector of probabilities ∆ N := x ∈ [0, 1] N | ∑ xi = 1 .
i
t
p = ( p1t , p2t , · · · , ptN ) ∈ ∆N .
to deduce that
Pr[mistake at time t] = ℓt , pt .
Theorem 13.7. Consider a fixed ε ≤ 1/2. For any sequences of loss vectors
in [−1, 1] N and for all indices i ∈ [ N ], the Hedge algorithm guarantees:
T T
ln N
∑ ⟨ pt , ℓt ⟩ ≤ ∑ ℓit + εT + ε
i =1 t =1
online learning: experts and bandits 175
1 1
T ∑⟨ pt , ℓt ⟩ ≤ min
i T ∑ ℓit + ε
t t
1
= min
p∗ ∈∆ N T
∑ ℓt , p∗ + ε.
t
Corollary 13.9 (Average Gain). Let ρ ≥ 1 and ε ∈ (0, 1/2). For any
4ρ2 ln N
sequence of gain vectors g1 , . . . , g T ∈ [−ρ, ρ] N with T ≥ ε2 , the gains
version of the Hedge algorithm produces probability vectors pt ∈ ∆ N such
that
1 T
t t 1 T
t
∑
T t =1
g , p ≥ max ∑ g , ei − ε.
i ∈[ N ] T t=1
176 optional: the bandit setting
However, now the algorithm only gets to see the loss ℓtat corre-
sponding to the action chosen by the algorithm, and not the entire
loss vector.
This limited-information setting is called the bandit setting. The name comes from the analysis of
slot machines, which are affectionately
known as “one-armed bandits”.
13.5.1 The Exp3 Algorithm
Surprisingly, we can obtain algorithms for the bandit setting from
algorithms for the experts setting, by simply “hallucinating” the cost
vector, using an idea called importance sampling. This causes the
parameters to degrade, however.
Indeed, consider the following algorithm: we run an instance A
of the RWM algorithm, which is in the full information model. So at
each timestep,
3. We get back the loss value ℓtI t for this chosen expert.
However, the LHS is not our real loss, since we chose I t according to
qt and not pt . This means our expected total loss is really
γ
∑ qt , ℓt = (1 − γ) ∑ pt , ℓt + N ∑ 1, ℓt
t t t
N log N
≤ ∑ ℓit + εT + + γT.
t γ ε
q √ log N 1/4
log N
Now choosing ε = T and γ = N T gives us a regret
of ≈ N 1/2 T 3/4 . The interesting fact here is that the regret is again
sub-linear in T, the number of timesteps: this means that as T → ∞,
the per-step regret tends to zero.
The dependence on N, the number of experts/options, is now
polynomial, instead of being logarithmic as in the full-information
√
case. This is necessary: there is a lower bound of Ω( NT ) in the
bandit setting. And indeed, the Exp3 algorithm itself achieves a near-
p
optimal regret bound of O( NT log N ); we can show this by using a
finer analysis of Hedge that makes more careful approximations. We
defer these improvements for now, and instead give an application of
this bandit setting to a problem in item pricing.
178 optional: the bandit setting
We can now use the low-regret algorithms for the experts problem to
show how to approximately solve linear programs (LPs). As a warm-
up, we use it to solve two-player zero-sum games, which are a special
case of LPs. In fact, zero-sum games are equivalent
to linear programming, see this work of
Ilan Adler. Is there an earlier reference?
14.1 (Two-Player) Zero-Sum Games
There are two players in such a game, traditionally called the “row
player" and the “column player". Each of them has some set of ac-
tions: the row player with m actions (associated with the set [m]), and
the column player with the n actions in [n]. Finally, we have a payoff
matrix M ∈ Rm×n . In a play of the game, the row player chooses a
row i ∈ [m], and simultaneously, the column player chooses a column
j ∈ [n]. If this happens, the row player gets Mi,j , and the column
player loses Mi,j . The winnings of the two players sum to zero, and
so we imagine that the payoff is from the row player to the column
player. Henceforth, when we talk about pay-
offs, these will always refer to payoffs to
the row player from the column player.
14.1.1 Strategies, and Best-Response This payoff may be negative, which
would capture situations where the
Each player is allowed to have a randomized strategy. Given strate- column player does better.
gies p ∈ ∆m for the row player, and q ∈ ∆n for the column player, the
expected payoff (to the row player) is
The row player wants to maximize this value, while the column
player wants to minimize it.
Suppose the row player fixes a strategy p ∈ ∆m . Knowing p, the
column player can choose an action to minimize the expected payoff:
C ( p) := min p⊺ Mq = min p⊺ Me j .
q∈∆n j∈[n]
180 (two-player) zero-sum games
The equality holds because the expected payoff is linear, and hence
the column player’s best strategy is to choose a column that mini-
mizes the expected payoff. The column player is said to be playing
their best response. Analogously, if the column player fixes a strategy
q ∈ ∆n , the row player can maximize the expected payoff by playing
their own best response:
Now, the row player would love to play the strategy p such that
even if the column player plays best-response, the payoff is as large
as possible: i.e., it wants to achieve
max C ( p).
p∈∆m
min R(q).
q∈∆n
C ( p) ≤ R(q) (14.1)
pt ∈ ∆m . Initially p1 = m1 , . . . , m1 , which represents that the row
player chooses each row with equal probability, when they have no
information to work with.
At each time t, the column player plays the best-response to pt , i.e.,
jt := arg max ( pt )⊺ Me j .
j∈[n]
to be the average long-term plays of the row player, and of the best
responses of the column player to those plays. We know that
C ( pb) ≤ R(qb)
4 ln m
by (14.1). But by Corollary 13.9, after T ≥ ε2
steps,
1 1
T ∑⟨ pt , gt ⟩ ≥ max
i T∑
ei , g t − ε (by Hedge)
t t
D 1 E
= max ei , ∑ gt − ε
i T t
D 1 E
= max ei , M
i
∑
T t
e jt −ε (by definition of gt )
= max⟨ei , Mb
y⟩ − ε
i
= R(qb) − ε.
Since pt is the row player’s strategy, and C is concave (i.e., the payoff
on the average strategy pb is no more than the average of the payoffs: To see this, recall that
1 1 1 C ( p) := min p⊺ Mq.
T ∑ ⟨ pt , gt ⟩ = ∑ C ( pt ) ≤ C
T T ∑ pt = C ( xb). q
We assume that ρ ≥ 1.
⟨ pt , gt ⟩ = ⟨ pt , Ax t − b⟩
= ⟨ pt , Ax t ⟩ − ⟨ pt , b⟩
= ⟨αt , x t ⟩ − βt ≤ 0,
t
α , x ≤ βt . Averaging over all times, the left hand side of (14.5) is
T
1
T ∑ ⟨ pt , gt ⟩ ≤ 0.
t =1
1 T
D 1 T E
T ∑ ei , g t = ei ,
T ∑ gt
t =1 t =1
1 T
=
T ∑ ai , xbt − bi
t =1
= ⟨ ai , xb⟩ − bi .
T
1
0≥
T ∑ ⟨ pt , gt ⟩ ≥ max
i
⟨ ai , xb⟩ − bi − ε.
t =1
x ≤ b + ε1.
This shows that Ab
x ≤ b + (ε + δ)1,
Ab
but now the number of calls to the relaxed oracle can be even
smaller, namely O(ρ2rlx ln m/ε2 ).
In the s-t maximum flow problem, we are given a graph G = (V, E), and
distinguished vertices s and t. Each edge has a capacities ue ≥ 0; we
will mostly focus on the unit-capacity case of ue = 1 in this chapter.
The graph may be directed or undirected; an undirected edge can be
modeled by two oppositely directed edges having the same capacity.
Recall that an s-t flow is an assignment f : E → R+ such that
the maximum flow in time O(m · min(m1/2 , n2/3 )). This runtime
was eventually matched for general capacities (up to some poly-
logarithmic factors) by an algorithm of Andrew Goldberg and Satish
Rao in 1998. For the special case of m = O(n), these results gave a
runtime of O(m1.5 ), but nothing better was known even for approx-
imate max-flows, even for unit-capacity undirected graphs—until a
breakthrough in 2010, which we will see at the end of this chapter.
max ∑ fP (15.1)
P∈P
∑ f P ≤ ue ∀e ∈ E
P:e∈ P
fP ≥ 0 ∀P ∈ P
The first set of constraints says that for each edge e, the contribution
of all possible flows is no greater than the capacity ue of that edge.
The second set of constraints say that the contribution from each path
must be non-negative. This is a gigantic linear program: there could
be an exponential number of s-t paths. As we see, this will not be a
hurdle.
K := { f | ∑ f p = F, f ≥ 0}.
P∈P
∑ f P lent ( P) ≤ 1, (15.3)
P∈P
3. it increases the length of each edge on this path multiplicatively. The factor happens to be (1 + ε/F ), be-
cause of how we rescale the gains, but
This length-increase makes congested edges (those with a lot of flow) that does not matter for this intuition.
be much longer, and hence become very undesirable when search-
ing for short paths. Note that the process is repeated some number
of times, and then we average all the flows we find. So unlike usual
network flow algorithms based on residual networks, these algo-
rithms are truly greedy and cannot “undo” past actions (which is
what pushing flow in residual flow networks does, when we use an
arc backwards). This means these MW-based algorithms must ensure
that very little flow goes on edges that are “wasteful”.
To illustrate this point, consider an example commonly used to
show that the greedy algorithm does not work for max-flow: Change
the figure to make it more instructive.
undirected graphs. Since then, works by Jonah Sherman, and Kelner Sherman (2013)
et al. gave O(m1+o(1) /εO(1) )-time algorithms for the problem. The Kelner, Lee, Oracchia, Sidford (2013)
current best runtime is O(m poly log m/εO(1) )-time, due to Richard Peng (2014)
Peng. Interestingly, Shang-Hua Teng, Jonah
Sherman, and Richard Peng are all
CMU graduates.
15.3.1 Electrical Flows
Given a connected undirected graph with general edge-capacities, we
can view it as an electrical circuit, where each edge e of the original 𝜑 𝑠 =1 𝜑 𝑡 =0
ϕu − ϕv
f uv = .
ruv
For example, if we take the 6-node graph in Figure 15.1 and assume
that all edges have unit conductance, then its Laplacian LG matrix is:
s t u v w x
s 2 0 −1 −1 0 0
t 0 2 0 0 −1 −1
u −1 0 3 0 −1 −1
.
LG =
v
−1 0 0 2 0 −1
w 0 −1 −1 0 2 0
x 0 −1 −1 −1 0 3
Now for a general graph G, we define the Laplacian to be: This Laplacian for the single edge uv
has 1s on the diagonal at locations
(u, u), (v, v), and −1s at locations
LG = ∑ Luv . (u, v), (v, u). Draw figure.
uv∈ E
In other words, LG is the sum of little ‘per-edge’ Laplacians Luv . A symmetric matrix A ∈ Rn×n is called
(Since each of those Laplacians is clearly positive semidefinite (PSD), PSD if x⊺ Ax ≥ 0 for all x ∈ Rn , or
equivalently, if all its eigenvalues are
it follows that LG is PSD too.) non-negative.
For yet another definition for the Laplacian, first consider the
edge-vertex incidence matrix B ∈ {−1, 0, 1}m×n , where the rows are
indexed by edges and the columns by vertices. The row correspond-
ing to edge e = uv has zeros in all columns other than u, v, it has
an entry +1 in one of those columns (say u) and an entry −1 in the
approximate max-flows using experts 193
su sv uw ux vx wt xt
s 1 1 0 0 0 0 0
t 0 0 0 0 0 −1 −1
u −1 0 1 1 0 0 0 .
B=
v
0 −1 0 0 1 0 0
w 0 0 −1 0 0 1 0
x 0 0 0 −1 −1 0 1
A little algebra shows this to be the vth entry of the vector Lϕ. Finally,
by 15.4, this net current into v must be zero, unless v is either s or t,
in which case it is either −k or k respectively. Summarizing, if ϕ are
the voltages at the nodes, they satisfy the linear system:
Lϕ = k(es − et ).
(ϕu − ϕv )2
E ( f ) := ∑ f e2 re = ∑ r uv
= ϕ⊺ Lϕ.
e∈ E (u,v)∈ E
∥ L x̂ − b∥ L ≤ ε ∥ x̄ ∥ L .
The algorithm is randomized? and runs in time O(m log2 n log 1/ε).
E ( f ) ≤ (1 + δ)E ( fe),
approximate max-flows using experts 195
For the rest of this lecture we assume we can compute the corre-
sponding minimum-energy flow exactly in time O e (m). The arguments
can easily be extended to incorporate the errors.
K = {f | ∑ f P = F, f ≥ 0},
P∈P
∑ pe f e ≤ 1. (15.4)
e
∑ pe f e ≤ (1 + ε) ∑ pe + ε = 1 + 2ε.
e∈ E e∈ E
√
2. (width) maxe f e ≤ O( m/ε).
Proof. Since the flow f ∗ satisfies all the constraints, it burns energy
ε
E ( f ∗ ) = ∑( f e∗ )2 re ≤ ∑ re = ∑( pe + ) = 1 + ε.
e e e m
This proves the first part of the theorem. For the second part, we may
use the bound on energy burnt to obtain
ε ε
∑ f e2 m ≤ ∑ f e2 pe + m = ∑ f e2 re = E ( f ) ≤ 1 + ε.
e e e
The idea to get an improved bound on the width is to use a crude but
effective trick: if we have an edge with electrical flow of more than
approximate max-flows using experts 197
ρ ≈ m1/3 in some iteration, we delete it for that iteration (and for the
rest of the process), and find a new flow. Clearly, no edge now carries
a flow more than ρ. The main thrust of the proof is to show that we
do not end up butchering the graph, and that the maximum flow
value reduces by only a small amount due to these edge deletions.
Formally, we set:
m 1/3 log m
ρ= . (15.5)
ε
and show that at most εF edges are ever deleted by the process. The
crucial ingredient in this proof is this observation: every time we
delete an edge, the effective resistance between s and t increases by a
lot. We assume that a flow value of F is
Since we need to argue about how many edges are deleted in the feasible; moreover, F ≥ ρ, else Ford-
Fulkerson can be implemented in time
entire algortihm (and not just in one call to the oracle), we explic- O(mF ) ≤ O e (m4/3 ).
itly maintain edge-weights wet , instead of using the results from the
previous sections as a black-box.
′
Reff ≥ Reff .
′ Reff
Reff ≥( ).
1−β
198 o ( m 4/3 ) -time algorithm
optional: an e
2. If there is an edge e with f et > ρ, delete e (for the rest of the algo-
rithm), and go back to Item 1.
Lemma 15.10. We delete at most m1/3 ≤ εF edges over the run of the
algorithm.
We defer the proof to later, and observe that the total number of
electrical flows computed is therefore O( T ). Each such computation
takes Oe (m/ε) by Corollary 15.6, so the overall runtime of our algo-
rithm is O(m4/3 / poly(ε)).
Next, we show that the flow fb is an (1 + O(ε)-approximate maxi-
mum s-t flow. We start with an analog of Theorem 15.7 that accounts
for edge deletions.
The last step is very loose, but it will suffice for our purposes.
To calculate the congestion of the final flow, observe that even
though the algorithm above explicitly maintains weights, we can just
wt
appeal directly to the guarantees . Indeed, define p te : = Wet for each
time t; the previous part implies that the flow f t satisfies
∑ p te f et ≤ 1 + 3ε
e
3. Each deleted edge e has flow at least ρ, and hence energy burn at
least ( ρ 2 ) w et ≥ ( ρ 2 ) mε W t . Since the total energy burn is at most
2W t from Lemma 15.11, the deleted edge e was burning at least
ρ2 ε
β := 2m fraction of the total energy. Hence
ol d
new R eff ol d ρ2 ε
R eff ≥ ρ2 ε
≥ R eff · exp
(1 − 2m
2m )
1
if we use 1− x ≥ e x/2 when x ∈ [ 0, 1/4 ] .
4. For the final effective resistance, note that we send F flow with
total energy burn 2W T ; since the energy depends on the square of
f inal T
the flow, we have R eff ≤ 2W F2
≤ 2W T .
(All these calculations hold as long as we have not deleted more than
ε F edges.) Now, to show that this invariant is maintained, suppose D
edges are deleted over the course of the T steps. Then
0 ρ2 ε f inal T 2 ln m
R eff exp D · ≤ R eff ≤ 2W ≤ 2m · exp .
2m ε
Taking logs and simplifying, we get that
ερ 2 D 2 ln m
≤ ln ( 2m 3 ) +
2m ε
2m ( ln m )( 1 + O ( ε ))
=⇒ D ≤ 2 ≪ m 1/3 ≤ εF.
ερ ε
This bounds the number of deleted edges D as desired.
E(f) = ∑ f e2 r e
e
for flow values f that represent a unit flow from s to t (these form
a polytope). We alluded to algorithms that solve this problem, but
one can also observe that E ( f ) is a convex function, and we want to
find a minimizer within some polytope K. Equivalently, we wanted
to solve the linear system
Lϕ = ( e s − e t ) ,
∥ Lϕ − ( e s − e t )∥ 2 .
f ( λx + ( 1 − λ ) y ) ≤ λ f ( x ) + ( 1 − λ ) f ( y ) , (16.2)
y
λ f ( x ) + (1 − λ ) f ( y )
There are two kinds of problems that we will study. The most
basic question is that of unconstrained convex minimization (UCM):
x
given a convex function f , we want to find x λx + (1 − λ)y y
min f ( x ).
x ∈Rn
16.1.1 Gradient
For most of the following discussion, we assume that the function f
is differentiable. In that case, we can give an equivalent characteriza-
tion, based on the notion of the gradient ∇ f : Rn → Rn . The directional derivative of f at x (in the
direction y) is defined as
Fact 16.3 (First-order condition). A function f : K → R is convex if
f ( x + εy) − f ( x )
and only if f ′ ( x; y) := lim .
ε →0 ε
f (y) ≥ f ( x ) + ⟨∇ f ( x ), y − x ⟩ , (16.3) If there exists a vector g such that
⟨ g, y⟩ = f ′ ( x; y) for all y, then f is called
for all x, y ∈ K. differentiable at x, and g is called the
gradient. It follows that the gradient
Geometrically, Fact 16.3 states that the function always lies above must be of the form
its tangent plane, for all points in K. If the function f is twice-differentiable,
∂f ∂f ∂f
∇ f (x) = ( x ), ( x ), · · · , (x) .
and if H f ( x ) is its Hessian matrix, i.e. its matrix of second derivatives ∂x1 ∂x2 ∂xn
at x ∈ K:
∂2 f
( H f )i,j ( x ) := ( x ), (16.4)
∂xi ∂x j
then we get yet another characterization of convex functions.
Fact 16.4 (Second-order condition). A twice-differentiable function f
is convex if and only if H f ( x ) is positive semidefinite for all x ∈ K.
| f ( x ) − f (y)| ≤ G ∥ x − y∥ ,
for all x, y ∈ K.
the gradient descent framework 205
∥∇ f ( x )∥2 ≤ G, (16.5)
for all x ∈ K.
∇ f ( x ) = 0 ⇐⇒ Ax = b ⇐⇒ x = A−1 b.
x t +1 ← x t − η t · ∇ f ( x t ). (16.6)
206 unconstrained convex minimization
f ( xb) ≤ f ( x ∗ ) + ε. (16.7)
In particular, this holds when x ∗ is a minimizer of f .
The core of this proposition lies in the following theorem
Theorem 16.8. Let f : Rn → R be convex, differentiable and G-Lipschitz.
Then the gradient descent algorithm ensures that
T T
1 1
∑ f (xt ) ≤ ∑ f (x∗ ) + 2 ηTG2 + 2η ∥ x0 − x∗ ∥2 . (16.8)
t =1 t =1
We will prove Theorem 16.8 in the next section, but let’s first use it
to prove Proposition 16.7.
T t∑ ∑ f ( x t ).
f ( xb) = f x t ≤
=1 T t =1
the gradient descent framework 207
By Theorem 16.8,
T
1 1 1
T ∑ f ( xt ) ≤ f ( x ∗ ) + ηG2 +
2 2ηT
∥ x0 − x ∗ ∥2 .
t =1 | {z }
error
∥ x0 − ∗
The error terms balance when η = √x ∥ , giving
G T
∥ x0 − x ∗ ∥ G
f ( xb) ≤ f ( x ∗ ) + √ .
T
Finally, we set T = 1 2
ε2
G ∥ x0 − x ∗ ∥2 to obtain
f ( xb) ≤ f ( x ∗ ) + ε.
∥ x t − x ∗ ∥2
Φt := . (16.9)
2η
1
Φ t +1 − Φ t = ∥ x t +1 − x ∗ ∥ 2 − ∥ x t − x ∗ ∥ 2 (16.10)
2η
1
= 2 ⟨x − x t , x t − x ∗ ⟩ + ∥ x t +1 − x t ∥ 2 ;
2η | t+1 {z } | {z }
⟨b,a⟩ ∥ b ∥2
1
= 2 ⟨−η ∇ f ( xt ), xt − x ∗ ⟩ + ∥η ∇ f ( xt )∥2 .
2η
1
f ( xt ) + (Φt+1 − Φt ) ≤ f ( xt ) + ⟨∇ f ( xt ), x ∗ − xt ⟩ + ηG2
2
Since f is convex, we know that f ( xt ) + ⟨∇ f ( xt ), x ∗ − xt ⟩ ≤ f ( x ∗ ).
Thus, we conclude that
1
f ( xt ) + (Φt+1 − Φt ) ≤ f ( x ∗ ) + ηG2 .
2
Now that we understand how our potential changes over time,
proving the theorem is straightforward.
1
f ( xt ) + (Φt+1 − Φt ) ≤ f ( x ∗ ) + ηG2 .
2
Summing over t = 1, . . . , T,
T T T
1
∑ f (xt ) + ∑ (Φt+1 − Φt ) ≤ ∑ f (x∗ ) + 2 ηG2 T
t =1 t =1 t =1
T T
1
∑ f (xt ) + ΦT+1 − Φ1 ≤ ∑ f (x∗ ) + 2 ηG2 T
t =1 t =1
T T
1
∑ f (xt ) − Φ1 ≤ ∑ f (x∗ ) + 2 ηG2 T
t =1 t =1
Unlike the unconstrained case, the gradient at the minimizer may not
be zero in the constrained case—it may be at the boundary. In this This is the analog of the minimizer of a
case, the condition for a convex function f : K → R to be minimized single variable function being achieved
either at a point where the derivative is
at x ∗ ∈ K is now zero, or at the boundary.
function values. But we must change our algorithm to ensure that the
new point xt+1 lies within K. To ensure this, we simply project the
new iterate xt+1 back onto K. Let projK : Rn → K be defined as
1
Φ t +1 − Φ t = ∥ x t +1 − x ∗ ∥ 2 − ∥ x t − x ∗ ∥ 2 (16.13)
2η
1
Φ t +1 − Φ t = ∥ xt′ +1 − x ∗ ∥2 − ∥ xt − x ∗ ∥2 . (16.14)
2η
Now the rest of the proof of Theorem 16.8 goes through unchanged.
Why is the claim
xt′ +1 − x ∗
≥ ∥ xt+1 − x ∗ ∥ true? Since K is
convex, projecting onto it gets us closer to every point in K, in particular
to x ∗ ∈ K. To formally prove this fact about projections, consider
the angle x ∗ → xt+1 → xt′ +1 . This is a non-acute angle, since the
orthogonal projection means K likes to one side of the hyperplane
defined by the vector xt′ +1 − xt+1 , as in the figure on the right.
the gradient descent framework 211
Looking back at the proof in §16.2, the proof of Lemma 16.9 immedi-
ately extends to give us
1
f t ( xt ) + Φt+1 − Φt ≤ f t ( x ∗ ) + ηG2 .
2
Now summing this over all times t gives
T T 1
∑ f t ( xt ) − f t ( x ∗ ) ≤ ∑ Φt − Φt+1 + ηTG2
2
t =1 t =1
1
≤ Φ1 + ηTG2 ,
2
∥ x1 − x ∗ ∥2 G 2
since Φ T +1 ≥ 0. The proof is now unchanged: setting T ≥ ε2
∥ x1 − ∗
and η = √x ∥ , and doing some elementary algebra as above,
G T
1 T ∥ x − x∗ ∥G
T ∑ f t ( xt ) − f t ( x ∗ ) ≤ 1 √
T
≤ ε.
t =0
for all convex bodies K and all convex functions, as opposed to being
just for the unit simplex ∆n and linear losses f t ( x ) = ⟨ℓt , x ⟩, say for
ℓt ∈ [−1, 1]n . However, in order to make a fair comparison, suppose
we restrict ourselves to ∆n and linear losses, and consider the number
of rounds T before we get an average regret of ε.
α
f (y) ≥ f ( x ) + ⟨∇ f ( x ), y − x ⟩ + ∥ x − y ∥2 . (16.15)
2
We will work with the first-order definition, and show that the
1
gradient descent algorithm with (time-varying) step size ηt = O αt
2
converges to a value at most f ( x ∗ ) + ε in time T = Θ( Gαε ). Note there
is no more dependence on the diameter of the polytope. Before we
give this proof, let us give the other relevant definitions.
β
f (y) ≤ f ( x ) + ⟨∇ f ( x ), y − x ⟩ + ∥ x − y ∥2 . (16.16)
2
In this case, the gradient descent algorithm with fixed step size
ηt = η = O β1 yields an xb which satisfies f ( xb) − f ( x ∗ ) ≤ ε when
β ∥ x1 − x ∗ ∥
T = Θ ε . In this case, note we have no dependence on the
Lipschitzness G any more; we only depend on the diameter of the
polytope. Again, we defer the proof for the moment.
β −t
f ( xt ) − f ( x ∗ ) ≤ exp ∥ x1 − x ∗ ∥2 .
2 κ
Proof. For β-smooth f , we can use Definition 16.12 to get
β
f ( xt+1 ) ≤ f ( xt ) − η ∥∇ f ( xt )∥2 + η 2 ∥∇ f ( xt )∥2 .
2
214 stronger assumptions
1
f ( x t +1 ) − f ( x t ) ≤ − ∥∇ f ( xt )∥2 . (16.17)
2β
α
f ( xt ) − f ( x ∗ ) ≤ ⟨∇ f ( xt ), xt − x ∗ ⟩ − ∥ x t − x ∗ ∥2 ,
2
α
≤ ∥∇ f ( xt )∥ ∥ xt − x ∗ ∥ − ∥ xt − x ∗ ∥2 ,
2
1 2
≤ ∥∇ f ( xt )∥ , (16.18)
2α
16.6.1 Subgradients
What if the convex function f is not differentiable? Staring at the
proofs above, all we need is the following:
16.6.3 Acceleration
setting the gradient of the function to zero; this gives us the expres-
sion
η · ∇ f t ( x t ) + ( x t +1 − x t ) = 0 =⇒ x t +1 = x t − η · ∇ f t ( x t ),
Dh (y∥ x ) = 21 ∥y − x ∥2 ,
Again, setting the gradient at xt+1 to zero (i.e., the optimality condi-
tion for xt+1 ) now gives:
or, rephrasing
∇ h ( x t +1 ) = ∇ h ( x t ) − η ∇ f t ( x t ) (17.3)
=⇒ xt+1 = ∇h−1 ∇h( xt ) − η ∇ f t ( xt ) (17.4)
1. When h( x ) = 1
2 ∥ x ∥2 , the gradient ∇h( x ) = x. So we get
x t +1 = x t − η ∇ f t ( x t ),
The name of the process comes from thinking of the dual space as be-
ing a mirror image of the primal space. But how do we choose these mir-
ror maps? Again, this comes down to understanding the geometry
of the problem, the kinds of functions and the set K we care about,
and the kinds of guarantees we want. In order to discuss these, let us
discuss the notion of norms in some more detail.
for x ∈ Rn .
∥∇ f ( x )∥∗ ≤ G.
222 mirror descent: the mirror map view
∇(h) : Rn → Rn .
(iii) Map θt+1 back to the primal space xt′ +1 ← (∇h)−1 (θt+1 ).
(iv) Project xt′ +1 back into the feasible region K by using the Bregman
divergence: xt+1 ← minx∈K Dh ( x ∥ xt′ +1 ). In case xt′ +1 ∈ K, e.g., in
the unconstrained case, we get xt+1 = xt′ +1 .
Note that the choice of h affects almost every step of this algorithm.
mirror descent 223
KL( x ∗ ∥ x1 ) ηT ln n
+ ≤ + ηT.
η 2/ ln 2 η
The last inequality uses that the KL divergence to the uniform dis-
tribution is at most ln n. (Exercise!) In fact, this suggests a way to
improve the regret bound: if we start with a distribution x1 that is
closer to x ∗ , the first term of the regret gets smaller.
Dh ( x ∗ ∥ x t )
Φt =
η
The last inequality above uses that the Bregman divergence is always
non-negative for convex functions.
To complete the proof, it remains to show that blaht in inequal-
ity (17.6) can be made 2α ∥∇ f t ( xt )∥2∗ . Let us focus on the uncon-
η
strained case, where xt+1 = xt′ +1 . The calculations below are fairly
routine, and can be skipped at the first reading:
1
Φ t +1 − Φ t = Dh ( x ∗ ∥ x t +1 ) − Dh ( x ∗ ∥ x t )
η
1
= h( x ∗ ) − h( xt+1 ) − ⟨∇h( xt+1 ), x ∗ − xt+1 ⟩ − h( x ∗ ) + h( xt ) + ⟨∇h( xt ), x ∗ − xt ⟩
η | {z } | {z }
θ t +1 θt
1
= h ( x t ) − h ( x t +1 ) − ⟨ θ t − η ∇ f t ( x t ), x ∗ − x t +1 ⟩ + ⟨ θ t , x ∗ − x t ⟩
η | {z }
∇t
1
= h( xt ) − h( xt+1 ) − ⟨θt , xt − xt+1 ⟩ + η ⟨∇ f t ( xt ), x ∗ − xt+1 ⟩
η
1 α
≤ − ∥ xt+1 − xt ∥2 + η ⟨∇ f t ( xt ), x ∗ − xt+1 ⟩ (By α-strong convexity of h wrt to ∥ · ∥)
η 2
f t ( x t ) − f t ( x ∗ ) + ( Φ t +1 − Φ t )
α
≤ f t ( xt ) − f t ( x ∗ ) − ∥ xt+1 − xt ∥2 + ⟨∇ f t ( xt ), x ∗ − xt+1 ⟩
2η
α
≤ f t ( xt ) − f t ( x ) + ⟨∇ f t ( xt ), x ∗ − xt ⟩ − ∥ xt+1 − xt ∥2 + ⟨∇ f t ( xt ), xt − xt+1 ⟩
∗
| {z } 2η
≤ 0 by convexity of f t
α
≤ − ∥ xt+1 − xt ∥2 + ∥∇ f t ( xt )∥∗ ∥ xt − xt+1 ∥ (By Corollary 17.4)
2η
α 1η α
≤ − ∥ x t +1 − x t ∥ 2 + ∥∇ f t ( xt )∥2∗ + ∥ xt − xt+1 ∥2 (By AM-GM)
2η 2 α η
η
= ∥∇ f t ( xt )∥2∗ .
2α
This completes the proof of Theorem 17.7. As you observe, it is
syntactically similar to the original proof of gradient descent, just
using more general language. In order to extend this to the con-
strained case, we will need to show that if xt′ +1 ∈
/ K, and xt+1 =
arg minx∈K Dh ( x ∥ xt′ +1 ), then
Dh ( x ∗ ∥ xt+1 ) ≤ Dh ( x ∗ ∥ xt′ +1 )
mirror descent 225
x t + 1 = x t − η Hh ( x t ) − 1 ∇ f ( x t ) .
Some of you may have seen Newton’s method for minimizing convex
functions, which has the following update rule:
x t +1 = x t − η H f ( x t ) −1 ∇ f ( x t ).
This means mirror descent replaces the Hessian of the function itself
by the Hessian of a strongly convex function h. Newton’s method has
very strong convergence properties (it gets error ε in O(log log 1/ε)
iterations!) but is not “robust”—it is only guaranteed to converge
when the starting point is “close” to the minimizer. We can view
mirror descent as trading off the convergence time for robustness. Fill
in more on this view.
where vol(K ) is the volume of the set K. The following lemma cap-
tures the crucial fact about the center-of-gravity that we use in our
algorithm.
f ( xb) − f ( x ∗ ) ≤ ε.
K δ := {(1 − δ) x ∗ + δx | x ∈ K }
f (y) = f ((1 − δ) x ∗ + δx ) ≤ (1 − δ) f ( x ∗ ) + δ f ( x ) ≤ (1 − δ) f ( x ∗ ) + δB
≤ f ( x ∗ ) + δ( B − f ( x ∗ )) ≤ f ( x ∗ ) + 2δB.
The Ellipsoid algorithm is usually attributed to Naum Shor; the fact N. Z. Šor and N. G. Žurbenko (1971)
that this algorithm gives a polynomial-time algorithm for linear pro-
gramming was a breakthrough result due to Khachiyan, and was Khachiyan (1979)
front page news at the time. A great source of information about
this algorithm is the Grötschel-Lovász-Schrijver book. A historical Grötschel, Lovász, and Schrijver (1988)
perspective appears in this this survey by Bland, Goldfarb, and Todd.
Let us mention some theorem statements about the Ellipsoid algo-
rithm that are most useful in designing algorithms. The second-most
important theorem is the following. Recall the notion of an extreme
point or basic feasible solution (bfs) from §7.1.2. Let ⟨ A⟩, ⟨b⟩, ⟨c⟩ de-
note the number of bits required to represent of A, b, c respectively.
min{c⊺ x | x ∈ K }
3. Finally, after T = 2n(n + 1) ln( R/r ) rounds either we have not seen
any point in K—in which case we say “K is empty”—or else we
output
xb ← arg min{ f (ct ) | ct ∈ Kt , t ∈ 1 . . . T }.
Now adapting the analysis from the previous sections gives us the
following result (assuming exact arithmetic again):
Theorem 18.7 (Idealized Convex Minimization using Ellipsoid).
Given K, r, R as above (and a strong separation oracle K), and a function
f : K → [− B, B], the Ellipsoid algorithm run for T steps either correctly
reports that K = ∅, or else produces a point xb such that
2BR n T o
f ( xb) − f ( x ∗ ) ≤ exp − .
r 2n(n + 1)
Note the similarity to Theorem 18.2, as well as the differences: the
exponential term is slower by a factor of 2(n + 1). This is because
the volume of the successive ellipsoids shrinks much slower than
in Grünbaum’s lemma. Also, we lose a factor of R/r because K is
potentially smaller than the starting body by precisely this factor.
(Again, this presentation ignores precision issues, and assumes we
can do exact real arithmetic.)
L(Ball(0, 1)) = { Lx : x⊺ x ≤ 1}
= { y : ( L −1 y )⊺ ( L −1 y ) ≤ 1 }
= {y : y⊺ ( LL⊺ )−1 y ≤ 1}
= { y : y ⊺ Q −1 y ≤ 1 } ,
{ y + 1 : y ⊺ Q −1 y ≤ 1 } = { y : ( y − c )⊺ Q −1 ( y − c ) ≤ 1 } .
vol(Et+1 )
≤ e−1/2(n+1) .
vol(Et )
1
c t +1 : = c t − h
n+1
and
n2 2
Q t +1 = 2
Qk − hh⊺
n −1 n+1
q
where h = a⊺t Qt at .
the centroid and ellipsoid algorithms 235
(1 − c1 )2 c21 1
≤ 1 and + 2 ≤ 1.
a2 a 2 b
Suppose these two inequalities are tight, then we get
s s
(1 − c1 )2 (1 − c1 )2
a = 1 − c1 , b= = ,
(1 − c1 )2 − c21 (1 − 2c1
and moreover the ratio of volume of the ellipsoid to that of the ball is
(1 − c )2 (n−1)/2
1
abn−1 = (1 − c1 ) · .
1 − 2c1
1
This is minimized by setting c1 = n +1 gives us
vol(E ) − 1
= · · · ≤ e 2( n +1) .
vol(Ball(0, 1))
For a more detailed description and proof of this process, see these
notes from our LP/SDP course for details.
In fact, we can view the question of finding the minimum-volume
ellipsoid that contains the half-ball K: this is a convex program, and
looking at the optimality conditions for this gives us the same con-
struction above (without having to make the assumptions of symme-
try).
236 algorithms for solving lps
Simplex: This is perhaps the first algorithm for solving LPs that most
of us see. It was also the first general-purpose linear program
solver known, having been developed by George Dantzig in 1947.
This is a local-search algorithm: it maintains a vertex of the poly-
hedron K, and at each step it moves to a neighboring vertex with-
out decreasing the objective function value, until it reaches an op-
timal vertex. (The convexity of K ensures that such a sequence of
steps is possible.) The strategy to choose the next vertex is called
the pivot rule. Unfortunately, for most known pivot rules, there
are examples on which the following the pivot rule takes expo-
nential (or at least a super-polynomial) number of steps. Despite
that, it is often used in practice: e.g., the Excel software contains an
implementation of simplex.
For details and references, see this survey by Martin Dyer, Nimrod
Megiddo, and Emo Welzl.
x≥0
We will only sketch the high-level idea behind Step 1 (finding the
starting solution), and will skip Step 2 (the rounding); our focus will
interior-point methods 239
19.1.1 The Primal and Dual LPs, and the Duality Gap
Recall the primal linear program:
( P) min c⊺ x
Ax = b
x ≥ 0,
( D ) max b⊺ y
A⊺ y ≤ c.
( D ′ ) max b⊺ y
A⊺ y + s = c
s ≥ 0.
We assume that both the primal ( P) and dual ( D ) are strictly feasible:
i.e., they have solutions even if we replace the inequalities with strict
ones). Then we can prove the following result, which relates the
optimizer for f η to feasible primal and dual solutions:
Ax − b = 0 (19.1)
⊺
A y+s = c (19.2)
∀i ∈ [ n ] : si xi = η (19.3)
The conditions (19.1) and (19.2) show that x and (y, s) are feasible
for the primal ( P) and dual ( D ′ ) respectively. The condition (19.3)
is an analog of the usual complementary slackness result that arises
when η = 0. To prove this lemma, we use the method of Lagrange
multipliers. Observe: we get that if there exists a
maximum x ∗ , then x ∗ satisfies these
Theorem 19.2 (The Method of Lagrange Multipliers). Let functions conditions.
f and g1 , · · · , gm be continuously differentiable, and defined on some open
subset of Rn . If x ∗ is a local optimum of the following optimization problem
min f ( x )
s.t. ∀i ∈ [m] : gi ( x ) = 0
The first step uses that if there are strictly feasible primal and
dual solutions ( x̂, ŷ, ŝ), then the region { x | Ax = b, f µ ( x ) ≤ f µ x̂ }
is bounded (and clearly closed) and hence the continuous function
f µ ( x ) achieves its minimum at some point x ∗ inside this region, by
the Extreme Value theorem. (See Lemma 7.2.1 of Matoušek and Gärt-
ner, say.)
For the second step, we use the functions f µ ( x ), and gi ( x ) = a⊺i x −
bi in Theorem 19.2 to get the existence of y∗ ∈ Rm such that:
m ⊺ m
∑ yi∗ · ∇(ai x∗ − bi ) ∑ yi∗ ai .
⊺
f η (x∗ ) = ⇐⇒ c − η · 1/x1∗ , · · · , 1/xn∗ =
i =1 i =1
By weak duality, the optimal value of the linear program lies be-
tween the values of any feasible primal and dual solution, so the
duality gap c⊺ x − b⊺ y bounds the suboptimality c⊺ x − OPT of our
current solution. Lemma 19.1 allows us to relate the duality gap to η
as follows.
c⊺ x − b⊺ y = c⊺ x − ( Ax )⊺ y = x⊺ c − x⊺ (c − s) = x⊺ s = n · η.
Ax = b (19.4)
⊺
A y+s = c (19.5)
n 2
∑ s i x i − ηt ≤ (ηt/4)2 . (19.6)
i =1
The first two are again feasibility conditions for ( P) and ( D ′ ). The
third condition is new, and is an approximate version of (19.3). Sup-
pose that
1
η ′ := η ′ · 1 − √ .
4 n
A (∆x ) = 0
⊺
A (∆y) + (∆s) = 0
si (∆xi ) + (∆si ) xi + (∆si )(∆xi ) = η ′ − xi si .
Note the quadratic term in blue. Since we are aiming for an approxi-
mation anyways, and these increments are meant to be tiny, we drop
the quadratic term to get a system of linear equations in these incre-
ments:
A (∆x ) = 0
⊺
A (∆y) + (∆s) = 0
si (∆xi ) + (∆si ) xi = η ′ − xi si .
putting down just so that you recognize it the next time you see it):
A 0 0 ∆x 0
0 A⊺ I ∆y = 0 .
diag( x ) 0 diag(s) ∆s η′1 − x ◦ s
Proof. The last set of equalities in the linear system ensure that
so we get
x ′ , s′ = ⟨ x + ∆x, s + ∆s⟩
= ∑ si xi + si (∆xi ) + (∆si ) xi + ⟨∆x, ∆s⟩
i
= nη ′ + ⟨∆x, − A⊺ (∆y)⟩
= n · η ′ − ⟨ A(∆x ), ∆y⟩
= n · η′,
Proof. As in the proof of Lemma 19.3, we get that si′ xi′ − η ′ = (∆si )(∆xi ),
so it suffices to show that
s
n
∑ (∆si )2 (∆xi )2 ≤ η′/4.
i =1
xi (∆si )2 si (∆xi )2
where we set a2i = si and bi2 = xi . Hence
s
n
1 n xi s
∑ (∆si ∆xi )2 ≤ ∑
4 i =1 si
· (∆si )2 + i · (∆xi )2 + 2(∆si )(∆xi )
xi
i =1
1 n ( xi ∆si )2 + (si ∆xi )2
4 i∑
= [since (∆s)⊺ ∆x = 0 by Claim 19.3]
=1
si xi
1 ∑in=1 ( xi ∆si + si ∆xi )2
≤
4 mini∈[n] si xi
1 ∑in=1 (η ′ − si xi )2
= . (19.8)
4 mini∈[n] si xi
n n
∑ (η ′ − si xi )2 = ∑ ((1 − δ)η − si xi )2
i =1 i =1
n n n
= ∑ (η − si xi )2 + ∑ (δη )2 + 2δη ∑ (η − si xi ).
i =1 i =1 i =1
Thus
n
1
∑ (η ′ − si xi )2 ≤ (η/4)2 + n (4√n)2 η 2 = η 2 /8.
i =1
which are analogs of Lemmas 19.3 and 19.4 respectively. The latter
inequality means that |si′′ xi′′ − η ′′ | ≤ η ′′ /4 for each coordinate i, else
that coordinate itself would violate inequality (19.9). Specifically,
this means that neither xi′′ nor si′′ ever becomes zero for any value of
α ∈ [0, 1]. Now since ( xi′′ , si′′ ) is a linear interpolation between ( xi , si )
and ( xi′ , si′ ), and the former were strictly positive, the latter cannot be
non-positive.
19.3.2 An Example
Given an n-bit integer a ∈ Z, suppose we want to compute its re-
ciprocal 1/a without using divisions. This reciprocal is a zero of the
expression This method for computing recip-
g( x ) = 1/x − a. rocals appears in the classic book of
Aho, Hopcroft, Ullman, without any
Hence, the Newton-Raphson method says, we can start with x1 = 1, elaboration—it always mystified me
until I realized the connection to the
say, and then use (19.10) to get Newton-Raphson method. I guess they
(1/xt − a) expected their readers to be familiar
x t +1 ← x t − = xt + xt (1 − a xt ) = 2xt − a xt2 . with these connections, since computer
(−1/xt2 ) science used to have closer connections
If we define ε t := 1 − a xt , then to numerical analysis in the 1970s.
f ′ ( xt )
x t +1 ← x t − . (19.11)
f ′′ ( xt )
19.4 Self-Concordance
Combating Intractability
20
Approximation Algorithms
to be the worst-case ratio between the costs of the algorithm’s solu- Alg ≤ r · Opt
c(Alg( I ))
ρ = ρA := min ,
I c(Opt( I ))
However, there are problems that do not fall into any of these
clean categories, such as Asymmetric k-Center, for which there
exists a O ( log ∗ n ) -approximation algorithm, and this is best possible
unless P = NP. Or Group Steiner Tree, where the approximation
ratio is O ( log 2 n ) on trees, and this is also best possible.
approximation algorithms 251
This shows that Alg( I ) ≤ α Opt( I ). Which leaves us with the ques-
tion of how to construct the surrogate. Sometimes we use the com-
binatorial properties of the problem to get a surrogate, and at other
times we use a linear programming relaxation.
The greedy algorithm does not achieve a better ratio than Ω(log n):
one example is given by the figure to the right. The optimal sets are
the two rows, whereas the greedy algorithm may break ties poorly
and pick the set covering the left half, and then half the remainder,
etc. A more sophisticated example can show a matching gap of ln n.
Proof of Theorem 20.1. Suppose Opt picks k sets from S . Let ni be the
number of still uncovered when the algorithm has picked i sets. Then
n0 = n = |U |. Since the k sets in Opt cover all the elements of U,
Figure 20.2:
they also cover the uncovered elements in ni . By averaging, there
must exist a set in S that covers ni /k of the yet-uncovered elements.
Hence,
ni+1 ≤ ni − ni /k = ni (1 − 1/k).
Iterating, we get nt ≤ n0 (1 − 1/k)t < n · e−t/k . So setting T = k ln n, As always, we use 1 + x ≤ e x , and here
we get n T < 1. Since n T must be an integer, it is zero, so we have we can use that the inequality is strict
whenever x ̸= 0.
covered all elements using T = k ln n sets.
If the sets are of size at most B, we can show that the greedy al-
gorithm is an (1 + ln B)-approximation. Moreover, for the weighted
case, the greedy algorithm changes to picking the set S in that maxi-
mizes:
number of yet-uncovered elements in S
.
cS
One can give an analysis like the one above for this weighted case as
well. Not quite, the proof here changes a fair bit, need to rephrase
and give the proof?
The second algorithm for Set Cover uses the popular relax-and-
round framework. The steps of this process are as follows:
1. Write an integer linear program for the problem. This will also be
NP-hard to solve, naturally.
3. Now solve the linear program, and round the fractional variables
to integer values, while ensuring that the cost of this integer solu-
tion is not much higher than the LP value.
Let’s see this in action: here is the integer linear program (ILP) that
precisely models Set Cover:
min ∑ cS xS (ILP-SC)
S∈S
s.t. ∑ xS ≥ 1 ∀e ∈ U
S:e∈S
xS ∈ {0, 1} ∀S ∈ S .
min ∑ cS xS (LP-SC)
S∈S
s.t. ∑ xS ≥ 1 ∀e ∈ U
S:e∈S
xS ≥ 0 ∀S ∈ S .
If LP( I ) is the optimal value for the linear program, then we get:
LP( I ) ≤ Opt( I ).
2. Phase 2: For each element e yet uncovered, pick any set covering it.
Clearly the solution produced by the algorithm is feasible; it just
remains to bound the number of sets picked by it.
Theorem 20.2. The expected number of sets picked by this algorithm is
(ln n) LP( I ) + 1.
Proof. Clearly, the expected number of sets covered in each round in
phase 1 is ∑S xS∗ = LP( I ), and hence the expected number of sets in
phase 1 is at most ln n times as much.
For the second phase, the number of sets not picked is precisely
the the expected number of elements not covered in Phase 1. To
calculate this, consider an arbitrary element e.
1. First-Fit: add the item to the earliest opened bin where it fits.
Exercise: if all the items were of size at most ε, then each bin (ex-
cept the last one) would have at least 1 − ε total size, thereby giving
an approximation of
1
Opt( I ) + 1 ≈ (1 + ε) Opt( I ) + 1.
1−ε
• Define the new size si ′ for each item i to be the size of the largest
element in i’s group.
There are D distinct item sizes, and all sizes are only increased, so
it remains to show a packing for the items in I ′ that uses at most
Opt( I ) + ⌈n/D ⌉ bins. Indeed, suppose Opt( I ) assigns item i to some
bin b. Then we assign item (i + ⌈n/D⌉) to bin b. Since the sizes of the
items only get smaller, this allocates all the items except items in first
group, without violating the sizes. Now we assign each item in the
first group into a new bin, thereby opening up ⌈n/D⌉ more bins.
min ∑ xC ,
C ∈C
s.t. ∑ ACs xC ≥ ns , ∀ sizes s
C
xC ∈ N.
approximation algorithms 257
Here ACs is the number of items of type s being placed in the config-
uration C, and ns is the total number items of size s in the instance.
This is an exact formulation, and relaxing the integrality constraint to
xC ≥ 0 gives us an LP that we can solve in time poly( N, n). This is
polynomial time when N is a constant. We use the optimal value of In fact, we show in a homework prob-
this LP as our surrogate. lem that the LP can be solved in time
polynomial in n even when N is not a
How do we round the optimal solution for this LP? There are only constant.
D non-trivial constraints in the LP, and N non-negativity constraints.
So if we pick an optimal vertex solution, it must have some N of
these constraints at equality. This means at least N − D of these tight
constraints come from the latter set, and therefore N − D variables
are set to zero. In other words, at most D of the variables are non-
zero. Rounding these variables up to the closest integer, we get a
solution that uses at most LP( I ) + D ≤ Opt( I ) + D bins. Since D is a
constant, we have approximated the solution up to a constant.
Opt( I ) + ⌈n/D ⌉ + D
bins. Now if we could ensure that n/D were at most ε Opt( I ), when
D was f (ε), we would be done. Indeed, if all the items have size at
least ε, the total volume (and therefore Opt( I )) is at least εn. If we
now set D = 1/ε2 , then n/D ≤ ε2 n ≤ ε Opt( I ), and the number of
bins is at most l m
(1 + ε) Opt( I ) + 1/ε2 .
What if some of the items are smaller than ε? We now use the
observation that First-Fit behaves very well when the item sizes
are small. Indeed, we first hold back all the items smaller than ε,
and solve the remaining instance as above. Then we add in the small
items using First-Fit: if it does not open any new bins, we are
fine. And if adding these small items results in opening some new
bin, then each of the existing bins—and all the newly opened bins
(except the last one)—must have at least (1 − ε) total size in them.
The number of bins is then at most
1
Opt( I ) + 1 ≈ (1 + O(ε)) Opt( I ) + 1,
1−ε
as long as ε ≤ 1/2.
Just like the use of linear programming was a major advance in the
design of approximation algorithms, specifically in the use of lin-
ear programs in the relax-and-round framework, another significant
advantage was the use of semidefinite programs in the same frame-
work. For instance, the approximation guaranteee for the Max-Cut
problem was improved from 1/2 to 0.878 using this technique. More-
over, subsequent results have shown that any improvements to this
approximation guarantee in polynomial-time would disprove the
Unique Games Conjecture.
a. x⊺ Ax ≥ 0 for all x ∈ Rn .
Lemma 21.2. Let A ⪰ 0. If Ai,i = 0 then A j,i = Ai,j = 0 for all j. We will write A ⪰ 0 to denote that A is
PSD; more generally, we write A ⪰ B
if A − B is PSD: this partial order on
Proof. Let j ̸= i. The determinant of the submatrix indexed by {i, j} is
symmetric matrices is called the Löwner
order.
Ai,i A j,j − Ai,j A j,i
We can think of this as being the usual vector inner product treat-
ing A and B as vectors of length n × n. Note that by the cyclic prop-
erty of the trace, A • xx⊺ = Tr( Axx⊺ ) = Tr( x⊺ Ax ) = x⊺ Ax; we will
use this fact to derive yet another of PSD matrices.
∑ λi ( A • xi xi ) = ∑ λi xi Axi ≥ 0.
⊺ ⊺
A•X =
i i
maximize A•X
X ∈Rn × n
subject to I•X =1 (21.1)
X⪰0
Proof. Let X maximize SDP (21.1) (this exists as the objective is con-
tinuous and the feasible set is compact). Consider the spectral de-
composition X = ∑in=1 λi xi xi⊺ where λi ≥ 0 and ∥ xi ∥2 = 1. The
trace constraint I • X = 1 implies ∑i λi = 1. Thus the objective value
A • X = ∑i λi xi⊺ Axi is a convex combination of xi⊺ Axi . Hence without
loss of generality, we can put all the weight into one of these terms,
in which case X = yy⊺ is a rank-one matrix with ∥y∥2 = 1. By the
Courant-Fischer theorem, OPT ≤ max∥y∥2 =1 y⊺ Ay = λmax .
262 sdps in approximation algorithms
Here is another SDP for the same problem: In fact, it turns out that this SDP is dual
to the one in (21.1). Weak duality still
minimize t holds for this case, but strong duality
t does not hold in general for SDPs.
(21.2)
Indeed, there could be a duality gap for
subject to tI − A ⪰ 0. some cases, where both the primal and
dual are finite, but the optimal solutions
Lemma 21.7. SDP (21.2) computes the maximum eigenvalue of A. are not equal to each other. However,
under some mild regularity conditions
Proof. The matrix tI − A has eigenvalues t − λi . And hence the con- (e.g., the Slater conditions) we can show
strong duality. More about SDP duality
straint tI − A ⪰ 0 is equivalent to the constraint t − λ ≥ 0 for all its
here.
eigenvalues λ. In other words, t ≥ λmax , and thus OPT = λmax .
This result shows two things: (a) every graph has a bipartition that
cuts half the edges of the graph, so Opt ≥ | E|/2. Moreover, (b) that
since Opt ≤ | E| on any graph, this means that Alg ≥ | E|/2 ≥ Opt /2.
We cannot hope to prove a better result
Here’s a simple randomized algorithm: place each vertex in either than Lemma 21.9 in terms of | E|, since
the complete graph Kn has (n2 ) ≈ n2 /2
S or in S̄ independently and uniformly at random. Since each edge is edges and any partition can cut at most
cut with probability 1/2, the expected number of cut edges is | E|/2. n2 /4 of them.
Moreover, by the probabilistic method Opt ≥ | E|/2.
( x i − x j )2
maximize
x1 ,...,xn ∈R
∑ 4
(i,j)∈ E (21.3)
subject to xi2 =1 ∀i.
1 − vi , v j
maximize
v1 ,...,vn ∈R
n ∑ 2
(i,j)∈ E (21.5)
subject to ⟨ vi , vi ⟩ = 1 ∀i.
Proof. By linearity of expectation, it suffices to bound the probability Figure 21.1: A geometric picture of
Goemans-Williamson randomized
of an edge (i, j) being cut. Let rounding
θij := cos−1 ( vi , v j )
be the angle between the unit vectors vi and v j . Now consider the
2-dimensional plane P containing vi , v j and the origin, and let ge be
the projection of the Gaussian vector g onto this plane. Observe that
the edge (i, j) is cut precisely when the hyperplane defined by g
separates vi , v j . This is precisely when the vector perpendicular to
ge in the plane P lands between vi and v j . As the projection onto a
subspace of the standard Gaussian is again a standard Guassian (by
spherical symmetry),
2θij θij
Pr[(i, j) cut] = = .
2π π
θij
approximation algorithms via sdps 265
g̃
Proof. Pick any vertex v, recursively color the remaining graph, and
then assign v a color not among the colors of its ∆ neighbors.
find v 1 , . . . , v n ∈ Rn
subject to vi , v j ≤ λ ∀(i, j) ∈ E (21.6)
⟨ vi , vi ⟩ = 1 ∀i ∈ V.
Why is this SDP relevant to our problem? The goal is to have vectors
clustered together in groups, such that each cluster represents a color.
Intuitively, we want to have vectors of adjacent vertices to be far
apart, so we want their inner product to be close to −1 (recall we are
approximation algorithms via sdps 267
dealing with unit vectors, due to the last constraint) and vectors of
the same color to be close together.
Proof. Consider the vector placement shown in the figure to the right. 120◦
If the graph is 3-colorable, we can assign all vertices with color 1
the red vector, all vertices with color 2 the blue vector and all vertices
with color 3 the green vector. Now for every edge (i, j) ∈ E, we have
that Figure 21.3: Optimal distribution of
2π vectors for 3-coloring graph
vi , v j = cos = −1/2.
3
At first sight, it may seem like we are done: if we solve the above
SDP with λ = −1/2, don’t all three vectors look like the figure above?
No, that would only hold if all of them were to be co-planar. And in
n-dimensions we can have an exponential number of cones of angle
2π
3 , like in the next figure, so we cannot cluster vectors as easily as in
the above example.
To solve this issue, we apply a hyperplane rounding technique
similar to that from the MaxCut algorithm. Indeed, for some pa-
rameter t we will pick later, pick t random hyperplanes. Formally, we
pick gi ∈ Rn from a standard n-dimensional Gaussian distribution, Figure 21.4: Dimensionality problem of
for i ∈ [t]. Each of these defines a normal hyperplane, and these split 2π/3 far vectors
the Rn unit sphere into 2t regions (except if two of them point in the
same direction, which has zero probability). Now, each vectors {vi }
that lie in the same region can be considered “close” to each other,
and we can try to assign them a unique color. Formally, this means
that if vi and v j are such that
sign(⟨vi , gk ⟩) = sign( v j , gk )
for all k ∈ [t], then i and j are given the same color. Each region is
given a different color, of course.
However, this may color some neighbors with the same color, so
we use the method of alterations: while there exists an edge between
vertices of the same color, we uncolor both endpoints. When this
uncoloring stops, we remove the still-colored vertices from the graph,
and then repeat the same procedure on the remaining graph, until we
color every vertex. Note that since we use t hyperplanes, we add at
most 2t new colors per iteration. The goal is to now show that (a) the
number of interations is small, and (b) the value of 2t is also small.
Lemma 21.16. The expected number of vertices that remain uncolored after
a single iteration is at most n∆ (1/3)t .
Proof. Fix an edge ij: for a single random hyperplane, the probability
that vi , v j are not separated by it is
π − θij 1
≤ ,
π 3
There are n vertices, and each vertex has degree at most ∆, which
proves the result.
Typically we do not know the future requests that the CPU will make
so it is sensible to model this as an online problem. If the entire sequence of requests
We let U be a universe of n items or pages. The cache is a mem- is known, show that Belády’s rule is
optimal: evict the page in cache that is
ory containing at most k pages. The requests are pages σi ∈ U and next requested furthest in the future.
the online algorithm is an eviction policy. Now we return back to
defining the performance of an online algorithm.
Alg(σ )
max
σ Opt(σ )
E[Alg(σ)] ≤ α · Opt(σ ).
Lemma 22.1. The competitive ratio of algorithm AlgB is 2 − 1/B and this is
the best possible ratio for any deterministic algorithm.
Proof. There are two cases to consider j < B and j ≥ B. For the
first case, AlgB ( Ij ) = j and Opt( Ij ) = j, so AlgB ( Ij )/ Opt( Ij ) =
1. In the second case, AlgB ( Ij ) = 2B − 1 and Opt( Ij ) = B, so
AlgB ( Ij )/ Opt( Ij ) = 2 − 1/B. Thus the competitive ratio of AlgB
is
AlgB ( Ij )
max = 2 − 1/B
Ij Opt( Ij )
Now to show that this is the best possible competitive ratio for any
deterministic algorithm. Consider algorithm Algi . We find an in-
stance Ij such that Algi ( Ij )/ Opt( Ij ) ≥ 2 − 1/B. If i ≥ B then we take
j = B so that Algi ( Ij ) = (i − 1 + B) and Opt( Ij ) = B so that
Algi ( Ij ) i−1+B i 1 1
= = +1− ≥ 2−
Opt( Ij ) B B B B
Algi ( Ij ) B
= ≥2
Opt( Ij ) 1
Algi ( Ij ) 1
≥ 2− .
Opt( Ij ) B
I1 I2 I3 I∞
Alg1 4/1 4/2 4/3 4/4
Alg2 1/1 5/2 5/3 5/4
Alg3 1/1 2/2 6/3 6/4
Alg4 1/1 2/2 3/3 7/4
online algorithms 275
4p1 + p2 + p3 + p4 ≤ c
4p2 + 5p2 + 2p3 + 2p4
≤c
2
4p1 + 5p3 + 6p3 + 3p4
≤c
3
4p1 + 5p2 + 6p3 + 7p4
≤c
4
p1 + p2 + p3 + p4 = 1
3p1 = c − 1
4p2 + p3 + p4 = c
4p3 + p4 = c
4p4 = c.
276 the ski rental problem: rent or buy?
1
c= and pi = (3/4)i−1 (c/4)
1 − (1 − 1/4)4
1 e
c = cB = ≤ .
1 − (1 − 1/B) B e−1
f ′ (ℓ) − f (ℓ) = 0
But this solves to f (ℓ) = Ceℓ for some constant C. Since f is a prob-
R1
ability density function, ℓ=0 f (ℓ) = 1, we get C = e−1 1 . Substituting
into (22.1), we get that the competitive ratio is c = e−e 1 , as desired.
online algorithms 277
Proof. We break up the proof into an upper bound on Alg’s cost and
a lower bound on Opt’s cost. Before doing this we set up some no-
tation. For the ith phase, let Si be the set of pages in the algorithm’s
cache at the beginning of the phase. Now define
∆ i = | Si + 1 \ Si | .
k −1 k −1
c 1
∑ k−s
≤ ∆i ∑
k − s
= ∆i Hk .
s =0 s =0
∆i Hk + ∆i = ∆i ( Hk + 1).
Now we claim that Opt ≥ 12 ∑i ∆i . Let Si∗ be the pages in Opt’s cache
at the beginning of phase i. Let ϕi be the number of pages in Si but
not in Opt’s cache at the beginning of phase i, i.e., ϕi = |Si \ Si∗ |.
Now let Opti be the cost that Opt incurs in phase i. We have that
Opti ≥ ∆i − ϕi since this is the number of “clean” requests that Opt
sees. Moreover, consider the end of phase i. Alg has the k most recent
online algorithms 279
requests in cache, but Opt does not have ϕi+1 of them by definition of
ϕi+1 . Hence Opti ≥ ϕi+1 . Now by averaging,
1
Opti ≥ max{ϕi+1 , ∆i − ϕi } ≥ (ϕ + ∆i − ϕi ).
2 i +1
So summing over all phases we have
1 1
Opt ≥ ∑
2 i
∆i + ϕ f inal − ϕinitial ≥ ∑ ∆i ,
2 i
since ϕ f inal ≥ 0 and ϕinitial = 0. Combining the upper and lower
bound yields
E[Alg] ≤ 2( Hk + 1) Opt = O(log k) Opt .
It can also be shown that no randomized algorithm can do better
than Ω(log k )-competitive for the paging problem. For some intuition
as to why this might be true, consider the coupon collector prob-
lem: if you repeatedly sample a uniformly random number from
{1, . . . , k + 1} with replacement, show that the expected number of
samples to see all k + 1 coupons is Hk+1 .
Additional Topics
23
Smoothed Analysis of Algorithms
the maximum cost for any input of size n. Consider Figure 23.1 and
imagine that all instances of size n are lined up on the x-axis. The
blue line could be the running time of an algorithm that is fast for
most inputs but very slow for a small number of instances. The worst
case is the height of the tallest peak. For the green curve, which
could for example be the running time of a dynamic programming
algorithm, the worst case is a tight bound since all instances induce
the same cost.
Worst case analysis provides a bound that is true for all instances,
but it might be very loose for some or most instances like it happens
for the blue curve. When running the algorithm on real world data
284 the traveling salesman problem
sets, the worst case instances might not occur, and there are different
approaches to provide a more realistic analysis.
The idea behind smoothed analysis is to analyze how an algorithm
acts on data that is jittered by a bit of random noise (notice that for
real world data, we might have a bit of noise anyway). This is mod-
eled by defining a notion of ‘neighborhood’ for instances and then
analyzing the expected cost of an instance from the neighborhood of
I instead of the running time of I.
The choice of the neighborhood is problem specific. We assume
that the neighborhood of I is given in form of a probability distribu-
tion over all instances in In . Thus, it is allowed that all instances in
In are in the neighborhood but their influence will be different. The
distribution depends on a parameter σ that says how much the input
shall be jittered. We denote the neighborhood of I parameterized on
σ by nbrhdσ ( I ). Then the smoothed complexity is given by
Note that the (n − 1)! represents the maximum possible number of it-
erations the algorithm can take, since each swap is strictly improving
and hence no permutation can repeat.
To bound this probability, we use the simple fact that a large num-
ber of iterations t means there is an iteration where the decrease in
weight was very small, and this is an event with low probability.
286 the simplex algorithm
Lemma 23.2. The probability that the algorithm takes at least t iterations is
5
at most ntσ .
Proof. Note that wij ∈ [0, 1] implies that the weight of any cycle is in
[0, n]. This implies by pigeonhole principle that there was an iteration
where the decrease in weight was at most ε := nt . By Lemma 23.1 the
n5
probability of this event is at most tσ , as advertised.
Theorem 23.3. 6 The expected number of iterations that the algorithm takes
n log n
is at most O σ .
( n −1) !
E[ X ] = ∑ Pr [ X ≥ t]
t =1
( n −1) !
n5
≤ ∑ tσ
t =1
( n −1) !
n5 1
σ t∑
=
=1 t
6
n log n
=O
σ
We now turn our attention to the simplex algorithm for solving gen-
eral linear programs. Given a vector c ∈ Rd and a matrix A ∈ Rn×d ,
smoothed analysis of algorithms 287
max c⊺ x
Ax ≤ 1.
max c⊺ x
( A + G ) x ≤ 1 + g.
This is one specific neighborhood model. Notice that for any input
( A, c, 1), all inputs ( A + G, c, 1, g) are potential neighbors, but the
probability decreases exponentially when we go ‘further away’ from
the original input. The vector c is not changed, only A and 1 are
jittered. The variance of the Gaussian distributions scales with the
smoothness parameter σ. For σ = 0, the problem reduces to the
288 the simplex algorithm
Finally, we turn to a problem for which we will give (almost) all the
details of the smoothed analysis. The input to the knapsack problem
is a collection of n items, each with size/weight wi ∈ R≥0 and profit
pi ∈ R≥0 , and the goal is to pick a subset of items that maximizes the
sum of profits, subject to the total weight of the subset being at most
some bound B. As always, we can write the inputs
The knapsack problem is weakly NP-hard—e.g., if the sizes are as vectors p, w ∈ Rn≥0 , and hence the
solution as x ∈ {0, 1}n . For example,
integers, then it can be solved by dynamic programming in time with p = (1, 2, 3), w = (2, 4, 5) and B =
O(nB). Notice that perturbing the input by a real number prohibits 10, the optimal solution is x = (0, 1, 1)
with p⊺ x = 5 and w⊺ x = 9.
the standard dynamic programming approach which assumes in-
tegral input. Therefore we show a smoothed analysis for a dif-
ferent algorithm, one due to George Nemhauser and Jeff Ullman.
The smoothed analysis result is by René Beier, Heiko Röglin, and
Berthold Vöcking.
sum of profits p T x
sum of profits p T x
P( j + 1) ⊆ P( j) ∪ {S ∪ { j + 1} | S ∈ P( j)} .
| {z }
=:A j
Note that for this analysis we no longer care about the size of
the knapsack g. The remainder of the section will be focused on
bounding E[| P( j)|].
nk
ℓ ℓ + 1
E[| P|] ≤ 1 + lim ∑
k →∞ ℓ=0
Pr P ∩
k
,
k
̸ = ∅ (23.2)
Formally: ε
Proof. Assume we fixed all weights except for i. Then xUL,−i is com-
pletely determined. We remind the reader that x ∗,+i is defined as the sum of the weights
leftmost point that contains item i and is higher than xUL,−i .
Now we turn our attention to the “precursor” of x ∗,+i without the Figure 23.7: Illustration of the definition
of xUL and x R . ∆ is only plotted in the
item i, namely the item x ∗,+i − ei where ei is the ith basis vector. The right figure.
claim is that this point is completely determined when we fixed all
weights except i. Name that point y (formal definition of this point
will be given later). The point y is exactly the one that is leftmost
with the condition that yi = 0 and p⊺ y + pi ≥ p⊺ xUL,−i (by definition
of x ∗,+i ). Note that the order of y’s does not change when adding wi .
In particular, if y1 was left of y2 (had smaller weight), then adding the
(currently undetermined) wi will not change their ordering.
More formally, let y := arg miny∈2[n] {w⊺ y | p⊺ y + pi ≥ p⊺ xUL,−i , yi =
0} (we drop the index i from y for clarity). In other words, it is the
leftmost solution without i higher than xUL,−i when we add the profit
of i to it. It holds that w⊺ x ∗,+i = w⊺ y + wi . Therefore,
h i
Pr w⊺ x ∗,+i ∈ (t, t + ε] = Pr [w⊺ y + wi ∈ (t, t + ε]]
= Pr [wi ∈ (t − w⊺ y, t + ε − w⊺ y]]
ε
≤
σ
Proof. Let xUL be the most profitable (highest) point left of t. For-
mally:
Since the zero vector is always Pareto optimal there is always at least
one point left of t, so xUL is well-defined. There must be an item i s.t.
that is contained in x R , but is not in xUL . Formally, pick and fix any
i ∈ [n] s.t. xiR = 1, xiUL = 0. It is clear that such i must exist since
otherwise xUL would have higher weight than x R . Also, the height of
x R higher than xUL since they are both on the Pareto curve.
Clearly (for this i) it holds that xUL = xUL,−i . Assume for the sake
of contradiction that x ∗,+i is not x R . Then:
• The only remaining spot for x ∗,+i is right of t and left of x R , but
that contradicts the choice of x R as the leftmost point right of t.
n 1k n2
≤ 1 + limk→∞ nk · σ = σ.
294 the knapsack problem
max p⊺ nx
s.t. Ax ≤ b
x∈S
The prophet and secretary problems are two classes of problems where
online decision-making meets stochasticity: in the first set of prob-
lems the inputs are random variables, whereas in the second one
the inputs are worst-case but revealed to the algorithm (a.k.a. the
decision-maker) in random order. Here we survey some results,
proofs, and techniques, and give some pointers to the rich body of
work developing around them.
Proof. Observe that we pick an item with probability exactly 1/2, but
how does the expected reward compare with E[ Xmax ]?
n ^
ALG ≥ τ · Pr[ Xmax ≥ τ ] + ∑ E[( Xi − τ )+ ] · Pr[ ( X j < τ )]
i =1 j ≤i
n
≥ τ · Pr[ Xmax ≥ τ ] + ∑ E[( Xi − τ )+ ] · Pr[ Xmax < τ ]
i =1
But both these probability values equal half, and hence ALG ≥
1/2 E[ Xmax ].
vi ( pi ) := E[ Xi | Xi ≥ τi ]
Proof. Say we “reach” item i if we’ve not picked an item before i. The
expected value of the algorithm is
n
ALG ≥ ∑ Pr[reach item i] · 1/2 · Pr[Xi ≥ τi ] · E[Xi | Xi ≥ τi ]
i =1
n
= ∑ Pr[reach item i] · 1/2 · pi · vi ( pi ). (24.1)
i =1
Since we pick each item with probability pi /2, the expected number
of items we choose is half. So, by Markov’s inequality, the probability
we pick no item at all is at least half. Hence, the probability we reach
item i is at least one half too, the above expression is 1/4 ∑i vi ( pi ) · pi
as claimed.
r i +1 = r i (1 − q i · p i ) (24.2)
we can just set qi+1 = 2r1 . A simple induction shows that ri+1 ≥
i +1
1/2—indeed, sum up (24.2) to get ri +1 = r1 − ∑
j≤i pi /2—so that
qi+1 ∈ [0, 1] and is well-defined.
298 the prophet problem
Wi Wj∗ +1
E[ Xmax ] = ∑∗ 2i
+
2j
∗ .
i≤ j
Indeed, Xmax = Wi if all the previous Wi′ s belong to the sample (i.e.,
they are S’s and not X’s), but Wi belongs to the actual values (it is an
X). Moreover, if all the previous values are Ss, then Wj∗ +1 would be
an X and hence the maximum, by our choice of j∗ .
What about the algorithm? If W1 is a sample (i.e., an S-value) then
we don’t get any value. Else if W1 , . . . , Wi are all X values, and Wi+1
is a sample (S value) then we get value at least Wi . If i < j∗ , this hap-
pens with probability 2i1+1 since all the i + 1 coins are independent.
Else if i = j∗ , the probability is 21i = 1j∗ . Hence
2
∑i pi = k. (24.3)
What about an algorithm to get value 1/4 of the value in (24.4)? The
same as above: reject each item outright with probability 1/2, else
pick i if Xi ≥ τi . Proof #2 goes through unchanged.
For this case of picking multiple items, we can do much better:
√
a result of Alaei shows that one can get within (1 − 1/ k+3) of the
value in (24.4)—for k = 1, this matches the factor
q /2 we showed
1
log k
above. One can, however, get a factor of (1 − O( k )) using a
simple concentration bound.
q
log k
which gives the claimed bound of (1 − O( k )).
∑i vi ( yi ) · yi .
y ∈ the matroid polytope for M
24.1.7 Exercises
1. Give a dynamic programming algorithm for the best strategy when we know the
order in which r.v.s are revealed to us. (Footnote 1). Extend this to the case where
you can pick k items.
Open problem: is this “best strategy” problem computationally hard when we are
given a general matroid constraint? Even a laminar matroid or graphical matroid?
2. If we can choose the order in which we see the items, show that we can get ex-
pected value ≥ (1 − 1/e)E[ Xmax ]. (Hint: use proof #2, but consider the elements in
decreasing order of vi ( pi ).)
Open problem: can you beat (1 − 1/e)E[ Xmax ]? A recent paper of Abolhassani et al.
does so for i.i.d. Xi s.
The problem setting: there are n items, each having some intrinsic
non-negative value. For simplicity, assume the values are distinct, but
we know nothing about their ranges. We know n, and nothing else.
The items are presented to us one-by-one. Upon seeing an item, we
can either pick it (in which case the process ends) or we can pass (but
then this item is rejected and we cannot ever pick it again). The goal
is to maximize the probability of picking the item with the largest
value vmax .
If an adversary chooses the order in which the items are presented,
every deterministic strategy must fail. Suppose there are just two
items, the first one with value 1. If the algorithm picks it, the adver-
sary can send a second item with value 2, else it sends one with value
1/2. Randomizing our algorithm can help, but we cannot do much
Note that this algorithm succeeds if the best item is in the second
half of the items (which happens w.p. 1/2) and the second-best item
is in the first half (which, conditioned on the above event, happens
w.p. ≥ 1/2). Hence 1/4. It turns out that rejecting the first half of the
items is not optimal, and there are other cases where the algorithm
succeeds that this simple analysis does not account for, so let’s be
more careful. Consider the following 37%-algorithm:
Ignore the first n/e items, and then pick the next item that is better
than all the ones seen so far.
302 secretary problems
Theorem 24.4. The strategy that maximizes the probability of picking the
highest number can be assumed to be a wait-and-pick strategy.
Proof. (Due to Niv Buchbinder, Kamal Jain, and Mohit Singh.) Let us
fix an optimal strategy. By the first proof above, we know what it is,
but let us ignore that for the time being. Let us just assume w.l.o.g.
that it does not pick any item that is not the best so far (since such an
item cannot be the global best).
Let pi be the probability that this strategy picks an item at posi-
tion i. Let qi be the probability that we pick an item at position i,
p
conditioned on it being the best so far. So qi = 1/ii = i · pi .
Now, the probability of picking the best item is
But not picking the first i − 1 items is independent of i being the best
so far, so we get
1
pi ≤ 1 − ∑ pj .
i j <i
304 secretary problems
i
max ∑ ·p
i
n i
i · pi ≤ 1 − ∑ p j
j <i
pi ∈ [0, 1].
qi ≤ 1 − (i − 1) p.
Now the Buchbinder, Jain, and Singh paper shows the optimal value
√
of this LP is at least 1 − 1/ 2 ≈ 0.29; they also give a slightly more
involved algorithm that achieves this success probability.
Is this tradeoff optimal? No. Kleinberg showed that one can get
expected value V ⋆ (1 − O(k−1/2 )), and this is asymptotically optimal.
306 secretary problems
Exercises
1. Give an algorithm for general matroids that finds an independent set with expected
value at least an O(1/(log k))-fraction of the max-value independent set.
2. Improve the above result to O(1)-fraction for graphic matroids.