0% found this document useful (0 votes)

15 views285 pages

Cmu850 f20

This document contains the course notes for the graduate-level course 15-850: Advanced Algorithms taught by Anupam Gupta at Carnegie Mellon University in Fall 2020. It includes various topics such as classical algorithms, shortest paths, graph matchings, and modern algorithms, along with exercises and references. The notes are intended as study resources and may contain inaccuracies, with further details available on the course website.

Uploaded by

rop.e.nh.art.o

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views285 pages

Cmu850 f20

Uploaded by

rop.e.nh.art.o

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 285

A D VA N C E D A L G O R I T H M S

notes for cmu 15-850 (fall 2020)

lecturer: anupam gupta

About this document

This document contains the course notes for 15-850: Advanced

Algorithms, a graduate-level course taught by Anupam Gupta at
Carnegie Mellon University in Fall 2020. Parts of these notes were
written by the students of previous versions of the class (based on the
lectures) and then edited by the professor. The names of the student
scribes will appear soon, as will missing details and bug fixes, more
chapters, exercises, and (better) figures.
The notes have not been thoroughly checked for accuracy, espe-
cially attributions of results. They are intended to serve as study
resources and not as a substitute for professionally prepared publica-
tions. We apologize for any inadvertent inaccuracies or misrepresen-
tations.
More information about the course, including problem sets and
references, can be found on the course website:

https://www.cs.cmu.edu/~15850/

The style files (as well as the text on this page!) are mildly adapted
from the ones developed by Yufei Zhao (MIT), for his notes on Graph
Theory and Additive Combinatorics. As some of you may guess, the
LATEX template used for these notes is called tufte-book.
Contents

I Classical Algorithms 7

1 Minimum Spanning Trees 9

1.1 Minimum Spanning Trees: History . . . . . . . . . . . . . 9
1.2 The Classical Algorithms . . . . . . . . . . . . . . . . . . 12
1.3 Fredman and Tarjan’s O(m log∗ n)-time Algorithm . . . 14
1.4 A Linear-Time Randomized Algorithm . . . . . . . . . . 17
1.5 Optional: MST Verification . . . . . . . . . . . . . . . . . 20
1.6 The Ackermann Function . . . . . . . . . . . . . . . . . . 23
1.7 Matroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 Arborescences: Directed Spanning Trees 25

2.1 Arborescences . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 The Chu-Liu/Edmonds/Bock Algorithm . . . . . . . . . 26
2.3 Linear Programming Methods . . . . . . . . . . . . . . . 28
2.4 Matroid Intersection . . . . . . . . . . . . . . . . . . . . . 33

3 Shortest Paths in Graphs 35

3.1 Single-Source Shortest Path Algorithms . . . . . . . . . . 35
3.2 The All-Pairs Shortest Paths Problem (APSP) . . . . . . . 38
3.3 Min-Sum Products and APSPs . . . . . . . . . . . . . . . 42
3.4 Undirected APSP Using Fast Matrix Multiplication . . . 44
3.5 Optional: Fredman’s Decision-Tree Complexity Bound . 47

4 Low-Stretch Spanning Trees 49

4.1 Towards a Definition . . . . . . . . . . . . . . . . . . . . . 49
4.2 Low-Stretch Spanning Tree Construction . . . . . . . . . 52
4.3 Bartal’s Construction . . . . . . . . . . . . . . . . . . . . . 53
4.4 Metric Embeddings: a.k.a. Simplifying Metrics . . . . . . 58

5 A Near-Linear Time Algorithm for SSSP 61

6 Blank 63
4

7 Graph Matchings I: Combinatorial Algorithms 65

7.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . 65
7.2 Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 General Graphs: The Tutte-Berge Theorem . . . . . . . . 70
7.4 The Blossom Algorithm . . . . . . . . . . . . . . . . . . . 71
7.5 Subsequent Work . . . . . . . . . . . . . . . . . . . . . . . 76

8 Graph Matchings II: Algebraic Algorithms 79

8.1 Preliminaries: roots of low degree polynomials . . . . . . 79
8.2 Detecting Perfect Matchings by Computing a Determinant 81
8.3 From Detecting to Finding Perfect Matchings . . . . . . . 84
8.4 Red-Blue Perfect Matchings . . . . . . . . . . . . . . . . . 87
8.5 Matchings in Parallel, and the Isolation Lemma . . . . . 88
8.6 The Permanent Connection . . . . . . . . . . . . . . . . . 90
8.7 A Matrix Scaling Approach . . . . . . . . . . . . . . . . . 90

9 Graph Matchings III: Weighted Matchings 93

9.1 Linear Programming . . . . . . . . . . . . . . . . . . . . . 93
9.2 Weighted Matchings in Bipartite Graphs . . . . . . . . . 96
9.3 Another Perspective: Buyers and sellers . . . . . . . . . . 100
9.4 A Third Algorithm: Shortest Augmenting Paths . . . . . 105
9.5 Perfect Matchings in General Graphs . . . . . . . . . . . 107
9.6 Integrality of Polyhedra . . . . . . . . . . . . . . . . . . . 110

II Interlude: Dimension Reduction 115

10 Concentration of Measure 117

10.1 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . 118
10.2 Non-Asymptotic Convergence Bounds . . . . . . . . . . . 119
10.3 Chernoff bounds, and Hoeffding’s inequality . . . . . . . 122
10.4 Other concentration bounds . . . . . . . . . . . . . . . . . 127
10.5 Application #1: Oblivious Routing on the Hypercube . . 130
10.6 Application #2: Graph Sparsification . . . . . . . . . . . . 134
10.7 Application #3: The Power of Two Choices . . . . . . . . 134

11 Dimension Reduction and the JL Lemma 135

11.1 The Johnson Lindenstrauss lemma . . . . . . . . . . . . . 135
11.2 The Construction . . . . . . . . . . . . . . . . . . . . . . . 136
11.3 Intuition for the Distributional JL Lemma . . . . . . . . . 137
11.4 A Direct Proof of Lemma 11.2 . . . . . . . . . . . . . . . . 138
11.5 Introducing Subgaussian Random Variables . . . . . . . 140
11.6 A Proof of Lemma 11.2 using Subgaussian r.v.s . . . . . . 141
11.7 Optional: Compressive Sensing . . . . . . . . . . . . . . . 143
11.8 Some Facts about Balls in High-Dimensional Spaces . . . 147
5

12 Streaming Algorithms 149

12.1 Streams as Vectors, and Additions/Deletions . . . . . . . 150
12.2 Computing Moments . . . . . . . . . . . . . . . . . . . . . 151
12.3 A Matrix View of our Estimator . . . . . . . . . . . . . . . 154
12.4 Application: Approximate Matrix Multiplication . . . . . 155
12.5 Optional: Computing the Number of Distinct Elements . 156

13 Dimension Reduction: Singular Value Decompositions 161

13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 161
13.2 Best fit subspaces of dimension k and the SVD . . . . . . 162
13.3 Useful facts, and rank-k-approximation . . . . . . . . . . 165
13.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 166
13.5 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . 168

III “Modern” Algorithms 169

14 Online Learning: Experts and Bandits 171

14.1 The Mistake-Bound Model . . . . . . . . . . . . . . . . . . 171
14.2 The Weighted Majority Algorithm . . . . . . . . . . . . . 172
14.3 Randomized Weighted Majority . . . . . . . . . . . . . . 174
14.4 The Hedge Algorithm, and a Change in Perspective . . . 176
14.5 Optional: The Bandit Setting . . . . . . . . . . . . . . . . 178

15 Solving Linear Programs using Experts 181

15.1 (Two-Player) Zero-Sum Games . . . . . . . . . . . . . . . 181
15.2 Solving LPs Approximately . . . . . . . . . . . . . . . . . 184

16 Approximate Max-Flows using Experts 189

16.1 The Maximum Flow Problem . . . . . . . . . . . . . . . . 189
16.2 A First Algorithm using the MW Framework . . . . . . . 190
16.3 Finding Max-Flows using Electrical Flows . . . . . . . . 192
16.4 An Oe (m3/2 )-time Algorithm . . . . . . . . . . . . . . . . . 197
16.5 Optional: An O e (m4/3 )-time Algorithm . . . . . . . . . . . 198

17 The Gradient Descent Framework 205

17.1 Convex Sets and Functions . . . . . . . . . . . . . . . . . 205
17.2 Unconstrained Convex Minimization . . . . . . . . . . . 207
17.3 Constrained Convex Minimization . . . . . . . . . . . . . 211
17.4 Online Gradient Descent, and Relationship with MW . . 213
17.5 Stronger Assumptions . . . . . . . . . . . . . . . . . . . . 214
17.6 Extensions and Loose Ends . . . . . . . . . . . . . . . . . 217

18 Mirror Descent 219

18.1 Mirror Descent: the Proximal Point View . . . . . . . . . 219
18.2 Mirror Descent: The Mirror Map View . . . . . . . . . . . 222
6

18.3 The Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 225

18.4 Alternative Views of Mirror Descent . . . . . . . . . . . . 227

19 The Centroid and Ellipsoid Algorithms 229

19.1 The Centroid Algorithm . . . . . . . . . . . . . . . . . . . 229
19.2 Multi-Dimensional Binary Search . . . . . . . . . . . . . . 232
19.3 Ellipsoid for Convex Optimization . . . . . . . . . . . . . 234
19.4 The Ellipsoid Algorithm to Solve LPs . . . . . . . . . . . 235
19.5 Getting the New Ellipsoid . . . . . . . . . . . . . . . . . . 237
19.6 Algorithms for Solving LPs . . . . . . . . . . . . . . . . . 239

20 Interior-Point Methods 241

20.1 Barrier Functions . . . . . . . . . . . . . . . . . . . . . . . 242
20.2 The Update Step . . . . . . . . . . . . . . . . . . . . . . . 244
20.3 The Newton-Raphson Method . . . . . . . . . . . . . . . 248
20.4 Self-Concordance . . . . . . . . . . . . . . . . . . . . . . . 250

24 Approximation Algorithms 251

24.1 A Rough Classification into Hardness Classes . . . . . . 251
24.2 The Surrogate . . . . . . . . . . . . . . . . . . . . . . . . . 253
24.3 The Set Cover Problem . . . . . . . . . . . . . . . . . . . . 253
24.4 A Relax-and-Round Algorithm for Set Cover . . . . . . . 255
24.5 The Bin Packing Problem . . . . . . . . . . . . . . . . . . 256
24.6 The Linear Grouping Algorithm for Bin Packing . . . . . 257
24.7 Subsequent Results and Open Problems . . . . . . . . . . 260

25 Approximation Algorithms via SDPs 261

25.1 Positive Semidefinite Matrices . . . . . . . . . . . . . . . . 261
25.2 Semidefinite Programs . . . . . . . . . . . . . . . . . . . . 262
25.3 SDPs in Approximation Algorithms . . . . . . . . . . . . 264
25.4 The MaxCut Problem and Hyperplane Rounding . . . 264
25.5 Coloring 3-Colorable Graphs . . . . . . . . . . . . . . . . 267

26 Online Algorithms 273

26.1 The Competitive Analysis Framework . . . . . . . . . . . 273
26.2 The Ski Rental Problem: Rent or Buy? . . . . . . . . . . . 275
26.3 The Paging Problem . . . . . . . . . . . . . . . . . . . . . 279
26.4 Generalizing Paging: The k-Server Problem . . . . . . . . 281
Part I

Classical Algorithms
1
Minimum Spanning Trees

We start our exploration of algorithms this semester with a clas-

sic problem in algorithmic graph theory: given a graph with edge
weights, finding a spanning tree of minimum total weight. Why this
problem?

1. It is a rich problem with lots of structure, which is easy to discover,

and which we can use to get good algorithms.

2. These algorithms allows us to showcase the interplay between data

structures and algorithms—while we will have too much time to
spend exploring data structures in this course, they can be crucial
in getting improved running times.

3. And finally, some of it is for nostalgia: algorithms for this problem

have been known for almost a hundred years now!

1.1 Minimum Spanning Trees: History

In minimum spanning tree problem, the input is an undirected con-

nected graph G = (V, E) with n nodes and m edges, where the
edges have weights w(e) ∈ R. The goal is to find a spanning tree
of the graph with the minimum total edge-weight. If the graph G A spanning tree/forest is defined to be
is disconnected, we get a spanning forest As a classic (and important) an acyclic subgraph T that is inclusion-
wise maximal, i.e., adding any edge in
problem, it’s been tackled many times. Here’s a brief, not-quite- G \ T would create a cycle.
comprehensive history of its optimization, all without making any
assumptions on the edge weights other that they can be compared in
constant time:

• Otakar Borůvka gave the first known MST algorithm in 1926; Otakar Borůvka (1926)
it was subsequently rediscovered by Gustave Choquet (1938),
Georges Sollin (1965), and several others. Vojtečh Jarník gave his Vojtečh Jarník (1930)
algorithm in 1930, and it was independently discovered by Robert
Prim (’57) and Edsger Dijkstra (’59), among others. Joseph Kruskal J.B. Kruskal, Jr. (1956)
10 minimum spanning trees: history

gave his algorithm in ’56; this was rediscovered by Loberman

and Weinberger in 1957. All these can easily be implemented in
O(m log n) time; we will discuss these in this lecture. Both Prim and Kruskal refer to
Borůvka’s paper, but say it is “un-
necesarily elaborate”. However, while
• In 1975, Andy Yao achieved a runtime of O(m log log n). His al- Borůvka’s paper is written in a compli-
gorithm builds on Borůvka’s algorithm (which he attributes to cated fashion, but his essential ideas are
very clean.
Sollin), and uses as a subroutine the linear-time algorithm for
median-finding, which had only recently been invented in 1974. Andrew Chi-Chih Yao (1975)
We will work through Yao’s algorithm in HW#1.

• In 1984, Michael Fredman and Bob Tarjan gave an O(m log∗ n) Michael L. Fredman and Robert E.
time algorithm, based on their Fibonacci heaps data structure. Tarjan (1987)

Here log∗ is the iterated logarithm function, and denotes the num-
ber of times we must take logarithms before the argument be-
comes smaller than 1. The actual runtime is a bit more nuanced,
which we will not bother with today.

This result was soon improved by Gabow, Galil, Spencer, and

Tarjan (’86) to get an O(m log log∗ n) runtime—note the logarithm Gabow, Galil, Spencer, and Tarjan (1986)
applied to the iterated logarithm.

• In 1995, David Karger, Phil Klein and Bob Tarjan finally got the Karger, Klein, and Tarjan (1995)
holy grail of O(m) time! . . . but it was a randomized algorithm, so
the search for a deterministic linear-time algorithm continued.

• In 1997, Bernard Chazelle gave an O(mα(n))-time deterministic Chazelle (1997)

algorithm. Here α(n) is the inverse Ackermann function (defined
in §1.6). This function grows extremely slowly, even slower than
the iterated logarithm function. However, it still goes to infinity
as n → ∞, so we still don’t have a deterministic linear-time MST
algorithm.

• In 1998, Seth Pettie and Vijaya Ramachandran gave an optimal Pettie and Ramachandran (1998)
algorithm for computing minimum spanning trees—however,
we don’t know its runtime! More formally, they show that if This was part of Seth’s Ph.D. thesis, and
there exists an algorithm which uses MST ∗ (m, n) comparisons Vijaya was his advisor.

to find MSTs on all graphs with m edges and n nodes, the Pettie-
Ramachandran algorithm will run in time O( MST ∗ (m, n)).)

In this chapter, we’ll go through the three classics: Jarnik/Prim’s,

Kruskal’s, and Borůvka’s algorithms. Then we will discuss Fredman
and Tarjan’s algorithm, and finally present Karger, Klein, and Tarjan’s
randomized algorithm. This will lead us to discuss another intrigu-
ing question: how do we verify whether a given tree is an MST?
minimum spanning trees 11

1.1.1 Two Assumptions

For the rest of this chapter, assume that the edge weights are distinct.
This does not change things in any essential way, but it simplifies
some of the statements, because distinct edge weights imply that
the MST is unique. (Exercise: prove this!) Also assume the graph
is simple, and hence m ≤ (n2 ); you can delete all self-loops and re-
move all-but-the-lightest from any collection of parallel edges, all by
preprocessing the graph in linear time.

1.1.2 The Cut and Cycle Rules

Most of these algorithms rely on two rules: the cut rule (known in
Bob Tarjan’s monograph as the blue rule) and the cycle rule (or the Tarjan (1983)
red rule). Recall that a cut in the graph is a partition of the vertices
into two non-empty sets (S, S̄ = V \ S), and an edge crosses this cut if
its two endpoints lie in different sets.

Theorem 1.1 (Cut Rule). For any cut of the graph, the minimum-weight
edge that crosses the cut must be in the MST. This rule helps us determine
what to add to our MST.

Proof. Let S ⊊ V be any nonempty proper subset of vertices, let

e = {u, v} be the minimum-weight edge that crosses the cut defined
by (S, S̄) (W.l.o.g., u ∈ S, v ∈/ S), and let T be a spanning tree not
containing e. Then T ∪ {e} contains a unique cycle C. Since C crosses
the cut (S, S̄) once (namely at e), it must also cross at another edge
e′ . But w(e′ ) > w(e), so T ′ = ( T − {e′ }) ∪ {e} is a lower-weight tree
than T, so T is not the MST. Since T was an arbitrary spanning tree
not containing e, the MST must contain e.

Theorem 1.2 (Cycle Rule). For any cycle in G, the heaviest edge on that
cycle cannot be in the MST. This helps us determine what we can remove in
constructing the MST.

Proof. Let C be any cycle, let e be the heaviest edge in C. For a con-
tradiction, let T be an MST that contains e. Dropping e from T gives
two components. Now there must be some edge e′ in C \ {e} that
crosses between these two components, and hence T ′ := ( T − {e′ }) ∪
{e} is a spanning tree. (Make sure you see why.) By the choice of e
we have w(e′ ) < w(e), so T ′ is a lower-weight spanning tree than T, a
contradiction.

To find a minimum spanning tree, we repeated apply whichever

of these rules we like. E.g., we choose some cut, use the cut rule to
designate the lightest edge in it as belonging to the MST by coloring
it blue (hence the name). Or we choose a cycle which contains no This edge e cannot have previously
been colored red—this follows from the
above lemmas. Or more directly, any
cycle crosses any cut an even number
of times, so a cycle containing e also
contains another edge f in the cut,
which is heavier.
12 the classical algorithms

red edge, use the cycle rule to mark the heaviest edge as not being in
the MST, and color it red. (Again, this edge cannot already be blue
for similar reasons.) And if either of the rules is not applicable, we
are done. Indeed, if we cannot apply the blue rule, the blue edges
cross every cut, and hence form a spanning tree, which must be the
MST. Similarly, once the non-red edges do not contain a cycle, they
form a spanning tree, which must be the MST. All known algorithms
differ only in their choice of cut/cycle, and how they find these fast.
Indeed, all the deterministic algorithms we discuss today will just
use the cut rule, whereas the randomized algorithm will use the cycle
rule as well.

1.2 The Classical Algorithms

1.2.1 Kruskal’s Algorithm

For Kruskal’s Algorithm, first sort all the edges such that w(e1 ) <
w(e2 ) < · · · < w(em ). This takes O(m log m) = O(m log n) time. 3
1
Start with all edges being uncolored, and iterate through the edges
10
in the sorted order, coloring an edge blue if and only if it connects 5
two vertices which are not currently in the same blue component. 4
2
Figure 1.1 gives an example of how edges are added.
To keep track of which vertex is in which component, use a dis-
joint set union-find data structure. This data structure has three Figure 1.1: Dashed lines are not yet in
the MST. Note that 5 will be analyzed
operations: next, but will not be added. 10 will
be added. Colors designate connected
• makeset(elem), which takes an element elem and creates a new components.
singleton set for it,

• find(elem), which finds the canonical representative for the set

containing the element elem, and

• union(elem1 , elem2 ), which merges the two sets that elem1 and
elem2 are in.

There is an implementation of this data structure which allows us to

do m operations in O(m α(m)) amortized time, where α(·) is the in-
verse Ackermann function mentioned above. Note that the naïve im-
plementation of Kruskal’s algorithm spends O(m log m) = O(m log n)
time to sort the edges, and then performs n makesets, m finds, and
n − 1 union operations, the total runtime is O(m log n + m α(m)),
which is dominated by the O(m log n) term.

1.2.2 The Jarnik/Prim Algorithm

For the Jarnik/Prim algorithm, first take an arbitrary root vertex r to
start our MST T. At each iteration, take the cheapest edge connecting
minimum spanning trees 13

of our current tree T of blue edges to some vertex not yet in T, and
color it blue—thereby adding this edge to T and increasing its size by 1 3

one. Figure 1.2 below shows an example of how we edges are added. 10
5
We’ll use a priority queue data structure which keeps track of the
4
lightest edge connecting T to each vertex not yet in T. A priority 2
queue data structure is equipped with (at least) three operations:
Figure 1.2: Dashed lines are not yet in
• insert(elem, key) inserts the given (element, key) pair into the the MST. We started at the red node,
queue, and the blue nodes are also part of T
right now.

• decreasekey(elem, newkey) changes the key of the element elem

from its current key to min(originalkey, newkey), and

• extractmin() removes the element with the minimum key from

the priority queue, and returns the (elem, key) pair.

Note that by using the standard binary heap data structure we can
get O(log n) worst-case time for each priority queue operation above.
To implement the Jarnik/Prim algorithm, we initially insert
each vertex in V \ {r } into the priority queue with key ∞, and
the root r with key 0. The key of an node v denotes the weight of
the least-weight edge from a node in T to v; it is zero if v ∈ T,
and ∞ if there are no edges yet from nodes in T to v. At each step,
use extractmin to find the vertex u with smallest key, and add u
to the tree using this edge. Then for each neighbor of u, say v, do
decreasekey(v, w({u, v})). Overall we do m decreasekey operations, We can optimize slightly by inserting
n inserts, and n extractmins, with the decreasekeys supplying the a vertex into the priority queue only
when it has an edge to the current
dominating O(m log n) term. tree T. This does not seem particularly
useful right now, but will be crucial in
the Fredman-Tarjan proof.

1.2.3 Borůvka’s Algorithm

Unlike Kruskal’s and Jarnik/Prim’s algorithms, Borůvka’s algorithm

adds many edges in parallel, and can be implemented without any
non-trivial data structures. In a “round”, simply take the lightest
edge out of each vertex and color it blue; these edges are guaranteed
to form a forest if edge-weights are distinct. (Exercise: why?)
Now contract the blue edges and recurse on the resulting graph.
At the end, when the resulting graph is a single vertex, uncontract all
the edges to get the MST. Each round can be implemented in O(m)
work: we will work out the details of this in HW #1. Moreover, we’re
1 3
guaranteed to shrink away at least half of the nodes (as each node at
10 10
least pairs up with one other node), and maybe many more if we are 5
lucky. So we have at most ⌈log2 n⌉ rounds of computation, leaving us
4
with O(m log n) total work. 2

Figure 1.3: The red edges will be

chosen and contracted in a single
step, yielding the graph on the right,
which we recurse on. Colors designate
components.
14 fredman and tarjan’s O ( m log ∗ n ) -time algorithm

1.2.4 A Slight Improvement on Jarnik/Prim

We can actually easily improve the performance of Jarnik/Prim’s
algorithm by using a more sophisticated data structure, namely by
using Fibonacci heaps instead of binary heaps to implement the
priority queue. Fibonacci heaps (invented by Fredman and Tarjan)
implement the insert and decreasekey operations in constant amor-
tized time, and extractmin in amortized O(log H ) time, where H is
the maximum number of elements in the heap during the execution.
Since we do n extractmins, and O(m + n) of the other two opera-
tions, and the maximum size of the heap is at most n, this gives us a
total cost of O(m + n log n).
Note that this is linear time on graphs with m = Ω(n log n) edges;
however, we’d like to get linear-time on all graphs. So the remaining
cases are the graphs with m = o (n log n) edges.

1.3 Fredman and Tarjan’s O(m log∗ n)-time Algorithm

Fredman and Tarjan’s algorithm builds on Jarnik/Prim’s algorithm:

the crucial observation uses the following crucial facts.

The amortized cost of extractmin operations in Fibonacci heaps is

O(log H ), where H is the maximum size of the heap. Moreover, in
Jarnik/Prim’s algorithm, the size of the heap is just the number of
nodes that are adjacent to the current tree T. So if the current tree
always has a “small boundary”, the extractmin cost will be low.

How can we maintain the boundary to be smaller than some

threshold K? Simple: Once the boundary exceeds K, stop growing
the Prim tree, and begin Jarnik/Prim’s algorithm anew from a dif-
ferent vertex. Do this until we have a forest where all vertices lie in
some tree; then contract these trees (much like Borůvka), and recurse
on the smaller graph. Before we formally define the algorithm, here’s
an example.
Formally, in each round of the algorithm, all vertices start as un-
marked.

1. Pick an arbitrary unmarked vertex and start Jarnik/Prim’s algo-

rithm from it, creating a tree T. Keep track of the lightest edge
from T to each vertex in the neighborhood N ( T ) of T, where
N ( T ) := {v ∈ V − T | ∃u ∈ T s.t. {u, v} ∈ E}. Note that
N ( T ) may contain vertices that are marked.

2. If at any time | N ( T )| ≥ K, or if T has just added an edge to some

vertex that was previously marked, stop and mark all vertices in
the current T, and go to step 1.
minimum spanning trees 15

A H

61 13
1 6 R
2 55 7

56
14 16
8
3

57 62
9 15
4
58 63
11 10

17
52
53 12

G 60
5 59 18
54
D
Figure 1.4: We begin at vertices A, H,
R, and D (in that order) with K = 6.
Although D begins as its own compo-
nent, it stops when it joins with tree
A. Dashed edges are not chosen in this
step (though they may be chosen in the
next recursive call), and colors denote
trees.
16 fredman and tarjan’s O ( m log ∗ n ) -time algorithm

3. Terminate when each node belongs to some tree.

Let’s first note that the runtime of one round of the algorithm is
O(m + n log K ). Each edge is considered at most twice, once from
each endpoint, giving us the O(m) term. Each time we grow the
current tree in step 1, the number of connected components decreases
by 1, so there are at most n such steps. Each step calls findmin on
a heap of size at most K, which takes O(log K ) times. Hence, at the
end of this round, we’ve successfully identified a forest, each edge of
which is part of the final MST, in O(m + n log K ) time.
Let d v be the degree of the vertex v in the graph we consider in
this round. We claim that every marked vertex u belongs to a com-
ponent C such that ∑ v ∈ C d v ≥ K. Indeed, if u became marked be-
cause the neighborhood of its component had size at least K, then
this is true. Otherwise, u became marked because it entered a com-
ponent C of marked vertices. Since the vertices of C were marked,
∑ v ∈ C d v ≥ K before u joined, and this sum only increased when u
(and other vertices) joined. Thus, if C 1 , . . . , C l are the components at
the end of this routine, we have
l l
2m = ∑ dv = ∑ ∑ dv ≥ ∑ K ≥ Kl
v i = 1 v ∈ Ci i =1

Thus l ≤ 2m 2m
K , i.e. this routine produced at most K trees.
The choice of K will change over the course of the algorithm. How
should we set the thresholds K i ? Say we start round i with n i nodes
and m i ≤ m edges. One clean way is to set
2m
K i : = 2 ni

which ensures that

2m
O ( m i + n i log K i ) = O mi + ni · = O(m).
ni
In turn, this means the number of trees, and hence the number of
nodes n i + 1 in the next round, is at most 2m 2m
K i ≤ K i . The number of
i

edges is m i + 1 ≤ m i ≤ m. Rewriting, this gives

2m
Ki ≤ = lg K i + 1 =⇒ K i + 1 ≥ 2 K i .
ni +1
Hence the threshold value exponentiates in each step. Hence after The threshold increases “tetrationally”.
log ∗ n rounds, the value of K would be at least n, and we would
just run Jarnik/Prim’s algorithm to completion, ending with a sin-
gle tree. This means we have at most log ∗ n rounds, and a total of
O ( m log ∗ n ) work.
In retrospect, I don’t know whether to consider the Fredman-
Tarjan algorithm as being trivial (once we have Fibonacci heaps) or
minimum spanning trees 17

being devilishly clever. I think it is the latter (and that is the beauty
of the best algorithms). Indeed, there’s a lovely idea—of keeping the
neighborhoods small at the beginning when there’s a lot of work to
do, but allow them to grow quickly, as the graph collapses. It is quite
non-obvious at the start, and obvious in hindsight. And once you see
it, you cannot un-see it!

1.4 A Linear-Time Randomized Algorithm

Another algorithm that is extremely clever but almost obvious in

hindsight is the the Karger-Klein-Tarjan randomized MST algorithm, Karger, Klein, and Tarjan (1995)
which runs in O ( m + n ) expected time. The new idea here is to A version of this algorithm was pro-
compute a “rough approximation” to the MST, use that to throw posed by Karger in 1992, but he only
obtained an O(m + n log n) runtime.
away many edges using the cycle rule, and then recurse on the rest of The enhancement to linear time was
the graph. given by Klein and Tarjan at the STOC
1994 conference; the combined paper is
cited above.
1.4.1 Heavy & light edges
The crucial definition is that of edges being heavy and light with
respect to some forest F.
Definition 1.3. Let F be a forest that is a subgraph of G. An edge e ∈
E( G ) is F-heavy if e creates a cycle when added to F, and moreover it
is the heaviest edge in this cycle. Otherwise, we say edge e is F-light.
The next facts follow from the definition: Figure 1.5: Fix this figure, make it
Fact 1.4. Edge e is F-light ⇐⇒ e ∈ MST( F ∪ {e}). interesting. Every edge in F is F-light,
as are the edges on the left, and also
Fact 1.5 (Completeness). If T is an MST of G then edge e ∈ E( G ) is those going between the components.
T-light if and only if e ∈ T. The edge on the right is F-heavy.

Fact 1.6 (Soundness). For any forest F, the F-light edges contain the
MST of the underlying graph G. In other words, any F-heavy edge is
also heavy with respect to the MST of the entire graph.
This suggests a clear strategy: pick a forest F from the current
edges, and discard all the F-heavy edges. Hopefully the number of
edges remaining is small. By Fact 1.6 these edges contain the MST of
G, so repeat the process on them. To make this idea work, we want
a forest F with many F-heavy edges. The catch is that a forest has
many heavy edges if it has small weight, if there are many off-forest
edges forming cycles where they are the heaviest edges. Indeed, one
such forest in the MST T ∗ of G: Fact 1.5 shows there are m − (n − 1)
many T ∗ -heavy edges, the maximum possible. How do we find some
similarly good tree/forest, but in linear time?
A second issue is to classify edges as light/heavy, given a forest F.
It is easy to classify a single edge e in linear time, but the following
remarkable theorem is also true:
18 a linear-time randomized algorithm

Theorem 1.7 (MST Verification). Given a forest F ⊆ G, we can output

the set of all F-light edges in G in time O(m + n).

This MST verification algorithm itself uses several interesting

ideas; we discuss some of them in Section 1.5. But for now, let us use
it to give the randomized linear-time MST algorithm.

1.4.2 The Randomized MST Algorithm

The idea is simple and elegant: randomly choose half of the edges
and find the minimum-weight spanning forest F on this “half-of-a-
graph”. This forest F should have many F-heavy edges; we discard The random subgraph may not be
these and recursively find the MST on the remaining graph. Since connected, so the maximum spanning
forest is obtained by finding the MST
both the recursive calls are on smaller graphs, hopefully the runtime for each of its connected components.
will be linear.
The actual algorithm below has just one extra step: we first run a
few rounds of Borůvka’s algorithm to force a reduction in the num-
ber of vertices, and then do the steps above.
Algorithm 1: KKT(G)
1.1 Run 3 rounds of Borůvka’s Algorithm on G, contracting the
chosen edges to get a graph G ′ = (V ′ , E′ ) with n′ ≤ n/8
vertices and m′ ≤ m edges.
1.2 If G ′ has a single vertex, return any chosen edges.
1.3 E1 ← random sample of E′ , each edge picked indep. w.p. 1/2.
1.4 F1 ← KKT(G1 = (V ′ , E1 )).
1.5 E2 ← all the F1 -light edges in E′ .
1.6 F2 ← KKT(G2 = (V ′ , E2 )).
1.7 return F2 (combined with Borůvka edges chosen in Step 1).

Theorem 1.8. The KKT algorithm returns MST(G).

Proof. This follows from Fact 1.6, that discarding heavy edges of any
forest F in a graph does not change the MST. Indeed, the MST on
G2 is the same as the MST on G ′ , since the discarded F1 -heavy edges
cannot be in MST ( G ′ ) because of Fact 1.6. Adding back the edges
picked by Borůvka’s algorithm in Step 1 gives the MST on G, by the
cut rule.

Now we need to bound the running time. The following two

claims formalize the intuition that we recurse on “smaller” sub-
graphs:
Claim 1.9. E[#E1 ] = 12 m′ .
Claim 1.10. E[#E2 ] ≤ 2n′ .
The first claim is easy to prove, using linearity of expectations, and
that each edge is picked with probability 1/2. The proof of Claim 1.10
minimum spanning trees 19

is also short, but before we prove it, let us complete the proof of the
linear running time.

Theorem 1.11. The KKT algorithm, run on a graph with m edges and n
vertices, terminates in expected time O(m + n).

Proof. Let TG be the expected running time on graph G, and

Tm,n := max { TG }.
G =(V,E),|V |=n,| E|=m

In the KKT algorithm, Step 1, 2, 3, 5, and 7 can each be done in linear

time: indeed, the only non-trivial part is Step 4, for which we use
Theorem 1.7. Let the total time for these steps be at most c(m + n).
Steps 4 and 6 require time TG1 and TG2 respectively. Then we have

TG ≤ c(m + n) + E[TG1 + TG2 ] ≤ cm + E[Tm1 ,n′ + Tm2 ,n′ ],

where m1 = #E1 and m2 = #E2 are both random variables. Induc-

tively assume that Tm,n ≤ 2c(m + n), then

TG ≤ c(m + n) + E[2c(m1 + n′ )] + E[2c(m2 + n′ )]

≤ c(m + n) + c(m′ + 2n′ ) + 2c(2n′ + n′ ) = c(m + m′ + n + 8n′ )
≤ 2c(m + n).

The second inequality holds because E[m1 ] ≤ 12 m′ and E[m2 ] ≤ 2n′ .

The last inequality holds because n′ ≤ n/8 and m′ ≤ m. Indeed, we
shrunk the graph using Borůvka’s algorithm in the first step just to
ensure n′ ≤ 8n and hence give us some breathing room.

Now we prove Claim 1.10. Recall that we randomly subsample

the edges of G ′ to get G1 , compute its maximum spanning forest F1 ,
and now we want to bound the expected number of edges in G ′ that
are F1 -light. The key to the proof is to do all these steps together, This idea to defer looking at the ran-
deferring the random decisions to when we really need them. This dom choices of the algorithm is often
called the principle of deferred deci-
makes it apparent which edges are light, making them easy to count. sions.

Proof of Claim 1.10. For the sake of the proof, we can use any correct
algorithm to compute F1 , so let us use Kruskal’s algorithm. Moreover,
let’s run a lazy version as follows: first sort all the edges in E′ , and
not just those in E1 ⊆ E′ , and consider then in increasing order
of weights. Now if the currently considered edge ei connects two
different trees in the current blue forest, call ei useful and flip an
independent unbiased coin: if the coin comes up “heads”, color ei
blue and add it to F1 , else color ei red. The crucial observation is
that this process produces a forest from the same distribution as first
choosing G1 and then computing F1 by running Kruskal’s algorithm
on it.

Figure 1.6: Illustration of another order

of coin tossing
20 optional: mst verification

Now, let us consider the lazy process again: which edges are F1 -
light? We claim that these are precisely the useful edges. Indeed,
any non-useful edge e j forms a cycle with the previously chosen
blue edges in F1 , and it is the heaviest edge on that cycle. Hence
e j does not belong to MST ( F1 ∪ {e j }), so it is F1 -heavy by Fact 1.4.
And a useful edge ei would belong to MST ( F1 ∪ {ei }), since run-
ning Kruskal’s algorithm on F1 ∪ {ei } would see that ei connects two
different blue components and hence would pick it.
Finally, how many useful edges are there, in expectation? Let’s
abstract away the details: we’re running a process that periodically
asks us to flip an independent unbiased coin. Since each time we see
a heads, we add an edge to the forest, so we definitely stop when we
see n′ − 1 heads. (We may stop earlier, in case the process runs out of
edges, but then we can pad the random sequence to flip some more
coins.) Since the coins are independent and unbiased, the expected
number of flips until we see n′ − 1 heads is exactly 2(n′ − 1). This
proves Claim 1.10.

That’s it. The algorithm and proof are both short and slick and
beautiful: this result is a real gem. I think it’s an algorithm from The Paul Erdős claimed that God has “The
Book. The one slight annoyance with the algorithm is the relative Book” which contains the most elegant
proof of each mathematical theorem.
complexity of the MST verification algorithm, which we use to find
the F1 -light edges in linear time. Nonetheless, these verification The current verification algorithms are
algorithms also contain many nice ideas, which we now discuss. deterministic; can we use randomness
to simplify these as well?

1.5 Optional: MST Verification

We now come back to the implementation of the MST verification

procedure. Here we only consider only trees (not forests), since we
can run this algorithm separately on each tree in the forest and incur
only a linear extra cost. Let us refine Theorem 1.7 as follows.

Theorem 1.12 (MST Verification). Given a tree T = (V, E) where

|V | = n, and m pairs of vertices (yi , zi ) in T, we can find the heaviest edge
on the unique yi -to-zi path in T for all i, in O(m + n) time.

Since the edge {yi , zi } is T-heavy precisely if it is heavier than the

heaviest edge on the corresponding tree path, this also proves The-
orem 1.7. Observe that the query pairs are given up-front: there is
an inverse-Ackermann-type lower bound for the problem where the
queries arrive online. Pettie (2006)
How do we get such a linear-time algorithm? A priori, it is not
easy to even show a query-complexity upper bound: that there exists
a procedure that performs a linear number of edge-weight comparisons
to solve the MST verification problem. This problem was solved
minimum spanning trees 21

by János Komlós. His result was subsequently made algorithmic Komlos (1985)
(“how do you find (in linear time) which linear number of queries to
make?”) by Brendan Dixon, Monika Rauch (now Monika Henzinger)
and Bob Tarjan. This algorithm was futher simplified by Valerie
King 1 , and by Thomas Hagerup 2 . We will just discuss Komlós’s 1

2
query-complexity bound.

1.5.1 A Simpler Case

To start developing the algortihm, it helps to consider special cases:
e.g., what if the tree is a complete binary tree? Let’s assume some- A node v is an ancestor of u if v lies on
thing slightly less restrictive than a complete binary tree: suppose the unique path from u to the root; then
u is a descendent of v.
tree T is rooted at some node r, all internal nodes have at least 2 chil-
dren, and all its leaves are at the same level. Moreover, all queries
{yi , zi } are for pairs where yi is a leaf and zi its ancestor.
Now for an edge (u, v) of the tree, where v is the parent and u
the child, consider all queries starting within subtree Tu and ending
at vertex v or higher. Say these queries go from some leaves inside
Tv up to w1 , w2 , . . . , wk , where w1 is closest to the root. Define the
“query string”
Q e : = ( w1 , w2 , . . . , w k ) .

We want to calculate the “answer string”

A e : = ( a1 , a2 , · · · , a k ),

where ai is the largest weight among the edges between wi and u.

Now given the answer string A(b,a) , we can get the answer string
for a child edge. In the example, say the query string for edge (c, b)
is Q(c,b) = (w1 , w4 , b). We have lost some queries that were in Q(b,a) ,
(e.g., for w3 ) but we now have a query ending at b. To get A(b,a) we
can drop the lost queries, add in the entry for b, and also take the Figure 1.7: Query string Q(b,a) =
component-wise maximum with the weight of (c, b) itself. E.g., if (w1 , w3 , w4 ) means there are three
queries starting from vertices in Tb and
(c, b) has weight t, then ending at w1 , w3 , w4 . The answer string
is A(b,a) = ( a1 , a3 , a4 ) = (6, 4, 4).
A(c,b) = (max{ a1 , t}, max{ a4 , t}, t) = (max{6, 5}, max{4, 5}, 5).

Naïvely this would require us to compare the weight w(c,b) with

all the entries in the answer string, incurring | Ae′ | comparisons. The
crucial observation is this: since the nodes in the query string are
sorted from top to bottom, the answers must be non-increasing: i.e.,
a1 ≥ a2 ≥ · · · ≥ ak . Therefore we can do binary search to reduce
the number of comparisons between edge-weights. Indeed, given
the answer string for some edge e, we can compute answers Ae′ for a
child edge e′ using at most ⌈log(| Ae′ | + 1)⌉ comparisons. This will be
enough to prove the result.
22 optional: mst verification

Claim 1.13. The total number of comparisons for all queries is at most
m+n
∑ log (|Qe | + 1) ≤ O(n + n log n
) = O ( m + n ).
e

Proof. Let the number of edges at height i be ni , where height 1 cor-

responds to edges incident to the leaves.

∑ log2 (1 + | Qe |) = ni avge∈height i (log2 (1 + | Qe |))

e∈height i

≤ ni log2 1 + avge∈height i (| Qe |)

m
≤ ni log2 1 +
ni

m+n 4n
= ni log2 + log2 .
4n ni
The first inequality uses concavity of the function log2 (1 + x ), and
Jensen’s inequality. The second holds because each of the m queries Jensen’s inequality says that for any
can only appear on at most one edge, so the average “load” is at most convex function f and any random
variable X, E[ f ( X )] ≥ f (E[ X ]). Con-
m/ni . Summing the first term over all heights gives n log2 m4n+n =
cavity requires flipping the sign, of
O ( m ). course.
To bound the second term (summed over all heights), recall that
i
each node has at least two children, so the number of edges at least S= ∑ 2i
i ≥0
doubles each time the height decreases. Hence, ni ≤ n/2i−1 , and i i+1
2S = ∑ 2i − 1 = ∑ 2i
4n n 4n O (i ) i ≥0 i ≥0
∑ ni log2 ni ≤ ∑ 2i−1 log2 n/2i−1 = n · ∑ 2i = O(n). ( i + 1) − i 1
i ≥1 i ≥1 i ≥1 =⇒ 2S − S = ∑ 2i
= ∑ i = 2.
2
i ≥0 i ≥0

The inequality above uses that x log(4n/x ) is increasing for x ≤

Converting this into an algorithm that runs in O(m + n) time re-

quires quite a bit more work. The essential idea is to store each query
string Q(u,v) as a bit vector of length log2 n, indicating which nodes
on the path from v to the root belong to it Q(u,v) . Now the answers
A(u,v) can be stored by encoding the locations of the successive max-
ima. And answers for a child edge can be computed from that of the
parent edge using some tricky bit operations (e.g., by precomputing
solutions on bit-strings of length, say (log2 n)/3, of which there are
only n1/3 × n1/3 = n2/3 ). If you are interested, check out these lecture
slides by Uri Zwick.

1.5.2 Solving the General Case

Finally, we reduce a general instance of MST verification to the spe-
cial instances considered in §1.5.2. First we reduce to a “branch-
ing” tree with the special properties we asked for, then we alter the
queries to become leaf-ancestor queries.

Figure 1.8: Illustration of balancing a

tree. We have maxwtT (v1 , v7 ) is 7 which
is the weight of edge (v4 , v6 ). We can
minimum spanning trees 23

To achieve this reduction, run Borůvka’s algorithm on the tree T.

After the ith round of edge selection and contraction, let Vi be the
remaining vertices, so that V0 = V is the original set of nodes. Define
a new tree T ′ whose vertex set V ′ is the disjoint union V0 ⊎ V1 ⊎ · · · .
A node u ∈ Vi has an edge in T ′ to v ∈ Vi+1 if the component
containing u was contracted into the new vertex v; the weight of this
edge in T ′ is the weight of the minimum-weight edge chosen by u in
this round. Moreover, if r is the single vertex corresponding to the
entire tree T at the end of the run of Borůvka’s algorithm, then root
tree T ′ at r.

Exercise 1.14. Show that each node in T ′ has at least two children,
and all leaves belong to the same level. There are n leaves (corre-
sponding to the nodes in T), and at most 2n − 1 nodes in T ′ . Also
show how to construct T ′ in linear time.

Exercise 1.15. For nodes u, v in a tree T, let maxwtT (u, v) be the maxi-
mum weight of an edge on the (unique) path between u, v in the tree
T. Show that all u, v ∈ V, maxwtT (u, v) = maxwtT ′ (u, v).

This exercise means arbitrary queries (yi , zi ) in the original tree T

can be reduced to leaf-leaf queries in T ′ . To make these leaf-ancestor
queries, we simply find the least-common ancestor ℓi := lca(yi , zi )
for each pair, and replace the original query by the maximum of two
queries (yi , ℓi ), (zi , ℓi ). To show that we can find the least-common
ancestors in linear time, we defer to a theorem of David Harel and
Bob Tarjan: Harel and Tarjan (1984)

Theorem 1.16. Given a tree T, we can preprocess it in O(n) time, so that

all subsequent least-common ancestor queries for T can be answered in O(1)
time.

Interestingly, this algorithm also proceeds by solving the least-

common ancestor problem for complete balanced binary trees, and
then extending the solution to general trees. For a survey of algo-
rithms for this problem, see the paper of Alstrup et al. Alstrup et al. (2004)
This completes Komlós’ proof that the MST verification problem
can be solved using O(m + n) comparisons. An outstanding open
problem is to get a really simple linear-time algorithm for this prob-
lem. (An algorithm that runs in time O(mα(n)) can be given using
the disjoint set union-find data structure.)

1.6 The Ackermann Function

Wilhelm Ackermann defined a fast-growing function that is totally Ackermann (1928)

computable but not primitive recursive. Today, we use the term A similar function was defined by
Gabriel Sudan, a Romanian mathemati-
cian, in 1927.
24 matroids

Ackermann function A(m, n) to refer to one of many variants that are

rapidly-growing and have similar propeties. It seems to arise often in
algorithm analysis, so let’s briefly discuss it here.
For illustrative purposes, it is cleanest to define A(m, n) : N ×
N → N recursively as


 2n : m=1
A(m, n) = 2 : m ≥ 1, xn = 1

 A(m − 1, A(m, n − 1)) : m ≥ 2, n ≥ 2

Here are the values of A(m, n) for m, n ≤ 4:

1 2 3 4 ... n
1 2 4 6 8 ... 2n
2 2 4 8 16 ... 2n
.2
2 22 ..
3 2 4 22 22 ... 22
4 2 4 65536 !!! ... huge!

We can define the inverse Ackermann function α(·) to be a functional

inverse of the diagonal A(n, n); by construction, α(·) grows extremely
2
. ..
slowly. For example, α(m) ≤ 4 for all m ≤ 22 where the tower has
height 65536.

1.7 Matroids

To come here. See HW1 as well.

2
Arborescences: Directed Spanning Trees

Greedy algorithms worked vey well for minimum weight spanning

tree problem, as we saw in Chapter 1. In this chapter, we define ar-
borescences which are a notion of spanning trees for rooted directed
graphs. We will see that a naïve greedy approach no longer works,
but it requires just a slightly more sophisticated algorithm to effi-
ciently find them. We give two proofs of correctness for this algo-
rithm. The first is a direct inductive proof, but the second makes use
of linear programming duality, and highlights its use in analyzing
the performance of algorithms. This will be a theme that return to
multiple times in this course.

2.1 Arborescences

Consider a graph G = (V, A, w): here V is a set of vertices, and A a

set of directed edges, also known as arcs. The function w : A → R We will use “arcs” instead of “edges”
gives a weight to every arc. Let |V | = n and | A| = m. Once we root to emphasize the directedness of the
graph.
G at a node r ∈ V, we can define a “directed spanning tree” with r
being the sink/root.
Definition 2.1. An r-arborescence is a subgraph T = (V, A′ ) with A branching is the directed analog of
A′ ⊆ A such that a forest; it drops the first reachability
requirement, and asks only for all
1. Every vertex has a directed path in T to the root r, and non-root vertices to have an outgoing
edge.
2. Each vertex except r has one outgoing arc; r has none.
Remark 2.2. Observe that T forms a spanning tree in the undirected
sense. This property (along with either property 1 or property 2) can
alternatively be used to define an arborescence.
Remark 2.3. It’s easy to check if an r-arborescence exists. We can
reverse the arcs and run a depth-first search from the root. If all
vertices are reached, we have produced an r-arborescence.
The focus of this chapter is to find the minimum-weight r-arborescence.
We can simplify things slightly by assuming that all of the weights
26 the chu-liu/edmonds/bock algorithm

are non-negative. Because no outgoing arcs from r will be part If there are negative arc weights, add
of any arborescence, we can assume no such arcs exist in G either. a large positive constant M to every
weight. This increases the total weight
For brevity, we fix r and simply say arborescence when we mean of each arborescence by M (n − 1), and
r-arborescence. hence the identity of the minimum-
weight one remains unchanged.

2.1.1 The Limitations of Greedy Algorithms

It’s natural to ask if greedy algorithms like those in Chapter 1 for the
directed case. E.g., we can try picking the lightest incoming arc into
the component containing r, as in Prim’s algorithm, but this fails,
for example in Figure 2.1. Or we could emulate Kruskal’s algorithm
and consider arcs in increasing order of weight, adding them if they
don’t close a directed cycle. (Exercise: give an example where it fails.)
The problem is that greedy algorithms (that consider the arcs in
some linear order and irrevocably add them in) don’t see to work.
However, the algorithm we eventually get will feel like Borůvka’s
algorithm, but one where we are allowed to revoke some of our past
decisions.

2.2 The Chu-Liu/Edmonds/Bock Algorithm Figure 2.1: A Prim-like algorithm will

select the arc with weight 2 and 3,
The algorithm we present was discovered independently by Yoeng- whereas the optimal choices are the arcs
with weights 3 and 1.
Jin Chu and Tseng-Hong Liu, Jack Edmonds, and F. Bock. We will
Y.-J. Chu and T.-H. Liu (1965)
follow Karp’s presentation of Edmonds’ algorithm. J. Edmonds (1967)
F. Bock (1971)
Definition 2.4. For a vertex v ∈ V or subset of vertices S ⊆ V, let
R.M. Karp (1971)
∂+ v and ∂+ S denote the set of arcs leaving the node v and the set S,
respectively.

Definition 2.5. For a vertex v ∈ V in graph G, define MG (v) :=

mina∈∂+ v w( a) be the minimum weight among arcs leaving v in G.

The first step is to create a new graph G ′ by subtracting some

weight from each outgoing arc from a vertex, such that there is at
least one arc of weight 0. That is, set w( a′ ) ← w( a) − MG (v) for all
a ∈ ∂+ v and each v ∈ V.
Claim 2.6. T is a min-weight arborescence in G ⇐⇒ T is a min-
weight arborescence in G ′ .

Proof. Each arborescence has exactly one arc leaving each vertex.
Decreasing the weight of every arc exiting v by MG (v) decreases the
weight of every possible arborescence by MG (v) as well. Thus, the set
of min-weight arborescences remains unchanged.

Now each vertex has at least one 0-weight arc leaving it. Now, for
each vertex, pick an arbitrary 0-weight arc out of it. If this choice is
arborescences: directed spanning trees 27

an arborescence, this must be the minimum-weight arborescence,

since all arc weights are still nonnegative. Otherwise, the graph con-
sist of some connected components, each of which has one directed
cycle along with some acyclic incoming components, as shown in the
figure.
For the second step of the algorithm, consider one such 0-weight
cycle C, and construct a new graph G ′′ := G ′ /C by contracting the
cycle C down to a single new node vC , removing arcs within C, and
replacing parallel arcs by the cheapest of these arcs. Let OPT( G )
denote the weight of the min-weight arborescence on G. Figure 2.2: An example of a possible
component after running the first step
Claim 2.7. OPT( G ′ ) = OPT( G ′′ ). of the algorithm

Proof. To show OPT(G ′ ) ≤ OPT(G ′′ ), we exhibit an arborescence in

G ′ with weight at most OPT(G ′′ ). Indeed, let T ′′ be a min-weight
arborescence in G ′′ . Consider arborescence T ′ in G ′ obtained by ex-
panding vC back to the cycle C, and removing one arc in the cycle.
Since the cycle has weight 0 on all its arcs, T ′ has the same weight as
T ′′ . (See Figure 2.3.)
Now to show OPT(G ′′ ) ≤ OPT(G ′ ), take a min-weight arborescence Figure 2.3: The white node is expanded
into a 4-cycle, and the dashed arrow is
T ′ of G ′ , and identify the nodes in C down to get a vertex vC . The
the arc that is removed after expanding.
resulting graph is clearly connected, with each vertex having a di-
rected path to the root. Now remove some arcs to get an arborescence
of G ′′ , e.g., as in Figure 2.4. Since arc weights are non-negative, we
can only lower the weight by removing arcs. Therefore OPT(G ′′ ) ≤
OPT(G ′ ).

The proof also gives an algorithm for finding the min-weight ar-
borescence on G ′ by contracting the cycle C (in linear time), recursing
on G ′′ , and the “lifting” the solution T ′′ back to a solution T ′ . Since
we recurse on a graph which has at least one fewer nodes, there are Figure 2.4: Contracting the two white
nodes down to a cycle, and removing
at most n recursive calls. Moreover, the weight-reduction, contraction, arc b.
and lifting steps in each recursive call take O(m) time, so the runtime
of the algorithm is O(mn).
Remark 2.8. This is not the best known run-time bound: there are
many optimizations possible. Tarjan presents an implementation of R.E. Tarjan (1971)
the above algorithm using priority queues in O(min(m log n, n2 ))
time, and Gabow, Galil, Spencer and Tarjan give an algorithm to H.N. Gabow, Z. Galil, T. Spencer and
solve the min-weight arborescence problem in O(n log n + m) time. R.E. Tarjan (1986)

The best runtime currently known is O(m log log n) due to Mendel-
son et al.. R. Mendelson, R.E. Tarjan, M. Thorup,
and U. Zwick (2006)
Open problem 2.9. Is there a linear-time (randomized or determinis-
tic) algorithm to find a min-weight arborescence in a digraph G?
28 linear programming methods

2.3 Linear Programming Methods

Let us now see an alternate proof of correctness of the algorithm

above, this time using linear programming duality. This is how Ed-
monds originally proved his algorithm to be optimal. If you have access to the Chu-Liu or
Bock papers, I would love to see them.

2.3.1 Linear Programming Review

Before we actually represent the arborescence problem as a linear
program, we first review some standard definitions and results from
linear programming.

Definition 2.10. For some number of variables (a.k.a. dimension) n ∈

N, number of constraints m ∈ N, objective vector c ∈ Rn , constraint
matrix A ∈ Rn×m , and right-hand side b ∈ Rm , a (minimization)
linear program (LP) is This form of the LP is called the stan-
dard form. More here.
minimize c⊺ x subject to Ax ≥ b and x ≥ 0

Note that c⊺ x is the inner product ∑in=1 ci xi .

The constraints of a linear program form a polyhedron, which is

the convex body formed by the intersection of a finite number of
half spaces. Here we have m + n half spaces. There are m of them
corresponding to the constraints { a⊺i x ≥ bi }im=1 , where ai ∈ Rn is the
vector corresponding to the ith row of the matrix A. Moreover, we Whenever we write a vector, we imag-
have n non-negativity constraints { x j ≥ 0}nj=1 . If the polyhedron is ine it to be a column vector.

bounded, we call it a polytope.

Definition 2.11. A vector x ∈ Rn is called feasible if it satisfies the

constraints: i.e., Ax ≥ b and x ≥ 0.

Definition 2.12. Given a linear program min{c⊺ x | Ax ≤ b, x ≥ 0},

the dual linear program is

maximize b⊺ y subject to A⊺ y ≤ c and y ≥ 0

The dual linear program has a single variable yi for each constraint
in the original (primal) linear program. This variable can be thought
of as giving an importance weight to the constraint, so that taking a
linear combination of constraints with these weights shows that the
primal cannot possibly surpass a certain value for c⊺ x. This purpose
is exemplified by the following theorem.

Theorem 2.13 (Weak Duality). If x and y are feasible solutions to the

linear program min{c⊺ x | Ax ≤ b, x ≥ 0} and its dual, respectively, then
c⊺ x ≥ b⊺ y.

Proof. c⊺ x ≥ ( A⊺ y)⊺ x = y⊺ Ax ≥ y⊺ b = b⊺ y.
arborescences: directed spanning trees 29

See the strong duality theorem in add

This principle of weak duality tells us that if we have feasible reference for a converse to this theorem.
For now, weak duality will suffice.
solutions x, y where c⊺ x = b⊺ y, then we know that both x and y are
optimal solutions. Our approach will be to give a linear program that
models min-weight arborescences, use the algorithm above to write a
feasible solution to the primal, and then to exhibit a feasible solution
to the dual such that the primal and dual values are the same—hence
both must be optimal!

2.3.2 Arborescences via Linear Programs

To analyze the algorithm, we first need to come up with a linear

program that “captures” the min-weight arborescence problem. Since
we want to find a set of arcs forming an arborescence T, we have
one variable x a for each arc a ∈ A. Ideally, each variable will be
an indicator for the arc being in the arborescence: i.e., it will binary
values: x a ∈ {0, 1}, with x a = 1 if and only if a ∈ T. This choice
of variables allows us to express our objective to minimize the total
weight: w⊺ x := ∑ a∈ A w( a) x a .
Next, we need to come up with a way to express the constraint
that T is a valid arborescence. Let S ⊆ V − {r } be a set of vertices
not containing the root, and some vertex v ∈ S. Every vertex must
be able to reach the root by a directed path. If ∂+ S ∩ T = ∅, there
is no arc in T leaving the set S, and hence we have no path from v to
r. We conclude that, at a minimum, ∂+ S ∩ T ̸= ∅. We represent this
constraint by ensuring that the number of arcs out of S is non-zero,
i.e.,
∑ x a ≥ 1.
a∈∂+ S

We write an integer linear programming (ILP) formulation for min-

weight arborescences as follows:

minimize ∑ w( a) x a
a∈ A
subject to ∑ xa ≥ 1 ∀ S ⊆ V − {r }
a∈∂+ S (2.1)
∑ xa = 1 ∀v ̸= r
a∈∂+ v
x a ∈ {0, 1} ∀ a ∈ A.

The following lemma is easy to verify:

Lemma 2.14. T is an arborescence of G with x a = 1 a∈T if and only if x is

feasible for the integer LP (2.2). Hence the optimal solution to the ILP (2.1)
is exactly the min-weight arborescence.
30 linear programming methods

Relaxing the Boolean integrality constraints gives us the linear

programming relaxation:

minimize ∑ w( a) x a
a∈ A
subject to ∑ xa ≥ 1 ∀ S ⊆ V − {r }
a∈∂+ S (2.2)
∑ xa = 1 ∀v ̸= r
a∈∂+ v
xa ≥ 0 ∀ a ∈ A.

Since we have relaxed the constraints, the optimal solution to the

(fractional) LP (2.2) can only have less value than the ILP (2.1), and
hence the optimal value of the LP is at most OPT( G ). In the follow-
ing, we show that it is in fact equal to OPT( G )!

Exercise 2.15. Suppose all the arc weights are non-negative. Show
that the optimal solution to the linear program remains unchanged
even if drop the constraints ∑ a∈∂+ v x a = 1.

2.3.3 Showing Optimality

The output T of the Chu-Liu/Edmonds/Bock algorithm is an ar-
borescence, and hence the associated solution x (as defined in Lemma 2.14)
is feasible for ILP (2.1) and hence for LP (2.2). To show that x is op-
timal, we now exhibit a vector y feasible for the dual linear program
with objective equal to w⊺ x. Now weak duality implies that both x
and y must be optimal primal and dual solutions.
The dual linear program for (2.2) is

maximize ∑ yS
S⊆V −{r }

subject to ∑ yS ≤ w( a) ∀a ∈ A (2.3)
S:a∈∂+ S
yS ≥ 0 ∀S ⊆ V − {r }, |S| > 1.

Observe that yS is unconstrained when |S| = 1, i.e., S corresponds to

a singleton non-root vertex.
We think of yS as payments raised by vertices inside set S so that
we can buy an arc leaving S. In order to buy an arc a, we need to
raise w( a) dollars. We’re trying to raise as much money as possible,
while not overpaying for any single arc a.

Lemma 2.16. If arc weights are non-negative, there exists a solution for the
dual LP (2.3) such that w⊺ x = 1⊺ y, where all ye values are non-negative.

Proof. The proof is by induction over the execution of the algorithm.

arborescences: directed spanning trees 31

• The base case is when the chosen zero-weight arcs out of each
node form an arborescence. In this case we can set yS = 0 for
all S; since all arc weights are non-negative, this is a feasible dual
solution. Moreover, both the primal and dual values are zero.

• Suppose we subtract M := MG (v) from all arcs leaving vertex v in

graph G so that v has at least one zero-weight arc leaving it. Let G ′
be the graph with the new weights, and let T ′ be the optimal so-
lution on G ′ . By induction on G ′ , let y′ be a non-negative solution
such that ∑ a∈T ′ we′ = ∑S y′S . Define yv := y′v + M and yS = y′S for
all other subsets; this is the desired feasible dual solution for the
same tree T = T ′ on the original graph G. Indeed, for one of the
arcs a = (v, u) out of the node v, we have

∑ yS = ∑ yS + ∑ yS
S:a∈∂+ S S:a∈∂+ S,|S|=1 S:a∈∂+ S,|S|≥2
1 r
= (y′{u} + M) + ∑ y′S
S:a∈∂+ S,|S|≥2
2 5
3
≤ M + w ′ ( a ) = M + ( w ( a ) − M ) = w ( a ). 2
1

2 3 2
Moreover, the value of the dual increases by M, the same as the
3 2
increase in the weight of the arborescence.
6 7
• Else, suppose the chosen zero-weight arcs contain a cycle C, which
we contract down to a node vC . Using induction for this new 7
graph G ′ , let y′ be the feasible dual solution. For any subset S′ of
nodes in G ′ that contains the new node vC , let S = (S′ \ {vC }) ∪ C,
Figure 2.5: An optimal dual solution:
and define yS = y′S′ . For all other subsets S in G ′ not containing
vertex sets are labeled with dual values,
vC , define yS = y′S . Moreover, for all nodes v ∈ C, define y{v} = 0. and arcs with costs.
The dual value remains unchanged, as does the weight of the
solution T obtained by lifting T ′ . The dual constraint changes only
arcs of the form a = (v, u), where v ∈ C and u ̸∈ C. But such
an arc is replaced by an arc a′ = (vC , u), whose weight is at most
w( a). Hence

∑ yS = y′{v
C}
+ ∑ y′S ≤ w( a′ ) ≤ w( a).
S:a∈∂+ S S′ :a′ ∈∂+ S′ ,S′ ̸={v C}

This completes the inductive proof

Notice that the sets with non-zero weights correspond to single-
ton nodes, or to the various cycles contracted during the algorithm.
Hence these sets form a laminar family; i.e., any two sets S, S′ with
non-zero value in y are either disjoint, or one is contained within the
other.

By Lemma 2.16 and weak duality, we conclude that the solution x

and the associated arborescence T is optimal. It is easy to extend the
argument to potentially negative arc weights.
32 linear programming methods

Corollary 2.17. There exists a solution for the dual LP (2.3) such that
w⊺ x = 1⊺ y. Hence the algorithm produces an optimal arborescence even for
negative arc weights.

Proof. If some arc weights are negative, add M to all arc weights to
get the new graph G ′ where all arc weights are positive. Let y′ be the
optimal dual for G ′ from Lemma 2.16; define yS = y′S for all sets of
size at least two, and y{v} = y′{v} − M for singletons. Note that the
weight of the optimal solution on G is precisely M(n − 1) smaller
than on G ′ ; the same is true for the total dual value. Moreover, for arc
e = (u, v), we have

∑ yS = ∑ y′S + (y′{u} − M) ≤ (we + M) − M = we .

S:a∈∂+ S S:a∈∂+ S,|S|≥2

The inequality above uses that y′ is a feasible LP solution for the

graph G ′ with inflated arc weights. Finally, since the only non-
negative values in the dual solution are for singleton sets, all con-
straints in (2.2) are satisfied for the dual solution y, this completes the
proof.

2.3.4 Integrality of the Polytope

The result of Corollary 2.17 is quite exciting: it says that no matter
what the objective function of the linear program (i.e., the arc weights
w( a)), there is an optimal integral solution to the linear program,
which our combinatorial algorithm finds. In other words, the optimal
solutions to the LP (2.2) and the ILP (2.1) are the same.
We will formally discuss this later in the course, but let us start
playing with these kinds of ideas. A good start is to visualize this ge-
ometrically: let A ⊆ R| A| be the set of all solutions to the ILP (which
correspond to the characteristic vectors of all valid r-arborescences).
This is a finite set of points, and let Karb be the convex hull of these
points. (It can be shown that Karb is a polytope, though we don’t do
it here.) If we optimize a linear function given by some weight vector
w over this polytope, we get the optimal arborescence for this weight.
This is the solution to ILP (2.1).
Moreover, let K ⊆ R| A| be the polytope defined by the constraints
in the LP relaxation (2.2). Note that each point in A is contained
within K, therefore so is their convex hull K. I.e.,

Karb ⊆ K.

In general, the two polytopes are not equal. But in this case, Corol-
lary 2.17 implies that for this particular setting, the two are indeed
equal. Indeed, a geometric hand-wavy argument is easy to make —
if K were strictly bigger than Karb , there would be some direction
arborescences: directed spanning trees 33

in which K extends beyond Karb . But each direction corresponds to

a weight-vector, and hence for that weight vector the optimal solu-
tion within K (which is the solution to the LP) would differ from the
optimal solution within Karb (which is the solution to the ILP). This
contradicts Corollary 2.17.

2.4 Matroid Intersection

More to come here, maybe just a forward pointer to a later lecture.

3
Shortest Paths in Graphs

In this chapter, we look at another basic algorithmic construct: given

a graph where edges have weights, find the shortest path between
two specified vertices in it. Here the weight of a path is the sum of
the weights of the edges in it, and a shortest path is the path with
least weight. Or given a source vertex, find shortest paths to all other
vertices. Or find shortest paths between all pairs of vertices in the
graph. Of course, each harder problem can be solved by multiple
calls of the easier ones, but can we do better?
Let us give some notation. The input is a graph G = (V, E), with
each edge e = uv having a weight/length wuv ∈ R. For most of this
chapter, the graphs will be directed: in this case we use the terms
edges and arcs interchangeably, and an edge uv is imagined as being
directed from u to v (i.e., from left to right). Given the graph G and edge-weights
w, the minimum weight of any path
1. Given a source vertex s, the single-source shortest paths (SSSP) from u to v is often called the distance
asks for the distances (and the corresponding shortest paths) from dw (u, v).

s to all vertices in V.

2. The all-pairs shortest paths (APSP) problem asks for the distances
between each pair of vertices in V. We do not consider the s-t-shortest-
path problem, since algorithms for that
We will consider both these variants, and give multiple algorithms problem also tend to solve the SSSP (on
for both. worst-case instances).

There is another potential source of complexity: whether the edge-

weights are all non-negative, or if they are allowed to take on nega-
tive values. In the latter case, we disallow cycles of negative weight,
else the shortest-path may not be well-defined. This is because a neg-
ative cycle allows for ever-smaller shortest paths: we can just run
around the cycle to reduce the total weight arbitrarily. We could ask for a shortest simple path.
However, this problem is NP-hard in
general, via a reduction from Hamilton
3.1 Single-Source Shortest Path Algorithms path.

The single-source shortest path problem (SSSP) is to find a shortest path

from a single source vertex s to every other vertex in the graph. The
36 single-source shortest path algorithms

output of this algorithm can either be the n − 1 numbers giving the

weights of the n − 1 shortest paths, or (some compact representation
of) these paths. We first consider Dijkstra’s algorithm for the case of
non-negative edge-weights, and give the Bellman-Ford algorithm that
handles negative weights as well.

3.1.1 Dijkstra’s Algorithm for Non-negative Weights

Dijkstra’s algorithm keeps an estimate dist of the distance from s
to every other vertex. Initially the estimate of the distance from s to
itself is set to 0 (which is correct), and is set to ∞ for all other ver-
tices (which is typically an over-estimate). All vertices are unmarked.
Then repeatedly, the algorithm finds an umarked vertex u with the
smallest current estimate, marks this vertex (thereby indicating that
this estimate is correct), and then updates the estimates for all ver- This update step is often said to relax
the edges out of u, which has a nice
tices v reachable by arcs uv thus:
physical interpretation. Indeed, any
edge uv for which the dist(v) is strictly
dist(v) ← min{dist(v), dist(u) + wuv } bigger than dist(u) + wuv can be
imagined to be over-stretched, which
this update fixes.
We keep all the vertices that are not marked and their estimated dis-
tances in a priority queue, and extract the minimum in each iteration.
Algorithm 2: Dijkstra’s Algorithm
Input: Digraph G = (V, E) with edge-weights we ≥ 0 and
source vertex s ∈ G
Output: The shortest-path distances from s to each vertex
2.1 add s to heap with key 0
2.2 for v ∈ V \ {s} do
2.3 add v to heap with key ∞
2.4 while heap not empty do
2.5 u ← deletemin
2.6 for v a neighbor of u do
2.7 key(v) ← min{ key(v), key(u) + wuv } // relax uv

To prove the correctness of the algorithm, it suffices to show that

each time we extract a vertex u with the minimum estimated distance
from the priority queue, the estimate for that vertex u is indeed the
distance from s to u. This can be proved by induction on the number
of marked vertices, and left as an exercise. Also left as an exercise are
the modifications to return the shortest-path tree from node s.
The time complexity of the algorithm depends on the priority
queue data structure. E.g., if we use binary heap, which incurs
O(log n) for decrease-key as well as extract-min operations, we in-
cur a running time of O(m log n). But just like for spanning trees, we
can do better with Fibonacci heaps, which implement the decrease-key
operation in constant amortized time, and extract-min in O(log n)
shortest paths in graphs 37

time. Since Dijkstra’s algorithm uses n inserts, n delete-mins, and m

decrease-keys, this improves the running time to O(m + n log n).
There have been many other improvements since Dijkstra’s orig-
inal work. If the edge-weights are integers in {0, . . . , C }, a clever Dijkstra (1959)
priority queue data structure of Peter van Emde Boas can be used in- Dijkstra’s paper also gives his ver-
sion of the Járnik/Prim MST algo-
stead; this implements all operations in time O(log log C ). Carefully
p rithm. The two algorithms are not that
using it can give us runtimes of O(m log log C ) and O(m + n log C ) different, since the MST algorithm
(see Ahuja et al.). Later, showed a faster implementation for the merely changes the update rule to
dist(v) ← min{dist(v), wuv }.
case that the weights are integer, which has the running time of P. van Emde Boas (1975)
O(m + n log log(n)) time. Currently, latest results to come here. Ahuja et al. (1990)
M. Thorup (2004)
3.1.2 The Bellman-Ford Algorithm a
3 1
Dijkstra’s algorithm does not work on instances with negative edge s −3 t
5
weights; see the example on the right. For such instances, we want
b
that a correct SSSP algorithm to either return the distances from s to Figure 3.1: Example with negative
all other vertices, or else find a negative-weight cycle in the graph. edge-weights: Dijkstra’s algorithm gives
a label of 4 for t, whereas the correct
The most well-known algorithm for this case is the Shimbel- answer is 3.
Bellman-Ford algorithm. Just like Dijkstra’s algorithm, this algo- This algorithm also has a complicated
rithm also starts with an overestimate of the shortest path to each history. The algorithm was first stated
by Shimbel in 1954, then Moore in
vertex. However, instead of relaxing the out-arcs from each vertex ’57, Woodbury and Dantzig in ’57,
once (in a careful order), this algorithm relaxes the out-arcs of all the and finally by Bellman in ’58. Since
it used Ford’s idea of relaxing edges,
vertices n − 1 times, in round-robin fashion. Formally, the algorithm the algorithm “naturally” came to be
is the following. (A visualization can be found at visualgo.net.) known as Bellman-Ford.

Algorithm 3: The Bellman-Ford Algorithm

Input: A digraph G = (V, E) with edge weights we ∈ R, and
source vertex s ∈ V
Output: The shortest-path distances from s to each vertex, or
report that a negative-weight cycle exists
3.1 dist(s) = 0 // the source has distance 0
3.2 for v ∈ V do
3.3 dist(v) ← ∞
3.4 for |V | iterations do
3.5 for edge e = (u, v) ∈ E do
3.6 dist(v) ← min{ dist(v), dist(u) + weight(e)}
3.7 If any distances changed in the last (nth ) iteration, output “G
has a negative weight cycle”.
The proof relies on the following lemma, which is easily proved by
induction on i.

Lemma 3.1. After i iterations of the algorithm, dist(v) equals the weight of
the shortest-path from s to v containing at most i edges. (This is defined to
be ∞ if there are no such paths.)

If there is no negative-weight cycle, then the shortest-paths are

38 the all-pairs shortest paths problem (apsp)

well-defined and simple, so a shortest-path contains at most n − 1

edges. Now the algorithm is guaranteed to be correct after n − 1
iterations by Lemma 3.4; moreover, none of the distances will change
in the nth iteration.
However, suppose the graph contains a negative cycle that is
reachable from the source. Then the labels dist(u) for vertices on
this cycle continue to decrease in each subsequent iteration, because
we may reach to any point on this cycle and by moving in that cy-
cle we can accumulate negative distance; therefore, the distance will
get smaller and smaller in each iteration. Specifically, they will de-
crease in the nth iteration, and this decrease signals the existence of
a negative-weight cycle reachable from s. (Note that if none of the
negative-weight cycles C are reachable from s, the algorithm out-
puts a correct solution despite C’s existence, and it will produce the
distance of ∞ for all the vertices in that cycle.)
The runtime is O(mn), since each iteration of Bellman-Ford looks
at each edge once, and there are n iterations. This is still the fastest
algorithm known for SSSP with general edge-weights, even though
faster algorithms are known for some special cases (e.g., when the
graph is planar, or has some special structure, or when the edge
weights are “well-behaved”). E.g., for the case where all edge weights
are integers in the range [−C, ∞), we can compute SSSP in time
√
O(m n log C ), using an idea we may discuss in Homework #1. And
very recently, ideas using low-diameter decompositions, which we
will see in the very next lecture, have been used to give near-linear
time algorithms; their runtime is O(m log C poly log n).

3.2 The All-Pairs Shortest Paths Problem (APSP)

The obvious way to do this is to run an algorithm for SSSP n times,

each time with a different vertex being the source. This gives an
O(mn + n2 log n) runtime for non-negative edge weights (using n
runs of Dijkstra), and O(mn2 ) for general edge weights (using n runs
of Bellman-Ford). Fortunately, there is a clever trick to bypass this ex-
tra loss, and still get a runtime of O(mn + n2 log n) with general edge
weights. This is known as Johnson’s algorithm, which we discuss
next.
d
4
1
s
3.2.1 Johnson’s Algorithm and Feasible Potentials 1 −1 c
1
a b
The idea behind this algorithm is to (a) re-weight the edges so that Figure 3.2: A graph with negative edges
they are nonnegative yet preserve shortest paths, and then (b) run n in which adding positive constant to all
instances of Dijkstra’s algorithm to get all the shortest-path distances. the edges will change the shortest paths

A simple-minded hope (based on our idea for MSTs) would be to add

shortest paths in graphs 39

a positive number to all the weights to make them positive. Although

this preserves MSTs, it doesn’t preserve shortest paths. For instance,
the example on the right has a single negative-weight edge. Adding
1 to all edge weights makes them all have non-negative weights, but
the shortest path from s to d is changed.
Don Johnson gave a algorithm that does the edge re-weighting in D.B. Johnson (1977)
a slightly cleverer way, using the idea of feasible potentials. Loosely, Lex Schrijver attributes the idea of
using potentials to T. Gallai (1958).
it runs the Bellman-Ford algorithm once, then uses the information
gathered to do the re-weighting. At first glance, the concept of a
feasible potential does not seem very useful. It is just an assignment of
weights ϕv to each vertex v of the graph, with some conditions:

Definition 3.2. For a weighted digraph G = (V, A), a function

ϕ : V → R is a feasible potential if for all edges e = uv ∈ A,

ϕ(u) + wuv − ϕ(v) ≥ 0.

Given a feasible potential, we can transform the edge-weights of

the graph from wuv to

buv := wuv + ϕ(u) − ϕ(v).

Observe the following facts:

1. The new weights w b are all positive. This comes from the definition
of the feasible potential.

2. Let Pab be a path from a to b. Let ℓ( Pab ) be the length of Pab when
we use the weights w, and ℓ̂( Pab ) be its length when we use the
b Then
weights w.

bℓ( Pab ) = ℓ( Pab ) + ϕa − ϕb .

The change in the path length is ϕa − ϕb , which is independent of

b preserve the shortest a-to-b paths,
the path. So the new weights w
only changing the length by ϕa − ϕb .

This means that if we find a feasible potential, we can compute the

new weights w,b and then run Dijkstra’s algorithm on the remaining
graph. But how can we find feasible potentials? Here’s the short
answer: Bellman-Ford. Indeed, suppose there some source vertex
s ∈ V such that every vertex in V is reachable from s. Then, set It is cleaner (and algorithmically sim-
ϕ(v) = dist(s, v). pler) to just add a new vertex s and
add zero-weight edges from it to all the
original vertices. This does not change
Lemma 3.3. Given a digraph G = (V, A) with vertex s such that all any of the original distances, or create
vertices are reachable from s, ϕ(v) = dist(s, v) is a feasible potential for G. any new cycles.

Proof. Since every vertex is reachable from s, dist(s, v) and therefore

ϕ(v) is well-defined. For an edge e = uv ∈ A, taking the shortest
40 the all-pairs shortest paths problem (apsp)

path from s to u, and adding on the arc uv gives a path from s to v,

whose length is ϕ(u) + wuv . This length is at least ϕ(v), the length of
the shortest path from s to v, and the lemma follows.

In summary, the algorithm is the following:

Algorithm 4: Johnson’s Algorithm
Input: A weighted digraph G = (V, A)
Output: A list of the all-pairs shortest paths for G
4.1 V ′ ← V ∪ {s} // add a new source vertex
4.2
′
A ← E ∪ {(s, v, 0) | v ∈ V }
4.3 dist ← BellmanFord((V ′ , A′ ))
// set feasible potentials
4.4 for e = (u, v) ∈ A do
4.5 weight(e)+ = dist(u) − dist(v)
4.6 L = [] // the result
4.7 for v ∈ V do
4.8 L+ = Dijkstra(G, v)
4.9 return L
We now bound the running time. Running Bellman-Ford once
takes O(mn) time, computing the “reduced” weights w b requires
O(m) time, and the n Dijkstra calls take O(n(m + n log n)), if we
use Fibonacci heaps. Therefore, the overall running time is O(mn +
n2 log n)—almost the same as one SSSP computation, except on very
sparse graphs with m = o (n log n).

3.2.2 More on Feasible Potentials

How did we decide to use the shortest-path distances from s as our
feasible potentials? Here’s some more observations, which give us a
better sense of these potentials, and which lead us to the solution.

1. If all edge-weights are non-negative, then ϕ(v) = 0 is a feasible

potential.

2. Adding a constant to a feasible potential gives another feasible

potential.

3. If there is a negative cycle in the graph, there can be no feasible

potential. Indeed, the sum of the new weights along the cycle is
the same as the sum of the original weights, due to the telescop-
ing sum. But since the new weights are non-negative, so the old
weight of the cycle must be, too.

4. If we set ϕ(s) = 0 for some vertex s, then ϕ(v) for any other vertex
v is an underestimate of the s-to-v distance. This is because for all
shortest paths in graphs 41

the paths from s to v we have

0 ≤ bℓ( Psv ) = ℓ( Psv ) − ϕv + ϕs = ℓ( Psv ) − ϕv ,

giving ℓ( Psv ) ≥ ϕv . Now if we try to set ϕ(s) to zero and try

to maximize summation of ϕ(v) for other vertices subject to the
feasible potential constraints we will get an LP that is the dual of
the shortest path LP.

Maximize ∑ ϕx
x ∈V
Subject to ϕs = 0
wvu + ϕv − ϕu ≥ 0 ∀(v, u) ∈ E

3.2.3 The Floyd-Warshall Algorithm

The Floyd-Warshall algorithm is perhaps best introduced via its strik- The naming of this algorithm does
ingly simple pseudocode. It first puts down estimates dist(u, v) for not disappoint: it was discovered by
Bernard Roy, Stephen Warshall, Bob
the distances thus: Floyd, and others. The name tells only a
 small part of the story.

wij , i, j ∈ E

distij = ∞ i, j ∈
/ E, i ̸= j .



0, i=j

Then it runs the following series of updates.

Algorithm 5: The Floyd-Warshall Algorithm
Input: A weighted digraph D = (V, A)
Output: A list of the all-pairs shortest paths for D
5.1 set d( x, y) ← wxy if ( x, y) ∈ E, else d( x, y) ← ∞
5.2 for z ∈ V do
5.3 for x, y ∈ V do
5.4 d( x, y) ← min{d( x, y), d( x, z) + d(z, y)}
Importantly, we run over the “inner” index z in the outermost
loop. The proof of correctness is similar to, yet not that same as that
of Algorithm 3, and is again left as a simple exercise in induction.

Lemma 3.4. After we have considered vertices Vk = {z1 , . . . , zk } in the

outer loop, dist(u, v) equals the weight of the shortest x-y path that uses
only the vertices from Vk as internal vertices. (This is ∞ if there are no such
paths.)

The running time of Floyd-Warshall is clearly O(n3 )—no bet-

ter than Johnson’s algorithm. But it does have a few advantages: it
is simple, and it is quick to implement with minimal errors. (The
most common error is nesting the for-loops in reverse.) Another ad- Actually, this paper of Hide, Kumabe,
vantage is that Floyd-Warshall is also parellelizable, and very cache and Maehara (2019) shows that even if
you get the loops wrong, but you run
efficient. the algorithm a few more times, it all
works out in the end. But that proof
requires a bit more work.
42 min-sum products and apsps

3.3 Min-Sum Products and APSPs

A conceptually different way to get shortest-path algorithms is via

matrix products. These may not seem relevant, a priori, but they lead
to deep insights about the APSP problem.
Recall the classic definition of matrix multiplication, for two real-
valued matrices A, B ∈ Rn×n
n
( AB)ij = ∑ ( Aik ∗ Bkj ).
k =0

Hence, each entry of the product AB is a sum of products, both be-

ing the familar operations over the field (R, +, ∗). But now, what if
we change the constituent operations, to replace the sum with the
min operation, and the product with a sum? We get the Min-Sum
Product(MSP): given matrices A, B ∈ Rn×n , the new product is

( A ⊚ B)ij = min{ Aik + Bkj }.

This is the usual matrix multiplication, but over the semiring (R, min, +).
A semiring has a notion of addition
It turns out that computing Min-Sum Products is precisely the and one of multiplication. However,
neither the addition nor the multipli-
operation needed for the APSP problem. Indeed, initialize a matrix D cation operations are required to have
exactly as in the Floyd-Warshall algorithm: inverses.


wij , i, j ∈ E

Dij = ∞ i, j ∈
/ E, i ̸= j .



0, i=j

Now ( D ⊚ D )ij represents the cheapest i-j path using at most 2 hops!
(It’s as though we made the outer-most loop of Floyd-Warshall into
the inner-most loop.) Similarly, we can compute

D ⊚k : = D ⊚ D ⊚ D · · · ⊚ D ,
| {z }
k −1 MSPs

whose entries give the shortest i-j paths using at most k hops (or at
most k − 1 intermediate nodes). Since the shortest paths would have
at most n − 1 hops, we can compute D⊚n−1 .
How much time would this take? The very definition of MSP
shows how to implement it in O(n3 ) time. But performing it n − 1
times would be O(n) worse than all other approaches! But here’s a
classical trick, which probably goes back to the Babylonians: for any
integer k,
D⊚2k = D⊚k ⊚ D⊚k .
(Here we use that the underlying operations are associative.) Now it
is a simple exercise to compute D⊚n−1 using at most 2 log2 n MSPs.
shortest paths in graphs 43

This a runtime of O( MSP(n) log n), where MSP(n) is the time it

takes to compute the min-sum product of two n × n matrices. In fact, with some more work, we can
Now using the naive implementation of MSP gives a total runtime implement APSP in time O( MSP(n));
you will probably see this in a home-
of O(n3 log n), which is almost in the right ballpark! The natural work.
question is: can we implement MSPs faster?

3.3.1 Faster Algorithms for Matrix Multiplication

Can we get algorithms for MSP that run in time O(n3−ε ) for some
constant ε > 0? To answer this question, we can first consider
the more common case, that of matrix multiplication over the re-
als (or over some field)? Here, the answer is yes, and this has been
known for now over 50 years. In 1969, Volker Strassen showed V. Strassen. Gaussian elimination is not
that one could multiply n × n matrices over any field F, using optimal. Numer. Math. 13 (1969)

O(nlog2 7 ) = O(n2.81 ) additions and multiplications. (One can al-

low divisions as well, but Strassen showed that divisions do not help
asymptotically.) Mike Paterson has a beautiful but still
If we define the exponent of matrix multiplication ω > 0 to be small- mysterious geometric interpretation
of the sub-problems Strassen comes
est real such that two n × n matrices over any field F can be multi- up with, and how they relate to Karat-
plied in time O(nω ), then Strassen’s result can be phrased as saying: suba’s algorithm to multiply numbers.

Technically it’s an infimum, since put

details and references.
ω ≤ log2 7.

This value, and Strassen’s idea, has been refined over the years, to The big improvements in this line
its current value of 2.3728 due to François Le Gall (2014). (See this of work were due to Arnold Schön-
hage (1981), Don Coppersmith and
survey by Virginia for a discussion of algorithmic progress until Shmuel Winograd (1990), with recent
2013.) There has been a flurry of work on lower bounds as well, e.g., refinements by Andrew Stothers, CMU
alumna Virginia Vassilevska Williams,
by Josh Alman and Virginia Vassilevska Williams showing limitations and François Le Gall (2014).
for all known approaches.
But how about MSP(n)? Sadly, progress on this has been less
impressive. Despite much effort, we don’t even know if it can be
done in O(n3−ϵ ) time. In fact, most of the recent work has been on
giving evidence that getting sub-cubic algorithms for MSP and APSP
may not be possible. There is an interesting theory of hardness within
P developed around this problem, and related ones. For instance, it is
now known that several problems are equivalent to APSP, and truly
sub-cubic algorithms for one will lead to sub-cubic algorithms for all
of them.
Yet there is some interesting progress on the positive side, albeit
qualitatively small. As far back as 1976, Fredman had shown an M.L. Fredman (1976)
log log n
algorithm to compute MSP in O(n3 log n ) time. He used the fact
that the decision-tree complexity of APSP is sub-cubic (a result we
will discuss in §3.5) in order to speed up computations over nearly-
xlogarithmic-sized sub-instances; this gives the improvement above.
More recently, another CMU alumnus Ryan Williams improved on
44 undirected apsp using fast matrix multiplication

3

this idea quite substantially to O √n , using very interesting R.R. Williams (2018)
log n
2
ideas from circuit complexity. We will discuss this result in a later
section, if we get a chance.

3.4 Undirected APSP Using Fast Matrix Multiplication

One case where we know truly sub-cubic APSP algorithms is that

of graphs with small integer edge-weights. Our focus here will be
on the case of unit-weighted undirected graphs: we present an algo-
rithm of Raimund Seidel that runs in time O(nω log n), assuming that R. Seidel (1995)
ω > 2. This elegant algorithm showcases the smart use of matrix
multiplication in graph problems.

3.4.1 The Square of G

As always, the adjacency matrix A for the simple graph G is the
symmetric matrix

1 ij ∈ E
Aij = .
0 ij ∈ /E

Now consider the graph G2 , the square of G, which has the same
vertex set as G but where an edge in G2 corresponds to being at most
two hops away in G—that is, uv ∈ E( G2 ) ⇐⇒ dG (u, v) ≤ 2. To
construct the adjacency matrix for G2 from that of A, we can use the
following idea:

1. Consider B := AG × AG ; this matrix product takes nω time.

2. Since Bij = ∑k Aik Akj counts the number of two-hop paths in A,

we can define

( AG2 )ij := ( Bij > 0) ∨ ( Aij > 0).

This transformation takes an additional O(n2 ) time.

3.4.2 Relating Shortest Paths in G and G2

Suppose we recursively compute APSP on G2 : how can we translate
this result back to G? The next lemma shows that the shortest-path
distances in G2 are nicely related to those in G.

Lemma 3.5. If d xy and Dxy are the shortest-path distances between x, y in

G and G2 respectively, then

d xy
Dxy = .
2
shortest paths in graphs 45

Proof. Any u-v path in G can be written as

u, a1 , b1 , a2 , b2 , . . . , ak , bk , v

if the path has odd length; an even-length path can be written as

u, a1 , b1 , a2 , b2 , . . . , ak , bk , ak+1 , v.

In either case, G2 has edges ub1 , b1 b2 , . . . , bk−1 bk , bk v, and thus a u-v

d xy d xy
path of length ⌈ 2 ⌉ in G2 . Therefore Dxy ≤ ⌈ 2 ⌉.
d
To show equality, suppose there is a u-v path of length ℓ < ⌈ 2xy ⌉ in
G2 . Each of these ℓ edges corresponds to either an edge or a 2-edge
path in G, so we can find a u-v path of length at most 2ℓ < d xy in G, a
contradiction.

Lemma 3.5 implies that

duv ∈ {2Duv , 2Duv − 1}.

But which one? The following lemmas give us simple rule to decide.
Let NG (v) denote the set of neighbors of v in G.

Lemma 3.6. If duv = 2Duv , then for all w ∈ NG (v) we have Duw ≥ Duv .

Proof. Assume not, and let w ∈ NG (v) be such that Duw < Duv .
Since both of them are integers, we have 2Duw < 2Duv − 1. Then the
shortest u-w path in G along with the edge wv forms a u-v-path in G
of length at most 2Duw + 1 < 2Duv = duv , which is in contradiction
with the assumption that duv is the shortest path in G.

Lemma 3.7. If duv = 2Duv − 1, then Duw ≤ Duv for all w ∈ NG (v);
moreover, there exists z ∈ NG (v) such that Duz < Duv .

Proof. For any w ∈ NG (v), considering the shortest u-v path in G

along with the edge vw implies that duw ≤ duv + 1 = (2Duv − 1) + 1,
so Lemma 3.5 gives that Duw = ⌈duw /2⌉ = Duv . For the second claim,
consider a vertex z ∈ NG (v) on a shortest path from u to v. Then
duz = duv − 1, and Lemma 3.5 gives Duz < Duv . Where did we use that G was undi-
rected? In Lemma 3.6 we used that
w ∈ NG (v) =⇒ wv ∈ E.
These lemmas can be summarized thus: And in Lemma 3.7 we used that
w ∈ NG (v) =⇒ vw ∈ E.

Corollary 3.8. If deg( j) = | NG ( j)| is the degree of j, then

∑w∈ N (v) Duw

duv = 2Duv ⇐⇒ ≥ Duv , (3.1)
deg(v)
46 undirected apsp using fast matrix multiplication

3.4.3 Using Matrix Multiplication One More Time

Given D, the criterion on the right can be checked for each uv in time
deg(v) by just computing the average, but that could be too slow—
how can we do better? Define the normalized adjacency matrix of G to
b with
be A
bwv = 1wv∈E · 1
A .
deg(v)
Now if D is the distance matrix of G2 , then
∑w∈ NG (v) Duw
b)uv =
(D A ∑ bwv =
Duw A
deg(v)
,
w ∈V

which is conveniently the expression in (3.1). Let 1( D Ab< D) be a matrix

b)uv < Duv , and zero otherwise. Then
with the uv-entry being 1 if ( D A
the distance matrix for G is

2D − 1( D Ab< D) .

This completes the algorithm, which we now summarize:

Algorithm 6: Seidel’s Algorithm
Input: Unweighted undirected graph G = (V, E) with
adjacency matrix A
Output: The distance matrix for G
6.1 if A = J then
6.2 return A // If A is all-ones matrix, done!
6.3 else
6.4 A′ ← A ∗ A + A // Boolean operations
6.5 D ← Seidel(A′ )
6.6 return 2D − 1( D Ab< D)
Each call to the procedure above performs one Boolean matrix
multiplication in step (6.4), one matrix multiplication with ratio-
nal entries in step (6.6), plus O(n2 ) extra work. The diameter of the
graph halves in each recursive call (by Lemma 3.5), and the algorithm
hits the base case when the diameter is 1. Hence, the overall running
time is O(nω log n).
Ideas similar to these can be used to find shortest paths graphs
with small integer weights on the edges: if the weights are integers in
the interval [0, W ], Avi Shoshan and Uri Zwick give an Õ(Wnω )-time Shoshan and U. Zwick (1999)
algorithm. In fact, Zwick also extends the ideas to directed
graphs, U. Zwick (2000)
1 1
and gives an algorithm with runtime Õ W 4−ω n2+ 4−ω .

3.4.4 Finding the Shortest Paths

How do we find the shortest paths themselves, and not just their
lengths? For the previous algorithms, modifying the algorithms to
shortest paths in graphs 47

output the paths is fairly simple. But for Seidel’s algorithm, things
get tricky. Indeed, since the runtime of Seidel’s algorithm is strictly
sub-cubic, how can we write down the shortest paths in nω time,
since the total length of all these paths may be Ω(n3 )? We don’t: we
just write down the successor pointers. Indeed, for each pair u, v, de-
fine Sv (u) to be the second node on a shortest u-v path (the first node
being u, and the last being v). Then to get the entire u-v shortest
path, we just follow these pointers:

u, Sv (u), Sv (Sv (u)), . . . , v.

So there is a representation of all shortest paths that uses at most

O(n2 log n) bits.
The main idea for computing the successor matrix for Seidel’s
algorithm is to solve the Boolean Product Matrix Witness problem:
given n × n Boolean matrices A, B, compute an n × n matrix W such
that Wuv = k if Aik = Bkj = 1, and Wij = 0 if no such k exists. We will
hopefully see (and solve) this problem in a homework.

3.5 Optional: Fredman’s Decision-Tree Complexity Bound

Given the algorithmic advances, one may wonder about lower bounds
for the APSP problem. There is the obvious Ω(n2 ) lower bound
from the time required to write down the answer. Maybe even the
decision-tree complexity of the problem is Ω(n3 )? Then no algorithm
can do any faster, and we’d have shown the Floyd-Warshall and the
Matrix-Multiplication methods are optimal.
However, thanks to a result of Michael Fredman, we know this is M.L. Fredman (1976)
not the case. If we just care about the decision-tree complexity, we
can get much better. Specifically, Fredman shows

Theorem 3.9. The Min-Sum Product of two n × n matrices A, B can be

deduced in O(n2.5 ) additions and comparisons.

Proof. The proof idea is to split A and B into rectangular sub-matrices,

and compute the MSP on the sub-matrices. Since these sub-matrices
are rectangular, we can substantially reduce the number of compar-
isons needed for each one. Once we have these sub-MSPs, we can
simply compute an element-wise minimum for find the final MSP.
Fix a parameter W which we determine later. Then divide A into
n/W n × W matrices A1 , . . . , An/W , and divide B into n/W W ×
n submatrices B1 , . . . , Bn/W . We will compute each Ai ⊚ Bi . Now
T)
consider ( A ⊚ B)ij = mink∈[W ] ( Aik + Bkj ) = mink∈[W ] ( Aik + Bjk
∗
and let k be the minimizer of this expression. Then we have the
48 optional: fredman’s decision-tree complexity bound

following:
T T
Aik∗ − Bjk ∗ ≤ Aik − B jk ∀ k (3.2)
T T
Aik∗ − Aik ≤ −( Bjk ∗ − B jk ) ∀ k (3.3)

Now for every pair of columns, p, q from Ai , BiT , and sort the follow-
ing 2n numbers

A1p − Aiq , A2p − A2q , . . . , Anp − Anq , −( B1p − B1q ), . . . , −( Bnp − Bnq )

We claim that by sorting W 2 lists of numbers we can compute Ai ⊚

Bi . To see this, consider a particular entry ( A ⊚ B)ij and find a k∗
such that for every k ∈ [W ], Aik∗ − Aik precedes every −( Bjk T − BT )
∗ jk
in their sorted list. By (3.3), such a k∗ is a minimizer. Then we can set
( A ⊚ B)ij = Aik∗ + Bk∗ j .
This computes the MSP for Ai , Bi , but it is possible that another A j ⊚
Bj produces the actual minimum. So, we must take the element-wise
minimum across all the ( Ai ⊚ Bi ). This produces the MSP of A, B.
Now for the number of comparisons. We have n/W smaller prod-
ucts to compute. Each sub-product has W 2 arrays to sort, each of
which can be sorted in 2n log n comparisons. Finding the minimizer
requires W 2 n comparisons.So, computing the sub-products requires
n/W ∗ 2W 2 n log n = 2n2 W log n comparisons. Then, reconstructing
the final MSP requires n2 element-wise minimums between n/W − 1
elements, which requires n3 /W comparisons. Summing these bounds
gives us n3 /W + 2n2 W log n comparisons. Optimizing over W gives
p
us O(n2 n log n) comparisons.

This result does not give us a fast algorithm, since it just counts
the number of comparisons, and not the actual time to figure out
which comparisons to make. Regardless, many of the algorithms
that achieve n3 / poly log n time for APSP use Fredman’s result on
tiny instances (say of size O(poly log n), so that we can find the best
decision-tree using brute-force) to achieve their results.
4
Low-Stretch Spanning Trees

Given that shortest paths from a single source node s can be repre-
sented by a single shortest-path tree, can we get an analog for all-
pairs shortest paths? Given a graph can we find a tree T that gives us
the shortest-path distances between every pair of nodes? Does such
a tree even exist? Sadly, the answer is negative—and it remains neg-
ative even if we allow this tree to stretch distances by a small factor,
as we will soon see. However, we show that allowing randomiza-
tion will allow us to circumvent the problems, and get low-stretch
spanning trees in general graphs.
In this chapter, we consider undirected graphs G = (V, E), where
each edge e has a non-negative weight/length we . For all u, v in V,
let dG (u, v) be the distance between u, v, i.e., the length of a shortest
path in G from u to v. Observe that the set V along with the distance
function dG forms a metric space. A metric space is a set V with a dis-
tance function d satisfying symme-
try (i.e., d( x, y) = d(y, x ) for all
x, y ∈ V) and the triangle inequality
4.1 Towards a Definition (d( x, y) ≤ d( x, z) + d(z, y) for all
x, y, z ∈ V). Typically, the definition also
The study of low-stretch spanning trees is guided by two high level asks for x = y ⇐⇒ d( x, y) = 0, but we
will merely assume d( x, x ) = 0 for all x.
hopes:

1. Graphs have spanning trees that preserve their distances. That is,
given G there exists a subtree T = (V, ET ) with ET ⊆ E such that We assume that the weights of edges in
ET are the same as those in G.
dG (u, v) ≈ d T (u, v) for all u, v ∈ V.

2. Many NP-hard problems are much easier to solve on trees.

Supposing these are true, we have a natural recipe for designing

algorithms to solve problems that depend only on distances in G:
(1) find a spanning tree T preserving distances in G, (2) solve the
problem on T, and then (3) return the solution (or some close cousin)
with the hope that it is a good solution for the original graph.
50 towards a definition

4.1.1 An All-Pairs Shortest Path Tree?

The boldest hope would be to find an all-pairs shortest path tree T, i.e.,
one that ensures d T (u, v) = dG (u, v) for all u, v in V. However, such a
tree may not exist: consider Kn , the clique of n nodes, with unit edge
lengths. The distance dG satisfies dG ( x, y) = 1 for all x ̸= y, and zero
otherwise. But any subtree T contains only n − 1 edges, so most pairs
of vertices x, y ∈ V lack an edge between them in T. Any such pair
has a shortest-path distance d T ( x, y) ≥ 2, whereas dG ( x, y) = 1.

4.1.2 A First Relaxation: Low-Stretch Spanning Trees

To remedy the snag above, let us not require distances in T be equal
to those in G, but instead be within a small multiplicative factor
α ≥ 1 of those in G.

Definition 4.1. Let T be a spanning tree of G, and let α ≥ 1. We call

T a (deterministic) α-stretch spanning tree of G if Exercise: show that if T is any subtree
of G with the same edge weights, then
dG ( x, y) ≤ d T ( x, y).
dG (u, v) ≤ d T (u, v) ≤ α dG (u, v).

holds for all u, v ∈ V.

Supposing we had such a low-stretch spanning tree, we could

try our meta-algorithm out on the traveling salesperson problem
(TSP): given a graph, find a closed tour that visits all the vertices, and
has the smallest total length. This problem is NP-hard in general,
but let us see how an α-stretch spanning tree of G gives us an an
α-approximate TSP solution for G. The algorithm is simple:

Algorithm 7: TSP via Low-Stretch Spanning Trees

7.1 Find an α-stretch spanning tree T of G.
7.2 Solve TSP on T to get an ordering π T on the vertices.
7.3 return the ordering π T .

Solving the TSP problem on a tree T is trivial: just take an Euler

tour of T, and let π T be the order in which the vertices are visited.
Let us bound the quality of this solution.
Claim 4.2. π T is an α-approximate solution to the TSP problem on G.

Proof. Suppose that the permutation πG minimizes the length of the

TSP tour for G. The length of the resulting tour is

OPTG := ∑ dG (πG (i ), πG (i + 1)).

i ∈[n]
low-stretch spanning trees 51

Since distances in the tree T are stretched by only a factor of α,

∑ d T (πG (i ), πG (i + 1)) ≤ α · ∑ dG (πG (i ), πG (i + 1)). (4.1)

i ∈[n] i ∈[n]

Now, since π T is the optimal ordering for the tree T, and πG is some
other ordering,

∑ d T (π T (i ), π T (i + 1)) ≤ ∑ d T (πG (i ), πG (i + 1)). (4.2)

i ∈[n] i ∈[n]
| {z }
OPTT

Finally, since distances were only stretched in going from G to T,

∑ dG (π T (i ), π T (i + 1)) ≤ ∑ d T (π T (i ), π T (i + 1)). (4.3)

i ∈[n] i ∈[n]

Putting it all together, the length of the tour given by π T is

∑ dG (π T (i ), π T (i + 1)) ≤ α · ∑ dG (πG (i ), πG (i + 1)),

i ∈[n] i ∈[n]

which is α · OPTG .

Hence, if we had low-stretch spanning trees T with α ≤ 1.49, we

would get the best approximation algorithm for the TSP problem.
(Assuming we can find T, but we defer this for now.) However, you
may have already noticed that the Kn example above shows that
α < 2 is impossible. But can we achieve α = 2? Indeed, is there
any “small” value for α such that for any graph G we can find an
α-stretch spanning tree of G?
Sadly, things are terrible: take the cycle Cn , again with unit edge
weights. Now any subtree T is missing one edge from Cn , say uv. Exercise: show how to find, for any
The endpoints of this edge are at distance 1 in Cn , but d T (u, v) = n − graph G, a spanning tree T with stretch
α ≤ n − 1.
1, since we have to go all the way around the cycle. Hence, getting
α < (n − 1) is impossible in general.

4.1.3 A Second Relaxation: Randomization to the Rescue

Since we cannot get trees with small stretch deterministically, let
us try to get trees with small stretch “on average”. We amend our
definition as follows:

Definition 4.3. A (randomized) low-stretch spanning tree of stretch α

for a graph G = (V, E) is a probability distribution D over spanning
trees of G such that for all u, v ∈ V, we have Henceforth, all references to low-stretch
trees will only refer to this randomized
version, unless otherwise specified.
dG (u, v) ≤ d T (u, v) for all T in the support of D , and
ET ∼D [d T (u, v)] ≤ α dG (u, v) (4.4)
52 low-stretch spanning tree construction

Observe that the first property must hold with probability 1 (i.e.,
it holds for all trees in the support of the distribution), whereas the
second property holds only on average. Is this definition any good
for our TSP example above? If we change the algorithm to sample a
tree T from the distribution and then return the optimal tour for T,
we get a randomized algorithm that is good in expectation. Indeed,
(4.1) becomes

∑ E[d T (πG (i ), πG (i + 1))] ≤ α · ∑ dG (πG (i ), πG (i + 1)), (4.5)

i ∈[n] i ∈[n]

because the stretch guarantees hold in expectation (and linearity of

expectation). The rest of the inequalities hold unchanged, includ-
ing (4.3)—which requires the probability 1 guarantee of Definition 4.6
(Do you see why?). Hence, we get

∑ E[dG (π T (i ), π T (i + 1))] ≤ α · ∑ dG (πG (i ), πG (i + 1)) . (4.6)

i ∈[n] i ∈[n]
| {z } | {z }
expected algorithm’s tour length OPTG

Even a randomized better-than-1.49 approximation for TSP would

still be amazing! And the algorithmic template here works not just
for TSP: any NP-hard problem whose objective is a linear function
of distances (e.g., many other vehicle routing problems, or the k-
median clustering problem) can be solved in this way. Indeed, the
first approximation algorithms for many such problems came via
low-stretch spanning trees.
Moreover, (randomized) low-stretch spanning trees arise in many
different contexts, some of which are not obvious at all. E.g., they can
be used to more efficiently solve “Laplacian” linear systems of the
form A⃗x = ⃗b, where A is the Laplacian matrix of some graph G. To
do this, we let P be the Laplacian matrix of a low-stretch spanning
tree of G, and then we solve the system P−1 A⃗x = P−1⃗x instead. This
is called preconditioning with P. It turns out that this preconditioning
allows certain algorithms for solving linear systems to converge faster
to a solution. Time permitting, we will discuss this application later
in the course.

4.2 Low-Stretch Spanning Tree Construction

But first, given a graph G, how can we find a randomized low-stretch

spanning tree for G with a small value of α (and efficiently)? As a
sanity check, let us check what we can do on the two examples from
before: A natural first attempt (at least for
unweighted graphs) would be to try
1. For the complete graph Kn , choose a star graph centered at a uni- a uniformly random spanning tree.
This does not work very well (which I
formly random vertex of G. For any pair of vertices u, v, they are
think is not that surprising), even for
the complete graph Kn (which I think
is somewhat surprising). A result of
Moon and Moser shows that for any
pair of vertices u, v ∈ V (Kn ), if we
choose T to be one of the nn−2 spanning
trees uniformly at random, √ the expected
distance is d T (u, v) = Θ( n).
low-stretch spanning trees 53

at distance 1 in this star if either u or v is the center, else they are

at distance 2. Hence the expected distance is n2 · 1 + n− 2 2
n · 2 = 2 − n.

2. For the cycle Cn , choose a tree by dropping a single edge uni-

formly at random. For any edge uv in the cycle, there is only a 1 in
n chance of deleting the edge from u to v. But when it is deleted, u
and v are at distance n − 1 in the tree. So

n−1 1 2
E[d T (u, v)] = · 1 + · ( n − 1) = 2 − .
n n n
And what about an arbitrary pair of nodes u, v in Cn ? We can use Exercise: Given a graph G, suppose
the exercise on the right to show that the stretch on other pairs is the stretch on all edges is at most α.
Show that the stretch on all pairs of
no worse! nodes is at most α. (Hint: linearity of
expectation.)
While we will not manage to get α < 1.49 for general graphs (or
even for the above examples, for which the bounds of 2 − n2 are the
best possible), we show that α ≈ O(log n) can indeed be achieved.
The following theorem is the current best result, due to Ittai Abra-
ham and Ofer Neiman:

Theorem 4.4. For any graph G, there exists a distribution D over span-
ning trees of G with stretch α = O(log n log log n). Moreover, the
construction is efficient: we can sample trees from this distribution D in
O(m log n log log n) time.

Moreover, the stretch bound of this theorem is almost optimal, up

to the O(log log n) factor, as the following lower bound due to Alon,
Peleg, Karp, and West shows.

Theorem 4.5. For infinitely many n, there exist graphs G on n vertices

such that any α-stretch spanning tree distribution D on G must have α =
Ω(log n). In fact, G can be taken to be the n-vertex square grid, the n-
vertex hypercube, or any n-vertex constant-degree expander.

4.3 Bartal’s Construction

The algorithm underlying Theorem 4.4 is quite involved, but we

can give the entire construction of low-stretch trees for finite metric
spaces.

Definition 4.6. A (randomized) low-stretch tree with stretch α for a

metric space M = (V, d) is a probability distribution D over trees
over the vertex set V such that for all u, v ∈ V, we have

d(u, v) ≤ d T (u, v) for all T in the support of D , and

ET ∼D [d T (u, v)] ≤ α d(u, v). (4.7)
54 bartal’s construction

The difference of this definition from Definition 4.6 is slight: we

now have a metric space instead of a graph, and we are allowed to
output any tree on the vertex set V (since the concept of subtrees
doesn’t make sense now). Note that given a graph G, we can com-
pute its shortest-path metric (V, dG ) and then find a distribution over
(non-spanning) trees that approximate the distance in G. So if we
don’t really need the spanning aspect in our low-stretch trees—e.g.,
as in the TSP example—we can use results for this definition.
We need one more piece of notation: for a metric space M =
(V, d), define its aspect ratio ∆ to be
maxu̸=v∈V d(u, v)
∆ M := .
minu̸=v∈V d(u, v)

We will show the following theorem, due to Yair Bartal:

Theorem 4.7. For any metric space M = (V, d), there exists an efficiently
sampleable α B -stretch spanning tree distribution D B , where

α B = O(log n log ∆ M ).

The proof works in two parts: we first show a good low-diameter

decomposition. This will be a procedure that takes a metric space The diameter of a set S is
and a diameter bound D, and randomly partitions the metric space maxu,v∈S d(u, v), i.e., the maximum
distance between any two points in it.
into clusters of diameter ≤ D, in such a way that close-by points are
unlikely to be separated. Then we show how such a low-diameter
decomposition can be used recursively to constuct a low-stretch tree.

4.3.1 Low-Diameter Decompositions

The notion of a low-diameter decomposition has become ubiquitous
in algorithm design, popping up in approximation and online algo-
rithms, and also in distributed and parallel algorithms. It’s something
worth understanding well.

Definition 4.8 (Low-Diameter Decomposition). A low-diameter de-

composition scheme (or LDD scheme) with parameter β for a metric
M = (V, d) is a randomized algorithm that, given a bound D > 0,
partitions the point set V into “clusters” C1 , . . . , Ct such that
(i) for all i ∈ {1, . . . , t}, the diameter of Ci is at most D, and
(ii) for all x, y ∈ V such that x ̸= y, we have

d( x, y)
Pr[ x, y in different clusters] ≤ β · .
D
Let’s see a few examples, to get a better sense for the definition:

1. Consider a set of points on the real line. One way to partition the
line into pieces of diameter D is simple: imagine making notches
low-stretch spanning trees 55

on the line at distance D from each other, and then randomly

shifting them. Formally, pick a random value R ∈ [0, D ] uniformly
at random, and partition the line into intervals of the form [ Di +
R, D (i + 1) + R), for i ∈ Z. A little thought shows that points x, y
d( x,y)
are separated with probability exactly D .

2. The infinite 2-dimensional square grid with unit edge-lengths.

One way to divide this up is to draw horizontal and vertical lines
which are D/2 apart, and randomly shift as above. A pair x, y is
d( x,y)
separated with probability exactly D/2 in this case. Indeed, this
approach works for k-dimensional hypergrids (and k-dimensional
ℓ1 -space) with probability k · d(D
x,y)
— in this case the β parameter
is at most the dimension of the space.

3. What about lower bounds? One can show that for the k-dimensional
hypergrid, we cannot get β = o (k). Or for a constant-degree n-
vertex expander, we cannot get β = o (log n). Details to come soon.

Since the aspect ratio of the metric space is invariant to scaling all
the edge lengths by the same factor, it will be convenient to assume
that the smallest non-zero distance in d is 1, so the largest distance is
∆. The basic algorithm is then quite simple:

Algorithm 8: LDD( M = (V, d), D )

4 log n
8.1 p ← min(1, D ).
8.2 while there exist unmarked point do
8.3 v ← any unmarked point.
8.4 sample Rv ∼ Geometric( p).
8.5 cluster Cv ← {unmarked u | d(v, u) < Rv }.
8.6 mark points in Cv .
8.7 return the resulting set of clusters.

Lemma 4.9. The algorithm above ensures that

1. the diameter of every cluster is at most D with probability at least 1 −

1/n, and

2. any pair x, y ∈ V is separated with probability at most 2p d( x, y).

Proof. To show the diameter bound, it suffices to show that Rv ≤

D/2 for each cluster Cv , because then the triangle inequality shows
that for any x, y ∈ Cv ,

d( x, y) ≤ d( x, v) + d(v, y) < D/2 + D/2 = D.

Now the probability that Rv > D/2 for one particular cluster is We use that 1 − z ≤ ez for all z ∈ R.
56 bartal’s construction

1
Pr[ Rv > D/2] = (1 − p) D/2 ≤ e− pD/2 ≤ e−2 log n = .
n2
By a union bound, there exists a cluster with diameter > D with
probability
n 1
1 − Pr[∃v ∈ V, Rv > D/2] ≥ 1 − = 1− .
n2 n
To bound the probability of some pair u, v being separated, we
use the fact that sampling from the geometric distribution with pa-
Cv
rameter p means repeatedly flipping a coin with bias p and counting
the number of flips until we see the first heads. Recall this process
is memoryless, meaning that even if we have already performed k d(v, y)
v y
flips without having seen a heads, the time until the first heads is still Rv ≤ D
2
d(v, x ) d( x, y)
geometrically distributed. x
Hence, the steps of drawing Rv and then forming the cluster can
be viewed as starting from v, where the cluster is a unit-radius ball
around v. Each time we flip a coin of bias p: it is comes up heads we Figure 4.1: A cluster forming around v
in the LDD process, separating x and
set the radius Rv to the current value, form the cluster Cv (and mark y. To reduce clutter, only some of the
its vertices) and then pick a new unmarked point v; on seeing tails, distances are shown.
we just increment the radius of v’s cluster by one and flip again. The
process ends when all vertices lie in some cluster.
For x, y, consider the first time when one of these vertices lies
inside the current ball centered at some point, say, v. (This must hap-
pen at some point, since all vertices are eventually marked.) With-
out loss of generality, let the point inside the current ball be x. At
this point, we have performed d(v, x ) flips without having seen a
heads. Now we will separate x, y if we see a heads within the next
⌈d(v, y) − d(v, x )⌉ ≤ ⌈d( x, y)⌉ flips—beyond that, both x, y will have
been contained in v’s cluster and hence cannot be separated. But
the probability of getting a heads among these flips is at most (by a
union bound)
d( x, y)
⌈d( x, y)⌉ p ≤ 2d( x, y) p ≤ 8 log n .
D
(Here we used that the minimum distance is 1, so rounding up dis-
tances at most doubles things.) This proves the claimed probability of
separation.

Recall that we wanted the diameter bound with probability 1,

whereas Lemma 4.9 only ensures it with high probability. Here’s a
quick fix to this problem: repeat the above process until the returned
partition has clusters of diameter at most D. The probability of any
pair u, v being separated by this last run of Algorithm 8 is at most
the probability of u, v being separated by any of the runs, which is at
most p d(u, v) times the expected number of runs,
d(u, v)
p d(u, v) · (1/(1 − 1/n)) ≤ 2p d(u, v) = O(log n) .
D
low-stretch spanning trees 57

Lemma 4.10. The low-diameter decomposition scheme above achieves

parameter β = O(log n) for any metric M on n points.

4.3.2 Low-Stretch Trees Using LDDs

Now we can use the low-diameter decomposition scheme to get a

low-stretch tree (LST). Here’s the high-level idea: given a metric with
diameter ∆, use an LDD to decompose it into clusters with diameter
D ≤ ∆/2. Build a tree recursively for each of these clusters, and then
combine these trees into one tree for the entire metric.
Recall we assumed that the metric had minimum distance 1 and
maximum distance ∆. Formally, we invoke the procedure LST below
with the parameters LST(metric M, ⌈log2 ∆⌉).
Algorithm 9: LST(metric M = (V,d), D = 2δ )
Input: Invariant: diameter( M) ≤ 2δ
9.1 if |V | = 1 then
9.2 return tree containing the single point in V.
9.3 C1 , . . . , Ct ← LDD( M, D = 2δ−1 ).
9.4 for j in {1, . . . , t} do
9.5 M j ← metric M restricted to the points in Cj .
9.6 Tj ← LST( M j , δ − 1).
9.7 Add edges of length 2δ from root r1 for tree T1 to the roots of
T2 , . . . , Tt .
9.8 return resulting tree rooted at r1 .
We are ready to prove Theorem 4.7; we will show that the tree has
expected stretch O( β log ∆), and that it does not shrink any distances.
In fact, we show a slightly stronger guarantee.

Lemma 4.11. If the random tree T returned by some call LDD( M′ , δ) has
root r, then (a) every vertex x in T has distance d( x, r ) ≤ 2δ+1 , and (b) the
expected distance between any x, y ∈ T has E[d T ( x, y)] ≤ 8δβ d( x, y).

Proof. The proof is by induction on δ. For the base case, the tree has
a single vertex, so the claims are trivial. Else, let x lie in cluster Cj , so
inductively the distance to the root of the tree Ti is d( x, ri ) ≤ 2(δ−1)+1 .
Now the distance to the new root r is at most 2δ more, which gives
2δ + 2δ = 2δ+1 as claimed.
Moreover, any pair x, y is separated by the LDD with probability
d( x,y)
β 2δ−1 , in which case their distance is at most

d( x, r ) + d(r, y) ≤ 2δ+1 + 2δ+1 = 4 · 2δ .

Else they lie in the same cluster, and inductively have expected dis-
58 metric embeddings: a.k.a. simplifying metrics

tance at most 8(δ − 1) β d( x, y). Hence the expected distance is

E[d( x, y)] ≤ Pr[ x, y separated] · 4 · 2δ +

Pr[ x, y not separated] · 8(δ − 1) β d( x, y)
d( x, y)
≤β · 4 2δ + 8(δ − 1) β d( x, y)
2δ −1
= 8 δ β d( x, y).

This proves Theorem 4.7 because β = O(log n), and the iniitial
call on the entire metric defines δ = O(log ∆). In fact, if we have a
better LDD (with smaller β), we immediately get a better low-stretch
tree. For example, shortest-path metrics of planar graphs admit an
LDD with parameter β = O(1); this shows that planar metrics admit
(randomized) low-stretch trees with stretch O(log ∆).
It turns out this factor of O(log n log ∆) can be improved to O(log n)—
this was done by Fakcharoenphol, Rao, and Talwar. Moreover, the
bound of O(log n) is tight: the lower bounds of Theorem 4.5 continue
to hold even for low-stretch non-spanning trees.

4.4 Metric Embeddings: a.k.a. Simplifying Metrics

We just how to approximate a finite metric space with a simpler

metric space, defined over a tree. (Loosely, “every metric space is
within O(log n) of some tree metric”.) And since trees are simpler
metrics, both conceptually and algorithmically, such an embedding
can help design algorithms for problems on metric spaces.
This idea of approximating metric spaces by simpler ones has
been extensively studied in various forms. For example, another fa-
mous result of Jean Bourgain (with an extension by Jirka Matoušek)
shows that any finite metric space on n points can be embedded
into ℓ p -space with O((log n)/p) distortion 1 . Moreover, the Johnson- 1

Lindenstrauss Lemma, which we will see in a future chapter, shows

that any n point-submetric of Euclidean space can be embedded
into a (low-dimensional) Euclidean space of dimension at most
O(log n/ϵ2 ), such that distances between points are distorted by a
factor of at most 1 ± ϵ 2 . Since geometric spaces, and particularly, 2

low-dimensional Euclidean spaces, are easier to work with and rea-

son about, these can be used for algorithm design as well.

4.4.1 Historical Notes

To be cleaned up. Elkin et al. 3 gave the first polylog-stretch span- 3

ning trees, which took eight years following Bartal’s√ construction.

(The first low-stretch spanning trees had stretch 2O( log n log log n) by
Alon et al. 4 , which is smaller than nϵ for any ϵ > 0 but larger than 4
low-stretch spanning trees 59

polylogarithmic, i.e., (log n)C for any C > 0.)

5
A Near-Linear Time Algorithm for SSSP

TO be added in
6
Blank

TO be added in
7
Graph Matchings I: Combinatorial Algorithms

Another fundamental graph problem is to find matchings: these are

subsets of edges that do not share endpoints. Matchings arise in var-
ious contexts: matching tasks to workers, or advertisements to slots,
or roommates to each other. Moreover, matchings have a rich com-
binatorial structure. The classical results can be found in Matching
Theory by Laci Lovász and Michael Plummer, though Lex Schrijver’s L. Lovász and M.D. Plummer
three-volume opus Combinatorial Optimization: Polyhedra and Efficiency A. Schrijver (2003)
might be easier to find, and contains more recent developments as
well.
Several different and interesting algorithmic techniques can be
used to find large matchings in graphs; we will discuss them over the
next few chapters. This chapter discusses the simplest combinatorial
algorithms, explaining the underlying concepts without optimizing
the runtimes.

7.1 Notation and Definitions

Consider an undirected (simple and connected) graph G = (V, E)

with |V | = n and | E| = m as usual. The graph is unweighted; we will
consider weighted versions of matching problems in later chapters.
When considering bipartite graphs, where the vertex set has parts
V = L ⊎ R (the “left” and “right”, and the edges E ⊆ L × R, we may
denote the graph as G = ( L, R, E).

Definition 7.1 (Matching). A matching in graph G is a subset of the

edges M ⊆ E which have no endpoints in common. Equivalently,
the edges in M are disjoint, and hence every vertex in (V, M ) has
maximum degree 1.

Given a matching M in G, a vertex v is open or exposed or free if

no edge in the matching is incident to v, else the vertex is closed or
covered or matched. Observe: the empty set of edges is a matching.
Moreover, any matching can have at most |V |/2 edges, since each
66 notation and definitions

edge covers two vertices, and each vertex can be covered by at most
one edge.

Definition 7.2 (Perfect Matching). A perfect matching M is a matching

such that | M| = |V |/2. Equivalently, every vertex is matched in the
matching M.

Definition 7.3 (Maximum Matching). A maximum cardinality match-

ing (or simply maximum matching) in G is a matching with largest
possible cardinality. The size of the maximum matching in graph G is
denoted MM ( G ).

Definition 7.4 (Maximal Matching). A maximal matching on a graph

is a matching that is inclusion-wise maximal; that is, no additional
edges can be added to M while maintaining the matching property.
Hence, M ∪ {e} is not a matching for all edges e ̸∈ M.

The last definition is given to mention something we will not be

focusing on; our interest is in perfect and maximum matchings. That
being said, it is a useful exercise to show that any maximal matching
in G has at least MM( G )/2 edges.

7.1.1 Augmenting Paths for Matchings

Since we want to find a maximum matching, a question we may
ask is: given a matching M, can we (efficiently) decide if it is a maximum
matching? One answer to this was suggested by Berge, who gave a
characterization of maximum matchings in terms of “augmenting”
paths.

Definition 7.5 (Alternating Path). For matching M, an M-alternating

path is a path in which edges in M alternate with those not in M.
Figure 7.1: An alternating path P
(dashed edges are not in P, solid edges
Definition 7.6 (Augmenting Path). For matching M, an M-augmenting are in P)
path is an M-alternating path with both endpoints open.

Given sets S, T, their symmetric difference is denoted Figure 7.2: An augmenting path

S △ T : = ( S \ T ) ∪ ( T \ S ).

The following theorem explains the name for augmenting paths.

Theorem 7.7 (Berge’s Optimality Criterion). A matching M is a maxi-

mum matching in graph G if and only if there are no M-augmenting paths
in G.
Berge (1957)
Proof. If there is an M-augmenting path P, then M′ := M△ P is a
larger matching than M. (Think of getting M′ by toggling the dashed
graph matchings i: combinatorial algorithms 67

edges in the path to solid, and vice versa). Hence if M is maximum

matching, there cannot exist an M-augmenting path.
Conversely, suppose M is not a maximum matching, and matching
M′ has | M′ | > | M|. Consider their symmetric difference S := M△ M′ .
Every vertex is incident to at most 2 edges in S (at most one each
from M and M′ ), so S consists of only paths and cycles, all of them
having edges in M alternating with edges in M′ . Any cycle with this
alternating structure must be of even length, and any path has at
most one more edge from one matching than form the other. Since
| M′ | > | M|, there must exists a path in S with one more edge from
M′ than from M. But this is an M-augmenting path.

If we could efficiently find an M-augmenting path (if one exists),

we could repeatedly augment the current matching until we have
a maximum matching. However, Berge’s theorem does not imme-
diately give an efficient algorithm: finding an M-augmenting path
could naively take exponential time. We now give algorithms to effi-
ciently find augmenting paths, first in bipartite graphs, and then in
general graphs.

7.2 Bipartite Graphs

Finding an M-augmenting path (if one exists) in bipartite graphs is

an easier task, though it still requires cleverness. A first step is to
consider a “dual” object, which is called a vertex cover.

Definition 7.8 (Vertex Cover). A vertex cover in G is a set of vertices C

such that every edge in the graph has at least one endpoint in C.

Note that the entire set V is trivially a vertex cover, and the chal-
lenge is to find small vertex covers. We denote the size of the smallest
cardinality vertex cover of graph G as VC ( G ). Our motivation for
calling it a “dual” object comes from the following fundamental theo-
rem from the early 20th century:

Theorem 7.9 (König’s Minimax Theorem). In a bipartite graph, the size Dénes König (1916)
of the largest possible matching equals the cardinality of the smallest vertex
cover:
MM( G ) = VC( G ).

This theorem is a special case of the max-flow min-cut theorem,

which you may have seen before. It is first of many min-max rela-
tionships, many of which lead to efficient algorithms. Indeed, the
algorithm for finding augmenting paths will come out of the proof of
this theorem. Exercise: Use König’s theorem to prove
P. Hall’s theorem: A bipartite graph has
a matching that matches all vertices of L if
and only for every subset S ⊆ L of vertices,
| N (S)| ≥ |S|. Here N (S) denotes the
“neighborhood” of S, i.e., those vertices
with a neighbor inside S.
68 bipartite graphs

Proof. In many such proofs, there is one easy direction. Here, it is

proving that MM ( G ) ≤ VC ( G ). Indeed, the edges of any matching
share no endpoints, so covering a matching of size MM ( G ) requires
at least as many vertices. The minimum vertex cover size is therefore
at least MM( G ).
Next, we prove that MM( G ) ≥ VC ( G ). To do this, we give a
linear-time algorithm that takes as input an arbitrary matching M,
and either returns an M-augmenting path (if such a path exists), or
else returns a vertex cover of size | M|. Since a maximum matching
M admits no M-augmenting path by Berge’s theorem, we would
get back a vertex cover of size MM( G ), thereby showing VC ( G ) ≤
MM( G ).
The proof is an “alternating” breadth-first search: it starts with
all open nodes among the left vertex set L, and places them at level
0. Then it finds all the (new) neighbors of these nodes reachable
using non-matching edges, and then all (new) neighbors of those
nodes using matching edges, and so on. Formally, the algorithm is as
follows, where we use X≤ j to denote X0 ∪ . . . ∪ X j .

9.1 X0 ← all open vertices in L

9.2 for i = 0, 1, 2, . . . do
9.3 X2i+1 ← {v | exists u ∈ X2i s.t. uv ̸∈ M, and v ̸∈ X≤2i }
9.4 X2i+2 ← {v | exists u ∈ X2i+1 s.t. uv ∈ M, and v ̸∈ X≤2i+1 }
Let us make a few observations about the procedure. First, since
the graph is bipartite, Xi is a subset of L for even levels i, and of R for
odd levels i. Next, all vertices in X2 ∪ X4 ∪ . . . are matched vertices,
X D A
since they are reached from the previous level using an edge in the Y B

matching. Moreover, if some odd level X2i+1 contains an open node Z C

v, we have found an M-alternating path from an open node in X0 to Layer 0 1 2 3 4

Matched edge
v, and hence we can stop and return this augmenting path. Unmatched edge

Open vertex
Hence, suppose we do not find an open node in an even level, and
Figure 7.3: Illustration of the process
stop when some X j is empty. Let X = ∪ j X j be all nodes added to any to find augmenting paths in a bipartite
of the sets X j ; we call these marked nodes. Define the set C to be the graph. Mistakes here, to be fixed!
vertices on the left which are not marked, plus the vertices on the right
which are marked. That is,

C := ( L \ X ) ∪ ( R ∩ X )

We claim that C is a vertex cover of size | M |.

Claim 7.10. C is a vertex cover.

Proof. G is a bipartite graph, and C hits all edges that touch R ∩ X

and L \ X. Hence we must show there are no edges between L ∩ X Figure 7.4: X = set of marked vertices,
O = marked open vertices, C = claimed
and R \ X, i.e., between the top-left and bottom-right of the figure.
vertex cover of G. To be changed.
graph matchings i: combinatorial algorithms 69

1. There can be no unmatched edge from the open vertices in L ∩ X to

R \ X, else that vertex would be reachable from X0 and so belong
to X1 . Moreover, an open vertex has no unmatched edges, by
definition. Hence, any “offending edges” out of L ∩ X must come
from a covered vertex.

2. There can be no non-matching edge from a covered vertex in L ∩ X

to some node u in R \ X, else this node u would have been added
to some level X2i+1 .

3. Finally, there can be no matching edge between a covered vertex

in L ∩ X and some vertex in R \ X. Indeed, every covered node in
L ∩ X (i.e., those in X2 , X4 , . . . ) was reached via a matching edge
from some node in R ∩ X. There cannot be another matching edge
from some node in R \ X incident to it.

This shows that C is a vertex cover.

Claim 7.11. |C | ≤ | M|.

We use a simple counting argument:

• Every vertex in R ∩ X has a matching edge incident to it; else it

would be open, giving an augmenting path.

• Every vertex in L \ X has an incident edge in the matching, since

no vertices in L \ X ⊆ L \ X0 are open.

• There are no matching edges between L \ X and R ∩ X, else they

would have been explored and added to X.

Hence, every vertex in C = ( L \ X ) ∪ ( R ∩ X ) corresponds to a unique

edge in the matching, and |C | ≤ | M |.

Observe that the proof of König’s theorem is algorithmic, and

can be implemented to run in O(m) time. Now, starting from some
trivial matching, we can use this linear-time algorithm to repeatedly
augment until we have a maximum matching. This means that maxi-
mum matching on bipartite graphs has an O(mn)-time algorithm.
Observe: this algorithm also gives a “proof of optimality” of the
maximum matching M, in the form of a vertex cover of size | M |. By
the easy direction of König’s theorem, this is a vertex cover of mini-
mum cardinality. Therefore, while finding the smallest vertex cover is
NP-hard for general graphs, we have just solved the minimum vertex
cover problem on bipartite graphs.
One other connection: if you have seen the Ford-Fulkerson al-
gorithm for computing maximum flows, the above algorithm may
seem familiar. Indeed, modeling the maximum matching problem in Figure 7.5: Use Ford-Fulkerson algo-
bipartite graphs as that of finding a maximum integer s-t flow, and rithm to find a matching
70 general graphs: the tutte-berge theorem

running the Ford-Fulkerson “augmenting paths” algorithm results in

the same result. Moreover, the minimum s-t cut corresponds to a ver-
tex cover, and the max-flow min-cut theorem proves König’s theorem.
The figure to the right illustrates this on an example. Figure needs
fixing.

7.2.1 Other algorithms

There are faster algorithms to find maximum matchings in bipartite

graphs. For a long time, the fastest one was an algorithm by John
√
Hopcroft and Dick Karp, which ran in time O(m n). It finds many J. Hopcroft and R.M. Karp (1973)
augmenting paths at once, and then combines them in a clever way.
There is also a related algorithm of Shimon Even and Bob Tarjan, S. Even and R.E. Tarjan (1975)
√
which runs in time O(min(m m, mn2/3 )); in fact, they compute
maximum flows on unit-capacity graphs in this running time.
There was remarkably little progress on the maximum match-
ing problem until 2016, when Aleksander Madry gave an algorithm A. Madry (2016)
that runs in time Õ(m10/7 ) time—in fact the algorithm also solves
the unit-capacity maximum-flow problem in that time. It takes an
interior-point algorithm for solving general linear programs, and spe-
cializes it to the case of maximum matchings. We may discuss this
max-flow algorithm in a later chapter. Then, following an interme-
diate improvement to m4/3+o(1) time, a remarkable paper presented L. Chen, R. Kyng, Y.P. Liu, R. Peng,
at the FOCS 2022 conference gave an algorithm for both the maxi- M. Probst Gutenberg, and S. Sachdeva
(2022).
mum flow problem, and the min-cost flow problem in m1+o(1) time.
(The result assumes polynomially bounded capacities and costs, and
integer demands.)

7.3 General Graphs: The Tutte-Berge Theorem

The matching problem on general (non-bipartite) graphs gets more

involved, since the structure of matchings is richer. For example, the
flow-based approaches do not work any more. And while Berge’s
theorem (Theorem 7.7) still holds in this case, König’s theorem (The-
orem 7.9) is no longer true. Indeed, the 3-cycle C3 has a maximum
matching of size 1, but the smallest vertex cover is of size 2. However,
we can still give a min-max relationship, via the Tutte-Berge theorem.
To state it, let us give a definition: for a subset U ⊆ V, suppose
deleting the nodes of U and their incident edges from G gives con-
nected components {K1 , K2 , . . . , Kt }. The quantity odd( G \ U ) is the
number of such pieces with an odd number of vertices.

Theorem 7.12 (The Tutte-Berge Max-Min Theorem). Given a graph G, Tutte (1947), Berge (1958)
Tutte showed that the graph has a
perfect matching precisely if for every
U ⊆ V, odd( G \ U ) ≤ |U |. Berge
gave the generalization to maximum
matchings.
graph matchings i: combinatorial algorithms 71

the size of the maximum matching is described by the following equation.

n + |U | − odd( G \ U )
MM ( G ) = min .
U ⊆V 2
The expression on the right can seem a bit confusing, so let’s con-
sider some cases.

• If U = ∅, we get that if |V | is even then MM ( G ) ≤ n/2, and if |V |

is odd, the maximum matching cannot be bigger than (n − 1)/2.
(Or if G is disconnected with k odd-sized components, this gives
n/2 − k/2.)

• Another special case is when U is any vertex cover with size c.

Then the Ki ’s must be isolated vertices, so odd( G \ U ) = n − c.
c+n−(n−c)
This gives us MM ≤ 2 = c, i.e., the size of the maximum
matching is at most the size of any vertex cover.

• Give example where G is even, connected, but MM < VC.

Trying special cases is a good way to understand the

Proof of the ≤ direction of Theorem 7.12. The easy direction is to show

that MM ( G ) is at most the quantity on the right. Indeed, consider
a maximum matching M. At most |U | of the edges in M can be hit
by nodes in U; the other edges must lie completely within some
connected component of G \ U. The maximum size of a matching
within Ki is ⌊Ki /2⌋, and it are these losses from the odd components
that gives the expression on the right. Indeed, we get
t
| Ki |
| M | ≤ |U | + ∑
i =1
2
n − |U | odd( G \ U )
= |U | + −
2 2
|U | + n − odd( G \ U )
= .
2
We can prove the “hard” direction using induction (see the webpage
for several such proofs). However, we defer it for now, and derive it
later from the proof of the Blossom algorithm.

7.4 The Blossom Algorithm

The Blossom algorithm for finding the maximum matching in a gen-

eral graph is by Jack Edmonds. Recall: the algorithm for minimum- J. Edmonds (1965)
weight arborescences in §?? was also due to him, and you may see
some similarities in these two algorithms.

Theorem 7.13. Given a graph G, the Blossom algorithm finds a maximum

matching M in time O(mn2 ).
72 the blossom algorithm

The rest of this section defines the algorithm, and proves this
theorem. The essential idea of the algorithm is simple, and similar
to the one for the bipartite case: if we have a matching M, Berge’s
characterization from Theorem 7.7 says that if M is not optimal, there
exists an M-augmenting path. So the natural idea would be to find
such an augmenting path. However, it is not clear how to do this
directly. The clever idea in the Blossom algorithm is to either find
an M-augmenting path, or else find a structure called a “blossom”. Stem Blossom
(a)
The good thing about blossoms is that we can use them to contract
the graph in a certain way, and make progress. Let us now give some
definitions, and details. A
(b)
A flower is a subgraph of G that looks like the the object to the
Matched edge
right: it has a open vertex at the base, then a stem with an even num- Unmatched edge
ber of edges (alternating between matched and unmatched edges ), Open vertex

and then a cycle with an odd number of edges (again alternating,

though naturally having two unmatched edges adjacent to the stem). Figure 7.6: An example of blossom and
The cycle itself is called the blossom. the toggling of the stem.

7.4.1 The Main Procedure

The algorithm depends on a subroutine called FindAugPath, which
has the following guarantee.

Lemma 7.14. Given graph G and matching M, the subroutine FindAugPath,

runs in O(m) time. If G has an M-augmenting path, then it returns either
(a) a flower F, or (b) an M-augmenting path.

Note that we have not said what happens if there is no M-augmenting

path. Indeed, we cannot find an augmenting path, but we show that
the FindAugPath returns either a flower, or says “no M-augmenting
path, and returns a Tutte-Berge set U achieving equality in The-
orem 7.12 with respect to M. We can now use this FindAugPath
subroutine within our algorithm as follows.

1. Says “no M-augmenting path” and a set U of nodes. In this case, M is

the maximum matching.

2. Finds augmenting path P. We can now augment along P, by setting

M ← M△ P.

3. Finds a flower F. In this case, we don’t yet know if M is a maxi-

mum matching or not. But we can shrink the blossom down to
get a smaller graph G ′ (and a matching M′ in it), and recurse.
Either we will find a proof of maximality of M′ in G ′ , or an M′ -
augmenting path. This we can extend to the matching M in G.
That’s the whole algorithm!
graph matchings i: combinatorial algorithms 73

Let’s give some more details for the last step. Suppose we find
a flower F, with stem S and blossom B. First, toggle the stem (by
setting M ← M△S): this moves the open node to the blossom,
Figure 7.7: The shrinking of a blossom.
without changing the size of the matching M. (It makes the following Image found at http://en.wikipedia.
arguments easier, with one less case to consider.) (Change figure.) org/wiki/Blossom_algorithm.

Next, contract the blossom down into a single vertex v B , which is

now open. Denote the new graph G ′ ← G/B, and M′ ← M/B. Given a graph and a subset C ⊆ V,
Since all the nodes in blossom B, apart from perhaps the base, were recall that G/C denotes the contraction
of C in G.
matched by edges within the blossom, M′ is also a matching in G ′ .
Next, we recurse on this smaller graph G ′ with matching M′ .
Finally, if we get back an M′ -augmenting path, we “lift” it to get an
M-augmenting path (as we see soon). Else if we find that M′ is a
maximum matching in G ′ , we declare that M is maximum in G. To
show correctness, it suffices to prove the following theorem.

Lemma 7.15. Given graph G and matching M, suppose we shrink a blos-

som to get G ′ and M′ . Then there exists an M-augmenting path in G if and
only if there exists an M′ -augmenting path in G ′ .
Moreover, given an M′ -augmenting path in G ′ , we can lift it back to an
M-augmenting path P in G in O(m) time.

Proof. Since we toggled the stem, the vertex v at the base of the blos-
som B is open, and so is the vertex v B created in G ′ by contracting
B. Moreover, all other nodes in the blossom are matched by edges
within itself, so all edges leaving B are non-matching edges. The
picture essentially gives the proof, and can be used to follow along.

(⇒) Consider an M-augmenting path in G, denoted by P. If P does

not go through the blossom B, the path still exists in G ′ . Else if
P goes through the blossom, we can assume that one of its end-
points is the base of the blossom (which is the only open node on
Figure 7.8: The translation of augment-
the blossom)—indeed, any other M-augmenting path P can be ing paths from G \ B to G and back.
rerouted to the base. (Figure!) So suppose this path P starts at the
base and ends at some v′ not in B. Because v B is open in G ′ , the
path from v B to v′ is an M′ -augmenting path in G ′ .

(⇐) Again, an M′ -augmenting path P′ in G ′ that does not go through

v B still exists in G. Else, the M′ -augmenting path P′ passes through
v B , and because v B is open in G ′ , the path starts at v B and ends at
some node t. Let the first edge on P′ be e′ = v B y for some node
y, and let it correspond to edge e = xy in G, where x ∈ B. Now,
if v is the open vertex at the base of the blossom, following one
of the two paths (either clockwise or counter-clockwise) along the
blossom from v to x, using the edge xy and then following the rest
of the path P′ from y to t gives an M-augmenting path in G. (This
74 the blossom algorithm

is where we use the fact that the cycle is odd, and is alternating
except for the two edges incident to v.)

The process to get from P′ in G ′ to the M-augmenting path in G be

done algorithmically in O(m) time, completing the proof.

We can now analyze the runtime, and prove Theorem 7.13:

Proof of Theorem 7.13. We first call FindAugPath, which takes O(m)

time. We are either done (because M is a maximum matching, or else
we have an augmenting path), or else we contract down in another
O(m) time to get a graph G ′ with at most n − 3 vertices and at most
m edges. Inductively, the time taken in the recursive call on G ′ is
O(m(n − 3)). Now lifting an augmenting path takes O(m) time more.
So the total runtime to find an augmenting path in G (if one exists) is
O(mn).
Finally, we start with an empty matching, so its size can be aug-
mented at most n/2 times, giving us a total runtime of O(mn2 ).

7.4.2 The FindAugPath Subroutine

The subroutine FindAugPath is very similar to the analogous pro-
cedure in the bipartite case, but since there is no notion of left and
right vertices, we start with level X0 containing all vertices that are
unmatched in M0 , and try to grow M-alternating paths from them, in
the hope of finding an M-augmenting path. As before, let X≤ j denote X0 ∪ . . . ∪ X j ,
and let nodes added to some level X j be
called marked.
9.1 X0 ← all open vertices in V
9.2 for i = 0, 1, 2, . . . do
9.3 X2i+1 ← {v | exists u ∈ X2i s.t. uv ̸∈ M, and v ̸∈ X≤2i }
9.4 X2i+2 ← {v | exists u ∈ X2i+1 s.t. uv ∈ M, and v ̸∈ X≤2i+1 }
9.5 if exists a “cross” edge between nodes of same level then
9.6 return augmenting path or flower
9.7 else
9.8 say “no M-augmenting path”
To argue correctness, let us look at the steps above in more detail.
In line 9.2, for each vertex u ∈ X2i , we consider the possible cases for
each non-matching edge uv incident to it:

1. If v is not in X≤2i+1 already (i.e., not marked already) then we add

it to X2i+1 . Note that v ∈ X2i+1 now has an M-alternating path to
some node in X0 , that hits each layer exactly once.

2. If v ∈ X2i , then uv is an unmatched edge linking two vertices

in the same level. This gives an augmenting path or a blossom!
Indeed, by construction, there are M-alternating paths P and
graph matchings i: combinatorial algorithms 75

Q from u and v to open vertices in X0 . If P and Q do not inter-

sect, then concatenating path P, edge uv, and path Q gives an
M-augmenting path. If P and Q intersect, they must first intersect
some vertex w ∈ X2j for some j ≤ i, and the cycle containing u, v, w
gives us the blossom, with the stem being a path from w back to
an open vertex in X0 .

3. If v ∈ X2j for j < i, then u would have been added to the odd level
X2j+1 , which is impossible.

4. Finally, v may belong to some previous odd level, which is fine.

Observe that this “backward” non-matching edge uv is also an
even-to-odd edge, like the “forward” edge in the first case.

Now for the edges out of the odd layers considered in line 9.3.
Given u ∈ X2i+1 and matching edge uv ∈ M, the cases are:

1. If v is not in X≤2i+1 then add it to X2i+2 . Notice that v cannot be in

X2i+2 already, since nodes in even layers are matched to nodes in
the preceding odd layer, and there cannot be two matching edges
incident to v.
Again, observe inductively that v has a path to some vertex in X0
that hits each intermediate layer once.

2. If v is in X2i+1 , there is an matching edge linking two vertices in

the same odd level. This gives an augmenting path or a blossom,
as in case 2 above. (Success!)

3. The node v cannot be in a previous level, because all those vertices

are either open, or are matched using other edges.

Observe that if the algorithm does not succeed, all the matching
edges we explored are odd-to-even, whereas all the non-matching
edges are even-to-odd. Now we can prove Lemma 7.14.

Proof of Lemma 7.14. Let P be an M-augmenting path in G. For a

contradiction, suppose we do not succeed in finding an augmenting
path or blossom. Starting from one of the endpoints of P (which is in
X0 , an even level), trace the path in the leveled graph created above.
The next vertex should be in an odd level, the next in an even level,
and so forth. Since the path P is alternating, FindAugPath ensures
that all its edges will be explored. (Make sure you see this!) Now P
has an odd number of edges (i.e., even number of vertices), so the last
vertex has an opposite parity from the starting vertex. But the last
vertex is open, and hence in X0 , an even level. This is a contradiction.
76 subsequent work

7.4.3 Finding a Tutte-Berge Set⋆

If FindAugPath did not succeed, all the edges we explored form a
bipartite graph. This does not mean that the entire graph is bipar-
tite, of course—there can be non-matching edges incident to nodes
in odd levels that lead to nodes that remain unmarked. But these
components have no open vertices (which are all in X0 and marked).
Now define U = Xodd := X1 ∪ X3 ∪ . . . be the vertices in odd lev-
els. Since there are no cross edges, each of these nodes has a distinct
matching edge leading to the next level. Now G \ U has two kinds of
components:

(a) the marked vertices in the even levels, Xeven which are all single-
tons since there are no cross edges, and

(b) the unmarked components, which have no open vertices, and

hence have even size.

Hence
n + |U | − odd( G \ U ) n + | Xodd | − | Xeven |
=
2 2
2| Xodd | + (n − | X |)
=
2
(n − | X |)
= | Xodd | + = | M |.
2
The last equality uses that all nodes in V \ X are perfectly matched
among themselves, and all nodes in Xodd are matched using unique
edges.
The last piece is to show that a Tutte-Berge set U ′ for a contracted
graph G ′ = G/B with respect to M′ = M/B can be lifted to one for G
with respect to M. We leave it as an exercise to show that adding the
entire blossom B to U ′ gives such an U.

7.5 Subsequent Work

The best runtime of combinatorial algorithms for maximum matching

√
in general graphs is O(m n) by an algorithm of Silvio Micali and
Vijay Vazirani. The algorithm is based on finding augmenting paths S. Micali and V.V. Vazirani (1984)
much faster than the naïve approach above. It is quite involved; I rec-
ommend an algorithm due to Hal Gabow and Bob Tarjan that has the H.N. Gabow and R.E. Tarjan
same running time, and also extends to the min-cost version of the
problem. In a later chapter, we will see a very different “algebraic”
algorithm based on fast matrix multiplication. This algorithm due to
Marcin Mucha and Piotr Sankowski gives a runtime of O(nω ), where M. Mucha and P. Sankowski (2006)
ω ≈ 2.376. Coming up next, however, is a discussion of weighted
versions of matching, where edges have weights and the goal is to
graph matchings i: combinatorial algorithms 77

find the matching of maximum weight, or perfect matchings with

minimum weight.
8
Graph Matchings II: Algebraic Algorithms

We now introduce some algebraic methods to find perfect match-

ings in general graphs. We use so-called the “polynomial method”, We focus on perfect matchings here; it
based on the elementary fact that low-degree polynomials have few is an exercise to reduce finding maxi-
mum matchings to perfect matchings.
zeroes. This is a powerful and versatile idea, using a combination of
basic algebra and randomness, that can be used to solve many related
problems as well. For instance, we will use it to get parallel (ran-
domized) algorithms for perfect matchings, and also to find red-Blue
perfect matchings, an algorithm for which we know no deterministic
algorithms. But before we digress to these problems, let us discuss
some of the algebraic results for perfect matchings.

• The first result along these lines is that of Laci Lovász, who intro- Lovász (1979)
duced the general idea, and gave a randomized algorithm to detect
the presence of perfect matchings in time O(nω ), and to find it in
time O(mnω ). We will present all the details of this elegant idea
soon.

• Dick Karp, Eli Upfal, and Avi Wigderson, and then Ketan Mulmu- Karp, Upfal, and Wigderson (1986)
ley, Umesh Vazirani, and Vijay Vazirani showed how to find such a Mulmuley, Vazirani, and Vazirani (1987)
matching in parallel. The question of getting a deterministic paral-
lel algorithm remains an outstanding open problem, despite recent
progress (which discuss at the end of the chapter).

• Michael Rabin and Vijay Vazirani sped up the sequential algorithm Rabin and Vazirani (1989)
to run in O(n · nω ). This was substantially improved by the work
of Marcin Mucha and Piotr Sankowski to get a runtime of O(nω ). Mucha and Sankowski (2006)

8.1 Preliminaries: roots of low degree polynomials

For the rest of this lecture, we fix a field F, and consider (univariate
and multivariate) polynomials over this field. We assume that we can
perform basic arithmetic operations in constant time, though some-
times it will be important to look more closely at this assumption. For finite fields Fq (where q is a prime
power), we can perform arithmetic
operations (addition, multiplication,
division) in time poly log q.
80 preliminaries: roots of low degree polynomials

Given p( x ), a root/zero of this polynomial is some value z such

that p(z) evaluates to zero. The critical idea for today’s lecture is
simple: low-degree polynomials have “few” roots. In this section, we
will see this for both univariate and multivariate polynomials, for
the right notion of “few”. The following theorem well-knwon is for
univariate polynomials. (The proof is essentially by induction on the
degree; will add a reference.)

Theorem 8.1 (Univariate Few-Roots Theorem). A univariate polynomial

p( x ) of degree at most d over any field F has at most d roots, unless p( x ) is
zero polynomial.

Now, for multivariate polynomials, the trivial extension of this

theorem is not true. For example, p( x, y) := xy has degree two, and
the solutions to p( x, y) = 0 over the reals are exactly the points in
{( x, y) ∈ R2 : x = 0 or y = 0}, which is infinite. However, the
roots are still “few”, in the sense that the set of roots is very sparse
in R2 . To formalize this observation, let us write a trivial corollary of
Theorem 8.1:

Corollary 8.2. Given a non-zero univariate polynomial p( x ) over a field F,

such that p has degree at most d. Suppose we choose R uniformly at random
from a subset S ⊆ F. Then

d
Pr [ p( R) = 0] ≤ .
|S|
This statement holds for multivariate polynomials as well, as we
see next. The result is called the Schwartz-Zippel lemma, and it appears
in papers by Richard DeMillo and Richard Lipton, by Richard Zippel, R.A. DeMillo and R.J. Lipton (1978)
and by Jacob Schwartz. Zippel (1979)
Schwartz (1980)
Theorem 8.3. Let p( x1 , . . . , xn ) be a non-zero polynomial over a field F, Like many powerful ideas, the prove-
such that p has degree at most d. Suppose we choose values R1 , . . . , Rn nance of this result gets complicated.
A version of this for finite fields was
independently and uniformly at random from a subset S ⊆ F. Then apparently already proved in 1922 by
Øystein Ore. Anyone have a copy of
d that paper?
Pr [ p( R1 , . . . , Rn ) = 0] ≤ .
|S| A monomial is a product a collection of
variables. The degree of a monomial
Hence, the number of roots of p inside Sn is at most d|S|n−1 . is the sum of degrees of the variables
in it. The degree of a polynomial is the
Proof. We argue by induction on n. The base case of n = 1 considers maximum degree of any monomial in
it.
univariate polynomials, so the claim follows from Theorem 8.1. Now
for the inductive step for n variables. Let k be the highest power of
xn that appears in p, and let q be the quotient and r be the remainder
when dividing p by xnk . That is, let q( x1 , . . . , xn−1 ) and r ( x1 , . . . , xn )
be the (unique) polynomials such that

p( x1 , . . . , xn ) = xnk q( x1 , . . . , xn−1 ) + r ( x1 , . . . , xn ),
graph matchings ii: algebraic algorithms 81

where the highest power of xn in r is less than k. Now letting E be

the event that q( R1 , . . . , Rn−1 ) is zero, we find

Pr [ p( R1 , . . . , Rn ) = 0] = Pr [ p( R1 , . . . , Rn ) = 0 | E] Pr [E ]

+ Pr p( R1 , . . . , Rn ) = 0 | E Pr E

≤ Pr [E ] + Pr p( R1 , . . . , Rn ) = 0 | E

By the inductive assumption, and noting that q has degree at most

d − k, we know

Pr [E ] = Pr [q( R1 , . . . , Rn−1 ) = 0] ≤ (d − k)/|S|.

Similarly, fixing the values of R1 , . . . , Rn−1 and viewing p as a poly-

nomial only in variable xn (with degree k), we know

Pr p( R1 , . . . , Rn ) = 0 | E ≤ k/|S|.

Thus we get

d−k k d
Pr [ p( R1 , . . . , Rn ) = 0] ≤ + = .
|S| |S| |S|

Remark 8.4. Finding the set S ⊆ F such that |S| ≥ dn2 , guarantees that
if p is a non-zero polynomial,

1
Pr [ p( R1 , . . . , Rn ) = 0] ≤ .
n2
Naturally, if p is zero polynomial, then the probability equals 1.

8.2 Detecting Perfect Matchings by Computing a Determinant

Let us solve the easier problem of detecting a perfect matching in

a graph, first for bipartite graphs, and then for general graphs. We
define the Edmonds matrix of a bipartite graph G.

Definition 8.5. For a bipartite graph G = ( L, R, E) with | L| = | R| = n,

its Edmonds matrix E( G ) is the following n × n matrix of indetermi-
nates/variables.

0 if (i, j) ̸∈ E and i ∈ L, j ∈ R 1 1
Ei,j =
 xi,j if (i, j) ∈ E and i ∈ L, j ∈ R.

Example 8.6. The Edmonds matrix of the graph to the right is

2 2
" #
x11 x12
E= , Figure 8.1: Bipartite graph
0 x22

which has determinant x11 x22 .

82 detecting perfect matchings by computing a determinant

Recall the Leibniz formula to compute the determinant, and apply

it to the Edmonds matrix:
n
det (E( G )) = ∑ (−1)sign(σ) ∏ Ei,σ(i)
σ ∈Sn i =1

There is a natural correspondence between potential perfect match-

ings in G and permutations σ ∈ Sn , where we match each vertex
i ∈ L to vertex σ (i ) ∈ R. Moreover, the term in the above expansion
corresponding to a permutation σ gives a non-zero monomial (a prod-
uct of xij variables) if and only if all the edges coresponding to that
permutation exist in G. Moreover, all the monomials are distinct, by
construction. This proves the following simple claim.
Proposition 8.7. Let E( G ) denote the Edmonds matrix of a bipartite graph
G. Then det (E( G )) is a non-zero polynomial (over any field F) if and only
if G contains a perfect matching.
However, writing out this determinant of indeterminates could
take exponential time—it would correspond to a brute-forece check
of all possible perfect matchings. Lovász’s idea was to use the ran-
domized algorithm implicit in Theorem 8.3 to test whether G contains
a perfect matching.
Algorithm 10: PM-tester(bipartite graph G, S ⊆ F)
10.1 E ← Edmonds matrix for graph G
10.2 For each non-zero entry Eij , sample Rij ∈ S independently and
uniformly at random
10.3 e ← E({ Rij }i,j ) be matrix with sampled values substituted
E
10.4 e ) = 0 then
if det(E
10.5 return G does not have a perfect matching (No)
10.6 else
10.7 return G contains a perfect matching (Yes)

Lemma 8.8. For |S| ≥ n3 , Algorithm 10 always returns No if G has

no perfect matching, while it says Yes with probability at least 1 − n12
otherwise. Moreover, the algorithm can be implemented in time O(nω ),
where ω is the exponent of matrix multiplication.
Proof. The success probability follows from Remark 8.4, and the fact
that the determinant is a polynomial of degree n. Assuming that If we work over finite fields, the size of
arithmetic operations can be done in constant time, we can compute the numbers is not an issue. However,
Gaussian Elimination over the rationals
the determinant of E e in time O(n3 ), using Gaussian elimination. (Try could cause some of the numbers to get
it yourself, or see Wikipedia.) Hence Algorithm 10 easuly runs in unreasonably large. Ensuring that the
numbers remain polynomially bounded
time O(n3 ). In fact, Bunch and Hopcroft proved that both computing requires care; see Edmonds’ paper, or a
matrix inverses and determinants can be done in asymptotically the book on numerical methods.
same time as matrix multiplication. Thus, we can make Algorithm 10 J.R. Bunch and J.E. Hopcroft (1974)

run in time O(nω ).

graph matchings ii: algebraic algorithms 83

8.2.1 Non-bipartite matching

The extension to the non-bipartite case requires a very small change:
instead of using the Edmonds matrix, which is designed for bipartite
graphs, we use the analogous object for general graphs. This object
was introduced by Bill Tutte in his 1947 paper with the Tutte-Berge Tutte (1947)
theorem.

Definition 8.9. For a general graph G = (V, E) with |V | = n, the

Tutte matrix T( G ) of G is the n × n skew-symmetric matrix given by A matrix A is skew-symmetric if
A⊺ = − A.

 if (i, j) ̸∈ E or i = j
0

Ti,j = xi,j if (i, j) ∈ E and i < j



− x j,i if (i, j) ∈ E and i > j.

Observe that now each variable is of the form xi,j for i < j; it 1
occurs twice in the matrix, with the variables below the diagonal
being the negations of those above.
2 4
Example 8.10. For the graph to the right, the Tutte matrix is

 
0 x1,2 0 x1,4 3
 − x1,2 0 x2,3 x2,4 
  Figure 8.2: Non-bipartite graph
 
 0 − x2,3 0 x3,4 
− x1,4 − x2,4 − x3,4 0

And its determinant is

2 2 2 2
x1,4 x2,3 + 2x1,2 x1,4 x2,3 x3,4 + x1,2 x3,4 .

We claim the same property for this matrix as we did for the Ed-
monds matrix:

Theorem 8.11. For any graph G, the determinant of the Tutte matrix T( G )
is a non-zero polynomial over any field F if and only if there exists a perfect
matching in G.

Proof. As before, the determinant is

n
det(T( G )) = ∑(−1)sign(σ) ∏ Ti,σ(i) .
σ i =1

One direction of the theorem is easy: if G has a perfect matching

M, consider the permutation σ mapping each vertex to the other
endpoint of the matching edge containing it. The corresponding
monomial above is ± ∏e∈ M xe2 , which cannot be cancelled by any
other permutation, and makes the determinant non-zero over any
field.
84 from detecting to finding perfect matchings

To prove the converse, suppose the determinant is non-zero. In the

monomial corresponding to permutation σ, either each (i, σ (i )) in the
product corresponds to an edge of G, or else the monomial is zero.
This means each non-zero monomial of det(T( G )) chooses an edge
incident to i, for each vertex i ∈ [n], giving us a cycle cover of G.
If the cycle cover for σ has an odd-length cycle, take the permuta-
tion σ′ obtained by reversing the order of, say, the first odd cycle in
it. This does not change the sign of the permutation, which depends
on how many odd and even cycles there are, but the skew symmetry
and the odd cycle length means the product of matrix entries flips the
sign. Hence the monomials for σ and σ′ cancel each other. Formally,
we get a bijection between permutations with odd-length cycles that
cancel out.
The remaining monomials corresponding to cycle covers with even
cycles. Choosing either the odd edges or even edges on each such
even cycle gives a perfect matching.

Now given Theorem 8.11, the Tutte matrix can simply be substi-
tuted instead of the Edmonds matrix to extend the results to general
graphs.

8.3 From Detecting to Finding Perfect Matchings

We can convert the above perfect matching tester (which solves the
decision version of the perfect matching problem) into an algorithm
for the search version: one that outputs a perfect matching in a graph
(if one exists), using the simple but brilliant idea of self-reducibility. We are reducing the problem to smaller
Suppose that graph G has a perfect matching. Then we can pick any instances of itself, hence the term
self-reducibility.
edge e = uv and check if G [ E − e], the subgraph of G obtained by
dropping just the edge e, contains a perfect matching. If not, then
edge e must be part of every perfect matching in G, and hence we can
find a perfect matching on the induced subgraph G [V \ {u, v}]. The
following algorithm is based on this observation.
Algorithm 11: Find-PM(bipartite graph G, S ⊆ F)
11.1 Assume: G has a perfect matching let e = uv be an edge in G if
PM-tester(G [ E − e], S) == Yes then
11.2 return Find-PM(G [ E − e], S)
11.3 else
11.4 M′ ← Find-PM(G [V − {u, v}], S)
11.5 return M′ ∪ {e}

Theorem 8.12. Let |S| ≥ n3 . Given a bipartite graph G that contains some
perfect matching, Algorithm 11 finds a perfect matching with probability at
least 12 , and runs in time O(m · nω ).
graph matchings ii: algebraic algorithms 85

Proof. At each step, we call the tester once, and then recurse after
either deleting an edge or two vertices. Thus, the number of total
recursive steps inside Algorithm 11 is at most max{m, n/2} = m, if
the graph is connected. This gives a runtime of O(m · nω ). Moreover,
at each step, the probability that the tester returns a wrong answer is
at most n12 , so the PM-tester makes a mistake with probability at most
m
n2
≤ 1/2, by a union bound.

Observe that the algorithm assumes that G contains a perfect

matching. We could simply run the PM-tester once on G at the be-
ginning to check for a perfect matching, and then proceed as above.
Or indeed, we could just run the algorithm regardless; if G has no
perfect matching, there is no danger of this algorithm erronenously
returning one.
Moreover, there are at least two ways reducing the error proba-
bility to 1/nc : we could increase the size of S to ns+3 , or we could
repeat the above algorithm c log2 n times. For the latter approach,
the probability of not getting a perfect matching in all iterations is at
most (1/2)c log2 n = n1c . Hence we get correctness with high probabil-
ity.

Corollary 8.13. Given a bipartite graph G containing a perfect matching,

there is an O(m · nω log n)-time algorithm which finds a perfect matching
with high probability.

Exercise 8.14. Reduce the time-complexity of the basic version of

Algorithm 11 from O(m · nω ) to O(n log n · nω ).

8.3.1 The Algorithm of Rabin and Vazirani

Rewrite this section. How can we speed up the algorithm further?
The improvement we give here is small, it only removes a logarith-
mic term from the algorithm you get from Exercise 8.14, but it has
a nice idea that we want to emphasize. Again, we only focus on the
bipartite case, and leave the general case for the reader. Also, we
identify the nodes of both L and R with [n], so that we can index the
rows and columns of the Edmonds matrix using vertices in L and R
respectively.
We can view Algorithm 11 as searching for a permutation π such
that M := {iπ (i ) | i ∈ [n]} is a perfect matching in G. Hence it picks
a vertex, say 1 ∈ L, and searches for a vertex j ∈ R such that there
is an edge 1j covering vertex 1, and also the remaining graph has a
perfect matching. Interestingly, the remaining graph has an Edmonds
matrix which is simple: it is simply the matrix E−1,− j , which is our
notation from dropping row 1 and column j from E.
86 from detecting to finding perfect matchings

Therefore, our task is simple: find a value j such that 1j is an edge

in G, and also det(E−1,− j ) is a non-zero polynomial. Doing this
naively would require n determinant computations, one for each j,
and we’d be back to square one. But the smart observation is to recall
Cramer’s rule for the inverse for any matrix A: Some jargon that may be useful:
adjugate( A)
(−1)i+ j det( A− j,−i ) A −1 = ,
−1 det( A)
A i,j
= . (8.1)
det( A) where adjugate( A) is the transpose of
the cofactor matrix of A, given by
Take the matrix E e obtained by substituting random values into the
cofactor( A) p,q := (−1) p+q det( A− p,−q ),
Edmonds matrix E, and assume the set S and field F are of size at
where det( A− p,−q ) is also called a minor
least n10 , say, so that all error probabilities are tiny. Compute its of A.
inverse in time O(nω ), and use (8.1) to get all the values det(E e −1,− j ):

e −1
e −1,− j ) = E
det(E × (−1) j+1 × det(E
e)
j,1

using just n scalar multiplications. In fact, it suffices to find a non-

e −1 , which indicates that det(E
zero E e −1,− j ) is non-zero, and hence
j,1
the corresponding det(E−1,− j ) (without the tilde) is a non-zero poly-
nomial, so that G [V \ {1, j}] has a perfect matching.
In summary, by computing one matrix inverse and n scalar mul-
tiplications, we can figure out one edge of the matching. Hence the
runtime can be made O(n · nω ). Extending this to general graphs
requires a bit more work; we refer to the Rabin and Vazirani paper
for details. Also, Marcin Mucha’s thesis has a very well-written intro-
duction which discusses these details, and also gives the details of his
improvement (with Sankowski) to O(nω ) time.

8.3.2 The Polynomial Identity Testing (PIT) Problem

In Polynomial Identity Testing (PIT) we are given a polynomial P( x1 , x2 , . . . , xn )
over some field F, and we want to test if it is identically zero or not.
If P were written out explicitly as a list of monomials and their coef-
ficients, this would not be a problem, since we could just check that
all the coefficients are zero. But if P is represented implicitly, say as a
determinant, then things get more tricky. A big question is whether
polynomial identity testing can be derandomized.
We don’t know deterministic algorithms that given P can decide
whether P is identically zero or not, in poly-time. How is P given,
for this question? Even if P is given as an arithmetic circuit (a circuit
whose gates are addition, subtraction and multiplication, and inputs
are the variables and constants), it turns out that derandomizing PIT
will result in surprising circuit lower bounds—for example, via a
result of Kabanets and Impagliazzo. Derandomizing special cases V. Kabanets and R. Impagliazzo (2004)
of PIT can be done. For example, the PIT instances that come from
graph matchings ii: algebraic algorithms 87

matchings can be derandomized, as was shown in work by Jim Gee-

len and Nick Harvey, among others; however, but the runtime seems J.F. Geelen (2000)
to get worse. N.J.A. Harvey (2009)

8.4 Red-Blue Perfect Matchings

To illustrate the power of the algebraic approach, let us now consider

the red-blue matching problem. We solve this problem using the al-
gebraic approach and randomization. Interestingly, no determistic
polynomial-time algorithm is currently known for this problem!
Again, we consider bipartite graphs for simplicity.
Given a graph G where each edge is colored either red or blue,
and an integer k, a k-red matching is a perfect matching in G which
has exactly k red edges. The goal of the red-blue matching problem is
to decide whether G has a k-red matching. (We focus on the decision
version of the problem; the self-reducibility ideas used above can
solve the search version.
To begin, let’s solve the case when G has a unique red-blue match-
ing with k red edges. Define the following n × n matrix:

 if (i, j) ̸∈ E,
0

Mi,j = 1 if (i, j) ∈ E and colored blue,



y if (i, j) ∈ E and colored red.

Claim 8.15. Let G have at most one perfect matching with k red edges.
The determinant det(M) has a term of the form ck yk if and only if G
has a k-red matching.

Proof. Consider p(y) := det(M) as a univariate polynomial in the

variable y. Again, using the Leibniz formula, the only way to get a
non-zero term of the form ck yk is if the graph has a k-red matching.
And since we assumed that G has at most one such perfect matching,
such a term cannot be cancelled out by other such matchings.

The polynomial p(y) has degree at most n, and hence we can re-
cover it by Lagrangian interpolation. Indeed, we can choose n + 1
distinct numbers a0 , . . . , an , and evaluate p( a0 ), . . . , p( an ) by comput-
ing the determinant det(M) at y = ai , for each i. These n + 1 values
are enough to determine the polynomial as follows:

n +1
x − aj
p(y) = ∑ p ( ai ) ∏ ai − a j
.
i =1 j ̸ =i

(E.g., see 451 lecture notes or Ryan’s lecture notes.) Note this is a
completely deterministic algorithm, so far.
88 matchings in parallel, and the isolation lemma

8.4.1 Getting Rid of the Uniqueness Assumption

To extend to the case where G could have many k-red matchings, we
can redefine the matrix as the following:

 if (i, j) ̸∈ E,
0

Mi,j = xij if (i, j) ∈ E and colored blue,



yxij if (i, j) ∈ E and colored red.

The determinant det(M) is now a polynomial in m + 1 variables

and degree at most 2n. Writing
n
P(x, y) = ∑ y i Q i ( x ),
i =0

where Qi is a multilinear degree-n polynomial that corresponds to Multilinear just means that the degree
all the i-red matchings. If we set the x variables randomly (say, to of each variable in each monomial is at
most one.
values xij = aij ) from a large enough set S, we get a polynomial
R(y) = P(a, y) whose only variable is y. The coefficient of yk in this
polynomial is Qk (a), which is non-zero with high probability, by the
Schwartz-Zippel lemma. Now we can again use interpolation to find
out this coefficient, and decide the red-blue matching problem based
on whether it is non-zero.

8.5 Matchings in Parallel, and the Isolation Lemma

One of the “killer applications” of the algebraic method for finding a

perfect matching is that the approach extends to getting a (random-
ized) parallel algorithm as well. The basic idea is simple when there
is a unique perfect matching. Indeed, computing the determinant
can be done in parallel with poly-logarithmic depth and polynomial
work. Hence, for each edge e we can run the PM-tester algorithm
on G, and also on G [ E − e] to see if e belongs to this unique perfect
matching; we output e if it does.
However, this approach fails when the graph has multiple perfect
matchings. A fix for this problem was given by Mulmuley, Vazi-
rani, and Vazirani by adding further randomness! The approach is K. Mulmuley, U.V. Vazirani and V.V.
first extend the approach from §8.2 to find a minimum-weight perfect Vazirani (1987)

matching using the Tutte matrix and Schwartz-Zippel lemma, as long

as the weights are polynomially bounded. (Exercise: solve this!) The
trickier part is to show that assigning random polynomially-bounded
weights to the edges of G causes it to have a unique minimum-weight
perfect matching with high probability. Then this unique matching
can also be found in parallel, as we outlined in the previous para-
graph.
graph matchings ii: algebraic algorithms 89

The proof showing that random weights result in a unique minimum-

weight perfect matching is via a beautiful result called the Isolation
Lemma. Let us give its simple elegant proof.

Theorem 8.16. Consider a collection F = { M1 , M2 , . . . , } of sets over a

universe E of size m. Assign a random weight to each elements of E, where
the weights are drawn independently and uniformly from {1, . . . , 2m}.
Then there exists a unique minimum-weight set with probability at least 12 .

Proof. Call an element e ∈ E “confused” if the weight of a minimum-

weight set containing e is the same as the weight of a minimum-
weight set not containing e. We claim that any specific element e is
confused with probability at most 1/2m. Observe is that there exists
a confused element if and only if there are two minimum-weight sets,
so using the claim and taking a union bound over all elements proves
the theorem.
To prove the claim, make the random choices for all elements
except e. Now the identity (and weight) of the minimum-weight We are using the principle of deferred
set not containing e is determined; let its weight be W − . Also, the decisions again.

identity (but not the weight) of the minimum-weight set containing

e is determined. Its weight is not determined because the random
choice for e’s weight has not been made, so denote its weight by
W + + we , where we is the random variable denoting the weight of
e. Now e will be confused precisely if W − = W + + we , i.e., if we =
W − − W + . But since we is chosen uniformly at random from a set of
size 2m, this probability is at most 1/2m, as claimed.

It is remarkable the result does not depend on number of sets in

F , but only on the size of the universe. We also emphasize that the
weights being drawn from a polynomially-sized set is what gives the
claim its power: it is trivial to obtain a unique minimum-weight set if
the weights are allowed to be in {1, . . . , 2m }. (Exercise: how?) Finally,
the proof strategy for the Isolation Lemma is versatile and worth
remembering.

8.5.1 Towards Deterministic Algorithms

The question of finding perfect matchings deterministically in poly-

logarithmic depth and polynomial work still remains open. Some
recent work of Fenner, Gurjar, and Thierauf, and of Svensson and S. Fenner, R. Gurjar, and T. Thierauf
Tarnawski has shown how to obtain poly-logarithmic depth and (2016)
O. Svensson and J. Tarnawski (2017)
quasi-polynomial work; we may see some of these ideas in an up-
coming HW.
90 the permanent connection

8.6 The Permanent Connection

The randomized approach above defines determinants of matrices

full of variables to decide whether a graph has a perfect matching
or not. Why didn’t we just use the determinant of the adjacency
matrix, instead of replacing each 1 by a suitable variable? The reason
is simple: the matrix Jn×n (which is the bipartite adjacency matrix for
the complete bipartite graph Kn,n ) has determinant zero, even though
it has n! perfect matchings. Of course, this is because the determinant
is defined as
det( A) = ∑ (−1)sign σ ∏ Ai,σi .
σ ∈Sn

And in this case, the signs of the permutations cancel each other out.
What if we defined a new quantity which does not have these
pesky negative signs. This function is called the permanent, defined
as: The term comes from Cauchy’s use of
perm( A) = ∑ ∏ Ai,σi . “fonctions symétriques permanentes” for
a related class of functions. The term
σ ∈Sn
determinant comes from Gauss, but
Given this definition, we immediately get the following fact: apparently Cauchy was the one to use
determinant to mean precisely the same
Fact 8.17. Given the (bipartite) adjacency matrix A for a bipartite object as we do.
graph G, perm( A) is the number of perfect matchings in G. Hence

perm( A) > 0 ⇐⇒ G has a perfect matching.

This sounds like great news, since we no longer have to rely on the
above randomization ideas. However, we seem to have gone from a
minor annoyance to a major one: how do we compute the permanent
efficiently? This was a source of theoretical and practical annoyance
for some time, and attempts to transform permanent computations
into determinant computations had been fruitless. Finally, in 1979,
Les Valiant proved the following surprising theorem: L.R. Valiant. The complexity of computing
the permanent. (1979)
Theorem 8.18. It is NP-hard to compute the permanent of square {0, 1}-
matrices. In fact, computing the number of perfect matchings of a bipartite
graph is as hard as counting the number of satisfying assignments to a
3SAT formula. The class of problems reducible to
counting the number of satisfying
This is truly a remarkable theorem. Finding a satisfying assign- assignments to a 3SAT formula is
called #P; this contains all problems
ment to a 3SAT formula is NP-hard, whereas finding a perfect match- in NP, naturally, but also seems to
ing is in polynomial time. But counting the number of these two contain much more. Valiant’s theorem
objects has the same complexity! says: counting the number of perfect
matchings is as hard as all the problems
in #P, which blows my mind.
8.7 A Matrix Scaling Approach

While our attempt to use the permanent instead of the determinant

to compute matchings turned to be a dead end, the permanent does
graph matchings ii: algebraic algorithms 91

arise in the analysis of a very different matrix-based approach to

finding (fractional) perfect matchings in bipartite graphs. Let us now
explore this elegant approach, called matrix scaling.
Given a non-negative matrix A, and two non-negative diagonal
matrices R and C, consider the scaled matrix

B := RAC.

In other words, taking the matrix A, and scaling each row i by Rii
and each column j by Cjj , gives the matrix B. Matrix scaling gives
us yet another characterization of bipartite graphs that have perfect
matchings: A matrix A is doubly-stochastic if it has
unit row- and column-sums. In other
Theorem 8.19. A bipartite graph G admits a perfect matching if and only words, A1 = A⊺ 1 = 1. The ε-doubly-
stochasticity requires that A1, A⊺ 1 both
if for each ε > 0 there exist non-negative matrices R, C such that RAG C is have entries in (1 − ε, 1 + ε).
ε-approximate doubly-stochastic.

Proof. To come soon

Given the adjacency matrix A ∈ {0, 1}n for the bipartite graph G,
we now try to find scaling matrices R and C. Since we want the row-
and column-sums to be close to 1, one “greedy” idea is to start with
A and repeatedly do the following two steps:

1. Scale each row to make the row sums equal to 1; this may put the
column sums out of whack.

2. Scale each column to make the row sums equal to 1; this may now
mess up the row sums.

We show that if we ever reach a matrix where both row and column
sums are very close to 1, then Theorem 8.19 tells us that the graph
has a perfect matching. And if we don’t manage to get close to 1 in a
“reasonable” time (which depends on n and ε), interestingly we can
conclude it has no perfect matching!
To make this precise, let’s define two diagonal matrices R( A) :=
diag( A1) and C ( A) = diag( A⊺ 1). Then the algorithm becomes:

Algorithm 12: Sinkhorn Scaling( A)

12.1 for i = 1, 2, . . . , T do
12.2 A ← R ( A ) −1 A
√
12.3 if ∥ I − C ( A)∥2 ≤ 1/ n then return true
12.4 A ← A C ( A ) −1
√
12.5 if ∥ R( A) − I ∥2 ≤ 1/ n then return true
12.6 return false
92 a matrix scaling approach

Theorem 8.20. If T ≥ poly(n) then Algorithm 12 outputs true if and

only if the bipartite graph G has a perfect matching.

Proof. We will use the permanent of A as a potential function:

n
perm( A) := ∑ ∏ Ai,σ(i) .
σ ∈Sn i =1

Let A(t) be the matrix obtained after t rescalings. Here are the three While this looks very similar to the
crucial facts: Leibniz formula for the determinant—it
just lacks the (−1)sign(σ) term inside
the summation—the small difference
1. perm( A(t) ) ≤ 1. To show this, observe that for any non-negative completely changes the complexity
matrix M, of the two problems. While we can
n compute determinants in polynomial
perm( M ) ≤ ∏( Mi1 + . . . + Min ), time, the computation of permanents is
i =1 #P-complete.

because every term of the sum in the permanent also appears in

the expression on the right, and the additional terms are all non-
negative. We now use this inequality for some A(t) that has unit
row sums (the case of unit column sums is identical): then the
RHS above equals 1, which proves the claim.

2. perm( A(1) ) ≥ n−n . This follows from the fact that each matrix A(t)
has normalized rows or columns. Suppose the rows are normal-
ized
9
Graph Matchings III: Weighted Matchings

In this chapter, we study the matching problem from the perspective

of linear programs, and also learn results about linear programming
using the matching problem as our running example. In fact, we see
how linear programs capture the structure of many problems we
have been studying: MSTs, min-weight arborescences, and graph
matchings.

9.1 Linear Programming

We start with some basic definitions and results in Linear Program-

ming. We will use these results while designing our linear program
solutions for min-cost perfect matchings, min-weight arborescences
and MSTs. This will be a sufficient jumping-off point for the contents
6
of this lecture; a more thorough introduction to the subject can be
found in the introductory text by Matoušek and Gärtner.
4
Definition 9.1. Let ⃗a ∈ Rn be a vector and let b ∈ R a scalar. Then a
x2

half-space in Rn is a region defined by the set {⃗x ∈ Rn | ⃗a · ⃗x ≥ b}.

" # 2

1
As an example, the half space S = {⃗x | · ⃗x ≥ 3} in R2 is shown
1 0
on the right. (Note that we implicitly restrict ourselves to closed half-
0 1 2 3 4 5 6
spaces.) x1

Figure 9.1: The half-space in R2 given

by the set S
9.1.1 Polytopes and Polyhedra
Definition 9.2 (Polyhedron). A polyhedron in Rn is the intersection of
a finite number of half spaces.

A polyhedron is a convex region which is defined by finitely many

linear constraints. A polyhedron in n dimensions with m constraints
is often written compactly as

K = { Ax ≤ b},
94 linear programming

where A is an m × n constraint matrix, x is an n × 1 vector of variables,

and b is an m × 1 vector of constants.

Definition 9.3 (Polytope). A polytope K ∈ Rn is a bounded polyhe-

dron.

In other words, a polytope is polyhedron such that there exists

some radius R > 0 such that K ⊆ B(0, R) = { x | ∥ x ∥2 ≤ R}. A simple
example of a polytope (where the bounded region of the polytope is
highlighted by ) appears on the right. We can now define a linear Figure 9.2: The polytope in R2 given by
the constraints − x1 − x2 ≤ 1, x1 ≤ 0,
program (often abbeviated as LP) in terms of a polyhedron. and x2 ≤ 0.

Definition 9.4 (Linear Program). For some integer n, a polyhedron

K = { x | Ax ≤ b}, and an n by 1 vector c, a linear program in n
dimensions is the linear optimization problem

min{c · x | x ∈ K } = min{c · x | Ax ≤ b}.

The set K is called the feasible region of the linear program.

Although all linear programs can be put into this canonical form,
in practice they may have many different forms. These presenta-
tions can be shown to be equivalent to one another by adding new
variables and constraints, negating the entries of A and c, etc. For
example, the following are all linear programs:

max{c · x : Ax ≤ b} min{c · x : Ax = b}
x x
min{c · x : Ax ≥ b} min{c · x : Ax ≤ b, x ≥ 0}.
x x

The polyhedron K need not be bounded for the linear program to

have a (finite) optimal solution. For example, the following linear
program has a finite optimal solution even though the polyhedron is
unbounded:
min{ x1 + x2 | x1 + x2 ≥ 3}. (9.1)
x

9.1.2 Vertices, Extreme Points, and BFSs

We now introduce three different classifications of some special
points associated with polyhedra. (Several of these definitions ex-
tend to convex bodies.)

Definition 9.5 (Extreme Point). Given a polyhedron K ∈ Rn , a point

x ∈ K is an extreme point of K if there do not exist distinct x1 , x2 ∈ K,
and λ ∈ [0, 1] such that x = λx1 + (1 − λ) x2 .

In other words, x is an extreme point of K if it cannot be written as Figure 9.3: Here y is an extreme point,
but x is not.
the convex combination of two other points in K. See Figure 9.3 for
an example.
Here’s another kind of point in K. In this course, we will use the notation
c · x, c⊺ x, and ⟨c, x ⟩ to denote the inner-
product between vectors c and x.
graph matchings iii: weighted matchings 95

Definition 9.6 (Vertex). Given a polyhedron K ⊆ Rn , a point x ∈ K is

a vertex of K if there exists an vector c ∈ Rn such that c · x < c · y for
all y ∈ K y ̸= x.

In other words, a vertex is the unique optimizer of some linear ob-

jective function. Equivalently, the hyperplane {y ∈ Rn | c · y = c · x }
intersects K at the single point x. Note that there may be a poly-
hedron that does not have any vertices: e.g., one given by a single
constraint, or two parallel constraints.
Finally, here’s a third kind of special point in K:

Definition 9.7 (Basic Feasible Solution). Given a polyhedron K ∈ Rn ,

a point x ∈ K is a basic feasible solution (bfs) for K if there exist n
linearly independent defining constraints for K which x satisfies at
equality.

In other words, let K := { x ∈ Rn | Ax ≤ b}, where the m

constraints corresponding to the m rows of A are denoted by ai · x ≤
bi . Then x ∗ ∈ Rn is a basic feasible solution if there exist n linearly
independent constraints for which ai · x ∗ = bi , and moreover ai · x ∗ ≤
bi for all other constraints (because x ∗ must belong to K, and hence
satisfy all other constraints as well). Note there are only (m n ) basic
feasible solutions for K, where m is the total number of constraints
and n is the dimension.
As you may have guessed by now, these three definitions are all
related. In fact, they are all equivalent.
Fact 9.8. Given a polyhedron K and a point x ∈ K, the following are
equivalent:
• x is a basic feasible solution,
• x is an extreme point, and
• x is a vertex.
While we do not prove it here, you could try to prove it yourself,
or consult a textbook. For now, we proceed directly to the main fact
we need for this section. Observe that we claimed Fact 9.9 for
LPs whose feasible region is a polytope,
Fact 9.9. For a polytope K and a linear program LP := min{c · x | x ∈ since that suffices for today, but it can
K }, there exists an optimal solution x ∗ ∈ K such that x ∗ is an extreme be proven with weaker conditions.
However it is not true for all LPs: e.g.,
point/vertex/bfs of K. the LP in (9.1) has an infinite number of
optimal solutions, none of which are at
This fact suggests an algorithm for LPs when K is a polytope: vertices.
simply find all of the (at most (m
n ) basic feasible solutions and pick
the one that gives the minimum solution value. Of course, there
are more efficient algorithm to solve linear programs; we will talk
about them in a later chapter. However, let us state a theorem—a
very restricted form of the general result—about LP solving that will
suffice for now:
96 weighted matchings in bipartite graphs

Theorem 9.10. There exist algorithms that take any LP min{c · x | Ax =

b, x ≥ 0, x ∈ Rn }, where both the constraint matrix A and the RHS b have
entries in {0, 1} and poly(n) rows, and output a basic feasible solution to
this LP in poly(n) time.

We will see a sketch of the proof in a later chapter. Discuss the

dependence on the number of bits to represent c? Or make this an
informal theorem?

9.1.3 Convex Hulls and an Alternate Representation

The next definition allows us to give another representation of poly-
topes:

Definition 9.11 (Convex Hull). Given x1 , x2 , . . . , x N ∈ Rn , the convex

hull of x1 , . . . , x N is the set of all convex combinations of these points.
In other words, CH( x1 , . . . , x N ) is defined as
( )
N N
x ∈ Rn ∃λ1 , . . . , λ N ≥ 0 s.t. ∑ λi = 1 and x = ∑ λi xi . (9.2)
i =1 i =1

Put yet another way, the convex hull of x1 , . . . , x N is the intersec-

tion of all convex sets that contain x1 , . . . , x N . It follows from the
definition that the convex hull of finitely many points is a polytope.
(Check!) We also know the following fact:
Fact 9.12. Given a polytope K with extreme points ext(K ),

K = CH(ext(K )).

The important insight that polytopes may be represented in terms

of their extreme points, or their bounding half-planes. One represen-
tation may be easier to work with than the other, depending on the
situation. The rest of this chapter will involve moving between these
two methods of representing polytopes.

9.2 Weighted Matchings in Bipartite Graphs

While the previous chapters focused on finding maximum matchings

in graphs, let us now consider the problem of finding a minimum-
weight perfect matching in a graph with edge-weights. As before,
we start with bipartite graphs, and extend our techniques to general
graphs.
We are given a bipartite graph G = ( L, R, E) with edge-weights
we . We want to use linear programs to solve the problem, so it is
natural to have a variable xe for each edge e of the graph. We want
our solution to set xe = 1 if the edge is in the minimum-weight
graph matchings iii: weighted matchings 97

perfect matching, and xe = 0 otherwise. Compactly, this collection of

variables gives us a | E|-dimensional vector x ∈ R| E| , that happens to
contain only zeros and ones.
A bit of notation: for any subset S ⊆ E, let χS ∈ {0, 1}| E| de-
note the characteristic vector of this subset S, where χS has ones in
coordinates that correspond to elements in S, and zeros in the other
coordinates.

9.2.1 Goal: the Bipartite Perfect Matching Polytope

It is conceptually easy to define an | E|-dimensional polytope whose Figure 9.4: This graph has one perfect
vertices are precisely the perfect matchings of G: we simply define matching M: it contains edges 1, 4,
5, and 6, represented by the vector
χ M = (1, 0, 0, 1, 1, 1).
CPM ( G ) = CH ({χ M | M is a perfect matching in G }). (9.3)

And now we get a linear program that finds the minimum-weight

perfect matching in a bipartite graph.

min{w · x | x ∈ CPM ( G )}.

By Fact 9.9, there is an optimal solution at a vertex of CPM ( G ), which

by construction represents a perfect matching in G.
The good part of this linear program is that its feasible region has
(a) only integer extreme points, (b) which are in bijection with the
objects we want to optimize over. So optimizing over this LP will
immediately solve our problem. (We can assume that there are linear
program solvers which always return an optimal vertex solution, if
one exists.) Moreover, the LP solver runs in time polynomial in the
size of the LP.
The catch, of course, is that we have no control over the size of
the LP, as we have written it. Our graph G may have an exponen-
tial number of matchings, and hence the definition of CPM ( G ) given
in (9.3) is too unwieldly to work with. Of course, the fact that there The unit cube
are an exponential number of vertices does not mean that there can- K = { x ∈ Rn | 0 ≤ x i ≤ 1 ∀ i }
not be a smaller representation using half-spaces. Can we find a is a polytope with 2n constraints but 2n
compact way to describe CPM ( G )? vertices.

9.2.2 A Compact Linear Program

The beauty of the bipartite matching problem is that the “right”

linear program is perhaps the very first one you may write. Here is
the definition of the polytope using linear constraints:
98 weighted matchings in bipartite graphs

  






∑ xlr = 1 ∀l ∈ L 



 
 r ∈ N (l ) 

  
K PM ( G ) =

x ∈ R| E | s.t.
 ∑ xlr = 1 ∀r ∈ R


 
 l ∈ N (r ) 


 
 

  
xe ≥ 0 ∀e ∈ E

The constraints of this polytope merely enforce that each coordi-

nate is non-negative (which gives us | E| constraints), and that the
xe values of the edges leaving each vertex sum to 1 (which gives us
| L| + | R| more constraints). All these constraints are satisfied by each
χ M corresponding to a matching M, which is promising. But it still
always comes as a surprise to realize that his first attempt is actually
successful:

Theorem 9.13. For any bipartite graph G, K PM ( G ) = CPM ( G ).

Proof. For brevity, let us refer to the polytopes as K and C. The easy
direction is to show that C ⊆ K. Indeed, the characteristic vector χ M
for each perfect matching M satisfies the constraints for K. Moreover
K is convex, so if it contains all the vertices of the convex set C, it
contains all their convex combinations, and hence contains all of C.
Now to show K ⊆ C, we again show that the vertices of K are
contained in C, and then use Fact 9.12 to infer it for the rest of K.
Consider an arbitrary vertex x ∗ of K. In this proof, we use the equiv-
alent view of a vertex as an extreme point of K. (A proof using the
“basic feasible solution” perspective appears in §9.2.3, and a proof
using the “vertex” perspective appears in §9.3.)
Let supp( x ∗ ) = {e | xe∗ > 0} be the support of this solution.
We claim that supp( x ∗ ) is acyclic. Indeed, suppose not, and cycle
C = e1 , e2 , . . . , ek is contained within the support supp( x ∗ ). Since the
graph is bipartite, this is an even-length cycle. Define

ε := min xe∗ .
e∈supp( x ∗ )

Observe that for all ei ∈ C, xe∗i + xe∗i+1 ≤ 1, so xe∗i ≤ 1 − ε. And

of course xe∗i ≥ ε, merely by the definition of ε. Now consider two
solutions x + and x − , where

xe+i = xe∗i + (−1)i ε Figure 9.5: There cannot be a cycle

in supp( x ∗ ), because this violates the
and assumption that x ∗ is an extreme point.
xe+i = xe∗i − (−1)i ε.
I.e., the two solutions add and subtract ε on alternate edges; this
ensures that both the solutions stay within K. But then x ∗ = 12 x + +
1 − ∗
2 x , contradicting our that x is an extreme point.
graph matchings iii: weighted matchings 99

Therefore there are no cycles in supp( x ∗ ); this means the sup-

port is a forest. Consider a leaf vertex v in the support. Then, by the
equality constraint at v, the single edge e ∈ supp( x ∗ ) leaving v must
have xe∗ = 1. But this edge e = uv goes to another vertex u; because
x ∗ is in K, this vertex u cannot have other edges in supp( x ∗ ) without
violating its equality constraint. So u and v are a matched pair in x ∗ .
Now remove u and v from consideration. We have introduced no
cycles into the remainder of supp( x ∗ ), so we may perform the same
step inductively to show that x ∗ is the indicator of a perfect match-
ing, and hence x ∗ ∈ C. This means all vertices of K are in C, which
proves C ⊆ K, and completes the proof.

This completes the proof that the polytope K PM ( G ) exactly cap-

tures precisely the perfect matchings in G, despite having such a
simple description. Now, using the fact that the linear program

min{w · x | x ∈ K PM ( G )}

can be solved in polynomial time, we get an efficient algorithm for

finding minimum-weight perfect matching in graphs.

9.2.3 A Proof via Basic Feasible Solutions

Here is how to prove Theorem 9.13 using the notion of basic feasible
solutions (bfs). Suppose x ∗ ∈ R| E| is a bfs: we now show that xe∗ ∈
{0, 1} for all edges. By the definition of a bfs, there is a collection
of | E| tight linearly independent constraints that define x ∗ . These
constraints are of two kinds: the degree constraints ∑e∈∂(v) xe = 1 for
some subset S of vertices, and the non-negativity constraints xe ≥ 0
for some subset E′ ⊆ E be edges. (Hence we have | E′ | + |S| = | E|.)
By reordering columns and rows, we can put the degree con-
straints at the top, and put all the edges in E′ at the end, to get that
x ∗ is defined by: " # " #
C C′ 1 s
x∗ =
0 I 0m−s

where C ∈ {0, 1}s×s , C ′ ∈ {0, 1}(m−s)×s , and m = | E| and s = |S|.

The edges in E′ have xe∗ = 0, so consider edges in E \ E′ . By the linear
independence of the constraints, we have C being non-singular, so

x ∗ | E\ E′ = C −1 (1 − C ′ x ∗ | E′ ) = C −1 1.

By Cramer’s rule,
det(C [1]i )
xe∗ = .
det(C )
The numerator is an integer (since the entries of C are integers), so
showing det(C ) ∈ {±1} means that xe∗ is an integer.
100 another perspective: buyers and sellers

Claim 9.14. Any k × k-submatrix of C has determinant in {−1, 0, 1}.

Proof. The proof is by induction on k; the base case is trivial. If the

submatrix D has a column with a single 1, we can expand using that
entry, and use the inductive hypothesis. Else each column of D has
two non-zeros. Recall that the columns of D correspond to some
edges ED in E \ E′ , and the rows correspond to vertices SD in S—two
non-zeros in each column means each edge in ED has both endpoints
in SD . Now if we sum rows for vertices in SD ∩ L would give the all
ones vector, as will summing up rows for vertices in SD ∩ R. (Here is
the only place we’re using bipartiteness.) In this case det( D ) = 0.

Using the claim and using the fact C is non-singular and hence
det(C ) cannot be zero, we get that the entries of xe∗ are integers. By
the structure of the LP, the only integers possible in a feasible solu-
tion are {0, 1} and the vector x ∗ corresponds to a matching.

9.2.4 Minimum-Weight Matchings

How can we we find a minimum-weight (possibly non-perfect)
matching in a bipartite graph G? If the edge weights are all non-
negative, the empty matching would be the solution—but what if
some edge weights are negative? (In fact, that’s how we would find a
maximum-weight matching–by negating all the weights.) As before,
we can define the matching polytope for G as

C Match ( G ) = CH ({χ M | M is a matching in G }).

To write a compact LP that describes this polytope, we slightly mod-

ify our linear constraints as follows:
  





 ∑ xij ≤ 1 ∀i ∈ L 



 
 j∈ R 

  
K Match ( G ) =

x ∈ R| E | s.t.
 ∑ x ji ≤ 1 ∀i ∈ R


 
 j∈ L 


 
 

  
xi,j ≥ 0 ∀i, j

We leave it as an exercise to apply the techniques used in Theo-

rem 9.13 to show that the vertices of K Match are matchings of G, and
hence the following theorem:

Theorem 9.15. For any bipartite graph G, K Match ( G ) = CH Match ( G ).

9.3 Another Perspective: Buyers and sellers

The results of the previous section show that the bipartite perfect
matching polytope is integral, and hence the max-weight perfect
graph matchings iii: weighted matchings 101

matching problem on bipartite graphs can be be solved by “sim-

ply” solving the LP and getting a vertex solution. But do we need a
generic linear program solver? Can we solve this problem faster? In
this section, we develop (a variant of) the Hungarian algorithm that
finds an optimal solutions using a “combinatorial” algorithm. This
proof also shows that any vertex of the polytope K PM ( G ) is integral,
and hence gives another proof of Theorem 9.13.

9.3.1 The Setting: Buyers and Items

Consider the setting with a set B with n buyers and another set I with
n items, where buyer b has value vbi for item i. The goal is to find a
max-value perfect matching, that matches each buyer to a distinct
item and maximizes the sum of the values obtained by this matching.
Our algorithm will maintain a set of prices for items: each item i
will have price pi . Given a price vector p := ( p1 , . . . , pn ), define the
utility of item i to buyer b to be

ubi ( p) := vbi − pi .

Intuitively, the utility measures how favorable it is for buyer b to buy

item i, since it factors in both the value and the price of the item.
We say that buyer b prefers item i if item i gives the highest utility
to buyer b, among all items. Formally, buyer b ∈ B prefers item i at
prices p if i ∈ arg maxi′ ∈ I ubi′ ( p). The utility of buyer b at prices p is
the utility of this preferred item:

ub ( p) := max ubi ( p) = max(vbi − pi ). (9.4)

i∈ I i∈ I

A buyer has at least one preferred item, and can have multiple
preferred items, since there can be ties. Given prices p, we build a
preference graph H = H ( p), where the vertices are buyers B on the
left, items I on the right, and where bi is an edge if buyer b prefers
item i at prices p. The two examples show preference graphs, where
the second graph results from an increase in price of item 1. Flip the
figure.

Theorem 9.16. For any price vector p∗ , if the preference graph H ( p∗ )

contains a perfect matching M, then M is a max-value perfect matching.

Proof. This proof uses weak linear programming duality. Indeed,

recall the linear program we wrote for the bipartite perfect matching
problem: we allow fractional matchings by assigning each edge bi a
102 another perspective: buyers and sellers

fractional value xbi ∈ [0, 1].

maximize ∑ vbi xbi

bi
n
subject to ∑ xbi = 1 ∀i
b =1
n
∑ xbi = 1 ∀b
i =1
xbi ≥ 0 ∀(b, i )

The perfect matching M is clearly feasible for this LP, so it remains

to show that it achieves the optimum. Indeed, we show this by ex-
hibiting a feasible dual solution with value ∑bi∈ M vbi , the value of the
primal solution. Then by weak duality, both these solutions must be
optimal.
The dual linear program is the following:
n n
minimize ∑ pi + ∑ u b
i =1 b =1
subject to pi + ub ≥ vbi ∀bi

(Observe that u and p are unconstrained variables.) In fact, given

any settings of the pi variables, the ub variables that minimize the
objective function, while still satisfying the linear constraints, are
given by ub := maxi∈ I (vbi − pi ), exactly matching (9.4). Hence, the
dual program can then be rewritten as the following (non-linear,
convex) program with no constraints:

min
p=( p1 ,...,pn )
∑ p i + ∑ u b ( p ).
i∈ I b∈ B

Consider the dual solution given by the price vector p∗ . Recall that
M is a perfect matching in the preference graph H ( p∗ ), and let M(i )
be the buyer matched to item i by it. Since u M(i) ( p) = v M(i)i − pi , the
dual objective is

∑ pi∗ + ∑ ub ( p∗ ) = ∑ pi∗ + ∑(v M(i)i − pi ) = ∑ vbi .

i∈ I b∈ B i∈ I i∈ I bi ∈ M

Since the primal and dual values are equal, the primal matching M
must be optimal.

Prices p = ( p1 , . . . , pn ) are said to be market-clearing if each item

can be assigned to some person who has maximum utility for it at
these prices, subject to the constraints of the problem. In our setting,
having such prices are equivalent to having a perfect matching in the
preference graph. Hence, Theorem 9.16 shows that market-clearing
prices give us an optimal matching, so our goal will be to find such
prices.
graph matchings iii: weighted matchings 103

9.3.2 The Hungarian algorithm

The “Hungarian” algorithm uses the buyers-and-sellers viewpoint H.W. Kuhn (1955)
from the previous section. The idea of the algorithm is to iteratively The algorithm was named the Hun-
garian algorithm by Harold Kuhn who
change item prices as long as they are not market-clearing, and the
based his ideas on the works of Jenö
key is to show that this procedure terminates. To make our proofs Egervary and Dénes König. Munkres
easier, we assume for now that all the values vbi are integers. subsequently showed that the algorithm
was in fact implementable in O(n3 ).
The price-changing algorithm proceeds as follows: Later, the algorithm was found to have
been proposed even earlier by Carl
1. Initially, all items have price pi = 0. Gustav Jacobi, before 1851.

2. In each iteration, build the current preference graph H ( p). If it

contains a perfect matching M, return it. Theorem 9.16 ensures
that M is an optimal matching.

3. Otherwise, by Hall’s theorem, there exists a set S of buyers such

that if
N (S) := {i ∈ I | ∃b ∈ S, bi ∈ E( H ( p))}
is the set of items preferred by at least one buyer in S, then | N (S)| <
|S|. (N (S) is the neighborhood of S in the preference graph.) Intu-
itively, we have many buyers trying to buy few items, so logically,
the sellers of those items should raise their prices! The algorithm
increases the price of every item in N (S) by 1, and starts a new
iteration by going back to step 2.

That’s it. Running the algorithm on our running example gives the
prices on the right.
The only way the algorithm can stop is to produce an optimal
matching. So we must show it does stop, for which we use a “semi-
invariant” argument. We keep track of the “potential”

Φ( p) := ∑ p i + ∑ u b ( p ),
i b

where pi are the current prices and ub ( p) = maxi (vbi − pi ) as above.

This is just the dual value, and hence is is lower-bounded by the
optimal value of the dual program. (We assume the optimal value of
the LP is finite, e.g., if all the input values are finite.) Then, it suffices
to prove the following:

Lemma 9.17. Every time we increase the prices in N (S) by 1, the value of
∑i pi + ∑b ub decreases by at least 1.

Proof. The value of ∑i pi increases by | N (S)|, because we increase

the price of each item i ∈ N (S) by 1. For each buyer b ∈ S, the
value ub must decrease by 1, since all their preferred items had their
prices increased by 1, and all other items previously had utilities
at least one lower than the original ub ( p). (Here, we used the fact
104 another perspective: buyers and sellers

that all values were integral.) Therefore, the value of the potential
∑i pi + ∑b ub changes by | N ( B)| − | B| ≤ −1.

Hence the algorithm stops in finite time, and produces a maximum-

value perfect matching. By the arguments above ?? we get yet an-
other proof of integrality of the LP ?? for the bipartite pefect match-
ing problem. A few other remarks about the algorithm:

• In fact, one can get rid of the integrality assumption by raising the
prices by the maximum amount possible for the above proof to
still go through, namely

min ub ( p) − max (vib − pi ) .
b∈S i ̸∈ N (S)

It can be shown that this update rule makes the algorithm stop in
only O(n3 ) iterations.

• If all the values are non-negative, and we don’t like the utilities to
be negative, then we can do one of the following things: (a) when
all the prices become non-zero, subtract the same amount from all
of them to make the lowest price hit zero, or (b) choose S to be a
minimal “consticted” set and raise the prices for N (S). This way,
we can ensure that each buyer still has at least one item which
gives it nonngegative utility. (Exercise!)

• Suppose there are n buyers and a single item, with all non-negative
values. (Imagine there are n − 1 dummy items, with buyers hav-
ing zero values for them.) The above algorithm behaves like the
usual ascending-price English or Vickery auction, where prices
are raised until only one bidder remains. Indeed, the final price
for the “real” item will be such that the second-highest bidder is
indifferent between it and a dummy item.
This is a more general phenomenon: indeed, even in the setting
with multiple items, the final prices are those produced by the
Vickery-Clarke-Groves truthful mechanism, at least if we use the
version of the algorithm that raises prices on minimal constricted
sets. The truthfulness of the mechanism means there is no incen-
tive for buyers to unilaterally lie about their values for items. See,
e.g., 1 for the rich connection of matching algorithms to auction 1

theory and (algorithmic) mechanism design.

Check about negative values, they don’t seem to matter at all,

as long as everything is finite. What about max-weight maximum
matching: we can always convert the graph, but does the algorithm
work out of the box?
graph matchings iii: weighted matchings 105

This proof shows that for any setting of values, there is an optimal
integer solution to the linear program

max{v · x | x ∈ K LP(G) }.

This implies that every vertex x ∗ of the polytope K LP(G) is integral—

indeed, the definition of vertex means x ∗ is the unique solution to the
linear program for some values v, and our proof just produced an
integral matching that is the optimal solution. Hence, we get another
proof of Theorem 9.13, this time using the notion of vertices instead
of extreme points.

9.4 A Third Algorithm: Shortest Augmenting Paths

Let us now see yet another algorithm for solving weighted matching
problems in bipartite graphs. For now, we switch from maximum-
weight matchings to minimum-weight matchings, because they are
conceptually cleaner to explain here. Of course, the two problems are
equivalent, since we can always negate edge weights.
In fact, we solve a min-cost max-flow problem here: given an flow
network with terminals s and t, edge capacities ue , and also edge
costs/weights we , find an s-t flow with maximum flow value, and
whose total cost/weight is the least among all such flows. (Moreover,
if the capacities are integers, the flow we find will also have integer
flow values on all edges.) Casting the maximum-cardinality bipartite
matching problem as a integer max-flow problem, as in §blah gives
us a minimum-weight bipartite matching.
This algorithm uses an augmenting path subroutine, much like
the algorithm of Ford and Fulkerson. The subroutine, which takes in
a matching M and returns one of size | M | + 1, is presented below.
Then, we can start with the empty matching and call this subroutine
until we get a maximum matching.
Let the original bipartite graph be G. Construct the directed graph
G M as follows: For each edge e ∈ M, insert that edge directed from
right to left, with weight −we . For each edge e ∈ G \ M, insert that
edge directed from left to right, with weight we . Then, compute the
shortest path P that starts from the left and ends on the right, and
return M △ P. It is easy to see that M △ P is a matching of size | M | +
1, and has total weight equal to the sum of the weights of M and P.
Call a matching M an extreme matching if M has minimum
weight among all matchings of size | M |. The main idea is to show
that the above subroutine preserves extremity, so that the final match-
ing must be extreme and therefore optimal.

Theorem 9.18. If M is an extreme matching, then so is M △ P.

106 a third algorithm: shortest augmenting paths

Proof. Suppose that M is extreme. We will show that there exists

an augmenting path P such that M △ P is extreme. Then, since the
algorithm finds the shortest augmenting path, it will find a path that
is no longer than P, so the returned matching must also be extreme.
Consider an extreme matching M′ of size | M| + 1. Then, the edges
in M △ M′ are composed of disjoint paths and cycles. Since M △ M′
has more edges in M′ than edges in M, there is some path P ⊂ M △
M′ with one more edge in M′ than in M. This path necessarily starts
and ends on opposite sides, so we can direct it to start from the left
and end on the right. We know that | M′ ∩ P| = | M ∩ P| + 1, which
means that M \ P and M′ \ P must have equal size. The total weight of
M\ P and M′ \ P must be the same, since otherwise, we can swap the
two matchings and improve one of M and M′ . Therefore, M △ P =
( M′ ∩ P) ∪ ( M\ P) has the same weight as M′ and is extreme.

Note that the formulation of G M is exactly the graph constructed

if we represent the minimum matching problem as a min-cost flow.
Indeed, the previous theorem can be generalized to a very similar
statement for the augmenting path algorithm for min-cost flows.
graph matchings iii: weighted matchings 107

9.5 Perfect Matchings in General Graphs

When the graph is not bipartite, there are no “left” and “right” sets of
vertices, so we can simply define Recall that ∂v is the set of edges inci-
dent on vertex v.
Kdeg ( G ) := { x ∈ R| E| | x (∂v) = 1 ∀v ∈ V, x ≥ 0}.

This matches with the definition (9.3) when the graph is bipartite.
Interestingly,
CPM ( G ) ⊊ Kdeg ( G )
for non-bipartite graphs. Indeed, consider graph K3 which consists
of a single 3-cycle: this graph has no perfect matching, but setting
xe = 1/2 for each edge satisfies all the constraints. Or in the graph
K6 (which does have perfect matchings), the solution where we set
xe = 1/2 on two disjoint 3-cycles is an extreme point. This suggests Can you find a cost vector for which
that the linear constraints defining Kdeg ( G ) are not enough, and we this half-integral solution is the unique
optimum?
need to add more constraints to capture the convex hull of perfect
matchings in general graphs.
In situations like this, it is instructive to look at the counter-
example, to see what constraints must be satisfied by any integer
solution, but are violated by this fractional solution. For a set of ver-
tices S ⊆ V, let ∂S denote the edges leaving S. Here is one such set of
constraints:

∑ xe ≥ 1 ∀S ⊆ V such that |S| is odd, .
e∈∂S

These constraints say: the vertices belonging to a set S ⊆ V of

odd size cannot be perfectly matched within themselves, and at least
one edge from any perfect matching must leave S. Indeed, this con-
straint would be violated by the set of vertices on the odd cycle in the
examples above.
Adding these odd-set constraints to the previous degree constraints
gives us the following polytope, which was originally proposed by
Edmonds: J. Edmonds. Maximum matching and a
polyhedron with 0,1-vertices (1965)
  






∑ xvu = 1 ∀v ∈ V 



 
 u∈∂(v) 

  
KgenPM ( G ) =

x ∈ R| E | s.t.
 ∑ xe ≥ 1 ∀S s.t. |S| odd


 
e∈∂(S) 


 
 

  
xe ≥ 0 ∀e ∈ E
Theorem 9.19. For an undirected graph G, we have

KgenPM ( G ) = CgenPM ( G ),

where CgenPM ( G ) the convex hull of all perfect matchings of G.

108 perfect matchings in general graphs

One proof is a more sophisticated version of the one in §9.2.3,

where we may now have tight odd-set constraints; you can try it as a
slightly challenging exercise, or see the proof in §9.5.1.
Note that we seem to have exchanged one complication for an-
other: while the polytope CgenPM ( G ) was unwieldy because it was
defined as the convex hull of an exponential number of points, the
new LP KgenPM ( G ) contains a exponential number of constraints (in
contrast with the linear number of constraints needed for the bipar-
tite case). In fact, a powerful recent theorem by Thomas Rothvoß T. Rothvoß (2014,2017)
(building on pioneering work by Mihalis Yannakakis) shows that M. Yannakakis (1991)
any polytope whose vertices are the perfect matchings of the com-
plete graph on n vertices must contain an exponential number of
constraints. Elaborate?
Given this negative result, this LP seems useless: we cannot even
write it down explicitly in polynomial time! It turns out that despite
this large size, it is possible to solve this LP in polynomial time. In
a later lecture, we will see the Ellipsoid method to solve linear pro-
grams. This method can solve such a large LP, provided we give it a
helper procedure called a “separation oracle”, which, given a point
x ∈ R| E| , outputs Yes if x lies is within the desired polytope, and
otherwise it outputs No and returns a violated constraint of this LP.
It is easy to check if x satisfies the degree constraints, so the challenge
here is to find an odd set S with ∑e∈∂(S) xe < 1, if there exists one.
Such an algorithm can be indeed obtained using a sequence of min-
cut computations in the graph, as was shown by. Hopefully we will M.W. Padberg and M.R. Rao (1982)
see this in a HW problem later in the course.

9.5.1 Integrality of the Perfect Matching Polytope

To prove integrality of KgenPM ( G ), we use a combination of the ideas
we’ve aready used to argue about bipartite graphs.
Fact 9.20. If x ∗ is an extreme point for the polytope Kdeg(G) , then any
component in the support supp( x ∗ ) is either an edge, or has odd
number of vertices. The classical proof of this fact shows
more, that the extreme points of
Proof. Consider the support supp( x ∗ ). Note that any vertex with unit Kdeg ( G ) have {0, 1/2, 1}-valued edges,
and hence contain edges and odd-
degree is incident to an edge xe∗ = 1, and the other endpoint must length cycles.
also have unit degree. So all edges with value 1 are isolated edges. That proof creates a bipartite graph
Also, we can ignore edges with value 0. So the rest of the edges are by making two copies of each vertex,
“lifts” the solution x ∗ to that graph, and
fractional and have value in the open interval (0, 1). The rest of the then uses the integrality of the perfect
vertices of V—those not incident to the matching edges—have degree matchings polytope for that bipartite
graph. Our proof is more hands-on,
at least 2 in the support, and hence each other connected component and closer to what we saw earlier in
of the support must have a cycle. Theorem 9.13.
Consider one such component H with vertices VH and edges EH
(each of which has xe∗ ∈ (0, 1)). If |VH | is odd, we are done, so say
graph matchings iii: weighted matchings 109

H has even size. If each vertex in H has an even degree, we can find
an Euler tour of these edges (which uses these edges exactly once),
and then apply the idea from Theorem 9.13 to this Euler tour to show
that x ∗ is the convex combination of two other solutions x + /x − . The
argument we used did not rely on the cycle being simple—just that
it was of even length, which holds because H has even size, and that
the solutions we get are different from x ∗ !
The rest of the proof just handles the case of non-Eulerian com-
ponents of even size. Such a non-Eulerian graph H must contain
vertices with an odd degree; but there must be an even number of
them. Pair them up in any way you like; pick the pairs one by one, Why? The Handshake Lemma says the
pick a path between them in H, and “duplicate” it—make one copy sum of degrees is twice the number of
edges and hence is even.
of each edge on the path. This increases the degree of each endpoint
by 1 (thereby changing the parity), but does not change the parity
of each other node. One important comment: pick some edge e′ on
some cycle, and ensure that the duplicated path does not use this
edge e′ by going the “other way around the cycle”.
At the end this fixes the degrees of all vertices to be even (at the
cost of duplicating edges in H). Again find an Euler tour and do
the +ε/−ε trick on these. If an edge is duplicated, it may be used
as an “odd” edge some pe number of times and an “even” edge the
remaining ne times, so will get an offset of ε( pe − ne ) in one solution
and −ε( pe − ne ) in the other. This again shows that x ∗ is not an Since the edge e′ is used only once in
extreme point. the Euler tour, it is definitely increased
or decreased in the two solutions,
ensuring that x + /x − are not equal to
Now we use the basic feasible solution perspective of x ∗ : this x∗ .
means we have a “basis” containing some | E| linearly independent
constraints of the linear program defining the polytope that are tight
at x ∗ . Fix any such basis, and suppose S is the collection of sets for
which the odd-set constraints are tight in this basis. There are two
cases:

1. If S = ∅, then x is a basic feasible solution of Kdeg ( G ) as well. So

by Fact 9.20 each non-edge component C of its support must be of
odd size. But any such component C violates the corresponding
odd-set constraint (because there is no edge in supp( x ∗ ) ∩ ∂C),
giving a contradiction. So there are no such odd components, and
x ∗ is just a perfect matching.

2. Else S ̸= ∅, and let S be a tight set in S . Now define two graphs:

G1 := G/S obtained by contracting S to a vertex vS , and

G2 := G/(V \ S) obtained by contracting V \ S to a vertex vS .

Let x (1) , x (2) be obtained by restricting x ∗ to the edges of G1 , G2 re-

spectively. We can check that each x (i) is a solution to KgenPM ( Gi ).
110 integrality of polyhedra

Moreover, both these polytopes are for smaller graphs, and have
integer vertices by induction. Hence, any point within them can be
written as a convex combination of their vertices. In particular,
x (1) = ∑′ α M′ χ M′
M

where the sum is over perfect matchings in G1 ; similarly

x (2) = ∑ β M′′ χ M′′ ,
M′′

where M′′ ranges over perfect matchings in G2 .

Note that each of the matchings in the above sums contains exactly
one of the edges in ∂S. Moreover, each edge e ∈ ∂S is contained
in exactly xe∗ fraction of both these sums. So we can pair these
matchings M′ , M′′ up and take their union to get a convex com-
bination of matchings M in G. This shows that x ∗ can be written
as a convex combination of perfect matchings in M, and hence
x ∗ ∈ CgenPM ( G ), proving the claim.
There are other proofs to show integrality of KgenPM ( G ), e.g., one
using just the basic feasible solutions perspective; I will put links to
them soon.

9.6 Integrality of Polyhedra

We just saw several proofs that the bipartite perfect matching poly-
tope has a compact linear program. Moreover, we claimed that the
pefect matching polytope on general graphs has an explicit linear
program that, while exponential-sized, can be solved in polynomial
time. Such results allow us to solve the weighted bipartite matching
problems using generic linear programming solvers (as long as they
return vertex solutions).
Having many different ways to view a problem gives us a deeper
insight, and thereby come up with faster and better ways to solve it.
Moreover, these different perspectives give us a handle into solving
extensions of these problems. E.g., if we have a matching problem
with two different kinds weights w1 and w2 on the edges: we want to
find a matching x ∈ K PM ( G ) minimizing w1 · x, now subject to the
additional constraint w2 · x ≤ B. While the problem is now NP-hard,
this linear constraint can easily be added to the linear program to
get a fractional optimal solution. Then we can reason about how to
“round” this solution to get a near-optimal matching.
We now show how two problems we considered earlier, namely
minimum-cost arborescence and spanning trees, can be exactly mod-
eled using linear programs. We then conclude with a pointer to a
general theory of integral polyhedra.
graph matchings iii: weighted matchings 111

9.6.1 Arborescences
We already saw a linear program for the min-weight r-arborescence
polytope in §2.3.2: since each node that is not the root r must have a
path in the arborescence to the root, it is natural to say that for any
subset of vertices S ⊆ V that does not contain the root, there must
be an edge leaving it. Specifically, given the digraph G = (V, A), the
polytope can be written as
  

  ∑ x a ≥ 1 ∀S ⊂ Vs.t.r ̸∈ S
 

| A|
K Arb ( G ) = x ∈ R s.t. a ∈ ∂ + (S)
.

 
x ≥ 0 

a ∀a ∈ A

Here ∂+ (S) is the set of arcs that leave set S. The proof in §2.3.2 al-
ready showed that for each weight vector w ∈ R| A| , we can find an
optimal solution to the linear program min{w · x | x ∈ K Arb ( G )}.

9.6.2 Minimum Spanning Trees

One way to write an LP for minimum spanning trees is to reduce
it to minimum-weight r-arborescences: indeed, replace each edge
by two arcs in opposite directions, each having the same cost. Pick
any node as the root r. Observe the natural bijection between r-
arborescence in this digraph and spanning trees in the original graph,
having the same weight.
But why go via arborescences? Why not directly model the fact
that any tree has at least one undirected edge crossing each cut
(S, V \ S), perhaps as follows:
  

  ∑ xe ≥ 1 ∀S ⊂ V, S ̸= ∅, V
 

| E|
KSTtry = x ∈ R s.t. e ∈ ∂ ( S ) .

 
x ≥ 0 

e ∀e ∈ E

(The first constraint excludes the case where S is either empty or the
entire vertex set.) Sadly, this does not precisely capture the spanning
tree polytope: e.g., for the familiar cycle graph having three vertices,
setting xe = 1/2 for all three edges satisfies all the constraints. If all
edge weights are 1, this solution get a value of ∑e xe = 3/2, whereas
any spanning tree on 3 vertices must have 2 edges.
One can indeed write a different linear program that captures the
spanning tree polytope, but it is a bit non-trivial:
  





 ∑ xij ≤ |S| − 1 ∀S ⊆ V, S ̸= ∅ 



 
 ij ∈ E:i,j ∈S 

  
KST ( G ) =

x ∈ R| E | s.t.
 ∑ xij = |V | − 1 

 
 ij∈ E 


 
 

  
xij ≥ 0 ∀ij ∈ E
112 integrality of polyhedra

Define the convex hull of all minimum spanning trees of G to be

CH MST . Then, somewhat predictably, we will again find that CH MST =
K MST .
Both the polytopes for arborescences and spanning trees had ex-
ponentially many constraints. Again, we can solve these LPs if we
are given separation oracles for them, i.e., procedures that take x and
check if it is indeed feasible for the polytope. If it is not feasible, the
oracle should output a violated linear inequality. We leave it as an
exercise to construct separation oracles for the polytopes above.
A different approach is to represent such a polytope K compactly
′
via an extended formulation: i.e., to define a polytope K ′ ∈ Rn+n using
a polynomial number of linear contraints (on the original variables
′
x ∈ Rn and perhaps some new variables y ∈ Rn ) such that projecting
K ′ down onto the original n-dimensions gives us K. I.e., we want
that
′
K = { x ∈ Rn | ∃y ∈ Rn s.t. ( x, y) ∈ K ′ }.
The homework exercises will ask you to write such a compact ex-
tended formulation for the arborescence problem.

9.6.3 Integrality of Polyhedra and Total Unimodularity

This section still needs work. We have seen that LPs are a powerful
way of formulating problems like min-cost matchings, min-weight
r-aborescences, and MSTs. We reasoned about the structure of the
polytopes that underly the LPs, and we were able to show that these
LPs do indeed solve their combinatorial problems. But notice that
simply forming the LP is not sufficient–significant effort was ex-
pended to show that these polytopes do indeed have integer solu-
tions at the vertices. Without this guarantee, we could get fractional When we say “integer solutions”, we
solutions to these LPs that do not actually give us solutions to our mean the solution vector is integer
valued
problem.
There is a substantial field of study concerned with proving the
integrality of various LPs. We will briefly introduce a matrix property
that implies the integrality of corresponding LPs. Recall that an LP
can be written as
[ A]m×n · ⃗x ≤ ⃗b,
where A is a m × n matrix with each row corresponding to a con-
straint, ⃗x is a vector of n variables, and ⃗b ∈ Rm is a vector corre-
sponding to the m scalars bi ∈ R in the constraint A(i) · ⃗x ≤ bi .

Definition 9.21. A matrix [ A]m×n is called totally unimodular if every

square submatrix B of A has the property that det( B) ∈ {0, ±1}

We then have the following neat theorem, due to Alan Hoffman

and Joe Kruskal:
graph matchings iii: weighted matchings 113

Theorem 9.22 (Hoffman and Kruskal Theorem). If the constraint A.J. Hoffman and J.B. Kruskal (1956)
matrix [ A]m×n is totally unimodular and the vector ⃗b is integral, i.e., ⃗b ∈
Zm , then the vertices of the polytope induced by the LP are integer valued.
Moreover, if for some matrix A the polytope has integer vertices for all
integer vectors b, then the matrix A is totally unimodular.

Proof. (Sketch) This proof uses that solutions to linear systems can be
obtained using Cramer’s rule.

Thus, to show that the vertices are indeed integer valued, one
need not go through producing combinatorial proofs, as we have.
Instead, one could just check that the constraint matrix A is totally
unimodular. Here’s a nice presentation by Marc Uetz about the rela-
tion between total unimodularity and graph matchings.
Part II

Interlude: Dimension
Reduction
10
Concentration of Measure

Consider the following questions:

1. You distribute n tasks among n machines, by sending each task

to a machine uniformly and independently at random: while any
machine has unit expected load, what is the maximum load (i.e.,
the maximum number of tasks assigned to any machine)?

2. You want to estimate the bias p of a coin by repeatedly flipping it

and then taking the sample mean. How many samples suffice to
be within ±ε of the correct answer p with confidence 1 − δ?

3. How many unit vectors can you choose in Rn that are almost
orthonormal? I.e., they must satisfy | vi , v j | ≤ ε for all i ̸= j?

4. A n-dimensional hyercube has N = 2n nodes. Each node i ∈ [ N ]

contains a packet pi , which is destined for node πi , where π is a
permutation. The routing happens in rounds. At each round, each
packet traverses at most one edge, and each edge can transmit at
most one packet. Find a routing policy where each packet reaches
its destination in O(n) rounds, regardless of the permutation π.

All these questions can be answered by the same basic tool, which
goes by the name of Chernoff bounds or concentration inequalities or tail
inequalities or concentration of measure, or tens of other names. The ba-
sic question is simple: if we have a real-valued function f ( X1 , X2 , . . . , Xm )
of several independent random variables Xi , such that it is “not too sensitive
to each coordinate”, how often does it deviate far from its mean? To make it
more concrete, consider this—
Given n independent random variables X1 , . . . , Xn , each bounded in
the interval [0, 1], let Sn = ∑in=1 Xi . What is
h i
Pr Sn ̸∈ (1 ± ε) ESn ?

This question will turn out to have relations to convex geometry,

to online learning, to many other areas. But of greatest interest to
118 asymptotic analysis

us, this question will solve many problems in algorithm analysis,

including the above four. Let us see some basic results, and then give
the answers to the four questions.

10.1 Asymptotic Analysis

We will be concerned with non-asymptotic analysis, i.e., the qualitative

behavior of sums (and other Lipschitz functions) of finite number of
(bounded) independent random variables. Before we begin that, a
few words about the asymptotic analysis, which concerns the conver-
gence of averages of infinite sequences of random variables.
Given a sequence of random variables { Xn } and another random
variable Y, the following two notions of convergence can be defined.
Definition 10.1 (Convergence in Probability). { Xn } converges in
probability to Y if for every ε > 0 we have
lim P(| Xn − Y | > ε) = 0 (10.1)
n→∞
p
This is denoted by Xn −
→ Y.
Definition 10.2 (Convergence in Distribution). Let FX (.) denote the
CDF of a random variable X. { Xn } converges in distribution to Y if
lim FXn (t) = FY (t) (10.2)
n→∞
for all points t where the distribution function FY is continuous. This
d
is denoted by Xn −
→ Y.
There are many results known here, and we only mention the two
well-known results below. The weak law of large numbers states that
the average of independent and identically distributed (i.i.d.) random
variables converges in probability to their mean.
Theorem 10.3 (Weak law of large numbers). Let Sn denote the sum of n
i.i.d. random variables, each with mean µ and variance σ2 < ∞, then
p
Sn/n −
→ µ. (10.3)
The central limit theorem tells us about the distribution of the sum
of a large collection of i.i.d. random variables. Let N (0, 1) denote
the standard normal variable with mean 0 and variance 1, whose
2
probability density function is f ( x ) = √1 exp(− x2 ).
2π
Theorem 10.4 (Central limit theorem). Let Sn denote the sum of n i.i.d.
random variables, each with mean µ and variance σ2 < ∞, then
Sn − nµ d
√ −
→ N (0, 1). (10.4)
nσ
There are many powerful asymptotic results in the literature; see
need to give references here.
concentration of measure 119

10.2 Non-Asymptotic Convergence Bounds

Our focus will be on the behavior of finite sequences of random

variables. The central question here will be: what is the chance of
deviating far from the mean? Given an r.v. X with mean µ, and some
deviation λ > 0, the quantity

Pr[ X ≥ µ + λ]

is called the upper tail, and the analogous quantity

Pr[ X ≤ µ − λ]

is the lower tail. We are interested in bounding these tails for various
values of λ.

10.2.1 Markov’s inequality

Most of our results will stem from the most basic of all results:
Markov’s inequality. This inequality qualitatively generalizes that idea
that a random variable cannot always be above its mean, and gives a
bound on the upper tail.

Theorem 10.5 (Markov’s Inequality). Let X be a non negative random

variable and λ > 0, then

E( X )
P( X ≥ λ ) ≤ (10.5)
λ
With this in hand, we can start substituting various non-negative
functions of random variables X to deduce interesting bounds. For
instance, the next inequality looks at both the mean µ := EX and the
variance σ2 := E[( X − µ)2 ] of a random variable, and bounds both
the upper and lower tails.

10.2.2 Chebychev’s Inequality

Theorem 10.6 (Chebychev’s inequality). For any random variable X
with mean µ and variance σ2 , we have

σ2
Pr[| X − µ| ≥ λ] ≤ .
λ2
Proof. Using Markov’s inequality on the non-negative r.v. Y = ( X −
µ)2 , we get

E [Y ]
Pr[Y ≥ λ2 ] ≤ .
λ2
The proof follows from Pr[Y ≥ λ2 ] = Pr[| X − µ| ≥ λ].
120 non-asymptotic convergence bounds

10.2.3 Examples: The First Bounds

Example 1 (Coin Flips): Let X1 , X2 , . . . , Xn be i.i.d. Bernoulli random
variables with Pr[ Xi = 0] = 1 − p and Pr[ Xi = 1] = p. (Im other
words, these are the outcomes of independently flipping n coins,
each with bias p.) Let Sn := ∑in Xi be the number of heads. Then Sn is
distributed as a binomial random variable Bin(n, p), with Recall that linearity of expectations for
r.v.s X, Y means E[ X + Y ] = E[ X ] +
E[Sn ] = np and Var[Sn ] = np(1 − p). E[Y ]. For independent we have Var[ X +
Y ] = Var[ X ] + Var[Y ].

Applying Markov’s inequality for the upper tail gives

pn 1
Pr[Sn − pn ≥ βn] ≤ = .
pn + βn 1 + ( β/p)

So, for p = 1/2, this is 1+12β ≈ 1 − O( β) for small values of β > 0.

However, Chebychev’s inequality gives a much tighter bound:

np(1 − p) p
Pr[|Sn − pn| ≥ βn] ≤ < 2 .
β2 n2 β n

In particular, this already says that the sample mean Sn /n lies in the
p
interval p ± β with probability at least 1 − β2 n . Equivalently, to get Concretely, to get within an additive
p p 1% error of the correct bias p with
confidence 1 − δ, we just need to set δ ≥ β2 n
, i.e., take n ≥ β2 δ
. (We probability 99.9%, set β = 0.01 and
will see a better bound soon.) δ = 0.001, so taking n ≥ 107 · p samples
suffices.
Example 2 (Balls and Bins): Throw n balls uniformly at random and
independently into n bins. Then for a fixed bin i, let Li denote the
number of balls in it. Observe that Li is distributed as a Bin(n, 1/n)
random variable. Markov’s inequality gives a bound on the probabil-
ity that Li deviates from its mean 1 by λ ≫ 1 as
1 1
Pr Li ≥ 1 + λ ≤ ≈ .
1+λ λ
However, Chebychev’s inequality gives a much tighter bound as
h i (1 − 1/n) 1
Pr | Li − 1| ≥ λ ≤ ≈ 2.
λ2 λ
√
So setting λ = 2 n says that the probability of any fixed bin having
√ (1−1/n)
more than 2 n + 1 balls is at most 4n . Now a union bound over Doing this argument with Markov’s
(1−1/n) inequality would give a trivial upper
all bins i means that, with probability at least n · 4n ≤ 1/4, the bound of 1 + 2n on the load. This is
√
load on every bin is at most 1 + 2 n. useless, since there are at most n balls,
so the load can never be more than n.
Example 3 (Random Walk): Suppose we start at the origin and at
each step move a unit distance either left or right uniformly ran-
domly and independently. We can then ask about the behaviour of
the final position after n steps. Each step (Xi ) can be modelled as a
Rademacher random variable with the following distribution. A random sign is also called a
Rademacher random variable, the name
Bernoulli being already taken for a
random bit in {0, 1}.
concentration of measure 121


1 w.p. 1
2
Xi =
 −1 w.p. 1
2

The position after n steps is given by Sn = ∑in=1 Xi , with mean and

variance being µ = 0 and σ2 = n respectively. Applying Chebyshev’s
√
inequality on Sn with deviation λ = tσ = t n, we get
h √ i 1
Pr Sn > t n ≤ 2 . (10.6)
t
We will soon see how to get a tighter tail bound.

10.2.4 Higher-Order Moment Inequalities

All the bounds in the examples above can be improved by using
higher-order moments of the random variables. The idea is to use the
same recipe as in Chebychev’s inequality.
Theorem 10.7 (2kth -Order Moment inequalities). Let k ∈ Z≥0 . For any
random variable X having mean µ, and finite moments upto order 2k, we
have
E(( X − µ)2k )
Pr[| X − µ| ≥ λ] ≤ .
λ2k
Proof. The proof is exactly the same: using Markov’s inequality on
the non-negative r.v. Y := ( X − µ)2k ,
E [Y ]
Pr[| X − µ| ≥ λ] = Pr[Y ≥ λ2k ] ≤ .
λ2k
We can get stronger tail bounds for large values of k, however
it becomes increasingly tedious to compute E(( X − µ)2k ) for the
random variables of interest.

Example 3 (Random Walk, continued): If we consider the fourth

moment of Sn :
h n i
E ( S n ) 4 = E ∑ Xi
i =1
h i
=E ∑ Xi4 + 4 ∑ Xi3 Xj + 6 ∑ Xi2 X2j + 12 ∑ Xi2 X j Xk + 24 ∑ Xi X j X k X l
i i< j i< j i < j<k i < j<k<l

n
= n+6 ,
2
where we crucially used that the r.v.s are independent and mean-
zero, hence terms like Xi3 X j , Xi2 X j Xk , and Xi X j Xk Xl all have mean
zero. Now substituting this expectation in the fourth-order moment
√
inequality, we get a stronger tail bound for λ = tσ = t n.

√ E ( Sn )4 n + 6(n2 ) Θ (1)
Pr |Sn | ≥ t n ≤ = = 4 . (10.7)
t4 n2 t4 n2 t
Compare this with the bound in (10.6).
122 chernoff bounds, and hoeffding’s inequality

10.2.5 Digression: The Right Answer for Random Walks

We can actually explicitly computing Pr(Sn = k) for sums of Rademacher
random variables. Indeed, we just need to choose the positions for +1
steps, which means
n
Pr[Sn = 2λ] (n+ )
2 λ
= n .
Pr[Sn = 0] (n)
2

√
For large n, we can use Stirling’s formula n! ≈ 2πn( ne )n :

Pr[Sn = 2λ] ( n2 )n/2 ( n2 )n/2 1

≈ n = 2λ n2 +λ n −λ .
Pr[Sn = 0] ( 2 + λ)(n/2+λ) ( n2 − λ)(n/2−λ) (1 + n ) (1 − 2λ
n )
2

λ
If λ ≪ n, then we can approximate 1 + k λn by ek n :

Pr[Sn = 2λ] −2λ n 2λ n −4λ2

≈ e n ( 2 +λ) e n ( 2 −λ) = e n .
Pr[Sn = 0]
√
Finally, substituting λ = tσ = t n, we get
2
Pr[Sn = 2λ] ≈ Pr[Sn = 0] · e−4t .

This shows that most of the probability mass lies in the region |Sn | ≤
√
O( n), and drops off exponentially as we go further. And indeed,
this is the bound we will derive next—we will get slightly weaker
constants, but we will avoid these tedious approximations.

10.3 Chernoff bounds, and Hoeffding’s inequality

The main bound of this section is a bit of a mouthful, but as Ryan

O’Donnell says in his notes, you should memorize it “like a poem”. I
find it lies in a sweet spot: it is not difficult to remember, and still is
very broadly applicable: The provenance of these bounds is
again quite complicated. There’s Her-
Theorem 10.8 (Hoeffding’s inequality). Let X1 , . . . , Xn be n independent man Chernoff’s paper, which derives
the corresponding inequality for i.i.d.
random variables taking values in [0, 1]. Let Sn := ∑in=1 Xi , with mean Bernoulli random variables. Wassily
µ := E[Sn ] = ∑i E[ Xi ]. Then for any λ ≥ 0 we have Hoeffding gives the generalization
for independent random variables all
taking values in some bounded interval
λ2
Upper tail : Pr Sn ≥ µ + λ ≤ exp − . (10.8) [ a, b]. Moreover, Chernoff attributes
2µ + λ his result to another Herman, namely
Herman Rubin. Then there’s Harald
λ2
Lower tail : Pr Sn ≤ µ − λ ≤ exp − . (10.9) Cramér (of the Cramér-Rao fame, not of
3µ Cramer’s rule). And there’s the bound
by Sergei Bernstein, many years earlier,
Before we prove the bound, let’s give a simpler version that suf- which is at least as strong. . .
fices for many settings; here we assume the deviation λ is smaller
than the mean, and hence can be written as βµ for β ∈ [0, 1].
concentration of measure 123

Corollary 10.9 (Double-Sided Concentration Bound). For X1 , . . . , Xn

independent r.v.s taking values in [0, 1], Let Sn := ∑in=1 Xi have mean
µ := E[Sn ]. Then for any β ∈ [0, 1],
2
Pr Sn ̸∈ µ(1 ± β) ≤ 2 e− β µ/3 . (10.10)

10.3.1 The Proof

Proof of Theorem 10.8. We only prove (10.8); the proof for (10.9) is
similar. The idea is to use Markov’s inequality not on the square or
the fourth power, but on a function which is fast-growing enough so
that we get tighter bounds, and “not too fast” so that we can control
the errors. So we consider the Laplace transform, i.e., the function

x 7→ etx

for some value t > 0 to be chosen carefully. Since this map is mono-
tone,

Pr[Sn ≥ µ + λ)] = Pr[etSn ≥ et(µ+λ) ]

E[etSn ]
≤ (using Markov’s inequality)
et(µ+λ)
∏ E[etXi ]
= it(µ+λ) (using independence) (10.11)
e

Bernoulli random variables: Assume that all the Xi ∈ {0, 1}; we will
remove this assumption later. Let the mean be µi = E[ Xi ], so the
moment generating function can be explicitly computed as

E[etXi ] = 1 + µi (et − 1) ≤ exp(µi (et − 1)).

Substituting, we get

∏i E[etXi ]
Pr[Sn ≥ µ + λ] ≤ (10.12)
et(µ+λ)
∏ exp(µi (et − 1))
≤ i (10.13)
et(µ+λ)
exp(µ(et − 1))
≤
et(µ+λ)
(since µ = ∑ µi )
i
t
= exp(µ(e − 1) − t(µ + λ)). (10.14)

Since this calculation holds for all positive t, and we want the tightest
upper bound, we should minimize the expression (10.14). Setting the
derivative w.r.t. t to zero gives t = ln(1 + λ/µ) which is non-negative
for λ ≥ µ. This bound on the upper tail is also
one to be kept in mind; it often is
eλ useful when we are interested in large
Pr[Sn ≥ µ + λ] ≤ . (10.15) deviations where λ ≫ µ. One such
(1 + λ/µ)µ+λ example will be the load-balancing
application with jobs and machines.
124 chernoff bounds, and hoeffding’s inequality

If we define β := λ/µ as the deviation in multiples of the mean, this

quantity is
µ
eβ
Pr[Sn ≥ µ + λ] ≤ , (10.16)
(1 + β )1+ β

which is an expression that may be easy to deal with/memorize.

And we can simplify even further: since

β
≤ ln(1 + β) (10.17)
1 + β/2

for all β ≥ 0, so we get

(10.17) − β2 µ − λ2
(10.16) ≤ exp = exp ,
2+β 2µ + λ

where the last expression follows by algebraic manipulation. This

proves the upper tail bound (10.8); a similar proof gives us the lower
tail as well.

Removing the assumption that Xi ∈ {0, 1}: If the r.v.s are not Bernoullis,
then we define new Bernoulli r.v.s Yi ∼ Bernoulli(µi ), which take
value 0 with probability 1 − µi , and value 1 with probability µi , so
that E[ Xi ] = E[Yi ]. Note that f ( x ) = etx is convex for every value
of t ≥ 0; hence the function ℓ( x ) = (1 − x ) · f (0) + x · f (1) satisfies
f ( x ) ≤ ℓ( x ) for all x ∈ [0, 1]. Hence E[ f ( Xi )] ≤ E[ℓ( Xi )]; moreover
ℓ( x ) is a linear function so E[ℓ( Xi )] = ℓ(E[ Xi ]) = E[ℓ(Yi )], since
Xi and Yi have the same mean. Finally, ℓ(y) = f (y) for y ∈ {0, 1}.
Putting all this together,

E[etXi ] ≤ E[etYi ] = 1 + µi (et − 1) ≤ exp(µi (et − 1)),

so the step from (10.12) to (10.13) goes through again. This completes
the proof of Theorem 10.8.

Since the proof has a few steps, let’s take stock of what we did:
i. Apply Markov’s inequality on the function etX ,
ii. Use independence and linearity of expectations to break into etXi ,
iii. Reduce to the Bernoulli case Xi ∈ {0, 1},
iv. Compute the MGF (moment generating function) E[etXi ],
v. Choose t to minimize the resulting bound, and
vi. Use convexity to argue that Bernoullis are the “worst case”. Do make sure you see why the bounds
You can get tail bounds for other functions of random variables of Theorem 10.8 are impossible in
general if we do not assume some kind
by varying this template around; e.g., we will see an application for of boundedness and independence.
sums of independent normal (a.k.a. Gaussian) random variables in
the next chapter.
concentration of measure 125

10.3.2 The Generic Chernoff Bound

Let’s consider the case where the r.v.s Xi are identically distributed.
Suppose we start off the same, and get to (10.11). Now define the
log-MGF of the underlying r.v. X to be

ψ(t) := E[etX ]. (10.18)

The expression (10.11) can be then written as

exp n ψ(t) − t(µ + λ) = exp(−n(t(µ+λ)/n − ψ(t))).

The tightest upper bound is obtained when the expression tλ/n −

ψ(t) is the largest. The Legendre-Fenchel dual of the function ψ(t) is This is also called the convex conjugate.
defined as Since it is the max of a collection of
linear functions, one for each t, the dual
ψ∗ (λ) := sup {tλ − ψ(t)}, function ψ∗ is always convex, even if
t ≥0 the original function ψ is not.
so we get the following concise statement, which we call the generic Exercise: if ψ1 (t) ≥ ψ2 (t) for all t ≥ 0,
then ψ1∗ (λ) ≤ ψ2∗ (λ) for all λ.
Chernoff bound:

Theorem 10.10 (Generic Chernoff Bound). Suppose Sn is the sum of n

i.i.d. random variables, each having log-MGF ψ(t). Let µ := E[Sn ]. Then

∗ µ+λ
Pr[Sn ≥ µ + λ] ≤ exp − n · ψ . (10.19)
n

For the rest of the proof of the Chernoff bound, we can just focus
on computing the dual ψ∗ (λ) of the log-MGF ψ(t). Let’s see some
examples:

1. The first example is when X ∼ N (0, σ2 ), then

Z x2
1 −
E[etX ) ] = √ etx e 2σ2 dx
2πσ x ∈R
Z 2 )2
t2 σ2 /2 1 − ( x−tσ2 2 σ2 /2
=e ·√ e 2σ dx = et . (10.20)
2πσ x ∈R

Hence, for X ∼ N (0, σ2 ) r.v.s, we have

t2 σ 2 λ2
ψ(t) = and ψ∗ (λ) = ,
2 2σ2
the latter by basic calculus. Now the generic Chernoff bound (10.19)
for the sum of n normal N (0, σ2 ) variables says:

− λ2
Pr[Sn ≥ λ] ≤ e 2n σ2 . (10.21)

This is even interesting when n = 1, in which case we get that for a

N (0, σ2 ) random variable G, In fact, you may have noticed that for
Gaussians, the two statements (10.21)
− λ2 and (10.22) are equivalent, using the
Pr[ G ≥ λ] ≤ e 2 σ2 . (10.22) fact that the sum of n independent
N (0, σ2 ) r.v.s is itself a N (0, nσ2 ) r.v..
126 chernoff bounds, and hoeffding’s inequality

2. How about a Rademacher {−1, +1}-valued r.v. X? The MGF is

et + e−t t2 t4 2
E[etX ] = = cosh t = 1 + + + · · · ≤ et /2 ,
2 2! 4!
so
t2 λ2
ψ(t) = and ψ∗ (λ) = .
2 2
Note that
∗
ψRademacher (t) ≤ ψN (0,1) (t) =⇒ ψRademacher (λ) ≥ ψ∗N (0,1) (λ).

This means the upper tail bound for a single Rademacher is at

least as strong as that for the standard normal.

3. And what about a centered Bernoulli with bias p? The log-MGF is

ψ(t) := log E[etX ] = log((1 − p) + pet ),

and a little calculus shows that the dual is

λ 1−λ
ψ∗ (λ) = λ log + (1 − λ) log .
p 1− p

Interestingly this function has a name: it is Kullback-Leibler diver-

gence DKL (λ∥ p) between two Bernoulli distributions, one with bias
λ and the other with bias p. In summary, if we write µ + λ = qn The KL divergence DKL (q∥ p), also
for some q > p, we have called the relative entropy, is a distance
measure between two distributions.
It is not symmetric, so be careful with
Pr[Sn ≥ qn] ≤ e−nDKL (q∥ p) . the order of the arguments! We will
see more of it when we discuss online
We can also extend the generic Chernoff bound to sums of non- learning and mirror descent.

identical distributions using the AM-GM inequality: details here.

10.3.3 The Examples Again: New and Improved Bounds

Example 1 (Coin Flips): Since each r.v. is a Bernoulli( p), the sum
Sn = ∑i Xi has mean µ = np, and hence

β2 n β2 n
Pr |Sn − np| ≥ βn ≤ exp − ≤ exp − .
2p + β 2

(For the second inequality, we use that the interesting settings have
2 ln(1/δ)
p + β ≤ 1.) Hence, if n ≥ β2
, the empirical average Sn /n is
within an additve β of the bias p with probability at least 1 − δ. This
has an exponentially better dependence on 1/δ than the bound we
obtained from Chebychev’s inequality.
This is asymptotically the correct answer: consider the problem
where we have n coins, n − 1 of them having bias 1/2, and one having
bias 1/2 + 2β. We want to find the higher-bias coin. One way is to es-
1
timate the bias of each coin to within β with confidence 1 − 2n , using
concentration of measure 127

the procedure above—which takes O(log n/ε2 ) flips per coin—and

Ω(n log n)
then take a union bound. It turns out any algorithm needs ε2
flips, so this the bound we have is tight. .

Example 2 (Load Balancing): Since the load Li on any bin i behaves

like Bin(n, 1/n), the expected load is 1. Now (10.8) says:

λ2
Pr[ Li ≥ 1 + λ] ≤ exp − .
2+λ

If we set λ = Θ(log n), the probability of the load Li being larger than
1 + λ is at most 1/n2 . Now taking a union bound over all bins, the
probability that any bin receives at least 1 + λ balls is at most n1 . I.e.,
the maximum load is O(log n) balls with high probability.
In fact, the correct answer is that the maximum load is (1 +
o (1)) lnlnlnnn with high probability. For example, the proofs in cite show
this. Getting this precise bound requires a bit more work, but we can
get an asymptotically correct bound by using (10.15) instead, with a
C ln n
setting of λ = ln ln n with a large constant C.
Moreover, this shows that the asymmetry in the bounds (10.8)
and (10.9) is essential. A first reaction would have been to believe The situation where λ ≤ µ is often
our proof to be weak, and to hope for a better proof to get called the Gaussian regime, since the
bound on the upper tail behaves like
exp(−λ2 /µ) = exp(− β2 µ), with
Pr[Sn ≥ (1 + β)µ] ≤ exp(− β2 µ/c) β = λ/µ. In other cases, the upper tail
bound behaves like exp(−λ), and is
for some constant c > 0, for all values of β. This is not possible, said to be the Poisson regime.
p
however, because it would imply a max-load of Θ( log n) with high
probability.

Example 3 (Random Walk): In this case, the variables are [−1, 1]

valued, and hence we cannot apply the bounds from Theorem 10.8
directly. But define Yi = 1+2Xi to get Bernoulli(1/2) variables, and
define Tn = ∑in=1 Yi . Since Tn = Sn /2 + n/2, In general, if Xi takes values in [ a, b],
X −a
we can define Yi := bi−a and then use
√ √
Pr |Sn | ≥ t n = Pr | Tn − n/2| ≥ (t/2) n Theorem 10.8.

(t2 /n) · (n/2)
≤ 2 exp − √ using (10.8)
2 + t/n
≤ 2 exp(−t2 /6).

Recall from §10.2.5 that the tail bound of ≈ exp(−t2 /O(1)) is indeed
in the right ballpark.

10.4 Other concentration bounds

Many of the extensions address the various assumptions of Theo-

rem 10.8: that the variables are bounded, that they are independent,
128 other concentration bounds

and that the function Sn is the sum of these r.v.s. Add details and refs
to this section.
But before we move on, let us give the bound that Sergei Bernstein
gave in the 1920s: it uses knowledge about the variance of the ran-
dom variable to get a potentially sharper bound than Theorem 10.8.

Theorem 10.11 (Bernstein’s inequality). Consider n independent random

variables X1 , . . . , Xn with | Xi − E[ Xi ]| ≤ 1 for each i. Let Sn := ∑i Xi
have mean µ and variance σ2 . Then for any λ ≥ 0 we have

λ2
Pr[|Sn − µ| ≥ λ] ≤ 2 exp − 2 .
2σ + 2λ/3

10.4.1 Mildly Correlated Variables

The only place we used independence in the proof of Theorem 10.8
was in (10.11). So if we have some set of r.v.s where this inequal-
ity holds even without independence, the proof can proceed un-
changed. Indeed, one such case is when the r.v.s are negatively corre-
lated. Loosely speaking, this means that if some variables are “high”
then it makes more likely for the other variables to be “low”. For-
mally, X1 , . . . , Xn are negatively associated if for all disjoint sets A, B
and for all monotone increasing functions f , g, we have

E[ f ( Xi : i ∈ A) · g( X j : j ∈ B)] ≤ E[ f ( Xi : i ∈ A)] · E[ g( X j : j ∈ B)].

We can use this in the step (10.11), since the function etx is monotone
increasing for t > 0.
Negative association arises in many settings: say we want to
choose a subset S of k items out of a universe of size n, and let
Xi = 1i∈S be the indicator for whether the ith item is selected. The
variables X1 , . . . , Xn are clearly not independent, but they are nega-
tively associated.

10.4.2 Martingales
A different and powerful set of results can be obtained when we
stop considering random variables are not independent, but al-
low variables X j to take on values that depend on the past choices
X1 , X2 , . . . , X j−1 but in a controlled way. One powerful formalization
is the notion of a martingale. A martingale difference sequence is a se-
quence of r.v.s Y1 , Y2 , . . . , Yn , such that E[Yi | Y1 , . . . , Yi−1 ] = 0 for each
i. (This is true for mean-zero independent r.v.s, but may be true in
other settings too.)

Theorem 10.12 (Hoeffding-Azuma inequality). Let Y1 , Y2 , . . . , Yn be

a martingale difference sequence with |Yi | ≤ ci for each i, for constants ci .
concentration of measure 129

Then for any t ≥ 0,

" # !
n
λ2
Pr ∑ Yi ≥ λ ≤ 2 exp −
2 ∑in=1 c2i
.
i =1

For instance, applying the Azuma-Hoeffding bounds to the ran-

dom walk in Example 3, where each Yi is a Rademacher r.v. gives
√ 2
Pr[|Sn | ≥ t n] ≤ 2e−t /8 , which is very similar to the bounds we
derived above. But we can also consider, e.g., a “bounded” random
walk that starts at the origin, say, and stops whenever it reaches ei-
ther −ℓ or +r. In this case, the step size Yi = 0 with unit probability
if ∑ij− 1
=1 Yj ∈ {−ℓ, r }, else it is {±1} independently and uniformly at
random.

10.4.3 Going Beyond Sums of Random Variables

The Azuma-Hoeffding inequality can be used to bound functions of
X1 , . . . , Xn other than their sum—and there are many other bounds
for more general classes of functions. In all these cases we want any
single variable to affect the function only in a limited way—i.e., the
function should be Lipschitz. One popular packaging was given by
Colin McDiarmid:

Theorem 10.13 (McDiarmid’s inequality). Consider n independent r.v.s

X1 , . . . , Xn , with Xi taking values in a set Ai for each i, and a function
f : ∏ Ai → R satisfying | f ( x ) − f ( x ′ )| ≤ ci whenever x and x ′ differ only
in the ith coordinate. Let µ := E[ f ( X1 , . . . , Xn )] be the expected value of the
random variable f ( X ). Then for any non-negative β,
!
2µ2 β2
Upper tail : Pr[ f ( X ) ≥ µ(1 + β)] ≤ exp −
∑i c2i
!
2µ2 β2
Lower tail : Pr[ f ( X ) ≤ µ(1 − β)] ≤ exp −
∑i c2i

This inequality does not assume very much about the function,
except it being ci -Lipschitz in the ith coordinate; hence we can also
use this to the truncated random walk example above, or for many
other applications.

10.4.4 Moment Bounds vs. Chernoff-style Bounds

One may ask how moment bounds relate to Chernoff-Hoeffding
bounds: Philips and Nelson showed that Chernoff-style bounds T.K. Philips and R. Nelson (1995)
obtained using the approach of bounding the moment-generating
function are never stronger than moment bounds:
130 application #1: oblivious routing on the hypercube

Theorem 10.14. Consider n independent random variables X1 , . . . , Xn ,

each with mean 0. Let Sn = ∑ Xi . Then

E[ X k ] E[etX ]
Pr[Sn ≥ λ] ≤ min ≤ inf .
k ≥0 λk t≥0 etλ

10.4.5 Matrix-Valued Random Variables

Finally, an important line of research considers concentration for
vector-valued and matrix valued functions of independent (and
mildly dependent) r.v.s. One object that we will see in a homework,
and also in later applications, is the matrix-valued case: here the no-
tation A ⪰ 0 means the matrix is positive-semidefinite (i.e., all its
eigenvalues are non-negative), and A ⪰ B means A − B ⪰ 0. See, e.g.,
the lecture notes by Joel Tropp!

Theorem 10.15 (Matrix Chernoff bounds). Consider n independent

symmetric matrices X1 , . . . , Xn of dimension d. Moreover, I ⪰ Xi ⪰ 0 for
each i, i.e., the eigenvalues of each matrix are between 0 and 1. If µmax :=
λmax (∑ E[ Xi ]) is the largest eigenvalue of their expected sum, then

γ2
Pr λmax ∑ Xi ≥ µmax + γ ≤ d exp −
2µmax + γ
.

As an example, if we are throwing n balls into n bins, then we

can let matrix Xi have a single 1 at position ( j, j) if the ith ball falls
into bin j, and zeros elsewhere. Now the sum of these matrices has
the loads of the bins on the diagonal, and the maximum eigenvalue
is precisely the highest load. This bound therefore gives that the
2
probability of a bin with load 1 + γ is at most n · eγ /(2+γ) —again
implying a maximum load of O(log n) with high probability.
But we can use this for a lot more than just diagonal matrices
(which can be reasoned about using the scalar-valued Chernoff
bounds, plus the naïve union bound). Indeed, we can sample edges
of a graph at random, and then talk about the eigenvalues of the
resulting adjacency matrix (or more interestingly, of the resulting
Laplacian matrix) using these bounds. We will discuss this in a later
chapter.

10.5 Application #1: Oblivious Routing on the Hypercube

Now we return to fourth application mentioned at the beginning of

the chapter. (The first two applications have already been considered
above, the third will be covered as a homework problem.)
The setting is the following: we are given the d-dimensional hyper-
cube Qd , with n = 2d vertices. We have n = 2d vertices, each labeled
concentration of measure 131

with a d-bit vector. Each vertex i has a single packet (which we also
call packet i), destined for vertex π (i ), where π is a permutation on
the nodes [n].
Packets move in synchronous rounds. Each edge is bi-directed,
and at most one packet can cross each directed edge in each round.
Moreover, each packet can cross at most one edge per round. So if
uv ∈ E( Qd ), one packet can cross from u to v, and one from v to u,
in a round. Each edge e has an associated waiting queue We ; so each
node has d queues, one for each edge leaving it. If several packets
want to cross an edge e in the same round, only one can cross; the
rest wait in the queue We and try again the next round. We assume
the queues are allowed to grow to arbitrary size (though one can also
show queue length bounds in the algorithm below). The goal is to get
a simple routing scheme that delivers the packets in O(d) rounds, no
matter what permutation π needs to be routed.
One natural proposal is the bit-fixing routing scheme: each packet
i looks at its current position u, finds the first bit position where u
differs from π (i ), and flips the bit (which corresponds to traversing
an edge out of u). For example:

0001010 → 1001010 → 1101010 → 1100010 → 1100011.

However, this proposal can create “congestion hotspots” in the net- Suppose we choose a permutation π
work, and therefore delay some packets by 2Ω(d) . In fact, it turns such that

out any deterministic oblivious strategy (that does not depend π (w00 ) = 0 w,
p on the
actual sources and destinations) must have a delay of Ω( 2d /d) where w, 0 ∈ {0, 1}d/2 . All these 2d/2
rounds. packets have to pass through the all-
zeros node in the bit-fixing routing
scheme; since this node can send out at
most d packets at each timestep, need at
10.5.1 A Randomized Algorithm. . .
least 2d/2 /d rounds.
Here’s a lovely randomized strategy, due to Les Valiant, and to Valiant (1982)
Valiant and Brebner. It requires no centralized control, and is opti-
mal in the sense of requiring O(d) rounds (with high probability) on
any permutation π.
Each node i picks a randomized midpoint Ri independently and uni-
formly from [n]: it sends its packet to Ri . Then after 5d rounds have
elapsed, the packets proceed to their final destinations π (i ). All routing
is done using bit-fixing.

10.5.2 . . . and its Analysis

Theorem 10.16. The random midpoint algorithm above succeeds in deliver-
ing the packets in at most 10d rounds, with probability at least 1 − n2 .

Proof. We only prove that all packets reach their midpoints by time
5d, with high probability. The argument for the second phase is then
132 application #1: oblivious routing on the hypercube

identical. Let Pi be the bit-fixing path from i to the midpoint Ri , and

define
Si := { j ̸= i | path Pj shares an edge with Pi }.

Claim 10.17. Any two paths Pi and Pj intersect in one contiguous

segment.

Proof. (Exercise.) This is where using a consistent routing strategy

like bit-fixing helps.

Claim 10.18. Packet i reaches midpoint Ri by time at most | Pi | + |Si |.

Proof. Consider the path Pi = ⟨e1 , e2 , . . . , eℓ ⟩ taken by packet i. If Si

were empty, clearly packet i would reach its destination in time | Pi |;
we now show how to charge each timestep that packet i is delayed
to a distinct packet in Si . For that, we first define the notion of lag. Observe: the lags are defined for pack-
For any edge ek ∈ Pi , we say every packet in Wek at the beginning of ets in Si according to the numbering of
edges in Pi , not the numbering of their
timestep t has lag t − k. Note that all packets in the same queue at own paths.
the same time have the same lag. Now:

1. Each packet j in Si ∪ {i } either reaches its destination on Pi or it

leaves Pi (forever, by Claim 10.17) after traversing some last edge
ek ∈ Pi . Call this traversal of ek the final traversal for packet j, and
call its lag value just before this final traversal its final lag.

2. Suppose packet i traverses the last edge eℓ on its path and reaches
its destination at timestep T. Since it has lag T − ℓ = T − | Pi | just
before it traverses the edge, it reaches the destination at time | Pi |
plus its final lag. So it suffices to show that i’s final lag is at most
| Si | .

3. The initial lag (at time t = 1) of this packet i is (1 − 1) = 0,

since it belongs to queue We1 at the very beginning. The lag of
this packet never decreases over time as it makes its way along
the path. Indeed, if it is in Wek at the beginning of some timestep
t, and it traverses the edge, it now belongs to wek+1 at the start of
timestep t + 1, and its new lag is (t + 1) − (k + 1) = t − k and
therefore unchanged.

4. Else suppose packet i’s lag increases from some value L to L + 1 at

some timestep. This is because i ∈ Wek for some k at the beginning
of time t = L + k, but some other packet j ∈ Si from queue Wek
was sent across the edge ek at this timestep. In this case, imagine
packet i gives packet j a token numbered L. So there is a single
token generated for each increase in i’s lag, each with a different
number.
concentration of measure 133

5. We show (in the next bullet point) how to maintain the invariant
that at the beginning of each time, any token numbered L still on
the path Pi is carried by some packet in Si with current lag L. This
implies that when a packet in Si makes its final traversal and it has
some final lag L′ , it is either carrying a single token numbered L′
at that time or no token at all. Since each token is carried by some
packet, this means there can be at most |Si | tokens overall, and
hence i’s final lag value is at most |Si |.

6. To ensure the invariant, note that when j got the token numbered
L from i, packet j had lag value L. Now as long as j does not get
delayed as it proceeds along the path, its lag remains L (and it
keeps the token). If it does get delayed, say while waiting in queue
Wek′ while some other packet j′ (having the same lag value L,
because they were sharing the same queue) traverses the edge ek′ ,
packet j gives its token numbered L to this j′ . This maintains the
invariant.

Finally, we bound the size of Si by a concentration bound. Since

Ri is chosen uniformly at random from {0, 1}d , the labels of i and Ri
differ in d/2 bits in expectation. Hence Pi has expected length d/2.
There are d2d = dn (directed) edges, and all n = 2d paths behave
symmetrically, so the expected number of paths Pj using any edge e
is n·dn
d/2
= 1/2.
Claim 10.19. Pr[|Si | ≥ 4d] ≤ e−2d .

Proof. If Xij is the indicator of the event that Pi and Pj intersect,

then |Si | = ∑ j̸=i Xij , i.e., it is a sum of a collection of independent
{0, 1}-valued random variables. Now conditioned on any choice of
Pi (which is of length at most d), the expected number of paths using
each edge in it is at most 1/2, so the conditional expectation of Si is
at most d/2. Since this holds for any choice of Pi , the unconditional
expectation µ = E[Si ] is also at most d/2.
Now apply the Chernoff bound to Si with λ = 4d − µ and µ ≤ d/2
to get

(4d − µ)2
Pr[|Si | ≥ 4d] ≤ exp − ≤ e−2d .
2µ + (4d − µ)
Note that we could apply the bound even though the variables Xij
were not i.i.d., and moreover we did not need estimates for E[ Xij ],
just an upper bound for their expected sum.

Now applying a union bound over all n = 2d packets i means

that all n packets reach their midpoints within d + 4d steps with
probability 1 − 2d · e−2d ≥ 1 − e−d ≥ 1 − 1/n. Similarly, the second
134 application #2: graph sparsification

phase has a probability at most 1/n of failing to complete in 5d steps,

completing the proof.

A different strategy would be to let each packet pick a random

permutation and fix the bits according to that permutation. Sadly,
this approach gives delay 2Ω(d) . This is true even if each node picks
its permutation independently. One bad example appears in Valiant’s
original paper (see Section 5 “The Necessity for Phase A”) and shows
that you can fix a permutation that “gangs up” on some node, even if
the bit-fixing order is random.

10.6 Application #2: Graph Sparsification

10.7 Application #3: The Power of Two Choices

11
Dimension Reduction and the JL Lemma

For a set of n points { x1 , x2 , . . . , xn } in RD , can we map them into

some lower dimensional space Rk and still maintain the Euclidean
distances between them? We can always take k ≤ n − 1, since any set
of n points lies on a n − 1-dimensional subspace. And this is (exis-
tentially) tight, e.g., if x2 − x1 , x3 − x1 , . . . , xn − x1 are all orthogonal
vectors.
But what if we were fine with distances being approximately pre-
served? There can only be k orthogonal unit vectors in Rk , but there
are as many as exp(cε2 k) unit vectors which are ε-orthogonal—i.e.,
whose mutual inner products all lie in [−ε, ε]. Near-orthogonality al-
lows us to pack exponentially more vectors! (Indeed, we will see this
in a homework exercise.)
This near-orthogonality of the unit vectors means that distances
are also approximately preserved. Indeed, for any two a, b ∈ Rk ,

∥ a − b∥22 = ⟨ a − b, a − b⟩ = ⟨ a, a⟩ + ⟨b, b⟩ − 2⟨ a, b⟩ = ∥ a∥22 + ∥b∥22 − 2⟨ a, b⟩,

so the squared Euclidean distance between any pair of the points

defined by these ε-orthogonal vectors falls in the range 2(1 ± ε). So,
if we wanted n points at exactly the same (Euclidean) distance from
each other, we would need n − 1 dimensions. (Think of a triangle in
2-dims.) But if we wanted to pack in n points which were at distance Having n ≥ exp(cε2 k ) vectors in d
(1 ± ε) from each other, we could pack them into dimensions means the dimension is
k = O(log n/ε2 ).

log n
k = O ε2

dimensions.

11.1 The Johnson Lindenstrauss lemma

The Johnson Lindenstrauss “flattening” lemma says that such a claim

is true not just for equidistant points, but for any set of n points in
Euclidean space:
136 the construction

Lemma 11.1. Let ε ∈ (0, 1/2). Given any set of points

X= { x1 , x2 , . . . , x n }
log n
in R , there exists a map A : R → R with k = O ε2
D D k such that

∥ A( xi ) − A( x j )∥22
1−ε ≤ ≤ 1 + ε.
∥ xi − x j ∥22
Moreover, such a map can be computed in expected poly(n, D, 1/ε) time.
Note that the target dimension k is independent of the original
dimension D, and depends only on the number of points n and the
accuracy parameter ε. Given n points with Euclidean distances
It is not difficult to show that we need at least Ω(log n) dimen- in (1 ± ε), the balls of radius 1− 2
ε

around these points must be mutually

sions in such a result, using a packing argument. Noga Alon showed disjoint, by the minimum distance,
log n
a lower bound of Ω ε2 log 1/ε , and then Kasper Green Larson and and they are contained within a ball
of radius (1 + ε) + 1− ε
2 around x0 .
log n
Jelani Nelson showed a tight and matching lower bound of Ω ε2 Since volumes of balls in Rk of radius r
dimensions for any dimensionality reduction scheme from n dimen- behave like ck r k , we have
1 − ε k 3 + ε k
sions that preserves pairwise distances. n · ck ≤ ck
2 2
The JL Lemma was first considered in the area of metric embed-
or k ≥ Ω(log n) for ε ≤ 1/2.
dings, for applications like fast near-neighbor searching; today we
Alon (2003)
use it to speed up algorithms for problems like spectral sparsification Larson and Nelson (2017)
of graphs, and solving linear programs fast.

11.2 The Construction

The JL lemma is pretty surprising, but the construction of the map

is perhaps even more surprising: it is a super-simple randomized
construction. Let M be a k × D matrix, such that every entry of M is
filled with an i.i.d. draw from a standard normal N (0, 1) distribution
(a.k.a. the “Gaussian” distribution). For x ∈ RD , define
1
A( x ) = √ Mx.
k
That’s it. You hit the vector x with a Gaussian matrix M, and scale it
√
down by k. That’s the map A.
Since A( x ) is a linear map and satisfies αA( x ) + βA(y) = A(αx +
βy), it is enough to show the following lemma:
Lemma 11.2. [Distributional Johnson-Lindenstrauss] Let ε ∈ (0, 1/2).
If A is constructed as above with k = cε−2 log δ−1 , and x ∈ RD is a unit
vector, then
Pr[ ∥ A( x )∥22 ∈ 1 ± ε] ≥ 1 − δ.
To prove Lemma 11.1, set δ = 1/n2 , and hence k = O(ε−2 log n).
Now for each xi , x j ∈ X, use linearity of A(·) to infer
2 2
A ( xi ) − A ( x j ) A ( xi − x j ) 2
2
= 2
= A(vij ) ∈ (1 ± ε )
xi − x j xi − x j
dimension reduction and the jl lemma 137

with probability at least 1 − 1/n2 , where vij is the unit vector in the
direction of xi − x j . By a union bound, all (n2 ) pairs of distances in
( X2 ) are maintained with probability at least 1 − (n2 ) n12 ≥ 1/2. A few
comments about this construction:

• The above proof shows not only the existence of a good map, we
also get that a random map as above works with constant prob-
ability! In other words, a Monte-Carlo randomized algorithm
for dimension reduction. (Since we can efficiently check that the
distances are preserved to within the prescribed bounds, we can
convert this into a Las Vegas algorithm.) Or we can also get deter-
ministic algorithms: see here.

• The algorithm (at least the Monte Carlo version) is data-oblivious:

it does not even look at the set of points X: it works for any set X
with high probability. Hence, we can pick this map A before the
points in X arrive.

11.3 Intuition for the Distributional JL Lemma

Let us recall some basic facts about Gaussian distributions. The prob-
ability density function for the Gaussian N (µ, σ2 ) is
( x − µ )2
f (x) = √1 e 2σ2 .
2πσ

We also use the following; the proof just needs some elbow grease. The fact that the means and the vari-
ances take on the claimed values should
Proposition 11.3. If G1 ∼ N (µ1 , σ12 ) and G2 ∼ N (µ2 , σ22 ) are indepen- not be surprising; this is true for all
r.v.s. The surprising part is that the
dent, then for c ∈ R, resulting variables are also Gaussians.

c G1 ∼ N (cµ1 , c2 σ12 ) (11.1)

G1 + G2 ∼ N (µ1 + µ2 , σ12 + σ22 ). (11.2)

Now, here’s the main idea in the proof of Lemma 11.2. Imagine
that the vector x is the elementary unit vector e1 = (1, 0, . . . , 0). Then
M e1 is just the first column of M, which is a vector with independent
and identical Gaussian values.
    
G1,1 G1,2 ···
G1,D 1 G1,1
    
 G2,1 G2,2 ···
G2,D  0  G2,1 
M e1 = 
 .. .. ..
 
..   ..  =  .. 
 
.
 . . ..  .  . 
Gk,1 Gk,2 · · · Gk,D 0 Gk,1
√
A( x ) is a scaling-down of this vector by k: every entry in this
random vector A( x ) = A(e1 ) is distributed as
√
1/ k · N (0, 1) = N (0, 1/k) (by (11.1)).
138 a direct proof of Lemma 11.2

Thus, the expected squared length of A( x ) = A(e1 ) is If G has mean µ and variance σ2 , then
E[ G2 ] = Var[ G ] + E[ G ]2 = σ2 + µ2 .
" #
h i k k h i k
1
E ∥ A( x )∥ = E ∑ A( x )i = ∑ E A( x )2i =
2 2
∑k = 1.
i =1 i =1 i =1

So the expectation of ∥ A( x )∥2 is 1; the heart is in the right place!

Now to show that ∥ A( x )∥2 does not deviate too much from the
mean—i.e., to show a concentration result. Indeed, ∥ A( x )∥2 is a sum
of independent N (0, 1/k )2 random variables, so if these N (0, 1/k)2
variables were bounded, we would be done by the Chernoff bounds
of the previous chapter. Sadly, they are not. However, their tails are
fairly “thin”, so if we squint hard enough, these random variables
can be viewed as “pretty much bounded”, and the Chernoff bounds
can be used.
Of course this is very vague and imprecise. Indeed, the Laplace
distribution with density function f ( x ) ∝ e−λ| x| for x ∈ R also has
pretty thin tails—“exponential tails”. But using a matrix with Laplace
entries does not work the same, no matter how hard we squint. It
turns out you need the entries of M, the matrix used to define A( x ),
to have “sub-Gaussian tails”. The Gaussian entries have precisely this
property.
We now make all this precise, and also remove the assumption
that the vector x = e1 . In fact, we do this in two ways.

1. First we give a proof via a direct calculation: it has several steps,

but each step is elementary, and you are mostly following your
nose.

2. The second proof uses the notion of sub-Gaussian random vari-

ables from , and builds some general machinery for concentration
bounds.

11.4 A Direct Proof of Lemma 11.2

Recall that we want to argue about the squared length of A( x ) ∈ Rk ,

where A( x ) = √1 Mx, and x is a unit vector. First, let’s understand
k
what the expected length of A( x ) is, and then we will show concen-
tration about the mean.

Lemma 11.4. Suppose the entries of M are independent random variables,

with mean zero and unit variance. Then for unit vector x ∈ RD ,

E[∥ A( x )∥2 ] = ∥ x ∥2 .

Proof. Each entry of the vector Mx is the inner product of x with

a vector with independent zero mean and unit variance random
dimension reduction and the jl lemma 139

variables, and so is itself a random variable with zero mean and

variance ∑i xi2 = 1. This means that for any entry i ∈ [k],

E[( Mx )2i ] = Var( Mx ) + E[( Mx )i ]2 = 1.

Now E[∥ A( x )∥2 ] = 1

k ∑ik=1 E[( Mx )2i ] = 1 = ∥ x ∥2 .

Observe that did not use the fact that the matrix entries were
Gaussians. We will use it for the concentration bound, which we
show next.

11.4.1 Concentration about the Mean

Using that each entry of M is an independent N (0, 1) r.v., we can

use Proposition 11.3 to infer that ( Mx )i ∼ N (0, x12 + x22 + . . . +
x2D ) = N (0, 1). So, each of the k coordinates of Mx behaves just like
an independent Gaussian! For brevity, define

k
1
Z := ∥ A(z)∥2 = ∑ k ( Mx)2i ,
i =1

so Z is the average of the squares of a collection of k independent

N (0, 1) r.v.s.
Next we show that Z does not deviate too much from 1. Since Z is
the sum of a bunch of independent and identical random variables,
let’s start down the usual path for a Chernoff bound, for the upper
tail, say: The easy way out is to observe that
the squares of Gaussians are chi-
squared r.v.s, the sum of k of them
Pr[ Z ≥ 1 + ε] ≤ Pr[etkZ ≥ etk(1+ε) ] ≤ E[etkZ ]/etk(1+ε) (11.3) is χ2 with k degrees of freedom, and
the internet conveniently has tail
= ∏ E[etG ]/et(1+ε)
2
(11.4) bounds for these things. But even if
i you don’t recall these facts, and don’t
have internet connectivity and cannot
2 check Wikipedia, it is not that difficult
for every t > 0, where G ∼ N (0, 1). Now E[etG ], the moment-
to prove from scratch.
generating function for G2 is easy to calculate for t < 1/2:
Z Z
1 2 2 /2 1 2 /2 dz 1
√ etg e− g dg = √ e−z √ = √ .
2π g ∈R 2π z ∈R 1 − 2t 1 − 2t
(11.5)

Plugging back into (11.4), the bound on the upper tail shows that for
all t ∈ (0, 1/2),

k
1
Pr[ Z ≥ (1 + ε)] ≤ √ .
et(1+ε) 1 − 2t
140 introducing subgaussian random variables

Let’s just focus on part of this expression:

1 1
√ = exp −t − log(1 − 2t)) (11.6)
et 1 − 2t 2

= exp (2t) /4 + (2t)3 /6 + · · ·
2
(11.7)

≤ exp t2 (1 + 2t + 2t2 + · · · ) (11.8)

= exp(t2 /(1 − 2t)).

Plugging this back, we get

k
1
Pr[ Z ≥ (1 + ε)] ≤ √
et(1+ε) 1 − 2t
2 /8
≤ exp(kt2 /(1 − 2t) − ktε) ≤ e−kε ,

if we set t = ε/4 and use the fact that 1 − 2t ≥ 1/2 for ε ≤ 1/2. (Note:
this setting of t also satisfies t ∈ (0, 1/2), which we needed from our
previous calculations.)
Almost done: let’s take stock of the situation. We observed that
∥ A( x )∥22 was distributed like an average of squares of Gaussians, and
by a Chernoff-like calculation we proved that

Pr[∥ A( x )∥22 > 1 + ε] ≤ exp(−kε2 /8) ≤ δ/2

for k = ε82 ln 2δ . A similar calculation bounds the lower tail, and

finishes the proof of Lemma 11.2.
The JL Lemma was first proved by Bill Johnson and Joram Linden-
strauss. There have been several proofs after theirs, usually trying to Johnson and Lindenstrauss (1982)
tighten their results, or simplify the algorithm/proof (see citations in
some of the newer papers): the proof above is some combinations of
those by Piotr Indyk and Rajeev Motwani, and Sanjoy Dasgupta and Indyk and Motwani (1998)
myself. Dasgupta and Gupta (2004)

11.5 Introducing Subgaussian Random Variables

It turns out that the proof of Lemma 11.2 is a bit cleaner (with fewer
calculations) if we use the abstraction provided by the generic Cher-
noff bound from last lecture, and the notion of subGaussian random
variables which we introduce next. This abstraction will also allow
us to extend the result to JL matrices having i.i.d. entries from other
distributions, e.g., where each Mij ∈ R {−1, +1}.

11.5.1 Subgaussian Random Variables

Recall the definitions of the log-MGF ψ(t) and its Legendre-Fenchel
dual ψ∗ (λ) from §10.3.2.
dimension reduction and the jl lemma 141

Definition 11.5. A random variable V with mean 0 is subgaussian with

parameter σ if its log-MGF ψ(t) satisfies

σ 2 t2
ψ(t) ≤ .
2
for all t ≥ 0. It is subgaussian with parameter σ up to t0 if the above
inequality holds for all |t| ≤ t0 .

In other words, the log-MGF of a subgaussian r.v. is bounded

above by that of a Gaussian! At this point, it’s useful to recall a fact
we asked as an exercise in §10.3.2:
Fact 11.6. If ψ1 (t) ≥ ψ2 (t) for all t ≥ 0, then ψ1∗ (λ) ≤ ψ2∗ (λ) for all λ.
Using this, the dual function of a subgaussian random variable
with parameter σ is bounded below by that of a Gaussian N (0, σ2 ),
which means we have a tighter upper tail bound! Indeed, combining
with (10.22), we immediately get:

Theorem 11.7 (Subgaussian Tail Bounds). If V is zero-mean and sub-

gaussian with parameter σ, then
2 / (2σ2 )
Pr[V ≥ λ] ≤ e−λ .

Most tail bounds you will prove using the subgaussian perspective
will come down to showing that some random variable is subgaus-
sian with parameter σ, whereupon you can use Theorem 11.7. Given
that you will often reason about sums of subgaussians, you may use
the next fact, which is an analog of Proposition 11.3.

Lemma 11.8. If V1 , V2 , . . . are independent, q

zero-mean and σi -subgaussian,
and x1 , x2 , . . . are reals, then V = ∑i xi Vi is ∑i xi2 σi2 -subgaussian.

Proof. Using independence and the definition of subgaussian-ness:

∏ E[etxi Vi ] ≤ ∏ e(txi ) σi /2 .
2 2
E[etV ] = E[et ∑i xi Vi ] =
i i

t2 xi2 σi2
Finally taking logarithms, ψV (t) = ∑i ψVi (txi ) ≤ ∑i 2 .

11.6 A Proof of Lemma 11.2 using Subgaussian r.v.s

Suppose we choose each Mij to be an independent copy of a subgaus-

sian r.v. with zero mean and unit variance, and let A( x ) = √1 Mx
k
again? We want to show that

1 k
k i∑
Z := ∥ A( x )∥2 = ( Mx )2i (11.9)
=1
142 a proof of Lemma 11.2 using subgaussian r.v.s

has mean ∥ x ∥2 , and is concentrated sharply around that value. Con-

veniently, we had only used the mean and variance of the entries of
M in proving Lemma 11.4, so we can still infer that

E[ Z ] = E[∥ A( x )∥2 = ∥ x ∥ = 1.

It just remains to show the concentration.

11.6.1 Sums of Squares of Subgaussians

To add in. Until then see the explanaton in Matousek’s paper “On
Variants of the Johnson-Lindenstrauss Lemma”.

11.6.2 Relating Subgaussian to Gaussians

If you have done the proof for the Gaussian case, and just want to ex-
tend the JL Lemma to other subgaussian random variables, you need
not do all the work in §11.6.1. Instead you can relate subgaussian
concentration to good old Gaussian concentration.
Indeed, the direct proof from §11.4 showed the ( Mx )i s were them-
selves Gaussian with variance ∥ x ∥2 . Since the Rademachers are 1-
subgaussian, Lemma 11.8 shows that ( Mx )i is subgaussian with
parameter ∥ x ∥2 . Next, we need to consider Z, which is the average of
squares of k independent ( Mx )i s. The following lemma shows that the
MGF of squares of symmetric σ-subgaussians are bounded above by
the corresponding Gaussians with variance σ2 . An r.v. X is symmetric if it is distributed
the same as R| X |, where R is an inde-
Lemma 11.9. If V is symmetric mean-zero σ-subgaussian r.v., and W ∼ pendent Rademacher.
2 2
N (0, σ2 ), then E[etV ] ≤ E[etW ] for t > 0.

Proof. Using the calculation in (10.20) in the “backwards” direction

2
√
2t(V/σ ) W
EV [etV ] = EV,W [e ].

(Note that we’ve just introduced W into the mix, without any provo-
cation!) Hence, rewriting
√ √
2t(V/σ ) W
EV,W [e ] = EW [EV [e( 2tW/σ )V
]],

we can use the σ-subgaussian behavior of V in the inner expectation

to get an upper bound of
2(
√
2t|W |/σ )2 /2 2
EW [eσ ] = EW [etW ].

Excellent. Now the bound on the upper tail for sums of squares
of symmetric mean-zero σ-subgaussians follows from that of Gaus-
2
sians. The lower tail (which requires us to bound E[etV ] for t < 0)
needs one more idea: suppose V is a mean-zero σ-subgaussian with
dimension reduction and the jl lemma 143

parameter σ2 = 1, and suppose |t| < 1. A Taylor expansion shows

that
E[etV ] ≤ 1 + tE[V 2 ] + t2 ∑ E[V 2i /i!].
2

i ≥2
2
Since E[V 2 ] = 1 and |t| < 1, this is at most 1 + t + t2 E[eV ]. Now
2 2 2
use the above bound E[eV ] ≤ E[eW ] to get that E[etV ] ≤ 1 + t +
√
t2 / 1 − 2t, and the proof proceeds as for the Gaussian case.
In summary, we get the same tail bounds as in §11.4.1, and hence
that the Rademacher matrix also has the distributional JL property,
while using far fewer random bits!
In general one can use other σ-subgaussian distributions to fill
the matrix M—using σ different than 1 may require us to rework the
proof from §11.4.1 since the linear terms in (11.6) don’t cancel any
more, see works by Indyk and Naor or Matousek for details. Indyk and Naor (2008)
Matoušek (2008)

11.6.3 The Fast JL Transform

A different direction to consider is getting fast algorithms for the
JL Lemma: Do we really need to plug in non-zero values into every
entry of the matrix A? What if most of A is filled with zeroes? The
first problem is that if x is a very sparse vector, then Ax might be
zero with high probability? Achlioptas showed that having a random
two-thirds of the entries of A being zero still works fine: Nir Ailon
and Bernard Chazelle showed that if you first hit x with a suitable Ailon and Chazelle
matrix P which caused Px to be “well-spread-out” whp, and then
∥ APx ∥ ≈ ∥ x ∥ would still hold for a much sparser A. Moreover, this
P requires much less randomness, and furthermore, the computa-
tions can be done faster too! There has been much work on fast and
sparse versions of JL: see, e.g., this paper from SOSA 2018 by Michael
Cohen, T.S. Jayram, and Jelani Nelson. Jelani Nelson also has some Cohen, Jayram, and Nelson (2018)
notes on the Fast JL Transform.

11.7 Optional: Compressive Sensing

To rewrite. In an attempt to build a better machine to take MRI scans,

we decrease the number of sensors. Then, instead of the signal x we
intended to obtain from the machine, we only have a small num-
ber of measurements of this signal. Can we hope to recover x from
the measurements we made if we make sparsity assumptions on x?
We use the term s-sparse signal for a vector with at most s nonzero
entries, i.e., with | supp( x )| ≤ s. It is common to use the notation
Formally, x is a n-dimensional s-sparse vector, and a measurement ∥ x ∥0 := | supp( x )|, even though
this is not a norm.
of x with respect to a vector a is a real number given by ⟨ a, x ⟩. If we
ask k questions, this gives us a k × n sensing matrix A (whose rows
144 optional: compressive sensing

are the measurements), and a k-dimensional vector b of results. We

want to reconstruct x with s nonzero entries satisfying Ax = b. This
is often written as
n o
min ∥ x ∥0 | Ax = b . (11.10)

11.7.1 Sparse Recovery: A First Attempt

What properties would we like from our sensing matrix A? The first
would be some form of consistency: that the problem should be
solvable.

Definition 11.10 (Kruskal Rank). An m × n matrix A has Kruskal

rank r if every subset of r of its columns are linearly independent.

Lemma 11.11 (Unique Decoding). If A has Kruskal rank ≥ 2s, then for
any b we have Ax = b for at most one s-sparse x.

Proof. Suppose Ax = Ax ′ for two s-sparse vectors x, x ′ . Then A( x −

x ′ ) = 0 for the 2s-sparse vector x − x ′ . The Kruskal rank being 2s
means this vector x − x ′ = 0, and hence x = x ′ .

So we can just find some sensing matrix with large Kruskal rank
Give examples here and ensure our results will be unique. The next
question is: how fast can we find x? (We should also be worried
about noise in the measurements.) A generic construction of matrices
with large Kruskal rank may not give us efficient solutions to (11.10).
Indeed, it turns out that the problem as formulated is NP-hard, as-
suming A and b are contrived by an adversary.
Of course, asking to solve (11.10) for general A, b is a more difficult
problem than we need to solve. In our setting, we can choose A as
we like and then are given b = Ax, so we can ask whether there are
matrices A for which this decoding process is indeed efficient. This is
precisely what we do next.

11.7.2 The Basis Pursuit Algorithm

Consider the following similar looking problem called the basis pur-
suit (BP) problem:
n o
min ∥ x ∥1 | Ax = b . (11.11)

This problem can be formulated as a linear program as follows,

and hence can be efficiently solved. Introduce n new variables
y1 , y2 , . . . , yn under the constraints
n o
min ∑ yi | Ax = b, −yi ≤ xi ≤ yi .
i
dimension reduction and the jl lemma 145

Definition 11.12. We call a matrix A as BP-exact for sparsity s if for

all vectors b such that the non-convex program (11.10) has a unique
solution x ∗ , this vector x ∗ is also the unique optimal solution to the
basis pursuit LP (11.11).
In other words, we want a matrix A for which the two programs
return the same optimal solution. But do BP-exact matrices exist? If
so, how do we efficiently construct them? Our next ingredient will be
crucial to show their existence and construction.
Definition 11.13 (Restricted Isometry Property (RIP)). A matrix A is
(t, ε)-RIP if for all unit vectors x with ∥ x ∥0 ≤ t, we have
∥ Ax ∥22 ∈ [1 ± ε].
Lemma 11.14 (RIP =⇒ BP-exact). If a matrix A is (3s, ε)-RIP for some
ε ≤ 1/9, then A is BP-exact for sparsity s.
Proof. Suppose x ∗ is the unique solution to (11.10) and x the solution
to (11.11), so that
∥ x ∥1 ≤ ∥ x ∗ ∥1 . (11.12)
Suppose x − x ∗ = ∆ ̸= 0; hence A∆ = A( x − x ∗ ) = 0. If we could
somehow show that supp(∆) ≤ 3s, then using the RIP property for
A, we would get
0 = ∥ A∆∥2 ≥ (1 − ε)∥∆∥2 > 0,
a contradiction. But of course, ∆ could have large support, so we
need to work harder. The actual proof breaks up ∆ into small pieces
(so that the RIP matrix A maintains their length), and argues that
there is one large piece that the other pieces cannot cancel out.
Let S := supp( x ∗ ) be the support of x ∗ , and S be the remaining
coordinates. Let’s sort these coordinates in decreasing order of their
absolute value, and group them into buckets of 2s consecutive coordi-
nates. Call these buckets B1 , B2 , . . .. For vector v ∈ Rn and subset T ⊆ [n],
√ define vector v T ∈ Rn which agrees
Claim 11.15. ∑ j≥2 ∥∆ Bj ∥2 ≤ ∥∆S ∥2 / 2. with v on the coordinates in S, and
Before we prove the claim, let’s see how to use it. The claim says which has zeroes elsewhere.

that total Euclidean length of the vectors {v Bj } j≥2 is a constant factor

smaller than that of vS∪ B1 . So even after the near-isometric mapping
A, the lengths of the former would not be able to cancel the length of
the latter. Formally:
0 = ∥ A∆∥2 ≥ ∥ A∆S∪ B1 ∥2 − ∑ ∥ A∆Bj ∥2
j ≥2

≥ (1 − ε) ∥∆S∪ B1 ∥2 − (1 + ε) ∑ ∥ ∆ B j ∥2
j ≥2
1+ε
≥ (1 − ε ) ∥ ∆ S ∥2 − √ ∥ ∆ S ∥2 ,
2
146 optional: compressive sensing

where the first step uses the triangle inequality for norms, the second
uses that each ∆S∪ B1 and ∆ Bj are 3s-sparse, and the last step uses
∥∆S∪ B1 ∥2 ≥ ∥∆S ∥2 and also Claim 11.15. Finally, since ε ≤ 1/9, we
have 1 − ε > 1√+ε , so the only remaining possibility is that ∆S = 0.
2
The next claim implies that ∆S = 0 implies that ∆ = 0, giving a
contradiction and hence the proof of Lemma 11.14.
Claim 11.16. ∥∆S ∥1 ≥ ∥∆S ∥1 .

Proof. We finally use that x = x ∗ + ∆ is the optimizer for the LP,

which means

∥ x ∗ ∥1 > ∥ x ∗ + ∆∥1 = ∥ xS∗ + ∆S ∥1 + ∥∆S ∥1

≥ ∥ xS∗ ∥1 − ∥∆S ∥1 + ∥∆S ∥1 .

(The last step uses the triangle inequality.) Since ∥ x ∗ ∥1 = ∥ xS∗ ∥1 , we

get Claim 11.16.

The final piece of the argument is to prove Claim 11.15:

Proof of Claim 11.15. Take any bucket Bj for j ≥ 2. Each entry of ∆

in this bucket is smaller than the smallest entry of Bj−1 , and hence
smaller than the average entry of Bj−1 . And there are 2s entries in
this bucket Bj , so the Euclidean length of the bucket is

√ ∥ ∆ B j −1 ∥ 1 ∥ ∆ B j −1 ∥ 1
∥ ∆ B j ∥2 ≤ 2s · = √ .
2s 2s
Summing this over all j ≥ 2, we get

∥ ∆ B j −1 ∥ 1 ∥ ∆ ∥1
∑ ∥ ∆ B j ∥2 ≤ ∑ √
2s
= √S .
2s
j ≥2 j ≥2

Now ∥∆S ∥1 ≤ ∥∆S ∥1 by Claim 11.16. And finally, since the sup- Exercise:: forpany vector v ∈ Rd , show
√ that ∥v∥1 ≤ supp(v) · ∥v∥2 .
port of ∆S is of size s, we can bound its ℓ1 length by s times its ℓ2
√
length, finishing the claim. (Since we wanted that factor of 2 in the
denominator, we made the buckets slightly larger than the size of
S.)

This completes the proof for Lemma 11.14.

Finally, how do we construct RIP matrices? Call a distribution D

over k × n matrices a distributional JL family if Lemma 11.2 is true
when A is drawn from D . The following theorem was proved by
David Donoho, and by Emanuel Candes and Terry Tao, and by Mark
Rudelson and Roman Vershynin. (The connection of their constuction
to the distributional JL was made explicit by Baraniuk et al.)

Theorem 11.17 (JL =⇒ RIP). If we pick A ∈ Rk×n from a distributional

JL family with k ≥ Ω(s log n/s), then with high probability A is BP-exact.
dimension reduction and the jl lemma 147

Proof. The proof is simple, but uses some fairly general ideas worth
emphasizing. First, focus on some s-dimensional subspace of Rn
(obtained by restricting to some subset of coordinates). For notational
simplicity, we just identify this subspace with Rs .

1. For δ = ε/3, pick an δ-net N of the sphere Ss−1 (under Euclidean

distances). This can be done by a greedy algorithm: if some point Given a metric space ( X, d), a δ-net is a
x does not satisfy the covering property at any time, it can be subset N ⊆ X such that (i) d( x, y) ≥ δ
for all x, y ∈ N, and (ii) for each x ∈ X
added to the net. We claim the size of the net is | N | := (4/δ)s . there exists y ∈ N such that d( x, y) ≤ δ.
Indeed, define balls of radius δ/2 around the points in N; these The former is call the packing property
and the latter the covering property of
are disjoint by the packing property of nets, and are all contained nets.
in a ball of radius 1 + δ around the origin. Since the volume of
balls of radius r scales as r s , we have
s
1+δ
|N| ≤ = (4/δ)s .
δ/2

2. If A is an δ-isometry on the δ-net N ⊆ Ss−1 , we claim it is a 3δ-

isometry on all of Ss−1 . Indeed, consider the point x that achives
the maximum stretch arg max{∥ Ax ∥2 | x ∈ Ss−1 }, and let this
stretch be M. Let y be the closest point in N to x; by the packing
property ∥ x − y∥ ≤ δ. Then M = ∥ Ax ∥ ≤ ∥ Ay∥ + ∥ A( x − y)∥ ≤
(1 + δ) + Mδ. Rearranging, M ≤ 11+ −δ ≤ (1 + 3δ ) for δ ≤ 1/3, say.
δ

For the contraction, consider any x ∈ Ss−1 , with closest net point y.
Then ∥ Ax ∥ ≥ ∥ Ay∥ − ∥ A( x − y)∥ ≥ 1 − δ − (1 + 3δ)δ ≥ 1 − 3δ,
again as long as δ ≤ 1/3.

3. By Lemma 11.2, the random matrix A with m rows is an δ-isometry

on each point in the net N, except with probability exp(−cδ2 m) for
some constant c.

4. Now apply the above argument to each of the (ns) subspaces ob-
tained by restricting to some subset S of coordinates. By a union
bound over all subsets S, and over all points in the net for that
subspace, the matrix A is an 3δ-isometry on all points with sup-
port in S except with probability

n
· (4/δ)s · exp(−cδ2 m) ≤ exp(−Θ(m)),
s

as long as m is Ω(s log n/s). Since ε = 3δ, we have the proof.

This presentation is based on notes by Jirka Matoušek. Also see

Chapter 4 of Ankur Moitra’s book for more on compressed sensing,
sparse recovery and basis pursuit.
148 some facts about balls in high-dimensional spaces

11.8 Some Facts about Balls in High-Dimensional Spaces

Consider the unit ball Bd := { x ∈ Rd | ∥ x ∥2 ≤ 1}. Here are two

facts, whose proofs we sketch. These sketches can be made formal
(since the approximations are almost the truth), but perhaps the style
of arguments are more illuminating.

Theorem 11.18 (Heavy Shells). At least 1 − ε of the mass of the unit ball
log 1/ε
in Rd lies within a Θ( d )-width shell next to the surface.

Proof. (Sketch) The volume of a radius-r ball in Rd goes as r d , so the

fraction of the volume not in the shell of width w is (1 − w)d ≈ e−wd ,
log 1/ε
which is ε when w ≈ d .

Given any hyperplane H = { x ∈ Rd | a · x = b} where ∥ a∥ = 1, the

width-w slab around it is K = { x ∈ Rd | b − w ≤ a · x ≤ b + w}.

Theorem 11.19 (Heavy Slabs). At least (1 − ε) of the mass of the unit ball
√
in Rd lies within Θ(1/ d) slab around any hyperplane that passes through
the origin.

Proof. (Sketch) By spherical symmetry we can consider the hyper-

plane { x1 = 0}. The volume of the ball within {−w ≤ x1 ≤ w} is at
least Z w q Z w
d −1 2 d −1
2
( 1 − y ) dy ≈ e−y · 2 dy.
y =0 y =0
1
If we define σ2 = d −1 , this is
Z w y2
−
e 2σ2 dy ≈ Pr[ G ≤ w],
y =0

2 2
where G ∼ N (0, σ2 ). But we know that Pr[ G ≥ w] ≤ e−w /2σ by
our generic Chernoff bound for Gaussians (10.21). So setting that tail
probability to be ε gives
q r
log(1/ε)
2
w ≈ 2σ log(1/ε) = O .
d
This may seem quite counter-intuitive: that 99% of the volume
of the sphere is within O(1/d) of the surface, yet 99% is within
√
O(1/ d) of any central slab! This challenges our notion of the ball
Figure 11.1: Sea Urchin (from uncom-
“looking like” the smooth circular object, and more like a very spiky moncaribbean.com)
sea-urchin. Finally, a last observation:

Corollary 11.20 (Near-orthogonality). Two random vectors from the

surface of the unit ball in Rd (i.e., from the sphere Sd−1 ) are nearly orthog-
onal
qwith high probability. In particular, their dot-product is smaller than
log(1/ε)
O( d ) with probability 1 − ε.
dimension reduction and the jl lemma 149

Proof. Fix one of the vectors u. Then for dot-product |u · v| to be

at most ε, the other vector v must fall in the slab of width ε around
the hyperplane { x · u = 0}. Now Theorem 11.19 completes the
argument.

This means that if we pick n random vectors in Rd , and

q set ε =
log n
1/n2 , a union bound gives that all have dot-product O( d ). Set-
2
ting this dot-product to ε gives us n = exp(ε d) unit vectors with
mutual dot-products at most ε, exactly as in the calculation at the
beginning of the chapter.
12
Streaming Algorithms

We now consider a slightly different computational model called

the data streaming model. In this model we see elements going past
in a “stream”, and we have very little space to store things. For ex-
ample, we might be running a program on an Internet router with
limited space, and the elements might be IP Addresses. We certainly
don’t have space to store all the elements in the stream. The ques-
tion is: which functions of the input stream can we compute with
what amount of time and space? While we focus on space, similar
questions can be asked for update times.
We denote the stream elements by

a1 , a2 , a3 , . . . , a t , . . .

We assume each stream element is from alphabet U, and takes b =

| log2 U | bits to represent. For example, the elements might be 32-bit
integers IP addresses. We imagine we are given some function, and
we want to compute it continually, on every prefix of the stream. Let
us denote a[1:t] = ⟨ a1 , a2 , . . . , at ⟩. For example, if we have seen the
integers:

3, 1, 17, 4, −9, 32, 101, 3, −722, 3, 900, 4, 32, . . . (12.1)

1. Can we compute the sum of all the integers seen so far? I.e.,
F ( a[1:t] ) = ∑it=1 ai . We want the outputs to be

3, 4, 21, 25, 16, 48, 149, 152, −570, −567, 333, 337, 369, . . .

If we have seen T numbers so far, the sum is at most T2b and

hence needs at most O(b + log T ) space. So we can just keep
a counter, and when a new element comes in, we add it to the
counter.

2. How about the maximum of the elements so far? F ( a[1:t] ) =

maxit=1 ai . Even easier. The outputs are:

3, 1, 17, 17, 17, 32, 101, 101, 101, 101, 900, 900, 900
152 streams as vectors, and additions/deletions

We just need to store b bits.

3. The median? The outputs on the various prefixes of (12.1) now are

3, 1, 3, 3, 3, 3, 4, 3, . . .

And doing this will small space is a lot more tricky.

4. (“distinct elements”) Or the number of distinct numbers seen so

far? We’d want to output:

1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 9, 9, 9 . . .

5. (“heavy hitters”) Or the elements that have appeared most often so

far? Hmm...

We can imagine the applications of the data-stream model. An

Internet router might see a lot of packets whiz by, and may want to
figure out which data connections are using the most space? Or how
many different connections have been initiated since midnight? Or
the median (or the 90th percentile) of the file sizes that have been
transferred. Which IP connections are “elephants” (say the ones that
have used more than 0.01% of our bandwidth)? Even if we are not
working at “line speed”, but just looking over the server logs, we Such a router might see tens of millions
may not want to spend too much time to find out the answers, we of packets per second.

may just want to read over the file in one quick pass and come up
with an answer. Such an algorithm might also be cache-friendly. But
how to do this?
Two of the recurring themes will be:

1. Approximate solutions: in several cases, it will be impossible to

compute the function exactly using small space. Hence we’ll ex-
plore the trade-offs between approximation and space.

2. Hashing: this will be a very powerful technique.

12.1 Streams as Vectors, and Additions/Deletions

An important abstraction will be to view the stream as a vector (in

high dimensional space). Since each element in the stream is an el-
ement of the universe U, we can imagine the stream at time t as a
vector xt ∈ Z|U | . Here

xt = ( x1t , x2t , . . . , x|tU | )

and xit is the number of times the ith element in U has been seen until
time t. (Hence, xi0 = 0 for all i ∈ U.) When the next element comes in
and it is element j, we increment x j by 1.
streaming algorithms 153

This brings us a extension of the model: we could have another

model where each element of the stream is either a new element,
or an old element departing. Formally, each time we get an update In data stream jargon, the addition-only
at , it looks like (add, e) or (del, e). We usually assume that for each model is called the cash-register model,
whereas the model with both additions
element, the number of deletes we see for it is at most the number of and deletions is called the turnstile
adds we see — the running counts of each element is non-negative. model. I will not use this jargon.
As an example, suppose the stream looks like:

(add, A), (add, B), (add, A), (del, B), (del, A), (add, C ), . . .

and if A is the first element of U, then the first coordinate x1 of the

vector s would be 1, 1, 2, 2, 1, 1, . . .. This vector notation allows us to
formulate some of the problems more easily:

1. The total number of elements currently in the system is just

|U |
∥x∥1 := ∑i=1 xi . (This is easy.)

2. We might want to estimate the norms ∥x∥2 , ∥x∥ p of the vector x.

3. The number of distinct elements is the number of non-zero entries

in x is denoted by ∥x∥0 .

Let’s consider the (non-trivial) problems one by one.

12.2 Computing Moments

Recall that xt was the vector of frequencies of elements seen so far.

Several interesting problems can be posed as computing various
norms of xt : in particular the Euclidean or 2-norm
v
u |U |
u
∥x ∥2 = t ∑ ( xit )2 ,
t
i =1

and the 0-norm (which is not really a norm)

∥xt ∥0 := number of non-zeroes in xt .

Henceforth, we use the notation that F0 := ∥xt ∥0 is the number of

non-zero entries in x. For p ≥ 1, we consider the p-moment, that is,
the pth -power of the p-norm:

|U |
Fp := ∑ (xit ) p . (12.2)
i =1

We’ll develop an algorithm to compute F2 , and to compute F0 ; we

may see extensions from F2 to Fp in the homeworks.
154 computing moments

12.2.1 Computing the Second Moment F2

The “second moment” F2 of the stream is often called the “surprise

number” (since it captures how uneven the data is). This is also
the size of the self-join. Clearly we can store the entire vector x and
compute F2 , but that requires storing |U | counts. Here’s an algorithm
that uses much less space: This estimator is often called the “tug-
of-war” estimator: the hash function
Pick a random hash function h : U → {−1, +1} from family H . randomly partitions the elements into
two parties (those mapping to 1, and
Maintain counter C, which starts off at zero.
those to −1), and the counter keeps the
On update ( add, i ) ∈ U , increment the counter C → C + h(i ). difference between the sizes of the two
On update (delete, i ) ∈ U , decrement the counter C → C − h(i ). parties.
On query about the value of F2 , reply with C2 .

This estimator was given by Noga Alon, Yossi Matias, and Mario
Szegedy, in their God̈el-award winning paper on streaming computa- Alon, Matias, Szegedy (2000)
tion.

12.2.2 Properties of the Hash Family

The choice of the hash family will be crucial: we want a small fam-
ily so that we require only a small amount of space to store the hash
function, but we want it to be rich enough for the subsequent analy-
sis to go through.

Definition 12.1 (k-universal hash family). H is k-universal (also called

uniform and k-wise independent) mapping universe U to some range R
if all distinct elements i1 , . . . , ik ∈ U and for values α1 , . . . , αk ∈ R,

^ 1
Pr ( h (i j ) = α j ) = . (12.3)
h← H
j=1..k
| R|k

In our application, we want the hash family to be 4-universal from

U to the two-element range R = {−1, 1}. This means that for any
element i,
1
Pr [h(i ) = 1] = Pr [h(i ) = −1] = .
h← H h← H 2

Moreover, for four distinct elements i, j, k, l, their maps behave inde-

pendently of each other, and hence

E[h(i ) · h( j) · h(k) · h(l )] = E[h(i )] · E[h( j)] · E[h(k)] · E[h(l )].

E[h(i ) · h( j)] = E[h(i )] · E[h( j)].

We will discuss constructions of such hash families soon, but let us

use them to analyze the tug-of-war estimator.
streaming algorithms 155

12.2.3 A Direct Analysis

Hence, having seen the stream that results in the frequency vector
|U |
x ∈ Z≥0 , the counter will have the value

C := ∑ x i h ( i ).
i ∈U

Remember, the resulting estimate is C2 : so we need to show that

E[C2 ] = F2 , and variance that is small enough that Chebyshev’s
inequality ensures we are correct with reasonable probability.

E[ C 2 ] = E[ ∑ h ( i ) x i · h ( j ) x j ] = ∑ xi x j E[(h(i) · h( j))]
i,j i,j

=∑ xi2 E[h(i ) · h(i )] + ∑ ∑ xi x j E[h(i)] · E[h( j)]

i i ̸= j i,j

= ∑ xi2 = F2 .
i

So in expectation we are correct! Next, recall that the variance is

defined as Var(C2 ) = E[(C2 )2 ] − E[C2 ]2 :

E[(C2 )2 ] = E[ ∑ h ( p ) h ( q ) h (r ) h ( s ) x p x q xr x s ] =
p,q,r,s

= ∑ x4p E[h( p)4 ] + 6 ∑ x2p xq2 E[h( p)2 h(q)2 ] + other terms
p p<q

= ∑ x4p + 6 ∑ x2p xq2 .

p p<q

This is because all the other terms have expectation zero. Why? The
terms like E[h( p)h(q)h(r )h(s)] where p, q, r, s are all distinct, all be-
come zero because of 4-universality. Terms like E[h( p)2 h(r )h(s)]
become zero for the same reason. It is only terms like E[h( p)2 h(q)2 ]
and E[h( p)4 ] that survive, and since h( p) ∈ {−1, 1}, they have expec-
tation 1. So

Var(C2 ) = ∑ x4p + 6 ∑ x2p xq2 − (∑ x2p )2 = 4 ∑ x2p xq2 ≤ 2E[C2 ]2 .

p p<q p p<q

What does Chebyshev say then?

Var(C2 ) 2
Pr[|C2 − E[C2 ]| > εE[C2 ]] ≤ ≤ 2.
(εE[C2 ])2 ε
This is pretty pathetic: since ε is usually less than 1, the RHS usually
more than 1.

12.2.4 Reduce the Variance by Repetition

The idea is the simplest one: if we have an estimator with mean µ
and variance σ2 , then taking the average of k independent copies of
156 a matrix view of our estimator

this estimator has mean µ and variance σ2 /k. (Why? Summing the
independent copies sums the variances and so increases it by k, but
dividing by k reduces it by k2 .)
So if we k such independent counters C1 , C2 , . . . , Ck , and return
their average C = 1k ∑i Ci , we get

2
2 2 2 Var(C ) 2
Pr[|C − E[C ]| > εE[C ]] ≤ 2
≤ .
(εE[C ])2 kε2

Taking k = ε22δ independent counters gives a probability δ of error

on any query. Each counter uses a 4-universal hash function, which
requires O(log U ) random bits to store.

12.2.5 Estimating the p-Moments

To fix, please skip. A bunch of students (Jason, Anshu, Aram) pro-
posed that for the pth -moment calculation we should use 2p-wise in-
dependent hash functions from U to R, where R = {1, ω, ω 2 , . . . , ω p−1 },
the p primitive roots of unity. Again, we set C := ∑i∈U xi h(i ), and
return the real part of C p as our estimate. This approach has been
explored by Ganguly in this paper. Some calculations (and elbow-
grease) show that E[C p ] = Fp , but it seems that naively Var(C p )
p p
tends to grow like F2 instead of Fk ; this leads to pretty bad bounds.
Ganguly’s paper gives some ways of controlling the variance.
BTW, there is a lower bound saying that any algorithm that out-
puts a 2-approximation for Fk requires at least |U |1−2/k bits of stor-
age. Hence, while we just saw that for k = 2, we can get away with
just O(log |U |) bits to get a O(1)-estimate, for k > 2 things are much
worse.

12.3 A Matrix View of our Estimator

Here’s a equivalent way of looking at this estimator, that also relates

it to the previous chapter and the JL Theorem. Recall that the stream
can be viewed as representing a vector x of size |U |, and F2 = ∥x∥2 .
Take a matrix M of dimensions k × D, where D = |U |: again, M is
a “fat and short” matrix, since k = O(ε−2 δ−1 ) is small and D = |U |
is huge. Pick k independent hash functions h1 , h2 , . . . , hk from the
4-universal hash family, and use each one to fill a row of M:

Mij := hi ( j).

The k counters C1 , C2 , . . . , Ck are now nothing other than the entries

of the matrix-vector product
M x.
streaming algorithms 157

2
2
The estimate C = 1
k ∑ik=1 Ci is nothing but

1
∥ Mx∥22 .
k
This is completely analogous to the construction for JL: we’ve got
a slightly taller matrix with k = O(ε−2 δ−1 ) rows instead of k =
O(ε−2 log δ−1 ) rows. However, the matrix entries are not fully inde-
pendent (as in JL), just 4-wise independent. I.e., we need to store only
O(k log D ) bits and can generate any entry of M quickly, whereas the
construction for JL stored all kD bits. Henceforth, we use S = √1 M to denote
k
Let us record two properties of this construction: the “sketch” matrix.

Theorem 12.2 (Tug-of-War Sketch). Take a k × D matrix S whose

−1 }k -valued r.v.s. Then for x, y ∈
columns are 4-wise independent { √1 , √
k k
RD ,

1. E[ ⟨Sx, Sy⟩ ] = ⟨x, y⟩.

2
2. Var( ⟨Sx, Sy⟩ ) = k · ∥x∥22 ∥y∥22 .

The proofs is similar to that in §12.2.3; using y = x gives us ex-

actly the results from that section. Moreover, an analogous theorem
can also be given in the JL construction, with fewer rows but with
completely independent entries.

12.4 Application: Approximate Matrix Multiplication

Suppose we want to multiply square matrices A, B ∈ Rn×n , but

want to solve the problem faster, at the expense of getting only an
approximate solution C ≈ AB. How should we measure the error?
Requiring that the answer be close entry-wise to the actual answer
is a hard problem. Let’s aim for something weaker: we want the
“aggregate error” to be small. It’s as though we think of the matrix as
Formally, the Frobenius norm of matrix M is just a vector and look at its Euclidean
length.
s
∥ M∥ F := ∑ Mij2 .
i,j

Our guarantee for approximate matrix multiplication will be

∥C − AB∥2F ≤ small.

Here’s the idea: we want to do the matrix multiplication:

C = AB
A B = C
158 optional: computing the number of distinct elements

This usually takes O(n3 ) time. Indeed, the ijth entry of the product
C is the dot-product of the ith row Ai⋆ of A with the jth column B⋆ j of
B, and the dot-product takes O(n) time. The intuition is that S⊺ S is an almost-
Suppose instead we use a “fat and short” k × n matrix S (for k ≪ identity matrix, it has 1 on the diag-
onals and at most ε everywhere else.
n), and calculate And hence it gives only a small error.
Ce = AS⊺ SB. Of course, we don’t multiply out S⊺ S,
but instead compute AS⊺ and SB, and
By associativity of matrix multiplication, we could first compute then multiply the smaller matrices.
( AS⊺ ) and (SB) in times O(n2 k), and then multiply the results in
time O(nk2 ). Moreover, the matrix S from the previous section works
pretty well, where we set D = n.
S>
d
A S B ≈ C
Indeed, entries of the error matrix Y = C − Ce satisfy

E[Yij ] = 0

and

E[Yij2 ] = Var(Yij ) + E[Yij ]2 = Var(Yij ) ≤ 2k ∥ Ai⋆ ∥22 ∥ B⋆ j ∥22 .

So The squared Frobenius norm of a

matrix is the sum of squared Euclidean
E[∥ AB − AS⊺ SB∥2F ] = E[∑ Yij2 ] = ∑ E[Yij2 ] = 2k ∑ij ∥ Ai⋆ ∥22 ∥ B⋆ j ∥22 lengths of the columns, or of the rows.
ij ij

= 2k ∥ A∥2F ∥ B∥2F .

Finally, setting k = ε22δ and using Markov’s inequality, we can say that
for any fixed ε > 0, we can compute an approximate matrix product
C := AS⊺ SB such that

Pr ∥ AB − C ∥ F ≤ ε · ∥ A∥ F ∥ B∥ F ≥ 1 − δ,

2
in time O( εn2 δ ). (If we want to make δ very small, at the expense of
picking more independent random bits in the sketching matrix S, we
can use the JL matrices instead. Details will appear in a homework.)
Finally, if the matrices A, B are sparse and contains only ≪ n2 entries,
the time can be made to depend on nnz( A, B).
The approximate matrix product question has been considered
often, e.g., by Edith Cohen and David Lewis using a random-walks Cohen and Lewis (1999)
approach. The algorithm we present is due to Tamás Sarlós; his pa-
per gives better results, as well as extensions to computing SVDs
faster. Better bounds have subsequently been given by Clarkson and
Woodruff. More recent refs too.

12.5 Optional: Computing the Number of Distinct Elements

Our last example today will be to compute F0 , the number of distinct

elements seen in the data stream, but in the addition-only model,
with no deletions. (We’ll see another approach in a HW.)
streaming algorithms 159

12.5.1 A Simple Lower Bound

Of course, if we store x explicitly (using |U | space), we can trivially
solve this problem exactly. Or we could store the (at most) t elements
seen so far, again we could give an exact answer. And indeed, we
cannot do much better if we want no errors. Here’s a proof sketch
for deterministic algorithms (one can extend this to randomized
algorithms with some more work).

Lemma 12.3 (A Lower Bound). Suppose a deterministic algorithm cor-

rectly reports the number of distinct elements for each sequence of length at
most N. Suppose N ≤ 2|U |. Then it must use at least Ω( N ) bits of space.

Proof. Consider the situation where first we send in some subset S

of N − 1 elements distinct elements of U. Look at the information
stored by the algorithm. We claim that we should be able to use this
information to identify exactly which of the ( N|U−|1) subsets of U we
have seen so far. This would require

|U |
log2 ≥ ( N − 1) log2 |U | − log2 ( N − 1) = Ω( N )
N−1

bits of memory. We used the approximation that

OK, so why should we be able to uniquely identify the set of el- m m k
≥ ,
ements until time N − 1? For a contradiction, suppose we could k k
not tell whether we’d seen S1 or S2 after N − 1 elements had come and hence

in. Pick any element e ∈ S1 \ S2 . Now if we gave the algorithm e log2
m
≥ k(log2 m − log2 k).
k
as the N th element, the number of distinct elements seen would be
N if we’d already seen S2 , and N − 1 if we’d seen S1 . But the algo-
rithm could not distinguish between the two cases, and would return
the same answer. It would be incorrect in one of the two cases. This
contradicts the claim that the algorithm always correctly reports the
number of distinct elements on streams of length N.

OK, so we need an approximation if we want to use little space.

Let’s use some hashing magic.

12.5.2 The Intuition

Suppose there are d = ∥x∥0 distinct elements. If we randomly map
d distinct elements onto the line [0, 1], we expect to see the smallest
mapped value at location ≈ d1 . (I am assuming that we map these
elements consistently, so that multiple copies of an element go to
the same place.) So if the smallest value is δ, one estimator for the
number of elements is 1/δ.
This is the essential idea. To make this work (and analyze it), we
change it slightly: The variance of the above estimator is large. By the
160 optional: computing the number of distinct elements

same argument, for any integer s we expect the sth smallest mapped
value at ds . We use a larger value of s to reduce the variance.

12.5.3 The Algorithm

Assume we have a hash family H with hash functions h : U → [ M ].
(We’ll soon figure out the precise properties we’ll want from this
hash family.) We will later fix the value of the parameter s to be some
large constant. Here’s the algorithm:

Pick a hash function h randomly from H .

If query comes in at time t
Consider the hash values h( a1 ), h( a2 ), . . . , h( at ) seen so far.
Let Lt be sth smallest distinct hash value h( ai ) in this
set.
M·s
Output the estimate Dt = Lt .

The crucial observation is: it does not matter if we see an element

e once or multiple times — the algorithm will behave the same, since
the output depends on what distinct elements we’ve seen so far. Also,
maintaining the sth smallest element can be done by remembering at
most s elements. (So we want to make s small.)
How does this help? As a thought experiment, if we had d distinct
darts and threw them in the continuous interval [0, M ], we would
expect the location of the sth smallest dart to be about s·dM . So if the
sth smallest dart was at location ℓ in the interval [0, M ], we would be
tempted to equate ℓ = s·dM and hence guessing d = s·ℓM would be a
good move. Which is precisely why we used the estimate

M·s
Dt = .
Lt

Of course, all this is in expectation—the following theorem argues

that this estimate is good with reasonable probability.

Theorem 12.4. Consider some time t. If H is a uniform 2-universal hash

family mapping U → [ M ], and M is large enough, then both the following
guarantees hold:

3
Pr[ Dt > 2 ∥xt ∥0 ] ≤ , and (12.4)
s
∥ x t ∥0 3
Pr[ Dt < ]≤ . (12.5)
2 s
We will prove this in the next section. First, some observations.
Firstly, we now use the stronger assumption that that the hash family
2-universal; recall the definition from Section 12.2.2. Next, setting
∥xt ∥
s = 8 means that the estimate Dt lies within [ 2 0 , 2∥xt ∥0 ] with
probability at least 1 − (1/4 + 1/4) = 1/2. (And we can boost the
streaming algorithms 161

success probability by repetitions.) Secondly, we will see that the

estimation error of a factor of 2 can be made (1 + ε) by changing the
parameters s and k.

12.5.4 Proof of Theorem 12.4

Now for the proof of the theorem. We’ll prove bound (12.5), the other
bound (12.4) is proved identically. Some shorter notation may help.
Let d := ∥xt ∥0 . Let these d distinct elements be T = {e1 , e2 , . . . , ed } ⊆
U.
The random variable Lt is the sth smallest distinct hash value seen
until time t. Our estimate is sMLt , and we want this to be at least d/2.
2sM
So we want Lt to be at most d . In other words,

2sM
Pr[ estimate too low ] = Pr[ Dt < d/2] = Pr[ Lt > ].
d
Recall T is the set of all d (= ∥xt ∥0 ) distinct elements in U that
have appeared so far. How many of these elements in T hashed to
values greater than 2sM/d? The event that Lt > 2sM/d (which
is what we want to bound the probability of) is the same as saying
that fewer than s of the elements in T hashed to values smaller than
2sM/d. For each i = 1, 2, . . . , d, define the indicator

1 if h(ei ) ≤ 2sM/d
Xi = (12.6)
0 otherwise

Then X = ∑id=1 Xi is the number of elements seen that hash to values

below 2sM/d. By the discussion above, we get that

2sM
Pr Lt < ≤ Pr[ X < s].
d

We will now estimate the RHS.

Next, what is the chance that Xi = 1? The hash h(ei ) takes on each
of the M integer values with equal probability, so

⌊sM/2d⌋ s 1
Pr[ Xi = 1] = ≥ − . (12.7)
M 2d M
By linearity of expectations,
" #
d d d
s 1 s d
E[ X ] = E ∑ Xi = ∑ E [Xi ] = ∑ Pr [Xi = 1] ≥ d · 2d
−
M
=
2
−
M
.
i =1 i =1 i =1

s
Let’s imagine we set M large enough so that d/M is, say, at most 100 .
Which means s s 49 s
E[ X ] ≥ − = .
2 100 100
162 optional: computing the number of distinct elements

So by Markov’s inequality,
100 49
Pr X > s = Pr X > E[ X ] ≤ .
49 100
Good? Well, not so good. We wanted a probability of failure to be
smaller than 2/s, we got it to be slightly less than 1/2. Good try, but
no cigar.

12.5.5 Enter Chebyshev

Recall that Var(∑i Zi ) = ∑i Var( Zi ) for pairwise-independent random
variables Zi . (Why?) Also, if Zi is a {0, 1} random variable, Var( Zi ) ≤
E[ Zi ]. (Why?) Applying these to our random variables X = ∑i Xi , we
get
Var( X ) = ∑ Var( Xi ) ≤ ∑ E[ Xi ] = E( X ).
i i
(The first inequality used that the Xi were pairwise independent,
since the hash function was 2-universal.) Is this variance “low”
enough? Plugging into Chebyshev’s inequality, we get:
100 50
Pr[ X > s] = Pr[ X > µ X ] ≤ Pr[| X − µ X | > µX ]
49 49
σX2 1 3
≤ ≤ ≤ .
(50/49)2 µ2X (50/49)2 µ X s
Which is precisely what we want for the bound (12.4). The proof for
the bound (12.5) is similar and left as an exercise. If we want the estimate to be at most
∥ x t ∥0
,
then we would want to bound
(1+ ε )
E[ X ]
Pr[ X < ]. Similar calculations
12.5.6 Final Bookkeeping (1+ ε )
should give this to be at most ε23s , as
Excellent. We have a hashing-based data structure that answers long as M was large enough. In that
case we would set s = O(1/ε2 ) to get
“number of distinct elements seen so far” queries, such that each some non-trivial guarantees.
answer is within a multiplicative factor of 2 of the actual value ∥xt ∥0 ,
with small error probability.
Let’s see how much space we actually used. Recall that for failure
probability 1/2, we could set s = 12, say. And the space to store
the s smallest hash values seen so far is O(s lg M ) bits. For the hash
functions themselves, the standard constructions use O((lg M) +
(lg U )) bits per hash function. So the total space used for the entire
data structure is
O(log M ) + (lg U ) bits.
What is M? Recall we needed to M large enough so that d/M ≤
s/100. Since d ≤ |U |, the total number of elements in the universe,
set M = Θ(U ). Now the total number of bits stored is
O(log U ).
And the probability of our estimate Dt being within a factor of 2 of
the correct answer ∥xt ∥0 is at least 1/2.
13
Dimension Reduction: Singular Value Decompositions

13.1 Introduction

In this lecture, we see a very popular und useful dimension reduction

technique that is based on the singular value decomposition (SVD) of
a matrix. In contrast to the dimension reduction obtained by the
Johnson-Lindenstrauss Lemma, SVD based dimension reductions are
not distance preserving. That means that we allow that the distances
between pairs of points in our input change. Instead, we want to
keep the shape of the point set by fitting it to a subspace according
to a least squares error. This preserves most of the ‘energy’ of the
points.
More precisely, the problem that we want to solve is the follow-
ing. We are given a matrix A ∈ Rn×d . The points are the rows of
A, which we also name a1 , . . . , an ∈ Rd . Let the rank of A be r, so
r ≤ min{n, d}. Given an integer k, we want to find a subspace V of
dimension k that minimizes the sum of the squared distances of all
points in A to V. Thus, for each point in A, we square the distance
between the point and its projection to V and add these squared
errors, and this term should be minimized by our choice of V.
This task can be solved by computing the SVD of A, a decomposi-
tion of A into matrices with nice properties. We will see that we can
write A as
 
a1
 .. 
A=  .


an
   
| |  σ1 0 v1
  ..  .. 
=  u1 · · · u n    . 
 .
 = UDV ⊺ ,

| | 0 σr vd

where U ∈ Rn×r and V ∈ Rr×d are matrices with orthonormal

columns and D ∈ Rr×r is a diagonal matrix. Notice that the columns
164 best fit subspaces of dimension k and the svd

of V are the d-dimensional points v1 , . . . , vd which appear in the rows

of the above matrix since it is V ⊺ .

Figure 13.1: A visualization of AV =

UD for r = 2.
σ2 u2
v2
A σ1 u1
v1

Notice that the SVD can give us an intuition of how A acts as a

mapping. We have that

AV = UDV ⊺ V = UD

because V consists of orthonormal columns. Imagine the r-dimensional

sphere that is spanned by v1 , . . . , vr . The linear mapping defined by
A maps this sphere to an ellipsoid with σ1 u1 , . . . , σr ur as the axes, like
shown in Figure 13.1.
The singular value decomposition was developed by different
mathematicians around the beginning of the 19th century. The survey
by Stewart 1 gives an historical overview on its origins. In the fol- 1

lowing, we see how to obtain the SVD and why it solves our best fit
problem. The lecture is partly based on 2 . 2

13.2 Best fit subspaces of dimension k and the SVD

a3 Figure 13.2: Finding the best fit sub-

space of dimension one.
ai ∗

βi

a1 a4

αi

a2
dimension reduction: singular value decompositions 165

We start with the case that k = 1. Thus, we look for the line
through the origin that minimizes the sum of the squared errors.
See Figure 13.2. It depicts a one-dimensional subspace V in blue. We
look at a point ai , its distance β i to V, and the length of its projection
to V which is named αi in the picture. Notice that the length of ai is
α2i + β2i . Thus, for our fixed ai , minimizing β i is equivalent to maxi-
mizing αi . If we represent V by a unit vector v that spans V (depicted
in orange in the picture), then we can compute the projection of ai to
V by the dot product ⟨ ai , v⟩. We have just argued that we can find the
best fit subspace of dimension one by solving
n n
max ∑ ⟨ ai , v ⟩2 =
v ∈ R d , ∥ v ∥= 1 i = 1
min ∑ dist( a i , span( v )) 2
v ∈ R d , ∥ v ∥= 1 i = 1

where we denote the distance between a point a i and the line spanned
by v by dist ( a i , span ( v )) 2 . Now because Av = (⟨ a 1 , v ⟩ , ⟨ a 2 , v ⟩ , . . . , ⟨ a n v ⟩) ⊺ ,
we can rewrite ∑ id= 1 ⟨ a i , v ⟩ 2 as ∥ Av ∥ 2 . We define the first right sin-
gular vector to be a unit vector that maximizes ∥ Av ∥ . We thus know There may be many vectors that achieve
that the subspaces spanned by it is the best fit subspace of dimension the maximum: indeed, for every v that
achieves the maximum, −v also has
one. the same maximum. Let us break ties
Now we want to generalize this concept to more than one dimen- arbitrarily.

sion. It turns out that to do so, we can iteratively pick orthogonal

unit vectors that span more and more dimensions. Among all unit
vectors that are orthogonal to those chosen so far, we pick a vector
that maximizes ∥ Av∥. This is formalized in the following definition.

Definition 13.1. Let A ∈ Rn×d be a matrix. We define

v1 = arg max ∥ Av∥, σ1 ( A) := ∥ Av1 ∥

∥v∥=1

v2 = arg max ∥ Av∥, σ2 ( A) := ∥ Av2 ∥

∥v∥=1,⟨v,v1 ⟩=0
..
.
vr = arg max ∥ Av∥, σr ( A) := ∥ Avr ∥
∥v∥=1,⟨v,vi ⟩=0 ∀i =1,...,r −1

and say that v1 , . . . , vr are right singular vectors of A and that σ1 :=

σ1 ( A), . . . , σr := σr ( A) are the singular values of A. Then we define the
left singular vectors by setting

Avi
ui : = for all i = 1, . . . , r.
∥ Avi ∥

One worry is that this greedy process picked v2 after fixing v1 ,

and hence the span of v1 , v2 may not be the best two-dimensional
subspace. The following claim says that Definition 13.1 indeed gives
us the the best fit subspaces.
166 best fit subspaces of dimension k and the svd

Claim 13.2. For any k, the subspace Vk , which is the span of v1 , . . . , vk ,

minimizes the sum of the squared distances of all points among all
subspaces of dimension k.

Proof. Let V2 be the subspace spanned by v1 and v2 . Let W be any

other 2-dimensional subspace and let w1 , w2 be an orthonormal basis
of W. Recall that the squared length of the projection of a point ai to
V decomposes into the squared lengths of the projections to the lines
spanned by v1 and v2 and the same is true for W, w1 and w2 .
Since we chose v1 to maximize ∥ Av∥, we know that ∥ Aw1 ∥ ≤
∥ Av1 ∥. Similarly, it holds that ∥ Aw2 ∥ ≤ ∥ Av2 ∥, which means that

∥ Aw1 ∥2 + ∥ Aw2 ∥2 ≤ ∥ Av1 ∥2 + ∥ Av2 ∥2 .

We can extend this argument by induction to show that the space

spanned by v1 , . . . , vk is the best fit subspace of dimension k.

We review some properties of the singular values and vectors. No-

tice that as long as i < r, there is always a vector in the row space
of A that is linearly independent to v1 , . . . , vi , which ensures that
max ∥ Av∥ is nonzero. For i = r, the vectors v1 , . . . , vr span the row
space of A. Thus, any vector that is orthogonal to them lies in the
kernel of A, meaning that arg max∥v∥=1,⟨v,vi ⟩=0 ∀i=1,...,i−1 ∥ Av∥ = 0,
so we end the process at this point. By construction, we know that
the singular values are not increasing. We also see that the right sin-
gular vectors form a orthonormal basis of the row space of A. This is
true for the left singular vectors and the column space as well (home-
work). The following fact summarizes the important properties.
Fact 13.3. The sets {u1 , . . . , ur } and {v1 , . . . , vr } as defined in 13.1 are
both orthonormal sets and span the column and row space, respec-
tively. The singular values satisfy σ1 ≥ σ2 ≥ . . . ≥ σr > 0.
So far, we defined the vi purely based on the goal to find the best
fit subspace. Now we claim that in doing so, we have actually found
the decomposition we wanted, i.e. that
   
| |  σ1 0 v1
  ..  .. 
UDV ⊺ := u1 · · · un    . 
 .
 = A.

| | 0 σr vd
(13.1)

Claim 13.4. For any matrix A ∈ Rn×d and U, V, D as in (13.1), it holds

that
A = UDV ⊺ .

Proof. We prove the claim by using the fact that two matrices A, B ∈
Rn×d are identical iff for all vectors v, the images are equal, i.e. Av =
dimension reduction: singular value decompositions 167

Bv. Notice that it is sufficient to check this for a basis, so it is true if

the following subclaim holds (which we do not prove):

Subclaim: Two matrices A, B ∈ R n × d are identical iff Av = Bv for

all v in a basis of R d .
We use the subclaim for B = U DV ⊺ . Notice that we can extend
v 1 , . . . , v r to a basis of R d by adding orthonormal vectors from the
kernel of A. These additional vectors are orthogonal to all vectors in
the rows of V ⊺ , so V ⊺ v is the zero vector for all of them. Since they
−
→ −
→ − →
are in the kernel of A, it holds 0 = Av = Bv = U D 0 = 0 for the
additional basis vectors. For i = 1, . . . , r, we notice that
Av i
( U DV ⊺ ) v i = U De i = u i σi = · ∥ Av i ∥ = Av i
∥ Av i ∥
which completes the proof.

13.3 Useful facts, and rank-k-approximation

Singular values are a generalization of the concept of eigenvalues

for square matrices. Recall that a square symmetric matrix M can
be written as M = ∑ri=1 λi vi v⊺i where λi and vi are eigenvalues and
eigenvectors, respectively. This decomposition can be used to de-
fine the singular vectors in a different way. In fact, the right singular
vectors of A correspond to the eigenvectors of A⊺ A (notice that this
matrix is square and symmetric), and the left singular vectors corre-
spond to the eigenvectors of AA⊺ .
This fact can also be used to compute the SVD. Computing the
SVD or eigenvalues and -vectors in a numerically stable way is the
topic of a large research area, and there are different ways to obtain
algorithms that converge under the assumption of a finite precision.
Fact 13.5. The SVD can be found (up to arbritrary precision) in time
O(min(nd2 , n2 d)) or even in time O(min(ndω −1 , dnω −1 )) where ω
is the matrix multiplication constant. (Here the big-O term hides the
dependence on the precision.)
The SVD is unique in the sense that for any i ∈ [r ], the subspace
spanned by unit vectors v that maximize ∥ Av∥ is unique. Aside from
the different choices of an orthonormal basis of these subspaces, the
singular vectors are uniquely defined. For example, if all singular
values are distinct, then the subspace of unit vectors that maximize
∥ Av∥ is one-dimensional and the singular vector is unique (up to
sign changes, i.e., up to multiplication by −1).
Sometimes, it is helpful to observe that the matrix product UDV ⊺
can also be written as the sum of outer products of the singular vec-
tors. This formulation has the advantage that we can write the pro-
168 applications

jection of A to the best fit subspaces of dimension k as the sum of the

first k terms.
Remark 13.6. The SVD can equivalently be written as
r
∑ σi ui vi
⊺
A=
i =1

where ui v⊺i is the outer product. For k ≤ r, the projection of A to Vk is

k
∑ σi ui vi .
⊺
Ak :=
i =1

Recall that the Frobenius norm of a matrix A is the square

qroot of
the sum of its squared entries, i.e. it is defined by ∥ A∥ F := ∑i,j a2ij .
This means that ∥ A − B∥2F is equal to the sum of the squared dis-
tances between each row in A and the corresponding row in B for
matrices of equal dimensions. Imagine that B is a rank k matrix.
Then its points lie within a k-dimensional subspace, and ∥ A − B∥2F
cannot be smaller than the distance between A and this subspace.
Since Ak is the projection to the best fit subspace of dimension k, Ak
minimizes ∥ A − B∥ F (notice that Ak has rank at most k). It is there-
fore also called the best rank k-approximation of A.

Theorem 13.7. Let A ∈ Rn×d be a matrix of rank r and let k ≤ r be given.

It holds that
∥ A − Ak ∥ F ≤ ∥ A − B∥ F
for any matrix B ∈ Rn×d of rank at most k.

The theorem is also true if the Frobenius norm is replaced by

the spectral norm. For a matrix A, the spectral norm is equal to the In fact, this theorem holds for any
maximum singular value, i.e. ∥ A∥2 := maxv∈Rd ,∥v∥=1 ∥ Av∥ = σ1 . unitarily invariant matrix norm; a matrix
norm ∥ · ∥ is unitarily invariant if
∥ A∥ = ∥U AV ∥ for any unitary matrices
U, V. Other examples of unitarily
13.4 Applications invariant norms are the Schatten norms,
and the Ky Fan norms. J. von Neumann
Topic modeling. Replacing A by Ak is a great compression idea. For characterized all unitarily invariant
matrix norms as those obtained by
example, for topic modeling, we imagine A to be a matrix that stores taking a “symmetric” (vector) norm of
the number of times that any of d words appears in any of n docu- the vector of singular values — here
symmetric means ∥ x ∥ = ∥y∥ when y is
ments. Then we assume that the rank rof A corresponds to r topics.
obtained by flipping the signs of some
Recall that entries of x and then permuting them
around. See Theorem 7.4.24 in the text
      and Johnson.
by Horn
a1 | |  σ1 0 v1
 ..    ..  .. 
A=
 .
 =  u1
 ··· un  
 . 
 .
.

an | | 0 σr vd

Assume that the entries in U and V are positive. Since the column
vectors are unit vectors, they define a convex combination of the
dimension reduction: singular value decompositions 169

r topics. We can thus imagine U to contain information on how

much each of the documents consists of each topic. Then, D assigns
a weight to each of the topics. Finally, we V ⊺ gives information on
how much each topic consists of each of the words. The combination
of the three matrices generates the actual documents. By using the
SVD, we can represent a set of documents based on fewer topics, thus
obtaining an easier model of how they are generated.
Notice that this interpretation of the SVD needs that the entries are
non negative, and that obtaining such a decomposition is an NP-hard
problem.

13.4.1 Pseudoinverse and least squares regression

a4 x
b4

b2 a3 x

a2 x
b3

a1 a2 a3 a4
a1 x

For any diagonal matrix M = diag(d1 , . . . , dℓ ), define M+ :=

diag(1/d1 , . . . , 1/dℓ ). We notice that for the matrices from the SVD, it
holds that
VD + U ⊺ UDV = diag(1, . . . , 1, 0, . . . , 0).
| {z }
r times

If A is an n × n-matrix of rank n, then r = n and the result of this

product is I. Thus, A+ := VD + U ⊺ is then the inverse of A. In gen-
eral, A+ is the (Moore Penrose) pseudoinverse of A. It satisfies that

A( A+ b) = b ∀b in the image of A

The pseudoinverse helps to find the solution to another popular

minimization problem, least squares regression. Given an overcon-
strained system of equations Ax = b, least squares regression asks for
a point x that minimizes the squared error ∥ Ax − b∥22 . I.e., we want

x ∗ := arg min ∥ Ax − b∥22 .

170 symmetric matrices

Notice that if there is an x ′ with Ax ′ = b, then it also minimizes

∥ Ax ′ − b∥22 , and if A had full rank this x ′ would be obtained by com-
puting A−1 b. If A does not have full rank, an optimal solution is
obtained by using the pseudoinverse:

x ∗ = A+ b

(This is often used as another definition for the pseudoinverse.)

Here’s a proof: for any choice of x ∈ Rd , Ax is some point in
the column span of A. So x ∗ , the minimizer, must be the projection
of b onto colspan( A). One orthonormal basis for colspan( A) is the
columns of U. Hence the projection Πb of b onto colspan( A) is given
by UU ⊺ b. (Why? Extend U to a basis for all of Rd , write b in this
basis, and consider what it’s projection must be.) Hence we want
Ax ∗ = UU ⊺ b. For this, it suffices to set x ∗ = VD + U ⊺ b = A+ b.

13.5 Symmetric Matrices

For a (square) symmetric matrix A, the (normalized) eigenvectors vi

and the eigenvalues λi satisfy the following properties: the vi s form
an orthonormal basis, and A = VΛV ⊺ , where the columns of V are
the vi vectors, and Λ is a diagonal matrix with λi s on the diagonal.
It is no longer the case that the eigenvalues are all non-negative. (In
fact, we can match up the eigenvalues and singular values such that
they differ only in sign.)
Given a function f : R → R, we can extend this to a function on
symmetric matrices as follows:

f ( A) = V diag( f (λ1 ), . . . , f (λn )) V ⊺ .

For instance, you can check that Ak or e A defined this way indeed
correspond to what you think they might mean. (The other way to
k
define e A would be ∑k≥0 Ak! .)
Part III

“Modern” Algorithms
14
Online Learning: Experts and Bandits

In this set of chapters, we consider a basic problem in online algo-

rithms and online learning: how to dynamically choose from among
a set of “experts” in a way that compares favorably to any fixed ex-
pert. Both this abstract problem, and the techniques behind the solu-
tion, are important parts of the algorithm designer’s toolkit.

14.1 The Mistake-Bound Model

The term expert just refers to a person
Suppose there are N experts who make predictions about a certain who has an opinion, and does not
event every day—for example, whether it rains today or not, or reflect whether they are good or bad at
the prediction task at hand.
whether the stock market goes up or not. Let U be the set of pos-
sible choices. The process in the experts setting goes as follows:

1. At the beginning of each time step t, each expert makes a predic-

tion. Let E t ∈ U N be the vector of predictions. Note the order of events: the experts
predictions come first, then the algo-
2. The algorithm makes a prediction at , and simultaneously, the rithm chooses an expert at the same
time as the reality being revealed.
actual outcome o t is revealed.

The goal is to minimize the number of mistakes, i.e., the number of

times our prediction at differs from the outcome o t . Suppose we have 8 experts, and E t =
(0, 1, 0, 0, 0, 1, 1, 0). If we follow the third
Fact 14.1. There exists an algorithm that makes at most ⌈log2 N ⌉ expert and predict at = 0, but the actual
mistakes, if there is a perfect expert. outcome is o t = 1, we make a mistake;
if we would have picked the second
Proof. The algorithm just considers all the experts who have made expert, we would have been correct.

no mistakes so far, and predicts what the majority of them predict.

Note that every time we make a mistake, the number of experts who
have not been wrong yet reduces by a factor of 2 or more. (And when
we do not make a mistake, this number does not increase.) Since
there is at least one perfect expert, we can make at most ⌈log2 N ⌉
mistakes.

Show that any algorithm must make at least ⌈log2 N ⌉ mistakes in

the worst case.
174 the weighted majority algorithm

Fact 14.2. There is an algorithm that, on any sequence, makes at most

M ≤ m∗ (⌈log2 N ⌉ + 1) + ⌈log2 N ⌉ mistakes, where m∗ is the number
of mistakes made by the best of these experts on this sequence.

Proof. Think of time as being divided into “epochs”. In each epoch,

we proceed as in the perfect expert scenario as in Fact 14.1: we keep
track of all experts who have not yet made a mistake in that epoch,
and predict the majority opinion. The set of experts halves (at least)
with every mistake the algorithm makes. When the set becomes
empty, we end the epoch, and start a new epoch with all the N ex-
perts.
Note that in each epoch, every expert makes at least one mistake.
Therefore the number of completed epochs is at most m∗ . Moreover,
we make at most ⌈log2 N ⌉ + 1 mistakes in each completed epoch, and
at most ⌈log2 N ⌉ mistakes the last epoch, giving the result.

However, this algorithm is very harsh and very myopic. Firstly, it

penalizes even a single mistake by immediately discarding the expert.
But then, at the end of an epoch, it wipes the slate clean and forgets
the past performance of the experts. Maybe we should be gentler, but
have a better memory?

14.2 The Weighted Majority Algorithm

This algorithm, due to Littlestone and Warmuth, is remarkable for

(t)
its simplicity. We assign a weight wi to each expert i ∈ [ N ]. Let wi
denote the weight of expert i at the beginning of round t. Initially, all
(1)
weights are 1, i.e., wi = 1.

1. In round t, predict according to the weighted majority of experts.

In other words, choose the outcome that maximizes the sum of
weights of experts that predicted it. I.e.,
(t)
at ← arg max
u ∈U
∑ wi .
i:expert i predicts u
We break ties arbitrarily, say, by picking
the first of the options that achieve the
2. Upon seeing the outcome, set maximum.

1 if i was correct
( t +1) (t)
wi = wi · .
1 if i was incorrect
2

Theorem 14.3. For any sequence of predictions, the number of mistakes

made by the weighted majority algorithm (WM) is at most

2.41(mi + log2 N ),

where mi is the number of mistakes made by expert i.

online learning: experts and bandits 175

Proof. The proof uses a potential-function argument. Let

(t)
Φt := ∑ wi .
i ∈[ N ]

Note that

1. Φ1 = N, since the weights start off at 1,

2. Φt+1 ≤ Φt for all t, and

3. if Algorithm WM makes a mistake in round t, the sum of weights

of the wrong experts is higher than the sum of the weights of the
correct experts, so
( t +1) ( t +1)
Φ t +1 = ∑ wi + ∑ wi
i wrong i correct
1 (t) (t)
= ∑ wi + ∑ wi
2 i wrong i correct
1 (t)
= Φt − ∑ wi
2 i wrong
3 t
≤ Φ
4
If after T rounds, expert i has made mi mistakes and WM has made
M mistakes, then
mi M M
1 ( T +1) T +1 1 3 3
= wi ≤Φ ≤Φ =N .
2 4 4
Now taking logs, and rearranging,
mi + log2 N
M≤ 4
≤ 2.41(mi + log2 N ).
log2 3
We cannot hope to compare ourselves
In other words, if the best of the N experts on this sequence was to the best way of dynamically choosing
wrong m∗ times, we would be wrong at most 2.41(m∗ + log2 n) times. experts to follow. This result says
that at least we do not much worse to
Note that we are much better on the multiplier in front of the m∗ the best static policy of choosing an
term than Fact 14.2 was, at the expense of being slightly worse on the expert—in fact, choosing the best expert
in hindsight—and sticking with them.
multiplier in front of the log2 N term.
We’ll improve our performance soon,
but all our results will still compare to
the best static policy for now.
14.2.1 A Gentler Penalization
Instead of penalizing each wrong expert by a factor of 1/2, we could
penalize the experts by a factor of (1 − ε). This allows us to trade off
the multipliers on the m∗ term and the logarithmic term.

Theorem 14.4. For ε ∈ (0, 1/2), penalizing each incorrect expert by a factor
of (1 − ε) guarantees that the number of mistakes made by MW is at most

log N
2(1 + ε ) m i + O .
ε
176 randomized weighted majority

Proof. Using an analysis identical to Theorem 14.3, we get that

Φt+1 ≤ (1 − 2ε )Φt and therefore
ε M ε M
(1 − ε ) m i ≤ Φ T +1 ≤ Φ 1 1 − = N 1− ≤ N exp − εM/2 .
2 2
Now taking logs, and simplifying,
−mi log(1 − ε) + ln N
M≤
ε/2

m ( ε + ε2 ) log N
≤2 i +O ,
ε ε
ε2 ε3
because − ln(1 − ε) = ε + 2 + 3 + · · · ≤ ε + ε2 for ε ∈ [0, 1].

This shows that we can make our mistakes bound as close to 2m∗
as we want, but this approach seems to have this inherent loss of
a factor of 2. In fact, no deterministic strategy can do better than a
factor of 2, as we show next.
Proposition 14.5. No deterministic algorithm A can do better than a factor
of 2, compared to the best expert.
Proof. Note that if the algorithm is deterministic, its predictions are
completely determined by the sequence seen thus far (and hence can
also be computed by the adversary). Consider a scenario with two
experts A,B, the first always predicts 1 and the second always pre-
dicts 0. Since A is deterministic, an adversary can fix the outcomes
such that A’s predictions are always wrong. Hence at least one of A
and B will have an error rate of ≤ 1/2, while A’s error rate will be
1.

14.3 Randomized Weighted Majority

Consider the proof of Proposition 14.5, but applied to the WM algo-

rithm: the algorithm alternates between predicting 0 and 1, whereas
the actual outcome is the opposite. The weights of the two experts
remain approximately the same, but because we are deterministic, we
choose the wrong one. What if we interpret the weights being equal
as a signal that we should choose one of the two options with equal
probability?
This is the idea behind the Randomized Weighted Majority algorithm
(RMW) of Littlestone and Warmuth: the weights evolve in exactly the
same way as in Theorem 14.4, but now the prediction at each time is
drawn randomly proportional to the current weights of the experts.
I.e., instead of Step 1 in that algorithm, we do the following:
(t)
∑i:expert i predicts u wi
Pr[action u is picked] = (t)
.
∑ i wi
online learning: experts and bandits 177

Note that the update of the weights proceeds exactly the same as
previously.

Theorem 14.6. Fix ε ≤ 1/2. For any fixed sequence of predictions, the ex-
pected number of mistakes made by randomized weighted majority (RWM)
is at most
log N
E[ M ] ≤ (1 + ε ) m i + O
ε
log N
The quantity εmi + O ε gap
between the algorithm’s performance
Proof. The proof is an analysis of the weight evolution that is more and that of the best expert is called the
(t)
careful than in Theorem 14.4. Again, the potential is Φt = ∑i wi . regret with respect to expert i.
Define
(t)
∑i incorrect wi
Ft := (t)
∑ i wi
to be the fraction of weight on incorrect experts at time t. Note that

E[ M ] = ∑ Ft .
t∈[ T ]

Indeed, we make a mistake at step t precisely with the probability Ft ,

since the adversary does not see our random choice when deciding
on the actual outcome at . By our re-weighting rules,

Φt+1 = Φt ((1 − Ft ) + Ft (1 − ε)) = Φt (1 − εFt )

Bounding the size of the potential after T steps,

T
(1 − ε)mi ≤ Φ T +1 = Φ1 ∏ (1 − εFt ) ≤ Ne−ε ∑ Ft = Ne−εE[ M]
t =1

Now taking logs, we get mi ln(1 − ε) ≤ ln N − εE[ M], using the

approximation − log(1 − ε) ≤ ε + ε2 gives us

ln N
E[ M ] ≤ m i (1 + ε ) + .
ε

14.3.1 Classifying Adversaries for Randomized Algorithms

In the above analysis, it was important that the actual random out-
come was independent of the prediction of the algorithm. Let us
formalize the power of the adversary:

Oblivious Adversary. Constructs entire sequence E 1 , o1 , E 2 , o2 , · · · up-

front.

Adaptive Adversary. Sees the previous choices of the algorithm, but

must choose o t independently of our actual prediction at in round
t. Hence, o t can be a function of E 1 , o1 , . . . , E t−1 , o t−1 , E t , as well as
of a1 , . . . , at−1 , but not of at .
178 the hedge algorithm, and a change in perspective

The adversaries are equivalent on deterministic algorithms, because

such an algorithm always outputs the same prediction and the obliv-
ious adversary could have calculated at in advance when creating
E t+1 . They may be different for randomized algorithms. However, it
turns out that RWM works in both models, because our predictions
do not affect the weight updates and hence the future.

14.4 The Hedge Algorithm, and a Change in Perspective

Let’s broaden the setting slightly, and consider the following dot-
product game. In each round, Define the probability simplex as

1. The algorithm produces a vector of probabilities ∆ N := x ∈ [0, 1] N | ∑ xi = 1 .
i
t
p = ( p1t , p2t , · · · , ptN ) ∈ ∆N .

2. The adversary produces

ℓt = (ℓ1t , ℓ2t , · · · , ℓtN ) ∈ [−1, 1] N .

3. The loss of the algorithm in this round is ℓt , pt .

This equivalence between randomized
We can move between this “fractional” model where we play a and fractional algorithms is a common
theme in algorithm design, especially in
point in the probability simplex ∆ N , and the randomized model of
approximation and online algorithms.
the previous section (with an adaptive adversary), where we must
play a single expert (which is a vertex of the simplex ∆ N . Indeed,
setting ℓt to be a vector of 0s and 1s can capture whether an expert is
correct or not, and we can set

pit = Pr[algorithm plays expert i at time t]

to deduce that
Pr[mistake at time t] = ℓt , pt .

14.4.1 The Hedge Algorithm

The Hedge algorithm starts with weights wi1 = 1 for all experts i. In
each round t, it defines pt ∈ ∆ N using:
wit
pit ← , (14.1)
∑ j wtj

and updates weights as follows:

wit+1 ← wit · exp(−εℓit ). (14.2)

Theorem 14.7. Consider a fixed ε ≤ 1/2. For any sequences of loss vectors
in [−1, 1] N and for all indices i ∈ [ N ], the Hedge algorithm guarantees:
T T
ln N
∑ ⟨ pt , ℓt ⟩ ≤ ∑ ℓit + εT + ε
i =1 t =1
online learning: experts and bandits 179

Proof. As in previous proofs, let Φt = ∑ j wtj , so that Φ1 = N, and

∑ wit+1 = ∑ wit e−εℓi

t
Φ t +1 =
i

≤ ∑ wit 1 − εℓit + ε2 ℓit )2 (using e x ≤ 1 + x + x2 ∀ x ∈ [−1, 1])
i
≤ ∑ wit (1 + ε2 ) − ε ∑ wit ℓit (because |ℓit | ≤ 1)
= (1 + ε2 )Φt − εΦt ⟨ pt , ℓt ⟩ (because wit = pit · Φt )

= Φ t 1 + ε2 − ε ⟨ p t , ℓ t ⟩
2 − ε ⟨ p t ,ℓ t ⟩
≤ Φt eε (using 1 + x ≤ e x )

Again, comparing to the final weight of the ith coordinate,

t ( t +1) 2 T −ε
∑⟨ pt ,ℓit ⟩ ;
e − ε ∑ ℓi = wi ≤ Φ T +1 ≤ Φ 1 e ε

now using Φ1 = N and taking logs proves the claim.

q √
Moreover, choosing ε = lnTN gives εT + lnεN = 2 T ln N, and the
regret term is concave and sublinear in time T. This suggests that the
further we run the algorithm, the quicker the average regret goes to
zero, which suggests the algorithm is in some sense “learning".
One final observation: instead of just bounding the (ℓit )2 terms by
1, we could just keep them around: we would get the slightly more
nuanced bound:

Theorem 14.8. Consider a fixed ε ≤ 1/2. For any sequences of loss vectors
in [−1, 1] N , the Hedge algorithm guarantees that for any index i ∈ [ N ]:

T T D E ln N
∑ ⟨ pt , ℓt ⟩ ≤ ∑ i ℓ t
+ ε (ℓ t 2 t
) , p +
ε
i =1 t =1

In §14.5.3 we will see a situation where the bound from Theo-

rem 14.8 is more useful than the one from Theorem 14.7.

14.4.2 Two Useful Corollaries

The following corollary will be useful in many contexts: it just flips
Theorem 14.7 on its head, and shows that the average regret is small
after sufficiently many steps.
4 log N
Corollary 14.9. For T ≥ ε2
, the average loss of the Hedge algorithm is

1 1
T ∑⟨ pt , ℓt ⟩ ≤ min
i T ∑ ℓit + ε
t t
1
= min
p∗ ∈∆ N T
∑ ℓt , p∗ + ε.
t
180 optional: the bandit setting

The viewpoint of the last expression is useful, since it indicates

that the dynamic strategy given by Hedge for the dot-product game
is comparable (in the sense of having tiny regret) against any fixed
strategy p∗ in the probability simplex.
Finally, we state a further corollary that is useful in future lectures.
It can be proved by running Corollary 14.9 with losses ℓt = − gt /ρ.

Corollary 14.10 (Average Gain). Let ρ ≥ 1 and ε ∈ (0, 1/2). For any
4ρ2 ln N
sequence of gain vectors g1 , . . . , g T ∈ [−ρ, ρ] N with T ≥ ε2 , the gains
version of the Hedge algorithm produces probability vectors pt ∈ ∆ N such
that
1 T 1 T
T t∑ ∑ gt , ei − ε.
t t
g , p ≥ max
=1 i ∈[ N ] T t =1

In passing we mention that if the gains or losses lie in the range

N
[−γ, ρ], then we can get an asymmetric guarantee of T ≥ 4γρεln 2 .

14.5 Optional: The Bandit Setting

The model of experts or the dot-product problem is often called the

full-information model, because the algorithm gets to see the entire
loss vector ℓt at each step. (Recall that we view the entries of the
probability vector pt played by the algorithm as the probability of
playing each of the actions, and hence ℓt , pt is just the expected
loss incurred by the algorithm. Now we consider a different model,
where the algorithm only gets to see the loss of the action it plays.
Specifically, in each round,

1. The algorithm again produces a vector of probabilities

pt = ( p1t , p2t , · · · , ptN ) ∈ ∆ N .

It then chooses an action at ∈ [ N ] with these marginal probabili-

ties.

2. In parallel, the adversary produces

ℓt = (ℓ1t , ℓ2t , · · · , ℓtN ) ∈ [−1, 1] N .

However, now the algorithm only gets to see the loss ℓtat corre-
sponding to the action chosen by the algorithm, and not the entire
loss vector.

This limited-information setting is called the bandit setting. The name comes from the analysis of
slot machines, which are affectionately
known as “one-armed bandits”.
14.5.1 The Exp3 Algorithm
Surprisingly, we can obtain algorithms for the bandit setting from
algorithms for the experts setting, by simply “hallucinating” the
online learning: experts and bandits 181

cost vector, using an idea called importance sampling. This causes the
parameters to degrade, however.
Indeed, consider the following algorithm: we run an instance A
of the RWM algorithm, which is in the full information model. So at
each timestep,
1. A produces a probability vector pt ∈ ∆ N .

2. We choose an expert I t ∈ [ N ], where

1
Pr[ I t = i ] = qit := γ · + (1 − γ) · pit .
N
I.e., with probability γ we pick a uniformly random expert, else we
follow the suggestion given by pt .

3. We get back the loss value ℓtI t for this chosen expert.

4. We construct an “estimated loss” ℓ̂t ∈ [0, 1] N by setting

 t

 ℓ j if j = I t
t t
ℓ̃ j = q j .

0 if j ̸= I t

We now feed ℓ̃t to the RWM instance A, and go back to Step 1.

We now show this algorithm achieves low regret. The first obser-
vation is that the estimated loss vector is an unbiased estimate of the
actual loss, just because of the way we reweighted the answer by the
inverse of the probability of picking it. Indeed,
ℓit t
E[ℓ̃it ] = · q + 0 · (1 − qit ) = ℓit . (14.3)
qit i
Since each true loss value lies in [−1, 1], and each probability value
is at least γ/N, the absolute value of each entry in the ℓ̃ vectors is at
most N/γ. Now, since we run RWM on these estimated loss vectors
belonging to [0, N/γ] N , we know that

N log N
∑ p , ℓ̃ ≤ ∑ ℓ̃i + γ εT + ε .
t t t
(14.4)
t t

Taking expectations over both sides, and using (14.3),

N log N
∑ p , ℓ ≤ ∑ ℓi + γ εT + ε .
t t t
(14.5)
t t

However, the LHS is not our real loss, since we chose I t according to
qt and not pt . This means our expected total loss is really
γ
∑ q t , ℓ t = (1 − γ ) ∑ ∑ 1, ℓt
pt , ℓt +
N
t t t

N log N
≤ ∑ ℓit + εT + + γT.
t γ ε
182 optional: the bandit setting

q √ log N 1/4
log N
Now choosing ε = T and γ = N T gives us a regret
of ≈ N 1/2 T 3/4 . The interesting fact here is that the regret is again
sub-linear in T, the number of timesteps: this means that as T → ∞,
the per-step regret tends to zero.
The dependence on N, the number of experts/options, is now
polynomial, instead of being logarithmic as in the full-information
√
case. This is necessary: there is a lower bound of Ω( NT ) in the
bandit setting. And indeed, the Exp3 algorithm itself achieves a near-
p
optimal regret bound of O( NT log N ); we can show this by using a
finer analysis of Hedge that makes more careful approximations. We
defer these improvements to §14.5.3, and instead give an application
of this bandit setting to a problem in item pricing.

14.5.2 Item Pricing via Bandits

To be added in.

14.5.3 Getting a Tight “Square-Root” Regret

We would need non-negative losses. And then use that the bound
exp(− x ) ≤ 1 − x + x2 for all x ≥ −1. Etc.
Suppose we use the bound from Theorem 14.8 to get

log N
∑ p , ℓ̃ ≤ ∑ ℓ̃i + ε ∑ ε .
t t t
(14.6)
t t +
15
Solving Linear Programs using Experts

We can now use the low-regret algorithms for the experts problem to
show how to approximately solve linear programs (LPs). As a warm-
up, we use it to solve two-player zero-sum games, which are a special
case of LPs. In fact, zero-sum games are equivalent
to linear programming, see this work of
Ilan Adler. Is there an earlier reference?
15.1 (Two-Player) Zero-Sum Games

There are two players in such a game, traditionally called the “row
player" and the “column player". Each of them has some set of ac-
tions: the row player with m actions (associated with the set [m]), and
the column player with the n actions in [n]. Finally, we have a payoff
matrix M ∈ Rm×n . In a play of the game, the row player chooses a
row i ∈ [m], and simultaneously, the column player chooses a column
j ∈ [n]. If this happens, the row player gets Mi,j , and the column
player loses Mi,j . The winnings of the two players sum to zero, and
so we imagine that the payoff is from the row player to the column
player. Henceforth, when we talk about pay-
offs, these will always refer to payoffs to
the row player from the column player.
15.1.1 Strategies, and Best-Response This payoff may be negative, which
would capture situations where the
Each player is allowed to have a randomized strategy. Given strate- column player does better.
gies p ∈ ∆m for the row player, and q ∈ ∆n for the column player, the
expected payoff (to the row player) is

E[payoff to row] = p⊺ Mq = ∑ pi q j Mi,j .

i,j

The row player wants to maximize this value, while the column
player wants to minimize it.
Suppose the row player fixes a strategy p ∈ ∆m . Knowing p, the
column player can choose an action to minimize the expected payoff:

C ( p) := min p⊺ Mq = min p⊺ Me j .
q∈∆n j∈[n]
184 (two-player) zero-sum games

The equality holds because the expected payoff is linear, and hence
the column player’s best strategy is to choose a column that mini-
mizes the expected payoff. The column player is said to be playing
their best response. Analogously, if the column player fixes a strategy
q ∈ ∆n , the row player can maximize the expected payoff by playing
their own best response:

R(q) := max p⊺ Mq = max ei⊺ Mq.

p∈∆m i ∈[m]

Now, the row player would love to play the strategy p such that
even if the column player plays best-response, the payoff is as large
as possible: i.e., it wants to achieve

max C ( p).
p∈∆m

Similarly, the column player wants to choose q to minimize the payoff

against a best-response row player, i.e., to achieve

min R(q).
q∈∆n

Lemma 15.1. For any p ∈ ∆m , q ∈ ∆n , we have

C ( p) ≤ R(q) (15.1)

Proof. Intuitively, since the column player commits to a strategy

q, it hence gives more power to the row player. Formally, the row
player could always play strategy p in response to q, and hence could
always get value C ( p). But R(q) is the best response, which could be
even higher.

Interestingly, there always exist strategies p ∈ ∆m , q ∈ ∆n which

achieve equality. This is formalized by the following theorem:

Theorem 15.2 (Von Neumann’s Minimax Theorem). For any finite

zero-sum game M ∈ Rm×n ,

max C ( p) = min R(q).

p∈∆m q∈∆n

This common value V is called the value of the game M.

Proof. We assume for the sake of contradiction that ∃ M ∈ [−1, 1]m×n

such that max p∈∆m C ( p) ≤ minq∈∆n R(q) − δ for some δ > 0. (The
assumption that Mij ∈ [−1, 1] follows by scaling.) Now we use the
fact that the average regret of the Hedge algorithm tends to zero to
construct strategies pb and qb that have R(qb) − C ( pb) < δ, thereby giving
us a contradiction.
We consider an instance of the experts problem, with m experts,
one for each row of M. At each time step t, the row player produces
solving linear programs using experts 185

pt ∈ ∆m . Initially p1 = m1 , . . . , m1 , which represents that the row
player chooses each row with equal probability, when they have no
information to work with.
At each time t, the column player plays the best-response to pt , i.e.,
jt := arg max ( pt )⊺ Me j .
j∈[n]

This defines a gain vector for the row player:

gt := Me jt ,
which is the jth column of M. The row player uses this to update the
weights and get pt+1 , etc. Define
T T
1 1
pb :=
T ∑ pt and qb :=
T ∑ e jt
t =1 t =1

to be the average long-term plays of the row player, and of the best
responses of the column player to those plays. We know that
C ( pb) ≤ R(qb)
4 ln m
by (15.1). But by Corollary 14.10, after T ≥ ε2
steps,
1 1
T ∑⟨ pt , gt ⟩ ≥ max
i T∑
ei , g t − ε (by Hedge)
t t
D 1 E
= max ei , ∑ gt − ε
i T t
D 1 E
= max ei , M
i
∑
T t
e jt −ε (by definition of gt )

= max⟨ei , Mb
q⟩ − ε
i
= R(qb) − ε.
Since pt is the row player’s strategy, and C is concave (i.e., the payoff
on the average strategy pb is no more than the average of the payoffs: To see this, recall that

1 1 1 C ( p) := min p⊺ Mq.
T ∑ ⟨ pt , gt ⟩ = ∑ C ( pt ) ≤ C
T T ∑ pt = C ( pb). q

Let q∗be the optimal value of q that

Putting it all together: minimizes C ( p). Then for any a, b ∈ ∆m ,
we have that
R(qb) − ε ≤ C ( pb) ≤ R(qb).
C ( a + b) = ( a + b)⊺ Mq∗ = a⊺ Mq∗ + b⊺ Mq∗
Now for any δ > 0 we can choose ε < δ to get the contradiction. ≥ min a⊺ Mq + min b⊺ Mq = C ( a) + C (b)
q q

Observe that the proof gives us an explicit algorithm to find strate-

gies pb, qb that have a small gap. The minimax theorem is also im-
plied by strong duality of linear programs: indeed, we can write
minq∈∆n R(q) as a linear program, take its dual and observe that it
computes min p∈∆m C ( p). The natural question is: we can solve linear
programs using low-regret algorithms. We now show how to do this.
We should get a clean proof of strong duality this way?
186 solving lps approximately

15.2 Solving LPs Approximately

Consider an LP with constraint matrix A ∈ Rm×n :

max ⟨c, x ⟩ (15.2)

Ax ≤ b
x≥0

Suppose x ∗ is an optimal solution, with OPT := ⟨c, x ∗ ⟩. Let K ⊆ Rn

be the polyhedron defined by the “easy” constraints, i.e.,

K := { x ∈ Rn | ⟨c, x ⟩ = OPT, x ≥ 0},

where OPT is found by binary search over possible objective values.

Binary search over the reals is typically not a good idea, since it may
never reach the answer. (E.g., searching for 1/3 by binary search over
[0, 1].) However, we defer this issue for now, and imagine we know
the value of OPT. We now use low-regret algorithms to find xb ∈ K
such that ⟨ ai , x ⟩ ≤ bi + ε for all i ∈ [m]. The fix to the “binary search over reals”
problem is this: the optimal value of a
linear program in n dimensions where
15.2.1 The Oracle all numbers integers using at most b
bits is a rational p/q, whereboth p, q use
The one assumption we make is that we can solve the feasibility at most poly(nb) bits. So once we the
granularity of the search is fine enough,
problem obtained by intersecting the “easy” constraints K with a
there is a unique rational close the
single linear constraint. Suppose α ∈ Rn , β ∈ R, then we want to query point, and we can snap to it. See,
solve the problem: e.g., the problem on finding negative
cycles in the homeworks.
Oracle: find a point x ∈ K ∩ { x | ⟨α, x ⟩ ≤ β}. (15.3)

Proposition 15.3. There is an O(n)-time algorithm to solve (15.3), when

K = { x ≥ 0 | ⟨c, x ⟩ = OPT }.
Proof. We give the proof only for the case where ci > 0 for all i; the
general case is left as an exercise. Let j∗ := arg min j α j/c j , and define
x = (OPT/c j∗ )e j∗ . Say “infeasible” if x does not satisfy ⟨α, x ⟩ ≤ β, else
return x.

Of course, this problem can be solved in time linear in the num-

ber of variables (as Proposition 15.3 above shows), but the situation
can be more interesting when the number of variables is large. For
instance, when we solve flow LPs, the number of variables will be
exponential in the size of the graph, yet the oracle will be imple-
mentable in time poly(n).

15.2.2 The Algorithm

The key idea to solving general LPs is then similar to that for zero-
sum games. We have m experts, one corresponding to each con-
straint. In each round, we combine the multiple constraints using a
solving linear programs using experts 187

weighted sum, we call the above oracle on this single-constraint LP

to get a solution, we construct a gain vector from this solution and
feed this to Hedge, which then updates the weights that we use for
the next round. The gain of an expert in a round is based based on
how badly the constraint was violated by the current solution. The
intuition is simple: greater violation means more gain, and hence
more weight in the next iteration, which forces us to not violate the
constraint as much.
An upper bound on the maximum possible violation is the width ρ
of the LP, defined by

ρ := max {| ⟨ ai , x ⟩ − bi |}. (15.4)

x ∈K,i ∈[m]

We assume that ρ ≥ 1.

Algorithm 13: LP-Solver

13.1 p1 ← (1/m, . . . , 1/m). T ← Θ(ρ2 ln m/ε2 )
13.2 for t = 1 to T do
13.3 Define αt := ∑im=1 pit ai ∈ Rn and βt = ∑im=1 pit bi ∈ R.
13.4 Use Oracle to find x ∈ K ∩ { αt , x t ≤ βt }.
13.5 if oracle says infeasible then
13.6 return infeasible
13.7 else
13.8 git ← ⟨ ai , x t ⟩ − bi for all i.
13.9 feed gt to Hedge(ε) to get pt+1 .
13.10 return xb ← ( x1 + · · · + x T )/T.

15.2.3 The Analysis

Theorem 15.4. Fix 0 ≤ ε ≤ 1/4. Then Algorithm 13 calls the oracle
O(ρ2 ln m/ε2 ) times, and either correctly returns “infeasible”, or returns
xb ∈ K such that
Ab x ≤ b − ε1.
Proof. Observe that if x ∗ is feasible for the original LP (15.2) then it is
feasible for any of the calls to the oracle, since it satisfies any positive
linear combination of the constraints. Hence, we are correct if we ever
return “infeasible”. Moreover, x t ∈ K in each iteration, and xb is an
average of x t ’s, so it also lies in K by convexity. So it remains to show
that xb approximately satisfies the other linear constraints.
Recall the guarantee from Corollary 14.10:
1 1
T ∑⟨ pt , gt ⟩ ≥ max
i T∑
ei , gt − ε, (15.5)
t t

for precisely the choice of T in Algorithm 13, since the definition of

width in (15.4) ensures that gt ∈ [−ρ, ρ]m .
188 solving lps approximately

Let i ∈ [m], and recall the definitions of αt = ∑im=1 pit ai , βt =

∑im=1
pit bi , and gt = Ax t − b from the algorithm. Then

⟨ pt , gt ⟩ = ⟨ pt , Ax t − b⟩
= ⟨ pt , Ax t ⟩ − ⟨ pt , b⟩
= ⟨αt , x t ⟩ − βt ≤ 0,

the last inequality because x t satisfies the single linear constraint

αt , x ≤ βt . Averaging over all times, the left hand side of (15.5) is

T
1
T ∑ ⟨ pt , gt ⟩ ≤ 0.
t =1

However, the average on the RHS in (15.5) for constraint/expert i is:

1 T D 1 T E
T ∑ ei , g t = ei ,
T ∑ gt
t =1 t =1

1 T
=
T ∑ ai , xbt − bi
t =1
= ⟨ ai , xb⟩ − bi .

Substituting into (15.5) we have

T
1
0≥
T ∑ ⟨ pt , gt ⟩ ≥ max
i
⟨ ai , xb⟩ − bi − ε.
t =1

x ≤ b + ε1.
This shows that Ab

15.2.4 A Small Extension: Approximate Oracles

Recall the definition of the problem width from (15.4). A few com-
ments:

• In the above analysis, we do not care about the maximum value

of | a⊺i x − bi | over all points x ∈ K, but only about the largest this
expression gets over points that are potentially returned by the
oracle. This seems a pedantic point, but if there are many solutions
to (15.3), we can return one with small width. But we can do more,
as the next point outlines.

• We can also relax the oracle to satisfy ⟨α, x ⟩ ≤ β + δ for some

small δ > 0 instead. Define the width of the LP with respect such a
relaxed oracle to be

ρrlx := max {| a⊺i x − bi |}. (15.6)

i ∈[m],x returned by relaxed oracle
solving linear programs using experts 189

Now running the algorithm with a relaxed oracle gives us a

slightly worse guarantee that

x ≤ b + (ε + δ)1,
Ab

but now the number of calls to the relaxed oracle can be even
smaller, namely O(ρ2rlx ln m/ε2 ).

• Of course, if we violations can be bounded in some better way,

e.g., if we can ensure that violations are always positive or nega-
tive, then we can give stronger bounds on the regret, and hence
reduce the number of calls even further. Details to come.

All these improvements will be crucial in the upcoming applications.

16
Approximate Max-Flows using Experts

We now use low-regret multiplicative-weight algorithms to give

approximate solutions to the s-t-maximum-flow problem. In the
previous chapter, we already saw how to get approximate solutions
to general linear programs. We now show how a closer look at those
algorithms give us improvements in the running time (albeit in the
setting of undirected graphs), which go beyond those known via
usual “combinatorial” techniques. The first set of results we give will
hold for directed graphs as well, but the improved results will only
hold for undirected graphs.

16.1 The Maximum Flow Problem

In the s-t maximum flow problem, we are given a graph G = (V, E), and
distinguished vertices s and t. Each edge has a capacities ue ≥ 0; we
will mostly focus on the unit-capacity case of ue = 1 in this chapter.
The graph may be directed or undirected; an undirected edge can be
modeled by two oppositely directed edges having the same capacity.
Recall that an s-t flow is an assignment f : E → R+ such that

(a) f (e) ∈ [0, ue ], i.e., capacity-respecting on all edges, and

(b) ∑e=(u,v)∈E f (e) = ∑e=(v,w)∈E f (e), i.e., flow-conservation at all

non-{s, t}-nodes.

The value of flow f is ∑e=(s,w)∈E f (e) − ∑e=(u,s)∈E f (e), the net amount
of flow leaving the source node s. The goal is to find an s-t flow in
the network, that satisfies the edge capacities, and has maximum
value.
Algorithms by Edmonds and Karp, by Yefim Dinitz, and many
others can solve the s-t max-flow problem exactly in polynomial
time. For the special case of (directed) graphs with unit capaci-
ties, Shimon Even and Bob Tarjan, and independently, Alexander
Karzanov showed in 1975 that the Ford-Fulkerson algorithm finds
192 a first algorithm using the mw framework

the maximum flow in time O(m · min(m1/2 , n2/3 )). This runtime
was eventually matched for general capacities (up to some poly-
logarithmic factors) by an algorithm of Andrew Goldberg and Satish
Rao in 1998. For the special case of m = O(n), these results gave a
runtime of O(m1.5 ), but nothing better was known even for approx-
imate max-flows, even for unit-capacity undirected graphs—until a
breakthrough in 2010, which we will see at the end of this chapter.

16.1.1 A Linear Program for Maximum Flow

We formulate the max-flow problem as a linear program. There are

many ways to do this, and we choose to write an enormous LP for
it. Let P be the set of all s-t paths in G. Define a variable f P denoting
the amount of flow going on path P ∈ P . We can now write:

max ∑ fP (16.1)
P∈P

∑ f P ≤ ue ∀e ∈ E
P:e∈ P
fP ≥ 0 ∀P ∈ P

The first set of constraints says that for each edge e, the contribution
of all possible flows is no greater than the capacity ue of that edge.
The second set of constraints say that the contribution from each path
must be non-negative. This is a gigantic linear program: there could
be an exponential number of s-t paths. As we see, this will not be a
hurdle.

16.2 A First Algorithm using the MW Framework

To using the framework from the previous section, we just need

to implement the Oracle: i.e., we solve a problem with a single
“average” constraint, as in (15.3). Specifically, suppose we want a
flow value of F, then the “easy” constraints are:

K := { f | ∑ f p = F, f ≥ 0}.
P∈P

Moreover, the constraint ⟨α, f ⟩ ≤ β is not an arbitrary constraint—it

is one obtained by combining the original constraints. Specifically,
given a vector pt ∈ ∆m , the average constraint is obtained by the
convex combination of these constraints:

∑ pte ∑ f P ≤ ue , (16.2)
e∈ E P:e∈ P
approximate max-flows using experts 193

where f e represents the net flow over edge e. By swapping order of

summations, and using the unit capacity assumption, we obtain

∑ f P ∑ pte ≤ ∑ pte ue = 1.
P∈P e∈ P e

Now, the inner summation is the path length of P with respect to

edge weights pte , which we denote by lent ( P) := ∑e∈ P pte . The con-
straint now becomes:

∑ f P lent ( P) ≤ 1, (16.3)
P∈P

and we want a point f ∈ K satisfying it. The best way to satisfy it

is to place all F units of flow on the shortest path P, and zero every-
where else; we output “infeasible” if the shortest-path has a length
more than 1. This step can be done by a single call to Dijkstra’s algo-
rithm, which takes O(m + n log n) time. We already argued in Theorem 15.4
ρ2 log m that if there exists a feasible flow of
Now Theorem 15.4 says that running this algorithm for Θ ε2 value F in the graph, we never output
iterations gives a solution f ∈ K, that violates the constraints by “infeasible”. Here is a direct proof.
If there is a flow of value F, there are
an additive ε. Hence, the scaled-down flow f /(1 + ε) would satisfy F disjoint s-t paths. The vector pt ∈ ∆m ,
all the capacity constraints, and have flow value F/(1 + ε), which so its values sum to 1. Hence, one of
is what we wanted. To complete the runtime analysis, it remains to the F s-t paths P∗ has ∑e∈ P pte ≤ 1/F.
Setting f P = F for that path satisfies the
bound the value of ρ, the maximum amount by which any constraint constraint.
gets violated by a solution from the oracle. Since we send all the
F units of flow on a single edge, the maximum violation is F − 1.
Hence the total runtime is at most
F2 log m
O(m + n log n) · .
ε2
Moreover, the maximum flow F is m, by the unit capacity assump-
tion, which gives us an upper bound of O(m3 poly(log m/ε)).

16.2.1 A Better Bound, via an Asymmetric Guarantee for Hedge

Let us state (without proof, for now) a refined version of the Hedge
algorithm for the case of asymmetric gains, where the gains lie in the
range [−γ, ρ].

Theorem 16.1 (Asymmetric Hedge). Let ε ∈ (0, 1/2), and γ, ρ ≥ 1.

Θ(γρ ln N )
Moreover, let T ≥ ε2
. There exists an algorithm for the experts
problem such that for every sequence g1 , . . . , g T of gains with g ∈ [−γ, ρ] N ,
produces probability vectors { pt ∈ ∆ N }t∈[T ] online such that for each i:
T T
1 1
T ∑ gt , pt ≥
T ∑ gt , ei − ε.
t =1 t =1

The proof is a careful (though not difficult) reworking of the

standard proof for Hedge. (We will add it soon; a hand-written
194 finding max-flows using electrical flows

proof is on the webpage.) Moreover, we can use this statement to

Θ(γρ ln m)
prove that the approximate LP solver can stop after ε2
calls
to an oracle, as long as each of the oracle’s answer x guarantee that
( Ax )i − bi ∈ [−γ, ρ].
Since a solution f found by our shortest-path oracle sends all F
flow on a single path, and all capacities are 1, we have γ = 1 and
ρ = F − 1 ≤ F. The runtime now becomes
1 · ( F − 1) log m
O(m + n log n) · .
ε2
Again, using the naïve bound of F ≤ m, we have a runtime of
O(m2 poly(log m/ε)) to find a (1 + ε)-approximate max-flow, even
in directed graphs.

16.2.2 An Intuitive Explanation and an Example

Observe that the algorithm repeats the following natural process:

1. it finds a shortest path in the graph,

2. it pushes F units of flow on it, and then

3. it increases the length of each edge on this path multiplicatively. The factor happens to be (1 + ε/F ), be-
cause of how we rescale the gains, but
This length-increase makes congested edges (those with a lot of flow) that does not matter for this intuition.
be much longer, and hence become very undesirable when search-
ing for short paths. Note that the process is repeated some number
of times, and then we average all the flows we find. So unlike usual
network flow algorithms based on residual networks, these algo-
rithms are truly greedy and cannot “undo” past actions (which is
what pushing flow in residual flow networks does, when we use an
arc backwards). This means these MW-based algorithms must ensure
that very little flow goes on edges that are “wasteful”.
To illustrate this point, consider an example commonly used to
show that the greedy algorithm does not work for max-flow: Change
the figure to make it more instructive.

16.3 Finding Max-Flows using Electrical Flows

The approach of the previous sections suggests a way to get faster

algorithms for max-flow: reduce the width of the oracle. The approach
of the above section was to push all F flow along a single path, which
is why we have a width of Ω( F ). Can we implement the oracle in a
way that spreads the flow over several paths, and hence has smaller
width? Of course, one such solution is to use the max-flow as the
oracle response, but that would defeat the purpose of the MW ap-
proach. Indeed, we want a fast way of implementing the oracle. We use the notation O e ( f (n)) to hide
factors that are poly-logarithmic in
f (n). E.g., O(n log2 n) lies in Oe (n), and
O(log n log log n) lies in O e (log n), etc.
approximate max-flows using experts 195

For undirected graphs, one good solution turns out to be to use

electrical flows: to model the graph as an electrical network, set a volt-
age difference between s and t, and compute how electrical current
would flow between them. We now show how this approach gives us
an O e (m1.5 /εO(1) )-time algorithm quite easily; then with some more
work, we improved this to get a runtime of O e (m4/3 /εO(1) ). While we
focus only on unit-capacity graphs, the algorithm can be extended
to all undirected graphs with a further loss of poly-logarithmic fac-
tors in the maximum capacity, and moreover to get a runtime of
Oe (mn1/3 / poly(ε)).
At the time this result was announced (by Christiano et al.), it was Christiano, Kelner, Madry, Spielman,
the fastest algorithm for the approximate maximum s-t-problem in and Teng (2010)

undirected graphs. Since then, works by Jonah Sherman, and by Kel- Sherman (2013)
ner et al. gave O(m1+o(1) /εO(1) )-time algorithms for the problem. The Kelner, Lee, Orecchia, and Sidford
current best runtime is O(m poly log m/εO(1) )-time, due to Richard (2013)
Peng (2014)
Peng.
Interestingly, Shang-Hua Teng, Jonah
Sherman, and Richard Peng are all
CMU graduates.
16.3.1 Electrical Flows
Given a connected undirected graph with general edge-capacities, we
can view it as an electrical circuit, where each edge e of the original
𝜑 𝑠 =1 𝜑 𝑡 =0
graph represents a resistor with resistance re = 1/ue , and we connect t
s
(say, a 1-volt) battery between s to t. This causes electrical current to
flow from s (the node with higher potential) to t. Recall the following
laws about electrical flows. + -
Theorem 16.2 (Kirchoff’s Voltage Law). The directed potential changes
along any cycle sum to 0. Figure 16.1: The currents on the wires
would produce an electric flow (where
all the wires within the graph have
This means we can assign each node v a potential ϕv . Now the
resistance 1).
actual amount of current on any edge is given by Ohm’s law, and is
related to the potential drop across the edge.

Theorem 16.3 (Ohm’s Law). The electrical flow f uv on the edge e = uv is

the ratio between the difference in potential ϕ (or voltage) between u, v and
the resistance re of the edge:

ϕu − ϕv
f uv = .
ruv

Finally, we have flow conservation, much like in traditional net-

work flows:

Theorem 16.4 (Kirchoff’s Current Law). If we set s and t to some volt-

ages, the electrical current ensures flow-conservation at all nodes except s, t:
the total current entering any non-terminal node equals the current leaving
it.
196 finding max-flows using electrical flows

These laws give us a set of linear constraints that allow us to go

between the voltages and currents. In order to show this, we define
the Laplacian matrix of a graph.

16.3.2 The Laplacian Matrix

The conductance of an edge is the
Given an undirected graph on n nodes and m edges, with non- reciprocal of the resistance of the edge:
negative conductances cuv for each edge e = uv, we define the Lapla- ce = 1/re .

cian matrix to be a n × n matrix LG , with entries


 ∑w:uw∈E cuw
 if u = v
( LG )uv = −cuv if (u, v) ∈ E .

 0 otherwise

For example, if we take the 6-node graph in Figure 16.1 and assume
that all edges have unit conductance, then its Laplacian LG matrix is:

s t u v w x
 
s 2 0 −1 −1 0 0
 
t  0 2 0 0 −1 −1 
 
u −1 0 3 0 −1 −1 
.
LG =
v 
 −1 0 0 2 0 −1 

 
w 0 −1 −1 0 2 0 
x 0 −1 −1 −1 0 3

Note that LG is not a full-rank matrix

Equivalently, we can define the Laplacian matrix Luv for the graph since, e.g., the columns sum to zero.
However, if the graph G is corrected,
consisting of a single edge uv as then the vector 1 is the only vector in
the kernel of LG , so its rank is n − 1.
(proof?)
Luv := cuv (eu − ev )⊺ (eu − ev ).

Now for a general graph G, we define the Laplacian to be: This Laplacian for the single edge uv
has 1s on the diagonal at locations
(u, u), (v, v), and −1s at locations
LG = ∑ Luv . (u, v), (v, u). Draw figure.
uv∈ E

In other words, LG is the sum of little ‘per-edge’ Laplacians Luv . A symmetric matrix A ∈ Rn×n is called
(Since each of those Laplacians is clearly positive semidefinite (PSD), PSD if x⊺ Ax ≥ 0 for all x ∈ Rn , or
equivalently, if all its eigenvalues are
it follows that LG is PSD too.) non-negative.
For yet another definition for the Laplacian, first consider the
edge-vertex incidence matrix B ∈ {−1, 0, 1}m×n , where the rows are
indexed by edges and the columns by vertices. The row correspond-
ing to edge e = uv has zeros in all columns other than u, v, it has
an entry +1 in one of those columns (say u) and an entry −1 in the
approximate max-flows using experts 197

other (say v).

su sv uw ux vx wt xt
 
s 1 1 0 0 0 0 0
 
t  0 0 0 0 0 −1 −1 
 
u −1 0 1 1 0 0 0 .
B=
v 
 0 −1 0 0 1 0 0 
 
w 0 0 −1 0 0 1 0 
x 0 0 0 −1 −1 0 1

The Laplacian matrix is now defined as LG := B⊺ CB, where C ∈

Rm×m is a diagonal matrix with entry Cuv containing the conductance
for edge uv. E.g., for the example above, here’s the edge-vertex inci-
dence matrix, and since all conductances are 1, we have LG = BB⊺ .

16.3.3 Solving for Electrical Flows: Lx = b

Given the Laplacian matrix for the electrical network, we can figure
out how the current flows by solving a linear system, i.e., a system of
linear equations. Indeed, by Theorem 16.4, all the current flows from
s to t. Suppose k units of current flows from s to t. By Theorem 16.3,
the net current flow into a node v is precisely
ϕu − ϕv
∑ f uv = ∑ ruv
.
u:uv∈ E u:uv∈ E

A little algebra shows this to be the vth entry of the vector Lϕ. Finally,
by 16.4, this net current into v must be zero, unless v is either s or t,
in which case it is either −k or k respectively. Summarizing, if ϕ are
the voltages at the nodes, they satisfy the linear system:

Lϕ = k(es − et ).

(Recall that k is the amount of current flowing from s to t, and es , et

are elementary basis vectors.) It turns out the solutions ϕ to this
linear system are unique up to translation, as long as the graph is
connected: if ϕ is a solution, then {ϕ + a | a ∈ R} is the set of all
solutions.
Great: we have n + 1 unknowns so far: the potentials at all the
nodes, and the current value k. The above discussion gives us poten-
tials at all the nodes in terms of the current value k. Now we can set
unit potential at s, and ground t (i.e., set its potential to zero), and
solve the linear system (with n − 1 linearly independent constraints)
for the remaining n − 1 variables. The resulting value of k gives us
the s-t current flow. Moreover, the potential settings at all the other
nodes can now be read off from the ϕ vector. Then we can use Ohm’s
law to also read off the current on each edge, if we want.
198 finding max-flows using electrical flows

How do we solve the linear system Lϕ = b (subject to these bound-

ary conditions)? We can use Gaussian elimination, of course, but the
best implementations can take nω time in the worst-case. Thankfully,
there are faster (approximate) methods, which we discuss in §16.3.5.

16.3.4 Electrical Flows Minimize Energy Burn

Here’s another useful way of characterizing this current flow of k
units from s and t: the current flow is one minimizing the total energy
dissipated. Indeed, for a flow f , the energy burn on edge e is given by
(ϕu −ϕv )2
( f uv )2 ruv = ruv , and the total energy burn is

(ϕu − ϕv )2
E ( f ) := ∑ f e2 re = ∑ r uv
= ϕ⊺ Lϕ.
e∈ E (u,v)∈ E

The electrical flow f produced happens to be

arg min f is an s-t flow of value k {E ( f )}.

We often use this characterization when arguing about electrical

flows.

16.3.5 Solving Linear Systems

We can solve a linear system Lx = b fast? If L is a Laplacian matrix
and we are fine with approximate solutions, we can do things much
faster than Gaussian elimination. A line of work starting with Dan
Spielman and Shang-Hua Teng, and then refined by Ioannis Koutis, Spielman and Teng (200?)
Gary Miller, and Richard Peng shows how to (approximately) solve Koutis, Miller, and Peng (2010)
a Laplacian linear system in the time essentially near-linear in the
number of non-zeros of the matrix L.

Theorem 16.5 (Laplacian Solver). There exists an algorithm that given a

linear system Lx = b with L being a Laplacian matrix (and having solution
x̄), find a vector x̂ such that the error vector z := L x̂ − b satisfies Given a positive semidefinite matrix
A, the A-norm is defined as ∥ x ∥ A :=
√
z⊺ Lz ≤ ε( x̄⊺ L x̄ ). x⊺ Ax. Hence the guarantee here says

∥ L x̂ − b∥ L ≤ ε ∥ x̄ ∥ L .
The algorithm is randomized? and runs in time O(m log2 n log 1/ε).

Moreover, Theorem 16.5 can be converted to what we need; details

appear in the Christiano et al. paper.

Corollary 16.6 (Laplacian Solver II). There is an algorithm given a linear

system Lx = b corresponding to an electrical system as above, outputs an
electrical flow f that satisfies

E ( f ) ≤ (1 + δ)E ( fe),
approximate max-flows using experts 199

where fe is the min-energy flow. The algorithm runs in O e ( m log R ) time,

δ
where R is the ratio between the largest and smallest resistances in the
network.

For the rest of this lecture we assume we can compute the corre-
sponding minimum-energy flow exactly in time O e (m). The arguments
can easily be extended to incorporate the errors.

16.4 e (m3/2 )-time Algorithm

An O

Recall the setup from §16.2: given the polytope

K = {f | ∑ f P = F, f ≥ 0},
P∈P

and some edge weights pe , we wanted a vector in K that satisfies

∑ pe f e ≤ 1. (16.4)
e

where f e := ∑ P:e∈ P f P . Previously, we set f P∗ = F for P∗ being the

shortest s-t path according to edge weights pe , but that resulted in the
width—the maximum capacity violation—being too as large as Ω( F ).
So we want to spread the flow over more paths.
Our solution will now be to have the oracle return a flow with
√
width O( m/ε), and which satisfies the following weaker version of
the length bound (16.4) above:

∑ pe f e ≤ (1 + ε) ∑ pe + ε = 1 + 2ε.
e∈ E e∈ E

It is a simple exercise to check that this weaker oracle changes the

analysis of Theorem 15.4 only slightly, still showing that the multiplicative-
weights-based process finds an s-t-flow of value F, but now the edge-
capacities are violated by 1 + O(ε) instead of just 1 + ε.
Indeed, we replace the shortest-path implementation of the ora-
cle by the following electrical-flow implementation: we construct a
weighted electrical network, where the resistance for each edge e is
defined to be This idea of setting the edge length
ε to be pe plus a small constant term is a
ret := pte + .
m general technique useful in controlling
the width in other settings, as we will
We now compute currents f et by solving the linear system Lt ϕ = see in a HW problem.
F (es − et ) and return the resulting flow. It remains to show that this
flow spreads its mass around, and yet achieves a small “length” on
average.

Theorem 16.7. If f ∗ is a flow with value F and f is the minimum-energy

flow returned by the oracle, then

1. (length) ∑e∈E pe f e ≤ (1 + ε) ∑e∈E pe ,

200 e ( m 4/3 ) -time algorithm
optional: an O

√
2. (width) maxe f e ≤ O( m/ε).

Proof. Since the flow f ∗ satisfies all the constraints, it burns energy
ε
E ( f ∗ ) = ∑( f e∗ )2 re ≤ ∑ re = ∑( pe + ) = 1 + ε.
e e e m

Here we use that ∑e pe = 1. But since f is the flow K that minimizes

the energy,
E ( f ) ≤ E ( f ∗ ) ≤ 1 + ε.
Now, using Cauchy-Schwarz,
√ √ r √ √
∑ re f e = ∑ ( re f e · re ) ≤ (∑ re f e2 )(∑ re ) ≤ 1 + ε 1 + ε = 1 + ε.
e e e e

This proves the first part of the theorem. For the second part, we may
use the bound on energy burnt to obtain
ε ε
∑ f e2 m ≤ ∑ f e2 pe + m = ∑ f e2 re = E ( f ) ≤ 1 + ε.
e e e

Since each term in the leftmost summation is non-negative,

r r
2 ε m (1 + ε ) 2m
fe ≤ 1 + ε =⇒ f e ≤ ≤
m ε ε
for each edge e.

Using this oracle within the MW framework means the width is

√ ρ log m e (m) time by
ρ = O( m), and each of the O ε2 iterations takes O
Corollary 16.6, giving a runtime of Oe (m3/2 ).
In fact, this bound on the width is tight: consider the example
√
network on the right. The effective resistance of the entire collection Figure 16.2: There are k = Θ( m) black
paths of length k each. All edges have
of black edges is 1, which matches the effective resistance of the red unit capacities.
edge, so half the current goes on the top red edge. If we set F = k + 1
√
(which is the max-flow), this means a current of Θ( m) goes on the
top edge.
Sadly, while the idea of using electrical flows is very cool, the run-
time of O(m3/2 ) is not that impressive. The algorithms of Karzanov,
and of Even and Tarjan, for exact flow on directed unit-capacity
graphs in time O(m min(m1/2 , n2/3 )) were known even back in the
1970s. (Algorithms with similar runtime are known for capacitated
cases too.) Thankfully, this is not the end of the story: we can take
the idea of electrical flows further to get a better algorithm, as we
show in the next section.

16.5 e (m4/3 )-time Algorithm

Optional: An O

The idea to get an improved bound on the width is to use a crude but
effective trick: if we have an edge with electrical flow of more than
approximate max-flows using experts 201

ρ ≈ m1/3 in some iteration, we delete it for that iteration (and for the
rest of the process), and find a new flow. Clearly, no edge now carries
a flow more than ρ. The main thrust of the proof is to show that we
do not end up butchering the graph, and that the maximum flow
value reduces by only a small amount due to these edge deletions.
Formally, we set:

m 1/3 log m
ρ= . (16.5)
ε
and show that at most εF edges are ever deleted by the process. The
crucial ingredient in this proof is this observation: every time we
delete an edge, the effective resistance between s and t increases by a
lot. We assume that a flow value of F is
Since we need to argue about how many edges are deleted in the feasible; moreover, F ≥ ρ, else Ford-
Fulkerson can be implemented in time
entire algorithm (and not just in one call to the oracle), we explic- O(mF ) ≤ O e (m4/3 ).
itly maintain edge-weights wet , instead of using the results from the
previous sections as a black-box.

16.5.1 The Effective Resistance

Loosely speaking, the effective resistance between nodes u and v is the
resistance offered by the network to electrical flows between u and v.
There are many ways of formalizing this: the most useful one in this
context is the following.

Definition 16.8 (Effective Resistance). The effective resistance be-

tween s and t, denoted by Reff (st), is the energy burned if we send
one unit of electrical current from s to t.

Since we only consider the effective resistance between s and t in

this lecture, we simply write Reff . The following results relate the
effective resistances before and after we change the resistances of
some edges.

Lemma 16.9. Consider an electrical network with edge resistances re .

1. (Rayleigh Monotonicity) If we increase the resistances to re′ ≥ re for all e,

the resulting effective resistance is

′
Reff ≥ Reff .

2. Suppose f is an s-t electrical flow, suppose e is an edge with energy burn

f e2 re ≥ βE ( f ). If we set re′ ← ∞, then the new effective resistance

′ Reff
Reff ≥( ).
1−β
202 e ( m 4/3 ) -time algorithm
optional: an O

Proof. Recall that if we send electrical flow from s to t, the resulting

flow f minimizes the total energy burned E ( f ) = ∑e f e2 re . To prove
the first statement: for each flow, the energy burned with the new
resistances is at least that with the old resistances. Need to add in
second part.

16.5.2 A Modified Algorithm

Let’s give our algorithm that explicitly maintains the edge weights:
We start off with weights w1e = 1 for all e ∈ E. At step t of the
algorithm:

1. Find the min-energy flow f t of value F in the remaining graph

with respect to edge resistances ret := wet + mε W t .

2. If there is an edge e with f et > ρ, delete e (for the rest of the algo-
rithm), and go back to Item 1.

3. Update the edge weights wet+1 ← wet (1 + ρε f et ). This division by ρ

accounts for the edge-capacity violations being as large as ρ.
ρ log m
Stop after T := ε2
iterations, and output fb = 1
T ∑t f t .

16.5.3 The Analysis

Let us first comment on the runtime: each time we find an electrical
flow, we either delete an edge, or we push flow and increment t. The
latter happens for T steps by construction; the next lemma shows that
we only delete edges in a few iterations.

Lemma 16.10. We delete at most m1/3 ≤ εF edges over the run of the
algorithm.

We defer the proof to later, and observe that the total number of
electrical flows computed is therefore O( T ). Each such computation
takes Oe (m/ε) by Corollary 16.6, so the overall runtime of our algo-
rithm is O(m4/3 / poly(ε)).
Next, we show that the flow fb is an (1 + O(ε)-approximate maxi-
mum s-t flow. We start with an analog of Theorem 16.7 that accounts
for edge deletions.

Lemma 16.11. Suppose ε ≤ 1/10. If we delete at most εF edges from G:

1. the flow f t at step t burns energy E ( f t ) ≤ (1 + 3ε)W t ,

2. ∑e wet f et ≤ (1 + 3ε)W t ≤ 2W t , and

3. if fb ∈ K is the flow eventually returned, then fbe ≤ (1 + O(ε)).

approximate max-flows using experts 203

Proof. We assumed there exists a flow f ∗ of value F that respects

all capacities. Deleting εF edges can only hit εF of these flow paths,
so there exists a capacity-respecting flow of value at least (1 − ε) F.
Scaling up by (1− 1
ε)
, there exists a flow f ′ of value F using each edge
1
to extent (1− ε )
. The energy of this flow according to resistances ret is
at most
1 Wt
E ( f ′) = ∑ r et ( f e′ ) 2 ≤ (1 − ε )2 ∑ r et ≤ (1 − ε )2
≤ ( 1 + 3ε ) W t ,
e e

for ε small enough. Since we find the minimum energy flow, E ( f t ) ≤

E ( f ′ ) ≤ W t ( 1 + 3ε ) . For the second part, we again use the Cauchy-
Schwarz inequality:
r r q
∑ w et f et ≤ ∑ w et ∑ w et ( f et ) 2 ≤ W t · W t ( 1 + 3ε ) ≤ ( 1 + 3ε ) W t ≤ 2W t .
e e e

The last step is very loose, but it will suffice for our purposes.
To calculate the congestion of the final flow, observe that even
though the algorithm above explicitly maintains weights, we can just
wt
appeal directly to the guarantees . Indeed, define p te : = Wet for each
time t; the previous part implies that the flow f t satisfies

∑ p te f et ≤ 1 + 3ε
e

for precisely the p t values that the Hedge-based LP solver would

return if we gave it the flows f 0 , f 1 , . . . , f t − 1 . Using the guarantees
of that LP solver, the average flow bf uses any edge e to at most ( 1 +
3ε ) + ε.

Finally, it remains to prove Lemma 16.10.

Proof of Lemma 16.10. We track two quantities: the total weight W t

and the s-t-effective resistance R eff . First, the weight starts at W 0 =
m, and when we do an update,

ε ε
W t + 1 = ∑ w et 1 + f et = Wt + ∑ w et f et
e ρ ρ e
t ε t
≤ W + ( 2W ) (From Claim 16.11)
ρ
ρ ln m
Hence we get that for T = ε2
,
T
T 0 2ε 2ε · T 2 ln m
W ≤W · 1+ ≤ m · exp = m · exp .
ρ ρ ε

Therefore, the total weight is at most m 1 + 2/ε . Next, we consider the

s-t-effective resistance R eff .
204 e ( m 4/3 ) -time algorithm
optional: an O

1. At the beginning, all edges have resistance 1 + ε. When we send

F flow, some edge has at least F/m flow on it, so the energy burn
is at least ( F/m ) 2 . This means R eff at the beginning is at least
( F/m ) 2 ≥ 1/m 2 .
2. The weights increase each time we do an update, so R eff does not
decrease. (This is one place it is more convenience to argue about
weights w et explicitly, and not just the probabilities p te .)

3. Each deleted edge e has flow at least ρ, and hence energy burn at
least ( ρ 2 ) w et ≥ ( ρ 2 ) mε W t . Since the total energy burn is at most
2W t from Lemma 16.11, the deleted edge e was burning at least
ρ2 ε
β := 2m fraction of the total energy. Hence
ol d
new R eff ol d ρ2 ε
R eff ≥ ρ2 ε
≥ R eff · exp
(1 − 2m
2m )
1
if we use 1− x ≥ e x/2 when x ∈ [ 0, 1/4 ] .
4. For the final effective resistance, note that we send F flow with
total energy burn 2W T ; since the energy depends on the square of
f inal T
the flow, we have R eff ≤ 2W F2
≤ 2W T .
(All these calculations hold as long as we have not deleted more than
ε F edges.) Now, to show that this invariant is maintained, suppose D
edges are deleted over the course of the T steps. Then

0 ρ2 ε f inal T 2 ln m
R eff exp D · ≤ R eff ≤ 2W ≤ 2m · exp .
2m ε
Taking logs and simplifying, we get that
ερ 2 D 2 ln m
≤ ln ( 2m 3 ) +
2m ε
2m ( ln m )( 1 + O ( ε ))
=⇒ D ≤ 2 ≪ m 1/3 ≤ εF.
ερ ε
This bounds the number of deleted edges D as desired.

16.5.4 Tightness of the Analysis

This analysis of the algorithm is tight. Indeed, the algorithm needs
Ω ( m 1/3 ) iterations, and deletes Ω ( m 1/3 ) edges for the example on Figure 16.3: Again, all edges have unit
the right. In this example, m = Θ ( n ) . Each black gadget has a unit capacities.
effective resistance, and if we do the calculations, the effective resis-
tance between s and t tends to the golden ratio. If we set F = n 1/3
(which is almost the max-flow), a constant fraction of the current
(about Θ ( n 1/3 ) ) uses the edge e 1 . Once that edge is deleted, the
next red edge e 2 carries a lot of current, etc., until all red edges get
deleted.
approximate max-flows using experts 205

16.5.5 Subsequent Work

A couple years after this work, Sherman, and independently, Kelner Sherman (2013)
et al. gave O ( m 1 + o ( 1 ) /ε O ( 1 ) ) -time algorithms for approximate max- Kelner, Lee, Orecchia, and Sidford
(2013)
flow problem on undirected graphs. This was improved, using some
more ideas, to a runtime of O ( m poly log m/ε O ( 1 ) ) -time by Richard Peng (2014)
Peng. These are based on the ideas of oblivious routings, and non-
Euclidean gradient descent, and we hope to cover this in an upcoming
lecture.
There has also been work on faster directed flows: work by Madry,
and thereafter by more refs here, have improved the current best re-
sult for max-flow in unweighted directed graphs to O e ( m 4/3 ) , match-
ing the above result.
17
The Gradient Descent Framework

Consider the problem of finding the minimum-energy s-t electrical

unit flow: we wanted to minimize the total energy burn

E(f) = ∑ f e2 r e
e

for flow values f that represent a unit flow from s to t (these form
a polytope). We alluded to algorithms that solve this problem, but
one can also observe that E ( f ) is a convex function, and we want to
find a minimizer within some polytope K. Equivalently, we wanted
to solve the linear system

Lϕ = ( e s − e t ) ,

which can be cast as finding a minimizer of the convex function

∥ Lϕ − ( e s − e t )∥ 2 .

How can we minimize these functions efficiently? In this lecture, we

will study the gradient descent framework for the general problem of
minimizing functions, and give concrete performance guarantees for
the case of convex optimization.

17.1 Convex Sets and Functions

First, recall the following definitions:

Definition 17.1 (Convex Set). A set K ⊆ R n is called convex if for all
x, y ∈ K,
λx + ( 1 − λ ) y ∈ K, (17.1)
for all values of λ ∈ [ 0, 1 ] . Geometrically, this means that for any two
points in K, the line connecting them is contained in K.
Definition 17.2 (Convex Function). A function f : K → R defined on
a convex set K is called convex if for all x, y ∈ K,

f ( λx + ( 1 − λ ) y ) ≤ λ f ( x ) + ( 1 − λ ) f ( y ) , (17.2)
y

208 convex sets and functions

f [λx + (1 − λ)y]

λ f ( x ) + (1 − λ ) f ( y )

for all values of λ ∈ [ 0, 1 ] .

f (x)

There are two kinds of problems that we will study. The most
basic question is that of unconstrained convex minimization (UCM):
x
given a convex function f , we want to find x λx + (1 − λ)y y

min f ( x ).
x ∈Rn

In some cases we will be concerned with the constrained convex min-

imization (CCM) problem: given a convex function f and a convex
set K, we want to find
min f ( x ).
x ∈K

Note that setting K = Rn gives us the unconstrained case.

17.1.1 Gradient
For most of the following discussion, we assume that the function f
is differentiable. In that case, we can give an equivalent characteriza-
tion, based on the notion of the gradient ∇ f : Rn → Rn . The directional derivative of f at x (in the
direction y) is defined as
Fact 17.3 (First-order condition). A function f : K → R is convex if
f ( x + εy) − f ( x )
and only if f ′ ( x; y) := lim .
ε →0 ε
f (y) ≥ f ( x ) + ⟨∇ f ( x ), y − x ⟩ , (17.3) If there exists a vector g such that
⟨ g, y⟩ = f ′ ( x; y) for all y, then f is called
for all x, y ∈ K. differentiable at x, and g is called the
gradient. It follows that the gradient
Geometrically, Fact 17.3 states that the function always lies above must be of the form
its tangent plane, for all points in K. If the function f is twice-differentiable,
∂f ∂f ∂f
∇ f (x) = ( x ), ( x ), · · · , (x) .
and if H f ( x ) is its Hessian matrix, i.e. its matrix of second derivatives ∂x1 ∂x2 ∂xn
at x ∈ K:
∂2 f
( H f )i,j ( x ) := ( x ), (17.4)
∂xi ∂x j
then we get yet another characterization of convex functions.
Fact 17.4 (Second-order condition). A twice-differentiable function f
is convex if and only if H f ( x ) is positive semidefinite for all x ∈ K.

17.1.2 Lipschitz Functions

We will need a notion of “niceness” for functions: Figure 17.1: The blue line denotes
the function and the red line is the
tangent line at x. (Figure from Nisheeth
Definition 17.5 (Lipschitz continuity). For a convex set K ⊆ Rn , a Vishnoi.)
function f : K → R is called G-Lipschitz (or G-Lipschitz continuous)
with respect to the norm ∥ · ∥ if

| f ( x ) − f (y)| ≤ G ∥ x − y∥ ,

for all x, y ∈ K.
the gradient descent framework 209

In this chapter we focus on the Euclidean or ℓ2 -norm, denoted by

∥ · ∥2 . General norms arise in the next chapter, when we talk about
mirror descent. Again, assuming that the function is differentiable
allows us to give an alternative characterization of Lipschitzness.
Fact 17.6. A differentiable function f : K → Rn is G-Lipschitz with
respect to ∥ · ∥2 if and only if

∥∇ f ( x )∥2 ≤ G, (17.5)

for all x ∈ K.

17.2 Unconstrained Convex Minimization

If the function f is convex, any stationary point (i.e., a point x ∗ where

∇ f ( x ∗ ) = 0) is also a global minimum: just use Fact 17.3 to infer that
f (y) ≥ f ( x ∗ ) for all y. Now given a convex function, we can just
solve the equation
∇ f (x) = 0
to compute the global minima exactly. This is often easier said than
done: for instance, if the function f we want to minimize may not
be given explicitly. Instead we may only have a gradient oracle that
given x, returns ∇ f ( x ).
Even when f is explicit, it may be expensive to solve the equation
∇ f ( x ) = 0, and gradient descent may be a faster way. One example
arises when solving linear systems: given a quadratic function f ( x ) =
1 ⊺
2 x Ax − bx for a symmetric matrix A (say having full rank), a simple
calculation shows that

∇ f ( x ) = 0 ⇐⇒ Ax = b ⇐⇒ x = A−1 b.

This can be solved in O(nω ) (i.e., matrix-multiplication) time using

Gaussian elimination—but for “nice” matrices A we are often able to
approximate a solution much faster using the gradient-based meth-
ods we will soon see.

17.2.1 The Basic Gradient Descent Method

Gradient descent is an iterative algorithm to approximate the opti-
mal solution x ∗ . The main idea is simple: since the gradient tells us
the direction of steepest increase, we’d like to move opposite to the
direction of the gradient to decrease the fastest. So by selecting an
initial position x0 and a step size ηt at each time t, we can repeatedly
perform the update:

x t +1 ← x t − η t · ∇ f ( x t ). (17.6)
210 unconstrained convex minimization

There are many choices to be made: where should we start? What

are the step sizes? When do we stop? While each of these decisions
depend on the properties of the particular instance at hand, we can
show fairly general results for general convex functions.

17.2.2 An Algorithm for General Convex Functions

The algorithm fixes a step size for all times t, performs the up-
date (17.6) for some number of steps T, and then returns the average
of all the points seen during the process.
Algorithm 14: Gradient Descent
14.1 x1 ← starting point
14.2 for t ← 1 to T do
14.3 x t +1 ← x t − η · ∇ f ( x t )
T
1
14.4 return xb :=
T ∑ xi .
t =1
This is easy to visualize in two dimensions: draw the level sets
Figure 17.2: The yellow lines denote
of the function f , and the gradient at a point is a scaled version of the level sets of the function f and the
normal to the tangent line at that point. Now the algorithm’s path is red walk denotes the steps of gradient
often a zig-zagging walk towards the optimum (see Fig 17.2). descent. (Figure from Wikipedia.)

Interestingly, we can give rigorous bounds on the convergence of

this algorithm to the optimum, based on the distance of the starting
point from the optimum, and bounds on the Lipschitzness of the
function. If both these are assumed to be constant, then our error is
smaller than ε in only O(1/ε2 ) steps.

Proposition 17.7. Let f : Rn → R be convex, differentiable and G-

G 2 ∥ x0 − x ∗ ∥2
Lipschitz. Let x ∗ be any point in Rd . If we define T := ε2
and
∥ x0 − ∗
η := √x ∥ , then the solution xb returned by gradient descent satisfies
G T

f ( xb) ≤ f ( x ∗ ) + ε. (17.7)

In particular, this holds when x ∗ is a minimizer of f .

The core of this proposition lies in the following theorem

Theorem 17.8. Let f : Rn → R be convex, differentiable and G-Lipschitz.

Then the gradient descent algorithm ensures that

T T
1 1
∑ f ( xt ) ≤ ∑ f (x∗ ) + 2 ηTG2 + 2η ∥ x0 − x∗ ∥2 . (17.8)
t =1 t =1

We will prove Theorem 17.8 in the next section, but let’s first use it
to prove Proposition 17.7, our guarantee on the offline convergence of
vanilla gradient descent.
the gradient descent framework 211

Proof of Proposition 17.7. By definition of xb and the convexity of f ,

1 T 1 T
f ( xb) = f
T ∑ xt ≤
T ∑ f ( x t ).
t =1 t =1

By Theorem 17.8,
T
1 1 1
T ∑ f ( xt ) ≤ f ( x ∗ ) + ηG2 +
2 2ηT
∥ x0 − x ∗ ∥2 .
t =1 | {z }
error

−x∗ ∥
∥ x0 √
The error terms balance when η = , giving
G T

∥ x0 − x ∗ ∥ G
f ( xb) ≤ f ( x ∗ ) + √ .
T
Finally, we set T = 1 2
ε2
G ∥ x0 − x ∗ ∥2 to obtain

f ( xb) ≤ f ( x ∗ ) + ε.

Observe: we do not (and cannot) show that the point xb is close in

distance to x ∗ ; we just show that the function value f ( xb) ≈ f ( x ∗ ).
Indeed, if the function is very flat close to x ∗ and we start off at some
remote point, we make tiny steps as we get close to x ∗ , and we can-
not hope to get close to it.
The 1/ε2 dependence of the number of oracle calls was shown
to be tight for gradient-based methods by Yurii Nesterov, if we al-
low f to be any G-Lipschitz function. However, if we assume that
the function is “well-behaved”, we can indeed improve on the 1/ε2
dependence. Moreover, if the function is strongly convex, we can
show that x ∗ and xb are close to each other as well: see §17.5 for such
results.
The convergence guarantee in Proposition 17.7 is for the time-
averaged point xb. Indeed, using a fixed step size means that our
iterates may get stuck in a situation where xt+2 = xt after some point
and hence we never improve, even though xb is at the minimizer.
One can also show that f ( x T ) ≤ f ( x ∗ ) + ε if we use a time-varying
√
step size ηt = O(1/ t), and increase the time horizon slightly to
O(1/ε2 log 1/ε). We refer to the work of Shamir and Zhang.

17.2.3 Proof of Theorem 17.8

Like in the proof of the multiplicative weights algorithm, we will use
a potential function. Define
∥ x t − x ∗ ∥2
Φt := . (17.9)
2η
We start the proof of Theorem 17.8 by understanding the one-step
change in the potential:
212 unconstrained convex minimization

Lemma 17.9 (Change in Potential).

η
Φt+1 − Φt ≤ ⟨∇ f ( xt ), x ∗ − xt ⟩ + ∥∇ f ( xt )∥2 .
2
Proof. Using the identity

∥ a + b∥2 = ∥ a∥2 + 2 ⟨ a, b⟩ + ∥b∥2 ,

with a + b = xt+1 − x ∗ and a = xt − x ∗ , we get

1
Φ t +1 − Φ t = ∥ x t +1 − x ∗ ∥ 2 − ∥ x t − x ∗ ∥ 2 (17.10)
2η
1
= 2 ⟨x − x t , x t − x ∗ ⟩ + ∥ x t +1 − x t ∥ 2 ;
2η | t+1 {z } | {z }
⟨b,a⟩ ∥ b ∥2

now using xt+1 − xt = −η ∇ f ( xt ) from gradient descent,

1
= 2 ⟨−η ∇ f ( xt ), xt − x ∗ ⟩ + ∥η ∇ f ( xt )∥2 .
2η
Now rearranging terms proves the lemma.

Now that we understand how our potential changes over time,

proving the theorem is straightforward.

Proof of Theorem 17.8. We start with the inequality we proved above,

and use that since f is G-Lipschitz, ∥∇ f ( x )∥ ≤ G for all x. Thus,
η 2
f ( xt ) + (Φt+1 − Φt ) = f ( xt ) + ⟨∇ f ( xt ), x ∗ − xt ⟩ + G .
2
Since f is convex, we know that f ( xt ) + ⟨∇ f ( xt ), x ∗ − xt ⟩ ≤ f ( x ∗ ).
Thus, we conclude that
η 2
f ( x t ) + ( Φ t +1 − Φ t ) ≤ f ( x ∗ ) + G .
2
Summing over t = 1, . . . , T,
T T T
η
∑ f ( x t ) + ∑ ( Φ t +1 − Φ t ) ≤ ∑ f ( x ∗ ) + 2 G 2 T
t =1 t =1 t =1

The sum of potentials on the left telescopes to give:

T T
η
∑ f ( x t ) + Φ T +1 − Φ 1 ≤ ∑ f ( x ∗ ) + 2 G 2 T
t =1 t =1

Since the potentials are nonnegative, we can drop the Φ T term:

T T
η
∑ f ( x t ) − Φ1 ≤ ∑ f ( x ∗ ) + 2 G 2 T
t =1 t =1

Substituting in the definition of Φ1 and moving it over to the right

hand side completes the proof.
the gradient descent framework 213

17.2.4 Some Remarks on the Algorithm

We assume a gradient oracle for the function: given a point x, it
returns the gradient ∇ f ( x ) at that point. If the function f is not
given explicitly, we may have to estimate the gradient using, e.g.,
random sampling. One particularly sample-efficient solution is to
pick a uniformly random point u ∼ Sn−1 from the sphere in Rn , and
return h f ( x + δu) i As δ → 0, the expectation of this
expression tends to ∇ f ( x ), using
d u Stokes’ theorem.
δ
for some tiny δ > 0. It is slightly mysterious, so perhaps it is useful to
consider its expectation in the case of a univariate function:
h f ( x + δu) i f ( x + δ) − f ( x − δ)
Eu∼{−1,+1} u = ≈ f ′ ( x ).
δ 2δ
In general, randomized strategies form the basis of stochastic gradient
descent, where we use an unbiased estimator of the gradient, instead
of computing the gradient itself (because it is slow to compute, or
because enough information is not available). The challenge is now to
control the variance of this estimator.
Another concern is that the step-size η and the number of steps
T both require knowledge of the distance ∥ x1 − x ∗ ∥ as well as the
bound on the gradient. More here. As an exercise, show that using
∥ x −x∗ ∥
the time-varying step-size ηt := 0 √ also gives a very similar
G t
convergence rate.
Finally, the guarantee is for f ( xb), where xb is the time-average of
the iterates. What about returning the final iterate? It turns out this
has comparable guarantees, but the proof is slightly more involved.
Add references.

17.3 Constrained Convex Minimization

Unlike the unconstrained case, the gradient at the minimizer may not
be zero in the constrained case—it may be at the boundary. In this This is the analog of the minimizer of a
case, the condition for a convex function f : K → R to be minimized single variable function being achieved
either at a point where the derivative is
at x ∗ ∈ K is now zero, or at the boundary.

⟨∇ f ( x ∗ ), y − x ∗ ⟩ ≥ 0 for all y ∈ K. (17.11)

In other words, all vectors y − x ∗ pointing within K are “positively

correlated” with the gradient. When x ∗ is in the interior of K, the
condition (17.11) is equivalent to
∇ f ( x ∗ ) = 0.
17.3.1 Projected Gradient Descent
While the gradient descent algorithm still makes sense: moving in
the direction opposite to the gradient still moves us towards lower
214 constrained convex minimization

function values. But we must change our algorithm to ensure that the
new point xt+1 lies within K. To ensure this, we simply project the
new iterate xt+1 back onto K. Let projK : Rn → K be defined as

projK (y) = arg minx∈K ∥ x − y∥2 .

The modified algorithm is given below in Algorithm 15, with the

changes highlighted in blue.
Algorithm 15: Projected Gradient Descent For CCM xt

15.1 x1 ← starting point

x t +1 xt′ +1
15.2 for t ← 1 to T do
15.3 xt′ +1 ← xt − η · ∇ f ( xt )
15.4 xt+1 ← projK ( xt′ +1 )
T
15.5 return xb := 1
T ∑ xt
t =1
Figure 17.3: Projection onto a convex
We will show below that a result almost identical to that of Theo- body
rem 17.8, and hence that of Proposition 17.7 holds.

Proposition 17.10. Let K be a closed convex set, and f : K → R be convex,

G 2 ∥ x0 − x ∗ ∥2
differentiable and G-Lipschitz. Let x ∗ ∈ K, and define T := ε2
and
∥ x0 − ∗
η := √x ∥ . Then the solution xb returned by projected gradient descent
G T
satisfies
f ( xb) ≤ f ( x ∗ ) + ε. (17.12)

In particular, this holds when x ∗ is a minimizer of f .

Proof. We can reduce to an analogous constrained version of Theo-

rem 17.8. Let us start the proof as before:

1
Φ t +1 − Φ t = ∥ x t +1 − x ∗ ∥ 2 − ∥ x t − x ∗ ∥ 2 (17.13)
2η

But xt+1 is the projection of xt′ +1 onto K, which is difficult to reason

about. Also, we know that −η ∇ f ( xt ) = xt′ +1 − x ∗ , not xt+1 − x ∗ ,
so we would like to move to the point xt′ +1 . Indeed, we claim that
xt′ +1 − x ∗ ≥ ∥ xt+1 − x ∗ ∥, and hence we get

1
Φ t +1 − Φ t = ∥ xt′ +1 − x ∗ ∥2 − ∥ xt − x ∗ ∥2 . (17.14)
2η

Now the rest of the proof of Theorem 17.8 goes through unchanged.
Why is the claim xt′ +1 − x ∗ ≥ ∥ xt+1 − x ∗ ∥ true? Since K is
convex, projecting onto it gets us closer to every point in K, in particular
to x ∗ ∈ K. To formally prove this fact about projections, consider
the angle x ∗ → xt+1 → xt′ +1 . This is a non-acute angle, since the
orthogonal projection means K likes to one side of the hyperplane
defined by the vector xt′ +1 − xt+1 , as in the figure on the right.
the gradient descent framework 215

Note that restricting the play to K can be helpful in two ways: we

can upper-bound the distance ∥ x ∗ − x1 ∥ by the diameter of K, and
moreover we need only consider the Lipschitzness of f for points
within K.

17.4 Online Gradient Descent, and Relationship with MW

We considered gradient descent for the offline convex minimization

problem, but one can use it even when the function changes over
time. Indeed, consider the online convex optimization (OCO) problem:
at each time step t, the algorithm proposes a point xt ∈ K and an
adversary gives a function f t : K → R with ∥∇ f t ∥ ≤ G. The cost of
each time step is f t ( xt ) and your objective is to minimize

regret = ∑ f t (xt ) − xmin

∗ ∈K ∑
f t ( x ∗ ).
t t

For instance if K = ∆n , and f t ( x ) := ⟨ℓt , x ⟩ for some loss vector

ℓt ∈ [−1, 1]n , then we are back in the experts setting of the previous
chapters. Of course, the OCO problem is far more general, allowing
arbitrary convex functions.
Surprisingly, we can use the almost same algorithm to solve the
OCO problem, with one natural modification: the update rule is now
taken with respect to gradient of the current function f t : This was first observed by Martin
Zinkevich in 2002, when he was a Ph.D.
x t +1 ← x t − η · ∇ f t ( x t ). student here at CMU.

Looking back at the proof in §17.2, the proof of Lemma 17.9 immedi-
ately extends to give us
1
f t ( xt ) + Φt+1 − Φt ≤ f t ( x ∗ ) + ηG2 .
2
Now summing this over all times t gives
T T η
∑ f t ( xt ) − f t ( x ∗ ) ≤ ∑ Φt − Φt+1 + TG2
2
t =1 t =1
1
≤ Φ1 + ηTG2 ,
2
∥ x1 − x ∗ ∥2 G 2
since Φ T +1 ≥ 0. The proof is now unchanged: setting T ≥ ε2
∥ x1 − ∗
and η = √x ∥ , and doing some elementary algebra as above,
G T

1 T ∥ x − x∗ ∥G
T ∑ f t ( xt ) − f t ( x ∗ ) ≤ 1 √
T
≤ ε.
t =0

17.4.1 Comparison to the MW/Hedge Algorithms

One advantage of the gradient descent approach (and analysis) over
the multiplicative weight-based ones is that the guarantees here hold
216 stronger assumptions

for all convex bodies K and all convex functions, as opposed to being
just for the unit simplex ∆n and linear losses f t ( x ) = ⟨ℓt , x ⟩, say for
ℓt ∈ [−1, 1]n . However, in order to make a fair comparison, suppose
we restrict ourselves to ∆n and linear losses, and consider the number
of rounds T before we get an average regret of ε.

• If we consider ∥ x1 − x ∗ ∥ (which, in the worst case, is the diameter

of K), and G (which is an upper bound on ∥∇ f t ( x )∥ over points in
K) as constants, then the T = Θ( ε12 ) dependence is the same.
√
• For a more quantitative comparison, note that ∥ x1 − x ∗ ∥ ≤ 2 for
√
x1 , x ∗ ∈ ∆n , and ∥∇ f t ( x )∥ = ∥ℓt ∥ √
≤ n for ℓt ∈ [−1, 1]n . Hence,
log n
Proposition 17.10 gives us T = Θ ε2n , as opposed to T = Θ ε2
for multiplicative weights.

The problem, at a high level, is that we are “choosing the wrong

norm”: when dealing with probabilities, the “right” norm is the ℓ1
norm and not the Euclidean ℓ2 norm. In the next lecture we will for-
malize what this means, and how this dependence on n be improved
via the Mirror Descent framework.

17.5 Stronger Assumptions

If the function f is “well-behaved”, we can improve the guarantees

for gradient descent in two ways: we can reduce the dependence on
ε, and we can weaken (or remove) the dependence on the parameters
G and ∥ x1 − x ∗ ∥. There are two standard assumptions to make on
the convex function: that it is “not too flat” (captured by the idea of
strong convexity), and it is not “not too curved” (i.e., it is smooth). We
now use these assumptions to improve the guarantees.

17.5.1 Strongly-Convex Functions

Definition 17.11 (Strong Convexity). A function f : K → R is α-
strongly convex if for all x, y ∈ K, any of the following holds:

1. (Zeroth order) f (λx + (1 − λ)y) ≤ λ f ( x ) + (1 − λ) f (y) − α2 λ(1 −

λ)∥ x − y∥2 for all λ ∈ [0, 1].

2. (First order) If f is differentiable, then

α
f (y) ≥ f ( x ) + ⟨∇ f ( x ), y − x ⟩ + ∥ x − y ∥2 . (17.15)
2

3. (Second order) If f is twice-differentiable, then all eigenvalues of

H f ( x ) are at least α at every point x ∈ K.
the gradient descent framework 217

We will work with the first-order definition, and show that the
1

gradient descent algorithm with (time-varying) step size ηt = O αt
2
converges to a value at most f ( x ∗ ) + ε in time T = Θ( Gαε ). Note there
is no more dependence on the diameter of the polytope. Before we
give this proof, let us give the other relevant definitions.

17.5.2 Smooth Functions

Definition 17.12 (Lipschitz Smoothness). A function f : K → R is
β-(Lipschitz)-smooth if for all x, y ∈ K, any of the following holds:
β
1. (Zeroth order) f (λx + (1 − λ)y) ≥ λ f ( x ) + (1 − λ) f (y) − 2 λ(1 −
λ)∥ x − y∥2 for all λ ∈ [0, 1].

2. (First order) If f is differentiable, then

β
f (y) ≤ f ( x ) + ⟨∇ f ( x ), y − x ⟩ + ∥ x − y ∥2 . (17.16)
2

3. (Second order) If f is twice-differentiable, then all eigenvalues of

H f ( x ) are at most β at every point x ∈ K.

In this case, the gradient descent algorithm with fixed step size

ηt = η = O β1 yields an xb which satisfies f ( xb) − f ( x ∗ ) ≤ ε when
β ∥ x1 − x ∗ ∥
T = Θ ε . In this case, note we have no dependence on the
Lipschitzness G any more; we only depend on the diameter of the
polytope. Again, we defer the proof for the moment.

17.5.3 Well-conditioned Functions

Functions that are both β-smooth and α-strongly convex are called
well-conditioned functions. From the facts above, the eigenvalues of
their Hessian H f must lie in the interval [α, β] at all points x ∈ K.
In this case, we get a much stronger convergence—we can achieve
ε-closeness in time T = Θ(log 1ε ), where the constant depends on the
condition number κ = β/α.

Theorem 17.13. For a function f which is β-smooth and α-strongly con-

vex, let x ∗ be the solution to the unconstrained convex minimization prob-
lem arg minx∈Rn f ( x ). Then running gradient descent with ηt = 1/β
gives

β −t
f ( xt ) − f ( x ∗ ) ≤ exp ∥ x1 − x ∗ ∥2 .
2 κ
Proof. For β-smooth f , we can use Definition 17.12 to get

β
f ( xt+1 ) ≤ f ( xt ) − η ∥∇ f ( xt )∥2 + η 2 ∥∇ f ( xt )∥2 .
2
218 stronger assumptions

The right hand side is minimized by setting η = β1 , when we get

1
f ( x t +1 ) − f ( x t ) ≤ − ∥∇ f ( xt )∥2 . (17.17)
2β

For α-strongly-convex f , we can use Definition 17.11 to get:

α
f ( xt ) − f ( x ∗ ) ≤ ⟨∇ f ( xt ), xt − x ∗ ⟩ − ∥ x t − x ∗ ∥2 ,
2
α
≤ ∥∇ f ( xt )∥ ∥ xt − x ∗ ∥ − ∥ xt − x ∗ ∥2 ,
2
1 2
≤ ∥∇ f ( xt )∥ , (17.18)
2α

where we use that the right hand side is maximized when ∥ xt − x ∗ ∥ =

∥∇ f ( xt )∥ /α. Now combining with (17.17) we have that

α ∗
f ( x t +1 ) − f ( x t ) ≤ − f ( xt ) − f ( x ) , (17.19)
β

or setting ∆t = f ( xt ) − f ( x ∗ ) and rearranging, we get

t
α 1 t
∆ t +1 ≤ 1− ∆t ≤ 1− ∆1 ≤ exp − · ∆1 .
β κ κ

We can control the value of ∆1 by using (17.16) in x = x ∗ , y = x1 ;

since ∇ f ( x ∗ ) = 0, get ∆1 = f ( x1 ) − f ( x ∗ ) ≤ 2 ∥ x1 − x ∗ ∥2 .
β

Strongly-convex (and hence well-conditioned) functions have

the nice property that if f ( x ) is close to f ( x ∗ ) then x is close to x ∗ :
intuitively, since the function is curving at least quadratically, the
function values at points far from the minimizer must be significant.
Formally, use (17.15) with x = x ∗ , y = xt and the fact that ∇ f ( x ∗ ) = 0
to get
2
∥ x t − x ∗ ∥2 ≤ ( f ( xt ) − f ( x ∗ )).
α
We leave it as an exercise to show the claimed convergence bounds
using just strong convexity, or just smoothness. (Hint: use the state-
ments proved in (17.17) and (17.18).
Before we end, a comment on the strong O(log 1/ε) convergence
result for well-conditioned functions. Suppose the function values
lies in [0, 1]. The Θ(log 1/ε) error bound means that we are correct
up to b bits of precision—i.e., have error smaller than ε = 2−b —after
Θ(b) steps. In other words, the number of bits of precision is linear in
the number of iterations. The optimization literature refers to this as
linear convergence, which can be confusing when you first see it.
the gradient descent framework 219

17.6 Extensions and Loose Ends

17.6.1 Subgradients
What if the convex function f is not differentiable? Staring at the
proofs above, all we need is the following:

Definition 17.14 (Subgradient). A vector z x is called a subgradient at

point x if

f (y) ≥ f ( x ) + ⟨z x , y − x ⟩ for all y ∈ Rn .

Now we can use subgradients at the point x wherever we used

∇ f ( x ), and the entire proof goes through. In some cases, an approxi-
mate subgradient may also suffice.

17.6.2 Stochastic Gradients, and Coordinate Descent

17.6.3 Acceleration

17.6.4 Reducing to the Well-conditioned Case

18
Mirror Descent

The gradient descent algorithm of the previous chapter is general

and powerful: it allows us to (approximately) minimize convex func-
tions over convex bodies. Moreover, it also works in the model of
online convex optimization, where the convex function can vary over
time, and we want to find a low-regret strategy—one which performs
well against every fixed point x ∗ .
This power and broad applicability means the algorithm is not
always the best for specific classes of functions and bodies: for in-
stance, for minimizing linear functions over the probability simplex
∆n , we saw in §17.4.1 that the generic gradient descent algorithm
does significantly worse than the specialized Hedge algorithm. This
suggests asking: can we somehow change gradient descent to adapt to the
“geometry” of the problem?
The mirror descent framework of this section allows us to do pre-
cisely this. There are many different (and essentially equivalent) ways
to explain this framework, each with its positives. We present two of
them here: the proximal point view, and the mirror map view, and only
mention the others (the preconditioned or quasi-Newton gradient flow
view, and the follow the regularized leader view) in passing.

18.1 Mirror Descent: the Proximal Point View

Here is a different way to arrive at the gradient descent algorithm

from the last lecture: Indeed, we can get an expression for xt+1 by

Algorithm 16: Proximal Gradient Descent Algorithm

16.1 x1 ← starting point
16.2 for t ← 1 to T do
1
16.3 xt+1 ← arg minx {η ⟨∇ f t ( xt ), x ⟩ + 2 ∥x − x t ∥2 }

setting the gradient of the function to zero; this gives us the expres-
222 mirror descent: the proximal point view

sion

η · ∇ f t ( x t ) + ( x t +1 − x t ) = 0 =⇒ x t +1 = x t − η · ∇ f t ( x t ),

which matches the normal gradient descent algorithm. Moreover, the

intuition for this algorithm also makes sense: if we want to minimize
the function f t ( x ), we could try to minimize its linear approximation
f t ( xt ) + ⟨∇ f t ( xt ), x − xt ⟩ instead. But we should be careful not to
“over-fit”: this linear approximation is good only close to the point
xt , so we could add in a penalty function (a “regularizer”) to prevent
us from straying too far from the point xt . This means we should
minimize
1
xt+1 ← arg min{ f t ( xt ) + ⟨∇ f t ( xt ), x − xt ⟩ + ∥ x − x t ∥2 }
x 2
or dropping the terms that don’t depend on x,
1
xt+1 ← arg min{⟨∇ f t ( xt ), x ⟩ + ∥ x − x t ∥2 } (18.1)
x 2
If we have a constrained problem, we can change the update step to:
1
xt+1 ← arg min{η ⟨∇ f t ( xt ), x ⟩ + ∥ x − x t ∥2 } (18.2)
x ∈K 2
The optimality conditions are a bit more complicated now, but they
again can show this algorithm is equivalent to projected gradient
descent from the previous chapter.
Given this perspective, we can now replace the squared Euclidean
norm by other distances to get different algorithms. A particularly
useful class of distance functions are Bregman divergences, which we
now define and use.

18.1.1 Bregman Divergences

Given a strictly convex function h, we can define a distance based on
how the function differs from its linear approximation:

Definition 18.1. The Bregman divergence from x to y with respect to

function h is

Dh (y∥ x ) := h(y) − h( x ) − ⟨∇h( x ), y − x ⟩.

The figure on the right illustrates this definition geometrically for a

univariate function h : R → R. Here are a few examples:

1. For the function h( x ) = 21 ∥ x ∥2 from Rn to R, the associated Breg-

Figure 18.1: Dh (y∥ x ) for function
man divergence is h : R → R.

Dh (y∥ x ) = 21 ∥y − x ∥2 ,

the squared Euclidean distance.

mirror descent 223

2. For the (un-normalized) negative entropy function h( x ) = ∑in=1 ( xi ln xi −

x i ),
y
Dh (y∥ x ) = ∑i yi ln xi − yi + xi .
i

Using that ∑i yi = ∑i xi = 1 for y, x ∈ ∆n gives us Dh (y∥ x ) =

y
∑i yi ln xii for x, y ∈ ∆n : this is the Kullback-Leibler (KL) divergence
between probability distributions.

Many other interesting Bregman divergences can be defined.

18.1.2 Changing the Distance Function

Since the distance function 21 ∥ x − y∥2 in (18.1) is a Bregman diver-
gence, what if we replace it by a generic Bregman divergence: what
algorithm do we get in that case? Again, let us first consider the un-
constrained problem, with the update:

xt+1 ← arg min{η ⟨∇ f t ( xt ), x ⟩ + Dh ( x ∥ xt )}.

Again, setting the gradient at xt+1 to zero (i.e., the optimality condi-
tion for xt+1 ) now gives:

η ∇ f t ( xt ) + ∇h( xt+1 ) − ∇h( xt ) = 0,

or, rephrasing

∇ h ( x t +1 ) = ∇ h ( x t ) − η ∇ f t ( x t ) (18.3)

=⇒ xt+1 = ∇h−1 ∇h( xt ) − η ∇ f t ( xt ) (18.4)

Let’s consider this for our two running examples:

1. When h( x ) = 1
2 ∥ x ∥2 , the gradient ∇h( x ) = x. So we get

x t +1 = x t − η ∇ f t ( x t ),

the standard gradient descent update.

2. When h( x ) = ∑i ( xi ln xi − xi ), then ∇h( x ) = (ln x1 , . . . , ln xn ), so

( xt+1 )i = eln(xt )i −η ∇ f t (xt ) = ( xt )i e−η ∇ f t (xt ) .

Now if f t ( x ) = ⟨ℓt , x ⟩, its gradient is just the vector ℓt , and we get

back precisely the weights maintained by the Hedge algorithm!

The same ideas also hold for constrained convex minimization:

we now have to search for the minimizer within the set K. In this
case the algorithm using negative entropy results in the same Hedge-
like update, followed by scaling the point down to get a probability
vector, thereby giving the probability values in Hedge.
To summarize: this algorithm that tries to minimize the linear ap- What is the “right” choice of h to
minimize the function f ? A little
thought shows that h should equal f ,
because adding D f ( x ∥ xt ) to the linear
approximation of f at xt gives us back
exactly f . Of course, the update now
requires us to minimize f ( x ), which
is the original problem. So we should
choose an h that is somehow “similar”
to f , and yet such that the update step
is tractable.
224 mirror descent: the mirror map view

Algorithm 17: Proximal Gradient Descent Algorithm

17.1 x1 ← starting point
17.2 for t ← 1 to T do
17.3 xt+1 ← arg minx∈K {η ⟨∇ f t ( xt ), x ⟩ + Dh ( x ∥ xt )}

proximation of the function, regularized by a Bregman distance Dh ,

gives us vanilla gradient descent for one choice of h (which is good
for quadratic-like functions over Euclidean space), and Hedge for an-
other choice of h (which is good for linear functions over the space of
probability distributions). Indeed, depending on how we choose the
function, we can get different properties from this algorithm—this is
the mirror descent framework.

18.2 Mirror Descent: The Mirror Map View

A different view of the mirror descent framework is the one orig-

inally presented by Nemirovski and Yudin. They observe that in
gradient descent, at each step we set xt+1 = xt − η ∇ f t ( xt ). However,
the gradient was actually defined as a linear functional on Rn and A linear functional on vector space X is
hence naturally belongs to the dual space of Rn . The fact that we a linear map from X into its underlying
field F.
represent this functional (i.e., this covector) as a vector is a matter of
convenience, and we should exercise care.
In the vanilla gradient descent method, we were working in Rn en-
dowed with ℓ2 -norm, and this normed space is self-dual, so it is per-
haps reasonable to combine points in the primal space (the iterates xt
of our algorithm) with objects in the dual space (the gradients). But
when working with other normed spaces, adding a covector ∇ f t ( xt )
to a vector xt might not be the right thing to do. Instead, Nemirovski
and Yudin propose the following:

1. we map our current point xt to a point θt in the dual space using a

mirror map.

2. Next, we take the gradient step

θ t +1 ← θ t − η ∇ f t ( x t ). (18.5) Figure 18.2: The four basic steps in

each iteration of the mirror descent
algorithm
3. We map θt+1 back to a point in the primal space xt′ +1 using the
inverse of the mirror map from Step 1.

4. If we are in the constrained case, this point xt′ +1 might not be in

the convex feasible region K, so we to project xt′ +1 back to a “close-
by” xt+1 in K.
mirror descent 225

The name of the process comes from thinking of the dual space as be-
ing a mirror image of the primal space. But how do we choose these mir-
ror maps? Again, this comes down to understanding the geometry
of the problem, the kinds of functions and the set K we care about,
and the kinds of guarantees we want. In order to discuss these, let us
discuss the notion of norms in some more detail.

18.2.1 Norms and their Duals

Definition 18.2 (Norm). A function ∥ · ∥ : Rn → R is a norm if
• If ∥ x ∥ = 0 for x ∈ Rn , then x = 0;
• for α ∈ R and x ∈ Rn we have ∥αx ∥ = |α|∥ x ∥; and
• for x, y ∈ Rn we have ∥ x + y∥ ≤ ∥ x ∥ + ∥y∥.
The well-known ℓ p -norms for p ≥ 1 are defined by
n
∥ x ∥ p := ( ∑ | xi | p )1/p
i =1

for x ∈ Rn . The ℓ∞ -norm is given by

n
∥ x ∥∞ := max | xi |
i =1

for x ∈ Rn .

Definition 18.3 (Dual Norm). Let ∥ · ∥ be a norm. The dual norm of

∥ · ∥ is a function ∥ · ∥∗ defined as

∥y∥∗ := sup{⟨ x, y⟩ : ∥ x ∥ ≤ 1}.

Figure 18.3: The unit ball in ℓ1 -norm
The dual norm of the ℓ2 -norm is again the ℓ2 -norm; the Euclidean (Green), ℓ2 -norm (Blue), and ℓ∞ -norm
(Red).
norm is self-dual. The dual for the ℓ p -norm is the ℓq -norm, where
1/p + 1/q = 1.

Corollary 18.4 (Cauchy-Schwarz for General Norms). For x, y ∈ Rn ,

we have ⟨ x, y⟩ ≤ ∥ x ∥ ∥y∥∗ .

Proof. Assume ∥ x ∥ ̸= 0, otherwise both sides are 0. Since ∥ x/∥ x ∥∥ =

1, we have ⟨ x/∥ x ∥, y⟩ ≤ ∥y∥∗ .

Theorem 18.5. For a finite-dimensional space with norm ∥ · ∥, we have

(∥ · ∥∗ )∗ = ∥ · ∥.
Using the notion of dual norms, we can give an alternative charac-
terization of Lipschitz continuity for a norm ∥ · ∥, much like Fact 17.6
for Euclidean norms:
Fact 18.6. For f be a differentiable function. Then f is G-Lipschitz
with respect to norm ∥ · ∥ if and only if for all x ∈ R,

∥∇ f ( x )∥∗ ≤ G.
226 mirror descent: the mirror map view

18.2.2 Defining the Mirror Maps

To define a mirror map, we first fix a norm ∥ · ∥, and then choose a
differentiable convex function h : Rn → R that is α-strongly-convex
with respect to this norm. Recall from §17.5.1 that such a function
must satisfy
α
h(y) ≥ h( x ) + ⟨∇h( x ), y − x ⟩ + ∥ y − x ∥2 .
2
We use two familiar examples:

1. h( x ) = 12 ∥ x ∥22 is 1-strongly convex with respect to ∥ · ∥2 , and

2. h( x ) := ∑in=1 xi (log xi − 1) is 1-strongly convex with respect to

∥ · ∥1 ; the proof of this is called Pinsker’s inequality. Check out the two proofs pointed to by
Aryeh Kontorovich, or this proof (part
Having fixed ∥ · ∥ and h, the mirror map is 1, part 2) by Madhur Tulsiani.

∇(h) : Rn → Rn .

Since h is differentiable and strongly-convex, we can define the in-

verse map as well. This defines the mappings that we use in the The function h used in this way is often
Nemirovski-Yudin process: we set called a distance-generating function.

θt = ∇ h( xt ) and xt′ +1 = (∇h)−1 (θt+1 ).

For our first running example of h( x ) = 21 ∥ x ∥2 , the gradient (and

hence its inverse) is the identity map. For the (un-normalized) nega-
tive entropy example, (∇h( x ))i = ln xi , and hence (∇h)−1 (θ )i = eθi .

18.2.3 The Algorithm (Again)

Let us formally state the algorithm again, before we state and prove
a theorem about it. Suppose we want to minimize a convex function
f over a convex body K ⊆ Rn . We first fix a norm ∥ · ∥ on Rn and
choose a distance-generating function h : Rn → R, which gives the
mirror map ∇h : Rn → Rn . In each iteration of the algorithm, we do
the following:

(i) Map to the dual space θt ← ∇h( xt ).

(ii) Take a gradient step in the dual space: θt+1 ← θt − ηt · ∇ f t ( xt ).

(iii) Map θt+1 back to the primal space xt′ +1 ← (∇h)−1 (θt+1 ).

(iv) Project xt′ +1 back into the feasible region K by using the Bregman
divergence: xt+1 ← minx∈K Dh ( x ∥ xt′ +1 ). In case xt′ +1 ∈ K, e.g., in
the unconstrained case, we get xt+1 = xt′ +1 .

Note that the choice of h affects almost every step of this algorithm.
mirror descent 227

18.3 The Analysis

We prove the following guarantee for mirror descent, which captures

the guarantees for both Hedge and gradient descent, and for other
variants that you may use. The theorem is stated for the uncon-
strained version, but extending it to the
Theorem 18.7 (Mirror Descent Regret Bound). Let ∥ · ∥ be a norm on constrained version is an easy exercise.
Rn , and h be an α-strongly convex function with respect to ∥ · ∥. Given
f 1 , . . . , f T be convex, differentiable functions such that ∥∇ f t ∥∗ ≤ G, the
mirror descent algorithm starting with x0 and taking constant step size η in
every iteration produces x1 , . . . , x T such that for any x ∗ ∈ Rn ,
T T
Dh ( x ∗ ∥ x1 ) η ∑tT=1 ∥∇ f t ( xt )∥2∗
∑ f t ( xt ) ≤ ∑ f t ( x ∗ ) + η
+
2α
. (18.6)
t =1 t =1 | {z }
regret

Before proving Theorem 18.7, observe that when ∥ · ∥ is the ℓ2 -

norm and h = 21 ∥ · ∥2 , the regret term is

∥ x ∗ − x1 ∥22 η ∑tT=1 ∥∇ f t ( xt )∥22

+ ,
2η 2

which is what Theorem 17.8 guarantees. Similarly, if ∥ · ∥ is the ℓ1 -

norm and h is the negative entropy, the regret versus any point x ∗ ∈
∆n is
1 n ∗ x∗ η ∑tT=1 ∥∇ f t ( xt )∥2∞
∑
η i =1
xi ln i +
( x1 )i 2/ ln 2
.

For linear functions f t ( x ) = ⟨ℓt , x ⟩ with ℓt ∈ [−1, 1]n , and x1 = 1

n · 1,
the regret is

KL( x ∗ ∥ x1 ) ηT ln n
+ ≤ + ηT.
η 2/ ln 2 η

The last inequality uses that the KL divergence to the uniform distri-
bution on n items is at most ln n. (Exercise!) In fact, if we start with
a distribution x1 that is closer to x ∗ , the first term of the regret gets
smaller.

18.3.1 The Proof of Theorem 18.7

The proof here is very similar in spirit to that of Theorem 17.8: we
give a potential function

Dh ( x ∗ ∥ x t )
Φt =
η

and bound the amortized cost at time t as follows:

f t ( xt ) − f t ( x ∗ ) + (Φt+1 − Φt ) ≤ f t ( x ∗ ) + blaht . (18.7)

228 the analysis

Summing over all times,

T T T
∑ f t (xt ) − ∑ f t (x∗ ) ≤ Φ1 − ΦT+1 + ∑ blaht
t =1 t =1 t =1
T
Dh ( x ∗ ∥ x1 ) T
≤ Φ1 + ∑ blaht = η
+ ∑ blaht .
t =1 t =1

The last inequality above uses that the Bregman divergence is always
non-negative for convex functions. To complete the proof, it remains
to show that blaht in inequality (18.7) can be made 2α ∥∇ f t ( xt )∥2∗ . Let
η

us focus on the unconstrained case where xt+1 = xt′ +1 , and prove an

analog of Lemma 17.9 for our generalized setting:

Lemma 18.8 (Potential Change).

η
Φt+1 − Φt ≤ ⟨∇ f t ( xt ), x ∗ − xt ⟩ + ∥∇ f t ( xt )∥2∗ .
2α
Note that we use the dual norm ∥ · ∥∗ for the gradient. Moreover,
restricting Lemma 18.8 to the case of h( x ) = ∥ x ∥2 and using the
fact that the Euclidean norm is self-dual gives us back Lemma 17.9
bounding the potential change for standard gradient descent. The
proof of Lemma 18.8 can be skipped at the first reading; the calcula-
tions are simple, but they rely crucially on the strong-convexity of h,
and the mirror-descent update rule.

Proof of Lemma 18.8. The change in potential is

1
Φ t +1 − Φ t = D ( x ∗ ∥ x t +1 ) − Dh ( x ∗ ∥ x t ) ;
η |h {z }
(⋆)

now using the definition of the divergence,

(⋆) = h( x ∗ ) − h( xt+1 ) − ⟨∇h( xt+1 ), x ∗ − xt+1 ⟩ − h( x ∗ ) + h( xt ) + ⟨∇h( xt ), x ∗ − xt ⟩

| {z } | {z }
θ t +1 θt
∗ ∗
= h( xt ) − h( xt+1 ) − ⟨θt+1 , x − xt+1 ⟩ + ⟨θt , ( x − xt+1 ) + ( xt+1 − xt )⟩.
(18.8)

Now we can use the α-strong convexity of h wrt to ∥ · ∥ to claim

α
h ( x t +1 ) ≥ h ( x t ) + ⟨ θ t , x t +1 − x t ⟩ + ∥ x − x t ∥2 .
2 t +1
Substituting into (18.8),
α
(⋆) ≤ − ∥ xt+1 − xt ∥2 + ⟨θt − θt+1 , ( xt − xt+1 ) + ( x ∗ − xt )⟩
2
α
≤ − ∥ xt+1 − xt ∥2 + ∥η ∇ f t ( xt )∥∗ ∥ xt − xt+1 ∥ +η ⟨∇ f t ( xt ), x ∗ − xt ⟩ ,
| 2 {z }
(†)
mirror descent 229

where the latter inequality used the update rule (18.5) for mirror de-
scent, and the Cauchy-Schwarz inequality Corollary 18.4 for general
norms. Now using the AM-GM inequality shows that
1
(†) ≤ ∥η ∇ f t ( xt )∥2∗ .
2α
Finally, remembering that the change in potential is given by η1 (⋆)
finishes the proof of Lemma 18.8.

The rest of the proof of Theorem 18.7 follows now-familiar lines.

Using Lemma 18.8, and then the convexity of f on the first two
terms:
η
f t ( xt ) + (Φt+1 − Φt ) ≤ f t ( xt ) + ⟨∇ f t ( xt ), x ∗ − xt ⟩ + ∥∇ f t ( xt )∥2∗
2α
η
≤ f t (x∗ ) + ∥∇ f t ( xt )∥2∗ .
2α
Hence blaht in (18.7) is at most 2α ∥∇ f t ( xt )∥2∗ , as claimed, completing
η

the proof of Theorem 18.7.

In order to extend this to the constrained case, we need to show
that if xt′ +1 ∈
/ K, and xt+1 = arg minx∈K Dh ( x ∥ xt′ +1 ), then

Dh ( x ∗ ∥ xt+1 ) ≤ Dh ( x ∗ ∥ xt′ +1 )

for any x ∗ ∈ K. This is a Generalized Pythagorean Theorem for Bregman

divergences, and is left as an exercise.

18.4 Alternative Views of Mirror Descent

To complete and flesh out. In this lecture, we reviewed mirror de-

scent algorithm as a gradient descent scheme where we do the gradi-
ent step in the dual space. We now provide some alternative views of
mirror descent.

18.4.1 Preconditioned Gradient Descent

For any given space which we use a descent method on, we can lin-
early transform the space with some map Q to make the geometry
more regular. This technique is known as preconditioning, and im-
proves the speed of the descent. Using the linear transformation Q,
our descent rule becomes

x t + 1 = x t − η Hh ( x t ) − 1 ∇ f ( x t ) .

Some of you may have seen Newton’s method for minimizing convex
functions, which has the following update rule:

x t +1 = x t − η H f ( x t ) −1 ∇ f ( x t ).
230 alternative views of mirror descent

This means mirror descent replaces the Hessian of the function itself
by the Hessian of a strongly convex function h. Newton’s method has
very strong convergence properties (it gets error ε in O(log log 1/ε)
iterations!) but is not “robust”—it is only guaranteed to converge
when the starting point is “close” to the minimizer. We can view
mirror descent as trading off the convergence time for robustness. Fill
in more on this view.

18.4.2 As Follow the Regularized Leader

19
The Centroid and Ellipsoid Algorithms

Our focus in this chapter is on the constrainted optimization prob-

lem:

Given a convex function f , a convex set K, and a parameter ε > 0, find

a point x̂ ∈ K such that

f ( x̂ ) ≤ min f ( x ) + ε.
x ∈K

In previous sections, we saw gradient descent and mirror descent

gave us algorithms whose dependence on ε was like poly(1/ε). The algorithms also had some depen-
Moreover, we have examples that show algorithms based only on dence on f and K; e.g., if gradients
∥∇ f ( x )∥2 ≤ G for x ∈ K, and the diam-
local gradient information need time at least polynomial in 1/ε. eter of K was at most D, then projected
Where? So can we do better? gradient descent ran in O(( GD/ε)2 )
time.
In this chapter, we show how to use global information to get
algorithms for convex programming that have O(log 1/ε)-type con-
vergence guarantees (under suitable assumptions). Specifically, we
will examine the Centroid and Ellipsoid algorithms in depth. In turn,
these will give us polynomial-time algorithms for Linear Program-
ming problems.

19.1 The Centroid Algorithm

In this section, we discuss the Centroid Algorithm in the context

of constrained convex minimization. Besides being interesting in
its own right, it is a good lead-in to Ellipsoid, since it gives some
intuition about high-dimensional bodies and their volumes.
Given a convex body K ⊆ Rn and a convex function f : K → R,
we want to approximately minimize f ( x ) over x ∈ K. As in previous
sections, we assume a gradient oracle for f , one that returns the
value ∇ f ( x ) for any query point x ∈ K. We also assume that we
can perform exact arithmetic over the reals; however, we will soon
begin discussing issues that arise from using only finite-precision
arithmetic.
232 the centroid algorithm

As the name suggests, the algorithm is based on the notion of

centroid for compact convex sets. The centroid of a set K is the point This is the analog of the centroid of a
c ∈ Rn such that discrete set S = { x1 , x2 , . . . , x N }:
R R
x ∈K x dx x dx 1
c := = Rx∈K , centroid(S) :=
|S| ∑ xi .
vol(K ) x ∈K dx
i

Other names for the centroid are the

where vol(K ) is the volume of the set K. Since c is the “average” center of gravity, and the barycenter.
of points in some convex set K, it also lies within K. The following
result captures the crucial fact about the centroid that we use in our
algorithm. B. Grünbaum (1960)

Theorem 19.1 (Grünbaum’s Theorem). For any compact convex set K ∈

Rn with a centroid c ∈ Rn , and any halfspace H = { x | a⊺ ( x − c) ≥ 0}
whose supporing hyperplane passes through c,

1 vol(K ∩ H ) 1
≤ ≤ 1− .
e vol(K ) e

This bound of 1/e in Grünbaum’s Theorem is the best possible:

e.g., consider the simplex K = { x ∈ [0, 1]n | ∥ x ∥1 ≤ 1} with centroid
n+1 1. Defining the halfspace H = { x1 ≥ c }, we get that K ∩ H is a
1

scaled-down copy of K, with volume

n
1
1− → 1/e
n+1
as n → ∞.

19.1.1 The Algorithm

In 1965, A. Ju. Levin and Donald Newman independently (and on A.Ju. Levin (1965)
opposite sides of the iron curtain) proposed the following algorithm. D.J. Newman (1965)
For most of this chapter, we assume
Algorithm 18: Centroid(K, f, T) that we can perform exact arithmetic on
real numbers. This assumption could be
18.1 K1 ← K very restrictive loss in generality, since
some of our algorithm take square-
18.2 for t = 1, . . . T do
roots (e.g., when computing ellipsoids).
18.3 at step t, let ct ← centroid of Kt Rounding numbers create all sorts
18.4 Kt+1 ← Kt ∩ { x | ⟨∇ f (ct ), x − ct ⟩ ≤ 0} of numerical problems, and a large
part of the complication in the actual
18.5 return xb ← arg mint∈{1,...,T } f (ct ) algorithms comes from these numerical
issues.
The figure to the right shows a sample execution of the algorithm,
where K is initially a ball. (Ignore the body K ε for now.) We find the
centroid c1 and compute the gradient ∇ f (c1 ). Instead of moving in
the direction opposite to the gradient, we consider the halfspace H1
of vectors negatively correlated with the gradient, restrict our search
to K ← K ∩ H1 , and continue. We repeat this step some number of
times, and then return the smallest of the function value at all the
centroids seen by the algorithm. Note that the algorithm assumes:
the centroid and ellipsoid algorithms 233

1. Access to both a gradient oracle and a value oracle for the function
f , and ∇ f ( c2 ) K

2. access to a procedure that computes the centroid for any compact

• Kε
convex set K. c2
∇ f ( c0 )
Theorem 19.2. Consider a convex set K ⊆ (0, R) ⊆ Rn , and a convex • •
c0
function f : K → R such that let ∥∇ f ( x )∥ ≤ G for all x ∈ K. If xb is the c1
result of the algorithm, and x ∗ = arg minx∈K f ( x ), then
∇ f ( c1 )
f ( xb) − f ( x ∗ ) ≤ 4GR · exp(− T/3n).
Hence, for any ε ≤ 1, as long as T ≥ 3n ln 4GR
ε ,

f ( xb) − f ( x ∗ ) ≤ ε. Figure 19.1: Sample execution of first

three steps of the Centroid Algorithm.
Proof. For some δ ≤ 1, define the body
K δ := {(1 − δ) x ∗ + δx | x ∈ K }
as a scaled-down version of K centered at x ∗ . The following facts are
immediate:
1. vol(K δ ) = δn · vol(K ).

2. For any points x, y ∈ K, integrating along the path from x to y and

using the fact that the gradients are bounded by G gives
Z 1
f ( x ) − f (y) = ⟨ f (y + t( x − y)), x − y⟩ dt
t =0
Z 1
≤ ∥ f (y + t( x − y))∥∥ x − y∥dt ≤ G ∥ x − y∥ ≤ G · (2R).
t =0

3. The value of f on any point y = (1 − δ) x ∗ + δx ∈ K δ is

f (y) = f ((1 − δ) x ∗ + δx ) ≤ (1 − δ) f ( x ∗ ) + δ f ( x )
≤ f ( x ∗ ) + δ( f ( x ) − f ( x ∗ )) ≤ f ( x ∗ ) + 2δGR.

Using Grünbaum’s lemma, the volume falls by a constant factor in

each iteration, so vol(Kt ) ≤ vol(K ) · (1 − 1e )t . If we define δ :=
2(1 − 1/e) T/n , then after T steps the volume of KT is smaller than
that of K δ , so some point of K δ must have been cut off.
Consider such a step t such that K δ ⊆ Kt but K δ ̸⊆ Kt+1 . Let
y ∈ K δ ∩ (Kt \ Kt+1 ) be a point that is “cut off”. By convexity we have
f (y) ≥ f (ct ) + ⟨∇ f (ct ), y − ct ⟩ ;
moreover, ⟨∇ f (ct ), y − ct ⟩ > 0 since the cut-off point y ∈ Kt \ Kt+1 .
Hence the corresponding centroid has value f (ct ) < f (y) ≤ f ( x ∗ ) +
2δGR. Since xb is the centroid with the smallest function value, we get
T/n
f ( xb) − f ( x ∗ ) ≤ 2GR · 2 1 − 1/e ≤ 4GR exp(− T/3n).
The second claim follows by substituting T ≥ 3n ln 4GR
ε into the first
claim, and simplifying.
234 multi-dimensional binary search

19.1.2 Comments on the Runtime

The number of iterations T needed by the Centroid algorithm to get
an error of ε is O(n log( GR/ε)); compare this linear convergence Recall that linear convergence refers to
to gradient descent requiring O(( GR/ε)2 ) steps in the same setting. a rate where we the number of bits of
precision increases linearly: i.e., the
One downside with this approach is that the number of iterations number of iterations is logarithmic in
explicitly depends on the number of dimensions n, whereas gradient 1/ε.
descent does not.
Another all-important question is: how do we compute the centroid?
This is a difficult problem—it is #P-hard to do exactly, which means
it is at least as hard as counting the number of satisfying assignments
to a SAT instance. (You will see this in a homework problem.) In
2002, Dimitris Bertsimas and Santosh Vempala suggested a way to D. Bertsimas and S. Vempala (2006)
find approximate centroids by sampling random points from convex
bodies (which in turn is done via random walks). Combined with a
robust version of Grünbaum’s theorem gives us a polynomial-time
version of the algorithm.

19.2 Multi-Dimensional Binary Search

Let us put the Centroid algorithm in a broader context. Given a con-

vex body, one of the canonical ways of specifying it will be via a
separation oracle. An ε-weak separation oracle is one
where we are just ensured that ⟨ a, x ⟩ >
Definition 19.3 (Strong Separation Oracle). For a convex set K ⊆ Rn , ⟨ a, y⟩ − ε for all y ∈ K. Specifying K
via a weak separation oracle makes all
a strong separation oracle for K is an algorithm that takes a point z ∈ our tasks much more challenging; in
Rn and correctly outputs one of: this course we restrict our discussions
to strong separation, and defer the
generalization to the GLS book.
(i) Yes (i.e., z ∈ K), or

(ii) No (i.e., z ̸∈ K), as well as a separating hyperplane given by a ∈

Rn , b ∈ R such that K ⊆ { x ∈ Rn | ⟨ a, x ⟩ ≤ b} but ⟨ a, z⟩ > b.

The Hahn-Banach separation theorem ensures that exactly one of

the two cases can hold for any x and K. Our goal now is to solve the
following feasibility problem:

Given access to a strong separation oracle for a convex body K, as well

as positive values R, r such that (a) K ⊆ B(0, R) ⊆ Rn , and (b) the body
K is either empty, or else there is some unknown (full-dimensional)
r-ball B(c, r ) ⊆ K. If K ̸= ∅, output a point x ∈ K, else say K = ∅.

19.2.1 Feasibility using Centroids

The ideas behind the Centroid algorithm also solves the feasibility
problem:
the centroid and ellipsoid algorithms 235

Algorithm 19: CentroidFeasibility(K, R, r)

19.1 E0 ← B(0, R)
19.2 for t = 0, 1, . . . T := 3n ln( R/r ) do
19.3 query strong separation oracle on ct , the centroid of Et
19.4 if ct ̸∈ K then
19.5 at ← direction from strong separation oracle
19.6 Et+1 ← Et ∩ { x | ⟨∇ at , x − ct ⟩ ≤ 0}
19.7 else
19.8 output “ct ∈ K” and stop
19.9 output “K = ∅”

The argument is nearly identical to the one we saw above:

1. By Grünbaum’s theorem,

vol(Et ) ≤ (1 − 1/e)t vol(E0 ) = (1 − 1/e)t vol( B(0, R)),

which gives an upper bound on Et ’s volume.

2. Suppose K ̸= ∅. If none of the centers ct′ for t′ < t belong to K,

then K ⊆ Et and hence vol(Et ) ≥ vol( B(c, r )). This gives a lower
bound on the volume of Et , as long as none of the centroids fall
within our target convex body K.
vol( B(0,R))
3. Putting the above statements togehter, (1 − 1/e)t ≤ vol( B(c,r)) =
( R/r )n . This means that if we do not find a point in K within
3n log( R/r ) steps, K must be empty!

This approach is very flexible: we just need to (efficiently) main-

tain a sequence of bodies {Et }t such that for each step t:

(a) vol(Et+1 ) ≤ vol(Et ) · (1 − δ) for some δ > 0,

(b) if each of the “test points” c1 , c2 , . . . , ct did not belong to K, then

K ⊆ E t +1 .

Following the same outline with these properties gives an iteration

complexity of O nδ log Rr . Maybe this more abstract view allows us
to get an efficient algorithm (since computing centroids is #P-hard)?

19.2.2 The Ellipsoid Algorithm

Going from the Centroid to the Ellipsoid algorithm requires a re-
markably small change. If our test point ct = centroid(Et ) does not
belong to K, the separation oracle returns a half-space H := { x |
⟨ at , x − ct ⟩ ≤ 0} that contains K. Now we don’t just define

E t +1 ← E t ∩ H
236 ellipsoid for convex optimization

but instead define: The minimum-volume ellipsoid con-

taining a convex body K is often called
the John ellipsoid for K, after Fritz John
Et+1 ← minimum volume ellipsoid containing Et ∩ H.
who proved several properties for it in
1948.
How can we compute this minimum-volume ellipsoid? And does the
volume go down by a constant factor? Why not balls? Clearly, the smallest
Since we start off with E0 being a ball (which is trivially an el- volume ball that contains half a ball is
the ball itself. Interestingly, the same
lipsoid), it suffices to show how to compute the minimum-volume is true for boxes: the volume of the
ellipsoid Et+1 of half an ellipsoid (the intersection of an ellipsoid Et new box may not decrease. Thankfully,
ellipsoids—and in fact, simplices—do
with a half-space passing through its center). We show how to do have the volume-reduction property.
this in §19.5, and show that the ellipsoid Et+1 ⊇ Et ∩ Ht has volume

vol(Et+1 ) − 1
≤ e 2(n+1) ≈ 1 − O(1/n) .
vol(Et )

Therefore, after 2(n + 1) iterations, the ratio of the volumes falls by

at least a factor of 1e . Hence, if after O(n2 ln( R/r )) steps, none of the This volume reduction is weaker by a
ellipsoid centers have been inside K, we know that K must be empty. factor of Θ(n) than that of the Centroid
algorithm.

19.3 Ellipsoid for Convex Optimization

Let’s go back to convex minimization: we want to solve min{ f ( x ) |

x ∈ K }. Again, assume that K is given by a strong separation oracle,
and we have numbers R, r such that K ⊆ Ball(0, R), and K is either
empty or contains a ball of radius r. The general structure is a one
familiar by now, and combines ideas from both the previous sections.

1. Let the starting point x1 ← 0, the starting ellipsoid be E1 ←

Ball(0, R), and the starting convex set K1 ← K.

2. At time t, ask the separation oracle: “Is the center ct of ellipsoid Et

in the convex body Kt ?”

Yes: Define half-space Ht := { x | ⟨∇ f (ct ), x − ct ⟩ ≤ 0}. Observe

that Kt ∩ Ht contains all points in Kt with value at most f (ct ).
No: In this case the separation oracle also gives us a separating
hyperplane. This defines a half-space Ht such that ct ̸∈ Ht , but
Kt ⊆ Ht .

In both cases, set Kt+1 ← Kt ∩ Ht , and Et+1 to an ellipsoid con-

taining Et ∩ Ht . Since we knew that Kt ⊆ Et , we maintain that
K t +1 ⊆ E t +1 .

3. Finally, after T = 2n(n + 1) ln( R/r ) rounds either we have not seen
any point in K—in which case we say “K is empty”—or else we
output
xb ← arg min{ f (ct ) | ct ∈ Kt , t ∈ 1 . . . T }.
the centroid and ellipsoid algorithms 237

One subtle issue: we make queries to a separation oracle for Kt ,

but we are promised only a separation oracle for K1 = K. However,
we can build separation oracles for Ht inductively: indeed, given
strong separation oracle for Kt−1 , we build one for Kt = Kt−1 ∩ Ht−1
as follows:
Given z ∈ Rn , query the oracle for Kt−1 at z. If z ̸∈ Kt−1 , the separating
hyperplane for Kt−1 also works for Kt . Else, if z ∈ Kt−1 , check if
z ∈ Ht−1 . If so, z ∈ Kt = Kt−1 ∩ Ht−1 . Otherwise, the defining
hyperplane for halfspace Ht−1 is a separating hyperplane between z
and Kt .

Now adapting the analysis from the previous sections gives us the
following result (assuming exact arithmetic again):

Theorem 19.4 (Idealized Convex Minimization using Ellipsoid).

Given K, r, R as above (and a strong separation oracle K), and a function
f with gradients bounded by G, the Ellipsoid algorithm run for T steps
either correctly reports that K = ∅, or else produces a point xb such that

O( GR) n T o
f ( xb) − f ( x ∗ ) ≤ exp − .
r 2n(n + 1)
Note the similarity to Theorem 19.2, as well as the differences: the
exponential term is slower by a factor of 2(n + 1). This is because
the volume of the successive ellipsoids shrinks much slower than
in Grünbaum’s lemma. Also, we lose a factor of R/r because K is
potentially smaller than the starting body by precisely this factor.
(Again, this presentation ignores precision issues, and assumes we
can do exact real arithmetic.)

19.4 The Ellipsoid Algorithm to Solve LPs

The Ellipsoid algorithm is usually attributed to Naum Shor; the fact N. Z. Šor and N. G. Žurbenko (1971)
that this algorithm gives a polynomial-time algorithm for linear pro-
gramming was a breakthrough result due to Khachiyan, and was L.G. Khachiyan (1979)
front page news at the time. A great source of information about
this algorithm is the Grötschel-Lovász-Schrijver book. A historical M. Grötschel, L. Lovász, and A. Schri-
perspective appears in this this survey by Bland, Goldfarb, and Todd. jver (1988)

Let us mention some theorem statements about the Ellipsoid algo-

rithm that are most useful in designing algorithms. The second-most
important theorem is the following. Recall the notion of an extreme
point or basic feasible solution (bfs) from §9.1.2. Let ⟨ A⟩, ⟨b⟩, ⟨c⟩ de-
note the number of bits required to represent of A, b, c respectively.

Theorem 19.5 (Linear Programming in Polynomial Time). Given a

linear program min{c⊺ x | Ax ≥ b}, the Ellipsoid algorithm produces an
optimal vertex solution for the LP, in time polynomial in ⟨ A⟩, ⟨b⟩, ⟨c⟩.
238 the ellipsoid algorithm to solve lps

One may ask: does the runtime depend on the bit-complexity of

the input because doing basic arithmetic on these numbers may re-
quire large amounts of time. Unfortunately, that is not the case. Even
if we count the number of arithmetic operations we need to perform,
the Ellipsoid algorithm performs poly(⟨ A⟩ + ⟨b⟩ + ⟨c⟩) operations.
A stronger guarantee would have been for the number of arithmetic
operations to be poly(m, n), where the matrix A ∈ Qm×n : such an
algorithm would be called a strongly polynomial-time algorithm. Ob-
taining such an algorithm remains a major open question.

19.4.1 Finding Vertex Solutions for LPs

There are several issues that we need to handle when solving LPs
using this approach. For instance, the polytope may not be full-
dimensional, and hence we do not have any non-trivial ball within
K. Our separation oracles may only be approximate. Moreover, all
the numerical calculations may only be approximate.
Even after we take care of these issues, we are working over the
rationals so binary search-type techniques may not be able to get us
to a vertex solution. So finally, when we have a solution xt that is Consider the case where we perform
“close enough” to x ∗ , we need to “round” it and get a vertex solu- binary-search over the interval [0, 1]
and want to find the point 1/3: no
tion. In a single dimension we can do the following (and this idea number of steps will get us exactly to
already appeared in a homework problem): we know that the opti- the answer.
mal solution x ∗ is a rational whose denominator (when written in
reduced terms) uses at most some b bits. So we find a solution within
distance to x ∗ is smaller than some δ. Moreover δ is chosen to be
small enough such that there is a unique rational with denominator
smaller than 2b in the δ-ball around xt . This rational can only be x ∗ ,
so we can “round” xt to it.
In higher dimensions, the analog of this is a technique (due to
Lovász) called simultaneous Diophantine equations.

19.4.2 Separation Implies Optimization

The most important theorem about Ellipsoid is the following:

Theorem 19.6 (Separation implies Optimization). Given an LP

min{c⊺ x | x ∈ K }

for a polytope K = { x | Ax ≥ b} ⊆ Rn , and given access to a strong

separation oracle for K, the Ellipsoid algorithm produces a vertex solution
for the LP in time poly(n, maxi ⟨ ai ⟩, maxi ⟨bi ⟩, ⟨c⟩).

There is no dependence on the number of constraints in the LP; we

can get a basic solution to any finite LP as long as each constraint has
the centroid and ellipsoid algorithms 239

a reasonable bit complexity, and we can define a separation oracle for

the polytope. This is often summarized by saying: “separation implies
optimization”. Let us give two examples of exponential-sized LPs,
for which we can give a separation oracles, and hence optimize over
them.

19.5 Getting the New Ellipsoid

This brings us to the final missing piece: given a current ellipsoid

E and a half-space H that does not contain its center, we want an
ellipsoid E ′ that contains E ∩ H, and as small as possible. To start off,
let us recall some basic facts about ellipsoids. The simplest ellipses in
R2 are axis aligned, say with principal semi-axes having length a and
b, and written as:
x2 y2
2
+ 2 ≤ 1.
a b
Or in matrix notation we could also say
" #⊺ " #" #
x 1/a2 0 x
≤1
y 0 1/b 2 y

More generally, any ellipsoid E is perhaps best thought of as a in-

vertible linear transformation L applied to the unit ball B(0, 1), and
then it being shifted to the correct center c. The linear transformation
yields:

L(Ball(0, 1)) = { Lx : x⊺ x ≤ 1}
= { y : ( L −1 y )⊺ ( L −1 y ) ≤ 1 }
= {y : y⊺ ( LL⊺ )−1 y ≤ 1}
= { y : y ⊺ Q −1 y ≤ 1 } ,

where Q−1 := LL⊺ is a positive semidefinite matrix. For an ellipsoid

centered at c we simply write

{ y + 1 : y ⊺ Q −1 y ≤ 1 } = { y : ( y − c )⊺ Q −1 ( y − c ) ≤ 1 } .

It is helpful to note that for any ball A,

q
vol( L( A)) = vol( A) · | det( L)| = vol( A) det( Q)

In the above problems, we are given an ellipsoid Et and a half-

space Ht that does not contain the center of Et . We want to find a
matrix Qt+1 and a center ct+1 such that the resulting ellipsoid Et+1
contains Et ∩ Ht , and satisfies
vol(Et+1 )
≤ e−1/2(n+1) .
vol(Et )
240 getting the new ellipsoid

Given the above discussion, it suffices to do this when Et is a unit

ball: indeed, when Et is a general ellipsoid, we apply the inverse
linear transformation to convert it to a ball, find the smaller ellipsoid
for it, and then apply the transformation to get the final smaller
ellipsoid. (The volume changes due to the two transformations cancel
each other out.)
We give the construction for the unit ball below, but first let us
record the claim for general ellipsoids:

Theorem 19.7. Given an ellipsoid Et given by (ct , Qt ) and a separating

hyperplane a⊺t ( x − ct ) ≤ 0 through its center, the new ellipsoid Et+1 with
center ct+1 and psd matrix Qt+1 ) is found by taking

1
c t +1 : = c t − h
n+1

and
n2 2
Q t +1 = 2 Qk − hh⊺
n −1 n+1
q
where h = a⊺t Qt at .

Note that the construction requires us to take square-roots: this

may result in irrational numbers which we then have to either trun-
cate, or represent implicitly. In either case, we face numerical issues;
ensuring that these issues are not real problems lies at the heart of
the formal analysis. We refer to the GLS book, or other textbooks for
details and references.

19.5.1 Halving a Ball

Before we end, we show that the problem of finding a smaller ellip-
soid that contains half a ball is, in fact, completely straight-forward.
By rotational symmetry, we might as well find a small ellipsoid that
contains
K = Ball(0, 1) ∩ { x | x1 ≥ 0}.

By symmetry, it makes sense that the center of this new ellipsoid E

should be of the form
c = (c1 , 0, . . . , 0).

Again by symmetry, the ellipsoid can be axis-aligned, with semi-axes

of length a along e1 , and b > a along all the other coordinate axes.
Moreover, for E to contain the unit ball, it should contain the points
(1, 0) and (0, 1), say. So

(1 − c1 )2 c21 1
≤ 1 and + 2 ≤ 1.
a2 a2 b
the centroid and ellipsoid algorithms 241

Suppose these two inequalities are tight, then we get

s s
(1 − c1 )2 (1 − c1 )2
a = 1 − c1 , b= = ,
(1 − c1 )2 − c21 (1 − 2c1

and moreover the ratio of volume of the ellipsoid to that of the ball is
(1 − c )2 (n−1)/2
1
abn−1 = (1 − c1 ) · .
1 − 2c1
1
This is minimized by setting c1 = n +1 gives us

vol(E ) − 1
= · · · ≤ e 2( n +1) .
vol(Ball(0, 1))
For a more detailed description and proof of this process, see these
notes from our LP/SDP course for details.
In fact, we can view the question of finding the minimum-volume
ellipsoid that contains the half-ball K: this is a convex program, and
looking at the optimality conditions for this gives us the same con-
struction above (without having to make the assumptions of symme-
try).

19.6 Algorithms for Solving LPs

We have now seen two different classes of algorithms to solve lin-

ear programs: the first approach using multiplicative weights gave
us solutions which violate the constraints by ε and take O(1/ε2 )
steps (ignoring terms that depend on the other input parameters
for now). Next we saw the Centroid and Ellipsoid algorithms for
convex programming which require only O(log 1/ε) steps. How-
ever, they are typically not used to solve LPs in practice. There
are several other algorithms: let us mention them in passing. Let
K := { x | Ax ≥ b} ⊆ Rn , and we want to minimize {c⊺ x | x ∈ K }.

Simplex: This is perhaps the first algorithm for solving LPs that most
of us see. It was also the first general-purpose linear program
solver known, having been developed by George Dantzig in 1947. G.B. Dantzig (1990)
This is a local-search algorithm: it maintains a vertex of the poly-
hedron K, and at each step it moves to a neighboring vertex with-
out decreasing the objective function value, until it reaches an op-
timal vertex. (The convexity of K ensures that such a sequence of
steps is possible.) The strategy to choose the next vertex is called
the pivot rule. Unfortunately, for most known pivot rules, there
are examples on which the following the pivot rule takes expo-
nential (or at least a super-polynomial) number of steps. Despite
that, it is often used in practice: e.g., the Excel software contains an
implementation of simplex.
242 algorithms for solving lps

Interior Point: A different approach to get algorithms for LPs is via

interior-point algorithms: these happen to be good both in theory
and in practice. The first polynomial-time interior-point algorithm
was proposed by Karmarkar in 1984. We discuss this in the next
chapter.

Geometric Algorithms for LPs: These approaches are geared towards

solving LPs fast when the number of dimensions n is small. If m
is the number of constraints, these algorithms often allow a poor
runtime in n, at the expense of getting a good dependence on m.
As an example, a randomized algorithm of Raimund Seidel’s has
a runtime of O(m · n!) = O(m · nn/2 ); a different algorithm of Ken
Clarkson (based on the multiplicative weights approach!) has a
runtime of O(n2 m) + nO(n) O(log m)O(log n) . One of the fastest such
algorithm is by Jiri Matoušek, Micha Sharir, and Emo Welzl, and
has a runtime of √
O(n2 m) + eO( n log n) .
For details and references, see this survey by Martin Dyer, Nimrod
Megiddo, and Emo Welzl.

Naturally, there are other approaches to solve linear programs as

well: write more here.
20
Interior-Point Methods

In this chapter, we continue our discussion of polynomial-time algo-

rithms for linear programming, and cover the high-level details of an
interior-point algorithm. The runtime for these linear programs has
recently been improved both qualitatively and quantitatively, so this
is an active area of research that you may be interested in. Moreover,
these algorithms contain sophisticated general ideas (duality and the
method of Lagrange multipliers, and the use of barrier functions) that
are important even beyond this context.
Another advantage of delving into the details of these methods
is that we can work on getting better algorithms for special kinds
of linear programs of interest to us. For instance, the line of work
on faster max-flow algorithms for directed graphs, starting with
the work of Madry, and currently resulting in the O(m4/3+ε )-time
algorithms of Kathuria, and of Liu and Sidford, are based on a better
understanding of interior-point methods.
We will consider the following LP with equality constraints:

Figure 20.1: The feasible region for

min c⊺ x an LP in equational form (from the
Ax = b Matoušek and Gärtner book).

x≥0

where A ∈ Rm×n , b ∈ Rm and c, x ∈ Rn . Let K := { x | Ax = b, x ≥ 0}

be the polyhedron, and x ∗ = arg min{c⊺ x | x ∈ K } an optimal
solution.
To get the main ideas across, we make some simplifying assump-
tions and skip over some portions of the algorithm. For more details,
please refer to the book by Jiri Matoušek and Bernd Gärtner (which
has more details), or the one by Steve Wright (which has most de-
tails).
244 barrier functions

20.1 Barrier Functions

The first step in solving the LP using an interior-point method will be

to introduce a parameter η > 0 and exchange our constrained linear
optimization problem for an unconstrained but nonlinear one:
n
1
f η ( x ) := c⊺ x + η ∑ log xi .
i =1

Let xη∗ := arg min{ f η ( x ) | Ax = b} be the minimizer of this function

over the subspace given by the equality constraints. Note that we’ve
added in η times a barrier function
n
1
B( x ) := ∑ log xi .
i =1

The intuition is that when x approaches the boundary x ≥ 0 of the

feasible region, the barrier function B( x ) will approach +∞. The If we had inequality constraints Ax ≥ b
parameter η lets us control the influence of this barrier function. If as well, we would have added
∑im=1 log a⊺ x1−b to the barrier function.
η is sufficiently large, the contribution of the barrier function domi- i i

nates in f η ( x ), and the minimizer xη∗ will be close to the “center” of

the feasible region. However, as η gets close to 0, the effect of B( x )
will diminish and the term c⊺ x will now dominate, causing that xη∗ to
approach x ∗ .
Now consider the trajectory of the minimizer xη∗ as we lower η
continuously, starting at some large value and tending to zero: this
path is called the central path. The idea of our path-following al-
gorithm will be to approximately follow this path. In essence, such
algorithms conceptually perform the following steps (although we
will only approximate these steps in practice):

1. Pick a sufficiently large η0 and a starting point x (0) that is the

minimizer of f η0 ( x ). (We will ignore this step in our discussion, for
now.)

2. At step t, move to the corresponding minimizer x (t+1) for f ηt+1 ,

where
ηt+1 := ηt · (1 − ϵ).
Since ηt is close to ηt+1 , we hope that the previous minimizer
x (t) is close enough to the current goal x (t+1) for us to find it effi-
Figure 20.2: A visualization of a path-
ciently. following algorithm.

3. Repeat until η is small enough that xη∗ is very close to an optimal

solution x ∗ . At this point, round it to get a vertex solution, like in
§19.4.1.

We will only sketch the high-level idea behind Step 1 (finding the
starting solution), and will skip Step 2 (the rounding); our focus will
interior-point methods 245

be on the update step. To understand this step, let us look at the

structure of the minimizers for f η ( x ).

20.1.1 The Primal and Dual LPs, and the Duality Gap
Recall the primal linear program:

( P) min c⊺ x
Ax = b
x ≥ 0,

and its dual:

( D ) max b⊺ y
A⊺ y ≤ c.

We can rewrite the dual using non-negative slack variables s:

( D ′ ) max b⊺ y
A⊺ y + s = c
s ≥ 0.

We assume that both the primal ( P) and dual ( D ) are strictly feasible:
i.e., they have solutions even if we replace the inequalities with strict
ones). Then we can prove the following result, which relates the
optimizer for f η to feasible primal and dual solutions:

Lemma 20.1 (Optimality Conditions). The point x ∈ Rn≥0 is a minimizer

of f η ( x ) if and only if there exist y ∈ Rm and s ∈ Rn≥0 such that:

Ax − b = 0 (20.1)
⊺
A y+s = c (20.2)
∀i ∈ [ n ] : si xi = η (20.3)

The conditions (20.1) and (20.2) show that x and (y, s) are feasible
for the primal ( P) and dual ( D ′ ) respectively. The condition (20.3)
is an analog of the usual complementary slackness result that arises
when η = 0. To prove this lemma, we use the method of Lagrange
multipliers. Observe: we get that if there exists a
maximum x ∗ , then x ∗ satisfies these
Theorem 20.2 (The Method of Lagrange Multipliers). Let functions conditions.
f and g1 , · · · , gm be continuously differentiable, and defined on some open
subset of Rn . If x ∗ is a local optimum of the following optimization problem

min f ( x )
s.t. ∀i ∈ [m] : gi ( x ) = 0

then there exists y∗ ∈ Rm such that ∇ f ( x ∗ ) = ∑im=1 yi∗ · ∇ gi ( x ∗ ).

246 the update step

Proof Sketch of Lemma 20.1. We need to show three things:

1. The function f η ( x ) achieves its maximum x ∗ in the feasible region.

2. The point x ∗ satisfies the conditions (20.1)–(20.3).

3. And that no other x satisfies these conditions.

The first step uses that if there are strictly feasible primal and
dual solutions ( x̂, ŷ, ŝ), then the region { x | Ax = b, f µ ( x ) ≤ f µ x̂ }
is bounded (and clearly closed) and hence the continuous function
f µ ( x ) achieves its minimum at some point x ∗ inside this region, by
the Extreme Value theorem. (See Lemma 7.2.1 of Matoušek and Gärt-
ner, say.)
For the second step, we use the functions f µ ( x ), and gi ( x ) = a⊺i x −
bi in Theorem 20.2 to get the existence of y∗ ∈ Rm such that:
m ⊺ m
∑ yi∗ · ∇(ai x∗ − bi ) ∑ yi∗ ai .
⊺
f η (x∗ ) = ⇐⇒ c − η · 1/x1∗ , · · · , 1/xn∗ =
i =1 i =1

Define a vector with si∗

s∗ = η/xi∗ . The above condition is now
equivalent to setting A⊺ y∗ + s∗ = c and si∗ xi∗ = η for all i.
Finally, for the third step of the proof, the function f η ( x ) is strictly
convex and has a unique local/global optimum. Finish this proof.

By weak duality, the optimal value of the linear program lies be-
tween the values of any feasible primal and dual solution, so the
duality gap c⊺ x − b⊺ y bounds the suboptimality c⊺ x − OPT of our
current solution. Lemma 20.1 allows us to relate the duality gap to η
as follows.

c⊺ x − b⊺ y = c⊺ x − ( Ax )⊺ y = x⊺ c − x⊺ (c − s) = x⊺ s = n · η.

If the representation size of the original LP is L := ⟨ A⟩ + ⟨b⟩ + ⟨c⟩,

then making η ≤ 2− poly( L) means we have primal and dual solutions
whose values are close enough to optimal, and can be rounded (us-
ing the usual simultaneous Diophantine equations approach used for
Ellipsoid).

20.2 The Update Step

Let us now return to the question of obtaining x (t+1) from x (t) at

step t? Recall, we want x (t+1) to satisfy the optimality conditions
(20.1)–(20.3) for f ηt+1 . The hurdles to finding this point directly are:
(a) the non-negativity of the x, s variables means this is not just a
linear system, there are inequalities to contend with. And more wor-
ryingly, (b) it is not a linear system at all: we have non-linearity in the
constraints (20.3) because of multiplying xi with si .
interior-point methods 247

To get around this, we use a “local-search” method. We start with

the solution x (t) “close to” the optimal solution xη∗t for f ηt , and take
a small step, so that we remain non-negative, and also get “close to”
the optimal solution xη∗t+1 for f ηt+1 . Then we lower η and repeat this
process.
Let us make these precise. First, to avoid all the superscripts, we
use ( x, y, s) and η to denote ( x (t) , y(t) , s(t) ) and ηt . Similarly, ( x ′ , y′ , s′ )
and η ′ denote the corresponding values at time t + 1. Now we as-
sume we have ( x, y, s) with x, s > 0, and also:

Ax = b (20.4)
⊺
A y+s = c (20.5)
n 2
∑ s i x i − ηt ≤ (ηt/4)2 . (20.6)
i =1

The first two are again feasibility conditions for ( P) and ( D ′ ). The
third condition is new, and is an approximate version of (20.3). Sup-
pose that
1
η ′ := η ′ · 1 − √ .
4 n

Our goal is a new solution x ′ = x + ∆x, y′ = y + ∆y, s′ = s + ∆s,

which satisfies non-negativity, and ideally also satisfies the original
optimality conditions (20.1)–(20.3) for the new η ′ . (Of course, we will
fail and only satisfy the weaker condition (20.6) instead of (20.3), but
we should aim high.)
Let us write the goal explicitly, by substituting ( x ′ , y′ , s′ ) into
(20.4)–(20.6) and using the feasibility of ( x, y, s). This means the incre-
ments ∆x, ∆y, ∆s satisfy

A (∆x ) = 0
⊺
A (∆y) + (∆s) = 0
si (∆xi ) + (∆si ) xi + (∆si )(∆xi ) = η ′ − xi si .

Note the quadratic term in blue. Since we are aiming for an approxi-
mation anyways, and these increments are meant to be tiny, we drop
the quadratic term to get a system of linear equations in these incre-
ments:

A (∆x ) = 0
⊺
A (∆y) + (∆s) = 0
si (∆xi ) + (∆si ) xi = η ′ − xi si .

This is often written in the following matrix notation (which I am

248 the update step

putting down just so that you recognize it the next time you see it):
    
A 0 0 ∆x 0
    
 0 A⊺ I   ∆y  =  0 .
diag( x ) 0 diag(s) ∆s η′1 − x ◦ s

Here x ◦ s stands for the component-wise product of the vectors x

and s. The bottom line: this is a linear system and we can solve it, say
using Gaussian elimination. Now we can set x ′ = x + ∆x, etc., to get The goal of many modern algorithms
the new point ( x ′ , y′ , s′ ). It remains to check the non-negativity and is to get faster ways to solve this linear
system. E.g., if it were a Laplacian
also the weakened conditions (20.4)–(20.6) with respect to η ′ . system we could (approximately) solve
it in near-linear time.

20.2.1 Properties of the New Solution

While discarding the quadratic terms means we do not satisfy xi si =
η for each coordinate i, we can show that we satisfy it on average,
allowing us to bound the duality gap.

Lemma 20.3. The new duality gap is ⟨ x ′ , s′ ⟩ = n η ′ .

Proof. The last set of equalities in the linear system ensure that

si xi + si (∆xi ) + (∆si ) xi = η ′ , (20.7)

so we get

x ′ , s′ = ⟨ x + ∆x, s + ∆s⟩

= ∑ si xi + si (∆xi ) + (∆si ) xi + ⟨∆x, ∆s⟩
i
= nη ′ + ⟨∆x, − A⊺ (∆y)⟩
= n · η ′ − ⟨ A(∆x ), ∆y⟩
= n · η′,

using the first two equality constraints of the linear system.

We explicitly maintained the invariants given by (20.4), (20.5), so

it remains to check (20.6). This requires just a bit of algebra (that also
√
causes the n to pop out).
2
Lemma 20.4. ∑in=1 si′ xi′ − η ′ ≤ (η ′/4)2 .

Proof. As in the proof of Lemma 20.3, we get that si′ xi′ − η ′ = (∆si )(∆xi ),
so it suffices to show that
s
n
∑ (∆si )2 (∆xi )2 ≤ η′/4.
i =1

We can use the inequality

r 1 1
∑ a2i bi2 ≤ 4 ∑(ai + bi )2 = 4 ∑(a2i + bi2 + 2ai bi ),
i i i
interior-point methods 249

xi (∆si )2 si (∆xi )2
where we set a2i = si and bi2 = xi . Hence
s
n
1 n xi s
∑ (∆si ∆xi )2 ≤ ∑
4 i =1 si
· (∆si )2 + i · (∆xi )2 + 2(∆si )(∆xi )
xi
i =1
1 n ( xi ∆si )2 + (si ∆xi )2
4 i∑
= [since (∆s)⊺ ∆x = 0 by Claim 20.3]
=1
si xi
1 ∑in=1 ( xi ∆si + si ∆xi )2
≤
4 mini∈[n] si xi
1 ∑in=1 (η ′ − si xi )2
= . (20.8)
4 mini∈[n] si xi

We now bound the numerator and denominator separately.

This claim and proof are incorrect. However, we can prove that
mini si xi ≥ 3η/4. This weaker claim suffices for the rest of the proof.
The details will come soon, sorry for the mistake!

Claim 20.5 (Denominator). mini si xi ≥ η 1 − 4√1 n .

Proof. By the inductive hypothesis, ∑i (si xi − η )2 ≤ (η/4)2 . This

η
means that maxi |si xi − η | ≤ 4√n , which proves the claim.

Claim 20.6 (Numerator). ∑in=1 (η ′ − si xi )2 ≤ η 2 /8.

1
Proof. Let δ = √
4 n
. Then,

n n
∑ (η ′ − si xi )2 = ∑ ((1 − δ)η − si xi )2
i =1 i =1
n n n
= ∑ (η − si xi )2 + ∑ (δη )2 + 2δη ∑ (η − si xi ).
i =1 i =1 i =1

The first term is at most (η/4)2 , by the induction hypothesis. On the

other hand, by Claim 20.3 we have
n n
∑ (η − si xi ) = nη − ∑ si xi = 0.
i =1 i =1

Thus
n
1
∑ (η ′ − si xi )2 ≤ (η/4)2 + n (4√n)2 η 2 = η 2 /8.
i =1

Substituting these results into (20.8), we get

s
n
1 η 2 /8 1 η′
∑ i i( ∆s ∆x ) 2 ≤
4 (1 − √1 )η
=
32 (1 − √1 )2
.
i =1 4 n 4 n

This expression is smaller than η ′ /4, which completes the proof.

Lemma 20.7. The new values x ′ , s′ are non-negative.

250 the newton-raphson method

Proof. By induction, we assume the previous point has xi > 0 and

si > 0. (For the base case we need to ensure that the starting solution
( x (0) , s(0) ) also satisfies this property.) Now for a scalar α ∈ [0, 1] we
define x ′′ := x + α∆x, s′′ := s + α∆s, and η ′′ := (1 − α)η + αη ′ , to
linearly interpolate between the old values and the new ones. Then
we can show ⟨ x ′′ , s′′ ⟩ = n η ′′ , and also

∑(si′′ xi′′ − η ′′ )2 ≤ (η ′′ /4)2 , (20.9)

which are analogs of Lemmas 20.3 and 20.4 respectively. The latter
inequality means that |si′′ xi′′ − η ′′ | ≤ η ′′ /4 for each coordinate i, else
that coordinate itself would violate inequality (20.9). Specifically,
this means that neither xi′′ nor si′′ ever becomes zero for any value of
α ∈ [0, 1]. Now since ( xi′′ , si′′ ) is a linear interpolation between ( xi , si )
and ( xi′ , si′ ), and the former were strictly positive, the latter cannot be
non-positive.

Theorem 20.8. Given an LP min{c⊺ x | Ax = b, x ≥ 0} with an initial

feasible ( x (0) , η0 ) pair, the interior-point algorithm produces a primal-dual
√ nη
pair with duality gap at most ε in O( n log ε 0 ) iterations, each involving
solving one linear system.

The proof of the above theorem follows immediately from the

fact that the duality gap at the beginning is nη0 , and the value of η
drops by (1 − 4√1 n ) in each iteration. If the LP has representation size
L := ⟨ A⟩ + ⟨b⟩ + ⟨c⟩, we can stop when ε = exp(− poly( L)), and then
round this solution to an vertex solution of the LP.
The one missing piece is finding the initial ( x (0) , η0 ) pair: this is
a somewhat non-trivial step. One possible approach is to run the
interior-point algorithm “in reverse”. The idea is that we can start
with some vertex of the feasible region, and then to successively
increase η through a similar mechanism as the one above, until the
value of η is sufficiently large to begin the algorithm.

20.3 The Newton-Raphson Method

A more “modern” way of viewing interior-point methods is via the

notion of self-concordance. To do this, let us revisit the classic Newton-
Raphson method for finding roots.

20.3.1 Finding Zeros of Functions

The basic Newton-Raphson method for finding a zero of a univariate
function is the following: given a function g, we start with a point x1 ,
interior-point methods 251

and at each time t, set

g( xt )
x t +1 ← x t − . (20.10)
g′ ( xt )
We now show that if f is “nice enough” and we are “close enough”
to a zero x ∗ , then this process converges very rapidly to x ∗ . Since we take O(log log 1/ε) iterations
to get to error ε, the number of bits
Theorem 20.9. Suppose g has continuous second-derivatives, then if is x∗ of accuracy squares each time. This
∗
a zero of g, then if we start at x1 “close enough” to X , the error goes to ε in is called quadratic convergence in the
optimization literature.
O(log log 1/ε) steps. Make this formal!
Proof. By Taylor’s theorem, the existence of continuous second
derivatives means we can approximate f around xt as:
f ( x ∗ ) = f ( xt ) + f ′ ( xt )( x ∗ − xt ) + 1/2 f ′′ (ξ t )( x ∗ − xt )2 ,
where ξ t is some point in the interval [ x ∗ , xt ]. However, x ∗ is a zero of
f , so f ( x ∗ ) = 0. Moreover, using (20.10) to replace xt f ′ ( xt ) − f ( xt ) by
xt+1 f ′ ( xt ), and rearranging, we get
− f ′′ (ξ t )
x ∗ − x t +1 = · ( x ∗ − x t )2 .
| {z } 2 f ′ ( xt ) | {z }
=:δt+1 =:δt2

Above, we use δt to denote the error x ∗ − xt . Taking absolute values

f ′′ (ξ t )
|δt+1 | = · δt2 .
2 f ′ ( xt )
f ′′ (ξ )
Hence, if we can ensure that | 2 f ′ ( x) | ≤ M for each x and each ξ
that lies bewteen x ∗ and x, then once we have δ0 small enough, then
each subsequent error drops quadratically. This means the number of
significant bits of accuracy double each step. More careful analysis.

20.3.2 An Example
Given an n-bit integer a ∈ Z, suppose we want to compute its re-
ciprocal 1/a without using divisions. This reciprocal is a zero of the
expression This method for computing recip-
g( x ) = 1/x − a. rocals appears in the classic book of
Aho, Hopcroft, Ullman, without any
Hence, the Newton-Raphson method says, we can start with x1 = 1, elaboration—it always mystified me
until I realized the connection to the
say, and then use (20.10) to get Newton-Raphson method. I guess they
(1/xt − a) expected their readers to be familiar
x t +1 ← x t − = xt + xt (1 − a xt ) = 2xt − a xt2 . with these connections, since computer
(−1/xt2 ) science used to have closer connections
If we define ε t := 1 − a xt , then to numerical analysis in the 1970s.

ε t+1 = 1 − a xt+1 = 1 − (2a xt − a2 xt2 ) = (1 − a xt )2 = ε2t .

Hence, if ε 1 ≤ 1/2, say, the number of bits of accuracy double at each
step. Moreover, if we are careful, we can store xt using integers (by
instead keeping track of 2k xt for suitably chosen values k ≈ 2t ).
252 self-concordance

20.3.3 Minimizing Convex Functions

To find the minimum of a function f (especially a convex function)
we can focus on finding a stationary point, i.e., a point x such that
f ′ ( x ) = 0. Setting g = f ′ , the update rule just changes to

f ′ ( xt )
x t +1 ← x t − . (20.11)
f ′′ ( xt )

20.3.4 On To Higher Dimensions

For general functions f : Rn → R, the rule remains the same, with
the natural changes:

xt+1 ← xt − [ H f ( xt )]−1 · ∇ f ( xt ). (20.12)

20.4 Self-Concordance

Analogy between self-concordance and the convergence conditions

for the 1-d case?
Present the view using the “modern view” of self-concordance.
Mention that the current bound is really O(m)-self-concordant. That
universal barrier is O(n) self-concordant, but not efficient. Vaidya’s
volumetric barrier? The entropic barrier? The Lee-Sidford barrier,
based on leverage scores. What’s the cleanest way, without getting
lost in the algebra?
24
Approximation Algorithms

In this chapter, we turn to the problem of combating intractability:

many combinatorial optimization problems are NP-hard, and hence
are unlikely to have polynomial-time algorithms. Hence we consider
approximation algorithms: algorithms that run in polynomial-time,
but output solutions whose quality is close the optimal solution’s
quality. We illustrate some of the basic ideas in the context of two
NP-hard problems: Set Cover and Bin Packing. Both have been
studied since the 1970s.
Let us start with some definitions: having fixed an optimization
problem, let I denote an instance of the problem, and Alg denote
an algorithm. Then Alg( I ) is the output/solution produced by the
algorithm, and c(Alg( I )) its cost. Similarly, let Opt( I ) denote the
optimal output for input I, and let c(Opt( I )) denote its cost. For Typically, if the instance is clear from
minimization problems, the approximation ratio of the algorithm A is context, we use the notation

defined to be the worst-case ratio between the costs of the algorithm’s Alg ≤ r · Opt

solution and the optimum: to denote that

c(Alg( I )) ≤ r · c(Opt( I )).
c(Alg( I ))
ρ = ρA := max .
I c(Opt( I ))

In this case, we say that Alg is an ρ-approximation algorithm. For

maximization problems, we define rho to be

c(Alg( I ))
ρ = ρA := min ,
I c(Opt( I ))

and therefore a number in [0, 1].

24.1 A Rough Classification into Hardness Classes

In the late 1990s, there was an attempt to classify combinatorial op-

timization problems into a small number of hardness classes: while
this ultimately failed, a rough classification of NP-bard problems is
still useful.
254 a rough classification into hardness classes

• Fully Poly-Time Approximation Scheme (FPTAS): For problems in this

category, there exist approximation algorithms that take in a pa-
rameter ε, and output a solution with approximation ratio 1 + ε in
time poly(⟨ I ⟩, 1/ε). E.g., one such problem is Knapsack, where As always, let ⟨ I ⟩ denote the bit com-
given a collection of n items, with each item i having size si ∈ Q+ plexity of the input I.

and value vi ∈ Q+ , find the subset of items with maximum value

that fit into a knapsack of unit size.

• Poly-Time Approximation Scheme (PTAS): For a problem in this cat-

egory, for any ε > 0, there exists an approximation algorithm
with approximation ratio 1 + ε, that runs in time O(n f (ε) ) for some
function f (·). For instance, the Traveling Salesman Problem The runtime has been improved to
in d-dimensional Euclidean space has an algorithm due to San- O(n log n + n exp{(1/ε)d }).

jeev Arora (1996) that computes a (1 + ε)-approximation in time

O(n f (ε) ), where f (ε) = exp{(1/ε)d }. Moreover, it is known that
this dependence on ε, with the doubly-exponential dependence on
d, is unavoidable.

• Constant-Factor Approximation: Examples in this class include the

Traveling Salesman Problem on general metrics. In the late Christofides’ result only ever appeared
1970s, Nicos Christofides and Anatoliy Serdyukov discovered as a CMU GSIA technical report in
1976. Serdyukov’s result only came to
the same 1.5-approximation algorithm for metric TSP, using the be known a couple years back.
Blossom algorithm to connect up the odd-degree vertices of an
MST of the metric space to get an Eulerian spanning subgraph,
and hence a TSP tour. This was improved only in 2020, when
Anna Karlin, Nathan Klein, and Shayan Oveis-Gharan gave an Karlin, Klein, and Oveis Gharan (2020)
(1.5 − ε)-approximation, which we hope to briefly outline in a later
chapter. Meanwhile, it has been shown that metric TSP can’t be
approximated with a ratio better than 123
122 under the assumption of
P ̸= NP, by Karpinski, Lampis, Schmied.

• Logarithmic Approximation: An example of this is Set Cover,

which we will discuss in some detail.

• Polynomial Approximation: One example is the Independent Set

problem, for which any algorithm with an approximation ratio
n1−ε for some constant ε > 0 implies that P = NP. The best
approximation algorithm for Independent Set known has an
approximation ratio of O(n/ log3 n).

However, there are problems that do not fall into any of these
clean categories, such as Asymmetric k-Center, for which there
exists a O ( log ∗ n ) -approximation algorithm, and this is best possible
unless P = NP. Or Group Steiner Tree, where the approximation
ratio is O ( log 2 n ) on trees, and this is also best possible.
approximation algorithms 255

24.2 The Surrogate

Given that it is difficult to find an optimal solution, how can we

argue that the output of some algorithm has cost comparable to
that of Opt ( I ) . An important idea in proving the approximation
guarantee involves the use of a surrogate, or a lower bound, as follows:
Given an algorithm Alg and an instance I, if we want to calculate
the approximation ratio of Alg, we first find a surrogate map S from
instances to the reals. To bound the approximation ratio, we typically
do the following:
Figure 24.1: The cost diagram on
instance I (costs increase from left to
1. We show that S( I ) ≤ Opt( I ) for all I, and right).

2. then show that Alg( I ) ≤ αS( I ) for all I.

This shows that Alg( I ) ≤ α Opt( I ). Which leaves us with the ques-
tion of how to construct the surrogate. Sometimes we use the com-
binatorial properties of the problem to get a surrogate, and at other
times we use a linear programming relaxation.

24.3 The Set Cover Problem

In the Set Cover problem, we are given a universe U with n ele-

ments, and a family S = {S1 , . . . , Sm } of m subsets of U, such that
U = ∪S∈S S. We want to find a subset S ′ ⊆ S , such that U = ∪S∈S S
while minimizing the size |S ′ |.
In the weighted version of Set Cover, we have a cost cS for each
set S ∈ S , and want to minimize c(S ′ ) = ∑S∈S ′ cS . We will focus
on the unweighted version for now, and indicate the changes to the
algorithm and analysis to extend the results to the weighted case.
The Set Cover problem is NP-complete, even for the unweighted
version. Several approximation algorithms are known: the greedy
algorithm is a ln n-approximation algorithm, with different analyses
given by Vašek Chvátal, David Johnson, Laci Lovász, Stein, and oth-
ers. Since then, the same approximation guarantee was given based
on the relax-and-round paradigm.
This was complemented by a hardness result in 1998 by Uri
Feige (building on previous work of Carsten Lund and Mihalis Yan-
nakakis), who showed that a (1 − ε) ln n-approximation algorithm for
any constant ε > 0 would imply that NP has algorithms that run in
time O(nlog log n ). This was improved by Irit Dinur and David Steurer,
who tightened the result to show that such an approximation algo-
rithm would in fact imply that NP has polynomial-time algorithm
(i.e., that P = NP).
256 the set cover problem

24.3.1 The Greedy Algorithm for Set Cover

The greedy algorithm is simple: Repeatedly pick the set S ∈ S that
covers the most uncovered elements, until all elements of U are covered.

Theorem 24.1. The greedy algorithm is a ln n-approximation.

The greedy algorithm does not achieve a better ratio than Ω(log n):
one example is given by the figure to the right. The optimal sets are
the two rows, whereas the greedy algorithm may break ties poorly
and pick the set covering the left half, and then half the remainder,
etc. A more sophisticated example can show a matching gap of ln n.

Proof of Theorem 24.1. Suppose Opt picks k sets from S . Let ni be the
number of elements yet uncovered when the algorithm has picked i
sets. Then n0 = n = |U |. Since the k sets in Opt cover all the elements
Figure 24.2: A Tight Example for the
of U, they also cover the uncovered elements in ni . By averaging, Greedy Algorithm
there must exist a set in S that covers ni /k of the yet-uncovered ele-
ments. Hence, As always, we use 1 + x ≤ e x , and here
we can use that the inequality is strict
ni+1 ≤ ni − ni /k = ni (1 − 1/k). whenever x ̸= 0.

Iterating, we get nt ≤ n0 (1 − 1/k)t < n · e−t/k . So setting T = k ln n, we

get n T < 1. Since n T must be an integer, it is zero, so we have covered
all elements using T = k ln n sets. If the sets are of size at most B, we can
show that the greedy algorithm is an
HB -approximation, where
24.3.2 Extending to the Weighted Case HB = 1 + 1/2 + 1/3 + . . . + 1/B

Moreover, for the weighted case, the greedy algorithm changes to is the Bth Harmonic number.
picking the set S in that maximizes:

number of yet-uncovered elements in S

.
cS

One can give an analysis somewhat like the one above for this
weighted case as well: let k now be the total cost of sets in the op-
timal set cover. After i sets have been picked, the remaining ni ele-
ments can still be covered using a collection of cost k, so there must
be a set whose cost-to-fresh-coverage ratio is at most k/ni . If it covers
ni+1 − ni previously uncovered elements, then we know that its cost
most be at most
(ni+1 − ni ) · k/ni .

So if the algorithm picks ℓ sets, the total cost is

ℓ
∑ (ni+1 − ni ) · k/ni ≤ k 1/n + 1/(n−1) + . . . + 1/2 + 1 = k · Hn ,
i =1

where we used that n0 = n, since all elements are initially uncovered.

approximation algorithms 257

24.4 A Relax-and-Round Algorithm for Set Cover

The second algorithm for Set Cover uses the popular relax-and-
round framework. The steps of this process are as follows:

1. Write an integer linear program for the problem. This will also be
NP-hard to solve, naturally.

2. Relax the integrality constraints to get a linear program. Since

this is a minimization problem, relaxing the constraints causes the
optimal LP value to be no larger than the optimal IP value (which
is just Opt). This optimal value LP value is the surrogate.

3. Now solve the linear program, and round the fractional variables
to integer values, while ensuring that the cost of this integer solu-
tion is not much higher than the LP value.

Let’s see this in action: here is the integer linear program (ILP) that
precisely models Set Cover:

min ∑ cS xS (ILP-SC)
S∈S
s.t. ∑ xS ≥ 1 ∀e ∈ U
S:e∈S
xS ∈ {0, 1} ∀S ∈ S .

The LP relaxation just drops the integrality constraints:

min ∑ cS xS (LP-SC)
S∈S
s.t. ∑ xS ≥ 1 ∀e ∈ U
S:e∈S
xS ≥ 0 ∀S ∈ S .

If LP( I ) is the optimal value for the linear program, then we get:

LP( I ) ≤ Opt( I ).

Finally, how do we round? Suppose x ∗ is the fractional solution ob-

tained by solving the LP optimally. We do the following two phases:

1. Phase 1: Repeat t = ln n times: for each set S, pick S with probabil-

ity xS∗ independently.

2. Phase 2: For each element e yet uncovered, pick any set covering it.

Clearly the solution produced by the algorithm is feasible; it just

remains to bound the number of sets picked by it.

Theorem 24.2. The expected number of sets picked by this algorithm is

(ln n) LP( I ) + 1.
258 the bin packing problem

Proof. Clearly, the expected number of sets covered in each round in

phase 1 is ∑S xS∗ = LP( I ), and hence the expected number of sets in
phase 1 is at most ln n times as much.
For the second phase, the number of sets not picked is precisely
the the expected number of elements not covered in Phase 1. To calcu-
late this, consider an arbitrary element e.

Pr[e not covered in phase 1] = (ΠS:e∈S (1 − xS∗ ))t

≤ (e− ∑S:e∈S xS )t
≤ ( e −1 ) t
1
= ,
n
since t = ln n. By linearity of expectations, the expected number of
uncovered elements in Phase 2 should be 1, so in expectation we’ll
pick 1 set in Phase 2. This completes the proof. One can derandomize this algorithm to
get a deterministic algorithm with the
In a homework problem, we will show that if the sizes of the sets same guarantee. We may see this in an
exercise. Also, Neal Young has a way to
are bounded by B, then we can get a (1 + ln B)-approximation as solve this problem without solving the
well. And that the analysis can extend to the weighted case, where LP at all!
sets have costs.

24.5 The Bin Packing Problem

Bin Packing is another classic NP-hard optimization problem. We

are givenn items, each item i having some size si ∈ [0, 1]. We want
to find the minimum number of bins, each with capacity 1, such that
we can pack all n items into them. Formally, we want to find the
partition of [n] into S1 ∪ S2 ∪ . . . ∪ Sk such that ∑i∈S j si ≤ 1 for each
set S j , and moreover, the number of parts k is minimized.
The Bin Packing is NP-hard, and this can be shown via a reduc-
tion from the Partition problem (where we are given n positive
integers s1 , s2 , . . . , sn and an integer K such that ∑in=1 si = 2K, and
we want to decide whether we can find a partition of these integers
into two disjoint sets A, B such that ∑i∈ A si = ∑ j∈ B s j = K). Since
this partition instance corresponds gives us Bin Packing instances
where the optimum is either 2 or at least 3, the reduction shows that
getting an approximation factor of smaller than 3/2 for Bin Packing
is also NP-hard.
We show two algorithms for this problem. The first algorithm
First-Fit uses at most 2 Opt bins, whereas the second algorithm
uses at most (1 + ε) Opt +O(1/ε2 ) bins. These are not the best re-
sults possible: e.g., a recent result by Rebecca Hoberg and Thomas
Rothvoß gives a solution using at most Opt +O(log Opt) bins, and it
is conceivable that we can get an algorithm that uses Opt +O(1) bins.
approximation algorithms 259

24.5.1 A Class of Greedy Algorithms: X-Fit

We can define a collection of greedy algorithms that consider the
items in some arbitrary order: for each item they try to fit it into
some “open” bins; if the item does not fit into any of these bins,
then they open a new bin and put it there. Here are some of these
algorithms:

1. First-Fit: add the item to the earliest opened bin where it fits.

2. Next-Fit: add the item to the single most-recently opened bin.

3. Best-Fit: consider the bins in increasing order of free space, and

add the item to the first one that can take it.

4. Worst-Fit: consider the open bins in decreasing(!) order of free

space, and add the item to the first one that can take it. The idea is
to ensure that no bin has small amounts of free space remaining,
which is likely to then get wasted.

All these algorithms are 2-approximations. Let us give a proof for

First-Fit, the others have similar proofs.

Theorem 24.3. AlgFF ( I ) ≤ 2 · Opt( I ).

Proof. The surrogate in this case is the total volume V ( I ) = ∑i si of

items. Clearly, ⌈V ( I )⌉ ≤ OPT ( I ). Now consider the bins in the order
they were opened. For any pair of consecutive bins 2j − 1, 2j, the
first item in bin 2j could not have fit into bin 2j − 1 (else we would
not have opened the new bin). So the total size of items in these two
consecutive bins is strictly more than 1. Another way to say this: at most one
Hence, if we open K bins, the total volume of items in these bins is bin is at-most-half-full, because if there
were two, the later of these bins would
strictly more than ⌊K/2⌋. Hence, not have been opened.

⌊K/2⌋ < V ( I ) =⇒ K ≤ 2 ⌈V ( I )⌉ ≤ 2 Opt( I ).

Exercise: if all the items were of size at most ε, then each bin (ex-
cept the last one) would have at least 1 − ε total size, thereby giving
an approximation of

1
Opt( I ) + 1 ≈ (1 + ε) Opt( I ) + 1.
1−ε

24.6 The Linear Grouping Algorithm for Bin Packing

The next algorithm was given by Wenceslas Fernandez de la Vega

and G.S. Luecker, and it uses a clever linear programming idea to get
an “almost-PTAS” for Bin Packing. Observe that we cannot hope
to get a PTAS, because of the hardness result we showed above. But Recall, a PTAS (polynomial-time ap-
proximation scheme) is an algorithm
that for any ε > 0 outputs a (1 + ε)-
approximation in time n f (ε) . Hence, we
can get the approximation factor to any
constant above 1 as we want, and still
get polynomial-time—just the degree
of the polynomial in the runtime gets
larger.
260 the linear grouping algorithm for bin packing

we will show that if we allow ourselves a small additive term, the

hardness goes away. The main ideas here will be the following:

1. We can discretize the item sizes down to a constant number of

values by losing at most ε Opt (where the constant depends on ε).

2. The problem for a constant number of sizes can be solved almost

exactly (up to an additive constant) in polynomial-time.

3. Items of size at most ε can be added to any instance while main-

taining an approximation factor of (1 + ε).

24.6.1 The Linear Grouping Procedure

Lemma 24.4 (Linear Grouping). Given an instance I = (s1 , s2 , . . . , sn ) of
Bin Packing, and a parameter D ∈ N, we can efficiently produce another
instance I ′ = (s1 ′ , s2 ′ , . . . , sn ′ ) with increased sizes si ′ > si and at most D
distinct item sizes, such that

Opt( I ′ ) ≤ Opt( I ) + ⌈n/D⌉ .

Proof. The instance I ′ is constructed as follows:

• Sort the sizes si in non-increasing order to get s1 ≥ s2 ≥ . . . ≥ sn .

• Group items into D groups of ⌈n/D⌉ consecutive items, with the

last group being potentially slightly smaller.

• Define the new size si ′ for each item i to be the size of the largest
element in i’s group.

There are D distinct item sizes, and all sizes are only increased, so
it remains to show a packing for the items in I ′ that uses at most
Opt( I ) + ⌈n/D ⌉ bins. Indeed, suppose Opt( I ) assigns item i to some
bin b. Then we assign item (i + ⌈n/D⌉) to bin b. Since the sizes of the
items only get smaller, this allocates all the items except items in first
group, without violating the sizes. Now we assign each item in the
first group into a new bin, thereby opening up ⌈n/D⌉ more bins.

24.6.2 An Algorithm for a Constant Number of Item Sizes

Suppose we have an instance with at most D distinct item sizes: let
the sizes be s1 < s2 < . . . < SD , with δ > 0 being the smallest
size. The instance is then defined by the number of items for each
size. Define a configuration to be a collection of items that fits into a
bin: there can be at most 1/s1 items in any bin, and each item has
one of D sizes (or it can be the “null” item), so there are at most
N := ( D + 1)1/s1 different configurations. Note that if D and s1 are
both constants, this is (large) constant. (In the next section, we use
approximation algorithms 261

this result for the case where s1 ≥ ε.) Let C be the collection of all
configurations.
We now use an integer LP due to Paul Gilmore and Ralph Gomory
(from the 1950s). It has one variable xC for every configuration C ∈ C
that denotes the number of bins with configuration C in the solution.
The LP is:

min ∑ xC ,
C ∈C
s.t. ∑ ACs xC ≥ ns , ∀ sizes s
C
xC ∈ N.

Here ACs is the number of items of type s being placed in the config-
uration C, and ns is the total number items of size s in the instance.
This is an exact formulation, and relaxing the integrality constraint to
xC ≥ 0 gives us an LP that we can solve in time poly( N, n). This is
polynomial time when N is a constant. We use the optimal value of In fact, we show in a homework prob-
this LP as our surrogate. lem that the LP can be solved in time
polynomial in n even when N is not a
How do we round the optimal solution for this LP? There are only constant.
D non-trivial constraints in the LP, and N non-negativity constraints.
So if we pick an optimal vertex solution, it must have some N of
these constraints at equality. This means at least N − D of these tight
constraints come from the latter set, and therefore N − D variables
are set to zero. In other words, at most D of the variables are non-
zero. Rounding these variables up to the closest integer, we get a
solution that uses at most LP( I ) + D ≤ Opt( I ) + D bins. Since D is a
constant, we have approximated the solution up to a constant.

24.6.3 The Final Bin Packing Algorithm

Combining the two ideas, we get a solution that uses

Opt( I ) + ⌈n/D ⌉ + D

bins. Now if we could ensure that n/D were at most ε Opt( I ), when
D was f (ε), we would be done. Indeed, if all the items have size at
least ε, the total volume (and therefore Opt( I )) is at least εn. If we

now set D = 1/ε2 , then n/D ≤ ε2 n ≤ ε Opt( I ), and the number of
bins is at most l m
(1 + ε) Opt( I ) + 1/ε2 .
What if some of the items are smaller than ε? We now use the
observation that First-Fit behaves very well when the item sizes
are small. Indeed, we first hold back all the items smaller than ε,
and solve the remaining instance as above. Then we add in the small
items using First-Fit: if it does not open any new bins, we are
262 subsequent results and open problems

fine. And if adding these small items results in opening some new
bin, then each of the existing bins—and all the newly opened bins
(except the last one)—must have at least (1 − ε) total size in them.
The number of bins is then at most
1
Opt( I ) + 1 ≈ (1 + O(ε)) Opt( I ) + 1,
1−ε
as long as ε ≤ 1/2.

24.7 Subsequent Results and Open Problems

25
Approximation Algorithms via SDPs

Just like the use of linear programming was a major advance in the
design of approximation algorithms, specifically in the use of lin-
ear programs in the relax-and-round framework, another significant
advantage was the use of semidefinite programs in the same frame-
work. For instance, the approximation guaranteee for the Max-Cut
problem was improved from 1/2 to 0.878 using this technique. More-
over, subsequent results have shown that any improvements to this
approximation guarantee in polynomial-time would disprove the
Unique Games Conjecture.

25.1 Positive Semidefinite Matrices

The main objects of interest in semidefinite programming, not sur-

prisingly, are positive semidefinite matrices.

Definition 25.1 (Positive Semidefinite Matrices). Let A ∈ Rn×n be a

real-valued symmetric matrix and let r = rank( A). We say that A is
positive semidefinite (PSD) if any of the following equivalent conditions
hold:

a. x⊺ Ax ≥ 0 for all x ∈ Rn .

b. All of A’s eigenvalues are nonnegative (with r of them being

strictly positive), and hence A = ∑ri=1 λi vi v⊺i for λ1 , . . . , λr > 0,
and vi ’s being orthonormal.

c. There exists a matrix B ∈ Rn×r such that A = BB⊺ .

d. There exist vectors v1 , . . . , vn ∈ Rr such that Ai,j = vi , v j for all

i, j.

e. There exist jointly distributed (real-valued) random variables

X1 , . . . , Xn such that Ai,j = E[ Xi X j ].

f. All principal minors have nonnegative determinants. A principal minor is a submatrix of A

obtained by taking the columns and
rows indexed by some subset I ⊆ [n].
264 semidefinite programs

The different definitions may be useful in different contexts. As an

example, we see that the condition in Definition 25.1(f) gives a short
proof of the following claim.

Lemma 25.2. Let A ⪰ 0. If Ai,i = 0 then A j,i = Ai,j = 0 for all j. We will write A ⪰ 0 to denote that A is
PSD; more generally, we write A ⪰ B
Proof. Let j ̸= i. The determinant of the submatrix indexed by {i, j} is if A − B is PSD: this partial order on
symmetric matrices is called the Löwner
order.
Ai,i A j,j − Ai,j A j,i

is nonnegative, by assumption. Since Ai,j = A j,i by symmetry, and

Ai,i = 0, we get A2i,j = A2j,i ≤ 0 and we conclude Ai,j = A j,i = 0.

Definition 25.3 (Frobenius Product). Let A, B ∈ Rn×n . The Frobenius

inner product A • B, also written as ⟨ A, B⟩ is defined as

⟨ A, B⟩ := A • B := ∑ Ai,j Bi,j = Tr( A⊺ B).

i,j

We can think of this as being the usual vector inner product treat-
ing A and B as vectors of length n × n. Note that by the cyclic prop-
erty of the trace, A • xx⊺ = Tr( Axx⊺ ) = Tr( x⊺ Ax ) = x⊺ Ax; we will
use this fact to derive yet another of PSD matrices.

Lemma 25.4. A is PSD if and only if A • X ≥ 0 for all X ⪰ 0.

Proof. Suppose A ⪰ 0. Consider the spectral decomposition X =

⊺
∑i λi xi xi where λi ≥ 0 by Definition 25.1(b). Then

∑ λi ( A • xi xi ) = ∑ λi xi Axi ≥ 0.
⊺ ⊺
A•X =
i i

On the other hand, if A ⪰̸ 0, there exists v such that v⊺ Av < 0,

by 25.1(a). Let X = vv⊺ ⪰ 0. Then A • X = v⊺ Av < 0.

Finally, let us mention a useful fact (which can be proved, e.g.,

using the x⊺ Ax ≥ 0 characterization of PSD matrices):
Fact 25.5 (PSD cone). Given two matrices A, B ⪰ 0, and scalars
α, β > 0 then αA + βB ⪰ 0. Hence the set of PSD matrices forms a
convex cone in Rn(n+1)/2 . Here n(n + 1)/2 is the number of
entries on or above the diagonal in an
n × n matrix, and completely specifies a
symmetric matrix.
25.2 Semidefinite Programs

Loosely, a semidefinite program (SDP) is the problem of optimizing a

linear function over the intersection of a convex polyhedron K (given
by finitely many linear constraints, say Ax ≥ b) with the PSD cone K.
Let us give two useful packagings for semidefinite programs.
approximation algorithms via sdps 265

25.2.1 As Linear Programs with a PSD Constraint

Consider a linear program where the variables are indexed by pairs
i, j ∈ [n], i.e., a typical variable is xi,j . Let X be the n × n dimensional
matrix whose (i, j)th entry is xi,j . As the objective and constraints are
linear, we can write them as C • X and Ak • X ≤ bk for some (not
necessarily PSD) matrices C, A1 , . . . , Am and scalars b1 , . . . , bm . An
SDP is an LP of this form with the additional constraint X ⪰ 0: Observe that if each of the matrices Ai
and C are diagonal matrices, say with
maximize C•X diagonals ai and c, this SDP becomes
X ∈Rn × n the linear program
subject to A k • X ≤ bk , ∀k ∈ [m] max{c⊺ x | a⊺k x ≤ bk , x ≥ 0},
X ⪰ 0. where x denotes the diagonal of the
PSD matrix X.

25.2.2 As Vector Programs

We can use Definition 25.1(d) to rewrite the above program as a “vec-
tor program”: where the linear objective and the linear constraints
are on inner products of vector variables:

maximize
v1 ,...,vn ∈R
n ∑ cij vi , v j
i,j
(k)
subject to ∑ aij v i , v j ≤ bk , ∀ k ∈ [ m ].
i,j

In particular, we optimize over vectors in n-dimensional space; we

cannot restrict the dimension of these vectors, much like we cannot
restrict the rank of the matrices X in the previous representation.

25.2.3 Examples of SDPs

Let A a symmetric n × n real matrix. Here is an SDP to compute the
maximum eigenvalue of A:

maximize A•X
X ∈Rn × n
subject to I•X =1 (25.1)

X⪰0

Lemma 25.6. SDP (25.1) computes the maximum eigenvalue of A.

Proof. Let X maximize SDP (25.1) (this exists as the objective is con-
tinuous and the feasible set is compact). Consider the spectral de-
composition X = ∑in=1 λi xi xi⊺ where λi ≥ 0 and ∥ xi ∥2 = 1. The
trace constraint I • X = 1 implies ∑i λi = 1. Thus the objective value
A • X = ∑i λi xi⊺ Axi is a convex combination of xi⊺ Axi . Hence without
loss of generality, we can put all the weight into one of these terms,
in which case X = yy⊺ is a rank-one matrix with ∥y∥2 = 1. By the
Courant-Fischer theorem, OPT ≤ max∥y∥2 =1 y⊺ Ay = λmax .
266 sdps in approximation algorithms

On the other hand, letting v be a unit eigenvector of A correspond-

ing to λmax , we have that OPT ≥ A • vv⊺ = v⊺ Av = λmax .

Here is another SDP for the same problem: In fact, it turns out that this SDP is dual
to the one in (25.1). Weak duality still
minimize t holds for this case, but strong duality
t does not hold in general for SDPs.
(25.2)
Indeed, there could be a duality gap for
subject to tI − A ⪰ 0. some cases, where both the primal and
dual are finite, but the optimal solutions
Lemma 25.7. SDP (25.2) computes the maximum eigenvalue of A. are not equal to each other. However,
under some mild regularity conditions
Proof. The matrix tI − A has eigenvalues t − λi . And hence the con- (e.g., the Slater conditions) we can show
strong duality. More about SDP duality
straint tI − A ⪰ 0 is equivalent to the constraint t − λ ≥ 0 for all its
here.
eigenvalues λ. In other words, t ≥ λmax , and thus OPT = λmax .

25.3 SDPs in Approximation Algorithms

We now consider designing approximation algorithms using SDPs.

Recall that given a matrix A, we can check if it is PSD in (strongly)
polynomial time, by performing its eigendecomposition. Moreover, if
A is not PSD, we can return a hyperplane separating A from the PSD
cone. Thus using the ellipsoid method, we can approximate SDPs
when OPT is appropriately bounded. Informally, We know that there is an optimal LP
solution where the numbers are singly
Theorem 25.8 (Informal Theorem). Assuming that the radius of the fea- exponential, and hence can be written
using a polynomial number of bits. But
sible set is at most exp(poly(⟨SDP⟩)), the ellipsoid algorithm can weakly this is not true in SDPs, in fact, OPT in
solve SDP in time poly(⟨SDP⟩, log(1/ε)) up to an additive error of ε. an SDP may be as large (or small) as
doubly exponential in the size of the
For a formal statement, see Theorem 2.6.1 of Matoušek and Gärt- SDP. (See Section 2.6 of the Matoušek
and Gärtner.)
ner. However, we will ignore these technical issues in the remainder
of this chapter, and instead suppose that we can solve our SDPs ex-
actly.

25.4 The MaxCut Problem and Hyperplane Rounding

Given a graph G = (V, E), the MaxCut problem asks us to find a

partition of the vertices (S, V \ S) maximizing the number of edges
crossing the partition. This problem is NP-complete. In fact assuming
P ̸= NP, a result of Johan Håstad shows that we cannot approximate
MaxCut better than 17/16 − ε for any ε > 0.

25.4.1 Greedy and Randomized Algorithms

We begin by considering a greedy algorithm: process the vertices
v1 , . . . , vn in some order, and place each vertex vi in the part of the
bipartition that maximizes the number of edges cut so far (breaking
ties arbitrarily).
approximation algorithms via sdps 267

Lemma 25.9. The greedy algorithm cuts at least | E|/2-many edges.

Proof. Let δi be the number of edges from vertex i to vertices j < i:
then the greedy algorithm cuts at least ∑i δi /2 = | E|/2 edges.

This result shows two things: (a) every graph has a bipartition that
cuts half the edges of the graph, so Opt ≥ | E|/2. Moreover, (b) that
since Opt ≤ | E| on any graph, this means that Alg ≥ | E|/2 ≥ Opt /2.
We cannot hope to prove a better result
Here’s a simple randomized algorithm: place each vertex in either than Lemma 25.9 in terms of | E|, since
the complete graph Kn has (n2 ) ≈ n2 /2
S or in S̄ independently and uniformly at random. Since each edge is edges and any partition can cut at most
cut with probability 1/2, the expected number of cut edges is | E|/2. n2 /4 of them.
Moreover, by the probabilistic method Opt ≥ | E|/2.

25.4.2 Relax-and-Round using LPs

A natural direction would be to write an ILP formulation for Max-
Cut and to relax it: this approach does not give us anything beyond
a factor of 1/2, say.

25.4.3 A Semidefinite Relaxation

We now see a well-known example of an SDP-based approximation
algorithm due to Michel Goemans and David Williamson. Again, we
will use the relax-and-round framework from the previous chapter.
The difference is that we write a quadratic program to model the
problem exactly, and then relax it to get an SDP.
Indeed, observe that the MaxCut problem can be written as the
following quadratic program.

( x i − x j )2
maximize
x1 ,...,xn ∈R
∑ 4
(i,j)∈ E (25.3)
subject to xi2 =1 ∀i.

Since each xi is real-valued, and xi2 = 1, each variable must be as-

signed one of two labels {−1, +1}. Since each term in the objective
contributes 1 for an edge connecting two vertices in different parti-
tions, and 0 otherwise, this IP precisely captures MaxCut.
We now relax this program by replacing the variables xi with
vector variables vi ∈ Rn , where ∥vi ∥2 = 1.
∥ v i − v j ∥2
maximize
v1 ,...,vn ∈R
n ∑ 4
(i,j)∈ E (25.4)
2
subject to ∥ vi ∥ = 1 ∀i.

Noting that ∥vi − v j ∥2 = ∥vi ∥2 + ∥v j ∥2 − 2 vi , v j = 2 − 2 vi , v j , we

rewrite this vector program as The SDP relaxation for the MaxCut
problem was first introduced by Svata
Poljak and Franz Rendl.
268 the maxcut problem and hyperplane rounding

1 − vi , v j
maximize
v1 ,...,vn ∈R
n ∑ 2
(i,j)∈ E (25.5)
subject to ⟨ vi , vi ⟩ = 1 ∀i.

This is a relaxation of the original quadratic program, because we

can model any {−1, +1}-valued solution using vectors, say by a
corresponding {−e1 , +e1 }-valued solution. Since this is a maximiza-
tion problem, the SDP value is now at least the optimal value of the
quadratic program.

25.4.4 The Hyperplane Rounding Technique

In order to round this vector solution {vi } to the MaxCut SDP into
an integer scalar solution to MaxCut, we use the remarkably simple
method of hyperplane rounding. The idea is this: a term in the SDP
objective incurs a tiny cost close to zero when vi , v j are very close to
each other, and almost unit cost when vi , v j point in nearly opposite
directions. So we would like to map close vectors to the same value.
To do this, we randomly sample a hyperplane through the origin
and partition the vectors according to the side on which they land.
Formally, this corresponds to picking a vector g ∈ Rn according to
the standard n-dimensional Gaussian distribution, and setting
v1
S : = { i | ⟨ v i , g ⟩ ≥ 0}. g
v2

We now argue that this procedure gives us a good cut in expectation;

this procedure can be repeated to get an algorithm that succeeds with
high probability.

Theorem 25.10. The partition produced by the hyperplane rounding algo- v3

rithm cuts at least αGW · SDP edges in expectation, where αGW := 0.87856. v4

Proof. By linearity of expectation, it suffices to bound the probability Figure 25.1: A geometric picture of
Goemans-Williamson randomized
of an edge (i, j) being cut. Let rounding

θij := cos−1 ( vi , v j )

be the angle between the unit vectors vi and v j . Now consider the
2-dimensional plane P containing vi , v j and the origin, and let ge be
the projection of the Gaussian vector g onto this plane. Observe that
the edge (i, j) is cut precisely when the hyperplane defined by g
separates vi , v j . This is precisely when the vector perpendicular to
ge in the plane P lands between vi and v j . As the projection onto a
subspace of the standard Gaussian is again a standard Guassian (by
spherical symmetry),
2θij θij
Pr[(i, j) cut] = = .
2π π
θij
approximation algorithms via sdps 269
g̃

Since the SDP gets a contribution of

1 − vi , v j 1 − cos(θi,j )
=
2 2
for this edge, it suffices to show that Figure 25.2: Angle between two vectors.
θ 1 − cos θ We cut edge (i, j) when the vector
≥α . perpendicular to ge lands in the grey
π 2 area.
Indeed, we can show (either by plotting, or analytically) that α =
0.87856 . . . suffices for the above inequality, and hence
1 − cos(θij )
E[# edges cut] = ∑ θij /π ≥ α ∑ 2
= α SDP .
(i,j)∈ E (i,j)∈ E

This proves the theorem.

Corollary 25.11. For any ε > 0, repeating the hyperplane rounding

algorithm O(1/ε log 1/δ) times and returning the best solution ensures that
we output a cut of value at least (.87856 − ε) Opt with probability 1 − δ.
We leave this proof as an exerise in using Markov’s inequality:
note that we want to show that the algorithm returns something not
too far below the expecation, which seems to go the wrong way, and
hence requires a moment’s thought.
The above algorithm is randomized and the result only holds in
expectation. However, it is possible to derandomize this result to
obtain a polynomial-time deterministic algorithm with the same
approximation ratio.

25.4.5 Subsequent Work and Connections

Can we get a better approximation factor, perhaps using a more so-
phisticated SDP? An influential result of Subhash Khot, Guy Kindler,
Elchanan Mossel, and Ryan O’Donnell says that a constant-better-
than-αGW -approximation would refute the Unique Games Conjecture.
Also, one can ask if similar rounding procedures exist for an
linear-programming relaxation as opposed to the SDP relaxation
here. Unfortunately the answer is again no: a result of Siu-On Chan,
James Lee, Prasad Raghavendra, and David Steurer shows that no
polynomial-sized LP relaxation of MaxCut can obtain a non-trivial
approximation factor, that is, any polynomial sized LP of MaxCut
has an integrality gap of 1/2.

25.5 Coloring 3-Colorable Graphs

Suppose we are given a graph G = (V, E) and a promise that there

is some 3-coloring of G. What is the minimum k such that we can
270 coloring 3-colorable graphs

find a k-coloring of G in polynomial time? It is well-known that 2-

coloring a graph can be done in linear time, but 3-coloring a graph is
NP-complete. Hence, even given a 3-colorable graph, it is NP-hard
to color it using 3 colors. (In fact, a result of Venkat Guruswami and
Sanjeev Khanna shows that it is NP-hard to color it using even 4 col-
ors.) But what if we ask to color a 3-colorable graph using 5 colors?
O(log n) colors? O(nα ) colors, for some fixed constant α? We will see
√
an easy algorithm to achieve an O( n)-coloring, and then will use
semidefinite programming to improve this to an O e (nlog6 (2) ) color-
ing. Before we describe these, let us recall the easy part of Brooks’
theorem. The harder part is to show that in fact ∆
colors suffice unless the graph is either
Lemma 25.12. Let ∆ be the maximum degree of a graph G, then we can a complete graph, or an odd-length
cycle.
find a (∆ + 1)-coloring of G in linear time.

Proof. Pick any vertex v, recursively color the remaining graph, and
then assign v a color not among the colors of its ∆ neighbors.

We will now describe an algorithm that colors a 3-colorable graph

√
G with O( n) colors, originally due to Avi Wigderson: while there
√
exists a vertex with degree at least n, color it using a fresh color.
Moreover, its neighborhood must be 2-colorable, so use two fresh
√
colors to do so. This takes care of n vertices using 3 colors. Remove
these, and repeat. Finally, use Lemma 25.12 to color the remaining
√
vertices using n colors. This proves the following result.

Lemma 25.13. There is an algorithm to color a 3-colorable graph with

√
O( n) colors.

25.5.1 An Algorithm using SDPs

Let’s consider an algorithm that uses SDPs to color a 3-colorable
graph with maximum degree ∆ using O e (∆log3 2 ) ≈ O
e (∆0.63 ) colors.
In general ∆ could be as large as n, so this could be worse than the
algorithm in Lemma 25.13, but we will be able to combine the ideas
together to get a better result.
For some parameter λ ∈ R, consider the following feasibility SDP
(where we are not optimizing any objective):

find v 1 , . . . , v n ∈ Rn
subject to vi , v j ≤ λ ∀(i, j) ∈ E (25.6)
⟨ vi , vi ⟩ = 1 ∀i ∈ V.

Why is this SDP relevant to our problem? The goal is to have vectors
clustered together in groups, such that each cluster represents a color.
Intuitively, we want to have vectors of adjacent vertices to be far
apart, so we want their inner product to be close to −1 (recall we are
approximation algorithms via sdps 271

dealing with unit vectors, due to the last constraint) and vectors of
the same color to be close together.

Lemma 25.14. For 3-colorable graphs, SDP (25.6) is feasible with λ =

−1/2. 120◦ 120◦

Proof. Consider the vector placement shown in the figure to the right. 120◦
If the graph is 3-colorable, we can assign all vertices with color 1
the red vector, all vertices with color 2 the blue vector and all vertices
with color 3 the green vector. Now for every edge (i, j) ∈ E, we have
that Figure 25.3: Optimal distribution of
2π vectors for 3-coloring graph
vi , v j = cos = −1/2.
3
At first sight, it may seem like we are done: if we solve the above
SDP with λ = −1/2, don’t all three vectors look like the figure above?
No, that would only hold if all of them were to be co-planar. And in
n-dimensions we can have an exponential number of cones of angle
2π
3 , like in the next figure, so we cannot cluster vectors as easily as in
the above example.
To solve this issue, we apply a hyperplane rounding technique
similar to that from the MaxCut algorithm. Indeed, for some pa-
rameter t we will pick later, pick t random hyperplanes. Formally, we
pick gi ∈ Rn from a standard n-dimensional Gaussian distribution, Figure 25.4: Dimensionality problem of
for i ∈ [t]. Each of these defines a normal hyperplane, and these split 2π/3 far vectors
the Rn unit sphere into 2t regions (except if two of them point in the
same direction, which has zero probability). Now, each vectors {vi }
that lie in the same region can be considered “close” to each other,
and we can try to assign them a unique color. Formally, this means
that if vi and v j are such that

sign(⟨vi , gk ⟩) = sign( v j , gk )

for all k ∈ [t], then i and j are given the same color. Each region is
given a different color, of course.
However, this may color some neighbors with the same color, so
we use the method of alterations: while there exists an edge between
vertices of the same color, we uncolor both endpoints. When this
uncoloring stops, we remove the still-colored vertices from the graph,
and then repeat the same procedure on the remaining graph, until we
color every vertex. Note that since we use t hyperplanes, we add at
most 2t new colors per iteration. The goal is to now show that (a) the
number of interations is small, and (b) the value of 2t is also small.

Lemma 25.15. If half of the vertices are colored in a single iteration in

expectation, then the expected number of iterations to color the whole graph
is O(log n).
272 coloring 3-colorable graphs

Proof. Since the expected number of uncolored vertices is at most

half, Markov’s inequality says that more than 3/4 of the vertices are
uncolored in a single iteration, with probability at most 2/3. In other
words, at least 1/4 of the vertices are colored with probability 1/3.
Hence, the number of iterations to color the whole graph is domi-
nated by the number of flips of a coin of bias 1/3 to get log4 n heads.
This is 4 log4 n, which proves the result.

Lemma 25.16. The expected number of vertices that remain uncolored after
a single iteration is at most n∆ (1/3)t .

Proof. Fix an edge ij: for a single random hyperplane, the probability
that vi , v j are not separated by it is

π − θij 1
≤ ,
π 3

using that θij ≥ 2π

3 which follows from the constraint in the SDP.
Now if i is uncolored because of j, then vi , v j have the same color,
which happens when all t hyperplanes fail to separate the two. By
independence, this happens with probability at most (1/3)t . Finally,

E[remaining] = ∑ Pr[i uncolored]

i ∈V
≤ ∑ ∑ Pr[i uncolored because of j]. (25.7)
i ∈V (i,j)∈ E

There are n vertices, and each vertex has degree at most ∆, which
proves the result.

Lemma 25.17. There is an algorithm that colors a 3-colorable graph with

maximum degree ∆ with O(∆log3 2 · log n) colors in expectation.

Proof. Setting t = log3 (2∆) in Lemma 25.16, the expected number of

uncolored vertices in any iteration is

n · ∆ · (1/3)t ≤ n/2. (25.8)

Now Lemma 25.15 says we perform O(log n) iterations in expecta-

tion. Since we use most 2log3 (2∆) = (2∆)log3 2 colors in each iteration,
we get the result.

25.5.2 Improving the Algorithms Further

The expected number of colors used by the above algorithm is
√
e (nlog3 2 ) ≈ O
O e (n0.63 ), which is worse than our initial O( n) algo-
rithm. However we can combine the ideas together to get a better
result:
approximation algorithms via sdps 273

Theorem 25.18. There is an algorithm that colors a 3-colorable graph with

e (nlog6 (2) ) colors.
O

Proof. For some value σ, repeatedly remove vertices with degree

greater than σ and color them and their neighbors with 3 new col-
ors, as in Lemma 25.13. This requires at most 3n/σ colors overall,
and leaves us with a graph having maximum degree σ. Now use
Lemma 25.17 to color the remaining graph with O(σlog3 2 · log n) col-
ors. Picking σ to be nlog6 3 to balance these terms, we get a procedure
that uses Oe (nlog6 2 ) ≈ O
e (n0.38 ) colors.

25.5.3 Final notes on coloring 3-colorable graphs

This result us due to David Karger, Rajeev Motwani, and Madhu Su-
dan. They gave a better rounding algorithm that uses spherical caps
e (n1/4 ) colors. This result was
instead of hyperplanes to achieve O
then improved over a sequence of papers: the current best result by
Ken-Ichi Kawarabayashi and Mikkel Thorup uses O(n0.199 ) colors.
It remains an outstanding open problem to either get a better algo-
rithm, or to show hardness results, even under stronger complexity-
theoretic hypotheses.
26
Online Algorithms

In this chapter we introduce online algorithms and study two classic

online problems: the rent-or-buy problem, and the paging problem.
While the models we consider are reminiscent of those in regret min-
imization in online learning and online convex optimization, there
are some important differences, which lend a different flavor to the
results that are obtained.

26.1 The Competitive Analysis Framework

In the standard setting of online algorithms there is a sequence of

requests σ = (σ1 , σ2 , . . . , σt , . . .) received online. An online algorithm
does not know the input sequence up-front, but sees these requests
one by one. It must serve request σi before it is allowed to see σi+1 .
Serving this request σi involves some choice of actions, and incurs
some cost. We will measure the performance of an algorithm by
considering the ratio of the total cost incurred on σ to the optimal
cost of serving σ in hindsight. To make all this formal, let us xsee an
example of an online problem.

26.1.1 Example: The Paging Problem

The paging problem arises in computer memory systems. Often, a
memory system consists of a large but slow main memory, as well as
a small but fast memory called a cache. The CPU typically communi-
cates directly with the cache, so in order to access an item that is not
contained in the cache, the memory system has to load the item from
the slow memory into the cache. Moreover, if the cache is full, then
some item contained in the cache has to be evicted to create space for
the requested item.
We say that a cache miss occurs whenever there is a request to an
item that is not currently in the cache. The goal is to come up with
an eviction strategy that minimizes the number of cache misses.
276 the competitive analysis framework

Typically we do not know the future requests that the CPU will make
so it is sensible to model this as an online problem. If the entire sequence of requests
We let U be a universe of n items or pages. The cache is a mem- is known, show that Belády’s rule is
optimal: evict the page in cache that is
ory containing at most k pages. The requests are pages σi ∈ U and next requested furthest in the future.
the online algorithm is an eviction policy. Now we return back to
defining the performance of an online algorithm.

26.1.2 The Competitive Ratio

As we said before, the online algorithm incurs some cost as it serves
each request. If the complete request sequence is σ, then we let
Alg(σ ) be the total cost incurred by the online algorithm in serv-
ing σ. Similarly, we let Opt(σ ) be the optimal cost in hindsight of
Figure 26.1: Illustration of the Paging
serving σ. Note that Opt(σ ) represents the cost of an optimal offline Problem
algorithm that knows the full sequence of requests. Now we define
the competitive ratio of an algorithm to be:

Alg(σ )
max
σ Opt(σ )

In some sense this is an “apples to oranges” comparison, since

the online algorithm does not know the full sequence of requests,
whereas the optimal cost is aware of the full sequence and hence is
an “offline” quantity.
Note two differences from regret minimization: there we made
a prediction xt before (or concurrently with) seeing the function f t ,
whereas we now see the request σt before we produce our response
at time t. In this sense, our problem is easier. However, the bench-
mark is different—we now have to compare with the best dynamic se-
quence of actions for the input sequence σ, whereas regret is typically
measured with respect to a static response, i.e., to the cost of playing
the same fixed action for each of the t steps. In this sense, we are now
solving a harder problem. There are is a smaller, syntactic difference
as well: regret is an additive guarantee whereas the competitive ra-
tio is a multiplicative guarantee—but this is more a reflection on the
kind of results that are possible, rather than fundamental difference
between the two models.

26.1.3 What About Randomized Algorithms?

The above definitions generally hold for deterministic algorithms, so
how should we characterize randomized algorithms. For the deter-
ministic case we generally think about some adversary choosing the
worst possible request sequence for our algorithm. For randomized
algorithms we could consider either oblivious or adaptive adver-
saries. Oblivious adversaries fix the input sequence up front and then
online algorithms 277

let the randomized algorithm process it. An adaptive adversary is

allowed to see the results of the coin flips the online algorithm makes
and thus adapt its request sequence. We focus on oblivious adver-
saries in these notes.
To define the performance of a randomized online algorithm we
just consider the expected cost of the algorithm. Against an obliv-
ious adversary, we say that a randomized online algorithm is α-
competitive if for all request sequences σ,

E[Alg(σ)] ≤ α · Opt(σ ).

26.2 The Ski Rental Problem: Rent or Buy?

Now that we have a concrete analytical framework, let’s apply it to a

simple problem. Suppose you are on a ski trip with your friends. On
each day you can choose to either rent or buy skis. Renting skis costs
$1, whereas buying skis costs $B for B > 1. However, the benefit of
buying skis is that on subsequent days you do not need to rent or
buy again, just use the skis you already bought. The problem that
arises is that for some mysterious reason we do not know how long
the ski trip will last. On each morning we are simply told whether or
not the trip will continue that day. The goal of the problem is to find
a rent/buy strategy that is competitive with regards to minimizing
the cost of the trip.
In the notation that we developed above, the request for the i’th
day, σi , is either “Y” or “N” indicating whether or not the ski trip
continues that day. We also now that once we see a “N” request that
the request sequence has ended. For example a possible sequence
might be σ = (Y, Y, Y, N ). This allows us to characterize all instances
of the problem as follows. Let Ij be the sequence where the ski trip
ends on day j. Suppose we knew ahead of time what instance we re-
ceived, then we have that Opt( Ij ) = min{ j, B} since we can choose to
either buy skis on day 1 or rent skis every day depending on which is
better.

26.2.1 Deterministic Rent or Buy

We can classify and analyze all possible deterministic algorithms
since an algorithm for this problem is simply a rule deciding when
to buy skis. Let Algi be the algorithm that rents skis until day i, then
buys on day i if the trip lasts that long. The cost on instance Ij is then
Algi ( Ij ) = (i − 1 + B) · 1{i≤ j} + j · 1{i> j} . What is the best determin-
istic algorithm from the point of view of competitive analysis? The
following claims answer this question.
278 the ski rental problem: rent or buy?

Lemma 26.1. The competitive ratio of algorithm AlgB is 2 − 1/B and this is
the best possible ratio for any deterministic algorithm.

Proof. There are two cases to consider j < B and j ≥ B. For the
first case, AlgB ( Ij ) = j and Opt( Ij ) = j, so AlgB ( Ij )/ Opt( Ij ) =
1. In the second case, AlgB ( Ij ) = 2B − 1 and Opt( Ij ) = B, so
AlgB ( Ij )/ Opt( Ij ) = 2 − 1/B. Thus the competitive ratio of AlgB
is
AlgB ( Ij )
max = 2 − 1/B
Ij Opt( Ij )

Now to show that this is the best possible competitive ratio for any
deterministic algorithm. Consider algorithm Algi . We find an in-
stance Ij such that Algi ( Ij )/ Opt( Ij ) ≥ 2 − 1/B. If i ≥ B then we take
j = B so that Algi ( Ij ) = (i − 1 + B) and Opt( Ij ) = B so that

Algi ( Ij ) i−1+B i 1 1
= = +1− ≥ 2−
Opt( Ij ) B B B B

Now if i = 1, we take j = 1 so that

Algi ( Ij ) B
= ≥2
Opt( Ij ) 1

Since B is an integer > 1 by assumption. Now for 1 < i < B, we take

j = ⌊(i − 1 + B)/(2 − 1/B)⌋ ≥ 1 so that

Algi ( Ij ) 1
≥ 2− .
Opt( Ij ) B

26.2.2 Randomized Rent or Buy

Can randomization improve over deterministic algorithms in terms
of expected cost? We will show that this is in fact the case. So how
do we design a randomized algorithm for this problem? We use the
following general insight about randomized algorithms, notably that
a randomized algorithm is a distribution over deterministic algorithms.
To keep things simple let’s consider the case when B = 4. We
construct the following table of payoffs where the rows correspond
to deterministic algorithms Algi and the columns correspond to
instances Ij .

I1 I2 I3 I∞
Alg1 4/1 4/2 4/3 4/4
Alg2 1/1 5/2 5/3 5/4
Alg3 1/1 2/2 6/3 6/4
Alg4 1/1 2/2 3/3 7/4
online algorithms 279

While the real game is infinite along with coordinates, we do not

need to put columns for I4 , I5 , . . . because these strategies for the ad-
versary are dominated by the column I∞ . Now given that the adver-
sary would rather give us only these inputs, we do not need to put
rows after B = 4 since buying after day B is worse than buying on
day B for these inputs. (Please check these for yourself!) This means
we can think of the above table as a 2-player zero-sum game with 4
strategies each. The row player chooses an algorithm and the column
player chooses an instance, then the number in the corresponding
entry indicates the loss of the row player. Thinking along the lines of
the Von Neumann minimax theorem, we can consider mixed strate-
gies for the row player to construct a randomized algorithm for the
ski rental problem.
Let pi be the probability of our randomized algorithm choosing
row i. What is the expected cost of this algorithm? Suppose that
the competitive ratio was at most c in expectation. The expected
competitive ratio of our algorithm against each instance should be at
most c, so this yields the following linear constraints.

4p1 + p2 + p3 + p4 ≤ c
4p2 + 5p2 + 2p3 + 2p4
≤c
2
4p1 + 5p3 + 6p3 + 3p4
≤c
3
4p1 + 5p2 + 6p3 + 7p4
≤c
4

We would like to minimize c subject to p1 + p2 + p3 + p4 = 1 and

pi ≥ 0. It turns out that one can do this by solving the following
system of equations: Why is it OK to set the inequalities to
equalities? Simply because it works
out: in essence, we are guessing that
p1 + p2 + p3 + p4 = 1 making these four constraints tight
4p1 + p2 + p3 + p4 = c gives a basic feasible solution—and our
guess turns out to right. It does not
4p1 + 5p2 + 2p3 + 2p4 = 2c show optimality, but we can do that by
giving a matching dual solution.
4p1 + 5p2 + 6p3 + 3p4 = 3c
4p1 + 5p2 + 6p3 + 7p4 = 4c

Subtracting each line from the previous one gives us

p1 + p2 + p3 + p4 = 1
3p1 = c − 1
4p2 + p3 + p4 = c
4p3 + p4 = c
4p4 = c.
280 the ski rental problem: rent or buy?

This gives us p4 = c/4, p3 = (3/4)(c/4), etc., and indeed that

1
c= and pi = (3/4)i−1 (c/4)
1 − (1 − 1/4)4

for i = 1, 2, 3, 4. For general B, we get

1 e
c = cB = ≤ .
1 − (1 − 1/B) B e−1

Moreover, this value of c B is indeed the best possible competitive

ratio for anyth randomized algorithm for the ski rental problem.
How might one prove such a result? We instead consider playing
a random instance against a deterministic algorithm. By Von Neu-
mann’s minimax theorem the value of this should be the same as
what we considered above. We leave it as an exercise to verify this for
the case when B = 4.

26.2.3 (Optional) A Continuous Approach

This section is quite informal right now, needs to be made formal.
For simplicity, assume B = 1 by scaling, and that both the algorithm
and the length of the season can be any real in [0, 1]. Now our ran-
domized algorithm chooses a threshold t ∈ [0, 1] from the probability
distribution with density function f (t). Let’s say f is continuous.
Then we get that for any season length ℓ,
Z ℓ Z 1
(1 + t) f (t) dt + ℓ f (t) dt = c · ℓ.
t =0 t=ℓ

(Again, we’re setting an equality there without justification, except

that it works out.) Now we can differentiate w.r.t. ℓ to get
Z 1
(1 + ℓ) f (ℓ) + f (t) dt − ℓ f (ℓ) = c.
t=ℓ

(This differentiation is like taking differences of successive lines that

we did above.) Simplifying,
Z 1
f (ℓ) + f (t) dt = c. (26.1)
t=ℓ

Taking derivatives again:

f ′ (ℓ) − f (ℓ) = 0

But this solves to f (ℓ) = Ceℓ for some constant C. Since f is a prob-
R1
ability density function, ℓ=0 f (ℓ) = 1, we get C = e−1 1 . Substituting
into (26.1), we get that the competitive ratio is c = e−e 1 , as desired.
online algorithms 281

26.3 The Paging Problem

Now we return to the paging problem that was introduced earlier

and start by presenting a disappointing fact.

Lemma 26.2. No deterministic algorithm can be < k-competitive for the

paging problem.

Proof. Consider a universe with k + 1 pages in all. In each step the

adversary requests a page not in the cache (there is always at least
1 such page). Thus the algorithm’s cost over n requests is n. The
optimal offline algorithm can cut losses by always evicting the item
that will be next requested furthest in the future, thus it suffers a
cache miss every k steps so the optimal cost will be n/k. Thus the
n
competitive ratio of any deterministic algorithm is at least n/k =
k.

It is also known that many popular eviction strategies are k-

competitive such as Least Recently Used (LRU) and First-In First-Out
(FIFO). We will show that a 1-bit variant of LRU is k-competitive
and also show that a randomized version of it achieves an O(log k)-
competitive randomized algorithm for paging.

26.3.1 The 1-bit LRU/Marking Algorithm

The 1-bit LRU/Marking algorithm works in phases. The algorithm

maintains a single bit for each page in the universe. We say that a
page is marked/unmarked if its bit is set to 1/0. At the beginning of
each phase, all pages are unmarked. When a request for a page not
in the cache comes, then we evict an arbitrary unmarked page and
put the requested page in the cache, then mark the requested page. If
there are no unmarked pages to evict, then we unmark all pages and
start a new phase.

Lemma 26.3. The Marking algorithm is k-competitive for the paging

problem.

Proof. Consider the i’th phase of the algorithm. By definition of the

algorithm, Alg incurs a cost of at most k during the phase since we
can mark at most k different pages and hence we will have at most
k cache misses in this time. Now consider the first request after the
i’th phase ends. We claim that Opt has incurred at least 1 cache miss
by the time of this request. This follows since we have now seen
k + 1 different pages. Now summing over all phases we see that
Alg ≤ k Opt
282 the paging problem

Now suppose that instead of evicting an arbitrary unmarked page,

we instead evicted an unmarked page uniformly at random. For this
randomized marking algorithm we can prove a much better result.

Lemma 26.4. The Randomized Marking Algorithm is O(log k)-competitive

Proof. We break up the proof into an upper bound on Alg’s cost and
a lower bound on Opt’s cost. Before doing this we set up some no-
tation. For the ith phase, let Si be the set of pages in the algorithm’s
cache at the beginning of the phase. Now define

∆ i = | Si + 1 \ Si | .

We claim that the expected number of cache misses made by the

algorithm in phase i is at most ∆i ( Hk + 1), where Hk is the kth har-
monic number. By summing over all phases we see that E[Alg] ≤
∑i ∆i ( Hk + 1).
Now let Ri be the set of distinct requests in phase i. For each re-
quest in Ri we say that it is clean if the requested page is not Si , oth-
erwise the request is called stale. Every cache miss in the ith phase is
caused by either a clean request or a stale request.

1. The number of cache misses due to clean requests is at most ∆i

since there can be at most ∆i clean requests in phase i: each clean
request brings in a page not belonging to Si into the cache and
marks it, so it will be in Si+1 .

2. To bound the cache misses due to stale requests, suppose there

have been c clean requests and s stale requests so far, and consider
the s + 1st stale request. The probability this request causes a This is like the Airline seat problem,
cache miss is at most k−c s since we have evicted c random pages where we can imagine that c confused
passengers get on at the beginning.
out of k − s remaining stale requests. Now since c ≤ ∆i , we have
that the expected cost due to stale requests is at most

k −1 k −1
c 1
∑ k−s
≤ ∆i ∑
k − s
= ∆i Hk .
s =0 s =0

Now the expected total cost in phase i is at most

∆i Hk + ∆i = ∆i ( Hk + 1).

Now we claim that Opt ≥ 12 ∑i ∆i . Let Si∗ be the pages in Opt’s cache
at the beginning of phase i. Let ϕi be the number of pages in Si but
not in Opt’s cache at the beginning of phase i, i.e., ϕi = |Si \ Si∗ |.
Now let Opti be the cost that Opt incurs in phase i. We have that
Opti ≥ ∆i − ϕi since this is the number of “clean” requests that Opt
sees. Moreover, consider the end of phase i. Alg has the k most recent
online algorithms 283

requests in cache, but Opt does not have ϕi+1 of them by definition of
ϕi+1 . Hence Opti ≥ ϕi+1 . Now by averaging,
1
Opti ≥ max{ϕi+1 , ∆i − ϕi } ≥ (ϕ + ∆i − ϕi ).
2 i +1
So summing over all phases we have
1 1
Opt ≥ ∑
2 i
∆i + ϕ f inal − ϕinitial ≥ ∑ ∆i ,
2 i
since ϕ f inal ≥ 0 and ϕinitial = 0. Combining the upper and lower
bound yields
E[Alg] ≤ 2( Hk + 1) Opt = O(log k) Opt .
It can also be shown that no randomized algorithm can do better
than Ω(log k )-competitive for the paging problem. For some intuition
as to why this might be true, consider the coupon collector problem: if
you repeatedly sample a uniformly random number from {1, . . . , k +
1} with replacement, show that the expected number of samples to
see all k + 1 coupons is Hk+1 .

26.4 Generalizing Paging: The k-Server Problem

Another famous problem in online algorithms is the k-server problem.

Consider a metric space M = (V, d) with point set V and distance
function d : V × V → R+ satisfying the triangle inequality. In the
k-server problem there are k servers that are located at various points
of M. At each timestep t we receive a request σt ∈ V. If there is a
server at point σt already, then we can server that request for free.
Otherwise we move some server from point x to point σt and pay a
cost equal to d( x, σt ). The goal of the problem is to serve the requests
while minimizing the total cost of serving them.
The paging problem can be modeled as a k-server problem as fol-
lows. We let U be the points of the metric space and take d( x, y) = 1
for all pages x, y where x ̸= y. This special case shows that every de-
terministic algorithm is at least k-competitive and every randomized
algorithm is Ω(log k)-competitive by the discussion in the previous
section. It is conjectured that there is a k-competitive deterministic
algorithm: the best known result is a (2k − 1)-competitive algorithm
of Elias Koutsoupias and Christos Papadimitriou.
For randomized algorithms, a poly-logarithmic competitive al-
gorithm was given by Nikhil Bansal, Niv Buchbinder, Aleksander
Madry, and Seffi Naor. This was recently improved via an approach
based on Mirror descent by Sebastien Bubeck, Michael Cohen, and
James Lee, Yin Tat Lee, Aleksander Madry; see this paper of Niv
Buchbinder, Marco Molinaro, Seffi Naor, and myself for a discretiza-
tion.

Introduction To Mathematical Optimization by Matteo Fischetti
No ratings yet
Introduction To Mathematical Optimization by Matteo Fischetti
232 pages
Notes 20220602
No ratings yet
Notes 20220602
208 pages
Group Assignment - On - Hashing in DBMS
No ratings yet
Group Assignment - On - Hashing in DBMS
4 pages
Cmu850 f20
No ratings yet
Cmu850 f20
309 pages
Agao22 Script
No ratings yet
Agao22 Script
208 pages
PB Algo2 2013 PDF
No ratings yet
PB Algo2 2013 PDF
87 pages
Combinatorial Optimization - Chekuri (2022)
No ratings yet
Combinatorial Optimization - Chekuri (2022)
255 pages
Graphs and Beyond - Faster Algorithms For High Dimensional Convex Optimization - Jakub Pachocki
No ratings yet
Graphs and Beyond - Faster Algorithms For High Dimensional Convex Optimization - Jakub Pachocki
202 pages
Klein Berg Book
No ratings yet
Klein Berg Book
459 pages
Introduction To Optimisation PDF
No ratings yet
Introduction To Optimisation PDF
264 pages
Content Discrete
No ratings yet
Content Discrete
4 pages
Algorithms
No ratings yet
Algorithms
501 pages
Algorithm Lectures
No ratings yet
Algorithm Lectures
117 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
124 pages
Approx Algorithms Lecture Notes
No ratings yet
Approx Algorithms Lecture Notes
227 pages
Streaming Algorithms
No ratings yet
Streaming Algorithms
73 pages
Graph Algorithms and Network Flows
No ratings yet
Graph Algorithms and Network Flows
126 pages
CO250 Web
No ratings yet
CO250 Web
204 pages
Lecture Notes For Algorithm Analysis and Design: JNTU World
No ratings yet
Lecture Notes For Algorithm Analysis and Design: JNTU World
128 pages
Cs Theorists Toolkit
No ratings yet
Cs Theorists Toolkit
95 pages
Design and Analysis of Algorithms Course Notes
No ratings yet
Design and Analysis of Algorithms Course Notes
161 pages
Introduction To Combinatorial Optimization (Ding-Zhu Du Panos M. Pardalos Xiaodong Hu Weili Wu) (Personal - Utdallas.edu)
No ratings yet
Introduction To Combinatorial Optimization (Ding-Zhu Du Panos M. Pardalos Xiaodong Hu Weili Wu) (Personal - Utdallas.edu)
230 pages
Ga Tech Student Notes
No ratings yet
Ga Tech Student Notes
130 pages
Advanced Algorithms
No ratings yet
Advanced Algorithms
218 pages
Algorithms: 2006 S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani July 18, 2006
100% (2)
Algorithms: 2006 S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani July 18, 2006
336 pages
Algorithms: 2006 S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani July 18, 2006
No ratings yet
Algorithms: 2006 S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani July 18, 2006
8 pages
Operations Research
No ratings yet
Operations Research
118 pages
Algoritmos Distribuidos
No ratings yet
Algoritmos Distribuidos
221 pages
Sariel PDF
No ratings yet
Sariel PDF
265 pages
Chapters PDF
No ratings yet
Chapters PDF
51 pages
Math Foundations of Machine Learning Mississippi SU
No ratings yet
Math Foundations of Machine Learning Mississippi SU
328 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
332 pages
CS170: Efficient Algorithms and Intractable Problems Fall 2001
No ratings yet
CS170: Efficient Algorithms and Intractable Problems Fall 2001
113 pages
Algorithms: (Fundamental Techniques)
No ratings yet
Algorithms: (Fundamental Techniques)
52 pages
Algorithms (1cool
No ratings yet
Algorithms (1cool
52 pages
Computational Geomatory
No ratings yet
Computational Geomatory
212 pages
Semi Definite and Cone Programming
No ratings yet
Semi Definite and Cone Programming
89 pages
ComputationalMathematics - Chapter 1 PDF
No ratings yet
ComputationalMathematics - Chapter 1 PDF
39 pages
Fit2004 Course Notes
No ratings yet
Fit2004 Course Notes
167 pages
Draft ch1 8
No ratings yet
Draft ch1 8
105 pages
Teoria de Algoritmos
No ratings yet
Teoria de Algoritmos
107 pages
Linear and Integer Optimization (V3C1/F4C1) : Lecture Notes
No ratings yet
Linear and Integer Optimization (V3C1/F4C1) : Lecture Notes
129 pages
Computer Science Three
No ratings yet
Computer Science Three
244 pages
Eecs127 Reader
No ratings yet
Eecs127 Reader
199 pages
LSCOMO大规模凸优化使用单调算子 (Book)
No ratings yet
LSCOMO大规模凸优化使用单调算子 (Book)
348 pages
Computational Geometry Lecture Notes Ethz PDF
No ratings yet
Computational Geometry Lecture Notes Ethz PDF
182 pages
AAscript
No ratings yet
AAscript
158 pages
OM Notes PDF
No ratings yet
OM Notes PDF
278 pages
Linear and Integer Optimization (V3C1/F4C1) : Lecture Notes
No ratings yet
Linear and Integer Optimization (V3C1/F4C1) : Lecture Notes
125 pages
Math For Data Science
100% (1)
Math For Data Science
554 pages
Data Structures
No ratings yet
Data Structures
239 pages
CP List
No ratings yet
CP List
32 pages
Week 5 Feb 25 2023
No ratings yet
Week 5 Feb 25 2023
24 pages
Seminar 24
No ratings yet
Seminar 24
17 pages
DAA Notes
No ratings yet
DAA Notes
80 pages
ITC LAB 8 - Data Types & Variables
No ratings yet
ITC LAB 8 - Data Types & Variables
6 pages
IBM Sample Questions
No ratings yet
IBM Sample Questions
15 pages
Chap 11
No ratings yet
Chap 11
12 pages
MMW 6.2 Mathematics of Graphs Weighted Graphs
No ratings yet
MMW 6.2 Mathematics of Graphs Weighted Graphs
5 pages
FFT Based Cosine Similarity For Fast Image Matching
No ratings yet
FFT Based Cosine Similarity For Fast Image Matching
7 pages
Information Theory and Coding
No ratings yet
Information Theory and Coding
7 pages
Full Multiagent Systems: Introduction and Coordination Control 1st Edition Magdi S. Mahmoud Ebook All Chapters
100% (3)
Full Multiagent Systems: Introduction and Coordination Control 1st Edition Magdi S. Mahmoud Ebook All Chapters
65 pages
Maths Project
No ratings yet
Maths Project
40 pages
2024 Mth058 Lecture06 Mcts
100% (1)
2024 Mth058 Lecture06 Mcts
38 pages
AI Mid Semester Examination 2021 (CS6412) (Preview) Microsoft Forms
No ratings yet
AI Mid Semester Examination 2021 (CS6412) (Preview) Microsoft Forms
23 pages
K Maps - Karnaugh Maps - Solved Examples: Minimization of Boolean Expressions
No ratings yet
K Maps - Karnaugh Maps - Solved Examples: Minimization of Boolean Expressions
18 pages
Optimization in Railway Scheduling
No ratings yet
Optimization in Railway Scheduling
8 pages
Bucket Sort Algorith
No ratings yet
Bucket Sort Algorith
7 pages
Name: Student No. Group: - Expression Tree - : Csc248 - Fundamentals of Data Structures Tree: Class Exercise
No ratings yet
Name: Student No. Group: - Expression Tree - : Csc248 - Fundamentals of Data Structures Tree: Class Exercise
4 pages
1.JAVA Practicals
No ratings yet
1.JAVA Practicals
33 pages
Ding 2015
No ratings yet
Ding 2015
10 pages
Moore Mealy Machine Lecture-1
No ratings yet
Moore Mealy Machine Lecture-1
15 pages
CHAPTER 6 NUmerical Differentiation
No ratings yet
CHAPTER 6 NUmerical Differentiation
6 pages
Bsbss
No ratings yet
Bsbss
4 pages
DS Lab-Scheme
No ratings yet
DS Lab-Scheme
4 pages
Tugas Data Mining Pertemuan 10 Kelompok 3
No ratings yet
Tugas Data Mining Pertemuan 10 Kelompok 3
4 pages
Array 1
No ratings yet
Array 1
2 pages
NP and Computational Intractability
No ratings yet
NP and Computational Intractability
11 pages
Lars Vandenbergh'S Cubezone: Square-1 Solution Method - Step 5
No ratings yet
Lars Vandenbergh'S Cubezone: Square-1 Solution Method - Step 5
5 pages
Molecular Topology Mircea V Diudea Ivan Gutman Jantschi Lorentz PDF Download
No ratings yet
Molecular Topology Mircea V Diudea Ivan Gutman Jantschi Lorentz PDF Download
77 pages