0% found this document useful (0 votes)
35 views98 pages

Algorithms JoachimFavre

Uploaded by

oliviaivory9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views98 pages

Algorithms JoachimFavre

Uploaded by

oliviaivory9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Algorithms

Prof. Mikhail Kapralov — EPFL

Notes by Joachim Favre

Computer science bachelor — Semester 3


Autumn 2022
I made this document for my own use, but I thought that typed notes might be of interest to others. So, I
shared it (with you, if you are reading this!); since it did not cost me anything. I just ask you to keep in
mind that there are mistakes, it is impossible not to make any. If you find some, please feel free to share
them with me (grammatical and vocabulary errors are of course also welcome). You can contact me at
the following e-mail address:

[email protected]

If you did not get this document through my GitHub repository, then you may be interested by the fact
that I have one on which I put my typed notes. Here is the link (go take a look in the “Releases” section
to find the compiled documents):

https://github.com/JoachimFavre/EPFLNotesIN

Please note that the content does not belong to me. I have made some structural changes, reworded some
parts, and added some personal notes; but the wording and explanations come mainly from the Professor,
and from the book on which they based their course.
I think it is worth mentioning that in order to get these notes typed up, I took my notes in LATEXduring
the course, and then made some corrections. I do not think typing handwritten notes is doable in terms
of the amount of work. To take notes in LATEX, I took my inspiration from the following link, written
by Gilles Castel. If you want more details, feel free to contact me at my e-mail address, mentioned
hereinabove.

https://castel.dev/post/lecture-notes-1/

I would also like to specify that the words “trivial” and “simple” do not have, in this course, the definition
you find in a dictionary. We are at EPFL, nothing we do is trivial. Something trivial is something that
a random person in the street would be able to do. In our context, understand these words more as
“simpler than the rest”. Also, it is okay if you take a while to understand something that is said to be
trivial (especially as I love using this word everywhere hihi).
Since you are reading this, I will give you a little advice. Sleep is a much more powerful tool than you
may imagine, so never neglect a good night of sleep in favour of studying (especially the night before the
exam). I will also take the liberty of paraphrasing my high school philosophy teacher, Ms. Marques, I
hope you will have fun during your exams!

Version 2023–04–12
To Gilles Castel, whose work has
inspired me this note taking method.

Rest in peace, nobody


deserves to go so young.
Contents

1 Summary by lecture 11

2 Introduction 15
2.1 Definitions and example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Recall: Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Sorting algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Insertion sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Divide and conquer 21


3.1 Merge sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Fast multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Great data structures yield great algorithms 31


4.1 Heap sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Priority queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Stack and queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Linked list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Binary search trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Dynamic programming 43
5.1 Introduction and Fibonacci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Application: Rod cutting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Application: Change-making problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Application: Matrix-chain mulitplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Application: Longest common subsequence . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Application: Optimal binary search tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Graphs 53
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Primitives for traversing and searching a graph . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Topological sort of graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Strongly connected components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.5 Flow networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.6 Disjoint sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.7 Minimum spanning tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.8 Single-source shortest paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.9 All-pairs shortest paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7 Probabilistic analysis 85
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Hash functions and tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8 Back to sorting 91
8.1 Quick sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Sorting lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7
Algorithms CONTENTS

8
List of lectures

Lecture 1 : I’m rambling a bit — Friday 23rd September 2022 . . . . . . . . . . . . . . . . . . 15

Lecture 2 : Teile und herrsche — Monday 26th September 2022 . . . . . . . . . . . . . . . . . 18

Lecture 3 : Trees which grow in the wrong direction — Friday 30th September 2022 . . . . 22

Lecture 4 : Master theorem — Monday 3rd October 2022 . . . . . . . . . . . . . . . . . . . . . 24

Lecture 5 : Fast matrix multiplication — Friday 7th October 2022 . . . . . . . . . . . . . . . 26

Lecture 6 : Heap sort — Monday 10th October 2022 . . . . . . . . . . . . . . . . . . . . . . . . 31

Lecture 7 : Queues, stacks and linked list — Friday 14th October 2022 . . . . . . . . . . . . 34

Lecture 8 : More trees growing in the wrong direction — Monday 17th October 2022 . . . 38

Lecture 9 : Dynamic cannot be a pejorative word — Friday 21st October 2022 . . . . . . . 40

Lecture 10 : ”There are 3 types of mathematicians: the ones who can count, and the
ones who cannot” (Prof. Kapralov) (what do you mean by “this title is too long”?)
— Monday 24th October 2022 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Lecture 11 : LCS but not LoL’s one — Friday 28th October 2022 . . . . . . . . . . . . . . . . 48

Lecture 12 : More binary search trees — Monday 31st October 2022 . . . . . . . . . . . . . . 50

Lecture 13 : An empty course. — Friday 4th November 2022 . . . . . . . . . . . . . . . . . . . 52

Lecture 14 : I love XKCD — Monday 7th November 2022 . . . . . . . . . . . . . . . . . . . . . 53

Lecture 15 : I definitely really like this date — Monday 14th November 2022 . . . . . . . . . 60

Lecture 16 : This date is really nice too, though — Friday 18th November 2022 . . . . . . . 63

Lecture 17 : The algorithm may stop, or may not — Monday 21st November 2022 . . . . . 67

Lecture 18 : Either Levi or Mikasa made this function — Friday 25th November 2022 . . . 70

Lecture 19 : Finding the optimal MST — Monday 28th November 2022 . . . . . . . . . . . . 74

Lecture 20 : I like the structure of maths courses — Friday 2nd December 2022 . . . . . . . 77

Lecture 24 : Doing fun stuff with matrices (really) — Friday 16th December 2022 . . . . . 80

Lecture 21 : Stochastic probabilistic randomness — Monday 5th December 2022 . . . . . . 85

9
Algorithms LIST OF LECTURES

Lecture 22 : Hachis Parmentier — Friday 9th December 2022 . . . . . . . . . . . . . . . . . . 86

Lecture 23 : Quantum bogosort is a comparison sort in Θ(n) — Monday 12th December 2022 89

10
Chapter 1

Summary by lecture

Lecture 1 : I’m rambling a bit — Friday 23rd September 2022 p. 15

• Definition of algorithm and instance.


• Recall on asymptotics (I ramble a bit on this subject, but I find it very interesting).
• Definition of the sorting problem, and of loop invariants.
• Explanation of the insertion sort algorithm.

Lecture 2 : Teile und herrsche — Monday 26th September 2022 p. 18

• Proof that insertion sort works, and analysis of its complexity.


• Definition of divide-and-conquer algorithms.
• Explanation of merge sort, and analysis of its complexity.

Lecture 3 : Trees which grow in the wrong direction — Friday 30th September 2022 p. 22

• Proof of correctness of Merge-Sort.


• Analysis of the complexity of merge-sort, through the substitution method and through trees.

Lecture 4 : Master theorem — Monday 3rd October 2022 p. 24

• Explanation of the master theorem.


• Explanation of how to count the number of inversions in an array.

Lecture 5 : Fast matrix multiplication — Friday 7th October 2022 p. 26

• Explanation of a solution to the maximum subarray problem.


• Explanation of a divide-and-conquer algorithm for number multiplication.
• Explanation of Strassen’s algorithm, a divide-and-conquer algorithm for matrix multiplication.

Lecture 6 : Heap sort — Monday 10th October 2022 p. 31

• Definition of max-heap.
• Explanation on how to store a heap.
• Explanation of the MaxHeapify procedure.
• Explanation on how to make a heap out of a random array.
• Explanation on how to use heaps to make heapsort.

11
Algorithms CHAPTER 1. SUMMARY BY LECTURE

Lecture 7 : Queues, stacks and linked list — Friday 14th October 2022 p. 34

• Explanation on how to implement a priority queue through a heap.


• Explanation on how to implement a stack.
• Explanation on how to implement a queue.
• Explanation on how to implement a linked list.

Lecture 8 : More trees growing in the wrong direction — Monday 17th October 2022 p. 38

• Definition of binary search trees.


• Explanation on how to search, find the extrema, find the successor and predecessor of a given element,
how to print and how to insert an element in a binary search tree.

Lecture 9 : Dynamic cannot be a pejorative word — Friday 21st October 2022 p. 40

• Explanation on how to delete a node from a binary search tree.


• Explanation of top-down with memoisation and bottom-up algorithms in Dynamic Programming,
through the example of the Fibonacci numbers.

Lecture 10 : ”There are 3 types of mathematicians: the ones who can count, and the ones
who cannot” (Prof. Kapralov) (what do you mean by “this title is too long”?) — Monday
24th October 2022 p. 45

• Application of dynamic programming to the rod-cutting, change-making and matrix-multiplication


problems.

Lecture 11 : LCS but not LoL’s one — Friday 28th October 2022 p. 48

• Application of dynamic programming to the longest common subsequence problem.

Lecture 12 : More binary search trees — Monday 31st October 2022 p. 50

• Explanation on how to use dynamic programming in order to find the optimal binary search tree given
a sorted sequence and a list of probabilities.

Lecture 13 : An empty course. — Friday 4th November 2022 p. 52

• No really, we only did some revisions for the midterm.

Lecture 14 : I love XKCD — Monday 7th November 2022 p. 53

• Definition of directed and undirected graphs, and explanation on how to store them in memory.
• Explanation of BFS.
• Explanation of DFS, and of the depth-first forest and edge classification it implies.
• Explanation of the parenthesis theorem.
• Explanation of the white-path theorem.
• Definition of directed acyclic graphs.
• Proof that a DAG is acyclic if and only if it does not have any back edge.
• Definition of topological sort, and explanation of an algorithm to compute it.

12
Notes by Joachim Favre

Lecture 15 : I definitely really like this date — Monday 14th November 2022 p. 60

• Definition of SCCs, and proof of their existence and unicity.


• Definition of component graphs.
• Explanation of Kosarju’s algorithm for finding component graphs.

Lecture 16 : This date is really nice too, though — Friday 18th November 2022 p. 63

• Definition of flow network and flow.


• Definition of residual capacity and residual networks.
• Explanation of the Ford-Fulkerson greed algorithm for finding the maximum in a flow network.
• Definition of the cut of a flow network, and its flow and capacity.

Lecture 17 : The algorithm may stop, or may not — Monday 21st November 2022 p. 67

• Explanation and proof of the max-flow min-cut theorem.


• Complexity analysis of the Ford-Fulkerson method.
• Application of the Ford-Fulkerson method to the Bipartite matching problem and the Edge-disjoint
paths problem.

Lecture 18 : Either Levi or Mikasa made this function — Friday 25th November 2022 p. 70

• There exists no other Ackerman in the world, and when they wrote the term “Inverse Ackermann
function”, they definitely made a mistake while writing the word “Ackerman”.
• Definition of the disjoint-set data structures.
• Explanation of how to implement a disjoint-set data structure though linked lists.
• Explanation of how to implement a disjoint-set data structure though a forest of trees.
• Definition of spanning trees.
• Explanation of the minimum spanning tree problem.

Lecture 19 : Finding the optimal MST — Monday 28th November 2022 p. 74

• Explanation and proof of Prim’s algorithm for finding a MST.


• Explanation and proof of Kruskal’s algorithm for finding a MST.
• Definition of the shortest path problem.

Lecture 20 : I like the structure of maths courses — Friday 2nd December 2022 p. 77

• Explanation of the Bellman-Ford’s algorithm for finding shortest paths and detecting negative cycles.
• Proof of optimality of the Bellmand-Ford algorithm.
• Explanation of Dijkstra’s algorithm for finding a shortest path in a weighted graph, and proof of its
optimality.

Lecture 24 : Doing fun stuff with matrices (really) — Friday 16th December 2022 p. 80

• Applying dynamic programming to solve the all-pairs shortest paths problem.


• Translating our dynamic algorithm to matrix usage, in order to use fast exponentiation.
• Explanation of Floyd-Warshall’s algorithm for solving the all-pairs shortest paths problem.
• Explanation of Johnson’s algorithm for solving the all-pairs shortest paths problem.

13
Algorithms CHAPTER 1. SUMMARY BY LECTURE

Lecture 21 : Stochastic probabilistic randomness — Monday 5th December 2022 p. 85

• Introduction to probabilistic analysis.


• Definition of indicator random variables.

Lecture 22 : Hachis Parmentier — Friday 9th December 2022 p. 86

• Explanation of the birthday paradox, and proof of the birthday lemma.


• Definition of hash function.
• Explanation of hash tables.
• Proof of an upper bound on the runtime complexity of unsuccessful search in hash tables.

Lecture 23 : Quantum bogosort is a comparison sort in Θ(n) — Monday 12th December 2022 p.
89

• Proof of an upper bound on the runtime complexity of successful search in hash tables.
• Explanation of quicksort.
• Proof and analysis of naive quick sort.
• Analysis of randomised quick sort.
• Proof of the Ω(n log(n)) lower bound for comparison sorts.

14
Friday 23rd September 2022 — Lecture 1 : I’m rambling a bit

Chapter 2

Introduction

2.1 Definitions and example


Definition: Al- An algorithm is any well-defined computational procedure that takes some value,
gorithm or set of values as input and produces some value, or set of values, as output. An
algorithm is thus a sequence of computational steps that transform the input into
the output.

Definition: In- Given a problem, an instance is a set of precise inputs.


stance
Remark Note that for a problem that waits a number n as an input, “a
positive integer” is not an instance, whereas 232 would be one.

Example: Arith- Let’s say that, given n, we want to compute ni=1 i. There are multiples way to do
P
metic series so.
Naive al- The first algorithm that could come to mind is to compute the sum:
gorithm
ans = 0
for i = 1, 2, ..., n
ans = ans + i
return ans

This algorithm is very space efficient, since it only stores 2 numbers.


However, it has a time-complexity of Θ(n) elementary operations,
which is not very great.

15
Algorithms CHAPTER 2. INTRODUCTION

Clever al- A much better way is to simply use the arithmetic partial series
gorithm formula, yielding:
return n*(n + 1)/2

This algorithm is both very efficient in space and in time. This


shows that the first algorithm we think of is not necessarily the best,
and that sometimes we can really improve it.

2.2 Recall: Asymptotics


Complexity We want to analyse algorithm complexities and, to do so, we need a model. We will
analysis consider that any primitive operations (basically any line in pseudocode) consists of
a constant amount of time. Different lines may take a different time, but they do
not depend on the sizes of the input.
When we only have primitive operations, we basically only need to compute the
number of times each line is executed, and then look how it behaves asymptotically.
We will mainly consider worst-case behaviour since it gives a guaranteed upper
bound and, for some algorithms, the worst case occurs often. Also, the average case
is often as bad as the worst-case.
Remark When comparing asymptotic behaviour, we have to be careful about
the fact that this is asymptotical. In other words, some algorithms
which behave less well when n is very large might be better when n
is very small.
As a personal remark, it makes me think about galactic algorithms.
There are some algorithms which could be better for very large
numbers for some tasks, but those numbers should be so large that
it will never be used in practice (for instance, having more bits than
the number of atoms in the universe). My favourite example is
an algorithm which does a 1729-dimensional Fourier transform to
multiply two numbers.

Personal note: We say that f (x) ∈ O(g(x)), or more informally f (x) = O(g(x)), read “f is big-O
Definitions of g”, if there exists a M ∈ R+ and a x0 ∈ R such that:

|f (x)| ≤ M |g(x)|, ∀x ≥ x0

This leads to many other definitions:


• We say that f (x) ∈ Ω(g(x)) when g(x) ∈ O(f (x)).
• We say that f (x) ∈ Θ(g(x)) when f (x) ∈ O(g(x)) and f (x) ∈ Ω(g(x)). Func-
tions belonging to Θ(g(x)) represent an equivalence class.
• We say that f (x) ∈ o(g(x)) when f (x) ∈ O(g(x)) but f (x) 6∈ Θ(g(x)).
• We say that f (x) ∈ ω(g(x)) when f (x) ∈ Ω(g(x)) but f (x) 6∈ Θ(g(x)).

Personal note: We can have the following intuition:


Intuition • f (x) ∈ O(g(x)) means that f grows slower than (or as fast as) g when x → ∞.
• f (x) ∈ Ω(g(x)) means that f grows faster than (or as fast as) g when x → ∞.
• f (x) ∈ Θ(g(x)) means that f grows exactly as fast as g when x → ∞.
• f (x) ∈ o(g(x)) means that f grows strictly slower than g when x → ∞.
• f (x) ∈ ω(g(x)) means that f grows strictly faster than g when x → ∞.
The following theorem can also help the intuition.

Personal note: Let f and g be two functions, such that the following limit exists or diverges:
Theorem
|f (x)|
lim = ` ∈ R ∪ {∞}
x→∞ |g(x)|
We can draw the following conclusions, depending on the value of `:
• If ` = 0, then f (x) ∈ o(g(x)).

16
2.2. RECALL: ASYMPTOTICS Notes by Joachim Favre

• If ` = ∞, then f (x) ∈ ω(g(x)).


• If ` ∈ R∗ , then f (x) ∈ Θ(g(x)).

Proof We will only prove the third point, the other two are left as exercises
to the reader.
First, we can see that ` > 0, since ` 6= 0 by hypothesis and since
|g(x)| > 0 for all x.
|f (x)|

We can apply the definition of the limit. Since it is valid for all
ε > 0, we know that, in particular, it is true for ε = 2` > 0. Thus,
by definition of the limit, we know that for ε = 2` > 0, there exists
a x0 ∈ R, such that for all x ≥ x0 , we have:

|f (x)| `
−` ≤ε=
|g(x)| 2
` |f (x)| `
⇐⇒ − ≤ −`≤
2 |g(x)| 2
` 3`
⇐⇒ |g(x)| ≤ |f (x)| ≤ |g(x)|
2 2
since |g(x)| > 0.
Since 2` |g(x)| ≤ |f (x)| for x ≥ x0 , we get that f ∈ Ω(g(x)). Also,
since |f (x)| ≤ 3`
2 |g(x)| for x ≥ x0 , we get that f ∈ O(g(x)).
We can indeed conclude that f ∈ Θ(g(x)).

Example Let a ∈ R and b ∈ R+ . Let us compute the following ratio:
b
(n + a)  a b
lim = lim 1+ =1
n→∞ |nb | n→∞ n

which allows us to conclude that (n + a) ∈ Θ nb .


b 

Side note: You can go read my Analyse 1 notes on my GitHub (in French) if
Link with you want more information, but there is an interesting link with
series series we can do here.
You
P∞ can convince yourself that if an ∈ Θ(bn ), then n=1 |an | and
P∞

n=1 |bn | have the same nature. Indeed, this hypothesis yields that
∃C1 , C2 ∈ R+ and a n0 ∈ N such that, for all n ≥ n0 :

X ∞
X ∞
X
0 ≤ C1 |bn | ≤ |an | ≤ C2 |bn | =⇒ C1 |bn | ≤ |an | ≤ C2 |bn |
n=1 n=1 n=1

by the comparison criteria.


Also, we know very well the convergence of the series n=1 n1p
P∞
(which converges for all p > 1, and diverges otherwise). Using Taylor
series, this allows us to know P very easily the
 convergence of a series.
For instance, let us consider n=1 cos n1 − 1 . We can compute
∞ 

the following limit:


1 2
!
cos n1 − 1
   
p n 1
lim 1 = lim n 1 − +ε 4 −1
n→∞
np
n→∞ 2 n
p
 
n 1
= lim − 2 + np ε 4
n→∞ 2n n
1
=
2
for
P∞p =1 2. In other words, our series has the same nature as
Pn=1 n2 , which  converges. This allows us to conclude that
converges absolutely.
∞ 1

n=1 cos n − 1

17
Algorithms CHAPTER 2. INTRODUCTION

You can see how powerful those tools are.

2.3 Sorting algorithms


Definition: The For the sorting problem, we take a sequence of n numbers (a1 , . . . , an ) as input, and
sorting problem we want to output a reordering of those numbers (a01 , . . . , a0n ) such that a01 ≤ . . . ≤ a0n .

Example Given the input (5, 2, 4, 6, 1, 3), a correct output is (1, 2, 3, 4, 5, 6).

Personal note: It is important to have the same numbers at the start and the end.
Remark Else, it allows to have algorithms such as the Stalin sort (remove all
elements which are not in order, leading to a complexity of Θ(n)),
or the Nagasaki sort (clearing the list, leading to a complexity of
Θ(1)).
They are more jokes than real algorithms, here is where I found the
Nagasaki sort:
https://www.reddit.com/r/ProgrammerHumor/comments/o5w3eo

Definition: In An algorithm solving the sorting problem is said to be in place when the numbers
place algorithm are rearranged within the array (with at most a constant number of variables oustside
the array at any time).

Loop invariant We will see algorithms, which we will need to prove are correct. To do so, one of
the methods is to use a loop invariant. This is something that stays true at any
iteration of a loop. The idea is very similar to induction.
To use a loop invariant, we need to do three steps. In the initialization, we show
that the invariant is true prior to the first iteration of the loop. In the maintenance,
we show that, if the invariant is true before an iteration, then it remains true before
the next iteration. Finally, in the termination, we use the invariant when the loop
terminates to show that our algorithm works.

2.4 Insertion sort


Insertion sort They idea of insertion sort is to iteratively sort the sequence. We iteratively insert
elements at the right place.
This algorithm can be formulated as:
for j = 2 to n:
key = a[j]
// Insert a[j] into the sorted sequence.
i = j - 1
while i > 0 and a[i] > key
a[i + 1] = a[i]
i = i - 1
a[i+1] = key

We can see that this algorithm is in place.

Monday 26th September 2022 — Lecture 2 : Teile und herrsche

Proof Let us prove that insertion sort works by using a loop invariant.
We take as an invariant that at the start of each iteration of the outer for loop,
the subarray a[1…(j-1)] consists of the elements originally in a[1…(j-1)] but in
sorted order.
1. Before the first iteration of the loop, we have j = 2. Thus, the subarray
consist only of a[1], which is trivially sorted.
2. We assume the invariants holds at the beginning of an iteration j = k. The
body of our inner while loop works by moving the elements a[k-1], a[k-2],

18
2.4. INSERTION SORT Notes by Joachim Favre

and so on one step to the right, until it finds the proper position for a[k],
at which point it inserts the value of a[k]. Thus, at the end of the loop,
the subarray a[1…k] consists of the elements originally in a[1…k] in a sorted
order.
3. The loop terminated when j = n + 1. Thus, the loop invariant implies that
a[1…n] contains the original elements in sorted order.

Complexity We can see that the first line is executed n times, and the lines which do not belong
analysis to the inner loop are executed n − 1 times (the first line of a loop is executed one
time more than its body, since we need to do a last comparison before knowing we
can exit the loop). We only need to compute how many times the inner loop is
executed every iteration.
In the best case, the loop is already sorted, meaning that the inner loop is never
entered. This leads to T (n) = Θ(n) complexity, where T (n) is the number of
operations required by the algorithm.
In the worst case, the loop is sorted in reverse order, meaning that the first line of
the inner loop is executed j times. Thus, our complexity is given by:
 
n  
X n(n + 1) − 1
= Θ n2

T (n) = Θ j = Θ
j=2
2

As mentioned in the previous course, we mainly have to keep in mind the worst case
scenario.

19
Algorithms CHAPTER 2. INTRODUCTION

20
Chapter 3

Divide and conquer

3.1 Merge sort


Divide-and- We will use a powerful algorithmic approach: recursively divide the problem into
conquer smaller subproblems.
We first divide the problem into a subproblems of size nb that are smaller instances
of the same problem. We then conquer the subproblems by solving them recursively
(and if the subproblems are small enough, let’s say of size less than c for some constant
c, we can just solve them by brute force). Finally, we combine the subproblem
solutions to give a solution to the original problem
This gives us the following cost function:

Θ(1), if n ≤ c

T (n) = n
aT + D(n) + C(n), otherwise
b
where D(n) is the time to divide and C(n) the time to combine solutions.

Merge sort Merge sort is a divide and conquer algorithm:


MergeSort(A, p, r):
// p is the beginning index and r is the end index; they represent the
section of the array we try to sort.
i f p < r: // base case
q = floor ((p + r)/2) // divide
MergeSort(A, p, q) // conquer
MergeSort(A, q+1, r) // conquer
Merge(A, p, q, r) // combine

For it to be efficient, we need to have an efficient merge procedure. Note that


merging two sorted subarrays is rather easy: if we have two sorted piles of cards and
we want to merge them, we only have to iterately take the smallest card between
the two piles, there cannot be any smaller card that would come later, since the
piles are sorted. This gives us the following algorithm:
Merge(A, p, q, r):
// p is the beginning index of the first subarray , q is the beginning
index of the second subarray and r is the end of the second
subarray
n1 = q - p + 1 // number of elements in first subarray
n2 = r - q // number of elements in second subarray
l e t L[1...( n1+1)] and R[1...( n2+1)] be new arrays
for i = 1 to n1:
L[i] = A[p + i - 1]
for j = 1 to n2:
R[j] = A[q + j]
L[n1 + 1] = infinity
L[n2 + 1] = infinity
// Merge the two created subarrays.

21
Algorithms CHAPTER 3. DIVIDE AND CONQUER

i = 1
j = 1
for k = p to r:
// Since both subarrays are sorted , the next element is one of L[i]
or R[j]
i f L[i] <= R[j]:
A[k] = L[i]
i = i + 1
else :
A[k] = R[j]
j = j + 1

We can see that this algorithm is not in place, making it require more memory than
insertion sort.
Rermark The Professor put the following video on the slides, and I like it
very much, so here it is (reading the comments, dancers say “teile
une herrsche”, which means “divide and conquer”):
https://www.youtube.com/watch?v=dENca26N6V4

Friday 30th September 2022 — Lecture 3 : Trees which grow in the wrong direction

Theorem: Cor- Assuming that the implementation of the merge procedure is correct, mergeSort(A,
rectness of p, r) correctly sorts the numbers in A[p . . . r].
Merge-Sort

Proof Let’s do a proof by induction on n = r − p.


• When n = 0, we have r = p, and thus A[p . . . r] is trivially sorted.
• We suppose our statement is true for all n ∈ {0, . . . , k − 1} for
some k, and we want to prove it for n = k.
By the inductive hypothesis, both mergeSort(A, p, q) and
mergeSort(A, q+1, r) successfully sort the two subarrays.
Therefore, a correct merge procedure will successfully sort
A[p . . . q] as required.

Complexity Let’s analyse the complexity for merge sort.


analysis Modifying the complexity for divide and conquer, we get:

Θ(1), if n = 1

T (n) = n
2T + Θ(n), otherwise
2
Let’s first try to guess the guess the solution of this recurrence. We can set Θ(n) = c·n
for some c, leading to:
n
T (n) = 2T +c·n
 2 n  n
= 2 2T +c + cn
 n 4 2
= 4T + 2cn
 4 n  n
= 4 2T +c
 n 8 4
= 8T + 3cn
8
Thus, this seems that, continuing this enough times, we get:

T (n) = nT (1) + log2 (n)cn =⇒ T (n) = Θ(n log(n))

We still need to prove that this is true. We can do this by induction, and this is
then named the substitution method.

22
3.1. MERGE SORT Notes by Joachim Favre

Proof: Upper We want to show that there exists a constant a > 0 such that
bound T (n) ≤ an log(n) for all n ≥ 2 (meaning that T (n) = O(n log(n))),
by induction on n.
• For any constant n ∈ {2, 3, 4}, T (n) has a constant value; selecting
a larger than this value will satisfy the base cases when n ∈
{2, 3, 4}.
• We assume that our statement is true for all n ∈ {2, 3, . . . , k − 1}
and we want to prove it for n = k:

n
T (n) = 2T + cn
2
IH an n
≤2 log + cn
2   2
n
= an log + cn
2
= an log(n) − an + cn
≤ an log(n)

if we select a ≥ c.
We can thus select a to be a positive constant so that both the base
case and the inductive step hold.

Proof: Lower We want to show that there exists a constant b > 0 such that
bound T (n) ≥ bn log(n) for all n ≥ 0 (meaning that T (n) = Ω(n log(n))),
by induction on n.
• For n = 1, T (n) = c and bn log(n) = 0, so the base case is satisfied
for any b.
• We assume that our statement is true for all k ∈ {0, . . . , k − 1}
we want to prove it for n = k:
n
T (n) = 2T + cn
2
IH bn n
≥ 2 log + cn
2   2
n
= bn log + cn
2
= bn log(n) − bn + cn
≥ bn log(n)

selecting b ≤ c.
We can thus select b to be a positive constant so that both the
base case and the inductive step hold.

Proof: Conclu- Since T (n) = O(n log(n)) and T (n) = Ω(n log(n)), we have proven
sion that T (n) = Θ(n log(n)).

Remark The real recurrence relation for merge-sort is:

c, if n = 1

T (n) = j n k l n m
T +T + c · n.
2 2
Note that we are allowed to take the same c everywhere since we
are considering the worst-case, and thus we can take the maximum
of the two constants supposed to be there, and call it c.
Anyhow, in our proof, we did not consider floor and ceiling functions.
Indeed, they make calculations really messy but don’t change the
final asymptotic result. Thus, when analysing recurrences, we simply
assume for simplicity that all divisions evaluate to an integer.

23
Algorithms CHAPTER 3. DIVIDE AND CONQUER

Remark We have to be careful when using


 asymptotic notations with induction. For instance,
if we know that T (n) = 4T n4 + n, and we want to prove that T (n) = O(n), then
we cannot just do:
IH n
T (n) ≤ 4c + n = cn + n = n(c + 1) = O(n)
4
Indeed, we have to clearly state that we want to prove that T (n) ≤ nc, and then
prove it for the exact constant c during the inductive step. The proof above is wrong
and, in fact, T (n) 6= O(n).

Other proof: Another way of guessing the complexity of merge sort, which works for many
Tree recurrences, is thinking of the entire recurrence tree. A recurrence tree is a tree
(really?) where each node corresponds to the cost of a subproblem. We can thus
sum the costs within each level of the tree to obtain a set of per-level costs, and
then sum all the per-level costs to determine the total cost of all levels of recursion.
For merge sort, we can draw the following tree:

We can observe than on any level, the amount of work sums up to cn. Since there
are log2 (n) levels, we can guess that T (n) = cn log2 (n) = Θ(n log(n)). To prove it
formally, we would again need to use recurrence.

Tree: Other Let’s do another example for


 a tree, but
 for which the substitution method does not
example work: we take T (n) = T n3 + T 2n 3 + cn. The tree looks like:

Again, we notice that every level contributes to around cn, and we have at least
log3 (n) full levels. Therefore, it seems reasonable to say that an log3 (n) ≤ T (n) ≤
bn log 32 (n) and thus T (n) = Θ(n log(n)).

Monday 3rd October 2022 — Lecture 4 : Master theorem

24
3.1. MERGE SORT Notes by Joachim Favre

Example Let’s look at the following recurrence:


n  
3
T (n) = T +T n +1
4 4

We want to show it is Θ(n).

Upper bound Let’s prove that there exists a b such that T (n) ≤ bn. We consider
the base case to be correct, by choosing b to be large enough.
Let’s do the inductive step. We get:
   
1 3 1 3
T (n) = T n +T n + 1 ≤ b n + b n + 1 = bn + 1
4 4 4 4

But we wanted bn, so it proves nothing. We could consider that our


guess is wrong, or do another proof.

Upper bound Let’s now instead the harder induction hypothesis, stating that
(better) T (n) ≤ bn − b0 . This gives us:
   
1 3
T (n) = T n +T n +1
4 4
1 3
≤ b n − b0 + b n − b0 + 1
4 4
= bn − b + (1 − b0 )
0

≤ bn − b0

as long as b0 ≥ 1.
Thus, taking b such that the base case works, we have proven that
T (n) ≤ bn − b0 ≤ bn, and thus T (n) ∈ O(n). We needed to make
our claim stronger for it to work, and this is something that is often
needed to do.
Master theorem Let a ≤ 1 and b > 1 be constants. Also, let T (n) be a function defined on the
nonnegative integers by the following recurrence:
n
T (n) = aT + f (n)
b
Then, T (n) has the following asymptotic bounds:
1. If f (n) = O nlogb (a)−ε for some constant ε > 0, then T (n) = Θ nlogb (a) .


2. If f (n) = Θ nlogb (a) , then T (n) = Θ nlogb (a) log(n) .


 

3. If f (n) = Ω nlogb (a)+ε for some constant ε > 0, and if af nb ≤ cf (n) for
 

some constant c < 1 and all sufficiently large n, then T (n) = Θ(f (n)). Note
that the second condition holds for most functions.
Example Let us consider the case for merge sort, thus T (n) = 2T n2 + cn.


We get a = b = 2, so logb (a) = 1 and:


 
f (n) = Θ n1 = Θ nlogb (a)


This means that we are in the second case, telling us:


 
T (n) = Θ nlogb (a) log(n) = Θ(n log(n))

Tree To learn this theorem, we only need to get the intuition of why it
works, and to be able to reconstruct it. To do so, we can draw a tree.
The depth of this tree is logb (n), and there are alogb (n) = nlogb (a)
leaves.
 If a node does f (n) work, then each of its children does
af nb work.

25
Algorithms CHAPTER 3. DIVIDE AND CONQUER

1. If f grows slowly, a parent does less work than all its children
combined. This means that most of the work is done at the leaf.
Thus, the only thing that matters is the number of leafs which f
has: nlogb (a) .
2. If f grows such that every child contributes exactly the same
as their parents, then every level does the same work. Since we
have nlogb (a) leafs which each contribute a constant amount of
work, the last level adds up to c · nlogb (a) work, and thus every
level adds up to this value. We have logb (n) levels, meaning that
we have a total work of cnlogb (a) logb (n).
3. If f grows fast, then a parent does more work than all its children
combined. This means that all the work is done at the root and,
thus, that all that matters is f (n).

Application Let’s use a modified version of merge sort in order to count the number of inversions
in an array A (an inversion is i < j such that A[j] < A[i], where A never has twice
the same value).
The idea is that we can just add a return value to merge sort: the number of
inversions. In the trivial case n = 0, there is no inversion. For the recursive part,
we can just add the number of inversions of the two sub-cases and the number of
inversions we get from the merge procedure (which is the complicated part).
For the merge procedure, since the two subarrays are sorted, we notice that if the
element we are considering from the first subarray is greater than the one we are
considering of the second subarray, then we need to add (q − i + 1) to our current
count.
If A[i] > A[j], then all those
(q − i + 1) numbers
are greater than A[j]

··· ··· ··· ···


1 2 i q q+1 q+2 j n

This solution is Θ(n log(n)) and thus much better than the trivial Θ n double
2


for-loop solution.

Remark We can notice that there are at most n(n−1) inversions (in a reverse-
2
sorted array). It seems great that our algorithm achieves to count
this value in a smaller complexity. This comes from the fact that,
sometimes, we add much more than 1 at the same time in the merge
procedure.

Friday 7th October 2022 — Lecture 5 : Fast matrix multiplication

Maximum subar- We have an array of values representing stock price, and we want to find when we
ray problem should have bought and when we should have sold (retrospectively, so this is no
investment advice). We want to buy when the cost is as low as possible and sell
when it is as high as possible. Note that we cannot just take the all time minimum
and all time maximum since the maximum could be before the minimum.
Let’s switch our perspective by instead considering the array of changes: the difference
between i and i − 1. We then want to find the largest contiguous subarray that has
the maximum sum; this is named the maximum subarray problem. In other
words, we want to find i < j such that A[i . . . j] has the biggest sum possible. For
instance, for A = [1, −4, 3, −4], we have i = j = 3, and the sum is 3.
The bruteforce solution, in which we compute the sum efficiently, is a runtime of
Θ n2 = Θ n2 , which is not great.


26
3.2. FAST MULTIPLICATION Notes by Joachim Favre

Let’s now instead use a divide-and-conquer is method. Only the merge procedure is
complicated. We must not miss solutions that cross the midpoint. However, if we
know that we want to find a subarray which crosses the midpoint, we can try to
find the best i in the left part until the midpoint (which takes linear time), find the
best j so that the subarray from the midpoint to j (which also takes linear time).
This means that we get three subarrays: one that is only in the left part, one that
cross the midpoint and one that is only in the right. This represents all possible
subarrays, and we can just take the best one amongst those three.
We get that the divide step is Θ(1), the conquer step solves two problems each of size
2 , and the merge time takes linear time. Thus, we have the exact same recurrence
n

relation as for merge sort, giving us a time complexity of Θ(n log(n)).

Remark We will make a Θ(n) algorithm to solve this problem in the third
exercise series.

3.2 Fast multiplication


Problem We want to multiply quickly two numbers. The regular algorithm seen in primary
school is O n2 , but we think that we may be able to go faster.


We are given two integers a, b with n bits each (they are given to us through arrays
of bits), and we want to output a · b. This can be important for cryptography for
instance.
Fast multiplica- We want to use a divide and conquer strategy.
tion Let’s say we have an array of values a0 , . . . , an giving us a, and an array of values
b0 , . . . , bn giving us b (we will use base 10 here, but it works for any base):
n−1
X n−1
X
a= ai 10i , b= bi 10i
i=0 i=0

Let’s divide our numbers in the middle. We get four numbers aL , aH , bL and bH ,
defined as:
n n
2 −1 n−1 2 −1 n−1
i− n n
X X X X
i
aL = ai 10 , aH = ai 10 2 , bL = bi 10i , bH = bi 10i− 2
i=0 i= n
2
i=0 i= n
2

We can represent this geometrically:


a
aH aL

We get the following relations:


n n
a = aL + 10 2 aH , b = bL + 10 2 bH

Thus, the multiplication is given by:


n n n
ab = aL + 10 2 aH bL + 10 2 bH = aL bL + 10 2 (aH bL + bH aL ) + 10n aH bH
 

This gives us a recursive algorithm. We compute aL bL , aH bL , aL bH and aH bH


recursively. We can then do the corresponding shifts, and finally add everything up.

Complexity The recurrence of this algorithm is given by:


algorithm n
T (n) = 4T +n
2
since addition takes a linear time. 
However, this solves to T (n) = Θ n2 by the master theorem /.

27
Algorithms CHAPTER 3. DIVIDE AND CONQUER

Karatsouba Karatsouba, a computer scientist, realised that we do not need 4 multiplications.


algorithm Indeed, let’s compute the following value:

(aL + aH )(bL + bH ) = aL bL + bH bH + aH bL + bH aL

This means that computing aL bL and bH bH , we can extract aH bL + bH aL from the


product hereinabove by computing:

(aL + aH )(bL + bH ) − aL bL − aH bH = aH bL + bH aL

Thus, considering what we did before, we this time only need three multiplications:
(aL + aH )(bL + bH ), aL bL and aH bH .

Complexity The recurrence of this algorithm is given by:


analysis n
T (n) = 3T +n
2
This solves to T (n) = Θ nlog2 (3) , which is better than the primary


school algorithm.
Note that we are cheating a bit on the complexity, since
(aL + aH )(bL + bH ) is T n2 + 1 . However, as mentioned in the
last lesson, we don’t really care about floor and ceiling functions
(nor this +1).

Remark Note that, in most of the cases, we are working with 64-bit numbers which can be
multiplied in constant time on a 64-bit CPU. The algorithm above is in fact really
useful for huge numbers (in cryptography for instance).

3.3 Matrix multiplication


Problem We are given two n × n matrices, A = (aij ) and B = (bij ), and we want to output a
n × n matrix C = (cij ) such that C = AB.
Basically, when computing the value of cij , we compute the dot-product of the ith
row of A and the j th column of B.

Example For instance, for n = 2:


    
c11 c12 a11 a12 b11 b12
=
c21 c22 a21 a22 b21 b22
 
a11 b11 + a12 b21 a11 b12 + a12 b22
=
a21 b11 + a22 b21 a21 b12 + a22 b22

Naive algorithm The naive algorithm is:


l e t C be a new nxn matrix
for i = 1 to n
for j = 1 to n
c[i][j] = 0
for k = 1 to n
c[i]j] = c[i][j] + a[i][k]*b[j][j]

Complexity There are three nested for-loops, so we get a runtime of Θ n3 .




Divide and con- We can realise that, when multiplying matrices, this is like multiplying submatrices.
quer If we have A and B being two n×n matrices, then we can split them into submatrices
and get:     
C11 C12 A11 A12 B11 B12
=
C21 C21 A21 A22 B21 B22

28
3.3. MATRIX MULTIPLICATION Notes by Joachim Favre

where those elements are matrices.


In other words, we do have that:

C11 = A11 B11 + A12 B21

and similarly for all others elements.

Complexity Since we are splitting our multiplication into 8 matrix multiplications


that each need the multiplication of two n2 × n2 matrices, we get the
following recurrence relation:
n
T (n) = 8T + n2
2
since adding two matrices takes O n2 time.


The master theorem tells us that we have T (n) = Θ n3 , which is




no improvement.

Strassen’s al- Strassen realised that we only need to perform 7 recursive multiplications of n2 × n2
gorithm rather than 8. This gives us the recurrence:
n
+ Θ n2

T (n) = 7T
2
where the Θ n2 comes from additions, substractions and copying some matrices.


This solves to T (n) = Θ nlog2 (7) by the master theorem, which is better!


Remark Strassen was the first to beat the Θ n3 , but now we find algorithms with better


and better complexity (Even though the best ones currently known are galactic
algorithms).

29
Algorithms CHAPTER 3. DIVIDE AND CONQUER

30
Monday 10th October 2022 — Lecture 6 : Heap sort

Chapter 4

Great data structures yield great


algorithms

4.1 Heap sort


Nearly-complete A binary tree of depth d is nearly complete if all d − 1 levels are full, and, at level d,
binary tree if a node is present, then all nodes to its left must also be present.

Terminology The size of a tree is its number of vertices.

Example For instance, the three on the left is a nearly-complete binary tree
of depth 3, but not the one on the right:

Both binary trees are of size 10.

Heap A heap (or max-heap) is a nearly-complete binary tree such that, for every node i,
the key (value stored at that node) of its children is less than or equal to its key.

Examples For instance, the nearly complete binary tree of depth 3 of the left
is a max-heap, but not the one on the right:

Observations We notice that the maximum number is necessarily at the root.


Also, any path linking a leaf to the root is an increasing sequence.

Remark 1 We can define the min-heap to be like the max-heap, but the property
each node follows is that the key of its children is greater than or
equal to its key.

Remark 2 We must not confuse heaps and binary-search trees (which we will
define later), which are very similar but have a more restrictive
property.

31
Algorithms CHAPTER 4. GREAT DATA STRUCTURES YIELD GREAT ALGORITHMS

Height The height of a node is defined to be the number of edges on a longest simple path
from the node down to a leaf.
Example For instance, in the following picture, the node holding 10 has height
1, the node holding 14 has height 2, and the one holding a 2 has
height 0.

Remark We note that, if we have n nodes, we can bound the height h of any
node:
h ≤ log2 (n)
Also, we notice that the height of the root is the largest height of a
node from the tree. This is defined to be the height of the heap.
We notice it is thus Θ(log2 (n)).

Storing a heap We will store a heap in an array, layer by layer. Thus, take the first layer and store
it in the array. Then, we take the next layer, and store it after the first layer. We
continue this way until the end.
Let’s consider we store our numbers in a starting with index starting at 1. The
children of a node A[i]
 are
 stored in A[2i] and A[2i + 1]. Also, if i > 1, the parent
of the node A[i] is A 2i .
Using this method, we realise that we do not need a pointer to the left and right
elements for each node.
Example For instance, let’s consider again the following tree, but considering
the index of each node:

This would be stored in memory as:

A = [16, 14, 10, 8, 7, 9, 3, 2, 4, 1]

Then, the left child of i = 3 is left(i) = 2 · 3 = 6, and its right child


is right(i) = 2 · 3 + 1 = 7, as expected.

Max heapify To manipulate a heap, we need to max-heapify. Given an i such that the subtrees
of i are heaps (this condition is important), this algorithm ensures that the subtree
rooted at i is a heap (satisfying the heap property). The only violation we could
have is the root of one of the subtrees being larger than the node i.
So, to fix our tree, we compare A[i], A[left(i)] and A[right(i)]. If necessary, we
swap A[i] with the largest of the two children to get the heap property. This could
break the previous sub-heap, so we need to continue this process, comparing and
swapping down the heap, until the subtree rooted at i is a max-heap. We could
write this algorithm in pseudocode as:

32
4.1. HEAP SORT Notes by Joachim Favre

procedure maxHeapify(A, i, n):


l = left(i) // 2i
r = right(i) // 2i + 1
i f l <= n and A[l] > A[i] // don 't want an overflow , so check l <= n
largest = l
else
largest = i
i f r <= n and A[r] > A[largest]
largest = r

i f largest != i
swap(A, i, largest) // swap A[i] and A[largest]
maxHeapify(A, largest , n)

Complexity Asymptotically, we have done less computations than the height of


our node i, yielding a complexity of O(height(i)).
Also, we are working in place, thus we are taking a space of Θ(n).

Remark This procedure is the main primitive we have to work with heaps,
it is really important.

Building a heap To make a heap from an unordered array A of length n, we can use the following
buildMaxHeap procedure:
procedure buildMaxHeap(A, n)
for i = floor(n/2) downto 1
maxHeapify(A, i, n)

The idea is that the nodes strictly larger than n2 are leafs (no node after n2 can
   

have a left child since it would be at index 2i leading to an overflow, meaning that
they are all all leaves), which are trivial subheaps. Then, we can merge increasingly
higher heaps through the maxHeapify procedure. Note that we cannot loop in the
other direction (from 1 to n2 ), since maxHeapify does not create heap when our
 

sub-trees are not heap.

Complexity We can use the fact that maxHeapify is O(height(i)) to compute


the complexity of our new procedure. We can note that there are
approximately 2` nodes at the `th level (this is not really true for the
last level, but it is not important) and since we have ` + h ≈ log2 (n)
(the sum of the height and the level of a given note is approximately
the height of our tree), we get that we have approximately n(h) =
2log2 (n)−h = n2−h nodes at height h. This yields:
 
log2 (n) log2 (n) log2 (n)
X X n X h
T (n) = n(h)O(h) = O(h) = On 
2h 2h
h=0 h=0 h=0

However, we notice that:


∞ 1
X h 2
=  =2
2h 1 2
h=0 1 − 2

Thus, we get that our function is bounded by O(n).

Correctness To prove the correctness of this algorithm, we can use a loop invari-
ant: at the start of every iteration of for loops, each node i +1, . . . , n
is a root of a max-heap. 
1. At start, each node n2 + 1, . . . , n is a leaf,
  which is the root of
a trivial max-heap. Since, at start, i = n2 , the initialisation of
the loop invariant is true.
2. We notice that both children of the node i are indexed higher
than i and thus, by the loop invariant, they are both roots of

33
Algorithms CHAPTER 4. GREAT DATA STRUCTURES YIELD GREAT ALGORITHMS

max-heaps. This means that the maxHeapify procedure makes


node i a max-heap root. This means that the invariant stays
true after each iteration.
3. The loop terminates when i = 0. By our loop invariant, each
node (notably node 1) is the root of a max-heap.

Heapsort Now that we built our heap, we can use it to sort our array:
procedure heapsort(A, n)
buildMaxHeap(A, n)
for i = n downto 2
swap(A, 1, i) // swap A[1] and A[i]
maxHeapify(A, 1, i-1)

A max-heap is only useful for one thing: get the maximum element. When we get
it, we swap it to its right place. We can then max-heapify the new tree (without the
element we put in the right place), and start again.

Complexity We run O(n) times the heap repair (which runs in O(log(n))), thus
we get that our algorithm has complexity O(n log2 (n)).
It is interesting to see that, here, the good complexity comes from
a really good data-structure. This is basically like selection sort
(finding
 the biggest element and putting it at the end, it runs in
O n2 ) but finding the maximum in a constant time.

Remark We can note that, unlike Merge Sort, this sorting algorithm is in
place.

Friday 14th October 2022 — Lecture 7 : Queues, stacks and linked list

4.2 Priority queue


Definition: Prior- A priority queue maintains a dynamic set S of elements, where each element has a
ity queue key (an associated value that regulates its importance). This is a more constraining
datastructure than arrays, since we cannot access any element.
We want to have the following operations:
• Insert(S, x): inserts the element x into S.
• Maximum(S): Returns the element of S with the largest key.
• Extract-Max(S): removes and returns the element of S with the largest key.
• Increase-Key(S, x, k): increases the value of element x’s key to k, assuming
that k is greater than its current key value.

Usage Priority queue have many usage, the biggest one will be in Dijkstra’s
algorithm, which we will see later in this course.

Using a heap Let us try to implement a priority queue using a heap.

Maximum Since we are using a heap, we have two procedures for free.
Maximum(S) simply returns the root. This is Θ(1).
For Extract-Max(S), we can move the last element of the array to
the root and run Max-Heapify on the root (like what we do with
heap-sort, but without needing to put the root to the last element
of the heap).

34
4.3. STACK AND QUEUE Notes by Joachim Favre

Extract-Max runs in the same time as applying Max-Heapify on


the root, and thus in O(log(n)).

Increase key To implement Increase-Key, after having changed the key of our
element, we can make it go up until its parent has a bigger key that
it.
procedure HeapIncreaseKey(A, i, key)
i f key < A[i]:
error "new key i s smaller than current key"
A[i] = key
while i > 1 and A[Parent(i)] < A[i]
exchange A[i] with A[Parent(i)]
i = Parent(i)

This looks a lot like max-heapify, and it is thus O(log(n)).


Note that if wanted to implement Decrease-Key, we could just run
Max-Heapify on the element we modified.

Insert To insert a new key into heap, we can increment the heap size, insert
a new node in the last position in the heap with the key −∞, and
increase the −∞ value to key using Heap-Increase-Key.
procedure HeapInsert(A, key , n)
n = n + 1 // this is more complex in real life , but
this is not important here
A[n] = -infinity
HeapIncreaseKey(A, n, key)

Remark We can make min-priority queues with min-heaps similarly.

4.3 Stack and queue


Introduction We realise that the heap was really great because it led to very efficient algorithms.
So, we can try to make more great data structures.

Definition: Stack A stack is a data structure where we can insert (Push(S, x)) and delete elements
(Pop(S)). This is known as a last-in, first-out (LIFO), meaning that the element we
get by using the Pop procedure is the one that was inserted the most recently.

Intuition This is really like a stack: we put elements over one another, and
then we can only take elements back from the top.

Usage Stacks are everywhere in computing: a computer has a stack and


that’s how it operates.
Another usage is to know if an expressions with parenthesis, brackets
and curly brackets is a well-parenthesised expression. Indeed, we
can go through all letters from our expression. When we get an
opening character ((, [ or {), then we can push it in our stack.
When we get a closing character (), ] or }) we can pop an element
from the stack and verify that both characters correspond.

Stack implement- A good way to implement a stack is using an array.


ation

35
Algorithms CHAPTER 4. GREAT DATA STRUCTURES YIELD GREAT ALGORITHMS

We have an array of size n, and a pointer S.top to the last element (some space in
the array can be unused).

Empty To know if our stack is empty, we can basically only return S.top
== 0. This definitely has a complexity of O(1).

Push To push an element in our array, we can do:


procedure Push(S, x):
S.top = S.top + 1
S[S.top] = x

Note that, in reality, we would need to verify that we


have the space to add one more element, not to get an
IndexOutOfBoundException.
We can notice that this is executed in constant time.
Pop Popping element is very similar to pushing:
procedure Pop(S, x):
i f StackEmpty(S)
error "underflow"
S.top = S.top - 1
return S[S.top + 1]

We can notice that this is also done in constant time.


Queue A queue is a data structure where we can insert elements (Enqueue(Q, x)) and
delete elements (Dequeue(Q)). This is known as a first-in, first-out (FIFO), meaning
that the element we get by using the Dequeue procedure is the one that was inserted
the least recently.

Intuition This is really like a queue in real life: people that get out of the
queue are people who were there for the longest.

Usage Queues are also used a lot, for instance in packet switches in the
internet.
Queue imple- We have an array Q, a pointer Q.head to the first element of the queue, and Q.tail
mentation to the place after the last element.

Enqueue To insert an element, we can simply use the tail pointer, making
sure to have it wrap around the array if needed:
procedure Enqueue(Q, x)
Q[Q.tail] = x
i f Q.tail == Q.length
Q.tail = 1
else
Q.tail = Q.tail + 1

Note that, in real life, we must verify upon overflow. We can observe
that this procedure is executed in constant time.

Dequeue To get an element out of our queue, we can use the head pointer:
procedure Dequeue(Q)
x = Q[Q.head]
i f Q.head == Q.length
Q.head = 1
else
Q.head = Q.head + 1

36
4.4. LINKED LIST Notes by Joachim Favre

return x

Note that, again, in real life, we must verify upon underflow. Also,
this procedure is again executed in constant time.

Summary Both stacks and queues are very efficient and have natural operations. However,
they have limited support (we cannot search). Also, implementations using arrays
have a fixed-size capacity.

4.4 Linked list


Linked list The idea of a linked list is to, instead of having predefined memory slots like an
array, each object is stored in a point in memory and has a pointer to the next
element (meaning that we do not need to have all elements follow each other in
memory). In an array we cannot increase the size after creation, whereas for linked
lists we have space as long as we have memory left.
We use a pointer L.head to the first element of our list.

We can have multiple types of linked list. In a single-linked list, every element only
knows the position of the next element in memory. In a double-linked list, every
element knows the position of the element before and after. Also, it is possible to
have a circular linked list, by making like the first element is the successor of the
last one.
Note that, when we are not working with circular linked list, the head pointer of the
first element is a nullptr, and the tail pointer of the last element is also nullptr.
Those are pointers to nowhere, usually implemented as pointing to the memory
address 0. This can also be represented as Nil.
It is also possible to have a sentinel, a node like all the others, except it does not
wear a value. This can simplify a lot the algorithms, since the pointer to the head
of the list always points to this sentinel. For instance, a circular doubly-linked list
with a sentinel would look like:

Operations Let’s consider the operations we can do with a linked list.

Search We can search in a linked list just as we search in an unsorted array:


procedure List -Search(L, k)
x = L.head
while x != nil and x.key != k
x = x.next
return x

We can note that, if k cannot be found, then this procedure returns


nil. Also, clearly, this procedure is O(n).

Insertion We can insert an element to the first position of a double-linked list


by doing:
procedure List -Insert(L, x)
x.next = L.head
i f L.head != nil
L.head.prev = x
L.head = x

37
Algorithms CHAPTER 4. GREAT DATA STRUCTURES YIELD GREAT ALGORITHMS

x.prev = nil

We are basically just rewiring the pointers of L.head, x and the


first elements for everything to work.
This runs in O(1).

Delete Given a pointer to an element x, we want to remove it from L:


procedure List -Delete(L, x)
i f x.prev != nil
x.prev.next = x.next
else
L.head = x.next
i f x.next != nil
x.next.prev = x.prev

We are basically just rewiring the pointers, by making sure we do


not modify things that don’t exist.
When we are working with a linked list with sentinel, this algorithm
becomes very simple:
procedure Sentinel -List -Delete(L, x):
x.prev.next = x.next
x.next.prev = p.prev

Both those implementations run in O(1).

Summary A linked list is interesting because it does not have a fixed capacity, and we can insert
and delete elements in O(1) (as long as we have a pointer to the given element).
However, searching is O(n), which is not great (as we will see later).

Monday 17th October 2022 — Lecture 8 : More trees growing in the wrong direction

4.5 Binary search trees


Intuition Let’s think about the following following game: Alice thinks about an integer n
between 1 and 15, and Bob must guess it. To do so, he can make guesses, and Alice
tells him if the number is correct, smaller, or larger.
Intuitively, it seems like we can make a much more efficient algorithm than the basic
linear one, one that would work in O(log(n)) instead.

Definition: Bin- A binary search tree is a binary tree (which is not necessarily nearly-complete),
ary Search Trees which follows the following core property: for any node x, if y is in its left subtree
then y.key < x.key, and if y is in the right subtree of x, then y.key ≥ x.key.

Example For instance, here is a binary search tree of height h = 3:

We could also have the following binary search tree of height h = 14:

38
4.5. BINARY SEARCH TREES Notes by Joachim Favre

We will see that good binary search trees are one with the smallest
height, since complexity will depend on it.

Remark Even though binary search trees are not necessarily nearly-complete,
we can notice that their property is much more restrictive than the
one of heaps.

Searching We designed this data-structure for searching, so the implementation is not very
complicated:
procedure TreeSearch(x, k)
i f x == Nil or k == key[x]
return x
else i f k < x.key
return TreeSearch(x.left , k)
else
return TreeSearch(x.right , k)

Extrema We can notice that the minimum element is located in the leftmost node, and the
maximum is located in the rightmost node.

This gives us the following procedure to find the minimum element, in complexity
O(h):
procedure TreeMinimum(x)
while x.left != Nil
x = x.left
return x

We can make a very similar procedure to find the maximum element, which also
runs in complexity O(h):
procedure TreeMaximum(x)
while x.right != Nil
x = x.right
return x

Successor The successor of a node x is the node y where y.key is the smallest key such that
y.key > x.key. For instance, if we have a tree containing the numbers 1, 2, 3, 5, 6,
the successor of 2 is 3 and the successor of 3 is 5.
To find the successor of a given element, we can split two cases. If x has a non-empty
right subtree, then its sucessor is the minimum in this right subtre (it is the minimum
number which is greater than x). However, if x does not have a right subtree, we can

39
Algorithms CHAPTER 4. GREAT DATA STRUCTURES YIELD GREAT ALGORITHMS

consider the point of view of the successor y of x: we know that x is the predecessor
of y, meaning that it is at the rightmost bottom of y’s left subtree. This means that
we can go to the up-left (go up, as long as we are the right child), until we need to
go to the up-right (go up, but we are the left child). When we went up-right, we
found the successor of x.
We can convince ourselves that the this is an “else” (do the first case if there is no
right subtree, else do the second case) by seeing that, if x has a right subtree, then
all of these elements are greater that x but smaller than the y that we would find
through our second procedure. In other words, if x has a right subtree, then its
successor must definitely be there. Also, this means that the only element for which
we will not find a successor is the one at the rightmost of the tree (since it has no
right subtree, and we will never be able to go up-right), which makes sense since
this is the maximum number.
This gives us the following algorithm for finding the successor in O(h):
procdure TreeSuccessor(x)
i f x.right != nil
return TreeMinimum(x.right)
y = x.p
while y != nil and x == y.right
x = y
y = y.p
return y

We can note that looking at the successor of the greatest element yields y = Nil, as
expected. Also, the procedure to find the predecessor is symmetrical.

Printing To print a binary tree, we have three methods: preorder, inorder and postorder
tree walk. They all run in Θ(n).
The preorder tree walk looks like:
procedure PreoderTreeWalk(x)
i f x != nil
print key[x]
PreorderTreeWalk(x.left)
PreorderTreeWal(x.right)

The inorder has the print key[x] statement one line lower, and the postorder tree
walk has this instruction two lines lower.
Insertion To insert an element, we can basically search for this element and, when finding its
supposed position, we can basically insert it at that position.
procedure BinarySearchTreeInsert(T, z)
y = Nil // previous Node we looked at
x = T.root // current node to look at

// Search
while x != Nil
y = x
i f z.key < x.key
x = x.left
else
x = x.right

// Insert
i f y == Nil
T.root = z // Tree was empty
else i f z.key < y.key
y.left = z
else
y.right = z

Friday 21st October 2022 — Lecture 9 : Dynamic cannot be a pejorative word

40
4.5. BINARY SEARCH TREES Notes by Joachim Favre

Deletion When deleting a node z, we can consider three cases. If z has no child, then we
can just remove it. If it has exactly one child, then we can make that child take z’s
position in the tree. If it has two children, then we can find its successor y (which is
at the leftmost of its right subtree, since this tree is not empty by hypothesis) and
replace z by y. Note that deleting the node y to move it at the place of z is not so
hard since, by construction, y has 0 or 1 child. Also, note that we could use the
predecessor instead of the successor.
To implement our algorithm, it is easier to have a transplant procedure (to replace
a subtree rooted at u with the one rooted at v):
procedure Transplant(T, u, v):
// u is root
i f u.p == Nil:
T.root = v
// u is the left child of its parent
else i f u == u.p.left:
u.p.left = v
// u is the right child of its parent
else :
u.p.right = v

i f v != Nil:
v.p = u.p

We can then write our deletion procedure:


procedure TreeDelete(T, z):
// z has no left child
i f z.left == Nil:
Transplant(T, z, z.right)
// z has just a left child
else i f z.right == Nil:
Transplant(T, z, z.left)
// z has two children
else :
y = TreeMinimum(z.right) // z's successor
i f y.p != z:
// y is in z's subtree but is not its root
Transplant(T, y, y.right)
y.right = z.right
y.right.p = y
// Replace z by y
Transplant(T, z, y)
y.left = z.left
y.left.p = y

Balancing Note that neither our insertion nor our deletion procedures keep the height low. For
instance, creating a binary search tree by inserting in order elements of a sorted list
of n elements makes a tree of height n − 1.
There are balancing tricks when doing insertions and deletions, such as red-black
trees or AVL trees, which allow to stay at h = Θ(log(n)), but they will not be
covered in this course.
However, generally, making a tree from random order insertion is not so bad.

Summary We have been able to make search, max, min, predecessor, successor, insertion and
deletion in O(h).
Note that binary search tree are really interesting because we are easily able to
insert elements. This is the main thing which makes them more interesting than a
basic binary search on a sorted array.

41
Algorithms CHAPTER 4. GREAT DATA STRUCTURES YIELD GREAT ALGORITHMS

42
Chapter 5

Dynamic programming

5.1 Introduction and Fibonacci


Introduction Dynamic programming is a way to make algorithms and has little to do with pro-
gramming. The idea is that we never compute twice the same thing, by remembering
calculations already made.
This name was designed in the 50s, by a computer scientist who was trying to get
money for his research, and he came up with this name, which he found that it
cannot be negative.

Fibonacci num- Let’s consider the Fibonacci numbers:


bers
F0 = 1, F1 = 1, Fn = Fn−1 + Fn−2

The first idea to compute this is through the given recurrence relation:
procedure Fib(n):
i f n == 0 or n == 1:
return 1
else
return Fib(n-1) + Fib(n-2)

However, this is Θ(2n ). Indeed, when we reduce n by 1, we double the number


of computations we have to do. Another way to see this, is that we are never
adding large numbers (more than 1), thus we have literally a complexity of Θ(Fn ) =
Θ(ϕn ) = Θ(2n ).

However, we can see that we are computing many things more than twice, meaning
that we are wasting resources and that we can do better. An idea is to remember
what we have computed not to compute it again.
We have two solutions. The first method is top-down with memoisation: we
solve recursively but store each result in a table. The second method is bottom-up:
we sort the problems, and solve the smaller ones first; that way, when solving a
subproblem, we have already solved the smaller subproblems we need.

43
Algorithms CHAPTER 5. DYNAMIC PROGRAMMING

Top-down with The code looks like:


memoisation
procedure MemoisedFib(n):
// we store our answers in r
l e t r = [0...n] be a new array
for i = 0 to n
r[i] = -infinity
return MemoisedFibAux(n, r)

procedure MemoisedFibAux(n, r):


i f r[n] >= 0: return r[n]
i f n == 0 or n == 1:
answer = 1
else :
answer = MemoisedFibAux(n-1, r) + MemoisedFibAux(n
-2, r)
r[n] = answer
return r[n]

We can note that this has a much better runtime complexity than
the naive one, since this is Θ(n). Also, for memory, we are taking
about the same size since, for the naive algorithm, we had a stack
depth of O(n), which also needs to be saved.

Bottom-up The code looks like:


procedure BottomUpFibonacci(n):
l e t r = [0...n] be a new array
r[0] = 1
r[1] = 1
for i = 2 to n:
r[i] = r[i-1] + r[i-2]
return r[n]

Generally, the bottom-up version is slightly more optimised than


top-down with memoisation.

Designing a dy- Dynamic programming is a good idea when our problem has an optimal substructure:
namic program- the problem consists of making a choice, which leaves one or several subproblems to
ming algorithm solve, where finding the optimal solution to all those subproblems allows us to get
the optimal solution to our problem.
This is a really important idea which we must always keep in mind when solving
problems.

44
5.2. APPLICATION: ROD CUTTING Notes by Joachim Favre

Monday 24th October 2022 — Lecture 10 : ”There are 3 types of mathematicians: the ones
who can count, and the ones who cannot” (Prof. Kapralov) (what do you mean by “this
title is too long”?)

5.2 Application: Rod cutting


Problem Let’s say we have a rod of length n, and we would want to sell it on the market. We
can sell it as is, but also we could cut it in different lengths. The goal is that, given
the prices pi for lengths i = 1, . . . , n, we must find the way to cut our rod which
gives us the most money.

Brute force Let’s first consider the brute force case: why work smart when we can work fast.
We have n − 1 places where we can cut the rod, at each place we can either cut or
not, so we have 2n−1 possibilities to cut the rod (which is not great).
Also, as we will show in the 5th exercise series, a greedy algorithm does not work.

Theorem: Op- We can notice that, if the leftmost cut in an optimal solution is after i units, and an
timal substruc- optimal way to cut a solution of size n − i is into rods of sizes s1 , …, sk , then an
ture optimal way to cut our rod is into rods of sizes i, s1 , . . . , sk .

Proof Let’s prove the optimality of our solution. Let i, o1 , . . . , o` be an


optimal solution (which exits by assumption). Also, let s1 , . . . , sk be
an optimal solution to the subproblem of cutting a rod of size n − 1.
Since this second solution is an optimal solution to the subproblem,
we get that:
Xk X`
psj ≥ poj
j=1 j=1

Hence, adding pi on both sides, we get that:


k
X `
X
pi + ≥ pi + poj
j=1 j=1

This yields that, since the optimal solution is worse or equal to the
solution we made (using the solution of the subproblem), this new
solution is indeed optimal.

Dynamic pro- The theorem above shows the optimal substructure of our problem, meaning that we
gramming can apply dynamic programming. Letting r(n) to be the optimal revenue from a rod
of length n we get by the structural theorem that we can express r(n) recursively as:

0, if n = 0,
(
r(n) = max {p + r(n − i)}, otherwise
i
1≤i≤n

This allows us to make the following algorithm:


procedure CutRodBad(p, n):
i f n == 0:
return 0
q = -infinity
for i = 1 to n:
q = max(q, p[i] + CutRod(p, n-i))
return q

However, just as for the Fibonacci algorithm, we are computing many times the
same thing. Thus, let use bottom-up with some memoisation:

45
Algorithms CHAPTER 5. DYNAMIC PROGRAMMING

procedure BottomUpCutRode(p, n)
// r contains the solution
top down? // s contains where to cut to get the optimal solution
l e t r[0...n] and s[0...n] be new arrays
// initial condition
r[0] = 0
// fill in the array
for j = 1 to n:
// find max
q = -infinity
for i = 1 to j:
i f q < p[i] + r[j - i]:
q = p[i] + r[j-i]
s[j] = i
r[j] = q
return r and s

We can then reconstruct the solution using our array:


procedure PrintCutRodSolution(p, n):
(r, s) = BottomUpCutRod(p, n)
while n > 0:
print(s[n])
n = n - s[n]

Note that this is a typical Dynamic Programming approach: fill in the array of
subproblems (r here) and another array containing more information (s here), and
then use them to reconstruct the solution.

5.3 Application: Change-making problem


Problem We need to give the change for some amount of money W (a positive integer),
knowing that we have n distinct coins denominations (positive integers, positive
integers too) 0 < w1 < . . . < wn . We want to know the minimum number of coins
needed in order to make the change:
 
X n n 
xj xj ∈ N0 and
X
min xj wj = W
 
j=1 j=1

For instance, if we have w1 = 1, w2 = 2, w3 = 5 and W = 8, then the output must


be 3 since the best way of giving 8 is x1 = x2 = x3 = 1 (give one coin of each).

Solution We first want to see our optimal substructure. To do so, we need to define our
subproblems. Thus, let r[w] be the smallest number of coins needed to make change
for w. We can note that if we choose which coin i to use first and know the optimal
solution to make W − wi , we can make the optimal solution by adding 1 to the
solution we just found (since we just use one more coin wi ):

r[w] = min {1 + r[w − wi ]}


1≤i≤n

We can add the boundary conditions r[w] = +∞ for all w < 0 and r[0] = 0.
We can see that the runtime is O(nw) since we have w subproblems, and the
recomputation problem (checking all the value to get the minimum) takes order n.

46
5.4. APPLICATION: MATRIX-CHAIN MULITPLICATION Notes by Joachim Favre

5.4 Application: Matrix-chain mulitplication


Problem We have n matrices A1 , . . . , An , where each matrix Ai can have a different size
pi−1 × pi . We have seen that scalar multiplications is what takes the most time,
so we want to minimise the number of computations we do. To do so, we want to
output a full parenthesisation of the product (A1 · · · An ) in a way that minimises
the number of scalar multiplications (we can do this since matrix multiplication is
associative).
For this algorithm, we need to make the following observation. Let A ∈ Rp×q ,
B ∈ Rq×r and C ∈ Rp×r be matrices such that C = AB. Using regular Θ n3 matrix-
multiplication algorithms, we can see that each element of C takes q multiplications
to compute. However, since it has pr elements, we get that, to compute C = AB,
we need pqr scalar multiplications.
For instance, let’s say we have a product A1 A2 A3 with matrices of dimensions
50 × 5, 5 × 100 and 100 × 10, respectively. Then, calculating (A1 A2 )A3 requires
50 · 5 · 100 + 50 · 100 · 10 = 75000 scalar multiplications. However, calculating
A1 (A2 A3 ) requires 5 · 100 · 10 + 50 · 5 · 10 = 7500 scalar multiplications. We can
see that, indeed, a good parenthesisation can tremendously decrease the number of
operations, without changing the result.

Theorem: Op- We can notice that, if the outermost parenthesisation in an optimal solution is
timal substruc- (A1 · · · Ai )(Ai+1 · · · An ), and if PL and PR are optimal parenthesisation for A1 · · · Ai
ture and Ai+1 · · · An , respectively; then, ((PL ) · (PR )) is an optimal parenthesisation for
A1 · · · An .
Proof Let ((OL ) · (OR )) be an optimal parenthesisation (we know it has
this form by hypothesis), where OL and OR are parenthesisation
for A1 · · · Ai and Ai+1 · · · An , respectively. Also, let M (P ), be the
number of scalar multiplications required by a parenthesisation P .
Since PL and PR are optimal by hypothesis, we know that M (PL ) ≤
M (OL ) and M (PR ) ≤ M (OR ). This allows us to get that:

M ((OL ) · (OR )) = p0 pi pn + M (OL ) + M (OR )


≥ p0 pi pn + M (PL ) + M (PR )
= M ((PL ) · (PR ))

However, since ((PL ) · (PR )) needs less than or the same number of
scalar multiplications of the optimal parenthesisation, it necessarily
means that it is itself optimal.

Dynamic pro- We can note that our theorem gives us the following recursive formula, where m[i, j]
gramming is the optimal number of scalar multiplications for calculating Ai · · · Aj :

if i = j
(
0,
m[i, j] = min {m[i, k] + m[k + 1, j] + pi−1 pk pj }, if i < j
i≤k<j

We can use a bottom-up algorithm to solve the subproblems in increasing j − i


order.
procedure MatrixChainOrder(p)
n = p.length - 1
// m is the number of multiplications needed
// s is where we cut to make the optimal solution
l e t m[1...n, 1...n] and s[1...n, 1...n] be new tables
// initial conditions
for i = 1 to n:
m[i, i] = 0
// solve subproblems
for l = 2 to n: // l is chain length
for i = 1 to n-l+1:

47
Algorithms CHAPTER 5. DYNAMIC PROGRAMMING

j = i+l-1
// find max
m[i, j] = infinity
for k = i to j - 1:
q = (m[i, k] + m[k + 1, j]
+ p[i-1]p[k]p[j])
i f q < m[i, j]
m[i, j] = q
s[i, j] = k
return m and s

We can then get our optimal solution:


procedure PrintOPtimalParens(s, i, j):
i f i == j:
print "A_{i}"
else :
print "("
print PrintOptimalParens(s, i, s[i, j])
print PrintOptimalParens(s, s[i, j] + 1, j)
print ")"

We can note that this is Θ n3 . A good way to see this is that we have n2


subproblems, which all take around Θ(n) to find the maximum.

Summary To summarise, we first choose where to make the outermost parenthesis:

(A1 · · · Ak )(Ak+1 · · · An )

Then, we noticed the optimal substructure: to obtain an optimal solution, we need


to parenthesise the two remaining expressions in an optimal way. This gives us a
recurrence relation, which we solve through dynamic programming.

Friday 28th October 2022 — Lecture 11 : LCS but not LoL’s one

5.5 Application: Longest common subsequence


Problem We have two sequences X = hx1 , . . . , xm i and Y = hy1 , . . . , yn i, and we want to
find the longest common subsequence (LCS; it does not necessarily to have to be
consecutive, as long as it is in order) common to both.
For instance, if we have h,e,r,o,i,c,a,l,l,y and s,c,h,o,l,a,r,l,y, then the longest common
subsequence is h,o,l,l,y.

Remark This problem can for instance be useful if we want a way to compute
how far are two strings from each others: finding the length of the
longest common subsequence would be one way to measure this
distance.
Brute force We can note that brute force does not work, since we have 2m subsequences of X
and that each subsequence takes Θ(n) time to check (since we need to scan Y for
the first letter, then scan it for the second, and so on, until we know if this is also a
subsequence of Y ). This leads to a runtime complexity of Θ(n2m ).

Theorem: Op- We can note the following idea. Let’s say we start at the end of both words and
timal substruc- move to the left step-by-step (the other direction would also work), considering one
ture letter from both word at any time. If the two letters are the same, the we can let it
to be in the susequence. If they are not the same, then the optimal subsequence can
obtained by moving to the left in one of the words.
Let’s write this more formally. Let Xi and Yj denote the subprefixes hx1 , . . . , xn i
and hy1 , . . . , yj i, respectively. Also, let Z = hz1 , . . . zk i be any longest common
subsequence (LCS) of Xi and Yj . We then know that:
1. If xi = yj , then zk = xi = yj and Zk−1 is an LCS of Xi−1 and Yj−1 .

48
5.5. APPLICATION: LONGEST COMMON SUBSEQUENCE Notes by Joachim Favre

2. If xi =
6 yj and zk =6 xi , then Z is an LCS of Xi−1 and Yj .
3. If xi =6 yj and zk =6 yj , then Z is an LCS of Xi and Yj−1 .

Proof of the Let’s prove the first point, supposing for contradiction that zk 6=
first part of the xi = yj .
first point However, there is contradiction, since we can just create Z 0 =
hz1 , . . . , zk , xi i. This is indeed a new subsequence of Xi and Yj
which is one longer, and thus contradicts the fact that Z was the
longest common subsequence.
We also note that, if yj would be already matched to something
(meaning that we would not be able to match it now), it would
mean that zk = yj : yj is the last letter of Y and it must thus be the
last letter of Z. Naturally, we can do a similar reasoning to show
that xi was not already matched.

Proof of the The proofs of the second part of the first point, and for the second
rest and the third point (which are very similar) are considered trivial
and left as exercises to the reader. ,
Dynamic pro- The theorem above gives us the following recurrence, where c[i, j] is the length of an
gramming LCS of Xi and Yj :

0, if i = 0 or j = 0


c[i, j] = c[i − 1, j − 1] + 1, if i, j > 0 and xi = yi
max(c[i − 1, j], c[i, j − 1]), if i, j > 0 and xi 6= yj

We have to be careful since the naive implementation solves the same problems
many times. We can treat this problem with dynamic programming, as usual:
procedure LCSLength(X, Y, m, n):
// c is the length of an LCS
// b is the path we take to get our LCS
l e t b[1...m, 1...n] and c[0...m, 0...n] be new tables

// base cases
for i = 1 to m:
c[i, 0] = 0
for j = 0 to n:
c[0, j] = 0

// bottom up
for i = 1 to m:
for j = 1 to n:
// Three cases given by recurrence relation
i f X[i] == Y[j]:
c[i, j] = c[i-1, j-1] + 1
b[i, j] = (-1, -1)
// max of the two
else i f c[i-1, j] >= c[i, j -1]:
c[i, j] = c[i-1, j]
b[i, j] = (-1, 0)
else :
c[i, j] = c[i, j-1]
b[i, j] = (0, -1)
return c and b

We can note that time is dominated by instructions inside the two nested loops,
which are executed m · n times. We thus get that the total runtime of our solution
is Θ(mn).
We can then consider the procedure to print our result:
procedure PrintLCS(b, X, i, j):
i f i == 0 or j == 0:
return
di , dj = b[i, j]
PrintLCS(b, x, i+di , j+dj)
i f (di , dj) == (-1, -1):
print X[i]

49
Algorithms CHAPTER 5. DYNAMIC PROGRAMMING

We can note that each recursive call decreases i + j by at least one. Hence, the
time needed is at most T (i + j) ≤ T (i + j − 1) + Θ(1), which means it is O(i + j).
Also, each recursive call decreases i + j by at most two. Hence, the time needed
at least T (i + j) ≥ T (i + j − 2) + Θ(1), which means it is Ω(i + j). In other words,
this means that we can print the result in Θ(m + n).

Intuition Conceptually, for X = hB, A, B, D, BAi and Y =


hD, A, C, B, C, B, Ai, we are drawing the following table:

where the numbers are stored in the array c and the arrows in
the array b. For this example, we would give us that the longest
common subsequence is Z = hA, B, B, Ai.

Monday 31st October 2022 — Lecture 12 : More binary search trees

5.6 Application: Optimal binary search tree


Optimal binary The goal is that, given a sorted sequence K = hk1 , . . . , kn i of n distinct key, we want
search trees to build a binary search tree. There is a slight twist: we know that the probability
to search for ki is pi (for instance, on social media, people are much more likely to
search for stars than for regular people). We thus want to build the binary search
tree with minimum expected search cost.
The cost is the number of items examined. For a key ki , the cost is given by
depthT (ki ) + 1, where the term depthT (ki ) tells us the depth of ki in the tree T .
This gives us the following expected search cost:
n
E[search cost in T ] =
X
(depthT (ki ) + 1)pi
i=1
n n
depthT (ki )
X X
= pi +
i=1 i=1
n
depthT (ki )pi
X
= 1+
i=1

Example For instance, let us consider the following tree and probability table:

50
5.6. APPLICATION: OPTIMAL BINARY SEARCH TREE Notes by Joachim Favre

i 1 2 3 4 5
pi 0.25 0.2 0.05 0.3 0.3
Then, we can compute the expected search cost to be:

E[search cost] = 1 + (0.25 · 1 + 0 · 0.2 + 0.05 · 2 + 0.2 · 1 + 0.3 · 2)

Which is equal to:

E[search cost] = 1 + 1.15 = 2.15

Remark Designing a good binary search tree is equivalent to designing a


good binary search strategy.

Observation We notice that the optimal binary search tree might not have the smallest height,
and that it might not have the highest-probability key at the root too.

Brute force Before doing anything too fancy, let us start by considering the brute force algorithm:
we construct each n-node BST, then for each put in keys, and compute their expected
search cost. We can finally pick the tree with the smallest expected search cost.
However, there are exponentially many trees, making our algorithm really bad.

Theorem: Op- We know that:


n
timal substruc-
E[search cost] =
X
(depthT (ki ) + 1)pi
ture
i=1
Thus, if we increase the depth by one, we have:
n n n
E[search cost deeper] =
X X X
(depthT (ki ) + 2)pi = (depthT (ki ) + 1)pi + pi
i=1 i=1 i=1

This means that we, if we know two trees which we want to merge using a root, we
can easily compute the expected search cost through a recursive function:
r−1
!
E[search cost] = pr + p` + E[search cost in left subtree]
X

`=i
j
!
p` + E[search cost in right subtree]
X
+
`=r+1
j
= E[search cost left subtree] + E[search cost right subtree] +
X
p`
`=i

However, this means that, given the i, j, and r, the optimality of our tree only
depends on the optimality of the two subtrees (the formula above is minimal when
the two expected values are minimal). We have thus proven the optimal substructure
of our problem.
Let e[i, j] be the expected search cost of an optimal binary search tree of key
ki , . . . , kj . This gives us the following recurrence relation:
0, if i = j + 1




j
( )
e[i, j] =
p` , if i ≤ j
X

 min e[i, r − 1] + e[r + 1, j] +
i≤r≤j
`=i

51
Algorithms CHAPTER 5. DYNAMIC PROGRAMMING

Bottom-up al- We can write our algorithm as:


gorithm
procedure OptimalBST(p, q, n):
// expected search cost of optimal BST in ki , ..., kj
l e t e[1...n+1, 0...n] be a new table
// root in optimal BST of ki , ..., kj
l e t root [1...n, 1...n] be a new table
// sum of pi + ... + pj (so that we don 't have to recompute them every
time)
l e t w[1...n+1, 0...n] be a new table

for i = 1 to n+1:
e[i, i-1] = 0
w[i, i-1] = 0

for l = 1 to n:
for i = 1 to n-l+1:
j = i + l - 1
e[i, j] = infinity
w[i, j] = w[i, j-1] + p[j]
for r = 1 to j:
t = e[i, r-1] + e[r+1, j] + w[i, j]
i f t < e[i, j]:
e[i, j] = t
root[i ,j] = r
return e and root

We needed to be careful not to compute the sums again and again.


We note that there are three nested loops, thus the total runtime is Θ n3 . Another


way to see this is that we have Θ n2 cells to fill in and that most cells take Θ(n)


time to fill in.

Friday 4th November 2022 — Lecture 13 : An empty course.

52
Monday 7th November 2022 — Lecture 14 : I love XKCD

Chapter 6

Graphs

6.1 Introduction
Introduction Graphs are everywhere. For instance, in social media, when we have a bunch of
entities and relationships between them, graphs are the way to go.

Definition: A graph G = (V, E) consists of a vertex set V , and an edge set E that contains
Graph pairs of vertices.

Terminology We can have a directed graph, where such pairs are ordered (if
(a, b) ∈ E, then it is not necessarily the case that (b, a) ∈ E), or an
undirected graph where such pairs are non-ordered (if (a, b) ∈ E,
then (b, a) ∈ E).

Personal re- It is funny that we are beginning graph theory right now because,
mark very recently, XKCD published a comic about this subject:

https://xkcd.com/2694/

Definition: De- For a graph G = (V, E), the degree of a vertex u ∈ V , denoted degree(u), is its
gree number of outgoing vertices.

Remark For an undirected graph, the degree of a vertex is its number of


neighbours.

Storing a graph We can store a graph in two ways: adjacency lists and adjacency matrices. Note
that any of those two representations can be extended to include other attributes,
such as edge weights.

53
Algorithms CHAPTER 6. GRAPHS

Adjacency lists We use an array. Each index represents a vertex, where we store
the pointer to the head of a list containing its neighbours.
For an undirected graph, if a is in the list of b, then b is in the list
of a.
For instance, for the undirected graph above, we could have the
adjacency list:

Note that, in pseudo-code, we will denote this array at the attribute


Adj of the graph G. In other words, to get the adjacent nodes of a
vertex u, we will use G.Adj[u].

Adjacency We can also use a |V | × |V | matrix A = (aij ), where:


matrix
1, if (i, j) ∈ E
(
aij =
0, otherwrise

We can note that, for an undirected graph, the matrix is symmetric:

Complexities Let us consider the different complexities of our two representations.


The space complexity of adjacency lists is Θ(V + E) and the one
of adjacency matrices is Θ V 2 . Then, the time to list all vertices


adjacent to u is Θ(degree(u)) for an adjacency list but Θ(V ) for an


adjacency matrix. Finally, the time to determine whether (u, v) ∈ E
is O(degree(u)) for adjacency lists but only Θ(1) for adjacency
matrices.
We can thus note that adjacency matrix can be very interesting for
edge queries (determining if an edge exits), but have an important
cost to do so. Generally, adjacency matrices only become interesting
when graphs are very dense.

6.2 Primitives for traversing and searching a graph


Breadth-First We have as input a graph G = (V, E), which is either directed or undirected, and
Search some source vertex s ∈ V . The goal is, for all v ∈ V , to output the distance from s
to v, named v.d.

Algorithm The idea is to send some kind of wave out form s. It will first hit all
vertices which are 1 edge from s. From those points, we can again
send some kind of waves, which will hit all vertices at a distance of
two edges from s, and so on. In other words, beginning with the
root, we look at the vertex closest to the current vertices, set all
its neighbours distance to the distance of the current vertex plus 1,
and store them in a queue to consider their neighbours later.

54
6.2. PRIMITIVES FOR TRAVERSING AND SEARCHING A GRAPH Notes by Joachim Favre

0 3
s e
1
c
1 2
a f
2 ∞
d h
3
b g
3

This yields Breadth-First Search (BFS; named that way since it


prioritises breadth over depth), which can be expressed as:
procedure BFS(V, E, s):
for u in V without s:
u.d = infinity
s.d = 0
Q = empty queue
Enqueue(Q, s)
while !Q.isEmpty ():
u = Dequeue(Q)
for v in G.adj[u]:
i f v.d == infinity:
v.d = u.d + 1
Enqueue(Q, v)

Note that cycles are never a problem, since we enqueue nodes only
if it had not been visited before, and thus only once.

Analysis The informal proof of correctness is that we are always considering


the nodes closest to the root, meaning that whenever we visit a
node we could not have visited with a shorted distance. We will do
a formal proof for the generalisation of this algorithm, Dijkstra’s
algorithm.
The runtime is O(V + E). Indeed, it is O(V ) since each vertex are
enqueued at most once (less if the graph is disconnected), and O(E)
(u, v) is examined only when u is dequeued, which happens at most
once. In other words, every edge is examined at most once in a
directed graph and at most twice if undirected.
We notice that we have to switch a bit our mind when we think
about algorithm complexity for graphs: the input is non trivial. We
know that the worst case is E = O V 2 , but we may totally have
E = Θ(V ) for a given graph, meaning that our complexity needs to
take both into account.
Observation We note that BFS may not reach all the vertices (if they are not
connected).

Remark We can save the shortest path tree by keeping track of the edge that
discovered the vertex. Note that since each vertex (which is not
the root and which distance is not infinite) have exactly one such
vertex, and thus this is indeed a tree. Then, when given a vertex,
we can find the shortest path by using those pointers in reverse
orders, climbing the tree in the opposite direction.

55
Algorithms CHAPTER 6. GRAPHS

0 3
s e
1
c
1 2
a f
2 ∞
d h
3
b g
3
Depth-First BFS goes through every connected vertex, but not necessarily every edge. We would
Search now want to make an algorithm that goes through every edges. Note that this
algorithm may seem very abstract and useless for now but, as we will see right after,
it gives us a very interesting insight about a graph.
The idea is, starting from a given point, we get going following a path until we
get stuck, then backtrack, and get back going. Doing so, we want to output two
timestamps on each vertex v: the discovery time v.d (when we first start visiting a
node) and the finishing time v.f (when we finished visiting all neighbours of our
node). This algorithm is named Depth-First Search (DFS).
This is not really important where we start for now.

Algorithm Our algorithm can be stated the following way, where WHITE means
not yet visited, GREY means currently visiting, and BLACK finished
visiting:
procedure DFS(G):
for u in G.v:
u.color = WHITE
time = 0
for u in G.v:
i f u.color == WHITE:
DFSVisit(G, u)

procedure DFSVisit(G, u):


time = time + 1
u.d = time
u.color = GREY
for v in G.Adj[u]:
i f v.color == WHITE:
DFSVisit(G, v)
u.color = BLACK
time = time + 1
u.f = time

Example For instance, running DFS on the following graph, where we have
two DFS-visit (one on b and one on e):

Analysis The runtime is Θ(V + E). Indeed, Θ(V ) since every vertex is
discovered once, and Θ(E) since each edge is examined once if it is
a directed graph and twice if it is an undirected graph.

XKCD XKCD’s viewpoint on the ways we have to traverse a graph:

56
6.2. PRIMITIVES FOR TRAVERSING AND SEARCHING A GRAPH Notes by Joachim Favre

https://xkcd.com/2407/
And there is also the following great XKCD on the slides:

https://xkcd.com/761/

Depth-First Just as BFS leads to a tree, DFS leads to a forest (a set of trees).
forest Indeed, we can again consider the edge that we used to discover a given vertex, to
be an edge linking this vertex to its parent. Since trees might be disjoint but we are
running DFS so that every edge is discovered, we may have multiple detached trees.
There will be examples hereinafter.

Remark Since we have trees, in particular, DFS leads a certain partial


ordering of our nodes: a node can be descendent of another in a
tree (or have no relation because they are not in the same tree).

Formal defini- Very formally, each tree is made of edges (u, v) such that u (currently
tion explored) is grey and v is white (not yet explored) when (u, v) is
explored.

57
Algorithms CHAPTER 6. GRAPHS

Parenthesis We can think of the discovery time as an opening parenthesis and the finishing time
theorem a closing parenthesis. Let us note u’s discovery and finishing time by brackets and
v’s discovery and finishing times by braces. Then, to make a well-parenthesised
formulation, we have only the following possibilities:
1. (){}
2. {}()
3. ({})
4. {()}
However, it is for instance impossible to have ({)}.
Very formally, this yields that, for any vertex u, v, we have exactly one of the
following properties (where, in order, they exactly refer to each well-parenthesised
brace-parenthesis expresesions above):
1. u.d < u.f < v.d < v.f and neither of u and v are descendant of each other.
2. v.d < v.f < u.d < u.f and neither of u and v are descendent of each other.
3. u.d < v.d < v.f < u.f and v is a descendant of u.
4. v.d < u.d < u.f < v.f and u is a descendant of v.

White-path Vertex v is a descendant of u if and only if, at time u.d, there is a path from u to v
theorem consisting of only white vertices (except for u, which was just coloured grey).

Edge classifica- A depth-first-search run gives us a classification of edges:


tion 1. Tree edges are edges making our trees, the edges which we used to visit new
nodes when running DFSVisit.
2. Back edges are edges (u, v) where u is a descendant (a child, grand-child, or
any further, in the tree) of v. This is when v.d < u.d < u.f < v.f .
3. Forward edges are edges (u, v) where v is a descendant of u, bot not a tree
edge. This is when u.d < v.d < v.f < u.f (but, again (u, v) is not a tree
edge).
4. Cross edges are any other edge. This is when v.d < v.f < u.d < u.f .
Note that having both u.d < u.f < v.d < v.f and an edge (u, v) is not possible
since, then, because of our edge, we would have explored v from u and thus we
would have had v.d < u.f . This explains why no edge is classified with the condition
u.d < u.f < v.d < v.f .

Example For instance, in the following graph, tree edges are represented in
orange, back edges in blue, forward edges in red and cross edges in
green.

Note that, in this DFS forest, we have two trees.

Remark In the DFS of an undirected graph, it no longer makes sense to


make the distinction between back and forward edges. We thus call
both of them back edges.
Also, in an undirected graph, we cannot have any cross edge.

58
6.3. TOPOLOGICAL SORT OF GRAPHS Notes by Joachim Favre

Observation A different starting point for DFS will lead to a different edge
classification.

6.3 Topological sort of graphs


Definition: Dir- A directed acyclic graph (DAG) is a directed graph G such that there are no
ected acyclic cycles (what a definition). In other words, for all u, v ∈ E where u 6= v, if there
graph exists a path from u to v, then there exists no path from v to u.

Topological sort We have as input a directed acyclic graph, and we want to output a linear ordering
of vertices such that, if (u, v) ∈ E, then u appears somewhere before v.

Use This can for instance really be useful for dependency resolution
when compiling files: we need to know in what order we need to
compiles files.

Example For instance, let us say that, as good computer scientists, we made
the following graph to remember which clothes we absolutely need
to put before other clothes, in order to get dressed in the morning:

Then, we would want to output an order in which we could put


those clothes:

Theorem A directed graph G is acyclic if and only if a DFS of G yields no back edges.

Proof =⇒ We want to show that a cycle implies a back-edge.


Let v be the first vertex discovered in the cycle C, and let (u, v) be
its preceding edge in C (meaning that u is also in the cycle). At
time v.d, vertices in C form a white path v to u, and hence u is a
descendant of v. This means that the edge (u, v) is a back edge.

Proof ⇐= We want to show that a back-edge implies a cycle.


We suppose by hypothesis that there is a back edge (u, v). Then, v
is the ancestor of u in the depth-first forest. Therefore, there is a
path from v to u, and thus it creates a cycle.

Algorithm The idea of topological sort is to call DFS on our graph (starting from any vertex),
in order to compute finishing times v.f for all v ∈ V . We can then output vertices
in order of decreasing finishing time.

Example For instance, let’s consider the graph above. Running DFS, we may
get:

59
Algorithms CHAPTER 6. GRAPHS

Then, outputting the vertices by decreasing v.f , we get the exact


topological order shown above. Note that it could have naturally
outputted a different topological sort (if we had started our DFS at
other points), since those are definitely not unique.

Proof of cor- We want to show that, if the graph is acyclic and (u, v) ∈ E, then
rectness v.f < u.f .
When we traverse the edge (u, v), u is grey (since we are currently
considering it). We can then split our proof for the different colours
v can have:
1. v cannot be grey since, else, it would mean that we got to v
first, then got to u, and finally got back to v. In other words,
we would have v.d < u.d and thus v.d < u.d < u.f < v.f . This
would imply (u, v) would be a back edge, contradicting that our
graph is acyclic.
2. v could be white, which would mean that that it is a descendant
of u and thus, by the parenthesis theorem, we would have u.d <
v.d < v.f < u.f .
3. v could also be black, which would mean that v is already finished
and thus, definitely, v.f < u.d, implying that v.f < u.f .


Monday 14th November 2022 — Lecture 15 : I definitely really like this date

6.4 Strongly connected components


Definition: Con- Two vertices of an undirected graph are connected if there exists a path between
nected vertex those two vertices.
Remark To know if the two vertices are connected, we can run BFS on one
of the vertex, and see if the other vertex has a finite distance.

Observation For directed graph, this definition no longer really makes sense.
Since we may want one similar, we will define strongly connected
components right after.

Notation: Path In a graph, if there is a path from u to v, then we note u ; v.


Definition: A strongly connected component (SCC) of a directed graph G = (V, E) is a
Strongly connec- maximal set of vertices C ⊆ V such that, for all u, v ∈ C both u ;
v and v u. ;
ted component

Remark To verify if we indeed have a SCC, we first verify that every vertex
can reach every other vertex. We then also need to verify that it is
maximal, which we can do by adding any element which has one
connection to the potential SCC, and verifying that what it yields
is not a SCC.
Example ;
For instance, the first example is not a SCC since c 6 b, the second
is not either since we could add f and it is thus not maximal:

60
6.4. STRONGLY CONNECTED COMPONENTS Notes by Joachim Favre

However, here are all of the SCC of the graph:

Theorem: Exist- Any vertex belongs to one and exactly one SCC.
ence and unicity
of SCCs
Proof First, we notice that a vertex always belongs to at least one SCC
since we can always make an SCC containing one element (and
adding it enough elements so that to make it maximal). This shows
the existence.
Second, let us suppose for contradiction that SCCs are not unique.
Thus, for some graph, there exists a vertex v such that v ∈ C1 and
v ∈ C2 , where C1 and C2 are two SCCs such that C1 6= C2 . By
definition of SCCs, for all u1 ∈ C1 , we have u1 ; v and v ; u1 ,
and similarly for all u2 ∈ C2 . However, by transitivity, this also
means that u1 ; u2 and u2 ; u1 . This yields that we can create a
new SCC C1 ∪ C2 , which contradicts the maximality of C1 and C2
and thus shows the unicity.

Definition: Com- For a directed graph (digraph) G = (V, E), its component graph GSCC =
ponent graph V SCC , E SCC is defined to be the graph where V SCC has a vertex for each SCC


in G, and E SCC has an edge between the corresponding SCCs in G.

Example For instance, for the digraph hereinabove:

Theorem For any digraph G, its component graph GSCC is a DAG (directed acyclic graph).

Proof Let’s suppose for contradiction that GSCC has a cycle. This means
that we can access one SCC from G from another SCC (or more);
and thus any elements from the first SCC have a path to elements
of the second SCC, and reciprocally. However, this means that we
could the SCCs, contradicting their maximality.


61
Algorithms CHAPTER 6. GRAPHS

Definition: Let G be a digraph (directed graph).


Graph trans- The transpose of G, written GT , is the graph where all the edges have their
pose direction reversed:

GT = V, E T , where E T = {(u, v) : (v, u) ∈ E}




Remark We call this a transpose since the transpose of G is basically given


by transposing its adjacency matrix.

Observation We can create GT in Θ(V + E) if we are using adjacency lists.

Theorem A graph and its transpose have the same SCCs.

Kosarju’s al- The idea of Kosarju’s algorithm to compute component graphs efficiently is:
gorithm 1. Call DFS(G) to compute the finishing times u.f for all u.
2. Compute GT .
3. Call DFS(GT ) where the order of the main loop of this procedure goes in
order of decreasing u.f (as computed in the first DFS).
4. Output the vertices in each tree of the depth-first forest formed in second
DFS, as a separate SCC. Cross-edges represent links in the component graph.

Unicity Since SCCs are unique, the result will always be the same, even
though graphs can be traversed in very different ways with DFS.

Analysis Since every instruction is Θ(V + E), our algorithm runs in


Θ(V + E).

Intuition The main intuition for this algorithm is to realise that elements from
SCCs can be accessed from one another when going forwards (in
the regular graph) or backwards (in the transposed graph). Thus,
we first compute some kind of “topological sort” (this is not a real
one this we don’t have a DAG), and use its reverse-order as starting
points to go in the other direction. If two elements can be accessed
in both directions, they will indeed be in the same tree at the end. If
two elements have one direction where one cannot access the other,
then the first DFS will order them so that we begin the second DFS
by the one which cannot access the other.

Personal re- The Professor used the name “magic algorithm” since we do not
mark prove this theorem and it seems very magical. I feel like it is better
to give it its real name, but probably it is important to know its
informal name for exams.

62
6.5. FLOW NETWORKS Notes by Joachim Favre

Friday 18th November 2022 — Lecture 16 : This date is really nice too, though

6.5 Flow networks


Basic problem The basic problem solved by flow networks is shipping as much of a resource from
one node to another. Edge have a weight, which, if they were pipes, would represent
their flow capacity. The question is then how to optimise the rate of flow from the
source to the sink.
Applications This has many applications. For instance: evacuating people out of
a building. If we have given exits and corridors size, we can then
know how many people we could evacuate in a given time.
Another application is finding the best way to ship goods on roads,
or disrupting it in another country.

Definition: Flow A flow network is a directed graph G = (V, E), where each edge (u, v) has a capacity
network c(u, v) ≥ 0. This is function is such that, c(u, v) = 0 if and only if (u, v) 6∈ E. Finally,
we have a source node s and a sink node t.
We also assume that there are never antiparallel edges (both (u, v) ∈ E and (v, u) ∈
E). This supposition is more or less without loss of generality since, then, we could
just break one of the antiparallel edges into two edges linking a new node v 0 (see
the picture below). This will simplify notations in our algorithm.

Definition: Flow A flow is a function f : V × V 7→ R satisfying the two following constraints. First,
the capacity constraint states that, for all u, v ∈ V , we have:

0 ≤ f (u, v) ≤ c(u, v)

In other words, the flow cannot be greater than what is supported by the pipe. The
second constraint is flow conservation, which states that for all u ∈ V \ {s, t}:
X X
f (v, u) = f (u, v)
v∈V v∈V

In other words, the flow coming into u is the same as the flow coming out of u.

Notation We will note flows on a flow network by noting f (u, v)/c(u, v) for
all edge. For instance, we could have:

Definition: Value The value of a flow f , denoted |f |, is:


of a flow
X X
|f | = f (s, v) − f (v, s)
v∈V v∈V

which is the flow out of the source minus the flow into the source.

63
Algorithms CHAPTER 6. GRAPHS

Observation By the flow conservation constraint, this is equivalent to the flow


into the sink minus the flow out of the sink:
X X
|f | = f (v, t) − f (t, v)
v∈V v∈V

Example For instance, for the flow graph and flow hereinabove:

|f | = (1 + 2) − 0 = 3

Goal The idea is now to develop an algorithm that, given a flow network, we find the
maximum flow. The basic idea that could come to mind is to take a random path
through our network, consider its bottleneck link, and send this value of flow onto
this path. We then have a new graph, with capacities reduced and some links less
(if the capacity happens to be 0). We can continue this iteratively until the source
and the sink are 0.
This idea we would for example on the following (very simple) flow network:

Indeed, its bottleneck link has capacity 3, so we send 3 of flow on the only path.
Then, it leads to a new graph with one edge less, where the source and the sink are
no longer connected.
However, we notice that we suddenly have problems on different graphs. This
algorithm may produce the following sub-optimal result on the following flow network:

This means that we need a way to “undo” bad choices of paths done by our algorithm.
To do so, we will need the following definitions.

Definition: Re- Given a flow network G and a flow f , the residual capacity is defined as:
sidual capacity
c(u, v) − f (u, v), if (u, v) ∈ E


cf (u, v) = f (v, u), if (v, u) ∈ E
0, otherwise

The main idea of this function is its second part: the first part is just the capacity
left in the pipe, but the second part is a new, reversed, edge we add. This new edge
holds a capacity representing the amount of flow that can be reversed.

Example For instance, if we have an edge (u, v) with capacity c(u, v) = 5 and
current flow f (u, v) = 3, then cf (u, v) = 5 − 3 = 2 and cf (v, u) =
f (u, v) = 3.

Remark This definition is the reason why we do not want antiparallel edges:
the notation is much simpler without.

Definition: Re- Given a flow network G and flow f , the residual network Gf is defined as:
sidual network
Gf = (V, Ef ), where Ef = {(u, v) ∈ V × V : cf (u, v) > 0}

We basically use our residual capacity function, removing edges with 0 capacity left.

64
6.5. FLOW NETWORKS Notes by Joachim Favre

Definition: Aug- Given a flow network G and flow f , an augmenting path is a simple path (never
menting path going twice on the same vertex) from s to t in the residual network Gf .
Augmenting the flow f by this path means applying the minimum capacity over
the path: add it to edges which were here at the start, and remove it to edges we
added after through the residual capacity. This can easily be seen by looking at the
definition of residual capacity (if (u, v) ∈ E, then we use the opposite of the flow, if
(v, u) ∈ E, then we use the positive version of the flow).

Ford-Fulkerson The idea of the Ford-Fulkerson greedy algorithm for finding the maximum flow in a
algorithm flow network is, as the one we had before, to improve our flow iteratively; but using
residual networks in order to cancel wrong choices of paths.

Example Let’s consider again our non-trivial flow network, and the suboptimal
flow our naive algorithm found:

Now, the residual networks looks like:

Now, the new algorithm will indeed be able to take the new path.
Taking the edge going from bottom to top basically cancels the
choice it did before to choose it. Being careful to apply the new
path correctly (meaning to add it to edges from G and remove it
to edges introduced by the residual network), we get the following
flow and residual network:

Proof of optim- We will want to prove its optimality. However, to do so, we need
ality the following definitions.

Definition: Cut A cut of flow network G = (V, E) is a partition of V into S and T = V \ S such
of flow network that s ∈ S and t ∈ T .
In other words, we split our graph into nodes on the source side and on the sink
side.
Example For instance, we could have the following cut (where nodes from S
are coloured in black, and ones from T are coloured in white):

65
Algorithms CHAPTER 6. GRAPHS

Note that the cut does not necessarily have to be a straight line
(since, anyway, straight lines make no sense for a graph).

Definition: Net The net flow across a cut (S, T ) is:


flow across a cut
X X
f (S, T ) = f (u, v) − f (v, u)
u∈S u∈S
v∈T v∈T

This is basically the flow leaving S minus the flow entering S.

Example For instance, on the graph hereinabove, it is:

f (S, T ) = 12 + 11 − 4 = 19

Property Let f be a flow. For any cut S, T :

|f | = f (S, T )

Proof We make a proof by structural induction on S.


• If S = {s}, then the net flow is the flow out from s minus the
flow into s, which is exactly equal to the value of the flow.
• Let’s say S = S 0 ∪ {w}, supposing |f | = f (S 0 , T 0 ). We know that,
then, T = T 0 \ {w}.
By conservation of flow, we know that everything coming in this
new node w is also coming out. Thus, removing it from T 0 and
putting in S 0 does not change anything: in any case, it does not
add or remove any flow to a cut, it only relays it:
X X
f (S, T ) = f (S 0 , T 0 ) − f (v, w) + f (w, v) = f (S 0 , T 0 )
v∈V v∈V
| {z }
=0


Definition: Capa- The capacity of a cut S, T is defined as:
city of a cut
X
c(S, T ) = c(u, v)
u∈S
v∈T

Example For instance, on the graph hereinabove, the capacity of the cut is:

12 + 14 = 26

Note that we do not add the 9, since it goes in the wrong direction.

Observation This value, however, depends on the cut.

66
6.5. FLOW NETWORKS Notes by Joachim Favre

Monday 21st November 2022 — Lecture 17 : The algorithm may stop, or may not

Property For any flow f and any cut (S, T ), then:

|f | ≤ c(S, T )

Proof Starting from the left hand side:


X X
|f | = f (S, T ) = f (u, v) − f (v, u)
u∈S u∈S
v∈T v∈T
| {z }
≥0

And thus:
X X
|f | ≤ f (u, v) ≤ c(u, v) = c(S, T )
u∈S u∈S
v∈T v∈T

Definition: Min- Let f be a flow. A min-cut is a cut with minimum capacity. In other words, it is a
cut cut (Smin , Tmin ), such that for any cut (S, T ):

c(Smin , Tmin ) ≤ c(S, T )

Remark By the property above, the value of the flow is less than or equal to
the min-cut:
|f | ≤ c(Smin , Tmin )
We will prove right after that, in fact, |fmax | = c(Smin , Tmin ).

Max-flow min- Let G = (V, E) be a flow network, with source s, sink t, capacities c and flow f .
cut theorem Then, the following propositions are equivalent:
1. f is a maximum flow.
2. Gf has no augmenting path.
3. |f | = c(S, T ) for some cut (S, T ).

Remark This theorem shows that the Ford-Fulkerson method gives the
optimal value. Indeed, it terminates when Gf has no augmenting
path, which is, as this theorem says, equivalent to having found a
maximum flow.
Proof (1) =⇒ Let’s suppose for contradiction that Gf has an augmenting path p.
(2) However, then, Ford-Fulkerson method would augment f by p to
obtain a flow with increased value. This contradicts the fact that f
was a maximum flow.
Proof (2) =⇒ Let S be the set of nodes reachable from s in the residual network,
(3) and T = V \ S.
Every edge going out of S in G must be at capacity. Indeed,
otherwise, we could reach a node outside S in the residual network,
contradicting the construction of S.
Since every edge is at capacity, we get that f (S, T ) = c(S, T ).
However, since |f | = f (S, T ) for any cut, we indeed find that:

|f | = c(S, T )

Proof (3) =⇒ We know that |f | ≤ c(S, T ) for all cuts S, T . Therefore, if the
(1) value of the flow is equal to the capacity of some cut, it cannot be
improved. This shows its maximality.


67
Algorithms CHAPTER 6. GRAPHS

Summary All this shows that our Ford-Fulkerson method for finding a max-flow works:
start with 0-flow
while there i s an augmenting path from s to t in the residual network:
find an augmenting path
compute the bottleneck // the min capacity on the path
increase the flow on the path by the bottleneck and update the residual
network
// flow is maximal

Also, when we have found a max-flow, we can use our flow to find a min-cut:
i f no augmenting path exists in the residual network:
find the set of nodes S reachable from s in the residual network
set T = V \ S
// (S, T) is a minimum cut

High complexity It takes O(E) to find a path in the residual network (using breadth-first search for
analysis instance). Each time, the flow value is increased by at least 1. Thus, the running
time has a worst case of O(E|fmax |).
We can note that, indeed, there are some cases where we reach such a complexity if
we always choose the bad path (the one taking the link in the middle here, which
will always exist on the residual network):

This graph would never terminate before the heat death of the universe.

Lower complex- In fact, if we don’t choose our paths randomly and if the capacities are integers
ity analysis (or rational numbers, this does not really matter since we could then just multiply
everything by the lowest common divisor an get an equivalent problem), then we
can get a much better complexity.
If we take the shortest path given by BFS, then the complexity is bounded by 12 EV .
If we take the fattest path (the path which bottleneck has the largest capacity),
then the complexity is bounded by E log(E|fmax |).

Proof We will not show those two affirmations in this course.

Observation If the capacities of our network are irrational, then the Ford-Fulkerson method might
not really terminate.

Application: Bi- Let’s consider the Bipartite matching problem. It is easier to explain it with an
partite matching example. We have N students applying for M ≥ N jobs, where each student get
problem several offers. Every job can be taken at most once, and every student can have at
most one job.

68
6.5. FLOW NETWORKS Notes by Joachim Favre

We want to know if it is possible to match all students to jobs. To do so, we add


a source linked to all students, and a sink linked to all jobs, where all edges have
capacity 1.

If the Ford-Fulkerson method gives us that |fmax | = N , then every student was
able to find a job. Indeed, flows obtained by Ford-Fulkerson are integer valued if
capacities are integers, so the value on every edge is 0 or 1. Since every student
has the in-flow for at most one job, and each job has the out-flow for at most one
student, there cannot be any student matched to two jobs or any job matched to
two students, by conservation of the flow.

Application: In an undirected graph, we may want to know the minimum number of routes that
Edge-disjoint we can take that do not share a common road. To do so, we set an edge of capacity
paths problem 1 for both directions for every roads (in a non-anti-parallel fashion, as seen earlier).
Then, the max-flow is the number of edge-disjoint paths, and the min-cut shows the
minimum number of roads that need to be closed so that there is no more route
going from the start to the end.

69
Algorithms CHAPTER 6. GRAPHS

Friday 25th November 2022 — Lecture 18 : Either Levi or Mikasa made this function

6.6 Disjoint sets


Disjoint-set data The idea of disjoint-set data structures is to maintain a collection S =
structures {S1 , . . . , Sk } of disjoint sets, which can change over time. Each set is identified by a
representative, which is some member of the set. It does not matter which element is
the representative as long as, asking for the representative twice without modifying
the set, we get the same answer both times.
We want our data structure to have the following operations:
• Make-Set(x) makes a new set Si = {x}, and add Si to our collection S.
• Union(x, y) modifies S such that, if x ∈ Sx and y ∈ Sy , then:

S = S \ Sx \ Sy ∪ {Sx ∪ Sy }

In other words, we destroy Sx and Sy , but create a new set Sx ∪ Sy , which


representative is any member of Sx ∪ Sy .
• Find(x) returns the representative of the set containing x.

Remark This datastructure can also be named union find.

Linked list rep- A way to represent this data structure is through a linked list. To do so, each set is
resentation an object looking like a single linked list. Each set object is represented by a pointer
to the head of the list (which we will take as the representative) and a pointer to
the tail of the list. Also, each element in the list has a pointer to the set object and
to the next element.

Make-Set For the procedure Make-Set(x), we can just create a singleton list
containing x. This is easily done in time Θ(1).

Find For the procedure Find(x), we can follow the pointer back to the
list object, and then follow the head pointer to the representative.
This is also done in time Θ(1).

Union For the procedure Union(x, y), everything gets more complicated.
We notice that we can append a list to the end of another list.
However, we will need to update all the elements of the list we
appended to point to the right set object, which will take a lot of
time if its size is big. So, to do so, we can just append the smallest list
to the largest one (if their size are equal, we can make an arbitrary
choice). This method is named weighted-union heuristic.
We notice that, on a single operation, both ideas have exactly the
same bound. So, to understand why this is better, let’s consider
the following theorem.

Theorem Let us consider a linked-list implementation of a disjoint-set datastructure


With the weighted-union heuristic, a sequence of (any) m operations takes
O(m + n log(n)) time, where n is the number of elements our structure ends  with
after those operations. Without this heuristic, this bound gets to O m + n2 .

70
6.6. DISJOINT SETS Notes by Joachim Favre

Proof with The inefficiency comes from constantly rewiring our elements when
running the Union procedure. Let us count how many times an
element i may get rewire if, amongst those m operations, there are
n Union calls.
When we merge a set A containing i and another set B, if we have
to update wiring of i, then it means that the size of the list A was
smaller than the one of B, and thus the size of the total list of A ∪ B
is at least twice the size of the one of A. However, we can double
the sizes of a list at most log(n) times, meaning that the element i
has been rewired at most log(n) times. Since we have n elements
for which we could have made the exact same analysis, we get a
complexity of O(n log(n)) for this scenario.
Note that we also need to consider the case where there are many
more Make-Set and Find calls than Union ones. This is pretty
trivial since they are both Θ(1), and thus this case is Θ(m).
Putting everything together, we get a worst case complexity of
O(m + n log(n)) = O(max{m, n log(n)}).

Proof without Let’s say that we have n elements each in a singleton set and that
our m operations consist in always appending the list of the first
set to the second one, through unions. This way, the first set
will get a size constantly growing. Thus, we will have to rewire
1 + 2 + . . . + n − 1 elements, leading to a worst case complexity of
O n2 for this scenario.
Again, considering the case where there are mostly Make-Set and
Find call, it leads to a complexity of Θ(m). Putting everything
together, we indeed get a worst case of O m + n2 .



Remark This kind of analysis is amortised complexity analysis: we don’t
make our analysis on a single operation, since we may have a really
bad case happening. However, on average, it is fine.

Forest of trees Now, let’s consider instead a much better idea. We make a forest of trees (which are
not binary), where each tree represents one set, and the root is the representative.
Also, since we are working with trees, naturally each node only points to its parent.

Make-Set Make-Set(x) can be done easily by making a single-node tree.


procedure MakeSet(x):
x.p = x
x.rank = 0

The rank will be defined and used in the Union procedure.

Find For Find(x), we can just follow pointers to the root.


However, we can also use the following great heuristic: path com-
pression. The Find(x) procedure follows a path to the origin.
Thus, we can make all those elements’ parent be the representative
directly (in order to make the following calls quicker).

71
Algorithms CHAPTER 6. GRAPHS

procedure FindSet(x):
i f x != x.p:
x.p = FindSet(x.p) // update parent
return x.p

Union For Union(x, y), we can make the root of one of the trees the child
of another.
Again, we can optimise this procedure with another great heuristic:
union by rank. For the Find(x) procedure to be efficient, we need
to keep the height of our trees as small as possible. So, the idea
is to append the tree with smallest height to the other. However,
using heights is not really efficient (since they change often and are
thus hard to keep track of), so we use ranks instead, which give the
same kind of insights.
procedure Union(x, y):
Link(FindSet(x), FindSet(y))

procedure Link(x, y):


i f x.rank > y.rank:
y.p = x
else :
x.p = y
i f x.rank == y.rank:
y.rank = y.rank + 1

Complexity Let’s also consider applying m operations to a datastructure with n


elements.
We can show that, using both union by rank and path compression,
we have a complexity of O(mα(n)), where α(n) is the inverse Acker-
mann function. This function is growing really slow: we can consider
α(n) ≤ 5 for any n of size making sense when compared to the size
of the universe. In other words, our complexity is approximately
O(m).
Note that this complexity is tight, we don’t like to just add some
weird functions.
Application: For instance, we can construct a disjoint-set data structure for all the connected
Connected com- components of an undirected graph. Using the fact that, in an undirected graph,
ponents two elements are connected if and only if there is a path between them:
procedure ConnectedComponents(G):
for each vertex v in G.V:
MakeSet(v)
for each edge (u, v) in G.E:
i f FindSet(u) != FindSet(v):
Union(u, v)

Example For instance, in the following graph, we have two connected com-
ponents:

This means that our algorithm will give us two disjoint sets in the
end.

72
6.7. MINIMUM SPANNING TREE Notes by Joachim Favre

Analysis We notice that we have V elements, and we have at most V + 3E


union or find operations.
Thus, using the best implementation we saw for disjoint set data
structures, we get a complexity of O((V + E)α(V )) ≈ O(V + E).
For the other implementation we would get O(V log(V ) + E).

6.7 Minimum spanning tree


Definition: Span- A spanning tree of a graph G is a set T of edges that is acyclic and spanning (it
ning tree connects all vertices).

Example For instance, the following is a spanning tree:

However, the following is not a spanning tree since it has no cycle


but is not spanning (the node e is never reached):

Similarly, the following is not a spanning tree since it is spanning


but has a cycle:

Remark The number of edges of a spanning tree is Espan = V − 1.

Minimum span- The goal is now that, given an undirected graph G = (V, E) and weights w(u, v) for
ning tree (MST) each edge (u, v) ∈ E, we want to output a spanning tree of minimum total weight
(which sum of weights is the smallest).

Application: This problem can have many applications. For instance, let’s say
Communica- we have some cities between which we can make communication
tion networks lines at different costs. Finding how to connect all the cities at the
smallest cost possible is exactly an application of this problem.

Application: Another application is clustering. Let’s consider the following graph,


Clustering where edge weights equal to the distance of nodes:

73
Algorithms CHAPTER 6. GRAPHS

Then, to find n clusters, we can make the minimum spanning tree


(which will want to use the small edges), and remove the n − 1
fattest edges.

Definition: Cut Let G = (E, V ) be a graph. A cut (S, V \ S) is a partition of the vertices into two
non-empty disjoint sets S and V \ S.

Definition: Let G = (E, V ) be a graph, and (S, V \ S) be a cut. A crossing edge is an edge
Crossing edge connecting a vertex from S to a vertex from V \ S.

Monday 28th November 2022 — Lecture 19 : Finding the optimal MST

Theorem: Cut Let S, V \ S be a cut. Also, let T be a tree on S which is part of a MST, and let e
property be a crossing edge of minimum weight.
Then, there is a MST of G containing both e and T .

Proof Let us consider the MST T is part of.


If e is already in it, then we are done.
Since there must be a crossing edge (to span both S and V \ S), if
e is not part of the MST, then another crossing edge f is part of
the MST. However, we can just replace f by e: since w(e) ≤ w(f )
by hypothesis, we get that the new spanning tree has a weight less
than or equal to the MST we considered. But, since the latter
was minimal, it means that our new tree is also minimal and that
w(f ) = w(e). Note that the new tree is indeed spanning since, if we
consider the original MST to which we add e, then it has a cycle
going through e (since adding an edge to a MST always leads to
a cycle), and thus through f too. We can then remove any of the
edges in this cycle (f in particular) and still have our spanning
property.
In both cases, we have been able to create a MST containing both
T and e, finishing our proof.

Prim’s algorithm The idea of Prim’s algorithm for finding MSTs is to greedily construct the tree by
always picking the crossing edge with smallest weight.

Proof Let’s do this proof by structural induction on the number of nodes


in T .
Our base case is trivial: starting from any point, a single element
is always a subtree of a MST. For the inductive step, we can just

74
6.7. MINIMUM SPANNING TREE Notes by Joachim Favre

see that starting with a subtree of a MST and adding the crossing
edge with smallest weight yields another subtree of a MST by the
cut property.

Implementa- We need to keep track of all the crossing edges at every iteration,
tion and to be able to efficiently find the minimum crossing edge at every
iteration.
Checking out all outgoing edges is not really good since it leads to
O(E) comparisons at every iteration and thus a total running time
of O(EV ).
Let’s consider a better solution. For every node w, we keep a value
dist(w) that measure the “distance” (the minimum sum of weights
to reach it) of w from the current tree. When a new node u is
added to the tree, we check whether neighbours of u have their
distance to the tree decreased and, if so, we decrease it. To extract
the minimum efficiently, we use a min-priority queue for the nodes
and their distances. In pseudocode, it looks like:
procedure Prim(G, w, r):
l e t Q be an empty min -priority queue
for each u in G.V:
u.key = infinity
u.pred = Nil
Insert(Q, u)
decreaseKey(Q, r, 0) // set r.key to 0
while !Q.isEmpty ():
u = extractMin(Q)
for each v in G.Adj[u]:
i f v in Q and w(u, v) < v.key:
v.pred = u
decreaseKey(Q, v, w(u, v))

Analysis Initialising Q and the first for loop take O(V log(V )) time. Then,
decreasing the key of r takes O(log(V )). Finally, in the while loop,
we make V extractMin calls—leading to O(V log(V ))—and at most
E decreaseKey calls—leading to O(E log(V )).
In total, this sums up to O(E log(V )).

Kruskal’s al- Let’s consider another way to solve this problem. The idea of Kruskal’s algorithm
gorithm for finding MSTs is to start from a forest T with all nodes being in singleton trees.
Then, at each step, we greedily add the cheapest edge that does not create a cycle.
The forest will have been merged into a single tree at the end of the procedure.

Proof Let’s do a proof by structural induction on the number of edges in


T to show that T is always a sub-forest of a MST.
The base case is trivial since, at the beginning, T is a union of
singleton vertices and thus, definitely, it is the sub-forest of any tree
on the graph (and of any MST, in particular).
For the inductive step, by hypothesis, the current T is a sub-tree of
a MST. Let e be an edge of minimum weight that does not create
a cycle, and let’s suppose for contradiction that it is not part of a
MST. We notice that adding this edge to one of the MST will create
a cycle, since adding an edge to a MST will always create a cycle.
However, since there was no cycle when adding this edge to the
forest, it means that there were some edges that were added later
(meaning that they have greater weight) that compose this cycle.
We can just remove one of those edges, getting a tree with weight
smaller than the MST and also spanning, which is our contradiction.


75
Algorithms CHAPTER 6. GRAPHS

Implementa- To implement this algorithm, we need to be able to efficiently check


tion whether the cheapest edge creates a cycle. However, this is the same
as checking whether its endpoint belong to the same component,
meaning that we can use disjoint sets data structure.
We can thus implement our algorithm by making each singleton a
set, and then, when an edge (u, v) is added to T , we take the union
of the two connected components u and v.
procedure Kruskal(G, w):
l e t result be an empty set of edges
for each vertex v in G.V:
makeSet(v)
sort the edges of G.E into nondecreasing order by weight
w
for each (u, v) from G.E:
i f findSet(u) != findSet(v):
result = SetUnion(result , (u, v))
return result

Analysis Initialising result is in O(1), the first for loop represents V makeSet
calls, sorting E takes O(E log(E)) and the second for loop is O(E)
findSets and unions. We thus get a complexity of:

O((V + E)α(V )) +O(E log(E)) = O(E log(E)) = O(E log(V ))


| {z }
=O(Eα(V ))

since E = O V 2 for any graph.




We can note that, if the edges are already sorted, then we get a
complexity of O(Eα(V )), which is almost linear.

6.8 Single-source shortest paths


Definition: Let G = (V, E) be a directed graph with edge-weights w(u, v) for all (u, v) ∈ E.
Shortest path We want to find the path from a ∈ V to b ∈ V , (v0 , v1 , . . . , vk ), such that its weight
problem i=1 w(vi−1 , vi ) is minimum.
Pk

Variants Note that there are many variants of this problem.


In single-source, we want to find the shortest path from a given
source vertex to every other vertex of the graph. In single-
destination, we want to find the shortest path from every vertex
in the graph to a given destination vertex. In single-pair, we want
to find the shortest path from u to v. In all-pairs, we want to find
the shortest path from u to v for all pairs u, v of vertices.
We can observe that single-destination can be solved by solving
single-source and by reversing edge directions. For single-pair, no
algorithm better than the one for single-source is known for now.
Finally, for all-pairs, it can be solved using single-source on every
vertex, even though better algorithms are known.

Negative-weight Note that we will try to allow negative weights, as long there is no negative-weight
edges cycle (a cycle which sum is negative) reachable from the source (since then we could
just keep going around in the cycle and all nodes would have distance −∞). In fact,
one of our algorithm will allow to detect such negative-weight cycles.

Remark Dijkstra’s algorithm, which we will present in the following course,


only works with positive weights.

76
6.8. SINGLE-SOURCE SHORTEST PATHS Notes by Joachim Favre

Application This can for instance be really interesting for exchange rates. Let’s
say we have some exchange rate for some given currencies. We are
wondering if we can make an infinite amount of money by trading
money to a currency, and then to another, and so on until, when
we come back to the first currency, we have made more money.
To compute this, we need to compute the product of our rates.
Since we will want to apply minimum-path and we will compute
their sum, we can take a logarithm on every exchange rate. That
way, minimising this sum of logarithms is equivalent to minimising
the logarithm of the product of currencies, and thus minimising the
product of currencies. Moreover, we want to find the way to make
the maximum amount of money, so we can just consider the opposite
of all our logarithms. To sum up, we have w(u, v) = − log(r(u, v)).
Now, we only need to find negative cycles: they allow to make an
infinite amount of money.

Friday 2nd December 2022 — Lecture 20 : I like the structure of maths courses

Bellman-Ford We receive in input a directed graph with edge weights, a source s and no negative
algorithm cycle, and, for each vertex v, we want to output `(v), the distance of the shortest
path, and π(v) = pred(v), the predecessor of v in the shortest path (this is enough
to reconstruct the path at the end).
Note that, as the algorithm iterates, `(v) will always be the current upper estimate
of the length of the shortest path to v, and pred(v) be the predecessor of v in this
shortest path.

Algorithm The idea is to iteratively relax all paths, meaning to replace the
current path by a new path if it is better:
procedure Relax(u, v, w):
i f u.d + w(u, v) < v.d:
v.d = u.d + w(u, v)
v.pred = u

We need to relax all edges at most |V | − 1 times (this number will


be explained in the proof right after). Thus, it gives us:
procedure initSingleSource(G, s):
for each v in G.V:
v.d = infinity
v.pred = Nil
s.d = 0

procedure BellmanFord(G, w, s):


initSingleSource(G, s)

// Main algorithm
for i = 1 to len(G.V) - 1:
for each edge (u, v) in G.E:
relax(u, v, w)

// Detect negative cycles (does not modify the graph)


for each edge (u, v) in G.E:
i f v.d > u.d + w(u, v): // there would be a
modification with relax
return false
return true

Note that the negative cycle detection will be explained through its
proof after.

Remark This algorithm is only guaranteed to work if there is no negative


cycle, but it can also detect them.

77
Algorithms CHAPTER 6. GRAPHS

Analysis InitSingleSource updates ` and pred for each vertex in time Θ(V ).
Then, the nested for loops run Relax V − 1 times for each edge,
giving Θ(EV ). Finally, the last for loop runs once for each edge,
giving a time of Θ(E).
This gives us a total runtime of Θ(EV ).

Lemma: Optimal If (s, v1 , . . . , vk+1 ) is a shortest path from s to vk+1 , then (s, v1 , . . . , vk ) is a shortest
substructure path from s to vk .

Proof Let q = (s, v1 , . . . , vk+1 ) be a shortest path from s to vk+1 , and


p = (s, v1 , . . . , vk ) be the path of which we want to prove the
optimality. Let’s suppose for contradiction that there exists a path
p0 = (s, w1 , . . . , wj , vk ) from s to vk which sum of weight is smaller
than the one of p.
Let’s append vk+1 to both our paths, giving q and q 0 =
(s, w1 , . . . , wj , vk , vk+1 ). This increases both of their sum of weights
by the same amount, w(vk , vk+1 ), meaning that q 0 also has a smaller
weight than q:

`(p0 ) < `(p)


⇐⇒ `(p0 ) + w(vk , vk+1 ) < `(p) + w(vk , vk+1 )
⇐⇒ `(q 0 ) < `(q)

This contradicts the optimality of q, showing the optimal substruc-


ture of our problem.

Lemma: Number After the ith iteration, `(v) is at most the length of the shortest path from s to v
of iterations using at most i edges.

Proof We want to do this proof by induction.


The base case is trivial since, when there is 0 iteration, `(s) = 0 and
all other equal infinity. Those are indeed the shortest paths using
at most 0 edges.
For the inductive step, let us consider any shortest path q from s to
vk+1 using at most i edges. We want to show its optimality.
By the optimal substructure, the path p from s to vk is the shortest
path, and it uses at most i − 1 edges. Thus, by the inductive
hypothesis, after the (i − 1)th iteration, `(vk ) is at most the length
of the shortest path p, i.e. `(vk ) ≤ w(p). However, since `(vk+1 ) ≤
`(vk )+w(wk , wk+1 ) at our ith iteration by construction of the Relax
procedure, this implies that:

`(vk+1 ) ≤ `(vk ) + w(vk , vk+1 ) ≤ w(p) + w(vk + vk+1 ) = w(q)

Since `(v) never increases, we know that this property will hold
until the end of the procedure, which concludes our proof.

Lemma: Num- If there are no negative cycles reachable from s, then for any v there is a shortest
ber of edges in path from s to v using at most |V | − 1 edges.
shortest path

Proof Let’s suppose for contradiction that a shortest path with the smallest
number of edges has |V | or more edges. However, this means that
there exists at least a vertex the path uses at least twice by the
pigeon-hole principle, meaning that there is a cycle. Since the weight
of this cycle is non-negative, it can be removed without increasing

78
6.8. SINGLE-SOURCE SHORTEST PATHS Notes by Joachim Favre

the length of the path, contradicting that it had the smallest number
of edges.

Theorem: Op- If there is no negative cycle, Bellman-Ford will return the correct answer after |V | − 1
timality of iterations.
Bellamn-Ford
Proof The proof directly comes from the two previous lemmas: there
always exists a shortest path with at most |V | − 1 edges when there
is no negative cycle, and after |V | − 1 iterations we are sure that
we have found all paths with at most |V | − 1 edges.

Theorem: Detec- There is no negative cycle reachable from s if and only if the `-value of no node
tion of negative change when we run a |V |th iteration of Bellman-Ford.
cycles

Proof We already know from the lemma telling the maximum minimum
number of edges in the shortest path, that if there are no negative
cycles reachable from the source, then the `-values don’t change in
the nth iteration. We would now want to show that this is actually
equivalent, by showing that if the `-values of the vertices do not
change in the nth iteration, then there is no negative cycle that is
reachable from the source.
If there is no cycle, then definitely there is no negative cycle and
everything works perfectly. Let’s thus consider the case where
there are cycles. We consider any cycle, (v0 , v1 , . . . , vt−1 , vt ) with
v0 = vt . Since no `-value changed by hypothesis, we know that, by
construction of the Relax procedure, `(vi ) ≤ `(vi−1 ) + w(vi−1 , vi )
(since it is true for all edges, then it is definitely true for the ones in
our cycle). We can sum those values:
t
X t
X t
X Xt
`(vi ) ≤ (`(vi−1 ) + w(vi−1 , vi )) = `(vi−1 )+ w(vi−1 , vi )
i=1 i=1 i=1 i=1

However, since we have a cycle, i=1 `(vi ) = i=1 `(vi−1 ). Sub-


Pt Pt
tracting this value from both side, this indeed gives us that our
cycle is non-ngeative:
t
X
0≤ w(vi−1 , vi )
i=1


Remark We have shown that Bellman-Ford returns true if and only if there
is no (strictly) negative cycle.

Remark If we have a DAG (directed acyclic graph), we can first use a topological sort,
followed by one pass of Bellman-Ford using this ordering. This allows to find a
solution in O(V + E).

Dijkstra’s al- Let’s consider a much faster algorithm, which only works with non-negative weights.
gorithm The idea is to make a version of BFS working with weighted graphs. We start
with a set containing the source S = {s}. Then, we greedily grow the source S by
iteratively adding to S the vertex that is closest to S (v 6∈ S such that it minimises
minu∈S u.d + w(u, v)). To find those, we use a priority queue. At any time, we have
found the shortest path of every element in S.

79
Algorithms CHAPTER 6. GRAPHS

Implementa- The program looks like Prim’s algorithm, but we are minimising
tion u.d + w(u, v) instead of w(u, v):
procedure Dijkstra(G, w, s):
InitSingleSource(G, s)
l e t S be an empty set
l e t Q be an empty priority queue
insert all G.V into Q
while !Q.isEmpty ():
u = ExtractMin(Q)
S = union(S, u)
for each v in G.Adj[u]:
Relax(u, v, w, Q) // also decreases the key of
v in Q

Analysis Just like Prim’s algorithm, the running time is dominated by opera-
tions on the queue. If we implement it using a binary heap, each
operation takes O(log(V )) time, meaning that we have a runtime
of O(E log(V )).
Note that we can use a more careful implementation of the priority
queue, leading to a runtime of O(V log(V ) + E). This is almost as
efficient as BFS.
Proof The proof of optimality of this algorithm is considered trivial and
left as an exercise to the reader. The idea is to show the loop
invariant stating that, at the start of each iteration, we have for all
v ∈ S that the distance v.d from s to v is equal to the shortest path
from s to v. This naturally uses the assumption that all weights are
non-negative (positive or zero).

Friday 16th December 2022 — Lecture 24 : Doing fun stuff with matrices (really)

6.9 All-pairs shortest paths


Goal We want to find efficient ways to compute all-pairs shortest paths.
For now, we have seen algorithms allowing to compute the shortest path from a
source to any point. Running these algorithms |V | times allows to get all-pair shortest
paths. Using BFS if the graph is unweighted, it is V O(V + E) = O(EV ); using
Dijkstra if the graph
 has non-negative edge weights yields V O(E + V log(V  )) =
O EV + V 2 log(V ) ; and using Bellman-Ford, it gives V O(V E) = O V 2 E .
Since for any graph E = O V 2 , we have an algorithm that works for any graph in
O V 4 . However, we would want to have a better algorithm for all-pair shortest path,
one that has the same as for |V | times Dijkstra: O EV + V 2 log(V ) = O V 3 .


Dynamic pro- Our first guess is to use dynamic programming.


gramming Let dm (u, v) be the weight of the shortest path from u to v that uses at most m
edges. For our recursive function, we guess that the last edge of our path is a given
edge (x, v) ∈ E, giving us:
0, if m = 0, u = v


if m = 0, u 6= v

dm (u, v) = ∞,
min{dm−1 (u, x) + w(x, v)}, otherwise

x∈V

where we consider w(v, v) = 0.


When proving Bellman-Ford, we have proven that any path has at most |V | − 1
edges. Thus, we only need to find d|V |−1 (u, v) for all (u, v) ∈ V × V . This means
that we have V 3 subproblems, and we need O(V ) to compute the maximum amongst
all those subproblems, leading to a complexity of O V 4 . This is too much.


80
6.9. ALL-PAIRS SHORTEST PATHS Notes by Joachim Favre

Matrix multiplic- Let’s reinterpret our algorithm by doing


 a matrix multiplication. We know very
ation efficient algorithm running in O n2.807 (for Strassen’s algorithm) or even O n2.376
(For Coppersmith-Winograd algorithm) allowing to compute efficiently the following:
n
X
cuv = auk bkv
k=1

We thus hope that we could use those algorithms to make sense of:

cuv = min{dux + wxv }


x∈V

We can note that if we write interpret a ⊕ b = min{a, b}, and a b = a + b, then we


can rewrite the recursive definition of our dynamic problem as:

cuv = (au1 b1v ) ⊕ . . . ⊕ (aun bnv )

We can see that it has the form of a component of a matrix resulting a matrix
multiplication. In other words, defining Dm = (dm (i, j)) and W = (w(i, j)) and a
matrix multiplication A B defined element wise by our reinterpretation above,
then we could write:
Dm = Dm−1 W
This might seem a bit abstract, but we are allowed to do so since (R, min, +) is a
semiring (it is not important if we don’t exactly understand this explanation; we
only need to get the main idea).
In fact, since Dm = Dm−1 W , this means that Dm = W m . If we just use the naive
method
 to compute this product, it takes |V | “multiplications” and thus we still have
time. However, we can notice that we can use fast exponent—squaring
4
Θ |V |
 
repeatedly and multiplying the correct terms—giving O |V | log|V | .
3

We may want to try doing more. We designed all this in the idea of using Strassen’s
algorithm, but sadly we cannot use it since the minimum operation does not have
an inverse.
Application This idea of applying matrix multiplications with modified opera-
tions is a really nice one, which can be applied to many problems.
For instance, let’s say we want to solve an easier version of our
problem: transitive closure. We want to output an array t[i, j] such
that:
1, if there is a path from i to j
(
t[i, j] =
0, otherwise
Now, we can use ∨ (OR) and ∧ (AND) operations:

t[u, v] = (t[u, x1 ] ∧ w[x1 , v]) ∨ . . . ∨ (t[u, xn ] ∧ w[xn , v])

({0, 1}, ∨, ∧) is not a ring, but we can still use fast matrix multiplic-
ation in our case. This gives us a O V 2.376 log(V ) time (which is


really good).

Floyd-Warshall The idea of this algorithm is to make another dynamic program, but in a more clever
algorithm way.
Let’s consider a slightly different subproblem: let ck (u, v) be the weight of the
shortest path from u to v with intermediate vertices in {1, 2, . . . , k}. Our guess is
that our shortest path uses vertex k or not, leading to:

if k = 0
(
w(u, v),
ck (u, v) =
min(ck−1 (u, v), ck−1 (u, k) + ck−1 (k, v)), if k 6= 0

81
Algorithms CHAPTER 6. GRAPHS

Again, we only need to consider up to k = |V |. This means that we have O V 3




subproblems, and only 2 choices, giving us a complexity of O V 3 . This is already


really good.
We’d not want to try even improve this result
 when the graph is sparse, meaning
that |E|  |V | (typically having |E| = o |V | ).
2 2

Johnson’s al- Our new idea takes is source on the hope to apply Dijkstra |V | times. However, our
gorithm: Intu- graph may have negative edge weights, so we need to find a clever way to transform
ition our graph in order to only have non-negative weights and keeping the propriety that
the shortest path over those weights is the shortest path over the original weights.
To do so, we first need the following lemmas.

Theorem: Let h : V 7→ R be any function. Let’s also suppose that for all u, v ∈ V we have the
Weight trans- following property:
formation
wh (u, v) = w(u, v) + h(u) − h(v)
where wh (u, v) are the edge weights of a modified version of our graph. Then, for
any path p:
w(p) = wh (p) − h(u) + h(v)
where wh (p) is the weight of our path p on the modified version of our graph.

Intuition In other words, if we are able to construct a new graph with weights
wh (u, v) which are all positive, then, computing a distance on
this graph, we can easily transform it to a distance on the non-
transformed graph.
In particular, if we find a path which has the shortest distance
on the modified graph, then it will also be the path with shortest
distance on our original graph.

Proof This is just a telescopic series. Supposing that we have any path
p = (x0 , x1 , . . . , xk ) where x0 = u and xk = v, it yields thatè:
k
X
wh (p) = wh (xi−1 , xi )
i=1
k
X
= (w(xi−1 , xi ) + h(xi−1 ) − h(xi ))
i=1
k
!
X
= w(xi−1 , xi ) + h(x0 ) − h(xk )
i=1
= w(p) + h(u) − h(v)


Definition: Sys- Our goal is to have a modified graph with positive edges. This means that, for all
tem of difference u, v ∈ V we want:
constraints
wh (u, v) = w(u, v) + h(u) − h(v) ≥ 0 ⇐⇒ h(v) − h(u) ≤ w(u, v)

This last expression is what we are trying to solve, it is a system of difference


constraints.
Theorem: Inex- If a weighted graph has a negative cycle, then there is no solution to the difference
istence constraints for it.
Proof Let’s suppose that we have a negative weight cycle c =
(v0 , v1 , . . . , vk ) with v0 = vk . Supposing for contradiction that
there exists a h such that we have h(vi ) − h(vi−1 ) ≤ w(vi−1 , vi ) for

82
6.9. ALL-PAIRS SHORTEST PATHS Notes by Joachim Favre

all i, we get:
k
X k
X
w(c) = w(vi−1 , vi ) ≥ (h(vi ) − h(vi−1 )) = 0
i=1 i=1

since this is another telescopic series and since v0 = vk .


However, w(c) ≥ 0 implies that c was not a negative weight cycle,
which is our contradiction.

Theorem: Exist- If a weighted graph has no negative cycle, then we can solve the system of difference
ence constraints.
Proof Let us first add a new vertex s to our graph, which we link to every
other vertex with an edge of weight 0:

w(s, x) = 0, ∀x ∈ V

We let our function h to be defined as:

h(v) = d(s, v)

We want to show that h has the required property. Before doing so,
we can notice that, by the triangle inequality, taking the shortest
path from s to u and then from u to v cannot be a shorter path
than directly taking the shortest path form s to v:

d(s, u) + w(u, v) ≥ d(s, v) ⇐⇒ d(s, v) − d(s, u) ≤ w(u, v)

This means that:

h(v) − h(u) = d(s, v) − d(s, u) ≤ w(u, v)

as required.

Johnson’s al- We now have all the keys to make Johnson’s algorithm, in order to find the all-pair
gorithm shortest path on a given graph.
First, we consider a new graph with a new vertex s connected to all other vertices
with edges of weight 0. We run Bellman-Ford on this modified graph from s, i.e.
to get the shortest paths from s to all vertices. We let h(v) = d(s, v) for all vertex
v ∈ V (note that this is useful if there are paths of negative weight going to v).
Second, we run |V | times Dijkstra on another graph where we let wh (u, v) = w(u, v)+
h(u) − h(v) (this can be done without modifying the graph, we just need to fake the
way Dijkstra sees weights). At the same time, when we compute the distance dh (u, v)
from a vertex u to a vertex v, we output the distance d(u, v) = dh (u, v) − h(u) + h(v).
The first Bellman-Ford takes O(V E). Then, the |V | Dijkstra take
O V E + V 2 log(V ) . This indeed yields an algorithm running in


O V E + V 2 log(V ) .


83
Algorithms CHAPTER 6. GRAPHS

84
Monday 5th December 2022 — Lecture 21 : Stochastic probabilistic randomness

Chapter 7

Probabilistic analysis

7.1 Introduction
Goal We see that the worst case usually does not happen, so doing average case and
amortised analysis seems usually better. In fact, using randomness on the input
may allow us to get out of the worst case and lend in the average case: randomising
the elements in an array before sorting it allows avoid the reverse-sorted case.
Also, randomness is really important in cryptography.

Remark Getting randomness (or random-looking sequences) is a non-trivial


task for computers. We will not consider this problem here.

Hiring problem Let’s say we have n persons entering one after the other, and we want to hire tall
people. Our strategy is that, when someone comes in, we see if he or she is taller
than the tallest person we had hired before. We want to know how many people we
will have hired at the end on average.
For instance, if n = 3 people enter with sizes 2, 1, 3, then we will hire 2 and 3, and
thus have hired 2 people.

Unsatisfactory First, we notice that, in the worst case, we can hire n person if they
answer come from the shortest to the tallest. In the contrary, in the best
case, we hire 1 person if they come from the tallest to the shortest.
However, we realise that we should expect them to enter in uniform
random order. Listing all n! possibilities for n = 3, we get an
expected value of:
3+2+2+2+1+1 11
=
3! 6
The thing is, listing n! possibilities is really not tractable as n grows.
So, we will need to develop a more intelligent theory.

Theorem: Linear- Expected values are linear. In other words, for n random variables X1 , . . . , Xn and
ity of expected constants α1 , . . . , αn :
values
E(α1 X1 + . . . + αn Xn ) = α1 E(X1 ) + . . . + αn E(Xn )

Remark This result is really incredible since it works for any X1 , . . . , Xn


even if they are absolutely not independent.

85
Algorithms CHAPTER 7. PROBABILISTIC ANALYSIS

Definition: In- Given a sample space (the space of all possible outcomes) and an event A, we define
dicator Random the indicator random variable to be:
Variables
1, if A occurs
(
I(A) =
0, if A does not occurs

Theorem Let XA = I(A) for some event A.


Then, the expected value of XA is the probability of the event A to occur, i.e:

E(XA ) = P(A)

Example: Coin Let’s say we throw a coin n times. We want to know the expected number of heads.
flip We could compute it using the definition of expected values:
n
X
E(X) = kP(X = k)
k=0

This is the binomial distribution, so it works very well, but we can find a much
better way.
Let’s consider a single coin. Our sample space is {H, T } where every event has the
same probability to occur. Let us take an indicator variable XH = I(H), which
counts the number of heads in one flip. Since P(H) = 12 , we get that E(XH ) = 12 by
our theorem.
Now, let X be a random variable for the number of heads in n flips, and let
Xi = I(ith flip result). By the linearity of the expectations, we get:
1 1 n
E(X) = E(X1 + . . . + Xn ) = E(X1 ) + . . . + E(Xn ) = + ... + =
2 2 2
as expected (lolz).

Back to hiring Let’s consider again the hiring problem. We want to do a similar analysis as for the
problem coin flip example.
Let Hi be the event that the candidate i is hired, and let Xi = I(Hi ). We need to
find this expected value and, to do so, we need P(Hi ). There are i! ways for the first
i candidates to come, and only (i − 1)! where the ith is the tallest (we set the tallest
to come last, and compute the permutations of the (i − 1) left).. Since people are
hired if and only if they are the tallest, it gives us:

(i − 1)! 1
E(Xi ) = P(Hi ) = =
i! i
Thus, we find that:
1 1
E(X) = E(X1 + . . . + Xn ) = E(X1 ) + . . . + E(Xn ) = + . . . + = Hn
1 n
We know that the harmonic partial series has a complexity of log(n)+O(1). However,
the best case is 1 with probability n1 (if the tallest is the first), and the worst case is
n with probability n! 1
(if they come in sorted order, from shortest to tallest). By
randomising our input before hiring people, it allows us to avoid malicious users to
input the worst case and to (almost with certainty) land in the average-case scenario
of log(n) + O(1).

Friday 9th December 2022 — Lecture 22 : Hachis Parmentier

Randomness Let’s say that we have a random function that returns 1 with probability p and 0
extraction with probability 1 − p. We want to generate 0 or 1 with probability 12 .
To do so, we can generate a pair of random numbers (a, b). We notice that there is
the same probability p(1 − p) to get (a, b) = (0, 1) and (a, b) = (1, 0). Thus, if we

86
7.2. HASH FUNCTIONS AND TABLES Notes by Joachim Favre

generate new pairs until we have a 6= b, we can then just output a, which will be 0
or 1 with both probability 12 .
In the worst case, this method requires an infinite number of throws, but we can
compute that the expected number of throws is 2p(1−p)1
, meaning that we are fine
on average (if p is not too small and not too close to 1).

7.2 Hash functions and tables


Birthday para- Let’s say that people have a uniform 365
1
probability to be be born a given day of a
dox year. We wonder how many people we need so that at least two of them have their
birthday the same day with probability 50% or higher.
Clearly, if we have 366 people, two people have the same date. However, we only
need 23 people to get a probability of 50% and 57 to reach a probability of 97%.

Birthday lemma Let M be a finite set, q ∈ N and f : {1, 2, . . . , q} 7→ M a function chosen uniformly
at random (meaning that f (1) is chosen uniformly random in M , and so on until
f (q); whenp those values are chosen they become fixed).
If q > 1.78 |M |, then the probability that f is injective (meaning that there is no
collision) is at most 12 , i.e. the P(injective) ≤ 12 .

Example Applying this to the birthday problem, we would have M =


{1, 2, . . . , 365} and the function f maps any person to a given day
in the year, and it definitely seems chosen uniformly at random.
Note that, in this case, the bound on q is not very good (we get
q ≥ 31 but we only needed q ≥ 23) but this lower bound will be
very interesting hereinafter.

Proof Let m = |M |. A function with only one element is injective 1 with


probability 100%. Then, when we add the second element, the
probability that it is still injective is 1 · m−1
m (this second element
can map to any element, except to the one already taken by f (1)).
Similarly, when we add the third element, the probability that our
function is still injective is 1 · m−1
m · m . So, with q elements:
m−2

m m−1 m − (q − 1)
P(injective) = · ···
m m  m 
1 q−1
= 1 1− ··· 1 −
m m
However, we know that 1 − x ≤ e−x for all x (this can be proven
easily by the Taylor series of the exponential). So:
1 q−1
P(injective) ≤ e−0 e− m · · · e− m
 
0 + 1 + . . . + (q − 1)
= exp −
m
 
q(q − 1)
= exp −
m
This is really interesting since it tells us that the probability that f
is injective is exponentially small.
This gives us an equation:
 
q(q − 1) 1
exp − ≤
2m 2
⇐⇒ q 2 − q ≥ 2 ln(2)m
p
1 + 1 + 8 ln(2) √ √
=⇒ q ≥ m ≈ 1.78 m
2


87
Algorithms CHAPTER 7. PROBABILISTIC ANALYSIS

Definition: Hash A hash function h : U 7→ {0, 1, . . . , m − 1} is a “many-to-one” function: a function


function which has many more possible input than outputs.
A good hash function must have three properties. First, it must be efficiently
computable. Second, all the u ∈ U should be uniformly distributed onto the
{0, 1, . . . , m − 1}. Third, once the hashing function h is set, h(k) must always give
the same value.
This is simple uniform hashing.

Examples A way to choose hash functions is to use a modulo:

h(k) = k mod m

where m is often selected to be a prime not too close to a power of


2.
We could also consider a multiplicative method:

h(k) = bm · (Ak − bAkc)c

where x − bxc is basically the fractional part of x. For the A, √


there
are different valid choices, and Knut suggests to choose A = 5−1 2 .
The important thing to remember is only that hash functions are a
big area of research, and that they all depend on data distribution
and other properties.

Naive table We want to design a data structure which allows to insert, delete and search for
elements in (expected) constant time.
A first way to do it is to give a unique number to every element (such as ISBN if we
are considering books), and we can make a table with an entry with each possible
number. This indeed gives a running time of O(1) for each operation, but a space of
O(|U |). The thing is we only have |K|  |U | keys, so this is very space inefficient.

Bad hash table In our table, instead of storing an element with key k into index k, we store it in
h(k), where h : U 7→ {0, 1, . . . , m − 1} is a hash function (note that the array needs
m slots now). Two keys could collide, but we take m to be big enough so that it is
very unlikely. Search, insertion and deletion are also be in O(1) in the average case
as long as there is no collision.
By the birthday lemma, for h to be injective with good probability, we need m > K 2 .
This means that we would still have a lot of empty memory, which is really inefficient.

Hash table Let’s now (finally) see a correct hash table. The goal is to have a space proportional
to the number K of keys stored, i.e. Θ(K). Instead of avoiding every collision, we
will try to deal with them.
In every index we can instead store a double-linked list. Then, searching, inserting
and deleting are also very simple to implement. To insert an element, we insert it at
the head of the linked list in the good slot, which can be done in Θ(1). To delete an
element we can remove it easily from the linked list in Θ(1), since we are given a
pointer to it. Searching however requires to search in the list at the correct index,
which takes a time proportional to the size of this list. The worst case for searching
is thus Θ(n) (if the hash function conspires against us and puts all elements in the
same slot), but we want to show that this is actually Θ(1) in average case, without
requiring m too large. This is done in the following paragraphs.

Expected list Let nj denote the length of the list T [j]. By construction, we have:
size
n = n0 + n1 + . . . + nm−1
Also, we have:
1 1 1 n
E(nj ) = P(h(k1 ) = j) + . . . + P(h(kn ) = j) = + + ... + =
m m m m
Since this value will be important, we call E(nj ) = n
m = α.

88
7.2. HASH FUNCTIONS AND TABLES Notes by Joachim Favre

Unsuccessful An unsuccessful search takes an average time of Θ(1 + α).


search
Proof Since we use uniform hashing, any key which is not in the table is
equally likely to hash to any of the m slots.
To search unsuccessfully for any key k, we need to search
 until the
end of list T [h(k)], which has expected length E nh(k) = α. Thus,
the expected number of elements examined in an unsuccessful search
is α.
Finally, we add the cost for computing the hash function, leading
to Θ(1 + α).

Remark Since α = m , depending on the way we choose m, we can have
n

α = Θ n . This is why we need to add the 1 to the runtime (we


1

cannot get below constant running time).

Monday 12th December 2022 — Lecture 23 : Quantum bogosort is a comparison sort in Θ(n)

Successful search A successful search also takes expected time Θ(1 + α).

Proof Let’s say we search for a uniformly random element. However, a list
with many elements has higher probability of being chosen. This
means that we need to be careful.
Let x be the element we search, selected at random amongst all
the n elements from the table. The number of elements examined
during a successful search for x is one more than the number of
elements that appear before x in the list. By the implementation
of our table, these elements are the elements inserted after x was
inserted, which have the same hash. Thus, we need to find the
average of how many elements were inserted into x’s list after x was
inserted, over the n possibilities to take x in the table.
For i = 1, . . . , n, let xi be the ith element inserted into the table, and
let ki = key(xi ). For all i and j, we define the following indication
variable:
Xij = I(h(ki ) = h(kj ))
Since we are using simple uniform hashing, P(h(ki ) = h(kj )) = m
1
,
and thus E(Xij ) = m . This tells us that the expected number of
1

elements examined in a successful search is given by:


    
n n n n
1 X X 1 X X
E 1 + Xij  = 1 + E(Xij )
n i=1 j=i+1
n i=1 j=i+1
 
n n
1 X X 1
= 1 +
n i=1 j=i+1
m

= ...
α α
= 1+ −
2 2n
which is indeed Θ(1 + α)

Complexity of Both successful and unsuccessful searches have average complexity Θ(1 + α), where
search α= m n
. So, if we choose the size of our table to be proportional to the number of
elements stored (meaning m = Θ(n)), then search has O(1) average complexity.
This allowed us to construct a table which has insertion and deletion in O(1), and
search in expected O(1) time.

89
Algorithms CHAPTER 7. PROBABILISTIC ANALYSIS

90
Chapter 8

Back to sorting

8.1 Quick sort


Idea The idea is to again use divide and conquer, but in a way that allows to sort the array
in place. So, at any iteration, we choose a value (which yields an index q such that
the following properties can be satisfied) named pivot, and we partition A[p . . . r]
into two (one of them possibly empty) subarrays A[p . . . (q + 1)] and A[(q + 1) . . . r],
such that the elements in the first subarray are less than or equal to A[q], and the
ones in the right are greater than or equal to A[q]. We can then sort each subarray
recusrively, and use the fact that we are working in place for not needing any combine
step.

Naive partition- Let’s consider the last element to be the pivot. We thus want to reorder our array
ing in place such that the pivot lands in index q, and that the two other subarrays have
the property mentioned above.
To do so, we can use two counters: i and j. j goes through all elements one by one,
and if it finds an element less than the pivot, it will move i one forward and place
this element at position i. That way, i will always have elements less than or equal
to the pivot before it.
procedure Partition(A, p, r):
pivot = A[r]
i = p-1 // will be incremented before usage
for j = p to r-1:
i f A[j] <= pivot:
swap(A, i, j) // exchange A[i] with A[j]
swap(A, i+1, r) // place pivot correctly by swapping A[i+1] and A[r]
return i+1 // pivot index

91
Algorithms CHAPTER 8. BACK TO SORTING

Example For instance, it turns the first array to the second one:

Proof Let’s show that our partition algorithm is correct by showing the
loop invariant that all entries in A[p . . . i] are less than or equal to
the pivot, that all entries in A[(i + 1) . . . (j − 1)] are strictly greater
than the pivot, and that A[r] is always equal to the pivot.
The initial step is trivial since, before the loop starts, A[r] is the
pivot by construction, and A[p . . . i] and A[(i + 1) . . . (j − 1)] are
empty.
For the inductive step, let us split our proof in different cases, letting
x to be the value of the pivot. If A[j] ≤ x, then A[j] and A[i + 1] are
swapped, and then i and j are incremented; keeping our properties
as required. If A[j] > x, then we only increment j, which is correct
too. In both cases, we don’t touch at the pivot A[r].
When the loop terminates, j = r, so all elements in A are partitioned
into A[p . . . i] which only has value less than or equal to the pivot,
A[(i + 1) . . . (r − 1)] which only has value strictly greater than the
pivot, and A[r] being the pivot, showing our loop invariant.
We can then move the pivot the right place by swapping A[i + 1]
and A[r].

Complexity Let’s consider the time complexity of our procedure.
The for loop runs around n = r − p + 1 times. Each iteration takes
time Θ(1), meaning that the total running time is Θ(n) for an array
of length n. We can also observe that the number of comparions
made is around n.
Naive quick sort We can now write our quick sort procedure:
procedure Quicksort(A, p, r):
i f p < r:
q = Partition(A, p, r)
Quicksort(A, p, q-1)
Quicksort(A, q+1, r)
// no need for a combine step :D

Worst case Let’s consider the worst case running time of this algorithm. If
the list is already sorted, we will always have one of our subarray
empty:

Since the Partition procedure takes time proportional to the list


size, we get a worst case complexity of:

Θ(n) + Θ(n − 1) + . . . + Θ(1) = Θ n2




which is really not good.

92
8.1. QUICK SORT Notes by Joachim Favre

Best case Let’s consider the best case of our algorithm now. This happens
when every subarray are completely balance every time, meaning
that the pivots always split the array into two subarrays of equal
size. This gives us the following recurrence:
n
T (n) = 2T + Θ(n)
2
We have already seen it, typically for merge sort, and it can be
solved to T (n) = Θ(n log(n)) by the master theorem.

Average case Let’s now consider the average case over all possible inputs. Doing it
formally would take too much time, but let’s consider some intuition
to get why we have an average complexity of Θ(n log(n)).
First, we notice that even if Partition always produces a 9-to-1 split
(9 times moreelementsin one subarray), then we get the recurrence
10 + T 10 + Θ(1), which solves to Θ(n log(n)).
n
T (n) = T 9n
Also, we can notice that even if the recursion tree will not always be
good, it will usually be a mix of good and bad splits. For instance,
having a bad split (splitting n into 0 and n − 1) followed by a perfect
split takes Θ(n), and yields almost the same result as only having a
good split (which also takes Θ(n)).

Randomised There is a huge difference between the expected running time over all inputs, and the
quick sort expected running time for any input. We saw intuitively that the first is Θ(n log(n)),
and by computing a random permutation of our input array, we are able to transform
the second into the first.
However, let’s do a better strategy: instead of computing a permutation, let’s pick
randomly the pivot in the subarray we are considering.

Implementa- This modification can be changed very easily by just changing the
tion Partition procedure:
procedure RandomisedPartition(A, p, r):
i = Random(p, r)
exchange A[r] with A[i]
return Partition(A, p, r)

procedure RandomisedQuicksort(A, p, r): // the same


i f p < r:
q = RandomisedPartition(A, p, r)
RandomisedQuicksort(A, p, q-1)
RandomisedQuicksort(A, q+1, r)

Proposition: Randomised quick sort has expected running time O(n log(n)) for any input.
Randomised
quick sort ex-
pected running
time
Proof The dominant cost of the algorithm is partitioning. Each call to
Partition has cost Θ(1 + Fi ) where Fi is the number of comparis-
ons in the for loop. Since an element can be a pivot at most once,
Partition is called at most n times.
This means that, letting X to be the number of comparisons
performed in all calls to Partition, the total work done over
the entire execution is O(n + X). We thus want to show that
E(X) = O(n log(n)).
Let’s call z1 , . . . , zn the elements of A, in a way such that z1 ≤
. . . ≤ zn . Also, let Zij = {zi , zi+1 , . . . , zj }, Cij be the event that zi

93
Algorithms CHAPTER 8. BACK TO SORTING

is compared to zj throughout the execution of the algorithm, and


Xij = I(Cij ).
As each pair is compared at most once (when one of them is the
pivot), the total number of comparisons formed by the algorithm is:
n
X n
X
X= Xij
i=1 j=i+1

We want to compute its expected value:


 
Xn Xn X n Xn n
X n
X
E(X) = E Xij  = E(Xij ) = P(Cij )
i=1 j=i+1 i=1 j=i+1 i=1 j=i+1

This means that we need to compute the probability that zi is


compared to zj . We notice that if at some point we have a pivot x
such that zi < x < zj , then zi and zj will never be compared later
(since they will be part of two different subarrays).
We also notice that if zi or zj is chosen before any other element of
Zij , then it will be compared to all the elements of Zij , except itself.
In other words, the probability that zi is compared to zj is the
probability that either zi or zj is the first element chosen as pivot
in Zij . There are j − i + 1 elements and pivots are chosen randomly
and independently. Thus, the probability that any particular one of
them is the first one chosen is j−i+1
1
, and:

2
P(Cij ) =
j−i+1
Finally, we can put everything together:
n n n n n−1 n−i
X X X X 2 XX 2
E(X) = P(Cij ) = =
i=1 j=i+1 i=1 j=i+1
j − i + 1 i=1
k + 1
k=1

Which is such that:


n−1 n n−1
XX 2 X
E(X) < = O(log(n)) = O(n log(n))
i=1 k=1
k i=1

as required.

Remark We have thus seen that randomised quick sort has an expected running time
O(n log(n)). Also, this algorithm is in-place and it is very efficient and easy to
implement in practice.

8.2 Sorting lower bound


Theorem: Sort- Any sorting algorithm takes Ω(n).
ing lower bound

Proof This is trivial since our algorithm definitely needs to consider every
input at least once.

Definition: Com- A comparison sorting algorithm is a sorting algorithm which only uses comparisons
parison sorting of two elements to gain order information about a sequence.

94
8.2. SORTING LOWER BOUND Notes by Joachim Favre

Remark All sorts we have seen so far (insertion sort, merge sort, heapsort,
and quicksort) are comparison sorts.

Theorem: Com- Any comparison sorting takes (expected time) Ω(n log(n)).
parison sorting
lower bound
Proof Any comparison sorting algorithm can be summed up by a decision
tree.

We have n! possible permutations of our numbers, meaning at least


n! leafs. The goal of a sorting algorithm is to find which permutation
is the sorted one, meaning to find the path to the correct leaf. At
any step, we can either see that the operation a < b is true or false,
meaning that our tree is a binary tree. This leads that the height
of our tree is at least:

Ω(log2 (n!)) = Ω(n log(n))

However, by definition of the height of a tree, it means that some


of its leafs are at this depth, and thus that there exists input
permutation so that we need to do Ω(n log(n)) comparisons.
In fact, a similar argument implies that Ω(n log(n)) comparisons
are necessary for most inputs.

Remark In some sense, it means that merge-sort, heapsort and quicksort are
optimal.

Counting sort Let’s try to make a sorting algorithm in O(n). By our theorem above, it requires
not to be a comparison sort.
Let’s say that the array we receive A[1 . . . n] has values A[j] ∈ {0, 1, . . . , k} for some
k. Note that, as long as the array has integer values, we can always shift it like that
by subtracting the minimum (found in O(n)).
Now, the idea, is to make an array C[0 . . . k] such that C[i] represents the number
of times the value i appears in A (it can be 0, or more). It can be constructed from
A in one pass. From this array, we can then make a new, sorted, array B[1 . . . n]
with the values of A, in another single pass.

Algorithm We can realise that adding one more step can make the algorithm
much simpler to implement. When we have computed our array C,
we can then turn it to a cumulative sum, so that C[i] represents the
number of times the values 0, . . . , i appear in A. However, we notice
that, then, this values is exactly the place value i would require to
land in B (if we need to add it to B), i.e. B[C[i]] = i.
procedure CountingSort(A, B, n, k):
// Initialise i
l e t C[0...k] be a new array
for i = 0 to k:
C[i] = 0

// Count occurences
for j = 1 to n:

95
Algorithms CHAPTER 8. BACK TO SORTING

C[A[j]] += 1

// Turn to cumulative sum


for i = 1 to k:
C[i] += C[i-1]

// Put the values in order in B


for j = n downto 1:
B[C[A[j]]] = A[j] // place i in B[C[i]]
C[A[j]] -= 1 // if needs to be placed again , must
not be at the same place

Analysis Our algorithm only consists of for loops with Θ(k) iterations and
Θ(n) iterations. This gives us a runtime complexity of Θ(n + k).
Note that, in fact, because k typically has to be very big, this is
not better than the comparisons sort we saw for arrays we have no
information about. However, if this k is small enough, then it can
be very powerful.

Remark This algorithm will not be at the exam since we did not have time
to see in the lecture.

96

You might also like